[Lazarus] Encoding agnostic functions for codepoints + an iterator

Juha Manninen juha.manninen62 at gmail.com
Tue Jun 21 14:49:14 CEST 2016


On Mon, Jun 20, 2016 at 3:42 PM, Martin Frb <lazarus at mfriebe.de> wrote:
> Glyph and (de-)composed are different things
>
> A glyph may be a ligature https://en.wikipedia.org/wiki/Typographic_ligature
> 2 or more chars, depending on the font you use. Even whole words.

I don't know if glyphs require extra treatment. I guess I still don't
understand all details.
Anyway, I try to figure out all Combining Diacritical Marks now.

> As for (de-)composed: SynEdit has some code to detect them (combining
> codepoints). Not sure if (still) complete (if new ones were added ??).
> But if you just want (de-)composed, and not a complete unicode library (with
> all the properties for each codepoint), then feel free to look at the code, ...

I found:
 function TSynEditStringList.LogicPosIsCombining(const AChar: PChar): Boolean;
 begin
   Result := (
    ( (AChar[0] = #$CC) ) or
            // Combining Diacritical Marks (belongs to previos char)
0300-036F
    ( (AChar[0] = #$CD) and (AChar[1] in [#$80..#$AF]) ) or
            // Combining Diacritical Marks
    ( (AChar[0] = #$D8) and (AChar[1] in [#$90..#$9A]) ) or
            // Arabic 0610 (d890)..061A (d89a)
    ( (AChar[0] = #$D9) and (AChar[1] in [#$8b..#$9f, #$B0]) ) or
            // Arabic 064B (d98b)..065F (d99f) // 0670 (d9b0)
    ( (AChar[0] = #$DB) and (AChar[1] in [#$96..#$9C, #$9F..#$A4,
#$A7..#$A8, #$AA..#$AD]) ) or // Arabic 06D6 (db96)..  .. ..06EA
(dbaa)
    ( (AChar[0] = #$E0) and (AChar[1] = #$A3) and (AChar[2] in
[#$A4..#$BE]) ) or  // Arabic 08E4 (e0a3a4) ..08FE (e0a3be)
    ( (AChar[0] = #$E1) and (AChar[1] = #$B7) ) or
            // Combining Diacritical Marks Supplement 1DC0-1DFF
    ( (AChar[0] = #$E2) and (AChar[1] = #$83) and (AChar[2] in
[#$90..#$FF]) ) or  // Combining Diacritical Marks for Symbols
20D0-20FF
    ( (AChar[0] = #$EF) and (AChar[1] = #$B8) and (AChar[2] in
[#$A0..#$AF]) )     // Combining half Marks FE20-FE2F
   );
 end;

It works well. I already created an enumerator for UTF-8 using it.
I will do an UTF-16 version soon. First I wanted to add all the
possibly missing ranges of combining marks.
It is difficult to find such a list anywhere! How come?
The rule is quite simple after all: a codepoint should be combined
with a previous codepoint. True or False. Simple.

I did some detective work and studied C# lib sources of Mono project.
  https://github.com/mono/mono
The string etc. units are extremely complicated. I didn't understand
much and didn't find any list of combining marks.
I found something from XmlChar.cs and XmlCharTests.cs which has a
function IsCombiningChar. I copy it below.
The problem is that its ranges are completely wrong!
For example the first range ends at 0x0345. However 0x0346 is also a
combining mark:
  http://www.fileformat.info/info/unicode/char/0346/index.htm
and so is 0x0347 and 0x0348 etc.
Uhhh...
I guess I could dig out the ranges from here:
  http://www.fileformat.info/info/unicode/category/index.htm
At least these are combining marks:
  [Mc]  Mark, Spacing Combining
  [Me]  Mark, Enclosing
  [Mn]  Mark, Nonspacing
Are there others?
It will be lots of work because it only shows individual values, not
ranges, and there are almost 2000 values!

Any hints anybody?
This feels like reinventing the wheel. There must be a valid updated
list of the ranges somewhere.

Juha

---
--- From XmlCharTests.cs ---
---
private static bool IsCombiningChar (int ch)
{
    return
        (ch >= 0x0300 && ch <= 0x0345) ||
        (ch >= 0x0360 && ch <= 0x0361) ||
        (ch >= 0x0483 && ch <= 0x0486) ||
        (ch >= 0x0591 && ch <= 0x05A1) ||
        (ch >= 0x05A3 && ch <= 0x05B9) ||
        (ch >= 0x05BB && ch <= 0x05BD) ||
        (ch == 0x05BF) ||
        (ch >= 0x05C1 && ch <= 0x05C2) ||
        (ch == 0x05C4) ||
        (ch >= 0x064B && ch <= 0x0652) ||
        (ch == 0x0670) ||
        (ch >= 0x06D6 && ch <= 0x06DC) ||
        (ch >= 0x06DD && ch <= 0x06DF) ||
        (ch >= 0x06E0 && ch <= 0x06E4) ||
        (ch >= 0x06E7 && ch <= 0x06E8) ||
        (ch >= 0x06EA && ch <= 0x06ED) ||
        (ch >= 0x0901 && ch <= 0x0903) ||
        (ch == 0x093C) ||
        (ch >= 0x093E && ch <= 0x094C) ||
        (ch == 0x094D) ||
        (ch >= 0x0951 && ch <= 0x0954) ||
        (ch >= 0x0962 && ch <= 0x0963) ||
        (ch >= 0x0981 && ch <= 0x0983) ||
        (ch == 0x09BC) ||
        (ch == 0x09BE) ||
        (ch == 0x09BF) ||
        (ch >= 0x09C0 && ch <= 0x09C4) ||
        (ch >= 0x09C7 && ch <= 0x09C8) ||
        (ch >= 0x09CB && ch <= 0x09CD) ||
        (ch == 0x09D7) ||
        (ch >= 0x09E2 && ch <= 0x09E3) ||
        (ch == 0x0A02) ||
        (ch == 0x0A3C) ||
        (ch == 0x0A3E) ||
        (ch == 0x0A3F) ||
        (ch >= 0x0A40 && ch <= 0x0A42) ||
        (ch >= 0x0A47 && ch <= 0x0A48) ||
        (ch >= 0x0A4B && ch <= 0x0A4D) ||
        (ch >= 0x0A70 && ch <= 0x0A71) ||
        (ch >= 0x0A81 && ch <= 0x0A83) ||
        (ch == 0x0ABC) ||
        (ch >= 0x0ABE && ch <= 0x0AC5) ||
        (ch >= 0x0AC7 && ch <= 0x0AC9) ||
        (ch >= 0x0ACB && ch <= 0x0ACD) ||
        (ch >= 0x0B01 && ch <= 0x0B03) ||
        (ch == 0x0B3C) ||
        (ch >= 0x0B3E && ch <= 0x0B43) ||
        (ch >= 0x0B47 && ch <= 0x0B48) ||
        (ch >= 0x0B4B && ch <= 0x0B4D) ||
        (ch >= 0x0B56 && ch <= 0x0B57) ||
        (ch >= 0x0B82 && ch <= 0x0B83) ||
        (ch >= 0x0BBE && ch <= 0x0BC2) ||
        (ch >= 0x0BC6 && ch <= 0x0BC8) ||
        (ch >= 0x0BCA && ch <= 0x0BCD) ||
        (ch == 0x0BD7) ||
        (ch >= 0x0C01 && ch <= 0x0C03) ||
        (ch >= 0x0C3E && ch <= 0x0C44) ||
        (ch >= 0x0C46 && ch <= 0x0C48) ||
        (ch >= 0x0C4A && ch <= 0x0C4D) ||
        (ch >= 0x0C55 && ch <= 0x0C56) ||
        (ch >= 0x0C82 && ch <= 0x0C83) ||
        (ch >= 0x0CBE && ch <= 0x0CC4) ||
        (ch >= 0x0CC6 && ch <= 0x0CC8) ||
        (ch >= 0x0CCA && ch <= 0x0CCD) ||
        (ch >= 0x0CD5 && ch <= 0x0CD6) ||
        (ch >= 0x0D02 && ch <= 0x0D03) ||
        (ch >= 0x0D3E && ch <= 0x0D43) ||
        (ch >= 0x0D46 && ch <= 0x0D48) ||
        (ch >= 0x0D4A && ch <= 0x0D4D) ||
        (ch == 0x0D57) ||
        (ch == 0x0E31) ||
        (ch >= 0x0E34 && ch <= 0x0E3A) ||
        (ch >= 0x0E47 && ch <= 0x0E4E) ||
        (ch == 0x0EB1) ||
        (ch >= 0x0EB4 && ch <= 0x0EB9) ||
        (ch >= 0x0EBB && ch <= 0x0EBC) ||
        (ch >= 0x0EC8 && ch <= 0x0ECD) ||
        (ch >= 0x0F18 && ch <= 0x0F19) ||
        (ch == 0x0F35) ||
        (ch == 0x0F37) ||
        (ch == 0x0F39) ||
        (ch == 0x0F3E) ||
        (ch == 0x0F3F) ||
        (ch >= 0x0F71 && ch <= 0x0F84) ||
        (ch >= 0x0F86 && ch <= 0x0F8B) ||
        (ch >= 0x0F90 && ch <= 0x0F95) ||
        (ch == 0x0F97) ||
        (ch >= 0x0F99 && ch <= 0x0FAD) ||
        (ch >= 0x0FB1 && ch <= 0x0FB7) ||
        (ch == 0x0FB9) ||
        (ch >= 0x20D0 && ch <= 0x20DC) ||
        (ch == 0x20E1) ||
        (ch >= 0x302A && ch <= 0x302F) ||
        (ch == 0x3099) ||
        (ch == 0x309A);
}



More information about the Lazarus mailing list