[Lazarus] dynamic string proposal

Juha Manninen juha.manninen62 at gmail.com
Wed Aug 16 17:20:07 CEST 2017


On Wed, Aug 16, 2017 at 4:49 PM, Michael Schnell via Lazarus
<lazarus at lists.lazarus-ide.org> wrote:
>> You are writing about encodings etc. which are part of codepoints, but
>> you call them "characters". Why?
>
> Because the type for this stuff used in Delphi and and FPC is called "char".

No, actually the Pascal type "Char" contains a CodeUnit, not CodePoint.
It is the smallest fixed width "atom" of Unicode text. It is still
extremely useful in Unicode related programming.
The word "character" in Unicode can mean:

1. CodeUnit β€” Represented by Pascal type "Char".

2. CodePoint β€” all the arguments about one encoding's supremacy over
another deal with CodePoints. Yes, UTF-8, UTF-16, UTF-32 etc. all only
encode CodePoints.

3. Abstract Unicode character β€” like 🍷 'WINE GLASS'.

4. Coded Unicode character β€” "U" + a unique number, like U+1F377. This
is what "character" means in Unicode Standard.

5. User-perceived character β€” Whatever the end user thinks of as a character.
This is language dependent. For instance, β€˜ch’ is two letters in
English but one letter in Czech and Slovak.
Many more complexities are involved here, including decomposed codepoints.

6. Grapheme cluster

7. Glyph β€” related to fonts.

So, number 4. is the official Unicode "character".
Otherwise the most useful meanings are 1. "CodeUnit" for programmers
and 5. "User-perceived character" for everybody else.
Note, CodePoint is NOT a useful meaning for "character". It would only
confuse things. Yet most people in these Unicode threads write about
"character" like it meant CodePoint. It can only mean that those
people are ignorant of the complexity of Unicode.  :(


> In fact I did not explicitly talk about Unicode at all. the paper says it:
> ...

Unicode is the standard now. We cannot ignore it, and we don't want to
ignore it because it solves so many problems of the earlier solutions.
If you create a new string type, you certainly must take Unicode into account.

Juha


More information about the Lazarus mailing list