[Lazarus] dynamic string proposal

Fri Aug 18 11:42:22 CEST 2017

I answer here Tony's post in "String vs WideString" thread.

On Thu, Aug 17, 2017 at 2:09 PM, Tony Whyman via Lazarus
<lazarus at lists.lazarus-ide.org> wrote:
> Are you making my points for me? If such a basic term as "character" means 7
> different things then something is badly amiss. It should be fairly obvious
> that in this context, character = printable symbol - whilst for practical
> reasons allowing for format control characters such as a "end of line" and
> "end of string".

So, maybe it is the "User-perceived character" from my list above.

> I believe that you need to go back to the idea that you have both an
> abstract representation of a character with a constant semantic, separate
> from the actual encoding and for which there may be many different and valid
> encodings. For example, using a somewhat dated comparison, a lower case
> latin alphabet letter 'a' should always have a constant semantic, but in
> ASCII is encoded as decimal 97, while in EBCDIC is encoded as decimal 129.
> Even though they have different binary values, the represent the same
> abstract character.
>
> I want a 'char' type in Pascal to represent a character such as a lower case
> 'a' regardless of the encoding used. Indeed, for a program to be properly
> portable, the programmer should not have to care are the actual encoding -
> only that it is a lower case 'a'.
>
> Hence my proposal that a character type should include an implicit or
> explicit attribute that records the encoding scheme used - which could vary
> from ASCII to UTF-32.

You use words "character" and "encoding" in the same sentences,
although only codepoints are encoded.
Codepoints are the easy part anyway.
A "character" can be composed of many codepoints and the rules for
that are complex and depend on locale. It is not only about combining
accent marks, some languages have rules about grapheme clusters etc.
(which I don't know well).

> I was referring to GB 18030 and that it has one, two and four byte
> code points.

Ok, I didn't know that one. If I understand right, it is not part of Unicode.

> The point I believe that you are missing is to consider that a character is
> an abstract symbol with a semantic independent of how it is encoded.
> Collation sequences are independent of encoding and should remain the same
> regardless of how a character set is encoded.

Above I used concepts and terms of Unicode.
Maybe your idea is to create another abstract system outside of Unicode.
The problem is that the complex rules of Unicode are needed to support
all the languages in the world.
Your new abstract system would be equally complex in the end.

Now I remember, my unit LazUnicode in LazUtils has classes:
- TCodePointEnumerator  and
- TUnicodeCharacterEnumerator which can handle Combining Diacritical Marks.
It may be enough for most European + many other languages.
Please take a look.
Maybe you could use the idea for a new abstract "character" class.
Anyway such a complex logic must be in library code, not in compiler's
built-in type.

Or maybe you just want fixed-width indexing of codepoints without
dealing with complexities of Unicode. In that case UTF-32 and a new
string type UTF32String or similar would do it.

Juha