[Lazarus] Does Lazarus support a complete Unicode Component Library?

Wed Feb 16 23:58:13 CET 2011

Graeme Geldenhuys schrieb:
> Op 2011-02-16 12:52, Hans-Peter Diettrich het geskryf:
>> Most people have been sure, in the past, that they use a SBCS, where
>> every character on screen is a char in memory. And consequently they use
>> indexed access to the chars in an string, and for...to loops.
> 
> Yes, and that code accesses string characters 99% of the times in a
> sequential manner, be that left-to-right (or other way round), hardly
> ever random. So to overcome this "supposedly" limitation, one simply
> needs to create a StringIterator (which I already have in my projects
> where character extraction is needed) will work just fine. So I don't
> see this as a problem at all.

What's the type of the loop variable???

The iteration costs time, so that many users will insist in using "fast" 
SBCS access. No doubt that proper Unicode coding will require iterators, 
unless Pos can return an valid index immediately.

>> The same
>> procedures may work for UTF-16,
> 
> No, character indexes will not work for UTF-16 either. Not ALL Unicode
> Characters can fit into a 2-bytes.

When an Unicode string contains the same characters as an Ansi string, 
then all these BMP characters fit into one widechar.

> Also what about screen characters
> that are made up of multiple code-points (combining diacritics etc)?
> eg:
>   U+0041 (A) + U+030A (̊)      =   Å

These are special Unicode issues, that never have been an issue with 
Ansi strings, and should not be in Unicode - as long as dealing with the 
same content as before. Again the Cobol distinction applies: the user 
does not have to bother with the internals of strings of "usage display" 
- they only are read, written and displayed, and what else can be made 
in portable "high-level" string handling.

Dealing with *all* the Unicode quirks IMO is beyond "usual" coding, it 
will be reserved to specialized text processing components or applications.

Perhaps you understand better now, why I suggest an string type with an 
immutable application defined codepage, for "traditional" coding? This 
would be "usage computational", where the known rules for low-level 
string handling apply, just as used with AnsiStrings.

> Depending on how that string is normalized, doing a MyString[1] might
> only return 'A' and not Å as you would have expected.

No difference to the current encoding, isn't it? You should not assume 
that such non-canonical Unicode is or has ever been translated into a 
single Ansi char, by automatic conversion.

>> one widechar, but this code will fail miserably on an UTF-8 platform,
> 
> And so too for UTF-16 - as I have just shown. If you want to use UTF-16
> like that (just because *most* of the Unicode code-points can fit into
> 2-bytes), then it is no better that UCS-2.

*Most* users will be happy with the BMP. Those using codepages outside 
the BMP had to live with all that stuff, since ever.

IMO the most important thing about Unicode is to teach the users the 
difference between low and high level string handling. Indexed access to 
characters is a low level operation, that should not be used in 
Unicode-aware applications without specific knowledge. Low level string 
handling requires the exact knowledge about the encoding of an string, 
with eventual branches for the *expected* encodings and char types. High 
level string handling is not character-based, so that your objections do 
not apply.

DoDi