[Lazarus] Does Lazarus support a complete Unicode Component Library?
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Wed Feb 16 23:58:13 CET 2011
Graeme Geldenhuys schrieb:
> Op 2011-02-16 12:52, Hans-Peter Diettrich het geskryf:
>> Most people have been sure, in the past, that they use a SBCS, where
>> every character on screen is a char in memory. And consequently they use
>> indexed access to the chars in an string, and for...to loops.
>
> Yes, and that code accesses string characters 99% of the times in a
> sequential manner, be that left-to-right (or other way round), hardly
> ever random. So to overcome this "supposedly" limitation, one simply
> needs to create a StringIterator (which I already have in my projects
> where character extraction is needed) will work just fine. So I don't
> see this as a problem at all.
What's the type of the loop variable???
The iteration costs time, so that many users will insist in using "fast"
SBCS access. No doubt that proper Unicode coding will require iterators,
unless Pos can return an valid index immediately.
>> The same
>> procedures may work for UTF-16,
>
> No, character indexes will not work for UTF-16 either. Not ALL Unicode
> Characters can fit into a 2-bytes.
When an Unicode string contains the same characters as an Ansi string,
then all these BMP characters fit into one widechar.
> Also what about screen characters
> that are made up of multiple code-points (combining diacritics etc)?
> eg:
> U+0041 (A) + U+030A (̊) = Å
These are special Unicode issues, that never have been an issue with
Ansi strings, and should not be in Unicode - as long as dealing with the
same content as before. Again the Cobol distinction applies: the user
does not have to bother with the internals of strings of "usage display"
- they only are read, written and displayed, and what else can be made
in portable "high-level" string handling.
Dealing with *all* the Unicode quirks IMO is beyond "usual" coding, it
will be reserved to specialized text processing components or applications.
Perhaps you understand better now, why I suggest an string type with an
immutable application defined codepage, for "traditional" coding? This
would be "usage computational", where the known rules for low-level
string handling apply, just as used with AnsiStrings.
> Depending on how that string is normalized, doing a MyString[1] might
> only return 'A' and not Å as you would have expected.
No difference to the current encoding, isn't it? You should not assume
that such non-canonical Unicode is or has ever been translated into a
single Ansi char, by automatic conversion.
>> one widechar, but this code will fail miserably on an UTF-8 platform,
>
> And so too for UTF-16 - as I have just shown. If you want to use UTF-16
> like that (just because *most* of the Unicode code-points can fit into
> 2-bytes), then it is no better that UCS-2.
*Most* users will be happy with the BMP. Those using codepages outside
the BMP had to live with all that stuff, since ever.
IMO the most important thing about Unicode is to teach the users the
difference between low and high level string handling. Indexed access to
characters is a low level operation, that should not be used in
Unicode-aware applications without specific knowledge. Low level string
handling requires the exact knowledge about the encoding of an string,
with eventual branches for the *expected* encodings and char types. High
level string handling is not character-based, so that your objections do
not apply.
DoDi
More information about the Lazarus
mailing list