[Lazarus] Does Lazarus support a complete Unicode Component Library?

Hans-Peter Diettrich DrDiettrich1 at aol.com
Thu Feb 17 14:34:37 CET 2011


Graeme Geldenhuys schrieb:
> Op 2011-02-17 00:58, Hans-Peter Diettrich het geskryf:
>> What's the type of the loop variable???
> 
> Any time that can store 4-bytes. Be that a string, dynamic array or a
> custom object/class type.

String iteration, based on such positions, is quite useless, because 
then you'll have to determine the size (byte count...) of the indexed 
element in another explicit call. Proper for-each iteration will return 
an *element* of the structure, what in the case of Unicode can be 
neither an AnsiChar nor a WideChar.

In fact you'll need many different string iterator methods, like 
NextCharIncludingLigatures, NextCharInVisualOrder...

IMO the only feasable iteration is over physical codepoints, taking into 
account various string encodings. No special iterator objects are 
required for that purpose, instead something like MoveNext() etc. can be 
implemented, as known from database procedures.


>> The iteration costs time, so that many users will insist in using "fast"
>> SBCS access.
> 
> That would also insist they can't use Unicode text - which is the whole
> point of this conversation.

Right, most users do not really want nor have to deal with full Unicode 
and foreign characters. My idea of a unique immutable application 
defined encoding supports just that model. Then all local/national texts 
  can be processed in the well known way (mostly as SBCS).


> Yes, but still, not all Unicode characters fit into a widechar
> (2-bytes). Most [if not all - I'm not sure here] spoken languages fit
> into the BMP, but that might not always be the case.

When it comes to only store and display general Unicode text, in a 
lossless way, the actual encoding of such strings is of no interest to 
the coder. For that purpose UTF-8 or any OS specific encoding is fine.


> Maybe some day you
> want to translate all your text into Klingon or Goa'uld or whatever
> alien race visits our planet. Being prepared and supporting the full
> Unicode is the best option at the moment.

Have you ever tried, seriously, to translate texts *semantically*, 
preserving the *meaning* of the text? Then you know that the 
*capability* of entering and storing foreign glyphs is the least 
important issue, in such a task. And again UTF-8 is perfectly sufficient 
to even deal with future Unicode versions, with more than 2^17 codepoints.


>> These are special Unicode issues, that never have been an issue with
>> Ansi strings, and should not be in Unicode - as long as dealing with the
>> same content as before.
> 
> My example might not have been extensive enough to get the point across.
> The point being that what you see on screen as a "character" might be a
> combination of code-points. This is not a "issue of Unicode", but a
> functionality of Unicode - hence the reason there are stacks of
> information about various Unicode normalizations too. eg: Mac's keep
> them separated, where under Linux I believe such combined diacritics are
> replaced with a single code-point that can represent the same
> information [if it exists].

No problem as long as the strings are displayed properly. If somebody 
wants a *specific* canonical representation, he has to enforce it by 
using according translation functions, just like with little/big ending. 
The implementation of such translation functions is beyond the scope of 
the runtime library, it deserves detailed knowledge about every language 
to implement, and maybe of the OS conventions, so that it should be 
delegated to an installed external library. We also don't care about the 
visual representation of Unicode text, with e.g. ligatures and accents, 
do we?


>> - they only are read, written and displayed, and what else can be made
>> in portable "high-level" string handling.
> 
> Well, for any string handling in your application, you need to know the
> difference between what is perceived as a Unicode "character" on the
> screen, and the various ways such a "character" can be presented in a
> language structure. There is no way around this, unless FPC defines that
> such Unicode strings are always stored in some specific normalized manner.

Every higher level Unicode string handling is the task of dedicated 
libraries, not of an compiler and its RTL.


>> Dealing with *all* the Unicode quirks IMO is beyond "usual" coding, it
>> will be reserved to specialized text processing components or applications.
> 
> I'm not arguing that point.
> 
> 
>> *Most* users will be happy with the BMP. Those using codepages outside
>> the BMP had to live with all that stuff, since ever.
> 
> Then you should call it UCS-2 support, and not Unicode support. We are
> talking about implementing Unicode support here.

I see no reason why we should *not* distinguish between *physical* 
Unicode support (like UTF-8), and *logical* Unicode string handling, 
which may be restricted to specific codepages (e.g. the BMP). Support of 
more languages should be delegated to dedicated libraries, supplied by 
experts with detailed knowledge about the Unicode quirks in their own 
(natural) language.

Let's proceed step by step :-)

DoDi





More information about the Lazarus mailing list