[Lazarus] Unicode (was Re: cwstring in arm-linux)

Hans-Peter Diettrich DrDiettrich1 at aol.com
Fri Oct 21 23:29:54 CEST 2011


Michael Lutz schrieb:
> Am 21.10.2011 00:20 schrieb Hans-Peter Diettrich:
>> The Ansi/UTF-16 migration is much easier than a migration to UTF-8. When 
>> your legacy code can assume that every (visible) character is a Char, in 
>> an SBCS codepage, this is not different in UTF-16.
> 
> Ever heard of decomposed characters?

Right, that's one of the strange looking features of Unicode, related to 
*language* conventions.

> Don't even think about collation, sorting, upper/lower-casing etc, there's
> a reason the ICU library comes with 16 MB of data in addition to the code.

Right, see above.

For writing applications that are aware of different languages, Unicode 
by itself is not very helpful - but Unicode allows to implement and use 
libraries, dealing with everything beyond codepoints.

> Conclusion: Every Unicode encoding has variable length characters. Code
> points in UTF-32 are of fixed size, in UTF-16 come in two sizes, and in
> UTF-8 come in four sizes (not six as the Unicode standard chose not
> utilize a full 32-bit numerical space). Additionally, UTF-16 and UTF-32
> are not endian neutral.

Data can be compressed in various ways, text is only one kind of such 
data. An application should select the most appropriate text encoding, 
and use exactly this one internally.


> Conclusion 2: For storing a single visible character, a simple
> char/wchar_t/wxChar/wxUniChar/whatever variable is not enough. You always
> need a string to cater for decomposed characters.

Right, outside SBCS the code has to distinguish between logical and 
physical characters, indices and counts. IMO it's not normally required 
to store exactly one logical character, so that (sub)strings should be 
used everywhere.

E.g. it may be more efficient to cut an string at the place, where a 
certain pattern has been found, into two or three substrings, and then 
continue working with the preceding or following string. This eliminates 
the need to find the starting position again, based on an eventually 
returned logical character count.

AFAIK Java uses powerful substrings, which are references into the same 
string, with their own physical offsets and sizes, reducing runtime and 
storage requirements. Imagine what could be done with an equivalent 
SubString type in OPL, that can be used to refer to the preceding, 
matched and following parts of the entire string, without iterating 
again over the entire string. E.g.
   function Match(str, pattern: string): SubString;
   function Match(str, start, delim: string): SubString;
would allow for simple implementations of:
   MatchNext(str, substr): SubString;
   Replace(str, substr, newstr);
   Leader(str, substr): SubString;
   Trailer(str, substr): SubString;
etc.

DoDi





More information about the Lazarus mailing list