[Lazarus] Unicode (was Re: cwstring in arm-linux)

Fri Oct 21 13:26:12 CEST 2011

Am 21.10.2011 00:20 schrieb Hans-Peter Diettrich:
> The Ansi/UTF-16 migration is much easier than a migration to UTF-8. When 
> your legacy code can assume that every (visible) character is a Char, in 
> an SBCS codepage, this is not different in UTF-16.

Ever heard of decomposed characters? In *no* Unicode encoding you can ever
assume a single visible character is made up of only a single code point
(Char as you call it). Even if you normalize all external text, you'd
still have to deal with those characters as not every valid combination of
base character and diacriticals has a precomposed form.

Don't even think about collation, sorting, upper/lower-casing etc, there's
a reason the ICU library comes with 16 MB of data in addition to the code.

Conclusion: Every Unicode encoding has variable length characters. Code
points in UTF-32 are of fixed size, in UTF-16 come in two sizes, and in
UTF-8 come in four sizes (not six as the Unicode standard chose not
utilize a full 32-bit numerical space). Additionally, UTF-16 and UTF-32
are not endian neutral.

Conclusion 2: For storing a single visible character, a simple
char/wchar_t/wxChar/wxUniChar/whatever variable is not enough. You always
need a string to cater for decomposed characters.

Michael Lutz