[Lazarus] Does Lazarus support a complete Unicode Component Library?

Thu Feb 17 13:41:50 CET 2011

Graeme Geldenhuys schrieb:
> Op 2011-02-17 11:28, Michael Schnell het geskryf:
>> On 02/17/2011 07:19 AM, Jürgen Hestermann wrote:
>>> I often search for substrings, delete them from the string, insert
>>> other strings at certain places, etc.
>>> How can you do all this without knowledge of the internal structure of
>>> the string?
>> This (magically :-) ) does work with UTF8.
> 
> NO, it doesn't! You can't use FPC's Copy(), Pos() etc reliably with
> UTF-8 text,

You can, when you do it in the *right* way.

> because thouse RTL functions work purely on ANSI text
> (1-byte characters - speaking of String type text here) and don't know
> about multi-byte characters, combining diacritics etc.

Pos() certainly works with MBCS as well, and you cannot expect that 
combining characters and ligatures are handled by the basic Unicode 
functions. When Copy requires an byte count, you can compute it from the 
difference of the index positions of the involved substrings. It would 
be better, though, when the basic procedures would not deal with counts 
or sizes at all.

> Hence LCL and
> fpGUI have special functions similar to RTL, that knows how to work with
> UTF-8 encoded text. eg: UTF8Pos(), UTF8Length and UTF8Copy() etc functions.

This is a stupid idea, IMO. An "UTF8" prefix is inappropriate when it 
comes to the distinction between physical and logical functionality. 
E.g. the number of *logical* (maybe visible) characters can be 
determined from any string encoding, and that function should have an 
*unique* name and (possibly) overloaded implementations. Likewise a 
SubString procedure could take two index positions, which can be 
determined without knowledge of the string encoding. This way string 
insertion or extraction do not require a re-parse of the strings, in 
order to translate logical into physical indices and counts.

IMO we simply have to agree that Length() is a physical property, the 
number of elements in an array. A logical character count has a very 
different meaning in string handling, and not even a *single* meaning, 
when we start dealing with ligatures and other Unicode stuff[1].

[1] In a mix of LTR and RTL parts a distinction between sequential 
physical and logical indices is required as well. The first RTL 
codepoint physically follows the preceding LTR codepoint, but logically 
(on screen...) it precedes the *next* LTR codepoint. I only see one 
proper solution to such quirks, by restricting the arguments of string 
handling functions to physical (array) indices. Logical increments of 
such indices are at the discretion of the user, depending on his 
understanding of the desired result. Library functions only can deal 
with different encodings, but always will return physical indices.

DoDi