[Lazarus] Does Lazarus support a complete Unicode Component Library?
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Thu Feb 17 13:41:50 CET 2011
Graeme Geldenhuys schrieb:
> Op 2011-02-17 11:28, Michael Schnell het geskryf:
>> On 02/17/2011 07:19 AM, Jürgen Hestermann wrote:
>>> I often search for substrings, delete them from the string, insert
>>> other strings at certain places, etc.
>>> How can you do all this without knowledge of the internal structure of
>>> the string?
>> This (magically :-) ) does work with UTF8.
>
> NO, it doesn't! You can't use FPC's Copy(), Pos() etc reliably with
> UTF-8 text,
You can, when you do it in the *right* way.
> because thouse RTL functions work purely on ANSI text
> (1-byte characters - speaking of String type text here) and don't know
> about multi-byte characters, combining diacritics etc.
Pos() certainly works with MBCS as well, and you cannot expect that
combining characters and ligatures are handled by the basic Unicode
functions. When Copy requires an byte count, you can compute it from the
difference of the index positions of the involved substrings. It would
be better, though, when the basic procedures would not deal with counts
or sizes at all.
> Hence LCL and
> fpGUI have special functions similar to RTL, that knows how to work with
> UTF-8 encoded text. eg: UTF8Pos(), UTF8Length and UTF8Copy() etc functions.
This is a stupid idea, IMO. An "UTF8" prefix is inappropriate when it
comes to the distinction between physical and logical functionality.
E.g. the number of *logical* (maybe visible) characters can be
determined from any string encoding, and that function should have an
*unique* name and (possibly) overloaded implementations. Likewise a
SubString procedure could take two index positions, which can be
determined without knowledge of the string encoding. This way string
insertion or extraction do not require a re-parse of the strings, in
order to translate logical into physical indices and counts.
IMO we simply have to agree that Length() is a physical property, the
number of elements in an array. A logical character count has a very
different meaning in string handling, and not even a *single* meaning,
when we start dealing with ligatures and other Unicode stuff[1].
[1] In a mix of LTR and RTL parts a distinction between sequential
physical and logical indices is required as well. The first RTL
codepoint physically follows the preceding LTR codepoint, but logically
(on screen...) it precedes the *next* LTR codepoint. I only see one
proper solution to such quirks, by restricting the arguments of string
handling functions to physical (array) indices. Logical increments of
such indices are at the discretion of the user, depending on his
understanding of the desired result. Library functions only can deal
with different encodings, but always will return physical indices.
DoDi
More information about the Lazarus
mailing list