[Lazarus] substr return wrong string with some utf8 char

Hans-Peter Diettrich DrDiettrich1 at aol.com
Fri Feb 11 12:49:20 CET 2011


Michael Schnell schrieb:
> On 02/10/2011 03:20 PM, Hans-Peter Diettrich wrote:
>>
>>> length('à') return 2
>>> utf8length('à') return 1
>>
> I thinks according to the definition of UTF8String it's correct that 
> Length(s) provides the byte count. I do hope that with "NewStrings" this 
> some day might change, as it's quite confusing for anybody who does not 
> want to be bothered with the Uniocde internals.

Length() is bound to the physical (array) size, a redefinition would 
break this established rule.

MBCS users had to live with this problem since ever, and UTF-8 is a 
MBCS. I'm not sure whether the difference between number of characters 
(glyphs) and number of codepoints can be eliminated by any approved 
convention.

IMO it's a good idea to forget about "char" in dealing with Unicode/UTF 
strings, and only use (sub)strings. This is not a major problem, since 
Pascal does not distinguish between char and string literals.

Obviously this code will fail with UTF-8 encoding:
   var a: char = 'à'; //or '`a'?
and even UTF-32 may fail with ligatures or other character combinations.

Some "NewStrings" model IMO should at least distinguish between ASCII, 
ANSI and UTF strings:

ASCII: never convert, codes above #$7F are undefined (maybe raw data).
ANSI: SBCS according to a specific codepage.
UCS2: a possible Unicode subset (BMP) of 2-byte (WideChar) characters.
UTF: anything else, with unrelated character and byte counts.

This would make at least those coders happy, that are used to deal with 
SBCS, and writing applications for local/national use. All coders, in 
detail the English (ASCII) speakers, have to learn about UTF and MBCS 
when dealing with UTF strings (apart from assignment and display).

DoDi





More information about the Lazarus mailing list