[Lazarus] substr return wrong string with some utf8 char

Fri Feb 11 12:49:20 CET 2011

Michael Schnell schrieb:
> On 02/10/2011 03:20 PM, Hans-Peter Diettrich wrote:
>>
>>> length('à') return 2
>>> utf8length('à') return 1
>>
> I thinks according to the definition of UTF8String it's correct that 
> Length(s) provides the byte count. I do hope that with "NewStrings" this 
> some day might change, as it's quite confusing for anybody who does not 
> want to be bothered with the Uniocde internals.

Length() is bound to the physical (array) size, a redefinition would 
break this established rule.

MBCS users had to live with this problem since ever, and UTF-8 is a 
MBCS. I'm not sure whether the difference between number of characters 
(glyphs) and number of codepoints can be eliminated by any approved 
convention.

IMO it's a good idea to forget about "char" in dealing with Unicode/UTF 
strings, and only use (sub)strings. This is not a major problem, since 
Pascal does not distinguish between char and string literals.

Obviously this code will fail with UTF-8 encoding:
   var a: char = 'à'; //or '`a'?
and even UTF-32 may fail with ligatures or other character combinations.

Some "NewStrings" model IMO should at least distinguish between ASCII, 
ANSI and UTF strings:

ASCII: never convert, codes above #$7F are undefined (maybe raw data).
ANSI: SBCS according to a specific codepage.
UCS2: a possible Unicode subset (BMP) of 2-byte (WideChar) characters.
UTF: anything else, with unrelated character and byte counts.

This would make at least those coders happy, that are used to deal with 
SBCS, and writing applications for local/national use. All coders, in 
detail the English (ASCII) speakers, have to learn about UTF and MBCS 
when dealing with UTF strings (apart from assignment and display).

DoDi