[Lazarus] substr return wrong string with some utf8 char
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Fri Feb 11 12:49:20 CET 2011
Michael Schnell schrieb:
> On 02/10/2011 03:20 PM, Hans-Peter Diettrich wrote:
>>
>>> length('à') return 2
>>> utf8length('à') return 1
>>
> I thinks according to the definition of UTF8String it's correct that
> Length(s) provides the byte count. I do hope that with "NewStrings" this
> some day might change, as it's quite confusing for anybody who does not
> want to be bothered with the Uniocde internals.
Length() is bound to the physical (array) size, a redefinition would
break this established rule.
MBCS users had to live with this problem since ever, and UTF-8 is a
MBCS. I'm not sure whether the difference between number of characters
(glyphs) and number of codepoints can be eliminated by any approved
convention.
IMO it's a good idea to forget about "char" in dealing with Unicode/UTF
strings, and only use (sub)strings. This is not a major problem, since
Pascal does not distinguish between char and string literals.
Obviously this code will fail with UTF-8 encoding:
var a: char = 'à'; //or '`a'?
and even UTF-32 may fail with ligatures or other character combinations.
Some "NewStrings" model IMO should at least distinguish between ASCII,
ANSI and UTF strings:
ASCII: never convert, codes above #$7F are undefined (maybe raw data).
ANSI: SBCS according to a specific codepage.
UCS2: a possible Unicode subset (BMP) of 2-byte (WideChar) characters.
UTF: anything else, with unrelated character and byte counts.
This would make at least those coders happy, that are used to deal with
SBCS, and writing applications for local/national use. All coders, in
detail the English (ASCII) speakers, have to learn about UTF and MBCS
when dealing with UTF strings (apart from assignment and display).
DoDi
More information about the Lazarus
mailing list