[Lazarus] Unicode on Windows

Sven Barth pascaldragon at googlemail.com
Tue Apr 10 09:58:52 CEST 2012


Am 10.04.2012 03:50, schrieb Marcos Douglas:
> On Mon, Apr 9, 2012 at 8:17 PM, Hans-Peter Diettrich
> <DrDiettrich1 at aol.com>  wrote:
>> Marcos Douglas schrieb:
>>
>>
>>> I still think about:
>>> DirectoryExists or DirectoryExistsUTF8
>>> ForceDirectoriesUTF8 or ForceDirectories
>>> Pos or UTF8Pos
>>> etc
>>>
>>> Depends what part of code you are...
>>
>>
>> Such problems may (should) go away with the new Unicode- and AnsiString
>> types, where AnsiString contains an Encoding field. Then the conversion
>> between UTF-8 and the system codepage are done automatically, whenever
>> required, and the xyUTF8 functions can be dropped then.
>>
>> I discourage the use of UTF8Pos, in detail together with the new (encoded)
>> AnsiString type. When such a string is auto-converted, for some reason, the
>> index returned by UTF8Pos will become invalid. This is one of the downsides
>> of encoded strings, which suggest to use UnicodeString in future code.
>> Delphi enforced that move, by changing String and Char to UnicodeString and
>> WideChar, and Delphi compatibility propagated that pressure into FPC. The
>> continued use of UTF-8 strings (AnsiString) will result in a speed and
>> memory usage penalty, unless the system codepage is UTF-8. If your code only
>> contains String type strings, not AnsiString or UTF8String, then all your
>> strings will become UnicodeStrings (UTF-16), for which the xyUTF8 functions
>> are either inapplicable or will result only in superfluous implicit string
>> conversions.
>>
>> Now every user has the choice to stay with a specific FPC/Lazarus version,
>> that does not yet support the new string types, or to drop UTF-8 strings in
>> favor of the new UTF-16 strings. Since most code has to deal with the
>> Unicode BMP (BasicMappingPage) only, the difference between the length of an
>> string in (UTF-8)chars and characters has gone away with UTF-16. Do you
>> really see a need for finding the position of a non-BMP character in an
>> string, and changing exactly that character in the string? Then you are on
>> the safe side by using StringReplace, which already worked with UTF-8 and
>> will continue to work with UTF-16 and whatever other encoding. The use of
>> Char variables has been dangerous already with UTF-8, where exotic
>> ("astral") characters can consist of up to 6 bytes. In so far I don't
>> understand why Delphi now uses WideChar for Char, instead of UnicodeChar,
>> where it is guaranteed that every codepoint (except ligatures and similar
>> text-processing stuff) can be stored in a UnicodeChar variable.
>
> When the new Unicode and AnsiString types (that contains an Encoding
> field) arrive to us, users of FPC 2.6.1? Is this done?

This is part of 2.7.1 and will become part of the next main release.

Regards,
Sven





More information about the Lazarus mailing list