[Lazarus] Unicode on Windows
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Tue Apr 10 01:17:05 CEST 2012
Marcos Douglas schrieb:
> I still think about:
> DirectoryExists or DirectoryExistsUTF8
> ForceDirectoriesUTF8 or ForceDirectories
> Pos or UTF8Pos
> etc
>
> Depends what part of code you are...
Such problems may (should) go away with the new Unicode- and AnsiString
types, where AnsiString contains an Encoding field. Then the conversion
between UTF-8 and the system codepage are done automatically, whenever
required, and the xyUTF8 functions can be dropped then.
I discourage the use of UTF8Pos, in detail together with the new
(encoded) AnsiString type. When such a string is auto-converted, for
some reason, the index returned by UTF8Pos will become invalid. This is
one of the downsides of encoded strings, which suggest to use
UnicodeString in future code. Delphi enforced that move, by changing
String and Char to UnicodeString and WideChar, and Delphi compatibility
propagated that pressure into FPC. The continued use of UTF-8 strings
(AnsiString) will result in a speed and memory usage penalty, unless the
system codepage is UTF-8. If your code only contains String type
strings, not AnsiString or UTF8String, then all your strings will become
UnicodeStrings (UTF-16), for which the xyUTF8 functions are either
inapplicable or will result only in superfluous implicit string conversions.
Now every user has the choice to stay with a specific FPC/Lazarus
version, that does not yet support the new string types, or to drop
UTF-8 strings in favor of the new UTF-16 strings. Since most code has to
deal with the Unicode BMP (BasicMappingPage) only, the difference
between the length of an string in (UTF-8)chars and characters has gone
away with UTF-16. Do you really see a need for finding the position of a
non-BMP character in an string, and changing exactly that character in
the string? Then you are on the safe side by using StringReplace, which
already worked with UTF-8 and will continue to work with UTF-16 and
whatever other encoding. The use of Char variables has been dangerous
already with UTF-8, where exotic ("astral") characters can consist of up
to 6 bytes. In so far I don't understand why Delphi now uses WideChar
for Char, instead of UnicodeChar, where it is guaranteed that every
codepoint (except ligatures and similar text-processing stuff) can be
stored in a UnicodeChar variable.
DoDi
More information about the Lazarus
mailing list