[Lazarus] UTF8 RTL for Windows
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Tue Nov 25 20:45:14 CET 2014
Mattias Gaertner schrieb:
> On Tue, 25 Nov 2014 11:53:00 +0100
> Hans-Peter Diettrich <DrDiettrich1 at aol.com> wrote:
>
>> [...]
>>> Correction: *This* Char type needs to be extended.
>> Please specify.
>
> The ThousandSeparator type is "Char", which does not work with
> Russian in UTF-8. Well, at least if you want the non breakable space
> instead of the normal space.
> There are many cases where Char is enough.
You admit that there exist cases where Char is not enough :-]
>>> There is a Pos overload for strings. Where is the flaw in Pos?
>> The flaw is the added overload with a Char parameter.
>
> I use that a lot. It is faster than the string variant.
> Why is that a flaw?
When working with SBCS you can assume that a Char can hold any entire
character. This is not true with MBCS, like UTF-8.
With CP_ACP set to UTF-8 you cannot assign 'ä' to a Char, and search for
it. Depending on your exact code, the compiler may not find out that
this assignment is invalid, because it assigns only *part* of a
multibyte sequence. A following Pos, with that partial character, can
not always yield the *expected* result, it might find an 'ö' or 'ü' as
well. In detail that Pos overload has no indication of the codepage of
the Char, and consequently cannot enforce an eventually required
conversion, to the encoding of the string parameter. The same
considerations apply to eventual StringReplace (or similar) overloads.
Delphi users may think like you, that a Char is sufficient in such
cases. They are right so far, as in Unicode Delphi a Char is a WideChar,
and a String is UnicodeString, so that such optimizations work with BMP
characters. [Users of MBCS/non-BMP character sets already know that Char
is quite useless for text processing]
But compiling such code with FPC/Lazarus and the new RTL, where String
is AnsiString, and the default encoding is UTF-8, the same code will not
work properly. That's why I consider Char (=AnsiChar) dangerous in the
new RTL, causing obscure program errors.
Removing Char, perphaps in some special compiler mode, would allow to
identify all *possibly* wrong uses of the *generic* Char. Then the code
can be fixed in various ways, by e.g. replacing Char by WideChar or
UnicodeChar (4 bytes), removing overloads with Char parameters, or
whatever else will prevent inadvertent misuse of constants, variables,
fields or parameters of Char type.
Please note that Delphi compatibility is not a valid argument, as long
as FPC/Lazarus differs in the declaration of the generic String and Char
types. That's why Delphi made the Unicode move in *one* step, retyping
both String and Char at the same time, and (effectively) deprecating
AnsiString. This will at least make legacy code applicable to BMP
encodings, where WideChar is sufficient to hold any character value, and
legacy MBCS code will continue working without unexpected surprises.
>> Furthermore the Pos arguments should never be subject to automatic
>> conversion, otherwise the returned index will be useless.
>
> You can argue the same way in the direction: If it does not
> automatically convert it will find crap.
That's why the *original* declaration, with both parameters of type
String, will *allow* to identify and perform all required conversions. A
Char type, without an encoding indicator, prevents such checks and
conversions both at compiler level (in translating the call) and inside
function code.
>>>> In the best case Char could be retyped into an string (substring),
>>> That would be wrong in 99.9% of the cases.
>> Please give at least one example.
>
> Retype "Char" to "String" and the compiler will bark. For example in
> Graphics.
How is *graphic* information related to *text*? Using Char for Byte,
only because using strings offers some coding comfort, is another flaw.
Delphi discourages since long the use of strings for holding anything
but text. The continued abuse of strings, for other types of
information, will now result in errors whenever an (implicit) string
conversion occurs in some library routine, as can happen easily with
encoded AnsiStrings. Unfortunately Delphi missed the chance to simply
add an "unencoded" AnsiString encoding, which would allow to prevent any
conversions of according string variables. The RawByteString type,
despite its name, was added for quite a different purpose, *not* as a
chance to safely store arbitrary bytes in such strings.
DoDi
More information about the Lazarus
mailing list