[Lazarus] Lazarus (UTF8) and Windows: SysToUTF8, UTF8ToSys... Is there a better solution?

Hans-Peter Diettrich DrDiettrich1 at aol.com
Thu Dec 26 01:39:50 CET 2013


Sven Barth schrieb:

> If in 2.6.2 your three strings contain text of different encodings then 
> the resulting string might be garbage from the user's POV.
> In trunk the encoding is part of each string and if they differ then 
> each strings will be converted to the default string encoding (defined 
> by a global variable inside unit System) and thus the string might still 
> be valid.

If so, this flaw should be fixed immediately. Delphi uses lossless 
conversions, i.e. an up-cast to Unicode.

BTW the use of RawByteString variables or parameters *in Delphi* can 
result in stored strings of an encoding that doesn't match the 
declaration of the target variable. This in turn can confuse the 
compiler, when two such strings of the same declared (static) encoding, 
but of different actual (dynamic) encoding, are simply appended without 
further checks/conversions.

Such problems can be avoided by making RawByteString a compiler magic, 
that enforces a Unciode conversion whenever AnsiStrings of a different 
dynamic encoding have to be combined.

Furthermore the use of UTF-8 will allow for lossless conversions of 
AnsiStrings of any encoding, with the result still being an AnsiString. 
Here Delphi has the problem that a RawByteString result type requires a 
conversion of an intermediate Unicode string (UTF-16) into an 
AnsiString(CP_ACP), with possible losses. This is not required when FPC 
treats UTF-8 as a fully supported encoding, in addition to CP_ACP - it 
also were a strong argument for using UTF-8 for UnicodeString, *instead* 
of UTF-16. The related functions already exist in the FPC libraries, 
they only have to take precedence over CP_ACP (if different). Then 
additional UTF-8/16 conversions are required only on Windows, when 
calling external (API...) functions which expect/return WideStrings.


Conclusion:

FPC can treat RawByteString as *the one and only* string type of a 
variable dynamic encoding. Procedures accepting RawByteString arguments 
either retain the dynamic encoding of these strings, or convert 
parameters of different encoding into UTF-8. A conversion back to a 
different encoding may be required *only* when a RawByteString is 
assigned to a variable or parameter in another subroutine call.

There remains one problem with empty strings, whose declared encoding 
cannot be determined at runtime in the Delphi model, because empty 
strings are represented by Nil pointers. I can imagine two workarounds, 
to add an Encoding field to every string variable, or to make empty 
strings point to a string constant of their static encoding.


Alternatively typed AnsiStrings and RawByteString can be dropped, so 
that every AnsiString variable or parameter can have any dynamic 
encoding (equivalent to RawByteString), with the favorite encoding being 
UTF-8. This would allow to keep Lazarus and other existing code 
unmodified, all eventual string conversions can be inserted by the 
compiler, the obsolete UTF8... functions can be dropped.

DoDi





More information about the Lazarus mailing list