[Lazarus] UTF8 RTL for Windows

Hans-Peter Diettrich DrDiettrich1 at aol.com
Tue Nov 25 21:39:20 CET 2014


Mattias Gaertner schrieb:
> On Tue, 25 Nov 2014 13:10:26 +0100
> Hans-Peter Diettrich <DrDiettrich1 at aol.com> wrote:
> 
>> [...]
>>> Maybe I don't understand the question, but it seems to me this is
>>> documented where static-, dynamic cp and rawbytestring are explained.
>> More concrete questions:
>>
>> How can a user be sure that a string parameter in a subroutine has the 
>> specified encoding?
>> How to check, how to fix if needed?
> 
> As you know in general you cannot find out the encoding of a text. You
> have to trust that the caller gave the right encoding.
> This was true before 2.7.1 and it still is.
> The new thing with 2.7.1 is that String now has an encoding field and
> that you can use this to let the compiler convert encodings
> automatically.
> For example the RTL uses this to convert between OS strings and program
> strings. This means some RTL functions don't need manual encoding
> conversions (e.g. UTF8ToAnsi) anymore. You can simply pass the string.
> Hopefully more and more RTL functions/variables will be converted.
> 
> In short: Most of the time you code exactly like before.

FACK, so far :-]

> If your code works with various encodings, then formerly you had to be
> very careful what you do with the strings. For example when you pass
> the strings to the RTL you had to convert them to the system codepage.
> Now you can use for instance UTF8String instead and omit the
> UTF8ToAnsi. It is like gaining some type safety.

The Delphi model already broke that claimed type safety, by omitting 
conversions of RawByteString results, for speed optimization. That's 
dangerous, because the compiler can *only* check the static type of 
string variables, but not the dynamic encoding of their contents.

> And you can now use SetCodePage. But then you have to be very careful
> again.

SetCodePage is safe, as long as it enforces an according conversion of 
the dynamic string encoding. The option, of only changing the encoding 
field, is reserved for adjustments after reading strings from external 
sources, or from Char, Char arrays/pointers or ShortString, where the 
correct codepage is unknown to the compiler and library routines.


>>> http://wiki.freepascal.org/FPC_Unicode_support#Ansistring
>>>
>>> When a procedure requires a specific encoding it uses a specific String
>>> type. If it works with CP_ACP it uses "String". If it needs UTF8 it
>>> uses UTF8String.
>> Such specifications are meaningless when the string parameters can have 
>> a different dynamic encoding :-(
> 
> Please read the paragraph "Dynamic code page" again.

Please read my statement again, you still miss my point.


> With FPC 2.7.1 we have a new possibility. This is the new mode I was
> talking about. Now we get UTF-8 strings in many places in the RTL. Not
> all places yet. But we are working on it. And you can help.

I'm trying to help all the time, but if you don't understand my 
arguments, I cannot help you :-(

I've explored the encoded AnsiStrings in Delphi XE, years ago, and 
identified a couple of problems with the Delphi implementation. I can 
help by explaining these problems, and how to avoid or reduce these 
problems in FPC/Lazarus. But according fixes to legacy code must be 
applied by the maintainers of that code, who know about the *right* way 
(intended behaviour) to fix every single problem.


> Well, two weeks ago I was rolling my eyes when I read about this
> complex system and DefaultSystemCodePage. But then I tried to set it
> and now we can use one String encoding cross platform and it works
> with file functions, TStringList and friends. Almost all of the
> UTF8ToSys calls are no longer needed and file functions now support
> full Unicode.

This was clear to me just after exploring and understanding encoded 
strings in Delphi. In FPC/Lazarus we now have a *chance* for 
simplifications and improvements, when the new features are used in the 
*right* way. But many arguments and opinions, presented in this thread, 
indicate to me an yet incomplete understanding and many 
misunderstandings, which I actually try to spot.

DoDi





More information about the Lazarus mailing list