[Lazarus] UTF8 RTL for Windows

Tue Nov 25 14:23:59 CET 2014

On Tue, 25 Nov 2014 13:10:26 +0100
Hans-Peter Diettrich <DrDiettrich1 at aol.com> wrote:

>[...]
> > Maybe I don't understand the question, but it seems to me this is
> > documented where static-, dynamic cp and rawbytestring are explained.
> 
> More concrete questions:
> 
> How can a user be sure that a string parameter in a subroutine has the 
> specified encoding?
> How to check, how to fix if needed?

As you know in general you cannot find out the encoding of a text. You
have to trust that the caller gave the right encoding.
This was true before 2.7.1 and it still is.
The new thing with 2.7.1 is that String now has an encoding field and
that you can use this to let the compiler convert encodings
automatically.
For example the RTL uses this to convert between OS strings and program
strings. This means some RTL functions don't need manual encoding
conversions (e.g. UTF8ToAnsi) anymore. You can simply pass the string.
Hopefully more and more RTL functions/variables will be converted.

In short: Most of the time you code exactly like before.

If your code works with various encodings, then formerly you had to be
very careful what you do with the strings. For example when you pass
the strings to the RTL you had to convert them to the system codepage.
Now you can use for instance UTF8String instead and omit the
UTF8ToAnsi. It is like gaining some type safety.
And you can now use SetCodePage. But then you have to be very careful
again.

> > http://wiki.freepascal.org/FPC_Unicode_support#Ansistring
> > 
> > When a procedure requires a specific encoding it uses a specific String
> > type. If it works with CP_ACP it uses "String". If it needs UTF8 it
> > uses UTF8String.
> 
> Such specifications are meaningless when the string parameters can have 
> a different dynamic encoding :-(

Please read the paragraph "Dynamic code page" again. The example it
describes is the most common case: the system code page. This is
the same as FPC 2.6.5 and below. A String coming from the OS has
the system code page, which is dynamic. If you want a specific 
encoding you had to convert it.
With FPC 2.7.1 we have a new possibility. This is the new mode I was
talking about. Now we get UTF-8 strings in many places in the RTL. Not
all places yet. But we are working on it. And you can help.

> Unicode Delphi works well as long as only one codepage (CP_ACP) is used, 
> in addition to Unicode (UTF-16) strings. As soon as multiple codepages 
> can be involved at the same time, the dynamic string encodings become 
> almost random (observed in Delphi XE). FPC now already has multiple 
> built-in codepage variables (DefaultSystemCodePage...), with possibly 
> different values, so that the observed Delphi mess is inevitable, as 
> long as RawByteString results (of e.g. standard stringhandling 
> functions) are *not* converted when assigned to a string variable of 
> some specific static encoding.

Well, two weeks ago I was rolling my eyes when I read about this
complex system and DefaultSystemCodePage. But then I tried to set it
and now we can use one String encoding cross platform and it works
with file functions, TStringList and friends. Almost all of the
UTF8ToSys calls are no longer needed and file functions now support
full Unicode.
We can write an Unicode program cross platform using our normal strings
and classes.
And it is pretty compatible. 
So from Lazarus point of view this is a great step forward.
And last but not least: it is optional.

Of course if you have a product and you have to support all old modes
and some of the new possibilities you will curse.

> Unfortunately I cannot test Lazarus trunk since a long time, no answer 
> on my request for assistance.

?

Mattias