[Lazarus] TProcess, UTF8, Windows

Sun Apr 15 11:32:21 CEST 2012

On Sun, 15 Apr 2012 07:16:15 +0200
Martin Schreiber <mse00000 at gmail.com> wrote:

> On Saturday 14 April 2012 22:36:02 Marcos Douglas wrote:
> >
> > Well, works if I change this line:
> >   fname := 'c:\á b ç\á.txt';
> > to this:
> >   fname := UTF8Decode('c:\á b ç\á.txt');
> >
> > And doesn't matter if fname is UnicodeString or string -- well, the
> > debug hint to 'UnicodeString' is more beautiful than 'string' because
> > the compiler translate.
> >
> Add {$codepage utf8} to the unit  header or compile with -Fcutf8, this is the 
> default setting in MSEide+MSEgui for automatic Unicode handling.
> Warning:  most likely this setting will break Lazarus on FPC 2.6.0. I don't 
> know if Lazarus is fully tested with cpstrnew and -Fcutf8 already.

cpstrnew is part of fpc 2.7.1 and Lazarus runs fine with it since two
months.

The -Fcutf8 and {$codepage utf8} exists since ages.

There are some traps with -Fcutf8 and {$codepage utf8}. It only works if
the RTL DefaultSystemCodePage is CP_UTF8. Otherwise your strings are
converted by the compiler.
For example under Linux the RTL default is CP_ACP, which defaults to
ISO_8859-1. The RTL does *not* read your environment language on its
own. So the the default is ISO_8859-1. This means your
UTF-8 string constants are converted by the compiler:

Compile this with -Fcutf8 and run it on a Linux with LANG set to utf-8:

program project1;
{$mode objfpc}{$H+}
begin
  writeln(DefaultSystemCodePage,' ',CP_UTF8);
  writeln('ä');
end.

This results in
0 65001
Ã¤

The LCL uses a widestringmanager (at the moment cwstring), which sets the DefaultSystemCodePage. You can do the same in your non LCL programs:

program project1;
{$mode objfpc}{$H+}
uses cwstring;
begin
  writeln(DefaultSystemCodePage,' ',CP_UTF8);
  writeln('ä');
end.

This results in
65001 65001
ä

The above is a lie for kids.
See the program below:

program project1;
{$mode objfpc}{$H+}
uses cwstring;
var
  a,b,c: string;
begin
  writeln(ord(DefaultSystemCodePage),' ',CP_UTF8);
  a:='ä'; b:='='#$C3#$A4; // #$C3#$A4 is UTF-8 for ä
  c:=       'ä='#$C3#$A4;
  writeln(a,b); // writes ä=ä
  writeln(c);   // writes ä=Ã¤
end.

You can see that an UTF-8 string constant works, a string constant with UTF-8 codes works too, but the combination does not work. The above was compiled with -Fcutf8 and uses cwstring to set the DefaultSystemCodePage to CP_UTF8. So what went wrong?
The compiler treats any non ascii string constant (here: the ä) as widestring (not UTF-16).

You can not fool the compiler with 'ä='+#$C3#$A4. You must define two separate string constants.

Using any character outside the UCS-2 range results in
Fatal: illegal character "'�'" ($F0)

You can specify them with UTF-16 codes: #$D834#$DD1E
Yes, you read right. Specifying the codepage with -Fcutf8 or {$codepage utf8} actually defines a mix of UTF-8 and UTF-16.

Now compile the above without -Fcutf8:

65001 65001
ä=ä
ä=ä

Wow, everything looks as expected. You can even mix ascii and non ascii string constants.
Without the codepage the compiler stores string constants as byte sequences. That's what UTF-8 is.

That's why the LCL applications do not use the codepage flags.

msegui has implemented an ecosystem of widestrings, so it works better with a codepage.

Mattias