[Lazarus] TProcess, UTF8, Windows
Mattias Gaertner
nc-gaertnma at netcologne.de
Sun Apr 15 11:32:21 CEST 2012
On Sun, 15 Apr 2012 07:16:15 +0200
Martin Schreiber <mse00000 at gmail.com> wrote:
> On Saturday 14 April 2012 22:36:02 Marcos Douglas wrote:
> >
> > Well, works if I change this line:
> > fname := 'c:\á b ç\á.txt';
> > to this:
> > fname := UTF8Decode('c:\á b ç\á.txt');
> >
> > And doesn't matter if fname is UnicodeString or string -- well, the
> > debug hint to 'UnicodeString' is more beautiful than 'string' because
> > the compiler translate.
> >
> Add {$codepage utf8} to the unit header or compile with -Fcutf8, this is the
> default setting in MSEide+MSEgui for automatic Unicode handling.
> Warning: most likely this setting will break Lazarus on FPC 2.6.0. I don't
> know if Lazarus is fully tested with cpstrnew and -Fcutf8 already.
cpstrnew is part of fpc 2.7.1 and Lazarus runs fine with it since two
months.
The -Fcutf8 and {$codepage utf8} exists since ages.
There are some traps with -Fcutf8 and {$codepage utf8}. It only works if
the RTL DefaultSystemCodePage is CP_UTF8. Otherwise your strings are
converted by the compiler.
For example under Linux the RTL default is CP_ACP, which defaults to
ISO_8859-1. The RTL does *not* read your environment language on its
own. So the the default is ISO_8859-1. This means your
UTF-8 string constants are converted by the compiler:
Compile this with -Fcutf8 and run it on a Linux with LANG set to utf-8:
program project1;
{$mode objfpc}{$H+}
begin
writeln(DefaultSystemCodePage,' ',CP_UTF8);
writeln('ä');
end.
This results in
0 65001
ä
The LCL uses a widestringmanager (at the moment cwstring), which sets the DefaultSystemCodePage. You can do the same in your non LCL programs:
program project1;
{$mode objfpc}{$H+}
uses cwstring;
begin
writeln(DefaultSystemCodePage,' ',CP_UTF8);
writeln('ä');
end.
This results in
65001 65001
ä
The above is a lie for kids.
See the program below:
program project1;
{$mode objfpc}{$H+}
uses cwstring;
var
a,b,c: string;
begin
writeln(ord(DefaultSystemCodePage),' ',CP_UTF8);
a:='ä'; b:='='#$C3#$A4; // #$C3#$A4 is UTF-8 for ä
c:= 'ä='#$C3#$A4;
writeln(a,b); // writes ä=ä
writeln(c); // writes ä=ä
end.
You can see that an UTF-8 string constant works, a string constant with UTF-8 codes works too, but the combination does not work. The above was compiled with -Fcutf8 and uses cwstring to set the DefaultSystemCodePage to CP_UTF8. So what went wrong?
The compiler treats any non ascii string constant (here: the ä) as widestring (not UTF-16).
You can not fool the compiler with 'ä='+#$C3#$A4. You must define two separate string constants.
Using any character outside the UCS-2 range results in
Fatal: illegal character "'�'" ($F0)
You can specify them with UTF-16 codes: #$D834#$DD1E
Yes, you read right. Specifying the codepage with -Fcutf8 or {$codepage utf8} actually defines a mix of UTF-8 and UTF-16.
Now compile the above without -Fcutf8:
65001 65001
ä=ä
ä=ä
Wow, everything looks as expected. You can even mix ascii and non ascii string constants.
Without the codepage the compiler stores string constants as byte sequences. That's what UTF-8 is.
That's why the LCL applications do not use the codepage flags.
msegui has implemented an ecosystem of widestrings, so it works better with a codepage.
Mattias
More information about the Lazarus
mailing list