[Lazarus] Feature Request: Insert {codepage UTF8} per default

Thu Mar 31 14:04:53 CEST 2016

On Wed, 30 Mar 2016 18:16:32 +0200
Bart <bartjunk64 at gmail.com> wrote:

>[...]
> > Any valid UTF-8 string should work, including diacritics.
> Without the codepage identier?

Yes, if you use LazUTF8.
If you don't use LazUTF8 and assign a literal to a UnicodeString you
need the codepage.

> Quote from http://wiki.freepascal.org/FPC_Unicode_support#String_constants:
> "Normally, a string constant is interpreted according to the source
> file codepage. If the source file codepage is CP_ACP, a default is
> used instead: in that case, during conversions the constant strings
> are assumed to have code page 28591 (ISO 8859-1 Latin 1; Western
> European). "

AFAIK this is not entirely correct. The string literals are assumed
to be system codepage, which does not need to be code page 28591.
I will ask on the fpc list.

> ...
> "From the above it follows that to ensure predictable interpretation
> of string constants in your source code, it is best to either include
> an explicit {$codepage xxx} directive (or use the equivalent -Fc
> command line option), or to save the source code in UTF-8 with a BOM.
> "
> 
> AFAIK the IDE does not save the file with a BOM, so the compiler may
> very well decide that my sourcefile has ACP codepage?

Yes and no.
When the compiler assumes ACP, it treats the string special. It does
not convert it and stores it as byte copy. At runtime the string has
CP_ACP and its codepage is defined by the variable
DefaultSystemCodePage. LazUTF8 sets this to CP_UTF8, so the string is
treated as UTF-8. Note that it does that without any conversion.

OTOH when you tell the compiler that the source is UTF-8, it converts
the literal to UTF-16. At runtime it converts the string back to UTF-8.
It does that everytime you assign the literal.

So, with both you get an UTF-8 string, but the latter has a bit more
overhead. Also the latter needs special care when typecasting (e.g.
PChar).

>[...] Consider this test sourcefile (encoded as UTF8 without BOM):
>[...]
> DefaultSystemcodePage = 1252
>[...]
> I would say that this experiment contradicts the statement in
> http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals
> ?

No contradiction, because this wiki page is about DefaultSystemcodePage
= CP_UTF8. 

Mattias