[Lazarus] Lazarus (UTF8) and Windows: SysToUTF8, UTF8ToSys... Is there a better solution?

Marco van de Voort marcov at stack.nl
Wed Dec 25 19:50:04 CET 2013


On Wed, Dec 25, 2013 at 04:34:40PM -0200, Marcos Douglas wrote:
> > There are many scenarios up in the sky, and nothing is 100% certain, but it
> > would at least be significantly better. It is already significantly better
> > in trunk.
> 
> When you say that is better in trunk is only on FPC context or there
> are improvements for Lazarus users too?

Functions like mkdir/findfirst/assign etc are encoding safe. The only
exception is the concatenation problem below.
 
> > The only problem on Windows is that you must only pass a string with a very
> > clear encoding to a RTL function.
> >
> > so
> >
> >  assignfile(f,s+s2+s3);
> >
> > is dangerous if they are not all the same encoding. If there is any
> > mismatch, it will be converted down to default encoding.

> Yes but where is the difference between 2.6.2 and trunk, in that case?

1. UTF16 works fine 
2. You can actually pass utf8 to functions, as long as you are careful with
concatenations.

> > There is no utf8 on Windows. One can try to mess with the defaultcodepage,
> > but that will probably only force a different kind of problems.
> >
> > On Windows there is only ansi or utf16, or keeping it manual.
> 
> You're right.
> But if we imagine a perfect world that FPC and Lazarus use the same
> encode -- doesn't matter if is UTF-8 or UTF-16 -- everything would
> work. Do you agree?

The point is (as shown by the above problem) is that the choice must align
with what the OS offers. Because otherwise you are yet an island again.

E.g. the Windows unit (also in trunk) will only work in ansi or utf16.

> So, if the encode chosen was UTF-8 for all, RTL only needs to decode
> strings -- on Windows -- before to call API functions.  

The Windows unit is not wrapped, and the only unicode available on Windows
is UTF16. And the windows target converts mixes of 1-byte strings (say
ansi+utf8) to the default encoding (ansi).

One can attempt to fix that by messing with Windows encoding settings, but
the effect of doing that for large applications is unknown. 

Another possibility is using only 
own unicode routines (linking in the tables into each binaries). But that
again could lead to strange artefacts.

> The same on Linux (whatever platforms that uses UTF-8) if the encode chosen
> was UTF-16.

Yes. I don't think that is a good default choice either. But at least it has
some merits for modern Delphi compat.
 
> My thinking is correct?

Oversimplified. The RTL will never abstract everything, and there is the
issue of the default OS encoding.

In short, I don't think fighting the native encoding of an target is worth
the shallow appeal of the "one encoding rules all" principle. That is mostly
pushed by people that don't even use windows, and thus won't feel the pain.




More information about the Lazarus mailing list