[Lazarus] Lazarus (UTF8) and Windows: SysToUTF8, UTF8ToSys... Is there a better solution?

Mon Dec 23 23:08:01 CET 2013

On Mon, Dec 23, 2013 at 06:52:21PM +0100, J?rgen Hestermann wrote:
> Am 2013-12-23 11:32, schrieb Marco van de Voort:
>  > So I would say UTF16, and maybe, if there is demand, some can get utf8 :-)
> 
> The question is:
> Should FPC and LCL use a fixed encoding for all platforms
> or should the encoding be adapted for each WidgetSet/OS?

Not necessarily. Supporting both on both platforms is a sane reason too.

One can't ditch utf16 because of Delphi compatibility. It will be hard to
ditch utf8 because of old Lazarus compatibility.

But if I have to chose to kill one, it is utf8. It is the lesser used choice
for unicode strings INSIDE APPLICATIONS.  Yes, UTF8 is dominant in documents, but
not in APIs.

> If it should be the same for all platforms then it should be UTF8 IMO.
> UTF16 is the most horrible decision (all bad things combined).

For what? Most of the sentiments I hear are echoed discussions on the web
that are mostly about document encodings, NOT application internal
encodings.

However we

> UTF32 would at least have the advantage of fixed character size
> but pays this with *a lot* of memory consumption.

(it is not fixed character, but fixed codepoint)

> UTF8 has the lowest memory demand

Not according to 1 billion Chinese.

> (in general) and a good backward compatibility.

Hardly. Only for western languages, and even there conversions often go
wrong. That's why the whole BOM kludge became so important.

> On the other hand, adapting the string encoding for each
> Widgetset/OS would be a can of worms IMO.

If you feel that way, I think Delphi compatibility should prevail. Old
Lazarus code needs to be modified anyway. 

Note that the language support for utf8 breaks down when you pass e.g. a
"string" to rawbytestring on Windows. (because it is converted to the
default 1-byte encoding, which is not utf8 in general).

As said, UTF8 on Windows is a crutch, and attempts to workaround that moves
Lazarus in the direction of "portability to everything as long as it is
unix" philosophies, a la Cygwin. 

IMHO a bad direction. FPC has in general avoided having an outright
preference and IMHO should continue to do so.

> A lot of additional knowledge about strings is put on the programmer
> because handling of strings has to be done differently depending on OS.

It will anyway, even with utf8. Constructs that happen to work with Linux
will fail on Windows. Because on Windows the default 1-byte encoding is not
UTF8.

Moreover, I think  people step over the Delphi compatibility card too easy.
Way, way ,way to easy.  

> But FPC/Lazarus is meant to be portable so this should not be done.

FPC/Lazarus is supposed to be portable, not an emulated Unix on everything.
Using other systems default encoding is emulation, and not portability.