[Lazarus] Lazarus (UTF8) and Windows: SysToUTF8, UTF8ToSys... Is there a better solution?

Tue Dec 24 12:18:49 CET 2013

Am 2013-12-23 23:08, schrieb Marco van de Voort:
 > But if I have to chose to kill one, it is utf8. It is the lesser used choice
 > for unicode strings INSIDE APPLICATIONS.  Yes, UTF8 is dominant in documents, but
 > not in APIs.

But in APIs it would not matter much to convert (in general the time for conversion
is negligible compared to the time that is needed for the rest around the API call).

I have written a file manager for Windows that can log and store millions of files in memory.
It uses the (UTF16) unicode API from Windows and converts the file names as UTF8 internally.
There exists another file manager who uses UTF16 internally too which can also log millions of files.
When logging the same source I can't see any difference in performance (even when logging
multiple times so that everything is cached!) although I have to convert and the other one does not.
But the memory footprints are very different.

 >> UTF16 is the most horrible decision (all bad things combined).
 > For what? Most of the sentiments I hear are echoed discussions on the web
 > that are mostly about document encodings, NOT application internal
 > encodings.

IMO this decision is based on the assumption to choose one encoding for everything.
So the same encoding is used *everywhere* as much as possible.
Then UTF8 is the best solution.
Why use UTF16/32? They cannot be treated the same as ancient ANSI strings either.
So what would be the reason behind it? Just wasting memory?

 >> UTF8 has the lowest memory demand
 > Not according to 1 billion Chinese.

How many of the strings stored and processed on a chinese computer are in chinese language?
A lot of the strings are still in english (HTML etc.).
So for asian countries the real memory demand is a mix and is not so easy to determine.
In most western countries UTF8 definitely uses less memory.

 >> On the other hand, adapting the string encoding for each
 >> Widgetset/OS would be a can of worms IMO.
 > If you feel that way, I think Delphi compatibility should prevail.

Why this?
Free Pascal/Lazarus should fledge and not repeat all the bad decissions of Borland/Embarcadero/..

 > Note that the language support for utf8 breaks down when you pass e.g. a
 > "string" to rawbytestring on Windows. (because it is converted to the
 > default 1-byte encoding, which is not utf8 in general).

I am not sure what you are talking about here.
For Windows I would use the unicode (UTF16) API interface exclusively and
convert it to UTF8 internally. From then on, everything should be UTF8.

 > As said, UTF8 on Windows is a crutch, and attempts to workaround that moves
 > Lazarus in the direction of "portability to everything as long as it is
 > unix" philosophies, a la Cygwin.

For me the decision of what Unicode encoding should be used is primary OS independent.
Just do the conversion once at the API interface level but then use internal what was
decided to be "the best" (UTF8 IMO). Conversions seem to be unavoidable anyway.
So it is just a decision where and when they take place.
And the API level is a good place IMO.
And when other OS's use the same encoding it is even better but not the reason to chose one or the other.

 >> A lot of additional knowledge about strings is put on the programmer
 >> because handling of strings has to be done differently depending on OS.

No!. That's just the aim: If *all* Free Pascal/Lazarus programmers can rely on having
UTF8 in all cases then you only need to handle UTF8 strings.
No IFDEFS to handle UTF16 on Windows and UTF8 on Linux.
The same code just works on *all* platforms!

 > Constructs that happen to work with Linux will fail on Windows.
 > Because on Windows the default 1-byte encoding is not UTF8.

The ANSI interface should not be used anymore. It is obsolete and only needed
for ancient OS's like DOS. But programmers should not be encourraged to use it
on modern platforms. Just use UTF8 *everywhere*. That should be the aim IMO.