[Lazarus] Any chance of changing the LCL Unicode encoding to UTF-16?

Graeme Geldenhuys graemeg.lists at gmail.com
Tue Aug 5 10:10:27 CEST 2008


On Tue, Aug 5, 2008 at 8:50 AM, Martin Schreiber <fpmse at bluewin.ch> wrote:
>
> For external character representation utf-8 is best for cross platform storage
> of text in files and to send and receive text by communication channels.
> MSEgui uses utf-8 to store initialization and user state data and the like

That knew that, and that's what I said... utf-8 represents it's data
in byte form, to it's ideal for streaming. The reason I think W3C
chose utf-8 as a recommendations for XML and web pages.


> For internal character representation you should first and formost choose an
> encoding which will be easy to handle by the users.
> Because for most of the users a codepoint can not be expressed by a single
> utf-8 byte, they can not use character access by string index in their code
> and bytelength(utf8string) <> charactercount.

And do a search on the Internet... over and over you get articles
refuting that point. Random character access is very seldom required.
If character access is used, it normally coincides with
looping/iterating through the string, which utf-8 has no problems with
either. Random character access is not the norm in program code, and
if so, it's a very small part.

http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html


> With UCS2/utf-16 on the other hand *all* current users of MSEgui can store
> *all* the codepoints they need in a single utf-16 word and therefore can work

I'm being cynical I guess, but that sounds just like what the Unicode
guys thought with their first draft. 65000 code points would be more
than enough for all written languages. That assumption was proven
wrong quickly.  ;-)


> Parsing of utf-16 is much simpler than parsing of utf-8, processing of
> UCS2/utf-16 in framework core is probably faster than utf-8 because there are
> less addressing operations for codepoints > 127.

Yes and with utf-16 you need to worry about the BOM marker and
endianness. utf-8 you don't.

http://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes

"The UTF-16 and UCS-2 encoding forms produce a sequence of 16-bit
words or code units. These are not directly usable as a byte or octet
sequence because the endianness of these words varies according to the
computer architecture; either big-endian or little-endian. To account
for this choice of endianness each encoding form defines three related
encoding schemes: for UTF-16 there are the schemes UTF-16, UTF-16BE,
and UTF-16LE, and for UCS-2 there are the schemes UCS-2, UCS-2BE, and
UCS-2LE."

So that means you can have three utf-16 encodings!


> Widestring versions of all Pascal standard string functions (pos, copy...) are
> supported by FPC.

Does the widestring version of Copy take into account surrogate pairs?
 Or would it simply split the bytes?  As for Pos, Copy - LCL and fpGUI
has their own utf-8 versions. Why would those be any slower than the
widestring versions included in FPC?

As for Length()...  What should it return?  Bytes or Character count?
If bytes (which I believe is what the documentation says it must
return), then the standard ANSI Length() included in FPC will suffice.
If it's character count you want, LCL and fpGUI has their own versions
of that.

As far as I understand, the assumption that Length() returns character
count is coincidence, simply because ANSI Char = 1 byte.

from Kylix 3 help:  "For single-byte and multibyte strings, Length
returns the number of bytes used by the string."

I couldn't find FPC documentation on the Length() function.

Like I said... the argument between utf-8 and utf-16 is not clear cut.
It's not black & white. I have read so much documentation on both
sides of the argument that my brain started to hurt (that seems to
happen often lately).  ;-)


Regards,
 - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/



More information about the Lazarus mailing list