[Lazarus] Any chance of changing the LCL Unicode encoding to UTF-16?

Martin Schreiber fpmse at bluewin.ch
Tue Aug 5 10:48:28 CEST 2008


On Tuesday 05 August 2008 10.10:27 Graeme Geldenhuys wrote:
[...]
> > For internal character representation you should first and formost choose
> > an encoding which will be easy to handle by the users.
> > Because for most of the users a codepoint can not be expressed by a
> > single utf-8 byte, they can not use character access by string index in
> > their code and bytelength(utf8string) <> charactercount.
>
> And do a search on the Internet... over and over you get articles
> refuting that point. Random character access is very seldom required.
> If character access is used, it normally coincides with
> looping/iterating through the string, which utf-8 has no problems with
> either. Random character access is not the norm in program code, and
> if so, it's a very small part.
>
> http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html
>
Most MSEgui users don't program a webbrowser but some string and character 
manipulation is often used and often the assumption codepoint = storage unit 
simplifies the task a lot. All current users of MSEgui can take that 
assumption for widestrings, most can not for utf-8 encoded strings.

[...]

>
> > Parsing of utf-16 is much simpler than parsing of utf-8, processing of
> > UCS2/utf-16 in framework core is probably faster than utf-8 because there
> > are less addressing operations for codepoints > 127.
>
> Yes and with utf-16 you need to worry about the BOM marker and
> endianness. utf-8 you don't.
>
MSEgui uses UCS2 encoding for internal character representation only, there is 
no BOM and the endianness is the endianess of the system. Converting from/to 
the system encoding is handled by FPC transparently.

[...]

>
> > Widestring versions of all Pascal standard string functions (pos,
> > copy...) are supported by FPC.
>
> Does the widestring version of Copy take into account surrogate pairs?
>  Or would it simply split the bytes?  As for Pos, Copy - LCL and fpGUI
> has their own utf-8 versions. Why would those be any slower than the
> widestring versions included in FPC?
>
> As for Length()...  What should it return?  Bytes or Character count?
> If bytes (which I believe is what the documentation says it must
> return), then the standard ANSI Length() included in FPC will suffice.
> If it's character count you want, LCL and fpGUI has their own versions
> of that.
>
Length(), Copy()... should work with storage units and not with codepoints or 
glyph entities, it is unpracticable to use display entities, think on glyphs 
which are composed by several codepoints. I recently read a similar statement 
from Allen Bauer (CodeGear).
http://groups.google.ch/group/borland.public.delphi.non-technical/browse_thread/thread/cf4398427aa0358c/838865b5fef0cbfc

Martin



More information about the Lazarus mailing list