[Lazarus] Any chance of changing the LCL Unicode encoding to UTF-16?

Martin Schreiber fpmse at bluewin.ch
Tue Aug 5 08:50:29 CEST 2008


On Monday 04 August 2008 23.23:46 Graeme Geldenhuys wrote:
> 2008/8/4 Ivan P. Gan <ivan at comchatter.com>:
> > It is however better to have a single encoding format for all platforms
> > that Free pascal & Lazarus support
>
> I agree. While researching the issue with unicode I came across a very
> good article the other day. Unicode was design using UTF-16 at first.
> No need for variable amounts of bytes to present a character. 2 bytes
> is all that was needed. It proved a better idea that UTF-8 which
> required 1-4 bytes. Then to there surprise 65000 code points was not
> enough so UTF-16 ended up have encoding pairs - just the thing they
> tried to avoid!  Oops.
>
[...]

Graeme,
I think you should not mix up external and internal character representation 
of a GUI framework.
For external character representation utf-8 is best for cross platform storage 
of text in files and to send and receive text by communication channels. 
MSEgui uses utf-8 to store initialization and user state data and the like 
and in the MSEifi remote executable framework. utf-8 is also used to store 
the source code in files in MSEide.
For internal character representation you should first and formost choose an 
encoding which will be easy to handle by the users.
Because for most of the users a codepoint can not be expressed by a single 
utf-8 byte, they can not use character access by string index in their code 
and bytelength(utf8string) <> charactercount.
With UCS2/utf-16 on the other hand *all* current users of MSEgui can store 
*all* the codepoints they need in a single utf-16 word and therefore can work 
with the Unicode MSEgui framework as they are used to work with ansistrings.
Even our active Chinese MSEide+MSEgui user didn't report any Unicode issue 
since a long time.
Parsing of utf-16 is much simpler than parsing of utf-8, processing of 
UCS2/utf-16 in framework core is probably faster than utf-8 because there are 
less addressing operations for codepoints > 127.
FPC supports transparent conversion widestring <> system encoding since a long 
time. For number formatting, filesystem access, string manipulation and the 
like MSEgui has implemented its own optimized widestring functions.
Widestring versions of all Pascal standard string functions (pos, copy...) are 
supported by FPC. The only missing piece is a reference counted widestring 
type on Windows.

Martin



More information about the Lazarus mailing list