[Lazarus] Any chance of changing the LCL Unicode encoding to UTF-16?
Martin Schreiber
fpmse at bluewin.ch
Tue Aug 5 08:50:29 CEST 2008
On Monday 04 August 2008 23.23:46 Graeme Geldenhuys wrote:
> 2008/8/4 Ivan P. Gan <ivan at comchatter.com>:
> > It is however better to have a single encoding format for all platforms
> > that Free pascal & Lazarus support
>
> I agree. While researching the issue with unicode I came across a very
> good article the other day. Unicode was design using UTF-16 at first.
> No need for variable amounts of bytes to present a character. 2 bytes
> is all that was needed. It proved a better idea that UTF-8 which
> required 1-4 bytes. Then to there surprise 65000 code points was not
> enough so UTF-16 ended up have encoding pairs - just the thing they
> tried to avoid! Oops.
>
[...]
Graeme,
I think you should not mix up external and internal character representation
of a GUI framework.
For external character representation utf-8 is best for cross platform storage
of text in files and to send and receive text by communication channels.
MSEgui uses utf-8 to store initialization and user state data and the like
and in the MSEifi remote executable framework. utf-8 is also used to store
the source code in files in MSEide.
For internal character representation you should first and formost choose an
encoding which will be easy to handle by the users.
Because for most of the users a codepoint can not be expressed by a single
utf-8 byte, they can not use character access by string index in their code
and bytelength(utf8string) <> charactercount.
With UCS2/utf-16 on the other hand *all* current users of MSEgui can store
*all* the codepoints they need in a single utf-16 word and therefore can work
with the Unicode MSEgui framework as they are used to work with ansistrings.
Even our active Chinese MSEide+MSEgui user didn't report any Unicode issue
since a long time.
Parsing of utf-16 is much simpler than parsing of utf-8, processing of
UCS2/utf-16 in framework core is probably faster than utf-8 because there are
less addressing operations for codepoints > 127.
FPC supports transparent conversion widestring <> system encoding since a long
time. For number formatting, filesystem access, string manipulation and the
like MSEgui has implemented its own optimized widestring functions.
Widestring versions of all Pascal standard string functions (pos, copy...) are
supported by FPC. The only missing piece is a reference counted widestring
type on Windows.
Martin
More information about the Lazarus
mailing list