[Lazarus] rewriting of LConvEncoding

Thu Sep 23 11:11:48 CEST 2010

--- Ursprüngliche Nachricht ---
Von: Marco van de Voort <marcov at stack.nl>
An: merlin352 at globe.lu, Lazarus mailing list <lazarus at lists.lazarus.freepascal.org>
Betreff: Re: [Lazarus] rewriting of LConvEncoding

> On Mon, Sep 20, 2010 at 11:16:39PM +0200, merlin352 at globe.lu wrote:
> > > Note that FPC does contain a some conversion code in the RTL,
> > > see the ucmaps directory and the charset unit.
> >
> > The files in ucmaps are the files from ftp.unicode.org that I
> mentionned.
> >
> > My opinion is that static tables are smaller and faster than
> dynamic
> > mappings (especially for the asiancodepages), but memory could be
> an
> > issue.  Perhaps this can be solved by smartlinking.
>
> Explain static/dynamic in this context. Note that there is a tool to
> change
> the conversions into tables, so they can be linked in, already.

With the tool you mean the charset-unit? What I see is, that charset is not finished. It just offers a rudimentary way to read in the unicode.org textfiles, and some functions to find a mapping and convert one character. No support for complete string conversions, or UTF-8, UTF-16, UTF-32.

The tables are created dynamically via getmem and stored in a linked list. Every character is stored in a record : tunicodecharmapping, where Unicode is only definded as word, not cardinal. Thus UTF32 is not supported, UTF16 surrogates neither. The list is lineary scanned to find a certain mapping, not really optimal for fast conversions. The backconversion unicode > character is even worse than the algortihm in LConvEncoding, it searches linear through the mapping, this may take some time on big asian codepages. (LConvEncoding uses a dual Quicksearch)

Charset has absolutly no support to handle endianess of UTF-16 and UTF-32 strings.

With static tables, I mean a table in a const-section, compiled and linked into the code. This offers optimization possibilities to the compiler regarding indexed access to the table content. The tables can be optimized for best fitting datasize, latincodepages can mostly be encoded as UCS2. Others, especially asiancodepages but also some MAC-codepages need UCS4. The backconversion can be optimized trough individual functions for every codepage.

It is easy to write an utility which creates the sourcecode for such this codepage-units.

Best regards

______________________________________________________
powered by GLOBER.LU
Luxembourg Internet Service Provider
Hosting. Domain Registration, Webshops, Webdesign, FreeMail ...

Our professional Web Hosting plans include all the features you are looking for at the best possible price.
www.globe.lu

______________________________________________________
powered by GLOBER.LU
Luxembourg Internet Service Provider
Hosting. Domain Registration, Webshops, Webdesign, FreeMail ...

Our professional Web Hosting plans include all the features you are looking for at the best possible price.
www.globe.lu