[Lazarus] rewriting of LConvEncoding

Thu Sep 23 21:55:21 CEST 2010

On Thu, Sep 23, 2010 at 11:11:48AM +0200, Guy Fink wrote:
> > the conversions into tables, so they can be linked in, already.
> 
> With the tool you mean the charset-unit? 

No, I mean fpc/rtl/units/creumap.pp that afaik generates statically linkable
units from ISO files that plugin to charset.

>What I see is, that charset is not finished.

Then finish it.

> It just offers a rudimentary way to read in the unicode.org textfiles, and
> some functions to find a mapping and convert one character.  No support
> for complete string conversions, or UTF-8, UTF-16, UTF-32.

Then make a good proposal to fix this. Preferably with patches.

> The tables are created dynamically via getmem and stored in a linked list.
> Every character is stored in a record : tunicodecharmapping, where Unicode
> is only definded as word, not cardinal.  Thus UTF32 is not supported,
> UTF16 surrogates neither. 

UTF32 is nowhere supported at all with FPC atm, and to be honest, I don't
see a reason to start now.  The unicode Delphi's also don't provide a type
for it.  It is simply the most practical format, and the few places where it
is typically used , like complex string routines and the like, can survive
on hardcode handoptimized code.   (IOW it is not really an user type)

Since despite what people think, UTF32 is extremely wasteful, and still not
free from problems (codepoints vs chars, denormalized sequences etc)

> The list is lineary scanned to find a certain mapping, not really optimal
> for fast conversions.  The backconversion unicode > character is even
> worse than the algortihm in LConvEncoding, it searches linear through the
> mapping, this may take some time on big asian codepages.  (LConvEncoding
> uses a dual Quicksearch)

Well, one of the reasons is that the unit is mainly used for embedded
applications (which includes DOS and win9x nowadays) or special cases (like
very, very compatible installers), since on normal targets the OS routines
are used.

Nevertheless, I don't want to hide behind that. Certainly, charset is pretty much
a one-off effort and can be improved. But please, when reengineering, keep
in mind that the "special" uses are the main ones.

But if everybody tries to roll something new instead of improving existing
functionality the we are getting nowhere.

> Charset has absolutly no support to handle endianess of UTF-16 and UTF-32
> strings.

I would add separate special functions for that. No need to bog down the
standard functions that do the bulk of the work.  IOW a special functions
that do input validation at the perimeter, and functions that only do
internal conversions (e.g. that you could base the widestring manager on)

> With static tables, I mean a table in a const-section, compiled and linked
> into the code.

Have a look at creumap. If you had looked up where and how (c)charset is
used, you would have noticed

(see e.g. compiler/cp*

>  This offers optimization possibilities to the compiler
> regarding indexed access to the table content.  The tables can be
> optimized for best fitting datasize, latincodepages can mostly be encoded
> as UCS2.  Others, especially asiancodepages but also some MAC-codepages
> need UCS4.

I don't see that last bit. Yes, some Asian codepages have some chars above
the BMP, but afaik they are relatively rare.  You might get more speed (due
to better cache utilization) AND a better footprint by having a separate
UCS2 table, and a separate check (and tables) for the surrogates.

> It is easy to write an utility which creates the sourcecode for such this
> codepage-units.

See above.