[Lazarus] Adding codepage-support to the RTL (making LConvEncoding obsolete)

Fri Dec 3 12:38:38 CET 2010

Zitat von Guy Fink <merlin352 at globe.lu>:

> Hello
>
> I have opened issue #0018144 in the bugtracker and uploaded a new  
> version of my codepages unit.
>
> My description on this :
>
> In September we had a discussion on the Lazarus-mailing list to  
> rewrite LConvEncoding and move the functionality to the RTL (Thread:  
> rewriting of LConvEncoding).
>
> Since there I did a lot of coding to implement an effective  
> algorithm, both for Singlebyte- as for Doublebyte-Codepages. A first  
> release was on the mailing list mid-October, mainly as a base for  
> further discussions. But there were no comments or suggestions on  
> this.
>
> So here is a nearly final release with many changes to the first version.

It does not compile under 2.4.2:
cp_ISO88591.pas(69,37) Error: Constant strings can't be longer than 255 chars

> Major points:
>  - The unit supports Single- and Double-bytecodepages trough the  
> same functions
>  - Widestringsupport (configurable)
>  - UTF8 and UTF16 support (UTF16 needs widestrings)

Great.

>  - Direct conversion from CP to CP without intermediate string

Nice.

>  - Uppercase and Lowercase support
>  - Underlying Unicodes as of V 6.0.0 (October 11, 2010)
>  - A converter-application to convert Unicodedefinitions to a complete
>    pascal unit. The cp_* units are entirely generated by this app.
>  - Conversion up to 80% faster for SBCS.

Ehm, you made many functions inline. Even those that are more than a  
few lines of code. This will enlarge the executables and can cost  
performance in normal applications (e.g. Lazarus).
You call for each character a conversion function. But most real world  
texts contain a big part of ASCII characters, where no conversion is  
needed for UTF-8. My guess is that for most texts this approach is  
slower. But I have to wait till it compiles before I can test.

> - For DBCS up to 100 times

;)

> As for now there are only units for ISO-8859-1, ISO-8859-2 and CP932  
> (SHIFT_JIS). More to be added for the final release. The  
> converter-subdir has all the definition files that I could find. I  
> will add them all.
>
> The units:
>   codepages.pp : the main unit (highly configurable trough codepagesdef.inc)
>   unicodemappings.pas : Some definitions from unicode.org,  
> especially the tables
>                         for uppercase, lowercase and the unicodeblocks.
>   utf8.pas : mainly the UTF8 functions from LCLProc + some new
>   utf16.pas: same for UTF16
>   acpinfo.pas: info for codepages supported by Windows, as published on MSDN
>
> Some first test results  as attachment.

Mattias