[Lazarus] Adding codepage-support to the RTL (making LConvEncoding obsolete)

Guy Fink merlin352 at globe.lu
Fri Dec 3 21:51:22 CET 2010


> In some languages some unicode codepoints have different
> uppercase/lowercase pair. In example "i" in english (and most
> others)
> region is uppercased to "I" while in Turkish it is
> "I"+Upperdot (i can
> not write it here).
>
> Take a look over: "Why Applications Fail With The Turkish
> Language" at
> http://www.i18nguy.com/unicode/turkish-i18n.htm

There is no information on the language in a string, even not in a Unicodestring. So it is impossible to react on this point here.

The uppercase/lowercase tables have been generated purely on the official Unicode-Character-Description. Characters having a "SMALL" in their description are replaced by the one having "CAPITAL" on that place and vice-versa. (only if the counterpart exists) You can't do more on this level. Please feel free to implement the functionality you mention, I'll be sure it will be appreciated.

> Well, in fact yes, try converting Unicode $FFFF to UTF8 (not tested,
> but 99% of plain implementations just overlook exceptions).
>
> GF> For UTF8 > Unicode, well this is a desiign question. There is
> GF> a function that generates an exception on a wrong sequence. What
> GF> else would you do?
>
> There are 2 problems, invalid sequences and malformed sequences, both
> are different beasts.
>
> Malformed sequences: They are unterminated UTF8 sequences. So they
> starts as an UTF8 sequece, but they ends abruptally with a non
> continuing mark. In example #128#1
>
> Invalid sequences: This are writting most times intentionally to
> bypass protection systems, raise buffer overflows and other
> "funny"
> things. They exploit the ability of UTF8 to using one sequence obtain
> a totally different character. In example #$C0#$80 which will output
> NULL char which is in fact dangerous when playing with C-Style
> strings.

We are Pascal, not C. And in Pascal NULL is a valid character.

Once again, I have taken most of this from LCLProc, but I agree that improvements can be done here. But this was not the aim of this coding. The actual encoding is in most cases reversible (exceptions are double definitions of characters like in SHIFT_JIS. Converting those characters to unicode ends in loosing some information, so the back conversion may result in another SHIFT_JIS-character, but which represents the same graphical glyph)

On the other side there is a function called UTF8FixBroken to take off invalid sequences and codepoints. But it is also not perfect, because it is a C-style function.

Once again, improvements to this coding is welcome.

> Recomendation is to replace the invalid/malformed codepoint by
> "?" or
> better by the unicode error question mark '?' (U+FFFD), or raise an
> exception, but never eat it.
>
> In order to test you can use the stress test:
> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>
> Take a look over UTF8ToUnicode in fpc sources (there is one border
> case that it is still wrong).
>
> --
> Best regards,
>  José
>


______________________________________________________
powered by GLOBER.LU
Luxembourg Internet Service Provider
Hosting. Domain Registration, Webshops, Webdesign, FreeMail ...

Our professional Web Hosting plans include all the features you are looking for at the best possible price.
www.globe.lu





More information about the Lazarus mailing list