[Lazarus] Adding codepage-support to the RTL (making LConvEncoding obsolete)

Fri Dec 3 20:09:11 CET 2010

Hello Lazarus-List,

Friday, December 3, 2010, 7:27:51 PM, you wrote:

GF> The upper/lower tables in Unicodemappings are pure Unicode
GF> (please take a look). They are in NO way country dependent. What
GF> make you think " tables seems to be the country agnostic ones"

In some languages some unicode codepoints have different
uppercase/lowercase pair. In example "i" in english (and most others)
region is uppercased to "I" while in Turkish it is "I"+Upperdot (i can
not write it here).

Take a look over: "Why Applications Fail With The Turkish Language" at
http://www.i18nguy.com/unicode/turkish-i18n.html

GF> As I said,  I have taken this from LCLProc.
GF> The Unicode > UTF8 should be ok... It does not generate a
GF> codepoint for a character outside the Unicoderange.

Well, in fact yes, try converting Unicode $FFFF to UTF8 (not tested,
but 99% of plain implementations just overlook exceptions).

GF> For UTF8 > Unicode, well this is a desiign question. There is
GF> a function that generates an exception on a wrong sequence. What
GF> else would you do?

There are 2 problems, invalid sequences and malformed sequences, both
are different beasts.

Malformed sequences: They are unterminated UTF8 sequences. So they
starts as an UTF8 sequece, but they ends abruptally with a non
continuing mark. In example #128#1

Invalid sequences: This are writting most times intentionally to
bypass protection systems, raise buffer overflows and other "funny"
things. They exploit the ability of UTF8 to using one sequence obtain
a totally different character. In example #$C0#$80 which will output
NULL char which is in fact dangerous when playing with C-Style
strings.

Recomendation is to replace the invalid/malformed codepoint by "?" or
better by the unicode error question mark '?' (U+FFFD), or raise an
exception, but never eat it.

In order to test you can use the stress test: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Take a look over UTF8ToUnicode in fpc sources (there is one border
case that it is still wrong).

-- 
Best regards,
 José