[Lazarus] Improving UTF8CharacterLength?

Jürgen Hestermann juergen.hestermann at gmx.de
Sun Aug 9 14:31:44 CEST 2015


I just had a closer look at the function UTF8CharacterLength in unit LazUTF8.
To me it looks as if it can be improved (made faster) because it checks too many things.

According to https://de.wikipedia.org/wiki/UTF-8 the number of bytes of an
UTF-8-character should be computable by the first byte only.
So it seems not to be neccessary to check for any following bytes (which also bears
the danger of accessing bytes out of the range of the string).
Isn't it enough to just do it like this:

------------------------------
if p=nil then
    exit(0);
if (ord(p^) and %10000000)=%00000000 then // First bit is not set ==> 1 byte
    exit(1);
if (ord(p^) and %11100000)=%11000000 then // First 2 (of first 3) bits are set ==> 2 byte
    exit(2);
if (ord(p^) and %11110000)=%11100000 then // First 3 (of first 4) bits are set ==> 3 byte
    exit(3);
if (ord(p^) and %11111000)=%11110000 then // First 4 (of first 5) bits are set ==> 4 byte
    exit(4);
exit(0); // invalid UTF-8 character
-------------------------------

Currently, further bytes are checked even when
the first byte already determines the number of bytes.
But if the following bytes would not be as expected
it would not be a valid UTF-8-character.
But should this be checked by the UTF8CharacterLength function?
There is no error condition in the result of the function anyway.
I think errors should be checked when accessing the character as a whole.
Or is there any reason for handling invalid UTF-8-bytes more fault-tolerant?





More information about the Lazarus mailing list