[Lazarus] Improving UTF8CharacterLength?

Jürgen Hestermann juergen.hestermann at gmx.de
Thu Aug 13 11:13:38 CEST 2015


Am 2015-08-09 um 14:31 schrieb Jürgen Hestermann:
> I just had a closer look at the function UTF8CharacterLength in unit LazUTF8.
> To me it looks as if it can be improved (made faster) because it checks too many things.
>
> According to https://de.wikipedia.org/wiki/UTF-8 the number of bytes of an
> UTF-8-character should be computable by the first byte only.
> So it seems not to be neccessary to check for any following bytes (which also bears
> the danger of accessing bytes out of the range of the string).
> Isn't it enough to just do it like this:
>
> ------------------------------
> if p=nil then
>    exit(0);
> if (ord(p^) and %10000000)=%00000000 then // First bit is not set ==> 1 byte
>    exit(1);
> if (ord(p^) and %11100000)=%11000000 then // First 2 (of first 3) bits are set ==> 2 byte
>    exit(2);
> if (ord(p^) and %11110000)=%11100000 then // First 3 (of first 4) bits are set ==> 3 byte
>    exit(3);
> if (ord(p^) and %11111000)=%11110000 then // First 4 (of first 5) bits are set ==> 4 byte
>    exit(4);
> exit(0); // invalid UTF-8 character
> -------------------------------
>
> Currently, further bytes are checked even when
> the first byte already determines the number of bytes.
> But if the following bytes would not be as expected
> it would not be a valid UTF-8-character.
> But should this be checked by the UTF8CharacterLength function?
> There is no error condition in the result of the function anyway.
> I think errors should be checked when accessing the character as a whole.
> Or is there any reason for handling invalid UTF-8-bytes more fault-tolerant?
>

Realy nobody has has an opinion to this topic?
Strange.
Well, I take this as an implicit answer that I did not make any error in reasoning.
I am now using my own "UTF8CharacterLength" but I thought others could make use of it too.





More information about the Lazarus mailing list