[Lazarus] Improving UTF8CharacterLength?

Thu Aug 13 11:55:29 CEST 2015

On Sun, 9 Aug 2015 14:31:44 +0200
Jürgen Hestermann <juergen.hestermann at gmx.de> wrote:

> I just had a closer look at the function UTF8CharacterLength in unit LazUTF8.
> To me it looks as if it can be improved (made faster) because it checks too many things.
> 
> According to https://de.wikipedia.org/wiki/UTF-8 the number of bytes of an
> UTF-8-character should be computable by the first byte only.

True.

> So it seems not to be neccessary to check for any following bytes (which also bears
> the danger of accessing bytes out of the range of the string).

A string always ends with a #0, so checking byte by byte makes sure you
stay within range.
If you only read the first byte of a codepoint to determine its length,
you must check the length of the string.

The UTF8CharacterLength function handles invalid UTF-8 gracefully.
If you know that you have a valid UTF-8 string you can simply use the
first byte of each codepoint (as you pointed out). So, for that case a
faster function can be added.
Maybe UTF8QuickCharLen or something like that.

> Isn't it enough to just do it like this:
> 
> ------------------------------
> if p=nil then
>     exit(0);
> if (ord(p^) and %10000000)=%00000000 then // First bit is not set ==> 1 byte
>     exit(1);
> if (ord(p^) and %11100000)=%11000000 then // First 2 (of first 3) bits are set ==> 2 byte
>     exit(2);
> if (ord(p^) and %11110000)=%11100000 then // First 3 (of first 4) bits are set ==> 3 byte
>     exit(3);
> if (ord(p^) and %11111000)=%11110000 then // First 4 (of first 5) bits are set ==> 4 byte
>     exit(4);
> exit(0); // invalid UTF-8 character
> -------------------------------

Yes, although afaik the compiler can optimize a CASE better than a
series of IFs.

case p^ of
#0..#127: exit(1);
#192..#223: exit(2);
#224..#239: exit(3);
#240..#247: exit(4);
else exit(0); // invalid UTF-8 character, should never happen
end;

Note: because it is an optimized version the check for p=nil can be
omitted.

> Currently, further bytes are checked even when
> the first byte already determines the number of bytes.
> But if the following bytes would not be as expected
> it would not be a valid UTF-8-character.
> But should this be checked by the UTF8CharacterLength function?
> There is no error condition in the result of the function anyway.
> I think errors should be checked when accessing the character as a whole.
> Or is there any reason for handling invalid UTF-8-bytes more fault-tolerant?

The last, more fault-tolerant.

This allows to use the function like this:

while p^<>#0 do begin
  CharLen:=UTF8CharacterLength(p);
  // ...
  inc(p,CharLen);
end;

This works even with invalid UTF8.

Mattias