[Lazarus] Improving UTF8CharacterLength?
Mattias Gaertner
nc-gaertnma at netcologne.de
Thu Aug 13 11:55:29 CEST 2015
On Sun, 9 Aug 2015 14:31:44 +0200
Jürgen Hestermann <juergen.hestermann at gmx.de> wrote:
> I just had a closer look at the function UTF8CharacterLength in unit LazUTF8.
> To me it looks as if it can be improved (made faster) because it checks too many things.
>
> According to https://de.wikipedia.org/wiki/UTF-8 the number of bytes of an
> UTF-8-character should be computable by the first byte only.
True.
> So it seems not to be neccessary to check for any following bytes (which also bears
> the danger of accessing bytes out of the range of the string).
A string always ends with a #0, so checking byte by byte makes sure you
stay within range.
If you only read the first byte of a codepoint to determine its length,
you must check the length of the string.
The UTF8CharacterLength function handles invalid UTF-8 gracefully.
If you know that you have a valid UTF-8 string you can simply use the
first byte of each codepoint (as you pointed out). So, for that case a
faster function can be added.
Maybe UTF8QuickCharLen or something like that.
> Isn't it enough to just do it like this:
>
> ------------------------------
> if p=nil then
> exit(0);
> if (ord(p^) and %10000000)=%00000000 then // First bit is not set ==> 1 byte
> exit(1);
> if (ord(p^) and %11100000)=%11000000 then // First 2 (of first 3) bits are set ==> 2 byte
> exit(2);
> if (ord(p^) and %11110000)=%11100000 then // First 3 (of first 4) bits are set ==> 3 byte
> exit(3);
> if (ord(p^) and %11111000)=%11110000 then // First 4 (of first 5) bits are set ==> 4 byte
> exit(4);
> exit(0); // invalid UTF-8 character
> -------------------------------
Yes, although afaik the compiler can optimize a CASE better than a
series of IFs.
case p^ of
#0..#127: exit(1);
#192..#223: exit(2);
#224..#239: exit(3);
#240..#247: exit(4);
else exit(0); // invalid UTF-8 character, should never happen
end;
Note: because it is an optimized version the check for p=nil can be
omitted.
> Currently, further bytes are checked even when
> the first byte already determines the number of bytes.
> But if the following bytes would not be as expected
> it would not be a valid UTF-8-character.
> But should this be checked by the UTF8CharacterLength function?
> There is no error condition in the result of the function anyway.
> I think errors should be checked when accessing the character as a whole.
> Or is there any reason for handling invalid UTF-8-bytes more fault-tolerant?
The last, more fault-tolerant.
This allows to use the function like this:
while p^<>#0 do begin
CharLen:=UTF8CharacterLength(p);
// ...
inc(p,CharLen);
end;
This works even with invalid UTF8.
Mattias
More information about the Lazarus
mailing list