[Lazarus] Improving UTF8CharacterLength?

Jürgen Hestermann juergen.hestermann at gmx.de
Thu Aug 13 12:38:00 CEST 2015


Am 2015-08-13 um 11:55 schrieb Mattias Gaertner:
 > A string always ends with a #0, so checking byte by byte makes sure you
 > stay within range.

Not quite true:
------------
if ((ord(p^) and %11110000) = %11100000) then
    begin  // could be 3 byte character
    if ((ord(p[1]) and %11000000) = %10000000) and
       ((ord(p[2]) and %11000000) = %10000000) then ...
    ...
------------
In the above (current) code 3 bytes are accessed which may step behind the zero byte.
UTF8CharacterLength raises an exception in this case because it does not check for string length (zero byte).


 > If you only read the first byte of a codepoint to determine its length,
 > you must check the length of the string.

Thats something that needs to be checked in all cases anyway.
So it's double code (in UTF8CharacterLength and in my program).
Also, as stated at the beginning, the code in UTF8CharacterLength does
not prevent from accessing bytes behind the last (zero) byte.


 > The UTF8CharacterLength function handles invalid UTF-8 gracefully.

Not realy (see above).


 > If you know that you have a valid UTF-8 string you can simply use the
 > first byte of each codepoint (as you pointed out). So, for that case a
 > faster function can be added.
 > Maybe UTF8QuickCharLen or something like that.

Determining the character length of a invalid UTF-8 string is quite useless.
What do you do with such a result? IMO the UTF8CharacterLength
funtion should always assume a valid UTF-8 string.
Using this function on invalid UTF-8 strings lets you run into problems anyway.


 >> > Isn't it enough to just do it like this:
 >> >
 >> > ------------------------------
 >> > if p=nil then
 >> >     exit(0);
 >> > if (ord(p^) and %10000000)=%00000000 then // First bit is not set ==> 1 byte
 >> >     exit(1);
 >> > if (ord(p^) and %11100000)=%11000000 then // First 2 (of first 3) bits are set ==> 2 byte
 >> >     exit(2);
 >> > if (ord(p^) and %11110000)=%11100000 then // First 3 (of first 4) bits are set ==> 3 byte
 >> >     exit(3);
 >> > if (ord(p^) and %11111000)=%11110000 then // First 4 (of first 5) bits are set ==> 4 byte
 >> >     exit(4);
 >> > exit(0); // invalid UTF-8 character
 >> > -------------------------------
 > Yes, although afaik the compiler can optimize a CASE better than a
 > series of IFs.
 > case p^ of
 > #0..#127: exit(1);
 > #192..#223: exit(2);
 > #224..#239: exit(3);
 > #240..#247: exit(4);
 > else exit(0); // invalid UTF-8 character, should never happen
 > end;

Realy? With the if statements only constants are used.
In the case statement you have ranges and I think that more
code is necessary to determine whether a value is whithin a range
than only checking whether it is a certain constant (or not).
But I don't know how far the compiler can optimze which cases
(which may also depend on the optimisation opition used).


 > Note: because it is an optimized version the check for p=nil can be
 > omitted.

Yes.


 >> > Or is there any reason for handling invalid UTF-8-bytes more fault-tolerant?
 > The last, more fault-tolerant.
 > This allows to use the function like this:
 > while p^<>#0 do begin
 >   CharLen:=UTF8CharacterLength(p);
 >   // ...
 >   inc(p,CharLen);
 > end;
 > This works even with invalid UTF8.

Not true (as stated at the beginning).
You may already get an exception within UTF8CharacterLength.

It makes no sense if you try to compensate one error (invalid UTF-8 string)
by another error (giving back wrong results of UTF8CharacterLength).
It will never result in a proper program.





More information about the Lazarus mailing list