[Lazarus] Improving UTF8CharacterLength?

Jürgen Hestermann juergen.hestermann at gmx.de
Thu Aug 13 14:05:19 CEST 2015


Am 2015-08-13 um 13:01 schrieb Mattias Gaertner:
> On Thu, 13 Aug 2015 12:38:00 +0200
> Jürgen Hestermann <juergen.hestermann at gmx.de> wrote:
>
>> Am 2015-08-13 um 11:55 schrieb Mattias Gaertner:
>>   > A string always ends with a #0, so checking byte by byte makes sure you
>>   > stay within range.
>>
>> Not quite true:
>> ------------
>> if ((ord(p^) and %11110000) = %11100000) then
>>      begin  // could be 3 byte character
>>      if ((ord(p[1]) and %11000000) = %10000000) and
>>         ((ord(p[2]) and %11000000) = %10000000) then ...
>>      ...
>> ------------
>> In the above (current) code 3 bytes are accessed which may step behind the zero byte.
> The "and" operator stops evaluating if left side is already false.

Yes, I see now that I somehow missinterpreted the code.
You are right that for the case that a zero byte exists
it would not access further bytes within UTF8CharacterLength.

Still I think it would be better to give back 3 in case the byte actually
means 3 because 1 byte does not form a valid UTF-8 character.
If I rely on this result I would try to use this 1 byte as a valid UTF-8 character
which would be wrong so I have to apply further checks to cope with this situation anyway.
Then I can also check whether the 3 or 4 bytes of the correct result exist.
I would not loose anything for invalid UTF-8 strings but I would gain performance if
I can guarantee valid UTF-8 string.

And if no zero byte exists (for whatever reason) it currently fails anyway.





More information about the Lazarus mailing list