[Lazarus] Improving UTF8CharacterLength?

Thu Aug 13 13:01:16 CEST 2015

On Thu, 13 Aug 2015 12:38:00 +0200
Jürgen Hestermann <juergen.hestermann at gmx.de> wrote:

> Am 2015-08-13 um 11:55 schrieb Mattias Gaertner:
>  > A string always ends with a #0, so checking byte by byte makes sure you
>  > stay within range.
> 
> Not quite true:
> ------------
> if ((ord(p^) and %11110000) = %11100000) then
>     begin  // could be 3 byte character
>     if ((ord(p[1]) and %11000000) = %10000000) and
>        ((ord(p[2]) and %11000000) = %10000000) then ...
>     ...
> ------------
> In the above (current) code 3 bytes are accessed which may step behind the zero byte.

The "and" operator stops evaluating if left side is already false.

> Thats something that needs to be checked in all cases anyway.

No.
That's the advantage of PChar and ASCIIZ.

>[...]
>  > If you know that you have a valid UTF-8 string you can simply use the
>  > first byte of each codepoint (as you pointed out). So, for that case a
>  > faster function can be added.
>  > Maybe UTF8QuickCharLen or something like that.
> 
> Determining the character length of a invalid UTF-8 string is quite useless.
> What do you do with such a result?

Skip.

> IMO the UTF8CharacterLength
> funtion should always assume a valid UTF-8 string.
> Using this function on invalid UTF-8 strings lets you run into problems anyway.

It is already used this way in many places. If you need a function with
a different behavior you need to add new one.

>[...]
>  > Yes, although afaik the compiler can optimize a CASE better than a series of IFs
>[...]
> 
> Realy?

Look at the produced assembler or do some benchmarking.

> [...]

Mattias