[Lazarus] Improving UTF8CharacterLength?
Mattias Gaertner
nc-gaertnma at netcologne.de
Thu Aug 13 13:01:16 CEST 2015
On Thu, 13 Aug 2015 12:38:00 +0200
Jürgen Hestermann <juergen.hestermann at gmx.de> wrote:
> Am 2015-08-13 um 11:55 schrieb Mattias Gaertner:
> > A string always ends with a #0, so checking byte by byte makes sure you
> > stay within range.
>
> Not quite true:
> ------------
> if ((ord(p^) and %11110000) = %11100000) then
> begin // could be 3 byte character
> if ((ord(p[1]) and %11000000) = %10000000) and
> ((ord(p[2]) and %11000000) = %10000000) then ...
> ...
> ------------
> In the above (current) code 3 bytes are accessed which may step behind the zero byte.
The "and" operator stops evaluating if left side is already false.
> Thats something that needs to be checked in all cases anyway.
No.
That's the advantage of PChar and ASCIIZ.
>[...]
> > If you know that you have a valid UTF-8 string you can simply use the
> > first byte of each codepoint (as you pointed out). So, for that case a
> > faster function can be added.
> > Maybe UTF8QuickCharLen or something like that.
>
> Determining the character length of a invalid UTF-8 string is quite useless.
> What do you do with such a result?
Skip.
> IMO the UTF8CharacterLength
> funtion should always assume a valid UTF-8 string.
> Using this function on invalid UTF-8 strings lets you run into problems anyway.
It is already used this way in many places. If you need a function with
a different behavior you need to add new one.
>[...]
> > Yes, although afaik the compiler can optimize a CASE better than a series of IFs
>[...]
>
> Realy?
Look at the produced assembler or do some benchmarking.
> [...]
Mattias
More information about the Lazarus
mailing list