[Lazarus] Improving UTF8CharacterLength?

Thu Aug 13 14:53:50 CEST 2015

Am 2015-08-13 um 14:19 schrieb Mattias Gaertner:
 > On Thu, 13 Aug 2015 14:05:19 +0200
 > Jürgen Hestermann <juergen.hestermann at gmx.de> wrote:
 >> Still I think it would be better to give back 3 in case the byte actually
 >> means 3 because 1 byte does not form a valid UTF-8 character.
 >> If I rely on this result I would try to use this 1 byte as a valid UTF-8 character
 >> which would be wrong so I have to apply further checks to cope with this situation anyway.
 > Do you mean like UTF8CharacterStrictLength?

I did not know that yet another quite similar function like UTF8CharacterStrictLength exists.
So many functions doing nearly the same thing is very confusing....

If I am right (after a quick look) then UTF8CharacterStrictLength gives back 0
in cases where UTF8CharacterLength would give back 1.

IMO this does not change the underlying problem that if you have an invalid UTF-8
string then you cannot fix this situation within functions like UTF8CharacterLength
or UTF8CharacterStrictLength. There is no way around it other than:

1.) Make sure your strings are all valid UTF-8 or
2.) Do error checking and error handling in your program yourself

In both cases I think no further error handling is needed within such helper routines.

 >> Then I can also check whether the 3 or 4 bytes of the correct result exist.
 >> I would not loose anything for invalid UTF-8 strings but I would gain performance if
 >> I can guarantee valid UTF-8 string.
 > For this the UTF8QuickCharLen function would suffice, would it not?

Yes, of course.
Although I am wondering whether yet another function needs to be added.
To have an overview over all the UTF-8 functions is already quite complex
and I still think that error checking should not be part of such helper functions
so that only one is needed.

 >> And if no zero byte exists (for whatever reason) it currently fails anyway.
 > Till now the Lazarus code didn't have such a case.

Yes, maybe it's quite unlikely to have such a situation.
If a pchar pointer points to arbitrary data it will be impossible to cope with this situation anyway.