[Lazarus] UTF-8 string recognition

Wed Mar 3 17:30:05 CET 2010

JoshyFun schrieb:

> RH> Pls check the function I used for check UTF8 string. Hope it helpful
> RH> function IsUTF8(UnknownStr:string):boolean;
> 
> Well, there is a lot of UTF8 strings that do not pass your checks ;)
> If you remove low ascii control chars what happend with UTF8 control
> chars ?
> 
> RH> var
> RH>     i    :Integer;
> RH> begin
> RH>     if length(UnknownStr)=0 then exit(true);
> RH>     i:=1;
> RH>     while i<length(UnknownStr) do
> RH>     begin
> RH>         // ASCII
> RH>         if  (UnknownStr[i] = #$09) or
> RH>             (UnknownStr[i] = #$0A) or
> RH>             (UnknownStr[i] = #$0D) or
> RH>             (UnknownStr[i] in [#$20..#$7E]) then
> RH>         begin
> RH>             inc(i);
> RH>             continue;
> RH>         end;
> RH>         // non-overlong 2-byte
> RH>         if  (UnknownStr[i] in [#$C2..#$DF]) and
> RH>             (UnknownStr[i+1] in [#$80..#$BF]) then
> RH>         begin
> 
> It should crashes here with strings like:
> 
> var
>  s: string;
> begin
>  s:=$C2;
>  IsUTF8(s);
> end;
> 
> which is not valid UTF8.

That's correct, a possible workaround were the use of PChars, which can 
safely access the appended #0.

I'd suggest a state machine or the like for the implementation:
   while True do
     case p^ of
     #0: break; //done, okay if past the end of the string
     #8, #10, #12, #13, ' '..#$7E: inc(p); //okay
     #$C0..#$DF: ... //2 bytes
     #$E0..#$EF: ... //3 bytes
     #$F0..#$F4: ... //4 bytes
     else exit(False); //not valid text
     end;

DoDi