[Lazarus] UTF-8 string recognition
Robin Hoo
robin.hoo.cn at gmail.com
Wed Mar 3 14:05:48 CET 2010
Hi, JoshyFun
Thanks for pointing out the bug in my coding, yes you are right. I forgot to
put some checking before every inc(i,k) and continue; there should a
judgement statement
*if i>length(UnknownStr) then exit(false);*
2010/3/3 JoshyFun <joshyfun at gmail.com>
> Hello Lazarus-List,
>
> Wednesday, March 3, 2010, 12:24:35 AM, you wrote:
>
> RH> Pls check the function I used for check UTF8 string. Hope it helpful
> RH> function IsUTF8(UnknownStr:string):boolean;
>
> Well, there is a lot of UTF8 strings that do not pass your checks ;)
> If you remove low ascii control chars what happend with UTF8 control
> chars ?
>
> RH> var
> RH> i :Integer;
> RH> begin
> RH> if length(UnknownStr)=0 then exit(true);
> RH> i:=1;
> RH> while i<length(UnknownStr) do
> RH> begin
> RH> // ASCII
> RH> if (UnknownStr[i] = #$09) or
> RH> (UnknownStr[i] = #$0A) or
> RH> (UnknownStr[i] = #$0D) or
> RH> (UnknownStr[i] in [#$20..#$7E]) then
> RH> begin
> RH> inc(i);
> RH> continue;
> RH> end;
> RH> // non-overlong 2-byte
> RH> if (UnknownStr[i] in [#$C2..#$DF]) and
> RH> (UnknownStr[i+1] in [#$80..#$BF]) then
> RH> begin
>
> It should crashes here with strings like:
>
> var
> s: string;
> begin
> s:=$C2;
> IsUTF8(s);
> end;
>
> which is not valid UTF8.
>
> RH> // excluding surrogates
> RH> ((UnknownStr[i]=#$ED) and
> RH> (UnknownStr[i+1] in [#$80..#$9F]) and
> RH> (UnknownStr[i+2] in [#$80..#$BF])) then
>
> Surrogates are not UTF8 valid codepoints.
>
> --
> Best regards,
> JoshyFun
>
>
> --
> _______________________________________________
> Lazarus mailing list
> Lazarus at lists.lazarus.freepascal.org
> http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lazarus-ide.org/pipermail/lazarus/attachments/20100303/5bcd2fe8/attachment-0004.html>
More information about the Lazarus
mailing list