[Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
Bart
bartjunk64 at gmail.com
Mon Sep 16 17:04:04 CEST 2013
Hi,
Current code for Utf8StringOfChar that I wrote (in LazUtf8 unit) may
fail due to Utf8 -> UTF16 -> FillWord -> Utf8 conversions, which only
work for UCS2, as Mattias pointed out to me.
I constructed a new Utf8StringOfChar function that builds UTF8 without
conversions.
For speed reasons it uses FillWord or FillDWord when appropriate.
To be sure this is Endian safe (It works AFAICS on Windows i386) I
need Mac users (or testers with other BigEndian architecture) to test
the code:
function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String;
var
UCharLen, i: Integer;
W: Word;
DW: DWORD;
C1, C2, C3: Char;
PC: PChar;
begin
Result := '';
if Utf8Length(AUtf8Char) <> 1 then Exit;
UCharLen := Length(AUtf8Char);
Case UCharLen of
1: Result := StringOfChar(AUtf8Char[1], N);
2:
begin
SetLength(Result, 2 * N);
W := Byte(AUtf8Char[1]) + (Word(Byte(AUtf8Char[2])) shl 8);
W := LeToN(W);
System.FillWord(Result[1], N, W); ;
end;
3:
begin
SetLength(Result, 3 * N);
C1 := AUtf8Char[1];
C2 := AUtf8Char[2];
C3 := AUtf8Char[3];
PC := PChar(Result);
for i:=1 to N do
begin
PC^ := C1; inc(PC);
PC^ := C2; inc(PC);
PC^ := C3; inc(PC);
end;
end;
4:
begin
SetLength(Result, 4 * N);
DW := Byte(AUtf8Char[1]) + (Word(Byte(AUtf8Char[2])) shl 8) +
(Byte(AUtf8Char[3]) + (Word(Byte(AUtf8Char[4])) shl 8)) shl 16;
DW := LeToN(DW);
System.FillDWord(Result[1], N, DW);
end;
else
begin
//In November 2003 UTF-8 was restricted by RFC 3629 to four bytes to match
//the constraints of the UTF-16 character encoding.
//http://en.wikipedia.org/wiki/UTF-8
Result := StringOfChar('?', N);
end;
end;
end;
End here's the testing code in an OnClick event of a Button:
It uses a console for output, so either run it from console, redirect
output, or replace the write/writeln with e.g. memo1.lines.add().
procedure TForm1.Button6Click(Sender: TObject);
var
ResS: String;
UChar: String;
Expected: String;
i,j,k: Integer;
const
N = 4;
Utf8Sample: Array[1..3] of String = (#$C2#$A2, // ¢
#$E2#$82#$AC, // €
#$F0#$A4#$AD#$A2 // 𤭢
);
begin
for k := 1 to 3 do
begin
UChar := Utf8Sample[k];
Expected := '';
for i := 1 to N do Expected := Expected + UChar;
ResS := Utf8StringOfChar(UChar, N);
write('Testing: ');
write(Length(UChar),'-byte codepoint: ');
for j := 1 to length(UChar) do write('$',IntToHex(Ord(UChar[j]),2),' ');
writeln;
writeln('Expected Length = ',Length(Expected));
writeln('Found Length = ',Length(ResS));
write('Expected: ');
for i := 1 to length(Expected) do
write('$',IntToHex(Ord(Expected[i]),2),' ');
writeln;
write('Found : ');
for i := 1 to length(ResS) do write('$',IntToHex(Ord(ResS[i]),2),' ');
writeln;
if ResS <> Expected then
begin
if Length(ResS) <> Length(Expected) then
writeln('Different Lengths')
else
begin
i := 1;
while (length(ResS) >= i) and (ResS[i] = Expected[i]) do Inc(i);
writeln('Fail: at position ',i,': Expected =
$',IntToHex(Ord(Expected[i]),2),' Found =
$',IntToHex(Ord(ResS[i]),2));
end;
end
else writeln('Success!');
writeln;
end;
end;
B.t.w the new code isn't in trunk yet, I'ld rather first fix it if
it's broke on BigEndian machines.
Thanks in advance.
Bart
More information about the Lazarus
mailing list