[Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

Bart bartjunk64 at gmail.com
Mon Sep 16 17:04:04 CEST 2013


Hi,

Current code for Utf8StringOfChar that I wrote (in LazUtf8 unit) may
fail due to Utf8 -> UTF16 -> FillWord -> Utf8 conversions, which only
work for UCS2, as Mattias pointed out to me.

I constructed a new Utf8StringOfChar function that builds UTF8 without
conversions.
For speed reasons it uses FillWord or FillDWord when appropriate.

To be sure this is Endian safe (It works AFAICS on Windows i386) I
need Mac users (or testers with other BigEndian architecture) to test
the code:

function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String;
var
  UCharLen, i: Integer;
  W: Word;
  DW: DWORD;
  C1, C2, C3: Char;
  PC: PChar;
begin
  Result := '';
  if Utf8Length(AUtf8Char) <> 1 then Exit;
  UCharLen := Length(AUtf8Char);
  Case UCharLen of
    1: Result := StringOfChar(AUtf8Char[1], N);
    2:
    begin
      SetLength(Result, 2 * N);
      W := Byte(AUtf8Char[1]) + (Word(Byte(AUtf8Char[2])) shl 8);
      W := LeToN(W);
      System.FillWord(Result[1], N, W);        ;
     end;
    3:
    begin
      SetLength(Result, 3 * N);
      C1 := AUtf8Char[1];
      C2 := AUtf8Char[2];
      C3 := AUtf8Char[3];
      PC := PChar(Result);
      for i:=1 to N do
      begin
        PC^ := C1; inc(PC);
        PC^ := C2; inc(PC);
        PC^ := C3; inc(PC);
      end;
    end;
    4:
    begin
      SetLength(Result, 4 * N);
      DW := Byte(AUtf8Char[1]) + (Word(Byte(AUtf8Char[2])) shl 8) +
            (Byte(AUtf8Char[3]) + (Word(Byte(AUtf8Char[4])) shl 8)) shl 16;
      DW := LeToN(DW);
      System.FillDWord(Result[1], N, DW);
    end;
    else
    begin
      //In November 2003 UTF-8 was restricted by RFC 3629 to four bytes to match
      //the constraints of the UTF-16 character encoding.
      //http://en.wikipedia.org/wiki/UTF-8
      Result := StringOfChar('?', N);
    end;
  end;
end;


End here's the testing code in an OnClick event of a Button:

It uses a console for output, so either run it from console, redirect
output, or replace the write/writeln with e.g. memo1.lines.add().

procedure TForm1.Button6Click(Sender: TObject);
var
  ResS: String;
  UChar: String;
  Expected: String;
  i,j,k: Integer;
const
  N = 4;
  Utf8Sample: Array[1..3] of String = (#$C2#$A2,        // ¢
                                       #$E2#$82#$AC,    // €
                                       #$F0#$A4#$AD#$A2 // 𤭢
                                       );
begin
  for k := 1 to 3 do
  begin
    UChar := Utf8Sample[k];
    Expected := '';
    for i := 1 to N do Expected := Expected + UChar;

    ResS := Utf8StringOfChar(UChar, N);
    write('Testing: ');
    write(Length(UChar),'-byte codepoint: ');
    for j := 1 to length(UChar) do write('$',IntToHex(Ord(UChar[j]),2),' ');
    writeln;
    writeln('Expected Length = ',Length(Expected));
    writeln('Found Length    = ',Length(ResS));
    write('Expected: ');
    for i := 1 to length(Expected) do
write('$',IntToHex(Ord(Expected[i]),2),' ');
    writeln;
    write('Found   : ');
    for i := 1 to length(ResS) do write('$',IntToHex(Ord(ResS[i]),2),' ');
    writeln;
    if ResS <> Expected then
    begin
      if Length(ResS) <> Length(Expected) then
        writeln('Different Lengths')
      else
      begin
        i := 1;
        while (length(ResS) >= i) and (ResS[i] = Expected[i]) do Inc(i);
        writeln('Fail: at position ',i,':  Expected =
$',IntToHex(Ord(Expected[i]),2),' Found =
$',IntToHex(Ord(ResS[i]),2));

      end;
    end
    else writeln('Success!');
    writeln;
  end;

end;


B.t.w the new code isn't in trunk yet, I'ld rather first fix it if
it's broke on BigEndian machines.

Thanks in advance.

Bart




More information about the Lazarus mailing list