[Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

Bart bartjunk64 at gmail.com
Mon Dec 27 23:43:01 CET 2021


On Mon, Dec 27, 2021 at 10:02 PM Noel Duffy via lazarus
<lazarus at lists.lazarus-ide.org> wrote:

> It's not just the euro, though. It's any utf-8 sequence.

What I meant was that a single '€' (or any other single UTF8
"character") will not enter the mentioned block.
Can you add some debug statements to display the values of the  it
uses in the calculation like I did in the 4th message in this thread?

function UTF8LengthFast(p: PChar; ByteCount: PtrInt): PtrInt;
const
{$ifdef CPU32}
  ONEMASK   =$01010101;
  EIGHTYMASK=$80808080;
{$endif}
{$ifdef CPU64}
  ONEMASK   =$0101010101010101;
  EIGHTYMASK=$8080808080808080;
{$endif}
var
  pnx: PPtrInt absolute p; // To get contents of text in PtrInt
blocks. x refers to 32 or 64 bits
  pn8: pint8 absolute pnx; // To read text as Int8 in the initial and
final loops
  ix: PtrInt absolute pnx; // To read text as PtrInt in the block loop
  nx: PtrInt;              // values processed in block loop
  i,cnt,e: PtrInt;
begin
  Result := 0;
  e := ix+ByteCount; // End marker
  // Handle any initial misaligned bytes.
  cnt := (not (ix-1)) and (sizeof(PtrInt)-1);
  if cnt>ByteCount then
    cnt := ByteCount;
  for i := 1 to cnt do
  begin
    // Is this byte NOT the first byte of a character?
    writeln('pn8^             = ',byte(pn8^).ToBinString);
    writeln('pn8^ shr 7       = ',Byte(Byte(pn8^) shr 7).ToBinString);
    writeln('not pn8^         = ',Byte(not pn8^).ToBinString);
    writeln('(not pn8^) shr 6 = ',Byte((not pn8^) shr 6).ToBinString);
    writeln;
    Result += (pn8^ shr 7) and ((not pn8^) shr 6);
    inc(pn8);
  end;
  // Handle complete blocks
  for i := 1 to (ByteCount-cnt) div sizeof(PtrUInt) do
  begin
    // Count bytes which are NOT the first byte of a character.
    {
    nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
    {$push}{$overflowchecks off} // "nx * ONEMASK" causes an
arithmetic overflow.
    Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
    {$pop}
    }
    nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
    Result := Result + PopCnt(PtrUInt(nx));

    inc(pnx);
  end;
  // Take care of any left-over bytes.
  while ix<e do
  begin
    // Is this byte NOT the first byte of a character?
    writeln('pn8^             = ',byte(pn8^).ToBinString);
    writeln('pn8^ shr 7       = ',Byte(Byte(pn8^) shr 7).ToBinString);
    writeln('not pn8^         = ',Byte(not pn8^).ToBinString);
    writeln('(not pn8^) shr 6 = ',Byte((not pn8^) shr 6).ToBinString);
    writeln;
    //writeln('',);
    Result += (pn8^ shr 7) and ((not pn8^) shr 6);
    inc(pn8);
  end;
  Result := ByteCount - Result;
end;

(Just put this in the main unit, no need to change and rebuild LazUtf8 unit)
Make sure you app has a console.
Then just do something like:

  S := '€';
  Len := Utf8LengthFast(PChar(S), Length(S));

It should output somthing like;

pn8^             = 11100010
pn8^ shr 7       = 00000001
not pn8^         = 00011101
(not pn8^) shr 6 = 00000000

pn8^             = 10000010
pn8^ shr 7       = 00000001
not pn8^         = 01111101
(not pn8^) shr 6 = 00000001

pn8^             = 10101100
pn8^ shr 7       = 00000001
not pn8^         = 01010011
(not pn8^) shr 6 = 00000001

Notice that '€' in UTF8 is the byte sequnce 11100010 10000010 10101100
(the values you should see in pn8^.

-- 
Bart


More information about the lazarus mailing list