[Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
Noel Duffy
noelduffy at xtra.co.nz
Tue Dec 28 02:29:04 CET 2021
On 28/12/21 07:21, Bart via lazarus wrote:
> On Mon, Dec 27, 2021 at 6:35 PM Marco van de Voort via lazarus
> <lazarus at lists.lazarus-ide.org> wrote:
>
>> The expression seems to be 1 when the top bits are 10 iow when it is a
>> follow bytes of utf8, that is what the comment says, and I as far as I
>> can see the signedness doesn't matter.
>>
>> Basically to me that seems to be a branchless version of
>>
>> if (p[i] and %11000000)=%10000000 then
>>
>> inc(result);
>
> This is how I understood that paert of the code as well.
>
> Just a side node: all this assumes that the UTF8 is correct (in the
> strict sense).
>
> Now for the part that tries to do calculations on blocks of 32 or 64 bits.
> It uses a multiplication in that part.
> That seems a bit odd as this code tries to do evrything with ands,
> nots, and shifts.
>
> Would this approach (in that part of the code) also work?
> X = 32 or 64 bit block
>
> 1: AND X with EIGHTYMASK
> 2: SHR 7: gives (A)
>
> 3: NOT X
> 4: SHR 6: gives (B)
>
> 5: (A) AND (B): gives (C)
>
> Basically we do the same as for a 1-byte block;
> Now we have a pattern where any 1 tells us this byte was a following byte
>
> A 32-bit example:
> (Invalid sequence but that does not matter here)
>
> X = %11xxxxxx %01xxxxxx %00xxxxxx %10xxxxxx
> (Leading-ASCII-ASCII-Following, so count of following bytes must be 1)
>
> AND with EIGTHYMASK gives %10000000 %00000000 %00000000 %10000000
> SHR 7 gives: %00000001 %00000000 %00000000 %00000001 (A)
>
> NOT %11xxxxxx %01xxxxxx %00xxxxxx %10xxxxxx = %00yyyyyy %10yyyyyy
> %11yyyyyy %01yyyyyy
> SHR 6 gives: %00000000 %yyyyyy10 %yyyyyy11 %yyyyyy01 (B)
>
> (A) and (B) gives: %00000000 %00000000 %00000000 %00000001 (C)
> All non-following bytes turn into %00000000
>
> The count of following bytes is PopCnt(C)
>
> As long as PopCnt is available as a machine instruction, this will be
> faster than the (nx * ONEMASK) calculation I think.
> (I would hope this is the case for all platforms Lazarus supports, if
> not the call to PopCnt could be ifdef-ed.)
>
> So basically change this:
>
> nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
> {$push}{$overflowchecks off} // "nx * ONEMASK" causes an
> arithmetic overflow.
> Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
> {$pop}
>
> into
>
> nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
> Result := Result + PopCnt(PtrUInt(nx)); /PopCnt only accepts
> unsigend parameter
>
>
> Since bit manipulating is not my strongpoint, please comment.
>
I ran my code that calls UTF8LengthFast with a euro through the debugger
to see what path through UTF8LengthFast was followed. The debugger
skipped over all the loops until the very last one:
// Take care of any left-over bytes.
while ix<e do
begin
// Is this byte NOT the first byte of a character?
Result += (pn8^ shr 7) and ((not pn8^) shr 6);
inc(pn8);
end;
Result := ByteCount - Result;
The euro symbol encoded in UTF8 has three bytes, 226, 130, 172.
The loop has 3 iterations, one for each byte:
0: Result=33554428
1: Result=67108857
2: Result=100663286
The last line subtracts Result from ByteCount, and gets us -100663283.
OK, time to simplify. I tried writing a short pascal program that did
the bit shifting with constants:
c := (226 shr 7) and ((not 226) shr 6);
writeln('c='+inttostr(c)+' 226.');
c := (130 shr 7) and ((not 130) shr 6);
writeln('c='+inttostr(c)+' 130.');
c := (172 shr 7) and ((not 172) shr 6);
writeln('c='+inttostr(c)+' 172.');
This produces the correct result:
c=0 226.
c=1 130.
c=1 172.
Next I tried using a pint8:
const
s = '€';
var
p: PChar;
c: Byte;
pn8: pint8 absolute p;
begin
c := (pn8^ shr 7) and ((not pn8^) shr 6);
writeln('c='+inttostr(c)+' '+IntToStr(Byte(pn8^)));
Inc(pn8);
c := (pn8^ shr 7) and ((not pn8^) shr 6);
writeln('c='+inttostr(c)+' '+IntToStr(Byte(pn8^)));
Inc(pn8);
c := (pn8^ shr 7) and ((not pn8^) shr 6);
writeln('c='+inttostr(c)+' '+IntToStr(Byte(pn8^)));
end;
This produces:
c=252 226
c=253 130
c=253 172
Clearly wrong. Someone mentioned signed vs unsigned, so that seemed a
logical next step. I tried using a pbyte.
const
s = '€';
var
pb: PByte absolute p;
c: Byte;
p: PChar;
begin
p := PChar(s);
c := (pb^ shr 7) and ((not pb^) shr 6);
writeln('c='+inttostr(c)+' '+IntToStr(pb^));
Inc(pb);
c := (pb^ shr 7) and ((not pb^) shr 6);
writeln('c='+inttostr(c)+' '+IntToStr(pb^));
Inc(pb);
c := (pb^ shr 7) and ((not pb^) shr 6);
writeln('c='+inttostr(c)+' '+IntToStr(pb^));
end;
This produces:
c=0 226
c=1 130
c=1 172
Correct result. So it appears that using a pint8 produces the wrong
result on aarch64, but it doesn't on x86_64. It's not clear why, though.
So it appears to me that an unsigned pointer type is required in
UTFLengthFast.
More information about the lazarus
mailing list