[Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

Marco van de Voort fpc at pascalprogramming.org
Wed Dec 29 13:42:31 CET 2021


On 29-12-2021 10:16, Martin Frb via lazarus wrote:
>
>> // Martin's routine that should be replaced by some punpkl magic, but 
>> it is too late now.
>
> Why too late?

See datetime stamp.  02:10 AM. I don't know how it is with you Lazarus 
devels, but we FPC devels need our beauty sleep from time to time.

So I'll be working on it some more today, possible things to do are:

-  fold countmask into the asm, and

-  maybe do another outer loop to process all bytes in one go.

-  special case for only a few bytes

- maybe align to 16-byte (easier/cheaper for old machines than all the 
movdqu).

- Maybe test the code on godbolt for performance

At the very least I hope the punpkl code for countmask will be more 
compact since it doesn't require literals.

> There is a place for both. My routine works fine for cpu. (soon as a 
> 32 bit (and maybe 16 bit) variant are added).
>
> Then for known cpu, special handling can be added.

Could you post full source if you haven't already? For a bit of 
benchmarking. I just wrote it from the top of my head, and I assumed 5 
instructions for 16-byte would win any time, but haven't verified 
anything yet.

For 32-bit this can also be done, but you'd need a procedure variable to 
test for SSE support, and only run this for long strings (>200-500 or so)

p.s. is there a workaround for git worktree to work on the same branch? 
E.g. trunk for 32-bit and trunk for 64-bit ? :-)



More information about the lazarus mailing list