[Lazarus] unit Masks vs. unit FPMasks

Wed Feb 24 13:31:17 CET 2021

On Wed, Feb 24, 2021 at 12:22 PM José Mejuto via lazarus <
lazarus at lists.lazarus-ide.org> wrote:

> In my code there is non 100% unicode compatibility when using the
> "CaseInsensitive" mode as as it uses lowercase mask and lowercase string
> to perform the test which is wrong by definition but I was unable to
> find a method to test codepoints case insensitive without pulling in big
> unicode tables.
>
> I was thinking in import the NTFS (the filesystem) case comparison
> tables which are 128 KB "only".
>

That is not necessary.
LazUTF8 has functions like UTF8CompareText(), UTF8CompareTextP() and the
latest UTF8CompareLatinTextFast().
UTF8CompareLatinTextFast supports full Unicode but is optimized for mostly
Latin text.
We should add a PChar version UTF8CompareLatinTextFastP() and use it in
your mask code.

> Comprehensive unit tests are a way to prevent breaking things.
>
> And also define if a compatibility break is a bug in the new code or in
> the old code. In example my mask supports (there is a define to disable)
> "[z-a]" converting it to "[a-z]" which is a compatibility break.

Your code does not compile when RANGES_AUTOREVERSE is not defined.
cMask is not found.
The reverse logic can be enabled by default. It does not break anybody's
masks as I understand it. Earlier it was an error, now it does something
sensible.

Also there is the support (also can be disabled) for the mask "[?]"
> which is the counterpart for "*" but with one char position.
>

Where did you get this "[?]" syntax? There must be a reference
documentation somewhere but I have not seen it.
What is the difference between "?" and "[?]" ?

On Wed, Feb 24, 2021 at 1:28 PM José Mejuto via lazarus <
lazarus at lists.lazarus-ide.org> wrote:

> > Sometimes I wish we would migrate to using UnicodeString by default.
> > It would make life a bit easier.
> > (And yes I know you would have to deal with composed characters
> > (grapheme defined by more than 1 16-bit word)).
>
> That's a can of worms! UTF8 forces you to write "correct code" (at least
> try it) for any character >127, with UnicodeString you get the false
> apparence that everything magically works until everything cracks when a
> string with surrogate pairs come in play :-) and ALL you text handling
> must be rewritten, and most of them completly rewritten.
>

Exactly. UnicodeString uses UTF-16 which is also a variable length
encoding. The same rules should be applied but often they are not. There is
plenty of sloppy UTF-16 code out there.
Writing proper code UTF-8 is not difficult once you wrap your mind around
the concept. There is a learning curve, true. I also scratched my head for
some time when studying it.

Juha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lazarus-ide.org/pipermail/lazarus/attachments/20210224/3490307a/attachment-0001.html>