[Lazarus] cwstring in arm-linux

Wed Oct 19 20:03:39 CEST 2011

On Wed, Oct 19, 2011 at 6:33 PM, Martin Schreiber <mse00000 at gmail.com> wrote:
> Does it use locale specific collation in PasUnicodeCompareStr and
> PasUnicodeCompareText?

Good point, no, not yet. But this affects only turkish, azeri and
lithuanian AFAIK

Adding turkish and azeri is trivial, because UTF8LowerCase supports
them, but I did not understand yet the rules for Lithuanian, they are
quite convoluted, depend on nearby chars and stuff like that.

> Is the performance of UTF8LowerCase and UTF8UpperCase OK?

UTF8LowerCase was heavily optimized. UTF8UpperCase still needs to be
more optimized.

6 million UTF8LowerCase operations in the string "АБВЕЁЖЗКЛМНОПРДЙГ"
takes 2,6 seconds in my computer. It outperforms iconv by a factor of
2,5x aprox:

    UTF8LowerCase-- Performance test took:         804 ms     1896 ms
   2318 ms     3460 ms     2647 ms     1847 ms     2526 ms     2496 ms
    1830 ms     1975 ms
CWString SysUtils.UnicodeLowerCase-- Performance test took:
2456 ms     2461 ms     6594 ms     6170 ms     5347 ms     6939 ms
 4398 ms     4429 ms     2285 ms     2411 ms

For this strings:

      if j = 0 then Str := UTF8LowerCase('abcdefghijklmnopqrstuwvxyz');
      if j = 1 then Str := UTF8LowerCase('ABCDEFGHIJKLMNOPQRSTUWVXYZ');
      if j = 2 then Str := UTF8LowerCase('aąbcćdeęfghijklłmnńoóprsśtuwyzźż');
      if j = 3 then Str := UTF8LowerCase('AĄBCĆDEĘFGHIJKLŁMNŃOÓPRSŚTUWYZŹŻ');
      if j = 4 then Str := UTF8LowerCase('АБВЕЁЖЗКЛМНОПРДЙГ');
      if j = 5 then Str := UTF8LowerCase('名字叫嘉英，嘉陵江的嘉，英國的英');
      if j = 6 then Str :=
UTF8LowerCase('AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuWvVwXxYyZz');
      if j = 7 then Str :=
UTF8LowerCase('AAaaBBbbCCccDDddEEeeFFffGGggHHhhIIiiJJjjKKkkLLllMMmm');
      if j = 8 then Str := UTF8LowerCase('abcDefgHijkLmnoPqrsTuwvXyz');
      if j = 9 then Str := UTF8LowerCase('ABCdEFGhIJKlMNOpQRStUWVxYZ');

> Do  UTF8LowerCase and UTF8UpperCase cover all upper/lowercase Unicode
> (possibly accented) characters?

UTF8LowerCase currently covers all characters in the latest Unicode
spec AFAIK. Of course I might have forgotten something, but I have
tests for chars from 0000 to 0580 and more tests for other clusters.

UTF8UpperCase is currently implemented from 0000 to 0450, but I will
add the rest.

> Does it handle decomposed characters (cwstring doesn't)?

I think that decomposed characters should work naturally. See, for
example, if we have: [0]=~ (tilde accent, but the special version for
composition) [1]=A which forms "Ã" and then we pass lowercase into it,
we would get [0] without change and [1]=a which forms "ã". Or am I
wrong?

If you are talking about handling for CompareText, then the answer
would be that AFAIK it would be too inneficient to handle that in
CompareText ... so we would need another routine for that
NormalizedCompareText or something like that, which executes
normalization, then lowercase and finally the comparison.

-- 
Felipe Monteiro de Carvalho