[Lazarus] cwstring in arm-linux
Felipe Monteiro de Carvalho
felipemonteiro.carvalho at gmail.com
Wed Oct 19 20:03:39 CEST 2011
On Wed, Oct 19, 2011 at 6:33 PM, Martin Schreiber <mse00000 at gmail.com> wrote:
> Does it use locale specific collation in PasUnicodeCompareStr and
> PasUnicodeCompareText?
Good point, no, not yet. But this affects only turkish, azeri and
lithuanian AFAIK
Adding turkish and azeri is trivial, because UTF8LowerCase supports
them, but I did not understand yet the rules for Lithuanian, they are
quite convoluted, depend on nearby chars and stuff like that.
> Is the performance of UTF8LowerCase and UTF8UpperCase OK?
UTF8LowerCase was heavily optimized. UTF8UpperCase still needs to be
more optimized.
6 million UTF8LowerCase operations in the string "АБВЕЁЖЗКЛМНОПРДЙГ"
takes 2,6 seconds in my computer. It outperforms iconv by a factor of
2,5x aprox:
UTF8LowerCase-- Performance test took: 804 ms 1896 ms
2318 ms 3460 ms 2647 ms 1847 ms 2526 ms 2496 ms
1830 ms 1975 ms
CWString SysUtils.UnicodeLowerCase-- Performance test took:
2456 ms 2461 ms 6594 ms 6170 ms 5347 ms 6939 ms
4398 ms 4429 ms 2285 ms 2411 ms
For this strings:
if j = 0 then Str := UTF8LowerCase('abcdefghijklmnopqrstuwvxyz');
if j = 1 then Str := UTF8LowerCase('ABCDEFGHIJKLMNOPQRSTUWVXYZ');
if j = 2 then Str := UTF8LowerCase('aąbcćdeęfghijklłmnńoóprsśtuwyzźż');
if j = 3 then Str := UTF8LowerCase('AĄBCĆDEĘFGHIJKLŁMNŃOÓPRSŚTUWYZŹŻ');
if j = 4 then Str := UTF8LowerCase('АБВЕЁЖЗКЛМНОПРДЙГ');
if j = 5 then Str := UTF8LowerCase('名字叫嘉英,嘉陵江的嘉,英國的英');
if j = 6 then Str :=
UTF8LowerCase('AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuWvVwXxYyZz');
if j = 7 then Str :=
UTF8LowerCase('AAaaBBbbCCccDDddEEeeFFffGGggHHhhIIiiJJjjKKkkLLllMMmm');
if j = 8 then Str := UTF8LowerCase('abcDefgHijkLmnoPqrsTuwvXyz');
if j = 9 then Str := UTF8LowerCase('ABCdEFGhIJKlMNOpQRStUWVxYZ');
> Do UTF8LowerCase and UTF8UpperCase cover all upper/lowercase Unicode
> (possibly accented) characters?
UTF8LowerCase currently covers all characters in the latest Unicode
spec AFAIK. Of course I might have forgotten something, but I have
tests for chars from 0000 to 0580 and more tests for other clusters.
UTF8UpperCase is currently implemented from 0000 to 0450, but I will
add the rest.
> Does it handle decomposed characters (cwstring doesn't)?
I think that decomposed characters should work naturally. See, for
example, if we have: [0]=~ (tilde accent, but the special version for
composition) [1]=A which forms "Ã" and then we pass lowercase into it,
we would get [0] without change and [1]=a which forms "ã". Or am I
wrong?
If you are talking about handling for CompareText, then the answer
would be that AFAIK it would be too inneficient to handle that in
CompareText ... so we would need another routine for that
NormalizedCompareText or something like that, which executes
normalization, then lowercase and finally the comparison.
--
Felipe Monteiro de Carvalho
More information about the Lazarus
mailing list