[Lazarus] Making sources compatible with Delphi (but Lazarus is priority)
Tony Whyman
tony.whyman at mccallumwhyman.com
Thu May 4 10:56:18 CEST 2017
On 03/05/17 17:53, Sven Barth via Lazarus wrote:
>
> Am 03.05.2017 14:37 schrieb "Tony Whyman via Lazarus"
> <lazarus at lists.lazarus-ide.org <mailto:lazarus at lists.lazarus-ide.org>>:
> > On the other hand, AnsiString and UnicodeString are still separate
> types. Why? Why should there not be a single unified string type with
> (e.g.) ASCII, UTF8 and UTF-16 (or MS Unicode) being just another code
> page?
>
> Because indexed access to the string data would slow down quite a bit
> as the RTL would need to determine whether the string is a 1-Byte,
> 2-Byte, 4-Byte or multi Byte String. Yes the compiler could do
> optimizations for this inside loops, but it would definitely slow down
> -O- code.
>
> Regards,
> Sven
>
>
>
I don't believe that string indexing even works for UTF8 strings at
present - at least not in a simple s[i] way.
Is it really that much overhead to have a simple codepage check before
calling the correct function to index a string? The obvious optimisation
would be to check for UTF8, then UTF16 then the Default codepage and
then the rest. Or perhaps UTF16 first for Windows. With register level
code you are talking about very few actual machine level operations.
To me, a unified string type would have the advantage that:
- You would only have one managed string type "string" (and hence avoids
the confusion that exists today).
- You would have standard string byte length and string character length
functions (which yes, in the latter case, would have to have a codepage
check as above).
- String indexing could be standardised as always returning the
character at position 'i' (including UTF8 strings - albeit after having
to "walk" the string).
- Automatic transliteration on string compare (with code page check of
course) - and perhaps with the option to specific a non-standard collation.
- Readily portable code.
- The only time that a programmer has to think about the character
encoding is when writing code that interacts directly with an external
interface.
How often would that extra lookup be significant compared with the
benefits that unified string handling would bring? And, there is no
reason why you could not retain the UnicodeString type for cases where
you really need to optimise UTF16 handling.
I see the unified string type as a further extension to AnsiString to
include UTF16 and UCS2 code pages together with appropriate function
support.
Tony
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lazarus-ide.org/pipermail/lazarus/attachments/20170504/3c7d70b0/attachment.html>
More information about the Lazarus
mailing list