[Lazarus] Making sources compatible with Delphi (but Lazarus is priority)

Thu May 4 10:56:18 CEST 2017

On 03/05/17 17:53, Sven Barth via Lazarus wrote:
>
> Am 03.05.2017 14:37 schrieb "Tony Whyman via Lazarus" 
> <lazarus at lists.lazarus-ide.org <mailto:lazarus at lists.lazarus-ide.org>>:
> > On the other hand, AnsiString and UnicodeString are still separate 
> types. Why? Why should there not be a single unified string type with 
> (e.g.) ASCII, UTF8 and UTF-16 (or MS Unicode) being just another code 
> page?
>
> Because indexed access to the string data would slow down quite a bit 
> as the RTL would need to determine whether the string is a 1-Byte, 
> 2-Byte, 4-Byte or multi Byte String. Yes the compiler could do 
> optimizations for this inside loops, but it would definitely slow down 
> -O- code.
>
> Regards,
> Sven
>
>
>

I don't believe that string indexing even works for UTF8 strings at 
present - at least not in a simple s[i] way.

Is it really that much overhead to have a simple codepage check before 
calling the correct function to index a string? The obvious optimisation 
would be to check for UTF8, then UTF16 then the Default codepage and 
then the rest. Or perhaps UTF16 first for Windows. With register level 
code you are talking about very few actual machine level operations.

To me, a unified string type would have the advantage that:

- You would only have one managed string type "string" (and hence avoids 
the confusion that exists today).

- You would have standard string byte length and string character length 
functions (which yes, in the latter case, would have to have a codepage 
check as above).

- String indexing could be standardised as always returning the 
character at position 'i' (including UTF8 strings - albeit after having 
to "walk" the string).

- Automatic transliteration on string compare (with code page check of 
course) - and perhaps with the option to specific a non-standard collation.

- Readily portable code.

- The only time that a programmer has to think about the character 
encoding is when writing code that interacts directly with an external 
interface.

How often would that extra lookup be significant compared with the 
benefits that unified string handling would bring? And, there is no 
reason why you could not retain the UnicodeString type for cases where 
you really need to optimise UTF16 handling.

I see the unified string type as a further extension to AnsiString to 
include UTF16 and UCS2 code pages together with appropriate function 
support.

Tony

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lazarus-ide.org/pipermail/lazarus/attachments/20170504/3c7d70b0/attachment.html>