<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
On 03/05/17 17:53, Sven Barth via Lazarus wrote:<br>
<blockquote
cite="mid:CAFMUeB-Pxwp_G-JKD12Yf1EmMdQ=p0+7v4JkD4GZyUUU3L4jfg@mail.gmail.com"
type="cite">
<p>Am 03.05.2017 14:37 schrieb "Tony Whyman via Lazarus" <<a
moz-do-not-send="true"
href="mailto:lazarus@lists.lazarus-ide.org">lazarus@lists.lazarus-ide.org</a>>:<br>
> On the other hand, AnsiString and UnicodeString are still
separate types. Why? Why should there not be a single unified
string type with (e.g.) ASCII, UTF8 and UTF-16 (or MS Unicode)
being just another code page?</p>
<p>Because indexed access to the string data would slow down quite
a bit as the RTL would need to determine whether the string is a
1-Byte, 2-Byte, 4-Byte or multi Byte String. Yes the compiler
could do optimizations for this inside loops, but it would
definitely slow down -O- code.</p>
<p>Regards,<br>
Sven</p>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
</blockquote>
<br>
<p>I don't believe that string indexing even works for UTF8 strings
at present - at least not in a simple s[i] way.</p>
<p>Is it really that much overhead to have a simple codepage check
before calling the correct function to index a string? The obvious
optimisation would be to check for UTF8, then UTF16 then the
Default codepage and then the rest. Or perhaps UTF16 first for
Windows. With register level code you are talking about very few
actual machine level operations.<br>
</p>
<p>To me, a unified string type would have the advantage that:</p>
<p>- You would only have one managed string type "string" (and hence
avoids the confusion that exists today).<br>
</p>
<p>- You would have standard string byte length and string character
length functions (which yes, in the latter case, would have to
have a codepage check as above).</p>
<p>- String indexing could be standardised as always returning the
character at position 'i' (including UTF8 strings - albeit after
having to "walk" the string).<br>
</p>
<p>- Automatic transliteration on string compare (with code page
check of course) - and perhaps with the option to specific a
non-standard collation.<br>
</p>
<p>- Readily portable code.</p>
<p>- The only time that a programmer has to think about the
character encoding is when writing code that interacts directly
with an external interface.</p>
<p>How often would that extra lookup be significant compared with
the benefits that unified string handling would bring? And, there
is no reason why you could not retain the UnicodeString type for
cases where you really need to optimise UTF16 handling. <br>
</p>
<p>I see the unified string type as a further extension to
AnsiString to include UTF16 and UCS2 code pages together with
appropriate function support.<br>
</p>
<p>Tony<br>
</p>
<br>
</body>
</html>