<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
On 16/08/17 11:05, Juha Manninen via Lazarus wrote:<br>
<blockquote type="cite"
cite="mid:CAPN1EhByoaA5N4V7SCs-=rE-u4SSwhXBRJuk8YBy0-885vfwJA@mail.gmail.com">
<pre wrap="">On Mon, Aug 14, 2017 at 4:21 PM, Tony Whyman via Lazarus
<a class="moz-txt-link-rfc2396E" href="mailto:lazarus@lists.lazarus-ide.org"><lazarus@lists.lazarus-ide.org></a> wrote:
</pre>
<blockquote type="cite">
<pre wrap="">UTF-16/Unicode can only store 65,536 characters while the Unicode standard
(that covers UTF8 as well) defines 136,755 characters.
UTF-16/Unicode's main advantage seems to be for rapid indexing of large
strings.
</pre>
</blockquote>
<pre wrap="">
That shows complete ignorance from your side about Unicode.
You consider UTF-16 as a fixed-width encoding. :(
Unfortunately many other programmers had the same wrong idea or they
were just lazy. The result anyway is a lot of broken UTF-16 code out
there.</pre>
</blockquote>
You do like to use the word "ignorance" don't you. You can if you
want take the view that all the "other programmers" that got the
wrong idea are "stupid monkeys that don't know any better" or,
alternatively, that they just wanted a nice cup of tea rather than
the not quite tea drink that was served up.<br>
<br>
Wikipedia sums the problem up nicely: "The early 2-byte encoding was
usually called "Unicode", but is now called "UCS-2". UCS-2 differs
from UTF-16 by being a constant length encoding and only capable of
encoding characters of BMP, it is supported by many programs."<br>
<br>
This is where the problem starts. The definitive of "Unicode" was
changed (foolishly in my opinion) after it had been accepted by the
community and the result is confusion. Hence my first point about
not even using it. In using "UTF16/Unicode" I was attempting to
convey the common use of the term which is to see UTF-16 as what is
now defined as UCS-2. This is because hardly anyone I know uses
UCS-2 and instead says "Unicode". Perhaps I just spend too much time
amongst the ignorant.<br>
<br>
Wikipedia also makes the wonderful point that "The UTF-16 encoding
scheme was developed as a compromise to resolve this impasse in
version 2.0". The impasse having resulted from "4 bytes per
character wasted a lot of disk space and memory, and because some
manufacturers were already heavily invested in 2-byte-per-character
technology".<br>
<br>
Finally: "In UTF-16, code points greater or equal to 2<sup>16</sup>
are encoded using <i>two</i> 16-bit code units. The standards
organizations chose the largest block available of un-allocated
16-bit code points to use as these code units (since most existing
UCS-2 data did not use these code points and would be valid UTF-16).
Unlike UTF-8 they did not provide a means to encode these code
points".<br>
<br>
Which is from where I get my own view that UTF-16, as defined by the
standards, is pointless. If you keep it to a UCS-2 (like) subset
then you can get rapid indexing of character arrays. But as soon as
you introduce the possibility of some characters being encoded as
two 16-bit units then you lose rapid indexing and I can see no
advantage over UTF-8 - plus you get all the fun of worrying about
byte order. <br>
<br>
Indeed, I believe those lazy programmers that you referred to, are
actually making a conscious decision to prefer to work with a 16-bit
code point only UTF-16 subset (i.e. the Basic Multilingual Plan)
precisely so that they can do rapid indexing. As soon as you bring
in 2 x 16-bit code unit code points, you lose that benefit - and
perhaps you should be using UTF-32.<br>
<br>
IMHO, Linux has got it right by using UTF-8 as the standard for
character encoding and one of Lazarus's USPs is that it follows that
lead - even for Windows. I can see why a program that does intensive
text scanning will use a UTF-16 constrained to the BMP (i.e. 16-bit
only), but not why anyone would prefer an unconstrained UTF-16 over
UTF-8.<br>
<br>
</body>
</html>