<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    On 16/08/17 11:05, Juha Manninen via Lazarus wrote:<br>

    <blockquote type="cite"

cite="mid:CAPN1EhByoaA5N4V7SCs-=rE-u4SSwhXBRJuk8YBy0-885vfwJA@mail.gmail.com">

      <pre wrap="">On Mon, Aug 14, 2017 at 4:21 PM, Tony Whyman via Lazarus

<a class="moz-txt-link-rfc2396E" href="mailto:lazarus@lists.lazarus-ide.org"><lazarus@lists.lazarus-ide.org></a> wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">UTF-16/Unicode can only store 65,536 characters while the Unicode standard

(that covers UTF8 as well) defines 136,755 characters.

UTF-16/Unicode's main advantage seems to be for rapid indexing of large

strings.

</pre>

      </blockquote>

      <pre wrap="">

That shows complete ignorance from your side about Unicode.

You consider UTF-16 as a fixed-width encoding.  :(

Unfortunately many other programmers had the same wrong idea or they

were just lazy. The result anyway is a lot of broken UTF-16 code out

there.</pre>

    </blockquote>

    You do like to use the word "ignorance" don't you. You can if you

    want take the view that all the "other programmers" that got the

    wrong idea are "stupid monkeys that don't know any better"  or,

    alternatively, that they just wanted a nice cup of tea rather than

    the not quite tea drink that was served up.<br>

    <br>

    Wikipedia sums the problem up nicely: "The early 2-byte encoding was

    usually called "Unicode", but is now called "UCS-2". UCS-2 differs

    from UTF-16 by being a constant length encoding and only capable of

    encoding characters of BMP, it is supported by many programs."<br>

    <br>

    This is where the problem starts. The definitive of "Unicode" was

    changed (foolishly in my opinion) after it had been accepted by the

    community and the result is confusion. Hence my first point about

    not even using it. In using "UTF16/Unicode" I was attempting to

    convey the common use of the term which is to see UTF-16 as what is

    now defined as UCS-2. This is because hardly anyone I know uses

    UCS-2 and instead says "Unicode". Perhaps I just spend too much time

    amongst the ignorant.<br>

    <br>

    Wikipedia also makes the wonderful point that "The UTF-16 encoding

    scheme was developed as a compromise to resolve this impasse in

    version 2.0". The impasse having resulted from "4 bytes per

    character wasted a lot of disk space and memory, and because some

    manufacturers were already heavily invested in 2-byte-per-character

    technology".<br>

    <br>

    Finally: "In UTF-16, code points greater or equal to 2<sup>16</sup>

    are encoded using <i>two</i> 16-bit code units. The standards

    organizations chose the largest block available of un-allocated

    16-bit code points to use as these code units (since most existing

    UCS-2 data did not use these code points and would be valid UTF-16).

    Unlike UTF-8 they did not provide a means to encode these code

    points".<br>

    <br>

    Which is from where I get my own view that UTF-16, as defined by the

    standards, is pointless. If you keep it to a UCS-2 (like) subset

    then you can get rapid indexing of character arrays. But as soon as

    you introduce the possibility of some characters being encoded as

    two 16-bit units then you lose rapid indexing and I can see no

    advantage over UTF-8 - plus you get all the fun of worrying about

    byte order. <br>

    <br>

    Indeed, I believe those lazy programmers that you referred to, are

    actually making a conscious decision to prefer to work with a 16-bit

    code point only UTF-16 subset (i.e. the Basic Multilingual Plan)

    precisely so that they can do rapid indexing. As soon as you bring

    in 2 x 16-bit code unit code points, you lose that benefit - and

    perhaps you should be using UTF-32.<br>

    <br>

    IMHO, Linux has got it right by using UTF-8 as the standard for

    character encoding and one of Lazarus's USPs is that it follows that

    lead - even for Windows. I can see why a program that does intensive

    text scanning will use a UTF-16 constrained to the BMP (i.e. 16-bit

    only), but not why anyone would prefer an unconstrained UTF-16 over

    UTF-8.<br>

    <br>

  </body>

</html>