[Lazarus] String vs WideString
Tony Whyman
tony.whyman at mccallumwhyman.com
Thu Aug 17 12:41:03 CEST 2017
On 16/08/17 11:05, Juha Manninen via Lazarus wrote:
> On Mon, Aug 14, 2017 at 4:21 PM, Tony Whyman via Lazarus
> <lazarus at lists.lazarus-ide.org> wrote:
>> UTF-16/Unicode can only store 65,536 characters while the Unicode standard
>> (that covers UTF8 as well) defines 136,755 characters.
>> UTF-16/Unicode's main advantage seems to be for rapid indexing of large
>> strings.
> That shows complete ignorance from your side about Unicode.
> You consider UTF-16 as a fixed-width encoding. :(
> Unfortunately many other programmers had the same wrong idea or they
> were just lazy. The result anyway is a lot of broken UTF-16 code out
> there.
You do like to use the word "ignorance" don't you. You can if you want
take the view that all the "other programmers" that got the wrong idea
are "stupid monkeys that don't know any better" or, alternatively, that
they just wanted a nice cup of tea rather than the not quite tea drink
that was served up.
Wikipedia sums the problem up nicely: "The early 2-byte encoding was
usually called "Unicode", but is now called "UCS-2". UCS-2 differs from
UTF-16 by being a constant length encoding and only capable of encoding
characters of BMP, it is supported by many programs."
This is where the problem starts. The definitive of "Unicode" was
changed (foolishly in my opinion) after it had been accepted by the
community and the result is confusion. Hence my first point about not
even using it. In using "UTF16/Unicode" I was attempting to convey the
common use of the term which is to see UTF-16 as what is now defined as
UCS-2. This is because hardly anyone I know uses UCS-2 and instead says
"Unicode". Perhaps I just spend too much time amongst the ignorant.
Wikipedia also makes the wonderful point that "The UTF-16 encoding
scheme was developed as a compromise to resolve this impasse in version
2.0". The impasse having resulted from "4 bytes per character wasted a
lot of disk space and memory, and because some manufacturers were
already heavily invested in 2-byte-per-character technology".
Finally: "In UTF-16, code points greater or equal to 2^16 are encoded
using /two/ 16-bit code units. The standards organizations chose the
largest block available of un-allocated 16-bit code points to use as
these code units (since most existing UCS-2 data did not use these code
points and would be valid UTF-16). Unlike UTF-8 they did not provide a
means to encode these code points".
Which is from where I get my own view that UTF-16, as defined by the
standards, is pointless. If you keep it to a UCS-2 (like) subset then
you can get rapid indexing of character arrays. But as soon as you
introduce the possibility of some characters being encoded as two 16-bit
units then you lose rapid indexing and I can see no advantage over UTF-8
- plus you get all the fun of worrying about byte order.
Indeed, I believe those lazy programmers that you referred to, are
actually making a conscious decision to prefer to work with a 16-bit
code point only UTF-16 subset (i.e. the Basic Multilingual Plan)
precisely so that they can do rapid indexing. As soon as you bring in 2
x 16-bit code unit code points, you lose that benefit - and perhaps you
should be using UTF-32.
IMHO, Linux has got it right by using UTF-8 as the standard for
character encoding and one of Lazarus's USPs is that it follows that
lead - even for Windows. I can see why a program that does intensive
text scanning will use a UTF-16 constrained to the BMP (i.e. 16-bit
only), but not why anyone would prefer an unconstrained UTF-16 over UTF-8.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lazarus-ide.org/pipermail/lazarus/attachments/20170817/7be1fb26/attachment.html>
More information about the Lazarus
mailing list