[Lazarus] String vs WideString

Tony Whyman tony.whyman at mccallumwhyman.com
Thu Aug 17 12:41:03 CEST 2017


On 16/08/17 11:05, Juha Manninen via Lazarus wrote:
> On Mon, Aug 14, 2017 at 4:21 PM, Tony Whyman via Lazarus
> <lazarus at lists.lazarus-ide.org> wrote:
>> UTF-16/Unicode can only store 65,536 characters while the Unicode standard
>> (that covers UTF8 as well) defines 136,755 characters.
>> UTF-16/Unicode's main advantage seems to be for rapid indexing of large
>> strings.
> That shows complete ignorance from your side about Unicode.
> You consider UTF-16 as a fixed-width encoding.  :(
> Unfortunately many other programmers had the same wrong idea or they
> were just lazy. The result anyway is a lot of broken UTF-16 code out
> there.
You do like to use the word "ignorance" don't you. You can if you want 
take the view that all the "other programmers" that got the wrong idea 
are "stupid monkeys that don't know any better"  or, alternatively, that 
they just wanted a nice cup of tea rather than the not quite tea drink 
that was served up.

Wikipedia sums the problem up nicely: "The early 2-byte encoding was 
usually called "Unicode", but is now called "UCS-2". UCS-2 differs from 
UTF-16 by being a constant length encoding and only capable of encoding 
characters of BMP, it is supported by many programs."

This is where the problem starts. The definitive of "Unicode" was 
changed (foolishly in my opinion) after it had been accepted by the 
community and the result is confusion. Hence my first point about not 
even using it. In using "UTF16/Unicode" I was attempting to convey the 
common use of the term which is to see UTF-16 as what is now defined as 
UCS-2. This is because hardly anyone I know uses UCS-2 and instead says 
"Unicode". Perhaps I just spend too much time amongst the ignorant.

Wikipedia also makes the wonderful point that "The UTF-16 encoding 
scheme was developed as a compromise to resolve this impasse in version 
2.0". The impasse having resulted from "4 bytes per character wasted a 
lot of disk space and memory, and because some manufacturers were 
already heavily invested in 2-byte-per-character technology".

Finally: "In UTF-16, code points greater or equal to 2^16 are encoded 
using /two/ 16-bit code units. The standards organizations chose the 
largest block available of un-allocated 16-bit code points to use as 
these code units (since most existing UCS-2 data did not use these code 
points and would be valid UTF-16). Unlike UTF-8 they did not provide a 
means to encode these code points".

Which is from where I get my own view that UTF-16, as defined by the 
standards, is pointless. If you keep it to a UCS-2 (like) subset then 
you can get rapid indexing of character arrays. But as soon as you 
introduce the possibility of some characters being encoded as two 16-bit 
units then you lose rapid indexing and I can see no advantage over UTF-8 
- plus you get all the fun of worrying about byte order.

Indeed, I believe those lazy programmers that you referred to, are 
actually making a conscious decision to prefer to work with a 16-bit 
code point only UTF-16 subset (i.e. the Basic Multilingual Plan) 
precisely so that they can do rapid indexing. As soon as you bring in 2 
x 16-bit code unit code points, you lose that benefit - and perhaps you 
should be using UTF-32.

IMHO, Linux has got it right by using UTF-8 as the standard for 
character encoding and one of Lazarus's USPs is that it follows that 
lead - even for Windows. I can see why a program that does intensive 
text scanning will use a UTF-16 constrained to the BMP (i.e. 16-bit 
only), but not why anyone would prefer an unconstrained UTF-16 over UTF-8.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lazarus-ide.org/pipermail/lazarus/attachments/20170817/7be1fb26/attachment.html>


More information about the Lazarus mailing list