[Lazarus] UTF16 2 utf8

Marc Weustink marc.weustink at cuperus.nl
Thu May 5 13:38:51 CEST 2011


José Mejuto wrote:
> Hello Lazarus-List,
>
> Thursday, May 5, 2011, 12:20:10 PM, you wrote:
>
> MW>  According to unicode.org, when UTF-16 got introduced, the USC2 standard
> MW>  was extended. So yes they are the same.
> MW>  In some cases when ppl refer to USC2 they mean the old unicode 1.1
> MW>  standard, but that is wrong. The name USC2 is misleading and should not
> MW>  be used anymore.
>
> Maybe, but, taken from www.unicode.org glossary of terms:
>
> UCS-2. ISO/IEC 10646 encoding form: Universal Character Set coded in 2
> octets, limited to the Basic Multilingual Plane. (See Appendix C,
> Relationship to ISO/IEC 10646.)
>
> UTF-16. A multibyte encoding for text that represents each Unicode
> character with 2 or 4 bytes; it is not backward-compatible with ASCII.
> It is the internal form of Unicode in many programming languages, such
> as Java, C#, and JavaScript, and in many operating systems. More
> technically: (1) The UTF-16 encoding form. (2) The UTF-16 encoding
> scheme. (3) �Transformation format for 16 planes of Group 00,� defined
> in Annex C of ISO/IEC 10646:2003; technically equivalent to the
> definitions in the Unicode Standard.
>
> --------------------
>
> I think that the text that says the UCS2 has been extended, does not
> means that UCS2 has been extended, it says that UCS2 has been extended
> to UTF-16, so UCS2 can not be considered Unicode anymore as noted in
> ISO 10646:
>
> UCS-2. UCS-2 stands for �Universal Character Set coded in 2 octets� and is also known as
> �the two-octet BMP form.� It was documented in earlier editions of 10646 as the two-octet
> (16-bit) encoding consisting only of code positions for plane zero, the Basic Multilingual
> Plane. This documentation has been removed from ISO/IEC 10646:2011, and the term
> UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either
> 10646 or the Unicode Standard.


 From the basic FAQ:
Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is obsolete terminology which refers to a Unicode 
implementation up to Unicode 1.1, before surrogate code points and 
UTF-16 were added to Version 2.0 of the standard. This term should now 
be avoided.

UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 
are identical for purposes of data exchange. Both are 16-bit, and have 
exactly the same code unit representation.

Sometimes in the past an implementation has been labeled "UCS-2" to 
indicate that it does not support supplementary characters and doesn't 
interpret pairs of surrogate code points as characters. Such an 
implementation would not handle processing of character properties, code 
point boundaries, collation, etc. for supplementary characters. [AF]

from the "Both are 16-bit, and have exactly the same code unit 
representation" I concluded they are the same


Anyway, we're drifting away from the original question.


Marc




More information about the Lazarus mailing list