[Lazarus] UTF16 2 utf8

José Mejuto joshyfun at gmail.com
Thu May 5 12:41:51 CEST 2011


Hello Lazarus-List,

Thursday, May 5, 2011, 12:20:10 PM, you wrote:

MW> According to unicode.org, when UTF-16 got introduced, the USC2 standard
MW> was extended. So yes they are the same.
MW> In some cases when ppl refer to USC2 they mean the old unicode 1.1
MW> standard, but that is wrong. The name USC2 is misleading and should not
MW> be used anymore.

Maybe, but, taken from www.unicode.org glossary of terms:

UCS-2. ISO/IEC 10646 encoding form: Universal Character Set coded in 2
octets, limited to the Basic Multilingual Plane. (See Appendix C,
Relationship to ISO/IEC 10646.)

UTF-16. A multibyte encoding for text that represents each Unicode
character with 2 or 4 bytes; it is not backward-compatible with ASCII.
It is the internal form of Unicode in many programming languages, such
as Java, C#, and JavaScript, and in many operating systems. More
technically: (1) The UTF-16 encoding form. (2) The UTF-16 encoding
scheme. (3) “Transformation format for 16 planes of Group 00,” defined
in Annex C of ISO/IEC 10646:2003; technically equivalent to the
definitions in the Unicode Standard.

--------------------

I think that the text that says the UCS2 has been extended, does not
means that UCS2 has been extended, it says that UCS2 has been extended
to UTF-16, so UCS2 can not be considered Unicode anymore as noted in
ISO 10646:

UCS-2. UCS-2 stands for “Universal Character Set coded in 2 octets” and is also known as
“the two-octet BMP form.” It was documented in earlier editions of 10646 as the two-octet
(16-bit) encoding consisting only of code positions for plane zero, the Basic Multilingual
Plane. This documentation has been removed from ISO/IEC 10646:2011, and the term
UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either
10646 or the Unicode Standard.

-- Best regards,
   José





More information about the Lazarus mailing list