One approach to working with ideographic character sets is to use a character encoding scheme that maps each character to a numeric value (or code point) that is larger than one byte. Such characters are referred to as wide characters. A code page is a specific mapping of characters to code points.
Many systems for encoding characters have been devised. For instance, shift JIS is a character encoding for Japanese. Characters are represented by one or two byte code points. A string of such characters is a single byte string with characters of variable-width encoding.
Multibyte Character Set, or MBCS, is a term used to describe code pages that are encoded into single byte string. Such encodings typically have single byte characters that are provided for backward compatibility.
The Unicode standard describes a system for representing characters used in all the world's writing systems. It can represent over 100,000 characters or code points. The first 256 Unicode characters map to the ANSI character set. The term MBCS is not used to refer to Unicode.
Depending on the encoding, Unicode characters may be 1, 2, 3, or 4 bytes. The Unicode Transformation Format (UTF) encoding systems are equivalent representations of characters and are easily converted between each other. UTF-8 uses one to four bytes per code point. UTF-16 uses either two or four bytes; characters in the Basic Multilingual Plane (BMP) that contains most of the world's characters in current use can be represented in two bytes. UTF-32 represents each character as four bytes.
The Windows operating system supports UTF-16. The Linux operating system supports UTF-32.
In Delphi, the WideString type represents a string of two byte character elements. A WideChar is a two byte element, and a PWideChar is a pointer to a null-terminated string of two byte character elements. A WideString typically contains UTF-16 encoded characters. Since a code point may be represented by 2 or 4 bytes, the number of 2 byte elements in a WideString is not necessarily the number of characters in the string.
The UnicodeString type represents Unicode character strings. Though the WideString type is appropriate for COM, the UnicodeString type is reference counted and is generally preferred.
The AnsiString type is used to represent single characters strings and could be used for MBCS. AnsiString is not used for Unicode.
Copyright(C) 2009 Embarcadero Technologies, Inc. All Rights Reserved.
|
What do you think about this topic? Send feedback!
|