RAD Studio
ContentsIndex
PreviousUpNext
Multibyte Character Sets (MBCS)

The ideographic character sets used in Asia cannot use the simple 1:1 mapping between characters in the language and the one byte (8-bit) AnsiChar type. These languages have too many characters to be represented using the single-byte AnsiChar. Instead, a multibyte character set string can contain one or more bytes per character. A multibyte character set provides a way to encode characters outside the standard ANSI range into single byte strings of AnsiChar

The lead byte of every multibyte character code is taken from a reserved range that depends on the specific character set. The second and subsequent bytes can sometimes be the same as the character code for a separate one-byte character, or it can fall in the range reserved for the first byte of multibyte characters. Thus, the only way to tell whether a particular byte in a string represents a single character or is part of a multibyte character is to read the string, starting at the beginning, parsing it into two or more byte characters when a lead byte from the reserved range is encountered. 

When writing code for Asian locales, you must be sure to handle all string manipulation using functions that are enabled to parse strings into multibyte characters. 

For these reasons, you cannot process multibyte character strings (MBCS) as you process single byte character strings. You should use a string type appropriate for MCBS data, such as AnsiString or a short string. 

Delphi provides you with many of these runtime library functions, as listed in the following table:  

Runtime library functions  

AdjustLineBreaks  
AnsiStrLower  
ExtractFileDir  
AnsiCompareFileName  
AnsiStrPos  
ExtractFileExt  
AnsiExtractQuotedStr  
AnsiStrRScan  
ExtractFileName  
AnsiLastChar  
AnsiStrScan  
ExtractFilePath  
AnsiLowerCase  
AnsiStrUpper  
ExtractRelativePath  
AnsiLowerCaseFileName  
AnsiUpperCase  
FileSearch  
AnsiPos  
AnsiUpperCaseFileName  
IsDelimiter  
AnsiQuotedStr  
ByteToCharIndex  
IsPathDelimiter  
AnsiStrComp  
ByteToCharLen  
LastDelimiter  
AnsiStrIComp  
ByteType  
StrByteType  
AnsiStrLastChar  
ChangeFileExt  
StringReplace  
AnsiStrLComp  
CharToByteIndex  
WrapText  
AnsiStrLIComp  
CharToByteLen  
 

Remember that the length of the strings in bytes does not necessarily correspond to the length of the string in characters. Be careful not to truncate strings by cutting a multibyte character in half. Do not pass characters as a parameter to a function or procedure, since the size of a character can't be known up front. Instead, always pass a pointer to a character or a string. 

Ideographic character sets can also be represented in Unicode with the UnicodeString or WideString types. The WideString character type is essentially the same as a Windows BSTR. WideString is still appropriate for use in COM applications. However, WideString is not reference counted, and UnicodeString is more flexible and efficient in other types of applications. In addition, more functions are available for handling UnicodeString types than WideString, so UnicodeString is generally preferred.

Copyright(C) 2009 Embarcadero Technologies, Inc. All Rights Reserved.
What do you think about this topic? Send feedback!