Unicode in RAD Studio

RAD Studio for 2009 has changed from ANSI-based strings to Unicode-based strings: the type string is now a Unicode string. This topic describes what you need to know to handle strings properly.

RAD Studio is fully Unicode-compliant, and some changes are required to those parts of your code that involve string handling. However, every effort has been made to keep these changes to a minimum. Although new data types are introduced, existing data types remain and function as they always have. Based on the in house experience of Unicode conversion, existing developer applications should migrate fairly smoothly.

Existing String Types

The pre-existing data types AnsiString and WideString function the same way as before.

Short strings also function the same as before. Note that short strings are limited to 255 characters and contain only a character count and single-byte character data. They do not contain code page information. A short string could contain UTF-8 data for a particular application, but this is not generally true.

AnsiString

Previously, string was an alias for AnsiString. This table shows the location of the fields in AnsiString's previous format:

Format of AnsiString Data Type

Reference Count	Length	String Data (Byte sized)	Null Term
-8	-4	0	Length

For RAD Studio, the format of AnsiString has changed. Two new fields (CodePage and ElemSize) have been added. This makes the format identical for AnsiString and for the new UnicodeString type.

WideString

WideString was previously used for Unicode character data. Its format is essentially the same as a Windows BSTR. WideString is still appropriate for use in COM applications.

New String Type: UnicodeString

The new default for the type string in RAD Studio is the UnicodeString type.

For Delphi, Char and PChar types are now WideChar and PWideChar, respectively.

Note: This differs from versions prior to 2009, in which string was an alias for AnsiString, and Char and PChar types were AnsiChar and PAnsiChar, respectively.

For C++, the _TCHAR maps to option controls the floating definition of _TCHAR, which can be either wchart_t or char.

VCL now uses the UnicodeString type; it no longer represents string values as single byte or MBCS strings.

Format of UnicodeString Data Type

CodePage	Element Size	Reference Count	Length	String Data (element sized)	Null Term
-12	-10	-8	-4	0	Length * elementsize

UnicodeString may be represented as the following Object Pascal structure:

Copy Code

type StrRec = record 
      CodePage: Word; 
      ElemSize: Word; 
      refCount: Integer; 
      Len: Integer;
      case Integer of 
          1: array[0..0] of AnsiChar; 
          2: array[0..0] of WideChar; 
end;

UnicodeString adds code page and element size fields that describe the string contents. UnicodeString is assignment compatible with all other string types. However, assignments between AnsiString and UnicodeString still do the appropriate up or down conversions. Note that assigning a UnicodeString type to an AnsiString type is not recommended and can result in data loss.

Note that AnsiString also has CodePage and ElemSize fields.

UnicodeString data is in UTF-16 for the following reasons:

UTF-16 matches the underlying operating system format.
UTF-16 reduces extra explicit/implicit conversions.
It offers better performance when calling the Windows API.
There is no need to have the operating system do any conversions with UTF-16.
The Basic Multilingual Plane (BMP) already contains the vast majority of the world's active language glyphs and fits in a single UTF-16 Char (16 bits).
Unicode surrogate pairs are analogous to the multibyte character set (MBCS), but more predictable and standard.
UnicodeString can provide lossless implicit conversions to and from WideString for marshaling COM interfaces.

Characters in UTF-16 may be 2 or 4 bytes, so the number of elements in a string is not necessarily equal to the number of characters. If the string has only BMP characters, the number of characters and elements are equal.

UnicodeString offer the following benefits:

It is reference-counted.
It solves a legacy application problem in C++Builder.
Allowing AnsiString to carry encoding information (code page) reduces the potential data loss problem with implicit casts.
The compiler ensures the data is correct before mutating data.

WideString is not reference counted, and so UnicodeString is more flexible and efficient in other types of applications.

Indexing

Instances of UnicodeString can index characters. Indexing is 1-based, just as for AnsiString. Consider the following code:

Copy Code

var C: Char; 
        S: string; 
        begin 
        ... 
        C := S[1];
        ... 
        end;

In a case such as shown above, the compiler needs to ensure that data in S is in the proper format. The compiler generates code to ensure that assignments to string elements are the proper type and that the instance is unique (that is, has a reference count of one) via a call to a UniqueString function. For the above code, since the string could contain Unicode data, the compiler needs to also call the appropriate UniqueString function before indexing into the character array.

C++Builder

The UnicodeString class in C++Builder allows automatic conversion semantics similar to Delphi. For existing VCL event handlers that expect AnsiString parameters, this is somewhat transparent in that conversions are done on demand. This also allows users to gradually migrate to full Unicode on their own schedule.

However, in some cases this automatic conversion produces undesired results. The default VCL string type is now UnicodeString instead of AnsiString. However, for backward compatibility the method UnicodeString::t_str() returns 'const char* instead of const wchar_t* by narrowing the wide data of the UnicodeString instance. This can result in unexpected behavior, as code might not expect a call to the t_str() to corrupt the underlying data. This behavior is visible in code that displays the underlying data in the user interface, such asTListItem.

For example, in the following case, the data displayed by the TListView is corrupted after the call to t_str() on the last line of the method:

Copy Code

void ProcessSelectedItem(const char* item); 

void __fastcall TForm6::ListView1DblClick(TObject *Sender)
{ 
    int index = ListView1->Selected->Index;
    TListItem *ClassItem = ListView1->Items->Item[index];
    ProcessSelectedItem(ClassItem->Caption.t_str()); 
}

Compiler Conditionals

In both Delphi and C++Builder, you can use conditionals to allow both Unicode and non-Unicode code in the same source.

Delphi

Copy Code

{$IFDEF UNICODE}

C++Builder

Copy Code

#ifdef _DELPHI_STRING_UNICODE

Summary of Changes

string now maps to UnicodeString, not to AnsiString.
Char now maps to WideChar (2 bytes, not 1 byte) and is a UTF-16 character.
PChar now maps to PWideChar.
In C++, System::String now maps to the UnicodeString class.

Summary of What Has Not Changed

AnsiString.
WideString.
AnsiChar, PAnsiChar.
WideChar, PWideChar
Implicit conversions still work.
AnsiString uses the user's active code page.

Code Constructs Independent of Character Size

The following operations don't depend on character size:

String concatenation:
- <string var> + <string var>
- <string var> + <literal>
- <literal> + <literal>
- Concat(<string> , <string>)
Standard string functions:
- Length(<string>) returns the number of elements in a string, which may not be the same as the number of bytes or the number of characters in the string. Note that the SizeOf function returns the number of bytes required to represent a variable or type. SizeOf returns the number of bytes in a string only for a short string. Since the other string types are pointers, SizeOf returns the number of bytes in a pointer for non-short strings.
- Copy(<string>, <start>, <length>) returns a substring in Char elements.
- Pos(<substr>,<string>) returns the index of the first Char element.
Operators:
- <string> <comparison op> <string>
- CompareStr()
- CompareText()
- ...
FillChar(<struct or memory>)
- FillChar(Rect, SizeOf(Rect), #0)
- FillChar(WndClassEx, SizeOf(TWndClassEx), #0). Note that WndClassEx.cbSize := SizeOf(TWndClassEx);
Windows API
- API calls default to their WideString (“W”) versions.
- The PChar(<string>) cast has identical semantics.

GetModuleFileName example:

Copy Code

function ModuleFileName(Handle: HMODULE): string; 
    var Buffer: array[0..MAX_PATH] of Char; 
        begin 
            SetString(Result, Buffer, GetModuleFileName(Handle, Buffer, Length(Buffer))); 
        end;

GetWindowText example:

Copy Code

function WindowCaption(Handle: HWND): string; 
      begin 
          SetLength(Result, 1024);
          SetLength(Result, GetWindowText(Handle, PChar(Result), Length(Result))); 
      end;

String character indexing example:

Copy Code

function StripHotKeys(const S: string): string; 
    var I, J: Integer; 
    LastChar: Char;
    begin 
        SetLength(Result, Length(S)); 
        J := 0; 
        LastChar := #0; 
        for I := 1 to Length(S) do 
        begin
          if (S[I] <> ‘&’) or (LastChar = ‘&’) then 
          begin 
              Inc(J); 
              Result[J] := S[I]; 
          end; 
          LastChar := S[I]; 
    end; 
    SetLength(Result, J); 
end;

Code Constructs that Depend on Character Size

Some operations do depend on character size. The functions and features in the following list also include a “portable” version when possible. You can similarly rewrite your code to be portable, that is, the code works with both AnsiString and UnicodeString variables.

SizeOf(<Char array>) — use the portable Length(<Char array>).
Move(<Char buffer>... CharCount) — use the portable Move(<Char buffer> ,,, CharCount * SizeOf(Char)).
Stream Read/Write — use the portable AnsiString, SizeOf(Char) or the TEncoding class.
FillChar(<Char array>, <size>, <AnsiChar>) — use *SizeOf(Char) if filling with #0, or use the portable StringOfChar function
GetProcAddress(<module>, <PAnsiChar>) — use the provided overload function taking a PWideChar.
Casting or using PChar to do pointer arithmetic — Place {IFDEF PByte = PChar} at the top of the file if you use PChar for pointer arithmetic. Or use the {POINTERMATH <ON|OFF>} directive to turn on pointer arithmetic for all typed pointers, so that increment/decrement is by element size.

Set of Char Constructs

You may need to modify these constructs.

<Char> in <set of AnsiChar> — code generation is correct (>#255 characters are never in the set). The compiler warns “WideChar reduced in set operations”. Depending on your code, you can safely turn off the warning. Alternatively, use the CharinSet function.
<Char> in LeadBytes — the global LeadBytes set is for MBCS ANSI locales. UTF-16 still has the notion of a “lead char” (#$D800 - #$DBFF are high surrogate, #$DC00 - #$DFFF are low surrogate). To change this, use the overloaded function IsLeadChar. The ANSI version checks against LeadBytes. The WideChar version checks if it is a high/low surrogate.
Character classification — use the TCharacter static class. The Character unit offers functions to classify characters: IsDigit, IsLetter, IsLetterOrDigit, IsSymbol, IsWhiteSpace, IsSurrogatePair, and so on. These are based on table data directly from Unicode.org.

Beware of these Constructs

You should examine the following problematic code constructs:

Casts that obscure the type:
- AnsiString(Pointer(foo))
- Review for correctness: what was intended?
Suspicious casts–generate a warning:
- PChar(<AnsiString var>)
- PAnsiChar(<UnicodeString var>)
Directly constructing, manipulating, or accessing string internal structures. Some, such as AnsiString, have changed internally, so this is unsafe. Use the StringRefCount, StringCodePage, StringElementSize and other functions to get string information.

Runtime Library

Overloads. For functions that took PChar, there are now PAnsiChar and PWideChar versions so the appropriate function gets called.
SysUtils.AnsiXXXX functions, such as AnsiCompareString:
- remain declared with string and float to UnicodeString
- offer better backward compatibility (no need to change code).
The AnsiStrings unit's AnsiXXXX functions offer the same capabilities as the SysUtils.AnsiXXXX functions, but work only for AnsiString. AnsiStrings.AnsiXXXX functions provide better performance for an AnsiString than SysUtils.AnsiXXXX functions, which work for both AnsiString and UnicodeString, because no implicit conversions are performed.
Write/Writeln and Read/Readln
- Continue to convert to/from ANSI/OEM code pages.
- Console is mostly ANSI or OEM anyway.
- Offer better compatibility with legacy applications.
- TFDD (Text File Device Drivers)
  - TTextRec and TFileRec
  - File names are WideChar, but as above, data is ANSI/OEM.
- Use TEncoding and TStrings for Unicode file I/O.System.Text.Encodingclass.
PByte–declared with $POINTERMATH ON. This allows array indexing and pointer math like PAnsiChar.
String information functions:
- StringElementSize returns the actual data size.
- StringCodePage returns the code page of string data.
- StringRefCount returns the reference count.
RTL provides helper functions that enable users to do explicit conversions between code pages and element size conversions. If developers are using the Move function on a character array, they cannot make assumptions about the element size. Much of this problem can be mitigated by making sure all RValue references generate the proper calls to RTL to ensure proper element sizes.

Components and Classes

TStrings: Store UnicodeStrings internally (remains declared as string).
TWideStrings (may get deprecated) is unchanged. Uses WideString (BSTR) internally.
TStringStream
- Has been rewritten–defaults to the default ANSI encoding for internal storage.
- Encoding can be overridden.
- Consider using TStringBuilder instead of TStringStream to construct a string from bits and pieces.
TEncoding
- Defaults to users’ active code page.
- Supports UTF-8.
- Supports UTF-16, big and little endian.
- Byte Order Mark (BOM) support.
- You can create descendant classes for user specific encodings.
Component streaming (Text DFM files)
- Are fully backward compatible.
- Stream as UTF-8 only if component type, property or name contains non-ASCII-7 characters.
- String property values are still streamed in “#” escaped format.
- May allow values as UTF-8 as well (open issue).
- Only change in binary format is potential for UTF-8 data for component name, properties, and type name.

Byte Order Mark

The Byte Order Mark (BOM) should be added to files to indicate their encoding:

UTF-8 uses EF BB BF.
UTF-16 Little Endian uses FF FE.
UTF-16 Big Endian uses FE FF.

Users need to Unicode-enable their applications

Users need to perform these steps:

Review char- and string-related functions.
Rebuild the application.
Review surrogate pairs.
Review string payloads.

For more details, see Enabling Your Applications for Unicode

New Delphi compiler warnings

New warnings have been added to the Delphi compiler related to possible errors in casting types (such as from a UnicodeString or a WideString down to an AnsiString or AnsiChar). When you are converting an application to Unicode, you should enable warnings 1057 and 1058 to assist in finding problem areas in your code.

1057: Implicit string cast from '%s' to '%s' (IMPLICIT_STRING_CAST) Emitted when the compiler detects a case where it must implicitly convert an AnsiString (or AnsiChar) to some form of Unicode (a UnicodeString or a WideString). (NOTE: This warning will eventually be enabled by default).
1058: Implicit string cast with potential data loss from '%s' to '%s' (IMPLICIT_STRING_CAST_LOSS) Emitted when the compiler detects a case were it must implicitly convert some form of Unicode (a UnicodeString or a WideString) down to an AnsiString (or AnsiChar). This is a potential lossy conversion, since there may be characters in the string that cannot be represented in the code page to which the string is converted. (NOTE: This warning will eventually be enabled by default).
1059: Explicit string cast from '%s' to '%s' (EXPLICIT_STRING_CAST) Emitted when the compiler detects a case where the programmer is explicitly casting an AnsiString (or AnsiChar) to some form of Unicode (UnicodeString or WideString). (NOTE: This warning will always be off by default and should only be used to locate potential problems).
1060: Explicit string cast with potential data loss from '%s' to '%s' (EXPLICIT_STRING_CAST_LOSS) Emitted when the compiler detects a case where the programmer is explicitly casting some form of Unicode (UnicodeString or WideString) down to AnsiString (or AnsiChar). This is a potential lossy conversion, since there may be characters in the string that cannot be represented in the code page to which the string is converted. (NOTE: This warning will always be off by default and should only be used to locate potential problems).

Recommendations

Keep source files in UTF-8 format.
- Delphi 2005, 2006, 2007 support this.
- Files can remain ANSI as long as the source is compiled with the correct code page (can use –codepage compiler switch).
- Write a UTF-8 BOM to source file. Make sure your source control management system supports these files (most do)
Perform IDE refactoring when code must be AnsiString or AnsiChar (code is still portable).
Static code review:
- Is code merely passing the data along?
- Is code doing simple character indexing?
Heed all warnings (elevate to errors):
- Suspicious pointer casts.
- Implicit/Explicit casts (coming).
Determine code intent
- Is code using a string (AnsiString) as a dynamic-array of bytes? If so, use portable TBytes type (array of Byte) instead.
- Is a PChar cast used to enable pointer arithmetic? If so, cast to PByte instead and turn $POINTERMATH ON.