RAD Studio (Common)
|
RAD Studio for 2009 has changed from ANSI-based strings to Unicode-based strings: the type string is now a Unicode string. This topic describes what you need to know to handle strings properly.
RAD Studio is fully Unicode-compliant, and some changes are required to those parts of your code that involve string handling. However, every effort has been made to keep these changes to a minimum. Although new data types are introduced, existing data types remain and function as they always have. Based on the in house experience of Unicode conversion, existing developer applications should migrate fairly smoothly.
The pre-existing data types AnsiString and WideString function the same way as before.
Short strings also function the same as before. Note that short strings are limited to 255 characters and contain only a character count and single-byte character data. They do not contain code page information. A short string could contain UTF-8 data for a particular application, but this is not generally true.
Previously, string was an alias for AnsiString. This table shows the location of the fields in AnsiString's previous format:
Format of AnsiString Data Type
Reference Count |
Length |
String Data (Byte sized) |
Null Term |
-8 |
-4 |
0 |
Length |
For RAD Studio, the format of AnsiString has changed. Two new fields (CodePage and ElemSize) have been added. This makes the format identical for AnsiString and for the new UnicodeString type.
WideString was previously used for Unicode character data. Its format is essentially the same as a Windows BSTR. WideString is still appropriate for use in COM applications.
The new default for the type string in RAD Studio is the UnicodeString type.
For Delphi, Char and PChar types are now WideChar and PWideChar, respectively.
VCL now uses the UnicodeString type; it no longer represents string values as single byte or MBCS strings.
Format of UnicodeString Data Type
CodePage |
Element Size |
Reference Count |
Length |
String Data (element sized) |
Null Term |
-12 |
-10 |
-8 |
-4 |
0 |
Length * elementsize |
UnicodeString may be represented as the following Object Pascal structure:
type StrRec = record CodePage: Word; ElemSize: Word; refCount: Integer; Len: Integer; case Integer of 1: array[0..0] of AnsiChar; 2: array[0..0] of WideChar; end;
UnicodeString adds code page and element size fields that describe the string contents. UnicodeString is assignment compatible with all other string types. However, assignments between AnsiString and UnicodeString still do the appropriate up or down conversions. Note that assigning a UnicodeString type to an AnsiString type is not recommended and can result in data loss.
Note that AnsiString also has CodePage and ElemSize fields.
UnicodeString data is in UTF-16 for the following reasons:
Characters in UTF-16 may be 2 or 4 bytes, so the number of elements in a string is not necessarily equal to the number of characters. If the string has only BMP characters, the number of characters and elements are equal.
UnicodeString offer the following benefits:
Instances of UnicodeString can index characters. Indexing is 1-based, just as for AnsiString. Consider the following code:
var C: Char; S: string; begin ... C := S[1]; ... end;
In a case such as shown above, the compiler needs to ensure that data in S is in the proper format. The compiler generates code to ensure that assignments to string elements are the proper type and that the instance is unique (that is, has a reference count of one) via a call to a UniqueString function. For the above code, since the string could contain Unicode data, the compiler needs to also call the appropriate UniqueString function before indexing into the character array.
The UnicodeString class in C++Builder allows automatic conversion semantics similar to Delphi. For existing VCL event handlers that expect AnsiString parameters, this is somewhat transparent in that conversions are done on demand. This also allows users to gradually migrate to full Unicode on their own schedule.
However, in some cases this automatic conversion produces undesired results. The default VCL string type is now UnicodeString instead of AnsiString. However, for backward compatibility the method UnicodeString::t_str() returns 'const char* instead of const wchar_t* by narrowing the wide data of the UnicodeString instance. This can result in unexpected behavior, as code might not expect a call to the t_str() to corrupt the underlying data. This behavior is visible in code that displays the underlying data in the user interface, such asTListItem.
For example, in the following case, the data displayed by the TListView is corrupted after the call to t_str() on the last line of the method:
void ProcessSelectedItem(const char* item); void __fastcall TForm6::ListView1DblClick(TObject *Sender) { int index = ListView1->Selected->Index; TListItem *ClassItem = ListView1->Items->Item[index]; ProcessSelectedItem(ClassItem->Caption.t_str()); }
The following operations don't depend on character size:
GetModuleFileName example:
function ModuleFileName(Handle: HMODULE): string; var Buffer: array[0..MAX_PATH] of Char; begin SetString(Result, Buffer, GetModuleFileName(Handle, Buffer, Length(Buffer))); end;
GetWindowText example:
function WindowCaption(Handle: HWND): string; begin SetLength(Result, 1024); SetLength(Result, GetWindowText(Handle, PChar(Result), Length(Result))); end;
String character indexing example:
function StripHotKeys(const S: string): string; var I, J: Integer; LastChar: Char; begin SetLength(Result, Length(S)); J := 0; LastChar := #0; for I := 1 to Length(S) do begin if (S[I] <> ‘&’) or (LastChar = ‘&’) then begin Inc(J); Result[J] := S[I]; end; LastChar := S[I]; end; SetLength(Result, J); end;
Some operations do depend on character size. The functions and features in the following list also include a “portable” version when possible. You can similarly rewrite your code to be portable, that is, the code works with both AnsiString and UnicodeString variables.
You may need to modify these constructs.
You should examine the following problematic code constructs:
The Byte Order Mark (BOM) should be added to files to indicate their encoding:
Users need to perform these steps:
New warnings have been added to the Delphi compiler related to possible errors in casting types (such as from a UnicodeString or a WideString down to an AnsiString or AnsiChar). When you are converting an application to Unicode, you should enable warnings 1057 and 1058 to assist in finding problem areas in your code.
Copyright(C) 2009 Embarcadero Technologies, Inc. All Rights Reserved.
|
What do you think about this topic? Send feedback!
|