RAD Studio (Common)
|
This topic describes various semantic code constructs you should review in your existing code to ensure that your applications are compatible with the UnicodeString type. Because Char now equals WideChar, and string equals UnicodeString, previous assumptions about the size in bytes of a character array or string might now be incorrect.
For general information on Unicode, see Unicode in RAD Studio.
Look for any code that:
Flags have been provided so that you determine whether string is UnicodeString or AnsiString. This can be used to maintain code that supports older versions of Delphi and C++Builder in the same source. For most code that performs standard string operations, it should not be necessary to have separate UnicodeString and AnsiString code sections. However, if a procedure performs operations that are dependent upon the internal structure of the string data or that interact with external libraries, it might be necessary to have separate code paths for UnicodeString and AnsiString.
Delphi
{$IFDEF UNICODE}
C++
#ifdef _DELPHI_STRING_UNICODE
The compiler has warnings related to errors in casting types (such as from UnicodeString or WideString down to AnsiString or AnsiChar). When you convert an application to Unicode, enable warnings 1057 and 1058 to find problem areas in your code.
Warning # |
Warning Text/Name |
Implicit string cast from '%s' to '%s' (IMPLICIT_STRING_CAST) | |
Implicit string cast with potential data loss from '%s' to '%s' (IMPLICIT_STRING_CAST_LOSS) | |
Explicit string cast from '%s' to '%s' (EXPLICIT_STRING_CAST) | |
Explicit string cast with potential data loss from '%s' to '%s' (EXPLICIT_STRING_CAST_LOSS) |
Review calls to SizeOf on character arrays for correctness. Consider the following example:
var Count: Integer; Buffer: array[0..MAX_PATH - 1] of Char; begin // Existing code - incorrect when string = UnicodeString Count := SizeOf(Buffer); GetWindowText(Handle, Buffer, Count); // Correct for Unicode Count := Length(Buffer); // <<-- Count should be chars not bytes GetWindowText(Handle, Buffer, Count); end;
SizeOf returns the size of the array in bytes, but GetWindowText expects Count to be in characters. In this case, Length should be used instead of SizeOf. Length functions similarly with arrays and strings. Length applied to an array returns the number of array elements allocated to the array; with string types, Length returns the number of elements in the string.
To find the number of characters contained in a null-terminated string (PAnsiChar or PWideChar), use the StrLen function.
Review calls to FillChar when used in conjunction with strings or Char. Consider the following code:
var Count: Integer; Buffer: array[0..255] of Char; begin // Existing code - incorrect when string = UnicodeString (when char = 2 bytes) Count := Length(Buffer); FillChar(Buffer, Count, 0); // Correct for Unicode Count := Length(Buffer) * SizeOf(Char); // <<-- Specify buffer size in bytes FillChar(Buffer, Count, 0); end;
Length returns the size in elements, but FillChar expects Count to be in bytes. In this example, Length multiplied by the size of Char should be used. In addition, because the default size of a Char is now 2, FillChar fills the string with bytes, not Char as it previously did. For example:
var Buf: array[0..32] of Char; begin FillChar(Buf, Length(Buf), #9); end;
However, this code does not fill the array with code point $09 but code point $0909. To get the expected result, the code needs to be changed to this:
var Buf: array[0..32] of Char; begin StrPCopy(Buf, StringOfChar(#9, Length(Buf))); ... end;
Review calls to Move with strings or character arrays, as in the following example:
var Count: Integer; Buf1, Buf2: array[0..255] of Char; begin // Existing code - incorrect when string = UnicodeString (when char = 2 bytes) Count := Length(Buf1); Move(Buf1, Buf2, Count); // Correct for Unicode Count := Length(Buf1) * SizeOf(Char); // <<-- Specify buffer size in bytes Move(Buf1, Buf2, Count); end;
Length returns the size in elements, but Move expects Count to be in bytes. In this case, Length multiplied by the size of Char should be used.
Review calls to TStream.Read/ReadBuffer when strings or character arrays are used. Consider the following example:
var S: string; L: Integer; Stream: TStream; Temp: AnsiString; begin // Existing code - incorrect when string = UnicodeString Stream.Read(L, SizeOf(Integer)); SetLength(S, L); Stream.Read(Pointer(S)^, L); // Correct for Unicode string data Stream.Read(L, SizeOf(Integer)); SetLength(S, L); Stream.Read(Pointer(S)^, L * SizeOf(Char)); // <<-- Specify buffer size in bytes // Correct for Ansi string data Stream.Read(L, SizeOf(Integer)); SetLength(Temp, L); // <<-- Use temporary AnsiString Stream.Read(Pointer(Temp)^, L * SizeOf(AnsiChar)); // <<-- Specify buffer size in bytes S := Temp; // <<-- Widen string to Unicode end;
The solution depends on the format of the data being read. Use the TEncoding class to assist you in properly encoding stream text.
Review calls to TStream.Write/WriteBuffer when strings or character arrays are used. Consider the following example:
var S: string; Stream: TStream; Temp: AnsiString; begin // Existing code - incorrect when string = UnicodeString Stream.Write(Pointer(S)^, Length(S)); // Correct for Unciode data Stream.Write(Pointer(S)^, Length(S) * SizeOf(Char)); // <<-- Specify buffer size in bytes // Correct for Ansi data Temp := S; // <<-- Use temporary AnsiString Stream.Write(Pointer(Temp)^, Length(Temp) * SizeOf(AnsiChar));// <<-- Specify buffer size in bytes end;
The proper code depends on the format of the data being written. Use the TEncoding class to assist you in properly encoding stream text.
Calls to the Windows API function GetProcAddress should always use PAnsiChar, since there is no analogous wide function in the Windows API. This example shows the correct usage:
procedure CallLibraryProc(const LibraryName, ProcName: string); var Handle: THandle; RegisterProc: function: HResult stdcall; begin Handle := LoadOleControlLibrary(LibraryName, True); @RegisterProc := GetProcAddress(Handle, PAnsiChar(AnsiString(ProcName))); end;
In RegQueryValueEx, the Len parameter receives and returns the number of bytes, not characters. The Unicode version thus requires twice as large value for the Len parameter.
Here is a sample RegQueryValueEx call:
Len := MAX_PATH; if RegQueryValueEx(reg, PChar(Name), nil, nil, PByte(@Data[0]), @Len) = ERROR_SUCCESS then SetString(Result, Data, Len - 1) // Len includes #0 else RaiseLastOSError;
This must be changed to this:
Len := MAX_PATH * SizeOf(Char); if RegQueryValueEx(reg, PChar(Name), nil, nil, PByte(@Data[0]), @Len) = ERROR_SUCCES then SetString(Result, Data, Len div SizeOf(Char) - 1) // Len includes #0, Len contains the number of bytes else RaiseLastOSError;
The Unicode version of the Windows API function CreateProcess, CreateProcessW, behaves slightly differently than the ANSI version. To quote MSDN in reference to the lpCommandLine parameter:
"The Unicode version of this function, CreateProcessW, can modify the contents of this string. Therefore, this parameter cannot be a pointer to read-only memory (such as a const variable or a literal string). If this parameter is a constant string, the function might cause an access violation."
Because of this problem, existing code that calls CreateProcess might cause access violations.
Here are examples of such problematic code:
// Passing in a string constant CreateProcess(nil, 'foo.exe', nil, nil, False, 0, nil, nil, StartupInfo, ProcessInfo); // Passing in a constant expression const cMyExe = 'foo.exe' CreateProcess(nil, cMyExe, nil, nil, False, 0, nil, nil, StartupInfo, ProcessInfo); // Passing in a string whose refcount is -1: const cMyExe = 'foo.exe' var sMyExe: string; sMyExe := cMyExe; CreateProcess(nil, PChar(sMyExe), nil, nil, False, 0, nil, nil, StartupInfo, ProcessInfo);
Previously, LeadBytes listed all values that could be the first byte of a double byte character on the local system. Replace code like this:
if Str[I] in LeadBytes then
with a call to the IsLeadChar function:
if IsLeadChar(Str[I]) then
In cases where a TMemoryStream is used to write a text file, it is useful to write a Byte Order Mark (BOM) before writing anything else to the file. Here is an example of writing the BOM to a file:
var Bom: TBytes; begin tms: TMemoryStream; ... Bom := TEncoding.UTF8.GetPreamble; tms.Write(Bom[0], Length(Bom));
Any code that writes to a file needs to be changed to UTF-8 encode the Unicode string:
var Temp: Utf8String; begin tms: TMemoryStream; ... Temp := Utf8Encode(Str); // Str is string being written to file tms.Write(Pointer(Temp)^, Length(Temp)); //Write(Pointer(Str)^, Length(Str)); original call to write string to file
Calls to the Windows API function MultiByteToWideChar can simply be replaced with an assignment. An example using MultiByteToWideChar:
procedure TWideCharStrList.AddString(const S: string); var Size, D: Integer; begin Size := Length(S); D := (Size + 1) * SizeOf(WideChar); FList[FUsed] := AllocMem(D); MultiByteToWideChar(0, 0, PChar(S), Size, FList[FUsed], D); Inc(FUsed); end;
After the change to Unicode, this call was changed to support compiling under both ANSI and Unicode:
procedure TWideCharStrList.AddString(const S: string); {$IFNDEF UNICODE} var L, D: Integer; {$ENDIF} begin {$IFDEF UNICODE} FList[FUsed] := StrNew(PWideChar(S)); {$ELSE} L := Length(S); D := (L + 1) * SizeOf(WideChar); FList[FUsed] := AllocMem(D); MultiByteToWideChar(0, 0, PAnsiChar(S), L, FList[FUsed], D); {$ENDIF} Inc(FUsed); end;
AppendStr is deprecated and is hard-coded to use AnsiString, and no UnicodeString overload is available. Replace calls like this:
AppendStr(String1, String2);
with code like this:
String1 := String1 + String2;
You can also use the new TStringBuilder class.
Existing Delphi code that uses named threads must change. In previous versions, when you used the new Thread Object item in the gallery to create a new thread, it created the following type declaration in the new thread's unit:
type TThreadNameInfo = record FType: LongWord; // must be 0x1000 FName: PChar; // pointer to name (in user address space) FThreadID: LongWord; // thread ID (-1 indicates caller thread) FFlags: LongWord; // reserved for future use, must be zero end;
The debugger's named thread handler expects the FName member to be ANSI data, not Unicode, so the above declaration needs to be changed to the following:
type TThreadNameInfo = record FType: LongWord; // must be 0x1000 FName: PAnsiChar; // pointer to name (in user address space) FThreadID: LongWord; // thread ID (-1 indicates caller thread) FFlags: LongWord; // reserved for future use, must be zero end;
New named threads are created with the updated type declaration. Only code that was created in a previous Delphi version needs to be manually updated.
If you want to use Unicode characters or strings in a thread name, you must encode the string in UTF-8 for the debugger to handle it properly. For instance:
ThreadNameInfo.FName := UTF8String('UnicodeThread_фис');
In versions prior to 2009, not all Pointer types supported pointer arithmetic. Because of this, the practice of casting various non-char pointers to PChar was used to enable pointer arithmetic. Now, enable pointer arithmetic by using the new $POINTERMATH compiler directive, which is specifically enabled for the PByte type.
Here is an example of code that casts pointer data to PChar for the purpose of performing pointer arithmetic on it:
function TCustomVirtualStringTree.InternalData(Node: PVirtualNode): Pointer; begin if (Node = FRoot) or (Node = nil) then Result := nil else Result := PChar(Node) + FInternalDataOffset; end;
You should change this to use PByte rather than PChar:
function TCustomVirtualStringTree.InternalData(Node: PVirtualNode): Pointer; begin if (Node = FRoot) or (Node = nil) then Result := nil else Result := PByte(Node) + FInternalDataOffset; end;
In the above sample, Node is not actually character data. It is cast to a PChar to use pointer arithmetic to access data that is a certain number of bytes after Node. This worked previously, because SizeOf(Char) equalled Sizeof(Byte). This is no longer true, so such code needs to be changed to use PByte rather than PChar. Without this change, Result points to incorrect data.
If you have code that uses TVarRec to handle variant open array parameters, you might need to augment it to handle UnicodeString. A new type vtUnicodeString is defined for UnicodeString. The UnicodeString data is in type vtUnicodeString. The following sample shows a case where new code has been added to handle the UnicodeString type.
procedure RegisterPropertiesInCategory(const CategoryName: string; const Filters: array of const); overload; var I: Integer; begin if Assigned(RegisterPropertyInCategoryProc) then for I := Low(Filters) to High(Filters) do with Filters[I] do case vType of vtPointer: RegisterPropertyInCategoryProc(CategoryName, nil, PTypeInfo(vPointer), ); vtClass: RegisterPropertyInCategoryProc(CategoryName, vClass, nil, ); vtAnsiString: RegisterPropertyInCategoryProc(CategoryName, nil, nil, string(vAnsiString)); vtUnicodeString: RegisterPropertyInCategoryProc(CategoryName, nil, nil, string(vUnicodeString)); else raise Exception.CreateResFmt(@sInvalidFilter, [I, vType]); end; end;
Search for the following additional code constructs to locate Unicode enabling problems:
Copyright(C) 2009 Embarcadero Technologies, Inc. All Rights Reserved.
|
What do you think about this topic? Send feedback!
|