String Types

This topic describes the string data types available in the Delphi language. The following types are covered:

Short strings (ShortString)
ANSI strings (AnsiString)
Unicode strings (UnicodeString and WideString)

A string represents a sequence of characters. Delphi supports the following predefined string types.

String types

Type	Maximum length	Memory required	Used for
ShortString	255 characters	2 to 256 bytes	Backward compatibility
AnsiString	~2^31 characters	4 bytes to 2GB	8-bit (ANSI) characters, DBCS ANSI, MBCS ANSI, Unicode characters, etc.
UnicodeString	~2^30 characters	4 bytes to 2GB	Unicode characters, 8-bit (ANSI) characters, multi-user servers and multi-language applications
WideString	~2^30 characters	4 bytes to 2GB	Unicode characters; multi-user servers and multi-language applications. UnicodeString generally preferred

Note: WideString

is provided to be compatible with the COM BSTR type. You should generally use UnicodeString for non-COM application. UnicodeString is the preferred type for most purposes. The type string is an alias for UnicodeString, not AnsiString.

String types can be mixed in assignments and expressions; the compiler automatically performs required conversions. But strings passed by reference to a function or procedure (as var and out parameters) must be of the appropriate type. Strings can be explicitly cast to a different string type. However, casting a multi-byte string to a single byte string may result in data loss.

There are some special string types worth mentioning:

Code paged AnsiStrings are defined like this: Type mystring = AnsiString(CODEPAGE) It is an AnsiString that has an affinity to maintaining its internal data in a specific code page.
The RawByteString type is type AnsiString($FFFF). RawByteString enables the passing of string data of any code page without doing any code page conversions. RawByteString should only be used as a const or value type parameter or a return type from a function. It should never be passed by reference (passed by var), and should never be instantiated as a variable.
UTF8String represents a string encoded using UTF-8 (variable number of bytes Unicode). It is a code paged AnsiString type with a UTF-8 code page.

The reserved word string functions like a general string type identifier. For example,

Copy Code

var S: string;

creates a variable S that holds a string. On the Win32 platform, the compiler interprets string (when it appears without a bracketed number after it) as UnicodeString.

On the Win32 platform, you can use the {$H-} directive to turn string into ShortString. This is a potentially useful technique when using older 16-bit Delphi code or Turbo Pascal code with your current programs.

Note that the keyword string is also used when declaring ShortString types of specific lengths (see Short Strings, below).

Comparison of strings is defined by the ordering of the elements in corresponding positions. Between strings of unequal length, each character in the longer string without a corresponding character in the shorter string takes on a greater-than value. For example, 'AB' is greater than 'A'; that is, 'AB' > 'A' returns True. Zero-length strings represent the lowest values.

You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.

The standard function Length returns the number of elements in a string. As noted above, the number of elements is not necessarily the number of characters. The SetLength procedure adjusts the length of a string. Note that the SizeOf function returns the number of bytes used to represent a variable or type. Note that SizeOf returns the number of characters in a string only for a short string. SizeOf returns the number of bytes in a pointer for all other string types, since they are pointers.

For a short string or AnsiString, S[i] is of type AnsiChar. For a WideString, S[i] is of type WideChar. For single-byte (Western) locales, MyString[2] := 'A'; assigns the value A to the second character of MyString. The following code uses the standard UpCase function to convert MyString to uppercase.

Copy Code

var I: Integer;
begin
   I := Length(MyString);
   while I > 0 do
    begin
       MyString[I] := UpCase(MyString[I]);
       I := I - 1;
    end;
end;

Be careful indexing strings in this way, since overwriting the end of a string can cause access violations. Also, avoid passing string indexes as var parameters, because this results in inefficient code.

You can assign the value of a string constant - or any other expression that returns a string - to a variable. The length of the string changes dynamically when the assignment is made. Examples:

Copy Code

MyString := 'Hello world!';
MyString := 'Hello' + 'world';
MyString := MyString + '!';
MyString := ' '; { space }
MyString := '';  { empty string }

Short Strings

A ShortString is 0 to 255 single-byte characters long. While the length of a ShortString can change dynamically, its memory is a statically allocated 256 bytes; the first byte stores the length of the string, and the remaining 255 bytes are available for characters. If S is a ShortString variable, Ord(S[0]), like Length(S), returns the length of S; assigning a value to S[0], like calling SetLength, changes the length of S. ShortString is maintained for backward compatibility only.

The Delphi language supports short-string types - in effect, subtypes of ShortString - whose maximum length is anywhere from 0 to 255 characters. These are denoted by a bracketed numeral appended to the reserved word string. For example,

Copy Code

var MyString: string[100];

creates a variable called MyString whose maximum length is 100 characters. This is equivalent to the declarations

Copy Code

type CString = string[100];
var MyString: CString;

Variables declared in this way allocate only as much memory as the type requires - that is, the specified maximum length plus one byte. In our example, MyString uses 101 bytes, as compared to 256 bytes for a variable of the predefined ShortString type.

When you assign a value to a short-string variable, the string is truncated if it exceeds the maximum length for the type.

The standard functions High and Low operate on short-string type identifiers and variables. High returns the maximum length of the short-string type, while Low returns zero.

AnsiString

AnsiString represents a dynamically allocated string whose maximum length is limited only by available memory.

An AnsiString variable is a structure containing string information. When the variable is empty - that is, when it contains a zero-length string, the pointer is nil and the string uses no additional storage. When the variable is nonempty, it points to a dynamically allocated block of memory that contains the string value. This memory is allocated on the heap, but its management is entirely automatic and requires no user code. The AnsiString structure contains a 32-bit length indicator, a 32-bit reference count, a 16 bit data length indicating the number of bytes per character, and a 16 bit code page.

An AnsiString represents a single byte string. With a single-byte character set (SBCS), each byte in a string represents one character. In a multibyte character set (MBCS), the elements are still single bytes, but some characters are represented by one byte and others by more than one byte. Multibyte character sets - especially double-byte character sets (DBCS) - are widely used for Asian languages. An AnsiString can contain MBCS characters.

Indexing of AnsiString is 1-based. Indexing multibyte strings is not reliable, since S[i] represents the ith byte (not necessarily the ith character) in S. The ith byte may be a single character or part of a character. However, the standard AnsiString string-handling functions have multibyte-enabled counterparts that also implement locale-specific ordering for characters. (Names of multibyte functions usually start with Ansi-. For example, the multibyte version of StrPos is AnsiStrPos.) Multibyte character support is operating-system dependent and based on the current locale.

Because AnsiString variables have pointers, two or more of them can reference the same value without consuming additional memory. The compiler exploits this to conserve resources and execute assignments faster. Whenever an AnsiString variable is destroyed or assigned a new value, the reference count of the old AnsiString (the variable's previous value) is decremented and the reference count of the new value (if there is one) is incremented; if the reference count of a string reaches zero, its memory is deallocated. This process is called reference-counting. When indexing is used to change the value of a single character in a string, a copy of the string is made if - but only if - its reference count is greater than one. This is called copy-on-write semantics.

UnicodeString

The UnicodeString type represents a dynamically allocated Unicode string whose maximum length is limited only by available memory.

In a Unicode character set, each character is represented by one or more bytes. Unicode has several Unicode Transformation Formats that use different but equivalent character encodings that can be easily transformed into each other.

In UTF-8. for instance, characters may be one to 4 bytes. In UTF-8, the first 128 Unicode characters map to the US-ASCII characters.

UTF-16 is another commonly used Unicode encoding in which characters are either 2 bytes or 4 bytes. The majority of the world's characters are in the Basic Multilingual Plane and can be represented in 2 bytes. The remaining characters require two 2 byte characters known as surrogate pairs.

UTF-32 represents each character with 4 bytes.

The Win32 platform supports single-byte and multibyte character sets as well as Unicode. The Windows operating system supports UTF-16.

See the Unicode Standard for more information.

The UnicodeString type has exactly the same structure as the AnsiString type. UnicodeString data is encoded in UTF-16.

Since UnicodeString and AnsiString have the same structure, they function very similarly. When a UnicodeString variable is empty, it uses no additional memory. When it is not empty, it points to a dynamically allocated block of memory that contains the string value, and the memory handling for this is transparent to the user. UnicodeString variables are reference counted, and two or more of them can reference the same value without consuming additional memory.

Instances of UnicodeString can index characters. Indexing is 1-based, just as for AnsiString.

UnicodeString is assignment compatible with all other string types. However, assignments between AnsiString and UnicodeString do the appropriate up or down conversions. Note that assigning a UnicodeString type to an AnsiString type is not recommended and can result in data loss.

Delphi can also support Unicode characters and strings through the WideChar, PWideChar, and WideString types.

For more information on using Unicode, see Unicode in the IDE and Enabling Unicode in Your Application.

WideString

The WideString type represents a dynamically allocated string of 16-bit Unicode characters. In some respects it is similar to AnsiString. On Win32, WideString is compatible with the COM BSTR type.

WideString is appropriate for use in COM applications. However, WideString is not reference counted, and so UnicodeString is more flexible and efficient in other types of applications.

Indexing of WideString multibyte strings is not reliable, since S[i] represents the ith element (not necessarily the ith character) in S.

For Delphi, Char and PChar types are WideChar and PWideChar types, respectively.

Working with null-Terminated Strings

Many programming languages, including C and C++, lack a dedicated string data type. These languages, and environments that are built with them, rely on null-terminated strings. A null-terminated string is a zero-based array of characters that ends with NUL (#0); since the array has no length indicator, the first NUL character marks the end of the string. You can use Delphi constructions and special routines in the SysUtils unit (see Standard routines and I/O) to handle null-terminated strings when you need to share data with systems that use them.

For example, the following type declarations could be used to store null-terminated strings.

Copy Code

type
  TIdentifier = array[0..15] of Char;
  TFileName = array[0..259] of Char;
  TMemoText = array[0..1023] of WideChar;

With extended syntax enabled ({$X+}), you can assign a string constant to a statically allocated zero-based character array. (Dynamic arrays won't work for this purpose.) If you initialize an array constant with a string that is shorter than the declared length of the array, the remaining characters are set to #0.

Using Pointers, Arrays, and String Constants

To manipulate null-terminated strings, it is often necessary to use pointers. (See Pointers and pointer types.) String constants are assignment-compatible with the PChar and PWideChar types, which represent pointers to null-terminated arrays of Char and WideChar values. For example,

Copy Code

var P: PChar;
  ...                    
P := 'Hello world!'

points P to an area of memory that contains a null-terminated copy of 'Hello world!' This is equivalent to

Copy Code

const TempString: array[0..12] of Char = 'Hello world!';
var P: PChar;
   ...
P := @TempString[0];

You can also pass string constants to any function that takes value or const parameters of type PChar or PWideChar - for example StrUpper('Hello world!'). As with assignments to a PChar, the compiler generates a null-terminated copy of the string and gives the function a pointer to that copy. Finally, you can initialize PChar or PWideChar constants with string literals, alone or in a structured type. Examples:

Copy Code

const
  Message: PChar = 'Program terminated';
  Prompt: PChar = 'Enter values: ';
  Digits: array[0..9] of PChar = ('Zero', 'One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven', 'Eight', 'Nine');

Zero-based character arrays are compatible with PChar and PWideChar. When you use a character array in place of a pointer value, the compiler converts the array to a pointer constant whose value corresponds to the address of the first element of the array. For example,

Copy Code

var
  MyArray: array[0..32] of Char;
  MyPointer: PChar;
begin
  MyArray := 'Hello';
  MyPointer := MyArray;
  SomeProcedure(MyArray);
  SomeProcedure(MyPointer);
end;

This code calls SomeProcedure twice with the same value.

A character pointer can be indexed as if it were an array. In the previous example, MyPointer[0] returns H. The index specifies an offset added to the pointer before it is dereferenced. (For PWideChar variables, the index is automatically multiplied by two.) Thus, if P is a character pointer, P[0] is equivalent to P^ and specifies the first character in the array, P[1] specifies the second character in the array, and so forth; P[-1] specifies the 'character' immediately to the left of P[0]. The compiler performs no range checking on these indexes.

The StrUpper function illustrates the use of pointer indexing to iterate through a null-terminated string:

Copy Code

function StrUpper(Dest, Source: PChar; MaxLen: Integer): PChar;
var
  I: Integer;
begin
  I := 0;
  while (I < MaxLen) and (Source[I] <> #0) do
  begin
    Dest[I] := UpCase(Source[I]);
    Inc(I);
  end;
  Dest[I] := #0;
  Result := Dest;
end;

Mixing Delphi Strings and Null-Terminated Strings

You can mix strings (AnsiString and UnicodeString values) and null-terminated strings (PChar values) in expressions and assignments, and you can pass PChar values to functions or procedures that take string parameters. The assignment S := P, where S is a string variable and P is a PChar expression, copies a null-terminated string into a string.

In a binary operation, if one operand is a string and the other a PChar, the PChar operand is converted to a UnicodeString.

You can cast a PChar value as a UnicodeString. This is useful when you want to perform a string operation on two PChar values. For example,

Copy Code

S := string(P1) + string(P2);

You can also cast a UnicodeString or AnsiString string as a null-terminated string. The following rules apply.

If S is a UnicodeString, PChar(S) casts S as a null-terminated string; it returns a pointer to the first character in S. Such casts are used for the Windows API. For example, if Str1 and Str2 are UnicodeString, you could call the Win32 API MessageBox function like this: MessageBox(0, PChar(Str1), PChar(Str2), MB_OK);. Use PAnsiChar(S) if S is an AnsiString.
You can also use Pointer(S) to cast a string to an untyped pointer. But if S is empty, the typecast returns nil.
PChar(S) always returns a pointer to a memory block; if S is empty, a pointer to #0 is returned.
When you cast a UnicodeString or AnsiString variable to a pointer, the pointer remains valid until the variable is assigned a new value or goes out of scope. If you cast any other string expression to a pointer, the pointer is valid only within the statement where the typecast is performed.
When you cast a UnicodeString or AnsiString expression to a pointer, the pointer should usually be considered read-only. You can safely use the pointer to modify the string only when all of the following conditions are satisfied:
- The expression cast is a UnicodeString or AnsiString variable.
- The string is not empty.
- The string is unique - that is, has a reference count of one. To guarantee that the string is unique, call the SetLength, SetString, or UniqueString procedures.
- The string has not been modified since the typecast was made.
- The characters modified are all within the string. Be careful not to use an out-of-range index on the pointer.

The same rules apply when mixing WideString values with PWideChar values.