RAD Studio (Common)
|
This topic describes the string data types available in the Delphi language. The following types are covered:
A string represents a sequence of characters. Delphi supports the following predefined string types.
String types
Type |
Maximum length |
Memory required |
Used for |
ShortString |
255 characters |
2 to 256 bytes |
Backward compatibility |
AnsiString |
~2^31 characters |
4 bytes to 2GB |
8-bit (ANSI) characters, DBCS ANSI, MBCS ANSI, Unicode characters, etc. |
UnicodeString |
~2^30 characters |
4 bytes to 2GB |
Unicode characters, 8-bit (ANSI) characters, multi-user servers and multi-language applications |
WideString |
~2^30 characters |
4 bytes to 2GB |
Unicode characters; multi-user servers and multi-language applications. UnicodeString generally preferred |
String types can be mixed in assignments and expressions; the compiler automatically performs required conversions. But strings passed by reference to a function or procedure (as var and out parameters) must be of the appropriate type. Strings can be explicitly cast to a different string type. However, casting a multi-byte string to a single byte string may result in data loss.
There are some special string types worth mentioning:
var S: string;
creates a variable S that holds a string. On the Win32 platform, the compiler interprets string (when it appears without a bracketed number after it) as UnicodeString.
On the Win32 platform, you can use the {$H-} directive to turn string into ShortString. This is a potentially useful technique when using older 16-bit Delphi code or Turbo Pascal code with your current programs.
Note that the keyword string is also used when declaring ShortString types of specific lengths (see Short Strings, below).
Comparison of strings is defined by the ordering of the elements in corresponding positions. Between strings of unequal length, each character in the longer string without a corresponding character in the shorter string takes on a greater-than value. For example, 'AB' is greater than 'A'; that is, 'AB' > 'A' returns True. Zero-length strings represent the lowest values.
You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.
The standard function Length returns the number of elements in a string. As noted above, the number of elements is not necessarily the number of characters. The SetLength procedure adjusts the length of a string. Note that the SizeOf function returns the number of bytes used to represent a variable or type. Note that SizeOf returns the number of characters in a string only for a short string. SizeOf returns the number of bytes in a pointer for all other string types, since they are pointers.
For a short string or AnsiString, S[i] is of type AnsiChar. For a WideString, S[i] is of type WideChar. For single-byte (Western) locales, MyString[2] := 'A'; assigns the value A to the second character of MyString. The following code uses the standard UpCase function to convert MyString to uppercase.
var I: Integer; begin I := Length(MyString); while I > 0 do begin MyString[I] := UpCase(MyString[I]); I := I - 1; end; end;
Be careful indexing strings in this way, since overwriting the end of a string can cause access violations. Also, avoid passing string indexes as var parameters, because this results in inefficient code.
You can assign the value of a string constant - or any other expression that returns a string - to a variable. The length of the string changes dynamically when the assignment is made. Examples:
MyString := 'Hello world!'; MyString := 'Hello' + 'world'; MyString := MyString + '!'; MyString := ' '; { space } MyString := ''; { empty string }
A ShortString is 0 to 255 single-byte characters long. While the length of a ShortString can change dynamically, its memory is a statically allocated 256 bytes; the first byte stores the length of the string, and the remaining 255 bytes are available for characters. If S is a ShortString variable, Ord(S[0]), like Length(S), returns the length of S; assigning a value to S[0], like calling SetLength, changes the length of S. ShortString is maintained for backward compatibility only.
The Delphi language supports short-string types - in effect, subtypes of ShortString - whose maximum length is anywhere from 0 to 255 characters. These are denoted by a bracketed numeral appended to the reserved word string. For example,
var MyString: string[100];
creates a variable called MyString whose maximum length is 100 characters. This is equivalent to the declarations
type CString = string[100]; var MyString: CString;
Variables declared in this way allocate only as much memory as the type requires - that is, the specified maximum length plus one byte. In our example, MyString uses 101 bytes, as compared to 256 bytes for a variable of the predefined ShortString type.
When you assign a value to a short-string variable, the string is truncated if it exceeds the maximum length for the type.
The standard functions High and Low operate on short-string type identifiers and variables. High returns the maximum length of the short-string type, while Low returns zero.
AnsiString represents a dynamically allocated string whose maximum length is limited only by available memory.
An AnsiString variable is a structure containing string information. When the variable is empty - that is, when it contains a zero-length string, the pointer is nil and the string uses no additional storage. When the variable is nonempty, it points to a dynamically allocated block of memory that contains the string value. This memory is allocated on the heap, but its management is entirely automatic and requires no user code. The AnsiString structure contains a 32-bit length indicator, a 32-bit reference count, a 16 bit data length indicating the number of bytes per character, and a 16 bit code page.
An AnsiString represents a single byte string. With a single-byte character set (SBCS), each byte in a string represents one character. In a multibyte character set (MBCS), the elements are still single bytes, but some characters are represented by one byte and others by more than one byte. Multibyte character sets - especially double-byte character sets (DBCS) - are widely used for Asian languages. An AnsiString can contain MBCS characters.
Indexing of AnsiString is 1-based. Indexing multibyte strings is not reliable, since S[i] represents the ith byte (not necessarily the ith character) in S. The ith byte may be a single character or part of a character. However, the standard AnsiString string-handling functions have multibyte-enabled counterparts that also implement locale-specific ordering for characters. (Names of multibyte functions usually start with Ansi-. For example, the multibyte version of StrPos is AnsiStrPos.) Multibyte character support is operating-system dependent and based on the current locale.
Because AnsiString variables have pointers, two or more of them can reference the same value without consuming additional memory. The compiler exploits this to conserve resources and execute assignments faster. Whenever an AnsiString variable is destroyed or assigned a new value, the reference count of the old AnsiString (the variable's previous value) is decremented and the reference count of the new value (if there is one) is incremented; if the reference count of a string reaches zero, its memory is deallocated. This process is called reference-counting. When indexing is used to change the value of a single character in a string, a copy of the string is made if - but only if - its reference count is greater than one. This is called copy-on-write semantics.
The UnicodeString type represents a dynamically allocated Unicode string whose maximum length is limited only by available memory.
In a Unicode character set, each character is represented by one or more bytes. Unicode has several Unicode Transformation Formats that use different but equivalent character encodings that can be easily transformed into each other.
In UTF-8. for instance, characters may be one to 4 bytes. In UTF-8, the first 128 Unicode characters map to the US-ASCII characters.
UTF-16 is another commonly used Unicode encoding in which characters are either 2 bytes or 4 bytes. The majority of the world's characters are in the Basic Multilingual Plane and can be represented in 2 bytes. The remaining characters require two 2 byte characters known as surrogate pairs.
UTF-32 represents each character with 4 bytes.
The Win32 platform supports single-byte and multibyte character sets as well as Unicode. The Windows operating system supports UTF-16.
See the Unicode Standard for more information.
The UnicodeString type has exactly the same structure as the AnsiString type. UnicodeString data is encoded in UTF-16.
Since UnicodeString and AnsiString have the same structure, they function very similarly. When a UnicodeString variable is empty, it uses no additional memory. When it is not empty, it points to a dynamically allocated block of memory that contains the string value, and the memory handling for this is transparent to the user. UnicodeString variables are reference counted, and two or more of them can reference the same value without consuming additional memory.
Instances of UnicodeString can index characters. Indexing is 1-based, just as for AnsiString.
UnicodeString is assignment compatible with all other string types. However, assignments between AnsiString and UnicodeString do the appropriate up or down conversions. Note that assigning a UnicodeString type to an AnsiString type is not recommended and can result in data loss.
Delphi can also support Unicode characters and strings through the WideChar, PWideChar, and WideString types.
For more information on using Unicode, see Unicode in the IDE and Enabling Unicode in Your Application.
The WideString type represents a dynamically allocated string of 16-bit Unicode characters. In some respects it is similar to AnsiString. On Win32, WideString is compatible with the COM BSTR type.
WideString is appropriate for use in COM applications. However, WideString is not reference counted, and so UnicodeString is more flexible and efficient in other types of applications.
Indexing of WideString multibyte strings is not reliable, since S[i] represents the ith element (not necessarily the ith character) in S.
For Delphi, Char and PChar types are WideChar and PWideChar types, respectively.
Many programming languages, including C and C++, lack a dedicated string data type. These languages, and environments that are built with them, rely on null-terminated strings. A null-terminated string is a zero-based array of characters that ends with NUL (#0); since the array has no length indicator, the first NUL character marks the end of the string. You can use Delphi constructions and special routines in the SysUtils unit (see Standard routines and I/O) to handle null-terminated strings when you need to share data with systems that use them.
For example, the following type declarations could be used to store null-terminated strings.
type TIdentifier = array[0..15] of Char; TFileName = array[0..259] of Char; TMemoText = array[0..1023] of WideChar;
With extended syntax enabled ({$X+}), you can assign a string constant to a statically allocated zero-based character array. (Dynamic arrays won't work for this purpose.) If you initialize an array constant with a string that is shorter than the declared length of the array, the remaining characters are set to #0.
To manipulate null-terminated strings, it is often necessary to use pointers. (See Pointers and pointer types.) String constants are assignment-compatible with the PChar and PWideChar types, which represent pointers to null-terminated arrays of Char and WideChar values. For example,
var P: PChar; ... P := 'Hello world!'
points P to an area of memory that contains a null-terminated copy of 'Hello world!' This is equivalent to
const TempString: array[0..12] of Char = 'Hello world!'; var P: PChar; ... P := @TempString[0];
You can also pass string constants to any function that takes value or const parameters of type PChar or PWideChar - for example StrUpper('Hello world!'). As with assignments to a PChar, the compiler generates a null-terminated copy of the string and gives the function a pointer to that copy. Finally, you can initialize PChar or PWideChar constants with string literals, alone or in a structured type. Examples:
const Message: PChar = 'Program terminated'; Prompt: PChar = 'Enter values: '; Digits: array[0..9] of PChar = ('Zero', 'One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven', 'Eight', 'Nine');
Zero-based character arrays are compatible with PChar and PWideChar. When you use a character array in place of a pointer value, the compiler converts the array to a pointer constant whose value corresponds to the address of the first element of the array. For example,
var MyArray: array[0..32] of Char; MyPointer: PChar; begin MyArray := 'Hello'; MyPointer := MyArray; SomeProcedure(MyArray); SomeProcedure(MyPointer); end;
This code calls SomeProcedure twice with the same value.
A character pointer can be indexed as if it were an array. In the previous example, MyPointer[0] returns H. The index specifies an offset added to the pointer before it is dereferenced. (For PWideChar variables, the index is automatically multiplied by two.) Thus, if P is a character pointer, P[0] is equivalent to P^ and specifies the first character in the array, P[1] specifies the second character in the array, and so forth; P[-1] specifies the 'character' immediately to the left of P[0]. The compiler performs no range checking on these indexes.
The StrUpper function illustrates the use of pointer indexing to iterate through a null-terminated string:
function StrUpper(Dest, Source: PChar; MaxLen: Integer): PChar; var I: Integer; begin I := 0; while (I < MaxLen) and (Source[I] <> #0) do begin Dest[I] := UpCase(Source[I]); Inc(I); end; Dest[I] := #0; Result := Dest; end;
You can mix strings (AnsiString and UnicodeString values) and null-terminated strings (PChar values) in expressions and assignments, and you can pass PChar values to functions or procedures that take string parameters. The assignment S := P, where S is a string variable and P is a PChar expression, copies a null-terminated string into a string.
In a binary operation, if one operand is a string and the other a PChar, the PChar operand is converted to a UnicodeString.
You can cast a PChar value as a UnicodeString. This is useful when you want to perform a string operation on two PChar values. For example,
S := string(P1) + string(P2);
You can also cast a UnicodeString or AnsiString string as a null-terminated string. The following rules apply.
Copyright(C) 2009 Embarcadero Technologies, Inc. All Rights Reserved.
|
What do you think about this topic? Send feedback!
|