C and C++ Strings

A string is a sequence of characters, and it represents text data. C++ has two string abstractions, which we refer to as C-style strings and C++ strings.

C-Style Strings

In the original C language, strings are represented as just an array of characters, which have the type char. The following initializes a string representing the characters in the word hello:

char str[6] = { 'h', 'e', 'l', 'l', 'o', '\0' };
_images/05_string.svg

Figure 103 Array representation of a string.

Character literals are enclosed in single quotes. For example 'h' is the character literal corresponding to the lower-case letter h. The representation of the string in memory is shown in Figure 103.

A C-style string has a sentinel value at its end, the special null character, denoted by '\0'. This is not the same as a null pointer, which is denoted by nullptr, nor the character '0', which denotes the digit 0. The null character signals the end of the string, and algorithms on C-style strings rely on its presence to determine where the string ends.

A character array can also be initialized with a string literal:

char str2[6] = "hello";
char str3[] = "hello";

If the size of the array is specified, it must have sufficient space for the null terminator. In the second case above, the size of the array is inferred as 6 from the string literal that is used to initialize it. A string literal implicitly contains the null terminator at its end, so both str2 and str3 are initialized to end with a null terminator.

The char type is an atomic type that is represented by numerical values. The ASCII standard specifies the numerical values used to represent each character. For instance, the null character '\0' is represented by the ASCII value 0, the digit '0' is represented by the ASCII value 48, and the letter 'h' is represented by the ASCII value 104. Figure 104 illustrates the ASCII values that represent the string "hello".

_images/05_ascii_string.svg

Figure 104 ASCII values of the characters in a string.

An important feature of the ASCII standard is that the digits 0-9 are represented by consecutive values, the capital letters A-Z are also represented by consecutive values, and the lower-case letters a-z as well. The following function determines whether a character is a letter:

bool is_alpha(char ch) {
  return (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
}

In C++, atomic objects with value 0 are considered to have false truth values, while atomic objects with nonzero values are considered to be true. Thus, the null terminator is the only character that has a false truth value. We will make use of that when implementing algorithms on C-style strings.

Since C-style strings are just arrays, the pitfalls that apply to arrays also apply to C-style strings. For instance, a char array turns into a pointer to char when its value is required. Thus, comparisons and assignments on C-style strings cannot be done with the built-in operators:

char str1[6] = "hello";
char str2[6] = "hello";
char str3[6] = "apple";
char *ptr = str1;        // manually convert array into pointer;
                         // ptr points to first character of str1

// Test for equality?
str1 == str2;            // false; tests pointer equality

// Copy strings?
str1 = str3;             // does not compile; RHS turns into pointer

// Copy through pointer?
ptr = str3;              // sets ptr to point to first character of str3

When initializing a variable from a string literal, the variable can be an array, in which case the individual characters are initialized from those in the string literal:

char str1[6] = "hello";

The variable can also be a pointer, in which case it just points to the first character in the string literal itself. String literals are stored in memory; however, the C++ standard prohibits us from modifying the memory used to store a string literal. Thus, we must use the const keyword when specifying the element type of the pointer:

const char *ptr = "hello";

String Traversal and Functions

The conventional pattern for iterating over a C-style string is to use traversal by pointer: walk a pointer across the elements until the end is reached. However, unlike the traversal pattern we saw previously where we already knew the length, we don’t know the end of a C-style string until we reach the null terminator. Thus, we iterate until we reach that sentinel value:

// REQUIRES: str points to a valid, null-terminated string
// EFFECTS:  Returns the length of str, not including the null
//           terminator.
int strlen(const char *str) {
  const char *ptr = str;
  while (*ptr != '\0') {
    ++ptr;
  }
  return ptr - str;
}

Here, we compute the length of a string by creating a new pointer that points to the first character. We then increment that pointer [1] until reaching the null terminator. Then the distance between that pointer and the original is equal to the number of non-null characters in the string.

We can also use the truth value of the null character in the test of the while loop:

int strlen(const char *str) {
  const char *ptr = str;
  while (*ptr) {
    ++ptr;
  }
  return ptr - str;
}

We can also use a for loop, with an empty initialization and body:

int strlen(const char *str) {
  const char *ptr = str;
  for (; *ptr; ++ptr);
  return ptr - str;
}

The built-in <cstring> header contains a definition for strlen().

We saw previously that we cannot copy C-style strings with the assignment operator. Instead, we need to use a function:

// REQUIRES: src points to a valid, null-terminated string;
//           dst points to an array with >= strlen(src) + 1 elements
// MODIFIES: *dst
// EFFECTS:  Copies the characters from src into dst, including the
//           null terminator.
void strcpy(char *dst, const char *src) {
  while (*src) {
    *dst = *src;
    ++src;
    ++dst;
  }
  *dst = *src;   // null terminator
}

The function takes in a destination pointer; the pointed-to type must be non-const, since the function will modify the elements. The function does not need to modify the source string, so the corresponding parameter is a pointer to const char. Then each non-null character from src is copied into dst. The last line also copies the null terminator into dst.

The strcpy() function can be written more succinctly by relying on the behavior of the postfix increment operator. There are two versions of the increment operator, and their evaluation process is visualized in Figure 105:

_images/05_prefix_and_postfix_increment.svg

Figure 105 Evaluation process for prefix and postfix increment.

  • The prefix increment operator, when applied to an atomic object, increments the object and evaluates to the object itself, which now contains the new value:

    int x = 3;
    cout << ++x;   // prints 4
    cout << x;     // prints 4
    
  • The postfix increment operator, when applied to an atomic object, increments the object but evaluates to the old value:

    int x = 3;
    cout << x++;   // prints 3
    cout << x;     // prints 4
    

There are also both prefix and postfix versions of the decrement operator (--).

A word of caution when writing expressions that have side effects, such as increment: in C++, the order in which subexpressions are evaluated within a larger expression is for the most part unspecified. Thus, the following results in implementation-dependent behavior:

int x = 3;
cout << ++x << "," << x;   // can print 4,4 or 4,3

If the second x in the print statement is evaluated before ++x, then a 3 will be printed out for its value. On the other hand, if the second x is evaluated after ++x, a 4 will be printed out for its value. Code like this, where a single statement contains two subexpressions that use the same variable but at least one modifies it, should be avoided.

Another feature that our shorter version of strcpy() will rely on is that an assignment evaluates back to the left-hand-side object:

int x = 3;
int y = -4;
++(x = y);        // copies -4 into x, then increments x
cout << x;        // prints -3
cout << (y = x);  // prints -3

The succinct version of strcpy() is as follows:

void strcpy(char *dst, const char *src) {
  while (*dst++ = *src++);
}

The test increments both pointers, but since it is using postfix increment, the expressions themselves evaluate to the old values. Thus, in the first iteration, dst++ and src++ evaluate to the addresses of the first character in each string. The rest of the test expression dereferences the pointers and copies the source value to the destination. The assignment then evaluates to the left-hand-side object, so the test checks the truth value of that object’s value. As long as the character that was copied was not the null terminator, it will be true, and the loop will continue on to the next character. When the null terminator is reached, the assignment copies it to the destination but then produces a false value, so the loop terminates immediately after copying over the null terminator.

The <cstring> library also contains a version of strcpy().

Printing C-Style Arrays

Previously, we say that printing out an array prints out the address of its first character, since the array turns into a pointer. Printing out a pointer just prints out the address value contained in the pointer.

On the other hand, C++ output streams have special treatment of pointers to char. If a pointer to char is passed to cout, it will assume that the pointer is pointing into a C-style string and print out every character until it reaches a null terminator:

char str[] = "hello";
char *ptr = str;
cout << ptr;          // prints out hello
cout << str;          // str turns into a pointer; prints out hello

This means that we must ensure that a char * is actually pointing to a null-terminated string before passing it to cout. The following results in undefined behavior:

char array[] = { 'h', 'e', 'l', 'l', 'o' };  // not null-terminated
char ch = 'w';                                // just a character
cout << array;  // undefined behavior -- dereferences past end of array
cout << &ch;    // undefined behavior -- dereferences past ch

To print out the address value of a char *, we must convert it into a void *, which is a pointer that can point to any kind of object:

cout << static_cast<void *>(&ch);  // prints address of ch

C++ Strings

C++ strings are class-type objects represented by the string type [2]. They are not arrays, though the implementation may use arrays under the hood. Thus, C++ strings are to C-style strings as vectors are to built-in arrays.

The following table compares C-style and C++ strings:

C-Style Strings

C++ Strings

Library Header

<cstring>

<string>

Declaration

char cstr[]; char *cstr;

string str

Length

strlen(cstr)

str.length()

Copy Value

strcpy(cstr1, cstr2)

str1 = str2

Indexing

cstr[i]

str[i]

Concatenate

strcat(cstr1, cstr2)

str1 += str2

Compare

!strcmp(cstr1, cstr2)

str1 == str2

A C++ string can be converted into a C-style string by calling .c_str() on it:

const char *cstr = str.c_str();

A C-style string can be converted into a C++ string by explicitly or implicitly calling the string constructor:

string str1 = string(cstr);  // explicit call
string str = cstr;           // implicit call

C++ strings can be compared with the built-in comparison operators, which compare them lexicographically: the ASCII values of elements are compared one by one, and if the two strings differ in a character, then the string whose character has a lower ASCII value is considered less than the other. If one string is a prefix of the other, then the shorter one is less than the longer (which results from comparing the ASCII value of the null terminator to a non-null character).

C-style strings cannot be compared with the built-in operators – these would just do pointer comparisons. Instead, the strcmp() function can be used, and strcmp(str1, str2) returns:

  • a negative value if str1 is lexicographically less than str2

  • a positive value if str1 is lexicographically greater than str2

  • 0 if the two strings have equal values

The expression !strcmp(str1, str2) is often used to check for equality – if the two strings are equal, strcmp() returns 0, which has truth value false.