Strings
A string is a sequence of characters, and it represents text data. C++ has two string abstractions, which we refer to as C-style strings and C++ strings.
C-Style Strings
In the original C language, strings are represented as just an array
of characters, which have the type char
. The following initializes
a string representing the characters in the word hello
:
char str[6] = { 'h', 'e', 'l', 'l', 'o', '\0' };
Figure 25 Array representation of a string.
Character literals are enclosed in single quotes. For example 'h'
is the character literal corresponding to the lower-case letter h.
The representation of the string in memory is shown in
Figure 25.
A C-style string has a sentinel value at its end, the special null
character, denoted by '\0'
. This is not the same as a null
pointer, which is denoted by nullptr
, nor the character '0'
,
which denotes the digit 0. The null character signals the end of the
string, and algorithms on C-style strings rely on its presence to
determine where the string ends.
A character array can also be initialized with a string literal:
char str2[6] = "hello";
char str3[] = "hello";
If the size of the array is specified, it must have sufficient space
for the null terminator. In the second case above, the size of the
array is inferred as 6 from the string literal that is used to
initialize it. A string literal implicitly contains the null
terminator at its end, so both str2
and str3
are initialized
to end with a null terminator.
The char
type is an atomic type that is represented by numerical
values. The ASCII standard specifies the numerical values used to
represent each character. For instance, the null character '\0'
is
represented by the ASCII value 0, the digit '0'
is represented by
the ASCII value 48, and the letter 'h'
is represented by the ASCII
value 104. Figure 26 illustrates the ASCII values that
represent the string "hello"
.
Figure 26 ASCII values of the characters in a string.
An important feature of the ASCII standard is that the digits 0-9 are represented by consecutive values, the capital letters A-Z are also represented by consecutive values, and the lower-case letters a-z as well. The following function determines whether a character is a letter:
bool is_alpha(char ch) {
return (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
}
In C++, atomic objects with value 0 are considered to have false truth values, while atomic objects with nonzero values are considered to be true. Thus, the null terminator is the only character that has a false truth value. We will make use of that when implementing algorithms on C-style strings.
Since C-style strings are just arrays, the pitfalls that apply to
arrays also apply to C-style strings. For instance, a char
array
turns into a pointer to char
when its value is required. Thus,
comparisons and assignments on C-style strings cannot be done with the
built-in operators:
char str1[6] = "hello";
char str2[6] = "hello";
char str3[6] = "apple";
char *ptr = str1; // manually convert array into pointer;
// ptr points to first character of str1
// Test for equality?
str1 == str2; // false; tests pointer equality
// Copy strings?
str1 = str3; // does not compile; RHS turns into pointer
// Copy through pointer?
ptr = str3; // sets ptr to point to first character of str3
When initializing a variable from a string literal, the variable can be an array, in which case the individual characters are initialized from those in the string literal:
char str1[6] = "hello";
The variable can also be a pointer, in which case it just points to
the first character in the string literal itself. String literals are
stored in memory; however, the C++ standard prohibits us from
modifying the memory used to store a string literal. Thus, we must use
the const
keyword when specifying the element type of the pointer:
const char *ptr = "hello";
We will discuss const
in more detail next time.
String Traversal and Functions
The conventional pattern for iterating over a C-style string is to use traversal by pointer: walk a pointer across the elements until the end is reached. However, unlike the traversal pattern we saw previously where we already knew the length, we don’t know the end of a C-style string until we reach the null terminator. Thus, we iterate until we reach that sentinel value:
// REQUIRES: str points to a valid, null-terminated string
// EFFECTS: Returns the length of str, not including the null
// terminator.
int strlen(const char *str) {
const char *ptr = str;
while (*ptr != '\0') {
++ptr;
}
return ptr - str;
}
Here, we compute the length of a string by creating a new pointer that points to the first character. We then increment that pointer [1] until reaching the null terminator. Then the distance between that pointer and the original is equal to the number of non-null characters in the string.
We can also use the truth value of the null character in the test of the while loop:
int strlen(const char *str) {
const char *ptr = str;
while (*ptr) {
++ptr;
}
return ptr - str;
}
We can also use a for loop, with an empty initialization and body:
int strlen(const char *str) {
const char *ptr = str;
for (; *ptr; ++ptr);
return ptr - str;
}
The built-in <cstring>
header contains a definition for strlen()
.
We saw previously that we cannot copy C-style strings with the assignment operator. Instead, we need to use a function:
// REQUIRES: src points to a valid, null-terminated string;
// dst points to an array with >= strlen(src) + 1 elements
// MODIFIES: *dst
// EFFECTS: Copies the characters from src into dst, including the
// null terminator.
void strcpy(char *dst, const char *src) {
while (*src) {
*dst = *src;
++src;
++dst;
}
*dst = *src; // null terminator
}
The function takes in a destination pointer; the pointed-to type must
be non-const, since the function will modify the elements. The
function does not need to modify the source string, so the
corresponding parameter is a pointer to const char
. Then each
non-null character from src
is copied into dst
. The last line
also copies the null terminator into dst
.
The strcpy()
function can be written more succinctly by relying on
the behavior of the postfix increment operator. There are two versions
of the increment operator, and their evaluation process is visualized
in Figure 27:
Figure 27 Evaluation process for prefix and postfix increment.
The prefix increment operator, when applied to an atomic object, increments the object and evaluates to the object itself, which now contains the new value:
int x = 3; cout << ++x; // prints 4 cout << x; // prints 4
The postfix increment operator, when applied to an atomic object, increments the object but evaluates to the old value:
int x = 3; cout << x++; // prints 3 cout << x; // prints 4
There are also both prefix and postfix versions of the decrement
operator (--
).
A word of caution when writing expressions that have side effects, such as increment: in C++, the order in which subexpressions are evaluated within a larger expression is for the most part unspecified. Thus, the following results in implementation-dependent behavior:
int x = 3;
cout << ++x << "," << x; // can print 4,4 or 4,3
If the second x
in the print statement is evaluated before
++x
, then a 3 will be printed out for its value. On the other
hand, if the second x
is evaluated after ++x
, a 4 will be
printed out for its value. Code like this, where a single statement
contains two subexpressions that use the same variable but at least
one modifies it, should be avoided.
Another feature that our shorter version of strcpy()
will rely on
is that an assignment evaluates back to the left-hand-side object:
int x = 3;
int y = -4;
++(x = y); // copies -4 into x, then increments x
cout << x; // prints -3
cout << (y = x); // prints -3
The succinct version of strcpy()
is as follows:
void strcpy(char *dst, const char *src) {
while (*dst++ = *src++);
}
The test increments both pointers, but since it is using postfix
increment, the expressions themselves evaluate to the old values.
Thus, in the first iteration, dst++
and src++
evaluate to the
addresses of the first character in each string. The rest of the test
expression dereferences the pointers and copies the source value to
the destination. The assignment then evaluates to the left-hand-side
object, so the test checks the truth value of that object’s value. As
long as the character that was copied was not the null terminator, it
will be true, and the loop will continue on to the next character.
When the null terminator is reached, the assignment copies it to the
destination but then produces a false value, so the loop terminates
immediately after copying over the null terminator.
The <cstring>
library also contains a version of strcpy()
.
Printing C-Style Arrays
Previously, we say that printing out an array prints out the address of its first character, since the array turns into a pointer. Printing out a pointer just prints out the address value contained in the pointer.
On the other hand, C++ output streams have special treatment of
pointers to char
. If a pointer to char
is passed to cout
,
it will assume that the pointer is pointing into a C-style string and
print out every character until it reaches a null terminator:
char str[] = "hello";
char *ptr = str;
cout << ptr; // prints out hello
cout << str; // str turns into a pointer; prints out hello
This means that we must ensure that a char *
is actually pointing
to a null-terminated string before passing it to cout
. The
following results in undefined behavior:
char array[] = { 'h', 'e', 'l', 'l', 'o' }; // not null-terminated
char ch = 'w'; // just a character
cout << array; // undefined behavior -- dereferences past end of array
cout << &ch; // undefined behavior -- dereferences past ch
To print out the address value of a char *
, we must convert it into
a void *
, which is a pointer that can point to any kind of object:
cout << static_cast<void *>(&ch); // prints address of ch
C++ Strings
C++ strings are class-type objects represented by the string
type
[2]. They are not arrays, though the implementation may use arrays
under the hood. Thus, C++ strings are to C-style strings as vectors
are to built-in arrays.
Technically, string
is an alias for basic_string<char>
,
so you may see the latter in compiler errors.
The following table compares C-style and C++ strings:
C-Style Strings |
C++ Strings |
|
---|---|---|
Library Header |
|
|
Declaration |
|
|
Length |
|
|
Copy Value |
|
|
Indexing |
|
|
Concatenate |
|
|
Compare |
|
|
A C++ string can be converted into a C-style string by calling
.c_str()
on it:
const char *cstr = str.c_str();
A C-style string can be converted into a C++ string by explicitly or
implicitly calling the string
constructor:
string str1 = string(cstr); // explicit call
string str = cstr; // implicit call
C++ strings can be compared with the built-in comparison operators, which compare them lexicographically: the ASCII values of elements are compared one by one, and if the two strings differ in a character, then the string whose character has a lower ASCII value is considered less than the other. If one string is a prefix of the other, then the shorter one is less than the longer (which results from comparing the ASCII value of the null terminator to a non-null character).
C-style strings cannot be compared with the built-in operators – these
would just do pointer comparisons. Instead, the strcmp()
function
can be used, and strcmp(str1, str2)
returns:
a negative value if
str1
is lexicographically less thanstr2
a positive value if
str1
is lexicographically greater thanstr2
0 if the two strings have equal values
The expression !strcmp(str1, str2)
is often used to check for
equality – if the two strings are equal, strcmp()
returns 0,
which has truth value false.
Command-Line Arguments
So far, the programs we have considered have not worked with user input. More interesting programs, however, incorporate behavior that responds to user input. We will see two mechanisms for passing input to a program: command-line arguments and standard input.
Command-line arguments are arguments that are passed to a program when it is invoked from a shell or terminal. As an example, consider the following command:
$ g++ -Wall -O1 -std=c++17 -pedantic test.cpp –o test
Here, g++
is the program we are invoking, and the arguments tell
g++
what to do. For instance, the -Wall
argument tells the
g++
compiler to warn about any potential issues in the code,
-O1
tells the compiler to use optimization level 1, and so on.
Command-line arguments are passed to the program through arguments to
main()
. The main()
function may have zero parameters, in which
case the command-line arguments are discarded. It can also have two
parameters [3], so the signature has the following form:
int main(int argc, char *argv[]);
Implementations may also allow other signatures for main()
.
The first argument is the number of command-line arguments passed to
the program, and it is conventionally named argc
. The second,
conventionally named argv
, contains each command-line argument as
a C-style string. As we saw last time, an array parameter is actually
a pointer parameter, so the following signature is equivalent:
int main(int argc, char **argv);
Thus, the second parameter is a pointer to the first element of an array, each element of which is a pointer to the start of a C-style string, as shown in Figure 28.
Figure 28 Representation of command-line arguments.
The command-line arguments also include the name of the program as the first argument – this is often used in printing out error messages from the program.
As an example, the following program takes an arbitrary number of integral arguments and computes their sum:
#include <iostream>
#include <cstdlib> // for atoi() function
using namespace std;
int main(int argc, char *argv[]) {
int sum = 0;
for (int i = 1; i < argc; ++i) {
sum += atoi(argv[i]);
}
cout << "sum is " << sum << endl;
}
The first argument is skipped, since it is the program name. Each
remaining argument is converted to an int
by the atoi()
function, which takes a C-style string as the argument and returns the
integer that it represents. For example, atoi("123")
returns the
number 123 as an int
.
The following is an example of running the program:
$ ./sum.exe 2 4 6 8 10
sum is 30
Input and Output (I/O)
User input can also be obtained through standard input, which
receives data that a user types into the console. In C++, the cin
stream reads data from standard input. Data is extracted into an
object using the extraction operator >>
, and the extraction
interprets the raw character data according to the target data type.
For example, the following code extracts to string
, which extracts
individual words that are separated by whitespace:
string word;
while (cin >> word) {
cout << "word = '" << word << "'" << endl;
}
The extraction operation evaluates to the cin
stream, which has a
truth value – if the extraction succeeds, it is true, but if the
extraction fails, the truth value is false. Thus, the loop above will
continue as long as extraction succeeds.
The following is an example of the program:
$ ./words.exe
hello world!
word = 'hello'
word = 'world!'
goodbye
word = 'goodbye'
The program only receives input after the user presses the enter key.
The first line the user entered contains two words, each of which gets
printed out. Then the program waits for more input. Another word is
entered, so the program reads and prints it out. Finally, the user in
this example inputs an end-of-file character – on Unix-based
systems, the sequence Ctrl-d enters an end of file, while Ctrl-z does
so on Windows systems. The end-of-file marker denotes the end of a
stream, so extracting from cin
fails at that point, ending the
loop above.
The program above prints output to standard output, represented by
the cout
stream. The insertion operator <<
inserts the text
representation of a value into an output stream.
I/O Redirection
Shells allow input redirection, which passes the data from a file
to standard input rather than reading from the keyboard. For instance,
if the file words.in
contains the data:
hello world!
goodbye
Then using the <
symbol before the filename redirects the file to
standard input at the command line:
$ ./words.exe < words.in
word = 'hello'
word = 'world!'
word = 'goodbye'
A file has an implicit end of file at the end of its data, and the program terminates upon reaching the end of the file.
We can also do output redirection, where the shell writes the
contents of standard output to a file. The symbol for output
redirection is >
:
$ ./words.exe > result.out
hello world!
goodbye
$ cat result.out
word = 'hello'
word = 'world!'
word = 'goodbye'
Here, we redirect the output to the file result.out
. We then enter
input from the keyboard, ending with the Ctrl-d sequence. When the
program ends, we use the cat
command to display the contents of
result.out
.
Input and output redirection can also be used together:
$ ./words.exe < words.in > result.out
$ cat result.out
word = 'hello'
word = 'world!'
word = 'goodbye'
Example: Adding Integers
Using standard input, we can write a program that adds up integers
entered by a user. The program will terminate either upon reaching an
end of file or if the user types in the word done
:
#include <iostream>
#include <string> // for stoi()
using namespace std;
int main() {
int sum = 0;
cout << "Enter some numbers to sum." << endl
string word;
while (cin >> word && word != "done") {
sum += stoi(word);
}
cout << "sum is " << sum << endl;
}
The code extracts to a string so that it can be compared to the string
"done"
. (The latter is a C-style string, but C++ strings can be
compared with C-style strings using the built-in comparison
operators.) The stoi()
function parses a C++ string to determine
the int
value it represents.
The following is an example of running the program:
$ ./sum
Enter some numbers to sum.
2
4
6
done
sum is 12
An alternate version of the program extracts directly to an int
.
However, it can only be terminated by an end of file or other failed
extraction:
#include <iostream>
using namespace std;
int main() {
int sum = 0;
cout << "Enter some numbers to sum." << endl
int number;
while (cin >> number) {
sum += number;
}
cout << "sum is " << sum << endl;
}
File I/O
A program can also read and write files directly using file streams.
It must include the <fstream>
header, and it can then use an
ifstream
to read from a file and an ofstream
to write to a
file. The former supports the same interface as cin
, while the
latter has the same interface as cout
.
An ifstream
object can be created from a file name:
string filename = "words.in";
ifstream fin(filename);
Alternatively, the ifstream
object can be created without a file
name, and then its open()
function can be given the name of the
file to open:
string filename = "words.in";
ifstream fin;
fin.open(filename);
In general, a program should check if the file was successfully opened,
regardless of the mechanism used to create the ifstream
:
if (!fin.is_open()) {
cout << "open failed" << endl;
return 1;
}
Once we’ve determined the file is open, we can read from it like
cin
. The following program reads individual words from the file
words.in
and prints them:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main() {
string filename = "words.in";
ifstream fin;
fin.open(filename);
if (!fin.is_open()) {
cout << "open failed" << endl;
return 1;
}
string word;
while (fin >> word) {
cout << "word = '" << word << "'" << endl;
}
fin.close(); // optional
}
The program closes the file before exiting. Doing so explicitly is
optional – it will happen automatically at the end of the
ifstream
object’s lifetime (e.g. when it goes out of scope if it
is a local variable).
Best practice is to extract from an input stream, whether it is
cin
or an ifstream
, in the test of a loop or conditional. That
way, the test will evaluate to false if the extraction fails. The
following examples all print the last word twice because they do not
check for failure between extracting and printing a word:
while (!fin.fail()) {
fin >> word;
cout << word;
}
while (fin.good()) {
fin >> word;
cout << word;
}
while (!fin.eof()) {
fin >> word;
cout << word;
}
while (fin) {
fin >> word;
cout << word;
}
The following is printed when using any of the loops above:
$ ./main.exe
hello
world!
goodbye
goodbye
Multiple extractions can be placed in the test of a loop by chaining them. The test evaluates to true when all extractions succeed. For example, the following reads two words at a time:
string word1, word2;
while (fin >> word1 >> word2) {
cout << "word1 = '" << word1 << "'" << endl;
cout << "word2 = '" << word2 << "'" << endl;
}
For words.in
, only the first two words are printed, since the test
will fail in the second iteration when it tries to read a fourth word:
$ ./main.exe
word1 = 'hello'
word2 = 'world!'
An entire line can be read using the getline()
function, which
takes in an input stream and a target string (by reference) and
returns whether or not reading the line succeeded. If so, the target
string will contain the full line read:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main() {
string filename = "hello.txt";
ifstream fin;
fin.open(filename);
if (!fin.is_open()) {
cout << "open failed" << endl;
return 1;
}
string line;
while (getline(fin, line)) {
cout << "line = '" << line << "'" << endl;
}
}
For words.in
, this will result in:
$ ./main.exe
line = 'hello world!'
line = 'goodbye'
An ofstream
works similarly to an ifstream
, except that it is
used for printing output to a file. The following program prints data
to the file output.txt
:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main() {
const int SIZE = 4;
int data[SIZE] = { 1, 2, 3, 4 };
string filename = "output.txt";
ofstream fout;
fout.open(filename);
if (!fout.is_open()) {
cout << "open failed" << endl;
return 1;
}
for (int i = 0; i < 4; ++i) {
fout << "data[" << i << "] = " << data[i] << endl;
}
fout.close(); // optional
}
The following shows the resulting data in output.txt
:
$ cat output.txt
data[0] = 1
data[1] = 2
data[2] = 3
data[3] = 4