Array-based Data Structures
A container is an abstract data type whose purpose is to hold objects of some other type. We have already seen two examples of containers: built-in arrays and vectors from the standard library. Both are containers that hold elements in a particular order, indexed starting from 0. A built-in array has a fixed size, while a vector can grow and shrink.
Containers use a specific data representation to keep track of their elements – the representation is called a data structure. A data structure may be based on an array, a linear linked structure, a branching structure, and numerous other formats. We begin our examination of data structures with array-based representations.
Let’s define our own container ADT, using built-in arrays as our building block. Rather than an ordered sequence, we will build a set abstraction, which represents an unordered collection of unique elements. We will start with a container that holds integers, generalizing it to arbitrary element types next time.
In designing our ADT, we will use the following process:
Determine which operations our ADT will support.
Write some code that uses our ADT. This will helps us figure out what the right interface should be, and it will serve as a test case. (Ideally, we would also write unit tests for each part of the public interface, following the pattern of test-driven development.)
Define the ADT’s public interface, including both function signatures and documentation.
Come up with a data representation. Define member variables and determine the representation invariants.
Write definitions for the ADT’s constructors and member functions, making sure that they adhere to the representation invariants.
Set Operations
The following are the operations our set will support:
Creating an empty set.
Inserting a value into a set. We will allow an existing item to be inserted – it will just do nothing.
Removing a value from a set. We will allow a nonexistent item to be removed – it will just do nothing.
Check if a value is contained in a set.
Count the number of items in a set.
Print a character representation of a set to an output stream.
Code Example
Here is a small program that uses a set:
int main() {
IntSet set;
set.insert(7);
set.insert(32);
set.insert(32);
cout << "Size: " << set.size() << endl; // prints 2
set.print(cout); // prints { 7, 32 } in some arbitrary order
set.insert(42);
set.remove(32);
set.remove(32);
cout << "Contains 32? " << set.contains(32) << endl; // prints 0 (false)
cout << "Contains 42? " << set.contains(42) << endl; // prints 1 (true)
}
Public Interface
The code example leads to the following public interface:
class IntSet {
public:
// Maximum size of a set.
static const int MAX_SIZE = 10;
// EFFECTS: Initializes this set to be empty.
IntSet();
// REQUIRES: size() < MAX_SIZE
// MODIFIES: *this
// EFFECTS: Adds value to the set, if it isn't already in the set.
void insert(int value);
// MODIFIES: *this
// EFFECTS: Removes value from the set, if it is in the set.
void remove(int value);
// EFFECTS: Returns whether value is in the set.
bool contains(int value) const;
// EFFECTS: Returns the number of elements.
int size() const;
// EFFECTS: Prints out the set in an arbitrary order.
void print(std::ostream &os) const;
};
The public interface includes a default constructor and member
functions to insert or remove an item, check if an item is in the set,
obtain its size, and print a representation of the set to a stream.
The latter three functions do not modify the set, so they are declared
as const, meaning that the this pointer will be a pointer to
const. The compiler will then enforce that those functions do not
modify any member variables.
static Data Members
The public interface also includes a constant, MAX_SIZE, that
denotes the maximum size of a set. We define the constant inside the
IntSet class. This distinguishes the constant from other constants
of the same name; there may be some other ADT with a MAX_SIZE. It
also makes use of encapsulation, which allows all of an ADT’s data and
functionality to be defined together within a class.
The constant is declared static, which has a specific meaning when
used in a class – it indicates that the given member is not
associated with an object of the class, but with the class as a whole.
For a member variable, this means that there is a single copy of that
variable in the program, located in static storage. Without the
static keyword, each object of the class would have its own copy,
located in the memory for that object. The static keyword also
allows a member variable to be a compile-time constant, provided it is
also declared as const (or constexpr) and initialized from an
expression that can be computed at compile time. [1]
A static member can be accessed by name from within the class, the same as a non-static member. It can be accessed from outside the class with the scope-resolution operator:
int main() {
IntSet set;
for (int i = 0; i < IntSet::MAX_SIZE; ++i) {
set.insert(i);
}
}
We need a compile-time constant because we will use a member variable that is a built-in array as part of our data representation. [2] In C++, a local or member variable that is an array must have a size that is known to the compiler. In the case of a local variable, the compiler needs to know how big it is so that the program can create activation records of the correct size. For a member variable, the compiler needs to know its size so that it can determine the size of the object as a whole. Pointer arithmetic (including array accesses) requires the compiler to know the size of an object at compile time, since it is in terms of whole objects rather than with respect to individual bytes.
In the future, we will see how to eliminate the fixed-size restriction by storing the elements indirectly using a dynamic array. We will see that this raises issues of memory management and learn how to resolve them.
Data Representation
We rely on existing types to represent the data of our IntSet. For
the elements themselves, we use a built-in array, with MAX_SIZE as
its capacity. However, our set will start out empty, after which we
can add and remove elements as we please. Thus, the size of the set
will generally not be the same as the size of the array, and we need a
separate member variable to keep track of the number of elements in
the set. We will use an int for this member.
class IntSet {
...
private:
int elements[MAX_SIZE];
int num_elements;
};
Now that we’ve determined a data representation, we need to figure out what values correspond to valid sets. We use representation invariants to specify the set of valid values.
For num_elements, a negative value does not make sense, since a
set cannot have a negative size. At the same time, our sets do not
have enough space to store more than MAX_SIZE elements, so that
places an upper bound on num_elements. The invariants for
num_elements are thus 0 <= num_elements && num_elements <=
MAX_SIZE.
For elements, we need to know where in the array the set’s
elements are actually stored. We will store them at the beginning of
the array, so one invariant is that the first num_elements items
in the array are the ones in the set. The set abstraction prohibits
duplicate elements, so another invariant is that there are no
duplicates among the first num_elements items in elements.
We document the representation invariants where we define our representation:
class IntSet {
...
private:
int elements[MAX_SIZE];
int num_elements;
// INVARIANTS:
// 0 <= num_elements && num_elements <= MAX_SIZE
// the first num_elements items in elements are the items in the set
// the first num_elements items in elements contain no duplicates
};
size_t
The size_t type represents only nonnegative integers, and the C++
standard guarantees that it is large enough to hold the size of any
object. Since it does not represent negative integers, we could have
used it for num_elements rather than defining 0 <=
num_elements as a representation invariant. We would then also use
size_t as the return type for size(), which is what the
standard-library containers do.
C++ has integral types that are unsigned, which do not represent
negative numbers, and size_t is just one example. The int type
is signed (unless it is preceded by the unsigned keyword).
Unsigned integers can lead to subtle programming errors. The following
is an example that compares a signed and unsigned integer:
int main() {
size_t s = 3;
int i = -1;
cout << (s < i) << endl; // prints 1 (true)
}
Perhaps non-intuitively, the comparison s < i is true, even though
s is positive but i is negative. This is because the compiler
converts i to a size_t in doing the comparison, and the data
that represents a negative signed integer actually represents a very
large unsigned integer.
Another common error is incorrectly iterating backwards over a sequence:
int main() {
vector<int> vec = { 1, 2, 3 };
for (size_t i = vec.size() - 1; i >= 0; --i) {
cout << vec[i] << endl;
}
}
This code goes off the end of the vector, resulting in undefined
behavior. The reason is that a size_t is always greater than or
equal to 0, and when i is equal to 0 and decremented, its value
wraps around to the largest possible size_t value. The following
iteration correctly avoids this problem:
int main() {
vector<int> vec = { 1, 2, 3 };
for (size_t i = vec.size(); i > 0; --i) {
cout << vec[i - 1] << endl;
}
}
The loop iterates from vec.size() down to 1, offsetting by 1 when
indexing into the vector. The following equivalent code makes use of
postfix decrement to avoid
the offset in the index expression:
int main() {
vector<int> vec = { 1, 2, 3 };
for (size_t i = vec.size(); i-- > 0;) {
cout << vec[i] << endl;
}
}
Given the pitfalls of unsigned integers, many C++ programmers avoid them, using signed integers instead. We too will generally stick to signed integers.
Implementation
We proceed to define the constructor and member functions of
IntSet, defining them as they would appear in a source file,
outside the class definition.
We require only a default constructor for IntSet. However, we
cannot rely on a compiler-generated implicit one: it would default
initialize num_elements to an undefined value, violating our
representation invariants. Thus, we need to define our own default
constructor. [3]
Alternatively, we can initialize num_elements to 0 at the
point it is declared. The implicitly defined default
constructor would then be sufficient; it would still have an
empty member-initializer list, but the compiler uses the inline
initialization of a member variable if it is not initialized in
the member-initializer list.
IntSet::IntSet() : num_elements(0) {}
We need not do anything with elements. It gets default initialized
to contain undefined values, but that doesn’t violate our
representation invariants; since num_elements is 0, the invariants
for elements are trivially satisfied.
We proceed to define contains(). The representation invariants tell
us that the valid items are the first num_elements in elements.
They can be stored in any order, so we need to iterate through them
to see if the value is in the set.
bool IntSet::contains(int value) const {
for (int i = 0; i < num_elements; ++i) {
if (elements[i] == value) {
return true;
}
}
return false;
}
When we find an item, we can return immediately without checking the rest of the elements. On the other hand, if we get through all the elements without finding the given value, it is not in the set so we return false.
For inserting into the set, we first assert the requires clause, which
ensures that our invariant of num_elements <= MAX_SIZE is never
violated. We also must check whether the value is in the set, since
adding it again would violation the invariant of no duplicates. If the
value is not in the set, the simplest way to insert it is to place it
at index num_elements and increment that variable. This meets the
invariant that the first num_elements items in the array are the
values in the set.
void IntSet::insert(int value) {
assert(size() < MAX_SIZE);
if (!contains(value)) {
elements[num_elements] = value;
++num_elements; // we could use num_elements++ in the line above instead
}
}
To remove an element, we first need to check whether the value is in
the set. If so, we need to know its location in the array, which would
require iterating through the elements to find it. Rather than
repeating this algorithm in both contains() and remove(), we
define a private helper function to do so and call it in both places.
class IntSet {
...
private:
int indexOf(int value) const;
};
int IntSet::indexOf(int value) const {
for (int i = 0; i < num_elements; ++i) {
if (elements[i] == value) {
return i;
}
}
return -1;
}
Rather than returning whether or not the value is in the set,
indexOf() returns its actual index if it is there. If not, it
returns an invalid index; we use -1, since indices start at 0. We can
then write contains() as follows:
bool IntSet::contains(int value) const {
return indexOf(value) != -1;
}
We proceed to define remove():
void IntSet::remove(int value) {
int index = indexOf(value);
if (index != -1) {
elements[index] = elements[num_elements - 1];
--num_elements;
}
}
In order to actually remove the item, we need to place another item at
its location. The simplest solution that meets the representation
invariants is to copy over the last item and then decrement
num_elements, which ensures that the first num_elements items
are the values in the set.
Figure 49 Removing an element from an unordered set can be done by copying the last element into the position vacated by the removed element.
The remaining member functions are defined as follows:
int IntSet::size() const {
return num_elements;
}
void IntSet::print(std::ostream &os) const {
os << "{ ";
for (int i = 0; i < num_elements; ++i) {
os << elements[i] << " ";
}
os << "}";
}
Time Complexity
When choosing a particular container or data structure, an important consideration is how efficient each operation is when applied to that ADT. Different data structures make different tradeoffs in which operations are more efficient and which ones are less so, and the appropriate choice of data structure depends on which operations are most important to our setting. There are two resources we consider: time, or how long an operation takes, and memory, or how much space a data structure or an operation over one uses. In this text, we focus on time complexity.
In general, the amount of time an operation on a data structure takes depends on how many elements it contains. We characterize this using orders of growth, which give us an approximate measure of the time an operation takes relative to the size of the data structure or input. As an example, if \(n\) is the input size and an operation takes \(5n^2 - 3n +2\) units of time (e.g. machine instructions), we ignore the lower-order terms and constant factors classify the time complexity as \(\textrm{O}(n^2)\) (pronounced “oh of en squared” or “big-oh of en squared”). This is because the runtime is dominated by the quadratic term, and at large input sizes, the runtime is proportional to \(n^2\).
Figure 50 Comparison of constant, linear, and quadratic growth.
Figure 50 illustrates the growth of three functions: \(f_1(n) = 1\), \(f_2(n) = n\), and \(f_3(n) = n^2\). The gap between any two of these functions increasing significantly as \(n\) gets larger, whereas the gap between two functions of the same order, such as \(f(n) = n^2\) and \(g(n) = 2n^2\) differ by the same ratio regardless of the size \(n\). Thus, we consider both of the latter functions to have the same \(\textrm{O}(n^2)\) order of growth.
In addition to simplifying complexity using orders of growth, we
usually care most about the efficiency of the worst-case input at
any particular size. As an example, consider the contains()
operation on an IntSet:
bool IntSet::contains(int value) const {
for (int i = 0; i < num_elements; ++i) {
if (elements[i] == value) {
return true;
}
}
return false;
}
In the best-case scenario, the value we are looking for is at the very
front of the elements array, requiring only a single comparison
before returning true. However, it is also possible for the value to
be at the end of the array or not in the set at all, in which case we
have to look through all of the elements before we determine the
answer. If the set has \(n\) elements, in the worst case we would
need to look at all \(n\) of them to determine if the value is in
the set, resulting in an \(\textrm{O}(n)\) worst-case complexity.
Contrast this with the size() operation:
int IntSet::size() const {
return num_elements;
}
This merely returns the value of the num_elements member variable.
Its runtime is not affected by how many elements the set contains, so
it is a constant-time or \(\textrm{O}(1)\) operation.
Finally, let us consider one more example, that of converting a vector potentially containing duplicates into a set. Here’s a possible implementation:
IntSet to_set(const std::vector<int> &vec) {
IntSet result;
for (std::size_t i = 0; i < vec.size(); ++i) {
result.insert(vec[i]);
}
return result;
}
To analyze the time complexity, we add up the contributions of each
piece of the algorithm. Constructing an empty set takes constant time.
If the vector contains \(n\) elements, the loop does \(n\)
iterations. However, each iteration does not necessarily take constant
time. The worst case input is actually when the vector has no
duplicates – insert() calls contains(), which takes linear
time when the given value is not in the set, as we concluded
previously. Thus, we have a linear number of loop iterations, each of
which takes linear time in the worst case, for a total of quadratic or
\(\textrm{O}(n^2)\) time. (More exactly, the total runtime for the
loop is something like \(1 + 2 + 3 + \cdots + n =
\frac{n(n+1)}{2}\), which still gives us \(\textrm{O}(n^2)\) order
of growth.)
Now that we have a sense of the time complexity of the IntSet
implementation, can we define a set abstraction that is more efficient
for at least some of the operations? We can indeed, as we will see
next time.