Array-based Containers
A container is an abstract data type whose purpose is to hold objects of some other type. We have already seen two examples of containers: built-in arrays and vectors from the standard library. Both are containers that hold elements in a particular order, indexed starting from 0. A built-in array has a fixed size, while a vector can grow and shrink.
Let’s define our own container ADT, using built-in arrays as our building block. Rather than an ordered sequence, we will build a set abstraction, which represents an unordered collection of unique elements. We will start with a container that holds integers, generalizing it to arbitrary element types next time.
In designing our ADT, we will use the following process:
Determine which operations our ADT will support.
Write some code that uses our ADT. This will helps us figure out what the right interface should be, and it will serve as a test case. (Ideally, we would also write unit tests for each part of the public interface, following the pattern of test-driven development.)
Define the ADT’s public interface, including both function signatures and documentation.
Come up with a data representation. Define member variables and determine the representation invariants.
Write definitions for the ADT’s constructors and member functions, making sure that they adhere to the representation invariants.
Set Operations
The following are the operations our set will support:
Creating an empty set.
Inserting a value into a set. We will allow an existing item to be inserted – it will just do nothing.
Removing a value from a set. We will allow a nonexistent item to be removed – it will just do nothing.
Check if a value is contained in a set.
Count the number of items in a set.
Print a character representation of a set to an output stream.
Code Example
Here is a small program that uses a set:
int main() {
IntSet set;
set.insert(7);
set.insert(32);
set.insert(32);
cout << "Size: " << set.size() << endl; // prints 2
set.print(cout); // prints { 7, 32 } in some arbitrary order
set.insert(42);
set.remove(32);
set.remove(32);
cout << "Contains 32? " << set.contains(32) << endl; // prints 0 (false)
cout << "Contains 42? " << set.contains(42) << endl; // prints 1 (true)
}
Public Interface
The code example leads to the following public interface:
class IntSet {
public:
// Maximum size of a set.
static const int MAX_SIZE = 10;
// EFFECTS: Initializes this set to be empty.
IntSet();
// REQUIRES: size() < MAX_SIZE
// MODIFIES: *this
// EFFECTS: Adds value to the set, if it isn't already in the set.
void insert(int value);
// MODIFIES: *this
// EFFECTS: Removes value from the set, if it is in the set.
void remove(int value);
// EFFECTS: Returns whether value is in the set.
bool contains(int value) const;
// EFFECTS: Returns the number of elements.
int size() const;
// EFFECTS: Prints out the set in an arbitrary order.
void print(std::ostream &os) const;
};
The public interface includes a default constructor and member
functions to insert or remove an item, check if an item is in the set,
obtain its size, and print a representation of the set to a stream.
The latter three functions do not modify the set, so they are declared
as const
, meaning that the this
pointer will be a pointer to
const. The compiler will then enforce that those functions do not
modify any member variables.
static
Data Members
The public interface also includes a constant, MAX_SIZE
, that
denotes the maximum size of a set. We define the constant inside the
IntSet
class. This distinguishes the constant from other constants
of the same name; there may be some other ADT with a MAX_SIZE
. It
also makes use of encapsulation, which allows all of an ADT’s data and
functionality to be defined together within a class.
The constant is declared static
, which has a specific meaning when
used in a class – it indicates that the given member is not
associated with an object of the class, but with the class as a whole.
For a member variable, this means that there is a single copy of that
variable in the program, located in static storage. Without the
static
keyword, each object of the class would have its own copy,
located in the memory for that object. The static
keyword also
allows a member variable to be a compile-time constant, provided it is
also declared as const
(or constexpr
) and initialized from an
expression that can be computed at compile time. [1]
A static member can be accessed by name from within the class, the same as a non-static member. It can be accessed from outside the class with the scope-resolution operator:
int main() {
IntSet set;
for (int i = 0; i < IntSet::MAX_SIZE; ++i) {
set.insert(i);
}
}
We need a compile-time constant because we will use a member variable that is a built-in array as part of our data representation. [2] In C++, a local or member variable that is an array must have a size that is known to the compiler. In the case of a local variable, the compiler needs to know how big it is so that the program can create activation records of the correct size. For a member variable, the compiler needs to know its size so that it can determine the size of the object as a whole. Pointer arithmetic (including array accesses) requires the compiler to know the size of an object at compile time, since it is in terms of whole objects rather than with respect to individual bytes.
In the future, we will see how to eliminate the fixed-size restriction by storing the elements indirectly using a dynamic array. We will see that this raises issues of memory management and learn how to resolve them.
Data Representation
We rely on existing types to represent the data of our IntSet
. For
the elements themselves, we use a built-in array, with MAX_SIZE
as
its capacity. However, our set will start out empty, after which we
can add and remove elements as we please. Thus, the size of the set
will generally not be the same as the size of the array, and we need a
separate member variable to keep track of the number of elements in
the set. We will use an int
for this member.
class IntSet {
...
private:
int elements[MAX_SIZE];
int num_elements;
};
Now that we’ve determined a data representation, we need to figure out what values correspond to valid sets. We use representation invariants to specify the set of valid values.
For num_elements
, a negative value does not make sense, since a
set cannot have a negative size. At the same time, our sets do not
have enough space to store more than MAX_SIZE
elements, so that
places an upper bound on num_elements
. The invariants for
num_elements
are thus 0 <= num_elements && num_elements <=
MAX_SIZE
.
For elements
, we need to know where in the array the set’s
elements are actually stored. We will store them at the beginning of
the array, so one invariant is that the first num_elements
items
in the array are the ones in the set. The set abstraction prohibits
duplicate elements, so another invariant is that there are no
duplicates among the first num_elements
items in elements
.
We document the representation invariants where we define our representation:
class IntSet {
...
private:
int elements[MAX_SIZE];
int num_elements;
// INVARIANTS:
// 0 <= num_elements && num_elements <= MAX_SIZE
// the first num_elements items in elements are the items in the set
// the first num_elements items in elements contain no duplicates
};
size_t
The size_t
type represents only nonnegative integers, and the C++
standard guarantees that it is large enough to hold the size of any
object. Since it does not represent negative integers, we could have
used it for num_elements
rather than defining 0 <=
num_elements
as a representation invariant. We would then also use
size_t
as the return type for size()
, which is what the
standard-library containers do.
C++ has integral types that are unsigned, which do not represent
negative numbers, and size_t
is just one example. The int
type
is signed (unless it is preceded by the unsigned
keyword).
Unsigned integers can lead to subtle programming errors. The following
is an example that compares a signed and unsigned integer:
int main() {
size_t s = 3;
int i = -1;
cout << (s < i) << endl; // prints 1 (true)
}
Perhaps non-intuitively, the comparison s < i
is true, even though
s
is positive but i
is negative. This is because the compiler
converts i
to a size_t
in doing the comparison, and the data
that represents a negative signed integer actually represents a very
large unsigned integer.
Another common error is incorrectly iterating backwards over a sequence:
int main() {
vector<int> vec = { 1, 2, 3 };
for (size_t i = vec.size() - 1; i >= 0; --i) {
cout << vec[i] << endl;
}
}
This code goes off the end of the vector, resulting in undefined
behavior. The reason is that a size_t
is always greater than or
equal to 0, and when i
is equal to 0 and decremented, its value
wraps around to the largest possible size_t
value. The following
iteration correctly avoids this problem:
int main() {
vector<int> vec = { 1, 2, 3 };
for (size_t i = vec.size(); i > 0; --i) {
cout << vec[i - 1] << endl;
}
}
The loop iterates from vec.size()
down to 1, offsetting by 1 when
indexing into the vector. The following equivalent code makes use of
postfix decrement to avoid
the offset in the index expression:
int main() {
vector<int> vec = { 1, 2, 3 };
for (size_t i = vec.size(); i-- > 0;) {
cout << vec[i] << endl;
}
}
Given the pitfalls of unsigned integers, many C++ programmers avoid them, using signed integers instead. We too will generally stick to signed integers.
Implementation
We proceed to define the constructor and member functions of
IntSet
, defining them as they would appear in a source file,
outside the class definition.
We require only a default constructor for IntSet
. However, we
cannot rely on a compiler-generated implicit one: it would default
initialize num_elements
to an undefined value, violating our
representation invariants. Thus, we need to define our own default
constructor. [3]
Alternatively, we can initialize num_elements
to 0 at the
point it is declared. The implicitly defined default
constructor would then be sufficient; it would still have an
empty member-initializer list, but the compiler uses the inline
initialization of a member variable if it is not initialized in
the member-initializer list.
IntSet::IntSet() : num_elements(0) {}
We need not do anything with elements
. It gets default initialized
to contain undefined values, but that doesn’t violate our
representation invariants; since num_elements
is 0, the invariants
for elements
are trivially satisfied.
We proceed to define contains()
. The representation invariants tell
us that the valid items are the first num_elements
in elements
.
They can be stored in any order, so we need to iterate through them
to see if the value is in the set.
bool IntSet::contains(int value) const {
for (int i = 0; i < num_elements; ++i) {
if (elements[i] == value) {
return true;
}
}
return false;
}
When we find an item, we can return immediately without checking the rest of the elements. On the other hand, if we get through all the elements without finding the given value, it is not in the set so we return false.
For inserting into the set, we first assert the requires clause, which
ensures that our invariant of num_elements <= MAX_SIZE
is never
violated. We also must check whether the value is in the set, since
adding it again would violation the invariant of no duplicates. If the
value is not in the set, the simplest way to insert it is to place it
at index num_elements
and increment that variable. This meets the
invariant that the first num_elements
items in the array are the
values in the set.
void IntSet::insert(int value) {
assert(size() < MAX_SIZE);
if (!contains(value)) {
elements[num_elements] = value;
++num_elements; // we could use num_elements++ in the line above instead
}
}
To remove an element, we first need to check whether the value is in
the set. If so, we need to know its location in the array, which would
require iterating through the elements to find it. Rather than
repeating this algorithm in both contains()
and remove()
, we
define a private helper function to do so and call it in both places.
class IntSet {
...
private:
int indexOf(int value) const;
};
int IntSet::indexOf(int value) const {
for (int i = 0; i < num_elements; ++i) {
if (elements[i] == value) {
return i;
}
}
return -1;
}
Rather than returning whether or not the value is in the set,
indexOf()
returns its actual index if it is there. If not, it
returns an invalid index; we use -1, since indices start at 0. We can
then write contains()
as follows:
bool IntSet::contains(int value) const {
return indexOf(value) != -1;
}
We proceed to define remove()
:
void IntSet::remove(int value) {
int index = indexOf(value);
if (index != -1) {
elements[index] = elements[num_elements - 1];
--num_elements;
}
}
In order to actually remove the item, we need to place another item at
its location. The simplest solution that meets the representation
invariants is to copy over the last item and then decrement
num_elements
, which ensures that the first num_elements
items
are the values in the set.
The remaining member functions are defined as follows:
int IntSet::size() const {
return num_elements;
}
void IntSet::print(std::ostream &os) const {
os << "{ ";
for (int i = 0; i < num_elements; ++i) {
os << elements[i] << " ";
}
os << "}";
}
Sorted Representation
The representation of IntSet
places no restrictions on the
ordering of elements in the array. This simplifies the implementation
of insert()
and remove()
, but it requires contains()
(and
indexOf()
) to iterate over every element in the worst case.
An alternate representation would be to require that the set elements are stored in sorted, increasing order. The member variables remain the same – it is the representation invariants that change.
class SortedIntSet {
...
private:
int elements[MAX_SIZE];
int num_elements;
// INVARIANTS:
// 0 <= num_elements && num_elements <= MAX_SIZE
// the first num_elements items in elements are the items in the set
// the first num_elements items in elements contain no duplicates
// the first num_elements items are in sorted, increasing order
};
To insert an item, we can no longer just put it after the existing elements, since the new value may be smaller than existing values. Instead, we need to store it at its appropriate sorted position. This requires moving existing elements out of the way if necessary. Our algorithm will start after the last element and repeatedly:
Compare the item to the left with the value we are inserting.
If the item to the left is less than the new value, the new value is placed at the current position, and we are done.
If the item is greater than the new value, then the old item is moved one position to the right, and we repeat the process one position over to the left.
If no more items are to the left, we place the new value at the current position (i.e. the beginning of the array).
void SortedIntSet::insert(int value) {
assert(size() < MAX_SIZE);
if (!contains(value)) {
int index = num_elements;
while (index > 0 && elements[index - 1] > value) {
elements[index] = elements[index - 1];
--index;
}
elements[index] = value;
++num_elements;
}
}
Similarly, to remove an item, we must shift the remaining elements leftward in order to maintain our representation invariants.
void SortedIntSet::remove(int value) {
int index = indexOf(value);
if (index != -1) {
for (; index < num_elements - 1; ++index) {
elements[index] = elements[index + 1];
}
--num_elements;
}
}
The advantage of sorting is that we don’t have to look through all the elements to determine the location of a value. Instead, we can use binary search, which eliminates half the search space in each step:
Compare the value we are looking for to the middle element among the remaining items.
If the value is equal to the middle element, we have found its index.
If the value is less than the middle element, we know it must be to the left, if it is in the set. Thus, we need only repeat the search on the items in the first half.
If the value is greater than the middle element, we know it must be to the right, if it is in the set. Thus, we need only repeat the search on the items in the second half.
If we have run out of elements to search, the value is not in the set.
The following implements this algorithm:
int SortedIntSet::indexOf(int value) const {
int start = 0;
int end = num_elements;
while (start < end) {
int middle = start / 2 + end / 2;
if (value == elements[middle]) {
return middle;
} else if (value < elements[middle]) {
end = middle;
} else {
start = middle + 1;
}
}
return -1;
}
Since half the search space is eliminated in each step, this algorithm takes time that is logarithmic in the size of the set. We denote this as \(\textrm{O}(\log n)\), where \(n\) is the number of elements in the set.
For a set of size \(n\), the following compares the worst-case runtime of each operation on unsorted and sorted sets:
Operation |
Unsorted Set |
Sorted Set |
|
\(\textrm{O}(n)\) |
\(\textrm{O}(n)\) |
|
\(\textrm{O}(n)\) |
\(\textrm{O}(n)\) |
|
\(\textrm{O}(n)\) |
\(\textrm{O}(\log n)\) |
|
\(\textrm{O}(1)\) |
\(\textrm{O}(1)\) |
|
\(\textrm{O}(1)\) |
\(\textrm{O}(1)\) |
The notation \(\textrm{O}(1)\) means that the operation takes
constant time: the time it takes is independent of the size of the
set. The insert()
and remove()
operations must first check
whether the item is in the set, so they can be no faster than
contains()
. The latter is significantly faster on a sorted set
than an unsorted set, demonstrating the advantage of sorting.