Array-based Containers

A container is an abstract data type whose purpose is to hold objects of some other type. We have already seen two examples of containers: built-in arrays and vectors from the standard library. Both are containers that hold elements in a particular order, indexed starting from 0. A built-in array has a fixed size, while a vector can grow and shrink.

Let’s define our own container ADT, using built-in arrays as our building block. Rather than an ordered sequence, we will build a set abstraction, which represents an unordered collection of unique elements. We will start with a container that holds integers, generalizing it to arbitrary element types next time.

In designing our ADT, we will use the following process:

Determine which operations our ADT will support.
Write some code that uses our ADT. This will helps us figure out what the right interface should be, and it will serve as a test case. (Ideally, we would also write unit tests for each part of the public interface, following the pattern of test-driven development.)
Define the ADT’s public interface, including both function signatures and documentation.
Come up with a data representation. Define member variables and determine the representation invariants.
Write definitions for the ADT’s constructors and member functions, making sure that they adhere to the representation invariants.

Set Operations

The following are the operations our set will support:

Creating an empty set.
Inserting a value into a set. We will allow an existing item to be inserted – it will just do nothing.
Removing a value from a set. We will allow a nonexistent item to be removed – it will just do nothing.
Check if a value is contained in a set.
Count the number of items in a set.
Print a character representation of a set to an output stream.

Code Example

Here is a small program that uses a set:

int main() {
  IntSet set;
  set.insert(7);
  set.insert(32);
  set.insert(32);

  cout << "Size: " << set.size() << endl;                       // prints 2
  set.print(cout);              // prints { 7, 32 } in some arbitrary order

  set.insert(42);
  set.remove(32);
  set.remove(32);

  cout << "Contains 32? " << set.contains(32) << endl;  // prints 0 (false)
  cout << "Contains 42? " << set.contains(42) << endl;   // prints 1 (true)
}

Public Interface

The code example leads to the following public interface:

class IntSet {
public:
  // Maximum size of a set.
  static const int MAX_SIZE = 10;

  // EFFECTS:  Initializes this set to be empty.
  IntSet();

  // REQUIRES: size() < MAX_SIZE
  // MODIFIES: *this
  // EFFECTS:  Adds value to the set, if it isn't already in the set.
  void insert(int value);

  // MODIFIES: *this
  // EFFECTS:  Removes value from the set, if it is in the set.
  void remove(int value);

  // EFFECTS:  Returns whether value is in the set.
  bool contains(int value) const;

  // EFFECTS:  Returns the number of elements.
  int size() const;

  // EFFECTS:  Prints out the set in an arbitrary order.
  void print(std::ostream &os) const;
};

The public interface includes a default constructor and member functions to insert or remove an item, check if an item is in the set, obtain its size, and print a representation of the set to a stream. The latter three functions do not modify the set, so they are declared as const, meaning that the this pointer will be a pointer to const. The compiler will then enforce that those functions do not modify any member variables.

`static` Data Members

The public interface also includes a constant, MAX_SIZE, that denotes the maximum size of a set. We define the constant inside the IntSet class. This distinguishes the constant from other constants of the same name; there may be some other ADT with a MAX_SIZE. It also makes use of encapsulation, which allows all of an ADT’s data and functionality to be defined together within a class.

The constant is declared static, which has a specific meaning when used in a class – it indicates that the given member is not associated with an object of the class, but with the class as a whole. For a member variable, this means that there is a single copy of that variable in the program, located in static storage. Without the static keyword, each object of the class would have its own copy, located in the memory for that object. The static keyword also allows a member variable to be a compile-time constant, provided it is also declared as const (or constexpr) and initialized from an expression that can be computed at compile time. [1]

A static member can be accessed by name from within the class, the same as a non-static member. It can be accessed from outside the class with the scope-resolution operator:

int main() {
  IntSet set;
  for (int i = 0; i < IntSet::MAX_SIZE; ++i) {
    set.insert(i);
  }
}

We need a compile-time constant because we will use a member variable that is a built-in array as part of our data representation. [2] In C++, a local or member variable that is an array must have a size that is known to the compiler. In the case of a local variable, the compiler needs to know how big it is so that the program can create activation records of the correct size. For a member variable, the compiler needs to know its size so that it can determine the size of the object as a whole. Pointer arithmetic (including array accesses) requires the compiler to know the size of an object at compile time, since it is in terms of whole objects rather than with respect to individual bytes.

Data Representation

We rely on existing types to represent the data of our IntSet. For the elements themselves, we use a built-in array, with MAX_SIZE as its capacity. However, our set will start out empty, after which we can add and remove elements as we please. Thus, the size of the set will generally not be the same as the size of the array, and we need a separate member variable to keep track of the number of elements in the set. We will use an int for this member.

class IntSet {
  ...
private:
  int elements[MAX_SIZE];
  int num_elements;
};

Now that we’ve determined a data representation, we need to figure out what values correspond to valid sets. We use representation invariants to specify the set of valid values.

For num_elements, a negative value does not make sense, since a set cannot have a negative size. At the same time, our sets do not have enough space to store more than MAX_SIZE elements, so that places an upper bound on num_elements. The invariants for num_elements are thus 0 <= num_elements && num_elements <= MAX_SIZE.

For elements, we need to know where in the array the set’s elements are actually stored. We will store them at the beginning of the array, so one invariant is that the first num_elements items in the array are the ones in the set. The set abstraction prohibits duplicate elements, so another invariant is that there are no duplicates among the first num_elements items in elements.

We document the representation invariants where we define our representation:

class IntSet {
  ...
private:
  int elements[MAX_SIZE];
  int num_elements;
  // INVARIANTS:
  // 0 <= num_elements && num_elements <= MAX_SIZE
  // the first num_elements items in elements are the items in the set
  // the first num_elements items in elements contain no duplicates
};

`size_t`

The size_t type represents only nonnegative integers, and the C++ standard guarantees that it is large enough to hold the size of any object. Since it does not represent negative integers, we could have used it for num_elements rather than defining 0 <= num_elements as a representation invariant. We would then also use size_t as the return type for size(), which is what the standard-library containers do.

C++ has integral types that are unsigned, which do not represent negative numbers, and size_t is just one example. The int type is signed (unless it is preceded by the unsigned keyword). Unsigned integers can lead to subtle programming errors. The following is an example that compares a signed and unsigned integer:

int main() {
  size_t s = 3;
  int i = -1;
  cout << (s < i) << endl;   // prints 1 (true)
}

Perhaps non-intuitively, the comparison s < i is true, even though s is positive but i is negative. This is because the compiler converts i to a size_t in doing the comparison, and the data that represents a negative signed integer actually represents a very large unsigned integer.

Another common error is incorrectly iterating backwards over a sequence:

int main() {
  vector<int> vec = { 1, 2, 3 };
  for (size_t i = vec.size() - 1; i >= 0; --i) {
    cout << vec[i] << endl;
  }
}

This code goes off the end of the vector, resulting in undefined behavior. The reason is that a size_t is always greater than or equal to 0, and when i is equal to 0 and decremented, its value wraps around to the largest possible size_t value. The following iteration correctly avoids this problem:

int main() {
  vector<int> vec = { 1, 2, 3 };
  for (size_t i = vec.size(); i > 0; --i) {
    cout << vec[i - 1] << endl;
  }
}

The loop iterates from vec.size() down to 1, offsetting by 1 when indexing into the vector. The following equivalent code makes use of postfix decrement to avoid the offset in the index expression:

int main() {
  vector<int> vec = { 1, 2, 3 };
  for (size_t i = vec.size(); i-- > 0;) {
    cout << vec[i] << endl;
  }
}

Given the pitfalls of unsigned integers, many C++ programmers avoid them, using signed integers instead. We too will generally stick to signed integers.

Implementation

We proceed to define the constructor and member functions of IntSet, defining them as they would appear in a source file, outside the class definition.

We require only a default constructor for IntSet. However, we cannot rely on a compiler-generated implicit one: it would default initialize num_elements to an undefined value, violating our representation invariants. Thus, we need to define our own default constructor. [3]

IntSet::IntSet() : num_elements(0) {}

We need not do anything with elements. It gets default initialized to contain undefined values, but that doesn’t violate our representation invariants; since num_elements is 0, the invariants for elements are trivially satisfied.

We proceed to define contains(). The representation invariants tell us that the valid items are the first num_elements in elements. They can be stored in any order, so we need to iterate through them to see if the value is in the set.

bool IntSet::contains(int value) const {
  for (int i = 0; i < num_elements; ++i) {
    if (elements[i] == value) {
      return true;
    }
  }
  return false;
}

When we find an item, we can return immediately without checking the rest of the elements. On the other hand, if we get through all the elements without finding the given value, it is not in the set so we return false.

For inserting into the set, we first assert the requires clause, which ensures that our invariant of num_elements <= MAX_SIZE is never violated. We also must check whether the value is in the set, since adding it again would violation the invariant of no duplicates. If the value is not in the set, the simplest way to insert it is to place it at index num_elements and increment that variable. This meets the invariant that the first num_elements items in the array are the values in the set.

void IntSet::insert(int value) {
  assert(size() < MAX_SIZE);
  if (!contains(value)) {
    elements[num_elements] = value;
    ++num_elements;   // we could use num_elements++ in the line above instead
  }
}

To remove an element, we first need to check whether the value is in the set. If so, we need to know its location in the array, which would require iterating through the elements to find it. Rather than repeating this algorithm in both contains() and remove(), we define a private helper function to do so and call it in both places.

class IntSet {
  ...
private:
  int indexOf(int value) const;
};

int IntSet::indexOf(int value) const {
  for (int i = 0; i < num_elements; ++i) {
    if (elements[i] == value) {
      return i;
    }
  }
  return -1;
}

Rather than returning whether or not the value is in the set, indexOf() returns its actual index if it is there. If not, it returns an invalid index; we use -1, since indices start at 0. We can then write contains() as follows:

bool IntSet::contains(int value) const {
  return indexOf(value) != -1;
}

We proceed to define remove():

void IntSet::remove(int value) {
  int index = indexOf(value);
  if (index != -1) {
    elements[index] = elements[num_elements - 1];
    --num_elements;
  }
}

In order to actually remove the item, we need to place another item at its location. The simplest solution that meets the representation invariants is to copy over the last item and then decrement num_elements, which ensures that the first num_elements items are the values in the set.

_images/11_unordered_set_remove.svg — Figure 46 Removing an element from an unordered set can be done by copying the last element into the position vacated by the removed element.

The remaining member functions are defined as follows:

int IntSet::size() const {
  return num_elements;
}

void IntSet::print(std::ostream &os) const {
  os << "{ ";
  for (int i = 0; i < num_elements; ++i) {
    os << elements[i] << " ";
  }
  os << "}";
}

Sorted Representation

The representation of IntSet places no restrictions on the ordering of elements in the array. This simplifies the implementation of insert() and remove(), but it requires contains() (and indexOf()) to iterate over every element in the worst case.

An alternate representation would be to require that the set elements are stored in sorted, increasing order. The member variables remain the same – it is the representation invariants that change.

class SortedIntSet {
  ...
private:
  int elements[MAX_SIZE];
  int num_elements;
  // INVARIANTS:
  // 0 <= num_elements && num_elements <= MAX_SIZE
  // the first num_elements items in elements are the items in the set
  // the first num_elements items in elements contain no duplicates
  // the first num_elements items are in sorted, increasing order
};

To insert an item, we can no longer just put it after the existing elements, since the new value may be smaller than existing values. Instead, we need to store it at its appropriate sorted position. This requires moving existing elements out of the way if necessary. Our algorithm will start after the last element and repeatedly:

Compare the item to the left with the value we are inserting.
If the item to the left is less than the new value, the new value is placed at the current position, and we are done.
If the item is greater than the new value, then the old item is moved one position to the right, and we repeat the process one position over to the left.
If no more items are to the left, we place the new value at the current position (i.e. the beginning of the array).

_images/11_ordered_set_insert.svg — Figure 47 Inserting into an ordered set requires shifting existing elements to make room for the new element at the proper location.

void SortedIntSet::insert(int value) {
  assert(size() < MAX_SIZE);
  if (!contains(value)) {
    int index = num_elements;
    while (index > 0 && elements[index - 1] > value) {
      elements[index] = elements[index - 1];
      --index;
    }
    elements[index] = value;
    ++num_elements;
  }
}

Similarly, to remove an item, we must shift the remaining elements leftward in order to maintain our representation invariants.

_images/11_ordered_set_remove.svg — Figure 48 Removing from an ordered set requires shifting elements to preserve the representation invariants.

void SortedIntSet::remove(int value) {
  int index = indexOf(value);
  if (index != -1) {
    for (; index < num_elements - 1; ++index) {
      elements[index] = elements[index + 1];
    }
    --num_elements;
  }
}

The advantage of sorting is that we don’t have to look through all the elements to determine the location of a value. Instead, we can use binary search, which eliminates half the search space in each step:

Compare the value we are looking for to the middle element among the remaining items.
If the value is equal to the middle element, we have found its index.
If the value is less than the middle element, we know it must be to the left, if it is in the set. Thus, we need only repeat the search on the items in the first half.
If the value is greater than the middle element, we know it must be to the right, if it is in the set. Thus, we need only repeat the search on the items in the second half.
If we have run out of elements to search, the value is not in the set.

The following implements this algorithm:

int SortedIntSet::indexOf(int value) const {
  int start = 0;
  int end = num_elements;
  while (start < end) {
    int middle = start / 2 + end / 2;
    if (value == elements[middle]) {
      return middle;
    } else if (value < elements[middle]) {
      end = middle;
    } else {
      start = middle + 1;
    }
  }
  return -1;
}

Since half the search space is eliminated in each step, this algorithm takes time that is logarithmic in the size of the set. We denote this as \(\textrm{O}(\log n)\), where \(n\) is the number of elements in the set.

For a set of size \(n\), the following compares the worst-case runtime of each operation on unsorted and sorted sets:

Operation	Unsorted Set	Sorted Set
`insert()`	\(\textrm{O}(n)\)	\(\textrm{O}(n)\)
`remove()`	\(\textrm{O}(n)\)	\(\textrm{O}(n)\)
`contains()`	\(\textrm{O}(n)\)	\(\textrm{O}(\log n)\)
`size()`	\(\textrm{O}(1)\)	\(\textrm{O}(1)\)
`constructor`	\(\textrm{O}(1)\)	\(\textrm{O}(1)\)

The notation \(\textrm{O}(1)\) means that the operation takes constant time: the time it takes is independent of the size of the set. The insert() and remove() operations must first check whether the item is in the set, so they can be no faster than contains(). The latter is significantly faster on a sorted set than an unsorted set, demonstrating the advantage of sorting.