Array-based Data Structures

A container is an abstract data type whose purpose is to hold objects of some other type. We have already seen two examples of containers: built-in arrays and vectors from the standard library. Both are containers that hold elements in a particular order, indexed starting from 0. A built-in array has a fixed size, while a vector can grow and shrink.

Containers use a specific data representation to keep track of their elements – the representation is called a data structure. A data structure may be based on an array, a linear linked structure, a branching structure, and numerous other formats. We begin our examination of data structures with array-based representations.

Let’s define our own container ADT, using built-in arrays as our building block. Rather than an ordered sequence, we will build a set abstraction, which represents an unordered collection of unique elements. We will start with a container that holds integers, generalizing it to arbitrary element types next time.

In designing our ADT, we will use the following process:

  1. Determine which operations our ADT will support.

  2. Write some code that uses our ADT. This will helps us figure out what the right interface should be, and it will serve as a test case. (Ideally, we would also write unit tests for each part of the public interface, following the pattern of test-driven development.)

  3. Define the ADT’s public interface, including both function signatures and documentation.

  4. Come up with a data representation. Define member variables and determine the representation invariants.

  5. Write definitions for the ADT’s constructors and member functions, making sure that they adhere to the representation invariants.

Set Operations

The following are the operations our set will support:

  • Creating an empty set.

  • Inserting a value into a set. We will allow an existing item to be inserted – it will just do nothing.

  • Removing a value from a set. We will allow a nonexistent item to be removed – it will just do nothing.

  • Check if a value is contained in a set.

  • Count the number of items in a set.

  • Print a character representation of a set to an output stream.

Code Example

Here is a small program that uses a set:

int main() {
  IntSet set;
  set.insert(7);
  set.insert(32);
  set.insert(32);

  cout << "Size: " << set.size() << endl;                       // prints 2
  set.print(cout);              // prints { 7, 32 } in some arbitrary order

  set.insert(42);
  set.remove(32);
  set.remove(32);

  cout << "Contains 32? " << set.contains(32) << endl;  // prints 0 (false)
  cout << "Contains 42? " << set.contains(42) << endl;   // prints 1 (true)
}

Public Interface

The code example leads to the following public interface:

class IntSet {
public:
  // Maximum size of a set.
  static const int MAX_SIZE = 10;

  // EFFECTS:  Initializes this set to be empty.
  IntSet();

  // REQUIRES: size() < MAX_SIZE
  // MODIFIES: *this
  // EFFECTS:  Adds value to the set, if it isn't already in the set.
  void insert(int value);

  // MODIFIES: *this
  // EFFECTS:  Removes value from the set, if it is in the set.
  void remove(int value);

  // EFFECTS:  Returns whether value is in the set.
  bool contains(int value) const;

  // EFFECTS:  Returns the number of elements.
  int size() const;

  // EFFECTS:  Prints out the set in an arbitrary order.
  void print(std::ostream &os) const;
};

The public interface includes a default constructor and member functions to insert or remove an item, check if an item is in the set, obtain its size, and print a representation of the set to a stream. The latter three functions do not modify the set, so they are declared as const, meaning that the this pointer will be a pointer to const. The compiler will then enforce that those functions do not modify any member variables.

static Data Members

The public interface also includes a constant, MAX_SIZE, that denotes the maximum size of a set. We define the constant inside the IntSet class. This distinguishes the constant from other constants of the same name; there may be some other ADT with a MAX_SIZE. It also makes use of encapsulation, which allows all of an ADT’s data and functionality to be defined together within a class.

The constant is declared static, which has a specific meaning when used in a class – it indicates that the given member is not associated with an object of the class, but with the class as a whole. For a member variable, this means that there is a single copy of that variable in the program, located in static storage. Without the static keyword, each object of the class would have its own copy, located in the memory for that object. The static keyword also allows a member variable to be a compile-time constant, provided it is also declared as const (or constexpr) and initialized from an expression that can be computed at compile time. [1]

A static member can be accessed by name from within the class, the same as a non-static member. It can be accessed from outside the class with the scope-resolution operator:

int main() {
  IntSet set;
  for (int i = 0; i < IntSet::MAX_SIZE; ++i) {
    set.insert(i);
  }
}

We need a compile-time constant because we will use a member variable that is a built-in array as part of our data representation. [2] In C++, a local or member variable that is an array must have a size that is known to the compiler. In the case of a local variable, the compiler needs to know how big it is so that the program can create activation records of the correct size. For a member variable, the compiler needs to know its size so that it can determine the size of the object as a whole. Pointer arithmetic (including array accesses) requires the compiler to know the size of an object at compile time, since it is in terms of whole objects rather than with respect to individual bytes.

Data Representation

We rely on existing types to represent the data of our IntSet. For the elements themselves, we use a built-in array, with MAX_SIZE as its capacity. However, our set will start out empty, after which we can add and remove elements as we please. Thus, the size of the set will generally not be the same as the size of the array, and we need a separate member variable to keep track of the number of elements in the set. We will use an int for this member.

class IntSet {
  ...
private:
  int elements[MAX_SIZE];
  int num_elements;
};

Now that we’ve determined a data representation, we need to figure out what values correspond to valid sets. We use representation invariants to specify the set of valid values.

For num_elements, a negative value does not make sense, since a set cannot have a negative size. At the same time, our sets do not have enough space to store more than MAX_SIZE elements, so that places an upper bound on num_elements. The invariants for num_elements are thus 0 <= num_elements && num_elements <= MAX_SIZE.

For elements, we need to know where in the array the set’s elements are actually stored. We will store them at the beginning of the array, so one invariant is that the first num_elements items in the array are the ones in the set. The set abstraction prohibits duplicate elements, so another invariant is that there are no duplicates among the first num_elements items in elements.

We document the representation invariants where we define our representation:

class IntSet {
  ...
private:
  int elements[MAX_SIZE];
  int num_elements;
  // INVARIANTS:
  // 0 <= num_elements && num_elements <= MAX_SIZE
  // the first num_elements items in elements are the items in the set
  // the first num_elements items in elements contain no duplicates
};

size_t

The size_t type represents only nonnegative integers, and the C++ standard guarantees that it is large enough to hold the size of any object. Since it does not represent negative integers, we could have used it for num_elements rather than defining 0 <= num_elements as a representation invariant. We would then also use size_t as the return type for size(), which is what the standard-library containers do.

C++ has integral types that are unsigned, which do not represent negative numbers, and size_t is just one example. The int type is signed (unless it is preceded by the unsigned keyword). Unsigned integers can lead to subtle programming errors. The following is an example that compares a signed and unsigned integer:

int main() {
  size_t s = 3;
  int i = -1;
  cout << (s < i) << endl;   // prints 1 (true)
}

Perhaps non-intuitively, the comparison s < i is true, even though s is positive but i is negative. This is because the compiler converts i to a size_t in doing the comparison, and the data that represents a negative signed integer actually represents a very large unsigned integer.

Another common error is incorrectly iterating backwards over a sequence:

int main() {
  vector<int> vec = { 1, 2, 3 };
  for (size_t i = vec.size() - 1; i >= 0; --i) {
    cout << vec[i] << endl;
  }
}

This code goes off the end of the vector, resulting in undefined behavior. The reason is that a size_t is always greater than or equal to 0, and when i is equal to 0 and decremented, its value wraps around to the largest possible size_t value. The following iteration correctly avoids this problem:

int main() {
  vector<int> vec = { 1, 2, 3 };
  for (size_t i = vec.size(); i > 0; --i) {
    cout << vec[i - 1] << endl;
  }
}

The loop iterates from vec.size() down to 1, offsetting by 1 when indexing into the vector. The following equivalent code makes use of postfix decrement to avoid the offset in the index expression:

int main() {
  vector<int> vec = { 1, 2, 3 };
  for (size_t i = vec.size(); i-- > 0;) {
    cout << vec[i] << endl;
  }
}

Given the pitfalls of unsigned integers, many C++ programmers avoid them, using signed integers instead. We too will generally stick to signed integers.

Implementation

We proceed to define the constructor and member functions of IntSet, defining them as they would appear in a source file, outside the class definition.

We require only a default constructor for IntSet. However, we cannot rely on a compiler-generated implicit one: it would default initialize num_elements to an undefined value, violating our representation invariants. Thus, we need to define our own default constructor. [3]

IntSet::IntSet() : num_elements(0) {}

We need not do anything with elements. It gets default initialized to contain undefined values, but that doesn’t violate our representation invariants; since num_elements is 0, the invariants for elements are trivially satisfied.

We proceed to define contains(). The representation invariants tell us that the valid items are the first num_elements in elements. They can be stored in any order, so we need to iterate through them to see if the value is in the set.

bool IntSet::contains(int value) const {
  for (int i = 0; i < num_elements; ++i) {
    if (elements[i] == value) {
      return true;
    }
  }
  return false;
}

When we find an item, we can return immediately without checking the rest of the elements. On the other hand, if we get through all the elements without finding the given value, it is not in the set so we return false.

For inserting into the set, we first assert the requires clause, which ensures that our invariant of num_elements <= MAX_SIZE is never violated. We also must check whether the value is in the set, since adding it again would violation the invariant of no duplicates. If the value is not in the set, the simplest way to insert it is to place it at index num_elements and increment that variable. This meets the invariant that the first num_elements items in the array are the values in the set.

void IntSet::insert(int value) {
  assert(size() < MAX_SIZE);
  if (!contains(value)) {
    elements[num_elements] = value;
    ++num_elements;   // we could use num_elements++ in the line above instead
  }
}

To remove an element, we first need to check whether the value is in the set. If so, we need to know its location in the array, which would require iterating through the elements to find it. Rather than repeating this algorithm in both contains() and remove(), we define a private helper function to do so and call it in both places.

class IntSet {
  ...
private:
  int indexOf(int value) const;
};

int IntSet::indexOf(int value) const {
  for (int i = 0; i < num_elements; ++i) {
    if (elements[i] == value) {
      return i;
    }
  }
  return -1;
}

Rather than returning whether or not the value is in the set, indexOf() returns its actual index if it is there. If not, it returns an invalid index; we use -1, since indices start at 0. We can then write contains() as follows:

bool IntSet::contains(int value) const {
  return indexOf(value) != -1;
}

We proceed to define remove():

void IntSet::remove(int value) {
  int index = indexOf(value);
  if (index != -1) {
    elements[index] = elements[num_elements - 1];
    --num_elements;
  }
}

In order to actually remove the item, we need to place another item at its location. The simplest solution that meets the representation invariants is to copy over the last item and then decrement num_elements, which ensures that the first num_elements items are the values in the set.

_images/11_unordered_set_remove.svg

Figure 49 Removing an element from an unordered set can be done by copying the last element into the position vacated by the removed element.

The remaining member functions are defined as follows:

int IntSet::size() const {
  return num_elements;
}

void IntSet::print(std::ostream &os) const {
  os << "{ ";
  for (int i = 0; i < num_elements; ++i) {
    os << elements[i] << " ";
  }
  os << "}";
}

Time Complexity

When choosing a particular container or data structure, an important consideration is how efficient each operation is when applied to that ADT. Different data structures make different tradeoffs in which operations are more efficient and which ones are less so, and the appropriate choice of data structure depends on which operations are most important to our setting. There are two resources we consider: time, or how long an operation takes, and memory, or how much space a data structure or an operation over one uses. In this text, we focus on time complexity.

In general, the amount of time an operation on a data structure takes depends on how many elements it contains. We characterize this using orders of growth, which give us an approximate measure of the time an operation takes relative to the size of the data structure or input. As an example, if \(n\) is the input size and an operation takes \(5n^2 - 3n +2\) units of time (e.g. machine instructions), we ignore the lower-order terms and constant factors classify the time complexity as \(\textrm{O}(n^2)\) (pronounced “oh of en squared” or “big-oh of en squared”). This is because the runtime is dominated by the quadratic term, and at large input sizes, the runtime is proportional to \(n^2\).

_images/12_orders_of_growth.svg

Figure 50 Comparison of constant, linear, and quadratic growth.

Figure 50 illustrates the growth of three functions: \(f_1(n) = 1\), \(f_2(n) = n\), and \(f_3(n) = n^2\). The gap between any two of these functions increasing significantly as \(n\) gets larger, whereas the gap between two functions of the same order, such as \(f(n) = n^2\) and \(g(n) = 2n^2\) differ by the same ratio regardless of the size \(n\). Thus, we consider both of the latter functions to have the same \(\textrm{O}(n^2)\) order of growth.

In addition to simplifying complexity using orders of growth, we usually care most about the efficiency of the worst-case input at any particular size. As an example, consider the contains() operation on an IntSet:

bool IntSet::contains(int value) const {
  for (int i = 0; i < num_elements; ++i) {
    if (elements[i] == value) {
      return true;
    }
  }
  return false;
}

In the best-case scenario, the value we are looking for is at the very front of the elements array, requiring only a single comparison before returning true. However, it is also possible for the value to be at the end of the array or not in the set at all, in which case we have to look through all of the elements before we determine the answer. If the set has \(n\) elements, in the worst case we would need to look at all \(n\) of them to determine if the value is in the set, resulting in an \(\textrm{O}(n)\) worst-case complexity.

Contrast this with the size() operation:

int IntSet::size() const {
  return num_elements;
}

This merely returns the value of the num_elements member variable. Its runtime is not affected by how many elements the set contains, so it is a constant-time or \(\textrm{O}(1)\) operation.

Finally, let us consider one more example, that of converting a vector potentially containing duplicates into a set. Here’s a possible implementation:

IntSet to_set(const std::vector<int> &vec) {
  IntSet result;
  for (std::size_t i = 0; i < vec.size(); ++i) {
    result.insert(vec[i]);
  }
  return result;
}

To analyze the time complexity, we add up the contributions of each piece of the algorithm. Constructing an empty set takes constant time. If the vector contains \(n\) elements, the loop does \(n\) iterations. However, each iteration does not necessarily take constant time. The worst case input is actually when the vector has no duplicates – insert() calls contains(), which takes linear time when the given value is not in the set, as we concluded previously. Thus, we have a linear number of loop iterations, each of which takes linear time in the worst case, for a total of quadratic or \(\textrm{O}(n^2)\) time. (More exactly, the total runtime for the loop is something like \(1 + 2 + 3 + \cdots + n = \frac{n(n+1)}{2}\), which still gives us \(\textrm{O}(n^2)\) order of growth.)

Now that we have a sense of the time complexity of the IntSet implementation, can we define a set abstraction that is more efficient for at least some of the operations? We can indeed, as we will see next time.