Binary Search Trees (BSTs)

Last time, we saw binary trees, which are a recursive data structure that is either empty, or consists of an element and two subtrees. A binary search tree (BST) is a binary tree whose elements are stored in an order that maintains a sorting invariant. Specifically, a binary search tree is either:

  • empty, or

  • a root datum with two subtrees such that:

    1. The two subtrees are themselves binary search trees.

    2. Every element in the left subtree is strictly less than the root datum.

    3. Every element in the right subtree is strictly greater than the root datum. [1]

Figure 85 shows an example of a binary search tree.

_images/19_binary_search_tree.svg

Figure 85 A binary search tree.

Every element in the left subtree is less that the root datum, and every element in the right subtree is greater than the root. The left and right subtrees meet the requirements for a binary search tree. Thus, the whole tree is a binary search tree.

Figure 86 depicts trees that do not meet the definition of a binary search tree.

_images/19_invalid_bsts.svg

Figure 86 Trees that violate the invariants for a binary search tree.

In the tree on the left, the left subtree contains the element 7, which is not smaller than the root element 5. This violates the second condition in the recursive definition above of a BST. In the tree on the right, the right subtree is empty, which is a valid binary search tree. However, the left subtree is not a valid BST, since it does not meet the sorting invariant for a BST. Thus, the tree on the right is not a binary search tree.

A binary search tree is thus named because searching for an element can be done efficiently, in time proportional to the height of the tree rather than the size. A search algorithm need only recurse on one side of the tree at each level. For example, in searching for the element 2 in the BST in Figure 85, the element must be in the left subtree, since 2 is less than the root element 6. Within the left subtree, it must again be to the left, since 2 is less than 5. Within the next subtree, the 2 must be to the right of the 1, leading us to the actual location of the 2.

More generally, the following algorithm determines whether or not a BST contains a particular value:

  • If the tree is empty, it does not contain the value.

  • Otherwise, if the root datum is equal to the value, the tree contains the element.

  • If the value is less than the root element, it must be in the left subtree if it is in the tree, so we repeat the search on the left subtree.

  • Otherwise, the value is greater than the root element, so it is in the right subtree if it is in the tree at all. Thus, we repeat the search on the right subtree.

The first two cases above are base cases, since they directly compute an answer. The latter two are recursive cases, and we take the recursive leap of faith that the recursive calls will compute the correct result on the subtrees.

The algorithm leads to the following implementation on our tree representation:

// REQUIRES: node represents a valid binary search tree
// EFFECTS:  Returns whether or not the given value is in the tree
//           represented by node.
bool contains(const Node *node, int value) {
  if (!node) {                             // empty tree
    return false;
  } else if (node->datum == value) {       // non-empty tree, equal to root
    return true;
  } else if (value < node->datum) {        // less than root
    return contains(node->left, value);
  } else {                                 // greater than root
    return contains(node->right, value);
  }
}

This implementation is linear recursive, since at most one recursive call is made by each invocation of contains(). Furthermore, every recursive call is a tail call, so the implementation is tail recursive. This example illustrates that the body of a linear- or tail-recursive function may contain more than one recursive call, as long as at most one of those calls is actually made.

Let us consider another algorithm, that of finding the maximum element of a BST, which requires there to be at least one element in the tree. If there is only one element, then the lone, root element is the maximum. The root is also the maximum element when the right subtree is empty; everything in the left subtree is smaller than the root, making the root the largest item. On the other hand, when the right tree is not empty, every element in that subtree is larger than the root and everything in the left subtree. Then the largest element in the whole tree is the same as the largest element in the right subtree.

Summarizing this, we have the following algorithm:

  • If the right subtree is empty, then the root is the maximum.

  • Otherwise, the maximum item is the maximum element in the right subtree.

We implement the algorithm as follows:

// REQUIRES: node represents a valid non-empty binary search tree
// EFFECTS:  Returns the maximum element in the tree represented by
//           node.
int tree_max(const Node *node) {
  if (!node->right) {                // base case
    return node->datum;
  } else {                           // recursive case
    return tree_max(node->right);
  }
}

As with contains(), tree_max() is tail recursive, and it runs in time proportional to the height of the tree.

The BinarySearchTree Interface

In the full linked-list definition we saw in Linked Lists, we defined a separate IntList class to act as the interface for the list, defining Node as a member of that class. We then generalized the element type, resulting in a List class template. We follow the same strategy here by defining a BinarySearchTree class template as the interface for a BST.

template <typename T>
class BinarySearchTree {
public:
  // EFFECTS:  Constructs an empty BST.
  BinarySearchTree();

  // EFFECTS:  Constructs a copy of the given tree.
  BinarySearchTree(const BinarySearchTree &other);

  // EFFECTS:  Replaces the contents of this tree with a copy of the
  //           given tree.
  BinarySearchTree & operator=(const BinarySearchTree &other);

  // EFFECTS:  Destructs this tree.
  ~BinarySearchTree();

  // EFFECTS:  Returns whether this tree is empty.
  bool empty() const;

  // EFFECTS:  Returns the number of elements in this tree.
  int size() const;

  // EFFECTS:  Returns whether the given item is contained in this
  //           tree.
  bool contains(const T &value) const;

  // REQUIRES: value is not in this tree
  // EFFECTS:  Inserts the given item into this tree.
  void insert(const T &value);

private:
  // Represents a single node of a tree.
  struct Node {
    T datum;
    Node *left;
    Node *right;
    // INVARIANTS: left and right are either null or pointers to
    //             valid Nodes
  };

  // The root node of this tree.
  Node *root;
  // INVARIANTS: root is either null or a pointer to a valid Node
};

As with a list, we define Node as a nested class of BinarySearchTree to encapsulate it within the latter ADT. Since it is an implementation detail and not part of the BST interface, we define Node as a private member.

The contains() member function differs from the one we defined before; the member function just takes in a data item, while our previous definition operates on a node as well. We define the latter as a private helper function and call it with the root node:

template <typename T>
class BinarySearchTree {
  ...
public:
  bool contains(const T &value) const {
    return contains_impl(root, value);
  }

private:
  bool contains_impl(const Node *node, const T &value) {
    if (!node) {
      return false;
    } else if (node->datum == value) {
      return true;
    } else if (value < node->datum) {
      return contains_impl(node->left, value);
    } else {
      return contains_impl(node->right, value);
    }
  }

  Node *root;
};

Observe that contains_impl() does not refer to a BinarySearchTree or any of its members. Thus, there is no need for it to have a this pointer to a BinarySearchTree object. We can declare the function as a static member function to eliminate the this pointer.

template <typename T>
class BinarySearchTree {
  ...
public:
  bool contains(const T &value) const {
    return contains_impl(root, value);
  }

private:
  static bool contains_impl(const Node *node, const T &value) {
    if (!node) {
      return false;
    } else if (node->datum == value) {
      return true;
    } else if (value < node->datum) {
      return contains_impl(node->left, value);
    } else {
      return contains_impl(node->right, value);
    }
  }

  Node *root;
};

Just like a static member variable is associated with a class rather than an individual object, a static member function is also not associated with an individual object, and it cannot refer to non-static member variables.

A public static member function can be called from outside the class using the scope-resolution operator, the same syntax for referring to a static member variable:

BinarySearchTree<int>::contains_impl(nullptr, -1);
// compile error because contains_impl() is not public

BST-Based Set

Previously, we have seen array-based set implementations, one that used an unsorted array and another a sorted array. We can also implement a set using a binary search tree to store the data:

template <typename T>
class BSTSet {
public:
  // EFFECTS: Inserts the given value into this set, if it is not
  //          already in the set.
  void insert(const T &value) {
    if (!elts.contains(value)) {
      elts.insert(value);
    }
  }

  // EFFECTS: Returns whether value is in this set.
  bool contains(const T &value) const {
    return elts.contains(value);
  }

  // EFFECTS: Returns the size of this set.
  int size() const {
    return elts.size();
  }

private:
  BinarySearchTree<T> elts;
};

If the underlying BST is balanced, meaning that each subtree within the BST has close to the same number of elements in its left and right child, then the height of the tree is in \(\textrm{O}(\log n)\), where \(n\) is the size of the tree. Thus, the contains() and insert() operations take logarithmic time, rather than the linear time they would take on an unsorted set.

Unfortunately, our BST implementation does not guarantee that it will be balanced. In fact, inserting items in increasing order results in a maximally unbalanced tree as in Figure 87, resembling a list rather than a tree structure.

_images/19_stick_tree.svg

Figure 87 Inserting elements into a non-balancing binary search tree in order results in a linear structure, with height equal to the number of elements.

There are more complicated binary-search-tree implementations that do guarantee balance, but they are beyond the scope of this course.

Maps and Pairs

A map is a data structure that maps keys to values. It is thus an associative data structure, since it associates a value with each key. As an example, we may want to associate exam scores with individual test takers:

aliceywu

100

akamil

23

taligoro

100

jjuett

73

We can store these data in a standard-library map:

std::map<std::string, int> scores;
scores["aliceywu"] = 100;
scores["akamil"] = 23;
scores["taligoro"] = 100;
scores["jjuett"] = 73;
cout << scores["akamil"] << endl;    // prints 23
cout << scores["aliceywu"] << endl;  // prints 100

The map class template has two required template parameters, the first for the key type and the second for the value type. We are mapping string IDs to integer scores, so we use a map<string, int>. We can then use the subscript operator with a key to insert a key-value pair into the map, or to look up the value associated with a key.

A key feature of the standard-library map is that it is not erroneous to look up a non-existent key. Instead, the map inserts and returns a value-initialized datum for the key. For primitive types, value initialization produces the zero value for the type. (This is distinct from default initialization, which produces an undefined value.)

cout << scores["gmatute"] << endl;  // prints 0

A map is similar to a set in that the keys in the map are unique; each key is associated with at most one value. However, the values themselves are not unique. The map above has two keys that are associated with the value 100.

Given the similarity with a set, we can also implement a map using a binary search tree. However, rather than using the key-value pair for ordering and uniqueness, we need to use just the key, and the value merely tags along for the ride, as shown in Figure 88.

_images/21_map_representation.svg

Figure 88 A map represented as a binary search tree.

To make this work, we need a heterogeneous data type for the datum member of a node, so that it can store separate key and value items. We can define our own struct or class, or we can use the pair template in the standard <utility> library.

std::pair<int, bool> p1;   // default constructor value initializes both items
p1.first -= 5;                                       // first now has value -5
p1.second = false;

std::pair<string, int> p2 = { "hello", 4 };         // explicit initialization
cout << p2.first << ": " << p2.second << endl;              // prints hello: 4

A pair stores a first and second item, which can be of different types. The pair template is parameterized by the types of these items.

We can use a pair and a BST to implement a map:

template <typename Key_type, typename Value_type>
class Map {
public:
  // EFFECTS: Returns whether this map is empty.
  bool empty() const;

  // EFFECTS: Returns the number of key-value pairs in this map.
  size_t size() const;

  // EFFECTS: Returns the value associated with the given key. If
  //          the key is not in the map, inserts the key,
  //          associating it with a value-initialized object of type
  //          Value_type, and returns the newly inserted value.
  Value_type& operator[](const Key_type& key);

private:
  // Type alias for a key-value pair.
  using Pair_type = std::pair<Key_type, Value_type>;

  // Comparator that compares pairs by their key components.
  class PairLess {
  public:
    bool operator()(...) const;
  };

  // Data representation.
  BinarySearchTree<Pair_type, PairLess> entries;
};

For this to work, we need a BinarySearchTree that can take a custom comparator, to compare key-value pairs by just the value component. This comparator can be defaulted to std::less, which compares elements according to the < operator:

template <typename T, typename Compare=std::less<T>>
class BinarySearchTree {
  ...
};

All comparisons within the tree now must use an object of the Compare type. We leave the details as an exercise for the reader.