p5-ml

EECS 280 Project 5: Machine Learning

Due Monday June 20th, 2022 8:00pm

Introduction

In this project, you will write a program that uses natural language processing and machine learning techniques to automatically identify the subject of posts from the EECS 280 Piazza. You will gain experience with recursion, binary trees, templates, comparators, and the map data structure. Another goal is to prepare you for future courses (like EECS 281) or your own independent programming projects, so we have given you a lot of freedom to design the structure of your overall application.

This project will be autograded for correctness, comprehensiveness of your test cases, and programming style. See the style checking tutorial for the criteria and how to check your style automatically on CAEN.

You may work alone or with a partner. Please see the syllabus for partnership rules.

Project Roadmap

  1. Set up your IDE

    See the Setup section to get your visual debugger configured with the starter files.

  2. Read the Project Introduction and Project Essentials

    See the first sections below for an introduction to the project as well as essential instructions for successfully completing the project.

  3. Test and implement the BinarySearchTree data structure

    We’ve provided header files with comments. Test and implement those functions. Be sure to use recursion and tail recursion where the comments require it.

  4. Test and implement the Map data structure

    Implement and test a Map ADT that internally uses your BinarySearchTree to provide an interface that works (almost) exactly like std::map from the STL! Appendix A has an example.

  5. Test and implement the Piazza Classifier Application

    This specification describes the interface for the overall application, but it’s up to you how to separate it into functions and data structures.

    Appendix B has tips and tricks for this part.

  6. Submit to the Autograder

    Submit the following files to the autograder.

    • BinarySearchTree.h
    • Map.h
    • main.cpp
    • BinarySearchTree_tests.cpp

Suggested Project Timeline

Project Introduction

The goal for this project is to write an intelligent program that can classify Piazza posts according to topic. This task is easy for humans - we simply read and understand the content of the post, and the topic is intuitively clear. But how do we compose an algorithm to do the same? We can’t just tell the computer to “look at it” and understand. This is typical of problems in artificial intelligence and natural language processing.

We know this is about Euchre, but how can we write an algorithm that “knows” that?

With a bit of introspection, we might realize each individual word is a bit of evidence for the topic about which the post was written. Seeing a word like “card” or “spades” leads us toward the Euchre project. We judge a potential label for a post based on how likely it is given all the evidence. Along these lines, information about how common each word is for each topic essentially constitutes our classification algorithm.

But we don’t have that information (i.e. that algorithm). You could try to sit down and write out a list of common words for each project, but there’s no way you’ll get them all. For example, the word “lecture” appears much more frequently in posts about exam preparation. This makes sense, but we probably wouldn’t come up with it on our own. And what if the projects change? We don’t want to have to put in all that work again.

Instead, let’s write a program to comb through Piazza posts from previous terms (which are already tagged according to topic) and learn which words go with which topics. Essentially, the result of our program is an algorithm! This approach is called (supervised) machine learning. Once we’ve trained the classifier on some set of Piazza posts, we can apply it to new ones written in the future.

Authors

This project was developed for EECS 280, Fall 2016 at the University of Michigan. Andrew DeOrio and James Juett wrote the original project and specification. Amir Kamil contributed to code structure, style, and implementation details.

Setup

We’ll remind you which parts of the EECS 280 Setup Mega Tutorial you need to look back at to get started on this project.

Group registration

Please register your partnership (or working alone) on the Autograder.

Project folder

Create a folder for this project (instructions). Your folder location might be different.

$ pwd
/Users/awdeorio/src/eecs280/p5-ml

Version control

Set up version control using the Version control tutorial. If you’ve used version control before on your computer, you’ll probably want to start with the Create a local repository section.

Be sure to check out the Version control for a team tutorial.

After you’re done, you should have a local repository with a “clean” status and your local repository should be connected to a remote GitLab repository.

$ pwd
/Users/awdeorio/src/eecs280/p5-ml
$ git status
On branch main
Your branch is up-to-date with 'origin/main'.

nothing to commit, working tree clean
$ git remote -v
origin	https://gitlab.eecs.umich.edu/awdeorio/p5-ml.git (fetch)
origin	https://gitlab.eecs.umich.edu/awdeorio/p5-ml.git (push)

You should have a .gitignore file (instructions).

$ pwd
/Users/awdeorio/src/eecs280/p5-ml
$ head .gitignore
# This is a sample .gitignore file that's useful for C++ projects.
...

Starter files

Download the starter files and unpack them. These instructions are adapted from the Setup Tutorial Starter files section.

$ pwd
/Users/awdeorio/src/eecs280/p5-ml
$ wget https://eecs280staff.github.io/p5-ml/starter-files.tar.gz
$ tar -xvzf starter-files.tar.gz
$ ls
starter-files  starter-files.tar.gz

Move the starter files from starter-files/ to your present working directory (.) and clean up the old directory and tarball.

$ mv starter-files/* .
$ rm -rf starter-files/ starter-files.tar.gz
$ ls
BinarySearchTree.h.starter          projects_exam.out.correct
BinarySearchTree_compile_check.cpp  sp16_projects_exam.csv
BinarySearchTree_public_test.cpp    test_small.csv
BinarySearchTree_tests.cpp.starter  test_small.out.correct
Makefile                            test_small_debug.out.correct
Map.h.starter                       train_small.csv
Map_compile_check.cpp               unit_test_framework.h
Map_public_test.cpp                 w14-f15_instructor_student.csv
TreePrint.h                         w16_instructor_student.csv
csvstream.h                         w16_projects_exam.csv
instructor_student.out.correct

Rename each .starter file to a .h or .cpp file.

$ mv BinarySearchTree.h.starter BinarySearchTree.h
$ mv BinarySearchTree_tests.cpp.starter BinarySearchTree_tests.cpp
$ mv Map.h.starter Map.h

Create these new files.

$ touch main.cpp

Your project directory should look like this.

$ ls
BinarySearchTree.h                  projects_exam.out.correct
BinarySearchTree_compile_check.cpp  sp16_projects_exam.csv
BinarySearchTree_public_test.cpp    test_small.csv
BinarySearchTree_tests.cpp.starter  test_small.out.correct
Makefile                            test_small_debug.out.correct
Map.h                               train_small.csv
Map_compile_check.cpp               unit_test_framework.h
Map_public_test.cpp                 w14-f15_instructor_student.csv
TreePrint.h                         w16_instructor_student.csv
csvstream.h                         w16_projects_exam.csv
instructor_student.out.correct

You can find a complete description of the starter files later in this spec.

Create a project in your IDE

Create a project in your IDE (visual debugger). Later, we’ll compile and debug.

VS Code

If you’ve already used VS Code on your computer before, all you need to do is start VS Code from your project directory. Otherwise, start the tutorial at the beginning. Stop after you’ve completed the Install the C/C++ extension section.

$ pwd
/Users/awdeorio/src/eecs280/p5-ml
$ code .

Visual Studio

If you’ve already used Visual Studio on your computer before, start the tutorial at the Create a project section, otherwise start at the beginning. Stop after you’ve completed the Add existing files section.

Xcode

If you’ve already used Xcode on your computer before, start the tutorial at the Create a project section, otherwise start at the beginning. Stop after you’ve completed the Add existing files section.

Compile

Next, we’ll get all our code to compile by adding function stubs.

Binary Search Tree

We’ve already included function stubs in BinarySearchTree.h. You should be able to compile and run the BinarySearchTree unit tests. The public tests will fail until you implement the functions. The test you write (BinarySearchTree_tests.cpp) will pass because the starter file only contains ASSERT_TRUE(true).

$ make BinarySearchTree_compile_check.exe
c++ -Wall -Werror -pedantic --std=c++11 -g BinarySearchTree_compile_check.cpp -o BinarySearchTree_compile_check.exe
$ make BinarySearchTree_public_test.exe
c++ -Wall -Werror -pedantic --std=c++11 -g BinarySearchTree_public_test.cpp -o BinarySearchTree_public_test.exe
$ ./BinarySearchTree_public_test.exe
Running test: bst_public_test
Assertion failed: (false), function find_impl, file ./BinarySearchTree.h, line 380.
$ make BinarySearchTree_tests.exe
c++ -Wall -Werror -pedantic --std=c++11 -g BinarySearchTree_tests.cpp -o BinarySearchTree_tests.exe
$ ./BinarySearchTree_tests.exe
Running test: test_stub
PASS

*** Results ***
** Test case "test_stub": PASS
*** Summary ***
Out of 1 tests run:
0 failure(s), 0 error(s)

Map

Edit Map.h, adding a function stub for every function prototype at the top of Map.h. Recall the example of function stubs in Project 1’s stats.cpp.

Pro-tip: Start by copy-pasting each function prototype from the class declaration at the top of Map.h to the bottom of the same file.

Here are a few examples to get you started. We’re using K, V, and C as shorthands for Key_type, Value_type, and Key_compare.

template <typename K, typename V, typename C>
bool Map<K, V, C>::empty() const {
  assert(false);
}

template <typename K, typename V, typename C>
typename Map<K, V, C>::Iterator Map<K, V, C>::find(const K& k) const {
  assert(false);
}

template <typename K, typename V, typename C>
V& Map<K, V, C>::operator[](const K& k) {
  assert(false);
}

template <typename K, typename V, typename C>
std::pair<typename Map<K, V, C>::Iterator, bool> Map<K, V, C>::insert(const Pair_type &val) {
  assert(false);
}

Now you should be able to compile and run the Map unit tests. The public tests will fail until you implement the functions.

$ make Map_compile_check.exe
c++ -Wall -Werror -pedantic --std=c++11 -g Map_compile_check.cpp -o Map_compile_check.exe
$ make Map_public_test.exe
c++ -Wall -Werror -pedantic --std=c++11 -g Map_public_test.cpp -o Map_public_test.exe
$ ./Map_public_test.exe
Running test: map_public_test
Assertion failed: (false), function operator[], file ./Map.h, line 146.

Main

Edit main.cpp to print a message, similar to Project 1’s main.cpp.

Make sure that the Project UID is in the comments at the top.

// Project UID db1f506d06d84ab787baf250c265e24e

You should be able to compile and run your main function.

$ make main.exe
c++ -Wall -Werror -pedantic --std=c++11 -g main.cpp -o main.exe
$ ./main.exe
hello from main!

Everything

You can check if everything compiles all at once by running make -j, which compiles everything in parallel. The tests will fail until you’ve implemented the functions.

$ make clean
rm -vrf *.o *.exe *.gch *.dSYM *.stackdump *.out.txt
$ make -j
c++ -Wall -Werror -pedantic --std=c++11 -g BinarySearchTree_compile_check.cpp -o BinarySearchTree_compile_check.exe
c++ -Wall -Werror -pedantic --std=c++11 -g BinarySearchTree_tests.cpp -o BinarySearchTree_tests.exe
c++ -Wall -Werror -pedantic --std=c++11 -g BinarySearchTree_public_test.cpp -o BinarySearchTree_public_test.exe
c++ -Wall -Werror -pedantic --std=c++11 -g Map_compile_check.cpp -o Map_compile_check.exe
c++ -Wall -Werror -pedantic --std=c++11 -g Map_public_test.cpp -o Map_public_test.exe
c++ -Wall -Werror -pedantic --std=c++11 -g main.cpp -o main.exe
./BinarySearchTree_tests.exe
Running test: test_stub
PASS

*** Results ***
** Test case "test_stub": PASS
*** Summary ***
Out of 1 tests run:
0 failure(s), 0 error(s)
./BinarySearchTree_public_test.exe
Running test: bst_public_test
Assertion failed: (false), function find_impl, file ./BinarySearchTree.h, line 380.
make: *** [test] Abort trap: 6

Commit and submit

Now that your code compiles with function stubs, it’s a great time to make a Git commit and make your first autograder submission.

Submit the code you have to the autograder. It’s best to find out early if there’s a mismatch between the autograder and your local development environment.

Make a git commit.

$ git status
$ git add .
$ git commit -m "Starter files with function stubs"
$ git push

Debug

Configure your visual debugger to compile and run the Binary Search Tree public tests.

VS Code

Follow the VS Code Run instructions and configure it to run BinarySearchTree_public_test.exe. Don’t forget to change stats_tests.exe to BinarySearchTree_public_test.exe in launch.json.

Finally, follow the Debug instructions to run your program with a breakpoint. During unit testing, add a breakpoint to the first line of one test.

Visual Studio

First, make sure the correct files are included in the build. You can figure out the right files by compiling at the command line. In this example, you can see that we only need one file, BinarySearchTree_public_test.cpp. All other files should be excluded from the build.

$ make clean
$ make BinarySearchTree_public_test.exe
c++ -Wall -Werror -pedantic --std=c++11 -g BinarySearchTree_public_test.cpp -o BinarySearchTree_public_test.exe

Now you should be able to run the Binary Search Tree tests from inside Visual Studio (instructions).

Finally, follow the Debug instructions to run your program with a breakpoint. During unit testing, add a breakpoint to the first line of one test.

Xcode

First, make sure the correct files are included in the Xcode compile sources. You can figure out the right files by compiling at the command line. In this example, you can see that we only need one file, BinarySearchTree_public_test.cpp. All other files should be excluded from the build.

$ make clean
$ make BinarySearchTree_public_test.exe
c++ -Wall -Werror -pedantic --std=c++11 -g BinarySearchTree_public_test.cpp -o BinarySearchTree_public_test.exe

Next, change the settings for running your program (instructions).

Finally, follow the Debug instructions to run your program with a breakpoint. During unit testing, add a breakpoint to the first line of one test.

Configure Address Sanitizer

The Address Sanitizer is very good at finding memory errors, including going off the end of an array or vector or making a mistake with a pointer. Use the Address Sanitizer Quick Start.

Project Essentials

The project consists of three main phases:

  1. Implement and test the static _impl member functions in BinarySearchTree.
  2. Implement and test Map by using the has-a pattern on top of BinarySearchTree.
  3. Design, implement, and test the top-level classifier application.

The focus of part 1 is on working with recursive data structures and algorithms. The framework and some of the implementation for BinarySearchTree is provided for you, but you must implement the core functionality in several static member functions. Be mindful of requirements for which implementations must use certain kinds of recursion.

Part 2 should not require a lot of additional implementation code. Make sure to reuse the functionality already present in BinarySearchTree wherever possible.

For your top-level application, you must use std::map in place of Map. This means a bug in parts 1 or 2 will not jeopardize your ability to complete part 3. Additionally, the implementation of BinarySearchTree (and consequently Map) we have you write will not be fast enough for the classifier.

Requirements and Restrictions

DO DO NOT
Put all top-level application code in main.cpp. Create additional files other than main.cpp.
Create any ADTs or functions you wish for your top-level classifier application. Modify the BinarySearchTree or Map public interfaces
Use any part of the STL for your top level classifier application, including map and set. Use STL containers in your implementation of BinarySearchTree or Map.
Use any part of the STL except for containers in your BinarySearchTree and Map implementations. Use your Map implementation for the top level application. It will be too slow.
Use recursion for the BST _impl functions. Use iteration for the BST _impl functions.
Follow course style guidelines. Use non-const static or global variables.
Check for undefined behavior using address sanitizer and other tools “It runs fine on my machine!”

Starter Code

The following table describes each file included in the starter code. As you begin development, rename files to remove .starter.

Filename Description
BinarySearchTree.h.starter Defines an ADT for a binary search tree.
BinarySearchTree_tests.cpp.starter Add your BST tests to this file.
BinarySearchTree_public_test.cpp A public test for BinarySearchTree
BinarySearchTree_compile_check.cpp A compilation test for BinarySearchTree.h
TreePrint.h Auxiliary file to support printing trees. You do not need to look at this file. Do not modify it.
Map.h.starter Map ADT
Map_public_test.cpp A sample test for Map. You are encouraged to write map tests, but do not submit them.
Map_compile_check.cpp A compilation test for Map.h.
Piazza Datasets (Four .csv files) Piazza post data from several past EECS 280 terms in Comma Separated Value (CSV) format.
csvstream.h A library for reading data in CSV format.
train_small.csv
test_small.csv
test_small.out.correct
test_small_debug.out.correct
Sample input training and testing files for the classifier application, as well as the corresponding correct output when run with those files.
sp16_projects_exam.csv
w14-f15_instructor_student.csv
w16_instructor_student.csv
w16_projects_exam.csv
instructor_student.out.correct
projects_exam.out.correct
Actual Piazza data from past terms, as well as the corresponding correct output when run with those files.
Makefile Used by the make command to compile the executable.
unit_test_framework.h The unit test framework you must use to write your test cases.

The BinarySearchTree ADT

A binary search tree supports efficiently storing and searching for elements.

Template Parameters

BinarySearchTree has two template parameters:

No Duplicates Invariant

In the context of this project, duplicate values are NOT allowed in a BST. This does not need to be the case, but it avoids some distracting complications.

Sorting Invariant

A binary search tree is special in that the structure of the tree corresponds to a sorted ordering of elements and allows efficient searches (i.e. in logarithmic time).

Every node in a well-formed binary search tree must obey this sorting invariant:

- OR -

Put briefly, go left and you’ll find smaller elements. Go right and you’ll find bigger ones. For example, the following are all well-formed sorted binary trees:

flowchart TB
  %% Binary tree 2
  subgraph tree2["Valid"]
    direction TB
    tree2_1((1)) --> tree2_1_L((" ")) & tree2_2((2))
    tree2_2 --> tree2_2_L((" ")) & tree2_4((4))
    tree2_4 --> tree2_4_L((" ")) & tree2_4_R((" "))
  end

  %% Binary tree 1
  subgraph tree1["Valid"]
    direction TB
    tree1_4((1)) --> tree1_2((2)) & tree1_6((6))
    tree1_2 --> tree1_1((1)) & tree1_3((3))
    tree1_1 --> tree1_1_L((" ")) & tree1_1_R((" "))
    tree1_3 --> tree1_3_L((" ")) & tree1_3_R((" "))

    tree1_6 --> tree1_5((5)) & tree1_7((7))
    tree1_5 --> tree1_5_L((" ")) & tree1_5_R((" "))
    tree1_7 --> tree1_7_L((" ")) & tree1_7_R((" "))
  end

While the following are not:

flowchart TB
  %% Binary tree 4
  subgraph tree4["Invalid"]
    direction TB
    tree4_3((3)) --> tree4_2((2)) & tree4_7((7))
    tree4_2 --> tree4_1((1)) & tree4_5((5))
    tree4_1 --> tree4_1_L((" ")) & tree4_1_R((" "))
    tree4_5 --> tree4_5_L((" ")) & tree4_5_R((" "))
  end

  %% Binary tree 3
  subgraph tree3["Invalid"]
    direction TB
    tree3_4((4)) --> tree3_3((3)) & tree3_6((6))
    tree3_3 --> tree3_2((2)) & tree3_1((1))
    tree3_2 --> tree3_2_L((" ")) & tree3_2_R((" "))
    tree3_1 --> tree3_1_L((" ")) & tree3_1_R((" "))

    tree3_6 --> tree3_7((7))
    tree3_7 --> tree3_7_L((" ")) & tree3_7_R((" "))
  end

  %% Binary tree 2
  subgraph tree2["Invalid"]
    direction TB
    tree2_1((1)) --> tree2_2((2)) & tree2_3((3))
    tree2_2 --> tree2_2_L((" ")) & tree2_2_R((" "))
    tree2_3 --> tree2_3_L((" ")) & tree2_3_R((" "))
  end

  %% Binary tree 1
  subgraph tree1["Invalid"]
    direction TB
    tree1_1((1)) --> tree1_2((2)) & tree1_1_R((" "))
    tree1_2 --> tree1_2_L((" ")) & tree1_2_R((" "))
  end

ProTip: When writing tests for check_sorting_invariant(), you can use an iterator to break the invariant. For example:

BinarySearchTree<int> b;
b.insert(1);
b.insert(0);
// change first datum to 2, resulting in the first broken tree above
*b.begin() = 2;
ASSERT_FALSE(b.check_sorting_invariant());

Data Representation

The data representation for BinarySearchTree is a tree-like structure of nodes similar to that described in lecture. Each Node contains an element and pointers to left and right subtrees. The structure is self-similar. A null pointer indicates an empty tree. You must use this data representation. Do not add member variables to BinarySearchTree or Node.

Public Member Functions and Iterator Interface

The public member functions and iterator interface for BinarySearchTree are already implemented in the starter code. DO NOT modify the code for any of these functions. They delegate the work to private, static implementation functions, which you will write.

Implementation Functions

The core of the implementation for BinarySearchTree is a collection of private, static member functions that operate on tree-like structures of nodes. You are responsible for writing the implementation of several of these functions.

To disambiguate these implementation functions from the public interface functions, we have used names ending with _impl. (This is not strictly necessary, because the compiler can differentiate them based on the Node* parameter.)

There are a few keys to thinking about the implementation of these functions:

We’ve structured the starter code so that the first bullet point above is actually enforced by the language. Because they are static member functions, they do not have access to a receiver object (i.e. there’s no this pointer). That means it’s actually impossible for these functions to try to do something bad with the BinarySearchTree object (e.g. trying to access the root member variable).

Instead, the implementation functions are called from the regular member functions to perform specific operations on the underlying nodes and tree structure, and are passed only a pointer to the root Node of the tree/subtree they should work with.

The empty_impl function must run in constant time. It must must be able to determine and return its result immediately, without using either iteration or recursion. The rest of the implementation functions must be recursive. There are additional requirements on the kind of recursion that must be used for some functions. See comments in the starter code for details. Iteration (i.e. using loops) is not allowed in any of the _impl functions.

Using the Comparator

The _impl functions that need to compare data take in a comparator parameter called less. Make sure to use less rather than the < operator to compare elements!

The insert_impl Function

The key to properly maintaining the sorting invariant lies in the implementation of the insert_impl function - this is essentially where the tree is built, and this function will make or break the whole ADT. Your insert_impl function should follow this procedure:

  1. Handle an originally empty tree as a special case.
  2. Insert the element into the appropriate place in the tree, keeping in mind the sorting invariant. You’ll need to compare elements for this, and to do so make sure to use the less comparator passed in as a parameter.
  3. Use the recursive leap of faith and call insert_impl itself on the left or right subtree. Hint: You do need to use the return value of the recursive call. (Why?)

Important: When recursively inserting an item into the left or right subtree, be sure to replace the old left or right pointer of the current node with the result from the recursive call. This is essential, because in some cases the old tree structure (i.e. the nodes pointed to by the old left or right pointer) is not reused. Specifically, if the subtree is empty, the only way to get the current node to “know” about the newly allocated node is to use the pointer returned from the recursive call.

Technicality: In some cases, the tree structure may become unbalanced (i.e. too many nodes on one side of the tree, causing it to be much deeper than necessary) and prevent efficient operation for large trees. You don’t have to worry about this.

Testing BinarySearchTree

You must write and submit tests for the BinarySearchTree class. Your test cases MUST use the unit test framework, otherwise the autograder will not be able to evaluate them. Since unit tests should be small and run quickly, you are limited to 50 TEST() items per file, and your whole test suite must finish running in less than 5 seconds. Please bear in mind that you DO NOT need 50 unit tests to catch all the bugs. Writing targeted test cases and avoiding redundant tests can help catch more bugs in fewer tests.

Pro-tip: When writing tests for functions that return a size_t (which is an unsigned integer type), compare against an unsigned literal. For example:

BinarySearchTree<int> b;
ASSERT_EQUAL(b.height(), 0u);

How We Grade Your Tests

We will autograde your BinarySearchTree unit tests by running them against a number of implementations of the module. If a test of yours fails for one of those implementations, that is considered a report of a bug in that implementation.

We grade your tests by the following procedure:

  1. We compile and run your test cases with a correct solution. Test cases that pass are considered valid. Tests that fail (i.e. falsely report a bug in the solution) are invalid. The autograder gives you feedback about which test cases are valid/invalid. Since unit tests should be small and run quickly, your whole test suite must finish running in less than 5 seconds.
  2. We have a set of intentionally incorrect implementations that contain bugs. You get points for each of these “buggy” implementations that your valid tests can catch.
  3. How do you catch the bugs? We compile and run all of your valid test cases against each buggy implementation. If any of these test cases fail (i.e. report a bug), we consider that you have caught the bug and you earn the points for that bug.

The Map ADT

The Map ADT works just like std::map. Map has three template parameters for the types of keys and values, as well as a customizable comparator type for comparing keys. The most important functions are find, insert, and the [] operator. The RMEs and comments in Map.h provide the details, and appendix A includes an example.

Note: Although you must implement Map, use std::map instead in your top-level application. Our implementation of Map is not fast enough for the classifier.

Building on the BST

The operation of a map is quite similar to that of a BST. The additional consideration for a map is that we want to store key-value pairs instead of single elements, but also have any comparisons (e.g. for searching) only depend on the key and be able to freely change the stored values without messing up the BST sorting invariant. We can employ the has-a pattern using a BinarySearchTree as the data representation for Map:

Finally, we can even reuse the iterators from the BST class, since the interface we want (based on std::map) calls for iterators to yield a key-value pair when dereferenced. Since the element type T of the BST is our Pair_type, BST iterators will yield pairs and will work just fine. We’ve provided this using declaration with the starter code to make Map::Iterator simply an alias for iterators from the corresponding BST:

using Iterator = typename BinarySearchTree<Pair_type, PairComp>::Iterator;

Testing Map

You are encouraged to write tests for the Map ADT, but they are not required for the project submission. Do not submit them to the autograder.

The Piazza Datasets

For this project, we retrieved archived Piazza posts from EECS 280 in past terms. We will focus on two different ways to divide Piazza posts into labels (i.e. categories).

The Piazza datasets are Comma Separated Value (CSV) files. The label for each post is found in the “tag” column, and the content in the “content” column. There may be other columns in the CSV file; your code should ignore all but the “tag” and “content” columns. You may assume all Piazza files are formatted correctly, and that post content and labels only contain lowercase characters, numbers, and no punctuation. You must use the csvstream.h library (see https://github.com/awdeorio/csvstream for documentation) to read CSV files in your application. The csvstream.h file itself is included with the starter code.

Your classifier should not hardcode any labels. Instead, it should use the exact set of labels that appear in the training data.

Appendix B contains code for splitting a string of content into a set of individual words.

We have included several Piazza datasets with the project:

For the real datasets, we have indicated which are intended for training vs. testing.

Classifying Piazza Posts with NLP and ML

At a high level, the classifier we’ll implement works by assuming a probabilistic model of how Piazza posts are composed, and then finding which label (e.g. our categories of “euchre”, “exam”, etc.) is the most probable source of a particular post.

All the details of natural language processing (NLP) and machine learning (ML) techniques you need to implement the project are described here. You are welcome to consult other resources, but there are many kinds of classifiers that have subtle differences. The classifier we describe here is a simplified version of a “Multi-Variate Bernoulli Naive Bayes Classifier”. If you find other resources, but you’re not sure they apply, make sure to check them against this specification.

This document provides a more complete description of the way the classifier works, in case you’re interested in the math behind the formulas here.

The Bag of Words Model

We will treat a Piazza post as a “bag of words” - each post is simply characterized by which words it includes. The ordering of words is ignored, as are multiple occurrences of the same word. These two posts would be considered equivalent:

Thus, we could imagine the post-generation process as a person sitting down and going through every possible word and deciding which to toss into a bag.

Background: Conditional Probabilities and Notation

We write \(P(A)\) to denote the probability (a number between 0 and 1) that some event \(A\) will occur. \(P(A \mid B)\) denotes the probability that event \(A\) will occur given that we already know event \(B\) has occurred. For example, \(P(``bower" \mid ``euchre") \approx 0.007\). This means that if a Piazza post is about the euchre project, there is a 0.7% chance it will contain the word bower (we should say “at least once”, technically, because of the bag of words model).

Training the Classifier

Before the classifier can make predictions, it needs to be trained on a set of previously labeled Piazza posts (e.g. train_small.csv or w16_projects_exam.csv). Your application should process each post in the training set, and record the following information:

Predicting a Label for a New Post

Given a new Piazza post \(X\), we must determine the most probable label \(C\), based on what the classifier has learned from the training set. A measure of the likelihood of \(C\) is the log-probability score given the post:

\[\ln P(C) + \ln P(w_1 \mid C) + \ln P(w_2 \mid C) + \cdots + \ln P(w_n \mid C)\]

Important: Because we’re using the bag-of-words model, the words \(w_1, w_2, \ldots, w_n\) in this formula are only the unique words in the post, not including duplicates! To ensure consistent results, make sure to add the contributions from each word in alphabetic order.

The classifier should predict whichever label has the highest log-probability score for the post. If multiple labels are tied, predict whichever comes first alphabetically.

\(\ln P(C)\) is the log-prior probability of label \(C\) and is a reflection of how common it is:

\[\ln P(C) = \ln \left( \frac{\text{number of training posts with label } C}{\text{number of training posts}} \right)\]

\(\ln P(w \mid C)\) is the log-likelihood of a word \(w\) given a label \(C\), which is a measure of how likely it is to see word \(w\) in posts with label \(C\). The regular formula for \(\ln P(w \mid C)\) is:

\[\ln P(w \mid C) = \ln \left( \frac{\text{number of training posts with label } C \text{ that contain } w}{\text{number of training posts with label } C} \right)\]

However, if \(w\) was never seen in a post with label \(C\) in the training data, we get a log-likelihood of \(-\infty\), which is no good. Instead, use one of these two alternate formulas:


\[\ln P(w \mid C) = \ln \left( \frac{\text{number of training posts that contain } w}{\text{number of training posts}} \right)\]

(Use when \(w\) does not occur in posts labeled \(C\) but does occur in the training data overall.)


\[\ln P(w \mid C) = \ln \left( \frac{1}{\text{number of training posts}} \right)\]

(Use when \(w\) does not occur anywhere at all in the training set.)


Implementing Your Top-Level Classifier Application

For submission to the autograder, your top-level application code must be entirely contained in a single file, main.cpp. However, the structure of your classifier application, including which procedural abstractions and/or ADTs to use for the classifier, is entirely up to you. Make sure your decisions are informed by carefully considering the classifier and top-level application described in this specification.

We strongly suggest you make a class to represent the classifier - the private data members for the class should keep track of the classifier parameters learned from the training data, and the public member functions should provide an interface that allows you to train the classifier and make predictions for new piazza posts.

Here is some high-level guidance:

  1. First, your application should read posts from a file (e.g. train_small.csv) and use them to train the classifier. After training, your classifier abstraction should store the information mentioned in the “Training the Classifier” section above.
  2. Your classifier should be able to compute the log-probability score of a post (i.e. a collection of words) given a particular label. To predict a label for a new post, it should choose the label that gives the highest log-probability score.
  3. Read posts from a file (e.g. test_small.csv) to use as testing data. For each post, predict a label using your classifier.

Some of these steps have output associated with them. See the “output” section below for the details.

You should write RMEs and appropriate comments to describe the interfaces for the abstractions you choose (ADTs, classes, functions, etc.). You should also write unit tests to verify each component works on its own.

You are welcome to use any part of the STL in your top-level classifier application. In particular, std::map and std::set will be useful.

Classifier Application Interface

Here is the usage message for the top-level application:

$ ./main.exe
Usage: main.exe TRAIN_FILE TEST_FILE [--debug]

The main application always requires files for both training and testing. The training file must have at least one post, but the test file may have no posts. You may assume all files are in the correct format, with a header that has at least the “tag” and “content” columns.

Use the provided small-scale files for initial testing and to check your output formatting:

$ ./main.exe train_small.csv test_small.csv
$ ./main.exe train_small.csv test_small.csv --debug

Correct output is in test_small.out.correct and test_small_debug.out.correct. The output format is discussed in detail below.

Error Checking

The program checks that the command line arguments obey the following rules:

If any of these are violated, print out the usage message and then quit by returning a non-zero value from main. Do not use the exit library function, as this fails to clean up local objects.

cout << "Usage: main.exe TRAIN_FILE TEST_FILE [--debug]" << endl;

If any file cannot be opened, print out the following message, where filename is the name of the file that could not be opened, and quit by returning a non-zero value from main.

cout << "Error opening file: " << filename << endl;

You do not need to do any error checking for command-line arguments or file I/O other than what is described on this page. However, you must use precisely the error messages given here in order to receive credit. (Just literally use the code given here to print them.)

As mentioned earlier, you may assume all Piazza data files are in the correct format.

Output

This section details the output your program should write to cout, using the small files mentioned above as an example. Some lines are indented by two spaces. Output only printed when the --debug flag is provided is indicated here with “(DEBUG)”.

Add this line at the beginning of your main function to set floating point precision:

cout.precision(3);

First, print information about the training data:

If the debug option is provided, also print information about the classifier trained on the training posts. Whenever classes or words are listed, they are in alphabetic order.

Finally, use the classifier to predict classes for each example in the testing data. Print information about the test data as well as these predictions.

The last thing printed should be a newline character. The output for this example can be found in test_small.out.correct and test_small_debug.out.correct. Use diff to compare against these files and check your formatting.

Results

In case you’re curious, here’s the performance for the large datasets. Not too bad!

Command Accuracy
./main.exe w16_projects_exam.csv sp16_projects_exam.csv 245 / 332
./main.exe w14-f15_instructor_student.csv w16_instructor_student.csv 2602 / 2988

Efficiency

While efficiency is not a primary goal for this project, you should aim for your code to run the largest test case above in no more than a minute. Some common causes of slowdown you should avoid:

Refer to the Project 2 perf tutorial for instructions on how to use the perf tool to identify slow functions.


Appendix A: Map Example

#include <iostream>
#include <string>
#include "Map.h"
using namespace std;

int main () {
  // A map stores two types, key and value
  Map<string, double> words;

  // One way to use a map is like an array
  words["hello"] = 1;

  // Maps store a std::pair type, which "glues" one key to one value.
  // The CS term is Tuple, a fixed-size heterogeneous container.
  pair<string, double> tuple;
  tuple.first = "world";
  tuple.second = 2;
  words.insert(tuple);

  // Here's the C++11 way to insert a pair
  words.insert({"pi", 3.14159});

  // Iterate over map contents using a C++11 range-for loop
  // This is the equivalent without C++11:
  // for (Map<string, double>::Iterator i=words.begin();
  //      i != words.end(); ++i) {
  for (auto i : words) {
    auto word = i.first; //key
    auto number = i.second; //value
    cout << word << " " << number << "\n";
  }

  // Check if a key is in the map.  find() returns an iterator.
  auto found_it = words.find("pi");
  if (found_it != words.end()) {
    auto word = (*found_it).first; //key
    auto number = (*found_it).second; //value
    cout << "found " << word << " " << number << "\n";
  }

  // When using the [] notation, an element not found is automatically created.
  // If the value type of the map is numeric, it will always be 0 "by default".
  cout << "bleh: " << words["bleh"] << endl;
}

Appendix B: Splitting a Whitespace-Delimited String

We’ve provided two versions that use istringstream. They do the same thing, so use whichever you like in your code.

// EFFECTS: Returns a set containing the unique "words" in the original
//          string, delimited by whitespace.
set<string> unique_words(const string &str) {
 istringstream source(str);
 set<string> words;
 string word;

 // Read word by word from the stringstream and insert into the set
 while (source >> word) {
   words.insert(word);
 }
 return words;
}
// EFFECTS: Returns a set containing the unique "words" in the original
//          string, delimited by whitespace.
set<string> unique_words(const string &str) {
 // Fancy modern C++ and STL way to do it
 istringstream source{str};
 return {istream_iterator<string>{source},
         istream_iterator<string>{}};
}