# EECS 280 Project 1: Statistics

Due Friday, September 17, 2021, 8:00 pm

How Couples Meet and Stay Together is a research study that surveyed how Americans met their spouses and romantic partners, and compared traditional to nontraditional couples. In this project, you will write a program to analyze data from this research study.

You will write two code modules in two files, as well as a number of unit tests. In `stats.cpp`, you will write functions that compute basic statistics like mean and standard deviation. You’ll test your functions using unit tests, which you will add to `stats_tests.cpp`. Finally, write a driver program reads a data file from the study and computes statistics in `main.cpp`.

This project uses the Standard Template Library (STL) `vector` which is similar to an array. If you’ve never seen a vector before, check out Appendix B: `vector` How-to at the end of this document. Vectors are pretty convenient to use, and the appendix gives examples of exactly how to use them for this project.

1. Complete the set up tutorial
2. Implement statistics library in `stats.cpp`
3. Test statistics library in `stats_tests.cpp`
4. Write top level program in `main.cpp`

This a complete list of all files in this project:

File Description
`Makefile` Helper commands for building and submitting
`main_test.in` Inputs for main program
`main_test.out.correct` Correct output of main program
`main_test_data.tsv` Data file for main program
`p1_library.cpp` Provided code implementations
`p1_library.h` Provided code function prototypes
`stats_tests.cpp` Tests for statistics library. Add tests to this file.
`stats_public_test.cpp` A “does my code compile” test case
`stats.h` Function prototypes for statistics library
`stats.cpp` Function implementations for statistics library. Write this.
`main.cpp` Main program. Write this.

The starter files are available at `https://eecs280staff.github.io/p1-stats/starter-files.tar.gz`. You should have downloaded and unpacked them in the starter tutorial.

We have provided several functions for your convenience in `p1_library.h` and `p1_library.cpp`. These will make your life MUCH easier!

# Statistics Library

The statistics library provides functions for another program to use. We have provided `stats.h`, which has function prototypes and a description of what each function does. Implement these functions in `stats.cpp`. You will also write test cases for these functions in `stats_tests.cpp`. Compile and run your test cases like this.

``````\$ g++ -Wall -Werror -pedantic -g --std=c++11 stats_tests.cpp stats.cpp p1_library.cpp -o stats_tests.exe
\$ ./stats_tests.exe
``````

Note that you don’t type the `\$` symbol, it just means “this is the command prompt.”

Keep in mind that `stats_tests.cpp` doesn’t depend on `main.cpp`.

We’ve also provided a public test case that will help you verify that your code compiles correctly. You can compile and run the `stats_public_test.exe` executable with the following commands:

``````\$ g++ -Wall -Werror -pedantic -g --std=c++11 stats_public_test.cpp stats.cpp p1_library.cpp -o stats_public_test.exe
\$ ./stats_public_test.exe
``````

# Statistics Program

Our statistics program will first ask the user for a filename and which data to read for the file. Then, it will compute several statistics and print a summary to standard output.

## Input file format

Input files are in tab separated value (.tsv) format. The first line is a header, which has names for each column. Following lines contain numerical values. We have provided a simple example in `test_main_data.tsv`.

``````A        B
1        6
2        7
3        8
4        9
5        10
``````

## Example

Let’s run a complete example of the main program. First, we’ll compile and run the program at the command line.

``````\$ g++ -Wall -Werror -pedantic --std=c++11 -g main.cpp stats.cpp p1_library.cpp -o main.exe
\$ ./main.exe
``````

Our program asks for a file name

``````enter a filename
``````

The user types in a file name

``````main_test_data.tsv
``````

Next, it asks the user for a column name

``````enter a column name
``````

And we type that in, too

``````B
``````

Print an informational message about the column and file

``````reading column B from main_test_data.tsv
``````

Next, print a summary of the data, followed by a blank line.

``````Summary (value: frequency)
6: 1
7: 1
8: 1
9: 1
10: 1
``````

And finally, print these statistics

``````count = 5
sum = 40
mean = 8
stdev = 1.58114
median = 8
mode = 6
min = 6
max = 10
0th percentile = 6
25th percentile = 7
50th percentile = 8
75th percentile = 9
100th percentile = 10
``````

Next, let’s automate this so we can quickly check our program. First, we’ll use a file to contain all the user input. It’s annoying to keep typing it, so let’s type it once, and then reuse the file, called `main_test.in`.

``````main_test_data.tsv
B
``````

Now you can run your program and redirect the input from a file. This way you don’t have to type it every time! Notice that the there’s no user input showing up in the output, where the user typed `main_test_data.tsv` and `B` in the previous example.

``````\$ ./main.exe < main_test.in
enter a filename
enter a column name
Summary (value: frequency)
6: 1
7: 1
8: 1
9: 1
10: 1

count = 5
sum = 40
mean = 8
stdev = 1.58114
median = 8
mode = 6
min = 6
max = 10
0th percentile = 6
25th percentile = 7
50th percentile = 8
75th percentile = 9
100th percentile = 10
``````

Instructions for setting up redirection with your debugger:

Next, we’ll run our program again and save the output to file instead of printing it to the terminal. Hint: press the up arrow (or control-p) to avoid retyping the command. We will redirect output to a file called `main_test.out`, and at the same time read user input from `main_test.in`.

``````\$ ./main.exe < main_test.in > main_test.out
``````

Notice that there’s no output at the console, but we can peek at the output that was redirected to the file:

``````\$ cat main_test.out
``````

The last piece is a correct answer to compare against. We’ve provided the correct output in a file called `main_test.out.correct`. We could look at this file and compare it line by line with `main_test.out`, but let’s make the computer do it for us!

``````\$ diff main_test.out main_test.out.correct
``````

If there’s no output, that means the files match. If there is a problem, you can help debug it using the `sdiff` to see where the files are similar, and where they are different.

``````\$ sdiff main_test.out main_test.out.correct
``````

If the output gets too long to see, then you can send the output of the `diff` command to the input of the `less` command using a pipe (`|` character). `less` is a pager, which lets you use the arrow keys to move up and down in the output, and quit using `q`.

``````\$ sdiff main_test.out main_test.out.correct | less
``````

Whew! That was a lot of typing. Let’s make it even easier. `make` is a command line program that remembers long commands for you. It reads a file in the same directory named `Makefile`. I know, it’s weird that it doesn’t have a file extension, but it’s just a plain text file. Makefiles run commands for you! We provided one for you and here’s an example that compiles and runs all the tests:

``````\$ make test
``````

We can also delete the temporary files created by the `Makefile`, like the executables:

``````\$ make clean
``````

## Real Data

Want to try it out with real data from the How Couples Meet and Stay Together study?

1. Use the following `wget` link to download the data in tsv format: `https://eecs280staff.github.io/p1-stats/data/HCMST_ver_3.04.tsv`.
2. The variables in the study are the first line of the tsv file.
3. Another file called the codebook describes the variables. It can be accessed here: https://stacks.stanford.edu/file/druid:ns183dp7831/HCMST_codebook_3_04.pdf.

Let’s see how many survey respondents have a spouse or partner:

``````\$ ./main.exe
enter a filename
HCMST_ver_3.04.tsv
enter a column name
qflag
Summary (value: frequency)
1: 3009
2: 993

count = 4002
sum = 4995
mean = 1.24813
stdev = 0.431979
median = 1
mode = 1
min = 1
max = 2
0th percentile = 1
25th percentile = 1
50th percentile = 1
75th percentile = 1
100th percentile = 2
``````

After reading the codebook, we can understand that “1” means partnered and “2” means no spouse or partner.

How many respondents identified as gay, lesbian or bisexual?

``````\$ ./main.exe
enter a filename
HCMST_ver_3.04.tsv
enter a column name
glbstatus
Summary (value: frequency)
0: 3047
1: 955

count = 4002
sum = 955
mean = 0.238631
stdev = 0.4263
median = 0
mode = 0
min = 0
max = 1
0th percentile = 0
25th percentile = 0
50th percentile = 0
75th percentile = 0
100th percentile = 1
``````

We can see that 955 people identified as gay, lesbian or bisexual.

# Tips, Tricks and Restrictions

Put all of your statistics functions in `stats.cpp` and all statistics function tests in `stats_tests.cpp`. Write your statistics program in `main.cpp`.

These are the only libraries you may use:

``````#include "stats.h"
#include "p1_library.h"
#include <iostream>
#include <string>
#include <vector>
#include <cassert>
#include <cmath>
#include <iomanip>
#include <limits>
``````

No non-const global variables or static variables.

DO NOT INCLUDE a `main` function in your `stats.cpp` file. Remember, `stats.cpp` is a library of functions that `main.cpp` uses. The functions in `stats.cpp` must still work when compiled with a different main function.

# Testing

Testing is just as important as writing the original code! Write unit tests for each statistics function (the functions in `stats.h`). These are the requirements for tests:

• Your tests should be written as separate functions, as demonstrated by the sample test case in `stats_tests.cpp`
• Each function in `stats.cpp` must have at least one corresponding test function in `stats_tests.cpp`
• Use descriptive function names for your test cases
• Use `assert` to check things that should be true if your code is working correctly and passes the test. For example, if the mean should be 3, use `assert(mean == 3);`. Thus, a failed assert indicates a failed test case.
• You may print as much output as you like

Pro-tip: Write tests for the functions first. (i.e., Write tests for `median()`, and then implement `median()`. It sounds like a pain, but you gain two important things by coding this way:

1. You avoid being under the illusion that your code works when it’s actually full of bugs.
2. When you make changes to code that you wrote previously, you can re-run your test cases and immediately know if you broke something (yes, you will break things).

This practice is called test-driven development.

Submit `main.cpp`, `stats.cpp`, and `stats_tests.cpp` to the autograder using the direct link at the top of this page.

We will grade your code on functional correctness and the presence of test cases. As a reminder, you may not share any part of your solution with others. This includes both code and test cases. Doing so will result in an honor code violation.

# Acknowledgments

The original project was written by Andrew DeOrio, spring 2015.

This project is based on research work by Rosenfeld, Michael J., Reuben J. Thomas, and Maja Falcon. 2015. How Couples Meet and Stay Together, Waves 1, 2, and 3: Public version 3.04, plus wave 4 supplement version 1.02 and wave 5 supplement version 1.0 [Computer files]. Stanford, CA: Stanford University Libraries.

# Appendix A: Percentile formula

You should use the following formula to compute percentile. Note that this formula uses indexing from 1. You need to adapt it to use indexing from 0. # Appendix B: `vector` How-to

A `vector` is a data structure that is part of the Standard Template Library (STL). Vectors are great for storing sequences of items, and we’ll store a sequence of doubles in this project. You can use a vector by adding `#include <vector>` to the top of your program. Here is an example program that shows how to use a vector.

``````//vector_test.cpp
#include <iostream> //for cout
#include <vector>   //for vector
using namespace std;

int main() {
//create a vector that hold doubles
vector<double> v;

//fill it with {1.0, 2.0, 3.0}
v.push_back(1.0);
v.push_back(2.0);
v.push_back(3.0);

// check the size, which is how many items live inside the vector
cout << "There are " << v.size() << " elements in vn";

// access each item in the vector and print it
for (size_t i = 0; i < v.size(); i += 1) {
cout << "v[" << i << "] = " << v[i] << "\n";
}

return 0;
}
``````

Here’s how to compile and run

``````\$ g++ -Wall -Werror -pedantic --std=c++11 -g vector_test.cpp -o vector_test.exe
\$ ./vector_test.exe
``````

# Appendix C: `sqrt` How-to

You might find the square root function helpful in this project. It’s called `sqrt` (pronounced “squirt”) and lives in the `cmath` library. Here’s an example:

``````//sqrt_test.cpp
#include <iostream> //for cout
#include <cmath>    //for sqrt
using namespace std;

int main() {
cout << "the square root of 4 is " << sqrt(4) << "\n";
return 0;
``````

# Appendix D: `modf` How-to

`modf` breaks a double into its integral and fractional parts.

``````//modf_test.cpp
#include <iostream> //for cout
#include <cmath>    //for modf
using namespace std;

int main() {

double pi = 3.14159265;
double fractpart = 0;
double intpart = 0;

//use modf to extract fractional part and integral part of pi
fractpart = modf(pi , &intpart);
cout << pi << " = " << intpart << " + " << fractpart << "\n";

return 0;
}
``````

# Appendix E: `assert` How-to

`assert` is a programmer’s best friend. In this project, we’ll use it for checking the output of a function a test program.

When the input to `assert()` is true, it does nothing. When the input to `assert()` is false, it crashes the program with a helpful debugging message. Here’s an example program:

``````//assert_test.cpp
#include <cassert> //for assert
#include <cmath>   //for sqrt
using namespace std;

int main() {
//Check that the square root of 4 is 2.  This will pass :)
assert(sqrt(4) == 2);

//Check that the square root of 4 is 1.  This will fail :(
//When an assertion fails, it prints a helpful debugging message
assert(sqrt(4) == 1);

return 0;
}
``````

# Appendix F: Comparisons How-To

This appendix covers comparing signed and unsigned integers, and comparing floating point numbers (`double`).

## Signed and unsigned integer comparisons

Example from `stats.cpp`:

``````double sum(vector<double> v) {
double total = 0;
for (int i = 0; i < v.size(); ++i) {
total += v[i];
}
}
``````

Compile and get this error:

``````\$ make stats_tests.exe
g++-7 -Wall -Werror -pedantic -g --std=c++11 stats_tests.cpp stats.cpp p1_library.cpp -o stats_tests.exe
stats.cpp: In function 'double sum(std::vector<double>)':
stats.cpp:17:21: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
for (int i = 0; i < v.size(); ++i) {
~~^~~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [stats_tests.exe] Error 1
``````

The problem is `v.size()` returns a `size_t` type, which is an alias for an unsigned integer type. The loop variable `i` is an `int` type. The types don’t match.

### Solution 1: `size_t`

Change `int i` to `size_t i`. Now, the types match and the compiler is happy.

``````double sum(vector<double> v) {
double total = 0;
for (size_t i = 0; i < v.size(); ++i) {
total += v[i];
}
}
``````

### Solution 2: `static_cast<>()`

Cast `v.size()` to an `int`. Again, the types match and the compiler is happy.

``````double sum(vector<double> v) {
double total = 0;
for (int i = 0; i < static_cast<int>(v.size()); ++i) {
total += v[i];
}
}
``````

## Floating point comparisons

Another comparison error you may encounter occurs when you compare two floating point numbers, like `double`s. Floating point numbers have limited precision. Due to rounding errors, two floating point numbers we expect to be equal may be slightly different.

For example:

``````//test.cpp
#include <iostream>
using namespace std;

int main() {
double x = 1.0 / 3.0;
double y = 1.0 - (2.0 / 3.0);
cout << "x=" << x << endl;
cout << "y=" << y << endl;
if (x == y) {
cout << "equal" << endl;
} else {
cout << "not equal" << endl;
}
}
``````

Compile and run. The two numbers look the same, but when we compare them, they are no equal! Notice that `x` and `y` are rounded to 6 decimal places by default.

``````\$ g++ test.cpp -o test.exe
\$ ./test.exe
x=0.333333
y=0.333333
not equal
``````

Let’s look at the full precision. Modify your program to look like this.

``````//test.cpp
#include <iostream>
#include <limits>
using namespace std;

int main() {
double x = 1.0 / 3.0;
double y = 1.0 - (2.0 / 3.0);
cout.precision(std::numeric_limits<double>::max_digits10);
cout << "x=" << x << endl;
cout << "y=" << y << endl;
if (x == y) {
cout << "equal" << endl;
} else {
cout << "not equal" << endl;
}
}
``````

Compile and run. Notice that `x` and `y` are no longer rounded to 5 decimal places. We can see that they are slightly different.

``````\$ g++ test.cpp -o test.exe
\$ ./test.exe
x=0.33333333333333331
y=0.33333333333333337
not equal
``````

Next, we’ll compare within a tolerance `epsilon`, instead of an exact comparison. Again, modify your program. Notice the code `if (abs(x - y) < epsilon)`.

``````//test.cpp
#include <iostream>
#include <cmath>
#include <limits>
using namespace std;

// Precision for floating point comparison
const double epsilon = 0.00001;

int main() {
double x = 1.0 / 3.0;
double y = 1.0 - (2.0 / 3.0);
cout.precision(std::numeric_limits<double>::max_digits10);
cout << "x=" << x << endl;
cout << "y=" << y << endl;
if (abs(x - y) < epsilon) {
cout << "equal" << endl;
} else {
cout << "not equal" << endl;
}
}
``````

Compile and run. Notice that the comparison now reports equal.

``````\$ g++ test.cpp -o test.exe
\$ ./test.exe
x=0.33333333333333331
y=0.33333333333333337
equal
``````