EECS 280 Project 1: Statistics

Project Due Tuesday, September 17, 2019, 8:00 pm

How Couples Meet and Stay Together is a research study that surveyed how Americans met their spouses and romantic partners, and compared traditional to nontraditional couples. In this project, you will write a program to analyze data from this research study.

You will write two code modules in two files, as well as a number of unit tests. In stats.cpp, you will write functions that compute basic statistics like mean and standard deviation. You’ll test your functions using unit tests, which you will add to stats_tests.cpp. Finally, write a driver program reads a data file from the study and computes statistics in main.cpp.

This project uses the Standard Template Library (STL) vector which is similar to an array. If you’ve never seen a vector before, check out Appendix B: vector How-to at the end of this document. Vectors are pretty convenient to use, and the appendix gives examples of exactly how to use them for this project.

1. Complete the set up tutorial
2. Implement statistics library in stats.cpp
3. Test statistics library in stats_tests.cpp
4. Write top level program in main.cpp

This a complete list of all files in this project:

File Description
Makefile Helper commands for building and submitting
main_test.in Inputs for main program
main_test.out.correct Correct output of main program
main_test_data.tsv Data file for main program
p1_library.cpp Provided code implementations
p1_library.h Provided code function prototypes
stats_tests.cpp Tests for statistics library. Add tests to this file.
stats_public_test.cpp A “does my code compile” test case
stats.h Function prototypes for statistics library
stats.cpp Function implementations for statistics library. Write this.
main.cpp Main program. Write this.

The starter files are available at https://eecs280staff.github.io/p1-stats/starter-files.tar.gz. You should have downloaded and unpacked them in the starter tutorial.

We have provided several functions for your convenience in p1_library.h and p1_library.cpp. These will make your life MUCH easier!

Statistics Library

The statistics library provides functions for another program to use. We have provided stats.h, which has function prototypes and a description of what each function does. Implement these functions in stats.cpp. You will also write test cases for these functions in stats_tests.cpp. Compile and run your test cases like this (all on one line).

\$ g++ -Wall -Werror -pedantic -g --std=c++11 stats_tests.cpp stats.cpp p1_library.cpp -o stats_tests.exe
\$ ./stats_tests.exe

Note that you don’t type the \$ symbol, it just means “this is the command prompt.”

Keep in mind that stats_tests.cpp doesn’t depend on main.cpp.

We’ve also provided a public test case that will help you verify that your code compiles correctly. You can compile and run the stats_public_test.exe executable with the following commands:

\$ g++ -Wall -Werror -pedantic -g --std=c++11 stats_public_test.cpp stats.cpp p1_library.cpp -o stats_public_test.exe
\$ ./stats_public_test.exe

Statistics Program

Our statistics program will first ask the user for a filename and which data to read for the file. Then, it will compute several statistics and print a summary to standard output.

Input file format

Input files are in tab separated value (.tsv) format. The first line is a header, which has names for each column. Following lines contain numerical values. We have provided a simple example in test_main_data.tsv.

A        B
1        6
2        7
3        8
4        9
5        10

Example

Let’s run a complete example of the main program. First, we’ll compile and run the program at the command line.

\$ g++ -Wall -Werror -pedantic --std=c++11 -g main.cpp stats.cpp p1_library.cpp -o main.exe
\$ ./main.exe

Our program asks for a file name

enter a filename

The user types in a file name

main_test_data.tsv

Next, it asks the user for a column name

enter a column name

And we type that in, too

B

Print an informational message about the column and file

reading column B from main_test_data.tsv

Next, print a summary of the data, followed by a blank line.

Summary (value: frequency)
6: 1
7: 1
8: 1
9: 1
10: 1

And finally, print these statistics

count = 5
sum = 40
mean = 8
stdev = 1.58114
median = 8
mode = 6
min = 6
max = 10
0th percentile = 6
25th percentile = 7
50th percentile = 8
75th percentile = 9
100th percentile = 10

Next, let’s automate this so we can quickly check our program. First, we’ll use a file to contain all the user input. It’s annoying to keep typing it, so let’s type it once, and then reuse the file, called main_test.in.

main_test_data.tsv
B

Now you can run your program and redirect the input from a file. This way you don’t have to type it every time! Notice that the there’s no user input showing up in the output, where the user typed main_test_data.tsv and B in the previous example.

\$ ./main.exe < main_test.in
enter a filename
enter a column name
reading column B from main_test_data.tsv
Summary (value: frequency)
6: 1
7: 1
8: 1
9: 1
10: 1

count = 5
sum = 40
mean = 8
stdev = 1.58114
median = 8
mode = 6
min = 6
max = 10
0th percentile = 6
25th percentile = 7
50th percentile = 8
75th percentile = 9
100th percentile = 10

Instructions for setting up redirection with your debugger:

Next, we’ll run our program again and save the output to file instead of printing it to the terminal. Hint: press the up arrow (or control-p) to avoid retyping the command. We will redirect output to a file called main_test.out, and at the same time read user input from main_test.in.

\$ ./main.exe < main_test.in > main_test.out

Notice that there’s no output at the console, but we can peek at the output that was redirected to the file:

\$ cat main_test.out

The last piece is a correct answer to compare against. We’ve provided the correct output in a file called main_test.out.correct. We could look at this file and compare it line by line with main_test.out, but let’s make the computer do it for us!

\$ diff main_test.out main_test.out.correct

If there’s no output, that means the files match. If there is a problem, you can help debug it using the sdiff to see where the files are similar, and where they are different.

\$ sdiff main_test.out main_test.out.correct

If the output gets too long to see, then you can send the output of the diff command to the input of the less command using a pipe (| character). less is a pager, which lets you use the arrow keys to move up and down in the output, and quit using q.

\$ sdiff main_test.out main_test.out.correct | less

Whew! That was a lot of typing. Let’s make it even easier. make is a command line program that remembers long commands for you. It reads a file in the same directory named Makefile. I know, it’s weird that it doesn’t have a file extension, but it’s just a plain text file. Makefiles run commands for you! We provided one for you and here’s an example that compiles and runs all the tests:

\$ make test

We can also delete the temporary files created by the Makefile, like the executables:

\$ make clean

Real Data

Want to try it out with real data from the How Couples Meet and Stay Together study?

1. Use the following wget link to download the data in tsv format: https://eecs280staff.github.io/p1-stats/data/HCMST_ver_3.04.tsv.
2. The variables in the study are the first line of the tsv file.
3. Another file called the codebook describes the variables. It can be accessed here: https://stacks.stanford.edu/file/druid:ns183dp7831/HCMST_codebook_3_04.pdf.

Let’s see how many survey respondents have a spouse or partner:

\$ ./main.exe
enter a filename
HCMST_ver_3.04.tsv
enter a column name
qflag
reading column qflag from HCMST_ver_3.04.tsv
Summary (value: frequency)
1: 3009
2: 993

count = 4002
sum = 4995
mean = 1.24813
stdev = 0.431979
median = 1
mode = 1
min = 1
max = 2
0th percentile = 1
25th percentile = 1
50th percentile = 1
75th percentile = 1
100th percentile = 2

After reading the codebook, we can understand that “1” means partnered and “2” means no spouse or partner.

How many respondents identified as gay, lesbian or bisexual?

\$ ./main.exe
enter a filename
HCMST_ver_3.04.tsv
enter a column name
glbstatus
reading column glbstatus from HCMST_ver_3.04.tsv
Summary (value: frequency)
0: 3047
1: 955

count = 4002
sum = 955
mean = 0.238631
stdev = 0.4263
median = 0
mode = 0
min = 0
max = 1
0th percentile = 0
25th percentile = 0
50th percentile = 0
75th percentile = 0
100th percentile = 1

We can see that 955 people identified as gay, lesbian or bisexual.

Tips, Tricks and Restrictions

Put all of your statistics functions in stats.cpp and all statistics function tests in stats_tests.cpp. Write your statistics program in main.cpp.

These are the only libraries you may use:

#include "stats.h"
#include "p1_library.h"
#include <iostream>
#include <string>
#include <vector>
#include <cassert>
#include <cmath>
#include <iomanip>
#include <limits>

No non-const global variables or static variables.

DO NOT INCLUDE a main function in your stats.cpp file. Remember, stats.cpp is a library of functions that main.cpp uses. The functions in stats.cpp must still work when compiled with a different main function.

Testing

Testing is just as important as writing the original code! Write unit tests for each statistics function (the functions in stats.h). These are the requirements for tests:

• Your tests should be written as separate functions, as demonstrated by the sample test case in stats_tests.cpp
• Each function in stats.cpp must have at least one corresponding test function in stats_tests.cpp
• Use descriptive function names for your test cases
• Use assert to check things that should be true if your code is working correctly and passes the test. For example, if the mean should be 3, use assert(mean == 3);. Thus, a failed assert indicates a failed test case.
• You may print as much output as you like

Protip: Write tests for the functions first. (i.e., Write tests for median(), and then implement median(). It sounds like a pain, but you gain two important things by coding this way:

1. You avoid being under the illusion that your code works when it’s actually full of bugs.
2. When you make changes to code that you wrote previously, you can re-run your test cases and immediately know if you broke something (yes, you will break things).

This practice is called test-driven development.

Submit main.cpp, stats.cpp, and stats_tests.cpp to the autograder at https://autograder.io.

We will grade your code on functional correctness and the presence of test cases. As a reminder, you may not share any part of your solution with others. This includes both code and test cases. Doing so will result in an honor code violation.

Acknowledgments

The original project was written by Andrew DeOrio, spring 2015.

This project is based on research work by Rosenfeld, Michael J., Reuben J. Thomas, and Maja Falcon. 2015. How Couples Meet and Stay Together, Waves 1, 2, and 3: Public version 3.04, plus wave 4 supplement version 1.02 and wave 5 supplement version 1.0 [Computer files]. Stanford, CA: Stanford University Libraries.

Appendix A: Percentile formula

You should use the following formula to compute percentile. Note that this formula uses indexing from 1. You need to adapt it to use indexing from 0. Appendix B: vector How-to

A vector is a data structure that is part of the Standard Template Library (STL). Vectors are great for storing sequences of items, and we’ll store a sequence of doubles in this project. You can use a vector by adding #include <vector> to the top of your program. Here is an example program that shows how to use a vector.

//vector_test.cpp
#include <iostream> //for cout
#include <vector>   //for vector
using namespace std;

int main() {
//create a vector that hold doubles
vector<double> v;

//fill it with {1.0, 2.0, 3.0}
v.push_back(1.0);
v.push_back(2.0);
v.push_back(3.0);

// check the size, which is how many items live inside the vector
cout << "There are " << v.size() << " elements in vn";

// access each item in the vector and print it
for (size_t i = 0; i < v.size(); i += 1) {
cout << "v[" << i << "] = " << v[i] << "\n";
}

return 0;
}

Here’s how to compile and run

\$ g++ -Wall -Werror -pedantic --std=c++11 -g vector_test.cpp -o vector_test.exe
\$ ./vector_test.exe

Appendix C: sqrt How-to

You might find the square root function helpful in this project. It’s called sqrt (pronounced “squirt”) and lives in the cmath library. Here’s an example:

//sqrt_test.cpp
#include <iostream> //for cout
#include <cmath>    //for sqrt
using namespace std;

int main() {
cout << "the square root of 4 is " << sqrt(4) << "\n";
return 0;

Appendix D: modf How-to

modf breaks a double into its integral and fractional parts.

//modf_test.cpp
#include <iostream> //for cout
#include <cmath>    //for modf
using namespace std;

int main() {

double pi = 3.14159265;
double fractpart = 0;
double intpart = 0;

//use modf to extract fractional part and integral part of pi
fractpart = modf(pi , &intpart);
cout << pi << " = " << intpart << " + " << fractpart << "\n";

return 0;
}

Appendix E: assert How-to

assert is a programmer’s best friend. In this project, we’ll use it for checking the output of a function a test program.

When the input to assert() is true, it does nothing. When the input to assert() is false, it crashes the program with a helpful debugging message. Here’s an example program:

//assert_test.cpp
#include <cassert> //for assert
#include <cmath>   //for sqrt
using namespace std;

int main() {
//Check that the square root of 4 is 2.  This will pass :)
assert(sqrt(4) == 2);

//Check that the square root of 4 is 1.  This will fail :(
//When an assertion fails, it prints a helpful debugging message
assert(sqrt(4) == 1);

return 0;
}

Appendix F: Comparisons How-To

This appendix covers comparing signed and unsigned integers, and comparing floating point numbers (double).

Signed and unsigned integer comparisons

Example from stats.cpp:

double sum(vector<double> v) {
double total = 0;
for (int i = 0; i < v.size(); ++i) {
total += v[i];
}
}

Compile and get this error:

\$ make stats_tests.exe
g++-7 -Wall -Werror -pedantic -g --std=c++11 stats_tests.cpp stats.cpp p1_library.cpp -o stats_tests.exe
stats.cpp: In function 'double sum(std::vector<double>)':
stats.cpp:17:21: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
for (int i = 0; i < v.size(); ++i) {
~~^~~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [stats_tests.exe] Error 1

The problem is v.size() returns a size_t type, which is an alias for an unsigned integer type. The loop variable i is an int type. The types don’t match.

Solution 1: size_t

Change int i to size_t i. Now, the types match and the compiler is happy.

double sum(vector<double> v) {
double total = 0;
for (size_t i = 0; i < v.size(); ++i) {
total += v[i];
}
}

Solution 2: static_cast<>()

Cast v.size() to an int. Again, the types match and the compiler is happy.

double sum(vector<double> v) {
double total = 0;
for (int i = 0; i < static_cast<int>(v.size()); ++i) {
total += v[i];
}
}

Floating point comparisons

Another comparison error you may encounter occurs when you compare two floating point numbers, like doubles. Floating point numbers have limited precision. Due to rounding errors, two floating point numbers we expect to be equal may be slightly different.

For example:

//test.cpp
#include <iostream>
using namespace std;

int main() {
double x = 1.0 / 3.0;
double y = 1.0 - (2.0 / 3.0);
cout << "x=" << x << endl;
cout << "y=" << y << endl;
if (x == y) {
cout << "equal" << endl;
} else {
cout << "not equal" << endl;
}
}

Compile and run. The two numbers look the same, but when we compare them, they are no equal! Notice that x and y are rounded to 6 decimal places by default.

\$ g++ test.cpp -o test.exe
\$ ./test.exe
x=0.333333
y=0.333333
not equal

Let’s look at the full precision. Modify your program to look like this.

//test.cpp
#include <iostream>
#include <limits>
using namespace std;

int main() {
double x = 1.0 / 3.0;
double y = 1.0 - (2.0 / 3.0);
cout.precision(std::numeric_limits<double>::max_digits10);
cout << "x=" << x << endl;
cout << "y=" << y << endl;
if (x == y) {
cout << "equal" << endl;
} else {
cout << "not equal" << endl;
}
}

Compile and run. Notice that x and y are no longer rounded to 5 decimal places. We can see that they are slightly different.

\$ g++ test.cpp -o test.exe
\$ ./test.exe
x=0.33333333333333331
y=0.33333333333333337
not equal

Next, we’ll compare within a tolerance epsilon, instead of an exact comparison. Again, modify your program. Notice the code if (abs(x - y) < epsilon).

//test.cpp
#include <iostream>
#include <cmath>
#include <limits>
using namespace std;

// Precision for floating point comparison
const double epsilon = 0.00001;

int main() {
double x = 1.0 / 3.0;
double y = 1.0 - (2.0 / 3.0);
cout.precision(std::numeric_limits<double>::max_digits10);
cout << "x=" << x << endl;
cout << "y=" << y << endl;
if (abs(x - y) < epsilon) {
cout << "equal" << endl;
} else {
cout << "not equal" << endl;
}
}

Compile and run. Notice that the comparison now reports equal.

\$ g++ test.cpp -o test.exe
\$ ./test.exe
x=0.33333333333333331
y=0.33333333333333337
equal