p1-stats

EECS 280 Project 1: Statistics

Due: Tuesday, 16 January 2018, 8pm

How Couples Meet and Stay Together is a research study that surveyed how Americans met their spouses and romantic partners, and compared traditional to nontraditional couples. In this project, you will write a program to analyze data from this research study.

You will write two code modules in two files, as well as a number of unit tests. In stats.cpp, you will write functions that compute basic statistics like mean and standard deviation. You’ll test your functions using unit tests, which you will add to stats_tests.cpp. Finally, write a driver program reads a data file from the study and computes statistics in main.cpp.

This project uses the Standard Template Library (STL) vector which is similar to an array. If you’ve never seen a vector before, check out Appendix B: vector How-to at the end of this document. Vectors are pretty convenient to use, and the appendix gives examples of exactly how to use them for this project.

Project Roadmap

  1. Complete the set up tutorial
  2. Implement statistics library in stats.cpp
  3. Test statistics library in stats_tests.cpp
  4. Write top level program in main.cpp

This a complete list of all files in this project:

File Description
Makefile Helper commands for building and submitting
main_test.in Inputs for main program
main_test.out.correct Correct output of main program
main_test_data.tsv Data file for main program
p1_library.cpp Provided code implementations
p1_library.h Provided code function prototypes
stats_tests.cpp Tests for statistics library. Add tests to this file.
stats_public_test.cpp A “does my code compile” test case
stats.h Function prototypes for statistics library
stats.cpp Function implementations for statistics library. Write this.
main.cpp Main program. Write this.

The starter files are available at https://eecs280staff.github.io/p1-stats/starter-files.tar.gz. You should have downloaded and unpacked them in the starter tutorial.

We have provided several functions for your convenience in p1_library.h and p1_library.cpp. These will make your life MUCH easier!

Statistics Library

The statistics library provides functions for another program to use. We have provided stats.h, which has function prototypes and a description of what each function does. Implement these functions in stats.cpp. You will also write test cases for these functions in stats_tests.cpp. Compile and run your test cases like this (all on one line).

$ g++ -Wall -Werror -pedantic -g --std=c++11 stats_tests.cpp stats.cpp p1_library.cpp -o stats_tests
$ ./stats_tests

Note that you don’t type the $ symbol, it just means “this is the command prompt.”

Keep in mind that stats_tests.cpp doesn’t depend on main.cpp.

We’ve also provided a public test case that will help you verify that your code compiles correctly. You can compile and run the stats_public_test executable with the following commands:

$ g++ -Wall -Werror -pedantic -g --std=c++11 stats_public_test.cpp stats.cpp p1_library.cpp -o stats_public_test
$ ./stats_public_test

Statistics Program

Our statistics program will first ask the user for a filename and which data to read for the file. Then, it will compute several statistics and print a summary to standard output.

Input file format

Input files are in tab separated value (.tsv) format. The first line is a header, which has names for each column. Following lines contain numerical values. We have provided a simple example in test_main_data.tsv.

A        B
1        6
2        7
3        8
4        9
5        10

Example

Let’s run a complete example of the main program. First, we’ll compile and run the program at the command line.

$ g++ -Wall -Werror -pedantic --std=c++11 -g main.cpp stats.cpp p1_library.cpp -o main
$ ./main

Our program asks for a file name

enter a filename

The user types in a file name

main_test_data.tsv

Next, it asks the user for a column name

enter a column name

And we type that in, too

B

Print an informational message about the column and file

reading column B from main_test_data.tsv

Next, print a summary of the data, followed by a blank line.

Summary (value: frequency)
6: 1
7: 1
8: 1
9: 1
10: 1

And finally, print these statistics

count = 5
sum = 40
mean = 8
stdev = 1.58114
median = 8
mode = 6
min = 6
max = 10
  0th percentile = 6
 25th percentile = 7
 50th percentile = 8
 75th percentile = 9
100th percentile = 10

Next, let’s automate this so we can quickly check our program. First, we’ll use a file to contain all the user input. It’s annoying to keep typing it, so let’s type it once, and then reuse the file, called main_test.in.

main_test_data.tsv
B

Now you can run your program and redirect the input from a file. This way you don’t have to type it every time! Notice that the there’s no user input showing up in the output, where the user typed main_test_data.tsv and B in the previous example.

$ ./main < main_test.in
enter a filename
enter a column name
reading column B from main_test_data.tsv
Summary (value: frequency)
6: 1
7: 1
8: 1
9: 1
10: 1

count = 5
sum = 40
mean = 8
stdev = 1.58114
median = 8
mode = 6
min = 6
max = 10
  0th percentile = 6
 25th percentile = 7
 50th percentile = 8
 75th percentile = 9
100th percentile = 10

Instructions for setting up redirection with your debugger:

Next, we’ll run our program again and save the output to file instead of printing it to the terminal. Hint: press the up arrow (or control-p) to avoid retyping the command. We will redirect output to a file called main_test.out, and at the same time read user input from main_test.in.

$ ./main < main_test.in > main_test.out

Notice that there’s no output at the console, but we can peek at the output that was redirected to the file:

$ cat main_test.out

The last piece is a correct answer to compare against. We’ve provided the correct output in a file called main_test.out.correct. We could look at this file and compare it line by line with main_test.out, but let’s make the computer do it for us!

$ diff main_test.out main_test.out.correct

If there’s no output, that means the files match. If there is a problem, you can help debug it using the sdiff to see where the files are similar, and where they are different.

$ sdiff main_test.out main_test.out.correct

If the output gets too long to see, then you can send the output of the diff command to the input of the less command using a pipe (| character). less is a pager, which lets you use the arrow keys to move up and down in the output, and quit using q.

$ sdiff main_test.out main_test.out.correct | less

Whew! That was a lot of typing. Let’s make it even easier. make is a command line program that remembers long commands for you. It reads a file in the same directory named Makefile. I know, it’s weird that it doesn’t have a file extension, but it’s just a plain text file. Makefiles run commands for you! We provided one for you and here’s an example that compiles and runs all the tests:

$ make test

We can also delete the temporary files created by the Makefile, like the executables:

$ make clean

Real Data

Want to try it out with real data from the How Couples Meet and Stay Together study?

  1. Download the data from http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/30103?q=30103.

    Scroll down to “Datasets” and click “Download all files”. You might need to register with your umich google account.

  2. Unzip the file
    $ unzip ICPSR_30103.zip
    
  3. The data file in tsv format is ICPSR_30103/DS0001/30103-0001-Data.tsv
  4. The variables in the study are the first line of the tsv file.

    Another file called the codebook describes the variables: ICPSR_30103/DS0001/30103-0001-Codebook.pdf

Let’s see how many survey respondents have a spouse or partner:

$ ./main 
enter a filename
ICPSR_30103/DS0001/30103-0001-Data.tsv
enter a column name
QFLAG
reading column QFLAG from ICPSR_30103/DS0001/30103-0001-Data.tsv
Summary (value: frequency)
1: 3009
2: 993

count = 4002
sum = 4995
mean = 1.24813
stdev = 0.431979
median = 1
mode = 1
min = 1
max = 2
  0th percentile = 1
 25th percentile = 1
 50th percentile = 1
 75th percentile = 1
100th percentile = 2

After reading codebook ICPSR_30103/DS0001/30103-0001-Codebook.pdf, we can understand that “1” means partnered and “2” means no spouse or partner.

How many respondents identified as gay, lesbian or bisexual?

$ ./main
enter a filename
ICPSR_30103/DS0001/30103-0001-Data.tsv
enter a column name
GLBSTATUS
reading column GLBSTATUS from ICPSR_30103/DS0001/30103-0001-Data.tsv
Summary (value: frequency)
0: 3047
1: 955

count = 4002
sum = 955
mean = 0.238631
stdev = 0.4263
median = 0
mode = 0
min = 0
max = 1
  0th percentile = 0
 25th percentile = 0
 50th percentile = 0
 75th percentile = 0
100th percentile = 1

We can see that 955 people identified as gay, lesbian or bisexual.

Tips, Tricks and Restrictions

Put all of your statistics functions in stats.cpp and all statistics function tests in stats_tests.cpp. Write your statistics program in main.cpp.

These are the only libraries you may use:

#include "stats.h"
#include "p1_library.h"
#include <iostream>
#include <string>
#include <vector>
#include <cassert>
#include <cmath>
#include <iomanip>

No global variables or static variables.

DO NOT INCLUDE a main function in your stats.cpp file. Remember, stats.cpp is a library of functions that main.cpp uses. The functions in stats.cpp must still work when compiled with a different main function.

Testing

Testing is just as important as writing the original code! Write unit tests for each statistics function (the functions in stats.h). These are the requirements for tests:

Protip: Write tests for the functions first. (i.e., Write tests for median(), and then implement median(). It sounds like a pain, but you gain two important things by coding this way:

  1. You avoid being under the illusion that your code works when it’s actually full of bugs.
  2. When you make changes to code that you wrote previously, you can re-run your test cases and immediately know if you broke something (yes, you will break things).

This practice is called test-driven development.

Submission and Grading

Submit main.cpp, stats.cpp, and stats_tests.cpp to the autograder at https://autograder.io.

We will grade your code on functional correctness and the presence of test cases.

Acknowledgments

The original project was written by Andrew DeOrio, spring 2015.

This project is based on research work by Rosenfeld, Michael J., Reuben J. Thomas, and Maja Falcon. How Couples Meet and Stay Together (HCMST), Wave 1 2009, Wave 2 2010, Wave 3 2011, Wave 4 2013, United States. ICPSR30103-v7. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2014-09-02. http://doi.org/10.3886/ICPSR30103.v7

Appendix A: Percentile formula

You should use the following formula to compute percentile. Note that this formula uses indexing from 1. You need to adapt it to use indexing from 0.

percentile formula

Appendix B: vector How-to

A vector is a data structure that is part of the Standard Template Library (STL). Vectors are great for storing sequences of items, and we’ll store a sequence of doubles in this project. You can use a vector by adding #include <vector> to the top of your program. Here is an example program that shows how to use a vector.

//vector_test.cpp
#include <iostream> //for cout
#include <vector>   //for vector
using namespace std;


int main() {
  //create a vector that hold doubles
  vector<double> v;

  //fill it with {1.0, 2.0, 3.0}
  v.push_back(1.0);
  v.push_back(2.0);
  v.push_back(3.0);

  // check the size, which is how many items live inside the vector
  cout << "There are " << v.size() << " elements in vn";

  // access each item in the vector and print it
  for (int i=0; i < int(v.size()); i += 1) {
    cout << "v[" << i << "] = " << v[i] << "\n";
  }

  return 0;
}

Here’s how to compile and run

$ g++ -Wall -Werror -pedantic --std=c++11 -g vector_test.cpp -o vector_test
$ ./vector_test

Further reading: http://www.cplusplus.com/reference/vector/vector/

Appendix C: sqrt How-to

You might find the square root function helpful in this project. It’s called sqrt (pronounced “squirt”) and lives in the cmath library. Here’s an example:

//sqrt_test.cpp
#include <iostream> //for cout
#include <cmath>    //for sqrt
using namespace std;

int main() {
  cout << "the square root of 4 is " << sqrt(4) << "\n";
  return 0;

Appendix D: modf How-to

modf breaks a double into its integral and fractional parts.

//modf_test.cpp
#include <iostream> //for cout
#include <cmath>    //for modf
using namespace std;

int main() {

  double pi = 3.14159265;
  double fractpart = 0;
  double intpart = 0;

  //use modf to extract fractional part and integral part of pi
  fractpart = modf(pi , &intpart);
  cout << pi << " = " << intpart << " + " << fractpart << "\n";

  return 0;
}

Further reading: http://www.cplusplus.com/reference/cmath/modf/

Appendix E: assert How-to

assert is a programmer’s best friend. In this project, we’ll use it for checking the output of a function a test program.

When the input to assert() is true, it does nothing. When the input to assert() is false, it crashes the program with a helpful debugging message. Here’s an example program:

//assert_test.cpp
#include <cassert> //for assert
#include <cmath>   //for sqrt
using namespace std;

int main() {
  //Check that the square root of 4 is 2.  This will pass :)
  assert(sqrt(4) == 2);

  //Check that the square root of 4 is 1.  This will fail :(
  //When an assertion fails, it prints a helpful debugging message
  assert(sqrt(4) == 1);

  return 0;
}