Project description

Timeline

Proposal due Friday, March 10th

Draft due Friday April 7th

Peer review in lab Monday April 10th

Presentation + slides and final GitHub repo due Monday April 24th to GitHub

Project report due Wednesday April 26th to Gradescope

In addition to the above, a component of the grade will be comprised of evaluating group members via survey by exam period.

About the project

Find a data set, develop a question you can answer with the data, and do it.

Proposal 5pts

  • Find 3 data sets of interest. Each data must have a mix of categorical and numeric variables and contain at least 200 observations and 6 variables or have prior approval by Prof. Fisher. You may not use any data set we’ve used in class (including labs, homeworks, exams, aes etc.).
  1. Identify the source of the data, when and how it was originally collected (by the curator, not necessarily how you found the data) and a brief description of the observations.

  2. Identify a research question and associated hypothesis you can answer with each data set (and which variables will help you answer the question!)

  3. For each data set:

  • Place the file containing your data in the data folder of the project repo.

  • At the end of your document, provide a glimpse() of each data set.

Your proposal should be no longer than 1 page (not including the glimpses). After you submit your proposal, I will offer feedback and help you decide which data set to choose for the final project. For this reason, please rank your proposal data sets with your favorite first.

Where to find data?

Report 50pts

The written report is worth 50 points, broken down as follows:

Introduction 7pts

The introduction provides motivation and context for your research.

To begin, introduce the data set in a few short sentences.

Complete the introduction by providing a concise, clear statement of your research question and hypotheses. Be sure to motivate why the research question is interesting / useful.

Example research question and hypotheses:

Can we predict body mass with bill depth? We hypothesize that penguins with deeper bills will also have more mass.

Methodology 15pts

Here you should introduce any statistical methods you use and describe why you choose the methods you do to answer your question. You might also include any preliminary summary statistics or figures you use to explore the data.

Results 15pts

Place figure(s) here to illustrate the main results from your analysis. 1 beautiful figure is worth more than several poorly formatted figures. You must have at least 1 figure.

Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statistic and create every possible model for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in writing about and interpreting results, and that you can accomplish these tasks using R. More is not better.

Discussion 8pts

This section is a conclusion and discussion. You should

  1. Summarize your main finding in a sentence or two.

  2. Discuss your finding and why it is useful (put in the context of your motivation from the introduction).

  3. Critique your own analyses and include a brief paragraph on what you would do differently if you were able to start the project over.

  4. List a brief (1 or 2 sentence) summary of the relative contributions of each team member. E.g. “Aang built the models, Katara implemented them in R, and Sokka wrote the introduction and discussion.”

  5. Note: all team members should be comfortable describing all aspects of the project and understanding all code.

Formatting 5pts

Your written report should be professionally formatted. This means complete sentences, labeling graphs and figures, turning off code chunks, and using typical style guidelines. The only sections your report may contain are Introduction, Methodology, Results and Discussion. You should include a citation of your data set and the citation should be formatted in any style of your choosing (e.g. MLA, APA etc.) It is important that your citations (should you include multiple) be consistent in their formatting.

Presentation 30pts

Presentations will take place in class during the last lab of the semester.

The presentation must be no longer than 5 minutes. This will be strictly enforced.

Reproducibility + organization

All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.

Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo.

The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.

Teamwork

You will be asked to fill out a survey where you rate the contribution and teamwork of each team member.

Filling out the survey is a prerequisite for getting credit on the teamwork portion of the grade.

The teamwork survey together with GitHub commits will be used to measure individual contribution to the assignment. All group members are expected to participate equally. In the event of team concerns and low effort commits, individual grades may differ from the rest of the group.

If you have concerns with the teamwork and/or contribution from any team members, please reach out to me or the head TA as early as possible.

Overall grading

The grade breakdown is as follows:

Total 100 pts
Project proposal 5 pts
Peer review 5 pts
Final report 50 pts
Slides + presentation 30 pts
Reproducibility and organization 5pts
Teamwork 5pts