Week 3: Reading and Wrangling Data
Objectives
This week we’ll dive into reading and manipulating, ie “wrangling” (cowyboy style), the data. Yeehaw!
We’ll start by recapping the “conversation on code” we started having by using Github, especially through pull requests and issues.
We’ll also hear about best practices for data management from UCSB librarian Stephanie Tulley.
Schedule
-
8:30 - 9:30 am: Wrangling Data (individual)
- wk03_dplyr: recap Github, command line navigation, readr, dplyr, tidyr
- wrangling-webinar.pdf
-
individual assignment to work on
env-info/students/<user>.Rmd
-
9:30 - 10:30 am: Data Management Plan (group)
- Break [10 min]
- Introduction to the Data Management Planning Tool (DMP Tool) by Stephanie Tulley from UCSB Library
- group assignment to generate a data managment plan
-
10:30 - 11:30 am: Wrangling Data (group)
- group assignment to generate a data managment plan
Assignment
Due: Jan 28, Thursday 5pm
Individual
-
Ensure you have the latest from
bren-ucsb/env-infoby issuing a pull request to your<user>/env-info(You may need to “switch the base”.) Since you have write permissions on<user>/env-info, you should then Merge changes. -
Work through the [**wk03dplyr**](/ESM296-3W-2016/wk03_dplyr.html) and wrangling-webinar.pdf pdfs by typing in code as R chunks into your
env-info/students/<user>.Rmd. I recommend starting this section with a## Data Wranglingheader and use subheaders below to match the instructions, like### Multiple Variables. Be sure to knit tostudents/<user>.html, _commit changes locally with a message, push to yourgithub.com/<user>/env-infoand submit as a pull request togithub.com/ucsb-bren/env-info.
Group
-
Generate a Data Management Plan
-
Use the DMP Tool and select the DMP Template for National Science Foundation > NSF-EAR: Earth Sciences.
-
Transfer the headings and your group’s specific text into an
index.Rmdfrom your group project’s<org>.github.iorepository. When you knit theindex.Rmd, the outputindex.htmlwill become your group project’s home page viewable athttp://<org>.github.io. -
Per your github workflow, be sure to pull the latest changes from other members, commit changes with a message, and push to your
github.com/<org>/<repo>. -
When I look at the github blame history of your group’s
index.Rmdfile, I want to see that every member has contributed by pulling and pushing changes from their computer.
-
-
Wrangle Data
-
Add a
datafolder and csv/xls/etc files inside. (Note that empty folders are not recognized by Git, only when they have files inside.) -
At the bottom of your group repo’s
index.Rmd, add a header## Data Questionand type a question similar to [**wk03dplyr](/ESM296-3W-2016/wk03_dplyr.html) for a csv of your choice (besidessurveys.csvand hopefully relevant to your group’s area of study) like _How many observations of species ‘NL’ appear each year?**. Answering your question should require chaining the followingdplyrfunctions:select()filter()group_by()summarize()
Include the R chunk below the question and knit the
index.Rmdintoindex.html. Be sure to push your results so they show up on the sitehttp://<org>.github.io.When I look at the Blame for your
index.Rmd, I would like to see that every member of the group contributed. You can make up another question, add comments, improve code, etc.
-
Resources
Command Line
Data Management
- Best Practices Primer | DataONE
- Data Management Guide for Public Participation | DataONE
- Education Modules | DataONE
Data Wrangling in R
Git, Github and RStudio
- Git and GitHub cheat sheet
- Git and GitHub with RStudio
- PLOS Computational Biology: A Quick Introduction to Version Control with Git and GitHub