Week 3: Reading and Wrangling Data

Github Workflows Recap

“conversation” of code amongst us with pull requests & issues, with every change logged in the version history
flows:
1. fork and pull request: fork, clone, pull, (branch,) commit, push and pull request
  - read only (no write) permissions on original repository
  - eg bren-ucsb/env-info fork to bbest/env-info
  - ie <org>/<repo> fork to <user>/<repo>
  - to update:
    - pull request <user>/<repo> -> <org>/<repo>, or
    - pull request <org>/<repo> -> <user>/<repo>
2. pull and push: clone, pull, (branch,) commit and push
  - read and write permissions on original repository
  - eg bbest push directly to whaleroute/whaleroute.github.io
  - ie <user> push directly to <org>/<repo>
  - see Github Flow for branching model

Where am I? Getting around in the Command Line

Knowing your present working directory is critical to using “relative” paths, ie relative to your present working directory. Relative paths (eg somedir/somefile.csv are often preferred over “absolute” paths (eg C:/somedir/somefile.csv) since the project’s root folder can move around on the machine or even to a different machine and still work, whereas an absolute path is locked down to a very exact machine-specific path. Here are a couple of aspects to keep in mind however when knitting Rmarkdown (*.Rmd) files:

When you open an RStudio project, the default present working directory is the top level folder for that project (and contains the *.Rproj file).
When you “Knit” an Rmarkdown file (*.Rmd), the working directory is set to the folder containing the *.Rmd and a new workspace is used.

The above differences mean that when writing chunks of R code, a path can work in the Console and fail when you go to “Knit” the Rmarkdown file (*.Rmd), or vice versa.

So let’s review some basic commands for navigating directories in both shell commands and R commands.

Bash Shell

The bash shell is the most common Unix-based command shell, found in Linux and Mac machines. It gets emulated for Windows in the Git Bash Shell application when installing git. Natively, Windows uses the less powerful Windows DOS command prompt, which uses cd (for pwd and cd) and dir (instead of ls).

# present working directory
pwd

# change working directory
cd

# list files
ls

# list files that end in '.jpg'
ls *.jpg

Note the use of the wildcard * to indicate any set of characters.

R

Now play with the same commands commented above, but in R.

# present working directory
getwd()

# change working directory
setwd('.')

# list files
list.files()

# list files that end in '.jpg'
list.files(pattern=glob2rx('*.jpg'))

# file exists
file.exists('test.png')

Look at the help for list.files() (?list.files or F1 with cursor over list.files() in editing window) to see that the pattern argument expects a regular expression and glob2rx() changes the wildcard or globbing pattern into a regular expression.

To work on your students/<user>.Rmd, I recomend you get the Console and your Rmarkdown file using the same working directory:

setwd('students')

Install Packages

# Run this chunk only once in your Console
# Do not evaluate when knitting Rmarkdown

# list of packages
pkgs = c(
  'readr',        # read csv
  'readxl',       # read xls
  'dplyr',        # data frame manipulation
  'tidyr',        # data tidying
  'nycflights13', # test dataset of NYC flights for 2013
  'gapminder')    # test dataset of life expectancy and popultion

# install packages if not found
for (p in pkgs){
  if (!require(p, character.only=T)){
    install.packages(p)
  }
}

The gapminder dataset is “an excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007” (CRAN). Gapminder was the brain child of Hans Rosling who famously gave the TED Talk: The best stats you’ve ever seen - Hans Rosling.

Readings

These are the main R packages we’ll be learning about this week:

Reading CSV

`utils::read.csv`

Traditionally, you would read a CSV like so:

d = read.csv('../data/r-ecology/species.csv')
d
head(d)
summary(d)

`readr::read_csv`

Better yet, try read_csv:

library(readr)

d = read_csv('../data/r-ecology/species.csv')
d
head(d)
summary(d)

What are the differences in data types of columns when using read.csv vs read_csv? Especially compare character or factor data types. For an intriguing read into the perils of using factors, check out level 8.2 of the R_inferno.pdf 9 levels of hell in R (yes, a Dante reference).

`dplyr` Demo

When performing data analysis in R, code can become quite messy, making it hard to revisit and determine the sequence of operations. Commenting helps. Good variable names help. Still, at least two common issues make code difficult to understand: multiple variables and nested functions. Let’s examine these issues by approaching an analysis presenting both problems, and finally see how dplyr offers an elegant alternative.

For example, let’s ask of the surveys.csv dataset: How many observations of species ‘NL’ appear each year?

Pseudocode

You can write the logic out as pseudocode which can become later comments for the actual code:

# read in csv
# view data
# limit columns to species and year
# limit rows to just species "NL"
# get count per year
# write out csv

Multiple Variables

Now let’s approach this code sequentially using base functions, ie natively loaded functions in R without need for additional libraries.

# read in csv
surveys = read.csv('../data/r-ecology/surveys.csv') 

# view data
head(surveys)
summary(surveys)

# limit columns to species and year
surveys_2 = surveys[,c('species_id', 'year')]

# limit rows to just species "NL"
surveys_3 = surveys_2[surveys_2$species_id  == 'NL',]

# get count per year
surveys_4 = aggregate(species_id ~ year, data=surveys_3, FUN='length')

# write to csv
write.csv(surveys_4, 'data/surveys_bbest.csv', row.names = FALSE)

Because the variables are named sequentially, ie surveys_2 to surveys_4, it is relatively easy to follow, but so often in the course of playing with data these names are very different. And then we quickly lose track of which operations get applied to which variables.

Even with obvious variable names, there is a redunancy, as we’ll see shortly, to assigning a new variable name to the output of each operation and input of each subsequent operation.

Nested Functions

Another common programming trick to reduce variable naming space is to nest the output of one function as the input of the next one.

# read in data
surveys = read.csv('../data/r-ecology/surveys.csv') 

# view data
head(surveys)
summary(surveys)

# limit data with [], aggregate to count, write to csv
write.csv(
  aggregate(
    species_id ~ year, 
    data = surveys[surveys_2$species_id  == 'NL', c('species_id', 'year')], 
    FUN = 'length'), 
  'data/surveys_bbest.csv',
  row.names = FALSE)

So the code started the same, and continues using the same functions, but these functions get applied from the input arguments to the outer containing functions, ie in a nested manner:

surveys gets sliced [] into rows and columns in one call, which gets used as the data = argument to
aggregate(), which applies the length() function to get a count to the formula species_id ~ year in which the species_id gets split into groups based on year, which gets further applied as the unnamed first argument to
write.csv() which has the additional unnamed argument specifying the output file and named argument turning off the default option to prefix row numbers.

Although we’ve saved space from not performing the extra naming of variables, we’ve made the code very difficult to read, needing to parse which functions are arguments to subsequent functions. The indentation helps readability a bit, but now let’s examine a far better solution to either approaches above with dplyr.

Elegance with `dplyr`

Next, we’ll use the libraries readrfor improved versions of reading and writing csv files, and dplyr for advanced data frame manipulation. Most importantly, dplyr uses the “then” operator %>% which transfers the output on the left to the first argument of the function on the right. Most simply surveys %>% summary() transfers the surveys data frame into the first argument of the summary function. Use of this chaining operator seems excessive in this simple example, but is powerful when chaining together many operations on the same data frame. We’re able to efficiently write out operations, get past the previous problem of multiple variable names without the obfuscation of nesting.

# load libraries
library(readr)
library(dplyr)
library(magrittr) # for %T>%

# read in csv
surveys = read_csv('../data/r-ecology/surveys.csv') 

# dplyr elegance
surveys %T>%                          # note tee operator %T>% for glimpse
  glimpse() %>%                       # view data
  select(species_id, year) %>%        # limit columns
  filter(species_id  == 'NL') %>%     # limit rows
  group_by(year) %>%                  # get count by first grouping
  summarize(n = n()) %>%              #   then summarize
  write_csv('data/surveys_bbest.csv') # write out csv

Now we can read from the top, starting with the data frame surveys, to see a very clear sequence of operations:

glimpse()
select()
filter()
group_by()
summarize()
write_csv()

Arguments are minimal without repeating the name of the data frame, or even needing quotations in the case of column names.

The “tee” operator %T>% is similar to the “then” operator %>% in that the left side is passed to the right, but is then also teed off as the output of the right side. This is useful in this case for glimpse since its output is simply printed to the Console and does not otherwise return the data frame needed to continue the sequence of operations. So the “tee” operator %T>% is most useful for injecting intermediate operations like printing or plotting that wouldn’t otherwise output a return object for continuing operations.

Summary

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information. - NYTimes (2014)

The tidyr and dplyr packages were created by Hadley Wickham of ggplot2 fame. The “gg” in ggplot2 stands for the “grammar of graphics”. Hadley similarly considers the functionality of the two packages dplyr and tidyr to provide the “grammar of data manipulation”.

Week 3: Reading and Wrangling Data

Ben Best

2016-03-06 13:41

Github Workflows Recap

Where am I? Getting around in the Command Line

Bash Shell

R

Install Packages

Readings

Reading CSV

`utils::read.csv`

`readr::read_csv`

`dplyr` Demo

Pseudocode

Multiple Variables

Nested Functions

Elegance with `dplyr`

Summary

Presentation

References

Command Line

Data Management

Data Wrangling in R

Week 3: Reading and Wrangling Data

Ben Best

2016-03-06 13:41

Github Workflows Recap

Where am I? Getting around in the Command Line

Bash Shell

R

Install Packages

Readings

Reading CSV

utils::read.csv

readr::read_csv

dplyr Demo

Pseudocode

Multiple Variables

Nested Functions

Elegance with dplyr

Summary

Presentation

References

Command Line

Data Management

Data Wrangling in R

`utils::read.csv`

`readr::read_csv`

`dplyr` Demo

Elegance with `dplyr`