Content
Techniques
Data
Reading and Wrangling Data
- Bash Shell
- R
Install Packages
Readings
Multiple Variables
CO2 emissions assignment
- tidy
  - Question. Why use skip=12 argument in read_excel()?
- tidyr
  - gather
  - summarize CO₂ emissions
I tried the elegant way but didn’t work
dplyr elegance

# set working directory if has students directory and at R Console (vs knitting)
if ('students' %in% list.files() & interactive()){
    setwd('students' )
}

# ensure working directory is students
if (basename(getwd()) != 'students'){
  stop(sprintf("WHOAH! Your working directory is not in 'students'!\n   getwd(): %s", getwd()))
}

Content

Text shows.

Is the health of the Chilean ocean changing regionaly?

I would like to determine the health of the ocean’s regions in Chile. By measuring ocean health across a portfolio of goals I can better understand potential trade-offs between goals and propose management targets for the future.

Techniques

Useful techniques will be:

Tidying up Data
Visualizing Data
Interactive Plots and Maps
Advanced Programming

Data

I have have idenfied data of jobs, wages and revenues per region in Chile.

The data will be used to calculate the Livelihoods and Economies goal in each region of Chile. This goal focuses on avoiding the loss of ocean-dependent livelihoods and productive coastal economies while maximizing livelihood quality. The Status of this goal is the average of the Status of the two sub-goals : livelihoods (i.e., jobs and wages) and economies (i.e., revenues) using data

###Data File

#read csv
d=read.csv('data/faundez_Iregion.csv')

#output summary
summary(d)

Reading and Wrangling Data

Where am I? Getting around in the Command Line Knowing your present working directory is critical to using ârelativeâ paths, ie relative to your present working directory. Relative paths (eg somedir/somefile.csv are often preferred over âabsoluteâ paths (eg C:/somedir/somefile.csv) since the projectâs root folder can move around on the machine or even to a different machine and still work, whereas an absolute path is locked down to a very exact machine-specific path. Here are a couple of aspects to keep in mind however when knitting Rmarkdown (*.Rmd) files:

When you open an RStudio project, the default present working directory is the top level folder for that project (and contains the *.Rproj file).

When you âKnitâ an Rmarkdown file (.Rmd), the working directory is set to the folder containing the .Rmd and a new workspace is used.

The above differences mean that when writing chunks of R code, a path can work in the Console and fail when you go to âKnitâ the Rmarkdown file (*.Rmd), or vice versa.

So letâs review some basic commands for navigating directories in both shell commands and R commands.

Bash Shell

The bash shell is the most common Unix-based command shell, found in Linux and Mac machines. It gets emulated for Windows in the Git Bash Shell application when installing git. Natively, Windows uses the less powerful Windows DOS command prompt, which uses cd (for pwd and cd) and dir (instead of ls).

present working directory pwd

change working directory cd

list files ls

list files that end in ‘.jpg’ ls .jpg Note the use of the wildcard to indicate any set of characters.

R

Now play with the same commands commented above, but in R.

# present working directory
getwd()

# change working directory
setwd('.')

# list files
list.files()

# list files that end in '.jpg'
list.files(pattern=glob2rx('*.jpg'))

# file exists
file.exists('test.png')

Look at the help for list.files() (?list.files or F1 with cursor over list.files() in editing window) to see that the pattern argument expects a regular expression and glob2rx() changes the wildcard or globbing pattern into a regular expression.

To work on your students/.Rmd, I recomend you get the Console and your Rmarkdown file using the same working directory:

setwd('students')

Install Packages

# list of packages
pkgs = c(
  'readr',        # read csv
  'readxl',       # read xls
  'dplyr',        # data frame manipulation
  'tidyr',        # data tidying
  'nycflights13', # test dataset of NYC flights for 2013
  'gapminder')    # test dataset of life expectancy and popultion

# install packages if not found
for (p in pkgs){
  if (!require(p, character.only=T)){
    install.packages(p)
  }
}

The gapminder dataset is âan excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007â (CRAN). Gapminder was the brain child of Hans Rosling who famously gave the TED Talk: The best stats youâve ever seen - Hans Rosling.

Readings

These are the main R packages weâll be learning about this week:

readr: column types dplyr: introduction tidyr: tidy data dplyr & tidyr: data wrangling cheatsheet Reading CSV utils::read.csv Traditionally, you would read a CSV like so:

d = read.csv('../data/r-ecology/species.csv')
d
head(d)
summary(d)

Multiple Variables

Now letâs approach this code sequentially using base functions, ie natively loaded functions in R without need for additional libraries.

# read in csv
surveys = read.csv("../data/r-ecology/surveys.csv") 

# view data
head(surveys)
summary(surveys)

# limit columns to species and year
surveys_2 = surveys[,c('species_id', 'year')]

# limit rows to just species "NL"
surveys_3 = surveys_2[surveys_2$species_id  == 'NL',]

# get count per year
surveys_4 = aggregate(species_id ~ year, data=surveys_3, FUN='length')

# write to csv
write.csv(surveys_4, "./data/surveys_faundez.csv", row.names = FALSE)

Because the variables are named sequentially, ie surveys_2 to surveys_4, it is relatively easy to follow, but so often in the course of playing with data these names are very different. And then we quickly lose track of which operations get applied to which variables.

Even with obvious variable names, there is a redunancy, as weâll see shortly, to assigning a new variable name to the output of each operation and input of each subsequent operation.

Nested Functions Another common programming trick to reduce variable naming space is to nest the output of one function as the input of the next one.

# read in data
surveys = read.csv("../data/r-ecology/surveys.csv") 

# view data
head(surveys)
summary(surveys)

# limit data with [], aggregate to count, write to csv
write.csv(
  aggregate(
    species_id ~ year, 
    data = surveys[surveys_2$species_id  == 'NL', c('species_id', 'year')], 
    FUN = 'length'), 
  "../data/surveys_faundez.csv",
  row.names = FALSE)

So the code started the same, and continues using the same functions, but these functions get applied from the input arguments to the outer containing functions, ie in a nested manner:

surveys gets sliced [] into rows and columns in one call, which gets used as the data = argument to

aggregate(), which applies the length() function to get a count to the formula species_id ~ year in which the species_id gets split into groups based on year, which gets further applied as the unnamed first argument to

write.csv() which has the additional unnamed argument specifying the output file and named argument turning off the default option to prefix row numbers.

Although weâve saved space from not performing the extra naming of variables, weâve made the code very difficult to read, needing to parse which functions are arguments to subsequent functions. The indentation helps readability a bit, but now letâs examine a far better solution to either approaches above with dplyr.

Elegance with dplyr Next, weâll use the libraries readrfor improved versions of reading and writing csv files, and dplyr for advanced data frame manipulation. Most importantly, dplyr uses the âthenâ operator %>% which transfers the output on the left to the first argument of the function on the right. Most simply surveys %>% summary() transfers the surveys data frame into the first argument of the summary function. Use of this chaining operator seems excessive in this simple example, but is powerful when chaining together many operations on the same data frame. Weâre able to efficiently write out operations, get past the previous problem of multiple variable names without the obfuscation of nesting.

# load libraries
library(readr)
library(dplyr)

# read in csv
surveys = read_csv( "../data/r-ecology/surveys.csv") 

# dplyr elegance
surveys %T>%                          # note tee operator %T>% for glimpse
  glimpse() %>%                       # view data
  select(species_id, year) %>%        # limit columns
  filter(species_id  == 'NL') %>%     # limit rows
  group_by(year) %>%                  # get count by first grouping
  summarize(n = n()) %>%              #   then summarize
  write_csv('data/surveys_faundez.csv') # write out csv

Now we can read from the top, starting with the data frame surveys, to see a very clear sequence of operations:

glimpse() select() filter() group_by() summarize() write_csv()

Arguments are minimal without repeating the name of the data frame, or even needing quotations in the case of column names.

The âteeâ operator %T>% is similar to the âthenâ operator %>% in that the left side is passed to the right, but is then also teed off as the output of the right side. This is useful in this case for glimpse since its output is simply printed to the Console and does not otherwise return the data frame needed to continue the sequence of operations. So the âteeâ operator %T>% is most useful for injecting intermediate operations like printing or plotting that wouldnât otherwise output a return object for continuing operations.

CO2 emissions assignment

tidy

Task. Convert the following table CO2 emissions per country since 1970 from wide to long format and output the first few rows into your Rmarkdown. I recommend consulting ?gather and you should have 3 columns in your output.

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.3

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readxl) # install.packages('readxl')

## Warning: package 'readxl' was built under R version 3.2.3

# xls downloaded from http://edgar.jrc.ec.europa.eu/news_docs/CO2_1970-2014_dataset_of_CO2_report_2015.xls
xls = '../data/co2_europa.xls'

print(getwd())

## [1] "C:/Users/Patricia/Box Sync/Winter_2016/Environmental_Informatics/env-info/students"

co2 = read_excel(xls, skip=12)

## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 0c 00 e0 00 00 00 2c 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 0c 00 e0 00 00 00 2c 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 0c 00 e0 00 00 00 2c 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 0c 00 e0 00 00 00 2c 00

co2

## Source: local data frame [220 x 46]
## 
##                Country  1970.000000  1971.000000  1972.000000  1973.000000
##                  (chr)        (dbl)        (dbl)        (dbl)        (dbl)
## 1          Afghanistan  1813.981234  1840.794051  1794.831482  1817.997422
## 2              Albania  4435.433037  4425.968296  5035.309230  4638.057137
## 3              Algeria 18850.745443 22226.320843 34278.432478 46375.728534
## 4       American Samoa     6.183023     6.327022     6.462429     6.587474
## 5               Angola  8946.495011  8532.101439 10382.218539 11365.512295
## 6             Anguilla     2.168006     2.168036     2.263426     2.108604
## 7  Antigua and Barbuda   196.161773   203.964395   244.721752   263.402867
## 8            Argentina 87478.401090 87827.313960 88888.066154 93076.186184
## 9              Armenia 10951.012839 10972.887313 11783.384976 12844.589903
## 10               Aruba    40.380782    40.635025    41.138578    41.949184
## ..                 ...          ...          ...          ...          ...
## Variables not shown: 1974.000000 (dbl), 1975.000000 (dbl), 1976.000000
##   (dbl), 1977.000000 (dbl), 1978.000000 (dbl), 1979.000000 (dbl),
##   1980.000000 (dbl), 1981.000000 (dbl), 1982.000000 (dbl), 1983.000000
##   (dbl), 1984.000000 (dbl), 1985.000000 (dbl), 1986.000000 (dbl),
##   1987.000000 (dbl), 1988.000000 (dbl), 1989.000000 (dbl), 1990.000000
##   (dbl), 1991.000000 (dbl), 1992.000000 (dbl), 1993.000000 (dbl),
##   1994.000000 (dbl), 1995.000000 (dbl), 1996.000000 (dbl), 1997.000000
##   (dbl), 1998.000000 (dbl), 1999.000000 (dbl), 2000.000000 (dbl),
##   2001.000000 (dbl), 2002.000000 (dbl), 2003.000000 (dbl), 2004.000000
##   (dbl), 2005.000000 (dbl), 2006.000000 (dbl), 2007.000000 (dbl),
##   2008.000000 (dbl), 2009.000000 (dbl), 2010.000000 (dbl), 2011.000000
##   (dbl), 2012.000000 (dbl), 2013.000000 (dbl), 2014.000000 (dbl)

Question. Why use skip=12 argument in read_excel()?

skip = Number of rows to skip before reading any data. So it’s 12 because data we want to use begin at row 13.

tidyr

Two main functions: gather() and spread()

# install.packages("tidyr")
library(tidyr)

## Warning: package 'tidyr' was built under R version 3.2.3

gather # gather to long

## function (data, key, value, ..., na.rm = FALSE, convert = FALSE, 
##     factor_key = FALSE) 
## {
##     key_col <- col_name(substitute(key), "key")
##     value_col <- col_name(substitute(value), "value")
##     if (n_dots(...) == 0) {
##         gather_cols <- setdiff(colnames(data), c(key_col, value_col))
##     }
##     else {
##         gather_cols <- unname(dplyr::select_vars(colnames(data), 
##             ...))
##     }
##     gather_(data, key_col, value_col, gather_cols, na.rm = na.rm, 
##         convert = convert, factor_key = factor_key)
## }
## <environment: namespace:tidyr>

spread # spread to wide

## function (data, key, value, fill = NA, convert = FALSE, drop = TRUE) 
## {
##     key_col <- col_name(substitute(key))
##     value_col <- col_name(substitute(value))
##     spread_(data, key_col, value_col, fill = fill, convert = convert, 
##         drop = drop)
## }
## <environment: namespace:tidyr>

`gather`

co2

## Source: local data frame [220 x 46]
## 
##                Country  1970.000000  1971.000000  1972.000000  1973.000000
##                  (chr)        (dbl)        (dbl)        (dbl)        (dbl)
## 1          Afghanistan  1813.981234  1840.794051  1794.831482  1817.997422
## 2              Albania  4435.433037  4425.968296  5035.309230  4638.057137
## 3              Algeria 18850.745443 22226.320843 34278.432478 46375.728534
## 4       American Samoa     6.183023     6.327022     6.462429     6.587474
## 5               Angola  8946.495011  8532.101439 10382.218539 11365.512295
## 6             Anguilla     2.168006     2.168036     2.263426     2.108604
## 7  Antigua and Barbuda   196.161773   203.964395   244.721752   263.402867
## 8            Argentina 87478.401090 87827.313960 88888.066154 93076.186184
## 9              Armenia 10951.012839 10972.887313 11783.384976 12844.589903
## 10               Aruba    40.380782    40.635025    41.138578    41.949184
## ..                 ...          ...          ...          ...          ...
## Variables not shown: 1974.000000 (dbl), 1975.000000 (dbl), 1976.000000
##   (dbl), 1977.000000 (dbl), 1978.000000 (dbl), 1979.000000 (dbl),
##   1980.000000 (dbl), 1981.000000 (dbl), 1982.000000 (dbl), 1983.000000
##   (dbl), 1984.000000 (dbl), 1985.000000 (dbl), 1986.000000 (dbl),
##   1987.000000 (dbl), 1988.000000 (dbl), 1989.000000 (dbl), 1990.000000
##   (dbl), 1991.000000 (dbl), 1992.000000 (dbl), 1993.000000 (dbl),
##   1994.000000 (dbl), 1995.000000 (dbl), 1996.000000 (dbl), 1997.000000
##   (dbl), 1998.000000 (dbl), 1999.000000 (dbl), 2000.000000 (dbl),
##   2001.000000 (dbl), 2002.000000 (dbl), 2003.000000 (dbl), 2004.000000
##   (dbl), 2005.000000 (dbl), 2006.000000 (dbl), 2007.000000 (dbl),
##   2008.000000 (dbl), 2009.000000 (dbl), 2010.000000 (dbl), 2011.000000
##   (dbl), 2012.000000 (dbl), 2013.000000 (dbl), 2014.000000 (dbl)

d = gather(co2, "year", "n", -Country)
d

## Source: local data frame [9,900 x 3]
## 
##                Country        year            n
##                  (chr)       (chr)        (dbl)
## 1          Afghanistan 1970.000000  1813.981234
## 2              Albania 1970.000000  4435.433037
## 3              Algeria 1970.000000 18850.745443
## 4       American Samoa 1970.000000     6.183023
## 5               Angola 1970.000000  8946.495011
## 6             Anguilla 1970.000000     2.168006
## 7  Antigua and Barbuda 1970.000000   196.161773
## 8            Argentina 1970.000000 87478.401090
## 9              Armenia 1970.000000 10951.012839
## 10               Aruba 1970.000000    40.380782
## ..                 ...         ...          ...

# Other way to do the same: gather(co2, "year", "n", 2:ncol(co2)) or gather(co2, "year", "n", -1)

Converting my year column from factor to numeric

# install.packages("dplyr")
library(dplyr)

e = mutate(d, year = as.numeric(as.character(year)))

summarize CO₂ emissions

Task. Report the top 5 emitting countries (not World or EU28) for 2014 using your long format table. (You may need to convert your year column from factor to numeric, eg mutate(year = as.numeric(as.character(year))). As with most analyses, there are multiple ways to do this. I used the following functions: filter, arrange, desc, head).

Task. Summarize the total emissions by country (not World or EU28) across years from your long format table and return the top 5 emitting countries. (As with most analyses, there are multiple ways to do this. I used the following functions: filter, arrange, desc, head).

new2 = filter(e, year > 2013)   # other way is new <- e[e$year == 2014,]

new3 = arrange(new2, desc(n))

new4 = filter(new3, Country != c("World", "EU28"))

head(new4)

## Source: local data frame [6 x 3]
## 
##                    Country  year          n
##                      (chr) (dbl)      (dbl)
## 1                    China  2014 10540749.6
## 2 United States of America  2014  5334529.7
## 3                    India  2014  2341896.8
## 4       Russian Federation  2014  1766427.3
## 5                    Japan  2014  1278921.8
## 6                  Germany  2014   767145.6

I tried the elegant way but didn’t work

dplyr elegance

e %T>% # note tee operator %T>% for glimpse glimpse() %>% # view data filter(year > 2013) %>% # limit rows arrange(desc(n)) %>% # descendent order filter(Country != c(“World”, “EU28”)) %>% # limit rows write_csv(‘data/co2_faundez.csv’) # write out csv head(co2_faundez.csv)

faundez

Patricia Faundez

January 17, 2016

Content

Techniques

Data

Reading and Wrangling Data

Bash Shell

R

Install Packages

Readings

Multiple Variables

CO2 emissions assignment

tidy

Question. Why use skip=12 argument in read_excel()?

tidyr

`gather`

summarize CO₂ emissions

I tried the elegant way but didn’t work

dplyr elegance

faundez

Patricia Faundez

January 17, 2016

Content

Techniques

Data

Reading and Wrangling Data

Bash Shell

R

Install Packages

Readings

Multiple Variables

CO2 emissions assignment

tidy

Question. Why use skip=12 argument in read_excel()?

tidyr

gather

summarize CO2 emissions

I tried the elegant way but didn’t work

dplyr elegance

`gather`

summarize CO₂ emissions