Content

I am interested in environmental remediation: pollution modeling and the data analysis that comes with that are my jam. I also think that the tools we are learning in this class are useful in the corporate setting – performance metrics, etc. could all be run through tools like this.

I also have a blog where I post projects and classwork I’m particularly passionate about.


Assignment 3

Data Wrangling

How does using read_csv from the readr package improve data readability?

Using read.csv (the default) is pretty messy and unhelpful

d = read.csv('./data/sparrow925_EUSO2.csv')
head(d)
summary(d)

Compare that to using read_csv, which gives you usefull info!

# Oh man the head and summary here are so much more usable
d2 = read_csv('./data/sparrow925_EUSO2.csv')
head(d2)
## Source: local data frame [6 x 9]
## 
##   country_code  city_name station_european_code type_of_station
##          (chr)      (chr)                 (chr)           (chr)
## 1           AT Klagenfurt               AT2KA71      Background
## 2           AT       Linz               AT4S415         Traffic
## 3           AT       Linz               AT4S416      Background
## 4           AT   Salzburg               AT51200      Background
## 5           AT       Graz               AT60138      Background
## 6           AT       Graz               AT60164         Traffic
## Variables not shown: station_type_of_area (chr), component_caption (chr),
##   statistics_year (int), ug/m3 (dbl), above_DLV? (int)
# summary(d2)

Now, lets use dplyr. Let’s ask ourselves, how many cities are measured for each country?

Using Multiple Variables (The Old Way)

# read in csv
d = read_csv('./data/sparrow925_EUSO2.csv')
# limit columns to country and city
d2=d[,c('country_code','city_name')]
# get count per country
d3=aggregate(city_name ~ country_code, data=d2, FUN='length')
# write out csv
write.csv(d3, './data/sparrow925_EUSO2_citycount.csv', row.names = FALSE)

Using dplyr to Write Compact Code

# read in csv
sparrow925_EUSO2_dplyr = read_csv('./data/sparrow925_EUSO2.csv')
sparrow925_EUSO2_dplyr %T>%
  glimpse() %>% # view 
  select(country_code, city_name) %>% # limit columns
  group_by(city_name) %>% # get count by grouping
  summarize(n=n()) %>% # then summarize
  write.csv('./data/sparrow925_EUSO2_citycount.csv')

4. Tidying up Data / Answers and Tasks

Task 1. Convert the following table CO2 emissions per country since 1970 from wide to long format and output the first few rows into your Rmarkdown.

library(readxl) # install.packages('readxl')

url = 'http://edgar.jrc.ec.europa.eu/news_docs/CO2_1970-2014_dataset_of_CO2_report_2015.xls'
xls = '../data/co2_europa.xls'

# This download seems to corrupt on PCs... look at download.file helpfile, maybe try another method? For now we'll use a workaround I guess.
print(getwd())
if (!file.exists(xls)){
  download.file(url, xls)
}

Quick workaround (not downloading from online)

library(readxl) # install.packages('readxl')
library(readr)
library(dplyr)
library(tidyr)
xls = '../data/co2_europa.xls'
co2=read_excel(xls, skip=12)
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 0c 00 e0 00 00 00 2c 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 0c 00 e0 00 00 00 2c 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 0c 00 e0 00 00 00 2c 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 0c 00 e0 00 00 00 2c 00
#Convert to long format. Display a few rows as example.
task1 = co2 %>%
  gather("year", "n", -1)

head(task1)
## Source: local data frame [6 x 3]
## 
##          Country        year            n
##            (chr)       (chr)        (dbl)
## 1    Afghanistan 1970.000000  1813.981234
## 2        Albania 1970.000000  4435.433037
## 3        Algeria 1970.000000 18850.745443
## 4 American Samoa 1970.000000     6.183023
## 5         Angola 1970.000000  8946.495011
## 6       Anguilla 1970.000000     2.168006

Task 2. Report the top 5 emitting countries (not World or EU28) for 2014 using your long format table. (You may need to convert your year column from factor to numeric, eg mutate(year = as.numeric(as.character(year))).

task2=co2 %>%
  gather("year", "n", -1) %>%
  mutate(year = as.numeric(as.character(year))) %>%
  arrange(desc(n)) %>%
  filter (year==2014) %>%
  filter(Country != "World") %>%
  filter(Country != "EU28")
  
head(task2)
## Source: local data frame [6 x 3]
## 
##                    Country  year          n
##                      (chr) (dbl)      (dbl)
## 1                    China  2014 10540749.6
## 2 United States of America  2014  5334529.7
## 3                    India  2014  2341896.8
## 4       Russian Federation  2014  1766427.3
## 5                    Japan  2014  1278921.8
## 6                  Germany  2014   767145.6

Task 3. Summarize the total emissions by country (not World or EU28) across years from your long format table and return the top 5 emitting countries. (As with most analyses, there are multiple ways to do this. I used the following functions: filter, arrange, desc, head).

task3=co2 %>%
  gather("year", "n", -1) %>%
  mutate(year = as.numeric(as.character(year))) %>%
  group_by(Country) %>%
  summarise(Sum = sum(n)) %>%
  arrange(desc(Sum)) %>%
  filter(Country != "World") %>%
  filter(Country != "EU28")

colnames(task3)[colnames(task3)=="Sum"] <- "Sum of CO2 Emissions, 1970-2014 (kton CO2)"

head(task3)
## Source: local data frame [6 x 2]
## 
##                    Country Sum of CO2 Emissions, 1970-2014 (kton CO2)
##                      (chr)                                      (dbl)
## 1 United States of America                                  231948899
## 2                    China                                  174045927
## 3       Russian Federation                                   81242427
## 4                    Japan                                   51276329
## 5                  Germany                                   43382205
## 6                    India                                   39004218