
Pollution Prevention



I am interested in water resources, ecosystem health, and conservation management and am especially enthuastic about tools for data management. I believe good data management a fundemental skill that can be applied to both increasing work efficiency and exceling in a any scientific position where data is inovled…so basically, there are endless applications.

Here is a photo of me taking a water sample at one of my favorite locations on Earth: the Lyell Fork of the Tuolumne River.


will have some soon!


data wrangling, of course!


# read csv
data = read.csv("data/daalbu_SWEdataforTUM.csv")

#output summary
##    WaterYear         Feb              Mar              Apr       
##  Min.   :2010   Min.   : 1.500   Min.   : 1.000   Min.   : 0.00  
##  1st Qu.:2011   1st Qu.: 3.212   1st Qu.: 6.112   1st Qu.: 6.60  
##  Median :2012   Median :10.175   Median :11.725   Median :11.50  
##  Mean   :2012   Mean   :10.442   Mean   :13.075   Mean   :15.18  
##  3rd Qu.:2014   3rd Qu.:16.500   3rd Qu.:19.250   3rd Qu.:20.98  
##  Max.   :2015   Max.   :21.300   Max.   :28.000   Max.   :39.00  
##       May           WYTotal      
##  Min.   : 0.00   Min.   :  3.50  
##  1st Qu.: 0.00   1st Qu.: 15.20  
##  Median : 0.05   Median : 34.45  
##  Mean   :10.13   Mean   : 48.83  
##  3rd Qu.:19.00   3rd Qu.: 74.70  
##  Max.   :35.40   Max.   :123.70

Data for SWE sourced from here

Data Wrangling#####################Week 3

Bash Shell

present working directory


change working directory


list files


list files that end in ‘.jpg’


file exists


Set Working Directory

set working directory


Install Packages

Run this chunk only once in your Console

Do not evaluate when knitting Rmarkdown

list of packages

pkgs = c( ‘readr’, # read csv ‘readxl’, # read xls ‘dplyr’, # data frame manipulation ‘tidyr’, # data tidying ‘nycflights13’, # test dataset of NYC flights for 2013 ‘gapminder’) # test dataset of life expectancy and popultion

install packages if not found

for (p in pkgs){ if (!require(p, character.only=T)){ install.packages(p) } }

Reading CSV


d = read_csv(‘../data/r-ecology/species.csv’) d head(d) summary(d)

Multiple Variables

read in csv

surveys = read.csv(‘../data/r-ecology/surveys.csv’)

view data

head(surveys) summary(surveys)

limit columns to species and year

surveys_2 = surveys[,c(‘species_id’, ‘year’)]

limit rows to just species “NL”

surveys_3 = surveys_2[surveys_2$species_id == ‘NL’,]

get count per year

surveys_4 = aggregate(species_id ~ year, data=surveys_3, FUN=‘length’)

write to csv

write.csv(surveys_4, ‘data/surveys_bbest.csv’, row.names = FALSE)

nested functions

read in data

surveys = read.csv(‘../data/r-ecology/surveys.csv’)

view data

head(surveys) summary(surveys)

limit data with [], aggregate to count, write to csv

write.csv( aggregate( species_id ~ year, data = surveys[surveys_2$species_id == ‘NL’, c(‘species_id’, ‘year’)], FUN = ‘length’), ‘data/surveys_bbest.csv’, row.names = FALSE)

elegance with dplyr

load libraries

library(readr) library(dplyr)

read in csv

surveys = read_csv(‘../data/r-ecology/surveys.csv’)

dplyr elegance

surveys %T>% # note tee operator %T>% for glimpse glimpse() %>% # view data select(species_id, year) %>% # limit columns filter(species_id == ‘NL’) %>% # limit rows group_by(year) %>% # get count by first grouping summarize(n = n()) %>% # then summarize write_csv(‘data/surveys_bbest.csv’) # write out csv
