Organization

Pollution Prevention

@pollute

Interests

I am interested in water resources, ecosystem health, and conservation management and am especially enthuastic about tools for data management. I believe good data management a fundemental skill that can be applied to both increasing work efficiency and exceling in a any scientific position where data is inovled…so basically, there are endless applications.

Here is a photo of me taking a water sample at one of my favorite locations on Earth: the Lyell Fork of the Tuolumne River.

Content

will have some soon!

Techniques

data wrangling, of course!

Data

# read csv
data = read.csv("data/daalbu_SWEdataforTUM.csv")

#output summary
summary(data)
##    WaterYear         Feb              Mar              Apr       
##  Min.   :2010   Min.   : 1.500   Min.   : 1.000   Min.   : 0.00  
##  1st Qu.:2011   1st Qu.: 3.212   1st Qu.: 6.112   1st Qu.: 6.60  
##  Median :2012   Median :10.175   Median :11.725   Median :11.50  
##  Mean   :2012   Mean   :10.442   Mean   :13.075   Mean   :15.18  
##  3rd Qu.:2014   3rd Qu.:16.500   3rd Qu.:19.250   3rd Qu.:20.98  
##  Max.   :2015   Max.   :21.300   Max.   :28.000   Max.   :39.00  
##       May           WYTotal      
##  Min.   : 0.00   Min.   :  3.50  
##  1st Qu.: 0.00   1st Qu.: 15.20  
##  Median : 0.05   Median : 34.45  
##  Mean   :10.13   Mean   : 48.83  
##  3rd Qu.:19.00   3rd Qu.: 74.70  
##  Max.   :35.40   Max.   :123.70

Data for SWE sourced from here

Data Wrangling#####################Week 3

Bash Shell

present working directory

getwd()

change working directory

setwd(‘.’)

list files

list.files()

list files that end in ‘.jpg’

list.files(pattern=glob2rx(’*.jpg’))

file exists

file.exists(‘test.png’)

Set Working Directory

set working directory

setwd(‘students’)

Install Packages

Run this chunk only once in your Console

Do not evaluate when knitting Rmarkdown

list of packages

pkgs = c( ‘readr’, # read csv ‘readxl’, # read xls ‘dplyr’, # data frame manipulation ‘tidyr’, # data tidying ‘nycflights13’, # test dataset of NYC flights for 2013 ‘gapminder’) # test dataset of life expectancy and popultion

install packages if not found

for (p in pkgs){ if (!require(p, character.only=T)){ install.packages(p) } }

Reading CSV

library(readr)

d = read_csv(‘../data/r-ecology/species.csv’) d head(d) summary(d)

Multiple Variables

read in csv

surveys = read.csv(‘../data/r-ecology/surveys.csv’)

view data

head(surveys) summary(surveys)

limit columns to species and year

surveys_2 = surveys[,c(‘species_id’, ‘year’)]

limit rows to just species “NL”

surveys_3 = surveys_2[surveys_2$species_id == ‘NL’,]

get count per year

surveys_4 = aggregate(species_id ~ year, data=surveys_3, FUN=‘length’)

write to csv

write.csv(surveys_4, ‘data/surveys_bbest.csv’, row.names = FALSE)

nested functions

read in data

surveys = read.csv(‘../data/r-ecology/surveys.csv’)

view data

head(surveys) summary(surveys)

limit data with [], aggregate to count, write to csv

write.csv( aggregate( species_id ~ year, data = surveys[surveys_2$species_id == ‘NL’, c(‘species_id’, ‘year’)], FUN = ‘length’), ‘data/surveys_bbest.csv’, row.names = FALSE)

elegance with dplyr

load libraries

library(readr) library(dplyr)

read in csv

surveys = read_csv(‘../data/r-ecology/surveys.csv’)

dplyr elegance

surveys %T>% # note tee operator %T>% for glimpse glimpse() %>% # view data select(species_id, year) %>% # limit columns filter(species_id == ‘NL’) %>% # limit rows group_by(year) %>% # get count by first grouping summarize(n = n()) %>% # then summarize write_csv(‘data/surveys_bbest.csv’) # write out csv

########################################################################