Data Viz in R: Week 1
And also
Attention to visualization principles while digging into the R universe
Package | What we Love | icon |
---|---|---|
RStudio | R interface | |
tidyverse | data wrangling | |
ggplot2 | data visualization | |
R Markdown | data communication | |
Leaflet, Plotly , DT | data, table interactivity |
Also… possibly “great frustration and much suckiness…” - Hadley Wickham
Schwabish (from Introduction): [T]hinking carefully about how data is presented is just as important as the data itself.
Wilke (from Chapter 1): Data visualization is part art and part science. They should be
Schwabish (from Chapter 1): Visual Processing
Data visualization is essential for data exploration, communication, and understanding. Imagine we have a small dataset with the following summary characteristics:
## # A tibble: 1 × 6
## dataset mean_x mean_y std_dev_x std_dev_y corr_x_y
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 54.3 47.8 16.8 26.9 -0.0641
A set of summary statistics is at best a partial picture, until we see what it looks like.
R is the computational engine; RStudio is the interface
>
and press Enter to execute.?
before a function name in the console to get info in the Help section.“R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects.” From R 4 Data Science: Workflow Projects
For any new project in R, create an R project. Projects allow RStudio to leave notes for itself (e.g., history), will always start a new R session when opened, and will always set the working directory to the Project directory. If you never have to set the working directory at the top of the script, that’s a good thing!2
And create a system for organizing the objects in this project!
Functions are the “verbs” that allow us to manipulate data. Packages contain functions, and all functions belong to packages.
R comes with about 30 packages (“base R”). There are over 10,000 user-contributed packages; you can discover these packages online in Comprehensive R Archive Network (CRAN), with more in active development on GitHub.
To use a package, install it (you only need to do this once)
tidyverse
(or a different package name) then click on Install.In each new R session, you’ll have to load the package if you want access to it’s functions: e.g., type library(tidyverse)
.
Let’s start working in R!
Part of the the tidyverse
, dplyr
is a package for data manipulation. The package implements a grammar for transforming data, based on verbs/functions that define a set of common tasks.
dplyr
functions are for d
ata frames.
dplyr
functions is always a data frame\(\color{blue}{\text{select()}}\) - extract \(\color{blue}{\text{variables}}\)
\(\color{green}{\text{filter()}}\) - extract \(\color{green}{\text{rows}}\)
\(\color{green}{\text{arrange()}}\) - reorder \(\color{green}{\text{rows}}\)
Extract columns by name.
select(property, yearbuilt)
select() helpers include
Extract rows that meet logical conditions.
filter(property, yearbuilt > 2020)
Logical tests | Boolean operators for multiple conditions |
---|---|
x < y: less than | a & b: and |
y >= y: greater than or equal to | a | b: or |
x == y: equal to | xor(a,b): exactly or |
x != y: not equal to | !a: not |
x %in% y: is a member of | |
is.na(x): is NA | |
!is.na(x): is not NA |
filter(property, cardtype == "R" & yearbuilt > 2020)
Order rows from smallest to largest values (or vice versa) for designated column/s.
arrange(property, yearbuilt)
Reverse the order (largest to smallest) with desc()
arrange(property, desc(yearbuilt))
\(\color{green}{\text{slice()}}\) - extract \(\color{green}{\text{rows}}\) using index(es)
\(\color{green}{\text{distinct()}}\) - filter for unique \(\color{green}{\text{rows}}\)
\(\color{green}{\text{sample_n()/sample_frac()}}\) - randomly sample \(\color{green}{\text{rows}}\)
The pipe (%>%
) allows you to chain together functions by passing (piping) the result on the left into the first argument of the function on the right.
To get the totalvalue
and finsqft
for property built in 2020 arranged in descending order of totalvalue
without the pipe we could nest the functions
arrange(
select(
filter(property, yearbuilt == 2021 & cardtype == "R"),
totalvalue, finsqft),
desc(totalvalue))
Or run each and save the intervening steps
tmp <- filter(property, yearbuilt == 2021 & cardtype == "R")
tmp <- select(tmp, totalvalue, finsqft)
arrange(tmp, desc(totalvalue))
With the pipe, we call each function in sequence (read the pipe as “and then…”)
property %>%
filter(yearbuilt == 2021 & cardtype == "R") %>%
select(totalvalue, finsqft) %>%
arrange(desc(yearbuilt))
\(\color{blue}{\text{summarize()}}\) - summarize \(\color{blue}{\text{variables}}\)
\(\color{green}{\text{group_by()}}\) - group \(\color{green}{\text{rows}}\)
\(\color{blue}{\text{mutate()}}\) - create new \(\color{blue}{\text{variables}}\)
Compute summaries. Summary functions include
property %>%
filter(yearbuilt == 2021 & cardtype == "R") %>%
summarize(smallest = min(finsqft),
biggest = max(finsqft),
total = n())
Summary Functions | |
---|---|
first(): first value | last(): last value |
min(): minimum value | max(): maximum value |
mean(): mean value | median(): median value |
var(): variance | sd(): standard deviation |
nth(.x, n): nth value | quantile(.x, probs = .25): |
n_distinct(): number of distinct values | n(): number of values |
That’s not always interesting on it’s own, but when combined with group_by
, it is powerful!
Groups cases by common values of one or more columns.
property %>%
filter(yearbuilt == 2021 & cardtype == "R") %>%
group_by(esdistrict) %>%
summarize(smallest = min(finsqft),
biggest = max(finsqft),
avg_value = mean(totalvalue, na.rm = TRUE),
number = n()) %>%
arrange(desc(avg_value))
Create new columns or alter existing columns
property %>%
filter(yearbuilt == 2020 & cardtype == "R") %>%
mutate(value_sqft = totalvalue/finsqft) %>%
group_by(esdistrict) %>%
summarize(smallest = min(finsqft),
biggest = max(finsqft),
avg_value = mean(totalvalue, na.rm = TRUE),
number = n()) %>%
arrange(desc(avg_value))
if_else
, case_when
\(\color{blue}{\text{tally()}}\) - short hand for summarize(n())
\(\color{blue}{\text{count()}}\) - short hand for group_by()
+ tally()
\(\color{blue}{\text{summarize_all()}}\) - apply summary function to all \(\color{blue}{\text{variables}}\)
\(\color{blue}{\text{summarize_at()}}\) - apply summary function to selected \(\color{blue}{\text{variables}}\)
\(\color{blue}{\text{rename()}}\) - rename \(\color{blue}{\text{variables}}\)
\(\color{blue}{\text{recode()}}\) - modify values of \(\color{blue}{\text{variables}}\)
The Grammar of Graphcis: All data visualizations map data to aesthetic attributes (location, shape, color) of geometric objects (lines, points, bars)
Scales control the mapping from data to aesthetics and provide tools to read the plot (axes, legends). Geometric objects are drawn in a specific **coord*inate system.
A plot can contains statistical transformations of the data (counts, means, medians) and faceting can be used to generate the same plot for different subsets of the data.
Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis
ggplot(data, aes(x = var1, y = var2)) +
geom_point(aes(color = var3)) +
geom_smooth(color = "red") +
labs(title = "Helpful Title",
x = "x-axis label")
# geom_histogram(), geom_boxplot(), geom_bar(), etc.
So much more to come
head(ncdn_long)
## # A tibble: 6 × 10
## station date name month day avg_tmp max_tmp min_tmp location dates
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <date>
## 1 USC003463… 01-01 NORM… 1 1 38.7 49.2 28.3 Norman 0000-01-01
## 2 USC003463… 01-02 NORM… 1 2 38.7 49.2 28.2 Norman 0000-01-02
## 3 USC003463… 01-03 NORM… 1 3 38.6 49.2 28.1 Norman 0000-01-03
## 4 USC003463… 01-04 NORM… 1 4 38.6 49.2 28 Norman 0000-01-04
## 5 USC003463… 01-05 NORM… 1 5 38.6 49.2 27.9 Norman 0000-01-05
## 6 USC003463… 01-06 NORM… 1 6 38.5 49.2 27.8 Norman 0000-01-06
ggplot(ncdn_long, aes(x = dates, y = avg_tmp, color = location)) +
geom_line(size = 1) +
scale_x_date(name = "month", date_labels = "%b") +
scale_y_continuous(limits = c(15, 95),
breaks = seq(15, 95, by = 20),
name = "temperature (°F)") +
labs(title = "Average Daily Normal Temperatures")
head(mean_ncdn)
## # A tibble: 6 × 3
## location month mean
## <fct> <fct> <dbl>
## 1 Houston Jan 53.8
## 2 Houston Feb 57.8
## 3 Houston Mar 63.8
## 4 Houston Apr 69.9
## 5 Houston May 77.3
## 6 Houston Jun 83.0
ggplot(mean_ncdn, aes(x = month, y = location, fill = mean)) +
geom_tile(width = .95, height = 0.95) +
scale_fill_viridis_c(option = "B", begin = 0.15, end = 0.98,
name = "temp (°F)") +
scale_y_discrete(name = NULL) +
coord_fixed(expand = FALSE) +
theme(axis.line = element_blank(),
axis.ticks = element_blank()) +
labs(title = "Average Monthly Normal Temperatures")
location <- as.data.frame(unique(mean_ncdn$location))
names(location) <- "location"
ggplot(mean_ncdn, aes(x = month, y = mean)) +
geom_bar(aes(fill = mean), stat = "identity") +
facet_wrap(~location) +
geom_text(data = location, aes(x = 0, y = -75, label = location), size = 3) +
scale_fill_viridis_c(option = "plasma",
name = "temp (°F)") +
ylim(-80, 90) +
theme_minimal() +
theme(
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
strip.text.x = element_blank()
) +
coord_polar(start = 0)
All of the code for his book is on github https://github.com/clauswilke/dataviz
We’ll work with R scripts for the first couple of weeks and R Markdown in the last couple of weeks.↩︎
Especially since no one seems to understand paths and directories any more.↩︎