Wrangling & Visualization
ggplot expects tidy data, data that is structured such that
- Each variable has its own column
- Each observation has its own row
- Each value has its own cell
Separate/Unite
separate
: Split a single column into multiple columns by separating each cell in the column into a row of cells.
separate(df, col = rate, into = c("cases", "pop"), sep = "/")
unite
: Combine several columns into a single column by uniting their values across rows.
unite(df, col = year, century:year, sep = "")
Joins
Joins merge data sets based on key variables. The syntax is always name_join(x, y, by = "key")
Animated visuals created by Garrick Aden-Buie
full_join()
: keeps all observations in x and y
left_join()
: keeps all observations in x
right_join()
: keeps all observations in y
inner_join()
: keeps observations in both x and y
More Vis
- Amounts
- Proportions
- Scatterplots
- Slope graphs, dumbell plots
To the Script
Bad Viz Examples
To the slack!
Do No Harm
- “If I were one of the data points on this visualization, would I feel offended?” – Kim Bui
- “If I only saw this chart on Twitter, would I draw the correct conclusion?”
- Understand the data – how are they generated, what/whose purpose do they serve, who is included or excluded, and more
- Use language thoughtfully, use colors thoughtfully, consider missing groups
Color
Scales
Color used to distinguish groups requires a qualitative color scale that is
- finite and unordered
- readily distinguished
- approximately equivalent
Color used to representing values or comparative magnitude requires a sequential color scheme that
- uses a many-valued gradient to distinguish larger/smaller values
- represets the distance between values
- may be single-hued, multi-hued, diverging
Color to highlight a group or threshold value requires accent colors that
- stands out/pops relative to the rest of the colors
- may be a single color against grey backdrop
- may be baed on intensity of colors in color scale
Pitfalls
- Encoding too much information (e.g., too many groups)
- Wilke suggests qualitative scales work best with 3 to 5 groups and work poorly beyond 8 groups
- Labeling points is an alternative
- Coloring for the sake of coloring
- And using oversaturated colors
- Using non-monotonic scales for values (e.g., the rainbow scale)
- Ignoring accessibility (e.g., color perception)
Colors in R
Emil Hvitfeldt’s
To the Script
R Markdown
R Markdown creates dynamic documents by combining markdown (an easy to write plain text format) with embedded R code chunks. When compiled, the code can be evaluated so that the code, its output, and your prose can be included in the final document to make reports reproducible.
- R Markdown documents (.Rmd files) can be rendered to multiple formats including HTML and PDF.
- The R code in an .Rmd document is processed by knitr, while the resulting .md file is rendered by pandoc to the final output formats (e.g. HTML or PDF).
R Markdown files contain
- A YAML header (yet-another-markup-language), offset by —-
- Text with markdown formatting
- Chunks of R code, offset by ``` (keyboard shortcut: Cmd/Ctrl + Alt + I)
Additional Resources