aw.stats

Awhile ago I wrote about trying my hand at creating a data visualization inspired by the The Pudding. I decided to do another one of those posts but this time inspired by a Flowing Data visualization looking at 2018 salary estimates from the Bureau of Labor Statistics. I stumbled upon this one on Twitter and thought I bet we can make something with ggplot2 and plotly. The finished version from the sites creator, Nathan Yau, is fantastic but without the D3 knowledge it can be hard to know where to start. As it just so happens this also coincides with work I have been doing the last couple months to teach a number of my co-workers R and this week we were covering some data visualization techniques and different packages. My usual requirement for the lessons are they need to be less than 20-30 minutes of work and each needs to illustrate the use of (mainly) one package. So this one centers around plotly which not only is super easy to implement but also fairly customizable and I think we can get to a nice finished product.

The data for this is available for download on the Bureau of Labor Statistics website.

library(tidyverse)
library(awtools)
library(plotly)
library(ggbeeswarm)

Now that we have those packages loaded we can import the salaries dataframe and the categories dataframe from github.

salaries <- read_excel("path/to/your/download.xlsx")
categories <- read.csv("https://raw.githubusercontent.com/awhstin/Dataset-List/master/2018categories.csv")

From there we create a field to join the two dataframes together and get rid of the columns we aren’t interested in and remove any duplicate rows that are left.

salaries <- salaries %>%
  mutate(ID = as.numeric(gsub("-.*", "", OCC_CODE))) %>%
  select(1, 2, 4:21) %>%
  distinct(salaries, OCC_TITLE, .keep_all = TRUE)

Now we join the two dataframes and manually convert the numbers to numbers by removing the commas. We need to remove the ALL category and all occupations that are the roll up for each category. Finally one thing that is great about plotly is the customization of tooltips so we create a nice text element for the labels.

salaries.cats <- salaries %>%
  left_join(., categories) %>%
  select(21, 2:4, 6, 15) %>%
  mutate(
    Median.Salary = as.numeric(gsub(",", "", A_MEDIAN)),
    TOT_EMP = as.numeric(gsub(",", "", TOT_EMP)),
    Cat = gsub("Occupations", "", Cat)
  ) %>%
  slice(-1) %>%
  filter(!grepl("Occupations", OCC_TITLE)) %>%
  mutate(
    text = paste(
      OCC_TITLE,
      "<br>Category:", Cat,
      "<br>Total:", formatC(TOT_EMP, format = "f", digits = 0, big.mark = ","),
      "<br>Median Salary:", paste("$", formatC(as.numeric(Median.Salary), format = "f", digits = 0, big.mark = ","))
    )
  )

To handle the large number of categories I am going to take one of my palettes and expand it.

# create color palette
colourCount <- length(unique(categories$ID))
getPalette <- colorRampPalette(bpalette)

Now we just create the ggplot2 object to feed to plotly. There is the handy text argument available which you can leverage to add our manually created tooltip to the object to pass to the final viz.

p <- ggplot(salaries.cats, aes(1, Median.Salary, color = Cat, text = text)) +
  geom_quasirandom(aes(size = TOT_EMP), bandwidth = 1, show.legend = FALSE) +
  scale_color_manual(values = getPalette(colourCount)) +
  scale_y_continuous(labels = scales::dollar) +
  a_plex_theme(emphasis = "y", grid = FALSE) +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_line(size = .1, color = "#444444")
  )

Finally here! We create the custom mouse-over labels, and pass our ggplot object to the plotly::ggplotly function.

# customized labels
label <- list(
  bgcolor = "white",
  bordercolor = "#4444444",
  font = list(
    family = "IBM Plex Mono",
    size = 10,
    color = "#444444"
  )
)

# plot(ly)
ggplotly(p,
  tooltip = c("text"),
  width = 650,
  height = 815
) %>%
  highlight("plotly_selected") %>%
  style(hoverlabel = label)

Median Salary for Occupations

Double click on categories in the legend to filter.

Data from 2018 Bureau of Labor Statistics
May 2018 National Occupational Employment and Wage Estimates United States

In the original article on Flowing Data this sort of visualization was discussed as showing a good overall picture but warranted visualizing every category as facets. The great part about using plotly to visualize this is the native feature of filtering from the legend. Double clicking on the entries in the legend filters on that category and double clicking again deactivates the filter. This adds an interesting layer where you can compare the distribution of salaries within that occupation category.

Median Salary for Occupations

Data from 2018 Bureau of Labor Statistics May 2018 National Occupational Employment and Wage Estimates United States

Data from 2018 Bureau of Labor Statistics
May 2018 National Occupational Employment and Wage Estimates United States