COVID-19 Cases, Hospitalizations, and Deaths Across Race

For this project, I’m seeking to explore the differing experiences of COVID-19 based on racial identity and state of residence. A lack of equitable healthcare infrastructure, types of employment, and location are all factors that can and have contributed to the disproportionate impact of COVID-19 on racial and ethnic minorities in the United States. This is evidenced by CDC data showing that Native Hawaiian or Pacific Island (NHOPI), Hispanic, and American Indian and Alaska Native (AIAN) people face a 1.5 times higher right of COVID-19 infection that White people. In addition, AIAN, Hispanic, NHOPI and Black people face a 2x higher likelihood of death than White people from COVID-19 infections. Compounding racial disparities are differences in state response to the pandemic.The extent of COVID-19 protection measures varied greatly by state, depending on factors such as the state’s political leanings and attitudes of leadership regarding the pandemic. Some states saw more lenient regulations vs others, which ultimately resulted in significantly different experiences with COVID-19. At times, these factors, compounded by pre-existing genetic conditions related to race, have contributed to heightened rates of infection, hospitalization, and deaths within these vulnerable communities. In this project, I’d like to do a deeper dive into the extent of the experience gap between these marginalized communities and white communities through variables such as location and race.

The data I’m using comes from a collaboration between The Covid Tracking Project at The Atlantic and the Boston University Center for Antiracist Research. It tracks data from April 15, 2020 to March 7, 2021, so almost a full year of data at the height of the pandemic. The data was then updated twice a week.

Data Cleaning

data <- read_csv("CovidRace.csv")
acs <- read_csv("ACSDP5Y2020.DP05-Data.csv")
saveRDS(data, "data.rds")

# remove all rows without a total number of cases
# remove all territories (Puerto Rico, Northern Mariana Islands, Guam, Virgin Islands) for just the 50 states and Washington D.C.

data <- data %>% filter(
  !is.na(Cases_Total),
  !State %in% c("PR","MP","GU","VI"))

data <- data %>% 
  mutate(date1 = ymd(Date)) 

Normalization for Population

Here I’m pulling the total population of each state, as well as population of each racial group for normalization purposes and renaming the variables to match the racial groups in the original data. Also adding postal codes so that I can join the two tables later.

acs <- acs %>% 
  select(GEO_ID, NAME, DP05_0001E, DP05_0037E, DP05_0038E, DP05_0039E, DP05_0044E, DP05_0052E, DP05_0057E, DP05_0058E, DP05_0070E) %>% 
  filter(!NAME == "Puerto Rico", !GEO_ID == "Geography") %>% 
  mutate(post_code = c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", "HI", "ID", "IL", "IN", "IO", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH",
"NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY")) 

colnames(acs)[c(3:11)] <- c("Total_pop", "White_pop", "Black_pop", "AIAN_pop", "Asian_pop", "NHPI_pop", "Other_pop", "Multiracial_pop", "Latinx_pop")

### Joining the tables using postal code!

norm_data <- acs %>% select(-Total_pop) %>% # don't want to add the total to the other cases, so removing it
  left_join(data, by = c("post_code" = "State")) 

norm_data_total <- acs %>% 
  left_join(data, by = c("post_code" = "State")) 

### Cases, hospitalizations, tests, and deaths normalized by population- without total cases

norm_data <- norm_data %>% mutate(
  White_cases_norm = Cases_White/as.numeric(White_pop),
  White_hosp_norm = Hosp_White/as.numeric(White_pop),
  White_tests_norm = Tests_White/as.numeric(White_pop),
  White_deaths_norm = Deaths_White/as.numeric(White_pop),
  Black_cases_norm = Cases_Black/as.numeric(Black_pop),
  Black_hosp_norm = Hosp_Black/as.numeric(Black_pop),
  Black_tests_norm = Tests_Black/as.numeric(Black_pop),
  Black_deaths_norm = Deaths_Black/as.numeric(Black_pop),
  AIAN_cases_norm = Cases_AIAN/as.numeric(AIAN_pop),
  AIAN_hosp_norm = Hosp_AIAN/as.numeric(AIAN_pop),
  AIAN_tests_norm = Tests_AIAN/as.numeric(AIAN_pop),
  AIAN_deaths_norm = Deaths_AIAN/as.numeric(AIAN_pop),
  Asian_cases_norm = Cases_Asian/as.numeric(Asian_pop),
  Asian_hosp_norm = Hosp_Asian/as.numeric(Asian_pop),
  Asian_tests_norm = Tests_Asian/as.numeric(Asian_pop),
  Asian_deaths_norm = Deaths_Asian/as.numeric(Asian_pop),
  NHPI_cases_norm = Cases_NHPI/as.numeric(NHPI_pop),
  NHPI_hosp_norm = Hosp_NHPI/as.numeric(NHPI_pop),
  NHPI_tests_norm = Tests_NHPI/as.numeric(NHPI_pop),
  NHPI_deaths_norm = Deaths_NHPI/as.numeric(NHPI_pop),
  Other_cases_norm = Cases_Other/as.numeric(Other_pop),
  Other_hosp_norm = Hosp_Other/as.numeric(Other_pop),
  Other_tests_norm = Tests_Other/as.numeric(Other_pop),
  Other_deaths_norm = Deaths_Other/as.numeric(Other_pop),
  Multiracial_cases_norm = Cases_Multiracial/as.numeric(Multiracial_pop),
  Multiracial_hosp_norm = Hosp_Multiracial/as.numeric(Multiracial_pop),
  Multiracial_tests_norm = Tests_Multiracial/as.numeric(Multiracial_pop),
  Multiracial_deaths_norm = Deaths_Multiracial/as.numeric(Multiracial_pop),
  Latinx_cases_norm = Cases_Latinx/as.numeric(Latinx_pop),
  Latinx_hosp_norm = Hosp_Latinx/as.numeric(Latinx_pop),
  Latinx_tests_norm = Tests_Latinx/as.numeric(Latinx_pop),
  Latinx_deaths_norm = Deaths_Latinx/as.numeric(Latinx_pop)) 

### Cases, hospitalizations, tests, and deaths normalized by population- With total cases 

norm_data_total <- norm_data_total %>% mutate(
  Total_cases_norm = Cases_Total/as.numeric(Total_pop),
  White_cases_norm = Cases_White/as.numeric(White_pop),
  Black_cases_norm = Cases_Black/as.numeric(Black_pop),
  AIAN_cases_norm = Cases_AIAN/as.numeric(AIAN_pop),
  Asian_cases_norm = Cases_Asian/as.numeric(Asian_pop),
  NHPI_cases_norm = Cases_NHPI/as.numeric(NHPI_pop),
  Other_cases_norm = Cases_Other/as.numeric(Other_pop),
  Multiracial_cases_norm = Cases_Multiracial/as.numeric(Multiracial_pop),
  Latinx_cases_norm = Cases_Latinx/as.numeric(Latinx_pop),
  
  Total_hosp_norm = Hosp_Total/as.numeric(Total_pop),
  White_hosp_norm = Hosp_White/as.numeric(White_pop),
  Black_hosp_norm = Hosp_Black/as.numeric(Black_pop),
  AIAN_hosp_norm = Hosp_AIAN/as.numeric(AIAN_pop),
  Asian_hosp_norm = Hosp_Asian/as.numeric(Asian_pop),
  NHPI_hosp_norm = Hosp_NHPI/as.numeric(NHPI_pop),
  Other_hosp_norm = Hosp_Other/as.numeric(Other_pop),
  Multiracial_hosp_norm = Hosp_Multiracial/as.numeric(Multiracial_pop),
  Latinx_hosp_norm = Hosp_Latinx/as.numeric(Latinx_pop),
  
  Total_tests_norm = Tests_Total/as.numeric(Total_pop),
  White_tests_norm = Tests_White/as.numeric(White_pop),
  Black_tests_norm = Tests_Black/as.numeric(Black_pop),
  AIAN_tests_norm = Tests_AIAN/as.numeric(AIAN_pop),
  Asian_tests_norm = Tests_Asian/as.numeric(Asian_pop),
  NHPI_tests_norm = Tests_NHPI/as.numeric(NHPI_pop),
  Other_tests_norm = Tests_Other/as.numeric(Other_pop),
  Multiracial_tests_norm = Tests_Multiracial/as.numeric(Multiracial_pop),
  Latinx_tests_norm = Tests_Latinx/as.numeric(Latinx_pop),
  
  Total_deaths_norm = Deaths_Total/as.numeric(Total_pop),
  White_deaths_norm = Deaths_White/as.numeric(White_pop),
  Black_deaths_norm = Deaths_Black/as.numeric(Black_pop),
  AIAN_deaths_norm = Deaths_AIAN/as.numeric(AIAN_pop),
  Asian_deaths_norm = Deaths_Asian/as.numeric(Asian_pop),
  NHPI_deaths_norm = Deaths_NHPI/as.numeric(NHPI_pop),
  Other_deaths_norm = Deaths_Other/as.numeric(Other_pop),
  Multiracial_deaths_norm = Deaths_Multiracial/as.numeric(Multiracial_pop),
  Latinx_deaths_norm = Deaths_Latinx/as.numeric(Latinx_pop)) 

Changing to long data type

cases_long_norm_start <- norm_data_total %>% 
  filter(date1 == as.Date("2020-04-15")) %>% 
  select(post_code, Total_cases_norm:Latinx_cases_norm) %>% 
  mutate(across(Total_cases_norm:Latinx_cases_norm, ~replace_na(.x, 0))) %>% 
  pivot_longer(-post_code, names_to = "race", values_to = "cases",
               names_prefix = "Cases_")

cases_long_norm_end <- norm_data_total %>% 
  filter(date1 == as.Date("2021-03-07")) %>% 
  select(post_code, Total_cases_norm:Latinx_cases_norm) %>% 
  mutate(across(Total_cases_norm:Latinx_cases_norm, ~replace_na(.x, 0))) %>% 
  pivot_longer(-post_code, names_to = "race", values_to = "cases",
               names_prefix = "Cases_")

Looking at the percentage of covid cases compared to overall population by state at the start of data collection and at the end

cases_long_norm_start %>% 
  filter(race == "Total_cases_norm") %>% 
  ggplot(aes(y = fct_reorder(post_code, cases), x = cases)) +
  geom_col(fill = "darkolivegreen4") +
  labs(y = "State", x = "Percentage of Population with COVID", title = "Percentage of Cases in Population by State in April 2020") +
  theme(axis.title = element_text(size = 10),
        axis.text = element_text(size = 5))

cases_long_norm_end %>% 
  filter(race == "Total_cases_norm") %>% 
  ggplot(aes(y = fct_reorder(post_code, cases), x = cases)) +
  geom_col(fill = "darkolivegreen4") +
  labs(y = "State", x = "Percentage of Population with COVID", title = "Percentage of Cases in Population by State in March 2021") +
  theme(axis.title = element_text(size = 10),
        axis.text = element_text(size = 5))

Comparing the proportion of cases by race in each state at the start of data collection and at the end

bw_start <- cases_long_norm_start %>% 
  filter(race %in% c("White_cases_norm", "Black_cases_norm")) %>% 
  ggplot(aes(x = reorder(post_code, desc(post_code)), y = cases, color = race)) +
  geom_point() +
  coord_flip() +
  scale_color_manual("Race", values = c("darkolivegreen4", "lightgoldenrod2")) +
  labs(x = "State", y = "Cases Normalized for Population", 
       title = "Comparing Proportion of Cases by Race in each State in April 2020") +
  theme(axis.title = element_text(size = 10),
        axis.text = element_text(size = 5))
bw_start

bw_end <- cases_long_norm_end %>% 
  filter(race %in% c("White_cases_norm", "Black_cases_norm")) %>% 
  ggplot(aes(x = reorder(post_code, desc(post_code)), y = cases, color = race)) +
  geom_point() +
  coord_flip() +
  scale_color_manual("Race", values = c("darkolivegreen4", "lightgoldenrod2")) +
  labs(x = "State", y = "Cases Normalized for Population", 
       title = "Comparing Proportion of Cases by Race in each State in March 2021") +
  theme(axis.title = element_text(size = 10),
        axis.text = element_text(size = 5))
bw_end

lxw_start <- cases_long_norm_start %>% 
  filter(race %in% c("White_cases_norm", "Latinx_cases_norm")) %>% 
  ggplot(aes(x = reorder(post_code, desc(post_code)), y = cases, color = race)) +
  geom_point() +
  coord_flip() +
  scale_color_manual("Race", values = c("darkolivegreen4", "lightgoldenrod2")) +
  labs(x = "State", y = "Cases Normalized for Population", 
       title = "Comparing Proportion of Cases by Race in each State in April 2020") +
  theme(axis.title = element_text(size = 10),
        axis.text = element_text(size = 5))
lxw_start

lxw_end <- cases_long_norm_end %>% 
  filter(race %in% c("White_cases_norm", "Latinx_cases_norm")) %>% 
  ggplot(aes(x = reorder(post_code, desc(post_code)), y = cases, color = race)) +
  geom_point() +
  coord_flip() +
  scale_color_manual("Race", values = c("darkolivegreen4", "lightgoldenrod2")) +
  labs(x = "State", y = "Cases Normalized for Population", 
       title = "Comparing Proportion of Cases by Race in each State in March 2021") +
  theme(axis.title = element_text(size = 10),
        axis.text = element_text(size = 5))
lxw_end

aw_start <- cases_long_norm_start %>% 
  filter(race %in% c("White_cases_norm", "Asian_cases_norm")) %>% 
  ggplot(aes(x = reorder(post_code, desc(post_code)), y = cases, color = race)) +
  geom_point() +
  coord_flip() +
  scale_color_manual("Race", values = c("darkolivegreen4", "lightgoldenrod2")) +
  labs(x = "State", y = "Cases Normalized for Population", 
       title = "Comparing Proportion of Cases by Race in each State in April 2020") +
  theme(axis.title = element_text(size = 10),
        axis.text = element_text(size = 5))
aw_start

aw_end <- cases_long_norm_end %>% 
  filter(race %in% c("White_cases_norm", "Asian_cases_norm")) %>% 
  ggplot(aes(x = reorder(post_code, desc(post_code)), y = cases, color = race)) +
  geom_point() +
  coord_flip() +
  scale_color_manual("Race", values = c("darkolivegreen4", "lightgoldenrod2")) +
  labs(x = "State", y = "Cases Normalized for Population", 
       title = "Comparing Proportion of Cases by Race in each State in March 2021") +
  theme(axis.title = element_text(size = 10),
        axis.text = element_text(size = 5))
aw_end

Some interesting insights here, namely that in all states but Indiana, Kansas, Kentucky, Nebraska, Oklahoma, and Tennessee, the proportion of cases in the Black community is higher than the proportion of cases in the White community after normalization for population at the end of the data collection period in March of 2021. Here, we can clearly see that data collection was not as robust at the beginning of the pandemic with only 27 states and D.C. reporting cases by race at this point in April of 2020. However, even at the beginning of the pandemic, all states but Montana see disproportionate cases in Black communities over White communities. For the Latinx demographic group, the results are less stark, instead portraying higher proportions of White cases compared to the overall White population, regardless of the year. The scatterplot for Asian Americans are even more ambiguous, with different states seeing a different group with higher proportion of cases compared to total population, regardless of the year.

Comparing the demographic composition of cases to the demographic composition of the population using filled/stack bars.

cases_long_counts_start <- norm_data_total %>% 
  filter(date1 == as.Date("2020-04-15")) %>% 
  select(post_code, Cases_Total:Cases_Unknown) %>% 
  mutate(across(Cases_Total:Cases_Unknown, ~replace_na(.x, 0))) %>% 
  pivot_longer(-post_code, names_to = "race", values_to = "cases",
               names_prefix = "Cases_")

cases_long_counts_end <- norm_data_total %>% 
  filter(date1 == as.Date("2021-03-07")) %>% 
  select(post_code, Cases_Total:Cases_Unknown) %>% 
  mutate(across(Cases_Total:Cases_Unknown, ~replace_na(.x, 0))) %>% 
  pivot_longer(-post_code, names_to = "race", values_to = "cases",
               names_prefix = "Cases_")

# long pop counts
pop_long_counts_start <- norm_data_total %>% 
  filter(date1 == as.Date("2020-04-15")) %>% 
  select(post_code, Total_pop:Latinx_pop) %>% 
  mutate(across(Total_pop:Latinx_pop, as.numeric)) %>% 
  pivot_longer(-post_code, names_to = "race", values_to = "pop") %>% 
  mutate(race = str_remove(race, "_pop"))

pop_long_counts_end <- norm_data_total %>% 
  filter(date1 == as.Date("2021-03-07")) %>% 
  select(post_code, Total_pop:Latinx_pop) %>% 
  mutate(across(Total_pop:Latinx_pop, as.numeric)) %>% 
  pivot_longer(-post_code, names_to = "race", values_to = "pop") %>% 
  mutate(race = str_remove(race, "_pop"))

# joining the two (and pivotting again)
cases_long_counts_start <- cases_long_counts_start %>% 
  left_join(pop_long_counts_start) %>% 
  pivot_longer(-c(post_code, race), names_to = "group", values_to = "count")

cases_long_counts_end <- cases_long_counts_end %>% 
  left_join(pop_long_counts_end) %>% 
  pivot_longer(-c(post_code, race), names_to = "group", values_to = "count")

# creating stacked bars to compare by state
cases_long_counts_start %>% 
  filter(race != "Total") %>% 
  ggplot(aes(x = group, y = count, fill = race)) + 
  scale_fill_manual(values = c("#800000FF", "#FFA319FF", "#8A9045FF", "#155F83FF", "#C16622FF", "#8F3931FF", "#58593FFF", "#767676FF", "#350E20FF")) +
  geom_col(position = "fill") +
  facet_wrap(~ post_code)

cases_long_counts_end %>% 
  filter(race != "Total") %>% 
  ggplot(aes(x = group, y = count, fill = race)) +
    scale_fill_manual(values = c("#800000FF", "#FFA319FF", "#8A9045FF", "#155F83FF", "#C16622FF", "#8F3931FF", "#58593FFF", "#767676FF", "#350E20FF")) +
  geom_col(position = "fill") +
  facet_wrap(~ post_code)

Something interesting that pops up in this view is the minimal proportion of cases in Latinx communities. The proportion of community members with cases is significantly smaller for Latinx communities than the total Latinx population might suggest there would be. In several states, such as Mississippi, South Carolina, Louisiana, and Maryland, there are higher proportions of Black cases than the total Black population in each state might suggest. This suggests a potentially disproportionate experience of COVID-19 among different communities.

Comparing COVID case prevalance with the results of the 2020 presidential election

Another question I wanted to explore with my data was whether certain states had higher rates of infection over others. I was specifically looking to test the hypothesis that more liberal states with stricter COVID-19 regulations had lower rates of infections, hospitalizations, and deaths of Black Americans compared to more conservative states with lenient COVID-19 regulations. I first looked at how the proportion of Black cases to the state’s Black population compared to 2020 election results. Then I looked at how the proportion of White cases to the state’s White population compared to 2020 election results.

party <- norm_data_total %>% 
  mutate(party2020 = ifelse(post_code %in% c("AK", "AL", "AR", "FL", "ID", "IN", "IA", "KS", "KY", "LA", "MS", "MO", "MT", "NE", "NC", "ND", "OH", "OK", "SC", "SD", "TN", "TX", "UT", "WV", "WY"), "Republican", "Democrat"
  ))

usstates <- ne_states(iso_a2 = "US", returnclass = "sf")

us48 <- usstates %>% 
  filter(!woe_name %in% c("Alaska", "Hawaii"))

party1 <- party %>% 
  filter(!post_code %in% "AK" | !post_code %in% "HI") %>%
  mutate(postal = post_code)
us48_join <- left_join(us48, party1, by = "postal")

plot6 <- ggplot(us48_join) +
geom_sf(aes(fill= party2020), alpha = 0.4) +
  scale_fill_manual("Party Affiliation \n in 2020 Election", values = c("dark blue", "dark red")) +
  theme(legend.position = "bottom") +
  labs(title = "2020 Election Results")
plot6

plot7 <- ggplot(us48_join) +
geom_sf(aes(fill= Black_cases_norm)) +
  scale_fill_carto_c(palette = "Burg", name = "Black COVID-19 Cases") +
  theme(legend.position = "bottom") +
  labs(title = "Black COVID-19 Cases by State")
plot7

plot7.1 <- ggplot(us48_join) +
geom_sf(aes(fill= White_cases_norm)) +
  scale_fill_carto_c(palette = "Burg", name = "White COVID-19 Cases") +
  theme(legend.position = "bottom") +
  labs(title = "White COVID-19 Cases by State")
plot7.1

At first glance, it doesn’t seem like the maps depicting proportion of members in a demographic community with COVID However, if you look closer at the scale, it’s clear that the proportion of members in white communities with COVID is much lower than the proportion in Black communities. In general, most states seem to have similar proportions across demographic communities, regardless of how their state voged in the 2020 election.

Another way of looking at the election data: A box plot

I also created a box plot to see the distribution of cases by 2020 Election results.

#total cases in a boxplot by how the state voted in the 2020 election histogram
#indicator variable for political party

partybplot <- party %>% 
  ggplot(aes(x = party2020, y = Total_cases_norm, fill = party2020)) +
  geom_boxplot(outlier.size = 0.5, alpha= 0.4) +
  scale_fill_manual("Party Affiliation \n in 2020 Election", values = c("dark blue", "dark red")) +
   labs(x = "Party of Candidate Voted for in the 2020 Election", y = "Distribution of Cases", title = "Distribution of Total Cases by 2020 Election Results ") 
partybplot
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

After normalizing for population differences, the states with the higher counts of COVID cases are states that voted for the Republican candidate in the 2020 election. The interquartile range for the Republican voting states is larger, indicating more variability, but ultimately the median proportion of cases to total state population and maximum cases is higher for Republican voting states than Democrat voting states. Other factors could contribute to this higher count, but there does seem to maybe be a relationship between state voting patterns and the prevalance of COVID.

Conclusion

Overall, my data visualizations seem to generally imply a disproportionate experience of COVID-19 based on race, with differences being especially stark between the Black and White communities. Among other racial groups, differences in proportions and case prevalance do seem to exist but to a lesser degree. I’m not surprised by the findings of this project, but I think it’s interesting that differences are less tangible for other racial groups other than the Blck community. I do wish there was more data available from earlier on in the pandemic at the start of data collection. I think this could’ve allowed for a more robust analysis, especially when comparing case counts and response from the start of the pandemic to partway through it. I think more could be done to explore the experiences of demographic groups other than larger minority populations, such as Native Hawaiian and Pacific Islanders and American Indian/Alaskan Natives with that data.