Link to our GitHub Webpage: https://criviere1.github.io/¶

In [ ]:
!git clone https://github.com/criviere1/criviere1.github.io.git
fatal: destination path 'criviere1.github.io' already exists and is not an empty directory.
In [ ]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [ ]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sb

CMPS 3160-01 Fa23¶

Collette Riviere and Toby Mendels¶

Motivation¶

The team, Collette Riviere and Toby Mendels, will be investigating datasets related to Covid-19 infection rates, vaccine hesitancy rates, actual Covid-19 vaccine rates, and state by state demographic information. Most of the data will come from the US Census Bureau, specifically the Household Pulse Survey (HPS) and the American Community Survey (ACS). Daily Covid-19 infeciton rates were collected by the Center for Systems Science and Engineering at Johns Hopkins University from the World Health Organization (WHO) and the US Centers for Disease Control and Prevention (CDC). Finally, actual Covid-19 vaccine trends were colllected from the CDC.

The first dataset is state-by-state demographic information, such as sex, age, race, income, health insurance status, and educational attainment. This dataset could potentially indicate a relationship between state demographic information vaccine hesitancy or Covid-19 cases. The second dataset, from the HPS survey, we are interested in vaccine hesitancy by state. The Covid-19 vaccine was a devicive issue for many people so looking at vaccine hesitency before the Covid-19 pandemic could help indicate which parts of the US could be most reluctant or hesitant about the Covid-19 vaccines. This relates to the third dataset, of actual vaccine trends. This can help us highlight the difference between hesitancy and actual vaccine rates. The final dataset is daily Covid-19 infection reports across states. This dataset can help investigate how Covid-19 spread across the USA from Janurary 2020 to March 2023 and how different states were affected by the pandemic over time. It can also be compared to vaccine rates to determine if the vaccine actually helped slow the spread of the pandemic.

All of these datasets in conjunction can help these specific answer questions:

  • "Did states with more vaccine hesitancy have higher rates of Covid-19 infection, especially after the Covid-19 vaccine was released?"
  • "How do states with different demographic makeups differ in vaccine hesitancy and Covid-19 cases?"
  • "Did vaccines decrease the Covid-19 infection rate (assuming people with little vaccine hesitancy got the Covid-19 vaccine)?"

Inital steps to answer these questions will be to analyze the correlation between each dataset. Additionally, comparing the rates of Covid-19 cases before and after the vaccine was rolled out will help to address the question of whether the vaccine redcued Covid-19 rates. This is discussed more below in our model.

The data we are using from the Household Pulse Survey is found here: https://data.cdc.gov/Vaccinations/Vaccine-Hesitancy-for-COVID-19-County-and-local-es/q9mh-h2tw/data.

Selected Demographic information from the ACS is found here:

  • https://data.census.gov/cedsci/table?q=DP02#
  • https://data.census.gov/cedsci/table?q=DP03#
  • https://data.census.gov/cedsci/table?q=DP05#

COVID-19 infection reports can be found here: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/README.md.

Covid-19 Vaccination Trends can be found from the CDC here: https://data.cdc.gov/Vaccinations/COVID-19-Vaccination-Trends-in-the-United-States-N/rh2h-3yt2

Covid-19 timeline used: https://www.cdc.gov/museum/timeline/covid19.html#Early-2021

In [ ]:
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
}

Table 1: Demographic Information by State¶

First, we have to read in the tables from the ACS and select the data that we want.

In [ ]:
demo1_df = pd.read_csv("../datasets/ACSDP1Y2022.DP02-2023-edu.csv")
demo2_df = pd.read_csv("../datasets/ACSDP1Y2022.DP03-2023-employment,income,healthinsurance.csv")
demo3_df = pd.read_csv("../datasets/ACSDP1Y2022.DP05-2023-sex,age,race.csv")
In [ ]:
edu_df = demo1_df.iloc[67:74,:-2]

inc_df = demo2_df.iloc[list(range(67,69)),:-2]
ins_df = demo2_df.iloc[[102,105],:-2]

age_df = demo3_df.iloc[[18],:-2]
sex_df = demo3_df.iloc[[2,3],:-2]
race_df = demo3_df.iloc[[76]+list(range(82,89)),:-2]
In [ ]:
df_all = pd.concat([edu_df,inc_df,ins_df,age_df,sex_df,race_df])
df_all.set_index([df_all.columns[0]], inplace=True)
df_all.head()
Out[ ]:
Alabama!!Estimate Alabama!!Percent Alaska!!Estimate Alaska!!Percent Arizona!!Estimate Arizona!!Percent Arkansas!!Estimate Arkansas!!Percent California!!Estimate California!!Percent ... Virginia!!Estimate Virginia!!Percent Washington!!Estimate Washington!!Percent West Virginia!!Estimate West Virginia!!Percent Wisconsin!!Estimate Wisconsin!!Percent Wyoming!!Estimate Wyoming!!Percent
Label (Grouping)
Less than 9th grade 118,034 3.4% 10,950 2.2% 241,467 4.8% 82,740 4.0% 2,376,805 8.8% ... 209,876 3.5% 192,061 3.5% 43,318 3.4% 99,448 2.4% 8,563 2.1%
9th to 12th grade, no diploma 271,263 7.8% 21,719 4.4% 306,781 6.1% 140,697 6.8% 1,738,878 6.5% ... 295,899 4.9% 236,434 4.3% 93,833 7.4% 167,049 4.1% 16,511 4.1%
High school graduate (includes equivalency) 1,057,155 30.4% 142,455 29.1% 1,202,946 23.8% 705,670 34.3% 5,506,855 20.5% ... 1,431,624 23.9% 1,176,424 21.5% 484,166 38.3% 1,209,135 29.6% 109,092 27.3%
Some college, no degree 716,144 20.6% 116,078 23.7% 1,160,936 23.0% 438,076 21.3% 5,192,854 19.3% ... 1,067,612 17.9% 1,161,011 21.2% 216,917 17.2% 798,565 19.5% 99,620 24.9%
Associate's degree 311,537 9.0% 48,370 9.9% 475,387 9.4% 167,634 8.1% 2,115,440 7.9% ... 451,737 7.6% 544,596 10.0% 111,246 8.8% 455,882 11.2% 47,844 12.0%

5 rows × 102 columns

Now to organize the data and get it ready for analysis so the dtypes are corrected and the table is tidied.

In [ ]:
df_all.T.dtypes
Out[ ]:
Label (Grouping)
        Less than 9th grade                                     object
        9th to 12th grade, no diploma                           object
        High school graduate (includes equivalency)             object
        Some college, no degree                                 object
        Associate's degree                                      object
        Bachelor's degree                                       object
        Graduate or professional degree                         object
        Median household income (dollars)                       object
        Mean household income (dollars)                         object
        With health insurance coverage                          object
        No health insurance coverage                            object
        Median age (years)                                      object
        Male                                                    object
        Female                                                  object
        Hispanic or Latino (of any race)                        object
            White alone                                         object
            Black or African American alone                     object
            American Indian and Alaska Native alone             object
            Asian alone                                         object
            Native Hawaiian and Other Pacific Islander alone    object
            Some Other Race alone                               object
            Two or More Races                                   object
dtype: object
In [ ]:
df_all.replace(',','', regex=True, inplace=True)
df_all.replace('%','', regex=True, inplace=True)
df_all.replace('(X)', None, regex=True, inplace=True)

cols = df_all.columns.to_series()
cols.replace('!!Estimate','', regex=True, inplace=True)
cols.replace('!!Percent','%', regex=True, inplace=True)
cols.replace(us_state_to_abbrev,inplace=True)
df_all.columns = cols

df_all = df_all.apply(pd.to_numeric, errors='ignore')

df_all
Out[ ]:
AL Alabama% AK Alaska% AZ Arizona% AR Arkansas% CA California% ... VA Virginia% WA Washington% WV West Virginia% WI Wisconsin% WY Wyoming%
Label (Grouping)
Less than 9th grade 118034.0 3.4 10950.0 2.2 241467.0 4.8 82740.0 4.0 2376805.0 8.8 ... 209876.0 3.5 192061.0 3.5 43318.0 3.4 99448.0 2.4 8563.0 2.1
9th to 12th grade, no diploma 271263.0 7.8 21719.0 4.4 306781.0 6.1 140697.0 6.8 1738878.0 6.5 ... 295899.0 4.9 236434.0 4.3 93833.0 7.4 167049.0 4.1 16511.0 4.1
High school graduate (includes equivalency) 1057155.0 30.4 142455.0 29.1 1202946.0 23.8 705670.0 34.3 5506855.0 20.5 ... 1431624.0 23.9 1176424.0 21.5 484166.0 38.3 1209135.0 29.6 109092.0 27.3
Some college, no degree 716144.0 20.6 116078.0 23.7 1160936.0 23.0 438076.0 21.3 5192854.0 19.3 ... 1067612.0 17.9 1161011.0 21.2 216917.0 17.2 798565.0 19.5 99620.0 24.9
Associate's degree 311537.0 9.0 48370.0 9.9 475387.0 9.4 167634.0 8.1 2115440.0 7.9 ... 451737.0 7.6 544596.0 10.0 111246.0 8.8 455882.0 11.2 47844.0 12.0
Bachelor's degree 609316.0 17.5 94168.0 19.2 1032052.0 20.4 323961.0 15.7 6056169.0 22.5 ... 1409942.0 23.6 1302707.0 23.8 188793.0 15.0 885548.0 21.7 71812.0 18.0
Graduate or professional degree 391475.0 11.3 55478.0 11.3 634087.0 12.5 198846.0 9.7 3879772.0 14.4 ... 1112092.0 18.6 856976.0 15.7 124293.0 9.8 472873.0 11.6 46354.0 11.6
Median household income (dollars) 59674.0 NaN 88121.0 NaN 74568.0 NaN 55432.0 NaN 91551.0 NaN ... 85873.0 NaN 91306.0 NaN 54329.0 NaN 70996.0 NaN 70042.0 NaN
Mean household income (dollars) 82956.0 NaN 109524.0 NaN 101316.0 NaN 76853.0 NaN 131504.0 NaN ... 119058.0 NaN 125847.0 NaN 75265.0 NaN 94085.0 NaN 90018.0 NaN
With health insurance coverage 4551357.0 91.2 624538.0 89.0 6503212.0 89.7 2739056.0 91.6 36056827.0 93.5 ... 7901373.0 93.5 7199483.0 93.9 1638252.0 94.1 5528608.0 94.8 505145.0 88.5
No health insurance coverage 437268.0 8.8 76973.0 11.0 748712.0 10.3 252143.0 8.4 2491843.0 6.5 ... 545005.0 6.5 467558.0 6.1 103070.0 5.9 302843.0 5.2 65619.0 11.5
Median age (years) 39.6 NaN 35.9 NaN 38.8 NaN 38.9 NaN 37.9 NaN ... 39.0 NaN 38.4 NaN 42.9 NaN 40.4 NaN 39.1 NaN
Male 2461248.0 48.5 385667.0 52.6 3678381.0 50.0 1504488.0 49.4 19536425.0 50.1 ... 4299314.0 49.5 3930411.0 50.5 882101.0 49.7 2958771.0 50.2 297855.0 51.2
Female 2613048.0 51.5 347916.0 47.4 3680816.0 50.0 1541149.0 50.6 19492917.0 49.9 ... 4384305.0 50.5 3855375.0 49.5 893055.0 50.3 2933768.0 49.8 283526.0 48.8
Hispanic or Latino (of any race) 246477.0 4.9 56491.0 7.7 2388520.0 32.5 255416.0 8.4 15732184.0 40.3 ... 905750.0 10.4 1093313.0 14.0 34343.0 1.9 447022.0 7.6 62803.0 10.8
White alone 3250182.0 64.1 421104.0 57.4 3814587.0 51.8 2054922.0 67.5 13160426.0 33.7 ... 5095130.0 58.7 4941456.0 63.5 1594128.0 89.8 4654031.0 79.0 473220.0 81.4
Black or African American alone 1296500.0 25.6 20664.0 2.8 322459.0 4.4 435637.0 14.3 2025218.0 5.2 ... 1594785.0 18.4 299537.0 3.8 59390.0 3.3 345678.0 5.9 3990.0 0.7
American Indian and Alaska Native alone 12694.0 0.3 93294.0 12.7 241688.0 3.3 12680.0 0.4 103030.0 0.3 ... 11738.0 0.1 69024.0 0.9 1880.0 0.1 33762.0 0.6 9589.0 1.6
Asian alone 76682.0 1.5 44905.0 6.1 257020.0 3.5 47592.0 1.6 5957867.0 15.3 ... 602881.0 6.9 755832.0 9.7 13134.0 0.7 172242.0 2.9 3577.0 0.6
Native Hawaiian and Other Pacific Islander alone 1954.0 0.0 14539.0 2.0 13887.0 0.2 14176.0 0.5 135777.0 0.3 ... 6869.0 0.1 52231.0 0.7 111.0 0.0 2592.0 0.0 294.0 0.1
Some Other Race alone 19927.0 0.4 4034.0 0.5 36042.0 0.5 11530.0 0.4 251618.0 0.6 ... 60419.0 0.7 50945.0 0.7 5238.0 0.3 19680.0 0.3 4997.0 0.9
Two or More Races 169880.0 3.3 78552.0 10.7 284994.0 3.9 213684.0 7.0 1663222.0 4.3 ... 406047.0 4.7 523448.0 6.7 66932.0 3.8 217532.0 3.7 22911.0 3.9

22 rows × 102 columns

In [ ]:
df_all.T.dtypes
Out[ ]:
Label (Grouping)
        Less than 9th grade                                     float64
        9th to 12th grade, no diploma                           float64
        High school graduate (includes equivalency)             float64
        Some college, no degree                                 float64
        Associate's degree                                      float64
        Bachelor's degree                                       float64
        Graduate or professional degree                         float64
        Median household income (dollars)                       float64
        Mean household income (dollars)                         float64
        With health insurance coverage                          float64
        No health insurance coverage                            float64
        Median age (years)                                      float64
        Male                                                    float64
        Female                                                  float64
        Hispanic or Latino (of any race)                        float64
            White alone                                         float64
            Black or African American alone                     float64
            American Indian and Alaska Native alone             float64
            Asian alone                                         float64
            Native Hawaiian and Other Pacific Islander alone    float64
            Some Other Race alone                               float64
            Two or More Races                                   float64
dtype: object

Now we can look at this information. If we just look at the state percentages, so the data is normalized, we can see if there are any interesting features we can draw about education, income, etc.

In [ ]:
df_all_est = df_all.T.iloc[0::2].T # not percentage columns
df_all_per = df_all.T.iloc[1::2].T # percentage columns
population = df_all_est.iloc[[12,13]].sum()
In [ ]:
df_all_est.T.describe().iloc[[3,4,5,6,7]].T
Out[ ]:
min 25% 50% 75% max
Label (Grouping)
Less than 9th grade 7709.0 43431.5 104599.0 200968.5 2376805.0
9th to 12th grade, no diploma 15759.0 54041.0 167049.0 295917.0 1738878.0
High school graduate (includes equivalency) 64543.0 329568.0 823109.0 1452257.0 5506855.0
Some college, no degree 55085.0 248816.5 635109.0 1027405.0 5192854.0
Associate's degree 15974.0 119963.5 271686.0 460914.5 2115440.0
Bachelor's degree 71812.0 249125.5 584999.0 1277447.0 6056169.0
Graduate or professional degree 46354.0 164184.0 391475.0 881625.5 3879772.0
Median household income (dollars) 52719.0 66885.5 71970.0 84105.5 101027.0
Mean household income (dollars) 72624.0 90063.5 97699.0 111890.5 148872.0
With health insurance coverage 505145.0 1696782.5 4181313.0 6975473.5 36056827.0
No health insurance coverage 19474.0 116320.5 302843.0 532875.0 4898967.0
Median age (years) 32.1 38.2 39.1 40.4 45.1
Male 297855.0 929513.0 2233158.0 3804396.0 19536425.0
Female 283526.0 927581.5 2279152.0 3768095.5 19492917.0
Hispanic or Latino (of any race) 14770.0 173793.0 442629.0 1010612.0 15732184.0
White alone 246745.0 1329689.0 2906492.0 5005741.0 13160426.0
Black or African American alone 3428.0 55358.0 343194.0 1311022.5 3508706.0
American Indian and Alaska Native alone 978.0 6126.5 12313.0 33507.0 273650.0
Asian alone 3577.0 39292.0 131817.0 371435.0 5957867.0
Native Hawaiian and Other Pacific Islander alone 0.0 955.0 2745.0 8061.5 135777.0
Some Other Race alone 2521.0 8167.0 19680.0 46158.0 251618.0
Two or More Races 22911.0 77525.5 213684.0 353171.0 1663222.0

From the raw numbers, we can observe a few things. Health insurance coverage, age, and sex don't vary much across the states. But education levels, income, and race do vary quite a bit, but mean and median income are very similar. From this we can remove the variables that don't vary much, and mean income, to have a set of important demographics to compare across states.

In [ ]:
df_all.drop([df_all.index[8],df_all.index[9],df_all.index[10],df_all.index[11],df_all.index[12],df_all.index[13]], inplace=True)
df_all_est.drop([df_all_est.index[8],df_all_est.index[9],df_all_est.index[10],df_all_est.index[11],df_all_est.index[12],df_all_est.index[13]], inplace=True)
df_all_per.drop([df_all_per.index[8],df_all_per.index[9],df_all_per.index[10],df_all_per.index[11],df_all_per.index[12],df_all_per.index[13]], inplace=True)
df_all
Out[ ]:
AL Alabama% AK Alaska% AZ Arizona% AR Arkansas% CA California% ... VA Virginia% WA Washington% WV West Virginia% WI Wisconsin% WY Wyoming%
Label (Grouping)
Less than 9th grade 118034.0 3.4 10950.0 2.2 241467.0 4.8 82740.0 4.0 2376805.0 8.8 ... 209876.0 3.5 192061.0 3.5 43318.0 3.4 99448.0 2.4 8563.0 2.1
9th to 12th grade, no diploma 271263.0 7.8 21719.0 4.4 306781.0 6.1 140697.0 6.8 1738878.0 6.5 ... 295899.0 4.9 236434.0 4.3 93833.0 7.4 167049.0 4.1 16511.0 4.1
High school graduate (includes equivalency) 1057155.0 30.4 142455.0 29.1 1202946.0 23.8 705670.0 34.3 5506855.0 20.5 ... 1431624.0 23.9 1176424.0 21.5 484166.0 38.3 1209135.0 29.6 109092.0 27.3
Some college, no degree 716144.0 20.6 116078.0 23.7 1160936.0 23.0 438076.0 21.3 5192854.0 19.3 ... 1067612.0 17.9 1161011.0 21.2 216917.0 17.2 798565.0 19.5 99620.0 24.9
Associate's degree 311537.0 9.0 48370.0 9.9 475387.0 9.4 167634.0 8.1 2115440.0 7.9 ... 451737.0 7.6 544596.0 10.0 111246.0 8.8 455882.0 11.2 47844.0 12.0
Bachelor's degree 609316.0 17.5 94168.0 19.2 1032052.0 20.4 323961.0 15.7 6056169.0 22.5 ... 1409942.0 23.6 1302707.0 23.8 188793.0 15.0 885548.0 21.7 71812.0 18.0
Graduate or professional degree 391475.0 11.3 55478.0 11.3 634087.0 12.5 198846.0 9.7 3879772.0 14.4 ... 1112092.0 18.6 856976.0 15.7 124293.0 9.8 472873.0 11.6 46354.0 11.6
Median household income (dollars) 59674.0 NaN 88121.0 NaN 74568.0 NaN 55432.0 NaN 91551.0 NaN ... 85873.0 NaN 91306.0 NaN 54329.0 NaN 70996.0 NaN 70042.0 NaN
Hispanic or Latino (of any race) 246477.0 4.9 56491.0 7.7 2388520.0 32.5 255416.0 8.4 15732184.0 40.3 ... 905750.0 10.4 1093313.0 14.0 34343.0 1.9 447022.0 7.6 62803.0 10.8
White alone 3250182.0 64.1 421104.0 57.4 3814587.0 51.8 2054922.0 67.5 13160426.0 33.7 ... 5095130.0 58.7 4941456.0 63.5 1594128.0 89.8 4654031.0 79.0 473220.0 81.4
Black or African American alone 1296500.0 25.6 20664.0 2.8 322459.0 4.4 435637.0 14.3 2025218.0 5.2 ... 1594785.0 18.4 299537.0 3.8 59390.0 3.3 345678.0 5.9 3990.0 0.7
American Indian and Alaska Native alone 12694.0 0.3 93294.0 12.7 241688.0 3.3 12680.0 0.4 103030.0 0.3 ... 11738.0 0.1 69024.0 0.9 1880.0 0.1 33762.0 0.6 9589.0 1.6
Asian alone 76682.0 1.5 44905.0 6.1 257020.0 3.5 47592.0 1.6 5957867.0 15.3 ... 602881.0 6.9 755832.0 9.7 13134.0 0.7 172242.0 2.9 3577.0 0.6
Native Hawaiian and Other Pacific Islander alone 1954.0 0.0 14539.0 2.0 13887.0 0.2 14176.0 0.5 135777.0 0.3 ... 6869.0 0.1 52231.0 0.7 111.0 0.0 2592.0 0.0 294.0 0.1
Some Other Race alone 19927.0 0.4 4034.0 0.5 36042.0 0.5 11530.0 0.4 251618.0 0.6 ... 60419.0 0.7 50945.0 0.7 5238.0 0.3 19680.0 0.3 4997.0 0.9
Two or More Races 169880.0 3.3 78552.0 10.7 284994.0 3.9 213684.0 7.0 1663222.0 4.3 ... 406047.0 4.7 523448.0 6.7 66932.0 3.8 217532.0 3.7 22911.0 3.9

16 rows × 102 columns

Now we can look how education, income, and race vary over the states.

In [ ]:
states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI',
       'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN',
       'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
       'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA',
       'WV', 'WI', 'WY']
In [ ]:
# Education across states
df_all_per.T[df_all_per.index[list(range(0,7))]].plot.bar(stacked=True)

plt.ylabel("Percent of state with eduction level")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
Out[ ]:
<matplotlib.legend.Legend at 0x16529b5e0>
In [ ]:
# Income across states
plt.figure(figsize=(15,5))
plt.bar(states, df_all_est.T[df_all_est.T.columns[7]])

plt.ylabel("Median income per state")
plt.show
Out[ ]:
<function matplotlib.pyplot.show(close=None, block=None)>
In [ ]:
# Race across states
plt.figure(figsize=(25,15))
df_all_per.T[df_all_per.index[list(range(8,16))]].plot.bar(stacked=True)


plt.ylabel("Percent of state of certain race")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
Out[ ]:
<matplotlib.legend.Legend at 0x1621cb730>
<Figure size 2500x1500 with 0 Axes>

These three graphs above show us what states are very different from eachother in terms of race or median income or education attainment. These differences help indicate which states we need to compare for vaccine and Covid-19 cases in our models.

Table 2: Vaccine hesitancy statistics by county¶

We start by reading the csv into a data frame and deleting the columns that we don't need such as the state boundaries, county boundaries, and geographical point (longtitide and latitude).

In [ ]:
url = "https://data.cdc.gov/api/views/q9mh-h2tw/rows.csv?accessType=DOWNLOAD"

hesitancy_df = pd.read_csv(url)
hesitancy_df.drop(labels=['County Boundary', 'State Boundary', 'Geographical Point'], axis=1, inplace=True)

Now we can remove the catogorical varibales that we will not need and group by state.

In [ ]:
hesitancy_df.drop(labels=['CVAC Level Of Concern', 'SVI Category', 'State', 'FIPS Code','County Name'], axis=1, inplace=True)
hesitancy_by_state = hesitancy_df.groupby('State Code').mean()
hesitancy_by_state.head()
Out[ ]:
Estimated hesitant Estimated hesitant or unsure Estimated strongly hesitant Social Vulnerability Index (SVI) CVAC level of concern for vaccination rollout Percent adults fully vaccinated against COVID-19 (as of 6/10/21) Percent Hispanic Percent non-Hispanic American Indian/Alaska Native Percent non-Hispanic Asian Percent non-Hispanic Black Percent non-Hispanic Native Hawaiian/Pacific Islander Percent non-Hispanic White
State Code
AK 0.217386 0.264245 0.156907 0.561724 0.744483 0.582296 0.053148 0.324583 0.053886 0.013293 0.005324 0.478703
AL 0.173507 0.231512 0.132409 0.703433 0.720149 0.320851 0.034864 0.005749 0.007072 0.287355 0.000269 0.648376
AR 0.212123 0.261232 0.139901 0.717733 0.644800 0.338800 0.052561 0.005515 0.006661 0.160659 0.001171 0.751871
AZ 0.159240 0.242547 0.121353 0.846000 0.901333 0.504733 0.313380 0.133180 0.013267 0.018427 0.001180 0.501993
CA 0.072274 0.111041 0.035807 0.633793 0.633276 0.506820 0.306224 0.017743 0.073066 0.029845 0.002805 0.537445

Then we can make a correlation table of all the quantitaive variables to get some ideas of what variables correlate with what. This will help give us a basic idea of what we should look at more and what we will be able to take away from this data table. We can also look at some basic summary statistics for the variables to give us a better picture of the data.

In [ ]:
variables = ['Estimated hesitant', 
             'Estimated hesitant or unsure', 
             'Estimated strongly hesitant', 
             'Social Vulnerability Index (SVI)',
              'CVAC level of concern for vaccination rollout',
               'Percent adults fully vaccinated against COVID-19 (as of 6/10/21)',
                'Percent Hispanic',
                'Percent non-Hispanic American Indian/Alaska Native',
                'Percent non-Hispanic Asian',
                'Percent non-Hispanic Black',
                'Percent non-Hispanic Native Hawaiian/Pacific Islander',
                'Percent non-Hispanic White'
                     ]
hesitancy_df[variables].corr()
Out[ ]:
Estimated hesitant Estimated hesitant or unsure Estimated strongly hesitant Social Vulnerability Index (SVI) CVAC level of concern for vaccination rollout Percent adults fully vaccinated against COVID-19 (as of 6/10/21) Percent Hispanic Percent non-Hispanic American Indian/Alaska Native Percent non-Hispanic Asian Percent non-Hispanic Black Percent non-Hispanic Native Hawaiian/Pacific Islander Percent non-Hispanic White
Estimated hesitant 1.000000 0.961811 0.975274 0.282310 0.338086 -0.199502 -0.210853 0.173319 -0.269042 0.159336 -0.047019 0.000838
Estimated hesitant or unsure 0.961811 1.000000 0.938115 0.386490 0.438484 -0.277029 -0.195632 0.154750 -0.303635 0.288680 -0.064783 -0.085858
Estimated strongly hesitant 0.975274 0.938115 1.000000 0.299722 0.350317 -0.205624 -0.205454 0.198035 -0.263052 0.171879 -0.048325 -0.023501
Social Vulnerability Index (SVI) 0.282310 0.386490 0.299722 1.000000 0.717859 -0.276399 0.313903 0.167560 -0.058608 0.475836 0.000462 -0.615726
CVAC level of concern for vaccination rollout 0.338086 0.438484 0.350317 0.717859 1.000000 -0.401594 0.323777 0.139995 -0.139841 0.359244 0.002739 -0.512439
Percent adults fully vaccinated against COVID-19 (as of 6/10/21) -0.199502 -0.277029 -0.205624 -0.276399 -0.401594 1.000000 0.096973 0.148729 0.266810 -0.253635 0.067014 0.042492
Percent Hispanic -0.210853 -0.195632 -0.205454 0.313903 0.323777 0.096973 1.000000 -0.042063 0.138102 -0.115655 0.017473 -0.601836
Percent non-Hispanic American Indian/Alaska Native 0.173319 0.154750 0.198035 0.167560 0.139995 0.148729 -0.042063 1.000000 -0.018333 -0.102222 0.017776 -0.293375
Percent non-Hispanic Asian -0.269042 -0.303635 -0.263052 -0.058608 -0.139841 0.266810 0.138102 -0.018333 1.000000 0.016760 0.375992 -0.279602
Percent non-Hispanic Black 0.159336 0.288680 0.171879 0.475836 0.359244 -0.253635 -0.115655 -0.102222 0.016760 1.000000 -0.036959 -0.591361
Percent non-Hispanic Native Hawaiian/Pacific Islander -0.047019 -0.064783 -0.048325 0.000462 0.002739 0.067014 0.017473 0.017776 0.375992 -0.036959 1.000000 -0.110420
Percent non-Hispanic White 0.000838 -0.085858 -0.023501 -0.615726 -0.512439 0.042492 -0.601836 -0.293375 -0.279602 -0.591361 -0.110420 1.000000
In [ ]:
hesitancy_df[variables].describe()
Out[ ]:
Estimated hesitant Estimated hesitant or unsure Estimated strongly hesitant Social Vulnerability Index (SVI) CVAC level of concern for vaccination rollout Percent adults fully vaccinated against COVID-19 (as of 6/10/21) Percent Hispanic Percent non-Hispanic American Indian/Alaska Native Percent non-Hispanic Asian Percent non-Hispanic Black Percent non-Hispanic Native Hawaiian/Pacific Islander Percent non-Hispanic White
count 3142.000000 3142.000000 3142.000000 3141.000000 3142.000000 2864.000000 3142.000000 3142.000000 3142.000000 3142.000000 3142.000000 3142.000000
mean 0.132600 0.191429 0.086653 0.500000 0.499965 0.399398 0.094184 0.018463 0.013621 0.089259 0.000919 0.762499
std 0.046337 0.053494 0.032938 0.288842 0.288832 0.142893 0.138647 0.076340 0.027665 0.144283 0.006240 0.202157
min 0.026900 0.049900 0.018600 0.000000 0.000000 0.001000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006900
25% 0.098300 0.148500 0.062325 0.250000 0.250000 0.318000 0.022200 0.001200 0.002800 0.006500 0.000000 0.644450
50% 0.131800 0.190100 0.084900 0.500000 0.500000 0.400000 0.042300 0.002800 0.006100 0.021900 0.000100 0.837400
75% 0.161725 0.228800 0.104475 0.750000 0.750000 0.494000 0.096800 0.006900 0.012800 0.098400 0.000600 0.924875
max 0.267000 0.323300 0.182400 1.000000 1.000000 0.999000 0.991700 0.919000 0.417300 0.872300 0.272700 1.000000

We can see that the 3 columns that talk about hesitancy are very strongly correlated, and when we look at summary statistics, we can see that on average the "Estimated hesitant or unsure" is slightly higher than the "Estimated hesitant" and the "Estimated strongly hesitant" is slightly below. Knowing this, we will look only at the "Estimated hesitant" column in order to simplify things.

In [ ]:
hesitancy_df.drop(labels=['Estimated hesitant or unsure', 'Estimated strongly hesitant'], axis=1, inplace=True)
hesitancy_by_state.drop(labels=['Estimated hesitant or unsure', 'Estimated strongly hesitant'], axis=1, inplace=True)

The social vulernability index (SVI) is define as "a database that helps emergency response planners and public health officials identify, map, and plan support for communities that will most likely need support before, during, and after a public health emergency.". It acts as an estimate of how much support a county will need in the event of a public health emergency based off a large number of factors. Below we can make a scatter of the percent of a county that is white compared to the SVI.

In [ ]:
hesitancy_df.plot.scatter(x=['Percent non-Hispanic White'], 
                          y='Social Vulnerability Index (SVI)', alpha = 0.5)
Out[ ]:
<Axes: xlabel='[Percent non-Hispanic White]', ylabel='Social Vulnerability Index (SVI)'>

Table 3: CDC Vaccine Trends¶

Next we will open up a data set from the CDC on vaccine trends in the United States that can be found here https://data.cdc.gov/Vaccinations/COVID-19-Vaccination-Trends-in-the-United-States-N/rh2h-3yt2

In [ ]:
vaccines = pd.read_csv('../datasets/COVID-19_Vaccination_Trends_in_the_United_States_National_and_Jurisdictional.csv')
vaccines = vaccines[ vaccines['date_type'].str.contains('Admin')==False ]
vaccines.head()
Out[ ]:
Date date_type MMWR_week Location Administered_Daily Administered_Cumulative Administered_7_Day_Rolling_Average Admin_Dose_1_Daily Admin_Dose_1_Cumulative Admin_Dose_1_Day_Rolling_Average ... Booster_7_Day_Rolling_Average Additional_Doses_Vax_Pct Second_Booster_50Plus_Daily Second_Booster_50Plus_Cumulative Second_Booster_50Plus_7_Day_Rolling_Average Second_Booster_50Plus_Vax_Pct Bivalent_Booster_Daily Bivalent_Booster_Cumulative Bivalent_Booster_7_Day_Rolling_Average Bivalent_Booster_Pop_Pct
0 05/10/2023 Report 19 CO 15097 13033446 NaN 1527 4837792 NaN ... NaN 57.9 1062 794838 NaN 62.9 9725 1272115 NaN 22.1
1 05/10/2023 Report 19 AZ 16505 14647405 NaN 2955 5704677 NaN ... NaN 50.2 1312 794699 NaN 54.3 11388 1148060 NaN 15.8
2 05/10/2023 Report 19 MN 16020 12829141 NaN 1282 4461994 NaN ... NaN 63.6 1196 983284 NaN 67.8 5497 1510743 NaN 26.8
3 05/10/2023 Report 19 ID 3526 2894361 NaN 323 1146055 NaN ... NaN 48.8 281 173862 NaN 54.8 2032 248989 NaN 13.9
4 05/10/2023 Report 19 DC 31 2137377 NaN 264 836680 NaN ... NaN 51.4 106 80880 NaN 55.9 509 226857 NaN 32.1

5 rows × 29 columns

Next we can drop all of the rows that contain data from a US territory, as we will only be looking at the 50 states & DC. We also will convert the date column to datetime type, and drop any columns that we don't need.

In [ ]:
territories = ['VI','FM', 'MH','MP','AS','GU','PR','PW','US']
mask = vaccines["Location"].isin(territories)
vaccines = vaccines[~mask]
vaccines.Date = pd.to_datetime(vaccines['Date'])
In [ ]:
vaccines.drop(labels=['MMWR_week', 
                      'Bivalent_Booster_7_Day_Rolling_Average',
                      'Bivalent_Booster_Cumulative',
                      'Bivalent_Booster_Daily',
                      'Second_Booster_50Plus_Vax_Pct',
                      'Second_Booster_50Plus_7_Day_Rolling_Average',
                      'Second_Booster_50Plus_Cumulative',
                      'Second_Booster_50Plus_Daily',
                      'Booster_Daily',
                      'Booster_Cumulative',
                      'Booster_7_Day_Rolling_Average',
                      'Series_Complete_Day_Rolling_Average',
                      'Series_Complete_Cumulative',
                      'Series_Complete_Daily',
                      'Administered_daily_change_report_7dayroll',
                      'Administered_daily_change_report',
                      'Admin_Dose_1_Day_Rolling_Average',
                      'Admin_Dose_1_Cumulative',
                      'Admin_Dose_1_Daily',
                      'date_type'
                      ], axis=1, inplace=True)
In [ ]:
vaccines.head()
Out[ ]:
Date Location Administered_Daily Administered_Cumulative Administered_7_Day_Rolling_Average Administered_Dose1_Pop_Pct Series_Complete_Pop_Pct Additional_Doses_Vax_Pct Bivalent_Booster_Pop_Pct
0 2023-05-10 CO 15097 13033446 NaN 84.0 73.8 57.9 22.1
1 2023-05-10 AZ 16505 14647405 NaN 78.4 66.2 50.2 15.8
2 2023-05-10 MN 16020 12829141 NaN 79.1 72.4 63.6 26.8
3 2023-05-10 ID 3526 2894361 NaN 64.1 56.6 48.8 13.9
4 2023-05-10 DC 31 2137377 NaN 95.0 91.3 51.4 32.1

Now that we have the data that we want, we can make a plot to see vaccine trends overtime, as well as by state.

In [ ]:
vaccines.groupby("Date")['Administered_Daily'].sum().plot(title="Daily vaccines over time")
Out[ ]:
<Axes: title={'center': 'Daily vaccines over time'}, xlabel='Date'>
In [ ]:
vaccines.groupby("Location")['Administered_Dose1_Pop_Pct'].max().plot.bar(title="Percent of people with at least 1 dose by state")
Out[ ]:
<Axes: title={'center': 'Percent of people with at least 1 dose by state'}, xlabel='Location'>

Table 4: Confirmed Covid-19 cases by state, 2020:2023¶

This data is from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. Over the course of three years, (1/22/20 to 3/9/23) data on the daily confirmed cases, deaths, and recoveries from Covid-19 were published to this Github repository: https://github.com/CSSEGISandData/COVID-19.git. The data is reported by city and state, but for ease, we chose to look at the state level so rows were aggregated to one row per state (or territory). Additionally, this dataset is being used to look at Covid-19 cases over time (dates), all other columns were dropped. Then, each state can be normalized by its population so we can compare across states.

In [ ]:
confirmed_df = pd.read_csv("../datasets/time_series_covid19_confirmed_US.csv")
confirmed_df.drop(columns=['iso2','iso3','code3','FIPS','Country_Region','Lat','Long_','UID','Admin2','Combined_Key'], inplace=True)
In [ ]:
confirmed_state_by_day = confirmed_df.groupby('Province_State').sum() 
confirmed_state_by_day.drop(labels=['American Samoa','Diamond Princess','Grand Princess', 'Guam','Northern Mariana Islands','Puerto Rico','Virgin Islands'], axis='index', inplace=True)
confirmed_state_by_day.rename(us_state_to_abbrev, inplace=True)
confirmed_state_by_day = confirmed_state_by_day.div(population,axis=0)*100 
confirmed_state_by_day.head()
Out[ ]:
1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
Province_State
AL 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 32.287198 32.360390 32.360390 32.360390 32.360390 32.360390 32.360390 32.360390 32.409087 32.409087
AK 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 41.859340 41.859340 41.859340 41.859340 41.859340 41.859340 41.859340 41.938676 41.938676 41.938676
AZ 0.0 0.0 0.0 0.0 0.000014 0.000014 0.000014 0.000014 0.000014 0.000014 ... 33.082835 33.159786 33.159786 33.159786 33.159786 33.159786 33.159786 33.159786 33.203541 33.203541
AR 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 32.989880 32.989880 33.004984 33.013258 33.023962 33.025833 33.028559 33.041889 33.051280 33.059849
CA 0.0 0.0 0.0 0.0 0.000005 0.000005 0.000005 0.000005 0.000005 0.000008 ... 30.996200 30.996200 31.022445 31.028397 31.028397 31.028397 31.028397 31.056032 31.056032 31.078410

5 rows × 1143 columns

Now, we can look at the cases of Covid-19 over 3 years, state by state. This is give us an idea of trends of reported covid cases over time and how they vary state to state. Below is a plot of daily reported covid-19 cases, from 1/22/20 to 3/9/23.

In [ ]:
for i in confirmed_state_by_day.index:
    plt.plot(confirmed_state_by_day.loc[i])

plt.xticks(ticks=[0,1142], labels=['2020-01-22','2023-09-03'])
plt.xlabel('date')
plt.ylabel('percentage of covid cases in state')
plt.show()

Let's find the top 10 states.

In [ ]:
confirmed_state_by_day.sort_values(['3/9/23'], ascending=False)[0:10]
Out[ ]:
1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 3/6/23 3/7/23 3/8/23 3/9/23
Province_State
RI 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 41.989734 41.989734 42.061872 42.061872 42.061872 42.061872 42.061872 42.061872 42.061872 42.121485
AK 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 41.859340 41.859340 41.859340 41.859340 41.859340 41.859340 41.859340 41.938676 41.938676 41.938676
KY 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 37.967693 37.967693 37.977976 37.977976 37.977976 37.977976 37.977976 37.977976 38.084063 38.084063
ND 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 36.799609 36.812698 36.823349 36.823349 36.823349 36.823349 36.823349 36.823349 36.823349 36.823349
WV 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 36.010694 36.112544 36.112544 36.208649 36.208649 36.208649 36.208649 36.208649 36.208649 36.208649
TN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 35.630992 35.641004 35.655512 35.665779 35.668828 35.596105 35.596105 35.596105 35.668828 35.668828
SC 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 34.712437 34.712437 34.712437 34.712437 34.712437 34.712437 34.712437 34.766141 34.766141 34.766141
NY 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 34.468349 34.470498 34.482451 34.491213 34.496157 34.499959 34.512130 34.517345 34.524154 34.531107
LA 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 34.118143 34.217615 34.217615 34.217615 34.217615 34.217615 34.217615 34.217615 34.285999 34.285999
WI 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 33.955957 33.970009 33.982770 33.994820 33.994820 33.994820 34.016203 34.029474 34.040912 34.052927

10 rows × 1143 columns

From the plot, we can see that some states, like West Virginia, Wyoming, and Mississippi, have many more cases than others per capita. Overall, some states saw outbreaks when others didnt, and some didn't have many cases at all. This can help us answer questions about how Covid-19 impacted different states and what states were impacted the most at different times.

To look at this data a bit closer, we can look at the average number of cases over three months. Then we can look at the summary statistics for that data.

In [ ]:
confirmed_state_by_day.columns = pd.to_datetime(confirmed_state_by_day.columns, dayfirst=True, format='mixed')
confirmed_day = confirmed_state_by_day
In [ ]:
# 3 month average
threemonthaverage = confirmed_day.T.resample('Q').mean().T
threemonthaverage.describe()
Out[ ]:
2020-03-31 2020-06-30 2020-09-30 2020-12-31 2021-03-31 2021-06-30 2021-09-30 2021-12-31 2022-03-31 2022-06-30 2022-09-30 2022-12-31 2023-03-31 2023-06-30 2023-09-30 2023-12-31
count 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000
mean 0.523509 0.753765 1.404136 3.053974 9.083523 10.006541 10.977025 13.110492 24.043743 25.460957 27.085812 28.069560 30.466076 30.458059 30.493810 30.374839
std 0.183625 0.335680 0.542772 1.099395 2.358052 2.376262 2.536622 2.838387 3.729242 3.779098 3.923377 4.022893 4.295360 4.293824 4.299314 4.276433
min 0.091269 0.177454 0.235364 0.482400 2.494718 2.822579 3.753079 5.118809 16.102109 17.621519 18.990129 19.800811 21.845738 21.824123 21.858583 21.768570
25% 0.426930 0.585348 1.037146 2.546424 8.354899 9.527069 10.295712 12.307920 22.040358 23.343224 25.002910 25.781952 27.810816 27.806267 27.822505 27.740747
50% 0.533490 0.720621 1.437911 3.166962 9.544954 10.499170 11.379409 13.674178 24.408501 25.723154 27.584863 28.548954 30.765054 30.748463 30.779522 30.687674
75% 0.658690 0.962602 1.787737 3.653513 10.445870 11.275231 12.601067 14.956032 26.564371 27.780954 29.494269 30.417470 32.724562 32.735825 32.761352 32.618070
max 0.877756 1.732563 2.621345 6.248935 13.599858 14.374649 15.109814 18.279272 32.961085 35.490087 38.081019 38.994821 41.574649 41.526601 41.592979 41.424713
In [ ]:
plt.plot(threemonthaverage.T)
plt.xticks(rotation=45)
plt.xlabel('three month period')
plt.ylabel('average number of covid cases per three months')
plt.show()

From these summary statistics, there was a massive increase in cases from 1/21 to 4/21 in 2020 (increased 3 fold) and also from 1/22 to 4/22 (increased 2 fold). After the Covid-19 vaccine was released in December of 2020, the immediate three following months saw a massive increase in cases, then the next 9 months saw a plateu in cases. That intial big spike could be from the slow roll-out of the vaccine to the general public, but the plateu after seems to validate the effectiveness of the vaccine. The peak after about 9 months could be a result of winter and everyone being in closer quarters, and could also indicate the vaccine was effective for about 9 months, until another booster was encouraged to maintain increased levels of immune responses.

Model Plan¶

Using these datasets, we can find the correlation between different demographic information and vaccine hesitancy, actual vaccine trends, and Covid-19 cases. We can determine the correlations for each pair of the four datasets and as combinations. The datasets should be normalized to population so the correlation is interpretable. Graphs, such as hesitancy by state vs vaccine rates by state can help visually show the correlations found.

These correlations can also help address the question of the effectiveness of the vaccine on Covid-19 cases.

We also can also use demographic information, vaccine hesitancy, and Covid-19 to predict vaccine rates. This will include combining all the datasets, splitting the data into training and testing sets, and finally implementing a predictive model, using linear regression (linear_model in sklearn package). This model will then be evaluated using k-fold cross validation and performance metrics such as MSE, RMSE, and MAE.

Model¶

Using these four datsets, we can first look at the correlation matrix between all the demographic information, vaccine hesitancy, actual vaccine trends, and Covid-19 cases. Then, we can look at correlations of the first three datasets and covid cases. This second correlation matrix will help us see which features should be used in our model to predict Covid-19 cases.

Joining Datsets and Finding Correlations¶

In [ ]:
vaccines_on_days = vaccines[vaccines['Date'] == '12/13/2021'][['Location','Administered_Dose1_Pop_Pct']].merge(vaccines[vaccines['Date'] == '12/14/2022'][['Location','Administered_Dose1_Pop_Pct']],on = 'Location',suffixes = ('',' on 12/14/2022'))
vaccines_on_days = vaccines_on_days.merge(vaccines[vaccines['Date'] == '12/14/2022'][['Location','Administered_Dose1_Pop_Pct']],on = 'Location',suffixes = (' on 12/13/2021',' on 05/10/2023'))
vaccines_on_days.set_index("Location",inplace=True)
vaccines_on_days.head() 
Out[ ]:
Administered_Dose1_Pop_Pct on 12/13/2021 Administered_Dose1_Pop_Pct on 12/14/2022 Administered_Dose1_Pop_Pct on 05/10/2023
Location
NH 92.0 87.3 87.3
WA 74.1 84.7 84.7
CO 73.0 83.3 83.3
CT 86.2 95.0 95.0
MS 54.7 61.4 61.4
In [ ]:
df_demog = df_all_per
df_demog.columns = df_demog.columns.str.rstrip('%')
df_demog = df_demog.T
df_demog.rename(us_state_to_abbrev, inplace=True)
df_demog.drop('\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Median household income (dollars)',axis = 1,inplace=True)
df_demog.join(df_all_est.T[['\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Median household income (dollars)']])
df_demog.head()
Out[ ]:
Label (Grouping) Less than 9th grade 9th to 12th grade, no diploma High school graduate (includes equivalency) Some college, no degree Associate's degree Bachelor's degree Graduate or professional degree Hispanic or Latino (of any race) White alone Black or African American alone American Indian and Alaska Native alone Asian alone Native Hawaiian and Other Pacific Islander alone Some Other Race alone Two or More Races
AL 3.4 7.8 30.4 20.6 9.0 17.5 11.3 4.9 64.1 25.6 0.3 1.5 0.0 0.4 3.3
AK 2.2 4.4 29.1 23.7 9.9 19.2 11.3 7.7 57.4 2.8 12.7 6.1 2.0 0.5 10.7
AZ 4.8 6.1 23.8 23.0 9.4 20.4 12.5 32.5 51.8 4.4 3.3 3.5 0.2 0.5 3.9
AR 4.0 6.8 34.3 21.3 8.1 15.7 9.7 8.4 67.5 14.3 0.4 1.6 0.5 0.4 7.0
CA 8.8 6.5 20.5 19.3 7.9 22.5 14.4 40.3 33.7 5.2 0.3 15.3 0.3 0.6 4.3
In [ ]:
cases_vaccines = confirmed_state_by_day[['12/13/2021','12/14/2022','05/03/2023']].join(vaccines_on_days)
cases_vaccines.rename(columns={'2020-12-13':'Cases on 12/13/2020',
 '2021-12-13 00:00:00':'Cases on 12/13/2021',
 '2022-12-14 00:00:00':'Cases on 12/14/2022',
 '2023-05-03 00:00:00':'Cases on 05/03/2023'},inplace=True)
cases_vaccines_hesitancy = cases_vaccines.join(hesitancy_by_state[['Estimated hesitant','Social Vulnerability Index (SVI)']])
all_data = cases_vaccines_hesitancy.join(df_demog)
all_data.head()
Out[ ]:
2021-12-13 00:00:00 2022-12-14 00:00:00 2023-05-03 00:00:00 Administered_Dose1_Pop_Pct on 12/13/2021 Administered_Dose1_Pop_Pct on 12/14/2022 Administered_Dose1_Pop_Pct on 05/10/2023 Estimated hesitant Social Vulnerability Index (SVI) Less than 9th grade 9th to 12th grade, no diploma ... Bachelor's degree Graduate or professional degree Hispanic or Latino (of any race) White alone Black or African American alone American Indian and Alaska Native alone Asian alone Native Hawaiian and Other Pacific Islander alone Some Other Race alone Two or More Races
Province_State
AL 16.801818 30.773865 32.360390 57.6 64.8 64.8 0.173507 0.703433 3.4 7.8 ... 17.5 11.3 4.9 64.1 25.6 0.3 1.5 0.0 0.4 3.3
AK 20.970361 40.993180 41.859340 64.1 72.7 72.7 0.217386 0.561724 2.2 4.4 ... 19.2 11.3 7.7 57.4 2.8 12.7 6.1 2.0 0.5 10.7
AZ 17.917444 32.137745 33.159786 65.8 76.9 76.9 0.159240 0.846000 4.8 6.1 ... 20.4 12.5 32.5 51.8 4.4 3.3 3.5 0.2 0.5 3.9
AR 17.687630 32.054936 33.025833 61.7 69.5 69.5 0.212123 0.717733 4.0 6.8 ... 15.7 9.7 8.4 67.5 14.3 0.4 1.6 0.5 0.4 7.0
CA 13.239006 29.969714 31.028397 80.7 84.2 84.2 0.072274 0.633793 8.8 6.5 ... 22.5 14.4 40.3 33.7 5.2 0.3 15.3 0.3 0.6 4.3

5 rows × 23 columns

First we can look at the correlations of every feature we selected using a heatmap for easy visualization.

In [ ]:
sb.heatmap(all_data.corr(), vmin=-1, vmax=1, cmap='PiYG')
Out[ ]:
<Axes: >

From this heatmap, a few things are signficiantly correlated. Estimated hesitancy and actual vaccine doses are significantly negativly correlated. This makes sense because if people were more hesitant about the vaccine, they would be less likely to get it. Estimated hesitancy was also negativly correlated with higher education (Bachelor's and Graduate degrees). This indicates that more educated people are less hesitant about the vaccine, potentially because they have a better skill set to do more of their own research and critial thinking to form an opinion on the vaccine than less educated people. This indicates education is important to combat misfinformation and essential to make informed decisions, especially about health concerns. Finally, SVI became decreasingly correlated with education as education levels increased. This indicates that the more education a person attains, the less likely they are to be significantly affected by public health emergencies, such as a pandemic.

Next, we can zoom in on all the demographic, vaccine, and hesitancy correlations with just cases to see what we should include in our model.

In [ ]:
sb.heatmap(all_data.corr().iloc[0:3].T[3:], vmin=-1, vmax=1,cmap='PiYG')
Out[ ]:
<Axes: >

From this heatmap, we can see that estimated hesitancy was significantly and positively correlated with covid cases on 12/13/2021. This indicates the more hesitantcy there was in a state, the more covid cases they had. Additionally, over all three case dates, the administered vaccines are negativly correlated with cases. This shows that the more vaccines given out, the less covid cases there were, which signals towards the effectiveness of the vaccine. Additionally, this map shows that there was no significant difference in cases in education levels below a bachelor's degree, but there was a negative correlation between higher education and cases. This could be an important feature in our model.

Using these correlations, we can choose features highly correlated with covid cases, and drop anything not highly correlated. Here we will only drop Social Vulnerability Index (SVI) and not race because some races were significantly correlated while others weren't so for simplicity we are keeping every race. From here, we can use these features to predict covid-19 cases on 5/10/2023.

Linear Regression¶

In [ ]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_predict

Predicting covid-19 cases on '2023-05-10 00:00:00'¶

In [ ]:
all_data.columns = all_data.columns.astype(str)

For this model, we will use linear regression. We will randomly split the data into 60% for training and 40% for testing.

In [ ]:
features = all_data.columns[[0,1] + list(range(3,7)) + list(range(8,23))]
pred_feat = all_data.columns[2]

X = all_data[features]
y = all_data[pred_feat]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
In [ ]:
model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

Below is our predicted data! Let's evaluate and compare it to the actual values.

In [ ]:
X_test
Out[ ]:
2021-12-13 00:00:00 2022-12-14 00:00:00 Administered_Dose1_Pop_Pct on 12/13/2021 Administered_Dose1_Pop_Pct on 12/14/2022 Administered_Dose1_Pop_Pct on 05/10/2023 Estimated hesitant Less than 9th grade 9th to 12th grade, no diploma High school graduate (includes equivalency) Some college, no degree ... Bachelor's degree Graduate or professional degree Hispanic or Latino (of any race) White alone Black or African American alone American Indian and Alaska Native alone Asian alone Native Hawaiian and Other Pacific Islander alone Some Other Race alone Two or More Races
Province_State
KS 16.672659 30.981598 67.7 75.8 75.8 0.151648 3.6 4.4 25.3 22.1 ... 22.0 13.7 13.0 73.1 5.0 0.4 2.9 0.1 0.5 4.9
TX 14.740353 27.125442 65.3 75.9 75.9 0.119811 7.2 6.7 24.2 20.3 ... 21.6 12.3 40.2 38.9 11.7 0.2 5.4 0.1 0.4 3.2
NH 12.764840 26.089085 92.0 87.3 87.3 0.103540 1.3 4.1 27.1 16.8 ... 24.6 16.7 4.5 86.6 1.3 0.1 2.6 0.1 0.5 4.4
ID 16.100087 26.260461 51.5 63.6 63.6 0.192023 3.5 4.5 26.0 23.9 ... 21.4 10.8 13.5 79.0 0.6 0.8 1.3 0.2 0.5 4.2
OR 9.470543 22.006553 72.8 81.1 81.1 0.112689 3.6 4.8 22.7 23.7 ... 22.2 14.1 14.4 71.6 1.8 0.7 4.5 0.4 0.6 6.0
IN 17.009772 29.079559 56.8 64.1 64.1 0.131465 3.6 6.2 32.1 19.2 ... 19.0 10.6 7.8 76.0 9.2 0.1 2.5 0.0 0.5 3.9
TN 18.999668 34.243922 57.7 64.2 64.2 0.138388 3.5 6.1 31.3 20.1 ... 19.7 11.4 6.3 71.9 15.5 0.1 1.9 0.1 0.4 4.0
HI 6.206030 25.684976 83.5 91.0 91.0 0.059500 3.2 3.9 26.8 19.6 ... 22.1 13.3 11.1 20.7 1.6 0.1 34.6 9.3 0.4 22.1
PA 14.208571 26.122178 84.0 89.9 89.9 0.118857 2.8 5.0 33.0 15.2 ... 20.8 14.3 8.6 73.1 10.1 0.1 3.8 0.0 0.5 3.8
NE 16.386668 27.783252 65.2 73.0 73.0 0.086788 3.3 3.9 25.1 21.6 ... 22.6 12.1 12.3 75.8 4.5 0.5 2.5 0.1 0.4 3.9
AK 20.970361 40.993180 64.1 72.7 72.7 0.217386 2.2 4.4 29.1 23.7 ... 19.2 11.3 7.7 57.4 2.8 12.7 6.1 2.0 0.5 10.7
MT 17.268742 28.512371 60.9 68.0 68.0 0.251386 2.2 3.8 27.7 22.3 ... 22.9 11.7 4.4 83.5 0.3 5.2 0.7 0.1 0.8 4.9
WA 10.194950 24.054566 74.1 84.7 84.7 0.077913 3.5 4.3 21.5 21.2 ... 23.8 15.7 14.0 63.5 3.8 0.9 9.7 0.7 0.7 6.7
MA 14.054206 30.424676 88.1 95.0 95.0 0.046279 4.4 4.3 23.0 14.3 ... 25.3 21.3 13.0 67.0 6.6 0.1 7.2 0.0 1.2 4.9
CO 15.013752 29.490562 73.0 83.3 83.3 0.078220 2.8 4.2 20.2 18.7 ... 28.8 17.1 22.5 65.0 3.8 0.4 3.1 0.1 0.5 4.6
CT 12.335651 25.812771 86.2 95.0 95.0 0.056987 3.9 4.5 25.9 15.9 ... 23.0 18.9 18.2 62.0 9.8 0.1 4.8 0.0 0.8 4.4
RI 18.625004 40.208405 85.8 95.0 95.0 0.066420 4.8 4.7 24.7 18.1 ... 23.9 15.7 17.6 68.2 4.7 0.1 3.4 0.0 0.9 5.2
OK 17.008782 30.653316 64.4 74.2 74.2 0.194045 3.5 6.9 30.6 22.1 ... 18.6 9.9 12.1 62.6 6.7 6.8 2.3 0.1 0.3 9.1
NC 14.723114 30.845372 73.1 91.3 91.3 0.135913 3.7 6.1 24.9 19.3 ... 22.8 13.2 10.4 60.7 20.1 0.9 3.2 0.1 0.5 4.1
IL 15.134757 31.176387 70.9 78.8 78.8 0.096085 4.6 5.0 25.2 19.1 ... 22.5 15.2 18.3 58.5 13.2 0.1 5.9 0.0 0.4 3.6
GA 15.412124 27.153539 60.0 68.0 68.0 0.161815 4.0 6.5 27.0 19.1 ... 20.7 14.0 10.4 49.6 30.7 0.1 4.4 0.1 0.5 4.2

21 rows × 21 columns

We can evaluate our model using several common metrics.

In [ ]:
from sklearn.metrics import mean_absolute_error,mean_squared_error
 
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
# squared=True returns MSE value, False returns RMSE value. 
mse = mean_squared_error(y_true=y_test,y_pred=y_pred) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)

print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.445853139227499
MSE: 1.1235586671662616
RMSE: 1.0599805032010077
In [ ]:
plt.scatter(y_test,y_pred,c='r')
plt.xlabel('Actual percent of state with Covid-19')
plt.ylabel('Predicted percent of state with Covid-19')
plt.title('Actual vs Predicted Covid-19 cases on 5/10/202 (40 percent of states)')
for i in range(y_test.shape[0]): 
    plt.text(y_test[i], y_pred[i], str(y.index[i]))

This is a pretty good model, but we can confirm our results with 5 fold cross validation.

Cross Validation¶

In [ ]:
lr = LinearRegression()
y_tot_pred = cross_val_predict(lr, X, y, cv=5)
In [ ]:
mae = mean_absolute_error(y_true=y,y_pred=y_tot_pred)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y,y_pred=y_tot_pred) #default=True
rmse = mean_squared_error(y_true=y,y_pred=y_tot_pred,squared=False)
 
print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)


plt.scatter(y,y_tot_pred,c='r')
plt.xlabel('Actual percent of state with Covid-19')
plt.ylabel('Predicted percent of state with Covid-19')
plt.title('Actual vs Predicted Covid-19 cases on 5/10/2023 (5-fold cross validation)')
for i in range(y.shape[0]):
    plt.text(y[i], y_tot_pred[i], str(y.index[i]))
MAE: 0.23393785092251881
MSE: 0.18849029765015374
RMSE: 0.43415469322599026

Linear Regression Coefficients¶

We were able to get a pretty good model based on our performance metrics and direct comparison between predicted and actual values. To gather some insight into the actual model, we can extract the coefficients from the linear model to see which variables were deemed the most, and least, important to predicting cases.

In [ ]:
coefficients = pd.DataFrame(zip(X.columns, model.coef_))
coefficients.columns = ['variable', 'coefficient']
coefficients 
Out[ ]:
variable coefficient
0 2021-12-13 00:00:00 -0.030062
1 2022-12-14 00:00:00 1.053670
2 Administered_Dose1_Pop_Pct on 12/13/2021 0.001830
3 Administered_Dose1_Pop_Pct on 12/14/2022 -0.001069
4 Administered_Dose1_Pop_Pct on 05/10/2023 -0.001069
5 Estimated hesitant 0.246702
6 Less than 9th grade 0.654279
7 9th to 12th grade, no diploma 0.696613
8 High school graduate (includes equival... 0.596443
9 Some college, no degree 0.567993
10 Associate's degree 0.481504
11 Bachelor's degree 0.691563
12 Graduate or professional degree 0.530945
13 Hispanic or Latino (of any race) 0.537597
14 White alone 0.552245
15 Black or African American alone 0.561912
16 American Indian and Alaska Native ... 0.600524
17 Asian alone 0.529313
18 Native Hawaiian and Other Pacific ... 0.267250
19 Some Other Race alone 1.005719
20 Two or More Races 0.477663

Conclusions¶

In conclusion, our model was very successful in predicting covid cases on 5/10/23 based on vaccination rates, estimated hesitancy, education levels, demographic information, and covid cases on 2 dates in past years. Our cross-validation RMSE was 0.43, which is very low and indicates low error. One potential flaw with our model would be that it did use covid cases from past days, as well as vaccine rates from past days and the test day. This means we would not be able to use it at the start of a new pandemic to predict cases if we did not have that data.

Looking at our coefficients, covid cases 6 months before had the highest coefficient, which is not surprising. The next highest coefficients were all of the education variables, but the coefficient for each level of education was about the same (between 0.48 and 0.7). This means that education levels did not actually have a large effect because they were all weighted the same. Vaccine rates had the smallest coefficients, which is surprising because on the correlation table there was a negative correlation between vaccines rates and cases. This is also surprising because estimated vaccine hesitancy did have a fairly significant coefficient (0.25). Estimated vaccine hesitancy did have a stronger correlation with cases than the vaccine rates did, which is interesting that it is more impactful than vaccines themselves. In future projects, more parameter tuning would be needed in order to better figure out how each variable is impacting the model.

Looking back at our initial questions asked, we can partially answer these looking at our correlation table. Estimated hesitancy had a correlation of between 0.3-0.6 with covid cases depending on which day we look at. These correlations are significant enough so we can say that states with more vaccine hesitancy had higher rates of Covid-19. Vaccine rates had a correlation of -0.3 with cases on the last day of data, which is not super strong.

In the future, some things that we would want to consider would be data on masks, as well as deaths. Cases are an important thing to measure and predict, but predicting covid deaths may be more useful, especially if we wanted to predict the impact of future pandemics. Data on mask wearing and mask mandates could also be interesting to look at because they were such a large controversy throughout the whole pandemic. We were able to predict cases fairly accurately with no data on masks which may suggest that they did not have a huge impact, however it is more likely that mask wearing is correlated with vaccine rates/vaccine hesitancy and it was therefore factored in through that.

In [ ]: