!git clone https://github.com/criviere1/criviere1.github.io.git
fatal: destination path 'criviere1.github.io' already exists and is not an empty directory.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
The team, Collette Riviere and Toby Mendels, will be investigating datasets related to Covid-19 infection rates, vaccine hesitancy rates, actual Covid-19 vaccine rates, and state by state demographic information. Most of the data will come from the US Census Bureau, specifically the Household Pulse Survey (HPS) and the American Community Survey (ACS). Daily Covid-19 infeciton rates were collected by the Center for Systems Science and Engineering at Johns Hopkins University from the World Health Organization (WHO) and the US Centers for Disease Control and Prevention (CDC). Finally, actual Covid-19 vaccine trends were colllected from the CDC.
The first dataset is state-by-state demographic information, such as sex, age, race, income, health insurance status, and educational attainment. This dataset could potentially indicate a relationship between state demographic information vaccine hesitancy or Covid-19 cases. The second dataset, from the HPS survey, we are interested in vaccine hesitancy by state. The Covid-19 vaccine was a devicive issue for many people so looking at vaccine hesitency before the Covid-19 pandemic could help indicate which parts of the US could be most reluctant or hesitant about the Covid-19 vaccines. This relates to the third dataset, of actual vaccine trends. This can help us highlight the difference between hesitancy and actual vaccine rates. The final dataset is daily Covid-19 infection reports across states. This dataset can help investigate how Covid-19 spread across the USA from Janurary 2020 to March 2023 and how different states were affected by the pandemic over time. It can also be compared to vaccine rates to determine if the vaccine actually helped slow the spread of the pandemic.
All of these datasets in conjunction can help these specific answer questions:
Inital steps to answer these questions will be to analyze the correlation between each dataset. Additionally, comparing the rates of Covid-19 cases before and after the vaccine was rolled out will help to address the question of whether the vaccine redcued Covid-19 rates. This is discussed more below in our model.
The data we are using from the Household Pulse Survey is found here: https://data.cdc.gov/Vaccinations/Vaccine-Hesitancy-for-COVID-19-County-and-local-es/q9mh-h2tw/data.
Selected Demographic information from the ACS is found here:
COVID-19 infection reports can be found here: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/README.md.
Covid-19 Vaccination Trends can be found from the CDC here: https://data.cdc.gov/Vaccinations/COVID-19-Vaccination-Trends-in-the-United-States-N/rh2h-3yt2
Covid-19 timeline used: https://www.cdc.gov/museum/timeline/covid19.html#Early-2021
us_state_to_abbrev = {
"Alabama": "AL",
"Alaska": "AK",
"Arizona": "AZ",
"Arkansas": "AR",
"California": "CA",
"Colorado": "CO",
"Connecticut": "CT",
"Delaware": "DE",
"Florida": "FL",
"Georgia": "GA",
"Hawaii": "HI",
"Idaho": "ID",
"Illinois": "IL",
"Indiana": "IN",
"Iowa": "IA",
"Kansas": "KS",
"Kentucky": "KY",
"Louisiana": "LA",
"Maine": "ME",
"Maryland": "MD",
"Massachusetts": "MA",
"Michigan": "MI",
"Minnesota": "MN",
"Mississippi": "MS",
"Missouri": "MO",
"Montana": "MT",
"Nebraska": "NE",
"Nevada": "NV",
"New Hampshire": "NH",
"New Jersey": "NJ",
"New Mexico": "NM",
"New York": "NY",
"North Carolina": "NC",
"North Dakota": "ND",
"Ohio": "OH",
"Oklahoma": "OK",
"Oregon": "OR",
"Pennsylvania": "PA",
"Rhode Island": "RI",
"South Carolina": "SC",
"South Dakota": "SD",
"Tennessee": "TN",
"Texas": "TX",
"Utah": "UT",
"Vermont": "VT",
"Virginia": "VA",
"Washington": "WA",
"West Virginia": "WV",
"Wisconsin": "WI",
"Wyoming": "WY",
"District of Columbia": "DC",
}
First, we have to read in the tables from the ACS and select the data that we want.
demo1_df = pd.read_csv("../datasets/ACSDP1Y2022.DP02-2023-edu.csv")
demo2_df = pd.read_csv("../datasets/ACSDP1Y2022.DP03-2023-employment,income,healthinsurance.csv")
demo3_df = pd.read_csv("../datasets/ACSDP1Y2022.DP05-2023-sex,age,race.csv")
edu_df = demo1_df.iloc[67:74,:-2]
inc_df = demo2_df.iloc[list(range(67,69)),:-2]
ins_df = demo2_df.iloc[[102,105],:-2]
age_df = demo3_df.iloc[[18],:-2]
sex_df = demo3_df.iloc[[2,3],:-2]
race_df = demo3_df.iloc[[76]+list(range(82,89)),:-2]
df_all = pd.concat([edu_df,inc_df,ins_df,age_df,sex_df,race_df])
df_all.set_index([df_all.columns[0]], inplace=True)
df_all.head()
Alabama!!Estimate | Alabama!!Percent | Alaska!!Estimate | Alaska!!Percent | Arizona!!Estimate | Arizona!!Percent | Arkansas!!Estimate | Arkansas!!Percent | California!!Estimate | California!!Percent | ... | Virginia!!Estimate | Virginia!!Percent | Washington!!Estimate | Washington!!Percent | West Virginia!!Estimate | West Virginia!!Percent | Wisconsin!!Estimate | Wisconsin!!Percent | Wyoming!!Estimate | Wyoming!!Percent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Label (Grouping) | |||||||||||||||||||||
Less than 9th grade | 118,034 | 3.4% | 10,950 | 2.2% | 241,467 | 4.8% | 82,740 | 4.0% | 2,376,805 | 8.8% | ... | 209,876 | 3.5% | 192,061 | 3.5% | 43,318 | 3.4% | 99,448 | 2.4% | 8,563 | 2.1% |
9th to 12th grade, no diploma | 271,263 | 7.8% | 21,719 | 4.4% | 306,781 | 6.1% | 140,697 | 6.8% | 1,738,878 | 6.5% | ... | 295,899 | 4.9% | 236,434 | 4.3% | 93,833 | 7.4% | 167,049 | 4.1% | 16,511 | 4.1% |
High school graduate (includes equivalency) | 1,057,155 | 30.4% | 142,455 | 29.1% | 1,202,946 | 23.8% | 705,670 | 34.3% | 5,506,855 | 20.5% | ... | 1,431,624 | 23.9% | 1,176,424 | 21.5% | 484,166 | 38.3% | 1,209,135 | 29.6% | 109,092 | 27.3% |
Some college, no degree | 716,144 | 20.6% | 116,078 | 23.7% | 1,160,936 | 23.0% | 438,076 | 21.3% | 5,192,854 | 19.3% | ... | 1,067,612 | 17.9% | 1,161,011 | 21.2% | 216,917 | 17.2% | 798,565 | 19.5% | 99,620 | 24.9% |
Associate's degree | 311,537 | 9.0% | 48,370 | 9.9% | 475,387 | 9.4% | 167,634 | 8.1% | 2,115,440 | 7.9% | ... | 451,737 | 7.6% | 544,596 | 10.0% | 111,246 | 8.8% | 455,882 | 11.2% | 47,844 | 12.0% |
5 rows × 102 columns
Now to organize the data and get it ready for analysis so the dtypes are corrected and the table is tidied.
df_all.T.dtypes
Label (Grouping) Less than 9th grade object 9th to 12th grade, no diploma object High school graduate (includes equivalency) object Some college, no degree object Associate's degree object Bachelor's degree object Graduate or professional degree object Median household income (dollars) object Mean household income (dollars) object With health insurance coverage object No health insurance coverage object Median age (years) object Male object Female object Hispanic or Latino (of any race) object White alone object Black or African American alone object American Indian and Alaska Native alone object Asian alone object Native Hawaiian and Other Pacific Islander alone object Some Other Race alone object Two or More Races object dtype: object
df_all.replace(',','', regex=True, inplace=True)
df_all.replace('%','', regex=True, inplace=True)
df_all.replace('(X)', None, regex=True, inplace=True)
cols = df_all.columns.to_series()
cols.replace('!!Estimate','', regex=True, inplace=True)
cols.replace('!!Percent','%', regex=True, inplace=True)
cols.replace(us_state_to_abbrev,inplace=True)
df_all.columns = cols
df_all = df_all.apply(pd.to_numeric, errors='ignore')
df_all
AL | Alabama% | AK | Alaska% | AZ | Arizona% | AR | Arkansas% | CA | California% | ... | VA | Virginia% | WA | Washington% | WV | West Virginia% | WI | Wisconsin% | WY | Wyoming% | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Label (Grouping) | |||||||||||||||||||||
Less than 9th grade | 118034.0 | 3.4 | 10950.0 | 2.2 | 241467.0 | 4.8 | 82740.0 | 4.0 | 2376805.0 | 8.8 | ... | 209876.0 | 3.5 | 192061.0 | 3.5 | 43318.0 | 3.4 | 99448.0 | 2.4 | 8563.0 | 2.1 |
9th to 12th grade, no diploma | 271263.0 | 7.8 | 21719.0 | 4.4 | 306781.0 | 6.1 | 140697.0 | 6.8 | 1738878.0 | 6.5 | ... | 295899.0 | 4.9 | 236434.0 | 4.3 | 93833.0 | 7.4 | 167049.0 | 4.1 | 16511.0 | 4.1 |
High school graduate (includes equivalency) | 1057155.0 | 30.4 | 142455.0 | 29.1 | 1202946.0 | 23.8 | 705670.0 | 34.3 | 5506855.0 | 20.5 | ... | 1431624.0 | 23.9 | 1176424.0 | 21.5 | 484166.0 | 38.3 | 1209135.0 | 29.6 | 109092.0 | 27.3 |
Some college, no degree | 716144.0 | 20.6 | 116078.0 | 23.7 | 1160936.0 | 23.0 | 438076.0 | 21.3 | 5192854.0 | 19.3 | ... | 1067612.0 | 17.9 | 1161011.0 | 21.2 | 216917.0 | 17.2 | 798565.0 | 19.5 | 99620.0 | 24.9 |
Associate's degree | 311537.0 | 9.0 | 48370.0 | 9.9 | 475387.0 | 9.4 | 167634.0 | 8.1 | 2115440.0 | 7.9 | ... | 451737.0 | 7.6 | 544596.0 | 10.0 | 111246.0 | 8.8 | 455882.0 | 11.2 | 47844.0 | 12.0 |
Bachelor's degree | 609316.0 | 17.5 | 94168.0 | 19.2 | 1032052.0 | 20.4 | 323961.0 | 15.7 | 6056169.0 | 22.5 | ... | 1409942.0 | 23.6 | 1302707.0 | 23.8 | 188793.0 | 15.0 | 885548.0 | 21.7 | 71812.0 | 18.0 |
Graduate or professional degree | 391475.0 | 11.3 | 55478.0 | 11.3 | 634087.0 | 12.5 | 198846.0 | 9.7 | 3879772.0 | 14.4 | ... | 1112092.0 | 18.6 | 856976.0 | 15.7 | 124293.0 | 9.8 | 472873.0 | 11.6 | 46354.0 | 11.6 |
Median household income (dollars) | 59674.0 | NaN | 88121.0 | NaN | 74568.0 | NaN | 55432.0 | NaN | 91551.0 | NaN | ... | 85873.0 | NaN | 91306.0 | NaN | 54329.0 | NaN | 70996.0 | NaN | 70042.0 | NaN |
Mean household income (dollars) | 82956.0 | NaN | 109524.0 | NaN | 101316.0 | NaN | 76853.0 | NaN | 131504.0 | NaN | ... | 119058.0 | NaN | 125847.0 | NaN | 75265.0 | NaN | 94085.0 | NaN | 90018.0 | NaN |
With health insurance coverage | 4551357.0 | 91.2 | 624538.0 | 89.0 | 6503212.0 | 89.7 | 2739056.0 | 91.6 | 36056827.0 | 93.5 | ... | 7901373.0 | 93.5 | 7199483.0 | 93.9 | 1638252.0 | 94.1 | 5528608.0 | 94.8 | 505145.0 | 88.5 |
No health insurance coverage | 437268.0 | 8.8 | 76973.0 | 11.0 | 748712.0 | 10.3 | 252143.0 | 8.4 | 2491843.0 | 6.5 | ... | 545005.0 | 6.5 | 467558.0 | 6.1 | 103070.0 | 5.9 | 302843.0 | 5.2 | 65619.0 | 11.5 |
Median age (years) | 39.6 | NaN | 35.9 | NaN | 38.8 | NaN | 38.9 | NaN | 37.9 | NaN | ... | 39.0 | NaN | 38.4 | NaN | 42.9 | NaN | 40.4 | NaN | 39.1 | NaN |
Male | 2461248.0 | 48.5 | 385667.0 | 52.6 | 3678381.0 | 50.0 | 1504488.0 | 49.4 | 19536425.0 | 50.1 | ... | 4299314.0 | 49.5 | 3930411.0 | 50.5 | 882101.0 | 49.7 | 2958771.0 | 50.2 | 297855.0 | 51.2 |
Female | 2613048.0 | 51.5 | 347916.0 | 47.4 | 3680816.0 | 50.0 | 1541149.0 | 50.6 | 19492917.0 | 49.9 | ... | 4384305.0 | 50.5 | 3855375.0 | 49.5 | 893055.0 | 50.3 | 2933768.0 | 49.8 | 283526.0 | 48.8 |
Hispanic or Latino (of any race) | 246477.0 | 4.9 | 56491.0 | 7.7 | 2388520.0 | 32.5 | 255416.0 | 8.4 | 15732184.0 | 40.3 | ... | 905750.0 | 10.4 | 1093313.0 | 14.0 | 34343.0 | 1.9 | 447022.0 | 7.6 | 62803.0 | 10.8 |
White alone | 3250182.0 | 64.1 | 421104.0 | 57.4 | 3814587.0 | 51.8 | 2054922.0 | 67.5 | 13160426.0 | 33.7 | ... | 5095130.0 | 58.7 | 4941456.0 | 63.5 | 1594128.0 | 89.8 | 4654031.0 | 79.0 | 473220.0 | 81.4 |
Black or African American alone | 1296500.0 | 25.6 | 20664.0 | 2.8 | 322459.0 | 4.4 | 435637.0 | 14.3 | 2025218.0 | 5.2 | ... | 1594785.0 | 18.4 | 299537.0 | 3.8 | 59390.0 | 3.3 | 345678.0 | 5.9 | 3990.0 | 0.7 |
American Indian and Alaska Native alone | 12694.0 | 0.3 | 93294.0 | 12.7 | 241688.0 | 3.3 | 12680.0 | 0.4 | 103030.0 | 0.3 | ... | 11738.0 | 0.1 | 69024.0 | 0.9 | 1880.0 | 0.1 | 33762.0 | 0.6 | 9589.0 | 1.6 |
Asian alone | 76682.0 | 1.5 | 44905.0 | 6.1 | 257020.0 | 3.5 | 47592.0 | 1.6 | 5957867.0 | 15.3 | ... | 602881.0 | 6.9 | 755832.0 | 9.7 | 13134.0 | 0.7 | 172242.0 | 2.9 | 3577.0 | 0.6 |
Native Hawaiian and Other Pacific Islander alone | 1954.0 | 0.0 | 14539.0 | 2.0 | 13887.0 | 0.2 | 14176.0 | 0.5 | 135777.0 | 0.3 | ... | 6869.0 | 0.1 | 52231.0 | 0.7 | 111.0 | 0.0 | 2592.0 | 0.0 | 294.0 | 0.1 |
Some Other Race alone | 19927.0 | 0.4 | 4034.0 | 0.5 | 36042.0 | 0.5 | 11530.0 | 0.4 | 251618.0 | 0.6 | ... | 60419.0 | 0.7 | 50945.0 | 0.7 | 5238.0 | 0.3 | 19680.0 | 0.3 | 4997.0 | 0.9 |
Two or More Races | 169880.0 | 3.3 | 78552.0 | 10.7 | 284994.0 | 3.9 | 213684.0 | 7.0 | 1663222.0 | 4.3 | ... | 406047.0 | 4.7 | 523448.0 | 6.7 | 66932.0 | 3.8 | 217532.0 | 3.7 | 22911.0 | 3.9 |
22 rows × 102 columns
df_all.T.dtypes
Label (Grouping) Less than 9th grade float64 9th to 12th grade, no diploma float64 High school graduate (includes equivalency) float64 Some college, no degree float64 Associate's degree float64 Bachelor's degree float64 Graduate or professional degree float64 Median household income (dollars) float64 Mean household income (dollars) float64 With health insurance coverage float64 No health insurance coverage float64 Median age (years) float64 Male float64 Female float64 Hispanic or Latino (of any race) float64 White alone float64 Black or African American alone float64 American Indian and Alaska Native alone float64 Asian alone float64 Native Hawaiian and Other Pacific Islander alone float64 Some Other Race alone float64 Two or More Races float64 dtype: object
Now we can look at this information. If we just look at the state percentages, so the data is normalized, we can see if there are any interesting features we can draw about education, income, etc.
df_all_est = df_all.T.iloc[0::2].T # not percentage columns
df_all_per = df_all.T.iloc[1::2].T # percentage columns
population = df_all_est.iloc[[12,13]].sum()
df_all_est.T.describe().iloc[[3,4,5,6,7]].T
min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|
Label (Grouping) | |||||
Less than 9th grade | 7709.0 | 43431.5 | 104599.0 | 200968.5 | 2376805.0 |
9th to 12th grade, no diploma | 15759.0 | 54041.0 | 167049.0 | 295917.0 | 1738878.0 |
High school graduate (includes equivalency) | 64543.0 | 329568.0 | 823109.0 | 1452257.0 | 5506855.0 |
Some college, no degree | 55085.0 | 248816.5 | 635109.0 | 1027405.0 | 5192854.0 |
Associate's degree | 15974.0 | 119963.5 | 271686.0 | 460914.5 | 2115440.0 |
Bachelor's degree | 71812.0 | 249125.5 | 584999.0 | 1277447.0 | 6056169.0 |
Graduate or professional degree | 46354.0 | 164184.0 | 391475.0 | 881625.5 | 3879772.0 |
Median household income (dollars) | 52719.0 | 66885.5 | 71970.0 | 84105.5 | 101027.0 |
Mean household income (dollars) | 72624.0 | 90063.5 | 97699.0 | 111890.5 | 148872.0 |
With health insurance coverage | 505145.0 | 1696782.5 | 4181313.0 | 6975473.5 | 36056827.0 |
No health insurance coverage | 19474.0 | 116320.5 | 302843.0 | 532875.0 | 4898967.0 |
Median age (years) | 32.1 | 38.2 | 39.1 | 40.4 | 45.1 |
Male | 297855.0 | 929513.0 | 2233158.0 | 3804396.0 | 19536425.0 |
Female | 283526.0 | 927581.5 | 2279152.0 | 3768095.5 | 19492917.0 |
Hispanic or Latino (of any race) | 14770.0 | 173793.0 | 442629.0 | 1010612.0 | 15732184.0 |
White alone | 246745.0 | 1329689.0 | 2906492.0 | 5005741.0 | 13160426.0 |
Black or African American alone | 3428.0 | 55358.0 | 343194.0 | 1311022.5 | 3508706.0 |
American Indian and Alaska Native alone | 978.0 | 6126.5 | 12313.0 | 33507.0 | 273650.0 |
Asian alone | 3577.0 | 39292.0 | 131817.0 | 371435.0 | 5957867.0 |
Native Hawaiian and Other Pacific Islander alone | 0.0 | 955.0 | 2745.0 | 8061.5 | 135777.0 |
Some Other Race alone | 2521.0 | 8167.0 | 19680.0 | 46158.0 | 251618.0 |
Two or More Races | 22911.0 | 77525.5 | 213684.0 | 353171.0 | 1663222.0 |
From the raw numbers, we can observe a few things. Health insurance coverage, age, and sex don't vary much across the states. But education levels, income, and race do vary quite a bit, but mean and median income are very similar. From this we can remove the variables that don't vary much, and mean income, to have a set of important demographics to compare across states.
df_all.drop([df_all.index[8],df_all.index[9],df_all.index[10],df_all.index[11],df_all.index[12],df_all.index[13]], inplace=True)
df_all_est.drop([df_all_est.index[8],df_all_est.index[9],df_all_est.index[10],df_all_est.index[11],df_all_est.index[12],df_all_est.index[13]], inplace=True)
df_all_per.drop([df_all_per.index[8],df_all_per.index[9],df_all_per.index[10],df_all_per.index[11],df_all_per.index[12],df_all_per.index[13]], inplace=True)
df_all
AL | Alabama% | AK | Alaska% | AZ | Arizona% | AR | Arkansas% | CA | California% | ... | VA | Virginia% | WA | Washington% | WV | West Virginia% | WI | Wisconsin% | WY | Wyoming% | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Label (Grouping) | |||||||||||||||||||||
Less than 9th grade | 118034.0 | 3.4 | 10950.0 | 2.2 | 241467.0 | 4.8 | 82740.0 | 4.0 | 2376805.0 | 8.8 | ... | 209876.0 | 3.5 | 192061.0 | 3.5 | 43318.0 | 3.4 | 99448.0 | 2.4 | 8563.0 | 2.1 |
9th to 12th grade, no diploma | 271263.0 | 7.8 | 21719.0 | 4.4 | 306781.0 | 6.1 | 140697.0 | 6.8 | 1738878.0 | 6.5 | ... | 295899.0 | 4.9 | 236434.0 | 4.3 | 93833.0 | 7.4 | 167049.0 | 4.1 | 16511.0 | 4.1 |
High school graduate (includes equivalency) | 1057155.0 | 30.4 | 142455.0 | 29.1 | 1202946.0 | 23.8 | 705670.0 | 34.3 | 5506855.0 | 20.5 | ... | 1431624.0 | 23.9 | 1176424.0 | 21.5 | 484166.0 | 38.3 | 1209135.0 | 29.6 | 109092.0 | 27.3 |
Some college, no degree | 716144.0 | 20.6 | 116078.0 | 23.7 | 1160936.0 | 23.0 | 438076.0 | 21.3 | 5192854.0 | 19.3 | ... | 1067612.0 | 17.9 | 1161011.0 | 21.2 | 216917.0 | 17.2 | 798565.0 | 19.5 | 99620.0 | 24.9 |
Associate's degree | 311537.0 | 9.0 | 48370.0 | 9.9 | 475387.0 | 9.4 | 167634.0 | 8.1 | 2115440.0 | 7.9 | ... | 451737.0 | 7.6 | 544596.0 | 10.0 | 111246.0 | 8.8 | 455882.0 | 11.2 | 47844.0 | 12.0 |
Bachelor's degree | 609316.0 | 17.5 | 94168.0 | 19.2 | 1032052.0 | 20.4 | 323961.0 | 15.7 | 6056169.0 | 22.5 | ... | 1409942.0 | 23.6 | 1302707.0 | 23.8 | 188793.0 | 15.0 | 885548.0 | 21.7 | 71812.0 | 18.0 |
Graduate or professional degree | 391475.0 | 11.3 | 55478.0 | 11.3 | 634087.0 | 12.5 | 198846.0 | 9.7 | 3879772.0 | 14.4 | ... | 1112092.0 | 18.6 | 856976.0 | 15.7 | 124293.0 | 9.8 | 472873.0 | 11.6 | 46354.0 | 11.6 |
Median household income (dollars) | 59674.0 | NaN | 88121.0 | NaN | 74568.0 | NaN | 55432.0 | NaN | 91551.0 | NaN | ... | 85873.0 | NaN | 91306.0 | NaN | 54329.0 | NaN | 70996.0 | NaN | 70042.0 | NaN |
Hispanic or Latino (of any race) | 246477.0 | 4.9 | 56491.0 | 7.7 | 2388520.0 | 32.5 | 255416.0 | 8.4 | 15732184.0 | 40.3 | ... | 905750.0 | 10.4 | 1093313.0 | 14.0 | 34343.0 | 1.9 | 447022.0 | 7.6 | 62803.0 | 10.8 |
White alone | 3250182.0 | 64.1 | 421104.0 | 57.4 | 3814587.0 | 51.8 | 2054922.0 | 67.5 | 13160426.0 | 33.7 | ... | 5095130.0 | 58.7 | 4941456.0 | 63.5 | 1594128.0 | 89.8 | 4654031.0 | 79.0 | 473220.0 | 81.4 |
Black or African American alone | 1296500.0 | 25.6 | 20664.0 | 2.8 | 322459.0 | 4.4 | 435637.0 | 14.3 | 2025218.0 | 5.2 | ... | 1594785.0 | 18.4 | 299537.0 | 3.8 | 59390.0 | 3.3 | 345678.0 | 5.9 | 3990.0 | 0.7 |
American Indian and Alaska Native alone | 12694.0 | 0.3 | 93294.0 | 12.7 | 241688.0 | 3.3 | 12680.0 | 0.4 | 103030.0 | 0.3 | ... | 11738.0 | 0.1 | 69024.0 | 0.9 | 1880.0 | 0.1 | 33762.0 | 0.6 | 9589.0 | 1.6 |
Asian alone | 76682.0 | 1.5 | 44905.0 | 6.1 | 257020.0 | 3.5 | 47592.0 | 1.6 | 5957867.0 | 15.3 | ... | 602881.0 | 6.9 | 755832.0 | 9.7 | 13134.0 | 0.7 | 172242.0 | 2.9 | 3577.0 | 0.6 |
Native Hawaiian and Other Pacific Islander alone | 1954.0 | 0.0 | 14539.0 | 2.0 | 13887.0 | 0.2 | 14176.0 | 0.5 | 135777.0 | 0.3 | ... | 6869.0 | 0.1 | 52231.0 | 0.7 | 111.0 | 0.0 | 2592.0 | 0.0 | 294.0 | 0.1 |
Some Other Race alone | 19927.0 | 0.4 | 4034.0 | 0.5 | 36042.0 | 0.5 | 11530.0 | 0.4 | 251618.0 | 0.6 | ... | 60419.0 | 0.7 | 50945.0 | 0.7 | 5238.0 | 0.3 | 19680.0 | 0.3 | 4997.0 | 0.9 |
Two or More Races | 169880.0 | 3.3 | 78552.0 | 10.7 | 284994.0 | 3.9 | 213684.0 | 7.0 | 1663222.0 | 4.3 | ... | 406047.0 | 4.7 | 523448.0 | 6.7 | 66932.0 | 3.8 | 217532.0 | 3.7 | 22911.0 | 3.9 |
16 rows × 102 columns
Now we can look how education, income, and race vary over the states.
states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI',
'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN',
'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA',
'WV', 'WI', 'WY']
# Education across states
df_all_per.T[df_all_per.index[list(range(0,7))]].plot.bar(stacked=True)
plt.ylabel("Percent of state with eduction level")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
<matplotlib.legend.Legend at 0x16529b5e0>
# Income across states
plt.figure(figsize=(15,5))
plt.bar(states, df_all_est.T[df_all_est.T.columns[7]])
plt.ylabel("Median income per state")
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
# Race across states
plt.figure(figsize=(25,15))
df_all_per.T[df_all_per.index[list(range(8,16))]].plot.bar(stacked=True)
plt.ylabel("Percent of state of certain race")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
<matplotlib.legend.Legend at 0x1621cb730>
<Figure size 2500x1500 with 0 Axes>
These three graphs above show us what states are very different from eachother in terms of race or median income or education attainment. These differences help indicate which states we need to compare for vaccine and Covid-19 cases in our models.
We start by reading the csv into a data frame and deleting the columns that we don't need such as the state boundaries, county boundaries, and geographical point (longtitide and latitude).
url = "https://data.cdc.gov/api/views/q9mh-h2tw/rows.csv?accessType=DOWNLOAD"
hesitancy_df = pd.read_csv(url)
hesitancy_df.drop(labels=['County Boundary', 'State Boundary', 'Geographical Point'], axis=1, inplace=True)
Now we can remove the catogorical varibales that we will not need and group by state.
hesitancy_df.drop(labels=['CVAC Level Of Concern', 'SVI Category', 'State', 'FIPS Code','County Name'], axis=1, inplace=True)
hesitancy_by_state = hesitancy_df.groupby('State Code').mean()
hesitancy_by_state.head()
Estimated hesitant | Estimated hesitant or unsure | Estimated strongly hesitant | Social Vulnerability Index (SVI) | CVAC level of concern for vaccination rollout | Percent adults fully vaccinated against COVID-19 (as of 6/10/21) | Percent Hispanic | Percent non-Hispanic American Indian/Alaska Native | Percent non-Hispanic Asian | Percent non-Hispanic Black | Percent non-Hispanic Native Hawaiian/Pacific Islander | Percent non-Hispanic White | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
State Code | ||||||||||||
AK | 0.217386 | 0.264245 | 0.156907 | 0.561724 | 0.744483 | 0.582296 | 0.053148 | 0.324583 | 0.053886 | 0.013293 | 0.005324 | 0.478703 |
AL | 0.173507 | 0.231512 | 0.132409 | 0.703433 | 0.720149 | 0.320851 | 0.034864 | 0.005749 | 0.007072 | 0.287355 | 0.000269 | 0.648376 |
AR | 0.212123 | 0.261232 | 0.139901 | 0.717733 | 0.644800 | 0.338800 | 0.052561 | 0.005515 | 0.006661 | 0.160659 | 0.001171 | 0.751871 |
AZ | 0.159240 | 0.242547 | 0.121353 | 0.846000 | 0.901333 | 0.504733 | 0.313380 | 0.133180 | 0.013267 | 0.018427 | 0.001180 | 0.501993 |
CA | 0.072274 | 0.111041 | 0.035807 | 0.633793 | 0.633276 | 0.506820 | 0.306224 | 0.017743 | 0.073066 | 0.029845 | 0.002805 | 0.537445 |
Then we can make a correlation table of all the quantitaive variables to get some ideas of what variables correlate with what. This will help give us a basic idea of what we should look at more and what we will be able to take away from this data table. We can also look at some basic summary statistics for the variables to give us a better picture of the data.
variables = ['Estimated hesitant',
'Estimated hesitant or unsure',
'Estimated strongly hesitant',
'Social Vulnerability Index (SVI)',
'CVAC level of concern for vaccination rollout',
'Percent adults fully vaccinated against COVID-19 (as of 6/10/21)',
'Percent Hispanic',
'Percent non-Hispanic American Indian/Alaska Native',
'Percent non-Hispanic Asian',
'Percent non-Hispanic Black',
'Percent non-Hispanic Native Hawaiian/Pacific Islander',
'Percent non-Hispanic White'
]
hesitancy_df[variables].corr()
Estimated hesitant | Estimated hesitant or unsure | Estimated strongly hesitant | Social Vulnerability Index (SVI) | CVAC level of concern for vaccination rollout | Percent adults fully vaccinated against COVID-19 (as of 6/10/21) | Percent Hispanic | Percent non-Hispanic American Indian/Alaska Native | Percent non-Hispanic Asian | Percent non-Hispanic Black | Percent non-Hispanic Native Hawaiian/Pacific Islander | Percent non-Hispanic White | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Estimated hesitant | 1.000000 | 0.961811 | 0.975274 | 0.282310 | 0.338086 | -0.199502 | -0.210853 | 0.173319 | -0.269042 | 0.159336 | -0.047019 | 0.000838 |
Estimated hesitant or unsure | 0.961811 | 1.000000 | 0.938115 | 0.386490 | 0.438484 | -0.277029 | -0.195632 | 0.154750 | -0.303635 | 0.288680 | -0.064783 | -0.085858 |
Estimated strongly hesitant | 0.975274 | 0.938115 | 1.000000 | 0.299722 | 0.350317 | -0.205624 | -0.205454 | 0.198035 | -0.263052 | 0.171879 | -0.048325 | -0.023501 |
Social Vulnerability Index (SVI) | 0.282310 | 0.386490 | 0.299722 | 1.000000 | 0.717859 | -0.276399 | 0.313903 | 0.167560 | -0.058608 | 0.475836 | 0.000462 | -0.615726 |
CVAC level of concern for vaccination rollout | 0.338086 | 0.438484 | 0.350317 | 0.717859 | 1.000000 | -0.401594 | 0.323777 | 0.139995 | -0.139841 | 0.359244 | 0.002739 | -0.512439 |
Percent adults fully vaccinated against COVID-19 (as of 6/10/21) | -0.199502 | -0.277029 | -0.205624 | -0.276399 | -0.401594 | 1.000000 | 0.096973 | 0.148729 | 0.266810 | -0.253635 | 0.067014 | 0.042492 |
Percent Hispanic | -0.210853 | -0.195632 | -0.205454 | 0.313903 | 0.323777 | 0.096973 | 1.000000 | -0.042063 | 0.138102 | -0.115655 | 0.017473 | -0.601836 |
Percent non-Hispanic American Indian/Alaska Native | 0.173319 | 0.154750 | 0.198035 | 0.167560 | 0.139995 | 0.148729 | -0.042063 | 1.000000 | -0.018333 | -0.102222 | 0.017776 | -0.293375 |
Percent non-Hispanic Asian | -0.269042 | -0.303635 | -0.263052 | -0.058608 | -0.139841 | 0.266810 | 0.138102 | -0.018333 | 1.000000 | 0.016760 | 0.375992 | -0.279602 |
Percent non-Hispanic Black | 0.159336 | 0.288680 | 0.171879 | 0.475836 | 0.359244 | -0.253635 | -0.115655 | -0.102222 | 0.016760 | 1.000000 | -0.036959 | -0.591361 |
Percent non-Hispanic Native Hawaiian/Pacific Islander | -0.047019 | -0.064783 | -0.048325 | 0.000462 | 0.002739 | 0.067014 | 0.017473 | 0.017776 | 0.375992 | -0.036959 | 1.000000 | -0.110420 |
Percent non-Hispanic White | 0.000838 | -0.085858 | -0.023501 | -0.615726 | -0.512439 | 0.042492 | -0.601836 | -0.293375 | -0.279602 | -0.591361 | -0.110420 | 1.000000 |
hesitancy_df[variables].describe()
Estimated hesitant | Estimated hesitant or unsure | Estimated strongly hesitant | Social Vulnerability Index (SVI) | CVAC level of concern for vaccination rollout | Percent adults fully vaccinated against COVID-19 (as of 6/10/21) | Percent Hispanic | Percent non-Hispanic American Indian/Alaska Native | Percent non-Hispanic Asian | Percent non-Hispanic Black | Percent non-Hispanic Native Hawaiian/Pacific Islander | Percent non-Hispanic White | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3142.000000 | 3142.000000 | 3142.000000 | 3141.000000 | 3142.000000 | 2864.000000 | 3142.000000 | 3142.000000 | 3142.000000 | 3142.000000 | 3142.000000 | 3142.000000 |
mean | 0.132600 | 0.191429 | 0.086653 | 0.500000 | 0.499965 | 0.399398 | 0.094184 | 0.018463 | 0.013621 | 0.089259 | 0.000919 | 0.762499 |
std | 0.046337 | 0.053494 | 0.032938 | 0.288842 | 0.288832 | 0.142893 | 0.138647 | 0.076340 | 0.027665 | 0.144283 | 0.006240 | 0.202157 |
min | 0.026900 | 0.049900 | 0.018600 | 0.000000 | 0.000000 | 0.001000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006900 |
25% | 0.098300 | 0.148500 | 0.062325 | 0.250000 | 0.250000 | 0.318000 | 0.022200 | 0.001200 | 0.002800 | 0.006500 | 0.000000 | 0.644450 |
50% | 0.131800 | 0.190100 | 0.084900 | 0.500000 | 0.500000 | 0.400000 | 0.042300 | 0.002800 | 0.006100 | 0.021900 | 0.000100 | 0.837400 |
75% | 0.161725 | 0.228800 | 0.104475 | 0.750000 | 0.750000 | 0.494000 | 0.096800 | 0.006900 | 0.012800 | 0.098400 | 0.000600 | 0.924875 |
max | 0.267000 | 0.323300 | 0.182400 | 1.000000 | 1.000000 | 0.999000 | 0.991700 | 0.919000 | 0.417300 | 0.872300 | 0.272700 | 1.000000 |
We can see that the 3 columns that talk about hesitancy are very strongly correlated, and when we look at summary statistics, we can see that on average the "Estimated hesitant or unsure" is slightly higher than the "Estimated hesitant" and the "Estimated strongly hesitant" is slightly below. Knowing this, we will look only at the "Estimated hesitant" column in order to simplify things.
hesitancy_df.drop(labels=['Estimated hesitant or unsure', 'Estimated strongly hesitant'], axis=1, inplace=True)
hesitancy_by_state.drop(labels=['Estimated hesitant or unsure', 'Estimated strongly hesitant'], axis=1, inplace=True)
The social vulernability index (SVI) is define as "a database that helps emergency response planners and public health officials identify, map, and plan support for communities that will most likely need support before, during, and after a public health emergency.". It acts as an estimate of how much support a county will need in the event of a public health emergency based off a large number of factors. Below we can make a scatter of the percent of a county that is white compared to the SVI.
hesitancy_df.plot.scatter(x=['Percent non-Hispanic White'],
y='Social Vulnerability Index (SVI)', alpha = 0.5)
<Axes: xlabel='[Percent non-Hispanic White]', ylabel='Social Vulnerability Index (SVI)'>
Next we will open up a data set from the CDC on vaccine trends in the United States that can be found here https://data.cdc.gov/Vaccinations/COVID-19-Vaccination-Trends-in-the-United-States-N/rh2h-3yt2
vaccines = pd.read_csv('../datasets/COVID-19_Vaccination_Trends_in_the_United_States_National_and_Jurisdictional.csv')
vaccines = vaccines[ vaccines['date_type'].str.contains('Admin')==False ]
vaccines.head()
Date | date_type | MMWR_week | Location | Administered_Daily | Administered_Cumulative | Administered_7_Day_Rolling_Average | Admin_Dose_1_Daily | Admin_Dose_1_Cumulative | Admin_Dose_1_Day_Rolling_Average | ... | Booster_7_Day_Rolling_Average | Additional_Doses_Vax_Pct | Second_Booster_50Plus_Daily | Second_Booster_50Plus_Cumulative | Second_Booster_50Plus_7_Day_Rolling_Average | Second_Booster_50Plus_Vax_Pct | Bivalent_Booster_Daily | Bivalent_Booster_Cumulative | Bivalent_Booster_7_Day_Rolling_Average | Bivalent_Booster_Pop_Pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 05/10/2023 | Report | 19 | CO | 15097 | 13033446 | NaN | 1527 | 4837792 | NaN | ... | NaN | 57.9 | 1062 | 794838 | NaN | 62.9 | 9725 | 1272115 | NaN | 22.1 |
1 | 05/10/2023 | Report | 19 | AZ | 16505 | 14647405 | NaN | 2955 | 5704677 | NaN | ... | NaN | 50.2 | 1312 | 794699 | NaN | 54.3 | 11388 | 1148060 | NaN | 15.8 |
2 | 05/10/2023 | Report | 19 | MN | 16020 | 12829141 | NaN | 1282 | 4461994 | NaN | ... | NaN | 63.6 | 1196 | 983284 | NaN | 67.8 | 5497 | 1510743 | NaN | 26.8 |
3 | 05/10/2023 | Report | 19 | ID | 3526 | 2894361 | NaN | 323 | 1146055 | NaN | ... | NaN | 48.8 | 281 | 173862 | NaN | 54.8 | 2032 | 248989 | NaN | 13.9 |
4 | 05/10/2023 | Report | 19 | DC | 31 | 2137377 | NaN | 264 | 836680 | NaN | ... | NaN | 51.4 | 106 | 80880 | NaN | 55.9 | 509 | 226857 | NaN | 32.1 |
5 rows × 29 columns
Next we can drop all of the rows that contain data from a US territory, as we will only be looking at the 50 states & DC. We also will convert the date column to datetime type, and drop any columns that we don't need.
territories = ['VI','FM', 'MH','MP','AS','GU','PR','PW','US']
mask = vaccines["Location"].isin(territories)
vaccines = vaccines[~mask]
vaccines.Date = pd.to_datetime(vaccines['Date'])
vaccines.drop(labels=['MMWR_week',
'Bivalent_Booster_7_Day_Rolling_Average',
'Bivalent_Booster_Cumulative',
'Bivalent_Booster_Daily',
'Second_Booster_50Plus_Vax_Pct',
'Second_Booster_50Plus_7_Day_Rolling_Average',
'Second_Booster_50Plus_Cumulative',
'Second_Booster_50Plus_Daily',
'Booster_Daily',
'Booster_Cumulative',
'Booster_7_Day_Rolling_Average',
'Series_Complete_Day_Rolling_Average',
'Series_Complete_Cumulative',
'Series_Complete_Daily',
'Administered_daily_change_report_7dayroll',
'Administered_daily_change_report',
'Admin_Dose_1_Day_Rolling_Average',
'Admin_Dose_1_Cumulative',
'Admin_Dose_1_Daily',
'date_type'
], axis=1, inplace=True)
vaccines.head()
Date | Location | Administered_Daily | Administered_Cumulative | Administered_7_Day_Rolling_Average | Administered_Dose1_Pop_Pct | Series_Complete_Pop_Pct | Additional_Doses_Vax_Pct | Bivalent_Booster_Pop_Pct | |
---|---|---|---|---|---|---|---|---|---|
0 | 2023-05-10 | CO | 15097 | 13033446 | NaN | 84.0 | 73.8 | 57.9 | 22.1 |
1 | 2023-05-10 | AZ | 16505 | 14647405 | NaN | 78.4 | 66.2 | 50.2 | 15.8 |
2 | 2023-05-10 | MN | 16020 | 12829141 | NaN | 79.1 | 72.4 | 63.6 | 26.8 |
3 | 2023-05-10 | ID | 3526 | 2894361 | NaN | 64.1 | 56.6 | 48.8 | 13.9 |
4 | 2023-05-10 | DC | 31 | 2137377 | NaN | 95.0 | 91.3 | 51.4 | 32.1 |
Now that we have the data that we want, we can make a plot to see vaccine trends overtime, as well as by state.
vaccines.groupby("Date")['Administered_Daily'].sum().plot(title="Daily vaccines over time")
<Axes: title={'center': 'Daily vaccines over time'}, xlabel='Date'>
vaccines.groupby("Location")['Administered_Dose1_Pop_Pct'].max().plot.bar(title="Percent of people with at least 1 dose by state")
<Axes: title={'center': 'Percent of people with at least 1 dose by state'}, xlabel='Location'>
This data is from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. Over the course of three years, (1/22/20 to 3/9/23) data on the daily confirmed cases, deaths, and recoveries from Covid-19 were published to this Github repository: https://github.com/CSSEGISandData/COVID-19.git. The data is reported by city and state, but for ease, we chose to look at the state level so rows were aggregated to one row per state (or territory). Additionally, this dataset is being used to look at Covid-19 cases over time (dates), all other columns were dropped. Then, each state can be normalized by its population so we can compare across states.
confirmed_df = pd.read_csv("../datasets/time_series_covid19_confirmed_US.csv")
confirmed_df.drop(columns=['iso2','iso3','code3','FIPS','Country_Region','Lat','Long_','UID','Admin2','Combined_Key'], inplace=True)
confirmed_state_by_day = confirmed_df.groupby('Province_State').sum()
confirmed_state_by_day.drop(labels=['American Samoa','Diamond Princess','Grand Princess', 'Guam','Northern Mariana Islands','Puerto Rico','Virgin Islands'], axis='index', inplace=True)
confirmed_state_by_day.rename(us_state_to_abbrev, inplace=True)
confirmed_state_by_day = confirmed_state_by_day.div(population,axis=0)*100
confirmed_state_by_day.head()
1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | 1/28/20 | 1/29/20 | 1/30/20 | 1/31/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Province_State | |||||||||||||||||||||
AL | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 32.287198 | 32.360390 | 32.360390 | 32.360390 | 32.360390 | 32.360390 | 32.360390 | 32.360390 | 32.409087 | 32.409087 |
AK | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 41.859340 | 41.859340 | 41.859340 | 41.859340 | 41.859340 | 41.859340 | 41.859340 | 41.938676 | 41.938676 | 41.938676 |
AZ | 0.0 | 0.0 | 0.0 | 0.0 | 0.000014 | 0.000014 | 0.000014 | 0.000014 | 0.000014 | 0.000014 | ... | 33.082835 | 33.159786 | 33.159786 | 33.159786 | 33.159786 | 33.159786 | 33.159786 | 33.159786 | 33.203541 | 33.203541 |
AR | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 32.989880 | 32.989880 | 33.004984 | 33.013258 | 33.023962 | 33.025833 | 33.028559 | 33.041889 | 33.051280 | 33.059849 |
CA | 0.0 | 0.0 | 0.0 | 0.0 | 0.000005 | 0.000005 | 0.000005 | 0.000005 | 0.000005 | 0.000008 | ... | 30.996200 | 30.996200 | 31.022445 | 31.028397 | 31.028397 | 31.028397 | 31.028397 | 31.056032 | 31.056032 | 31.078410 |
5 rows × 1143 columns
Now, we can look at the cases of Covid-19 over 3 years, state by state. This is give us an idea of trends of reported covid cases over time and how they vary state to state. Below is a plot of daily reported covid-19 cases, from 1/22/20 to 3/9/23.
for i in confirmed_state_by_day.index:
plt.plot(confirmed_state_by_day.loc[i])
plt.xticks(ticks=[0,1142], labels=['2020-01-22','2023-09-03'])
plt.xlabel('date')
plt.ylabel('percentage of covid cases in state')
plt.show()
Let's find the top 10 states.
confirmed_state_by_day.sort_values(['3/9/23'], ascending=False)[0:10]
1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | 1/28/20 | 1/29/20 | 1/30/20 | 1/31/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Province_State | |||||||||||||||||||||
RI | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 41.989734 | 41.989734 | 42.061872 | 42.061872 | 42.061872 | 42.061872 | 42.061872 | 42.061872 | 42.061872 | 42.121485 |
AK | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 41.859340 | 41.859340 | 41.859340 | 41.859340 | 41.859340 | 41.859340 | 41.859340 | 41.938676 | 41.938676 | 41.938676 |
KY | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 37.967693 | 37.967693 | 37.977976 | 37.977976 | 37.977976 | 37.977976 | 37.977976 | 37.977976 | 38.084063 | 38.084063 |
ND | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 36.799609 | 36.812698 | 36.823349 | 36.823349 | 36.823349 | 36.823349 | 36.823349 | 36.823349 | 36.823349 | 36.823349 |
WV | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 36.010694 | 36.112544 | 36.112544 | 36.208649 | 36.208649 | 36.208649 | 36.208649 | 36.208649 | 36.208649 | 36.208649 |
TN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 35.630992 | 35.641004 | 35.655512 | 35.665779 | 35.668828 | 35.596105 | 35.596105 | 35.596105 | 35.668828 | 35.668828 |
SC | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 34.712437 | 34.712437 | 34.712437 | 34.712437 | 34.712437 | 34.712437 | 34.712437 | 34.766141 | 34.766141 | 34.766141 |
NY | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 34.468349 | 34.470498 | 34.482451 | 34.491213 | 34.496157 | 34.499959 | 34.512130 | 34.517345 | 34.524154 | 34.531107 |
LA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 34.118143 | 34.217615 | 34.217615 | 34.217615 | 34.217615 | 34.217615 | 34.217615 | 34.217615 | 34.285999 | 34.285999 |
WI | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 33.955957 | 33.970009 | 33.982770 | 33.994820 | 33.994820 | 33.994820 | 34.016203 | 34.029474 | 34.040912 | 34.052927 |
10 rows × 1143 columns
From the plot, we can see that some states, like West Virginia, Wyoming, and Mississippi, have many more cases than others per capita. Overall, some states saw outbreaks when others didnt, and some didn't have many cases at all. This can help us answer questions about how Covid-19 impacted different states and what states were impacted the most at different times.
To look at this data a bit closer, we can look at the average number of cases over three months. Then we can look at the summary statistics for that data.
confirmed_state_by_day.columns = pd.to_datetime(confirmed_state_by_day.columns, dayfirst=True, format='mixed')
confirmed_day = confirmed_state_by_day
# 3 month average
threemonthaverage = confirmed_day.T.resample('Q').mean().T
threemonthaverage.describe()
2020-03-31 | 2020-06-30 | 2020-09-30 | 2020-12-31 | 2021-03-31 | 2021-06-30 | 2021-09-30 | 2021-12-31 | 2022-03-31 | 2022-06-30 | 2022-09-30 | 2022-12-31 | 2023-03-31 | 2023-06-30 | 2023-09-30 | 2023-12-31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 |
mean | 0.523509 | 0.753765 | 1.404136 | 3.053974 | 9.083523 | 10.006541 | 10.977025 | 13.110492 | 24.043743 | 25.460957 | 27.085812 | 28.069560 | 30.466076 | 30.458059 | 30.493810 | 30.374839 |
std | 0.183625 | 0.335680 | 0.542772 | 1.099395 | 2.358052 | 2.376262 | 2.536622 | 2.838387 | 3.729242 | 3.779098 | 3.923377 | 4.022893 | 4.295360 | 4.293824 | 4.299314 | 4.276433 |
min | 0.091269 | 0.177454 | 0.235364 | 0.482400 | 2.494718 | 2.822579 | 3.753079 | 5.118809 | 16.102109 | 17.621519 | 18.990129 | 19.800811 | 21.845738 | 21.824123 | 21.858583 | 21.768570 |
25% | 0.426930 | 0.585348 | 1.037146 | 2.546424 | 8.354899 | 9.527069 | 10.295712 | 12.307920 | 22.040358 | 23.343224 | 25.002910 | 25.781952 | 27.810816 | 27.806267 | 27.822505 | 27.740747 |
50% | 0.533490 | 0.720621 | 1.437911 | 3.166962 | 9.544954 | 10.499170 | 11.379409 | 13.674178 | 24.408501 | 25.723154 | 27.584863 | 28.548954 | 30.765054 | 30.748463 | 30.779522 | 30.687674 |
75% | 0.658690 | 0.962602 | 1.787737 | 3.653513 | 10.445870 | 11.275231 | 12.601067 | 14.956032 | 26.564371 | 27.780954 | 29.494269 | 30.417470 | 32.724562 | 32.735825 | 32.761352 | 32.618070 |
max | 0.877756 | 1.732563 | 2.621345 | 6.248935 | 13.599858 | 14.374649 | 15.109814 | 18.279272 | 32.961085 | 35.490087 | 38.081019 | 38.994821 | 41.574649 | 41.526601 | 41.592979 | 41.424713 |
plt.plot(threemonthaverage.T)
plt.xticks(rotation=45)
plt.xlabel('three month period')
plt.ylabel('average number of covid cases per three months')
plt.show()
From these summary statistics, there was a massive increase in cases from 1/21 to 4/21 in 2020 (increased 3 fold) and also from 1/22 to 4/22 (increased 2 fold). After the Covid-19 vaccine was released in December of 2020, the immediate three following months saw a massive increase in cases, then the next 9 months saw a plateu in cases. That intial big spike could be from the slow roll-out of the vaccine to the general public, but the plateu after seems to validate the effectiveness of the vaccine. The peak after about 9 months could be a result of winter and everyone being in closer quarters, and could also indicate the vaccine was effective for about 9 months, until another booster was encouraged to maintain increased levels of immune responses.
Using these datasets, we can find the correlation between different demographic information and vaccine hesitancy, actual vaccine trends, and Covid-19 cases. We can determine the correlations for each pair of the four datasets and as combinations. The datasets should be normalized to population so the correlation is interpretable. Graphs, such as hesitancy by state vs vaccine rates by state can help visually show the correlations found.
These correlations can also help address the question of the effectiveness of the vaccine on Covid-19 cases.
We also can also use demographic information, vaccine hesitancy, and Covid-19 to predict vaccine rates. This will include combining all the datasets, splitting the data into training and testing sets, and finally implementing a predictive model, using linear regression (linear_model in sklearn package). This model will then be evaluated using k-fold cross validation and performance metrics such as MSE, RMSE, and MAE.
Using these four datsets, we can first look at the correlation matrix between all the demographic information, vaccine hesitancy, actual vaccine trends, and Covid-19 cases. Then, we can look at correlations of the first three datasets and covid cases. This second correlation matrix will help us see which features should be used in our model to predict Covid-19 cases.
vaccines_on_days = vaccines[vaccines['Date'] == '12/13/2021'][['Location','Administered_Dose1_Pop_Pct']].merge(vaccines[vaccines['Date'] == '12/14/2022'][['Location','Administered_Dose1_Pop_Pct']],on = 'Location',suffixes = ('',' on 12/14/2022'))
vaccines_on_days = vaccines_on_days.merge(vaccines[vaccines['Date'] == '12/14/2022'][['Location','Administered_Dose1_Pop_Pct']],on = 'Location',suffixes = (' on 12/13/2021',' on 05/10/2023'))
vaccines_on_days.set_index("Location",inplace=True)
vaccines_on_days.head()
Administered_Dose1_Pop_Pct on 12/13/2021 | Administered_Dose1_Pop_Pct on 12/14/2022 | Administered_Dose1_Pop_Pct on 05/10/2023 | |
---|---|---|---|
Location | |||
NH | 92.0 | 87.3 | 87.3 |
WA | 74.1 | 84.7 | 84.7 |
CO | 73.0 | 83.3 | 83.3 |
CT | 86.2 | 95.0 | 95.0 |
MS | 54.7 | 61.4 | 61.4 |
df_demog = df_all_per
df_demog.columns = df_demog.columns.str.rstrip('%')
df_demog = df_demog.T
df_demog.rename(us_state_to_abbrev, inplace=True)
df_demog.drop('\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Median household income (dollars)',axis = 1,inplace=True)
df_demog.join(df_all_est.T[['\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Median household income (dollars)']])
df_demog.head()
Label (Grouping) | Less than 9th grade | 9th to 12th grade, no diploma | High school graduate (includes equivalency) | Some college, no degree | Associate's degree | Bachelor's degree | Graduate or professional degree | Hispanic or Latino (of any race) | White alone | Black or African American alone | American Indian and Alaska Native alone | Asian alone | Native Hawaiian and Other Pacific Islander alone | Some Other Race alone | Two or More Races |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AL | 3.4 | 7.8 | 30.4 | 20.6 | 9.0 | 17.5 | 11.3 | 4.9 | 64.1 | 25.6 | 0.3 | 1.5 | 0.0 | 0.4 | 3.3 |
AK | 2.2 | 4.4 | 29.1 | 23.7 | 9.9 | 19.2 | 11.3 | 7.7 | 57.4 | 2.8 | 12.7 | 6.1 | 2.0 | 0.5 | 10.7 |
AZ | 4.8 | 6.1 | 23.8 | 23.0 | 9.4 | 20.4 | 12.5 | 32.5 | 51.8 | 4.4 | 3.3 | 3.5 | 0.2 | 0.5 | 3.9 |
AR | 4.0 | 6.8 | 34.3 | 21.3 | 8.1 | 15.7 | 9.7 | 8.4 | 67.5 | 14.3 | 0.4 | 1.6 | 0.5 | 0.4 | 7.0 |
CA | 8.8 | 6.5 | 20.5 | 19.3 | 7.9 | 22.5 | 14.4 | 40.3 | 33.7 | 5.2 | 0.3 | 15.3 | 0.3 | 0.6 | 4.3 |
cases_vaccines = confirmed_state_by_day[['12/13/2021','12/14/2022','05/03/2023']].join(vaccines_on_days)
cases_vaccines.rename(columns={'2020-12-13':'Cases on 12/13/2020',
'2021-12-13 00:00:00':'Cases on 12/13/2021',
'2022-12-14 00:00:00':'Cases on 12/14/2022',
'2023-05-03 00:00:00':'Cases on 05/03/2023'},inplace=True)
cases_vaccines_hesitancy = cases_vaccines.join(hesitancy_by_state[['Estimated hesitant','Social Vulnerability Index (SVI)']])
all_data = cases_vaccines_hesitancy.join(df_demog)
all_data.head()
2021-12-13 00:00:00 | 2022-12-14 00:00:00 | 2023-05-03 00:00:00 | Administered_Dose1_Pop_Pct on 12/13/2021 | Administered_Dose1_Pop_Pct on 12/14/2022 | Administered_Dose1_Pop_Pct on 05/10/2023 | Estimated hesitant | Social Vulnerability Index (SVI) | Less than 9th grade | 9th to 12th grade, no diploma | ... | Bachelor's degree | Graduate or professional degree | Hispanic or Latino (of any race) | White alone | Black or African American alone | American Indian and Alaska Native alone | Asian alone | Native Hawaiian and Other Pacific Islander alone | Some Other Race alone | Two or More Races | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Province_State | |||||||||||||||||||||
AL | 16.801818 | 30.773865 | 32.360390 | 57.6 | 64.8 | 64.8 | 0.173507 | 0.703433 | 3.4 | 7.8 | ... | 17.5 | 11.3 | 4.9 | 64.1 | 25.6 | 0.3 | 1.5 | 0.0 | 0.4 | 3.3 |
AK | 20.970361 | 40.993180 | 41.859340 | 64.1 | 72.7 | 72.7 | 0.217386 | 0.561724 | 2.2 | 4.4 | ... | 19.2 | 11.3 | 7.7 | 57.4 | 2.8 | 12.7 | 6.1 | 2.0 | 0.5 | 10.7 |
AZ | 17.917444 | 32.137745 | 33.159786 | 65.8 | 76.9 | 76.9 | 0.159240 | 0.846000 | 4.8 | 6.1 | ... | 20.4 | 12.5 | 32.5 | 51.8 | 4.4 | 3.3 | 3.5 | 0.2 | 0.5 | 3.9 |
AR | 17.687630 | 32.054936 | 33.025833 | 61.7 | 69.5 | 69.5 | 0.212123 | 0.717733 | 4.0 | 6.8 | ... | 15.7 | 9.7 | 8.4 | 67.5 | 14.3 | 0.4 | 1.6 | 0.5 | 0.4 | 7.0 |
CA | 13.239006 | 29.969714 | 31.028397 | 80.7 | 84.2 | 84.2 | 0.072274 | 0.633793 | 8.8 | 6.5 | ... | 22.5 | 14.4 | 40.3 | 33.7 | 5.2 | 0.3 | 15.3 | 0.3 | 0.6 | 4.3 |
5 rows × 23 columns
First we can look at the correlations of every feature we selected using a heatmap for easy visualization.
sb.heatmap(all_data.corr(), vmin=-1, vmax=1, cmap='PiYG')
<Axes: >
From this heatmap, a few things are signficiantly correlated. Estimated hesitancy and actual vaccine doses are significantly negativly correlated. This makes sense because if people were more hesitant about the vaccine, they would be less likely to get it. Estimated hesitancy was also negativly correlated with higher education (Bachelor's and Graduate degrees). This indicates that more educated people are less hesitant about the vaccine, potentially because they have a better skill set to do more of their own research and critial thinking to form an opinion on the vaccine than less educated people. This indicates education is important to combat misfinformation and essential to make informed decisions, especially about health concerns. Finally, SVI became decreasingly correlated with education as education levels increased. This indicates that the more education a person attains, the less likely they are to be significantly affected by public health emergencies, such as a pandemic.
Next, we can zoom in on all the demographic, vaccine, and hesitancy correlations with just cases to see what we should include in our model.
sb.heatmap(all_data.corr().iloc[0:3].T[3:], vmin=-1, vmax=1,cmap='PiYG')
<Axes: >
From this heatmap, we can see that estimated hesitancy was significantly and positively correlated with covid cases on 12/13/2021. This indicates the more hesitantcy there was in a state, the more covid cases they had. Additionally, over all three case dates, the administered vaccines are negativly correlated with cases. This shows that the more vaccines given out, the less covid cases there were, which signals towards the effectiveness of the vaccine. Additionally, this map shows that there was no significant difference in cases in education levels below a bachelor's degree, but there was a negative correlation between higher education and cases. This could be an important feature in our model.
Using these correlations, we can choose features highly correlated with covid cases, and drop anything not highly correlated. Here we will only drop Social Vulnerability Index (SVI) and not race because some races were significantly correlated while others weren't so for simplicity we are keeping every race. From here, we can use these features to predict covid-19 cases on 5/10/2023.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_predict
all_data.columns = all_data.columns.astype(str)
For this model, we will use linear regression. We will randomly split the data into 60% for training and 40% for testing.
features = all_data.columns[[0,1] + list(range(3,7)) + list(range(8,23))]
pred_feat = all_data.columns[2]
X = all_data[features]
y = all_data[pred_feat]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Below is our predicted data! Let's evaluate and compare it to the actual values.
X_test
2021-12-13 00:00:00 | 2022-12-14 00:00:00 | Administered_Dose1_Pop_Pct on 12/13/2021 | Administered_Dose1_Pop_Pct on 12/14/2022 | Administered_Dose1_Pop_Pct on 05/10/2023 | Estimated hesitant | Less than 9th grade | 9th to 12th grade, no diploma | High school graduate (includes equivalency) | Some college, no degree | ... | Bachelor's degree | Graduate or professional degree | Hispanic or Latino (of any race) | White alone | Black or African American alone | American Indian and Alaska Native alone | Asian alone | Native Hawaiian and Other Pacific Islander alone | Some Other Race alone | Two or More Races | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Province_State | |||||||||||||||||||||
KS | 16.672659 | 30.981598 | 67.7 | 75.8 | 75.8 | 0.151648 | 3.6 | 4.4 | 25.3 | 22.1 | ... | 22.0 | 13.7 | 13.0 | 73.1 | 5.0 | 0.4 | 2.9 | 0.1 | 0.5 | 4.9 |
TX | 14.740353 | 27.125442 | 65.3 | 75.9 | 75.9 | 0.119811 | 7.2 | 6.7 | 24.2 | 20.3 | ... | 21.6 | 12.3 | 40.2 | 38.9 | 11.7 | 0.2 | 5.4 | 0.1 | 0.4 | 3.2 |
NH | 12.764840 | 26.089085 | 92.0 | 87.3 | 87.3 | 0.103540 | 1.3 | 4.1 | 27.1 | 16.8 | ... | 24.6 | 16.7 | 4.5 | 86.6 | 1.3 | 0.1 | 2.6 | 0.1 | 0.5 | 4.4 |
ID | 16.100087 | 26.260461 | 51.5 | 63.6 | 63.6 | 0.192023 | 3.5 | 4.5 | 26.0 | 23.9 | ... | 21.4 | 10.8 | 13.5 | 79.0 | 0.6 | 0.8 | 1.3 | 0.2 | 0.5 | 4.2 |
OR | 9.470543 | 22.006553 | 72.8 | 81.1 | 81.1 | 0.112689 | 3.6 | 4.8 | 22.7 | 23.7 | ... | 22.2 | 14.1 | 14.4 | 71.6 | 1.8 | 0.7 | 4.5 | 0.4 | 0.6 | 6.0 |
IN | 17.009772 | 29.079559 | 56.8 | 64.1 | 64.1 | 0.131465 | 3.6 | 6.2 | 32.1 | 19.2 | ... | 19.0 | 10.6 | 7.8 | 76.0 | 9.2 | 0.1 | 2.5 | 0.0 | 0.5 | 3.9 |
TN | 18.999668 | 34.243922 | 57.7 | 64.2 | 64.2 | 0.138388 | 3.5 | 6.1 | 31.3 | 20.1 | ... | 19.7 | 11.4 | 6.3 | 71.9 | 15.5 | 0.1 | 1.9 | 0.1 | 0.4 | 4.0 |
HI | 6.206030 | 25.684976 | 83.5 | 91.0 | 91.0 | 0.059500 | 3.2 | 3.9 | 26.8 | 19.6 | ... | 22.1 | 13.3 | 11.1 | 20.7 | 1.6 | 0.1 | 34.6 | 9.3 | 0.4 | 22.1 |
PA | 14.208571 | 26.122178 | 84.0 | 89.9 | 89.9 | 0.118857 | 2.8 | 5.0 | 33.0 | 15.2 | ... | 20.8 | 14.3 | 8.6 | 73.1 | 10.1 | 0.1 | 3.8 | 0.0 | 0.5 | 3.8 |
NE | 16.386668 | 27.783252 | 65.2 | 73.0 | 73.0 | 0.086788 | 3.3 | 3.9 | 25.1 | 21.6 | ... | 22.6 | 12.1 | 12.3 | 75.8 | 4.5 | 0.5 | 2.5 | 0.1 | 0.4 | 3.9 |
AK | 20.970361 | 40.993180 | 64.1 | 72.7 | 72.7 | 0.217386 | 2.2 | 4.4 | 29.1 | 23.7 | ... | 19.2 | 11.3 | 7.7 | 57.4 | 2.8 | 12.7 | 6.1 | 2.0 | 0.5 | 10.7 |
MT | 17.268742 | 28.512371 | 60.9 | 68.0 | 68.0 | 0.251386 | 2.2 | 3.8 | 27.7 | 22.3 | ... | 22.9 | 11.7 | 4.4 | 83.5 | 0.3 | 5.2 | 0.7 | 0.1 | 0.8 | 4.9 |
WA | 10.194950 | 24.054566 | 74.1 | 84.7 | 84.7 | 0.077913 | 3.5 | 4.3 | 21.5 | 21.2 | ... | 23.8 | 15.7 | 14.0 | 63.5 | 3.8 | 0.9 | 9.7 | 0.7 | 0.7 | 6.7 |
MA | 14.054206 | 30.424676 | 88.1 | 95.0 | 95.0 | 0.046279 | 4.4 | 4.3 | 23.0 | 14.3 | ... | 25.3 | 21.3 | 13.0 | 67.0 | 6.6 | 0.1 | 7.2 | 0.0 | 1.2 | 4.9 |
CO | 15.013752 | 29.490562 | 73.0 | 83.3 | 83.3 | 0.078220 | 2.8 | 4.2 | 20.2 | 18.7 | ... | 28.8 | 17.1 | 22.5 | 65.0 | 3.8 | 0.4 | 3.1 | 0.1 | 0.5 | 4.6 |
CT | 12.335651 | 25.812771 | 86.2 | 95.0 | 95.0 | 0.056987 | 3.9 | 4.5 | 25.9 | 15.9 | ... | 23.0 | 18.9 | 18.2 | 62.0 | 9.8 | 0.1 | 4.8 | 0.0 | 0.8 | 4.4 |
RI | 18.625004 | 40.208405 | 85.8 | 95.0 | 95.0 | 0.066420 | 4.8 | 4.7 | 24.7 | 18.1 | ... | 23.9 | 15.7 | 17.6 | 68.2 | 4.7 | 0.1 | 3.4 | 0.0 | 0.9 | 5.2 |
OK | 17.008782 | 30.653316 | 64.4 | 74.2 | 74.2 | 0.194045 | 3.5 | 6.9 | 30.6 | 22.1 | ... | 18.6 | 9.9 | 12.1 | 62.6 | 6.7 | 6.8 | 2.3 | 0.1 | 0.3 | 9.1 |
NC | 14.723114 | 30.845372 | 73.1 | 91.3 | 91.3 | 0.135913 | 3.7 | 6.1 | 24.9 | 19.3 | ... | 22.8 | 13.2 | 10.4 | 60.7 | 20.1 | 0.9 | 3.2 | 0.1 | 0.5 | 4.1 |
IL | 15.134757 | 31.176387 | 70.9 | 78.8 | 78.8 | 0.096085 | 4.6 | 5.0 | 25.2 | 19.1 | ... | 22.5 | 15.2 | 18.3 | 58.5 | 13.2 | 0.1 | 5.9 | 0.0 | 0.4 | 3.6 |
GA | 15.412124 | 27.153539 | 60.0 | 68.0 | 68.0 | 0.161815 | 4.0 | 6.5 | 27.0 | 19.1 | ... | 20.7 | 14.0 | 10.4 | 49.6 | 30.7 | 0.1 | 4.4 | 0.1 | 0.5 | 4.2 |
21 rows × 21 columns
We can evaluate our model using several common metrics.
from sklearn.metrics import mean_absolute_error,mean_squared_error
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
# squared=True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=y_pred) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.445853139227499 MSE: 1.1235586671662616 RMSE: 1.0599805032010077
plt.scatter(y_test,y_pred,c='r')
plt.xlabel('Actual percent of state with Covid-19')
plt.ylabel('Predicted percent of state with Covid-19')
plt.title('Actual vs Predicted Covid-19 cases on 5/10/202 (40 percent of states)')
for i in range(y_test.shape[0]):
plt.text(y_test[i], y_pred[i], str(y.index[i]))
This is a pretty good model, but we can confirm our results with 5 fold cross validation.
lr = LinearRegression()
y_tot_pred = cross_val_predict(lr, X, y, cv=5)
mae = mean_absolute_error(y_true=y,y_pred=y_tot_pred)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y,y_pred=y_tot_pred) #default=True
rmse = mean_squared_error(y_true=y,y_pred=y_tot_pred,squared=False)
print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
plt.scatter(y,y_tot_pred,c='r')
plt.xlabel('Actual percent of state with Covid-19')
plt.ylabel('Predicted percent of state with Covid-19')
plt.title('Actual vs Predicted Covid-19 cases on 5/10/2023 (5-fold cross validation)')
for i in range(y.shape[0]):
plt.text(y[i], y_tot_pred[i], str(y.index[i]))
MAE: 0.23393785092251881 MSE: 0.18849029765015374 RMSE: 0.43415469322599026
We were able to get a pretty good model based on our performance metrics and direct comparison between predicted and actual values. To gather some insight into the actual model, we can extract the coefficients from the linear model to see which variables were deemed the most, and least, important to predicting cases.
coefficients = pd.DataFrame(zip(X.columns, model.coef_))
coefficients.columns = ['variable', 'coefficient']
coefficients
variable | coefficient | |
---|---|---|
0 | 2021-12-13 00:00:00 | -0.030062 |
1 | 2022-12-14 00:00:00 | 1.053670 |
2 | Administered_Dose1_Pop_Pct on 12/13/2021 | 0.001830 |
3 | Administered_Dose1_Pop_Pct on 12/14/2022 | -0.001069 |
4 | Administered_Dose1_Pop_Pct on 05/10/2023 | -0.001069 |
5 | Estimated hesitant | 0.246702 |
6 | Less than 9th grade | 0.654279 |
7 | 9th to 12th grade, no diploma | 0.696613 |
8 | High school graduate (includes equival... | 0.596443 |
9 | Some college, no degree | 0.567993 |
10 | Associate's degree | 0.481504 |
11 | Bachelor's degree | 0.691563 |
12 | Graduate or professional degree | 0.530945 |
13 | Hispanic or Latino (of any race) | 0.537597 |
14 | White alone | 0.552245 |
15 | Black or African American alone | 0.561912 |
16 | American Indian and Alaska Native ... | 0.600524 |
17 | Asian alone | 0.529313 |
18 | Native Hawaiian and Other Pacific ... | 0.267250 |
19 | Some Other Race alone | 1.005719 |
20 | Two or More Races | 0.477663 |
In conclusion, our model was very successful in predicting covid cases on 5/10/23 based on vaccination rates, estimated hesitancy, education levels, demographic information, and covid cases on 2 dates in past years. Our cross-validation RMSE was 0.43, which is very low and indicates low error. One potential flaw with our model would be that it did use covid cases from past days, as well as vaccine rates from past days and the test day. This means we would not be able to use it at the start of a new pandemic to predict cases if we did not have that data.
Looking at our coefficients, covid cases 6 months before had the highest coefficient, which is not surprising. The next highest coefficients were all of the education variables, but the coefficient for each level of education was about the same (between 0.48 and 0.7). This means that education levels did not actually have a large effect because they were all weighted the same. Vaccine rates had the smallest coefficients, which is surprising because on the correlation table there was a negative correlation between vaccines rates and cases. This is also surprising because estimated vaccine hesitancy did have a fairly significant coefficient (0.25). Estimated vaccine hesitancy did have a stronger correlation with cases than the vaccine rates did, which is interesting that it is more impactful than vaccines themselves. In future projects, more parameter tuning would be needed in order to better figure out how each variable is impacting the model.
Looking back at our initial questions asked, we can partially answer these looking at our correlation table. Estimated hesitancy had a correlation of between 0.3-0.6 with covid cases depending on which day we look at. These correlations are significant enough so we can say that states with more vaccine hesitancy had higher rates of Covid-19. Vaccine rates had a correlation of -0.3 with cases on the last day of data, which is not super strong.
In the future, some things that we would want to consider would be data on masks, as well as deaths. Cases are an important thing to measure and predict, but predicting covid deaths may be more useful, especially if we wanted to predict the impact of future pandemics. Data on mask wearing and mask mandates could also be interesting to look at because they were such a large controversy throughout the whole pandemic. We were able to predict cases fairly accurately with no data on masks which may suggest that they did not have a huge impact, however it is more likely that mask wearing is correlated with vaccine rates/vaccine hesitancy and it was therefore factored in through that.