The following analysis uses data published by the Government of Ontario.
At some point during the pandemic I started feeling like news outlets were not reporting on the things I cared about. I care about numbers and actual data, not some news outlets interpretation. Even worse is editorialized content that always puts a spin on the data to push an agenda. I don't care about any of that, I just want to know what is going on.
The best way to do this is to download the data yourself and analyse it. Even if you don't know programming you could easily import this data into Excel and do something similar.
Since I am a python hobbyist this feels like a great use case for Python Pandas, Matplotlib and Seaborn for visualizations.
I did my best to interpret the data in an unbiased way. However, its easy to make mistakes and if you see something that doesnt make sense or you don't agree with please drop me an email, I would like to hear from you.
You can reach out to me at [email protected]
Feedback is always welcome.
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
# Dataset #1 - Covid Cases in Ontario
df = pd.read_csv('../data/conposcovidloc.csv', index_col="Row_ID")
# The conposcovidloc.csv file is over 100Mb.
# If you prefer to download it directly from the source, use this instead;
# df = pd.read_csv('https://data.ontario.ca/dataset/f4112442-bdc8-45d2-be3c-12efae72fb27/resource/455fd63b-603d-4608-8216-7d8647f43350/download/conposcovidloc.csv', index_col="Row_ID")
# schema_df source: https://data.ontario.ca/dataset/f4112442-bdc8-45d2-be3c-12efae72fb27/resource/a2ea0536-1eae-4a17-aa04-e5a1ab89ca9a/download/conposcovidloc_data_dictionary.xlsx
# converted from xlsx to csv and available on linuxnorth.org
schema_df = pd.read_csv('https://www.linuxnorth.org/pandas/data/conposcovidloc_data_dictionary.csv', index_col="Variable Name", encoding = "ISO-8859-1", error_bad_lines=False)
# Dataset #2 - Covid Retransmission Rate in Ontario
dfre = pd.read_csv('https://data.ontario.ca/dataset/8da73272-8078-4cbd-ae35-1b5c60c57796/resource/1ffdf824-2712-4f64-b7fc-f8b2509f9204/download/re_estimates_on.csv')
# Dataset #3 - Vaccine data for Ontario
dfvaccine = pd.read_csv('https://data.ontario.ca/dataset/752ce2b7-c15a-4965-a3dc-397bf405e7cc/resource/8a89caa9-511c-4568-af89-7f2174b4378c/download/vaccine_doses.csv')
# Dataset #4 - Vaccine Status
dfvacstatus = pd.read_csv('https://data.ontario.ca/dataset/752ce2b7-c15a-4965-a3dc-397bf405e7cc/resource/eed63cf2-83dd-4598-b337-b288c0a89a16/download/vac_status.csv.csv')
# taking a peek
df.head(10)
# Dataframe size (rows, columns)
df.shape
# Looking at the schema provided
schema_df = schema_df[['Definition', 'Additional Notes']]
schema_df.sort_index(inplace=True)
schema_df
# How many missing values in each column
df.isna().sum()
# Looking only at columns of interest
columns_of_interest = ['Accurate_Episode_Date', 'Case_Reported_Date', 'Age_Group', 'Client_Gender', 'Case_AcquisitionInfo',
'Outcome1', 'Outbreak_Related', 'Reporting_PHU_ID', 'Reporting_PHU']
df = df[columns_of_interest]
df.columns = ['adate','rdate', 'age', 'gender', 'source', 'outcome', 'outbreak', 'phuid', 'phu']
df.dtypes
# Dates are stored as strings. Change them to pandas datetime
df['rdate']= pd.to_datetime(df['rdate'])
df['adate']= pd.to_datetime(df['adate'])
df.dtypes
# Take another peek....that's better
df.tail()
# Total number of covid cases reported in Ontario all time.
len(df)
# Change '<20' to '0-19'. This will make age distribution charts easier to read later.
df['age'] = df['age'].replace(['<20'],'0-19')
df.head(2)
We can see three distinct waves of covid spread in Ontario. The initial smaller wave at the beginning that devastated the elderly in March/April of 2020, then two distinct larger waves in January and May 2021 which was mostly spread by younger people.
plt.figure(figsize=(14,6))
plt.title('Ontario Covid Waves - Daily Cases', fontsize=20)
sns.lineplot(data=df['rdate'].value_counts())
plt.ylabel('Cases', fontsize=15)
plt.xlabel('Date', fontsize=15)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
plt.show()
Covid infects all genders proportionally.
print(df['gender'].value_counts())
gender_filter = (df["gender"] == 'MALE') | (df["gender"] == 'FEMALE') | (df["gender"] == 'UNSPECIFIED') | (df["gender"] == 'GENDER DIVERSE')
gdf = df[gender_filter]
plt.figure(figsize=(10,6))
plt.title("Ontario - Covid Infections by Gender", fontsize=20)
sns.countplot(x=gdf["gender"], data=df)
plt.xlabel('Gender', fontsize=13)
plt.ylabel('Count', fontsize=13)
plt.show()
My hometown is Timmins and I am originally from Sudbury. Let's compare the two communities covid cases. Timmins is represented by the Porcupine Health Unit area.
The Porcupine Health Unit area had an explosion of cases in May, especially in the James Bay area.
You can compare multiple areas easily.
df_tim = df[df.phu == "Porcupine Health Unit"]
df_sud = df[df.phu == "Sudbury & District Health Unit"]
df_wat = df[df.phu == "Region of Waterloo, Public Health"]
plt.figure(figsize=(14,6))
plt.title('Cases in Porcupine and Sudbury Health Unit Areas', fontsize=20)
plt.xlabel("")
plt.ylabel("Daily Cases", fontsize=15)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
sns.lineplot(data=df_tim['rdate'].value_counts(), label="Porcupine Health Unit")
sns.lineplot(data=df_sud['rdate'].value_counts(), label="Sudbury & District Health Unit")
#sns.lineplot(data=df_wat['rdate'].value_counts(), label="Grey Bruce Health Unit")
plt.show()
We can see that young people have been hit especially hard by Covid.
plt.figure(figsize=(10,6))
plt.title("Ontario - Infections by Age Category", fontsize=18)
sns.countplot(data=df, x=df['age'],order=['0-19', '20s','30s','40s','50s','60s','70s','80s','90s'])#df["age"].value_counts().index)#.iloc[:10].index)
plt.ylabel('Age Group', fontsize=15)
plt.xlabel('Infections', fontsize=15)
plt.show()
Note: These dates are approximate by looking at the Ontario cases graph higher up.
wave1 = (df['rdate'] > '2020-03-01') & (df['rdate'] < '2020-05-30')
wave2 = (df['rdate'] > '2020-10-01') & (df['rdate'] < '2021-02-28')
wave3 = (df['rdate'] > '2021-04-01') & (df['rdate'] < '2021-05-21')
wave4 = (df['rdate'] > '2021-07-26')
dfwave1 = df[wave1].sort_values(by='age')
dfwave2 = df[wave2].sort_values(by='age')
dfwave3 = df[wave3].sort_values(by='age')
dfwave4 = df[wave4].sort_values(by='age')
We can see a clear trend of age distributions moving towards younger generations with each wave. There is a lot of speculation and people are quick to criticize younger Canadians for not following Covid guidelines like social distancing and not gathering in groups. I don't think that is entirely fair as Ontario has been proritizing older Ontarians during vaccine rollout.
Also more virulent variants have taken hold and many young Canadians work in the service sector, therefore may not have the luxury of working from home. They have no choice but to get out there.
Also as we see in the last wave "under 20's" have not had access to vaccination in the -12 years old group. The under 30 group now account for almost three quarters of new cases.
# wave 1 graph
plt.figure(figsize=(10,6))
plt.title("Wave 1 Ontario - Infections by Age Category", fontsize=18)
sns.countplot(data=dfwave1, x='age')
plt.xlabel('Age Group', fontsize=15)
plt.ylabel('Wave 1 Infections', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
# wave 2 graphb
plt.figure(figsize=(10,6))
plt.title("Wave 2 - Ontario - Infections by Age Category", fontsize=18)
sns.countplot(data=dfwave2, x='age')
plt.xlabel('Age Group', fontsize=15)
plt.ylabel('Wave 2 Infections', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
# wave 3 graph
plt.figure(figsize=(10,6))
plt.title("Wave 3 - Ontario - Infections by Age Category", fontsize=18)
sns.countplot(data=dfwave3, x='age')
plt.xlabel('Age Group', fontsize=15)
plt.ylabel('Wave 3 Infections', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
# wave 4 graph
plt.figure(figsize=(10,6))
plt.title("Wave 4 - Ontario - Infections by Age Category since July 26", fontsize=18)
sns.countplot(data=dfwave4, x='age')
plt.xlabel('Age Group', fontsize=15)
plt.ylabel('Wave 4 Infections', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
#df.age.value_counts().sort_index()
print('Missing Information and Unspecified EPI Link have been ommitted')
plt.figure(figsize=(14,6))
plt.title("Ontario - Source of Infection by Age Category", fontsize=18)
sns.countplot(data=df, x='age', hue='source', hue_order=('CC', 'NO KNOWN EPI LINK','OB', 'TRAVEL'),
order=['0-19', '20s','30s','40s','50s','60s','70s','80s','90s'])
plt.legend(title='Source of Infection', loc=7,labels=('Contact of a Case', 'Outbreak',
'No Known Link', 'Travel', 'Missing Information', 'Unspecified Link'))
plt.xlabel('Age Group', fontsize=15)
plt.ylabel('Infections', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
The risk of death from Covid rises exponentially as we age. Despite most infections occurring in younger Ontarians, the elderly have suffered the most deaths.
dfdeath = df[df.outcome == 'Fatal'].age.value_counts().sort_index()
print(dfdeath)
plt.figure(figsize=(10,6))
plt.title('Deaths by Age Group', fontsize=20)
plt.ylabel('Number of Deaths', fontsize=15)
plt.xlabel('')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
dfdeath.plot(kind='bar')
#sns.countplot(data=df, x='age', hue='outcome', hue_order=['Fatal'], order=df.age.value_counts().index)
plt.show()
With well over 9000 deaths in Ontario since the beginning of the Covid pandemic, the vast majority have been in individuals over 70 years of age. Despite the increasing number of cases throughout the second and third wave, deaths have dropped dramatically as infections moved to younger individuals, who are less susceptible to death as a result of infection.
Vaccination is also contributing to decreased rates of death.
df_fatal = df[df.outcome == 'Fatal'].sort_index()
df_fatal = df_fatal.sort_values(by=['rdate'])
print('There have been',len(df_fatal), 'Deaths Total.')
plt.figure(figsize=(14,6))
plt.title('Deaths Since Beginning of Covid Pandemic', fontsize=20)
plt.ylabel('Deaths', fontsize=15)
plt.xlabel('Date', fontsize=15)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
df_fatal['rdate'].value_counts().plot()
plt.show()
df['outcome'].unique()
df['outcome'].value_counts()
No surprise that large urban centres had the highest rates of transmission
plt.figure(figsize=(10,6))
plt.title("Infections by Top 10 PHU Area", fontsize=20)
sns.countplot(data=df, y=df['phu'], order=df.phu.value_counts().iloc[:10].index)
plt.ylabel('Area', fontsize=15)
plt.xlabel('Count', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
df.phu.value_counts().iloc[:10].index
Toronto 3895, Peel 2253, York 2270, Durham 2230, Hamilton 2237, Waterloo 2265, Halton 2236, Porcupine 2256, Wellington-Dufferin-Guelph 2266, and Simcoe-Muskoka 2260, Grey Bruce 2233
hotspots = (df['phuid'] == 3895) | (df['phuid'] == 2253) | (df['phuid'] == 2270) | (df['phuid'] == 2230) | (df['phuid'] == 2237) | (df['phuid'] == 2265) | (df['phuid'] == 2236) | (df['phuid'] == 2256) | (df['phuid'] == 2266) | (df['phuid'] == 2260)
dfhot = df.loc[hotspots]
dfhot.tail()
junehot = dfhot['rdate'] > "2021-07-01"
dfhot.loc[junehot]['rdate'].value_counts().plot()
df4 = dfwave4.loc[wave4]
df_tim4 = df4[df4.phu == "Porcupine Health Unit"]
df_sud4 = df4[df4.phu == "Sudbury & District Health Unit"]
plt.figure(figsize=(14,6))
plt.title('4th wave Cases in Porcupine and Sudbury Health Unit Areas', fontsize=20)
plt.xlabel("")
plt.ylabel("Daily Cases", fontsize=15)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
sns.lineplot(data=df_tim4['rdate'].value_counts(), label="Porcupine Health Unit")
sns.lineplot(data=df_sud4['rdate'].value_counts(), label="Sudbury & District Health Unit")
plt.show()
An estimate of the average number of people 1 person will infect when they have COVID-19.
Source: https://data.ontario.ca/dataset/effective-reproduction-number-re-for-covid-19-in-ontario
Note: A rate over one will mean that covid numbers are on the rise. A rate below one means Covid cases are shrinking.
# Make date_start and date_end Pandas datetime objects instead of strings.
dfre['date_start'] = pd.to_datetime(dfre['date_start'])
dfre['date_end'] = pd.to_datetime(dfre['date_end'])
dfre.dtypes
dfre['Re_baseline'] = dfre.apply(lambda x: 1, axis=1)
The Re number is provided as a rolling average of the past 7 days in Ontario's data.
dfre.set_index('date_end', inplace=True)
dfre.tail()
The Re rate can be a powerful predictor of where we are headed in terms of an increasing or decreasing number of cases. Vaccination of Ontarians started in February and has really picked up steam in April, May and June. The Re rate seems to reflect this and has been on a continuous decline since April. However it may still be too early to tell for sure with the Delta variant taking hold.
We see a similar trend from January to the end of February before the third wave hit. Vaccination was not an issue at that time.
It will be interesting to follow the Re rate in the next months given high vaccination rates but also increased spread of the Delta variant (and future unknown variants). If vaccination manages to contain Re then we can get ahead of Covid and return to a more normal way of life. The wildcard in this will be variants. While vaccination appears to be working with current strains, new variants could take hold and push Re back up again resulting in more waves.
Looking at the graph below and the upward trend, I predict that the rate of decrease in cases will stop in August and numbers will climb by September. (Assuming no changes)
wildcards - Delta Variant, Success in getting first doses, Opening Immunization to under 12. All of these can impact Re.
# Re Graph
plt.figure(figsize=(14, 6))
plt.title("Ontario Covid Reproduction Rate (Re) vs Cases", fontsize=20)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
sns.lineplot(data=dfre[['Re', 'Re_baseline']])
plt.xlabel("")
plt.ylabel("Re Number", fontsize=15)
# Ontario Covid Case graph for comparison.
#Let's lineup the dates with the Re dataset first.
df = df[df['rdate'] > '2020-03-19']
plt.figure(figsize=(14,6))
#plt.title('Ontario Covid Waves - Daily Cases', fontsize=20)
sns.lineplot(data=df['rdate'].value_counts())
plt.ylabel('Cases', fontsize=15)
plt.xlabel('')
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
plt.show()
The following looks at vaccination rates in Ontario. We can see that Ontarians overall are being vaccinated in large numbers. As of June 27, 2021 we have not yet seen a plateau although rates are expected to slow down.
# Dataset #3 - Vaccine data for Ontario
dfvaccine = pd.read_csv('https://data.ontario.ca/dataset/752ce2b7-c15a-4965-a3dc-397bf405e7cc/resource/8a89caa9-511c-4568-af89-7f2174b4378c/download/vaccine_doses.csv')
#dfvaccine.tail()
# Create a 7 day rolling average column of daily vaccinations.
dfvaccine['7day'] = dfvaccine.iloc[:,1].rolling(window=7).mean()
#plt.figure(figsize=(14,6))
dfvaccine[['report_date','previous_day_at_least_one', 'previous_day_fully_vaccinated',
'previous_day_total_doses_administered', '7day']].set_index('report_date').tail(10)#.plot(kind='bar')
# Make report_date a pandas datetime object instead of a string.
dfvaccine['report_date'] = pd.to_datetime(dfvaccine['report_date'])
#dfvaccine.dtypes
plt.figure(figsize=(14,6))
plt.title('Daily Vaccine Doses - Ontario', fontsize=20)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
sns.lineplot(data=dfvaccine, x='report_date', y='7day', label='7 Day Rolling Average')
sns.lineplot(data=dfvaccine, x='report_date', y='previous_day_total_doses_administered', label='Daily Dose Count')
plt.xlabel('Date',fontsize=15)
plt.ylabel('Number Vaccinated',fontsize=15)
plt.show()
plt.figure(figsize=(14,6))
plt.title('First and Second Dose Daily Counts - Ontario', fontsize=20)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
sns.lineplot(data=dfvaccine, x='report_date', y='previous_day_at_least_one', label='First Dose')
sns.lineplot(data=dfvaccine, x='report_date', y='previous_day_fully_vaccinated', label='Second Dose')
plt.xlabel('')
plt.ylabel('Number Vaccinated',fontsize=15)
plt.show()
total_doses = dfvaccine['previous_day_total_doses_administered'].sum()
total_fully_vaccinated = dfvaccine['total_individuals_fully_vaccinated'].max()
total_first_doses = total_doses - total_fully_vaccinated
population = 14734014 # See sources (1)
eligible_pop = population - 1961438 # See sources (2)
vaccine_rate = (total_first_doses / eligible_pop) * 100
vaccine_rate_tot = (total_first_doses /population) * 100
full_vaccine_rate = (total_fully_vaccinated / eligible_pop) * 100
full_vaccine_rate_tot = (total_fully_vaccinated / population) * 100
total_unvaccinated = int(eligible_pop - dfvaccine['total_individuals_at_least_one'].max())
unvaccinated_percentage = round((total_unvaccinated / eligible_pop) * 100,1)
###### print('Fast Sheet')
print("----------")
print("Data Published:", str(dfvaccine['report_date'].iloc[-1])[0:10])
print()
print('Eligible Population - 12 and over')
print('---------------------------------')
print("First Dose Only: ", round((vaccine_rate),1),"%")
print("Fully Vaccinated:", round((full_vaccine_rate),1),"%")
print()
print('Total Population')
print('----------------')
print("First Dose Only: ", round((vaccine_rate_tot),1),"%")
print("Fully Vaccinated:", round((full_vaccine_rate_tot),1),"%")
print()
print("Maximum Vaccinated in one day:", int(dfvaccine['previous_day_total_doses_administered'].max()) )
print("Vaccinated Yesterday", int(dfvaccine['previous_day_total_doses_administered'].tail(1)) )
print()
print("Total individuals with at least one dose:", int(dfvaccine['total_individuals_at_least_one'].max()))
print("Total individuals fully vaccinated:", int(dfvaccine['total_individuals_fully_vaccinated'].max()))
print()
print("Total Percentage of Unvaccinated Individual:", unvaccinated_percentage,"%")
print("Estimated total of eligible population foregoing vaccination:", total_unvaccinated )
(1) Vaccine Data from Ontario Open Data Portal
(2) Statistics Canada. Table 17-10-0005-01 Population estimates on July 1st, by age and sex
(3) 1,950,000 is an estimate of population under 12 based from source (2) above. Stats Can lists only pop from 10-14. 1,961,438 represents 60% of that age group. Assumed an even distribution of ages.
dfvacstatus.set_index('Date', inplace=True)
plt.figure(figsize=(14,6))
dfvacstatus[['covid19_cases_unvac', 'covid19_cases_partial_vac', 'covid19_cases_full_vac']].describe().plot(kind='bar')
plt.show()
plt.figure(figsize=(16,6))
plt.title('Cases by Vaccine Status - Ontario', fontsize=20)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12, rotation=30)
plt.xlabel('')
plt.ylabel('Cases',fontsize=15,)
sns.lineplot(data=dfvacstatus[['covid19_cases_unvac', 'covid19_cases_partial_vac', 'covid19_cases_full_vac']])
plt.show()
plt.figure(figsize=(16,6))
plt.title('Cases per 100k - Ontario', fontsize=20)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12, rotation=30)
plt.xlabel('')
plt.ylabel('Cases',fontsize=15,)
sns.lineplot(data=dfvacstatus[['cases_unvac_rate_per100K', 'cases_partial_vac_rate_per100K',
'cases_full_vac_rate_per100K']])
plt.show()