Hello there, I've been gone for a long time, but you know, not like it was noticeable or anything. cries in Eigenvectors. Well, if you must know the reason for my being gone for such a long time then look no further than work, school, work, sleep, anime, sleep and just being generally lazy. Anyhow, 2020, I mean, right?

Moving on...

In [ ]:

TL;DR¶

The models show large daily confirmed cases and suggest that easing the lockdown would not be such a good idea at this point, even daily reported cases show that. Interstellar-manuever

Aim¶

With the established mathematical models in this post, I try to forecast the spread of the covid-19 virus over the next 12 days in Nigeria on a national level. It should give a sense of what is to come but do not take it wholeheartedly as a shortage of resources is a big factor in making the models. The plan initially was to make a model to predict on a state level from scraped data, but the downtimes and layout changes on NCDC's site ruined that plan making already scraped data inept for this task.

Limitation¶

In order to make reliable models for cases like this, modelers rely heavily on historic data from prior outbreaks which can help with insights and boosting metrics. There aren't any (beneficial) historic data in this case, and researchers know very little about the disease to consider as predictors for modeling. No actual telling how long the virus can live on everyday objects. No sure effect of weather conditions being a factor in the spread.

And as such the data is limited. Hence, overfitting is inevitable. Also, there's the questionable quality of the collected data due to human and machine error, especially with things being rushed.

There are also the pattern irregularities associated with pandemics, such as this and making pattern finding a little bit strenuous, and since the model is on a national level, external data that could have proven useful are not included; isolation and testing centers, population density, covid-19 aids by states.

Assumption¶

The model assumes the travel ban as announced by the Federal Government of Nigeria was enforced on the stipulated date, and equally works with the view of a subset of the public concerning the proposed lockdown and social distancing: Relaxed, Strict, and Not enforced. It also projects that said interventions will remain in place for future dates in which our testing falls.

In addition, the model (rightly) assumes that I'm not an Epidemiologist(check out SEIR models here) and that, in fact, I'm only a brooding Data Scientist.

Method¶

Two standalone nonlinear models and an ensemble after fitting reported confirmed cases and some derived features.

Let's proceed!

In [4]:

#Import packages/tools 

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

import plotly.express as px

from sklearn.preprocessing import StandardScaler

# from sklearn.base import BaseEstimator, TransformerMixin
# from sklearn.pipeline import Parallel, Pipeline, FeatureUnion

from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error as mse

from IPython.core.display import HTML

import warnings
warnings.filterwarnings('ignore')

In [5]:

sns.set(style='white', font_scale=1.2, rc={"figure.figsize": (15,10)})


colors_ = {
    'background': '#111111',
    'text': '#BEBEBE',
    'grid': '#333333',
    'red': '#BF0000'
}

tickFont = {
    'size':12,
    'color':"rgb(30,30,30)",
    'family':"Courier New, monospace"}

SEED = 9
RANDOM_STATE = 5

In [7]:

# from plotly.offline import init_notebook_mode, plot,iplot
# import cufflinks
# init_notebook_mode(connected=True)
# cufflinks.go_offline()

# import chart_studio.plotly as py

# pd.set_option("max_rows", 20)

Next line: Data containing the cumulated confirmed cases(as well as Recovery and Deaths) daily, country and the data is loaded. I had to drop columns I won't be using such as Province/State, Cumulated Deaths and Recovery and the date is parsed as datetime.

Data Source: JohnsHopkinsUni

In [10]:

# data = pd.read_csv("~/NG_covid19_ML/data/allData.csv")\
#         .drop(["Province/State"], axis=1)\
#         .astype({'date':'datetime64[ns]'})

# # data.head(3)

In [5]:

#rename Country/Region to just Country
data.rename(columns={'Country/Region':'Country'}, inplace=True)

In [6]:

data.isna().sum() #No null values. Good!

Out[6]:

Country         0
date            0
CumConfirmed    0
CumDeaths       0
CumRecovered    0
dtype: int64

In [ ]:

Let's dedicate a couple of lines to making graphs: worldwide and then Africa.

In [7]:

countries_to_plot = ["US", "Italy", "China", "Germany", "Ghana", "Morocco", "Nigeria", "South Africa", "Tunisia"]
country_group = data.groupby(['date', 'Country'], as_index=False)['CumConfirmed'].sum()


xc = country_group[country_group["Country"].isin(countries_to_plot)]


trend_fig = px.line(xc, x="date", y="CumConfirmed", color="Country").for_each_trace\
(lambda rmvequalsign: rmvequalsign.update(name=rmvequalsign.name.split("=")[-1]))
trend_fig.update_traces(
        line=dict(width=3),
        selector=dict(type="scatter", mode="lines")
        )
trend_fig.update_layout({'legend_orientation':'h', 'autosize':True, 'margin':dict(l=0.5,r=0.5,b=2,t=3)}, font=tickFont)\
        .update_xaxes(title="", tickfont=tickFont)\
        .update_yaxes(title="Cumulated Confirmed Cases")

#african countries have lower cases so it's hard sighting them on the graph\
#(thousands compared to a million)

In [8]:

countries_to_plot = ["Ghana", "Morocco", "Nigeria", "South Africa", "Tunisia", "Kenya"] #6 african countries
country_group = data.groupby(['date', 'Country'], as_index=False)['CumConfirmed'].sum()


xc = country_group[country_group["Country"].isin(countries_to_plot)]


trend_fig = px.line(xc, x="date", y="CumConfirmed", color="Country").for_each_trace\
(lambda rmvequalsign: rmvequalsign.update(name=rmvequalsign.name.split("=")[-1]))
trend_fig.update_traces(
        line=dict(width=3),
        selector=dict(type="scatter", mode="lines")
        )
trend_fig.update_layout({'legend_orientation':'h', 'autosize':True, 'margin':dict(l=0.5,r=0.5,b=2,t=3)}, font=tickFont)\
        .update_xaxes(title="", tickfont=tickFont)\
        .update_yaxes(title="Cumulated Confirmed Cases")

In [ ]:

In [9]:

#I shall be extracting data for Nigeria only
NG_data = data[data["Country"] == "Nigeria"]
NG_data.reset_index(drop=True, inplace=True)

In [10]:

#peep into the data and check last accounted date
NG_data.tail()  #19th, May 2020

Out[10]:

	Country	date	CumConfirmed	CumDeaths	CumRecovered
114	Nigeria	2020-05-15	5450	171	1320
115	Nigeria	2020-05-16	5621	176	1472
116	Nigeria	2020-05-17	5959	182	1594
117	Nigeria	2020-05-18	6175	191	1644
118	Nigeria	2020-05-19	6401	192	1734

In [ ]:

Going through the data I noticed that the 20th and 21st of April had the same value of 665, which would indicate that there was no case on the 21st but there was. One of the highest cases ever recorded: 117. I checked the NCDC's site to confirm and made the required changes to that index. Also, the first case in Nigeria wasn't till the 28th of February so I had to tailor the data starting from the 27th of February, just to get rid of that long list of zeros.

Let's investigate together!

In [11]:

NG_data[NG_data["CumConfirmed"] > 0].head(12) #easily identify first reported case - 28th. 
#No further case until the 9th of March. Take note of the index (37, so tailor from 36)

Out[11]:

	Country	date	CumConfirmed
37	Nigeria	2020-02-28	1
38	Nigeria	2020-02-29	1
39	Nigeria	2020-03-01	1
40	Nigeria	2020-03-02	1
41	Nigeria	2020-03-03	1
42	Nigeria	2020-03-04	1
43	Nigeria	2020-03-05	1
44	Nigeria	2020-03-06	1
45	Nigeria	2020-03-07	1
46	Nigeria	2020-03-08	1
47	Nigeria	2020-03-09	2
48	Nigeria	2020-03-10	2

In [12]:

NG_data = NG_data.iloc[36:, :] #Tailored!
NG_data.reset_index(drop=True, inplace=True)
NG_data.head()

Out[12]:

	Country	date	CumConfirmed
0	Nigeria	2020-02-27	0
1	Nigeria	2020-02-28	1
2	Nigeria	2020-02-29	1
3	Nigeria	2020-03-01	1
4	Nigeria	2020-03-02	1

In [13]:

NG_data.iloc[50:60] #note index 53 and 54

Out[13]:

	Country	date	CumConfirmed	CumDeaths	CumRecovered
50	Nigeria	2020-04-17	493	17	159
51	Nigeria	2020-04-18	542	19	166
52	Nigeria	2020-04-19	627	21	170
53	Nigeria	2020-04-20	665	22	188
54	Nigeria	2020-04-21	665	22	188
55	Nigeria	2020-04-22	873	28	197
56	Nigeria	2020-04-23	981	31	197
57	Nigeria	2020-04-24	1095	32	208
58	Nigeria	2020-04-25	1182	35	222
59	Nigeria	2020-04-26	1273	40	239

In [ ]:

In [14]:

fig = px.line(NG_data, x='date', y='CumDeaths', title="Cumulated Deaths by date")
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="red")

In [ ]:

In [15]:

fig = px.line(NG_data, x='date', y='CumRecovered', title="cumulated recovery by date")
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="green")

In [16]:

fig = px.line(NG_data, x='date', y='CumConfirmed', title="cumulated confirmed cases")
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="blue")

Could you tell that flat curve just after 'Apr 19,' yeah? That's the culprit; 665 both on the 20th and 21st, while this could be a good thing meaning Nigeria had no new case on the 21st, we know that's false. In fact, we had 117 new cases on the 21st, summing it to 782. Hence, (20th) 665 - (21st) 782 - (22nd) 873 and so forth.

Also, NCDC's correction for the 18th of April: 541 instead of 542. Index 51.

Proof has been included below and update is to be made.

NCDC_REPORT_21_04_2020

In [17]:

#index 54, column CumConfirmed, replace 665 with 782
NG_data.at[54, 'CumConfirmed'] = 782 

NG_data.iloc[53:56, :] #perfection!

Out[17]:

	Country	date	CumConfirmed	CumDeaths	CumRecovered
53	Nigeria	2020-04-20	665	22	188
54	Nigeria	2020-04-21	782	22	188
55	Nigeria	2020-04-22	873	28	197

Notice how I did not update the values for Cumulative Death and Recovered, as this won't be used in the model.

In [18]:

NG_data.at[51, 'CumConfirmed'] = 541
NG_data.iloc[50:53, :]

Out[18]:

	Country	date	CumConfirmed	CumDeaths	CumRecovered
50	Nigeria	2020-04-17	493	17	159
51	Nigeria	2020-04-18	541	19	166
52	Nigeria	2020-04-19	627	21	170

In [19]:

#re-draw.
fig = px.line(NG_data, x='date', y='CumConfirmed')
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="blue")

Now, let's use dataFrame.diff() on the CumConfirmed Column so we can get the daily cases from it, convert possible NaNs to zeros, which should be only one at the very top(0 - 'no value' = NaN)

In [20]:

NG_data["DailyConfirmed"] = NG_data["CumConfirmed"].diff()
NG_data.head()

Out[20]:

	Country	date	CumConfirmed	DailyConfirmed
0	Nigeria	2020-02-27	0	NaN
1	Nigeria	2020-02-28	1	1.0
2	Nigeria	2020-02-29	1	0.0
3	Nigeria	2020-03-01	1	0.0
4	Nigeria	2020-03-02	1	0.0

In [21]:

NG_data.fillna(axis=1, value=0, inplace=True)
NG_data.head(5)

Out[21]:

	Country	date	CumConfirmed	DailyConfirmed
0	Nigeria	2020-02-27	0	0.0
1	Nigeria	2020-02-28	1	1.0
2	Nigeria	2020-02-29	1	0.0
3	Nigeria	2020-03-01	1	0.0
4	Nigeria	2020-03-02	1	0.0

The below graph shows that daily cases are rising fast. Noting that the virus can have an incubated window of 14 days, it's telling that the quarantine is not quite effective.

In [22]:

#cases on a daily basis
fig = px.line(NG_data, x='date', y='DailyConfirmed')
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="purple")

In [23]:

#drop recovery and death cases
NG_data.drop(["Country", "CumRecovered", "CumDeaths"], axis=1, inplace=True)

Now, let's slip in data for our testing. I want to make predictions for the next 8 days, starting on the 28th, and then merge with our data on the row level.

In [24]:

post_date = ['2020-05-20','2020-05-21', '2020-05-22', '2020-05-23', '2020-05-24', '2020-05-25', '2020-05-26','2020-05-27', '2020-05-28','2020-05-29','2020-05-30','2020-05-31']
post_incident = pd.DataFrame(post_date, columns=["date"])
post_incident["date"] = post_incident['date'].astype('datetime64[ns]')
post_incident #the next 12 days

Out[24]:

	date
0	2020-05-20
1	2020-05-21
2	2020-05-22
3	2020-05-23
4	2020-05-24
5	2020-05-25
6	2020-05-26
7	2020-05-27
8	2020-05-28
9	2020-05-29
10	2020-05-30
11	2020-05-31

In [25]:

covid19_NG = pd.concat([NG_data, post_incident], sort=False, axis=0)
covid19_NG.reset_index(drop=True, inplace=True)
covid19_NG.tail(15)

Out[25]:

	date	CumConfirmed	DailyConfirmed
80	2020-05-17	5959.0	338.0
81	2020-05-18	6175.0	216.0
82	2020-05-19	6401.0	226.0
83	2020-05-20	NaN	NaN
84	2020-05-21	NaN	NaN
85	2020-05-22	NaN	NaN
86	2020-05-23	NaN	NaN
87	2020-05-24	NaN	NaN
88	2020-05-25	NaN	NaN
89	2020-05-26	NaN	NaN
90	2020-05-27	NaN	NaN
91	2020-05-28	NaN	NaN
92	2020-05-29	NaN	NaN
93	2020-05-30	NaN	NaN
94	2020-05-31	NaN	NaN

In [26]:

covid19_NG["day"] = covid19_NG.date.dt.strftime("%m%d").astype('int')

covid19_NG.tail(20)

Out[26]:

	date	CumConfirmed	DailyConfirmed	day
75	2020-05-12	4787.0	146.0	512
76	2020-05-13	4971.0	184.0	513
77	2020-05-14	5162.0	191.0	514
78	2020-05-15	5450.0	288.0	515
79	2020-05-16	5621.0	171.0	516
80	2020-05-17	5959.0	338.0	517
81	2020-05-18	6175.0	216.0	518
82	2020-05-19	6401.0	226.0	519
83	2020-05-20	NaN	NaN	520
84	2020-05-21	NaN	NaN	521
85	2020-05-22	NaN	NaN	522
86	2020-05-23	NaN	NaN	523
87	2020-05-24	NaN	NaN	524
88	2020-05-25	NaN	NaN	525
89	2020-05-26	NaN	NaN	526
90	2020-05-27	NaN	NaN	527
91	2020-05-28	NaN	NaN	528
92	2020-05-29	NaN	NaN	529
93	2020-05-30	NaN	NaN	530
94	2020-05-31	NaN	NaN	531

I checked around for news that could be of value in terms of engineering features.

21-03-2020: NG bans entry of travelers from 13 countries(incl. US, Uk, Italy, France and China) [ng.usembassy.gov] Ban to last for 4 weeks.

23-03-2020: Closure of land borders [vanguard]. Same with schools.

28-03-2020: GTB hands over 110-bed isolation centre to Lagos State Gov.

08-04-2020: 110-bed isolation centre by GTB begins operation. [Allafrica]('https://allafrica.com']

31-03-2020: Lockdown as announced by Pres. Buhari (Starts 11:30PM on the 30th, so next day then)

14-04-2020: Another lockdown starts. 04-03-2020: No single isolation centre in Abuja says Senate President. [Punch.ng] 27-01-2020: First Nigeria case. NCDC

21-02-2020: covid19 training starts ncdc.gov.ng

update Lockdown lifted on the 4th of May, 2020. Travel Ban still in place.

In [ ]:

In [27]:

def assumed_lkdown(val):
    if val <= 330:
        return "Not Enforced"
    if val >= 331 and val <= 402: #first three days after proposed lockdown
        return "Relaxed"
    elif val >= 403 and val <=410: #police mobilization and lockdown security goes up 
        return "Tight"
    elif val >= 504:
        return "Lifted"
    else: 
        return "Relaxed"
    

#Lagos, Abuja, and Kano hold high cases and the assumed lockdown was modelled from them.

In [28]:

covid19_NG["Lockdown"] = covid19_NG["day"].apply(assumed_lkdown)
covid19_NG[covid19_NG["Lockdown"] == "Lifted"].head()

Out[28]:

	date	CumConfirmed	DailyConfirmed	day	Lockdown
67	2020-05-04	2802.0	244.0	504	Lifted
68	2020-05-05	2950.0	148.0	505	Lifted
69	2020-05-06	3145.0	195.0	506	Lifted
70	2020-05-07	3526.0	381.0	507	Lifted
71	2020-05-08	3912.0	386.0	508	Lifted

In [29]:

covid19_NG["Travel Ban"] = covid19_NG["day"].apply(lambda x: "Enforced" if x >= 321 else "Not Enforced")
covid19_NG[covid19_NG["Travel Ban"] == "Enforced"].head(3)

#321 - 21st of March, when ban became effective.

Out[29]:

	date	CumConfirmed	DailyConfirmed	day	Lockdown	Travel Ban
23	2020-03-21	22.0	10.0	321	Not Enforced	Enforced
24	2020-03-22	30.0	8.0	322	Not Enforced	Enforced
25	2020-03-23	40.0	10.0	323	Not Enforced	Enforced

In [30]:

covid19_NG["Border closure"] = covid19_NG["day"].apply(lambda x: "Enforced" if x >= 323 else "Not Enforced")
covid19_NG[covid19_NG["Border closure"] == "Enforced"].head(3)

Out[30]:

	date	CumConfirmed	DailyConfirmed	day	Lockdown	Travel Ban	Border closure
25	2020-03-23	40.0	10.0	323	Not Enforced	Enforced	Enforced
26	2020-03-24	44.0	4.0	324	Not Enforced	Enforced	Enforced
27	2020-03-25	51.0	7.0	325	Not Enforced	Enforced	Enforced

In [31]:

covid19_NG["dayName"] = covid19_NG.date.dt.day_name()
covid19_NG.head()

Out[31]:

	date	CumConfirmed	DailyConfirmed	day	Lockdown	Travel Ban	Border closure	dayName
0	2020-02-27	0.0	0.0	227	Not Enforced	Not Enforced	Not Enforced	Thursday
1	2020-02-28	1.0	1.0	228	Not Enforced	Not Enforced	Not Enforced	Friday
2	2020-02-29	1.0	0.0	229	Not Enforced	Not Enforced	Not Enforced	Saturday
3	2020-03-01	1.0	0.0	301	Not Enforced	Not Enforced	Not Enforced	Sunday
4	2020-03-02	1.0	0.0	302	Not Enforced	Not Enforced	Not Enforced	Monday

In [32]:

sns.barplot(data=covid19_NG, x="dayName", y="DailyConfirmed")

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9b4e3f6ba8>

In [ ]:

In [33]:

print("Highest reported cases in Nigeria: ", covid19_NG["DailyConfirmed"].max())
covid19_NG[covid19_NG["DailyConfirmed"] == covid19_NG["DailyConfirmed"].max()]

Highest reported cases in Nigeria:  386.0

Out[33]:

	date	CumConfirmed	DailyConfirmed	day	Lockdown	Travel Ban	Border closure	dayName
71	2020-05-08	3912.0	386.0	508	Lifted	Enforced	Enforced	Friday

In [34]:

covid19_NG.tail(11)

Out[34]:

	date	CumConfirmed	DailyConfirmed	day	Lockdown	Travel Ban	Border closure	dayName
84	2020-05-21	NaN	NaN	521	Lifted	Enforced	Enforced	Thursday
85	2020-05-22	NaN	NaN	522	Lifted	Enforced	Enforced	Friday
86	2020-05-23	NaN	NaN	523	Lifted	Enforced	Enforced	Saturday
87	2020-05-24	NaN	NaN	524	Lifted	Enforced	Enforced	Sunday
88	2020-05-25	NaN	NaN	525	Lifted	Enforced	Enforced	Monday
89	2020-05-26	NaN	NaN	526	Lifted	Enforced	Enforced	Tuesday
90	2020-05-27	NaN	NaN	527	Lifted	Enforced	Enforced	Wednesday
91	2020-05-28	NaN	NaN	528	Lifted	Enforced	Enforced	Thursday
92	2020-05-29	NaN	NaN	529	Lifted	Enforced	Enforced	Friday
93	2020-05-30	NaN	NaN	530	Lifted	Enforced	Enforced	Saturday
94	2020-05-31	NaN	NaN	531	Lifted	Enforced	Enforced	Sunday

In [35]:

covid19_NG.drop(["date"], axis=1, inplace=True) #drop date

In [36]:

#Log transform DailyConfirmed. NaNs won't be transformed
covid19_NG["DailyConfirmed"] = np.log1p(covid19_NG["DailyConfirmed"])

In [37]:

covid19_NG = pd.get_dummies(covid19_NG, drop_first=True) #get dummies 
covid19_NG.head()

Out[37]:

	CumConfirmed	DailyConfirmed	day	Lockdown_Not Enforced	Travel Ban_Not Enforced	Border closure_Not Enforced	dayName_Monday	dayName_Saturday	dayName_Sunday	dayName_Thursday
0	0.0	0.000000	227	1	1	1	0	0	0	1
1	1.0	0.693147	228	1	1	1	0	0	0	0
2	1.0	0.000000	229	1	1	1	0	1	0	0
3	1.0	0.000000	301	1	1	1	0	0	1	0
4	1.0	0.000000	302	1	1	1	1	0	0	0

Let's split the data into train, validation and testing.

In [38]:

train = covid19_NG[covid19_NG["day"] <= 514] #Train on this
val = covid19_NG[(covid19_NG["day"] >= 515) & (covid19_NG["day"] <= 519)] #evaluate on this
test = covid19_NG[covid19_NG["day"] >= 520] #test on this

In [39]:

print("Original data shape: ", covid19_NG.shape, "\nTrain shape: ", train.shape, "\nValidation data shape: ",val.shape, "\nTest data shape: ",test.shape)

Original data shape:  (95, 14) 
Train shape:  (78, 14) 
Validation data shape:  (5, 14) 
Test data shape:  (12, 14)

In [40]:

#Predict for the rest of the month starting from May 20th, 2020.

Get ready for TRAINING and TESTING!

fight!

In [41]:

Xtrain = train.drop(["CumConfirmed", "DailyConfirmed"], axis=1)
ytrain = train["DailyConfirmed"]

Xval = val.drop(["CumConfirmed", "DailyConfirmed"], axis=1)
yval = val["DailyConfirmed"]

Xtest = test.drop(["CumConfirmed", "DailyConfirmed"], axis=1)

Xtrain.head(3)

Out[41]:

	day	Lockdown_Not Enforced	Travel Ban_Not Enforced	Border closure_Not Enforced	dayName_Saturday	dayName_Thursday
0	227	1	1	1	0	1
1	228	1	1	1	0	0
2	229	1	1	1	1	0

In [42]:

from sklearn.preprocessing import MinMaxScaler

In [43]:

ss = StandardScaler()
mm = MinMaxScaler() #default=(0,1)

In [44]:

Xtrain_std = ss.fit_transform(Xtrain)
Xval_std = ss.transform(Xval)
Xtest_std = ss.transform(Xtest)

In [ ]:

In [45]:

from sklearn.metrics import mean_squared_error as mse

In [ ]:

In [46]:

cat_params = {
    'num_boost_round':10000,
    'learning_rate': 0.005,
    'silent':True,
    'eval_metric': 'RMSE',
    'early_stopping_rounds':200,
    'max_depth':9,
    'use_best_model':True,
    'random_state': RANDOM_STATE
}

In [47]:

cat_model = CatBoostRegressor(**cat_params)
rf_model = RandomForestRegressor(n_estimators=10000, random_state=RANDOM_STATE, max_depth=24)

#¶

In [48]:

%%time
cat_model.fit(Xtrain, ytrain, eval_set=[(Xtrain, ytrain)])
print("Best Score: ", cat_model.best_score_, "\nIteration stopped at: ", cat_model.best_iteration_)

Best Score:  {'learn': {'RMSE': 0.1798699129597255}, 'validation': {'RMSE': 0.17986991295972543}} 
Iteration stopped at:  9999
CPU times: user 1min 14s, sys: 9.53 s, total: 1min 23s
Wall time: 37.4 s

In [49]:

train_pred_cat = cat_model.predict(Xtrain)

cat_rmse = np.sqrt(mse(train_pred_cat, np.expm1(ytrain)))
print("RMSE: ", cat_rmse)

RMSE:  111.64232067925353

In [50]:

val_pred = cat_model.predict(Xval)


print("Predicted values: ", list(np.round(np.expm1(val_pred), decimals=0)))
print("Actual values: ", list(np.round(np.expm1(yval), decimals=0)))
print("RMSE: ", np.sqrt(mse(val_pred, np.expm1(yval))))

Predicted values:  [229.0, 149.0, 162.0, 154.0, 135.0]
Actual values:  [288.0, 171.0, 338.0, 216.0, 226.0]
RMSE:  249.6401677648684

In [51]:

test_pred_cat = cat_model.predict(Xtest)
test_pred_cat #log-transformed

Out[51]:

array([5.15031532, 5.25226949, 5.4382155 , 5.01015568, 5.09538887,
       5.04420315, 4.91303069, 5.15031532, 5.25226949, 5.4382155 ,
       5.01015568, 5.09538887])

In [52]:

# xg_model = XGBRegressor(n_estimators=20000, learning_rate=0.0005)

### Random Forest¶

In [53]:

%%time
rf_model.fit(Xtrain, ytrain)

CPU times: user 15.3 s, sys: 90.1 ms, total: 15.4 s
Wall time: 15.6 s

Out[53]:

RandomForestRegressor(max_depth=24, n_estimators=10000, random_state=5)

In [54]:

train_pred_rf = rf_model.predict(Xtrain)
print("RMSE (TRAIN): ", np.sqrt(mse(train_pred_rf, np.expm1(ytrain))))
print("######")

#evaluation
val_pred_rf = rf_model.predict(Xval)
print("Predicted values: ", list(np.round(np.expm1(val_pred_rf), decimals=0)))
print("Actual values: ", list(np.round(np.expm1(yval), decimals=0)))
print("RMSE (EVALUATION): ", np.sqrt(mse(val_pred_rf, np.expm1(yval))))

#test prediction
test_pred_rf = rf_model.predict(Xtest)

RMSE (TRAIN):  111.64716402156787
######
Predicted values:  [206.0, 207.0, 204.0, 208.0, 165.0]
Actual values:  [288.0, 171.0, 338.0, 216.0, 226.0]
RMSE (EVALUATION):  249.4737951255527

In [55]:

# list(np.round(np.expm1(covid19_NG.DailyConfirmed[:-9]), decimals=0))

In [ ]:

In [56]:

#Neural network modelling
from keras.wrappers.scikit_learn import KerasRegressor

Using TensorFlow backend.

In [57]:

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

In [58]:

def build_model():
    model = keras.Sequential([
    layers.Dense(32, activation='relu', input_shape=[Xtrain.shape[1]]),
    layers.Dense(16, activation='relu'),
    layers.Dense(1)
  ])

    optimizer = tf.keras.optimizers.RMSprop(0.003)

    model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mse'])
    return model

model = build_model()

WARNING:tensorflow:From /home/LEGION/.local/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor

In [ ]:

In [59]:

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 32)                416       
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 961
Trainable params: 961
Non-trainable params: 0
_________________________________________________________________

In [60]:

Xtrain_mm = mm.fit_transform(Xtrain)
# Xval_mm = mm.transform(Xval)
Xtest_mm = mm.transform(Xtest)

In [61]:

model.fit(Xtrain_mm, ytrain)
model.predict(Xtrain_mm)

78/78 [==============================] - 1s 11ms/sample - loss: 11.3841 - mean_squared_error: 11.3841

Out[61]:

array([[0.11537717],
       [0.16368824],
       [0.24647242],
       [0.2926372 ],
       [0.19305873],
       [0.03961409],
       [0.44060448],
       [0.11201602],
       [0.17328389],
       [0.2324318 ],
       [0.29317167],
       [0.19484395],
       [0.03953261],
       [0.44319835],
       [0.11171436],
       [0.17483209],
       [0.23131046],
       [0.29370615],
       [0.19662909],
       [0.03945112],
       [0.44579223],
       [0.11354627],
       [0.17638029],
       [0.25839227],
       [0.0832982 ],
       [0.28961697],
       [0.19041249],
       [0.33069178],
       [0.25782177],
       [0.21045932],
       [0.312838  ],
       [0.38946822],
       [0.28766018],
       [0.4614352 ],
       [0.36653256],
       [0.405119  ],
       [0.2083361 ],
       [0.33869115],
       [0.37104473],
       [0.12444282],
       [0.25969392],
       [0.23318624],
       [0.37256083],
       [0.2107252 ],
       [0.20848835],
       [0.45374992],
       [0.33070752],
       [0.44189045],
       [0.3670893 ],
       [0.39853686],
       [0.36813706],
       [0.20959184],
       [0.45252687],
       [0.32909504],
       [0.4411442 ],
       [0.36783984],
       [0.39621386],
       [0.36540365],
       [0.2106954 ],
       [0.45091635],
       [0.3274826 ],
       [0.44039792],
       [0.36859027],
       [0.3940295 ],
       [0.3353359 ],
       [0.22244734],
       [0.42233208],
       [0.13740422],
       [0.30470452],
       [0.1816612 ],
       [0.28484467],
       [0.13777824],
       [0.22920337],
       [0.2575878 ],
       [0.1399696 ],
       [0.30447504],
       [0.18237339],
       [0.28519323]], dtype=float32)

In [62]:

from keras.callbacks import EarlyStopping#set early stopping monitor so the model stops training when it won't improve anymore

early_stopping_monitor = EarlyStopping(patience=3)

EPOCHS = 10000

history = model.fit(
  Xtrain_mm, ytrain,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[early_stopping_monitor]
)

In [63]:

hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

Out[63]:

	loss	mean_squared_error	val_loss	val_mean_squared_error	epoch
17	0.527280	0.527280	7.961093	7.961093	17
18	0.511112	0.511112	7.932344	7.932344	18
19	0.518492	0.518492	8.249502	8.249502	19
20	0.505430	0.505430	8.174744	8.174744	20
21	0.496137	0.496137	8.056219	8.056219	21

In [64]:

# test_predictions = model.predict(Xtest_mm).flatten()

# np.expm1(test_predictions)

In [ ]:

In [65]:

#ensemble by weight

In [66]:

cat_pred = np.round(np.expm1(test_pred_cat), decimals=0)
rf_pred = np.round(np.expm1(test_pred_rf), decimals=0)

ensemble = 0.6*cat_pred + 0.4*rf_pred

In [67]:

test_predictions = pd.DataFrame({
                                "date": post_incident["date"],
                                "New Cases (CAT)": [int(x) for x in cat_pred],
                                "New Cases (Random Forest)": [int(x) for x in rf_pred],
                                "Ensemble": [int(x) for x in ensemble],
                               })


test_predictions

Out[67]:

	date	New Cases (CAT)	New Cases (Random Forest)	Ensemble
0	2020-05-20	171	194	180
1	2020-05-21	190	211	198
2	2020-05-22	229	206	219
3	2020-05-23	149	207	172
4	2020-05-24	162	204	178
5	2020-05-25	154	208	175
6	2020-05-26	135	165	147
7	2020-05-27	171	194	180
8	2020-05-28	190	211	198
9	2020-05-29	229	206	219
10	2020-05-30	149	207	172
11	2020-05-31	162	204	178

Giphy-overfit Overfitting was expected due to the data being constrained to a small size.

In [68]:

test_predictions.to_csv("submission.csv", index=False, header=True)

In [ ]:

Disclaimer¶

The predicted values are not to be substituted for actual future cases. The table is merely a forecast, purely for learning purposes.

Resources¶

Know more about the COVID-19 virus here.
Ramadan advisory by NCDC here
If you'd like to view a COVID-19 Dashboard, I made one for just African countries: click here to view the dashboard

Giphy-out

In [ ]:

Covid-19 Forecasting (NG)

TL;DR¶

Aim¶

Limitation¶

Assumption¶

Method¶

#¶

### Random Forest¶

Disclaimer¶

Resources¶