Covid-19 Forecasting (NG)

Hello there, I've been gone for a long time, but you know, not like it was noticeable or anything. cries in Eigenvectors. Well, if you must know the reason for my being gone for such a long time then look no further than work, school, work, sleep, anime, sleep and just being generally lazy. Anyhow, 2020, I mean, right?

Moving on...

In [ ]:
 

TL;DR

The models show large daily confirmed cases and suggest that easing the lockdown would not be such a good idea at this point, even daily reported cases show that. Interstellar-manuever

Aim

With the established mathematical models in this post, I try to forecast the spread of the covid-19 virus over the next 12 days in Nigeria on a national level. It should give a sense of what is to come but do not take it wholeheartedly as a shortage of resources is a big factor in making the models. The plan initially was to make a model to predict on a state level from scraped data, but the downtimes and layout changes on NCDC's site ruined that plan making already scraped data inept for this task.

Limitation

In order to make reliable models for cases like this, modelers rely heavily on historic data from prior outbreaks which can help with insights and boosting metrics. There aren't any (beneficial) historic data in this case, and researchers know very little about the disease to consider as predictors for modeling. No actual telling how long the virus can live on everyday objects. No sure effect of weather conditions being a factor in the spread.

And as such the data is limited. Hence, overfitting is inevitable. Also, there's the questionable quality of the collected data due to human and machine error, especially with things being rushed.

There are also the pattern irregularities associated with pandemics, such as this and making pattern finding a little bit strenuous, and since the model is on a national level, external data that could have proven useful are not included; isolation and testing centers, population density, covid-19 aids by states.

Assumption

The model assumes the travel ban as announced by the Federal Government of Nigeria was enforced on the stipulated date, and equally works with the view of a subset of the public concerning the proposed lockdown and social distancing: Relaxed, Strict, and Not enforced. It also projects that said interventions will remain in place for future dates in which our testing falls.

In addition, the model (rightly) assumes that I'm not an Epidemiologist(check out SEIR models here) and that, in fact, I'm only a brooding Data Scientist.

Method

Two standalone nonlinear models and an ensemble after fitting reported confirmed cases and some derived features.

Let's proceed!

In [4]:
#Import packages/tools 

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

import plotly.express as px

from sklearn.preprocessing import StandardScaler

# from sklearn.base import BaseEstimator, TransformerMixin
# from sklearn.pipeline import Parallel, Pipeline, FeatureUnion

from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error as mse

from IPython.core.display import HTML

import warnings
warnings.filterwarnings('ignore')
In [5]:
sns.set(style='white', font_scale=1.2, rc={"figure.figsize": (15,10)})


colors_ = {
    'background': '#111111',
    'text': '#BEBEBE',
    'grid': '#333333',
    'red': '#BF0000'
}

tickFont = {
    'size':12,
    'color':"rgb(30,30,30)",
    'family':"Courier New, monospace"}

SEED = 9
RANDOM_STATE = 5
In [7]:
# from plotly.offline import init_notebook_mode, plot,iplot
# import cufflinks
# init_notebook_mode(connected=True)
# cufflinks.go_offline()

# import chart_studio.plotly as py

# pd.set_option("max_rows", 20)

Next line: Data containing the cumulated confirmed cases(as well as Recovery and Deaths) daily, country and the data is loaded. I had to drop columns I won't be using such as Province/State, Cumulated Deaths and Recovery and the date is parsed as datetime.

Data Source: JohnsHopkinsUni

In [10]:
# data = pd.read_csv("~/NG_covid19_ML/data/allData.csv")\
#         .drop(["Province/State"], axis=1)\
#         .astype({'date':'datetime64[ns]'})

# # data.head(3)
In [5]:
#rename Country/Region to just Country
data.rename(columns={'Country/Region':'Country'}, inplace=True)
In [6]:
data.isna().sum() #No null values. Good!
Out[6]:
Country         0
date            0
CumConfirmed    0
CumDeaths       0
CumRecovered    0
dtype: int64
In [ ]:
 

Let's dedicate a couple of lines to making graphs: worldwide and then Africa.

In [7]:
countries_to_plot = ["US", "Italy", "China", "Germany", "Ghana", "Morocco", "Nigeria", "South Africa", "Tunisia"]
country_group = data.groupby(['date', 'Country'], as_index=False)['CumConfirmed'].sum()


xc = country_group[country_group["Country"].isin(countries_to_plot)]


trend_fig = px.line(xc, x="date", y="CumConfirmed", color="Country").for_each_trace\
(lambda rmvequalsign: rmvequalsign.update(name=rmvequalsign.name.split("=")[-1]))
trend_fig.update_traces(
        line=dict(width=3),
        selector=dict(type="scatter", mode="lines")
        )
trend_fig.update_layout({'legend_orientation':'h', 'autosize':True, 'margin':dict(l=0.5,r=0.5,b=2,t=3)}, font=tickFont)\
        .update_xaxes(title="", tickfont=tickFont)\
        .update_yaxes(title="Cumulated Confirmed Cases")

#african countries have lower cases so it's hard sighting them on the graph\
#(thousands compared to a million)
In [8]:
countries_to_plot = ["Ghana", "Morocco", "Nigeria", "South Africa", "Tunisia", "Kenya"] #6 african countries
country_group = data.groupby(['date', 'Country'], as_index=False)['CumConfirmed'].sum()


xc = country_group[country_group["Country"].isin(countries_to_plot)]


trend_fig = px.line(xc, x="date", y="CumConfirmed", color="Country").for_each_trace\
(lambda rmvequalsign: rmvequalsign.update(name=rmvequalsign.name.split("=")[-1]))
trend_fig.update_traces(
        line=dict(width=3),
        selector=dict(type="scatter", mode="lines")
        )
trend_fig.update_layout({'legend_orientation':'h', 'autosize':True, 'margin':dict(l=0.5,r=0.5,b=2,t=3)}, font=tickFont)\
        .update_xaxes(title="", tickfont=tickFont)\
        .update_yaxes(title="Cumulated Confirmed Cases")
In [ ]:
 
In [9]:
#I shall be extracting data for Nigeria only
NG_data = data[data["Country"] == "Nigeria"]
NG_data.reset_index(drop=True, inplace=True)
In [10]:
#peep into the data and check last accounted date
NG_data.tail()  #19th, May 2020
Out[10]:
Country date CumConfirmed CumDeaths CumRecovered
114 Nigeria 2020-05-15 5450 171 1320
115 Nigeria 2020-05-16 5621 176 1472
116 Nigeria 2020-05-17 5959 182 1594
117 Nigeria 2020-05-18 6175 191 1644
118 Nigeria 2020-05-19 6401 192 1734
In [ ]:
 

Going through the data I noticed that the 20th and 21st of April had the same value of 665, which would indicate that there was no case on the 21st but there was. One of the highest cases ever recorded: 117. I checked the NCDC's site to confirm and made the required changes to that index. Also, the first case in Nigeria wasn't till the 28th of February so I had to tailor the data starting from the 27th of February, just to get rid of that long list of zeros.

Let's investigate together!

In [11]:
NG_data[NG_data["CumConfirmed"] > 0].head(12) #easily identify first reported case - 28th. 
#No further case until the 9th of March. Take note of the index (37, so tailor from 36)
Out[11]:
Country date CumConfirmed CumDeaths CumRecovered
37 Nigeria 2020-02-28 1 0 0
38 Nigeria 2020-02-29 1 0 0
39 Nigeria 2020-03-01 1 0 0
40 Nigeria 2020-03-02 1 0 0
41 Nigeria 2020-03-03 1 0 0
42 Nigeria 2020-03-04 1 0 0
43 Nigeria 2020-03-05 1 0 0
44 Nigeria 2020-03-06 1 0 0
45 Nigeria 2020-03-07 1 0 0
46 Nigeria 2020-03-08 1 0 0
47 Nigeria 2020-03-09 2 0 0
48 Nigeria 2020-03-10 2 0 0
In [12]:
NG_data = NG_data.iloc[36:, :] #Tailored!
NG_data.reset_index(drop=True, inplace=True)
NG_data.head()
Out[12]:
Country date CumConfirmed CumDeaths CumRecovered
0 Nigeria 2020-02-27 0 0 0
1 Nigeria 2020-02-28 1 0 0
2 Nigeria 2020-02-29 1 0 0
3 Nigeria 2020-03-01 1 0 0
4 Nigeria 2020-03-02 1 0 0
In [13]:
NG_data.iloc[50:60] #note index 53 and 54
Out[13]:
Country date CumConfirmed CumDeaths CumRecovered
50 Nigeria 2020-04-17 493 17 159
51 Nigeria 2020-04-18 542 19 166
52 Nigeria 2020-04-19 627 21 170
53 Nigeria 2020-04-20 665 22 188
54 Nigeria 2020-04-21 665 22 188
55 Nigeria 2020-04-22 873 28 197
56 Nigeria 2020-04-23 981 31 197
57 Nigeria 2020-04-24 1095 32 208
58 Nigeria 2020-04-25 1182 35 222
59 Nigeria 2020-04-26 1273 40 239
In [ ]:
 
In [14]:
fig = px.line(NG_data, x='date', y='CumDeaths', title="Cumulated Deaths by date")
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="red")
In [ ]:
 
In [15]:
fig = px.line(NG_data, x='date', y='CumRecovered', title="cumulated recovery by date")
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="green")
In [16]:
fig = px.line(NG_data, x='date', y='CumConfirmed', title="cumulated confirmed cases")
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="blue")

Could you tell that flat curve just after 'Apr 19,' yeah? That's the culprit; 665 both on the 20th and 21st, while this could be a good thing meaning Nigeria had no new case on the 21st, we know that's false. In fact, we had 117 new cases on the 21st, summing it to 782. Hence, (20th) 665 - (21st) 782 - (22nd) 873 and so forth.

Also, NCDC's correction for the 18th of April: 541 instead of 542. Index 51.

Proof has been included below and update is to be made.

NCDC_REPORT_21_04_2020

In [17]:
#index 54, column CumConfirmed, replace 665 with 782
NG_data.at[54, 'CumConfirmed'] = 782 

NG_data.iloc[53:56, :] #perfection!
Out[17]:
Country date CumConfirmed CumDeaths CumRecovered
53 Nigeria 2020-04-20 665 22 188
54 Nigeria 2020-04-21 782 22 188
55 Nigeria 2020-04-22 873 28 197

Notice how I did not update the values for Cumulative Death and Recovered, as this won't be used in the model.

In [18]:
NG_data.at[51, 'CumConfirmed'] = 541
NG_data.iloc[50:53, :]
Out[18]:
Country date CumConfirmed CumDeaths CumRecovered
50 Nigeria 2020-04-17 493 17 159
51 Nigeria 2020-04-18 541 19 166
52 Nigeria 2020-04-19 627 21 170
In [19]:
#re-draw.
fig = px.line(NG_data, x='date', y='CumConfirmed')
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="blue")

Now, let's use dataFrame.diff() on the CumConfirmed Column so we can get the daily cases from it, convert possible NaNs to zeros, which should be only one at the very top(0 - 'no value' = NaN)

In [20]:
NG_data["DailyConfirmed"] = NG_data["CumConfirmed"].diff()
NG_data.head()
Out[20]:
Country date CumConfirmed CumDeaths CumRecovered DailyConfirmed
0 Nigeria 2020-02-27 0 0 0 NaN
1 Nigeria 2020-02-28 1 0 0 1.0
2 Nigeria 2020-02-29 1 0 0 0.0
3 Nigeria 2020-03-01 1 0 0 0.0
4 Nigeria 2020-03-02 1 0 0 0.0
In [21]:
NG_data.fillna(axis=1, value=0, inplace=True)
NG_data.head(5)
Out[21]:
Country date CumConfirmed CumDeaths CumRecovered DailyConfirmed
0 Nigeria 2020-02-27 0 0 0 0.0
1 Nigeria 2020-02-28 1 0 0 1.0
2 Nigeria 2020-02-29 1 0 0 0.0
3 Nigeria 2020-03-01 1 0 0 0.0
4 Nigeria 2020-03-02 1 0 0 0.0

The below graph shows that daily cases are rising fast. Noting that the virus can have an incubated window of 14 days, it's telling that the quarantine is not quite effective.

In [22]:
#cases on a daily basis
fig = px.line(NG_data, x='date', y='DailyConfirmed')
fig.update_layout({
    'paper_bgcolor': '#f2f2f2'
})
fig.update_traces(line_color="purple")
In [23]:
#drop recovery and death cases
NG_data.drop(["Country", "CumRecovered", "CumDeaths"], axis=1, inplace=True)

Now, let's slip in data for our testing. I want to make predictions for the next 8 days, starting on the 28th, and then merge with our data on the row level.

In [24]:
post_date = ['2020-05-20','2020-05-21', '2020-05-22', '2020-05-23', '2020-05-24', '2020-05-25', '2020-05-26','2020-05-27', '2020-05-28','2020-05-29','2020-05-30','2020-05-31']
post_incident = pd.DataFrame(post_date, columns=["date"])
post_incident["date"] = post_incident['date'].astype('datetime64[ns]')
post_incident #the next 12 days
Out[24]:
date
0 2020-05-20
1 2020-05-21
2 2020-05-22
3 2020-05-23
4 2020-05-24
5 2020-05-25
6 2020-05-26
7 2020-05-27
8 2020-05-28
9 2020-05-29
10 2020-05-30
11 2020-05-31
In [25]:
covid19_NG = pd.concat([NG_data, post_incident], sort=False, axis=0)
covid19_NG.reset_index(drop=True, inplace=True)
covid19_NG.tail(15)
Out[25]:
date CumConfirmed DailyConfirmed
80 2020-05-17 5959.0 338.0
81 2020-05-18 6175.0 216.0
82 2020-05-19 6401.0 226.0
83 2020-05-20 NaN NaN
84 2020-05-21 NaN NaN
85 2020-05-22 NaN NaN
86 2020-05-23 NaN NaN
87 2020-05-24 NaN NaN
88 2020-05-25 NaN NaN
89 2020-05-26 NaN NaN
90 2020-05-27 NaN NaN
91 2020-05-28 NaN NaN
92 2020-05-29 NaN NaN
93 2020-05-30 NaN NaN
94 2020-05-31 NaN NaN
In [26]:
covid19_NG["day"] = covid19_NG.date.dt.strftime("%m%d").astype('int')

covid19_NG.tail(20)
Out[26]:
date CumConfirmed DailyConfirmed day
75 2020-05-12 4787.0 146.0 512
76 2020-05-13 4971.0 184.0 513
77 2020-05-14 5162.0 191.0 514
78 2020-05-15 5450.0 288.0 515
79 2020-05-16 5621.0 171.0 516
80 2020-05-17 5959.0 338.0 517
81 2020-05-18 6175.0 216.0 518
82 2020-05-19 6401.0 226.0 519
83 2020-05-20 NaN NaN 520
84 2020-05-21 NaN NaN 521
85 2020-05-22 NaN NaN 522
86 2020-05-23 NaN NaN 523
87 2020-05-24 NaN NaN 524
88 2020-05-25 NaN NaN 525
89 2020-05-26 NaN NaN 526
90 2020-05-27 NaN NaN 527
91 2020-05-28 NaN NaN 528
92 2020-05-29 NaN NaN 529
93 2020-05-30 NaN NaN 530
94 2020-05-31 NaN NaN 531

I checked around for news that could be of value in terms of engineering features.

21-03-2020: NG bans entry of travelers from 13 countries(incl. US, Uk, Italy, France and China) [ng.usembassy.gov] Ban to last for 4 weeks.

23-03-2020: Closure of land borders [vanguard]. Same with schools.

28-03-2020: GTB hands over 110-bed isolation centre to Lagos State Gov.

08-04-2020: 110-bed isolation centre by GTB begins operation. [Allafrica]('https://allafrica.com']

31-03-2020: Lockdown as announced by Pres. Buhari (Starts 11:30PM on the 30th, so next day then)

14-04-2020: Another lockdown starts. 04-03-2020: No single isolation centre in Abuja says Senate President. [Punch.ng] 27-01-2020: First Nigeria case. NCDC

21-02-2020: covid19 training starts ncdc.gov.ng

update Lockdown lifted on the 4th of May, 2020. Travel Ban still in place.

In [ ]:
 
In [27]:
def assumed_lkdown(val):
    if val <= 330:
        return "Not Enforced"
    if val >= 331 and val <= 402: #first three days after proposed lockdown
        return "Relaxed"
    elif val >= 403 and val <=410: #police mobilization and lockdown security goes up 
        return "Tight"
    elif val >= 504:
        return "Lifted"
    else: 
        return "Relaxed"
    

#Lagos, Abuja, and Kano hold high cases and the assumed lockdown was modelled from them.
In [28]:
covid19_NG["Lockdown"] = covid19_NG["day"].apply(assumed_lkdown)
covid19_NG[covid19_NG["Lockdown"] == "Lifted"].head()
Out[28]:
date CumConfirmed DailyConfirmed day Lockdown
67 2020-05-04 2802.0 244.0 504 Lifted
68 2020-05-05 2950.0 148.0 505 Lifted
69 2020-05-06 3145.0 195.0 506 Lifted
70 2020-05-07 3526.0 381.0 507 Lifted
71 2020-05-08 3912.0 386.0 508 Lifted
In [29]:
covid19_NG["Travel Ban"] = covid19_NG["day"].apply(lambda x: "Enforced" if x >= 321 else "Not Enforced")
covid19_NG[covid19_NG["Travel Ban"] == "Enforced"].head(3)

#321 - 21st of March, when ban became effective.
Out[29]:
date CumConfirmed DailyConfirmed day Lockdown Travel Ban
23 2020-03-21 22.0 10.0 321 Not Enforced Enforced
24 2020-03-22 30.0 8.0 322 Not Enforced Enforced
25 2020-03-23 40.0 10.0 323 Not Enforced Enforced
In [30]:
covid19_NG["Border closure"] = covid19_NG["day"].apply(lambda x: "Enforced" if x >= 323 else "Not Enforced")
covid19_NG[covid19_NG["Border closure"] == "Enforced"].head(3)
Out[30]:
date CumConfirmed DailyConfirmed day Lockdown Travel Ban Border closure
25 2020-03-23 40.0 10.0 323 Not Enforced Enforced Enforced
26 2020-03-24 44.0 4.0 324 Not Enforced Enforced Enforced
27 2020-03-25 51.0 7.0 325 Not Enforced Enforced Enforced
In [31]:
covid19_NG["dayName"] = covid19_NG.date.dt.day_name()
covid19_NG.head()
Out[31]:
date CumConfirmed DailyConfirmed day Lockdown Travel Ban Border closure dayName
0 2020-02-27 0.0 0.0 227 Not Enforced Not Enforced Not Enforced Thursday
1 2020-02-28 1.0 1.0 228 Not Enforced Not Enforced Not Enforced Friday
2 2020-02-29 1.0 0.0 229 Not Enforced Not Enforced Not Enforced Saturday
3 2020-03-01 1.0 0.0 301 Not Enforced Not Enforced Not Enforced Sunday
4 2020-03-02 1.0 0.0 302 Not Enforced Not Enforced Not Enforced Monday
In [32]:
sns.barplot(data=covid19_NG, x="dayName", y="DailyConfirmed")
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9b4e3f6ba8>
In [ ]:
 
In [33]:
print("Highest reported cases in Nigeria: ", covid19_NG["DailyConfirmed"].max())
covid19_NG[covid19_NG["DailyConfirmed"] == covid19_NG["DailyConfirmed"].max()]
Highest reported cases in Nigeria:  386.0
Out[33]:
date CumConfirmed DailyConfirmed day Lockdown Travel Ban Border closure dayName
71 2020-05-08 3912.0 386.0 508 Lifted Enforced Enforced Friday
In [34]:
covid19_NG.tail(11)
Out[34]:
date CumConfirmed DailyConfirmed day Lockdown Travel Ban Border closure dayName
84 2020-05-21 NaN NaN 521 Lifted Enforced Enforced Thursday
85 2020-05-22 NaN NaN 522 Lifted Enforced Enforced Friday
86 2020-05-23 NaN NaN 523 Lifted Enforced Enforced Saturday
87 2020-05-24 NaN NaN 524 Lifted Enforced Enforced Sunday
88 2020-05-25 NaN NaN 525 Lifted Enforced Enforced Monday
89 2020-05-26 NaN NaN 526 Lifted Enforced Enforced Tuesday
90 2020-05-27 NaN NaN 527 Lifted Enforced Enforced Wednesday
91 2020-05-28 NaN NaN 528 Lifted Enforced Enforced Thursday
92 2020-05-29 NaN NaN 529 Lifted Enforced Enforced Friday
93 2020-05-30 NaN NaN 530 Lifted Enforced Enforced Saturday
94 2020-05-31 NaN NaN 531 Lifted Enforced Enforced Sunday
In [35]:
covid19_NG.drop(["date"], axis=1, inplace=True) #drop date
In [36]:
#Log transform DailyConfirmed. NaNs won't be transformed
covid19_NG["DailyConfirmed"] = np.log1p(covid19_NG["DailyConfirmed"])
In [37]:
covid19_NG = pd.get_dummies(covid19_NG, drop_first=True) #get dummies 
covid19_NG.head()
Out[37]:
CumConfirmed DailyConfirmed day Lockdown_Not Enforced Lockdown_Relaxed Lockdown_Tight Travel Ban_Not Enforced Border closure_Not Enforced dayName_Monday dayName_Saturday dayName_Sunday dayName_Thursday dayName_Tuesday dayName_Wednesday
0 0.0 0.000000 227 1 0 0 1 1 0 0 0 1 0 0
1 1.0 0.693147 228 1 0 0 1 1 0 0 0 0 0 0
2 1.0 0.000000 229 1 0 0 1 1 0 1 0 0 0 0
3 1.0 0.000000 301 1 0 0 1 1 0 0 1 0 0 0
4 1.0 0.000000 302 1 0 0 1 1 1 0 0 0 0 0

Let's split the data into train, validation and testing.

In [38]:
train = covid19_NG[covid19_NG["day"] <= 514] #Train on this
val = covid19_NG[(covid19_NG["day"] >= 515) & (covid19_NG["day"] <= 519)] #evaluate on this
test = covid19_NG[covid19_NG["day"] >= 520] #test on this
In [39]:
print("Original data shape: ", covid19_NG.shape, "\nTrain shape: ", train.shape, "\nValidation data shape: ",val.shape, "\nTest data shape: ",test.shape)
Original data shape:  (95, 14) 
Train shape:  (78, 14) 
Validation data shape:  (5, 14) 
Test data shape:  (12, 14)
In [40]:
#Predict for the rest of the month starting from May 20th, 2020.

Get ready for TRAINING and TESTING!

fight!

In [41]:
Xtrain = train.drop(["CumConfirmed", "DailyConfirmed"], axis=1)
ytrain = train["DailyConfirmed"]

Xval = val.drop(["CumConfirmed", "DailyConfirmed"], axis=1)
yval = val["DailyConfirmed"]

Xtest = test.drop(["CumConfirmed", "DailyConfirmed"], axis=1)

Xtrain.head(3)
Out[41]:
day Lockdown_Not Enforced Lockdown_Relaxed Lockdown_Tight Travel Ban_Not Enforced Border closure_Not Enforced dayName_Monday dayName_Saturday dayName_Sunday dayName_Thursday dayName_Tuesday dayName_Wednesday
0 227 1 0 0 1 1 0 0 0 1 0 0
1 228 1 0 0 1 1 0 0 0 0 0 0
2 229 1 0 0 1 1 0 1 0 0 0 0
In [42]:
from sklearn.preprocessing import MinMaxScaler
In [43]:
ss = StandardScaler()
mm = MinMaxScaler() #default=(0,1)
In [44]:
Xtrain_std = ss.fit_transform(Xtrain)
Xval_std = ss.transform(Xval)
Xtest_std = ss.transform(Xtest)
In [ ]:
 
In [45]:
from sklearn.metrics import mean_squared_error as mse
In [ ]:
 
In [46]:
cat_params = {
    'num_boost_round':10000,
    'learning_rate': 0.005,
    'silent':True,
    'eval_metric': 'RMSE',
    'early_stopping_rounds':200,
    'max_depth':9,
    'use_best_model':True,
    'random_state': RANDOM_STATE
}
In [47]:
cat_model = CatBoostRegressor(**cat_params)
rf_model = RandomForestRegressor(n_estimators=10000, random_state=RANDOM_STATE, max_depth=24)
#
In [48]:
%%time
cat_model.fit(Xtrain, ytrain, eval_set=[(Xtrain, ytrain)])
print("Best Score: ", cat_model.best_score_, "\nIteration stopped at: ", cat_model.best_iteration_)
Best Score:  {'learn': {'RMSE': 0.1798699129597255}, 'validation': {'RMSE': 0.17986991295972543}} 
Iteration stopped at:  9999
CPU times: user 1min 14s, sys: 9.53 s, total: 1min 23s
Wall time: 37.4 s
In [49]:
train_pred_cat = cat_model.predict(Xtrain)

cat_rmse = np.sqrt(mse(train_pred_cat, np.expm1(ytrain)))
print("RMSE: ", cat_rmse)
RMSE:  111.64232067925353
In [50]:
val_pred = cat_model.predict(Xval)


print("Predicted values: ", list(np.round(np.expm1(val_pred), decimals=0)))
print("Actual values: ", list(np.round(np.expm1(yval), decimals=0)))
print("RMSE: ", np.sqrt(mse(val_pred, np.expm1(yval))))
Predicted values:  [229.0, 149.0, 162.0, 154.0, 135.0]
Actual values:  [288.0, 171.0, 338.0, 216.0, 226.0]
RMSE:  249.6401677648684
In [51]:
test_pred_cat = cat_model.predict(Xtest)
test_pred_cat #log-transformed
Out[51]:
array([5.15031532, 5.25226949, 5.4382155 , 5.01015568, 5.09538887,
       5.04420315, 4.91303069, 5.15031532, 5.25226949, 5.4382155 ,
       5.01015568, 5.09538887])
In [52]:
# xg_model = XGBRegressor(n_estimators=20000, learning_rate=0.0005)
### Random Forest
In [53]:
%%time
rf_model.fit(Xtrain, ytrain)
CPU times: user 15.3 s, sys: 90.1 ms, total: 15.4 s
Wall time: 15.6 s
Out[53]:
RandomForestRegressor(max_depth=24, n_estimators=10000, random_state=5)
In [54]:
train_pred_rf = rf_model.predict(Xtrain)
print("RMSE (TRAIN): ", np.sqrt(mse(train_pred_rf, np.expm1(ytrain))))
print("######")

#evaluation
val_pred_rf = rf_model.predict(Xval)
print("Predicted values: ", list(np.round(np.expm1(val_pred_rf), decimals=0)))
print("Actual values: ", list(np.round(np.expm1(yval), decimals=0)))
print("RMSE (EVALUATION): ", np.sqrt(mse(val_pred_rf, np.expm1(yval))))

#test prediction
test_pred_rf = rf_model.predict(Xtest)
RMSE (TRAIN):  111.64716402156787
######
Predicted values:  [206.0, 207.0, 204.0, 208.0, 165.0]
Actual values:  [288.0, 171.0, 338.0, 216.0, 226.0]
RMSE (EVALUATION):  249.4737951255527
In [55]:
# list(np.round(np.expm1(covid19_NG.DailyConfirmed[:-9]), decimals=0))
In [ ]:
 
In [56]:
#Neural network modelling
from keras.wrappers.scikit_learn import KerasRegressor
Using TensorFlow backend.
In [57]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
In [58]:
def build_model():
    model = keras.Sequential([
    layers.Dense(32, activation='relu', input_shape=[Xtrain.shape[1]]),
    layers.Dense(16, activation='relu'),
    layers.Dense(1)
  ])

    optimizer = tf.keras.optimizers.RMSprop(0.003)

    model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mse'])
    return model

model = build_model()
WARNING:tensorflow:From /home/LEGION/.local/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
In [ ]:
 
In [59]:
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 32)                416       
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 961
Trainable params: 961
Non-trainable params: 0
_________________________________________________________________
In [60]:
Xtrain_mm = mm.fit_transform(Xtrain)
# Xval_mm = mm.transform(Xval)
Xtest_mm = mm.transform(Xtest)
In [61]:
model.fit(Xtrain_mm, ytrain)
model.predict(Xtrain_mm)
78/78 [==============================] - 1s 11ms/sample - loss: 11.3841 - mean_squared_error: 11.3841
Out[61]:
array([[0.11537717],
       [0.16368824],
       [0.24647242],
       [0.2926372 ],
       [0.19305873],
       [0.03961409],
       [0.44060448],
       [0.11201602],
       [0.17328389],
       [0.2324318 ],
       [0.29317167],
       [0.19484395],
       [0.03953261],
       [0.44319835],
       [0.11171436],
       [0.17483209],
       [0.23131046],
       [0.29370615],
       [0.19662909],
       [0.03945112],
       [0.44579223],
       [0.11354627],
       [0.17638029],
       [0.25839227],
       [0.0832982 ],
       [0.28961697],
       [0.19041249],
       [0.33069178],
       [0.25782177],
       [0.21045932],
       [0.312838  ],
       [0.38946822],
       [0.28766018],
       [0.4614352 ],
       [0.36653256],
       [0.405119  ],
       [0.2083361 ],
       [0.33869115],
       [0.37104473],
       [0.12444282],
       [0.25969392],
       [0.23318624],
       [0.37256083],
       [0.2107252 ],
       [0.20848835],
       [0.45374992],
       [0.33070752],
       [0.44189045],
       [0.3670893 ],
       [0.39853686],
       [0.36813706],
       [0.20959184],
       [0.45252687],
       [0.32909504],
       [0.4411442 ],
       [0.36783984],
       [0.39621386],
       [0.36540365],
       [0.2106954 ],
       [0.45091635],
       [0.3274826 ],
       [0.44039792],
       [0.36859027],
       [0.3940295 ],
       [0.3353359 ],
       [0.22244734],
       [0.42233208],
       [0.13740422],
       [0.30470452],
       [0.1816612 ],
       [0.28484467],
       [0.13777824],
       [0.22920337],
       [0.2575878 ],
       [0.1399696 ],
       [0.30447504],
       [0.18237339],
       [0.28519323]], dtype=float32)
In [62]:
from keras.callbacks import EarlyStopping#set early stopping monitor so the model stops training when it won't improve anymore

early_stopping_monitor = EarlyStopping(patience=3)

EPOCHS = 10000

history = model.fit(
  Xtrain_mm, ytrain,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[early_stopping_monitor]
)
In [63]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()
Out[63]:
loss mean_squared_error val_loss val_mean_squared_error epoch
17 0.527280 0.527280 7.961093 7.961093 17
18 0.511112 0.511112 7.932344 7.932344 18
19 0.518492 0.518492 8.249502 8.249502 19
20 0.505430 0.505430 8.174744 8.174744 20
21 0.496137 0.496137 8.056219 8.056219 21
In [64]:
# test_predictions = model.predict(Xtest_mm).flatten()

# np.expm1(test_predictions)
In [ ]:
 
In [ ]:
 
In [65]:
#ensemble by weight
In [66]:
cat_pred = np.round(np.expm1(test_pred_cat), decimals=0)
rf_pred = np.round(np.expm1(test_pred_rf), decimals=0)

ensemble = 0.6*cat_pred + 0.4*rf_pred
In [67]:
test_predictions = pd.DataFrame({
                                "date": post_incident["date"],
                                "New Cases (CAT)": [int(x) for x in cat_pred],
                                "New Cases (Random Forest)": [int(x) for x in rf_pred],
                                "Ensemble": [int(x) for x in ensemble],
                               })


test_predictions 
Out[67]:
date New Cases (CAT) New Cases (Random Forest) Ensemble
0 2020-05-20 171 194 180
1 2020-05-21 190 211 198
2 2020-05-22 229 206 219
3 2020-05-23 149 207 172
4 2020-05-24 162 204 178
5 2020-05-25 154 208 175
6 2020-05-26 135 165 147
7 2020-05-27 171 194 180
8 2020-05-28 190 211 198
9 2020-05-29 229 206 219
10 2020-05-30 149 207 172
11 2020-05-31 162 204 178

Giphy-overfit Overfitting was expected due to the data being constrained to a small size.

In [68]:
test_predictions.to_csv("submission.csv", index=False, header=True)
In [ ]:
 

Disclaimer

The predicted values are not to be substituted for actual future cases. The table is merely a forecast, purely for learning purposes.

Resources

Giphy-out

In [ ]:
 
In [ ]: