Prediction Policy Problems: Linear Models & Lasso Regression

Introducing the Prediction Policy Framework

In the video-lecture below you’ll be given a brief introduction to the prediction policy framework, and a primer on machine learning. Please take a moment to watch the 20 minute video.


Are you still wondering what the difference is between Machine Learning and Econometrics? Take a few minutes to watch the video below.


After watching the videos, we have a practical exercise.

Predicting social assistance beneficiaries

A key problem in the design of Social Policies is the identification of people in need of social assistance. Social policies usually work with tight budgets and limited fiscal space. To allocate resources efficiently, benefits need to be targeted to those who need them most. Yet, identifying needs isn’t easy and misclassifications can have severe and irreversible effects on people in need.

Think of a social protection programme that allocates food vouchers to families with children at risk of malnutrition, or a programme that establishes needs-based school grants. What happens when these limited and finite resources are given to people that could do without, and those who need them most are excluded from them?

In this block we’ll work with real-world data from the country of Malawi to predict cash-transfer programme beneficiaries: People who live in poverty and need government assistance to make ends meet. The data comes from McBride and Nichol’s (2018) paper Retooling poverty targeting using out-of-sample validation and machine learning.

Discussion Points

The points below are meant to help you think critically about why we’re about to embark on a machine learning - targeting exercise.


If you’d like to learn more about Social Protection Policies, take a look at this video Alex has made for us with a brilliant summary of the field

(Yes, more videos!)

R practical

You can download the dataset by clicking on the button below.

The script below is a step by step on how to go about coding a predictive model using a linear regression. Despite its simplicity and transparency, i.e. the ease with which we can interpret its results, a linear model is not without challenges in machine learning.

  1. Preliminaries: working directory, libraries, data upload


rm(list = ls()) # this line cleans your Global Environment.
setwd("/Users/michellegonzalez/Documents/GitHub/Machine-Learning-for-Public-Policy") # set your working directory

# Libraries
cat("
If this is your first time using R, you need to install the libraries before loading them. 
To do that, you can uncomment the line that starts with install.packages(...) by removing the # symbol.    
    ")
## 
## If this is your first time using R, you need to install the libraries before loading them. 
## To do that, you can uncomment the line that starts with install.packages(...) by removing the # symbol.    
## 
#install.packages("dplyr", "tidyverse", "caret", "corrplot", "Hmisc", "modelsummary", "plyr", "gt", "stargazer", elasticnet")

library(dplyr) # core package for dataframe manipulation. Usually installed and loaded with the tidyverse, but sometimes needs to be loaded in conjunction to avoid warnings.
library(tidyverse) # a large collection of packages for data manipulation and visualisation.  
library(caret) # a package with key functions that streamline the process for predictive modelling 
library(corrplot) # a package to plot correlation matrices
library(Hmisc) # a package for general-purpose data analysis 
library(modelsummary) # a package to describe model outputs
library(skimr) # a package to describe dataframes
library(plyr) # a package for data wrangling
library(gt) # a package to edit modelsummary (and other) tables
library(stargazer) # a package to visualise model output

data_malawi <- read_csv("malawi.csv") # the file is directly read from the working directory/folder previously set

  1. Get to know your data: visualisation and pre-processing


skim(data_malawi) # describes the dataset in a nice format 
Data summary
Name data_malawi
Number of rows 11280
Number of columns 38
_______________________
Column type frequency:
character 2
numeric 36
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
region 0 1 5 6 0 3 0
eatype 0 1 5 17 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
lnexp_pc_month 0 1 7.360000e+00 6.800000e-01 4.7800e+00 6.890000e+00 7.310000e+00 7.760000e+00 1.106000e+01 ▁▇▇▁▁
hhsize 0 1 4.550000e+00 2.340000e+00 1.0000e+00 3.000000e+00 4.000000e+00 6.000000e+00 2.700000e+01 ▇▂▁▁▁
hhsize2 0 1 2.613000e+01 2.799000e+01 1.0000e+00 9.000000e+00 1.600000e+01 3.600000e+01 7.290000e+02 ▇▁▁▁▁
agehead 0 1 4.246000e+01 1.636000e+01 1.0000e+01 2.900000e+01 3.900000e+01 5.400000e+01 1.040000e+02 ▅▇▅▂▁
agehead2 0 1 2.070610e+03 1.618600e+03 1.0000e+02 8.410000e+02 1.521000e+03 2.916000e+03 1.081600e+04 ▇▃▁▁▁
north 0 1 1.500000e-01 3.600000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▂
central 0 1 3.800000e-01 4.900000e-01 0.0000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 ▇▁▁▁▅
rural 0 1 8.700000e-01 3.300000e-01 0.0000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 ▁▁▁▁▇
nevermarried 0 1 3.000000e-02 1.700000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▁
sharenoedu 0 1 1.700000e-01 2.600000e-01 0.0000e+00 0.000000e+00 0.000000e+00 2.500000e-01 1.000000e+00 ▇▂▁▁▁
shareread 0 1 6.100000e-01 3.800000e-01 0.0000e+00 3.300000e-01 6.700000e-01 1.000000e+00 1.000000e+00 ▅▁▅▂▇
nrooms 0 1 2.500000e+00 1.300000e+00 0.0000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.600000e+01 ▇▂▁▁▁
floor_cement 0 1 2.000000e-01 4.000000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▂
electricity 0 1 6.000000e-02 2.300000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▁
flushtoilet 0 1 3.000000e-02 1.700000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▁
soap 0 1 1.400000e-01 3.400000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▁
bed 0 1 3.200000e-01 4.700000e-01 0.0000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 ▇▁▁▁▃
bike 0 1 3.600000e-01 4.800000e-01 0.0000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 ▇▁▁▁▅
musicplayer 0 1 1.600000e-01 3.700000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▂
coffeetable 0 1 1.200000e-01 3.200000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▁
iron 0 1 2.100000e-01 4.000000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▂
dimbagarden 0 1 3.200000e-01 4.700000e-01 0.0000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 ▇▁▁▁▃
goats 0 1 2.100000e-01 4.100000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▂
dependratio 0 1 1.120000e+00 9.500000e-01 0.0000e+00 5.000000e-01 1.000000e+00 1.500000e+00 9.000000e+00 ▇▂▁▁▁
hfem 0 1 2.300000e-01 4.200000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▂
grassroof 0 1 7.400000e-01 4.400000e-01 0.0000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 ▃▁▁▁▇
mortarpestle 0 1 5.000000e-01 5.000000e-01 0.0000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 ▇▁▁▁▇
table 0 1 3.600000e-01 4.800000e-01 0.0000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 ▇▁▁▁▅
clock 0 1 2.000000e-01 4.000000e-01 0.0000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▂
ea 0 1 7.606000e+01 1.885500e+02 1.0000e+00 8.000000e+00 1.900000e+01 4.525000e+01 9.010000e+02 ▇▁▁▁▁
EA 0 1 2.372322e+07 7.241514e+06 1.0101e+07 2.040204e+07 2.090352e+07 3.053301e+07 3.120209e+07 ▂▁▆▁▇
hhwght 0 1 2.387900e+02 7.001000e+01 7.9000e+01 2.076000e+02 2.471000e+02 2.913000e+02 3.587000e+02 ▂▁▇▇▂
psu 0 1 2.372322e+07 7.241514e+06 1.0101e+07 2.040204e+07 2.090352e+07 3.053301e+07 3.120209e+07 ▂▁▆▁▇
strataid 0 1 1.560000e+01 8.090000e+00 1.0000e+00 9.000000e+00 1.500000e+01 2.200000e+01 3.000000e+01 ▅▇▇▆▅
lnzline 0 1 7.550000e+00 0.000000e+00 7.5500e+00 7.550000e+00 7.550000e+00 7.550000e+00 7.550000e+00 ▁▁▇▁▁
case_id 0 1 2.372322e+10 7.241514e+09 1.0101e+10 2.040204e+10 2.090352e+10 3.053301e+10 3.120209e+10 ▂▁▆▁▇


  • The dataset contains 38 variables and 11,280 observations.

  • Not all of these variables are relevant for our prediction model.

  • To find the labels and description of the variables, you can refer to the paper.

[hhsize, hhsize2, age_head, age_head2, regions, rural, never married, share_of_adults_without_education, share_of_adults_who_can_read, number of rooms, cement floor, electricity, flush toilet, soap, bed, bike, music player, coffee table, iron, garden, goats]

  • Luckily for us, we have no missing values (n_missing in summary output)!

    1. Many machine learning models cannot be trained when missing values are present (some exceptions exist).

    2. Dealing with missingness is a non-trivial task:

First and foremost, we should assess whether there is a pattern to missingness and if so, 
what that means to what we can learn from our (sub)population. If there is no discernible pattern, 
we can proceed to delete the missing values or impute them. A more detailed explanation and course of action
can be found [here](https://stefvanbuuren.name/fimd/sec-MCAR.html).


Feature selection: subsetting the dataset

As part of our data pre-processing we will subset the dataframe, such that only relevant variables are left. * Relevant: variables/features about a household that could help us determine whether they are in poverty. That way, we save some memory space; but also, we can call the full set of variables in a dataframe in one go!

variables to delete (not included in the identified set above):
[ea, EA, hhwght, psu, strataid, lnzline, case_id, eatype] 


N/B: Feature selection is a critical process (and we normally don’t have a paper to guide us through it): from a practical point of view, a model with less predictors may be easier to interpret. Also, some models may be negatively affected by non-informative predictors. This process is similar to traditional econometric modelling, but we should not conflate predictive and explanatory modelling. Importantly, please note that we are not interested in knowing why something happens, but rather in what is likely to happen given some known data. Hence:

# object:vector that contains the names of the variables that we want to get rid of

cols <- c("ea", "EA", "psu","hhwght", "strataid", "lnzline", "case_id","eatype")


# subset of the data_malawi object:datframe
data_malawi <- data_malawi[,-which(colnames(data_malawi) %in% cols)] # the minus sign indicates deletion of cols

A few notes for you:

  • a dataframe follows the form data[rows,colums]

  • colnames() is a function that identifies the column names of an object of class dataframe

  • which() is an R base function that gives you the position of some value

  • a minus sign will delete either the identified position in the row or the column space

Data visualisation

A quick and effective way to take a first glance at our data is to plot histograms of relevant (numeric) features.

Recall (from the skim() dataframe summary output) that only two variables are non-numeric. However, we need to make a distinction between class factor and class numeric/numeric.

# identify categorical variables to transform from class numeric to factor
# using a for-loop: print the number of unique values by variable

for (i in 1:ncol(data_malawi)) { # iterate over the length of columns in the data_malawi df
    
    # store the number of unique values in column.i 
    x <- length(unique(data_malawi[[i]]))
    
    # print the name of column.i
    print(colnames(data_malawi[i]))
    # print the number of unique values in column.i
    print(x)
    
}
## [1] "lnexp_pc_month"
## [1] 11266
## [1] "hhsize"
## [1] 19
## [1] "hhsize2"
## [1] 19
## [1] "agehead"
## [1] 88
## [1] "agehead2"
## [1] 88
## [1] "north"
## [1] 2
## [1] "central"
## [1] 2
## [1] "rural"
## [1] 2
## [1] "nevermarried"
## [1] 2
## [1] "sharenoedu"
## [1] 47
## [1] "shareread"
## [1] 28
## [1] "nrooms"
## [1] 16
## [1] "floor_cement"
## [1] 2
## [1] "electricity"
## [1] 2
## [1] "flushtoilet"
## [1] 2
## [1] "soap"
## [1] 2
## [1] "bed"
## [1] 2
## [1] "bike"
## [1] 2
## [1] "musicplayer"
## [1] 2
## [1] "coffeetable"
## [1] 2
## [1] "iron"
## [1] 2
## [1] "dimbagarden"
## [1] 2
## [1] "goats"
## [1] 2
## [1] "dependratio"
## [1] 62
## [1] "hfem"
## [1] 2
## [1] "grassroof"
## [1] 2
## [1] "mortarpestle"
## [1] 2
## [1] "table"
## [1] 2
## [1] "clock"
## [1] 2
## [1] "region"
## [1] 3
# If you want to optimise your code, using for loops is not ideal. In R, there exists a family of functions called Apply whose purpose is to apply some function to all the elements in an object. For the time being, iterating over 38 columns is fast enough and we don't need to think about optimising our code. We'll also see an example later on on how to use one of the functions from the apply family. You can also refer to this blog (https://www.r-bloggers.com/2021/05/apply-family-in-r-apply-lapply-sapply-mapply-and-tapply/) if you want to learn more about it. 


Notice that we have a few variables with 2 unique values and one variable with 3 unique values. We should transform these into factor() class. Recall from the introduction tab that in object-oriented programming correctly identifying the variable type (vector class) is crucial for data manipulation and arithmetic operations. We can do this one by one, or in one shot. I’ll give an example of both:

# == One by one == # 

# transform a variable in df to factor, and pass it on to the df to keep the change
# first, sneak peak at the first 5 observations (rows) of the vector
head(data_malawi$north) # returns 1 = North 
## [1] 1 1 1 1 1 1
data_malawi$north <- factor(data_malawi$north, levels = c(0, 1), labels = c("NotNorth", "North"))

str(data_malawi$north) # returns 2?
##  Factor w/ 2 levels "NotNorth","North": 2 2 2 2 2 2 2 2 2 2 ...
head(data_malawi$north) # returns north (which was = 1 before), hooray!
## [1] North North North North North North
## Levels: NotNorth North
# R stores factors as 1...n; but the label remains the initial numeric value assigned (1 for true, 0 false)
# We have also explicityly told 0 is NotNorth and 1 is North (by the order that follows the line of code)


# transform all binary/categorical data into factor class
min_count <- 3 # vector: 3 categories is our max number of categories found

# == Apply family: function apply and lapply==#
# apply a length(unique(x)) function to all the columns (rows = 1, columns = 2) of the data_malawi dataframe, then
# store boolean (true/false) if the number of unique values is lower or equal to the min_count vector
n_distinct2 <- apply(data_malawi, 2, function(x) length(unique(x))) <= min_count

# print(n_distinct2) # prints boolean indicator object (so you know what object you have created)

# select the identified categorical variables and transform them into factors
data_malawi[n_distinct2] <- lapply(data_malawi[n_distinct2], factor) 

Visualise your data: histograms, bar plots, tables

Let’s start with histograms of numeric (continuous) variables.

# == HISTOGRAMS == #

# Select all variables in the dataframe which are numeric, and can therefore be plotted as a histogram.
malawi_continuous <- as.data.frame(data_malawi %>% select_if(~is.numeric(.))) 

# a quick glance at the summary statistics of our continuous variables
datasummary_skim(malawi_continuous) # from modelsummary pkg, output as plot in Plot Viewer
Unique (#) Missing (%) Mean SD Min Median Max
lnexp_pc_month 11266 0 7.4 0.7 4.8 7.3 11.1
hhsize 19 0 4.5 2.3 1.0 4.0 27.0
hhsize2 19 0 26.1 28.0 1.0 16.0 729.0
agehead 88 0 42.5 16.4 10.0 39.0 104.0
agehead2 88 0 2070.6 1618.6 100.0 1521.0 10816.0
sharenoedu 47 0 0.2 0.3 0.0 0.0 1.0
shareread 28 0 0.6 0.4 0.0 0.7 1.0
nrooms 16 0 2.5 1.3 0.0 2.0 16.0
dependratio 62 0 1.1 0.9 0.0 1.0 9.0
# Hmisc package, quick and painless hist.data.frame() function
# but first, make sure to adjust the number of rows and columns to be displayed on your Plot Viewer
par(mfrow = c(3, 3)) # 3 rows * 3 columns (9 variables)
hist.data.frame(malawi_continuous)

Now, let’s plot bar graphs and write tables of factor (categorical) variables:

# == BAR GRAPHS == #
malawi_factor <- data_malawi %>% select_if(~is.factor(.)) # subset of the data with all factor variables

par(mfrow = c(3, 7)) # 7 rows, 3 columns (21 variables = length of df)

for (i in 1:ncol(malawi_factor)) { # Loop over all the columns in the factor df subset
    
    # store data in column.i as x
    x <- malawi_factor[,i]
    
    # store name of column.i as x_name
    x_name <- colnames(malawi_factor[i])
    
    # Plot bar graph of x using Rbase plotting tools
    barplot(table(x),
            main = paste(x_name)
    )
    
}

# == TABLES == #

# We can also show tables of all factor variables (to get precise frequencies not displayed in bar plots)
llply(.data=malawi_factor, .fun=table) # create tables of all the variables in dataframe using the plyr package
## $north
## 
## NotNorth    North 
##     9600     1680 
## 
## $central
## 
##    0    1 
## 6960 4320 
## 
## $rural
## 
##    0    1 
## 1440 9840 
## 
## $nevermarried
## 
##     0     1 
## 10930   350 
## 
## $floor_cement
## 
##    0    1 
## 9036 2244 
## 
## $electricity
## 
##     0     1 
## 10620   660 
## 
## $flushtoilet
## 
##     0     1 
## 10962   318 
## 
## $soap
## 
##    0    1 
## 9740 1540 
## 
## $bed
## 
##    0    1 
## 7653 3627 
## 
## $bike
## 
##    0    1 
## 7194 4086 
## 
## $musicplayer
## 
##    0    1 
## 9426 1854 
## 
## $coffeetable
## 
##    0    1 
## 9956 1324 
## 
## $iron
## 
##    0    1 
## 8962 2318 
## 
## $dimbagarden
## 
##    0    1 
## 7658 3622 
## 
## $goats
## 
##    0    1 
## 8856 2424 
## 
## $hfem
## 
##    0    1 
## 8697 2583 
## 
## $grassroof
## 
##    0    1 
## 2953 8327 
## 
## $mortarpestle
## 
##    0    1 
## 5635 5645 
## 
## $table
## 
##    0    1 
## 7249 4031 
## 
## $clock
## 
##    0    1 
## 9039 2241 
## 
## $region
## 
## Centre  North  South 
##   4320   1680   5280


What have we learned from the data visualisation?

  1. Nothing worrying about the data itself. McBride and Nichols did a good job of pre-processing the data for us. No pesky missing values, or unknown categories.

  2. From the bar plots, we can see that for the most part, people tend not to own assets. Worryingly, there is a lack of soap, flush toilets and electricity, all of which are crucial for human capital (health and education).

  3. From the histograms, we can see log per capita expenditure is normally distributed, but if we remove the log, it’s incredibly skewed. Poverty is endemic. Households tend not to have too many educated individuals, and their size is non-trivially large (with less rooms than people need).

Relationships between features

To finalise our exploration of the dataset, we should define:

  • the target variable (a.k.a. outcome of interest)

  • correlational insights

Let’s visualise two distinct correlation matrices; for our numeric dataframe, which includes our target variable, we will plot a Pearson r correlation matrix. For our factor dataframe, to which we will add our continuous target, we will plot a Spearman rho correlation matrix. Both types of correlation coefficients are interpreted the same (0 = no correlation, 1 perfect positive correlation, -1 perfect negative correlation).

# = = PEARSON CORRELATION MATRIX = = #

M <- cor(malawi_continuous) # create a correlation matrix of the continuous dataset, cor() uses Pearson's correlation coefficient as default. This means we can only take the correlation between continuous variables
corrplot(M, method="circle", addCoef.col ="black", number.cex = 0.8) # visualise it in a nice way

We can already tell that the size of the household and dependent ratio are highly negatively correlated to our target variable.

# = = SPEARMAN CORRELATION MATRIX = = #

malawi_factorC <- as.data.frame(lapply(malawi_factor,as.numeric)) # coerce dataframe to numeric, as the cor() command only takes in numeric types
malawi_factorC$lnexp_pc_month <- malawi_continuous$lnexp_pc_month # include target variable in the dataframe

M2 <- cor(malawi_factorC, method = "spearman")
corrplot(M2, method="circle", addCoef.col ="black", number.cex = 0.3) # visualise it in a nice way

Ownership of some assets stands out: soap, a cement floor, electricity, a bed… ownership of these (and a couple of other) assets is positively correlated to per capita expenditure. Living in a rural area, on the other hand, is negtively correlated to our target variable.

We can also spot some correlation coefficients that equal zero. In some situations, the data generating mechanism can create predictors that only have a single unique value (i.e. a “zero-variance predictor”). For many ML models (excluding tree-based models), this may cause the model to crash or the fit to be unstable. Here, the only \(0\) we’ve spotted is not in relation to our target variable.

But we do observe some near-zero-variance predictors. Besides uninformative, these can also create unstable model fits. There’s a few strategies to deal with these; the quickest solution is to remove them. A second option, which is especially interesting in scenarios with a large number of predictors/variables, is to work with penalised models. We’ll discuss this option below.

  1. Model fit: data partition and performance evaluation parameters


We now have a general idea of the structure of the data we are working with, and what we’re trying to predict: per capita expenditures, which we believe are a proxy for poverty prevalence. Measured by the log of per capita monthly expenditure in our dataset, the variable is named lnexp_pc_month.

The next step is create a simple linear model (OLS) to predict per capita expenditure using the variables in our dataset, and introduce the elements with which we will evaluate our model.

Data Partinioning

When we want to build predictive models for machine learning purposes, it is important to have (at least) two data sets. A training data set from which our model will learn, and a test data set containing the same features as our training data set; we use the second dataset to see how well our predictive model extrapolates to other samples (i.e. is it generalisable?). To split our main data set into two, we will work with an 80/20 split.

The 80/20 split has its origins in the Pareto Principle, which states that ‘in most cases, 80% of effects from from 20% of causes’. Though there are other test/train splitting options, this partitioning method is a good place to start, and indeed standard in the machine learning field.

# First, set a seed to guarantee replicability of the process
set.seed(1234) # you can use any number you want, but to replicate the results in this tutorial you need to use this number

# We could split the data manually, but the caret package includes an useful function

train_idx <- createDataPartition(data_malawi$lnexp_pc_month, p = .8, list = FALSE, times = 1)
head(train_idx) # notice that observation 5 corresponds to resame indicator 7 and so on. We're shuffling and picking!
##      Resample1
## [1,]         1
## [2,]         2
## [3,]         4
## [4,]         5
## [5,]         7
## [6,]         8
Train_df <- data_malawi[ train_idx,]
Test_df  <- data_malawi[-train_idx,]

# Note that we have created training and testing dataframes as an 80/20 split of the original dataset.

Prediction with Linear Models

We will start by fitting a predictive model using the training dataset; that is, our target variable log of monthly per capita expenditures or lnexp_pc_month will be a \(Y\) dependent variable in a linear model \(Y_i = \alpha + x'\beta_i + \epsilon_i\), and the remaining features in the data frame correspond to the row vectors \(x'\beta\).

model1 <- lm(lnexp_pc_month ~ .,Train_df) # the dot after the squiggle ~ asks the lm() function tu use all other variables in the dataframe as predictors to the dependent variable lnexp_pc_month

stargazer(model1, type = "text") # printed in the console as text
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                           lnexp_pc_month       
## -----------------------------------------------
## hhsize                       -0.292***         
##                               (0.006)          
##                                                
## hhsize2                      0.012***          
##                              (0.0005)          
##                                                
## agehead                        0.003           
##                               (0.002)          
##                                                
## agehead2                    -0.00004**         
##                              (0.00002)         
##                                                
## northNorth                   0.080***          
##                               (0.014)          
##                                                
## central1                     0.252***          
##                               (0.010)          
##                                                
## rural1                       -0.055***         
##                               (0.017)          
##                                                
## nevermarried1                0.271***          
##                               (0.028)          
##                                                
## sharenoedu                   -0.105***         
##                               (0.020)          
##                                                
## shareread                    0.075***          
##                               (0.015)          
##                                                
## nrooms                       0.039***          
##                               (0.004)          
##                                                
## floor_cement1                0.102***          
##                               (0.017)          
##                                                
## electricity1                 0.372***          
##                               (0.027)          
##                                                
## flushtoilet1                 0.329***          
##                               (0.033)          
##                                                
## soap1                        0.215***          
##                               (0.014)          
##                                                
## bed1                         0.104***          
##                               (0.013)          
##                                                
## bike1                        0.094***          
##                               (0.011)          
##                                                
## musicplayer1                 0.111***          
##                               (0.015)          
##                                                
## coffeetable1                 0.137***          
##                               (0.019)          
##                                                
## iron1                        0.130***          
##                               (0.014)          
##                                                
## dimbagarden1                 0.102***          
##                               (0.010)          
##                                                
## goats1                       0.080***          
##                               (0.012)          
##                                                
## dependratio                  -0.045***         
##                               (0.006)          
##                                                
## hfem1                        -0.066***         
##                               (0.012)          
##                                                
## grassroof1                   -0.096***         
##                               (0.016)          
##                                                
## mortarpestle1                0.033***          
##                               (0.010)          
##                                                
## table1                       0.051***          
##                               (0.011)          
##                                                
## clock1                       0.058***          
##                               (0.014)          
##                                                
## regionNorth                                    
##                                                
##                                                
## regionSouth                                    
##                                                
##                                                
## Constant                     7.978***          
##                               (0.044)          
##                                                
## -----------------------------------------------
## Observations                   9,024           
## R2                             0.599           
## Adjusted R2                    0.597           
## Residual Std. Error      0.428 (df = 8995)     
## F Statistic         479.041*** (df = 28; 8995) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
# We can also estimate iid and robust standard errors, using the model output package modelsummary()
ms <- modelsummary(model1,
             vcov = list("iid","robust"), # include iid and HC3 (robust) standard errors
             statistic = c("p = {p.value}","s.e. = {std.error}"),
             stars = TRUE,
             output = "gt"
             ) # plotted as an image / object in "gt" format

ms %>% tab_header(
    title = md("**Linear Models with iid and robust s.e.**"),
    subtitle = md("Target: (log) per capita monthly expenditure")
)
Linear Models with iid and robust s.e.
Target: (log) per capita monthly expenditure
(1) (2)
(Intercept) 7.978*** 7.978***
p = <0.001 p = <0.001
s.e. = 0.044 s.e. = 0.069
hhsize -0.292*** -0.292***
p = <0.001 p = <0.001
s.e. = 0.006 s.e. = 0.029
hhsize2 0.012*** 0.012***
p = <0.001 p = <0.001
s.e. = 0.000 s.e. = 0.003
agehead 0.003 0.003
p = 0.108 p = 0.137
s.e. = 0.002 s.e. = 0.002
agehead2 0.000* 0.000*
p = 0.015 p = 0.032
s.e. = 0.000 s.e. = 0.000
northNorth 0.080*** 0.080***
p = <0.001 p = <0.001
s.e. = 0.014 s.e. = 0.015
central1 0.252*** 0.252***
p = <0.001 p = <0.001
s.e. = 0.010 s.e. = 0.010
rural1 -0.055** -0.055**
p = 0.001 p = 0.001
s.e. = 0.017 s.e. = 0.017
nevermarried1 0.271*** 0.271***
p = <0.001 p = <0.001
s.e. = 0.028 s.e. = 0.037
sharenoedu -0.105*** -0.105***
p = <0.001 p = <0.001
s.e. = 0.020 s.e. = 0.021
shareread 0.075*** 0.075***
p = <0.001 p = <0.001
s.e. = 0.015 s.e. = 0.015
nrooms 0.039*** 0.039***
p = <0.001 p = <0.001
s.e. = 0.004 s.e. = 0.004
floor_cement1 0.102*** 0.102***
p = <0.001 p = <0.001
s.e. = 0.017 s.e. = 0.018
electricity1 0.372*** 0.372***
p = <0.001 p = <0.001
s.e. = 0.027 s.e. = 0.029
flushtoilet1 0.329*** 0.329***
p = <0.001 p = <0.001
s.e. = 0.033 s.e. = 0.042
soap1 0.215*** 0.215***
p = <0.001 p = <0.001
s.e. = 0.014 s.e. = 0.014
bed1 0.104*** 0.104***
p = <0.001 p = <0.001
s.e. = 0.013 s.e. = 0.013
bike1 0.094*** 0.094***
p = <0.001 p = <0.001
s.e. = 0.011 s.e. = 0.011
musicplayer1 0.111*** 0.111***
p = <0.001 p = <0.001
s.e. = 0.015 s.e. = 0.015
coffeetable1 0.137*** 0.137***
p = <0.001 p = <0.001
s.e. = 0.019 s.e. = 0.019
iron1 0.130*** 0.130***
p = <0.001 p = <0.001
s.e. = 0.014 s.e. = 0.014
dimbagarden1 0.102*** 0.102***
p = <0.001 p = <0.001
s.e. = 0.010 s.e. = 0.010
goats1 0.080*** 0.080***
p = <0.001 p = <0.001
s.e. = 0.012 s.e. = 0.012
dependratio -0.045*** -0.045***
p = <0.001 p = <0.001
s.e. = 0.006 s.e. = 0.009
hfem1 -0.066*** -0.066***
p = <0.001 p = <0.001
s.e. = 0.012 s.e. = 0.013
grassroof1 -0.096*** -0.096***
p = <0.001 p = <0.001
s.e. = 0.016 s.e. = 0.016
mortarpestle1 0.033** 0.033**
p = 0.001 p = 0.001
s.e. = 0.010 s.e. = 0.010
table1 0.051*** 0.051***
p = <0.001 p = <0.001
s.e. = 0.011 s.e. = 0.012
clock1 0.058*** 0.058***
p = <0.001 p = <0.001
s.e. = 0.014 s.e. = 0.014
Num.Obs. 9024 9024
R2 0.599 0.599
R2 Adj. 0.597 0.597
AIC 10317.3 10317.3
BIC 10530.5 10530.5
Log.Lik. -5128.658 -5128.658
RMSE 0.43 0.43
Std.Errors IID HC3
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
# You can see that the variable region was not included in the output (this is because we already have dummies of north (not north = south), and central region). We may need to clean our dataset un some further steps. 

Recall that one of the assumptions of a linear model is that errors are independent and identically distributed. We could run some tests to determine this, but with the contrast of the iid and robust errors (HC3) in the modelsummary output table we can already tell that this is not an issue/something to worry about in our estimations.

Besides the regression output table, we can can also visualise the magnitude and significance of the coefficients with a plot. The further away the variable (dot) marker is from the \(0\) line, the larger the magnitude.

modelplot(model1) + 
          aes(color = ifelse(p.value < 0.001, "Significant", "Not significant")) +
              scale_color_manual(values = c("grey", "black")
                                 )

           # grey points indicate statistically insignificat (p>0.001) coefficient estimates
           # The scale of the plot is large due to the intercept estimate

Not unlike what we saw in out correlation matrix, household size, electricity, having a flush toilet… the magnitude of the impact of these variables on monthly per capita expenditure is significantly larger than that of other assets/predictors.

Performance Indicators

In predictive modelling, we are interested in the following performance metrics:

  • Model residuals: recall residuals are the observed value minus the predicted value. We can estimate a model’s Root Mean Squared Error (RMSE) or the Mean Absolute Error (MAE). Residuals allow us to quantify the extent to which the predicted response value (for a given observation) is close to the true response value. Small RMSE or MAE values indicate that the prediction is close to the true observed value.

  • The p-values: represented by stars *** (and a pre-defined critical threshold, e.g. 0.05), they point to the predictive power of each feature in the model; i.e. that the event does not occur by chance. In the same vein, the magnitude of the coefficient is also important, especially given that we are interested in explanatory power and not causality.

  • The R-squared: arguably the go-to indicator for performance assessment. Low R^2 values are not uncommon, especially in the social sciences. However, when hoping to use a model for predictive purposes, a low R^2 is a bad sign, large number of statistically significant features notwithstanding. The drawback from relying solely on this measure is that it does not take into consideration the problem of model over-fitting; i.e. you can inflate the R-squared by adding as many variables as you want, even if those variables have little predicting power. This method will yield great results in the training data, but will under perform when extrapolating the model to the test (or indeed any other) data set.

Residuals

# = = Model Residuals = = #
print(summary(model1$residuals))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.483927 -0.288657 -0.007872  0.000000  0.266950  1.887123

Recall that the residual is estimated as the true (observed) value minus the predicted value. Thus: the max(imum) error of 1.88 suggests that the model under-predicted expenditure by circa (log) $2 (or $6.5) for at least one observation. Fifty percent of the predictions (between the first and third quartiles) lie between (log) $0.28 and (log) $0.26 over the true value. From the estimation of the prediction residuals we obtain the popular measure of performance evaluation known as the Root Mean Squared Error (RMSE, for short).

# Calculate the RMSE for the training dataset, or the in-sample RMSE.

# 1. Predict values on the training dataset
p0 <- predict(model1, Train_df)

# 2. Obtain the errors (predicted values minus observed values of target variable)
error0 <- p0 - Train_df[["lnexp_pc_month"]]

# 3. In-sample RMSE
RMSE_insample <- sqrt(mean(error0 ^ 2))
print(RMSE_insample)
## [1] 0.4271572
# TIP: Notice that the in-sample RMSE we have just estimated was also printed in the linear model (regression) output tables!
# The table printed with the stargazer package returns this value at the bottom, under the header Residual Std. Error (0.428)
# The table printed with the modelsummary package returns this value at the bottom, under the header RMSE (0.43), rounded up. 

The RMSE (0.4271) gives us an absolute number that indicates how much our predicted values deviate from the true (observed) number. This is all in reference to the target vector (a.k.a. our outcome variable). Think of the question, how far, on average, are the residuals away from zero? Generally speaking, the lower the value, the better the model fit. Besides being a good measure of goodness of fit, the RMSE is also useful for comparing the ability of our model to make predictions on different (e.g. test) data sets. The in-sample RMSE should be close or equal to the out-of-sample RMSE.

In this case, our RMSE is ~0.4 units away from zero. Given the range of the target variable (roughly 4 to 11), the number seems to be relatively small and close enough to zero.

P-values Recall the large number of statistically significant features in our model. The coefficient plot, where we indicate that statistical significance is defined by a p-value threshold of 0.001, a strict rule (given the popularity of the more relaxed 0.05 critical value), shows that only 4 out of 29 features/variables do not meet this criterion. We can conclude that the features we have selected are relevant predictors.

R-squared

print(summary(model1))
## 
## Call:
## lm(formula = lnexp_pc_month ~ ., data = Train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4839 -0.2887 -0.0079  0.2669  1.8871 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.978e+00  4.409e-02 180.945  < 2e-16 ***
## hhsize        -2.919e-01  6.429e-03 -45.397  < 2e-16 ***
## hhsize2        1.206e-02  4.811e-04  25.067  < 2e-16 ***
## agehead        2.749e-03  1.712e-03   1.606  0.10835    
## agehead2      -4.168e-05  1.717e-05  -2.428  0.01519 *  
## northNorth     8.001e-02  1.439e-02   5.560 2.78e-08 ***
## central1       2.521e-01  1.023e-02  24.639  < 2e-16 ***
## rural1        -5.483e-02  1.687e-02  -3.250  0.00116 ** 
## nevermarried1  2.706e-01  2.839e-02   9.533  < 2e-16 ***
## sharenoedu    -1.045e-01  2.006e-02  -5.212 1.91e-07 ***
## shareread      7.514e-02  1.502e-02   5.004 5.72e-07 ***
## nrooms         3.898e-02  4.032e-03   9.670  < 2e-16 ***
## floor_cement1  1.016e-01  1.742e-02   5.835 5.58e-09 ***
## electricity1   3.718e-01  2.696e-02  13.793  < 2e-16 ***
## flushtoilet1   3.295e-01  3.263e-02  10.099  < 2e-16 ***
## soap1          2.150e-01  1.413e-02  15.220  < 2e-16 ***
## bed1           1.040e-01  1.297e-02   8.019 1.20e-15 ***
## bike1          9.395e-02  1.064e-02   8.831  < 2e-16 ***
## musicplayer1   1.111e-01  1.451e-02   7.658 2.08e-14 ***
## coffeetable1   1.369e-01  1.866e-02   7.338 2.36e-13 ***
## iron1          1.302e-01  1.378e-02   9.455  < 2e-16 ***
## dimbagarden1   1.018e-01  1.017e-02  10.017  < 2e-16 ***
## goats1         7.964e-02  1.168e-02   6.820 9.68e-12 ***
## dependratio   -4.501e-02  5.848e-03  -7.697 1.55e-14 ***
## hfem1         -6.570e-02  1.238e-02  -5.307 1.14e-07 ***
## grassroof1    -9.590e-02  1.575e-02  -6.089 1.18e-09 ***
## mortarpestle1  3.288e-02  1.027e-02   3.200  0.00138 ** 
## table1         5.074e-02  1.149e-02   4.417 1.01e-05 ***
## clock1         5.840e-02  1.423e-02   4.103 4.11e-05 ***
## regionNorth           NA         NA      NA       NA    
## regionSouth           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4278 on 8995 degrees of freedom
## Multiple R-squared:  0.5986, Adjusted R-squared:  0.5973 
## F-statistic:   479 on 28 and 8995 DF,  p-value: < 2.2e-16

We have printed our model’s output once more, now using the r base command summary(). The other packages (stargazer and modelsummary) are great if you want to export your results in text, html, latex format, but not necessary if you just want to print your output. The estimated (Multiple) R-squared of 0.59 tells us that our model predicts around 60 per cent of the variation in the independent variable (our target, log of per capita monthly expenditures). Also note that when we have a large number of predictors, it’s best to look at the Adjusted R-squared (of 0.59), which corrects or adjusts for this by only increasing when a new feature improves the model more so than what would be expected by chance.

Out of sample predictions

Now that we have built and evaluated our model in the training dataset, we can proceed to make out-of-sample predictions. That is, see how our model performs in the test dataset.

p <- predict(model1, Test_df)

cat("observed summary statistics of target variable in full dataset")
## observed summary statistics of target variable in full dataset
print(summary(data_malawi$lnexp_pc_month))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.777   6.893   7.305   7.359   7.758  11.064
cat("predictions based on the training dataset")
## predictions based on the training dataset
print(summary(p0))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.011   6.980   7.308   7.357   7.661  10.143
cat("predictions from the testing dataset")
## predictions from the testing dataset
print(summary(p))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.052   6.999   7.323   7.375   7.670   9.774

The summary statistics for the predictions with the train and test datasets are very close to one another. This is an indication that our model extrapolates well to other datasets. Compared to the observed summary statistics of the target variable, they’re relatively close, with the largest deviations observed at the minimum and maximum values.

The bias-variance tradeoff in practice

We previously mentioned that the RMSE metric could also be used to compare between train and test model predictions. Let us estimate the out-of-sample RMSE:

error <- p - Test_df[["lnexp_pc_month"]] # predicted values minus actual values
RMSE_test <- sqrt(mean(error^2))
print(RMSE_test) # this is known as the out-of-sample RMSE
## [1] 0.4284404

Notice that the in-sample RMSE [\(0.4271572\)] is very close to the out-of-sample RMSE [\(0.4284404\)]. This means that our model makes consistent predictions across different data sets. We also know by now that these predictions are relatively good. At least, we hit the mark around 60 per cent of the time. What we are observing here is a model that has found a balance between bias and variance. However, both measures can still improve. For one, McBride and Nichols (2018) report > 80 per cent accuracy in their poverty predictions for Malawi. But please note that, at this point, we are not replicating their approach. They document using a classification model, which means that they previously used a poverty line score and divided the sample between individuals below and above the poverty line.

What do we mean by bias in a model?

The bias is the difference between the average prediction of our model and the true (observed) value. Minimising the bias is analogous to minimising the RMSE.

What do we mean by variance in a model?

It is the observed variability of our model prediction for a given data point (how much the model can adjust given the data set). A model with high variance would yield low error values in the training data but high errors in the test data.

Hence, consistent in and out of sample RMSE scores = bias/variance balance.

Fine-tuning model parameters

Can we fine-tune model parameters? The quick answer is yes! Every algorithm has a set of parameters that can be adjusted/fine-tuned to improve our estimations. Even in the case of a simple linear model, we can try our hand at fine-tuning with, for example, cross-validation.

What is cross-validation?

Broadly speaking, it is a technique that allows us to assess the performance of our machine learning model. How so? Well, it looks at the ‘stability’ of the model. It’s a measure of how well our model would work on new, unseen data (is this ringing a bell yet?); i.e. it has correctly observed and recorded the patterns in the data and has not captured too much noise (what we know as the error term, or what we are unable to explain with our model). K-fold cross-validation is a good place to start for such a thing. In the words of The Internet™, what k-fold cross validation does is:

Split the input dataset into K groups
    For each group:
        - Take one group as the reserve or test data set.
        - Use remaining groups as the training dataset.
        - Fit the model on the training set and evaluate the performance of the model using the test set.

TL;DR We’re improving our splitting technique!

An Illustration by Eugenia Anello

# Let's rumble! 

set.seed(12345)

# create an object that defines the training method as cross-validation and number of folds (caret pkg)
cv_10fold <- trainControl(
    method = "cv", #cross-validation
    number = 10 # k-fold = 10-fold (split the data into 10 similar-sized samples)
)

set.seed(12345)

# train a model 
ols_kfold <- train(
    lnexp_pc_month ~ .,
    data = Train_df,
    method = 'lm', # runs a linear regression model (or ols)
    trControl = cv_10fold # use 10 folds to cross-validate
)
ols_kfold2 <- train(
    lnexp_pc_month ~ .,
    data = Test_df,
    method = 'lm', # runs a linear regression model (or ols)
    trControl = cv_10fold # use 10 folds to cross-validate
)

ols_kfold3 <- train(
    lnexp_pc_month ~ .,
    data = data_malawi,
    method = 'lm', # runs a linear regression model (or ols)
    trControl = cv_10fold # use 10 folds to cross-validate
)

cat(" Linear model with train dataset")
##  Linear model with train dataset
print(ols_kfold)
## Linear Regression 
## 
## 9024 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 8121, 8121, 8123, 8122, 8121, 8121, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.4295064  0.5947776  0.3365398
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
cat(" Linear model with test dataset")
##  Linear model with test dataset
print(ols_kfold2)
## Linear Regression 
## 
## 2256 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2030, 2030, 2030, 2030, 2031, 2031, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.4312944  0.5991526  0.3379473
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
cat(" Linear model with full dataset")
##  Linear model with full dataset
print(ols_kfold3)
## Linear Regression 
## 
## 11280 samples
##    29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10152, 10152, 10152, 10152, 10152, 10152, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.4291473  0.5965825  0.3363802
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Having cross-validated our model with 10 folds, we can see that the results are virtually the same. We have a RMSE of around .43, an R^2 of .59, (and a MAE of .33). We ran the same model for the train, test and full datasets and find results are consistent. Recall the R^2 tells us the model’s predictive ability and the RMSE and MAE the model’s accuracy.

NOTE: k-fold cross-validation replaces our original 80/20 split (we don’t need to do that anymore!) Therefore, we use the full dataset in the train() function. Using the train, test and full datasets in the example above was just for pedagogical purposes but it is no longer necessary and we can use the reported estimates of R^2 (close to 1!), RMSE (close to 0!) and MAE (let’s go low!) for model assessment without comparing them to another set of predictions. These reported parameters are estimated as the average of the R^2, RMSE, and MAE for all the folds.

  1. Feature selection with Lasso linear regression


An OLS regression is not the only model that can be written in the form of \(Y_i = \alpha + \beta_1X_{1i}, \beta_2X_{2i},..., \beta_pX_{pi}+ u_i\). In this section we will discuss ‘penalised’ models, which can also be expressed as a linear relationship between parameters. Penalised regression models are also known as regression shrinkage methods, and they take their name after the colloquial term for coefficient regularisation, ‘shrinkage’ of estimated coefficients. The goal of penalised models, as opposed to a traditional linear model, is not to minimise bias (least squares approach), but to reduce variance by adding a constraint to the equation and effectively pushing coefficient parameters towards 0. This results in the the worse model predictors having a coefficient of zero or close to zero.

Consider a scenario where you have hundreds of predictors. Which covariates are truly important for our known outcome? Including all of the predictors leads to over-fitting. We’ll find that the R^2 value is high, and conclude that our in-sample fit is good. However, this may lead to bad out-of-sample predictions. Model selection is a particularly challenging endeavour when we encounter high-dimensional data; i.e. when the number of variables is close to or larger than the number of observations. Some examples where you may encounter high-dimensional data include:

  1. Cross-country analyses: we have a small and finite number of countries, but we may collect/observe as many variables as we want.

  2. Cluster-population analyses: we wish to understand the outcome of some unique population \(n\), e.g. all students from classroom A. We collect plenty of information on these students, but the sample and the population are analogous \(n = N\), and thus the sample number of observations is small and finite.

The LASSO - Least Absolute Shrinkage and Selection Operator imposes a shrinking penalty to those predictors that do not actually belong in the model, and reduces the size of the estimated \(\beta\) coefficients towards and including zero (when the tuning parameter/ shrinkage penalty \(\lambda\) is sufficiently large). Note that \(lambda\) is the penalty term called L1-norm, and corresponds to the sum of the absolute coefficients.

What does this mean for our case? Well, McBride and Nichols (2018) used two sources to compile their final dataset (which we are using). One of those sources is the Living Standards Measurement Study (LSMS) from the World Bank Group. These surveys collect more than 500 variables/ features per country case. You can subset the dataset with critical reasoning: using your knowledge of the phenomenon, which features might best describe it? But even then, we might still end up with a large number of features. Perhaps a LASSO approach could work here.

Our own data is not ideal for the approach, because it has a small number of features (\(29\)), compared to the large number of observations (\(11,280\)). But let’s run a lasso regression and see if my statement holds. How would I know if it holds? If the Lasso regression returns the same values for the performance assessment parameters as the OLS regression; this means that the penalisation parameter was either zero (leading back to an OLS) or too small to make a difference.

### small detour ... 
summary(model1)
## 
## Call:
## lm(formula = lnexp_pc_month ~ ., data = Train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4839 -0.2887 -0.0079  0.2669  1.8871 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.978e+00  4.409e-02 180.945  < 2e-16 ***
## hhsize        -2.919e-01  6.429e-03 -45.397  < 2e-16 ***
## hhsize2        1.206e-02  4.811e-04  25.067  < 2e-16 ***
## agehead        2.749e-03  1.712e-03   1.606  0.10835    
## agehead2      -4.168e-05  1.717e-05  -2.428  0.01519 *  
## northNorth     8.001e-02  1.439e-02   5.560 2.78e-08 ***
## central1       2.521e-01  1.023e-02  24.639  < 2e-16 ***
## rural1        -5.483e-02  1.687e-02  -3.250  0.00116 ** 
## nevermarried1  2.706e-01  2.839e-02   9.533  < 2e-16 ***
## sharenoedu    -1.045e-01  2.006e-02  -5.212 1.91e-07 ***
## shareread      7.514e-02  1.502e-02   5.004 5.72e-07 ***
## nrooms         3.898e-02  4.032e-03   9.670  < 2e-16 ***
## floor_cement1  1.016e-01  1.742e-02   5.835 5.58e-09 ***
## electricity1   3.718e-01  2.696e-02  13.793  < 2e-16 ***
## flushtoilet1   3.295e-01  3.263e-02  10.099  < 2e-16 ***
## soap1          2.150e-01  1.413e-02  15.220  < 2e-16 ***
## bed1           1.040e-01  1.297e-02   8.019 1.20e-15 ***
## bike1          9.395e-02  1.064e-02   8.831  < 2e-16 ***
## musicplayer1   1.111e-01  1.451e-02   7.658 2.08e-14 ***
## coffeetable1   1.369e-01  1.866e-02   7.338 2.36e-13 ***
## iron1          1.302e-01  1.378e-02   9.455  < 2e-16 ***
## dimbagarden1   1.018e-01  1.017e-02  10.017  < 2e-16 ***
## goats1         7.964e-02  1.168e-02   6.820 9.68e-12 ***
## dependratio   -4.501e-02  5.848e-03  -7.697 1.55e-14 ***
## hfem1         -6.570e-02  1.238e-02  -5.307 1.14e-07 ***
## grassroof1    -9.590e-02  1.575e-02  -6.089 1.18e-09 ***
## mortarpestle1  3.288e-02  1.027e-02   3.200  0.00138 ** 
## table1         5.074e-02  1.149e-02   4.417 1.01e-05 ***
## clock1         5.840e-02  1.423e-02   4.103 4.11e-05 ***
## regionNorth           NA         NA      NA       NA    
## regionSouth           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4278 on 8995 degrees of freedom
## Multiple R-squared:  0.5986, Adjusted R-squared:  0.5973 
## F-statistic:   479 on 28 and 8995 DF,  p-value: < 2.2e-16
# notice that region dummies are repeated and therefore return NA in the output. We should clean this before proceeding.
# these guys are troublesome (recall our conversation about zero-variance predictors?)
data_malawi2 <- data_malawi[,-which(colnames(data_malawi)=="region")]
summary(lm(lnexp_pc_month~., data = data_malawi2)) # no more excluded categorical variables (recall this happened in our original linear model due to repeated measures?)
## 
## Call:
## lm(formula = lnexp_pc_month ~ ., data = data_malawi2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6626 -0.2906 -0.0096  0.2665  1.8788 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.965e+00  3.947e-02 201.781  < 2e-16 ***
## hhsize        -2.979e-01  5.854e-03 -50.893  < 2e-16 ***
## hhsize2        1.246e-02  4.401e-04  28.307  < 2e-16 ***
## agehead        3.545e-03  1.529e-03   2.319 0.020433 *  
## agehead2      -5.079e-05  1.532e-05  -3.315 0.000918 ***
## northNorth     7.452e-02  1.292e-02   5.766 8.32e-09 ***
## central1       2.552e-01  9.119e-03  27.979  < 2e-16 ***
## rural1        -5.488e-02  1.507e-02  -3.642 0.000272 ***
## nevermarried1  2.416e-01  2.507e-02   9.638  < 2e-16 ***
## sharenoedu    -1.053e-01  1.802e-02  -5.845 5.21e-09 ***
## shareread      7.838e-02  1.344e-02   5.832 5.63e-09 ***
## nrooms         4.182e-02  3.609e-03  11.587  < 2e-16 ***
## floor_cement1  9.460e-02  1.546e-02   6.119 9.73e-10 ***
## electricity1   3.719e-01  2.387e-02  15.579  < 2e-16 ***
## flushtoilet1   3.471e-01  2.886e-02  12.027  < 2e-16 ***
## soap1          2.163e-01  1.261e-02  17.154  < 2e-16 ***
## bed1           1.051e-01  1.156e-02   9.099  < 2e-16 ***
## bike1          9.405e-02  9.516e-03   9.883  < 2e-16 ***
## musicplayer1   1.090e-01  1.287e-02   8.466  < 2e-16 ***
## coffeetable1   1.311e-01  1.641e-02   7.991 1.47e-15 ***
## iron1          1.325e-01  1.231e-02  10.761  < 2e-16 ***
## dimbagarden1   9.140e-02  9.095e-03  10.049  < 2e-16 ***
## goats1         8.671e-02  1.049e-02   8.269  < 2e-16 ***
## dependratio   -4.086e-02  5.225e-03  -7.820 5.74e-15 ***
## hfem1         -6.761e-02  1.102e-02  -6.135 8.80e-10 ***
## grassroof1    -9.138e-02  1.394e-02  -6.556 5.77e-11 ***
## mortarpestle1  3.207e-02  9.193e-03   3.489 0.000487 ***
## table1         5.132e-02  1.029e-02   4.987 6.23e-07 ***
## clock1         6.254e-02  1.273e-02   4.912 9.16e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4278 on 11251 degrees of freedom
## Multiple R-squared:  0.5997, Adjusted R-squared:  0.5987 
## F-statistic: 601.9 on 28 and 11251 DF,  p-value: < 2.2e-16
set.seed(12345)
cv_10fold <- trainControl(
    method = "cv", 
    number = 10 
)
set.seed(12345)
model_lasso <- caret::train(
    lnexp_pc_month ~ .,
    data = data_malawi,
    method = 'lasso',
    preProcess = c("center", "scale"), #This will first center and then scale all relevant variables in the model
    trControl = cv_10fold # use 10 folds to cross-validate // we already know this fine-tuning may not be needed!
    )

print(model_lasso) # best R^2 and RMSE is similar to the one using a linear regression...( and so is the MAE!)
## The lasso 
## 
## 11280 samples
##    29 predictor
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10152, 10152, 10152, 10152, 10152, 10152, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE       Rsquared   MAE      
##   0.1       0.5953796  0.4234291  0.4653872
##   0.5       0.4485973  0.5660193  0.3533998
##   0.9       0.4295497  0.5957600  0.3368464
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.9.

Parameter-tuning in our Lasso regression

Let’s start with centering and scaling variables; why do we do that?

Centering and Scaling in Lasso: Recall that the L1-norm puts constraints on the size of the coefficients of the Lasso regression. The size, of course, differs based on the different scales the variables are measured. Having \(1\) and up to \(55\) electric gadgets at home is not the same as earning between \(1000\) and \(100,000\) monthly (of whichever currency you want to imagine here). There are two immediate consequences of centering and scaling (or normalising). 1) There is no longer an intercept. 2) It is easier to rank the relative magnitude of the coefficients post-shrinkage.

Cross-validation in Lasso: Beyond training and testing dataframes, what a k-fold (commonly 10-fold) cross validation does is resample the data to find an optimal (λ) lambda/penalisation for the lasso model and assess its predictive error. Recall: The optimal λ (lambda)/penalisation minimizes the out-of-sample (or test) mean prediction error.

This tuning is already included in the chunk of code above.

# Let's visualise our lasso model:
plot(model_lasso)

cat("The x-axis is the fraction of the full solution (i.e., ordinary least squares with no penalty)")
## The x-axis is the fraction of the full solution (i.e., ordinary least squares with no penalty)

It seems that the fraction used (0.9) is close to the non-penalised model (or OLS/linear). Our original claim that this particular set-up does not call for a Lasso approach was correct. We may want to try other machine learning algorithms to see if we can improve our initial predictions!

A final note: there is some discussion over whether k-fold or really any cross-validation technique is optimal for choosing the lambda parameter in Lasso models (see for example, this Coursera video). It is true that cross-validation is something that we want to do for ALL our machine learning models. Just make sure to make an informed choice of which cv technique you’ll implement in your model.

A setup with multiple categorical variables is problematic with Lasso!

Lasso models do not perform well with multiple categorical variables. As is our case. Primarily because, in order to work with categorical data in lasso models we encode them as dummies (for example, north = \(1\) and \(0\) otherwise, center = \(1\) and \(0\) otherwise). Lasso will thus only consider the dummy, and not the concept of the ‘region’). An alternative to that is a Group Lasso. If you’d like to read more on that, you can have a look at this Towards Data Science Post.

Python practical

You can download the dataset by clicking on the button below.


  1. Preliminaries: packages and data upload




import numpy as np # can be used to perform a wide variety of mathematical operations on arrays.
import pandas as pd # mainly used for data analysis and associated manipulation of tabular data in DataFrames
import matplotlib as mpl # comprehensive library for creating static, animated, and interactive visualizations in Python
import sklearn as sk #  implement machine learning models and statistical modelling.
import seaborn as sns # for dataframe/vector visualisation
import matplotlib.pyplot as plt # only pyplots from the matplotlib library of visuals
from sklearn.metrics import mean_absolute_error, mean_squared_error # evaluation metrics for ML models. Setting mean_squared_error squared to False will return the RMSE.
import statsmodels.api as sm # to visualise linear models in a nice way...


malawi = pd.read_csv('/Users/michellegonzalez/Documents/GitHub/Machine-Learning-for-Public-Policy/malawi.csv') # import csv file and store as malawi

  1. Get to know your data: pre-processing and visualisation


Pandas is a very complete package to manipulate DataFrames/datasets.

# first observations of dataframe
malawi.head()
##    lnexp_pc_month  hhsize  hhsize2  ...  strataid  lnzline      case_id
## 0        6.900896       7       49  ...         1    7.555  10101002025
## 1        7.064378       3        9  ...         1    7.555  10101002051
## 2        6.823851       6       36  ...         1    7.555  10101002072
## 3        6.894722       6       36  ...         1    7.555  10101002079
## 4        6.465989       6       36  ...         1    7.555  10101002095
## 
## [5 rows x 38 columns]
# descriptive statistics of variables
malawi.describe()
##        lnexp_pc_month        hhsize  ...       lnzline       case_id
## count    11280.000000  11280.000000  ...  1.128000e+04  1.128000e+04
## mean         7.358888      4.546809  ...  7.555000e+00  2.372322e+10
## std          0.675346      2.335525  ...  1.776436e-15  7.241514e+09
## min          4.776855      1.000000  ...  7.555000e+00  1.010100e+10
## 25%          6.892941      3.000000  ...  7.555000e+00  2.040204e+10
## 50%          7.305191      4.000000  ...  7.555000e+00  2.090352e+10
## 75%          7.757587      6.000000  ...  7.555000e+00  3.053301e+10
## max         11.063562     27.000000  ...  7.555000e+00  3.120209e+10
## 
## [8 rows x 36 columns]
# now let's summarise the dataframe in a general format
malawi.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 11280 entries, 0 to 11279
## Data columns (total 38 columns):
##  #   Column          Non-Null Count  Dtype  
## ---  ------          --------------  -----  
##  0   lnexp_pc_month  11280 non-null  float64
##  1   hhsize          11280 non-null  int64  
##  2   hhsize2         11280 non-null  int64  
##  3   agehead         11280 non-null  int64  
##  4   agehead2        11280 non-null  int64  
##  5   north           11280 non-null  int64  
##  6   central         11280 non-null  int64  
##  7   rural           11280 non-null  int64  
##  8   nevermarried    11280 non-null  int64  
##  9   sharenoedu      11280 non-null  float64
##  10  shareread       11280 non-null  float64
##  11  nrooms          11280 non-null  int64  
##  12  floor_cement    11280 non-null  int64  
##  13  electricity     11280 non-null  int64  
##  14  flushtoilet     11280 non-null  int64  
##  15  soap            11280 non-null  int64  
##  16  bed             11280 non-null  int64  
##  17  bike            11280 non-null  int64  
##  18  musicplayer     11280 non-null  int64  
##  19  coffeetable     11280 non-null  int64  
##  20  iron            11280 non-null  int64  
##  21  dimbagarden     11280 non-null  int64  
##  22  goats           11280 non-null  int64  
##  23  dependratio     11280 non-null  float64
##  24  hfem            11280 non-null  int64  
##  25  grassroof       11280 non-null  int64  
##  26  mortarpestle    11280 non-null  int64  
##  27  table           11280 non-null  int64  
##  28  clock           11280 non-null  int64  
##  29  ea              11280 non-null  int64  
##  30  EA              11280 non-null  int64  
##  31  hhwght          11280 non-null  float64
##  32  region          11280 non-null  object 
##  33  eatype          11280 non-null  object 
##  34  psu             11280 non-null  int64  
##  35  strataid        11280 non-null  int64  
##  36  lnzline         11280 non-null  float64
##  37  case_id         11280 non-null  int64  
## dtypes: float64(6), int64(30), object(2)
## memory usage: 3.3+ MB
  • The dataset contains 38 variables and 11,280 observations.

  • Not all of these variables are relevant for our prediction model.

  • To find the labels and description of the variables, you can refer to the paper.

[hhsize, hhsize2, age_head, age_head2, regions, rural, never married, share_of_adults_without_education, share_of_adults_who_can_read, number of rooms, cement floor, electricity, flush toilet, soap, bed, bike, music player, coffee table, iron, garden, goats]

  • Luckily for us, we have no missing values (Non-Null Count in info() output)!
  1. Many machine learning models cannot be trained when missing values are present (some exceptions exist).

  2. Dealing with missingness is a non-trivial task: First and foremost, we should assess whether there is a pattern to missingness and if so, what that means to what we can learn from our (sub)population. If there is no discernible pattern, we can proceed to delete the missing values or impute them. A more detailed explanation and course of action can be found here.

Feature selection: subsetting the dataset

As part of our data pre-processing we will subset the dataframe, such that only relevant variables are left. * Relevant: variables/features about a household/individual that could help us determine whether they are in poverty. That way, we save some memory space; but also, we can call the full set of variables in a dataframe in one go!

variables to delete (not included in the identified set above): [ea, EA, hhwght, psu, strataid, lnzline, case_id, eatype]

N/B: Feature selection is a critical process (and we normally don’t have a paper to guide us through it): from a practical point of view, a model with less predictors may be easier to interpret. Also, some models may be negatively affected by non-informative predictors. This process is similar to traditional econometric modelling, but we should not conflate predictive and causal modelling. Importantly, please note that we are not interested in knowing why something happens, but rather in what is likely to happen given some known data. Hence:

# deleting variables from pandas dataframe 

cols2delete = ['ea', 'EA', 'hhwght', 'psu', 'strataid', 'lnzline', 'case_id', 'eatype', 'region']
malawi = malawi.drop(cols2delete,axis=1) # axis=0 means delete rows and axis=1 means delete columns

# check if we have deleted the columns: 
print(malawi.shape)
## (11280, 29)

Same number of rows, but only 29 variables left. We have succesfully deleted those pesky unwanted vectors!

Data visualisation

A quick and effective way to take a first glance at our data is to plot histograms of relevant features. We can use the seaborn and matplotlib libraries to visualise our dataframe.

  • The distribution of each variable matters for the type of plot
  • start by looking at the number of unique values
  • if unique values <= 3 you are certain this is a categorical variable, and from our exploration using malawi.info() we had only integer/floats…
# print numbr of unique values by vector
malawi.nunique()
## lnexp_pc_month    11266
## hhsize               19
## hhsize2              19
## agehead              88
## agehead2             88
## north                 2
## central               2
## rural                 2
## nevermarried          2
## sharenoedu           47
## shareread            28
## nrooms               16
## floor_cement          2
## electricity           2
## flushtoilet           2
## soap                  2
## bed                   2
## bike                  2
## musicplayer           2
## coffeetable           2
## iron                  2
## dimbagarden           2
## goats                 2
## dependratio          62
## hfem                  2
## grassroof             2
## mortarpestle          2
## table                 2
## clock                 2
## dtype: int64
print(" There's a large number of variables with 2 unique values. We can plot numeric/continuous variables as histograms and binary variables as bar plots.")
##  There's a large number of variables with 2 unique values. We can plot numeric/continuous variables as histograms and binary variables as bar plots.
# for-loop that iterates over variables in dataframe, if they have 2 unique values, transform vector into categorical
for column in malawi:
    if malawi[column].nunique() == 2:
        malawi[column] = pd.Categorical(malawi[column])

# check if it worked
malawi.info() # hooray! we've got integers, floats (numerics), and categorical variables now. 
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 11280 entries, 0 to 11279
## Data columns (total 29 columns):
##  #   Column          Non-Null Count  Dtype   
## ---  ------          --------------  -----   
##  0   lnexp_pc_month  11280 non-null  float64 
##  1   hhsize          11280 non-null  int64   
##  2   hhsize2         11280 non-null  int64   
##  3   agehead         11280 non-null  int64   
##  4   agehead2        11280 non-null  int64   
##  5   north           11280 non-null  category
##  6   central         11280 non-null  category
##  7   rural           11280 non-null  category
##  8   nevermarried    11280 non-null  category
##  9   sharenoedu      11280 non-null  float64 
##  10  shareread       11280 non-null  float64 
##  11  nrooms          11280 non-null  int64   
##  12  floor_cement    11280 non-null  category
##  13  electricity     11280 non-null  category
##  14  flushtoilet     11280 non-null  category
##  15  soap            11280 non-null  category
##  16  bed             11280 non-null  category
##  17  bike            11280 non-null  category
##  18  musicplayer     11280 non-null  category
##  19  coffeetable     11280 non-null  category
##  20  iron            11280 non-null  category
##  21  dimbagarden     11280 non-null  category
##  22  goats           11280 non-null  category
##  23  dependratio     11280 non-null  float64 
##  24  hfem            11280 non-null  category
##  25  grassroof       11280 non-null  category
##  26  mortarpestle    11280 non-null  category
##  27  table           11280 non-null  category
##  28  clock           11280 non-null  category
## dtypes: category(20), float64(4), int64(5)
## memory usage: 1016.0 KB
# plot histograms or bar graphs for each of the columns in the dataframe, print them all in the same grid
# subset of numeric/continuous
numeric_df = malawi.select_dtypes(include=['float64', 'int64'])
numeric_df.shape # dataframe subset contains only 9 variables
## (11280, 9)
# plot histograms of all variables in numeric dataframe (df)
for column in numeric_df.columns:
    plt.figure(figsize=(3, 1))  # Adjust figsize as needed
    sns.histplot(numeric_df[column], bins=20, kde=True)
    plt.title(f'Histogram of {column}')
    plt.xlabel(column)
    plt.ylabel('Count')
    plt.show()

We can apply a similar logic to plotting/visualising categorical data: bar graphs, but also tables work well.

# subset of categorical data
cat_df = malawi.select_dtypes(include=['category'])
cat_df.shape # dataframe subset contains 20 variables 
## (11280, 20)
# let's start with printing tables of the 20 variables with category/count

for column in cat_df.columns:
    table = pd.crosstab(index=cat_df[column], columns='count')
    print(f'Table for {column}:\n{table}\n')
## Table for north:
## col_0  count
## north       
## 0       9600
## 1       1680
## 
## Table for central:
## col_0    count
## central       
## 0         6960
## 1         4320
## 
## Table for rural:
## col_0  count
## rural       
## 0       1440
## 1       9840
## 
## Table for nevermarried:
## col_0         count
## nevermarried       
## 0             10930
## 1               350
## 
## Table for floor_cement:
## col_0         count
## floor_cement       
## 0              9036
## 1              2244
## 
## Table for electricity:
## col_0        count
## electricity       
## 0            10620
## 1              660
## 
## Table for flushtoilet:
## col_0        count
## flushtoilet       
## 0            10962
## 1              318
## 
## Table for soap:
## col_0  count
## soap        
## 0       9740
## 1       1540
## 
## Table for bed:
## col_0  count
## bed         
## 0       7653
## 1       3627
## 
## Table for bike:
## col_0  count
## bike        
## 0       7194
## 1       4086
## 
## Table for musicplayer:
## col_0        count
## musicplayer       
## 0             9426
## 1             1854
## 
## Table for coffeetable:
## col_0        count
## coffeetable       
## 0             9956
## 1             1324
## 
## Table for iron:
## col_0  count
## iron        
## 0       8962
## 1       2318
## 
## Table for dimbagarden:
## col_0        count
## dimbagarden       
## 0             7658
## 1             3622
## 
## Table for goats:
## col_0  count
## goats       
## 0       8856
## 1       2424
## 
## Table for hfem:
## col_0  count
## hfem        
## 0       8697
## 1       2583
## 
## Table for grassroof:
## col_0      count
## grassroof       
## 0           2953
## 1           8327
## 
## Table for mortarpestle:
## col_0         count
## mortarpestle       
## 0              5635
## 1              5645
## 
## Table for table:
## col_0  count
## table       
## 0       7249
## 1       4031
## 
## Table for clock:
## col_0  count
## clock       
## 0       9039
## 1       2241
    
# maybe bar graphs for a quick visual check of asset ownership scarcity?

for column in cat_df.columns:
    plt.figure(figsize=(6, 4)) 
    sns.countplot(x=column, data=cat_df)
    plt.title(f'Bar Plot of {column}')
    plt.xlabel(column)
    plt.ylabel('Count')
    plt.show()

What have we learned from the data visualisation?

  1. Nothing worrying about the data itself. McBride and Nichols did a good job of pre-processing the data for us. No pesky missing values, or unknown categories.

  2. From the bar plots, we can see that for the most part, people tend not to own assets. Worryingly, there is a lack of soap, flush toilets and electricity, all of which are crucial for human capital (health and education).

  3. From the histograms, we can see log per capita expenditure is normally distributed, but if we remove the log, it’s incredibly skewed. Poverty is endemic. Furthermore, households tend not to have too many educated individuals, and household size is non-trivially large (with less rooms than people need).

Relationships between features

To finalise our exploration of the dataset, we should define:

  • the target variable (a.k.a. outcome of interest)

  • correlational insights

Let’s visualise two distinct correlation matrices; for our numeric dataframe, which includes our target variable, we will plot a Pearson r correlation matrix. For our factor dataframe, to which we will add our continuous target, we will plot a Spearman rho correlation matrix. Both types of correlation coefficients are interpreted the same (0 = no correlation, 1 perfect positive correlation, -1 perfect negative correlation).

# Pearson correlation matrix (seaborn and matplotlib libraries again :)) 
correlation_matrix = numeric_df.corr() # Pearson correlation for continuously disributed variables (default method is pearson)

# plot it to get a nice visual
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Pearson Correlation Matrix')
plt.show()

# Spearman correlation matrix
# unfortunately, only numeric values can be used in corr()
# detour: transform categorical into numeric type...
for column in cat_df.columns:
    if pd.api.types.is_categorical_dtype(cat_df[column]):
        cat_df[column] = cat_df[column].cat.codes
cat_df.info() # all integers now! hooray!
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 11280 entries, 0 to 11279
## Data columns (total 20 columns):
##  #   Column        Non-Null Count  Dtype
## ---  ------        --------------  -----
##  0   north         11280 non-null  int8 
##  1   central       11280 non-null  int8 
##  2   rural         11280 non-null  int8 
##  3   nevermarried  11280 non-null  int8 
##  4   floor_cement  11280 non-null  int8 
##  5   electricity   11280 non-null  int8 
##  6   flushtoilet   11280 non-null  int8 
##  7   soap          11280 non-null  int8 
##  8   bed           11280 non-null  int8 
##  9   bike          11280 non-null  int8 
##  10  musicplayer   11280 non-null  int8 
##  11  coffeetable   11280 non-null  int8 
##  12  iron          11280 non-null  int8 
##  13  dimbagarden   11280 non-null  int8 
##  14  goats         11280 non-null  int8 
##  15  hfem          11280 non-null  int8 
##  16  grassroof     11280 non-null  int8 
##  17  mortarpestle  11280 non-null  int8 
##  18  table         11280 non-null  int8 
##  19  clock         11280 non-null  int8 
## dtypes: int8(20)
## memory usage: 220.4 KB
# detour 2: concatenate target variable to our categorical dataframe
targetvar = malawi['lnexp_pc_month']
cat_df_new = pd.concat([cat_df, targetvar], axis=1) # recall axis=1 tells python to concatenate the column


spearman_matrix = cat_df_new.corr(method='spearman') # specify method spearman

# visualise spearman correlation matrix

plt.figure(figsize=(8, 6))
sns.heatmap(spearman_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, annot_kws={'size': 6})
plt.title('Spearman Correlation Matrix')
plt.show()

Ideally, we would print a correlation matrix with ALL the variables. For display purposes, we’ve separated them here. If we were interested in relationships between two variables that are not our target variable, we may want to think about a full correlation matrix.

From the Pearson Correlation Matrix, we can see that the size of the household and dependent ratio are highly negatively correlated to our target variable.

From the Spearman Correlation Matrix, we see that ownership of some assets stands out: soap, a cement floor, electricity, a bed… ownership of these (and a couple of other) assets is positively correlated to per capita expenditure. Living in a rural area, on the other hand, is negtively correlated to our target variable.

We can also spot some correlation coefficients that equal zero. In some situations, the data generating mechanism can create predictors that only have a single unique value (i.e. a “zero-variance predictor”). For many ML models (excluding tree-based models), this may cause the model to crash or the fit to be unstable. Here, the only \(0\) we’ve spotted is not in relation to our target variable.

But we do observe some near-zero-variance predictors. Besides uninformative, these can also create unstable model fits. There’s a few strategies to deal with these; the quickest solution is to remove them. A second option, which is especially interesting in scenarios with a large number of predictors/variables, is to work with penalised models. We’ll discuss this option below.

  1. Model fit: data partition and performance evaluation parameters


We now have a general idea of the structure of the data we are working with, and what we’re trying to predict: per capita expenditures, which we believe are a proxy for poverty prevalence. Measured by the log of per capita monthly expenditure in our dataset, the variable is named lnexp_pc_month.

The next step is to create a simple linear model (OLS) to predict monthly per capita expenditure using the variables in our dataset,and introduce the elements with which we will evaluate our model.

Data Partinioning

When we want to build predictive models for machine learning purposes, it is important to have (at least) two data sets. A training data set from which our model will learn, and a test data set containing the same features as our training data set; we use the second dataset to see how well our predictive model extrapolates to other samples (i.e. is it generalisable?). To split our main data set into two, we will work with an 80/20 split.

The 80/20 split has its origins in the Pareto Principle, which states that ‘in most cases, 80% of effects from from 20% of causes’. Though there are other test/train splitting options, this partitioning method is a good place to start, and indeed standard in the machine learning field.

# split train and test data using the sci-kit learn library (our go-to machine learning package for now!)

from sklearn.model_selection import train_test_split

# We need to define a matrix X that contains all the features (variables) that we are interested in, and a vector Y that is the target variable
# These two elements should be defined both fur the train and test subsets of our data
X = malawi.iloc[:, 1:28] # x is a matrix containing all vectors/variables except the first one, which conveniently is our target variable
# recall that python starts counting at 0, therefore vector in position 0 =  vector 1, vector in position 1 = vector 2, etc.
y = malawi.iloc[:, 0] # y is a vector containing our target variable 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12345) # random_state is for reproducibility purposes, analogous to setting a seed in R. 

X_train.head(3)
##       hhsize  hhsize2  agehead  agehead2  ... hfem grassroof mortarpestle table
## 9867       5       25       41      1681  ...    1         0            1     1
## 3782       5       25       41      1681  ...    0         1            1     0
## 5647       5       25       55      3025  ...    0         0            1     1
## 
## [3 rows x 27 columns]
X_test.head(3)
##       hhsize  hhsize2  agehead  agehead2  ... hfem grassroof mortarpestle table
## 3750       2        4       48      2304  ...    1         1            0     0
## 9030       3        9       42      1764  ...    0         1            1     1
## 3239       2        4       26       676  ...    0         1            0     0
## 
## [3 rows x 27 columns]
print(y_train)
## 9867    8.504023
## 3782    6.688002
## 5647    7.287024
## 7440    5.643893
## 7283    6.706208
##           ...   
## 4478    8.266976
## 4094    8.688864
## 3492    7.312614
## 2177    7.538856
## 4578    6.676927
## Name: lnexp_pc_month, Length: 9024, dtype: float64
print(y_test)
## 3750     7.936020
## 9030     7.464896
## 3239     7.288868
## 1463     7.816436
## 11236    6.859672
##            ...   
## 8570     7.232218
## 11080    7.318853
## 7124     9.199136
## 8818     7.928102
## 3167     7.507953
## Name: lnexp_pc_month, Length: 2256, dtype: float64

# everything looks okay, let's proceed with modelling... 

Prediction with Linear Models

We will start by fitting a predictive model using the training dataset; that is, our target variable log of monthly per capita expenditures or lnexp_pc_month will be a \(Y\) dependent variable in a linear model \(Y_i = \alpha + x'\beta_i + \epsilon_i\), and the remaining features in the data frame correspond to the row vectors \(x'\beta\).

 # Importing more features from the scikit-learn package 
from sklearn.linear_model import LinearRegression

model1 = LinearRegression(fit_intercept=True).fit(X_train, y_train)

# visualise the output from the linear model

print(f'Coefficients: {model1.coef_}')
## Coefficients: [-2.88803518e-01  1.16454588e-02  4.00549713e-03 -5.26522790e-05
##   7.46404592e-02  2.59743764e-01 -5.70197786e-02  2.68613159e-01
##  -1.10722935e-01  8.48179021e-02  3.97403945e-02  1.22016217e-01
##   3.86242046e-01  3.64921504e-01  2.08221283e-01  1.15598316e-01
##   9.77483503e-02  1.12175940e-01  1.35028702e-01  1.38280053e-01
##   9.19268361e-02  8.76979599e-02 -4.31108804e-02 -7.06557671e-02
##  -8.59920259e-02  2.97336875e-02  5.09616025e-02]
print(f'Intercept: {model1.intercept_}')
## Intercept: 7.9286268681021985
# visualise in a nice way (using the statsmodel.api library)


X_train_with_constant = sm.add_constant(X_train) # add a constant to the X feature matrix
ols_model = sm.OLS(y_train, X_train_with_constant) # use the statsmodel OLS function to regress X features on Y vector

# Fit the OLS model
lm_results = ols_model.fit()

# Display the summary
print(lm_results.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:         lnexp_pc_month   R-squared:                       0.599
## Model:                            OLS   Adj. R-squared:                  0.598
## Method:                 Least Squares   F-statistic:                     498.3
## Date:                Fri, 01 Mar 2024   Prob (F-statistic):               0.00
## Time:                        11:35:20   Log-Likelihood:                -5189.2
## No. Observations:                9024   AIC:                         1.043e+04
## Df Residuals:                    8996   BIC:                         1.063e+04
## Df Model:                          27                                         
## Covariance Type:            nonrobust                                         
## ================================================================================
##                    coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------
## const            7.9286      0.044    179.409      0.000       7.842       8.015
## hhsize          -0.2888      0.006    -44.577      0.000      -0.302      -0.276
## hhsize2          0.0116      0.000     24.092      0.000       0.011       0.013
## agehead          0.0040      0.002      2.327      0.020       0.001       0.007
## agehead2     -5.265e-05   1.72e-05     -3.058      0.002   -8.64e-05   -1.89e-05
## north            0.0746      0.014      5.156      0.000       0.046       0.103
## central          0.2597      0.010     25.294      0.000       0.240       0.280
## rural           -0.0570      0.017     -3.384      0.001      -0.090      -0.024
## nevermarried     0.2686      0.028      9.519      0.000       0.213       0.324
## sharenoedu      -0.1107      0.020     -5.450      0.000      -0.151      -0.071
## shareread        0.0848      0.015      5.622      0.000       0.055       0.114
## nrooms           0.0397      0.004      9.707      0.000       0.032       0.048
## floor_cement     0.1220      0.017      7.012      0.000       0.088       0.156
## electricity      0.3862      0.027     14.506      0.000       0.334       0.438
## flushtoilet      0.3649      0.032     11.365      0.000       0.302       0.428
## soap             0.2082      0.014     14.720      0.000       0.180       0.236
## bed              0.1156      0.013      8.933      0.000       0.090       0.141
## bike             0.0977      0.011      9.108      0.000       0.077       0.119
## musicplayer      0.1122      0.014      7.787      0.000       0.084       0.140
## coffeetable      0.1350      0.018      7.483      0.000       0.100       0.170
## iron             0.1383      0.014     10.093      0.000       0.111       0.165
## dimbagarden      0.0919      0.010      9.003      0.000       0.072       0.112
## goats            0.0877      0.012      7.428      0.000       0.065       0.111
## dependratio     -0.0431      0.006     -7.396      0.000      -0.055      -0.032
## hfem            -0.0707      0.012     -5.687      0.000      -0.095      -0.046
## grassroof       -0.0860      0.016     -5.498      0.000      -0.117      -0.055
## mortarpestle     0.0297      0.010      2.867      0.004       0.009       0.050
## table            0.0510      0.012      4.411      0.000       0.028       0.074
## ==============================================================================
## Omnibus:                      204.679   Durbin-Watson:                   1.992
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):              314.826
## Skew:                           0.233   Prob(JB):                     4.33e-69
## Kurtosis:                       3.787   Cond. No.                     2.68e+04
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 2.68e+04. This might indicate that there are
## strong multicollinearity or other numerical problems.

Recall that one of the assumptions of a linear model is that errors are independent and identically distributed. We could run some tests to determine this, but with another quick way to check is to contrast the iid and robust errors (HC3) in the summary output:

lm_robust = sm.OLS(y_train, X_train_with_constant).fit(cov_type='HC3') # HC3 are also known as robust std errors
print(lm_robust.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:         lnexp_pc_month   R-squared:                       0.599
## Model:                            OLS   Adj. R-squared:                  0.598
## Method:                 Least Squares   F-statistic:                     431.9
## Date:                Fri, 01 Mar 2024   Prob (F-statistic):               0.00
## Time:                        11:35:21   Log-Likelihood:                -5189.2
## No. Observations:                9024   AIC:                         1.043e+04
## Df Residuals:                    8996   BIC:                         1.063e+04
## Df Model:                          27                                         
## Covariance Type:                  HC3                                         
## ================================================================================
##                    coef    std err          z      P>|z|      [0.025      0.975]
## --------------------------------------------------------------------------------
## const            7.9286      0.066    120.963      0.000       7.800       8.057
## hhsize          -0.2888      0.027    -10.502      0.000      -0.343      -0.235
## hhsize2          0.0116      0.002      4.732      0.000       0.007       0.016
## agehead          0.0040      0.002      2.181      0.029       0.000       0.008
## agehead2     -5.265e-05   1.93e-05     -2.733      0.006   -9.04e-05   -1.49e-05
## north            0.0746      0.016      4.727      0.000       0.044       0.106
## central          0.2597      0.010     25.600      0.000       0.240       0.280
## rural           -0.0570      0.017     -3.267      0.001      -0.091      -0.023
## nevermarried     0.2686      0.036      7.507      0.000       0.198       0.339
## sharenoedu      -0.1107      0.021     -5.338      0.000      -0.151      -0.070
## shareread        0.0848      0.015      5.587      0.000       0.055       0.115
## nrooms           0.0397      0.005      8.706      0.000       0.031       0.049
## floor_cement     0.1220      0.018      6.733      0.000       0.086       0.158
## electricity      0.3862      0.030     13.071      0.000       0.328       0.444
## flushtoilet      0.3649      0.041      8.939      0.000       0.285       0.445
## soap             0.2082      0.014     14.659      0.000       0.180       0.236
## bed              0.1156      0.013      8.816      0.000       0.090       0.141
## bike             0.0977      0.011      9.053      0.000       0.077       0.119
## musicplayer      0.1122      0.015      7.607      0.000       0.083       0.141
## coffeetable      0.1350      0.019      7.079      0.000       0.098       0.172
## iron             0.1383      0.014     10.158      0.000       0.112       0.165
## dimbagarden      0.0919      0.010      8.983      0.000       0.072       0.112
## goats            0.0877      0.012      7.553      0.000       0.065       0.110
## dependratio     -0.0431      0.008     -5.189      0.000      -0.059      -0.027
## hfem            -0.0707      0.013     -5.408      0.000      -0.096      -0.045
## grassroof       -0.0860      0.015     -5.648      0.000      -0.116      -0.056
## mortarpestle     0.0297      0.010      2.856      0.004       0.009       0.050
## table            0.0510      0.012      4.313      0.000       0.028       0.074
## ==============================================================================
## Omnibus:                      204.679   Durbin-Watson:                   1.992
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):              314.826
## Skew:                           0.233   Prob(JB):                     4.33e-69
## Kurtosis:                       3.787   Cond. No.                     2.68e+04
## ==============================================================================
## 
## Notes:
## [1] Standard Errors are heteroscedasticity robust (HC3)
## [2] The condition number is large, 2.68e+04. This might indicate that there are
## strong multicollinearity or other numerical problems.

Our results are virtually the same. We’ll continue with the assumption that our simple linear model is homoskedastic.

Performance Indicators

In predictive modelling, we are interested in the following performance metrics:

  • Model residuals: recall residuals are the observed value minus the predicted value. We can estimate a model’s Root Mean Squared Error (RMSE) or the Mean Absolute Error (MAE). Residuals allow us to quantify the extent to which the predicted response value (for a given observation) is close to the true response value. Small RMSE or MAE values indicate that the prediction is close to the true observed value.

  • The p-values: represented by stars *** (and a pre-defined critical threshold, e.g. 0.05), they point to the predictive power of each feature in the model; i.e. that the event does not occur by chance. In the same vein, the magnitude of the coefficient is also important, especially given that we are interested in explanatory power and not causality.

  • The R-squared: arguably the go-to indicator for performance assessment. Low R^2 values are not uncommon, especially in the social sciences. However, when hoping to use a model for predictive purposes, a low R^2 is a bad sign, large number of statistically significant features notwithstanding. The drawback from relying solely on this measure is that it does not take into consideration the problem of model over-fitting; i.e. you can inflate the R-squared by adding as many variables as you want, even if those variables have little predicting power. This method will yield great results in the training data, but will under perform when extrapolating the model to the test (or indeed any other) data set.

Residuals

# = = Model Residuals = = #
# residuals of the train dataset 

# First, predict values of y based on the fitted linear model
y_pred = model1.predict(X_train)

# Second, calculate residuals. Recall: observed - predicted
residuals_model1 = y_train - y_pred

# Compute summary statistics of residuals
residual_mean = np.mean(residuals_model1)
residual_std = np.std(residuals_model1)
residual_min = np.min(residuals_model1)
residual_25th_percentile = np.percentile(residuals_model1, 25)
residual_median = np.median(residuals_model1)
residual_75th_percentile = np.percentile(residuals_model1, 75)
residual_max = np.max(residuals_model1)

# Print summary statistics
print(f"Mean of residuals: {residual_mean}")
## Mean of residuals: 3.2204341636290003e-16
print(f"Standard deviation of residuals: {residual_std}")
## Standard deviation of residuals: 0.43003194030066827
print(f"Minimum residual: {residual_min}")
## Minimum residual: -3.277590076409755
print(f"25th percentile of residuals: {residual_25th_percentile}")
## 25th percentile of residuals: -0.2895732594552083
print(f"Median of residuals: {residual_median}")
## Median of residuals: -0.013638733126892788
print(f"75th percentile of residuals: {residual_75th_percentile}")
## 75th percentile of residuals: 0.2657401219175035
print(f"Maximum residual: {residual_max}")
## Maximum residual: 1.8711412012693618

Recall that the residual is estimated as the true (observed) value minus the predicted value. Thus: the max(imum) error of 1.87 suggests that the model under-predicted expenditure by circa (log) $2 (or $6.5) for at least one observation. Fifty percent of the predictions (between the first and third quartiles) lie between (log) $0.28 and (log) $0.26 over/under the true value. From the estimation of the prediction residuals we obtain the popular measure of performance evaluation known as the Root Mean Squared Error (RMSE, for short).

# Calculate the RMSE for the training dataset, or the in-sample RMSE.

rmse_insample = mean_squared_error(y_train, y_pred, squared=False)
print(rmse_insample)
## 0.43003194030066827

The RMSE (0.4300) gives us an absolute number that indicates how much our predicted values deviate from the true (observed) number. This is all in reference to the target vector (a.k.a. our outcome variable). Think of the question, how far, on average, are the residuals away from zero? Generally speaking, the lower the value, the better the model fit. Besides being a good measure of goodness of fit, the RMSE is also useful for comparing the ability of our model to make predictions on different (e.g. test) data sets. The in-sample RMSE should be close or equal to the out-of-sample RMSE.

In this case, our RMSE is ~0.4 units away from zero. Given the range of the target variable (roughly 4 to 11), the number seems to be relatively small and close enough to zero. Good news!

P-values Recall the large number of statistically significant features in our model summary (output table), identified by the header P>|z| and the scoring number being below a critical \(\alpha\) threshold (traditionally, 0.05, more strictly, 0.001). We can conclude that the features we have selected are relevant predictors.

R-squared

from sklearn.metrics import r2_score # an evaluation metric (r-squared) that we didn't load at the beginning

r2_train = r2_score(y_train, model1.predict(X_train))

# Print the R-squared on the training set
print(f"R-squared on the training dataset: {r2_train}")
## R-squared on the training dataset: 0.5992973775467874

The estimated R-squared of 0.59 tells us that our model predicts around 60 per cent of the variation in the independent variable (our target, log of per capita monthly expenditures).

Out of sample predictions

Now that we have built and evaluated our model in the training dataset, we can proceed to make out-of-sample predictions. That is, see how our model performs in the test dataset.

# predict y(target) values based on the test data model, with model trained on 
y_test_pred = model1.predict(X_test)

# Calculate Root Mean Squared Error on the test set
rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)

# Estimate the R squared of the test model
r2_test = r2_score(y_test, model1.predict(X_test))

# Print the Mean Squared Error and R-squared on the test set
print(f"Root Mean Squared Error on the test set: {rmse_test}")
## Root Mean Squared Error on the test set: 0.41946592264787785
print(f"R-squared on the test dataset: {r2_test}")
## R-squared on the test dataset: 0.5947626941805604

The bias-variance tradeoff in practice

We previously mentioned that the RMSE metric could also be used to compare between train and test model predictions:

Notice that the in-sample RMSE [\(0.43\)] is very close to the out-of-sample RMSE [\(0.41\)]. This means that our model makes consistent predictions across different data sets. We also know by now that these predictions are relatively good. At least, we hit the mark around 59 per cent of the time. What we are observing here is a model that has found a balance between bias and variance. However, both measures can still improve. For one, McBride and Nichols (2018) report > 80 per cent accuracy in their poverty predictions for Malawi. But please note that, at this point, we are not replicating their approach. They document using a classification model, which means that they previously used a poverty line score and divided the sample between individuals below and above the poverty line.

What do we mean by bias in a model?

The bias is the difference between the average prediction of our model and the true (observed) value. Minimising the bias is analogous to minimising the RMSE.

What do we mean by variance in a model?

It is the observed variability of our model prediction for a given data point (how much the model can adjust given the data set). A model with high variance would yield low error values in the training data but high errors in the test data.

Hence, consistent in and out of sample RMSE and R^2 scores = bias/variance balance.

Fine-tuning model parameters

Can we fine-tune model parameters? The quick answer is yes! Every machine learning algorithm has a set of parameters that can be adjusted/fine-tuned to improve our estimations. Even in the case of a simple linear model, we can try our hand at fine-tuning with, for example, cross-validation.

What is cross-validation?

Broadly speaking, it is a technique that allows us to assess the performance of our machine learning model. How so? Well, it looks at the ‘stability’ of the model. It’s a measure of how well our model would work on new, unseen data (is this ringing a bell yet?); i.e. it has correctly observed and recorded the patterns in the data and has not captured too much noise (what we know as the error term, or what we are unable to explain with our model). K-fold cross-validation is a good place to start for such a thing. In the words of The Internet™, what k-fold cross validation does is:

Split the input dataset into K groups
+ For each group:
    + Take one group as the reserve or test data set.
    + Use remaining groups as the training dataset.
    + Fit the model on the training set and evaluate the performance of the model using the test set.

TL;DR We’re improving our splitting technique!

An Illustration by Eugenia Anello

Let’s rumble!


from sklearn.model_selection import cross_val_score, KFold

# Linear regression model with k-fold cross-validation
# This time around, we'll work with a) the full dataset (no need to split beforehand) and b) model objects instead of estimating the 
# model on-the-fly!

# object that contains an empty linear regression model
model_kcrossval = LinearRegression() 

# object that contains the number of folds (or partitions) that we want to have
k_folds = 10 # 5 or 10 are standard numbers. Choosing the optimal number of folds is beyond the score of this tutorial. 

# object that contains a k-fold cross-validation 
model_kf = KFold(n_splits=k_folds, shuffle=True, random_state=12345) #recall we had used 12345 as our random state before as well... 

# estimate a k-fold cross-validated model!
kfold_mse_scores = -cross_val_score(model_kcrossval, X, y.ravel(), cv=model_kf, scoring='neg_mean_squared_error')

What is happening in the line of code that estimates the k-fold cross-validated model? Let’s see if we can make sense of it: + The -cross_val_score (yes, with a minus sign) is a function that is used to estimate cross-validation models. + Inside the function, we specify the type of model (model_kcrossval = LinearRegression) + Inside the function, we specify which dataframe (matrix of features X, full dataset minus the target vriable) + Inside the function, we specify which target variable y (full dataset), the y.ravel() is to ensure that the function reads it as a one-dimensional array + Inside the function, we specify what type of cross-validation (cv) technique. Our model_kf object defines the cv technique as a k-fold cross-validation with 10 folds. + Inside the function, we select a ‘scoring’ = negative mean squared error. Why negate the mse? it’s to explicitly tell scikit-learn (the library we are using) to minimise the mean squared errors, it’s default is to maximise them.

The output that we get is the estimated mean squared error per fold, but we can also get the average of all folds and the root mean squared error that we’ve been using thus far.

# get the average mse from all 10 folds
average_kfold_mse = np.mean(kfold_mse_scores) # finally, numpy library in action

print("Mean Squared Error for each fold:", kfold_mse_scores)
## Mean Squared Error for each fold: [0.1721741  0.17878181 0.19410003 0.17626526 0.19455392 0.18466671
##  0.16980631 0.20032969 0.17379885 0.20905726]
print(f"Average Mean Squared Error: {average_kfold_mse}")
## Average Mean Squared Error: 0.18535339333850878
# get the RMSE, so that we can compare the performance of our cross-validated model
kfold_rmse_scores = np.sqrt(kfold_mse_scores)
average_kfold_rmse = np.sqrt(average_kfold_mse)

print("Root Mean Squared Error for each fold:", kfold_rmse_scores)
## Root Mean Squared Error for each fold: [0.41493867 0.42282598 0.44056784 0.41983957 0.44108267 0.42972865
##  0.41207561 0.44758204 0.4168919  0.4572278 ]
print(f"Average Root Mean Squared Error: {average_kfold_rmse}")
## Average Root Mean Squared Error: 0.43052687876427503

Having cross-validated our model with 10 folds, we can see that the residual results are virtually the same. We have a RMSE of around .43, but what about the R^2? We need to estimate it separately.

# loading (again!) another metric from scikit-learn
from sklearn.metrics import make_scorer

# creating an object for r^2 estimates
r2_scorer = make_scorer(r2_score)

# export r squared as scoring output instead of mean squared errors

kfold_r2_scores = cross_val_score(model_kcrossval, X, y.ravel(), cv=model_kf, scoring=r2_scorer) # note: this time, no negative sign at the beggining!


# get the average R^2 for all folds
average_r2_scores = np.average(kfold_r2_scores)

# print results

print("R-squared for each fold:", kfold_r2_scores)
## R-squared for each fold: [0.60735695 0.58411959 0.57381508 0.62260601 0.60060855 0.61087529
##  0.62923256 0.57307956 0.60035639 0.52943145]
print(f"Average R-squared:{average_r2_scores}")
## Average R-squared:0.5931481424710761

NOTE: k-fold cross-validation replaces our original 80/20 split. It is meant to improve our predictions when we have a limited number of observations to ‘learn’ from. It let’s us get around the fact that we ‘lost’ about 20% of the finite sample to the testing dataset by using a reshuffling/resampling technique and making sure we have used all the data to ’learn at some point.

Having said the above, based on the R^2 (\(0.59\)) and the RMSE (\(0.43\)) estimates, our model predictions have not improved after cross-validation. We’re still on the improvement path!


  1. Feature selection with Lasso linear regression


An OLS regression is not the only model that can be written in the form of \(Y_i = \alpha + \beta_1X_{1i}, \beta_2X_{2i},..., \beta_pX_{pi}+ u_i\). In this section we will discuss ‘penalised’ models, which can also be expressed as a linear relationship between parameters. Penalised regression models are also known as regression shrinkage methods, and they take their name after the colloquial term for coefficient regularisation, ‘shrinkage’ of estimated coefficients. The goal of penalised models, as opposed to a traditional linear model, is not to minimise bias (least squares approach), but to reduce variance by adding a constraint to the equation and effectively pushing coefficient parameters towards 0. This results in the the worse model predictors having a coefficient of zero or close to zero.

Consider a scenario where you have hundreds of predictors. Which covariates are truly important for our known outcome? Including all of the predictors leads to over-fitting. We’ll find that the R^2 value is high, and conclude that our in-sample fit is good. However, this may lead to bad out-of-sample predictions. Model selection is a particularly challenging endeavour when we encounter high-dimensional data; i.e. when the number of variables is close to or larger than the number of observations. Some examples where you may encounter high-dimensional data include:

  1. Cross-country analyses: we have a small and finite number of countries, but we may collect/observe as many variables as we want.

  2. Cluster-population analyses: we wish to understand the outcome of some unique population \(n\), e.g. all students from classroom A. We collect plenty of information on these students, but the sample and the population are analogous \(n = N\), and thus the sample number of observations is small and finite.

The LASSO - Least Absolute Shrinkage and Selection Operator imposes a shrinking penalty to those predictors that do not actually belong in the model, and reduces the size of the estimated \(\beta\) coefficients towards and including zero (when the tuning parameter/ shrinkage penalty \(\lambda\) is sufficiently large). Note that \(lambda\) is the penalty term called L1-norm, and corresponds to the sum of the absolute coefficients.

What does this mean for our case? Well, McBride and Nichols (2018) used two sources to compile their final dataset (which we are using). One of those sources is the Living Standards Measurement Study (LSMS) from the World Bank Group. These surveys collect more than 500 variables/ features per country case. You can subset the dataset with critical reasoning: using your knowledge of the phenomenon, which features might best describe it? But even then, we might still end up with a large number of features. Perhaps a LASSO approach could work here.

Our own data is not ideal for the approach, because it has a small number of features (\(29\)), compared to the large number of observations (\(11,280\)). But let’s run a lasso regression and see if my statement holds. How would I know if it holds? If the Lasso regression returns the same values for the performance assessment parameters as the OLS regression; this means that the penalisation parameter was either zero (leading back to an OLS) or too small to make a difference.

# (linear) Lasso with scikit-learn
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# before we proceed, we must center and scale (normalise) our features
normalise_obj = StandardScaler()

# normalise both train and test data (we don't need to cross-validate, remember? back to our 20/80 split df!)
X_train_scaled = normalise_obj.fit_transform(X_train) # fit on train dataset and then normalise train dataset
X_test_scaled = normalise_obj.transform(X_test) # no  need to re-fit, just normalise test dataset

# create lasso model

lasso_model = Lasso(alpha=0.1)

print("alpha is the regularisation parameter. The closer the alpha value is to zero, the more the model is close to a simple linear regression. Let's start with the smallest value of alpha, and increase if needed.")
## alpha is the regularisation parameter. The closer the alpha value is to zero, the more the model is close to a simple linear regression. Let's start with the smallest value of alpha, and increase if needed.
# fit the lasso model on the training data
lasso_model.fit(X_train_scaled, y_train)
Lasso(alpha=0.1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# make predictions on the test dataset
y_lasso_pred = lasso_model.predict(X_test_scaled)


# Estimate Root Mean Squared Error on the test set
mse_lasso = mean_squared_error(y_test, y_lasso_pred)
rmse_lasso = np.sqrt(mse_lasso)

# Print the Mean Squared Error
print(f"Root Mean Squared Error on the scaled test set: {rmse_lasso}")
## Root Mean Squared Error on the scaled test set: 0.5049684278065933
# RMSE on train data
y_lasso_pred_train = lasso_model.predict(X_train_scaled)
mse_lasso_train = mean_squared_error(y_train, y_lasso_pred_train)
rmse_lasso_train = np.sqrt(mse_lasso_train)

print(f"Root Mean Squared Error on the scaled train set: {rmse_lasso_train}")
## Root Mean Squared Error on the scaled train set: 0.5172317065909608
# R squared

r2_train = r2_score(y_train, y_lasso_pred_train)
r2_test = r2_score(y_test, y_lasso_pred)

print(f"R squared on train: {r2_train} and test: {r2_test}")
## R squared on train: 0.4203163697877488 and test: 0.41272095987451374

With an in-sample RMSE of \(0.51\) and an out-of-sample RMSE of \(0.5\) a Lasso with a small penalisation parameter performs worse than a simple linear regression. Similarly, our R^2 estimates which were at \(0.59\) before, have now decreased to \(0.42\). Do we increase or decrease the regularization parameter \(\alpha\) ? Let’s try decreasing it first:

# decrease the size of alpha (getting closer to a simple linear regression)
lasso_model2 = Lasso(alpha=0.01)

# fit the model on the train
lasso_model2.fit(X_train_scaled, y_train)
Lasso(alpha=0.01)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# make predictions on the test dataset
y_lasso2_pred = lasso_model2.predict(X_test_scaled)


# Estimate Root Mean Squared Error on the test set
mse_lasso2 = mean_squared_error(y_test, y_lasso2_pred)
rmse_lasso2 = np.sqrt(mse_lasso2)

# Print the Mean Squared Error
print(f"Root Mean Squared Error of Lasso with 0.01 alpha: {rmse_lasso2} vs 0.1 alpha: {rmse_lasso}")
## Root Mean Squared Error of Lasso with 0.01 alpha: 0.4288848193436583 vs 0.1 alpha: 0.5049684278065933
# optional: plot variable relevance!

# store coefficients in a dataframe first
coefficients = lasso_model2.coef_
df_coef = pd.DataFrame({"Att": X.columns, "Coefs": coefficients})

plt.figure(figsize=(10, 6))
plt.barh(df_coef['Att'], df_coef['Coefs'], color='red')
plt.title('Variable Importance - Lasso Regression with 0.01 alpha')
plt.xlabel('Coefficient Magnitude')
plt.ylabel('Feature')
plt.show()

The performance of our model with a smaller alpha (closer to \(0\), or NO penalisation = simple linear regression) based on the RMSE indicates that the best performing machine learning model is still a simple linear regression, like we had initially suggested (given what we know about the structure of our data). Our dataset is not a candidate for feature selection with LASSO modelling. Also, you should know that you don’t have to manually search for the best value of alpha. You can import a search algorithm, also from the scikit-learn library, called GridSearchCV. The search algorithm will search for the optimal value of alpha for you, from a pre-defined range (you do have to manually set the range yourself).

Conclusion

At this point, our best predictions are made with a linear regression algorithm, which bested a lasso model. As we progress, we will explore other common machine learning algorithms to see whether our prediction capabilities improve!.

Practice at home


You can go to our blog and ask questions about the practical exercise on the Prediction Policy Problems post. One of the instructors, or even one of your peers, is likely to reply to your thread! You can also leave messages there and propose a meeting on the Gather.Town, where you can interact with other people following the course in other parts of the world.

You can replicate the exercises in Python or R using the Malawi dataset. However, if you’d like a challenge, you can also try running a Linear and a Lasso Linear model using a similar dataset for the country of Bolivia. The dataset was put together by McBride and Nichols for the same paper for which they used the Malawi dataset.


Let us know that you are engaging with the content by answering one simple question in this link. We are happy if you can replicate the script at home, and happier even if you can extrapolate your knowledge to a different dataset. So, please send a reply to our Qualtrics question - even incorrect answers count!


Copyright © 2022 Michelle González Amador & Stephan Dietrich . All rights reserved.