Prediction Policy Problems: Classification with Logistic Regression

Have you heard the English proverb, “Birds of a feather flock together”? It references and old saying that indicates that people with similar characteristics tend to group and stay together. In Machine Learning, Classification problems deal with the evaluation of models of categorical response, such as:

In the video-lecture below you’ll get an intuitive explanation of what a logistic regression model is, and how we can use it in the context of a prediction policy framework.

After watching the video, below you’ll find a continuation of our previous exercise. Previously, we were working on predicting per capita monthly expenditures of a sample of individuals from Malawi. Our assumption is that by predicting how much a person spends per month, we can infer whether they are in poverty (or not) by contrasting that value to other relevant information, such as the cost of food and rent in the country. Another way to go about this is to use the estimated poverty line, and generate a variable that takes on the value \(1\) if the person’s expenditure is below the poverty line (they are poor) and \(0\) otherwise (not poor). Thus, our policy problem becomes one of classification.

R practical

We will continue to work with the Malawi dataset, which can be downloaded in the (Prediction Policy Problems)[https://www.ml4publicpolicy.com/predictionpolicy.html] tab of this website.

  1. Preliminaries: working directory, libraries, data upload


rm(list = ls()) # this line cleans your Global Environment.
setwd("/Users/michellegonzalez/Documents/GitHub/Machine-Learning-for-Public-Policy") # set your working directory

# Do not forget to install a package with the install.packages() function if it's the first time you use it!
# install.packages(caTools, plotROC) # these guys are new for us 

library(dplyr) # core package for dataframe manipulation. Usually installed and loaded with the tidyverse, but sometimes needs to be loaded in conjunction to avoid warnings.
library(tidyverse) # a large collection of packages for data manipulation and visualisation.  
library(caret) # a library with key functions that streamline the process for predictive modelling 
library(skimr) # a package with a set of functions to describe dataframes and more
library(plyr) # a package for data wrangling
library(caTools) # a library with several basic utility functions (e.g. ROC curves, LogitBoos classifier, etc.)
library(plotROC) # a companion to ggplot2 (loaded with the tidyverse) for plotting ROC curves

data_malawi <- read_csv("malawi.csv") # the file is directly read from the working directory/folder previously set

  1. Data pre-processing


This section will not be a thorough step-by-step of the pre-processing and visualisation of our data because we have already done that. However, we have to do something very important: recover a static variable from the original dataset that contains a single number: the poverty line in Malawi.

Feature selection: subsetting the dataset

The variable that we’re interested in recovering is lnzline. The code below reproduces the dataframe subsetting from our previous exercise. Except, this time we will NOT delete de static vector lnzline.

# object:vector that contains the names of the variables that we want to get rid of (notice this time lnzline is still there)
cols <- c("ea", "EA", "psu","hhwght", "strataid", "case_id","eatype")


# subset of the data_malawi object:datframe
data_malawi <- data_malawi[,-which(colnames(data_malawi) %in% cols)] # the minus sign indicates deletion of cols

colnames(data_malawi) # print the names of the remaining vectors in our dataframe
##  [1] "lnexp_pc_month" "hhsize"         "hhsize2"        "agehead"       
##  [5] "agehead2"       "north"          "central"        "rural"         
##  [9] "nevermarried"   "sharenoedu"     "shareread"      "nrooms"        
## [13] "floor_cement"   "electricity"    "flushtoilet"    "soap"          
## [17] "bed"            "bike"           "musicplayer"    "coffeetable"   
## [21] "iron"           "dimbagarden"    "goats"          "dependratio"   
## [25] "hfem"           "grassroof"      "mortarpestle"   "table"         
## [29] "clock"          "region"         "lnzline"


At this point, we still need to do two more pre-processing steps: correctly define the vector/variable class in the dataframe, and create the binary outcome/target variable. We will repeat the class-transformation code chunk below so that you have all that is needed in one section. However, we won’t spend time explaining it in detail as that was done in the previous session.

# transform all binary/categorical data into factor class

min_count <- 3 # vector: 3 categories is our max number of categories found

# store boolean (true/false) if the number of unique values is lower or equal to the min_count vector
n_distinct2 <- apply(data_malawi, 2, function(x) length(unique(x))) <= min_count

# select the identified categorical variables and transform them into factors
data_malawi[n_distinct2] <- lapply(data_malawi[n_distinct2], factor) 

# recall poverty line contains 1 unique value (it is static), let's transform the variable into numeric again
data_malawi$lnzline <- as.numeric(as.character(data_malawi$lnzline))

# you can use ``skim(data_malawi)'' to check that the dataframe is in working order


Feature creation: create a binary variable


# print summary statistics of target variable
summary(data_malawi$lnexp_pc_month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.777   6.893   7.305   7.359   7.758  11.064
# if the log of per capita expenditure is below the estimated poverty line, classify individual as poor, else classify individual as not poor. Store as factor (default with text is class character)
data_malawi$poor <- as.factor(ifelse(data_malawi$lnexp_pc_month<= data_malawi$lnzline,"Y","N")) # Y(es) N(o)

# make sure that the factor target variable has poor = Y as reference category (this step is important when running the logistic regression)
data_malawi$poor <- relevel(data_malawi$poor, ref="Y") # make Y reference category


# print a proportions table to get a first impression of the state of poverty in Malawi
prop.table(table(data_malawi$poor))
## 
##    Y    N 
## 0.65 0.35

According to our sample, about 65% of Malawians are considered poor. This number is not unreasonable. According to The World Bank’s Country Report for Malawi, ca. \(70\%\) of the population lives with under \(\$2.15\) a day, and the poverty rate is estimated to be at \(50\%\). About half of their population is labelled as poor. These estimates were done with \(2019\) data (so, a bit more recent than our sample).


# Final data pre-processing: delete static variable (poverty line)
# and along with it: remove the continuous target (as it perfectly predicts the binary target) 

which(colnames(data_malawi)=="lnzline") # returns column number 31
## [1] 31
which(colnames(data_malawi)=="lnexp_pc_month") # returns column number 1
## [1] 1
data_malawi <- data_malawi[,-c(1,31)] # delete columns no. 1 and 31 from the dataset

  1. Model Validation


Let’s use a simple 80:20 split of our data. We will use the caret package again.

set.seed(1234) # ensures reproducibility of our data split

# data partitioning: train and test datasets
train_idx <- createDataPartition(data_malawi$poor, p = .8, list = FALSE, times = 1) 

Train_df <- data_malawi[ train_idx,]
Test_df  <- data_malawi[-train_idx,]


Now, let’s fit a logistic model:

# Step 1: create trainControl object
TrControl <- trainControl(
    method = "cv",
    number = 5,
    summaryFunction = twoClassSummary,
    classProbs = TRUE, # IMPORTANT!
    verboseIter = FALSE
)

We’re going to pass the TrControl object onto the caret model estimation to ask for the following: - cross-validate with 5 folds - show model summary: performance metrics for when we have two distinct classes (binary outcome), including the area under the ROC curve, the sensitivity and specificity. - the ROC curve is based on the predicted class probabilities, so the classProbs = TRUE parameter must accompany a twoClassSummary setup. - veboseIter = TRUE shows you the output for each iteration (but we don’t want to display all the details atm).

# Step 2: train the model.
set.seed(12345)
m <- train(
    poor ~ ., 
    Train_df, 
    method = "glm",
    family="binomial",
    trControl = TrControl,
    preProcess=c("center", "scale")
)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.


Notice the warning. If we want to report the “Accuracy” metric, we should remove the twoClassSummary parameter specification in the TrControl object.

# print the model's performance metrics
print(m) 
## Generalized Linear Model 
## 
## 9025 samples
##   29 predictor
##    2 classes: 'Y', 'N' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 7219, 7220, 7220, 7221, 7220 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8825702  0.8943044  0.6774223

Performance metrics (on the train set!)

  • ROC: it is a probability curve plotted with the True Positive Rate (y-axis) against the False Positive Rate (x-axis); you can think of it as plotting the tradeoff between maximising the true positive rate and minimising the false positive rate. The preferred area under the curve is \(1\). Our estimated \(0.88\) score indicates that a logistic classification is a good fit for our data (close to \(1\)).

  • Sensitivity: it is a measure of the proportion of the positive (\(1\) = poor) values that are correctly identified. Therefore, we have correctly identified \(89\%\) of the actual positives. The formula is: \(\frac{tp}{tp + fn}\); where tp = true positive and fn = false negative. In the video-lecture, Stephan used the term Recall, where we now use sensitivity. This means that our model does pretty well at predicting/identifying people living below the poverty line in Malawi!

  • Specificity: measures the proportion of actual negatives that are correctly identified by the model; i.e. the ability of our model to predict if an observation doesn’t belong to a certain category. The formula is: \(\frac{tn}{tn + fp}\); where tn = true negative and fp = false positive. At \(67\%\), we can trust a predicted negative (\(0\)) value to be real more than half the time. Our model is not as good at predicting who doesn’t live below the poverty line in Malawi.


The performance metrics we have interpreted above are based on the training dataset only. We are interested in our model’s ability to make out-of-sample predictions. Therefore, we will use the definitions above, but to take the scores on the test-dataset predictions to make our final evaluation.

Out-of-sample performance

Notice that we have used cross-validation in our training dataset. In theory, our performance metrics have been validated in 5 different folds. To a certain extent, that means that our performance metrics above did reflect the model’s ability to extrapolate. Nevertheless, we will still see how our trained model performs in our test dataset. You can think of this step as predicting on a sixth fold. We know that the performance of a logistic classification model on the train set is relatively good, is it the same for the test dataset?

# First, use the logistic classification model (trained on the Train_df) to make predictions on the test dataset:

set.seed(12345)
pr1 <- predict(m, Test_df, type = "raw")
head(pr1) # Yes and No output
## [1] N Y Y Y Y Y
## Levels: Y N

We have specified the type of prediction we want: raw. This will return the predicted classification (\(0\) or \(1\)) as opposed to the individual’s probability of falling into the selected category \(1\) (or the estimated probability of being poor). There is a rule of thumb that says you will be categorised as poor (or any chosen category) if your estimated probability is >= to \(0.5\). With this information, we can create a Confusion Matrix which will be accompanied by performance metrics.

# Next, we call the caret package's confusionMatrix function, and select the two elements to be contrasted:
# the predicted classification vector, and the actual observed vector from the test dataframe. 
confusionMatrix(pr1, Test_df[["poor"]], positive = "Y") # positive = "Y" indicates that our category of interest is Y (1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    Y    N
##          Y 1294  270
##          N  172  519
##                                          
##                Accuracy : 0.804          
##                  95% CI : (0.787, 0.8202)
##     No Information Rate : 0.6501         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5564         
##                                          
##  Mcnemar's Test P-Value : 3.953e-06      
##                                          
##             Sensitivity : 0.8827         
##             Specificity : 0.6578         
##          Pos Pred Value : 0.8274         
##          Neg Pred Value : 0.7511         
##              Prevalence : 0.6501         
##          Detection Rate : 0.5738         
##    Detection Prevalence : 0.6936         
##       Balanced Accuracy : 0.7702         
##                                          
##        'Positive' Class : Y              
## 


The first element from the above function returns the confusion matrix, a 2×2 table that shows the predicted values from the model vs. the actual values from the test dataset. You may be acquainted with this sort of table, but know it as a cross-tabulation. From the confusion matrix, we obtain the information that we need to estimate some performance metrics. If you need a reminder of what each cell in the 2x2 matrix represents, recall that the structure of our target variable is [\(Y(1),N(0)\)]. Therefore, the first cell would be the intersection of Predicted \(Y\) vs Observed \(Y\) (or True Positive) = \(1294\), the fourth cell would be the intersection of Predicted \(N\) vs Observed \(N\) (or True Negative) = \(519\). These guys are the predictions that have hit the mark! On the other hand, the second cell would be the intersection of Predicted \(Y\) vs Observed \(N\) (or False Positive) = \(270\), and the third cell Predicted \(N\) vs Observed \(Y\) (or False Negative) = \(172\). These were incorrect predictions. We use these counts (true positives, true negatives, false positives, false negatives) to estimate performance metrics

Besides the performance metrics discussed previously, this function also shows the Accuracy of our model (or \(1\) - the error rate) which, at \(0.8\), indicates that our classification algorithm is highly accurate.


**Imbalanced data**
When you have a large number of zeros (or No, in this case), the Accuracy metric may not be the most reliable one. If we look at the formula: number of correct predictions / total number of predictions, we see why this might be an issue. It is a lot easier to correctly predict that of which there is plenty of (Yes), than the category for which we have less instances. 


Imbalance is not a problem for our target variable, as we have roughly as many zeros as ones. In fact, we have more Yes (1) responses. Nonetheless, this sets the stage for us to introduce the Kappa statistic (\(0.55\)), which is a measure of model accuracy that is adjusted by accounting for the possibility of a correct prediction by chance alone. It ranges from 0 to 1, and can be interpreted using the following thresholds:

  • Poor = Less than 0.20

  • Fair = 0.20 to 0.40

  • Moderate = 0.40 to 0.60

  • Good = 0.60 to 0.80

  • Very good = 0.80 to 1.00

At \(0.55\), our classification model performs moderately well. Finally, Sensitivity and Specificity scores on the test dataset are very close to the ones obtained from the train dataset. This is a good sign for the out-of-sample stability of our model.

Model Visualisation

We can visualise the performance of our classification model in various ways. For now, we’ll focus on a simple ROC AUC.

# ROC AUC: Area Under the Curve
# colAUC function from the caTools library

# transform predicted values and observed values into class numeric (needed for the colAUC function)

pr_numeric <- as.numeric(as.factor(pr1))
# sanity check:
head(cbind(pr_numeric, pr1)) # the numeric values of both vectors seem to be the same. 
##      pr_numeric pr1
## [1,]          2   2
## [2,]          1   1
## [3,]          1   1
## [4,]          1   1
## [5,]          1   1
## [6,]          1   1
poor_numeric <- as.numeric(as.factor(Test_df$poor))
# sanity check
head(cbind(poor_numeric,Test_df$poor)) # all good 
##      poor_numeric  
## [1,]            1 1
## [2,]            1 1
## [3,]            1 1
## [4,]            1 1
## [5,]            1 1
## [6,]            1 1
# plot the ROC area under the curve
colAUC(pr_numeric, poor_numeric, plotROC = TRUE)

##              [,1]
## 1 vs. 2 0.7702343
# We can also plot the ROC AUC with ggplot
# First, we create a dataframe containing the observed and the predicted values (in numeric form)
roc_df <- data.frame(Observed = poor_numeric, Predicted = pr_numeric)

# Second, we add the geom_roc() layer to a ggplot object

roc_gg <- ggplot(roc_df, aes (d = Observed, m = Predicted)) + 
            geom_roc(labels = FALSE, color='orange') +
            style_roc(theme = theme_bw, guide = TRUE) # guide=TRUE adds a diagonal guideline, style_roc() adds minor grid lines, and optionally direct labels to ggplot objects containing a geom_roc layer

direct_label(roc_gg, nudge_y = 0.2) # direct_label tells you what the plotted line represents, nudge_y option places the label (you can play around with that number to see where different values place the label)
## Warning in verify_d(data$d): D not labeled 0/1, assuming 1 = 0 and 2 = 1!

## Warning in verify_d(data$d): D not labeled 0/1, assuming 1 = 0 and 2 = 1!

Notice the warning on the ggplot2 ROC AUC plot. The assumption that they are making is correct, so we do not need to do anything else at this moment. You can check this by contrasting the values of the labelled vs. the numeric vectors (use the head() function).

Conclusion

How did our classification model do? Is a logistic regression the right algorithm? I trust you can form your own judgement based on the performance metrics above. Personally, I think we have improved from a linear regression, but perhaps we can do better with Ensemble Learning techniques!

Python practical

We will continue to work with the Malawi dataset, which can be downloaded in the (Prediction Policy Problems)[https://www.ml4publicpolicy.com/predictionpolicy.html] tab of this website.

  1. Preliminaries: libraries and data upload


#==== Python version: 3.10.12 ====#

# Opening libraries

import sklearn as sk # our trusted Machine Learning library
from sklearn.model_selection import train_test_split # split the dataset into train and test
from sklearn.model_selection import cross_val_score # to obtain the cross-validation score
from sklearn.model_selection import cross_validate # to perform cross-validation
from sklearn.linear_model import LogisticRegression # computation of logistic model for classification
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix # returns a confusion matrix
from sklearn.metrics import roc_curve, roc_auc_score, auc # roc area under the curve
from sklearn.metrics import accuracy_score # performance metric for classification models
from sklearn.metrics import classification_report # general report for classification model

# Non-ML libraries
import random # for random state 
import csv # a library to read and write csv files 
import numpy as np # a library for handling 
import pandas as pd # a library to help us easily navigate and manipulate dataframes
import seaborn as sns # a data visualisation library
import matplotlib.pyplot as plt # a data visualisation library
import statsmodels.api as sm # computation of linear and logistic regressions


# Uploading data

malawi = pd.read_csv('/Users/michellegonzalez/Documents/GitHub/Machine-Learning-for-Public-Policy/malawi.csv')

  1. Data pre-processing


This section will not be a thorough step-by-step of the pre-processing and visualisation of our data because we have already done that in the previous session. However, we have to do something very important: recover a static variable from the original dataset that contains a single number: the poverty line in Malawi.

Feature selection: subsetting the dataset

The variable that we’re interested in recovering is lnzline. The code below reproduces the dataframe subsetting from our previous exercise. Except, this time we will make sure to store the lnzline vector for future use.


# Retrieve information from the poverty line static vector:
malawi['lnzline'].describe()
## count    1.128000e+04
## mean     7.555000e+00
## std      1.776436e-15
## min      7.555000e+00
## 25%      7.555000e+00
## 50%      7.555000e+00
## 75%      7.555000e+00
## max      7.555000e+00
## Name: lnzline, dtype: float64
# Let's store this in an object outside of the dataframe and get rid of it in the dataframe
# Static variables have zero-variance, and we already know zero-variance predictors are troublesome..
lnzline = malawi['lnzline']

# Instead of deleting case_id, we will set it as an index (we did not do this last time!). 
# This is essentially using the case_id variable as row names (and won't be included in your ML model)
malawi = malawi.set_index('case_id')

# sanity check: notice that case_id is there, but doesn't have a corresponding column (it is now only considered a row name)
malawi.head()
##              lnexp_pc_month  hhsize  hhsize2  ...       psu  strataid  lnzline
## case_id                                       ...                             
## 10101002025        6.900896       7       49  ...  10101002         1    7.555
## 10101002051        7.064378       3        9  ...  10101002         1    7.555
## 10101002072        6.823851       6       36  ...  10101002         1    7.555
## 10101002079        6.894722       6       36  ...  10101002         1    7.555
## 10101002095        6.465989       6       36  ...  10101002         1    7.555
## 
## [5 rows x 37 columns]
# deleting variables from pandas dataframe 
cols2delete = ['ea', 'EA', 'hhwght', 'psu', 'strataid', 'lnzline', 'eatype', 'region']
malawi = malawi.drop(cols2delete,axis=1) # axis=0 means delete rows and axis=1 means delete columns

# check if we have deleted the columns: we originally had 37 variables, we now should have 29
print(malawi.shape)
## (11280, 29)

At this point, we still need to do two more pre-processing steps: correctly define the vector/variable class in the dataframe, and create the binary outcome/target variable. We will repeat the class-transformation code chunk below so that you have all that is needed in one section. However, we won’t spend time explaining it in detail as that was done in the previous session.


Feature creation: create a binary variable

#==== Correctly identify each vector type: ====#

# for-loop that iterates over variables in dataframe, if they have 2 unique values, transform vector into categorical
for column in malawi:
    if malawi[column].nunique() == 2:
        malawi[column] = pd.Categorical(malawi[column])

#==== Create a binary target variable: ====#

# print summary statistics of target variable
malawi['lnexp_pc_month'].describe()
## count    11280.000000
## mean         7.358888
## std          0.675346
## min          4.776855
## 25%          6.892941
## 50%          7.305191
## 75%          7.757587
## max         11.063562
## Name: lnexp_pc_month, dtype: float64
# if the log of per capita expenditure is below the estimated poverty line, classify individual as poor, else classify individual as not poor. Store as factor (default with text is class character)

print(lnzline[0]) # copy the static number 7.555 to use as threshold / we're printing the first row of a static vector --- i.e. all rows contain the same number
## 7.5549998
# use numPy to create a binary vector (notice we have rounded up the threshold)
malawi['Poor'] = np.where(malawi['lnexp_pc_month'] <= 7.555, 1, 0)

# sanity check
malawi['Poor'].describe() # returns binary [0,1] float vector, let's turn this into a categorical vector
## count    11280.000000
## mean         0.650000
## std          0.476991
## min          0.000000
## 25%          0.000000
## 50%          1.000000
## 75%          1.000000
## max          1.000000
## Name: Poor, dtype: float64
malawi['Poor'] = pd.Categorical(malawi['Poor']) # use malawi['Poor'].info() if you want to see the transformation

# alternatively...
# malawi['Poor'] = (malawi['lnexp_pc_month'] <= 7.555).astype(bool) # anlso numpy, but directly specifying boolean type (true (poor) /false (not poor))

# print a proportions table to get a first impression of the state of poverty in Malawi
malawi['Poor'].value_counts(normalize=True)
## 1    0.65
## 0    0.35
## Name: Poor, dtype: float64


According to our sample, about 65% of Malawians are considered poor. This number is not unreasonable. According to The World Bank’s Country Report for Malawi, ca. \(70\%\) of the population lives with under \(\$2.15\) a day, and the poverty rate is estimated to be at \(50\%\). About half of their population is labelled as poor. These estimates were done with \(2019\) data (so, a bit more recent than our sample).



# Final data pre-processing: remove the continuous target (as it perfectly predicts the binary target in a non-informative way)

cont_target  = ['lnexp_pc_month']
malawi = malawi.drop(cont_target , axis=1)

  1. Model Validation


Let’s use a simple 80:20 split of our data.

# First, recall the df structure
malawi.info() # returns the column number, e.g. hhsize = column number 0, hhsize2 = 1... etc.
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 11280 entries, 10101002025 to 31202086374
## Data columns (total 29 columns):
##  #   Column        Non-Null Count  Dtype   
## ---  ------        --------------  -----   
##  0   hhsize        11280 non-null  int64   
##  1   hhsize2       11280 non-null  int64   
##  2   agehead       11280 non-null  int64   
##  3   agehead2      11280 non-null  int64   
##  4   north         11280 non-null  category
##  5   central       11280 non-null  category
##  6   rural         11280 non-null  category
##  7   nevermarried  11280 non-null  category
##  8   sharenoedu    11280 non-null  float64 
##  9   shareread     11280 non-null  float64 
##  10  nrooms        11280 non-null  int64   
##  11  floor_cement  11280 non-null  category
##  12  electricity   11280 non-null  category
##  13  flushtoilet   11280 non-null  category
##  14  soap          11280 non-null  category
##  15  bed           11280 non-null  category
##  16  bike          11280 non-null  category
##  17  musicplayer   11280 non-null  category
##  18  coffeetable   11280 non-null  category
##  19  iron          11280 non-null  category
##  20  dimbagarden   11280 non-null  category
##  21  goats         11280 non-null  category
##  22  dependratio   11280 non-null  float64 
##  23  hfem          11280 non-null  category
##  24  grassroof     11280 non-null  category
##  25  mortarpestle  11280 non-null  category
##  26  table         11280 non-null  category
##  27  clock         11280 non-null  category
##  28  Poor          11280 non-null  category
## dtypes: category(21), float64(3), int64(5)
## memory usage: 1.0 MB

# Then, split!
X = malawi.iloc[:, 0:27] # x is a matrix containing all variables except the last one, which conveniently is our binary target variable
y = malawi.iloc[:, 28] # y is a vector containing our target variable 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12345) # random_state is for reproducibility purposes


Fit a logistic model


#==== Create a Logistic Regression Object and Fit the Model ====#

# We are going to add a random_state (reproducibility) and increase the number of iterations from the default (100) to 1000
# We have also stated that we want to fit the intercept and have no penalty (remember the penalisation parameter from the Lasso regression? sklearn has the option of including it in the Logistic Regression as well). 

m = LogisticRegression(random_state=12345, max_iter=1000, fit_intercept = True, penalty 
= None).fit(X_train, y_train) 


We have successfully fit a logistic classification model. A limitation of the sklearn python library is that we cannot easily access the output of the model. Instead, we immediately estimate performance metrics based on the testing dataset and evaluate accordingly. If you’d like to see the output of your model, I recommend re-estimating it using the statsmodel.api library. I won’t do it here, since we’re not particularly interested in interpreting coefficients, but on the overall predictive ability of our model. I also trust that you can do that yourself, given that we have an example of how to use the statsmodel.api library in the Linear Regression tutorial.

Also, please note that the Logistic Regression algorithm included in the sklearn library has more capabilities than a simple logistic regression function. You might have already seen that given that we had to set a number of parameters before fitting the model. If you’d like to know more about what is included, you can read the documentation for it. Otherwise, what we have set above can be taken as a standard model fit.

Predictions and Performance

# Make out-of-sample predictions on the test dataset

y_pred = m.predict(X_test)

# Report the performance of your model using chosen metrics

print("Accuracy:", accuracy_score(y_test, y_pred))
## Accuracy: 0.8120567375886525
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
## Confusion Matrix:
##  [[ 532  276]
##  [ 148 1300]]
print("Classification Report:\n", classification_report(y_test, y_pred))
## Classification Report:
##                precision    recall  f1-score   support
## 
##            0       0.78      0.66      0.72       808
##            1       0.82      0.90      0.86      1448
## 
##     accuracy                           0.81      2256
##    macro avg       0.80      0.78      0.79      2256
## weighted avg       0.81      0.81      0.81      2256


  • The Accuracy of our model (or \(1\) - the error rate) is \(0.8\). The high score indicates that our classification algorithm is highly accurate.

  • We have printed a confusion matrix.If you need a reminder of what each cell in the 2x2 matrix represents, recall that the structure of our target variable is [\(0[N],1[Y]\)]. Therefore, the first cell would be the intersection of Predicted \(0\) vs Observed \(0\) (or True Negative) = \(530\), the fourth cell would be the intersection of Predicted \(1\) vs Observed \(1\) (or True Positive) = \(1306\). These guys are the predictions that have hit the mark! On the other hand, the second cell would be the intersection of Predicted \(0\) vs Observed \(1\) (or False Positive) = \(278\), and the third cell Predicted \(0\) vs Observed \(1\) (or False Negative) = \(142\). These were incorrect predictions. We use these counts (true positives, true negatives, false positives, false negatives) to estimate performance metrics, including Accuracy = \(\frac{tp + tn}{tp + tn + fp + fn}\) for instance, Precision, Recall, and the F1-score.

  • The Classification Report includes three more performance metrics (explained in the video-lecture by Stephan), besides the Accuracy of our model.

    • Precision: it is the percentage of the overall predictions that are correct. Defined as the ratio of true positives to the sum of a true positive and false positive. Based on the output above, we are more precise in our predictions of the poor (\(1\)) than our predictions of the non-poor (\(0\)). Both values are high, close to 1, and indicate good performance.
    • Recall: it is a measure of the proportion of the positive values that are correctly identified. Again, we can correctly identify a larger percentage of \(1\) instances (\(90\%\) of the time!) than \(0\) instances (only \(66\%\)).
    • F1-score: it is the percentage of positive predictions that are correct. It is not, however, the same as recall. The F1-score is a weighted mean that can be used to compare different classification models (and not necessarily a measure of global accuracy). It is computed from the two previous measures: 2 * (Recall * Precision) / (Recall + Precision). Results are similar as above, better prediction ability for the poor than the non-poor.


Overall, we seem to be good at predicting \(1\) instances, but less so for \(0\) instances. We are indeed interested in predicting who is poor (\(1\)) so perhaps given the scores (all very close to \(1\)) of all our metrics, we are doing well. However, if we are unable to perform equally well for both categories, then we may still misallocate public resources.

Model Visualisation

To conclude our evaluation, let’s compute the ROC AUC:

# Compute the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

# Compute the ROC AUC score
roc_auc = auc(fpr, tpr)
print(f'ROC Area Under the Curve score: {roc_auc}')
## ROC Area Under the Curve score: 0.7781029484163885
# Plot the ROC curve! :) (it's nice to get a visual)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='orange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.1f})')
plt.plot([0, 1], [0, 1], color='darkblue', lw=2, linestyle='--')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()


The ROC (Receiver Operating Characteristic) is a probability curve plotted with the True Positive Rate (y-axis) against the False Positive Rate (x-axis); you can think of it as plotting the tradeoff between maximising the true positive rate and minimising the false positive rate. The preferred area under the curve is \(1\). Our estimated \(0.8\) (rounded up) indicates that a logistic classification is a good model for our data (close to \(1\)).

Conclusion

How did our classification model do? Is a logistic regression the right algorithm? I trust you can form your own judgement based on the performance metrics above. Personally, I think we have definitely improved from a linear regression, but perhaps we can do better with Ensemble Learning techniques!

Practice at home

As before, you can use the Bolivia dataset to try your hand at a logistic model. The data can be downloaded in the Practice at home sub-tab in the Prediction Policy Problems webpage of this site.

Let us know you are completing the tasks by answering a simple question via this Qualtrics link.

It is important for us to know that you are engaging with the course. Sending an answer - even if it is incorrect - to our Qualtrics question lets us know just that. So please say hi on Qualtrics! (And on our blog, too.)


Copyright © 2022 Michelle González Amador & Stephan Dietrich . All rights reserved.