In this tutorial, we will learn about binary logistic regression and its application to real life data using Python. We have also covered binary logistic regression in R in another tutorial. Without a doubt, binary logistic regression remains the most widely used predictive modeling method. Logistic Regression is a classification algorithm that is used to predict the probability of a categorical dependent variable. The method is used to model a binary variable that takes two possible values, typically coded as 0 and 1

You can download the data files for this tutorial here.

We’ll first recap a few aspects of binary logistic regression and then focus on statistical modeling, hypothesis testing and classification tables using Python. We’ll use a case study in the banking domain to demonstrate the method.

## Binary Logistic Regression in Python

Binary logistic regression models the relationship between a set of independent variables and a binary dependent variable. It is useful when the dependent variable is dichotomous in nature, such as death or survival, absence or presence, pass or fail, for example. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P (Y=1) as a function of X. Independent variables can be categorical or continuous, for example, gender, age, income, geographical region and so on. Binary logistic regression models a dependent variable as a logit of p, where p is the probability that dependent variables take a value of ‘one’.

## Statistical Model – For k Predictors

Let’s see what the statistical model in binary logistic regression looks like. In this equation, p is the probability that Y equals one given X, where Y is the dependent variable and X’s are independent variables. B 0 to B K are the parameters of the model. These parameters of the model are estimated using the maximum likelihood method. The left-hand side of the equation ranges between minus infinity to plus infinity.

where,

p : Probability that Y=1 given X

Y : Dependent Variable

X_{1}, X_{2 },…, X_{k} : Independent Variables

b_{0}, b_{1 },…, b_{k } : Parameters of Model

## Case Study – Modeling Loan Defaults

Let’s explain the concept of binary logistic regression using a case study from the banking sector. Our bank has the demographic and transactional data of its loan customers. It wants to develop a model that predicts defaulters and help the bank in its loan disbursal decision making. The objective here is to predict whether customers applying for a loan will be defaulters or not. We will use a sample of size 700 to develop the model. The independent variables are age group, years at current address, years at current employer, debt to income ratio, credit card debt and other debt. All of these variables are collected at the time of the loan application process and will be used as independent variables. The dependent variable is the status observed after the loan is disbursed, which will be one if it is a defaulter and zero if not.

## BLR Data Snapshot

Here’s a snapshot of the data. Our dependent variable is binary, whereas the independent variables are either categorical or continuous in nature.

## Binary Logistic Regression in Python

Let’s import our data and check the data structure in Python. As usual, we import the data using **read_csv** function in the pandas library, and use the **info** function to check the data structure. We can see here that the Age variable is an integer type.

# Import data and check data structure before running model

` `**import pandas as pd**
bankloan=**pd.read_csv**('BANK LOAN.csv')
bankloan**.info()**

# Output:

Age should be a categorical variable, and therefore needs to be converted into a category type. If it isn’t converted into a category type, then Python will interpret it as a numeric variable, which is not correct, as we are considering **age groups **in our model

# Change ‘AGE’ variable into categorical

` bankloan['AGE']=bankloan['AGE']`**.astype('category')**
bankloan**.info()**

Age is an integer and need to convert into type “category” for modeling purpose.

# Output:

Logistic regression uses the logit link function. As with the linear regression model, dependent and independent variables are separated using the tilde sign, and independent variables are separated by the plus sign.

So let’s see which independent variables impact customers turning into defaulters? After fitting the logistic regression model, we carry out individual hypothesis testing to identify significant variables. We then use the **summary** function on the model object to get detailed output. Variables whose P value is less than 0.05 are considered to be statistically significant. Since the p-value is < 0.05 for Employ, Address, Debtinc, and Creddebt, these independent variables are significant.

## Logistic Regression using logit function

` `**import statsmodels.formula.api as smf**
riskmodel** = smf.logit(formula = **'DEFAULTER ~ AGE + EMPLOY + ADDRESS + DEBTINC + CREDDEBT + OTHDEBT', **data = **bankloan**).fit()**

logit() fits a logistic regression model to the data.

## BLR Model summary

` riskmodel`**.summary()**

summary() generates detailed summary of the model.

## Re-run the BLR Model in Python

Once the variables to be retained are finalized, we re-run the model with these we re-run the binary logistic regression model by including only the significant variables. Again the output of the summary function provides the revised coefficients for the model.

` riskmodel = `**smf.logit(formula = **'DEFAULTER ~ EMPLOY + ADDRESS + DEBTINC + CREDDEBT', **data =** bankloan**).fit()**
riskmodel**.summary()**

In this output, all independent variables are statistically significant and the signs are logical, so this model can be used for further diagnosis.

# Output:

## Odds Ratios In Python

After substituting values of parameter estimates this is how the final model will appear.

The probability of defaulting can be predicted if the values of the X variables are entered into the equation.

We use the odds ratio to measure the association between the independent variable and dependent variable. Once the parameter is estimated with confidence intervals, by simply taking the antilog we can get the Odds Ratios with confidence intervals. In Python the ‘**conf_int**’ function calculates the confidence interval for parameters, and then parameter estimates are added to the object. The antilog values are printed to give a table of odds ratios.

` `**import numpy as np**
conf = riskmodel.**conf_int()**
conf['OR'] = riskmodel**.params**
conf.columns = ['2.5%', '97.5%', 'OR']
**print**(**np.exp(conf)**)

conf_int(): calculates confidence intervals for parameters

riskmodel.params: identify the model parameter estimates

## Odds Ratios in Python

From the output here, we can see that none of the confidence intervals for the odds ratio includes one, which indicates that all the variables included in the model are significant. The odds ratio for CREDDEBT is approximately 1.77

So for one unit change CREDDEBT, the odds of being a defaulter will change 1.77 fold.

# Output:

## Predicting Probabilities in Python

We determine the probability of the final model using the **predict** function. Predicted probabilities are saved in the same **bankloan** dataset in the new variable ‘pred’.

The last column in the data gives predicted probabilities using the final model.

## Classification Table

It’s important to measure the goodness of fit of any fitted model. Based on some cut off value of probability, the dependent variable Y is estimated to be either one or zero. A cross tabulation of observed values of Y and predicted values of Y is known as a classification table.

The accuracy percentage measures how accurate a model is in predicting the outcomes.

In the table, the dependent variable equals **zero** was observed and predicted 478 times, whereas it was observed and predicted to be **one** 92 times.

Therefore, the accuracy rate is calculated as 478 plus 92 divided by the total sample size of 700. The accuracy therefore is 81.43 %. The misclassification rate is the percentage of wrongly predicted observations. In this example, the misclassification rate is obtained as 38 + 91 divided by 700 giving misclassification rate as 18.57%

## Classification Table Terminology

Different terminologies are used for observations in a classification table. These are **sensitivity**, **specificity**, **false positive rat**e and **false negative rate**. The sensitivity of a model is the percentage of correctly predicted occurrences or events. It is the probability that the predicted value of Y is one, given the observed value of Y being one. On the contrary, specificity is the percentage of non-occurrences being correctly predicted – that is the probability that the predicted value of Y is zero, given that the observed value of Y is also zero. The false positive rate is the percentage of non-occurrences that are predicted wrongly as events. Similarly, the false negative rate is the percentage of occurrences which are predicted incorrectly.

## Sensitivity and Specificity calculations

This table represents the accuracy, sensitivity and specificity values for different cut off values. On the basis of the accuracy, sensitivity and specificity values, we can deduce that the cut off value of 0.3 is the best cut off value for the model.

## Classification table in Python

Let’s now obtain the classification table in Python. The **predict** function gives predicted probabilities. We set the threshold value to 0.5 and the predicted class is assigned a value of 1 if the predicted probability is greater than the threshold of 0.5. Finally, we use the confusion_matrix function to obtain a classification table using the observed defaulter status and the predicted class.

## Predicting Probabilities

` `**from sklearn.metrics import confusion_matrix**
predicted_values1 = riskmodel**.predict()**
threshold=0.5
predicted_class1=**np.zeros**(predicted_values1**.shape**)
predicted_class1[predicted_values1>threshold]=1
cm1 = **confusion_matrix**(bankloan['DEFAULTER'],predicted_class1)
**print**('Confusion Matrix : \n', cm1)

confusion_matrix function creates a cross table of observed Y (defaulter)vs. predicted Y

# Output:

## Sensitivity and Specificity in Python

Now let’s calculate sensitivity and specificity values in Python. We calculate these using the formula discussed earlier. On calculation, the sensitivity of the model is 50.27%, whereas the specificity is at 92.46%. The sensitivity value is definitely lower than the desired value so, we can try a different threshold and obtain optimum threshold as explained earlier.

## Sensitivity and Specificity

```
sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
```**print**('Sensitivity : ', sensitivity)
specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
**print**('Specificity : ', specificity )

# Output:

Sensitivity : 0.5027322404371585 Specificity : 0.9245647969052224

**Interpretation :**

The Sensitivity is at 50.27% and the Specificity is at 92.46%. Note that the threshold is set at 0.5

## Precision & Recall values of the model

The precision and recall values of the model are routinely assessed in a classification model. Precision tells us what percentage of predicted positive cases are correctly predicted.

Recall tells us what percentage of actual positive cases are correctly predicted.

## Classification Report

The **classifcation_report** function in python is also very useful. We import it from the **sklearn** metrics library. It accepts observed Y and predicted class of Y as two arguments. The output shows the recall, precision and accuracy of the model.

#Classification Report

` `**from sklearn.metrics import classification_report**
**print**(classification_report(bankloan['DEFAULTER'],predicted_class1))

classification_report() gives recall, precision and accuracy along with other measures.** **

# Output:

## Quick Recap

Let’s quickly recap. In this session, we learned about binary logistic regression modelling and its application. We then used python code to estimate model parameters and obtain a classification report.

This tutorial lesson is taken from Digita Schools Advanced Diploma in Data Analytics and the Postgraduate Diploma in Data Science. Continue to the follow on tutorial on Binary Logistic Regression in Python Part II

You can try our courses for free to learn more.