A practical introduction to Logistic Regression for classification & prediction in Python
Logistic Regression
Logistic Regression is an algorithm that can be used for regression as well as classification tasks but it is widely used for classification tasks. The response variable that is binary belongs either to one of the classes. It is used to predict categorical variables with the help of dependent variables. Consider there are two classes and a new data point is to be checked which class it would belong to. Then algorithms compute probability values that range from 0 and 1. For example, whether it will rain today or not. In logistic regression weighted sum of input is passed through the sigmoid activation function and the curve which is obtained is called the sigmoid curve.
The Math
Logistic regression uses an equation as the representation, very much like linear regression. Input values (x) are combined linearly using weights or coefficient values (referred to as the Greek capital letter Beta) to predict an output value (y). A key difference from linear regression is that the output value being modeled is a binary value (0 or 1) rather than a numeric value.
Below is an example logistic regression equation:
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for the single input value (x). Each column in your input data has an associated b coefficient (a constant real value) that must be learned from your training data. The actual representation of the model that you would store in memory or in a file is the coefficients in the equation (the beta value or b’s).
Logistic vs Linear Regression
Linear regression is used for predicting the continuous dependent variable using a given set of independent features whereas Logistic Regression is used to predict the categorical.
Linear regression is used to solve regression problems whereas logistic regression is used to solve classification problems.
In Linear regression, the approach is to find the best fit line to predict the output whereas in the Logistic regression approach is to try for S curved graphs that classify between the two classes that are 0 and 1.
The method for accuracy in linear regression is the least square estimation whereas for logistic regression it is maximum likelihood estimation.
In Linear regression, the output should be continuous like price & age, whereas in Logistic regression the output must be categorical like either Yes / No or 0/1.
There should be a linear relationship between the dependent and independent features in the case of Linear regression whereas it is not in the case of Logistic regression.
There can be collinearity between independent features in the case of linear regression but it is not in the case of logistic regression.
Python for Logistic Regression
Python is the most powerful and comes in handy for data scientists to perform simple or complex machine learning algorithms. It has an extensive archive of powerful packages for machine learning to help data scientists automate their way of coding. In this article, we will be building and evaluating our logistic regression model using python’s scikit-learn package. And, the case we are going to solve is whether a telecommunication company's customers are willing to stay in there or not. Let’s solve it in python!
Step-1: Importing Packages
For our logistic regression model, the primary packages include scikit-learn for building and training the model, pandas for data processing, and finally NumPy for working with arrays. Let’s import all the required packages in python!
Python Implementation:
Our next is going to be importing and working with the data using pandas. We will also be doing some EDA and cleaning processes in the next step.
Step-2: Importing and Working with the Data
We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company. This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.
The dataset includes information about:
Customers who left within the last month — the column is called Churn
Services that each customer has signed up for — phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information — how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers — gender, age range, and if they have partners and dependents
Let’s import and clean the data using python!
Python Implementation:
Output:
Now that we have imported our data into our python environment. It’s time to explore the dataset using pandas’ handy functions.
(i) Statistical view of data
Python Implementation:
Output:
(ii) Data info
Python Implementation:
Output:
With this, we come to an end of the process of working and exploring our dataset. Next, we are going to split our dataset into two parts, one is our training set and the other is our testing set. Let’s proceed to the next step.
Step-3: Splitting the dataset
As I mentioned before, in this process we are going to split our dataset into a training set and testing set. For that, we first define the independent variable which is the ‘X’ variable, and the dependent variable which is the ‘Y’ variable. Let’s define the variables in python!
Python Implementation:
Output:
Using the ‘StandardScaler’ function in scikit-learn, we are going to normalize the independent variable or the ‘X’ variable. Follow the code to normalize the X variable in python.
Python Implementation:
Output:
Now we have all the required components to split our data into a training set and testing set. We can feasibly split our data using the ‘train_test_split’ function provided by scikit-learn in python. Let’s split our data in python!
Python Implementation:
Output:
After splitting the data into a training set and testing set, we are now ready for our Logistic Regression modeling in python. So let’s proceed to the next step.
Step-4: Modelling (Logistic Regression with scikit-learn)
Let's build our model using the ‘LogisticRegression’ function from the scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers. The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem in machine learning models. ‘C’ parameter indicates inverse of regularization strength which must be a positive float. Smaller values specify stronger regularization. Now, let's fit our model with the train set in python.
Python Implementation:
Output:
LogisticRegression(C=0.1,class_weight=None,dual=False, fit_intercept=True,intercept_scaling=1,l1_ratio=None,max_iter=100,
multi_class='auto',n_jobs=None,penalty='l2',random_state=None, solver='liblinear',tol=0.0001,verbose=0,warm_start=False)
Now we can do some predictions on our test set using our trained Logistic Regression model. Follow the code to do predictions in python.
Python Implementation:
Output:
In the above code, ‘predict_proba’ returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 1, P(Y=1|X), and the second column is the probability of class 0, P(Y=0|X).
Step-5: Model Evaluation
In this step, we are going to evaluate our model using five evaluation metrics provided by scikit-learn namely, ‘jaccard_similarity_score’, ‘precision_score’, ‘log_loss’, ‘classification_report’, and finally the ‘confusion_matrix’.
(i) Jaccard similarity score or Jaccard index
We can define Jaccard as the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a sample strictly match the true set of labels, then the subset accuracy is 1.0; otherwise, it is 0.0. Follow the code to use the ‘jaccard_similarity_score’ function to evaluate our model in python.
Python Implementation:
Output:
(ii) Precision Score
Now, let’s try the ‘precision_score’ evaluation metric to evaluate our model in python.
Python Implementation:
Output:
(iii) Log Loss
Now, let’s try log loss for evaluation. In logistic regression, the output can be the probability of customer churn is yes (or equals to 1). This probability is a value between 0 and 1. Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. Remember that, lower the log loss value higher the accuracy of our model. Let’s do it in python!
Python Implementation:
Output:
(iv) Classification Report
The ‘classification_report’ function provides a summary of our model. It includes the precision score, F1 score, recall, and support metric. Observing a classification report, we can easily understand the accuracy and performance of our model. Let’s do it in python!
Python Implementation:
Output:
Based on the count of each section, we can calculate the precision and recall of each label:
Precision is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)
Recall is true positive rate. It is defined as: Recall = TP / (TP + FN)
So, we can calculate the precision and recall of each class.
F1 score: Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label. The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifier has a good value for both recall and precision.
And finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.74 in our case.
(v) Confusion Matrix
Seeing the confusion matrix in the form of a heatmap makes more sense than seeing it in the form of an array. Even though scikit-learn has a built-in function to plot a confusion matrix, we are going to define and plot it from scratch in python. Follow the code to implement a custom confusion matrix function in python.
Python Implementation:
Output:
Look at the first row. The first row is for customers whose actual churn value in the test set is 1. As you can calculate, out of 40 customers, the churn value of 17 of them is 1. And out of these 17, the classifier correctly predicted 5 of them as 1 and 12 of them as 0. It means, for 5 customers, the actual churn value was 1 in the test set, and the classifier also correctly predicted those as 1. However, while the actual label of 12 customers was 1, the classifier predicted those as 0, which is not very good. We can consider it as an error of the model for the first row.
What about the customers with churn value 0? Let's look at the second row. It looks like there were 43 customers whom their churn value were 0. The classifier correctly predicted 42 of them as 0, and one of them wrongly as 1. So, it has done a good job of predicting the customers with churn value 0.
A good thing about the confusion matrix is that shows the model’s ability to correctly predict or separate the classes. In the specific case of binary classifiers, such as this example, we can interpret these numbers as the count of true positives, false positives, true negatives, and false negatives.
Final Thoughts!
After a long process of theory and practical implementations in python, we finally built a fully functional Logistic regression model that can be used to solve real-world problems. Logistic regression is a long ocean of math and statistics which we covered only a small part in this article. Even though packages like scikit-learn and NumPy do all the complex math problems, it is always good to have a strong math foundation. Also, the model that we built in this article is very basic and so, there is a lot to explore in building a Logistic regression model. Remember that, hands-on-learning is really important when it comes to machine learning else, we tend to forget the concepts. So, never stop learning and never ever stop implementing it. If you forgot to follow any of the coding parts, don’t worry, I’ve provided the full source code at the end of this article.
Happy Machine Learning!
Full code:
Learnt about LR today. Thank you!
Great job nikhil 👌
Good Nikhil 👍