WATCH FULL VIDEO DISCUSSION HERE
Logistic regression is a supervised machine learning algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation.Β The model delivers a binary or dichotomous outcome limited to two possible outcomes: yes/no, 0/1, or true/false
Logistic regression originated in the field of statistics and was developed by statistician David Cox in the early 1950s. It has since become a fundamental tool in various fields, including medicine, social sciences, and machine learning, due to its simplicity and effectiveness in handling binary outcomes.
Logistic regression is widely used in data science and machine learning for tasks such as spam detection, credit scoring, and medical diagnosis. Its importance lies in its ability to provide probabilistic interpretations and its ease of implementation and interpretation.
In a linear regression, the dependent variable is a metric value e.g.salary or electric consumption. In a logistic regression, the dependent variable is a dichotomous variable e.g. 0 or 1, true or false, positive or negative.
Logistic regression uses a logistic function called a sigmoid function to map predictions and their probabilities. The sigmoid function refers to an S-shaped curve that converts any real value to a range between 0 and 1.
The logistic regression model predicts the probability π that an outcome π¦ equals 1 (e.g., success) given a set of predictors π:
where π½ are the model coefficients.
MLE is used to estimate the model parameters by finding the values that maximize the likelihood of the observed data.
From the given datasets, through logistic regression we can predict the possibility of having a disease of a person given the variables age, gender, smoking status and number of diseased from the whole datasets.
The logistic regression coefficients indicate the impact of each feature on the log-odds of the outcome. The coefficients can be interpreted as follows:
Let's assume the coefficients are as follows:
From the coefficients, you can see that the duration of the project has the most significant impact on the success of the project, followed by the team size and budget.
By understanding the logistic regression coefficients and model evaluation metrics, you can make informed decisions about the factors that most significantly impact project success. This information can help in planning and allocating resources more effectively to increase the likelihood of project success.
ROC: Plots the true positive rate against the false positive rate at various threshold settings.
AUC: The area under the ROC curve, representing the model's ability to distinguish between classes.
This is the most common type of logistic regression used when the response variable has two possible outcomes (e.g., success/failure, yes/no).
Example: Predicting whether a project will be successful or not.
Used when the response variable has more than two categories that are not ordered.
Example: Predicting the category of a project (e.g., high priority, medium priority, low priority).
Used when the response variable has more than two categories with a natural order.
Example: Predicting the severity of an issue (e.g., low, medium, high).
Adds a penalty to the logistic regression to prevent overfitting by discouraging complex models.
Example: Predicting customer churn with many features.
Similar to regularized logistic regression, it applies a penalty to the coefficients to reduce overfitting.
Example: Disease prediction models where the number of predictors is high.
Includes interaction terms between predictors to capture the combined effect of multiple features.
Example: Predicting sales success considering both marketing spend and the number of sales calls.
Used when data is nested or grouped, accounting for the hierarchical structure in the data.
Example: Predicting student success where students are nested within schools.
A method to reduce bias in the maximum likelihood estimates, especially useful for small sample sizes or rare events.
Example: Predicting rare adverse events in clinical trials.
Predict project success or failure to aid in risk assessment.
Predict software defects to improve quality control.
Segment customers based on behavior for tailored deliverables.
Analyze team performance to identify success factors.
Determine the probability of heart attacks
Gathers relevant data for analysis.
Cleans and prepares data for regression.
Implements using tools like Python or R.
Analyzes model output to make data-driven decisions.
Here is the step-by-step process for implementing logistic regression to predict employee attrition using Python.
We'll use a dataset commonly referred to as the "IBM HR Analytics Employee Attrition & Performance"
Download the datasets here: hrfile.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Load dataset
# Make sure to replace 'path_to_file.csv' with the actual path to your downloaded dataset file
df = pd.read_csv('path_to_file.csv')
# Drop irrelevant columns
df.drop(['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber'], axis=1, inplace=True)
# Encode categorical variables
df = pd.get_dummies(df, drop_first=True)
# Define features and target variable
X = df.drop('Attrition_Yes', axis=1)
y = df['Attrition_Yes']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and train the logistic regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
# Predict on the test set
y_pred = logreg.predict(X_test)
# Predict probabilities on the test set
y_prob = logreg.predict_proba(X_test)[:, 1] # Probabilities for the positive class (Attrition = Yes)
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
# Plot confusion matrix
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Print the coefficients of the features
coefficients = pd.DataFrame(logreg.coef_[0], X.columns, columns=['Coefficient'])
print(coefficients.sort_values(by='Coefficient', ascending=False))
# Plot the S-shaped curve for one feature (e.g., Age)
# Generate a range of values for Age
age_range = np.linspace(X['Age'].min(), X['Age'].max(), 300)
# Create a DataFrame with the mean values for all features except Age
mean_features = X_train.mean().to_frame().T
mean_features = mean_features.loc[np.repeat(mean_features.index.values, len(age_range))]
mean_features['Age'] = age_range
# Predict probabilities for the age range
age_prob = logreg.predict_proba(mean_features)[:, 1]
# Plot the S-shaped curve
plt.figure(figsize=(10, 6))
plt.plot(age_range, age_prob, label='Probability of Attrition')
plt.xlabel('Age')
plt.ylabel('Probability of Attrition')
plt.title('S-shaped Curve for Age and Attrition Probability')
plt.legend()
plt.grid()
plt.show()
# Python example using scikit-learn
# Python example using scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Predict project success or failure to aid in risk assessment.
Handles nonlinear relationships well, but can be less interpretable.
Simple and interpretable, but assumes linearity.
Step-by-step guide to implementing logistics regression in Google Colab using Python.
Click here to see exampleHere's the compilation of my class activity output.
Ex2 - Multiple Linear Regression
Ex3 - Time Series Analysis with ARIMA Model
Activity - Time Series Analysis with ARIMA Model - Renewable Energy Consumption in the US