Powering Predictions: Demonstrating Logistic Regression Using Excel

data analytics excel logistic regression machine learning regression Nov 05, 2023

Download the Excel template for this article.

As advisors in an ever-evolving digital landscape, we often find ourselves at crossroads, making pivotal decisions that could spell success or setbacks for our clients.

One of the most powerful tools in our decision-making toolbox is regression analysis, a statistical tool for machine learning.

But what happens when the outcomes are not about 'how much' but 'whether or not'? This is an case where we can use logistic regression, the well-known classifier that helps us navigate the binary terrain of yes-or-no decisions.

Unlike linear regression, which predicts values within a continuous range, logistic regression predicts the probability of an occurrence, whether it's the likelihood of a loan default, probability of a transaction being fraudulent, or the chance of a sale at a particular price point.

Understanding Logistic Regression

At its core, logistic regression is about predicting the probability of an event: will it happen, or won't it?

Imagine you're trying to predict whether a new financial product will be a hit or a miss, whether a loan application will be approved or denied, or whether a marketing campaign will convert a lead into a customer.

These are binary outcomes — something either is or isn't, it happens or it doesn't. Linear regression doesn’t work here because it doesn't naturally confine its predictions to the range between 0 and 1, which is necessary for predicting probabilities.

That's where logistic regression shines.

To understand logistic regression, we need to be comfortable with probabilities. The probability of an event is a value between 0 and 1, representing the likelihood of occurrence. An event with a probability of occurrence greater than 0.5 is more likely to occur than not.

The logistic function translates any input, through a special S-shaped curve, into a probability value between 0 and 1. In Excel, you can write the following formula:

=1/(1 + EXP(-(B0+B1*x)))

  • EXP() is the Excel function used to calculate the natural logarithm of a value.

  • B0 and B1 are the parameters of the model that we will estimate from data.

  • x is the value of the variable (e.g., income, age, credit score) you're using to predict the outcome. There could one variable or many variables for each record in the model.

When to Use Logistic Regression

Logistic regression is the method of choice when your dependent variable is categorical, particularly if it is binary (having two possible outcomes). It's well-suited for:

  • Predicting binary events: Whether an investor will buy or sell, if a customer will churn, or the approval of a credit application.

  • Understanding the impact of variables: Determining the effect of interest rates on loan defaults or the influence of demographics on people becoming victims of cybersecurity attacks.

While linear regression is used to predict continuous variables (like income, stock prices, or age), logistic regression is used for predicting categorical outcomes.

The key differences are:

  • Output interpretation: Logistic regression outputs probabilities, while linear regression outputs a continuous value that can be any number.

  • Nature of the dependent variable: Logistic regression requires the dependent (predicted) variable to be binary, while linear regression requires it to be continuous.

Logistic regression is specifically advantageous in classification because it provides a probability score for observations. This allows us to not only predict outcomes but also to have a quantifiable measure of certainty. These probabilities can then be used to rank predictions and make informed decisions based on quantitative thresholds.

Logistic regression is particularly powerful for advisors because it can be adapted to inform decisions in various scenarios, such as:

  • Default risk: Assessing the probability of loan default based on a customer's financial profile can inform lending decisions.

  • Audit risk: Predicting the likelihood of a tax return triggering an audit based on historical variables like total income, number of deductions claimed, presence of foreign assets, etc.

  • Marketing and sales strategies: Determining the probability of a sale or conversion can guide marketing efforts and resource allocation.

Step-by-Step Instructions for Implementing Logistic Regression in Excel

Implementing logistic regression in Excel involves several steps, from setting up your dataset to using the Solver add-in to estimate the model parameters.

Example Data

First, ensure your data is clean and ready for analysis. Here's how you can prepare your data:

  1. Organize Your Dataset: Arrange your data in columns, with each row representing an observation and each column representing a variable.

  2. Code Your Target Variable: Make sure your binary outcome is coded correctly. For example, "0" could mean "did not purchase" and "1" could mean "made a purchase."

  3. Use one-hot encoding for categorical variables: One-hot encoding is a process used to convert categorical data variables into a form that can be provided to machine learning algorithms. Since most machine learning models require input to be numeric, one-hot encoding converts categorical (non-numeric) data into a binary (0 or 1) table.

    • Identify Unique Categories: One-hot encoding identifies all the unique categories in a categorical feature. For example, if you have a feature 'referral source' with the values 'website', ‘client‘, or 'conference', one-hot encoding would create a separate column for each possible value.

    • Create Binary Columns: For each unique category, one-hot encoding creates a new binary column that represents the presence of that category with the number 1 and the absence with the number 0. For the ‘referral source’ example, one-hot encoding would create three features: ‘referral_website’, ‘referral_client’, and ‘referral_conference’.

    • Assign Binary Values: These columns are then filled with binary values corresponding to the presence of the categorical value in the original data. If a particular row had the referral source 'client', it would have a '1' in the referral_client column and '0' in the referral_website and referral_conference columns.

  4. Scale and Normalize Data: If you're using multiple independent variables, ensure they're on the same scale to aid convergence in the model optimization process. The easiest way to accomplish the scaling is to normalize the values, which you can do in Excel with the following formula: =(value-MIN(range_of_values))/(MAX(range_of_values)-MIN(range_of_values)).

One-hot encoded and normalized data

In Excel, you will also need to create a coefficients table and several columns to represent the logistic regression equation components:

Coefficient Table

  1. Create a Coefficient Table: Create a separate table with the coefficient values for the y-intercept and each independent variable. Initially, the guess for all of these coefficient values can be zero.

  2. Logistic Function Column: Use the equation =1/(1+EXP(-(intercept + coefficient*independent_variable))) to calculate the probability of the dependent variable being 1. The formula would be extended for each independent variable. Tip: Using Excel’s named ranges will make it easier to keep track of your variables.

  3. Log Loss Column: Set up another column to represent the log loss. For a binary classification with actual outcomes, and the predicted probability, the log loss for each observation can be computed as =IF(target_variable=1, LOG(logistic_function_results), LOG(1-logistic_function_results)).

  4. Predicted Outcome Column: Translate the probabilities into predicted classifications based on a chosen cutoff, like 0.5. That means any row with a probability of greater than 0.5 will be predicted as being included in the target group (e.g. a client who purchases the product).

Data Table with Logistic Regression Formulas

With your data and initial logistic function in place, you'll use Excel’s Solver add-in to find the best estimates for the intercept and coefficient:

  1. Define Objective: Set the objective cell to the one containing the sum of the log loss column. Choose the value of zero for the target.

  2. Adjust Cells: Specify the cells containing your intercept and coefficient(s) as the ones to change.

  3. Add Constraints: If necessary, add constraints to keep the guesses from being so large or small that they result in calculation errors.

  4. Choose Solving Method: Select the method to reduce the sum of log loss to the value of zero. For logistic regression, you'll typically use the GRG Nonlinear solving method.

Solver Configurations

Lastly, use a goodness-of-fit test to see how well your model fits the observed data. Data scientists may use statistical tools for this. For now, we can just calculate the accuracy as a percentage.

Practical Tips for Using Logistic Regression in Excel

Having built a logistic regression model in Excel, we can refine the approach to enhance model accuracy and utility.

Data Preparation and Selection

  • Ensure Data Quality: Garbage in, garbage out — make sure your data is accurate, relevant, and free from errors before running any analysis.

  • Select Relevant Predictors: Use domain knowledge to choose independent variables that are likely to influence the outcome you're predicting.

Interpreting Logistic Regression Coefficients

  • Understand the Directionality: A positive coefficient indicates a higher likelihood of the event occurring as the predictor increases.

  • Magnitude Matters: Larger absolute values of the coefficient signify a stronger effect on the outcome. However, the scale of the variable affects coefficient size, so this was made more interpretable by normalizing the inputs of the model.

Making Decisions with Model Output

  • Use Probabilities to Inform Actions: Rather than making binary decisions, use the model's probability output (the Logistic Function results) to understand risk levels and make informed, nuanced decisions.

  • Set Appropriate Thresholds: Depending on the context, you may set thresholds other than 0.5 to decide the predicted outcome, balancing the trade-off between sensitivity and specificity.

Limitations and Considerations in Excel

  • Be Wary of Overfitting: With Excel's limited diagnostic tools, be cautious of overfitting your model to the training data, which can reduce the models ability to be used beyond the training data.

  • Sample Size Considerations: Ensure you have a sufficient sample size to train your model, as small sample sizes can lead to unreliable estimates.

Validating and Testing Your Model

  • Cross-Validation: If possible, use a portion of your data for training and reserve some for testing to validate your model's predictive power.

  • Continuous Learning: Update your model with new data over time to maintain its accuracy, reflecting changing business conditions.

Conclusion

Logistic regression is a versatile tool in the advisor's repertoire, ideal for when the predictive focus is on binary outcomes.

Excel's capabilities to perform logistic regression can meet many practical needs in professional advisory contexts. However, understanding when and how to transition to more powerful analytics tools is part of advancing one's practice in line with industry standards.

Unlock the power of Excel PivotTables! Whether you're a beginner or an advanced user, this self-guided course will level up your skills.

FREE COURSE

Stay connected with news and updates!

Join our mailing list to receive the latest news and updates from our team.
Don't worry, your information will not be shared.

We hate SPAM. We will never sell your information, for any reason.