Catching Clickers: Using Regression to Understand Why People Fall for Phishing Attacks

cyber risk cybersecurity data analytics data visualization excel regression risk management Oct 29, 2023

Download the Excel template for this article.

Phishing attacks continue to be a significant cybersecurity threat, casting wide nets to ensnare unsuspecting individuals and organizations alike.

Although many people recognize and discard phishing attempts, others unknowingly take the bait, leading to compromised personal data, financial losses, and breached corporate networks.

Why do some people discern the danger and sidestep the snare, while others become ensnared victims? We might be tempted to attribute these actions to varying levels of tech-savviness, but the true underlying factors may be more intricate and multifaceted.

To explore this question, we can use regression analysis — a powerful statistical tool that sifts through layers of data to identify and quantify the relationships between variables.

Understanding Regression Analysis

Regression analysis is like a data detective, piecing together clues to unveil the relationships between variables. It can be used to predict and explain.

The simplest form of regression is linear regression. It assumes a straight-line relationship between the predictor and the outcome.

However, the real world is often more complex than a straight line. Sometimes, relationships curve or zig-zag. Non-linear regression helps capture these more intricate patterns.

Also, events seldom revolve around a single cause-and-effect relationships. Multiple regression allows us to factor in multiple predictors simultaneously, painting a comprehensive picture of the influences at play.

In the world of regression analysis, coefficients are the stars. For every predictor, the regression equation assigns a coefficient—a number that indicates the strength and direction of the relationship.

A positive coefficient indicates that as the predictor increases, the outcome also tends to increase, and vice versa for a negative coefficient. The magnitude of the coefficient tells us about the strength of this relationship: the larger the coefficient, the stronger the effect.

Regression doesn't just provide a mathematical model; it also offers valuable metrics that gauge the model's reliability and accuracy. Terms like r-squared (R^2), also called the coefficient of determination, give us a measure of how well our model explains the variability in the outcome.

P-values associated with each predictor tell us whether the relationships we've identified are statistically significant or if they likely have occurred by chance.

Why Use Regression Analysis for Phishing Research

Phishing is a cunning use of social engineering. Crafted to deceive, malicious emails exploit a range of human tendencies, from trust and urgency to curiosity and fear.

While the art of deception remains a constant, the individuals targeted are a diverse lot, each with their unique behaviors, experiences, and reactions. So, how do we weave through this intricate tapestry to understand why some click while others don't?

Regression analysis thrives in complexity. By accommodating multiple variables, it can provide a holistic view, teasing out individual factors that significantly influence phishing susceptibility.

Also regression offers quantifiable insights, allowing us to measure the exact impact of specific factors. For instance, we can determine that completing a training program might reduce the risk of clicking on a phishing link by, say, 30%.

Often, factors don't operate in isolation. An employee's response to a phishing email might be influenced by a combination of their training, the number of emails they receive daily, and their role in the organization. Regression can unearth these intertwined relationships, offering a multi-dimensional perspective on phishing risks.

With its predictive capabilities, regression analysis isn't just about understanding the past or present—it can also serve to predict the future. By identifying high-risk factors and groups, organizations can proactively tailor their cybersecurity measures, focusing resources where they're needed most.

Building the Regression Model

Every robust model begins with quality data. When researching phishing patterns in our organization, we may encounter the following:

  • Quantitative Data: Number of phishing incidents, average emails received per month, and other measurable metrics.

  • Categorical Data: Training completion status, department or role, and other non-numerical attributes.

  • Time-Stamped Data: To track changes over time, such as before and after a training intervention.

Not all regression models are created equal. Depending on the nature of our data and the relationships we're probing:

  • Linear Regression: Suitable when predicting a continuous outcome, like the number of phishing incidents.

  • Logistic Regression: Ideal when predicting binary outcomes, like whether an employee will click on a phishing link (Yes/No).

Sometimes, the combined effect of two variables is greater than the sum of their individual effects. For instance, the impact of training might be different for those in customer service compared to those in IT. Including interaction terms in our model allows us to capture these nuanced relationships.

Building the perfect regression model isn't a one-shot endeavor. It's an iterative process of:

  • Adding or removing predictors based on their significance and contribution to the model.

  • Checking for multicollinearity (high correlation between predictors).

  • Re-evaluating model assumptions and fit.

Once our model is honed, the coefficients need translation into actionable insights. A coefficient tells us how much the dependent variable changes for a one-unit change in the predictor, holding all other predictors constant. It's the key to quantifying the impact of individual factors on phishing susceptibility.

Illustrative Analysis

Imagine that we applied regression analysis to phishing data for employees at Rob’s Robots. A series of illuminating patterns may emerge, painting a detailed portrait of the factors influencing phishing susceptibility.

We can perform regression analysis using Excel’s data analysis toolpak.

Then we obtain the following results:

Here are the results from the linear regression analysis using the updated dataset:

R Square (R^2): Approximately 0.43. This value indicates that about 43% of the variability in phishing incidents is explained by the combined influence of the number of emails received per month and the completion of training.

Coefficients:

  • Avg_Emails_per_month: Approximately 3.33 times 10^-4. This indicates that for every additional 10,000 emails received per month, there's an associated increase of about 3.33 phishing incidents, holding training constant.

  • Completed_Training: Approximately -1.805. This coefficient suggests that, on average, employees who completed the training have about 1.805 fewer phishing incidents than those who didn't, holding the number of emails constant.

Intercept: Approximately 0.9192. This is the predicted number of phishing incidents for an employee who receives zero emails and has not completed the training. (It’s important to remember that phishing attacks can come from other vectors than email.)

P-Values: for Average emails per month, the p-value is 0.0002, and for the completed training, the p-value is 6.9 time 10^-12. These values tell us that the relationships we've identified are statistically significant. They are well below a conventional threshold of 0.05.

Additional Analysis: Interestingly, when we perform a regression analysis only on the subgroup of people who have had phishing incidents, then the linear regression R² becomes 0.96, meaning that the model explains about 96% of those observations. For people who have had incidents, the number of emails becomes very highly correlated.

Conclusion: Training appears to have a significant negative effect on phishing incidents, which is consistent with our expectation that training reduces the likelihood of our employees becoming the victims of phishing attacks. Furthermore, the most at risk people appear to experience more incidents when they receive more emails. The cybersecurity team may want to explore increasing email filtering for those users.

Causation vs. Correlation: Navigating the Nuances

As we consider the results from a regression analysis, it's crucial to tread with caution and discernment, especially when interpreting the relationships between variables.

One of the most fundamental distinctions to grasp in this realm is the difference between causation and correlation, a distinction that often carries profound implications.

Correlation implies that two variables move together. If one increases, the other tends to increase (or decrease) as well. However, this doesn't necessarily mean one causes the other.

Causation goes a step further, suggesting that a change in one variable is responsible for a change in another.

While regression analysis excels at identifying and quantifying relationships, it inherently captures correlations. Determining causation typically requires additional evidence, often from controlled experiments or rigorous observational studies.

For instance, our analysis revealed a strong negative correlation between training and phishing incidents. While tempting to conclude that training directly reduces phishing susceptibility (a causal relationship), it's possible that other unmeasured factors play a role. Perhaps more cybersecurity-aware employees are both more likely to complete training and less likely to click on phishing emails.

Conclusion

Regression analysis may prove very useful in our endeavor to understand the nuances of why some individuals fall prey to phishing attacks. Through its systematic, data-driven approach, we can unravel the intricate tapestry of factors that shape phishing susceptibility.

Yet, as with any analytical tool, regression is not a silver bullet. It's a piece of the puzzle, offering valuable perspectives but necessitating careful interpretation and complementary insights. By acknowledging its strengths and limitations, we can harness its full potential, guiding our cybersecurity efforts with a blend of rigor and nuance.

Unlock the power of Excel PivotTables! Whether you're a beginner or an advanced user, this self-guided course will level up your skills.

FREE COURSE

Stay connected with news and updates!

Join our mailing list to receive the latest news and updates from our team.
Don't worry, your information will not be shared.

We hate SPAM. We will never sell your information, for any reason.