saifahmad.io

The Hidden Power of Linear & Logistic Regression

Rare intricacies, overlooked behaviours, and real-world ML use cases.

21 Nov 202510 min read
AIMachine LearningRegressionLogisticData Science
Share

Linear regression and logistic regression are often taught as the “simplest ML algorithms.” But in real production systems—credit scoring, fraud detection, pricing, lead ranking, recommendation scoring— they show subtle behaviours that most engineers don’t notice.

These two workhorse algorithms quietly power billions of decisions every day in fintech, banking, healthcare, and retail. Yet their deeper behaviours—heteroscedasticity, leverage points, separation, log-odds, feature symmetry—are rarely discussed in basic ML tutorials.

True mastery begins when we stop treating these models as “easy” and start understanding how they behave under real-world data imperfections.

1. Linear Regression: Simpler Than It Looks — Until It Isn’t

Linear regression tries to fit a straight-line relationship. But real-world data rarely behaves in a clean straight line. Noise, outliers, fat tails, and correlated variables can distort the fitted model in surprisingly strong ways.

1.1 Heteroscedasticity: When Noise Grows with the Target

Linear regression assumes constant noise across all values. This is rarely true. For example, predicting loan amount from monthly income:

  • Lower incomes → small errors (±100 units)
  • Higher incomes → large errors (±2,000 units)

The regression line still looks “fine,” but variance increases with income. Confidence intervals widen, and predictions become unreliable for high-value applicants.

Insight:

Heteroscedasticity doesn’t break the model—but it makes standard errors and coefficient interpretations misleading. WLS (Weighted Least Squares) stabilises variance.

import statsmodels.api as sm

model = sm.WLS(y, X, weights=1.0 / predicted_variance)
results = model.fit()
print(results.summary())

1.2 Leverage Points: When One Point Pulls the Line

A leverage point is a data point with an extreme feature value. Even if its target is reasonable, its position gives it huge influence on the fitted line.

Example: A housing dataset with 499 homes priced 100k–300k and one mansion worth 12M. That single mansion pulls the regression line upward, corrupting predictions for ordinary homes.

Leverage point bending regression line
A single leverage point dramatically tilts the fitted regression line.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

x = np.array([1, 2, 3, 4, 10]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 30])  # leverage point at x=10

model = LinearRegression().fit(x, y)
y_pred = model.predict(x)

plt.scatter(x, y, label="Data")
plt.plot(x, y_pred, label="Fitted line")
plt.title("Leverage Point Effect")
plt.legend()
plt.savefig("leverage-effect.png", dpi=150)
plt.show()

1.3 Multicollinearity: When Features Fight Each Other

When features are strongly correlated (age & experience, income & spending), coefficients become unstable. Predictions might still be good, but coefficients themselves become unreliable.

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

X = pd.DataFrame({"age": age, "experience": experience})
vif_age = variance_inflation_factor(X.values, 0)
vif_exp = variance_inflation_factor(X.values, 1)
print(vif_age, vif_exp)

2. Logistic Regression: A Log-Odds Machine

Logistic regression is not just a classifier — it is a log-odds estimator. The S-curve appears only when log-odds are converted into probabilities.

2.1 Coefficients Multiply Odds, Not Probability

A coefficient of 0.7 means odds × exp(0.7) ≈ 2.01 — not “+70% probability”.

import numpy as np

coef = 0.7
odds_multiplier = np.exp(coef)
print(odds_multiplier)  # ≈ 2.01

This makes logistic regression powerful for regulated domains requiring explainability.

2.2 Perfect Separation: When a Rule Predicts 100%

If a feature perfectly splits classes (e.g., a fraud rule firing only for fraud cases), logistic regression coefficients tend toward infinity. Training fails with a “perfect separation” warning.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(
    penalty="l2",
    C=0.1,
    solver="lbfgs"
).fit(X, y)

L2 regularisation keeps optimisation stable.

2.3 Class Imbalance: When 99% Accuracy Is Useless

In fraud, AML, medical diagnosis, or credit default prediction, positive events are rare. Logistic regression may return 99% accuracy by predicting all negatives — which is useless.

  • Use class_weight="balanced"
  • Use oversampling (e.g., SMOTE)
  • Use probabilities instead of hard labels

2.4 S-Curve Enables Ranking

Logistic regression S-curve
Logistic regression’s S-curve is ideal for scoring and ranking.

3. Where Linear & Logistic Regression Work Together

3.1 Expected Loss Models in Finance

  • Logistic → PD (Probability of Default)
  • Linear → LGD (Loss Given Default)

Expected Loss = PD × LGD is widely used in lending, underwriting, and capital allocation.

3.2 E-commerce Conversion & Revenue Prediction

  • Logistic → probability of purchase
  • Linear → expected basket size/value

Multiply them → expected revenue per user/session.

4. A Simple Python Walkthrough

4.1 Linear Regression with a Leverage Point

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

x = np.array([1, 2, 3, 4, 10]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 30])

model = LinearRegression().fit(x, y)
y_pred = model.predict(x)

plt.scatter(x, y)
plt.plot(x, y_pred)
plt.title("Leverage Point Effect")
plt.savefig("leverage-effect.png", dpi=150)
plt.show()

4.2 Logistic Probability Curve

from sklearn.linear_model import LogisticRegression
import numpy as np
import matplotlib.pyplot as plt

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 0, 1, 1])

clf = LogisticRegression().fit(X, y)

grid = np.linspace(1, 5, 100).reshape(-1, 1)
probs = clf.predict_proba(grid)[:, 1]

plt.plot(grid, probs)
plt.ylim(0, 1)
plt.title("Logistic Regression Probability Curve")
plt.savefig("logistic-curve.png", dpi=150)
plt.show()

5. Model Assumptions Mismatch: the hidden reason behind simple model failures

Machine learning models don't just learn from data, they quietly assume how the data behaves. When reality follows these assumptions, models work beautifully. When reality doesn't, the models still run… but give unreliable or misleading results.

Here are the key assumption mismatches in Linear and Logistic Regression:

5.1 The "Straight-Line" Assumption (Linear Regression)

Linear Regression assumes the relationship between variables is roughly straight. But many real-world patterns rise and fall, peak and flatten.

Example: predicting queue length across the day:

  • Morning → short queue
  • Mid-day → long queue
  • Evening → short again

Reality curves — a forced straight line becomes unrealistic.

5.2 The "Even Variation" Assumption (Homogeneous Errors)

Linear Regression expects errors to be evenly spread. But real life has low-variance behaviours (snack prices) and high-variance behaviours (electronics).

5.3 The "No Overlapping Information" Assumption (Multicollinearity)

Regression expects each feature to provide unique information. Overlapping features (age and school grade) confuse the model and make coefficients unstable.

5.4 The "Smooth Change" Assumption (Logistic Regression)

Logistic Regression assumes probabilities rise smoothly, but many outcomes change abruptly.

Example: A playground only allows entry at 140 cm. Going from 139 → 140 cm is a sudden jump, not a smooth probability curve.

5.5 The "No Perfect Rules" Assumption

Logistic Regression fails when a rule is always true in historical data. Coefficients shoot toward infinity (perfect separation).

5.6 The "Balanced Outcomes" Assumption

Logistic Regression struggles when positive events are extremely rare. Predicting “NO” for everyone becomes highly accurate but completely useless.

Why This Matters

Assumption mismatches are one of the biggest hidden reasons why linear and logistic models behave strangely or fail silently. This doesn't make them flawed, but it means engineers must understand when assumptions don't match reality.

How to Fix These Assumptions

Fix 1: When the relationship isn't a straight line

  • Add squared terms (x²)
  • Use polynomial or spline regression
  • Use tree-based models that don't expect linearity

Fix 2: When large values behave unpredictably

  • Use Weighted Regression
  • Log-transform the target
  • Segment the data (small, medium, large)

Fix 3: When inputs overlap

  • Drop one correlated feature
  • Combine features into a single score
  • Use Lasso Regression to reduce duplicates

Fix 4: When outcomes change abruptly

  • Use decision trees
  • Use random forests or gradient boosting

Fix 5: When rules are “perfect”

  • Add L2 regularisation
  • Add synthetic counterexamples
  • Use tree-based models

Fix 6: When rare events are ignored

  • Oversample rare cases
  • Under-sample majority cases
  • Add class weights
  • Use anomaly detection algorithms

These fixes require understanding how data behaves, not heavy maths. When assumptions break, model reliability collapses. Knowing the patterns helps build trustworthy, robust ML systems.

6. Conclusion: Simple Algorithms with Deep Behaviour

Linear and Logistic Regression are not "beginner algorithms", they are interpretable, mathematically stable, regulator-friendly tools that quietly power risk, pricing, credit, medical, and marketing systems.

Their strength lies not in simplicity, but in predictable behaviour under messy real-world data, once you understand their deeper assumptions and hidden intricacies.