Basic Machine Learning with Python

Author

Giuseppe Ragusa

Introduction to Machine Learning

This document introduces fundamental machine learning concepts, including regression models, evaluation methods, and regularization techniques (ridge and lasso). We’ll use Python with scikit-learn to implement these concepts effectively.

Throughout this lecture, we’ll illustrate key machine learning principles using the Italian employee dataset,, rfl2019.csv which provides information on employment conditions and personal characteristics of Italian workers. We will then cover classification using the Heart Disease dataset from UCI.

The Italian Employee Dataset

The dataset we’ll use for our regression examples contains information about Italian employees from 2019, including:

  1. retric: Monthly wage in euros (our target variable)
  2. orelav: Number of weekly working hours
  3. reg: Region of Italy where the employee works
  4. edulev: Highest education level attained (e.g., “Licenza media”, “Laurea vecchio ordinamento”)
  5. etam: Employee’s age
  6. sg11: Employee’s gender (“Femmina” or “Maschio”)
  7. qualifica: Employment type (e.g., “Operaio”, “Impiegato”)
  8. c27: Contract type (e.g., “tempo indeterminato”, “tempo determinato”)
  9. stacim: Full-time (“tempo pieno”) or part-time
  10. anniistruzione: Years of education completed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


# Load the dataset
rfl2019 = pd.read_csv('rfl2019.csv')

# Display the first few rows
print(rfl2019.head())

# Basic dataset information
print(f"Dataset shape: {rfl2019.shape}")
print("\nDataset information:")
rfl2019.info()

# Summary statistics
print("\nSummary statistics:")
rfl2019.describe()
   retric  etam  anniistruzione         edulev         rip3        rip5  \
0    1600    64              18         Laurea  Mezzogiorno       Isole   
1    1030    44               8  Licenza media         Nord  Nord-Ovest   
2     900    32              13    Diploma 4-5  Mezzogiorno         Sud   
3    1050    31              13    Diploma 4-5       Centro      Centro   
4     700    58               8  Licenza media         Nord    Nord-Est   

                   reg            c27     sg11         stacim  orelav  \
0              Sicilia  A tempo pieno  Femmina    Coniugato/a      36   
1              Liguria      Part-time  Femmina  Celibe/nubile      30   
2           Basilicata  A tempo pieno  Maschio  Celibe/nubile      36   
3                Lazio      Part-time  Femmina  Celibe/nubile      30   
4  Trentino-Alto Adige      Part-time  Femmina    Coniugato/a      20   

   qualifica  
0  Dirigente  
1  Impiegato  
2  Impiegato  
3  Impiegato  
4    Operaio  
Dataset shape: (5000, 12)

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   retric          5000 non-null   int64 
 1   etam            5000 non-null   int64 
 2   anniistruzione  5000 non-null   int64 
 3   edulev          5000 non-null   object
 4   rip3            5000 non-null   object
 5   rip5            5000 non-null   object
 6   reg             5000 non-null   object
 7   c27             5000 non-null   object
 8   sg11            5000 non-null   object
 9   stacim          5000 non-null   object
 10  orelav          5000 non-null   int64 
 11  qualifica       5000 non-null   object
dtypes: int64(4), object(8)
memory usage: 468.9+ KB

Summary statistics:
retric etam anniistruzione orelav
count 5000.000000 5000.000000 5000.000000 5000.000000
mean 1365.704000 45.588000 12.108600 35.562000
std 501.892462 10.101181 3.607743 8.084654
min 250.000000 25.000000 0.000000 10.000000
25% 1080.000000 38.000000 8.000000 31.000000
50% 1350.000000 46.000000 13.000000 40.000000
75% 1600.000000 54.000000 13.000000 40.000000
max 3000.000000 64.000000 18.000000 55.000000

Machine Learning: Basic Concepts

The Supervised Learning Framework

In supervised learning, we assume a relationship exists between a target (response) variable Y and a set of predictors (features) \(X_1, X_2, \dots, X_p\).

This relationship can be expressed mathematically as:

\[ Y = f(X) + \epsilon \]

Where: - \(f(X)\) is an unknown function linking the predictors to \(Y\) - \(\epsilon\) is the error term (the part of \(Y\) that cannot be explained by \(X\))

Our primary goal is to estimate the function \(f\) (denoted as \(\hat{f}\)) to make predictions about \(Y\) using the predictors \(X\). Once we have determined \(\hat{f}\), we can predict \(Y\) for new observations:

\[ \hat{Y} = \hat{f}(X^{new}) \]

The exact form of \(\hat{f}\) is often less important than its ability to make accurate predictions.

Types of Supervised Learning Problems

  1. Regression: When the target variable is continuous (e.g., predicting wages, house prices)
  2. Classification: When the target variable is categorical (e.g., predicting disease presence, spam detection)

For our wage prediction example, we’ll use regression techniques since wages are continuous values.

Training and Validation Sets

In typical machine learning applications, we’re concerned with making predictions on new, unseen data. This presents a fundamental challenge: how do we evaluate model performance when applying it to new data where we only know the predictors (\(X\)) but not the target values (\(Y\))?

To address this issue, we use a technique called data splitting. We divide our available dataset (which contains both \(X\) and \(Y\)) into two separate sets:

  1. Training set: Used to train the model (estimate model parameters)
  2. Validation set: Used to evaluate the model’s performance

This splitting approach allows us to simulate the real-world scenario where we’ll apply our model to new data. The validation set acts as a proxy for future unseen data, helping us assess how well our model generalizes beyond the data it was trained on.

# Set a random seed for reproducibility
np.random.seed(42)

# Define X (predictors) and y (target)
X = rfl2019[['etam']]  # Using age as our only predictor for simple linear regression
y = rfl2019['retric']   # Our target variable is wage

# Split theta: 80% for training, 20% for validation
X_train, X_validation, y_train, y_validation = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Print the shapes to confirm the split
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_validation shape: {X_validation.shape}")
print(f"y_validation shape: {y_validation.shape}")
X_train shape: (4000, 1)
y_train shape: (4000,)
X_validation shape: (1000, 1)
y_validation shape: (1000,)

Training the Simple Linear Regression Model

Now we’ll use the training dataset to build our first machine learning model: a simple linear regression. This model assumes a linear relationship between age and wage:

\[ wage = \beta_0 + \beta_1 \cdot age + \epsilon \]

Where: - \(\beta_0\) is the intercept (expected wage when age = 0) - \(\beta_1\) is the coefficient for age (expected increase in wage for each additional year) - \(\epsilon\) is the error term

In matrix notation, we can express this as:

\[ \mathbf{y} = \mathbf{X\beta} + \mathbf{\epsilon} \]

The ordinary least squares (OLS) solution for \(\beta\) minimizes the sum of squared residuals:

\[ \hat{\beta} = \arg\min_{\beta} ||\mathbf{y} - \mathbf{X\beta}||^2 \]

This can be easily implemented in scikilearn:

# Create and train the model
lm1 = LinearRegression()
lm1.fit(X_train, y_train)

# Extract the model coefficients
intercept = lm1.intercept_
coefficient = lm1.coef_[0]

print(f"Estimated model: retric = {intercept:.2f} + {coefficient:.2f} × etam")

# Visualize the fitted line against the training data
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, alpha=0.4, label='Training data')
plt.plot(
    [X_train.min(), X_train.max()], 
    [lm1.predict(pd.DataFrame([X_train.min()], columns=['etam']))[0], 
     lm1.predict(pd.DataFrame([X_train.max()], columns=['etam']))[0]], 
    'r-', linewidth=2, label='Fitted line'
)
plt.xlabel('etam')
plt.ylabel('retric')
plt.title('Simple Linear Regression: Wage vs. Age')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Estimated model: retric = 996.92 + 8.19 × etam

Evaluating Model Performance

To assess our model’s performance, we use:

  1. Mean Squared Error (MSE): \[MSE = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2\]

  2. Root Mean Squared Error (RMSE): \[RMSE = \sqrt{MSE}\]

  3. Coefficient of Determination (R²): \[R^2 = 1 - \frac{\sum_{i=1}^n (Y_i - \hat{Y}_i)^2}{\sum_{i=1}^n (Y_i - \bar{Y})^2}\]

We calculate these matrics on both in-sample and out-of-sample.

# Make predictions on both training and validation data
y_train_pred = lm1.predict(X_train)
y_validation_pred = lm1.predict(X_validation)

# Calculate metrics
train_mse = mean_squared_error(y_train, y_train_pred)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(y_train, y_train_pred)

validation_mse = mean_squared_error(y_validation, y_validation_pred)
validation_rmse = np.sqrt(validation_mse)
validation_r2 = r2_score(y_validation, y_validation_pred)

# Print the results
print("\nModel Performance Metrics:")
print("--------------------------")
print("In-sample (Training) Metrics:")
print(f"MSE: {train_mse:.2f}")
print(f"RMSE: {train_rmse:.2f}")
print(f"R²: {train_r2:.2f}")

print("\nOut-of-sample (Validation) Metrics:")
print(f"MSE: {validation_mse:.2f}")
print(f"RMSE: {validation_rmse:.2f}")
print(f"R²: {validation_r2:.2f}")

# Visualize actual vs. predicted values
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Training data
axes[0].scatter(y_train, y_train_pred, alpha=0.5)
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--')
axes[0].set_xlabel('Actual Wages')
axes[0].set_ylabel('Predicted Wages')
axes[0].set_title('Training Data: Actual vs. Predicted Wages')
axes[0].grid(True, alpha=0.3)

# Validation data
axes[1].scatter(y_validation, y_validation_pred, alpha=0.5)
axes[1].plot([y_validation.min(), y_validation.max()], [y_validation.min(), y_validation.max()], 'r--')
axes[1].set_xlabel('Actual Wages')
axes[1].set_ylabel('Predicted Wages')
axes[1].set_title('Validation Data: Actual vs. Predicted Wages')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Visualize the residuals
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Training data residuals
train_residuals = y_train - y_train_pred
axes[0].scatter(y_train_pred, train_residuals, alpha=0.5)
axes[0].axhline(y=0, color='r', linestyle='--')
axes[0].set_xlabel('Predicted Wages')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Training Data: Residual Plot')
axes[0].grid(True, alpha=0.3)

# Validation data residuals
validation_residuals = y_validation - y_validation_pred
axes[1].scatter(y_validation_pred, validation_residuals, alpha=0.5)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('Predicted Wages')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Validation Data: Residual Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Model Performance Metrics:
--------------------------
In-sample (Training) Metrics:
MSE: 247251.62
RMSE: 497.24
R²: 0.03

Out-of-sample (Validation) Metrics:
MSE: 235662.03
RMSE: 485.45
R²: 0.03

Interpretation of Results

Looking at the performance metrics and plots, we can evaluate how well our simple linear regression model captures the relationship between age and wage:

  1. Training vs. Validation Performance: Comparing the metrics between training and validation data helps us detect if we’re overfitting/undefitting. Similar performance on both sets suggests the model generalizes well.

  2. R² Value: The coefficient of determination tells us what proportion of the variance in wage is explained by age. A low R² suggests that age alone is not sufficient to predict wages accurately.

  3. Residual Plots: Patterns in residuals can reveal limitations in our model. Ideally, residuals should be randomly distributed around zero with no clear pattern.

Our simple model using only age as a predictor likely shows limited predictive power. This is expected since many other factors influence a person’s wage, such as education, experience, industry, and location. In the next sections, we’ll explore how to improve our predictions by incorporating more variables and using more sophisticated modeling techniques.

Multiple Linear Regression

In practice, predictions are rarely based on a single factor. We can easily multiple predictors:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon \]

Implementing Multiple Linear Regression

# Prepare categorical and numerical features
categorical_features = ['edulev', 'reg', 'sg11', 'qualifica', 'stacim']
numerical_features = ['etam', 'orelav']

# Create preprocessor for mixed data types
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features)
    ])

# Create pipeline with preprocessing and model
model = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression(copy_X=False))
])

# Prepare X and y
X = rfl2019[numerical_features + categorical_features]
y = rfl2019['retric']

# Split the data
# Note: using the same random state guarantees that we get 
# the sampe osswervations in the training and in the validation example as in the previous example.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate metrics
train_mse = mean_squared_error(y_train, y_train_pred)
train_rmse = np.sqrt(train_mse)
train_r2 = r2_score(y_train, y_train_pred)

test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_test_pred)

print("\nMultiple Linear Regression Results")
print("\nIn-sample (Training) Metrics:")
print(f"MSE: {train_mse:.2f}")
print(f"RMSE: {train_rmse:.2f}")
print(f"R²: {train_r2:.2f}")

print("\nOut-of-sample (Testing) Metrics:")
print(f"MSE: {test_mse:.2f}")
print(f"RMSE: {test_rmse:.2f}")
print(f"R²: {test_r2:.2f}")

# Visualize actual vs. predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_test_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Wages')
plt.ylabel('Predicted Wages')
plt.title('Actual vs. Predicted Wages (Multiple Linear Regression)')
plt.show()

Multiple Linear Regression Results

In-sample (Training) Metrics:
MSE: 111050.22
RMSE: 333.24
R²: 0.56

Out-of-sample (Testing) Metrics:
MSE: 109845.24
RMSE: 331.43
R²: 0.55

Beyond Linearity: Polynomial Features

Linear models assume linear relationships between features and the target. However, in many real-world scenarios, relationships are non-linear. We can capture non-linearities by including polynomial terms.

For instance, the relationship between age and wage might not be linear—wages often increase more rapidly during mid-career years before plateauing.

We can make the model accomodate non-linearities with polynomial terms:

Including Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Create a pipeline with polynomial features
poly_model = Pipeline([
    ('preprocessor', preprocessor),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('regressor', LinearRegression(copy_X=False))
])

# Train the model
poly_model.fit(X_train, y_train)

# Make predictions
y_train_poly_pred = poly_model.predict(X_train)
y_test_poly_pred = poly_model.predict(X_test)

# Calculate metrics
train_poly_mse = mean_squared_error(y_train, y_train_poly_pred)
train_poly_rmse = np.sqrt(train_poly_mse)
train_poly_r2 = r2_score(y_train, y_train_poly_pred)

test_poly_mse = mean_squared_error(y_test, y_test_poly_pred)
test_poly_rmse = np.sqrt(test_poly_mse)
test_poly_r2 = r2_score(y_test, y_test_poly_pred)

print("\nPolynomial Regression Results")
print("\nIn-sample (Training) Metrics:")
print(f"MSE: {train_poly_mse:.2f}")
print(f"RMSE: {train_poly_rmse:.2f}")
print(f"R²: {train_poly_r2:.2f}")

print("\nOut-of-sample (Testing) Metrics:")
print(f"MSE: {test_poly_mse:.2f}")
print(f"RMSE: {test_poly_rmse:.2f}")
print(f"R²: {test_poly_r2:.2f}")

# Compare models
models = ['Simple Linear', 'Multiple Linear', 'Polynomial']
train_rmses = [train_rmse, train_rmse, train_poly_rmse]
test_rmses = [test_rmse, test_rmse, test_poly_rmse]

plt.figure(figsize=(10, 6))
x = np.arange(len(models))
width = 0.35

plt.bar(x - width/2, train_rmses, width, label='Training RMSE')
plt.bar(x + width/2, test_rmses, width, label='Testing RMSE')
plt.ylabel('RMSE')
plt.title('Model Performance Comparison')
plt.xticks(x, models)
plt.legend()
plt.show()

Polynomial Regression Results

In-sample (Training) Metrics:
MSE: 95807.79
RMSE: 309.53
R²: 0.62

Out-of-sample (Testing) Metrics:
MSE: 112653.24
RMSE: 335.64
R²: 0.54

The Overfitting Problem

As we increase model complexity by adding more features or polynomial terms, we risk overfitting —- a situation where the model performs well on training data but fails to generalize to new data.

Signs of overfitting include:

  • Training metrics much better than testing metrics

  • Testing performance deteriorates as model complexity increases

  • Very large coefficient values

Illustration of Overfitting

We create an interaction-heavy model to demonstrate overfitting:

# Create a model with many interaction terms
complex_model = Pipeline([
    ('preprocessor', preprocessor),
    ('poly', PolynomialFeatures(degree=3, interaction_only=False, include_bias=False)),
    ('regressor', LinearRegression(copy_X=False))
])

# Train the model
complex_model.fit(X_train, y_train)

# Make predictions
y_train_complex_pred = complex_model.predict(X_train)
y_test_complex_pred = complex_model.predict(X_test)

# Calculate metrics
train_complex_mse = mean_squared_error(y_train, y_train_complex_pred)
train_complex_rmse = np.sqrt(train_complex_mse)
train_complex_r2 = r2_score(y_train, y_train_complex_pred)

test_complex_mse = mean_squared_error(y_test, y_test_complex_pred)
test_complex_rmse = np.sqrt(test_complex_mse)
test_complex_r2 = r2_score(y_test, y_test_complex_pred)

print("\nComplex Model Results (Potential Overfitting)")
print("\nIn-sample (Training) Metrics:")
print(f"MSE: {train_complex_mse:.2f}")
print(f"RMSE: {train_complex_rmse:.2f}")
print(f"R²: {train_complex_r2:.2f}")

print("\nOut-of-sample (Testing) Metrics:")
print(f"MSE: {test_complex_mse:.2f}")
print(f"RMSE: {test_complex_rmse:.2f}")
print(f"R²: {test_complex_r2:.2f}")

# Visualize overfitting
models = ['Simple Linear', 'Multiple Linear', 'Polynomial', 'Complex']
train_rmses = [train_rmse, train_rmse, train_poly_rmse, train_complex_rmse]
test_rmses = [test_rmse, test_rmse, test_poly_rmse, test_complex_rmse]

plt.figure(figsize=(12, 7))
x = np.arange(len(models))
width = 0.35

plt.bar(x - width/2, train_rmses, width, label='Training RMSE')
plt.bar(x + width/2, test_rmses, width, label='Testing RMSE')
plt.ylabel('RMSE')
plt.title('Model Performance Comparison - Detecting Overfitting')
plt.xticks(x, models)
plt.legend()
plt.show()

Complex Model Results (Potential Overfitting)

In-sample (Training) Metrics:
MSE: 70003.45
RMSE: 264.58
R²: 0.72

Out-of-sample (Testing) Metrics:
MSE: 1573466.20
RMSE: 1254.38
R²: -5.49

You should be able to observe that the complex model performs exceptionally well on training data but poorly on test data—a classic sign of overfitting.

Regularization: Preventing Overfitting

Regularization techniques help prevent overfitting by adding penalty terms to the model’s objective function, which discourage complex models with large coefficients.

Two common regularization methods are:

  1. Ridge Regression (L2 regularization): \[ \min_{\beta} \left\{ \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})^2 + \lambda \sum_{j=1}^p \beta_j^2 \right\} \]

  2. Lasso Regression (L1 regularization): \[ \min_{\beta} \left\{ \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})^2 + \lambda \sum_{j=1}^p |\beta_j| \right\} \]

Where \(\lambda\) is a tuning parameter controlling the strength of the penalty.

Implementing Ridge and Lasso Regression

from sklearn.linear_model import Ridge, Lasso

# Create Ridge and Lasso pipelines
ridge_model = Pipeline([
    ('preprocessor', preprocessor),
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('regressor', Ridge(alpha=1.2, copy_X=False))  # alpha is the regularization strength
])

lasso_model = Pipeline([
    ('preprocessor', preprocessor),
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('regressor', Lasso(alpha=2.0, max_iter=10000, copy_X=False))
])

# Train the models
ridge_model.fit(X_train, y_train)
lasso_model.fit(X_train, y_train)

# Make predictions
y_test_ridge_pred = ridge_model.predict(X_test)
y_test_lasso_pred = lasso_model.predict(X_test)

# Calculate metrics
ridge_mse = mean_squared_error(y_test, y_test_ridge_pred)
ridge_rmse = np.sqrt(ridge_mse)
ridge_r2 = r2_score(y_test, y_test_ridge_pred)

lasso_mse = mean_squared_error(y_test, y_test_lasso_pred)
lasso_rmse = np.sqrt(lasso_mse)
lasso_r2 = r2_score(y_test, y_test_lasso_pred)

print("\nRegularized Models Results")
print("\nRidge Regression (L2):")
print(f"Test MSE: {ridge_mse:.2f}")
print(f"Test RMSE: {ridge_rmse:.2f}")
print(f"Test R²: {ridge_r2:.2f}")

print("\nLasso Regression (L1):")
print(f"Test MSE: {lasso_mse:.2f}")
print(f"Test RMSE: {lasso_rmse:.2f}")
print(f"Test R²: {lasso_r2:.2f}")

Regularized Models Results

Ridge Regression (L2):
Test MSE: 123754.76
Test RMSE: 351.79
Test R²: 0.49

Lasso Regression (L1):
Test MSE: 105907.35
Test RMSE: 325.43
Test R²: 0.56

Tuning the Regularization Parameter

The regularization parameter (λ or alpha) controls the trade-off between model complexity and training data fit. Let’s find the optimal value:

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {'regressor__alpha': np.logspace(-3, 3, 20)}

# Set up Grid Search for Ridge
ridge_grid = GridSearchCV(
    Pipeline([
        ('preprocessor', preprocessor),
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('regressor', Ridge(copy_X=False))
    ]),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error'
)

# Train the model
ridge_grid.fit(X_train, y_train)

# Best parameters and score
print(f"Best alpha for Ridge: {ridge_grid.best_params_['regressor__alpha']:.4f}")
print(f"Best CV score for Ridge: {-ridge_grid.best_score_:.2f} (MSE)")

# Set up Grid Search for Lasso
lasso_grid = GridSearchCV(
    Pipeline([
        ('preprocessor', preprocessor),
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('regressor', Lasso(max_iter=10000, copy_X=False))
    ]),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error'
)

# Train the model
lasso_grid.fit(X_train, y_train)

# Best parameters and score
print(f"Best alpha for Lasso: {lasso_grid.best_params_['regressor__alpha']:.4f}")
print(f"Best CV score for Lasso: {-lasso_grid.best_score_:.2f} (MSE)")

# Visualize regularization path
alphas = np.logspace(-3, 3, 20)
ridge_rmses = []
lasso_rmses = []

for alpha in alphas:
    # Ridge
    ridge = Pipeline([
        ('preprocessor', preprocessor),
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('regressor', Ridge(alpha=alpha))
    ])
    ridge.fit(X_train, y_train)
    y_pred = ridge.predict(X_test)
    ridge_rmses.append(np.sqrt(mean_squared_error(y_test, y_pred)))
    
    # Lasso
    lasso = Pipeline([
        ('preprocessor', preprocessor),
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('regressor', Lasso(alpha=alpha, max_iter=10000))
    ])
    lasso.fit(X_train, y_train)
    y_pred = lasso.predict(X_test)
    lasso_rmses.append(np.sqrt(mean_squared_error(y_test, y_pred)))

# Plot the results
plt.figure(figsize=(12, 7))
plt.plot(alphas, ridge_rmses, 'bo-', label='Ridge')
plt.plot(alphas, lasso_rmses, 'ro-', label='Lasso')
plt.xscale('log')
plt.xlabel('Regularization parameter (alpha)')
plt.ylabel('Test RMSE')
plt.title('Impact of Regularization Strength on Model Performance')
plt.legend()
plt.grid(True)
plt.show()

# Use the best models for final prediction
best_ridge = ridge_grid.best_estimator_
best_lasso = lasso_grid.best_estimator_

y_pred_ridge = best_ridge.predict(X_test)
y_pred_lasso = best_lasso.predict(X_test)

# Compare all models
models = ['Simple', 'Multiple', 'Polynomial', 'Complex', 'Ridge', 'Lasso']
test_rmses = [
    test_rmse, 
    test_rmse, 
    test_poly_rmse, 
    test_complex_rmse,
    np.sqrt(mean_squared_error(y_test, y_pred_ridge)),
    np.sqrt(mean_squared_error(y_test, y_pred_lasso))
]

plt.figure(figsize=(14, 8))
plt.bar(models, test_rmses, color='skyblue')
plt.ylabel('Test RMSE (lower is better)')
plt.title('Performance Comparison Across All Models')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Best alpha for Ridge: 54.5559
Best CV score for Ridge: 110820.65 (MSE)
Best alpha for Lasso: 0.6952
Best CV score for Lasso: 109789.50 (MSE)

Classification Example: Heart Disease Prediction

Now let’s shift our focus to classification with the Heart Disease dataset.

Dataset Description

The Heart Disease dataset contains medical attributes that help predict whether a patient has heart disease:

  • age: Age in years
  • sex: (1 = male; 0 = female)
  • cp: Chest pain type (0-3)
  • trestbps: Resting blood pressure (in mm Hg)
  • chol: Serum cholesterol in mg/dl
  • fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
  • restecg: Resting electrocardiographic results (0-2)
  • thalach: Maximum heart rate achieved
  • exang: Exercise induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest
  • slope: Slope of the peak exercise ST segment (0-2)
  • ca: Number of major vessels (0-3) colored by fluoroscopy
  • thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
  • target: Heart disease (1 = present; 0 = absent)

Loading and Exploring the Heart Disease Data

# Load the heart disease dataset
from sklearn.datasets import fetch_openml

heart = fetch_openml(name='heart-disease', version=1, as_frame=True)

X_heart = heart.data
X_heart = X_heart.drop(columns=["target"])
y_heart = heart.data.target.astype(int)

print("Heart Disease Dataset Overview:")
print(f"Features: {X_heart.shape[1]}")
print(f"Samples: {X_heart.shape[0]}")
print(f"Target distribution: {np.bincount(y_heart)}")

# Display the first few rows
print("\nFeature Preview:")
print(X_heart.head())

# Explore feature distributions
plt.figure(figsize=(15, 10))
for i, col in enumerate(X_heart.columns[:6]):
    plt.subplot(2, 3, i+1)
    if X_heart[col].nunique() <= 10:
        sns.countplot(x=col, data=X_heart, hue=y_heart)
        plt.title(f'Distribution of {col}')
    else:
        sns.histplot(data=X_heart, x=col, hue=y_heart, bins=15, kde=True)
        plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()
Heart Disease Dataset Overview:
Features: 13
Samples: 303
Target distribution: [138 165]

Feature Preview:
    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  3.0     145.0  233.0  1.0      0.0    150.0    0.0      2.3   
1  37.0  1.0  2.0     130.0  250.0  0.0      1.0    187.0    0.0      3.5   
2  41.0  0.0  1.0     130.0  204.0  0.0      0.0    172.0    0.0      1.4   
3  56.0  1.0  1.0     120.0  236.0  0.0      1.0    178.0    0.0      0.8   
4  57.0  0.0  0.0     120.0  354.0  0.0      1.0    163.0    1.0      0.6   

   slope   ca  thal  
0    0.0  0.0   1.0  
1    0.0  0.0   2.0  
2    2.0  0.0   2.0  
3    2.0  0.0   2.0  
4    2.0  0.0   2.0  

Implementing Logistic Regression for Classification

Logistic regression is a common algorithm for binary classification. It models the probability that an instance belongs to a particular class.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.preprocessing import StandardScaler

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_heart, y_heart, test_size=0.3, random_state=42)

# Create a pipeline with preprocessing and model
clf = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_mat = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_prob)

print(f"Accuracy: {accuracy:.4f}")
print("\nConfusion Matrix:")
print(conf_mat)
print("\nClassification Report:")
print(report)
print(f"AUC-ROC: {auc:.4f}")

# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Heart Disease Prediction')
plt.legend(loc='lower right')
plt.show()

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Disease', 'Disease'],
            yticklabels=['No Disease', 'Disease'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Heart Disease Prediction')
plt.show()
Accuracy: 0.8132

Confusion Matrix:
[[32  9]
 [ 8 42]]

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.78      0.79        41
           1       0.82      0.84      0.83        50

    accuracy                           0.81        91
   macro avg       0.81      0.81      0.81        91
weighted avg       0.81      0.81      0.81        91

AUC-ROC: 0.8824

The Confusion Matrix

When evaluating classification models, the confusion matrix provides a fundamental summary of prediction results:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Create a template confusion matrix for illustration
cm_template = np.array([
    ['True Negative (TN)', 'False Positive (FP)'],
    ['False Negative (FN)', 'True Positive (TP)']
])

# Plot the template
fig, ax = plt.subplots(figsize=(10, 7))
sns.heatmap(np.zeros((2,2)), annot=cm_template, fmt='', cmap='Blues', 
            xticklabels=['Predicted: No Disease', 'Predicted: Disease'],
            yticklabels=['Actual: No Disease', 'Actual: Disease'],
            annot_kws={"size": 14})
plt.tight_layout()
plt.show()
Figure 1: General structure of a confusion matrix for binary classification

For our heart disease example, each element of the matrix represents:

  • True Negative (TN): Correctly identified healthy patients
  • False Positive (FP): Healthy patients incorrectly diagnosed with heart disease
  • False Negative (FN): Patients with heart disease incorrectly classified as healthy
  • True Positive (TP): Correctly identified patients with heart disease
Note

In the medical context, false negatives (missed diagnoses) can be particularly dangerous as patients with heart disease might not receive needed treatment.

Precision

Precision answers the question: “Of all patients we diagnosed with heart disease, what percentage actually have it?”

\[ \text{Precision} = \frac{TP}{TP + FP} = \frac{\text{True Positives}}{\text{All Positive Predictions}} \]

Clinical Meaning: High precision means we minimize false alarms (FP). This is important for reducing unnecessary anxiety, additional testing, and treatments for patients who are actually healthy.

Recall (Sensitivity)

Recall answers the question: “Of all patients who truly have heart disease, what percentage did we correctly identify?”

\[ \text{Recall} = \frac{TP}{TP + FN} = \frac{\text{True Positives}}{\text{All Actual Positives}} \]

Clinical Meaning: High recall means we catch most actual cases of heart disease. This is critical for ensuring patients who need treatment receive it, potentially saving lives.

Important

In cardiac diagnostics, recall is often prioritized over precision because missing a case of heart disease (false negative) could be life-threatening.

Example from the Heart Disease Model

Let’s examine a sample confusion matrix from our heart disease classification model:

# Calculate precision and recall
tn, fp, fn, tp = conf_mat.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * (precision * recall) / (precision + recall)

# Visualization
plt.figure(figsize=(8, 6))
sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Disease', 'Disease'],
            yticklabels=['No Disease', 'Disease'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Heart Disease Prediction')
plt.show()

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
Figure 2: Confusion matrix from heart disease classification model
Precision: 0.82
Recall: 0.84
F1 Score: 0.83

Interpretation in Heart Disease Context

From our example confusion matrix:

\[ Precision = \frac{42}{42+9} = 0.8235294117647058 \]

82.35294117647058% of patients we diagnosed with heart disease actually have it

\[ Recall = \frac{42}{42+8} = 0.84 \] We correctly identified 84.0% of all patients who have heart disease

Clinical Trade-offs

In cardiac medicine, the balance between precision and recall has serious implications:

# Create data for illustration
thresholds = np.linspace(0, 1, 100)
# Simulated precision and recall curves for illustrative purposes
precision_curve = 1 - np.exp(-5*thresholds)
recall_curve = np.exp(-7*(thresholds-0.1)**2)

plt.figure(figsize=(10, 6))
plt.plot(thresholds, precision_curve, 'b-', linewidth=2, label='Precision')
plt.plot(thresholds, recall_curve, 'r-', linewidth=2, label='Recall')

# Add annotations
plt.annotate('High Recall Region\n(Catch most heart disease cases)',
             xy=(0.2, 0.7), xytext=(0.3, 0.5),
             arrowprops=dict(facecolor='black', shrink=0.05, width=1.5))
plt.annotate('High Precision Region\n(Few false positives)',
             xy=(0.8, 0.8), xytext=(0.6, 0.6),
             arrowprops=dict(facecolor='black', shrink=0.05, width=1.5))

plt.xlabel('Classification Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Trade-off in Heart Disease Detection')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Figure 3: Visualization of precision-recall tradeoffs in cardiac diagnostics

The trade-offs for heart disease detection:

  • High Precision, Lower Recall: We might miss some cases of heart disease (more FNs), but those we diagnose are very likely to have the condition. This approach could be dangerous as missed cases might lead to untreated heart disease.

  • High Recall, Lower Precision: We catch most cases of heart disease (fewer FNs), but might incorrectly diagnose some healthy patients (more FPs). This approach ensures most at-risk patients are identified but might lead to unnecessary treatments.

Tip

For heart disease detection, clinicians typically prioritize high recall (sensitivity) to ensure all potentially life-threatening conditions are detected, even at the cost of some false positives that can be ruled out with additional testing.

The F1 Score: Balancing Precision and Recall

The F1 score provides a single metric that balances both precision and recall:

\[ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]

In our cardiac example:

\[ \text{F1 Score} = 2 × (0.88 × 0.81)/(0.88 + 0.81) = 0.84 \]

This balanced score helps evaluate our model’s overall effectiveness in the critical task of heart disease detection.

Implications for Model Selection

When selecting a machine learning model for heart disease detection, we should consider:

  1. Model Threshold Adjustment: Lowering the classification threshold increases recall at the expense of precision
  2. Cost-Sensitive Learning: Using asymmetric misclassification costs that penalize false negatives more heavily than false positives
  3. Ensemble Methods: Combining multiple models to improve overall detection performance
import pandas as pd
from IPython.display import display

# Sample comparison data (replace with your actual results)
models = ['Logistic Regression', 'Random Forest', 'Gradient Boosting', 'Neural Network']
precision_values = [0.88, 0.92, 0.85, 0.90]
recall_values = [0.81, 0.76, 0.89, 0.83]
f1_values = [0.84, 0.83, 0.87, 0.86]

# Create a comparison table
comparison_df = pd.DataFrame({
    'Model': models,
    'Precision': precision_values,
    'Recall': recall_values,
    'F1 Score': f1_values
})

display(comparison_df)
Table 1: Comparison of classification models for heart disease detection
Model Precision Recall F1 Score
0 Logistic Regression 0.88 0.81 0.84
1 Random Forest 0.92 0.76 0.83
2 Gradient Boosting 0.85 0.89 0.87
3 Neural Network 0.90 0.83 0.86

From Linear to Nonlinear Classification

In our previous sections, we explored linear models for regression (predicting wages) and classification (logistic regression for heart disease). While these models provide interpretable solutions, they have a fundamental limitation: they can only represent linear decision boundaries.

The Limitations of Linear Boundaries

Linear models like logistic regression separate classes with a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions). When the true relationship between features and the target is nonlinear, these models cannot capture the underlying pattern accurately.

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_curve, auc

# Set styling
plt.style.use('seaborn-v0_8-whitegrid')


# Generate a nonlinear dataset
X, y = make_moons(n_samples=200, noise=0.3, random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit a logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Create a meshgrid for plotting decision boundaries
h = 0.02  # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot logistic regression boundary
Z = lr.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axes[0].contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolors='k')
axes[0].set_title('Logistic Regression (Linear Boundary)')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# Fit a random forest and plot its boundary
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
Z_rf = rf.predict(np.c_[xx.ravel(), yy.ravel()])
Z_rf = Z_rf.reshape(xx.shape)
axes[1].contourf(xx, yy, Z_rf, alpha=0.3, cmap=plt.cm.coolwarm)
axes[1].scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolors='k')
axes[1].set_title('Random Forest (Nonlinear Boundary)')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

# Print accuracies
lr_acc = lr.score(X_test, y_test)
rf_acc = rf.score(X_test, y_test)
print(f"Logistic Regression Accuracy: {lr_acc:.4f}")
print(f"Random Forest Accuracy: {rf_acc:.4f}")
Figure 4: Linear models vs. nonlinear reality: A comparison of decision boundaries
Logistic Regression Accuracy: 0.8500
Random Forest Accuracy: 0.9000

As shown in the figure, the logistic regression model attempts to separate the classes with a single line, while the random forest can capture the nonlinear relationship that more accurately represents the true pattern in the data.

Decision Trees: The Building Blocks

Before understanding random forests, we need to understand decision trees, which are the foundation of tree-based ensemble methods.

How Decision Trees Work

A decision tree recursively partitions the feature space into regions, making decisions at each node based on feature values:

  1. Root Node: The starting point that considers all training samples
  2. Internal Nodes: Decision points that split data based on feature thresholds
  3. Leaf Nodes: Terminal nodes that assign class labels

The algorithm identifies the feature and threshold at each node that best separates the data according to a criterion such as Gini impurity or entropy.

# Create a simpler tree for visualization
small_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
small_tree.fit(X_train, y_train)

plt.figure(figsize=(15, 10))
plot_tree(small_tree, filled=True, feature_names=['Feature 1', 'Feature 2'], 
          class_names=['Class 0', 'Class 1'], rounded=True, fontsize=10)
plt.title('Simple Decision Tree Structure')
plt.tight_layout()
plt.show()
Figure 5: Visualization of a simple decision tree for heart disease classification

Nonlinearity in Decision Trees

Decision trees naturally capture nonlinear patterns because:

  1. Hierarchical Structure: Combinations of these splits form complex, nonlinear decision boundaries
  2. Automatic Feature Interaction: Trees inherently model interactions between features through their hierarchical structure

While individual decision trees capture nonlinearity well, they often suffer from high variance (overfitting). Random forests address this weakness through ensemble learning.

The Random Forest Algorithm

A random forest is an ensemble of decision trees that follow these principles:

  1. Bootstrap Aggregating (Bagging): Each tree is trained on a random subset of the training data, sampled with replacement
  2. Feature Randomization: At each split, only a random subset of features is considered
  3. Voting Mechanism: For classification, predictions are made by majority voting across all trees
# Illustrate the Random Forest concept
plt.figure(figsize=(12, 8))

# Sample data for illustration
np.random.seed(42)
X_sample = np.random.rand(20, 2)
y_sample = (X_sample[:,0] + X_sample[:,1] > 1).astype(int)

# Create multiple small decision trees
n_trees = 9
plt_rows, plt_cols = 3, 3
tree_max_depth = 2

# Create subplots for individual trees
fig, axes = plt.subplots(plt_rows, plt_cols, figsize=(15, 15))
axes = axes.flatten()

# Function to plot decision boundary
def plot_decision_boundary(ax, clf, X, y, title):
    h = 0.02
    x_min, x_max = 0, 1
    y_min, y_max = 0, 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolors='k')
    ax.set_title(title)
    ax.set_xlim([0, 1])
    ax.set_ylim([0, 1])
    ax.set_xticks([])
    ax.set_yticks([])

# Train and plot individual trees
for i in range(n_trees):
    # Bootstrap sample
    indices = np.random.choice(len(X_sample), len(X_sample), replace=True)
    X_bootstrap = X_sample[indices]
    y_bootstrap = y_sample[indices]
    
    # Train tree with limited depth
    tree = DecisionTreeClassifier(max_depth=tree_max_depth)
    tree.fit(X_bootstrap, y_bootstrap)
    
    # Plot decision boundary
    plot_decision_boundary(axes[i], tree, X_bootstrap, y_bootstrap, f"Tree {i+1}")

# Create and plot the ensemble (final subplot)
rf = RandomForestClassifier(n_estimators=n_trees, max_depth=tree_max_depth, random_state=42)
rf.fit(X_sample, y_sample)

# Add an extra subplot for the ensemble
fig.add_subplot(plt_rows + 1, 1, plt_rows + 1)
plt.figure(figsize=(8, 6))
h = 0.02
x_min, x_max = 0, 1
y_min, y_max = 0, 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
plt.scatter(X_sample[:, 0], X_sample[:, 1], c=y_sample, cmap=plt.cm.coolwarm, edgecolors='k')
plt.title('Ensemble: Combined Random Forest')
plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
plt.scatter(X_sample[:, 0], X_sample[:, 1], c=y_sample, cmap=plt.cm.coolwarm, edgecolors='k')
plt.title('Final Random Forest Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
<Figure size 1152x768 with 0 Axes>
(a) Conceptual illustration of Random Forest ensemble method
(b)
(c)
(d)
Figure 6

How Random Forests Capture Nonlinearity

Random forests excel at modeling nonlinear patterns through:

  1. Piecewise Approximation: Each tree creates a complex stair-like approximation of the decision boundary
  2. Ensemble Smoothing: The averaging effect across trees smooths these boundaries
  3. Adaptive Complexity: The depth and number of trees automatically adapt to the complexity of the relationship

For heart disease prediction, this ability to capture nonlinearity is crucial, as relationships between risk factors are rarely linear. For example, age and cholesterol might interact nonlinearly—their combined effect on heart disease risk may be greater than the sum of their individual effects.

Comparing Model Performance

Random forests often outperform linear models for classification tasks due to their ability to capture nonlinear patterns:

# This code would be used with the actual heart disease dataset
plt.figure(figsize=(10, 8))

# Plot ROC curves for each model
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')

# Plot the random chance line
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()
Figure 7

Advantages and Disadvantages of Random Forests

Advantages

  1. Excellent Performance: Often produces state-of-the-art results with minimal tuning
  2. Robust to Overfitting: The ensemble nature reduces variance compared to individual trees
  3. Handles Nonlinearity: Automatically captures complex, nonlinear relationships
  4. Feature Importance: Provides a measure of variable importance
  5. Handles Mixed Data Types: Works well with both categorical and numerical features
  6. Minimal Preprocessing: No need for feature scaling or normalization

Disadvantages

  1. Interpretability: Less interpretable than linear models or single decision trees
  2. Computational Cost: Training many deep trees can be computationally expensive
  3. Memory Usage: Storing many trees requires more memory
  4. Not Extrapolative: Cannot make reliable predictions far outside the training data range

Tuning Random Forests

The performance of random forests can be improved through hyperparameter tuning:

# Create a table of important hyperparameters
params = {
    'Parameter': ['n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'max_features'],
    'Description': [
        'Number of trees in the forest',
        'Maximum depth of each tree',
        'Minimum samples required to split an internal node',
        'Minimum samples required to be at a leaf node',
        'Number of features to consider for best split'
    ],
    'Effect on Nonlinearity': [
        'More trees smooth the decision boundary',
        'Deeper trees capture more complex patterns',
        'Smaller values allow more complex splits',
        'Smaller values create more detailed boundaries',
        'Controls the randomness and diversity of trees'
    ],
    'Typical Values': [
        '100-1000',
        '5-30 (or None for unlimited)',
        '2-10',
        '1-5',
        '\'sqrt\' or \'log2\' of feature count'
    ]
}

pd.DataFrame(params)
Table 2: Key hyperparameters for Random Forests
Parameter Description Effect on Nonlinearity Typical Values
0 n_estimators Number of trees in the forest More trees smooth the decision boundary 100-1000
1 max_depth Maximum depth of each tree Deeper trees capture more complex patterns 5-30 (or None for unlimited)
2 min_samples_split Minimum samples required to split an internal ... Smaller values allow more complex splits 2-10
3 min_samples_leaf Minimum samples required to be at a leaf node Smaller values create more detailed boundaries 1-5
4 max_features Number of features to consider for best split Controls the randomness and diversity of trees 'sqrt' or 'log2' of feature count