Applying the K-Fold Cross Validation method

In an attempt to fit the model with better predictions, I ran a t-test and a Monte-Carlo test on the data which again gave me a very small p-value adding on to the fact that the data is indeed heteroscedastic.

As a further attempt to make better predictions for Diabetes, I performed a K-fold cross validation with 5 folds and 10 folds. However, both of these tests gave me a small R-squared value. Please find the code below.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
import numpy as np

# Specify the Excel file path
excel_file_path = ‘C:\\Users\\Tiyasa\\Desktop\\cdc-diabetes-2018.xlsx’

# Using pandas `read_excel` function to read all sheets into a dictionary
xls = pd.read_excel(excel_file_path, sheet_name=None)

# Access individual DataFrames by sheet name
df1 = xls[‘Diabetes’]
df2 = xls[‘Obesity’]
df3 = xls[‘Inactivity’]

df3.rename(columns={‘FIPDS’: ‘FIPS’}, inplace=True)
# Inner join df1 and df2 on the ‘FIPS’ column
merged_df = pd.merge(df1, df2, on=’FIPS’, how=’inner’)

# Inner join the result with df3 on the ‘FIPS’ column
final_merged_df = pd.merge(merged_df, df3, on=’FIPS’, how=’inner’)

# Prepare the input features (X) and target variable (y)
X = final_merged_df[[‘% OBESE’, ‘% INACTIVE’]]
y = final_merged_df[‘% DIABETIC’]

# Create a linear regression model
model = LinearRegression()

# Define the number of folds for cross-validation
num_folds = 5

# Create a KFold cross-validation iterator
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Perform cross-validation and get R-squared scores
cross_val_scores = cross_val_score(model, X, y, cv=kf, scoring=’r2′)

# Print R-squared scores for each fold
for fold, score in enumerate(cross_val_scores, start=1):
print(f”Fold {fold}: R-squared = {score:.4f}”)

# Calculate the mean and standard deviation of R-squared scores
mean_r2 = np.mean(cross_val_scores)
std_r2 = np.std(cross_val_scores)

print(f”\nMean R-squared: {mean_r2:.4f}”)
print(f”Standard Deviation of R-squared: {std_r2:.4f}”)

Output for 5 fold: 

Fold 1: R-squared = 0.3947
Fold 2: R-squared = 0.4278
Fold 3: R-squared = 0.1305
Fold 4: R-squared = 0.2200
Fold 5: R-squared = 0.1851

Mean R-squared: 0.2716
Standard Deviation of R-squared: 0.1180

Output for 10 fold: 

Fold 1: R-squared = 0.4598
Fold 2: R-squared = 0.3464
Fold 3: R-squared = 0.3879
Fold 4: R-squared = 0.4511
Fold 5: R-squared = 0.0416
Fold 6: R-squared = 0.2079
Fold 7: R-squared = 0.3571
Fold 8: R-squared = -0.0133
Fold 9: R-squared = 0.1856
Fold 10: R-squared = 0.2282

Mean R-squared: 0.2652
Standard Deviation of R-squared: 0.1553

Leave a Reply

Your email address will not be published. Required fields are marked *