In an attempt to fit the model with better predictions, I ran a t-test and a Monte-Carlo test on the data which again gave me a very small p-value adding on to the fact that the data is indeed heteroscedastic.
As a further attempt to make better predictions for Diabetes, I performed a K-fold cross validation with 5 folds and 10 folds. However, both of these tests gave me a small R-squared value. Please find the code below.
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score, KFold import numpy as np # Specify the Excel file path # Using pandas `read_excel` function to read all sheets into a dictionary # Access individual DataFrames by sheet name df3.rename(columns={‘FIPDS’: ‘FIPS’}, inplace=True) # Inner join the result with df3 on the ‘FIPS’ column # Prepare the input features (X) and target variable (y) # Create a linear regression model # Define the number of folds for cross-validation # Create a KFold cross-validation iterator # Perform cross-validation and get R-squared scores # Print R-squared scores for each fold # Calculate the mean and standard deviation of R-squared scores print(f”\nMean R-squared: {mean_r2:.4f}”) |
Output for 5 fold:
Fold 1: R-squared = 0.3947
Fold 2: R-squared = 0.4278
Fold 3: R-squared = 0.1305
Fold 4: R-squared = 0.2200
Fold 5: R-squared = 0.1851
Mean R-squared: 0.2716
Standard Deviation of R-squared: 0.1180
Output for 10 fold:
Fold 1: R-squared = 0.4598
Fold 2: R-squared = 0.3464
Fold 3: R-squared = 0.3879
Fold 4: R-squared = 0.4511
Fold 5: R-squared = 0.0416
Fold 6: R-squared = 0.2079
Fold 7: R-squared = 0.3571
Fold 8: R-squared = -0.0133
Fold 9: R-squared = 0.1856
Fold 10: R-squared = 0.2282
Mean R-squared: 0.2652
Standard Deviation of R-squared: 0.1553