The Linear regression plots did not give us a very high R2 value.
Hence, I went ahead to check if the data was homoscedastic or heteroscedastic, for which I used the Breusch Pagan Test. I have discussed about the p value and the Breusch Pagan Test in my earlier posts which helps determine the nature of the data points.
Here the null hypothesis is that the data is homoscedastic and the alternate hypothesis is that the data is heteroscedastic.
import pandas as pd import numpy as np import statsmodels.api as sm from statsmodels.stats.diagnostic import het_breuschpagan excel_file_path = ‘C:\\Users\\Tiyasa\\Desktop\\cdc-diabetes-2018.xlsx’ xls = pd.read_excel(excel_file_path, sheet_name=None) # Access individual DataFrames by sheet name df3.rename(columns={‘FIPDS’: ‘FIPS’}, inplace=True) # Inner join the result with df3 on the ‘FIPS’ column # Prepare the input features (X) and target variable (y) # Add a constant to the input features for the intercept term # Fit a linear regression model # Perform the Breusch-Pagan test # Interpret the results |
Output:
Breusch-Pagan Test p-value: 3.555846910402186e-05
Heteroscedasticity is detected (reject the null hypothesis).