ANOVA test and determining its feasibility

Analysis of Variance (ANOVA) is a statistical method used to assess and compare the means of three or more groups or treatments, aiming to determine if there are noteworthy differences among them. ANOVA works under the assumption that the data follows a normal distribution and that the variances within the groups are roughly similar. It achieves this by dividing the total data variance into distinct sources of variation, aiding in the detection of statistically significant differences in means. However, it’s crucial to avoid applying ANOVA to evaluate mean differences among the six racial groups’ age data in the police shooting dataset, given the substantial variation in variances among these groups. ANOVA relies on the fundamental assumption of roughly equal variances among groups, and its use may not be suitable when significant variance discrepancies exist.

Code to compare variances:

import pandas as pd

file_path = r’C:\Users\Tiyasa\Desktop\Courses_Sem1\MTH 522\fatal-police-shootings-data.xls’

df = pd.read_excel(file_path)


# Drop missing values in the “age” and “race” columns

df.dropna(subset=[‘age’, ‘race’], inplace=True)


# Group the data by the “race” column

grouped = df.groupby(‘race’)

print(“Variances per race:”)


# Calculate and print the variance for each race category

for race, group in grouped:

    age_data = group[‘age’]

    variance_age = age_data.var()


    print(f”Race: {race}”)

    print(f”Variance: {variance_age:.2f}”)



Variances per race:

Race: A

Variance: 134.38


Race: B

Variance: 129.70


Race: H

Variance: 115.42


Race: N

Variance: 80.90


Race: O

Variance: 139.15


Race: W

Variance: 173.24

Learning about k-means, k-medoids, and DBSCAN clustering methods

Today I learnt about the different clustering methods discussed in class.

K-Means, a partitioning method, groups data points into K clusters based on their proximity to cluster centers, typically the means of the data points in each cluster. This approach seeks to minimize the sum of squared distances and is widely used in practice.

K-Medoids, on the other hand, shares similarities with K-Means but employs the medoid, the data point most centrally located within a cluster, as the representative of each cluster. This method is preferred in scenarios where robustness to outliers is crucial.

DBSCAN is a density-based approach that identifies clusters as dense regions separated by areas of lower density, making it particularly suitable for datasets with irregularly shaped clusters and noise. It employs two essential parameters, an epsilon distance threshold and a minimum number of data points required to define a dense region. DBSCAN, known for its ability to automatically discover the number of clusters, is less sensitive to initial configurations and can adapt to various data distributions.

Introduction to Data on Police Shootings by Washington Post

After examining the Washington Post dataset on police shootings, I’ve identified several interesting avenues for exploration. The dataset contains attributes such as agencies and IDs, which provide an opportunity to assess whether some law enforcement agencies are more frequently involved in shootings than others. By analyzing the data, we can also pinpoint the locations where these agencies are most active.

Age is another crucial factor to consider. We can investigate the age groups to which most victims belong. Additionally, we can compare the average ages of victims from different racial backgrounds. To perform this analysis, we might employ statistical tests like t-tests, but it’s essential to ensure that our data follows a normal distribution. Given the diversity of races and ethnicities involved, conducting a Monte Carlo test could be quite complex. As an alternative, an ANOVA test may be more practical.

Initially, I plan to create basic data distributions to gain insights and generate more ideas for the subsequent analysis.

Applying the K-Fold Cross Validation method

In an attempt to fit the model with better predictions, I ran a t-test and a Monte-Carlo test on the data which again gave me a very small p-value adding on to the fact that the data is indeed heteroscedastic.

As a further attempt to make better predictions for Diabetes, I performed a K-fold cross validation with 5 folds and 10 folds. However, both of these tests gave me a small R-squared value. Please find the code below.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
import numpy as np

# Specify the Excel file path
excel_file_path = ‘C:\\Users\\Tiyasa\\Desktop\\cdc-diabetes-2018.xlsx’

# Using pandas `read_excel` function to read all sheets into a dictionary
xls = pd.read_excel(excel_file_path, sheet_name=None)

# Access individual DataFrames by sheet name
df1 = xls[‘Diabetes’]
df2 = xls[‘Obesity’]
df3 = xls[‘Inactivity’]

df3.rename(columns={‘FIPDS’: ‘FIPS’}, inplace=True)
# Inner join df1 and df2 on the ‘FIPS’ column
merged_df = pd.merge(df1, df2, on=’FIPS’, how=’inner’)

# Inner join the result with df3 on the ‘FIPS’ column
final_merged_df = pd.merge(merged_df, df3, on=’FIPS’, how=’inner’)

# Prepare the input features (X) and target variable (y)
X = final_merged_df[[‘% OBESE’, ‘% INACTIVE’]]
y = final_merged_df[‘% DIABETIC’]

# Create a linear regression model
model = LinearRegression()

# Define the number of folds for cross-validation
num_folds = 5

# Create a KFold cross-validation iterator
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Perform cross-validation and get R-squared scores
cross_val_scores = cross_val_score(model, X, y, cv=kf, scoring=’r2′)

# Print R-squared scores for each fold
for fold, score in enumerate(cross_val_scores, start=1):
print(f”Fold {fold}: R-squared = {score:.4f}”)

# Calculate the mean and standard deviation of R-squared scores
mean_r2 = np.mean(cross_val_scores)
std_r2 = np.std(cross_val_scores)

print(f”\nMean R-squared: {mean_r2:.4f}”)
print(f”Standard Deviation of R-squared: {std_r2:.4f}”)

Output for 5 fold: 

Fold 1: R-squared = 0.3947
Fold 2: R-squared = 0.4278
Fold 3: R-squared = 0.1305
Fold 4: R-squared = 0.2200
Fold 5: R-squared = 0.1851

Mean R-squared: 0.2716
Standard Deviation of R-squared: 0.1180

Output for 10 fold: 

Fold 1: R-squared = 0.4598
Fold 2: R-squared = 0.3464
Fold 3: R-squared = 0.3879
Fold 4: R-squared = 0.4511
Fold 5: R-squared = 0.0416
Fold 6: R-squared = 0.2079
Fold 7: R-squared = 0.3571
Fold 8: R-squared = -0.0133
Fold 9: R-squared = 0.1856
Fold 10: R-squared = 0.2282

Mean R-squared: 0.2652
Standard Deviation of R-squared: 0.1553

Applying Breusch Pagan Test using Python (2nd October, Monday)


The Linear regression plots did not give us a very high R2 value.

Hence, I went ahead to check if the data was homoscedastic or heteroscedastic, for which I used the Breusch Pagan Test. I have discussed about the p value and the Breusch Pagan Test in my earlier posts which helps determine the nature of the data points.

Here the null hypothesis is that the data is homoscedastic and the alternate hypothesis is that the data is heteroscedastic. 

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan

excel_file_path = ‘C:\\Users\\Tiyasa\\Desktop\\cdc-diabetes-2018.xlsx’

xls = pd.read_excel(excel_file_path, sheet_name=None)

# Access individual DataFrames by sheet name
df1 = xls[‘Diabetes’]
df2 = xls[‘Obesity’]
df3 = xls[‘Inactivity’]

df3.rename(columns={‘FIPDS’: ‘FIPS’}, inplace=True)
# Inner join df1 and df2 on the ‘FIPS’ column
merged_df = pd.merge(df1, df2, on=’FIPS’, how=’inner’)

# Inner join the result with df3 on the ‘FIPS’ column
final_merged_df = pd.merge(merged_df, df3, on=’FIPS’, how=’inner’)

# Prepare the input features (X) and target variable (y)
X = final_merged_df[[‘% OBESE’, ‘% INACTIVE’]]
y = final_merged_df[‘% DIABETIC’]

# Add a constant to the input features for the intercept term
X = sm.add_constant(X)

# Fit a linear regression model
model = sm.OLS(y, X).fit()

# Perform the Breusch-Pagan test
_, p_value, _, _ = het_breuschpagan(model.resid, X)
print(“Breusch-Pagan Test p-value:”, p_value)

# Interpret the results
if p_value < 0.05:
print(“Heteroscedasticity is detected (reject the null hypothesis).”)
print(“No significant evidence of heteroscedasticity (fail to reject the null hypothesis).”)


Breusch-Pagan Test p-value: 3.555846910402186e-05
Heteroscedasticity is detected (reject the null hypothesis).

The concept of Cross Validation (25th September, Monday)

Today we learnt about the concept of cross validation which helps make predictions on the existing data as if we were to have with new data. The one limitation every data scientist comes across is uncertainty surrounding the model’s performance on unseen data until it undergoes actual testing and hence more often than not we are predicting a model based on available data. One way to overcome this limitation is to use the method of cross-validation.

Cross-validation involves partitioning the dataset into two or more subsets: a training set and a validation set (or multiple validation sets). The basic steps are:

  • Split the data into K roughly equal-sized folds or subsets.
  • Iteratively use K-1 folds for training and the remaining fold for validation.
  • Repeat this process K times, each time with a different fold as the validation set.
  • Compute performance metrics on each validation fold.

While there are several cross-validation methods, we are keener on the K-Fold Cross-Validation where the dataset is divided into K subsets, and the process is repeated K times, with each subset serving as the validation set once.

Cross-validation helps evaluate a model’s performance without needing a separate validation dataset. It helps prevent overfitting and provides a more accurate estimate of a model’s performance by using multiple subsets of the data for evaluation.

Being cautious with the calculation of p-value and its interpretation

Venturing further into the concept of p-values I found that while p-values are a valuable tool in statistical analysis, they should be interpreted cautiously and in conjunction with other statistical measures. Their reliability depends on various factors, including sample size, study design, and the correct formulation of the null and alternative hypotheses. We studied the pre and post molt data for lab grown crabs today and even though the distribution of the data very closely fit the linear model, both the variables were non-normally distributed, skewed, with high variance and high kurtosis. This gave us a descriptive comparison of the pre and post molt data which showed that shape or pattern of the data were very similar in nature with a small difference in mean. To figure out if there is essentially no real difference in means of pre-molt to post-molt we resorted to a t-test which predicted a very small p- value indicating that the null hypothesis (there is no real difference in means pre-molt to post-molt data) is to be rejected.

However, the t test is based on the assumption that the data fits a normal distribution which is not the case with the pre and post molt dataset. It is hence suggested that we use a Monte-Carlo procedure to estimate a p-value for the observed difference in means, assuming a null hypothesis of no real difference in the pre and post molt of crabs. I did not quite understand why we used the Monte-Carlo test here and will venture more into it and seek the professor’s help on the same.

Learning about the Breusch-Pagan Test and Backward Reporting

I studied the Breusch-Pagan test in detail and learnt how it is used to assess the presence of heteroscedasticity in a regression model. Heteroscedasticity refers to the situation where the variance of the error terms in a regression model is not constant across all levels of the independent variables, violating one of the key assumptions of linear regression. The Breusch-Pagan test evaluates whether there is a significant relationship between the squared residuals of a regression model and the independent variables. If the test indicates a significant relationship, it suggests the presence of heteroscedasticity, and adjustments or transformations may be necessary to address this issue in the regression analysis.

The major idea is to find collinearity between the data points inactivity and obesity by plotting linear models of the same in order to predict trends in diabetes. Here, as suggested by the professor we are using a top-down approach where we would be backtracking the conclusion to the cause. This kind of a backward mapping method is very beneficial in the data science world today as it is the trends and predictions of data that people are mostly captivated by.

P-value and significance of null Hypothesis

Today I learnt about the concept of P-value which is the probability value to measure the chances of an original event (Null Hypothesis) to occur under the assumption that the null hypothesis is true. If the probability of occurrence of the event falls to a point where the null hypothesis seems insignificant (impossible to occur by chance), then the null hypothesis is rejected. This p value helps measure the statistical significance of the null hypothesis which helps predict the true nature (true or false) of the event. For example, say we have a hypothesis of a fair coin (null hypothesis) and with every toss of the coin we get the outcome as tails, then each time the outcome of the event drops the probability of occurrence and when we reach a significantly low p-value, we say that the chances of that event (tails) to occur is significantly low. Hence, we conclude that the null hypothesis is not true and therefore can be rejected (It is not a fair coin). In a similar way, we can predict the statistical significance of any hypothesis by finding its p-value and if the p value is too low then most probably the null hypothesis is false.

We also learnt about the Breusch-Pagan Test, which assumes the null hypothesis where the data is evenly distributed (homoscedastic). If the p value is significantly low (less than 0.5) then we conclude that the null hypothesis is false and the data is heteroscedastic. I plan on leaning to run the test on a smaller data set using python and then testing the same on our real data.

An insight to Real Data from Real Sources

We started the course lecture discussing the course structure and ventured into the topic for our first project which is Linear regression. I for one like going back to the basics before I can dig deep into a topic so I went back and referred the text. On reading some text, I recognized and recollected that for any kind of analysis using statistical methods, we need to understand the data and connect with it to better familiarize with its nature.  Since the data provided to us is from a real source hence the predictions also need to be done in a more realistic manner instead of just trying to fit it into a simple ideal model.

What needs to be considered at this point is that the data will have errors and these errors need to be given enough significance to preserve the realistic nature of the data. I learnt about the Linear least squares model given by Karl Gauss which helps calculate the absolute value of the error in values and minimizes it to idealize the data points and plot it into a linear model. However, this model is unstable and hence unreliable. I plan on referring to available data points and plotting individual models at first so I can then move over to finding a correlation between obesity and inactivity to predict the percentage diabetes ahead.