Learning about k-means, k-medoids, and DBSCAN clustering methods

Today I learnt about the different clustering methods discussed in class.

K-Means, a partitioning method, groups data points into K clusters based on their proximity to cluster centers, typically the means of the data points in each cluster. This approach seeks to minimize the sum of squared distances and is widely used in practice.

K-Medoids, on the other hand, shares similarities with K-Means but employs the medoid, the data point most centrally located within a cluster, as the representative of each cluster. This method is preferred in scenarios where robustness to outliers is crucial.

DBSCAN is a density-based approach that identifies clusters as dense regions separated by areas of lower density, making it particularly suitable for datasets with irregularly shaped clusters and noise. It employs two essential parameters, an epsilon distance threshold and a minimum number of data points required to define a dense region. DBSCAN, known for its ability to automatically discover the number of clusters, is less sensitive to initial configurations and can adapt to various data distributions.

Introduction to Data on Police Shootings by Washington Post

After examining the Washington Post dataset on police shootings, I’ve identified several interesting avenues for exploration. The dataset contains attributes such as agencies and IDs, which provide an opportunity to assess whether some law enforcement agencies are more frequently involved in shootings than others. By analyzing the data, we can also pinpoint the locations where these agencies are most active.

Age is another crucial factor to consider. We can investigate the age groups to which most victims belong. Additionally, we can compare the average ages of victims from different racial backgrounds. To perform this analysis, we might employ statistical tests like t-tests, but it’s essential to ensure that our data follows a normal distribution. Given the diversity of races and ethnicities involved, conducting a Monte Carlo test could be quite complex. As an alternative, an ANOVA test may be more practical.

Initially, I plan to create basic data distributions to gain insights and generate more ideas for the subsequent analysis.

Applying the K-Fold Cross Validation method

In an attempt to fit the model with better predictions, I ran a t-test and a Monte-Carlo test on the data which again gave me a very small p-value adding on to the fact that the data is indeed heteroscedastic.

As a further attempt to make better predictions for Diabetes, I performed a K-fold cross validation with 5 folds and 10 folds. However, both of these tests gave me a small R-squared value. Please find the code below.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
import numpy as np

# Specify the Excel file path
excel_file_path = ‘C:\\Users\\Tiyasa\\Desktop\\cdc-diabetes-2018.xlsx’

# Using pandas `read_excel` function to read all sheets into a dictionary
xls = pd.read_excel(excel_file_path, sheet_name=None)

# Access individual DataFrames by sheet name
df1 = xls[‘Diabetes’]
df2 = xls[‘Obesity’]
df3 = xls[‘Inactivity’]

df3.rename(columns={‘FIPDS’: ‘FIPS’}, inplace=True)
# Inner join df1 and df2 on the ‘FIPS’ column
merged_df = pd.merge(df1, df2, on=’FIPS’, how=’inner’)

# Inner join the result with df3 on the ‘FIPS’ column
final_merged_df = pd.merge(merged_df, df3, on=’FIPS’, how=’inner’)

# Prepare the input features (X) and target variable (y)
X = final_merged_df[[‘% OBESE’, ‘% INACTIVE’]]
y = final_merged_df[‘% DIABETIC’]

# Create a linear regression model
model = LinearRegression()

# Define the number of folds for cross-validation
num_folds = 5

# Create a KFold cross-validation iterator
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Perform cross-validation and get R-squared scores
cross_val_scores = cross_val_score(model, X, y, cv=kf, scoring=’r2′)

# Print R-squared scores for each fold
for fold, score in enumerate(cross_val_scores, start=1):
print(f”Fold {fold}: R-squared = {score:.4f}”)

# Calculate the mean and standard deviation of R-squared scores
mean_r2 = np.mean(cross_val_scores)
std_r2 = np.std(cross_val_scores)

print(f”\nMean R-squared: {mean_r2:.4f}”)
print(f”Standard Deviation of R-squared: {std_r2:.4f}”)

Output for 5 fold: 

Fold 1: R-squared = 0.3947
Fold 2: R-squared = 0.4278
Fold 3: R-squared = 0.1305
Fold 4: R-squared = 0.2200
Fold 5: R-squared = 0.1851

Mean R-squared: 0.2716
Standard Deviation of R-squared: 0.1180

Output for 10 fold: 

Fold 1: R-squared = 0.4598
Fold 2: R-squared = 0.3464
Fold 3: R-squared = 0.3879
Fold 4: R-squared = 0.4511
Fold 5: R-squared = 0.0416
Fold 6: R-squared = 0.2079
Fold 7: R-squared = 0.3571
Fold 8: R-squared = -0.0133
Fold 9: R-squared = 0.1856
Fold 10: R-squared = 0.2282

Mean R-squared: 0.2652
Standard Deviation of R-squared: 0.1553

Applying Breusch Pagan Test using Python (2nd October, Monday)

 

The Linear regression plots did not give us a very high R2 value.

Hence, I went ahead to check if the data was homoscedastic or heteroscedastic, for which I used the Breusch Pagan Test. I have discussed about the p value and the Breusch Pagan Test in my earlier posts which helps determine the nature of the data points.

Here the null hypothesis is that the data is homoscedastic and the alternate hypothesis is that the data is heteroscedastic. 

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan

excel_file_path = ‘C:\\Users\\Tiyasa\\Desktop\\cdc-diabetes-2018.xlsx’

xls = pd.read_excel(excel_file_path, sheet_name=None)

# Access individual DataFrames by sheet name
df1 = xls[‘Diabetes’]
df2 = xls[‘Obesity’]
df3 = xls[‘Inactivity’]

df3.rename(columns={‘FIPDS’: ‘FIPS’}, inplace=True)
# Inner join df1 and df2 on the ‘FIPS’ column
merged_df = pd.merge(df1, df2, on=’FIPS’, how=’inner’)

# Inner join the result with df3 on the ‘FIPS’ column
final_merged_df = pd.merge(merged_df, df3, on=’FIPS’, how=’inner’)

# Prepare the input features (X) and target variable (y)
X = final_merged_df[[‘% OBESE’, ‘% INACTIVE’]]
y = final_merged_df[‘% DIABETIC’]

# Add a constant to the input features for the intercept term
X = sm.add_constant(X)

# Fit a linear regression model
model = sm.OLS(y, X).fit()

# Perform the Breusch-Pagan test
_, p_value, _, _ = het_breuschpagan(model.resid, X)
print(“Breusch-Pagan Test p-value:”, p_value)

# Interpret the results
if p_value < 0.05:
print(“Heteroscedasticity is detected (reject the null hypothesis).”)
else:
print(“No significant evidence of heteroscedasticity (fail to reject the null hypothesis).”)

Output:

Breusch-Pagan Test p-value: 3.555846910402186e-05
Heteroscedasticity is detected (reject the null hypothesis).