Learning about k-means, k-medoids, and DBSCAN clustering methods
Today I learnt about the different clustering methods discussed in class.
K-Means, a partitioning method, groups data points into K clusters based on their proximity to cluster centers, typically the means of the data points in each cluster. This approach seeks to minimize the sum of squared distances and is widely used in practice.
K-Medoids, on the other hand, shares similarities with K-Means but employs the medoid, the data point most centrally located within a cluster, as the representative of each cluster. This method is preferred in scenarios where robustness to outliers is crucial.
DBSCAN is a density-based approach that identifies clusters as dense regions separated by areas of lower density, making it particularly suitable for datasets with irregularly shaped clusters and noise. It employs two essential parameters, an epsilon distance threshold and a minimum number of data points required to define a dense region. DBSCAN, known for its ability to automatically discover the number of clusters, is less sensitive to initial configurations and can adapt to various data distributions.
Plotting age Distributions for all ages and for Black and White people respectively
Geospatial Visualization of the Location data
Introduction to Data on Police Shootings by Washington Post
After examining the Washington Post dataset on police shootings, I’ve identified several interesting avenues for exploration. The dataset contains attributes such as agencies and IDs, which provide an opportunity to assess whether some law enforcement agencies are more frequently involved in shootings than others. By analyzing the data, we can also pinpoint the locations where these agencies are most active.
Age is another crucial factor to consider. We can investigate the age groups to which most victims belong. Additionally, we can compare the average ages of victims from different racial backgrounds. To perform this analysis, we might employ statistical tests like t-tests, but it’s essential to ensure that our data follows a normal distribution. Given the diversity of races and ethnicities involved, conducting a Monte Carlo test could be quite complex. As an alternative, an ANOVA test may be more practical.
Initially, I plan to create basic data distributions to gain insights and generate more ideas for the subsequent analysis.
Project 1 : Report
Applying the K-Fold Cross Validation method
In an attempt to fit the model with better predictions, I ran a t-test and a Monte-Carlo test on the data which again gave me a very small p-value adding on to the fact that the data is indeed heteroscedastic.
As a further attempt to make better predictions for Diabetes, I performed a K-fold cross validation with 5 folds and 10 folds. However, both of these tests gave me a small R-squared value. Please find the code below.
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score, KFold import numpy as np # Specify the Excel file path # Using pandas `read_excel` function to read all sheets into a dictionary # Access individual DataFrames by sheet name df3.rename(columns={‘FIPDS’: ‘FIPS’}, inplace=True) # Inner join the result with df3 on the ‘FIPS’ column # Prepare the input features (X) and target variable (y) # Create a linear regression model # Define the number of folds for cross-validation # Create a KFold cross-validation iterator # Perform cross-validation and get R-squared scores # Print R-squared scores for each fold # Calculate the mean and standard deviation of R-squared scores print(f”\nMean R-squared: {mean_r2:.4f}”) |
Output for 5 fold:
Fold 1: R-squared = 0.3947
Fold 2: R-squared = 0.4278
Fold 3: R-squared = 0.1305
Fold 4: R-squared = 0.2200
Fold 5: R-squared = 0.1851
Mean R-squared: 0.2716
Standard Deviation of R-squared: 0.1180
Output for 10 fold:
Fold 1: R-squared = 0.4598
Fold 2: R-squared = 0.3464
Fold 3: R-squared = 0.3879
Fold 4: R-squared = 0.4511
Fold 5: R-squared = 0.0416
Fold 6: R-squared = 0.2079
Fold 7: R-squared = 0.3571
Fold 8: R-squared = -0.0133
Fold 9: R-squared = 0.1856
Fold 10: R-squared = 0.2282
Mean R-squared: 0.2652
Standard Deviation of R-squared: 0.1553
Applying Breusch Pagan Test using Python (2nd October, Monday)
The Linear regression plots did not give us a very high R2 value.
Hence, I went ahead to check if the data was homoscedastic or heteroscedastic, for which I used the Breusch Pagan Test. I have discussed about the p value and the Breusch Pagan Test in my earlier posts which helps determine the nature of the data points.
Here the null hypothesis is that the data is homoscedastic and the alternate hypothesis is that the data is heteroscedastic.
import pandas as pd import numpy as np import statsmodels.api as sm from statsmodels.stats.diagnostic import het_breuschpagan excel_file_path = ‘C:\\Users\\Tiyasa\\Desktop\\cdc-diabetes-2018.xlsx’ xls = pd.read_excel(excel_file_path, sheet_name=None) # Access individual DataFrames by sheet name df3.rename(columns={‘FIPDS’: ‘FIPS’}, inplace=True) # Inner join the result with df3 on the ‘FIPS’ column # Prepare the input features (X) and target variable (y) # Add a constant to the input features for the intercept term # Fit a linear regression model # Perform the Breusch-Pagan test # Interpret the results |
Output:
Breusch-Pagan Test p-value: 3.555846910402186e-05
Heteroscedasticity is detected (reject the null hypothesis).