The concept of Cross Validation (25th September, Monday)

Today we learnt about the concept of cross validation which helps make predictions on the existing data as if we were to have with new data. The one limitation every data scientist comes across is uncertainty surrounding the model’s performance on unseen data until it undergoes actual testing and hence more often than not we are predicting a model based on available data. One way to overcome this limitation is to use the method of cross-validation.

Cross-validation involves partitioning the dataset into two or more subsets: a training set and a validation set (or multiple validation sets). The basic steps are:

  • Split the data into K roughly equal-sized folds or subsets.
  • Iteratively use K-1 folds for training and the remaining fold for validation.
  • Repeat this process K times, each time with a different fold as the validation set.
  • Compute performance metrics on each validation fold.

While there are several cross-validation methods, we are keener on the K-Fold Cross-Validation where the dataset is divided into K subsets, and the process is repeated K times, with each subset serving as the validation set once.

Cross-validation helps evaluate a model’s performance without needing a separate validation dataset. It helps prevent overfitting and provides a more accurate estimate of a model’s performance by using multiple subsets of the data for evaluation.

Being cautious with the calculation of p-value and its interpretation

Venturing further into the concept of p-values I found that while p-values are a valuable tool in statistical analysis, they should be interpreted cautiously and in conjunction with other statistical measures. Their reliability depends on various factors, including sample size, study design, and the correct formulation of the null and alternative hypotheses. We studied the pre and post molt data for lab grown crabs today and even though the distribution of the data very closely fit the linear model, both the variables were non-normally distributed, skewed, with high variance and high kurtosis. This gave us a descriptive comparison of the pre and post molt data which showed that shape or pattern of the data were very similar in nature with a small difference in mean. To figure out if there is essentially no real difference in means of pre-molt to post-molt we resorted to a t-test which predicted a very small p- value indicating that the null hypothesis (there is no real difference in means pre-molt to post-molt data) is to be rejected.

However, the t test is based on the assumption that the data fits a normal distribution which is not the case with the pre and post molt dataset. It is hence suggested that we use a Monte-Carlo procedure to estimate a p-value for the observed difference in means, assuming a null hypothesis of no real difference in the pre and post molt of crabs. I did not quite understand why we used the Monte-Carlo test here and will venture more into it and seek the professor’s help on the same.

Learning about the Breusch-Pagan Test and Backward Reporting

I studied the Breusch-Pagan test in detail and learnt how it is used to assess the presence of heteroscedasticity in a regression model. Heteroscedasticity refers to the situation where the variance of the error terms in a regression model is not constant across all levels of the independent variables, violating one of the key assumptions of linear regression. The Breusch-Pagan test evaluates whether there is a significant relationship between the squared residuals of a regression model and the independent variables. If the test indicates a significant relationship, it suggests the presence of heteroscedasticity, and adjustments or transformations may be necessary to address this issue in the regression analysis.

The major idea is to find collinearity between the data points inactivity and obesity by plotting linear models of the same in order to predict trends in diabetes. Here, as suggested by the professor we are using a top-down approach where we would be backtracking the conclusion to the cause. This kind of a backward mapping method is very beneficial in the data science world today as it is the trends and predictions of data that people are mostly captivated by.

P-value and significance of null Hypothesis

Today I learnt about the concept of P-value which is the probability value to measure the chances of an original event (Null Hypothesis) to occur under the assumption that the null hypothesis is true. If the probability of occurrence of the event falls to a point where the null hypothesis seems insignificant (impossible to occur by chance), then the null hypothesis is rejected. This p value helps measure the statistical significance of the null hypothesis which helps predict the true nature (true or false) of the event. For example, say we have a hypothesis of a fair coin (null hypothesis) and with every toss of the coin we get the outcome as tails, then each time the outcome of the event drops the probability of occurrence and when we reach a significantly low p-value, we say that the chances of that event (tails) to occur is significantly low. Hence, we conclude that the null hypothesis is not true and therefore can be rejected (It is not a fair coin). In a similar way, we can predict the statistical significance of any hypothesis by finding its p-value and if the p value is too low then most probably the null hypothesis is false.

We also learnt about the Breusch-Pagan Test, which assumes the null hypothesis where the data is evenly distributed (homoscedastic). If the p value is significantly low (less than 0.5) then we conclude that the null hypothesis is false and the data is heteroscedastic. I plan on leaning to run the test on a smaller data set using python and then testing the same on our real data.

An insight to Real Data from Real Sources

We started the course lecture discussing the course structure and ventured into the topic for our first project which is Linear regression. I for one like going back to the basics before I can dig deep into a topic so I went back and referred the text. On reading some text, I recognized and recollected that for any kind of analysis using statistical methods, we need to understand the data and connect with it to better familiarize with its nature.  Since the data provided to us is from a real source hence the predictions also need to be done in a more realistic manner instead of just trying to fit it into a simple ideal model.

What needs to be considered at this point is that the data will have errors and these errors need to be given enough significance to preserve the realistic nature of the data. I learnt about the Linear least squares model given by Karl Gauss which helps calculate the absolute value of the error in values and minimizes it to idealize the data points and plot it into a linear model. However, this model is unstable and hence unreliable. I plan on referring to available data points and plotting individual models at first so I can then move over to finding a correlation between obesity and inactivity to predict the percentage diabetes ahead.