Bootstrapping

Bootstrapping is random sampling with replacement. It implies we extract random samples from original dataset ‘n’ times and form a new dataset. With replacement implies same sample extracted is thrown back and can extracted back. The mean calculated for means in resampled data.

For the classification model and algorithm is sensitive to a priori class distribution then we can bootstrap on of classes and reduce bias of model.

It also reduces the effect of noise in the model.

Estimating Prediction Error

Estimating Prediction error predicts how well model works on unseen data. The Estimation prediction error can be estimated in various ways and among the various methods or techniques are Cross validation ,validation set and bootstrap.

In validation set comparison we use the model to predict values with dependent variable for validation set. Then we compare predicted values with the actual value of dependent variable to calculate the prediction error.

In the cross validation we iteratively split the data into subsamples, create a new training set and validation set. Then evaluating the model with the validation set.

The prediction error is due to the overfitting data where the model is trained well with fluctuations and noise training data. So by taking the large datasets overfitting can be prevented.

K-fold validation

In today’s lecture discussion was on Cross validation.

Before diving into the topic of cross validation I successfully explored when does the cross validation technique is used.

So the cross validation technique is used when the train test validation goes biasing, then the cross validation is used to avoid that biasing as the testing is performed on overall dataset for the performance estimate check.

There are 4 types of cross validation techniques:

  1. K-folds
  2. Stratified K-folds
  3. Leave-one-out
  4. Leave-p-out

In this we use the K-fold cross validation for our dataset. As the cross validation is a predictive analytics involve estimating model in more data samples and evaluating how good the model performs in separate data sample from same population.

The K-fold cross validation is a type of predictive analytics. The original data split randomly into k subsamples i.e. the training data.

Models estimated k-1 subsamples for each fold, kth subsample serving as validation sample. The process is iterated until every sample served as validation data and results is averaged.

Splitting original data into subset, passes through k-fold cross validation pipeline as above. It’s evaluating the final model performance ,the final data subset is the Test data.

The Linear regression & logistic regression models can be applied in the k-fold cross validation. Here we are using the linear regression model in the k-fold cross validation.

Polynomial Regression & Performance Measurement

Polynomial Regression is a type of regression analysis and a statistical technique that models the relationship between a dependent and one or more independent variables. It models the both linear and non-linear relationship between variables. The polynomial regression creates a best fit. This also minimizes the distance between actual and predicted values which makes it to the best fit.

The performance measurement tells the level or state of the model accuracy of prediction during the training data phase. In this regression the parameters considered for the performance measurement is the Root Mean Square and Mean Square Error.

The mathematical formula for MSE is 1/n * sum((y_pred-y_actual)^2)

The MSE utilized to minimize during model training. Lesser the MSE value results the higher accuracy.

The mathematical formula for RMSE is sqrt(MSE).

Lower the RMSE value greater the accuracy.

T-Test

The T-test is used for testing the hypothesis when sample size is small and  effects the population on testing. This test compares the means of two variables & it’s difference in significance.

The test statistic ‘t’ is a random variable having the t-distribution with n-1 degree of freedom.

So also got to know that when we’ve to go for the t-test?

When choosing a t-test we’ll look into the two criteria

Two types of T test’s Paired Sample tests,

  1. To test the difference in specific direction
  2. The groups being differentiated come from same or different populations.

The Null hypothesis assumes that no significant difference exists between the two groups that is being compared and the main difference between two groups equals to zero.

Quadratic Model, Multiple Regression & Overfitting.

Quadratic model is used when the relationship between independent and dependent variable isn’t a linear or it’s scatterplot is in U shape where we say quadratic regression model best firs for the given data. So, this is a way to figure out equation of parabola.

To derive more deeper relationships we use the multiple regression model which gives predictions.

Overfitting encountered when the model gets close to training data & becomes unpredictable for new data. This overfit encountered when machine learning model is small & training dataset is extremely small/large for model complexity. To avoid this overfitting we don’t have particularly a specific technique but there are few various techniques like Cross-validation. This evaluates machine learning algorithm performance to how far does it make predictions on data set that not been trained. Here we’re using R squared value is a measure that is calculated how good model fits. If it’s value is 1 it implies that the model fits perfectly  and 0 implies model doesn’t fit.

Analytical Data Interpretation

Today the class was centric towards the data interpretation. In the dataset there are FIP values given where I have extracted from all the given dataset and I found few missing values. These missing values has to be covered for that the features of the dataset has to be analyzed. Meanwhile, I calculated the mean, median, standard deviation. On analyzing these I also got to know that the skewness was also existing in the dataset with the datapoints that which were extracted from the data. The dataset was comprising of three main features i.e. diabetes, inactivity and obesity. With these features I tried to figure out the co-relation between diabetes and inactivity. The co-relation between diabetes and inactivity has been co-related somehow to some extent. The  values I got will  range from -1 to 1. The -1 indicates negative co-relation, 1 is positive co-relation & 0 implies no co-relation.

Hypothesis Testing, P-value, Breusch-Pagan Test

Today’s lecture was focused on Hypothesis Testing, what’s P-value? And Breusch-Pagan Test.

Hypothesis Testing is a statistical method, where it’s a quantitative statement about population based on the sample data. There’re two hypothesis available Null Hypothesis and Alternative Hypothesis.

Null Hypothesis: It’s a claim or statement about population parameter that is assumed to be true until it’s declared to be false. It’s notation is H0. This null hypothesis could be rejected and only accepted after few tests.

Alternative Hypothesis: Any hypothesis which is complementary to null hypothesis. In this alternative hypothesis test there would be a difference between the two variables which is accepted iff we reject H0(Null Hypothesis). It is also known as a Research Hypothesis.

P-value it is the probability for the “Null hypothesis” to be true or it’s the probability value of obtaining results at least as extreme as observed results in a null hypothesis is true. This tells the probable occurrence of that data value in the null hypothesis. The P-value ranges between 0 and 1. There’s a P-value=0.05, which is known as Significance value. If my experiment comes up with a value and if value  falls under extreme region values then we’ll reject null hypothesis saying that the experiment value is a away from the mean value. These significance values are derived by the domain experts which can be varied. Based on P-value we can also derive how much standard deviation will be away from the mean by using the z-test or t-test.

Breusch-Pagan test aids to detect the heteroscedasticity in the regression model with independent variables along squared residuals. For this if the P-value is less than Significant value, we come for closure saying as heteroscedasticity exists in this regression model.

Analytical Data Interpretation & Linear Regression

            In todays first class , we discussed course preliminary instructions and then we proceeded to our first project which is on Linear Regression for prediction of diabetes using the CDC diabetes data. In this dataset we could able to explore various factors are brought into consideration like county, inactivity, diabetes, FIPS, obesity to predict the diabetes. While exploring the dataset I could notice various relations that can be brought into consideration for prediction of diabetes the factors like obesity, inactivity, diabetes. Finding the correlations between the factors/features like  obesity, inactivity, diabetes from the data will give a rough idea about it. In this session I could able to learn about the various measures used to analyze the data like skewness, kurtosis, Heteroscedasticity. Later I explored little more about heteroscedasticity, homoscedasticity and their differences.

Skewness: It’s a data analysis measure that tells the shapes of the data distribution.

Kurtosis: It’s all about the measure of distribution is high or low tailed.

Heteroscedasticity: The measure of unequal errors in regression analysis.

Homoscedasticity: The error terms are constant across independent variable in regression analysis.

I also learned about Linear regression keeping it in simple terms, it’s the statistical model which predicts based on the relationship between the variables.

In mathematical terms, linear regression is Y=β01X+C

Y – Dependent Variable

X – Independent Variable

β1 – Slope

β0 – Intercept

C – Error

With this dataset now we are going to predict the %diabetes using %inactivity and %obesity. Here, in this case the target variable Y is %diabetes and the independent variable can be either %obesity or %inactivity. In case of Multi-linear regression more independent variables can be utilized. Similarly, here we can utilize %inactivity and %obesity as two independent variables in multi-linear regression for predicting the Diabetes.