Analytical Data Interpretation & Linear Regression
In todays first class , we discussed course preliminary instructions and then we proceeded to our first project which is on Linear Regression for prediction of diabetes using the CDC diabetes data. In this dataset we could able to explore various factors are brought into consideration like county, inactivity, diabetes, FIPS, obesity to predict the diabetes. While exploring the dataset I could notice various relations that can be brought into consideration for prediction of diabetes the factors like obesity, inactivity, diabetes. Finding the correlations between the factors/features like obesity, inactivity, diabetes from the data will give a rough idea about it. In this session I could able to learn about the various measures used to analyze the data like skewness, kurtosis, Heteroscedasticity. Later I explored little more about heteroscedasticity, homoscedasticity and their differences.
Skewness: It’s a data analysis measure that tells the shapes of the data distribution.
Kurtosis: It’s all about the measure of distribution is high or low tailed.
Heteroscedasticity: The measure of unequal errors in regression analysis.
Homoscedasticity: The error terms are constant across independent variable in regression analysis.
I also learned about Linear regression keeping it in simple terms, it’s the statistical model which predicts based on the relationship between the variables.
In mathematical terms, linear regression is Y=β0+β1X+C
Y – Dependent Variable
X – Independent Variable
β1 – Slope
β0 – Intercept
C – Error
With this dataset now we are going to predict the %diabetes using %inactivity and %obesity. Here, in this case the target variable Y is %diabetes and the independent variable can be either %obesity or %inactivity. In case of Multi-linear regression more independent variables can be utilized. Similarly, here we can utilize %inactivity and %obesity as two independent variables in multi-linear regression for predicting the Diabetes.