ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Multiple Linear regression
    MLAI/Regression 2020. 1. 19. 00:01

    1. Overview

    Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to model the linear relationship between the explanatory (independent) variables and response (dependent) variable.

    2. Description

    2.1 Formula

    2.1.1 Population model

    2.1.2 sample

    It is similar to a simple regression. The main difference is there are a bunch of independent variables, not just one that we are interested in. Actually it stops being two dimensional and when we have over three dimensions there is no visual way to represent the data. So if it is not about the line what is it about. It's about the best fitting model. More variables usually equal a better fitting model.

    2.2 Adjusted $R^{2}$

    • $R^{2}$ measures how much of the total variability is explained by our model
    • Multiple regressions are always better than simple ones as with each additional variable that you add the explanatory power may only increase or stay the same.
    • The adjusted $R^{2}$ penalizes excessive use of variables
    • The adjusted $R^{2}$ usually smaller than the $R^{2}$. The statement is not true only in the extreme occasions of small sample sizes and a high number of independent variables.
    • If adding a new parameter increases R-squared but decreases adjusted $R^{2}$, The variable can be omitted since it holds no predictive power

    2.2.1 Example

    generated a variable that assigns 1 2 or 3 randomly to each student. We are 100 percent sure that this variable cannot predict college GPA. So this is our new model:

    $$GPA=b_{0}+b_{1}SAT+b_{2}Rand123$$

    We notice that the new $R^{2}$ is 0.407. So it seems as we have increased the explanatory power of the model but that our enthusiasm is dampened by the adjusted R-squared of 0.392. We were penalized for adding an additional variable that had no strong explanatory power. We have added information but have lost value. Point is you should cherry-pick your data as to exclude useless information however one would assume regression analysis is smarter than that.

    Look at the coefficient table we have determined the coefficient for the random123 variable but its p-value is 0.762. Remember the null hypothesis of the test.

    $$H_{0}:\beta=0$$

    We cannot reject the null hypothesis that the 76% significance level. This is an incredibly high p-value. Let me remind you that for a coefficient to be statistically significant We want a p-value of less than 0.05.

    Our conclusion is that the variable random one to three not only worsens the explanatory power of the model reflected by a lower adjusted r squared but is also insignificant. Therefore it should be dropped altogether dropping useless variables is important.

    3. Building a model

    3.1 All-in

    3.2 Backward Elimination

    So as soon as all of the variables that you have left in your model are there p values are less than the significance level. Your models prepared.

    3.3 Forward Selection

    So when this condition is less than SL is not true then we don't go to Step 3 we finished the regression y because that variable though we just added is no longer is not significant anymore.

    3.4 Bidirectional Elimination

    3.5 Score Comparison All possible model

    4. Reference

    https://www.investopedia.com/terms/m/mlr.asp

    http://www.stat.yale.edu/Courses/1997-98/101/linmult.htm

    'MLAI > Regression' 카테고리의 다른 글

    Logistic Regression Statistics  (0) 2020.01.20
    Ordinary Least Squares Assumptions  (0) 2020.01.20
    Correlation vs Regression  (0) 2020.01.19
    Logistic Regression  (0) 2019.10.20
    Simple Linear Regression  (0) 2019.10.20

    댓글

Designed by Tistory.