The HR manager of a GTA division wants to predict the annual salaries of given employees.
The sample data used for this problem is available here.
Before running a regression analysis, a correlation analysis is run to determine whether any of the explanatory variables are highly correlated:
|
|
Salary |
Years Previous Experience |
Years Employed |
Years Education |
Gender |
Department |
Number Supervised |
|
Salary |
1 |
|
|
|
|
|
|
|
Years
Previous Experience |
0.029354 |
1 |
|
|
|
|
|
|
Years
Employed |
0.765174 |
0.031276694 |
1 |
|
|
|
|
|
Years
Education |
0.776991 |
0.080168828 |
0.6074856 |
1 |
|
|
|
|
Gender |
-0.24868 |
-0.21714487 |
-0.209393 |
-0.1926921 |
1 |
|
|
|
Department |
0.337743 |
-0.10550455 |
0.093467 |
0.08571375 |
0.012456 |
1 |
|
|
Number
Supervised |
0.523925 |
0.216198121 |
0.3454442 |
0.50460899 |
-0.10034 |
0.162588075 |
1 |
None of the independent variables seem to be highly correlated to each other which would suggest multicollinearity and therefore need to be removed before running a regression analysis.
|
Results
of multiple regression for Salary |
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
Summary
measures |
|
|
|
|
|
|
|
|
|
Multiple
R |
0.9052 |
|
|
|
|
|
|
|
R-Square |
0.8193 |
|
|
|
|
|
|
|
Adj R-Square |
0.7915 |
|
|
|
|
|
|
|
StErr of Est |
5022.3667 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA
Table |
|
|
|
|
|
|
|
|
|
Source |
df |
SS |
MS |
F |
p-value |
|
|
|
Explained |
6 |
4460508938.9565 |
743418156.4928 |
29.4725 |
0.0000 |
|
|
|
Unexplained |
39 |
983742464.0000 |
25224165.7436 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression
coefficients |
|
|
|
|
|
|
|
|
|
|
Coefficient |
Std Err |
t-value |
p-value |
Lower limit |
Upper limit |
|
|
Constant |
19589.4707 |
2862.6377 |
6.8432 |
0.0000 |
13799.2451 |
25379.6963 |
|
|
Years
Previous Experience |
-106.5479 |
213.0790 |
-0.5000 |
0.6199 |
-537.5405 |
324.4447 |
|
|
Years Employed |
621.0566 |
125.4148 |
4.9520 |
0.0000 |
367.3814 |
874.7318 |
|
|
Years Education |
1631.8308 |
362.7565 |
4.4984 |
0.0001 |
898.0872 |
2365.5744 |
|
|
Gender |
-1654.0746 |
1558.1146 |
-1.0616 |
0.2950 |
-4805.6558 |
1497.5066 |
|
|
Department |
2134.2893 |
624.7683 |
3.4161 |
0.0015 |
870.5774 |
3398.0013 |
|
|
Number Supervised |
134.0143 |
88.1399 |
1.5205 |
0.1365 |
-44.2653 |
312.2939 |
Analyzing this data shows that three values show a high
p-value which indicates the probability of making a Type I error (the
possibility that the coefficient is = 0, that is there is no relationship). If
this value is greater than .05 then we do not want to use the variable as a
predictor. I will start by removing the
highest p-value variable, years previous experience,
and recalculate the results.
|
Results
of multiple regression for Salary |
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
Summary
measures |
|
|
|
|
|
|
|
|
|
Multiple
R |
0.9045 |
|
|
|
|
|
|
|
R-Square |
0.8181 |
|
|
|
|
|
|
|
Adj R-Square |
0.7954 |
|
|
|
|
|
|
|
StErr of Est |
4975.0615 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA
Table |
|
|
|
|
|
|
|
|
|
Source |
df |
SS |
MS |
F |
p-value |
|
|
|
Explained |
5 |
4454201866.9565 |
890840373.3913 |
35.9917 |
0.0000 |
|
|
|
Unexplained |
40 |
990049536.0000 |
24751238.4000 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression
coefficients |
|
|
|
|
|
||
|
|
|
Coefficient |
Std Err |
t-value |
p-value |
Lower limit |
Upper limit |
|
|
Constant |
18888.0508 |
2471.8997 |
7.6411 |
0.0000 |
13892.1572 |
23883.9443 |
|
|
Years Employed |
624.4838 |
124.0479 |
5.0342 |
0.0000 |
373.7737 |
875.1939 |
|
|
Years Education |
1637.4521 |
359.1672 |
4.5590 |
0.0000 |
911.5485 |
2363.3558 |
|
|
Gender |
-1488.3057 |
1508.0995 |
-0.9869 |
0.3296 |
-4536.2872 |
1559.6759 |
|
|
Department |
2178.1035 |
612.7670 |
3.5545 |
0.0010 |
939.6557 |
3416.5514 |
|
|
Number Supervised |
123.9110 |
84.9847 |
1.4580 |
0.1526 |
-47.8494 |
295.6714 |
|
|
|
|
|
|
|
|
|
There are still two variables with unacceptable p-values. I will rerun the regression removing Gender.
|
Results
of multiple regression for Salary |
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
Summary
measures |
|
|
|
|
|
|
|
|
|
Multiple
R |
0.9021 |
|
|
|
|
|
|
|
R-Square |
0.8137 |
|
|
|
|
|
|
|
Adj R-Square |
0.7955 |
|
|
|
|
|
|
|
StErr of Est |
4973.4790 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA
Table |
|
|
|
|
|
|
|
|
|
Source |
df |
SS |
MS |
F |
p-value |
|
|
|
Explained |
4 |
4430096074.9565 |
1107524018.7391 |
44.7747 |
0.0000 |
|
|
|
Unexplained |
41 |
1014155328.0000 |
24735495.8049 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression
coefficients |
|
|
|
|
|
||
|
|
|
Coefficient |
Std Err |
t-value |
p-value |
Lower limit |
Upper limit |
|
|
Constant |
17945.7285 |
2279.3054 |
7.8733 |
0.0000 |
13342.5753 |
22548.8817 |
|
|
Years Employed |
639.1730 |
123.1125 |
5.1918 |
0.0000 |
390.5422 |
887.8039 |
|
|
Years Education |
1665.0995 |
357.9590 |
4.6516 |
0.0000 |
942.1862 |
2388.0128 |
|
|
Department |
2156.3027 |
612.1739 |
3.5224 |
0.0011 |
919.9918 |
3392.6137 |
|
|
Number Supervised |
124.0667 |
84.9575 |
1.4603 |
0.1518 |
-47.5086 |
295.6419 |
There is still a variable with an unacceptable p-value. I will rerun the regression removing Number Supervised.
|
Results
of multiple regression for Salary |
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
Summary
measures |
|
|
|
|
|
|
|
|
|
Multiple
R |
0.8967 |
|
|
|
|
|
|
|
R-Square |
0.8040 |
|
|
|
|
|
|
|
Adj R-Square |
0.7900 |
|
|
|
|
|
|
|
StErr of Est |
5040.0913 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA
Table |
|
|
|
|
|
|
|
|
|
Source |
df |
SS |
MS |
F |
p-value |
|
|
|
Explained |
3 |
4377345546.9565 |
1459115182.3188 |
57.4398 |
0.0000 |
|
|
|
Unexplained |
42 |
1066905856.0000 |
25402520.3810 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression
coefficients |
|
|
|
|
|
||
|
|
|
Coefficient |
Std Err |
t-value |
p-value |
Lower limit |
Upper limit |
|
|
Constant |
17168.4492 |
2245.9719 |
7.6441 |
0.0000 |
12635.8929 |
21701.0055 |
|
|
Years Employed |
648.1660 |
124.6052 |
5.2018 |
0.0000 |
396.7023 |
899.6296 |
|
|
Years Education |
1871.2850 |
333.3433 |
5.6137 |
0.0000 |
1198.5709 |
2543.9992 |
|
|
Department |
2278.0393 |
614.5943 |
3.7066 |
0.0006 |
1037.7374 |
3518.3412 |
All of the p-values are acceptable and the r square is still 80%, only 2% lower than the first regression will all the original variables included.

The equation identified by the regression is now identified and can be used for prediction of Salaries. Also the manager is also aware of what variables are relevant to Salary therefore saving time and money by not further collecting information on variables that are not relevant.
Salary=17168 + 648 Years Employed + 1871 Years
Education + 2278 Department.
According to the statistics provided, the model explains 80% of the variance in Salary and has a standard error of 5040. The p-values of the independent variables are very low thus indicating that all of the variables valid predictors of Salary.
We will now use test data to see if the model makes sense and compare to actual values.
Test 1 – 20 years employed, 4 years education, department 4.
17168 + 648(20) + 1871(4) + 2278(4) = $46,724
Actual Data - 20 years employed, 4 years education, department 4.
Salary = $46,184
This test between test data and actual data shows there is a difference of only $540 and showing this could be a valid model for predicting salary.