Project 2 – Regression Analysis

 

 

Define Objectives.

 

The HR manager of a GTA division wants to predict the annual salaries of given employees.

 

Variables.

 

 

The sample data used for this problem is available here.

 

 

Before running a regression analysis, a correlation analysis is run to determine whether any of the explanatory variables are highly correlated:

 

 

Salary

Years Previous Experience

Years Employed

Years Education

Gender

Department

Number Supervised

Salary

1

 

 

 

 

 

 

Years Previous Experience

0.029354

1

 

 

 

 

 

Years Employed

0.765174

0.031276694

1

 

 

 

 

Years Education

0.776991

0.080168828

0.6074856

1

 

 

 

Gender

-0.24868

-0.21714487

-0.209393

-0.1926921

1

 

 

Department

0.337743

-0.10550455

0.093467

0.08571375

0.012456

1

 

Number Supervised

0.523925

0.216198121

0.3454442

0.50460899

-0.10034

0.162588075

1

 

 

None of the independent variables seem to be highly correlated to each other which would suggest multicollinearity and therefore need to be removed before running a regression analysis.

 

Initial Regression Analysis.

 

Results of multiple regression for Salary

 

 

 

 

 

 

 

 

 

 

 

 

Summary measures

 

 

 

 

 

 

 

Multiple R

0.9052

 

 

 

 

 

 

R-Square

0.8193

 

 

 

 

 

 

Adj R-Square

0.7915

 

 

 

 

 

 

StErr of Est

5022.3667

 

 

 

 

 

 

 

 

 

 

 

 

 

ANOVA Table

 

 

 

 

 

 

 

Source

df

SS

MS

F

p-value

 

 

Explained

6

4460508938.9565

743418156.4928

29.4725

0.0000

 

 

Unexplained

39

983742464.0000

25224165.7436

 

 

 

 

 

 

 

 

 

 

 

Regression coefficients

 

 

 

 

 

 

 

 

Coefficient

Std Err

t-value

p-value

Lower limit

Upper limit

 

Constant

19589.4707

2862.6377

6.8432

0.0000

13799.2451

25379.6963

 

Years Previous Experience

-106.5479

213.0790

-0.5000

0.6199

-537.5405

324.4447

 

Years Employed

621.0566

125.4148

4.9520

0.0000

367.3814

874.7318

 

Years Education

1631.8308

362.7565

4.4984

0.0001

898.0872

2365.5744

 

Gender

-1654.0746

1558.1146

-1.0616

0.2950

-4805.6558

1497.5066

 

Department

2134.2893

624.7683

3.4161

0.0015

870.5774

3398.0013

 

Number Supervised

134.0143

88.1399

1.5205

0.1365

-44.2653

312.2939

 

Analyzing this data shows that three values show a high p-value which indicates the probability of making a Type I error (the possibility that the coefficient is = 0, that is there is no relationship). If this value is greater than .05 then we do not want to use the variable as a predictor.  I will start by removing the highest p-value variable, years previous experience, and recalculate the results.

 

Second Regression Analysis

 

 

Results of multiple regression for Salary

 

 

 

 

 

 

 

 

 

 

 

 

Summary measures

 

 

 

 

 

 

 

Multiple R

0.9045

 

 

 

 

 

 

R-Square

0.8181

 

 

 

 

 

 

Adj R-Square

0.7954

 

 

 

 

 

 

StErr of Est

4975.0615

 

 

 

 

 

 

 

 

 

 

 

 

 

ANOVA Table

 

 

 

 

 

 

 

Source

df

SS

MS

F

p-value

 

 

Explained

5

4454201866.9565

890840373.3913

35.9917

0.0000

 

 

Unexplained

40

990049536.0000

24751238.4000

 

 

 

 

 

 

 

 

 

 

 

Regression coefficients

 

 

 

 

 

 

 

Coefficient

Std Err

t-value

p-value

Lower limit

Upper limit

 

Constant

18888.0508

2471.8997

7.6411

0.0000

13892.1572

23883.9443

 

Years Employed

624.4838

124.0479

5.0342

0.0000

373.7737

875.1939

 

Years Education

1637.4521

359.1672

4.5590

0.0000

911.5485

2363.3558

 

Gender

-1488.3057

1508.0995

-0.9869

0.3296

-4536.2872

1559.6759

 

Department

2178.1035

612.7670

3.5545

0.0010

939.6557

3416.5514

 

Number Supervised

123.9110

84.9847

1.4580

0.1526

-47.8494

295.6714

 

 

 

 

 

 

 

 

 

There are still two variables with unacceptable p-values. I will rerun the regression removing Gender.

 

Third Regression Analysis

 

Results of multiple regression for Salary

 

 

 

 

 

 

 

 

 

 

 

 

Summary measures

 

 

 

 

 

 

 

Multiple R

0.9021

 

 

 

 

 

 

R-Square

0.8137

 

 

 

 

 

 

Adj R-Square

0.7955

 

 

 

 

 

 

StErr of Est

4973.4790

 

 

 

 

 

 

 

 

 

 

 

 

 

ANOVA Table

 

 

 

 

 

 

 

Source

df

SS

MS

F

p-value

 

 

Explained

4

4430096074.9565

1107524018.7391

44.7747

0.0000

 

 

Unexplained

41

1014155328.0000

24735495.8049

 

 

 

 

 

 

 

 

 

 

 

Regression coefficients

 

 

 

 

 

 

 

Coefficient

Std Err

t-value

p-value

Lower limit

Upper limit

 

Constant

17945.7285

2279.3054

7.8733

0.0000

13342.5753

22548.8817

 

Years Employed

639.1730

123.1125

5.1918

0.0000

390.5422

887.8039

 

Years Education

1665.0995

357.9590

4.6516

0.0000

942.1862

2388.0128

 

Department

2156.3027

612.1739

3.5224

0.0011

919.9918

3392.6137

 

Number Supervised

124.0667

84.9575

1.4603

0.1518

-47.5086

295.6419

 

 

There is still a variable with an unacceptable p-value. I will rerun the regression removing Number Supervised.

 

Fourth Regression Analysis

 

Results of multiple regression for Salary

 

 

 

 

 

 

 

 

 

 

 

 

Summary measures

 

 

 

 

 

 

 

Multiple R

0.8967

 

 

 

 

 

 

R-Square

0.8040

 

 

 

 

 

 

Adj R-Square

0.7900

 

 

 

 

 

 

StErr of Est

5040.0913

 

 

 

 

 

 

 

 

 

 

 

 

 

ANOVA Table

 

 

 

 

 

 

 

Source

df

SS

MS

F

p-value

 

 

Explained

3

4377345546.9565

1459115182.3188

57.4398

0.0000

 

 

Unexplained

42

1066905856.0000

25402520.3810

 

 

 

 

 

 

 

 

 

 

 

Regression coefficients

 

 

 

 

 

 

 

Coefficient

Std Err

t-value

p-value

Lower limit

Upper limit

 

Constant

17168.4492

2245.9719

7.6441

0.0000

12635.8929

21701.0055

 

Years Employed

648.1660

124.6052

5.2018

0.0000

396.7023

899.6296

 

Years Education

1871.2850

333.3433

5.6137

0.0000

1198.5709

2543.9992

 

Department

2278.0393

614.5943

3.7066

0.0006

1037.7374

3518.3412

 

All of the p-values are acceptable and the r square is still 80%, only 2% lower than the first regression will all the original variables included.

 

Implement and Use

 

The equation identified by the regression is now identified and can be used for prediction of Salaries. Also the manager is also aware of what variables are relevant to Salary therefore saving time and money by not further collecting information on variables that are not relevant.

 

Salary=17168 + 648 Years Employed + 1871 Years Education + 2278 Department.

 

According to the statistics provided, the model explains 80% of the variance in Salary and has a standard error of 5040.  The p-values of the independent variables are very low thus indicating that all of the variables valid predictors of Salary.

 

Monitor performance:

 

We will now use test data to see if the model makes sense and compare to actual values.

 

Test 1 – 20 years employed, 4 years education, department 4.

17168 + 648(20) + 1871(4) + 2278(4) = $46,724

 

Actual Data - 20 years employed, 4 years education, department 4.

Salary = $46,184

 

This test between test data and actual data shows there is a difference of only $540 and showing this could be a valid model for predicting salary.