My Blog: Assignment

1 Overview

All along, the research on the influencing factors of student achievement is a very hot research topic. Due to the complexity of students' information, the uncertainty of school and family influencing factors, and the difference of different subjects, it is difficult to quantitatively analyze the influencing factors of students' performance.

2 Research Method

2.1 Decision Tree

     A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
     Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.

2.2 Logistic Regression

     Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail which is represented by an indicator variable, where the two values are labeled "0" and "1". In the logistic model, the log-odds (the logarithm of the odds) for the value labeled "1" is a linear combination of one or more independent variables ("predictors"); the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. Analogous models with a different sigmoid function instead of the logistic function can also be used, such as the probit model; the defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio.

3 Step by Step Visualization

3.1 Install and load R packages

     1) 'party' contains 'ctree' which can help us draw the decision tree. 
     2) 'corrplot' can help us make correlation analysis between the variables.
     3) 'Rcpp' contains 'glm' which can help make logistic regression.
     4) 'Psych' contains 'corr.test' which can help us get the probability values of multiple values.

packages = c('readr', 'dplyr', 'tidyverse', 'plyr','party','corrplot','Rcpp','psych')
for(p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

3.2 load the data

data  <- read.csv('data/student-mat.csv')

3.3 Data Wrangling

3.3.1 Change Data Type

     Before make decision tree analysis or logistic analysis, we need to make correlation analysis to select the variables which has actual influence on students grade.
     To make correlation analysis, we need to change some vatiables type to numeric and remove some useless variables.

data$Grade <- NA
data$Schoolsup <- NA
data$Famsup <- NA
data$Paid <- NA
data$Activities <- NA
data$Nursery <- NA
data$Higher <- NA
data$Internet <- NA
data$Romantic <- NA
data$Gender <- NA
data$Grade <- ifelse(data$G3>15,4,ifelse(data$G3>10,3,ifelse(data$G3>5,2,1)))

data$Schoolsup <- ifelse(data$schoolsup == 'yes', 1, 0)

data$Famsup <- ifelse(data$famsup == 'yes', 1 , 0)

data$Famsup <- ifelse(data$famsup == 'yes', 1 , 0)

data$Paid <- ifelse(data$paid == 'yes',1,0)

data$Activities <- ifelse(data$activities =='yes',1,0)

data$Nursery <- ifelse(data$nursery =='yes',1,0)

data$Higher <- ifelse(data$higher == 'yes',1,0)

data$Internet <- ifelse(data$internet =='yes',1,0)

data$Romantic <- ifelse(data$romantic == 'yes',1,0)

data$Gender <- ifelse(data$sex == 'F',1,0)

data = data[,-c(1:12)]

data= data[,-c(4:11)]
data =data[,-c(11:13)]
str(data)

'data.frame':   395 obs. of  20 variables:
 $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
 $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
 $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
 $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
 $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
 $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
 $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
 $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
 $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
 $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
 $ Grade     : num  2 2 2 3 2 3 3 2 4 3 ...
 $ Schoolsup : num  1 0 1 0 0 0 0 1 0 0 ...
 $ Famsup    : num  0 1 0 1 1 1 0 1 1 1 ...
 $ Paid      : num  0 0 1 1 1 1 0 0 1 1 ...
 $ Activities: num  0 0 0 1 0 1 0 0 0 1 ...
 $ Nursery   : num  1 0 1 1 1 1 1 1 1 1 ...
 $ Higher    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Internet  : num  0 1 1 1 0 1 1 0 1 1 ...
 $ Romantic  : num  0 0 0 1 0 0 0 0 0 0 ...
 $ Gender    : num  1 1 1 1 1 0 0 1 0 0 ...

3.3.2 Correlation Analysis

a <- cor(data)
corrplot(a)

    As we can in the correlation plot, there are several variables like 'travetime', 'studytime'.. which has high correlation with grade. However, we still need to check the probability values to see if the result is significant.

   corr.test(data, use = 'complete')

Call:corr.test(x = data, use = "complete")
Correlation matrix 
           traveltime studytime failures famrel freetime goout  Dalc
traveltime       1.00     -0.10     0.09  -0.02    -0.02  0.03  0.14
studytime       -0.10      1.00    -0.17   0.04    -0.14 -0.06 -0.20
failures         0.09     -0.17     1.00  -0.04     0.09  0.12  0.14
famrel          -0.02      0.04    -0.04   1.00     0.15  0.06 -0.08
freetime        -0.02     -0.14     0.09   0.15     1.00  0.29  0.21
goout            0.03     -0.06     0.12   0.06     0.29  1.00  0.27
Dalc             0.14     -0.20     0.14  -0.08     0.21  0.27  1.00
Walc             0.13     -0.25     0.14  -0.11     0.15  0.42  0.65
health           0.01     -0.08     0.07   0.09     0.08 -0.01  0.08
absences        -0.01     -0.06     0.06  -0.04    -0.06  0.04  0.11
Grade           -0.13      0.11    -0.37   0.03     0.00 -0.14 -0.07
Schoolsup       -0.01      0.04     0.00   0.00    -0.05 -0.04 -0.02
Famsup           0.00      0.15    -0.06  -0.02     0.01 -0.02 -0.03
Paid            -0.07      0.17    -0.19   0.00    -0.06  0.01  0.06
Activities      -0.01      0.09    -0.07   0.04     0.09  0.05 -0.07
Nursery         -0.03      0.08    -0.10   0.00    -0.02  0.00 -0.08
Higher          -0.08      0.18    -0.30   0.02    -0.06 -0.04 -0.07
Internet        -0.11      0.06    -0.06   0.03     0.05  0.07  0.04
Romantic         0.02      0.05     0.09  -0.06    -0.01  0.01  0.02
Gender          -0.06      0.31    -0.04  -0.06    -0.24 -0.08 -0.27
            Walc health absences Grade Schoolsup Famsup  Paid
traveltime  0.13   0.01    -0.01 -0.13     -0.01   0.00 -0.07
studytime  -0.25  -0.08    -0.06  0.11      0.04   0.15  0.17
failures    0.14   0.07     0.06 -0.37      0.00  -0.06 -0.19
famrel     -0.11   0.09    -0.04  0.03      0.00  -0.02  0.00
freetime    0.15   0.08    -0.06  0.00     -0.05   0.01 -0.06
goout       0.42  -0.01     0.04 -0.14     -0.04  -0.02  0.01
Dalc        0.65   0.08     0.11 -0.07     -0.02  -0.03  0.06
Walc        1.00   0.09     0.14 -0.10     -0.09  -0.09  0.06
health      0.09   1.00    -0.03 -0.03     -0.03   0.03 -0.08
absences    0.14  -0.03     1.00  0.01      0.02   0.02  0.01
Grade      -0.10  -0.03     0.01  1.00     -0.09  -0.02  0.06
Schoolsup  -0.09  -0.03     0.02 -0.09      1.00   0.10 -0.02
Famsup     -0.09   0.03     0.02 -0.02      0.10   1.00  0.29
Paid        0.06  -0.08     0.01  0.06     -0.02   0.29  1.00
Activities -0.04   0.02    -0.01  0.00      0.05   0.00 -0.02
Nursery    -0.10  -0.02     0.02  0.03      0.05   0.06  0.10
Higher     -0.10  -0.02    -0.06  0.17      0.05   0.10  0.19
Internet    0.01  -0.08     0.10  0.10     -0.01   0.10  0.15
Romantic   -0.01   0.03     0.15 -0.08     -0.08   0.01  0.01
Gender     -0.27  -0.14     0.07 -0.09      0.14   0.15  0.13
           Activities Nursery Higher Internet Romantic Gender
traveltime      -0.01   -0.03  -0.08    -0.11     0.02  -0.06
studytime        0.09    0.08   0.18     0.06     0.05   0.31
failures        -0.07   -0.10  -0.30    -0.06     0.09  -0.04
famrel           0.04    0.00   0.02     0.03    -0.06  -0.06
freetime         0.09   -0.02  -0.06     0.05    -0.01  -0.24
goout            0.05    0.00  -0.04     0.07     0.01  -0.08
Dalc            -0.07   -0.08  -0.07     0.04     0.02  -0.27
Walc            -0.04   -0.10  -0.10     0.01    -0.01  -0.27
health           0.02   -0.02  -0.02    -0.08     0.03  -0.14
absences        -0.01    0.02  -0.06     0.10     0.15   0.07
Grade            0.00    0.03   0.17     0.10    -0.08  -0.09
Schoolsup        0.05    0.05   0.05    -0.01    -0.08   0.14
Famsup           0.00    0.06   0.10     0.10     0.01   0.15
Paid            -0.02    0.10   0.19     0.15     0.01   0.13
Activities       1.00    0.00   0.10     0.05     0.02  -0.10
Nursery          0.00    1.00   0.05     0.01     0.03   0.01
Higher           0.10    0.05   1.00     0.02    -0.11   0.15
Internet         0.05    0.01   0.02     1.00     0.09  -0.04
Romantic         0.02    0.03  -0.11     0.09     1.00   0.10
Gender          -0.10    0.01   0.15    -0.04     0.10   1.00
Sample Size 
[1] 395
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
           traveltime studytime failures famrel freetime goout Dalc
traveltime       0.00      1.00     1.00   1.00     1.00  1.00 0.94
studytime        0.05      0.00     0.09   1.00     0.70  1.00 0.02
failures         0.07      0.00     0.00   1.00     1.00  1.00 1.00
famrel           0.74      0.43     0.38   0.00     0.44  1.00 1.00
freetime         0.74      0.00     0.07   0.00     0.00  0.00 0.01
goout            0.57      0.21     0.01   0.20     0.00  0.00 0.00
Dalc             0.01      0.00     0.01   0.12     0.00  0.00 0.00
Walc             0.01      0.00     0.00   0.02     0.00  0.00 0.00
health           0.88      0.13     0.19   0.06     0.13  0.85 0.13
absences         0.80      0.21     0.21   0.38     0.25  0.38 0.03
Grade            0.01      0.03     0.00   0.53     0.94  0.00 0.14
Schoolsup        0.85      0.45     0.99   0.98     0.37  0.45 0.67
Famsup           0.95      0.00     0.27   0.69     0.83  0.76 0.53
Paid             0.19      0.00     0.00   0.99     0.20  0.84 0.22
Activities       0.88      0.07     0.17   0.42     0.07  0.36 0.19
Nursery          0.51      0.11     0.05   0.94     0.62  0.93 0.09
Higher           0.10      0.00     0.00   0.63     0.22  0.43 0.17
Internet         0.03      0.24     0.21   0.52     0.31  0.14 0.47
Romantic         0.66      0.29     0.06   0.21     0.82  0.88 0.76
Gender           0.24      0.00     0.38   0.24     0.00  0.13 0.00
           Walc health absences Grade Schoolsup Famsup Paid
traveltime 1.00   1.00     1.00  1.00      1.00   1.00 1.00
studytime  0.00   1.00     1.00  1.00      1.00   0.63 0.15
failures   0.75   1.00     1.00  0.00      1.00   1.00 0.03
famrel     1.00   1.00     1.00  1.00      1.00   1.00 1.00
freetime   0.53   1.00     1.00  1.00      1.00   1.00 1.00
goout      0.00   1.00     1.00  0.70      1.00   1.00 1.00
Dalc       0.00   1.00     1.00  1.00      1.00   1.00 1.00
Walc       0.00   1.00     1.00  1.00      1.00   1.00 1.00
health     0.07   0.00     1.00  1.00      1.00   1.00 1.00
absences   0.01   0.55     0.00  1.00      1.00   1.00 1.00
Grade      0.04   0.55     0.91  0.00      1.00   1.00 1.00
Schoolsup  0.08   0.50     0.66  0.06      0.00   1.00 1.00
Famsup     0.09   0.56     0.63  0.68      0.04   0.00 0.00
Paid       0.23   0.12     0.88  0.27      0.68   0.00 0.00
Activities 0.46   0.64     0.79  0.93      0.36   0.98 0.67
Nursery    0.05   0.71     0.70  0.59      0.36   0.24 0.04
Higher     0.05   0.75     0.27  0.00      0.28   0.05 0.00
Internet   0.82   0.11     0.04  0.05      0.85   0.04 0.00
Romantic   0.84   0.60     0.00  0.10      0.11   0.81 0.91
Gender     0.00   0.00     0.18  0.07      0.01   0.00 0.01
           Activities Nursery Higher Internet Romantic Gender
traveltime       1.00    1.00   1.00     1.00     1.00   1.00
studytime        1.00    1.00   0.08     1.00     1.00   0.00
failures         1.00    1.00   0.00     1.00     1.00   1.00
famrel           1.00    1.00   1.00     1.00     1.00   1.00
freetime         1.00    1.00   1.00     1.00     1.00   0.00
goout            1.00    1.00   1.00     1.00     1.00   1.00
Dalc             1.00    1.00   1.00     1.00     1.00   0.00
Walc             1.00    1.00   1.00     1.00     1.00   0.00
health           1.00    1.00   1.00     1.00     1.00   0.69
absences         1.00    1.00   1.00     1.00     0.38   1.00
Grade            1.00    1.00   0.11     1.00     1.00   1.00
Schoolsup        1.00    1.00   1.00     1.00     1.00   0.94
Famsup           1.00    1.00   1.00     1.00     1.00   0.42
Paid             1.00    1.00   0.03     0.38     1.00   1.00
Activities       0.00    1.00   1.00     1.00     1.00   1.00
Nursery          0.96    0.00   1.00     1.00     1.00   1.00
Higher           0.06    0.28   0.00     1.00     1.00   0.44
Internet         0.33    0.88   0.69     0.00     1.00   1.00
Romantic         0.70    0.59   0.04     0.08     0.00   1.00
Gender           0.05    0.87   0.00     0.38     0.04   0.00

 To see confidence intervals of the correlations, print with the short=FALSE option

   when set the alpha as 0.05, we can only choose 'traveltime', 'studytime', 'failures', 'goout' and 'Walc' to see their influence on students grade.

3.3 Decision Tree

tree <- ctree(Grade~ traveltime + studytime + failures  + Walc  + goout,
              data = data)
plot(tree)

   From the tree above we can see, only 'failures' and 'goout' are used to divide the students into four groups. Students who has no failures obviously have higher grade, especially those who go out less. Those who has more than 1 failures have the lowest grade.

3.4 Logistice regression

     To make logistic regression we need to change the response variable ' Grade' into factor type.

data$Grade <- factor(data$Grade)

     Then use glm() function to build the formula and make the analysis.

mylogit <- glm(formula = Grade ~ traveltime + studytime + failures  + Walc  + goout,data = data, family = binomial)
print(summary(mylogit))


Call:
glm(formula = Grade ~ traveltime + studytime + failures + Walc + 
    goout, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5360   0.3275   0.3915   0.4690   1.2929  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.5975     0.8069   3.219  0.00129 ** 
traveltime   -0.1130     0.2354  -0.480  0.63118    
studytime     0.1287     0.2218   0.580  0.56189    
failures     -0.8716     0.1730  -5.037 4.72e-07 ***
Walc          0.3447     0.1572   2.192  0.02838 *  
goout        -0.3006     0.1650  -1.821  0.06853 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 284.24  on 394  degrees of freedom
Residual deviance: 251.61  on 389  degrees of freedom
AIC: 263.61

Number of Fisher Scoring iterations: 5

   The result shows that for each additional failure point, the student grade will reduce 0.8716 and for each additional Walc, the students grade will increase by 0.3447 under alpha equals to 0.05.

4 Conclusion

  Compared with logistic regression model, the result of decision tree model is more visual. Let the reader reflect the classification of results in the first place. B
  Both of the two models show that failures is the major influence factor on student grade, which admonish student to concentrate on theri studies!