1 Overview
All along, the research on the influencing factors of student achievement is a very hot research topic. Due to the complexity of students' information, the uncertainty of school and family influencing factors, and the difference of different subjects, it is difficult to quantitatively analyze the influencing factors of students' performance.
2 Research Method
2.1 Decision Tree
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.
2.2 Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail which is represented by an indicator variable, where the two values are labeled "0" and "1". In the logistic model, the log-odds (the logarithm of the odds) for the value labeled "1" is a linear combination of one or more independent variables ("predictors"); the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. Analogous models with a different sigmoid function instead of the logistic function can also be used, such as the probit model; the defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio.
3 Step by Step Visualization
3.1 Install and load R packages
1) 'party' contains 'ctree' which can help us draw the decision tree.
2) 'corrplot' can help us make correlation analysis between the variables.
3) 'Rcpp' contains 'glm' which can help make logistic regression.
4) 'Psych' contains 'corr.test' which can help us get the probability values of multiple values.
packages = c('readr', 'dplyr', 'tidyverse', 'plyr','party','corrplot','Rcpp','psych')
for(p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
3.2 load the data
data <- read.csv('data/student-mat.csv')
3.3 Data Wrangling
3.3.1 Change Data Type
Before make decision tree analysis or logistic analysis, we need to make correlation analysis to select the variables which has actual influence on students grade.
To make correlation analysis, we need to change some vatiables type to numeric and remove some useless variables.
data$Grade <- NA
data$Schoolsup <- NA
data$Famsup <- NA
data$Paid <- NA
data$Activities <- NA
data$Nursery <- NA
data$Higher <- NA
data$Internet <- NA
data$Romantic <- NA
data$Gender <- NA
data$Grade <- ifelse(data$G3>15,4,ifelse(data$G3>10,3,ifelse(data$G3>5,2,1)))
data$Schoolsup <- ifelse(data$schoolsup == 'yes', 1, 0)
data$Famsup <- ifelse(data$famsup == 'yes', 1 , 0)
data$Famsup <- ifelse(data$famsup == 'yes', 1 , 0)
data$Paid <- ifelse(data$paid == 'yes',1,0)
data$Activities <- ifelse(data$activities =='yes',1,0)
data$Nursery <- ifelse(data$nursery =='yes',1,0)
data$Higher <- ifelse(data$higher == 'yes',1,0)
data$Internet <- ifelse(data$internet =='yes',1,0)
data$Romantic <- ifelse(data$romantic == 'yes',1,0)
data$Gender <- ifelse(data$sex == 'F',1,0)
data = data[,-c(1:12)]
data= data[,-c(4:11)]
data =data[,-c(11:13)]
str(data)
'data.frame': 395 obs. of 20 variables:
$ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
$ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
$ failures : int 0 0 3 0 0 0 0 0 0 0 ...
$ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
$ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
$ goout : int 4 3 2 2 2 2 4 4 2 1 ...
$ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
$ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
$ health : int 3 3 3 5 5 5 3 1 1 5 ...
$ absences : int 6 4 10 2 4 10 0 6 0 0 ...
$ Grade : num 2 2 2 3 2 3 3 2 4 3 ...
$ Schoolsup : num 1 0 1 0 0 0 0 1 0 0 ...
$ Famsup : num 0 1 0 1 1 1 0 1 1 1 ...
$ Paid : num 0 0 1 1 1 1 0 0 1 1 ...
$ Activities: num 0 0 0 1 0 1 0 0 0 1 ...
$ Nursery : num 1 0 1 1 1 1 1 1 1 1 ...
$ Higher : num 1 1 1 1 1 1 1 1 1 1 ...
$ Internet : num 0 1 1 1 0 1 1 0 1 1 ...
$ Romantic : num 0 0 0 1 0 0 0 0 0 0 ...
$ Gender : num 1 1 1 1 1 0 0 1 0 0 ...
3.3.2 Correlation Analysis
a <- cor(data)
corrplot(a)
As we can in the correlation plot, there are several variables like 'travetime', 'studytime'.. which has high correlation with grade. However, we still need to check the probability values to see if the result is significant.
corr.test(data, use = 'complete')
Call:corr.test(x = data, use = "complete")
Correlation matrix
traveltime studytime failures famrel freetime goout Dalc
traveltime 1.00 -0.10 0.09 -0.02 -0.02 0.03 0.14
studytime -0.10 1.00 -0.17 0.04 -0.14 -0.06 -0.20
failures 0.09 -0.17 1.00 -0.04 0.09 0.12 0.14
famrel -0.02 0.04 -0.04 1.00 0.15 0.06 -0.08
freetime -0.02 -0.14 0.09 0.15 1.00 0.29 0.21
goout 0.03 -0.06 0.12 0.06 0.29 1.00 0.27
Dalc 0.14 -0.20 0.14 -0.08 0.21 0.27 1.00
Walc 0.13 -0.25 0.14 -0.11 0.15 0.42 0.65
health 0.01 -0.08 0.07 0.09 0.08 -0.01 0.08
absences -0.01 -0.06 0.06 -0.04 -0.06 0.04 0.11
Grade -0.13 0.11 -0.37 0.03 0.00 -0.14 -0.07
Schoolsup -0.01 0.04 0.00 0.00 -0.05 -0.04 -0.02
Famsup 0.00 0.15 -0.06 -0.02 0.01 -0.02 -0.03
Paid -0.07 0.17 -0.19 0.00 -0.06 0.01 0.06
Activities -0.01 0.09 -0.07 0.04 0.09 0.05 -0.07
Nursery -0.03 0.08 -0.10 0.00 -0.02 0.00 -0.08
Higher -0.08 0.18 -0.30 0.02 -0.06 -0.04 -0.07
Internet -0.11 0.06 -0.06 0.03 0.05 0.07 0.04
Romantic 0.02 0.05 0.09 -0.06 -0.01 0.01 0.02
Gender -0.06 0.31 -0.04 -0.06 -0.24 -0.08 -0.27
Walc health absences Grade Schoolsup Famsup Paid
traveltime 0.13 0.01 -0.01 -0.13 -0.01 0.00 -0.07
studytime -0.25 -0.08 -0.06 0.11 0.04 0.15 0.17
failures 0.14 0.07 0.06 -0.37 0.00 -0.06 -0.19
famrel -0.11 0.09 -0.04 0.03 0.00 -0.02 0.00
freetime 0.15 0.08 -0.06 0.00 -0.05 0.01 -0.06
goout 0.42 -0.01 0.04 -0.14 -0.04 -0.02 0.01
Dalc 0.65 0.08 0.11 -0.07 -0.02 -0.03 0.06
Walc 1.00 0.09 0.14 -0.10 -0.09 -0.09 0.06
health 0.09 1.00 -0.03 -0.03 -0.03 0.03 -0.08
absences 0.14 -0.03 1.00 0.01 0.02 0.02 0.01
Grade -0.10 -0.03 0.01 1.00 -0.09 -0.02 0.06
Schoolsup -0.09 -0.03 0.02 -0.09 1.00 0.10 -0.02
Famsup -0.09 0.03 0.02 -0.02 0.10 1.00 0.29
Paid 0.06 -0.08 0.01 0.06 -0.02 0.29 1.00
Activities -0.04 0.02 -0.01 0.00 0.05 0.00 -0.02
Nursery -0.10 -0.02 0.02 0.03 0.05 0.06 0.10
Higher -0.10 -0.02 -0.06 0.17 0.05 0.10 0.19
Internet 0.01 -0.08 0.10 0.10 -0.01 0.10 0.15
Romantic -0.01 0.03 0.15 -0.08 -0.08 0.01 0.01
Gender -0.27 -0.14 0.07 -0.09 0.14 0.15 0.13
Activities Nursery Higher Internet Romantic Gender
traveltime -0.01 -0.03 -0.08 -0.11 0.02 -0.06
studytime 0.09 0.08 0.18 0.06 0.05 0.31
failures -0.07 -0.10 -0.30 -0.06 0.09 -0.04
famrel 0.04 0.00 0.02 0.03 -0.06 -0.06
freetime 0.09 -0.02 -0.06 0.05 -0.01 -0.24
goout 0.05 0.00 -0.04 0.07 0.01 -0.08
Dalc -0.07 -0.08 -0.07 0.04 0.02 -0.27
Walc -0.04 -0.10 -0.10 0.01 -0.01 -0.27
health 0.02 -0.02 -0.02 -0.08 0.03 -0.14
absences -0.01 0.02 -0.06 0.10 0.15 0.07
Grade 0.00 0.03 0.17 0.10 -0.08 -0.09
Schoolsup 0.05 0.05 0.05 -0.01 -0.08 0.14
Famsup 0.00 0.06 0.10 0.10 0.01 0.15
Paid -0.02 0.10 0.19 0.15 0.01 0.13
Activities 1.00 0.00 0.10 0.05 0.02 -0.10
Nursery 0.00 1.00 0.05 0.01 0.03 0.01
Higher 0.10 0.05 1.00 0.02 -0.11 0.15
Internet 0.05 0.01 0.02 1.00 0.09 -0.04
Romantic 0.02 0.03 -0.11 0.09 1.00 0.10
Gender -0.10 0.01 0.15 -0.04 0.10 1.00
Sample Size
[1] 395
Probability values (Entries above the diagonal are adjusted for multiple tests.)
traveltime studytime failures famrel freetime goout Dalc
traveltime 0.00 1.00 1.00 1.00 1.00 1.00 0.94
studytime 0.05 0.00 0.09 1.00 0.70 1.00 0.02
failures 0.07 0.00 0.00 1.00 1.00 1.00 1.00
famrel 0.74 0.43 0.38 0.00 0.44 1.00 1.00
freetime 0.74 0.00 0.07 0.00 0.00 0.00 0.01
goout 0.57 0.21 0.01 0.20 0.00 0.00 0.00
Dalc 0.01 0.00 0.01 0.12 0.00 0.00 0.00
Walc 0.01 0.00 0.00 0.02 0.00 0.00 0.00
health 0.88 0.13 0.19 0.06 0.13 0.85 0.13
absences 0.80 0.21 0.21 0.38 0.25 0.38 0.03
Grade 0.01 0.03 0.00 0.53 0.94 0.00 0.14
Schoolsup 0.85 0.45 0.99 0.98 0.37 0.45 0.67
Famsup 0.95 0.00 0.27 0.69 0.83 0.76 0.53
Paid 0.19 0.00 0.00 0.99 0.20 0.84 0.22
Activities 0.88 0.07 0.17 0.42 0.07 0.36 0.19
Nursery 0.51 0.11 0.05 0.94 0.62 0.93 0.09
Higher 0.10 0.00 0.00 0.63 0.22 0.43 0.17
Internet 0.03 0.24 0.21 0.52 0.31 0.14 0.47
Romantic 0.66 0.29 0.06 0.21 0.82 0.88 0.76
Gender 0.24 0.00 0.38 0.24 0.00 0.13 0.00
Walc health absences Grade Schoolsup Famsup Paid
traveltime 1.00 1.00 1.00 1.00 1.00 1.00 1.00
studytime 0.00 1.00 1.00 1.00 1.00 0.63 0.15
failures 0.75 1.00 1.00 0.00 1.00 1.00 0.03
famrel 1.00 1.00 1.00 1.00 1.00 1.00 1.00
freetime 0.53 1.00 1.00 1.00 1.00 1.00 1.00
goout 0.00 1.00 1.00 0.70 1.00 1.00 1.00
Dalc 0.00 1.00 1.00 1.00 1.00 1.00 1.00
Walc 0.00 1.00 1.00 1.00 1.00 1.00 1.00
health 0.07 0.00 1.00 1.00 1.00 1.00 1.00
absences 0.01 0.55 0.00 1.00 1.00 1.00 1.00
Grade 0.04 0.55 0.91 0.00 1.00 1.00 1.00
Schoolsup 0.08 0.50 0.66 0.06 0.00 1.00 1.00
Famsup 0.09 0.56 0.63 0.68 0.04 0.00 0.00
Paid 0.23 0.12 0.88 0.27 0.68 0.00 0.00
Activities 0.46 0.64 0.79 0.93 0.36 0.98 0.67
Nursery 0.05 0.71 0.70 0.59 0.36 0.24 0.04
Higher 0.05 0.75 0.27 0.00 0.28 0.05 0.00
Internet 0.82 0.11 0.04 0.05 0.85 0.04 0.00
Romantic 0.84 0.60 0.00 0.10 0.11 0.81 0.91
Gender 0.00 0.00 0.18 0.07 0.01 0.00 0.01
Activities Nursery Higher Internet Romantic Gender
traveltime 1.00 1.00 1.00 1.00 1.00 1.00
studytime 1.00 1.00 0.08 1.00 1.00 0.00
failures 1.00 1.00 0.00 1.00 1.00 1.00
famrel 1.00 1.00 1.00 1.00 1.00 1.00
freetime 1.00 1.00 1.00 1.00 1.00 0.00
goout 1.00 1.00 1.00 1.00 1.00 1.00
Dalc 1.00 1.00 1.00 1.00 1.00 0.00
Walc 1.00 1.00 1.00 1.00 1.00 0.00
health 1.00 1.00 1.00 1.00 1.00 0.69
absences 1.00 1.00 1.00 1.00 0.38 1.00
Grade 1.00 1.00 0.11 1.00 1.00 1.00
Schoolsup 1.00 1.00 1.00 1.00 1.00 0.94
Famsup 1.00 1.00 1.00 1.00 1.00 0.42
Paid 1.00 1.00 0.03 0.38 1.00 1.00
Activities 0.00 1.00 1.00 1.00 1.00 1.00
Nursery 0.96 0.00 1.00 1.00 1.00 1.00
Higher 0.06 0.28 0.00 1.00 1.00 0.44
Internet 0.33 0.88 0.69 0.00 1.00 1.00
Romantic 0.70 0.59 0.04 0.08 0.00 1.00
Gender 0.05 0.87 0.00 0.38 0.04 0.00
To see confidence intervals of the correlations, print with the short=FALSE option
when set the alpha as 0.05, we can only choose 'traveltime', 'studytime', 'failures', 'goout' and 'Walc' to see their influence on students grade.
3.3 Decision Tree
tree <- ctree(Grade~ traveltime + studytime + failures + Walc + goout,
data = data)
plot(tree)
From the tree above we can see, only 'failures' and 'goout' are used to divide the students into four groups. Students who has no failures obviously have higher grade, especially those who go out less. Those who has more than 1 failures have the lowest grade.
3.4 Logistice regression
To make logistic regression we need to change the response variable ' Grade' into factor type.
data$Grade <- factor(data$Grade)
Then use glm() function to build the formula and make the analysis.
mylogit <- glm(formula = Grade ~ traveltime + studytime + failures + Walc + goout,data = data, family = binomial)
print(summary(mylogit))
Call:
glm(formula = Grade ~ traveltime + studytime + failures + Walc +
goout, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5360 0.3275 0.3915 0.4690 1.2929
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.5975 0.8069 3.219 0.00129 **
traveltime -0.1130 0.2354 -0.480 0.63118
studytime 0.1287 0.2218 0.580 0.56189
failures -0.8716 0.1730 -5.037 4.72e-07 ***
Walc 0.3447 0.1572 2.192 0.02838 *
goout -0.3006 0.1650 -1.821 0.06853 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284.24 on 394 degrees of freedom
Residual deviance: 251.61 on 389 degrees of freedom
AIC: 263.61
Number of Fisher Scoring iterations: 5
The result shows that for each additional failure point, the student grade will reduce 0.8716 and for each additional Walc, the students grade will increase by 0.3447 under alpha equals to 0.05.
4 Conclusion
Compared with logistic regression model, the result of decision tree model is more visual. Let the reader reflect the classification of results in the first place. B
Both of the two models show that failures is the major influence factor on student grade, which admonish student to concentrate on theri studies!