This document elaborates on the steps done to run Multinomial Logistic Regression on the results of the Stated Preference Survey. Multinomial Logistic Regression serves to predict an individual’s mode choice (public bus, shared bicycles/PMDs, or Dalphin) using inputs such as the individual’s demographics, mode specific variables and non-mode specific variables.
First, we read in the data from the stated preference survey which is in the CSV format (Comma Separated Values).
modechoice <- read.csv("ModeChoice.csv")
Looking at the structure of the dataset str(), it contains 1000 observations of 40 variables.
str(modechoice)
Some of the entries in the Income column are NA. To fill this missing information, we used linear regression to predict their values.
We separate the dataset into two: one with stated incomes, and another which has NA for Income.
incomenotNA <- subset(modechoice, modechoice$Income != "NA")
incomeNA <- subset(modechoice,is.na(modechoice$Income))
There are 752 observations with stated incomes and 248 observations where income is NA.
We now create training and test sets from the observations where incomes are stated (incomenotNA). The caTools package is used.
library(caTools)
We set the random seed so that the results will be replicable.
set.seed(1)
We use the sample.split() function to obtain the training and test sets with a 65/35 split ratio. Both income training and test sets maintained the same proportion of Age variable as in the known income dataset, incomenotNA.
split <- sample.split(incomenotNA$Age, SplitRatio = 0.65)
incometraining <- subset(incomenotNA,split==TRUE)
incometest <- subset(incomenotNA,split==FALSE)
We now create a linear regression model with the training set. The first model includes all demographic information of the individual:
| Variable | Description |
|---|---|
Age |
The individual’s age group |
Gender |
The individual’s gender |
HouseholdType |
Household type |
HouseholdSize |
Number of people in the household |
Residential |
Residential location of the household |
Cars |
The number of cars in the household |
Concession |
The type of concession pass, if any |
income.model1 <- lm(Income ~ Age + Gender + HouseholdType +
HouseholdSize + Residential + Cars +
Concession,
data = incometraining)
summary(income.model1)
##
## Call:
## lm(formula = Income ~ Age + Gender + HouseholdType + HouseholdSize +
## Residential + Cars + Concession, data = incometraining)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4952 -0.9929 -0.1726 0.9612 2.9008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.531995 0.484056 1.099 0.272303
## Age 0.797357 0.048075 16.586 < 2e-16 ***
## Gender -0.360214 0.122404 -2.943 0.003410 **
## HouseholdType 0.011665 0.045024 0.259 0.795689
## HouseholdSize -0.293817 0.077210 -3.805 0.000160 ***
## Residential 0.003033 0.009273 0.327 0.743766
## Cars 0.354159 0.101643 3.484 0.000539 ***
## Concession 0.145744 0.079579 1.831 0.067655 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.317 on 480 degrees of freedom
## Multiple R-squared: 0.4055, Adjusted R-squared: 0.3968
## F-statistic: 46.77 on 7 and 480 DF, p-value: < 2.2e-16
The model has an adjusted \(R^2\) value of 0.39682. HouseholdType and Residential are not significant at the 90% significance level.
HouseholdType and ResidentialWe remove the insignificant variables from the model.
income.model2 <- lm(Income ~ Age + Gender +HouseholdSize + Cars + Concession,
data=incometraining)
summary(income.model2)
##
## Call:
## lm(formula = Income ~ Age + Gender + HouseholdSize + Cars + Concession,
## data = incometraining)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5561 -0.9873 -0.1662 0.9926 2.9186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.64785 0.35692 1.815 0.070124 .
## Age 0.79474 0.04702 16.901 < 2e-16 ***
## Gender -0.35916 0.12124 -2.962 0.003205 **
## HouseholdSize -0.30061 0.07509 -4.004 7.22e-05 ***
## Cars 0.36404 0.09577 3.801 0.000163 ***
## Concession 0.15273 0.07716 1.979 0.048336 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.315 on 482 degrees of freedom
## Multiple R-squared: 0.4053, Adjusted R-squared: 0.3991
## F-statistic: 65.69 on 5 and 482 DF, p-value: < 2.2e-16
This model has an increased adjusted \(R^2\) value of 0.39911. Concession and (Intercept) are the least significant variables.
Several other combinations of variables were tested. The combination of variables which gave the highest adjusted \(R^2\) was chosen.
| Model | Age |
Gender |
HouseholdType |
HouseholdSize |
Residential |
Cars |
Concession |
Adjusted \(R^2\) |
|---|---|---|---|---|---|---|---|---|
| 1 | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | 0.39682 |
| 2 | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | 0.39911 | ||
| 3 | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | 0.39548 | |||
| 4 | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | 0.39794 | |
| 5 | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | 0.39799 |
Model 2 has the highest adjusted \(R^2\). We now evaluate this model on the test set by calculating the test \(R^2\). This is given by the formula:
\[R^2 = 1 - \frac{SSE}{SST}\]
where \(SSE\) is the Sum of Squared Errors and \(SST\) is the Sum of Squared Errors (Total).
incomeprediction <- predict(income.model2, newdata=incometest)
sse <- sum((incomeprediction-incometest$Income)^2)
sst <- sum((incometest$Income-mean(incometraining$Income))^2)
The test \(R^2\) for model 2 is 0.37408. We now use model 2 to predict the NA entries in the dataset.
NAincomeprediction <- predict(income.model2, newdata=incomeNA)
modechoice$Income <- ifelse(is.na(modechoice$Income),
NAincomeprediction,
modechoice$Income)
Now that all entries in the dataset have no NA values, we can proceed to conduct Multinomial Logistic Regression.
Each row in the dataset corresponds to a choice task of a particular individual. We randomly split the 125 individuals into the training or test set with a 70/30 split ratio, while keeping all 8 choice tasks intact.
set.seed(0)
mask = rep(x = sample.split(rep(1, 125), SplitRatio = 0.7),
each = 8)
modechoice.training = subset(modechoice, mask==TRUE)
modechoice.test = subset(modechoice, mask==FALSE)
The table below shows the first 16 rows of the training set. All 8 choice tasks of an individual are preserved. Case refers to the individual, while Task refers to the choice task.
kable(modechoice.training[1:16,2:6], row.names = FALSE)
| Case | Task | weather1 | weather2 | weather3 |
|---|---|---|---|---|
| 2 | 1 | 0 | 0 | 0 |
| 2 | 2 | 0 | 0 | 0 |
| 2 | 3 | 0 | 0 | 0 |
| 2 | 4 | 0 | 0 | 0 |
| 2 | 5 | 1 | 1 | 1 |
| 2 | 6 | 1 | 1 | 1 |
| 2 | 7 | 1 | 1 | 1 |
| 2 | 8 | 1 | 1 | 1 |
| 3 | 1 | 0 | 0 | 0 |
| 3 | 2 | 0 | 0 | 0 |
| 3 | 3 | 0 | 0 | 0 |
| 3 | 4 | 0 | 0 | 0 |
| 3 | 5 | 1 | 1 | 1 |
| 3 | 6 | 1 | 1 | 1 |
| 3 | 7 | 1 | 1 | 1 |
| 3 | 8 | 1 | 1 | 1 |
mlogit() ObjectWe use the mlogit package which will help us conduct Multinomial Logistic Regression.
library(mlogit)
S <- mlogit.data(data = modechoice.training,
shape="wide",
choice="A",
varying=c(4:18),
sep="",
alt.levels=c("A.bus","A.cyclingPMD","A.dalphin"),
id.var="Case")
The mlogit.data() function shapes the data into a suitable form for the use of the mlogit() function. There are several parameters to be defined in this function:
data specifies the dataset to be used. Here, we use the training set created previously.
varying refers to the columns that correspond to the attributes of the choice.
alt.levels specifies a list of the choices available (Public bus, Cycling/PMDs, or Dalphin).
The Akaike Information Criterion (AIC) is a measure of the quality of fit of a model which also takes model complexity into account. A lower AIC value is more desirable. The formula is given by:
\[AIC = -2 LL(\hat\beta) + 2|\hat\beta|\]
where \(LL(\hat\beta)\) is the Log-likelihood of the estimated model and \(|\hat\beta|\) is the number of coefficients to be estimated (including the intercept).
For convenience, we define a function calc.AIC() that takes in a model and returns its AIC value.
calc.AIC <- function (model) {
return(-2*model$logLik[1] + 2*length(model$coefficients))
}
There are three broad types of variable selection:
Forward: Add one variable at a time. Remove this variable if it is not significant. If significant, add another variable.
Backward: Add all variables into model. Remove least significant variable each time.
Stepwise: Combination of the forward and backward selection techniques. After each step in which a variable was added, all variables in the model are checked to see if their significance has been reduced below the threshold significance level. If a nonsignificant variable is found, it is removed from the model. A variable can be added or removed from the previous model.
The stepwise variable selection was executed for its thoroughness.
| model | weather | fare | wait | ivtt | access | Age | Gender | Income | HouseholdType | HouseholdSize | Residential | Cars | Concession | AIC Value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | X | 1357.116 | ||||||||||||
| 2 | X | X | 1333.944 | |||||||||||
| 3 | X | X | X | 1334.169 | ||||||||||
| 4 | X | X | X | X | 1335.254 |
Model 2 the better model as it has the lowest AIC. We now build upon Model 2 by including individual-specific attributes.
| model | weather | fare | wait | ivtt | access | Age | Gender | Income | HouseholdType | HouseholdSize | Residential | Cars | Concession | AIC Value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | X | X | X | 1326.97 | ||||||||||
| 6 | X | X | X | X | 1328.847 | |||||||||
| 7 | X | X | X | X | 1320.019 | |||||||||
| 8 | X | X | X | X | X | 1323.093 | ||||||||
| 9 | X | X | X | X | 1328.914 | |||||||||
| 10 | X | X | X | X | X | 1324.68 | ||||||||
| 11 | X | X | X | X | X | 1318.375 | ||||||||
| 12 | X | X | X | X | X | X | 1319.543 | |||||||
| 13 | X | X | X | X | X | X | 1318.999 |
Now we try adding the other mode-specific variables that we removed just now.
| model | weather | fare | wait | ivtt | access | Age | Gender | Income | HouseholdType | HouseholdSize | Residential | Cars | Concession | AIC Value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | X | X | X | X | X | X | X | 1319.172 | ||||||
| 15 | X | X | X | X | X | X | X | X | 1320.237 | |||||
| 16 | X | X | X | X | X | X | X | 1320.235 |
Model 11 is the best as it has the lowest AIC and highest proportion of significant coefficients. Now we build upon Model 4 by including individual-specific attributes.
| model | weather | fare | wait | ivtt | access | Age | Gender | Income | HouseholdType | HouseholdSize | Residential | Cars | Concession | AIC Value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | X | X | X | X | X | 1328.258 | ||||||||
| 18 | X | X | X | X | X | X | 1330.134 | |||||||
| 19 | X | X | X | X | X | X | X | 1322.806 | ||||||
| 20 | X | X | X | X | X | X | X | X | 1325.969 | |||||
| 21 | X | X | X | X | X | X | X | 1332.142 | ||||||
| 22 | X | X | X | X | X | X | X | X | 1311.476 | |||||
| 23 | X | X | X | X | X | X | X | X | X | 1308.912 |
Overall, Model 23 is the best model with the lowest AIC value.
This is the full table of all the models constructed.
| model | weather | fare | wait | ivtt | access | Age | Gender | Income | HouseholdType | HouseholdSize | Residential | Cars | Concession | AIC Value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | X | 1357.116 | ||||||||||||
| 2 | X | X | 1333.944 | |||||||||||
| 3 | X | X | X | 1334.169 | ||||||||||
| 4 | X | X | X | X | 1335.254 | |||||||||
| 5 | X | X | X | 1326.97 | ||||||||||
| 6 | X | X | X | X | 1328.847 | |||||||||
| 7 | X | X | X | X | 1320.019 | |||||||||
| 8 | X | X | X | X | X | 1323.093 | ||||||||
| 9 | X | X | X | X | 1328.914 | |||||||||
| 10 | X | X | X | X | X | 1324.68 | ||||||||
| 11 | X | X | X | X | X | 1318.375 | ||||||||
| 12 | X | X | X | X | X | X | 1319.543 | |||||||
| 13 | X | X | X | X | X | X | 1318.999 | |||||||
| 14 | X | X | X | X | X | X | X | 1319.172 | ||||||
| 15 | X | X | X | X | X | X | X | X | 1320.237 | |||||
| 16 | X | X | X | X | X | X | X | 1320.235 | ||||||
| 17 | X | X | X | X | X | 1328.258 | ||||||||
| 18 | X | X | X | X | X | X | 1330.134 | |||||||
| 19 | X | X | X | X | X | X | X | 1322.806 | ||||||
| 20 | X | X | X | X | X | X | X | X | 1325.969 | |||||
| 21 | X | X | X | X | X | X | X | 1332.142 | ||||||
| 22 | X | X | X | X | X | X | X | X | 1311.476 | |||||
| 23 | X | X | X | X | X | X | X | X | X | 1308.912 |
We use the predict() function to give the probability of an individual choosing a particular mode.
P <- predict(M23, newdata = S, type = "response")
Looking at the first three rows of the dataset:
| Bus | Shared Bicycles/PMDs | Dalphin |
|---|---|---|
| 0.278 | 0.614 | 0.108 |
| 0.265 | 0.584 | 0.151 |
| 0.232 | 0.511 | 0.257 |
For example, the individual corresponding to the first row will choose the bus with a 27.8% chance, Shared Bicycles/PMDs with a 61.4% chance, and the Dalphin with a 10.8% chance.
We let the predicted choice to be the most likely alternative (highest chance of being chosen).
PredictedChoice <- apply(P, 1, which.max)
Using a confusion matrix, we compare the actual choice in the training set with the predicted choice by the model.
ActualChoice <- modechoice.training$A
Predicted Choice vs. Actual Choice
| Bus | Shared Bicycles/PMDs | Dalphin | |
|---|---|---|---|
| Bus | 68 | 19 | 46 |
| Shared Bicycles/PMDs | 26 | 42 | 17 |
| Dalphin | 132 | 64 | 282 |
The diagonal entries of the confusion matrix are the instances where the model correctly predicts the actual choice of the individual. Accuracy is calculated by finding the proportion of instances where the model correctly predicted the actual choice. This model has an accuracy of 56.32%.