Based on a training and a testing data sets, obtained using devices such as Jawbone Up, Nike FuelBand, and Fitbit, the goal of this project is to produce a model which to predict how well a certain exercise has been performed. The data contains measurements about 6 participants who were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The results are in the “classe” variable in the training set.
So, the question is: “Given the available data, how well can we predict the quality of the execution of the activity?”
The data sets are loaded and explored.
# URLS
trainingUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testingUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# FILES
trainingFile <- "pml-training.csv"
testingFile <- "pml-testing.csv"
# Download Files
download.file(url=trainingUrl, destfile=trainingFile)
download.file(url=testingUrl, destfile = testingFile)
# DATAFRAMES. I consider empty fields as NA
training <- read.csv(trainingFile, na.strings=c("NA", "", "#DIV/0!"), header=TRUE)
testing <- read.csv(testingFile, na.strings=c("NA", "", "#DIV/0!"), header=TRUE)
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
The testing set must in reality be the submission set for the project quiz as the number of observations is 20 and this is too small for a testing set. Furthermore it does not contain the classe field, so we cannot actually use it for evaluation. So, from now on this will be the “forSubmission” dataframe. I rename the dataframes to be in line with the course slides, then I proceed to a classical training/testing split of the training set as seen in the course lessons. As I am on a slow PC and the training part is what takes more time, I reduce the size of the training set to .55 (testing .45). I am not sure that this really affect the accuracy, it is actually likely that this will give me more conservative measures being my verifications performed on bigger data sets. I also proceed to some cleaning, removing not useful variables from all the dataframes. as well as variables (columns) which are mostly N/A.
# Need caret
library(caret)
# Data for final quiz submission
forSubmission <- testing
# This is actually the dataset to split
data <- training
# Set a seed for reproduceability
set.seed(54321)
# Partitioning the data (Data Selection)
# I partition .55 and .45 to make training faster
inTraining <- createDataPartition(data$classe, p=0.55, list=FALSE)
# Training and testing dataframes
training <- data[inTraining, ]
testing <- data[-inTraining, ]
# Verify sizes
dim(training); dim(testing)
## [1] 10795 160
## [1] 8827 160
# Cleaning data (Feature Selection):
# Some variables not useful for predictions:
# Index, Name, 3*timestamps, 2*window. They are the first 7.
training <- training[, -c(1:7)]
# I also look at the N/A values, it is unlikely that they
# can be any good as predictors. I set the threshold to .4 (40%)
toRemove <- NULL
for(i in 1:length(training)) {
if( sum( is.na( training[, i] ) ) /nrow(training) >= .4) {
if (exists("toRemove"))
toRemove <- c(toRemove, names(training)[i])
else
toRemove <- c(names(training)[i])
}
}
training <- training[, !names(training) %in% toRemove]
# A check with nearZeroVar with the default parameters to exclude "near zero-variance" predictors
# reveals that all the selected predictors should ve included
# nearZeroVar(training, saveMetrics=TRUE)
I do not need to remove the columns in the testing and forSubmission dataframes. The model built on training shall not contain any of the removed variables, so they shall also not be used (they shall be ignored) during prediction.
A check with nearZeroVar with the default parameters (to exclude “near zero-variance” predictors) reveals all the selected predictors should be included nearZeroVar(training, saveMetrics=TRUE) This does not change anything anyway, so I commented it in the code.
The problem at hand is a classification problem as opposed to a regression problem. Therefore, I will create several classifiers among those seen in the course, I will compare their performances and select the best one to submit the results of the final quiz.
The first model I try is a basic classification tree, that I then plot and evaluate.
library(rpart)
# Set a seed for reproduceability:
set.seed(54321)
modCT <- rpart(classe ~ ., data=training, method="class")
# During development only
# print(modCT, digits=3)
library(rattle)
fancyRpartPlot(modCT)
# Evaluate against testing set
predictionsCT <- predict(modCT, newdata=testing, type = "class")
cmRP <- confusionMatrix(predictionsCT, testing$classe)
cmRP
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 360 54 181 78
## B 49 922 125 56 102
## C 56 226 1209 201 160
## D 85 123 107 919 81
## E 88 77 44 90 1202
##
## Overall Statistics
##
## Accuracy : 0.7346
## 95% CI : (0.7252, 0.7438)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6626
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8892 0.5398 0.7856 0.6351 0.7406
## Specificity 0.8935 0.9534 0.9118 0.9463 0.9585
## Pos Pred Value 0.7683 0.7352 0.6528 0.6989 0.8008
## Neg Pred Value 0.9531 0.8962 0.9527 0.9297 0.9425
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2529 0.1045 0.1370 0.1041 0.1362
## Detection Prevalence 0.3291 0.1421 0.2098 0.1490 0.1700
## Balanced Accuracy 0.8914 0.7466 0.8487 0.7907 0.8495
plot(cmRP$table, col = cmRP$byClass, main = paste("Confusion Matrix CT: Accuracy =", round(cmRP$overall['Accuracy'], 5)))
This model provides an accuracy of 0.7346 which is not good enough to achieve the 0.8 required to pass the quiz.
Here I will generate a basic Random Forests model. The execution through the caret package takes way longer, to I call randomForest directly and with no parameters at all. Note that In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally , during the run…
library(caret)
library(randomForest)
# Set a seed for reproduceability:
set.seed(54321)
# This runs faster when called directly instead of using the caret package
modRF2 <- randomForest(classe ~ ., data=training)
# print(modRF2, digits=3)
# Evaluate against testing set
predictionsRF2 <- predict(modRF2, newdata=testing)
cmRF2 <- confusionMatrix(predictionsRF2, testing$classe)
cmRF2
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2500 2 0 0 0
## B 10 1700 10 0 0
## C 0 6 1525 19 0
## D 0 0 4 1427 3
## E 0 0 0 1 1620
##
## Overall Statistics
##
## Accuracy : 0.9938
## 95% CI : (0.9919, 0.9953)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9921
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9960 0.9953 0.9909 0.9862 0.9982
## Specificity 0.9997 0.9972 0.9966 0.9991 0.9999
## Pos Pred Value 0.9992 0.9884 0.9839 0.9951 0.9994
## Neg Pred Value 0.9984 0.9989 0.9981 0.9973 0.9996
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2832 0.1926 0.1728 0.1617 0.1835
## Detection Prevalence 0.2834 0.1949 0.1756 0.1625 0.1836
## Balanced Accuracy 0.9978 0.9963 0.9937 0.9926 0.9990
plot(cmRF2$table, col = cmRF2$byClass, main = paste("Confusion Matrix RF2: Accuracy =", round(cmRF2$overall['Accuracy'], 5)))
This model provides an accuracy of 0.994 which is a serious candidate to be the winner, with a very narrow CI, so a consistent performance over the test set.
This particular model is not built for accuracy, just to get a model comparable with the Random Forests. The options are tuned to keep execution time low rather than to achieve higher precision. We perform a 5 fold cross-validation in this case.
# Set a seed for reproduceability:
set.seed(54321)
fitControl <- trainControl(method='cv', number=5, returnResamp='none', classProbs = TRUE)
modGBM <- train(classe ~ ., data= training, method="gbm", trControl=fitControl, verbose=F )
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
# print(modGBM, digits=3)
predictionsGBM <- predict(modGBM, newdata=testing)
cmGBM <- confusionMatrix(predictionsGBM, testing$classe)
cmGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2458 37 0 0 2
## B 38 1613 62 6 14
## C 3 48 1441 59 15
## D 8 9 35 1373 30
## E 3 1 1 9 1562
##
## Overall Statistics
##
## Accuracy : 0.957
## 95% CI : (0.9525, 0.9611)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9456
## Mcnemar's Test P-Value : 3.837e-08
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9793 0.9444 0.9363 0.9489 0.9624
## Specificity 0.9938 0.9831 0.9828 0.9889 0.9981
## Pos Pred Value 0.9844 0.9308 0.9202 0.9436 0.9911
## Neg Pred Value 0.9918 0.9866 0.9865 0.9900 0.9916
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2785 0.1827 0.1632 0.1555 0.1770
## Detection Prevalence 0.2829 0.1963 0.1774 0.1648 0.1785
## Balanced Accuracy 0.9866 0.9638 0.9596 0.9689 0.9802
plot(cmGBM$table, col = cmGBM$byClass, main = paste("Confusion Matrix GBM: Accuracy =", round(cmGBM$overall['Accuracy'], 5)))
Also not a bad performer either if compared with the simple Classification Tree, sufficient to pass the quiz with a very good confidence but worse than the Random Forests.
I also generate a simple linear discriminant analysis model for comparison.
# Set a seed for reproduceability:
set.seed(54321)
modLDA <- train(classe ~ ., data= training, method="lda")
## Loading required package: MASS
print(modLDA, digits=3)
## Linear Discriminant Analysis
##
## 10795 samples
## 52 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 10795, 10795, 10795, 10795, 10795, 10795, ...
## Resampling results:
##
## Accuracy Kappa
## 0.693 0.612
##
##
predictionsLDA <- predict(modLDA, newdata=testing)
cmLDA <- confusionMatrix(predictionsLDA, testing$classe)
cmLDA
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2037 256 156 65 65
## B 66 1113 147 57 277
## C 178 190 998 166 140
## D 222 70 189 1105 153
## E 7 79 49 54 988
##
## Overall Statistics
##
## Accuracy : 0.707
## 95% CI : (0.6974, 0.7165)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6295
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8116 0.6516 0.6485 0.7636 0.6087
## Specificity 0.9142 0.9232 0.9075 0.9141 0.9738
## Pos Pred Value 0.7898 0.6705 0.5969 0.6354 0.8394
## Neg Pred Value 0.9243 0.9170 0.9244 0.9517 0.9170
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2308 0.1261 0.1131 0.1252 0.1119
## Detection Prevalence 0.2922 0.1881 0.1894 0.1970 0.1333
## Balanced Accuracy 0.8629 0.7874 0.7780 0.8389 0.7913
plot(cmLDA$table, col = cmLDA$byClass, main = paste("Confusion Matrix LDA: Accuracy =", round(cmLDA$overall['Accuracy'], 5)))
This model would also not pass the classification quiz.
By comparing just the accuracy of the models on the testing data set, the best model is the Random Forests, followed by the Gradient Boosting Model. The other two models provide data to compare the performances. The simplest Random Forests model provides an accuracy of 0.994 and an out of sample error rate of 1-0.994 = 0.006, which is pretty good in my opinion without performing any type of optimization. The second best model is the Generalised Boosting Model, which again I did not push to any extent, with 0.957. On the forSubmission data set I get identical results to the random forests attempt, which is a good sign. The other two models, a Classification Tree model and a Linear discriminant Analysis model do not perform satisfactorily in comparison, with 0.734 and 0.707 respectively, so they would fail the prediction quiz which requires at least 80% correct responses.
# Building manually a comparison table
resultsTable <- NULL
resultsTable <- data.frame(predict(modRF2, newdata=forSubmission))
resultsTable <- cbind(resultsTable, predict(modGBM, newdata=forSubmission))
resultsTable <- cbind(resultsTable, predict(modCT, newdata=forSubmission, type="class"))
resultsTable <- cbind(resultsTable, predict(modLDA, newdata=forSubmission))
names(resultsTable)[1] <- "1: RF"
names(resultsTable)[2] <- "2: GBM"
names(resultsTable)[3] <- "3: CT"
names(resultsTable)[4] <- "4: LDA"
library(htmlTable)
htmlTable(t(resultsTable))
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1: RF | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
2: GBM | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
3: CT | C | A | B | D | A | C | D | A | A | A | C | B | A | A | E | E | A | A | A | B |
4: LDA | B | A | B | C | A | E | D | D | A | A | D | A | B | A | E | A | A | B | B | B |
The Random Forests classifier is the model I will use to submit my answers to the quiz.
As all the other students doing this project, I owe the data set and a word of thanks to:
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.
(http://groupware.les.inf.puc-rio.br/har)
Furthermore, many thanks to Len Greski for his beautiful guide on setting up gh-pages on github (https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-ghPagesSetup.md).