library(caret)
set.seed(12321)
barbell_lifts <- read.csv('pml-training.csv')
Let’s evaluate what kind of prediction we need
str(barbell_lifts$classe)
## Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
We have the outcome variable classe to be a factor of 5. This means that we need to do classification to assign observation to one of five classes. The appropriate error rate measure for this kind of classification is accuracy which accounts for false positives/negatives equally. Another possible measure is area under the ROC curve (AUC) but we will use accuracy.
To properly choose the training, testing, validation data set we need to evaluate the amount of available data.
nrow(barbell_lifts)
## [1] 19622
We have thousands of observations in pml-training.csv which is large enough. We can follow the following typical splitting scheme for the datases:
As a validation set, we will assume final results of 20 question quiz.
We will use 10-fold cross validation on training data set for prediction functions tuning. Testing dataset will be used for final out-of-sample error evaluation
inTrain <- createDataPartition(y=barbell_lifts$classe, p=0.7,list=F)
training <- barbell_lifts[inTrain,]
testing <- barbell_lifts[-inTrain,]
The data set contains several variables that are numeric, but treated as factors during import. Converting them to numeric
nzvars <- nearZeroVar(training)
training <- training[,-nzvars]
After this stage there are 125 predictors left
After excluding the varaibles wich have more than 10% of NA values (there were 70 such varaibles) there are 55 presictors left
We will use 10-fold cross validation (as our dataset is large enough). This should keep both bias and varaince from being exream.
trainOptions <- trainControl(
method = "cv",
number=10 #number of folds
)
We will try to train a set of models of different families with the same cross-validation options for parameters tuning
rpartModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="rpart")
rfModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="rf")
gbmModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="gbm")
ldaModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="lda")
nbModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="nb")
svmRadialModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="svmRadial")
nnetModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="nnet")
trained <- list(
rpart=rpartModel,
rf=rfModel,
gbm=gbmModel,
lda=ldaModel,
nb=nbModel,
svm=svmRadialModel,
nnet=nnetModel
)
## method
## 1 CART
## 2 Random Forest
## 3 Stochastic Gradient Boosting
## 4 Linear Discriminant Analysis
## 5 Naive Bayes
## 6 Support Vector Machines with Radial Basis Function Kernel
## 7 Neural Network
## in_sample_accuracy out_of_sample_accuracy
## 1 0.4236733 0.4151232
## 2 1.0000000 0.9775701
## 3 0.8632889 0.8200510
## 4 0.5310475 0.5286321
## 5 0.6765669 0.6627018
## 6 0.9226177 0.9073917
## 7 0.6301958 0.6195412
We can see that Random forest are 100% accurate for traininig sample so it looks like overfitting. But out-of-sample accuracy shows that this overfitting works better than other methods.
We can check how the trained models are consistent in thier classification
library(corrplot)
corrplot(modelCor(resamples(trained)),method='pie')
So the models are not very consistent. This is good as we can combine them to try to get even better prediction taking advantages of each of them.
Let’s try 2 combined predictors: combining all the predictors and most accurate predictors. Combination can be done by using simple voting via mode calculation (the most common predicted value accross several predictors will be chosen as final value).
## method in_sample_accuracy out_of_sample_accuracy
## 1 all combined 0.8647448 0.8412914
## 2 rf,gbm,svm,nnet combined 0.9261848 0.9029737
It it reasonable to choose the “rf,gbm,svm,nnet combined model” as it seems to be protected from overfitting by balancing. This predictor is accurate enough in the same time with out of sample accuracy 0.9029737.
I’ve got “17/20 points earned (85%)” for the final quiz with the chosen model.
Accuracy rate of 85% is explainable as it is little bit less then out-of-sample accuracy expectation.