Study design

  1. Error rate definition
  2. Splitting the data
  3. Picking features
  1. Picking prediction function
  1. Final model choice

Study

Initial preparation

library(caret)
set.seed(12321)
barbell_lifts  <- read.csv('pml-training.csv')

1. Choosing error rate

Let’s evaluate what kind of prediction we need

str(barbell_lifts$classe)
##  Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

We have the outcome variable classe to be a factor of 5. This means that we need to do classification to assign observation to one of five classes. The appropriate error rate measure for this kind of classification is accuracy which accounts for false positives/negatives equally. Another possible measure is area under the ROC curve (AUC) but we will use accuracy.

2. Splitting the data

To properly choose the training, testing, validation data set we need to evaluate the amount of available data.

nrow(barbell_lifts)
## [1] 19622

We have thousands of observations in pml-training.csv which is large enough. We can follow the following typical splitting scheme for the datases:

  • 70% training
  • 30% testing

As a validation set, we will assume final results of 20 question quiz.

We will use 10-fold cross validation on training data set for prediction functions tuning. Testing dataset will be used for final out-of-sample error evaluation

inTrain <- createDataPartition(y=barbell_lifts$classe, p=0.7,list=F)

training <- barbell_lifts[inTrain,]
testing <- barbell_lifts[-inTrain,]

3. Picking features

3.1. Clearing data

Removing unrelated variables

The dataset contains several variables that we do not want to use as predictors

  • user_name - we do not want to account a particular person doing dumbbell lifting
  • cvtd_timestamp - we do not want to account the date and time of the dumbbell lifting
  • X - we do not want to account dataset row number

After this filtering there are 156 possible predictors left.

Converting factors with ‘numeric’ levels to real numeric

The data set contains several variables that are numeric, but treated as factors during import. Converting them to numeric

Identifing and elemination near zero variance predictors

nzvars <- nearZeroVar(training)
training <- training[,-nzvars]

After this stage there are 125 predictors left

Eliminating variables with lots of NA

After excluding the varaibles wich have more than 10% of NA values (there were 70 such varaibles) there are 55 presictors left

3.2 Transforming covariates:

Handling correlated predictors

Looking at the correlaction matrix of all 55 predictors

predictorsOnly <- training[,names(training) != 'classe']
varM <- cor(predictorsOnly)
varMcol<- colorRampPalette(c("red", "white", "blue"))(20)
heatmap(x = varM, col = varMcol, symm = TRUE)

We can see that there are several correlated varaible clusters.

We will reduce predictor space dimensions by appling PCA, with default desired described variance threshold of 95%.

predictorsOnly <- training[,names(training) != 'classe']

pcaPreProc <- preProcess(predictorsOnly,method="pca")
pcaPredictedTraining <- predict(pcaPreProc,predictorsOnly)
pcaPredictedTraining$classe <- training$classe

After this stage we have 26 linear independent principal component predictors.

4. Picking predictor function

Prediction function parameter tuning with Cross Validation

We will use 10-fold cross validation (as our dataset is large enough). This should keep both bias and varaince from being exream.

trainOptions <- trainControl(
   method = "cv",
   number=10 #number of folds
)

Tring different prediction function families, tuning their parameters

We will try to train a set of models of different families with the same cross-validation options for parameters tuning

rpartModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="rpart")
rfModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="rf")
gbmModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="gbm")
ldaModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="lda")
nbModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="nb")
svmRadialModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="svmRadial")
nnetModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="nnet")

trained <- list(
  rpart=rpartModel,
  rf=rfModel,
  gbm=gbmModel,
  lda=ldaModel,
  nb=nbModel,
  svm=svmRadialModel,
  nnet=nnetModel
)

Evaluating in-sample & out of sample errors

##                                                      method
## 1                                                      CART
## 2                                             Random Forest
## 3                              Stochastic Gradient Boosting
## 4                              Linear Discriminant Analysis
## 5                                               Naive Bayes
## 6 Support Vector Machines with Radial Basis Function Kernel
## 7                                            Neural Network
##   in_sample_accuracy out_of_sample_accuracy
## 1          0.4236733              0.4151232
## 2          1.0000000              0.9775701
## 3          0.8632889              0.8200510
## 4          0.5310475              0.5286321
## 5          0.6765669              0.6627018
## 6          0.9226177              0.9073917
## 7          0.6301958              0.6195412

We can see that Random forest are 100% accurate for traininig sample so it looks like overfitting. But out-of-sample accuracy shows that this overfitting works better than other methods.

Combining the models

We can check how the trained models are consistent in thier classification

library(corrplot)
corrplot(modelCor(resamples(trained)),method='pie')

So the models are not very consistent. This is good as we can combine them to try to get even better prediction taking advantages of each of them.

Ensampling the models using simple voting (mode calculation)

Let’s try 2 combined predictors: combining all the predictors and most accurate predictors. Combination can be done by using simple voting via mode calculation (the most common predicted value accross several predictors will be chosen as final value).

##                     method in_sample_accuracy out_of_sample_accuracy
## 1             all combined          0.8647448              0.8412914
## 2 rf,gbm,svm,nnet combined          0.9261848              0.9029737

5. Final model choice

It it reasonable to choose the “rf,gbm,svm,nnet combined model” as it seems to be protected from overfitting by balancing. This predictor is accurate enough in the same time with out of sample accuracy 0.9029737.

Final quiz prediciton.

I’ve got “17/20 points earned (85%)” for the final quiz with the chosen model.

Accuracy rate of 85% is explainable as it is little bit less then out-of-sample accuracy expectation.