# install.packages(c('caret', 'Rtsne', 'stringr', 'LiblineaR', 'ranger', 'pROC'))

Load packages

library(stringr)
library(DESeq2)
library(Rtsne)
library(ggplot2)
library(caret)
library(LiblineaR)
library(ranger)
library(pROC)

Set up working dir

setwd('~/share/Day4/cancer-ml')
load('rawdata.rd')

Predicting cancer vs. healthy

Before we dive right into machine learning, we need to consider the data we have. A full set of RNA-Seq data is going to be noisy (how many of our 28,500 genes are going t obe relevant?). It is definitely good to filter the data based on some sort of metric. To this end, statistics like feature variance are often used, but we have something better at our disposal: differential expression. If any of the genes are DE in the conditions we are interested in, we can expect them to be important to the model. Since we are interested in cancer vs healthy, we can simply make this our DE model:

meta$hd <- factor(ifelse(meta$Patient.type == "Healthy Donor (HD)", 1, 0))
dds.hd <- DESeqDataSetFromMatrix(expr, meta, ~hd)
dds.hd <- DESeq(dds.hd)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 340 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing

We select genes with an adjusted P-value < 1% as our relevant set:

sel <- which(results(dds.hd)$padj < 0.01)

At this point it’s a good idea to take a look at the split of the data. You are already familiar with PCA or MDS plots, so I’d like to take this opportunity to introduce t-SNE, or t-Stochastic Neighbourhood Embedding. It gives similar information to PCA and MDS plots, but it’s much better at detecting nonlinear associations. If you are interested, you can read more about t-SNE here.

# Subset to selected (P<1%) genes
dat <- t(assay(vst))[,sel]

# Calculate t-SNE
sne <- Rtsne(dat, perplexity = 20, pca = FALSE)

# Plot x,y values, annotate patient gender via pch and cancer type via color
ggplot(
  data.frame(x = sne$Y[,1], 
             y = sne$Y[,2], 
             pch = meta$Gender, 
             col = meta$Patient.type
             ), 
  aes(x = x, y = y, col = col, pch = pch)
  ) + geom_point() + theme_bw()

The data doesn’t group nicely at all, which is great! If it would, classifying the data would be too easy. Let’s go to preprocess the data some more. We will use support vector machines (SVMs) to classify our data into cancer/healthy. SVM is a supervised method, so we need to tell it the true values of our training data. To that end, we can simply extract the correct column from the metadata:

target <- meta$hd

We also want to retain some data to cross-validate our model, meaning we want to check wether it can predict data it has never seen accurately. We split 75:25 (training:testing)

index <- createDataPartition(target, p = 0.75, list = FALSE, times = 1)

tts <- list(
  'train' = list(
    'dat' = dat[index[,1],],
    'target' = target[index[,1]]
  ),
  'test' = list(
    'dat' = dat[-index[,1],],
    'target' = target[-index[,1]]
  )
)

It’s time to train our model! The R packake caret wraps a lot of other R machine learning libraries. Check here for a complete list. The train() function in caret will also try out several different combinations of parameters to see what works best (remember grid search?):

model <- train(x = tts$train$dat, y = tts$train$target, method = 'svmLinear3')

Let’s check the model parameters:

model
## L2 Regularized Support Vector Machine (dual) with Linear Kernel 
## 
##  173 samples
## 6090 predictors
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 173, 173, 173, 173, 173, 173, ... 
## Resampling results across tuning parameters:
## 
##   cost  Loss  Accuracy   Kappa    
##   0.25  L1    0.9339592  0.7566867
##   0.25  L2    0.9346239  0.7591968
##   0.50  L1    0.9339592  0.7566867
##   0.50  L2    0.9346239  0.7591968
##   1.00  L1    0.9339592  0.7566867
##   1.00  L2    0.9346239  0.7591968
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25 and Loss = L2.

Our model accuracy is fairly high at >90% (your results may be different from the above) value as there is a certain amount of randomness involved. The Kappa value compares observed accuracy with expected accuracy given the data.

Now we want to see if our model generalizes well. To that end we can use our 25% test data and perform cross validation. The best model is stored in the model$finalModel object. Let’s predict the test data:

pred <- predict(model$finalModel, tts$test$dat)

We can make ROC (Receiver Operator Characteristic) curves and plot them:

auc <- roc(as.numeric(tts$test$target), as.numeric(pred$predictions))

plot(auc, ylim = c(0, 1), print.thres = TRUE, 
     main = paste('AUC (SVM):', round(auc$auc[[1]],2)),
     xlim = c(1, 0))

Another good check is th confusion matrix. This simply compares the true class of the test data to the predicted class and computes some statistics. You can read more explanation of the statistics in the help text: ?confusionMatrix

cm <- confusionMatrix(pred$predictions, reference = tts$test$target)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 48  1
##          1  0  8
##                                           
##                Accuracy : 0.9825          
##                  95% CI : (0.9061, 0.9996)
##     No Information Rate : 0.8421          
##     P-Value [Acc > NIR] : 0.000651        
##                                           
##                   Kappa : 0.9309          
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8889          
##          Pos Pred Value : 0.9796          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.8421          
##          Detection Rate : 0.8421          
##    Detection Prevalence : 0.8596          
##       Balanced Accuracy : 0.9444          
##                                           
##        'Positive' Class : 0               
## 

Feature importance

Our next objective is to find out what genes are important to the predictions. Unfortunately, this is quite hard to for SVMs (we would need to build thousands of models and check the differences). Random forests do that by themselves already so we can train one. We will use a specific model instead of a grid search to save some time:

# Ranger wants data as one data frame as opposed to x, y
# that `train()` wanted
rf_data <- as.data.frame(tts$train$dat)
rf_data$target <- tts$train$target
model_rf <- ranger(target ~ ., 
                   data = rf_data, 
                   num.trees = 1000, 
                   importance = 'impurity_corrected', 
                   min.node.size = 1, 
                   splitrule = 'extratrees', 
                   mtry = 6089)

This particular method allows us to extract P-values for each feature (gene):

p_vals <- importance_pvalues(model_rf)
p_vals <- p.adjust(p_vals[,2], method = "BH")
imp_features <- p_vals < 0.05
p_vals[which(imp_features)]
## ENSG00000004534 ENSG00000008324 ENSG00000008438 ENSG00000008988 
##     0.000000000     0.041726697     0.032741634     0.000000000 
## ENSG00000012223 ENSG00000023445 ENSG00000025770 ENSG00000032219 
##     0.032741634     0.032741634     0.032741634     0.045055543 
## ENSG00000047410 ENSG00000051382 ENSG00000054654 ENSG00000056558 
##     0.019253386     0.041726697     0.000000000     0.000000000 
## ENSG00000061936 ENSG00000063177 ENSG00000064601 ENSG00000067248 
##     0.000000000     0.000000000     0.047675051     0.000000000 
## ENSG00000069493 ENSG00000069535 ENSG00000070961 ENSG00000072818 
##     0.032741634     0.000000000     0.000000000     0.006772152 
## ENSG00000072952 ENSG00000073605 ENSG00000073614 ENSG00000074966 
##     0.000000000     0.019253386     0.032741634     0.000000000 
## ENSG00000075914 ENSG00000076108 ENSG00000081059 ENSG00000083845 
##     0.000000000     0.032741634     0.000000000     0.029358000 
## ENSG00000086712 ENSG00000089009 ENSG00000089012 ENSG00000089048 
##     0.032741634     0.000000000     0.000000000     0.041726697 
## ENSG00000089157 ENSG00000090920 ENSG00000091129 ENSG00000096717 
##     0.041726697     0.000000000     0.000000000     0.000000000 
## ENSG00000099139 ENSG00000100201 ENSG00000100316 ENSG00000100354 
##     0.032741634     0.032741634     0.019253386     0.000000000 
## ENSG00000100376 ENSG00000100813 ENSG00000101425 ENSG00000102886 
##     0.041726697     0.000000000     0.000000000     0.000000000 
## ENSG00000103064 ENSG00000103175 ENSG00000104067 ENSG00000104529 
##     0.000000000     0.000000000     0.000000000     0.032741634 
## ENSG00000104689 ENSG00000104805 ENSG00000105193 ENSG00000105640 
##     0.000000000     0.032741634     0.000000000     0.000000000 
## ENSG00000105829 ENSG00000106948 ENSG00000107890 ENSG00000108064 
##     0.000000000     0.000000000     0.000000000     0.000000000 
## ENSG00000108107 ENSG00000108292 ENSG00000108465 ENSG00000108848 
##     0.000000000     0.000000000     0.036949315     0.000000000 
## ENSG00000109475 ENSG00000109920 ENSG00000110651 ENSG00000110848 
##     0.006772152     0.029358000     0.006772152     0.032741634 
## ENSG00000110944 ENSG00000111860 ENSG00000111879 ENSG00000113194 
##     0.041726697     0.045055543     0.025087745     0.040059464 
## ENSG00000113889 ENSG00000114391 ENSG00000115241 ENSG00000115268 
##     0.006772152     0.032741634     0.041726697     0.000000000 
## ENSG00000115541 ENSG00000115760 ENSG00000115977 ENSG00000116147 
##     0.032741634     0.006772152     0.000000000     0.032741634 
## ENSG00000116580 ENSG00000116754 ENSG00000117560 ENSG00000117616 
##     0.000000000     0.000000000     0.000000000     0.032741634 
## ENSG00000117643 ENSG00000118058 ENSG00000118520 ENSG00000119397 
##     0.000000000     0.000000000     0.040059464     0.041726697 
## ENSG00000119596 ENSG00000119707 ENSG00000120616 ENSG00000120694 
##     0.000000000     0.000000000     0.000000000     0.032741634 
## ENSG00000120800 ENSG00000121931 ENSG00000122224 ENSG00000122435 
##     0.041726697     0.047675051     0.000000000     0.000000000 
## ENSG00000122965 ENSG00000122986 ENSG00000124733 ENSG00000125498 
##     0.006772152     0.041726697     0.041726697     0.032741634 
## ENSG00000125691 ENSG00000125734 ENSG00000126353 ENSG00000126749 
##     0.000000000     0.040059464     0.000000000     0.041726697 
## ENSG00000127914 ENSG00000129003 ENSG00000129084 ENSG00000129351 
##     0.041726697     0.000000000     0.029358000     0.019253386 
## ENSG00000130921 ENSG00000131469 ENSG00000131797 ENSG00000131914 
##     0.000000000     0.000000000     0.041726697     0.000000000 
## ENSG00000132424 ENSG00000133112 ENSG00000134186 ENSG00000135213 
##     0.000000000     0.000000000     0.041726697     0.041726697 
## ENSG00000135596 ENSG00000135749 ENSG00000135968 ENSG00000135976 
##     0.032741634     0.000000000     0.000000000     0.000000000 
## ENSG00000136279 ENSG00000136942 ENSG00000137070 ENSG00000137103 
##     0.040059464     0.000000000     0.029358000     0.006772152 
## ENSG00000137818 ENSG00000138134 ENSG00000138326 ENSG00000138398 
##     0.000000000     0.000000000     0.000000000     0.032741634 
## ENSG00000138757 ENSG00000138795 ENSG00000139193 ENSG00000139668 
##     0.032741634     0.000000000     0.000000000     0.041726697 
## ENSG00000140386 ENSG00000141293 ENSG00000142102 ENSG00000142207 
##     0.000000000     0.040059464     0.000000000     0.025087745 
## ENSG00000142534 ENSG00000142541 ENSG00000142546 ENSG00000142676 
##     0.032741634     0.000000000     0.000000000     0.000000000 
## ENSG00000142937 ENSG00000143033 ENSG00000143924 ENSG00000144029 
##     0.000000000     0.041726697     0.000000000     0.036949315 
## ENSG00000144674 ENSG00000145241 ENSG00000145425 ENSG00000145592 
##     0.000000000     0.000000000     0.000000000     0.041726697 
## ENSG00000145734 ENSG00000146414 ENSG00000148843 ENSG00000149311 
##     0.000000000     0.000000000     0.000000000     0.000000000 
## ENSG00000149806 ENSG00000149970 ENSG00000151657 ENSG00000152818 
##     0.000000000     0.006772152     0.032741634     0.006772152 
## ENSG00000153006 ENSG00000154265 ENSG00000154358 ENSG00000154781 
##     0.006772152     0.029358000     0.041726697     0.006772152 
## ENSG00000154814 ENSG00000154874 ENSG00000155657 ENSG00000156256 
##     0.000000000     0.000000000     0.000000000     0.000000000 
## ENSG00000156467 ENSG00000157833 ENSG00000157992 ENSG00000158411 
##     0.032741634     0.006772152     0.006772152     0.036949315 
## ENSG00000159140 ENSG00000160185 ENSG00000160654 ENSG00000160799 
##     0.000000000     0.000000000     0.000000000     0.032741634 
## ENSG00000161016 ENSG00000161551 ENSG00000162777 ENSG00000162894 
##     0.000000000     0.041726697     0.045055543     0.000000000 
## ENSG00000163029 ENSG00000163492 ENSG00000163519 ENSG00000163682 
##     0.040059464     0.000000000     0.000000000     0.006772152 
## ENSG00000163714 ENSG00000164190 ENSG00000164587 ENSG00000165813 
##     0.000000000     0.041726697     0.032741634     0.006772152 
## ENSG00000166012 ENSG00000166762 ENSG00000167526 ENSG00000167548 
##     0.000000000     0.000000000     0.041726697     0.032741634 
## ENSG00000167840 ENSG00000167904 ENSG00000168010 ENSG00000168028 
##     0.041726697     0.041726697     0.041726697     0.041726697 
## ENSG00000168280 ENSG00000168421 ENSG00000168614 ENSG00000168685 
##     0.041726697     0.000000000     0.006772152     0.013461717 
## ENSG00000169246 ENSG00000170486 ENSG00000170776 ENSG00000171103 
##     0.041726697     0.000000000     0.047675051     0.032741634 
## ENSG00000171634 ENSG00000171817 ENSG00000171858 ENSG00000172757 
##     0.000000000     0.000000000     0.032741634     0.041726697 
## ENSG00000172809 ENSG00000172893 ENSG00000173530 ENSG00000173706 
##     0.000000000     0.041726697     0.032741634     0.000000000 
## ENSG00000174501 ENSG00000176476 ENSG00000177455 ENSG00000177600 
##     0.000000000     0.041726697     0.029358000     0.000000000 
## ENSG00000178188 ENSG00000178458 ENSG00000178988 ENSG00000179144 
##     0.000000000     0.029358000     0.047675051     0.000000000 
## ENSG00000180376 ENSG00000181007 ENSG00000181847 ENSG00000182230 
##     0.000000000     0.000000000     0.000000000     0.000000000 
## ENSG00000182319 ENSG00000182670 ENSG00000183019 ENSG00000183091 
##     0.040059464     0.041726697     0.029358000     0.006772152 
## ENSG00000183378 ENSG00000183426 ENSG00000183495 ENSG00000183625 
##     0.041726697     0.000000000     0.032741634     0.040059464 
## ENSG00000183662 ENSG00000183918 ENSG00000184206 ENSG00000184613 
##     0.025087745     0.041726697     0.000000000     0.019253386 
## ENSG00000185101 ENSG00000185485 ENSG00000186166 ENSG00000186481 
##     0.000000000     0.041726697     0.000000000     0.000000000 
## ENSG00000186854 ENSG00000189195 ENSG00000196199 ENSG00000196268 
##     0.000000000     0.041726697     0.000000000     0.000000000 
## ENSG00000196329 ENSG00000196683 ENSG00000196912 ENSG00000197561 
##     0.041726697     0.000000000     0.000000000     0.032741634 
## ENSG00000197756 ENSG00000197830 ENSG00000197958 ENSG00000198000 
##     0.000000000     0.000000000     0.000000000     0.000000000 
## ENSG00000198034 ENSG00000198399 ENSG00000198625 ENSG00000198722 
##     0.006772152     0.025087745     0.000000000     0.040059464 
## ENSG00000198851 ENSG00000198874 ENSG00000198945 ENSG00000204219 
##     0.000000000     0.000000000     0.032741634     0.000000000 
## ENSG00000204256 ENSG00000204305 ENSG00000204475 ENSG00000204574 
##     0.047675051     0.000000000     0.029358000     0.000000000 
## ENSG00000204628 ENSG00000205268 ENSG00000211745 ENSG00000211772 
##     0.032741634     0.041726697     0.000000000     0.000000000 
## ENSG00000211810 ENSG00000211821 ENSG00000213619 ENSG00000213741 
##     0.000000000     0.000000000     0.032741634     0.032741634 
## ENSG00000214106 ENSG00000214900 ENSG00000215630 ENSG00000215908 
##     0.041726697     0.000000000     0.047675051     0.000000000 
## ENSG00000219481 ENSG00000223705 ENSG00000224660 ENSG00000225113 
##     0.000000000     0.032741634     0.032741634     0.000000000 
## ENSG00000225241 ENSG00000226179 ENSG00000228314 ENSG00000229164 
##     0.032741634     0.000000000     0.041726697     0.000000000 
## ENSG00000229807 ENSG00000230629 ENSG00000231621 ENSG00000232815 
##     0.032741634     0.047675051     0.029358000     0.032741634 
## ENSG00000233927 ENSG00000233985 ENSG00000236333 ENSG00000239219 
##     0.000000000     0.000000000     0.032741634     0.000000000 
## ENSG00000239839 ENSG00000240038 ENSG00000241106 ENSG00000243536 
##     0.029358000     0.000000000     0.041726697     0.006772152 
## ENSG00000245164 ENSG00000246223 ENSG00000247774 ENSG00000251606 
##     0.032741634     0.000000000     0.032741634     0.019253386 
## ENSG00000261386 ENSG00000264290 ENSG00000266371 ENSG00000266897 
##     0.019253386     0.019253386     0.000000000     0.032741634 
## ENSG00000267373 ENSG00000268632 ENSG00000268746 ENSG00000269161 
##     0.006772152     0.041726697     0.029358000     0.032741634 
## ENSG00000269640 
##     0.032741634