Prostate Cancer

Overview

In this vignette, we demonstrate the power of easyml using a Prostate Cancer dataset.

Load the data

First we load our libraries and the Prostate Cancer dataset.

library(easyml)

## Loaded easyml 0.1.1. Also loading ggplot2.

## Loading required namespace: ggplot2

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

data("prostate", package = "easyml")
knitr::kable(head(prostate))

lcavol	lweight	age	lbph	lcp	gleason	pgg45	lpsa
-0.5798185	2.769459	50	-1.386294	-1.386294	6	0	-0.4307829
-0.9942523	3.319626	58	-1.386294	-1.386294	6	0	-0.1625189
-0.5108256	2.691243	74	-1.386294	-1.386294	7	20	-0.1625189
-1.2039728	3.282789	58	-1.386294	-1.386294	6	0	-0.1625189
0.7514161	3.432373	62	-1.386294	-1.386294	6	0	0.3715636
-1.0498221	3.228826	50	-1.386294	-1.386294	6	0	0.7654678

Train a support vector machine model

To run an easy_support_vector_machine model, we pass in the following parameters:

the data set prostate,
the name of the dependent variable e.g. lpsa,
whether to run a gaussian or a binomial model,
how to preprocess the data; in this case, we use preprocess_scale to scale the data,
the random state,
whether to display a progress bar,
how many cores to run the analysis on in parallel.

results <- easy_support_vector_machine(prostate, "lpsa", 
                                       n_samples = 10, n_divisions = 10, 
                                       n_iterations = 2, random_state = 12345, 
                                       n_core = 1)

## [1] "Generating predictions for a single train test split:"
## [1] "Generating measures of model performance over multiple train test splits:"

Assess results

Now let’s assess the results of the easy_support_vector_machine model.

Predictions: ROC Curve

We can examine both the in-sample and out-of-sample ROC curve plots for one particular trian-test split determined by the random state and determine the Area Under the Curve (AUC) as a goodness of fit metric. Here, we see that the in-sample AUC is higher than the out-of-sample AUC, but that both metrics indicate the model fits relatively well.

results$plot_predictions_single_train_test_split_train

results$plot_predictions_single_train_test_split_test

Metrics: AUC

We can examine both the in-sample and out-of-sample AUC metrics for n_divisions train-test splits (ususally defaults to 1,000). Again, we see that the in-sample AUC is higher than the out-of-sample AUC, but that both metrics indicate the model fits relatively well.

results$plot_model_performance_train

results$plot_model_performance_test

Discuss

In this vignette we used easyml to easily build and evaluate a support vector machine model using a Prostate Cancer dataset.

Paul Hendricks

2017-07-08