First we load our libraries and the Prostate Cancer dataset.
library(easyml)
## Loaded easyml 0.1.1. Also loading ggplot2.
## Loading required namespace: ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
data("prostate", package = "easyml")
knitr::kable(head(prostate))
lcavol | lweight | age | lbph | svi | lcp | gleason | pgg45 | lpsa |
---|---|---|---|---|---|---|---|---|
-0.5798185 | 2.769459 | 50 | -1.386294 | 0 | -1.386294 | 6 | 0 | -0.4307829 |
-0.9942523 | 3.319626 | 58 | -1.386294 | 0 | -1.386294 | 6 | 0 | -0.1625189 |
-0.5108256 | 2.691243 | 74 | -1.386294 | 0 | -1.386294 | 7 | 20 | -0.1625189 |
-1.2039728 | 3.282789 | 58 | -1.386294 | 0 | -1.386294 | 6 | 0 | -0.1625189 |
0.7514161 | 3.432373 | 62 | -1.386294 | 0 | -1.386294 | 6 | 0 | 0.3715636 |
-1.0498221 | 3.228826 | 50 | -1.386294 | 0 | -1.386294 | 6 | 0 | 0.7654678 |
To run an easy_support_vector_machine
model, we pass in the following parameters:
prostate
,lpsa
,gaussian
or a binomial
model,preprocess_scale
to scale the data,results <- easy_support_vector_machine(prostate, "lpsa",
n_samples = 10, n_divisions = 10,
n_iterations = 2, random_state = 12345,
n_core = 1)
## [1] "Generating predictions for a single train test split:"
## [1] "Generating measures of model performance over multiple train test splits:"
Now let’s assess the results of the easy_support_vector_machine
model.
We can examine both the in-sample and out-of-sample ROC curve plots for one particular trian-test split determined by the random state and determine the Area Under the Curve (AUC) as a goodness of fit metric. Here, we see that the in-sample AUC is higher than the out-of-sample AUC, but that both metrics indicate the model fits relatively well.
results$plot_predictions_single_train_test_split_train
results$plot_predictions_single_train_test_split_test
We can examine both the in-sample and out-of-sample AUC metrics for n_divisions
train-test splits (ususally defaults to 1,000). Again, we see that the in-sample AUC is higher than the out-of-sample AUC, but that both metrics indicate the model fits relatively well.
results$plot_model_performance_train
results$plot_model_performance_test