This document contains a tutorial-style vignette for the FMradio package. This tutorial is also tied to the radiomics practical as given during the 2019 Statistics for Omics Course. It assumes knowledge of R.
The exercises and code in this document are cumulative: each exercise and code block builds upon the preceding exercises and code. Code is contained in blocks and in-text references to functions, function arguments, and objects will be denoted as foo
. The data as well as supporting R files can be found on this GitHub page.
Following the current section containing preliminaries, there are sections highlighting FMradio functionality for factor analytic projection (Section 1) and usage of this projection in downstream classification and predictive modeling (Section 2). The FMradio package (Peeters, 2019) implements methodology as described in Peeters et al. (2019). (See Section FMradio References below).
Set the working directory to convenience and activate the required library:
## Set working directory to convenience
setwd(" ")
## Needed library
library(FMradio)
## Other requirements
require("Biobase")
require("rags2ridges")
require("DandEFA")
require("randomForestSRC")
require("pec")
require("survival")
## Convenience functions
source("Convenience.R")
The additional libraries are used in downstream predictive modeling. The source file contains some convenience functions for visualization of results. These are all of use in Section 2.
Radiomics refers to “the high-throughput mining of quantitative features from radiographic images. Such images stem from technology such as computed tomography (CT), positron emission tomography (PET), and magnetic resonance imaging (MRI). (…) The promise of radiomic data is that it may be a basis, through the information contained in standard-of-care images, for non-invasive medical decision support at low (additional) cost. As such, it is often viewed as an addition to or alternative for screening and classification based on molecular omics.” (Peeters et al., 2019). We will be interested in the downstream analysis of radiomics data.
Throughout, we will be using (anonymized and scrambled) radiomics data on primary (hypopharyngeal, oropharyngeal, and laryngeal) tumors in head and neck squamous cell carcinoma (HNSCC). These are cancers that arise from squamous cells (cells on the outer surfaces of organs) of the mouth, nose, and throat. The features concern \(p = 432\) radiomic variables extracted from PET/CT based images. The outcome measurements pertain to survival (in months) with the event of interest being death. The set contains \(n = 174\) primary tumors and \(55\) events. Additional information on the data can be found in Martens et al. (2019).
The data are contained as an expressionSet (AnonymousRadio
) in the AnonymousRadio.Rdata
object. An expressionSet is a concatenated object. exprs(AnonymousRadio)
is a matrix containing measurements of \(p = 432\) radiomic features (rows) on \(n = 174\) samples (columns). pData(AnonymousRadio)
is a data.frame
containing clinical information on the samples. This information pertains to event: 0 (no death observed) or 1 (death observed) as well as the follow-up time for each patient.
|—— Exercise 1: Get acquainted with the data: ——|
## Data packaged as expressionset
## Will invoke basic functions for looking at data
## Load data, get to know objects
load("AnonymousRadio.Rdata")
## Look at radiomic measurements
exprs(AnonymousRadio)[10:30,1:5]
## Look at Clinical (sample) information
head(pData(AnonymousRadio))
The aim is to project the high-dimensional (and highly collinear) radiomic feature-space onto a lower-dimensional (and non-collinear) latent meta-feature space. These latent meta-features may then be used in downstream classification and prediction (Section 2). The factor analytic projection basically has 3 steps: (i) penalized estimation of a (redundancy filtered) correlation matrix, (ii) data compression by factor analysis, and (iii) extraction of factor scores. The result is a compact set of stable (meta) features that can be directly used in classification or prediction. This section then revolves around correlation estimation (Exercises 2–4), factor analysis (Exercise 5), and score extraction (Exercises 6–7). For clarity, each exercise is contained in a separate subsection.
Factor analytic projection revolves around the factor analytic decomposition of a correlation matrix. Hence, we first need an estimate of the correlation matrix. The correlation matrix estimate will have to be a regularized estimate as the number of features (\(p = 432\)) is larger than the number of observations (\(n = 174\)). Before embarking on finding a regularized estimate, we first make a basic assessment of the correlation matrix, especially regarding the issue of collinearity.
|—— Exercise 2: Assess redundancy in the correlation matrix ——|
First, we scale the (transposed) data, as the features have different scales and as the variability of the features may differ substantially. This scaling is not a necessity for finding the correlation matrix through the cor
function. However, in high-dimensional prediction problems we typically want to scale the features. Hence, later on we need the scaled data object. Second, we visualize feature-redundancy through a thresholded correlation heatmap.
## Scale data and get raw correlation matrix
DATAscaled <- scale(t(exprs(AnonymousRadio)))
R <- cor(DATAscaled)
## Redundancy visualization
radioHeat(R, diag = FALSE,
threshold = TRUE,
threshvalue = .95,
labelsize = .01)
The resulting visualization gives a thresholded correlation heatmap of the radiomic features. Absolute correlations \(< .95\) were set to zero such that only correlations equalling or exceeding an absolute marginal correlation threshold of \(.95\) are visualized. Red indicates a positive correlation above the threshold, blue indicates a negative correlation below the threshold. Colored regions then visualize blocks of information redundancy. We notice quite some redundancy. i.e., the information contained in some features is almost completely represented by other features.
There is quite some redundant information in the data. Hence, one could consider redundancy-filtering as a preprocessing step. The goal would be to remove truly redundant features.
|—— Exercise 3: Redundancy-filter the correlation matrix ——|
The RF
function redundancy filters a correlation matrix at specified threshold t
. It removes the minimal number of redundant features such that no pair of features remains with an absolute marginal correlation that equals or exceeds t
. In practice we recommend to set the value of t
to \(.9\) or \(.95\). Here we use \(.95\). The subSet
function is a convenience function that subsets your data to those features that were retained after redundancy-filtering.
## Redundancy filtering
## And subsetting data
## 124 features remain
RFs <- RF(R, t = .95)
DATAscaledS <- subSet(DATAscaled, RFs)
Calling dim(RFs)[2]
will show that, after filtering, \(124\) features remain. These features will still be highly collinear in the sense that they share many high-correlations. However, they are not redundant from the perspective of the specified argument t
. The filtered correlation matrix will be the main input for the next step.
Now that the truly redundant features are removed we can embark on finding a regularized estimate of the correlation matrix. Regularization (i.e., penalization) ensures that the resulting estimate will be numerically stable. This numerical stability is needed for the factor analytic projection.
|—— Exercise 4: Find an optimal regularized correlation matrix ——|
The subsetted scaled data are used in the regcor
function. This function determines the optimal value of the penalty parameter by application of the Brent algorithm (1971) to the K-fold cross-validated negative log-likelihood score (using a regularized ridge estimator for the correlation matrix). The search for the optimal value is automatic. The penalty value at which the K-fold cross-validated negative log-likelihood score is minimized is deemed optimal. The output is an object of class list
, with $optPen
containing the optimal penalty value and $optCor
containing the correlation matrix under the optimal value of the penalty parameter.
## Optimal penalty
set.seed(303)
OPT <- regcor(DATAscaledS, fold = 5)
## Look at optimal penalty-value
## Obtain regularized correlation matrix
## Conditioning can again be assessed with, e.g., CNplot from rag2ridges
OPT$optPen
Re = OPT$optCor
The resulting regularized correlation matrix will be basis for the factor analysis procedure.
Factor analysis is a technique that assumes that observed variables can be grouped on the basis of their correlation into a lower-dimensional linear combination of latent features. Hence, a main goal of this technique is to project (if you will) the information contained in a higher-dimensional observation space onto a lower-dimensional latent meta-feature space. This projection can be done orthogonally, such that the latent features are uncorrelated with each other. This technique then is a natural candidate to counter the main challenges of radiomics data (being high-dimensionality and collinearity). Here, we perform a form of regularized factor analysis, for which one of the main inputs will be our previously obtained regularized correlation matrix.
|—— Exercise 5: Perform factor analysis on the regularized correlation matrix ——|
The main function for performing a factor analysis is mlFA
. Its inputs are R
, the regularized correlation matrix, and m
, the number of latent meta-features. Usually, the dimension of m
will be unknown a priori. One may then precede the factor analysis with the dimGB
function. This function calculates (upper-)bounds to the dimensionality of the latent vector. The function returns an object of class table
, with entries correspronding to 3 separate bounds. Peeters et al. (2019) contains an extensive simulation study showing that in high-dimensional situations the first reported bound provides a reliable upper-bound to the latent dimension. The choice of the number of factors can be further assessed with the dimVAR
function. Assessment provided by this latter function (which returns the total cumulative explained variance against the dimension of the factor solution, possibly with visualization) may inform if the result of the first bound should be accepted or be treated as an upper-bound.
## Assess dimensionality factor solution
## 13 considered upper bound
## Variance explained would suggest 8 factors
dimGB(Re)
dimVAR(Re, 15, graph = TRUE)
## Assessing solutions around 8
## 9th factor seems weak
## Will keep solution at 8
## ML factor analysis with Varimax rotation
fito <- mlFA(R = Re, m = 8)
print(fito$Loadings, digits = 2, cutoff = .3, sort = TRUE)
The mlFA
function performs a maximum likelihood factor analysis with an orthogonal simple structure rotation. It returns an object of class list
. fito$Loadings
contains a matrix whose elements represent the loadings of the observed features on the latent meta-features. This matrix may be used to interpret the structure of the factor solution and the possible meaning of the latent meta-features. fito$Uniqueness
contains a diagonal matrix whose elements represent the unique variances (variance not explain by the model).
The factor solution under the chosen latent dimension can be visualized as a Dandelion plot (Manukyan et al., 2014). This can be done wih the dandelion
function from the DandEFA package.
## Visualizing solution using Dandelion plot
dandpal <- rev(rainbow(100, start = 0.4, end = 0.6))
dandelion(fito$Loadings, bound = .3, mcex = c(1,1), palet = dandpal)
Exporting the plot to PDF (gives better visual than the R Plots window):
## Export to pdf for inspection
pdf(file = "Dandelion.pdf", width = 11, height = 11)
dandelion(fito$Loadings, bound = .3, mcex = c(1,1), palet = dandpal)
dev.off()
Once a factor model is fitted one may desire an estimate of the score each object/individual would obtain on each of the latent factors. Such scores are referred to as factor scores. These scores can then serve as the low-dimensional and non-collinear input into (standard) prediction procedures.
|—— Exercise 6: Obtain factor scores ——|
The facScore
function provides several types of factor score estimates. The default estimates are regression-type estimates. Its input consists of the scaled (possibly subsetted) data and the matrices from the $Loadings
and $Uniqueness
slots stemming from a previous call to mlFA
.
## Factor scores
Lambda <- fito$Loadings
Psi <- fito$Uniqueness
Scores <- facScore(DATAscaledS, Lambda, Psi)
The function returns an object of class data.frame
. Observations are represented in the rows. Each column represent a latent factor.
Before using the factor scores in a prediction setting one may desire to assess the quality of the factor extraction. This can be probed through the squared multiple correlations between the observed features and the common latent factors. It indicates how well a common factor can be predicted by the observed features or, from another perspective, to what extent the factor scores are determinate.
|—— Exercise 7: Assess factor scores ——|
facSMC
is a function with which one may evaluate the factor scores. It indeed calculates the squared multiple correlations between the observed features and the common latent factors. The closer to unity, the better one is able to uniquely determine the factor scores. And the more useful these scores are in prediction and classification.
## Determinacy factor scores
## Highly determinate
DF <- facSMC(Re, Lambda); DF
The function returns a numeric vector indicating, for each factor, the squared multiple correlation between the observed features and the common latent factor.
We now want to use the meta-features as predictors in a prediction model. Here we have a right-censored survival (time to event) outcome. The response concerns the time from a starting point \(t = 0\) to an event of interest or to censoring. Tied to this response is a status indicator that is \(1\) when the event of interest has occurred during the study-time and \(0\) when the event of interest did not occur during the study-time or when the subject was lost to follow-up (right-censoring). The general aim would then be to predict survival probabilities based on the prognostic features available at baseline. These prognostic features are our factor scores. As the factor scores represent low-dimensional and non-collinear data, we may use standard modeling techniques. The standard modeling technique would then be the Cox proportional hazards model.
We will also take interest in comparing the performance of the simple model with the projected data with a popular machine learning technique often used on radiomics data: random forests. The random survival forest is a learning method in which the predictor is an ensemble formed by combining many decision trees over the original observed radiomic features.
This section then revolves around setting up the necessary preliminaries for model comparison (Exercise 8), assessing model performance and model comparison (Exercises 9–10), and the visualization of the results (Exercise 11).
We will perform some convenient concatenation of the data that will support assessment of model performance as well as the model comparison. Especially, it will be convenient to collect all data of interest into a single data object.
|—— Exercise 8: Concatenate original and projected data ——|
we can use cbind
(from the R core packages) to conveniently column-wise collect all data of interest. The data of interest concern the observed radiomic features, the factor scores, as well as the time-to-event data and its associated event-indicator.
## Combine original (scaled) data with projected meta-features
DAT <- cbind(DATAscaled, Scores)
## Include the survival information
Status <- as.numeric(AnonymousRadio$Death_yesno) - 1
time <- AnonymousRadio$Death_followuptime_months
DAT <- cbind(time, Status, DAT)
The data above will be used in setting up the models of interest. It will be convenient to produce a list of models than can be used in an internal validation.
|—— Exercise 9: Set up model comparisons ——|
The outcome measure in survival modeling is the concatenation of the time measurement with the status indicator. Such a survival object can be made with the Surv
function from the survival package. The as.formula
function can then be combined with the paste
function (both from the R core packages) to conveniently produce model formulas (over many features). The coxph
(from the survival package) and rfsrc
(from the randomForestSRC package) functions can subequently be used to produce a list consisting of a Cox PH model and a random survival forest (over the respective features of interest).
## Formulating the model formula's
FitRSF <- as.formula(paste("Surv(time, Status)~",
paste(colnames(DAT)[3:434], collapse="+")))
FitMetaCox <- as.formula(paste("Surv(time, Status) ~",
paste(colnames(DAT[,c(435:442)]), collapse="+")))
models <- list("MetaCox" = coxph(FitMetaCox, data = DAT, x = TRUE, y = TRUE),
"RforestCox" = rfsrc(FitRSF, data = DAT))
The list of models produced previously can now be subjected to performance assessment. For model evaluation we focus on prediction error. We will use the time-dependent Brier score (Brier, 1950) which may be seen as a mean square error of prediction. It is a measure of inaccuracy and smaller measures are desired. The Brier score can also be used to calculate a measure of (overall) explained residual variation (\(R^2\)).
|—— Exercise 10: Compare models w.r.t. prediction error ——|
We have only one dataset available, so we have to harness ourselves from overoptimistic prediction errors. Hence, we will use a cross-validated prediction error. We will use 5-fold cross-validation to this end. To mitigate the dependency on the random-choice of folds, we repeat the 5-fold cross-validation many times and average the corresponding Brier scores over these repeats. This can be conveniently done with the pec
function from the pec package.
## Assessing prediction error
## Median follow-up time = 25.7
## (Averaged) repeated 5-fold cross-validation
set.seed(446464)
PredError <- pec(object = models,
formula = Surv(time, Status) ~ 1,
data = DAT,
exact = TRUE,
maxtime = median(time),
cens.model = "marginal",
splitMethod = "cv5",
B = 50,
verbose = TRUE)
## Summary results:
## Apparent and cross-validated prediction error and R2
crps(PredError)
Or2(crps(PredError))
Calling the crps
function on the pec object will produce a table with apparent and cross-validated errors. The apparent error is the prediction error obtained when the data used from training are re-used for validation, which will result in optimistic prediction errors. The cross-validation will harness against this overoptimism. Next to the Cox model with the latent features and the random forest with the original observed radiomic features the table will also give apparent and cross-validated errors for a reference model. The reference model is the Kaplan-Meier prediction rule, which acts as a null model by ignoring all feature information. One desires a model to do better than the reference model.
Looking at the results we see that both models do better than the reference model. The simple model with the factor-analytic meta-features has a better cross-validated prediction error than the random forest model. Moreover, it has a much smaller gap between the apparent and cross-validated errors. These results are mirrored in the (apparent and cross-validated) residual explained varation which can be obtained with the Or2
function from the source file.
It is often insightful to visualize the results.
|—— Exercise 11: Visualize the results ——|
The prediction error object can be visualized with the standard plot
function. Using the argument what = "AppErr"
will produce the apparent prediction error curve (time-dependent Brier score plotted against time).
## Visualize apparent prediction error
plot(PredError, what = "AppErr",
xlab = "Time (months)",
ylab = "Apparent prediction error",
legend.cex = .9,
legend.lty = 1,
legend.lwd = 2,
legend.legend = c("Reference model",
"FMradio",
"Random survival forest"),
add.refline = TRUE,
lwd = 1.5,
legend.y.intersp = 1.7)
Using the argument what = "crossvalErr"
will produce the cross-validated prediction error curve.
## Visualize cross-validated prediction error
plot(PredError, what = "crossvalErr",
xlab = "Time (months)",
ylab = "Averaged cross-validated prediction error",
legend.cex = .9,
legend.lty = 1,
legend.lwd = 2,
legend.legend = c("Reference model",
"FMradio",
"Random survival forest"),
add.refline = TRUE,
lwd = 1.5,
legend.y.intersp = 1.7)
The residual explained variation over time can be visualized with the plotR2Table
function from the source file. Using the argument type = "AE"
will produce the apparent residual explained variation plot.
## Visualize apparent residual explained variation
R2Table <- R2(PredError, times = seq(0,median(time),.01), reference = 1)
plotR2Table(R2Table, "AE", Xlab = "Time (months)")
Using the argument type = "CV"
will produce the cross-validated residual explained variation plot.
## Visualize cross-validated residual explained variation
plotR2Table(R2Table, "CV", Xlab = "Time (months)")
Exporting the plot to PDF (gives better visual than the R Plots window):
## Exporting
pdf("Internal.pdf", width = 15, height = 15)
par(mfrow=c(2,2))
## Visualize apparent prediction error
plot(PredError, what = "AppErr",
xlab = "Time (months)",
ylab = "Apparent prediction error",
legend.cex = .9,
legend.lty = 1,
legend.lwd = 2,
legend.legend = c("Reference model",
"FMradio",
"Random survival forest"),
add.refline = TRUE,
lwd = 1.5,
legend.y.intersp = 1.7)
## Visualize cross-validated prediction error
plot(PredError, what = "crossvalErr",
xlab = "Time (months)",
ylab = "Averaged cross-validated prediction error",
legend.cex = .9,
legend.lty = 1,
legend.lwd = 2,
legend.legend = c("Reference model",
"FMradio",
"Random survival forest"),
add.refline = TRUE,
lwd = 1.5,
legend.y.intersp = 1.7)
## Visualize apparent residual explained variation
R2Table <- R2(PredError, times = seq(0,median(time),.01), reference = 1)
plotR2Table(R2Table, "AE", Xlab = "Time (months)")
## Visualize cross-validated residual explained variation
plotR2Table(R2Table, "CV", Xlab = "Time (months)")
dev.off()