--- title: "The T-Rex Selector: Usage and Simulations" author: | | Jasin Machkour^\#^, Simon Tien^\#^, Daniel P. Palomar^\*^, Michael Muma^\#^ | | ^\#^Technische Universität Darmstadt | ^\*^The Hong Kong University of Science and Technology date: "`r Sys.Date()`" output: html_document: theme: flatly highlight: pygments toc: yes toc_depth: 1 toc_float: yes css: vignette_styles.css mathjax: "https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-AMS_CHTML.js" prettydoc::html_pretty: theme: tactile highlight: vignette toc: yes toc_depth: 2 toc-title: "Table of Contents" csl: ieee.csl bibliography: refs.bib nocite: | @machkour2022terminating, @machkour2024dependency, @machkour2024TRexIndexTracking, @machkour2022TRexGVS, @machkour2023InformedEN, @machkour2023ScreenTRex, @machkour2024sparse, @scheidt2023FDRControlLaptop, @koka2024false, @efron2004least, @tibshirani1996regression, @zou2005regularization vignette: | %\VignetteKeyword{T-Rex selector, false discovery rate (FDR) control, T-LARS, high-dimensional variable selection, martingale theory, genome-wide association studies (GWAS)} %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{The T-Rex Selector: Usage and Simulations} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} # Store user's options() old_options <- options() library(knitr) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", fig.retina = 2, out.width = "85%", dpi = 96 # pngquant = "--speed=1" ) options(width = 80) ``` ----------- # Motivation The T-Rex selector performs fast variable/feature selection in large-scale high-dimensional settings. It provably controls the false discovery rate (FDR), i.e., the expected fraction of selected false positives among all selected variables, at the user-defined target level. In addition to controlling the FDR, it also achieves a high true positive rate (TPR) (i.e., power) by maximizing the number of selected variables. It performs terminated-random experiments (T-Rex) using the T-LARS algorithm ([R package](https://CRAN.R-project.org/package=tlars)) and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector can be applied in various fields, such as genomics, financial engineering, or any other field that requires a fast and FDR-controlling variable/feature selection method for large-scale high-dimensional settings (see, e.g., [@machkour2022terminating; @machkour2024dependency; @machkour2024TRexIndexTracking; @machkour2022TRexGVS; @machkour2023InformedEN; @machkour2023ScreenTRex; @machkour2024sparse; @scheidt2023FDRControlLaptop; @koka2024false]). $$ \DeclareMathOperator{\FDP}{FDP} \DeclareMathOperator{\FDR}{FDR} \DeclareMathOperator{\TPP}{TPP} \DeclareMathOperator{\TPR}{TPR} \newcommand{\A}{\mathcal{A}} \newcommand{\X}{\boldsymbol{X}} \newcommand{\XWK}{\boldsymbol{\tilde{X}}} \newcommand{\C}{\mathcal{C}} \newcommand{\coloneqq}{\mathrel{\vcenter{:}}=} $$ # Installation Before installing the 'TRexSelector' package, you need to install the required 'tlars' package. You can install the 'tlars' package from [CRAN](https://CRAN.R-project.org/package=tlars) (stable version) or [GitHub](https://github.com/jasinmachkour/tlars) (developer version) with: ```{r, eval=FALSE} # Option 1: Install stable version from CRAN install.packages("tlars") # Option 2: install developer version from GitHub install.packages("devtools") devtools::install_github("jasinmachkour/tlars") ``` Then, you can install the 'TRexSelector' package from [CRAN](https://CRAN.R-project.org/package=TRexSelector) (stable version) or [GitHub](https://github.com/jasinmachkour/TRexSelector) (developer version) with: ```{r, eval=FALSE} # Option 1: Install stable version from CRAN install.packages("TRexSelector") # Option 2: install developer version from GitHub install.packages("devtools") devtools::install_github("jasinmachkour/TRexSelector") ``` You can open the help pages with: ```{r, eval=FALSE} library(TRexSelector) help(package = "TRexSelector") ?trex ?random_experiments ?lm_dummy ?add_dummies ?add_dummies_GVS ?FDP ?TPP # etc. ``` To cite the package 'TRexSelector' in publications use: ```{r, eval=FALSE} citation("TRexSelector") ``` # Quick Start This section illustrates the basic usage of the 'TRexSelector' package to perform FDR-controlled variable selection in large-scale high-dimensional settings based on the T-Rex selector. 1. **First**, we generate a high-dimensional Gaussian data set with sparse support: ```{r} library(TRexSelector) # Setup n <- 75 # number of observations p <- 150 # number of variables num_act <- 3 # number of true active variables beta <- c(rep(1, times = num_act), rep(0, times = p - num_act)) # coefficient vector true_actives <- which(beta > 0) # indices of true active variables num_dummies <- p # number of dummy predictors (also referred to as dummies) # Generate Gaussian data set.seed(123) X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p) y <- X %*% beta + stats::rnorm(n) ``` 2. **Second**, we perform FDR-controlled variable selection using the T-Rex selector for a target FDR of 5%: ```{r} # Seed set.seed(1234) # Numerical zero eps <- .Machine$double.eps # Variable selection via T-Rex res <- trex(X = X, y = y, tFDR = 0.05, verbose = FALSE) selected_var <- which(res$selected_var > eps) paste0("True active variables: ", paste(as.character(true_actives), collapse = ", ")) paste0("Selected variables: ", paste(as.character(selected_var), collapse = ", ")) ``` So, for a preset target FDR of 5%, the T-Rex selector has selected all true active variables and there are no false positives in this example. Note that users have to choose the target FDR according to the requirements of their specific applications. # FDR and TPR ## False discovery rate (FDR) and true positive rate (TPR) We give a mathematical definition of two important metrics in variable selection, i.e., the false discovery rate (FDR) and the true positive rate (TPR): **Definitions** (FDR and TPR) Let $\widehat{\A} \subseteq \lbrace 1, \ldots, p \rbrace$ be the set of selected variables, $\A \subseteq \lbrace 1, \ldots, p \rbrace$ the set of true active variables, $| \widehat{\A} |$ the cardinality of $\widehat{\A}$, and define $1 \lor a \coloneqq \max\lbrace 1, a \rbrace$, $a \in \mathbb{R}$. Then, the false discovery rate (FDR) and the true positive rate (TPR) are defined by $$ \FDR \coloneqq \mathbb{E} \big[ \FDP \big] \coloneqq \mathbb{E} \left[ \dfrac{\big| \widehat{\A} \backslash \A \big|}{1 \lor \big| \widehat{\A} \big|} \right] $$ and $$ \TPR \coloneqq \mathbb{E} \big[ \TPP \big] \coloneqq \mathbb{E} \left[ \dfrac{| \A \cap \widehat{\A} |}{1 \lor | \A |} \right], $$ respectively. Ideally, the $\FDR = 0$ and the $\TPR = 1$. In practice, this is not always possible. Therefore, the FDR is controlled on a sufficiently low level, while the TPR is maximized. # Simulations Let us have a look at the behavior of the T-Rex selector for different choices of the target FDR. We conduct Monte Carlo simulations and plot the resulting averaged false discovery proportions (FDP) and true positive proportions (TPP) over the target FDR. Note that the averaged FDP and TPP are estimates of the FDR and TPR, respectively: ```{r} # Computations might take up to 10 minutes... Please wait... # Numerical zero eps <- .Machine$double.eps # Seed set.seed(1234) # Setup n <- 100 # number of observations p <- 150 # number of variables # Parameters num_act <- 10 # number of true active variables beta <- rep(0, times = p) # coefficient vector (all zeros first) beta[sample(seq(p), size = num_act, replace = FALSE)] <- 1 # coefficient vector (active variables with non-zero coefficients) true_actives <- which(beta > 0) # indices of true active variables tFDR_vec <- c(0.1, 0.15, 0.2, 0.25) # target FDR levels MC <- 100 # number of Monte Carlo runs per stopping point # Initialize results vectors FDP <- matrix(NA, nrow = MC, ncol = length(tFDR_vec)) TPP <- matrix(NA, nrow = MC, ncol = length(tFDR_vec)) # Run simulations for (t in seq_along(tFDR_vec)) { for (mc in seq(MC)) { # Generate Gaussian data X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p) y <- X %*% beta + stats::rnorm(n) # Run T-Rex selector res <- trex(X = X, y = y, tFDR = tFDR_vec[t], verbose = FALSE) selected_var <- which(res$selected_var > eps) # Results FDP[mc, t] <- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var)) TPP[mc, t] <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives)) } } # Compute estimates of FDR and TPR by averaging FDP and TPP over MC Monte Carlo runs FDR <- colMeans(FDP) TPR <- colMeans(TPP) ``` ```{r FDR_and_TPR, echo=FALSE, fig.align='center', message=FALSE, fig.width=12, fig.height=5, out.width = "95%"} # Plot results library(ggplot2) library(patchwork) tFDR_vec_percent <- 100 * tFDR_vec plot_data <- data.frame(tFDR_vec = tFDR_vec_percent, FDR = 100 * FDR, TPR = 100 * TPR) # data frame containing data to be plotted (FDR and TPR in %) # FDR vs. tFDR FDR_vs_tFDR <- ggplot(plot_data, aes(x = tFDR_vec_percent, y = FDR)) + labs(x = "Target FDR", y = "FDR") + scale_x_continuous(breaks = tFDR_vec_percent, minor_breaks = c(), limits = c(tFDR_vec_percent[1], tFDR_vec_percent[length(tFDR_vec_percent)])) + scale_y_continuous(breaks = seq(0, 100, by = 10), minor_breaks = c(), limits = c(0, 100)) + geom_line(size = 1.5, colour = "#336C68") + geom_abline(slope = 1, colour = "red", linetype = 2, size = 1) + geom_point(size = 2.5, colour = "#336C68") + theme_bw(base_size = 16) + theme(panel.background = element_rect(fill = "white", color = "black", size = 1)) + coord_fixed(ratio = 0.75 * (tFDR_vec_percent[length(tFDR_vec_percent)] - tFDR_vec_percent[1]) / (100 - 0)) # TPR vs. tFDR TPR_vs_tFDR <- ggplot(plot_data, aes(x = tFDR_vec_percent, y = TPR)) + labs(x = "Target FDR", y = "TPR") + scale_x_continuous(breaks = tFDR_vec_percent, minor_breaks = c(), limits = c(tFDR_vec_percent[1], tFDR_vec_percent[length(tFDR_vec_percent)])) + scale_y_continuous(breaks = seq(0, 100, by = 10), minor_breaks = c(), limits = c(0, 100)) + geom_line(size = 1.5, colour = "#336C68") + geom_point(size = 2.5, colour = "#336C68") + theme_bw(base_size = 16) + theme(panel.background = element_rect(fill = "white", color = "black", size = 1)) + coord_fixed(ratio = 0.75 * (tFDR_vec_percent[length(tFDR_vec_percent)] - tFDR_vec_percent[1]) / (100 - 0)) FDR_vs_tFDR + TPR_vs_tFDR ``` We observe that the T-Rex selector always controls the FDR (green line is always below the red and dashed reference line, i.e., maximum allowed value for the FDR). For more details and discussions, we refer the interested reader to the T-Rex paper [@machkour2022terminating]. # The T-Rex Framework The general steps that define the \textit{T-Rex} framework are illustrated in Figure 1. The key idea is to design randomized controlled experiments where fake variables, so-called dummies, act as a negative control group in the variable selection process. ```{r T-Rex_framework, echo=FALSE, fig.cap="Figure 1: Simplified overview of the T-Rex framework.", out.width = '85%'} knitr::include_graphics("./figures/T-Rex_framework.png") ``` Within the \textit{T-Rex} framework, a total of $K$ random experiments with independently generated dummy matrices are conducted. Figure 2 shows the structure of the enlarged predictor matrix. Without loss of generality, true active variables (green), non-active (null) variables (red), and dummies (yellow) are illustrated as blocks within the predictor matrix. Note that this is only for visualization purposes and in practice the active and null variables are interspersed. In the random experiments, the dummy variables (yellow) compete with the given input variables in $\X$ (green and red) to be included by a forward variable selection method, such as the LARS algorithm [@efron2004least], the Lasso [@tibshirani1996regression], or the elastic net [@zou2005regularization]. In each random experiment, the solution path is terminated early, as soon as a pre-defined number of $T$ dummies is included in the model. This results in the $K$ candidate sets $\C_{1, L}(T), \ldots, \C_{K, L}(T)$. The early stopping leads to a drastic reduction in computation time for sparse problems, where continuing the forward selection algorithm, beyond some point, only leads to including more null variables. Finally, a voting scheme is applied to the candidate sets which yields the final active set $\widehat{\A}_{L}(v^{*}, T^{*})$. As detailed in [@machkour2022terminating], the calibration process ensures that the FDR is controlled at the user-defined level $\alpha$ while maximizing the TPR by determining the optimal voting level $v^{*}$ and number of included dummies $T^{*}$ after which the forward selection process is terminated. ```{r EnlargedPredictorMatrix, echo=FALSE, fig.cap="Figure 2: The enlarged predictor matrix (predictor matrix with dummies).", out.width = '65%'} knitr::include_graphics("./figures/predictor_matrix_with_dummies.png") ``` For a more detailed description of Figures 1 and 2 and more details on the T-Rex selector in general, we refer the interested reader to the original paper [@machkour2022terminating]. # References {-} ```{r, include = FALSE} # Reset user's options() options(old_options) ```