## Past Seminar Presentations

**Thursday, August 5, 2021**[Recording]**Speaker:**Snigdha Panigrahi (University of Michigan)**Title:**Approximate Methods for Joint Estimation of Group-sparse Parameters post Selection**Abstract:**In this talk, I will present a post-selective Bayesian framework to jointly and consistently estimate parameters within automatic group-sparse regression models. Selected through an indispensable class of learning algorithms, e.g. the Group LASSO, the overlapping Group LASSO, the sparse Group LASSO etc., uncertainty estimates for the matched parameters are unreliable in the absence of adjustments for selection bias. Limiting however the application of state of the art tools for the group-sparse problem include estimation strictly tailored to (i) real-valued projections onto very specific selected subspaces, (ii) selection events admitting representations as linear inequalities in the data variables. The proposed approximate Bayesian methods address these gaps by deriving an adjustment factor in an easily feasible analytic form that eliminates bias from the selection of promising groups. Paying a very nominal price for this adjustment, experiments on simulated data demonstrate the efficiency of our methods at a joint estimation of group-sparse parameters learned from data.

This talk is based upon joint work with Peter W. Macdonald and Daniel Kessler.

**Discussant:**Joshua Loftus (London School of Economics)

**Thursday, July 29, 2021**[Link to join]**Speaker:**Wesley Tansey (Memorial Sloan Kettering Cancer Center)**Title:**Efficient, robust, and powerful machine learning approaches to conditional independence testing**Abstract:**In this talk, I will present two approaches to conditional independence testing using deep neural networks. The first half of the talk focuses on the model-X knockoffs framework. I will present an optimization approach, Deep Direct Likelihood Knockoffs (DDLK), to learning the knockoff distribution directly through minimizing an adversarial swap objective. In the second half of the talk, I will shift to the conditional randomization test (CRT) framework. CRTs have higher power than knockoffs but come with a computational burden that generally makes them intractable. I will present an information-theoretic approach to CRTs, the Decoupled Independence Test (DIET), that overcomes this burden by reducing the CRT to a series of marginal independence tests. DIET estimates the residual information about the response and target variable after removing mutual information with the covariates. Under mild conditions, testing for conditional independence then reduces to testing for marginal independence between these two residuals. Both DDLK and DIET achieve higher power than existing methods and empirically control the target error rate in a broad class of benchmarks on synthetic and semi-synthetic data.**Discussant:**Thomas Berrett (University of Warwick)**Links:**[Relevant papers][Slides]

**Thursday, July 22, 2021**[Recording]**Speaker:**Matthew Plumlee (Northwestern University)**Title:**Inexact computer model calibration: Concerns, controversy, credibility, and confidence**Abstract**: There has been a recent surge in statistical methods for calibration of inexact models. Alongside these developments, a controversy has emerged about the goals of calibration of inexact models. This talk will trace a swath of research stemming from twenty years ago and potential concerns are marked along the way. The talk will also present some new ideas in this setting that might help close some of these philosophical and practical issues.**Discussant:**Rui Tuo (Texas A&M University)

**Thursday, July 15, 2021**[Recording]**Speaker:**Armin Schwartzman (UCSD)**Title:**Spatial inference for excursion sets**Abstract:**Spatial inference for excursion sets refers to the problem of estimating the set of locations where a function is greater than a threshold. This problem appears in analyses of 2D climate data and 3D brain imaging data. The purpose of solving such a problem is to provide an alternative to the standard large-scale multiple testing approach, where all locations in an image are tested for the presence of signal. As sample sizes in large imaging studies keep increasing, the statistical power becomes sufficient to detect the presence of signal in large portions of the image, making it difficult to localize important effects. Moreover, the multiple testing approach does not provide a measure of spatial uncertainty. We directly address the question of where the important effects are by estimating excursion sets and by constructing spatial confidence sets, given as nested regions that spatially bound the true excursion set with a given probability. We develop this approach for excursion sets of the mean function in a signal-plus-noise model, including coefficients in pointwise regression models, and further extend it to the Cohen's d parameter in order to handle spatial heteroscedasticity. Examples and computational issues are discussed for 3D fMRI data.**Discussant:**Jelle Goeman (Leiden University)**Links:**[Relevant papers: paper #1, paper #2, paper #3][Slides][Discussion Slides]

**Thursday, July 8, 2021: No seminar**

**Thursday, July 1, 2021**[Recording]**Speaker:**Xiao Li (UC Berkeley)**Title:**Whiteout: when do fixed-X knockoffs fail?**Abstract:**A core strength of knockoff methods is their virtually limitless customizability, allowing an analyst to exploit machine learning algorithms and domain knowledge without threatening the method’s robust finitesample false discovery rate control guarantee. While several previous works have investigated regimes where specific implementations of knockoffs are provably powerful, negative results are more difficult to obtain for such a flexible method. In this work we recast the fixed-X knockoff filter for the Gaussian linear model as a conditional post-selection inference method. It adds user-generated Gaussian noise to the ordinary least squares estimator βˆ to obtain a “whitened” estimator β˜ with uncorrelated entries, and performs inference using sgn(β˜j ) as the test statistic for Hj : βj = 0. We prove equivalence between our whitening formulation and the more standard formulation based on negative control predictor variables, showing how the fixed-X knockoffs framework can be used for multiple testing on any problem with (asymptotically) multivariate Gaussian parameter estimates. Relying on this perspective, we obtain the first negative results that universally upper-bound the power of all fixed-X knockoff methods, without regard to choices made by the analyst. Our results show roughly that, if the leading eigenvalues of Var(βˆ) are large with dense leading eigenvectors, then there is no way to whiten βˆ without irreparably erasing nearly all of the signal, rendering sgn(β˜j ) too uninformative for accurate inference. We give conditions under which the true positive rate (TPR) for any fixed-X knockoff method must converge to zero even while the TPR of Bonferroni-corrected multiple testing tends to one, and we explore several examples illustrating this phenomenon.**Discussant:**Asher Spector (Harvard University)

**Thursday, June 24, 2021**[Recording]**Speaker:**Jason Hsu (The Ohio State University)**Title:**Confident Directional Selective Inference, from Multiple Comparisons with the Best to Precision Medicine**Abstract:**MCB (multiple comparisons with the best, 1981, 1984), comparing treatments to the best without knowing which one is the best, can be considered an early example of selective inference. With the thinking that "there is only one true best", the relevance of MCB to this presentation is it led to the Partitioning Principle, which is essential for deriving confidence sets for stepwise tests. Inference based on confidence sets control the directional error rate, inference based on tests of equalities may not.

The FDA gave Accelerated Approval to Aduhelm^{TM} (aducanumab) for Alzheimer's Disease (AD) on 8 June 2021, based on its reduction of beta-amyloid plaque (a surrogate biomarker endpoint). When clinical efficacy of a treatment for the overall population is not shown, genome-wide association studies (GWAS) are often used to discover SNPs that might predict efficacy in subgroups. In the process of working on GWAS with real data, we came to realization that, if one causal SNP makes its zero-null hypothesis false, then all other zero-null hypotheses are statistically false as well. While the majority of no-association null hypotheses might well be true biologically, statistically they are false (if one is false) in GWAS. I will indeed illustrate this with a causal SNP for the ApoE gene which is involved in the clearance of beta-amyloid plaque in AD. We suggest our confidence interval CE4 approach instead.

Targeted therapies such as OPDIVO and TECENTRIQ naturally have patient subgroups, already defined by the extent to which the drug target is present or absent in them, subgroups that may derive differential efficacy. An additional danger of testing equality nulls in the presence of subgroups is that the illusory logical relationships among efficacy in subgroups and their mixtures created by exact quality nulls leads to too drastic a stepwise multiplicity reduction, resulting in inflated directional error rates, as I will explain. Instead, Partition Tests, which would be called Confident Direction methods in the language of Tukey, might be safer to use.

**Discussant:**Will Fithian (UC Berkeley)

**Thursday, June 17, 2021**[Recording]**Speaker:**Patrick Chao (University of Pennsylvania)**Title:**AdaPT-GMM: Powerful and robust covariate-assisted multiple testing**Abstract**: We propose a new empirical Bayes method for covariate-assisted multiple testing with false discovery rate (FDR) control, where we model the local false discovery rate for each hypothesis as a function of both its covariates and p-value. Our method refines the adaptive p-value thresholding (AdaPT) procedure by generalizing its masking scheme to reduce the bias and variance of its false discovery proportion estimator, improving the power when the rejection set is small or some null p-values concentrate near 1. We also introduce a Gaussian mixture model for the conditional distribution of the test statistics given covariates, modeling the mixing proportions with a generic user-specified classifier, which we implement using a two-layer neural network. Like AdaPT, our method provably controls the FDR in finite samples even if the classifier or the Gaussian mixture model is misspecified. We show in extensive simulations and real data examples that our new method, which we call AdaPT-GMM, consistently delivers high power relative to competing state-of-the-art methods. In particular, it performs well in scenarios where AdaPT is underpowered, and is especially well-suited for testing composite null hypothesis, such as whether the effect size exceeds a practical significance threshold.**Discussant:**Patrick Kimes (Genentech)

**Thursday, June 10, 2021**[Recording]**Speaker:**Wooseok Ha (UC Berkeley)**Title:**Interpreting deep neural networks in a transformed domain**Abstract:**Machine learning lies at the heart of new possibilities for scientific discovery, knowledge generation, and artificial intelligence. Its potential benefits to these fields require going beyond predictive accuracy and focusing on interpretability. In particular, many scientific problems require interpretations in domain-specific interpretable feature space (e.g. the frequency or wavelet domain) whereas attributions to the raw features (e.g. the pixel space) may be unintelligible or even misleading. To address this challenge, we propose TRIM (Transformation Importance), a novel approach which attributes importances to features in a transformed space and can be applied post-hoc to a fully trained model. We focus on a problem in cosmology, where it is crucial to interpret how a model trained on simulations predicts fundamental cosmological parameters. By using TRIM in interesting ways, we next introduce adaptive wavelet distillation (AWD), a method that aims to distill information from a trained neural network into a wavelet transform. Specifically, AWD penalizes feature attributions of a neural network in the wavelet domain to learn an effective multi-resolution wavelet transform. The resulting model is highly predictive, concise, computationally efficient, and has properties (such as a multi-scale structure) which make it easy to interpret. We showcase how AWD addresses challenges in two real-world settings: cosmological parameter inference and molecular-partner prediction. In both cases, AWD informs predictive features that are scientifically meaningful in the context of respective domains.**Discussant:**Sarah Tan (Facebook)

**Thursday, June 3, 2021**[Recording]**Speakers:**Song Zhai (UC Riverside)**Title:**Learning from Real World Data About Combinatorial Treatment Selection for COVID-19**Abstract:**COVID-19 is an unprecedented global pandemic with a serious negative impact on virtually every part of the world. Although much progress has been made in preventing and treating the disease, much remains to be learned about how best to treat the disease while considering patient and disease characteristics. This paper reports a case study of combinatorial treatment selection for COVID-19 based on real-world data from a large hospital in Southern China. In this observational study, 417 confirmed COVID-19 patients were treated with various combinations of drugs and followed for four weeks after discharge (or until death). Treatment failure is defined as death during hospitalization or recurrence of COVID-19 within four weeks of discharge. Using a virtual multiple matching method to adjust for confounding, we estimate and compare the failure rates of different combinatorial treatments, both in the whole study population and in subpopulations defined by baseline characteristics. Our analysis reveals that treatment effects are substantial and heterogeneous, and that the optimal combinatorial treatment may depend on baseline age, systolic blood pressure, and c-reactive protein level. Using these three variables to stratify the study population leads to a stratified treatment strategy that involves several different combinations of drugs (for patients in different strata). Our findings are exploratory and require further validation.**Discussant:**Hongyuan Cao (Florida State University)**Links:**[Slides]

**Thursday, May 27, 2021**[Recording]**Speaker:**Matthew Stephens (University of Chicago)**Title:**A simple new approach to variable selection in regression, with application to genetic fine-mapping**Abstract:**We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model — the “Sum of Single Effects” (SuSiE) model — which comes from writing the sparse vector of regression coefficients as a sum of “single-effect” vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure — Iterative Bayesian Stepwise Selection (IBSS) — which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a Credible Set of variables for each selection. Our methods are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate their application to fine-mapping genetic variants influencing alternative splicing in human cell-lines. We also discuss the potential and challenges for applying these methods to generic variable selection problems.**Discussant:**Peter Bühlmann (ETH Zürich)**Links:**[Relevant papers: paper #1][Slides][Discussant Slides]

**Thursday, May 20, 2021**[Recording]**Speaker:**Dan Kluger (Stanford University)**Title:**A central limit theorem for the Benjamini-Hochberg false discovery proportion under a factor model**Abstract:**The Benjamini-Hochberg (BH) procedure remains widely popular despite having limited theoretical guarantees in the commonly encountered scenario of correlated test statistics. Of particular concern is the possibility that the method could exhibit bursty behavior, meaning that it might typically yield no false discoveries while occasionally yielding both a large number of false discoveries and a false discovery proportion (FDP) that far exceeds its own well controlled mean. In this paper, we investigate which test statistic correlation structures lead to bursty behavior and which ones lead to well controlled FDPs. To this end, we develop a central limit theorem for the FDP in a multiple testing setup where the test statistic correlations can be either short-range or long-range as well as either weak or strong. The theorem and our simulations from a data-driven factor model suggest that the BH procedure exhibits severe burstiness when the test statistics have many strong, long-range correlations, but does not otherwise.**Discussant:**Grant Izmirlian (NCI DCP Biometry Research Group)**Links:**[Relevant papers: paper #1][Slides][Discussion Slides]

**Thursday, May 13, 2021**[Recording]**Speaker:**Chirag Gupta (Carnegie Mellon University)**Title:**Recent advances in distribution-free uncertainty quantification**Abstract:**Uncertainty quantification seeks to supplement point predictions with estimates of confidence or reliability. In the distribution-free (DF) framework, we require these confidence estimates to make valid statistical claims that provably hold no matter how the data is distributed, as long as the training and test data follow the same distribution. We present some recent results in DF uncertainty quantification for classification and regression problems. First, we discuss nested conformal, a framework to produce prediction sets that are guaranteed to contain the true output with a pre-defined probability. We then describe an ensemble-based conformal algorithm, QOOB. QOOB has DF guarantees, is computationally efficient, and produces prediction sets that exhibit strong practical performance on regression tasks. Next, we describe the notion of calibration in binary classification and connect it to prediction sets and confidence intervals. This relationship leads to an impossibility result for continuous-output DF calibration. We then show DF calibration guarantees for a popular discrete-output calibration algorithm called histogram binning. Based on our guarantees, we make practical recommendations for choosing the number of bins in histogram binning.**Discussant:**Rina Foygel Barber (University of Chicago)

**Thursday, May 6, 2021**[Recording]**Speaker:**Marie Perrot-Dockès (Université de Paris)**Title:**Post hoc false discovery proportion inference under a Hidden Markov Model**Abstract:**We address the multiple testing problem under the assumption that the true/false hypotheses are driven by a Hidden Markov Model (HMM), which is recognized as a fundamental setting to model multiple testing under dependence since the seminal work of Sun and Cai (2009). While previous work has concentrated on deriving specific procedures with a controlled False Discovery Rate (FDR) under this model, following a recent trend in selective inference, we consider the problem of establishing confidence bounds on the false discovery proportion (FDP), for a user-selected set of hypotheses that can depend on the observed data in an arbitrary way. We develop a methodology to construct such confidence bounds first when the HMM model is known, then when its parameters are unknown and estimated, including the data distribution under the null and the alternative, using a nonparametric approach. In the latter case, we propose a bootstrap-based methodology to take into account the effect of parameter estimation error. We show that taking advantage of the assumed HMM structure allows for a substantial improvement of confidence bound sharpness over existing agnostic (structure-free) methods, as witnessed both via numerical experiments and real data examples.**Discussant:**Jesse Hemerik (Wageningen University)

**Thursday, April 29, 2021**[Recording]**Speaker:**Thorsten Dickhaus (University of Bremen)**Title:**Randomized p-values in replicability analysis**Abstract:**We will be concerned with testing replicability hypotheses for many endpoints simultaneously. This constitutes a multiple test problem with composite null hypotheses. Traditional p-values, which are computed under least favourable parameter configurations (LFCs), are over-conservative in the case of composite null hypotheses. As demonstrated in prior work, this poses severe challenges in the multiple testing context, especially when one goal of the statistical analysis is to estimate the proportion $\pi_0$ of true null hypotheses. We will discuss the application of randomized p-values in the sense of [1] in replicability analysis. By means of theoretical considerations as well as computer simulations, we will demonstrate that their usage typically leads to a much more accurate estimation of $\pi_0$ than the LFC-based approach. Furthermore, we will draw connections to other recently proposed methods for dealing with conservative p-values in the multiple testing context. Finally, we will present a real data example from genomics. The presentation is based on [2] and [3].**Discussant:**Ruth Heller (Tel Aviv University)**Links:**[Relevant papers: paper #1, paper #2, paper #3][Slides]

**Thursday, April 22, 2021**[Recording]**Speaker:**Feng Ruan (UC Berkeley)**Title:**A Self-Penalizing Objective Function for Scalable Interaction Detection**Abstract:**We tackle the problem of nonparametric variable selection with a focus on discovering interactions between variables. With p variables there are O(ps) possible order-s interactions making exhaustive search infeasible. It is nonetheless possible to identify the variables involved in interactions with only linear computation cost, O(p). The trick is to maximize a class of parametrized nonparametric dependence measures which we call metric learning objectives; the landscape of these nonconvex objective functions is sensitive to interactions but the objectives themselves do not explicitly model interactions. Three properties make metric learning objectives highly attractive:

(a) The stationary points of the objective are automatically sparse (i.e. performs selection) -- no explicit ℓ1 penalization is needed.

(b) All stationary points of the objective exclude noise variables with high probability.

(c) Guaranteed recovery of all signal variables without needing to reach the objective's global maxima or special stationary points.

The second and third properties mean that all our theoretical results apply in the practical case where one uses gradient ascent to maximize the metric learning objective. While not all metric learning objectives enjoy good statistical power, we design an objective based on ℓ1 kernels that does exhibit favorable power: it recovers (i) main effects with n∼logp samples, (ii) hierarchical interactions with n∼logp samples and (iii) order-s pure interactions with n∼p^{2(s−1)}logp samples.

**Discussant:**Sumanta Basu (Cornell University)

**Thursday, April 15, 2021**[Recording]**Speaker:**Nikolaos Ignatiadis (Stanford University)**Title:**Confidence Intervals for Nonparametric Empirical Bayes Analysis**Abstract:**In an empirical Bayes analysis, we use data from repeated sampling to imitate inferences made by an oracle Bayesian with extensive knowledge of the data-generating distribution. Existing results provide a comprehensive characterization of when and why empirical Bayes point estimates accurately recover oracle Bayes behavior. In this work, we develop flexible and practical confidence intervals that provide asymptotic frequentist coverage of empirical Bayes estimands, such as the posterior mean or the local false sign rate. The coverage statements hold even when the estimands are only partially identified or when empirical Bayes point estimates converge very slowly. This is joint work with Stefan Wager.**Discussant:**Timothy Armstrong (Yale University)

**Thursday, April 8, 2021**[Recording]**Speaker:**Hongyuan Cao (Florida State University)**Title:**Optimal False Discovery Rate Control For Large Scale Multiple Testing With Auxiliary Information**Abstract:**Large-scale multiple testing is a fundamental problem in high dimensional statistical inference. It is increasingly common that various types of auxiliary information, reflecting the structural relationship among the hypotheses, are available. Exploiting such auxiliary information can boost statistical power. To this end, we propose a framework based on a two-group mixture model with varying probabilities of being null for different hypotheses a priori, where a shape constrained relationship is imposed between the auxiliary information and the prior probabilities of being null. An optimal rejection rule is designed to maximize the expected number of true positives when average false discovery rate is controlled. Focusing on the ordered structure, we develop a robust EM algorithm to estimate the prior probabilities of being null and the distribution of p-values under the alternative hypothesis simultaneously. We show that the proposed method has better power than state-of-the-art competitors while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method. Datasets from genome-wide association studies are used to illustrate the new methodology.**Discussant:**James Scott (University of Texas at Austin)

**Thursday, April 1, 2021**[Recording]**Speaker:**Jingshen Wang (UC Berkeley)**Title:**Sharp Inference on Selected Subgroups in Observational Studies**Abstract:**In modern drug development, the broader availability of high-dimensional observational data provides opportunities for scientist to explore subgroup heterogeneity, especially when randomized clinical trials are unavailable due to cost and ethical constraints. However, a common practice that naively searches the subgroup with a high treatment level is often misleading due to the “subgroup selection bias.” More importantly, the nature of high-dimensional observational data has further exacerbated the challenge of accurately estimating the subgroup treatment effects. To resolve these issues, we provide new inferential tools based on resampling to assess the replicability of post-hoc identified subgroups from observational studies. Through careful theoretical justification and extensive simulations, we show that our proposed approach delivers asymptotically sharp confidence intervals and debiased estimates for the selected subgroup treatment effects in the presence of high-dimensional covariates. We further demonstrate the merit of the proposed methods by analyzing the UK Biobank data. The R package “debiased.subgroup" implementing the proposed procedures is available on GitHub.**Discussant:**Rui Wang (Harvard University)**Links:**[Relevant papers: paper #1]

**Thursday, March 25, 2021**[Recording]**Speaker:**Jackson Loper (Columbia University)**Title:**Smoothed Nested Testing on Directed Acyclic Graphs**Abstract:**We consider the problem of multiple hypothesis testing when there is a logical nested structure to the hypotheses. When one hypothesis is nested inside another, the outer hypothesis must be false if the inner hypothesis is false. We model the nested structure as a directed acyclic graph, including chain and tree graphs as special cases. Each node in the graph is a hypothesis and rejecting a node requires also rejecting all of its ancestors. We propose a general framework for adjusting node-level test statistics using the known logical constraints. Within this framework, we study a smoothing procedure that combines each node with all of its descendants to form a more powerful statistic. We prove a broad class of smoothing strategies can be used with existing selection procedures to control the familywise error rate, false discovery exceedance rate, or false discovery rate, so long as the original test statistics are independent under the null. When the null statistics are not independent but are derived from positively-correlated normal observations, we prove control for all three error rates when the smoothing method is arithmetic averaging of the observations. Simulations and an application to a real biology dataset demonstrate that smoothing leads to substantial power gains.**Discussant:**Wenge Guo (New Jersey Institute of Technology)**Links:**[Relevant papers: paper #1]

**Thursday, March 18, 2021**[Recording]**Speaker:**Ruodu Wang (University of Waterloo)**Title:**Multiple hypothesis testing with e-values and dependence**Abstract:**E-values have gained attention as potential alternatives to p-values as measures of uncertainty, significance and evidence. In brief, e-values are realized by random variables with expectation at most one under the null; examples include betting scores, (point null) Bayes factors, likelihood ratios and stopped supermartingales. We design a natural analog of the Benjamini-Hochberg (BH) procedure for false discovery rate (FDR) control that utilizes e-values, called the e-BH procedure, and compare it with the standard procedure for p-values. One of our central results is that, unlike the usual BH procedure, the e-BH procedure controls the FDR at the desired level---with no correction---for any dependence structure between the e-values. We illustrate that the new procedure is convenient in various settings of complicated dependence, structured and post-selection hypotheses, and multi-armed bandit problems. Moreover, the BH procedure is a special case of the e-BH procedure through calibration between p-values and e-values. Overall, the e-BH procedure is a novel, powerful and general tool for multiple testing under dependence, that is complementary to the BH procedure, each being an appropriate choice in different applications.**Discussant:**Lihua Lei (Stanford University)

**Thursday, March 11, 2021**[Recording]**Speaker:**Stephen Bates (UC Berkeley)**Title:**Distribution-Free, Risk-Controlling Prediction Sets**Abstract:**While improving prediction accuracy has been the focus of machine learning in recent years, this alone does not suffice for reliable decision-making. Deploying learning systems in consequential settings also requires calibrating and communicating the uncertainty of predictions. To convey instance-wise uncertainty for prediction tasks, we show how to generate set-valued predictions from a black-box predictor that control the expected loss on future test points at a user-specified level. Our approach provides explicit finite-sample guarantees for any dataset by using a holdout set to calibrate the size of the prediction sets. This framework enables simple, distribution-free, rigorous error control for many tasks, and we demonstrate it in five large-scale machine learning problems: (1) classification problems where some mistakes are more costly than others; (2) multi-label classification, where each observation has multiple associated labels; (3) classification problems where the labels have a hierarchical structure; (4) image segmentation, where we wish to predict a set of pixels containing an object of interest; and (5) protein structure prediction. Lastly, we discuss extensions to uncertainty quantification for ranking, metric learning and distributionally robust learning.**Discussant:**Vladimir Vovk (Royal Holloway, University of London)

**Thursday, March 4, 2021**[Recording]**Speaker:**Boyan Duan (Carnegie Mellon University)**Title:**Interactive identification of individuals with positive treatment effect while controlling false discoveries**Abstract:**Out of the participants in a randomized experiment with anticipated heterogeneous treatment effects, is it possible to identify which ones have a positive treatment effect, even though each has only taken either treatment or control but not both? While subgroup analysis has received attention, claims about individual participants are more challenging. We frame the problem in terms of multiple hypothesis testing: we think of each individual as a null hypothesis (the potential outcomes are equal, for example) and aim to identify individuals for whom the null is false (the treatment potential outcome stochastically dominates the control, for example). We develop a novel algorithm that identifies such a subset, with nonasymptotic control of the false discovery rate (FDR). Our algorithm allows for interaction — a human data scientist (or a computer program acting on the human’s behalf) may adaptively guide the algorithm in a data-dependent manner to gain high identification power. We also propose several extensions: (a) relaxing the null to nonpositive effects, (b) moving from unpaired to paired samples, and (c) subgroup identification. We demonstrate via numerical experiments and theoretical analysis that the proposed method has valid FDR control in finite samples and reasonably high identification power.**Discussant:**Bikram Karmakar (University of Florida)

**Thursday, February 25, 2021**[Recording]**Speaker:**Anna Vesely, University of Padua**Title:**Permutation-based true discovery guarantee by sum tests**Abstract:**Sum-based global tests are highly popular in multiple hypothesis testing. In this paper we propose a general closed testing procedure for sum tests, which provides confidence lower bounds for the proportion of true discoveries (TDP), simultaneously over all subsets of hypotheses. Our method allows for an exploratory approach, as simultaneity ensures control of the TDP even when the subset of interest is selected post hoc. It adapts to the unknown joint distribution of the data through permutation testing. Any sum test may be employed, depending on the desired power properties. We present an iterative shortcut for the closed testing procedure, based on the branch and bound algorithm. It converges to the full closed testing results, often after few iterations. Even if it is stopped early, it controls the TDP. The feasibility of the method for high dimensional data is illustrated on brain imaging data. We compare the properties of different choices for the sum test through simulations.**Discussant:**Pierre Neuvial (Institut de Mathématiques de Toulouse (IMT))

**Thursday, February 18, 2021**[Recording]**Speaker:**Tijana Zrnic (UC Berkeley)**Title:**Post-Selection Inference via Algorithmic Stability**Abstract:**Modern approaches to data analysis make extensive use of data-driven model selection. The resulting dependencies between the selected model and data used for inference invalidate statistical guarantees derived from classical theories. The framework of post-selection inference (PoSI) has formalized this problem and proposed corrections which ensure valid inferences. Yet, obtaining general principles that enable computationally-efficient, powerful PoSI methodology with formal guarantees remains a challenge. With this goal in mind, we revisit the PoSI problem through the lens of algorithmic stability. Under an appropriate formulation of stability---one that captures closure under post-processing and compositionality properties---we show that stability parameters of a selection method alone suffice to provide non-trivial corrections to classical z-test and t-test intervals. Then, for several popular model selection methods, including the LASSO, we show how stability can be achieved through simple, computationally efficient randomization schemes. Our algorithms offer provable unconditional simultaneous coverage and are computationally efficient; in particular, they do not rely on MCMC sampling. Importantly, our proposal explicitly relates the magnitude of randomization to the resulting confidence interval width, allowing the analyst to tune interval width to the loss in utility due to randomizing selection. This is joint work with Michael I. Jordan.**Discussant:**Arun Kumar Kuchibhotla (Carnegie Mellon University)

**Thursday, February 11, 2021**[Recording]**Speaker:**Jelle Goeman (Leiden University)**Title:**Only closed testing procedures are admissible for controlling false discovery proportions**Abstract:**We consider a general class of procedures controlling the tail probability of the number or proportion of false discoveries, either in a single (random) set or in several such sets simultaneously. This class includes, among others, (generalized) familywise error, false discovery exceedance, simultaneous false discovery proportion control, and other selective inference methods. We put these procedures in a general framework, formulating all of them as special cases of true discovery guarantee procedures. We formulate both necessary and sufficient conditions for admissibility. Most importantly, we show that all such procedures are either a special case of closed testing, or they can be uniformly improved by a closed testing procedure. The practical value of our results is illustrated by giving uniform improvements of existing selective inference procedures, achieved by formulating them as a closed testing procedures. In particular, we investigate when procedures controlling conditional familywise error rate, and data-splitting methods, can be uniformly improved by closed testing.**Discussant:**Will Fithian (UC Berkeley)

**Thursday, February 4, 2021**[Recording]**Speaker:**Arian Maleki (Columbia University)**Title:**Comparing Variable Selection Techniques Under a High-Dimensional Asymptotic**Abstract:**In this talk, we discuss the problem of variable selection for linear models under the high-dimensional asymptotic setting, where the number of observations, n, grows at the same rate as the number of predictors, p. We consider two-stage variable selection techniques (TVS) in which the first stage obtains an estimate of the regression coefficients, and the second stage simply thresholds this estimate to select the “important” predictors. The asymptotic false discovery proportion (AFDP) and true positive proportion (ATPP) of these TVS are evaluated, and their optimality will be discussed.**Discussant:**Pragya Sur (Harvard University)

**Thursday, January 28, 2021**[Recording]**Speaker:**Ali Shojaie (University of Washington)**Title:**Nonparametric Inference for Infinite-Dimensional Parameters via a Generalized Score Test**Abstract:**Infinite-dimensional parameters that can be defined as the minimizer of a population risk arise naturally in many applications. Classic examples include the conditional mean function and the density function. Though there is extensive literature on constructing consistent estimators for infinite-dimensional risk minimizers, there is limited work on quantifying the uncertainty associated with such estimates via, e.g., hypothesis testing and construction of confidence regions. We propose a general inferential framework for infinite-dimensional risk minimizers as a nonparametric extension of the score test. We illustrate that our framework requires only mild assumptions and is applicable to a variety of estimation problems. In examples, we specialize our proposed methodology to estimation of regression functions with continuous outcomes and also consider a partially additive model as an extension of the more classical partially linear model.**Discussant:**Mladen Kolar (University of Chicago Booth School of Business)**Links:**[Slides]

**Thursday, January 21, 2021**[Recording]**Speaker:**Etienne Roquain (Sorbonne Université)**Title:**Structured multiple testing: can one mimic the oracle?**Abstract:**Knowing the model structure can significantly help to perform a multiple testing inference. Hence, a general aim is to build a procedure mimicking the performances of the oracle, that is, of a benchmark procedure that knows (and uses) this structure. As a case in point, classical structures are derived from the famous two-group model or its extensions, by specifying particular assumptions on the corresponding parameters, as the null/alternative distributions, or the false/null occurrence process. We will discuss the issue of mimicking the oracle for the three following structures and various multiple testing error rates:

(1) structure = Gaussian null distribution family, error rate= FDR (see https://arxiv.org/abs/1912.03109, joint work with Nicolas Verzelen and https://arxiv.org/abs/1809.08330, joint work with Alexandra Carpentier, Sylvain Delattre and Nicolas Verzelen)

(2) structure = stochastic block model for the false/null occurrence process, error rate = FDR (see https://arxiv.org/abs/1907.10176, joint work with Tabea Rebafka and Fanny Villers)

(3) structure = hidden Markov model for the false/null occurrence process, error rate = FDP confidence post hoc bound (preprint to come, joint work with Marie Perrot-Dockès, Gilles Blanchard and Pierre Neuvial) We will emphasize the work (1) above, and show that building a confidence region for the structure parameter can be fruitful to know whether mimicking the oracle is possible and how to mimic it when it is possible.**Discussant:**Ery Arias-Castro (UC San Diego)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, January 14, 2021**[Recording]**Speaker:**Qingyuan Zhao (University of Cambridge)**Title:**Selecting and Ranking Individualized Treatment Rules With Unmeasured Confounding**Abstract:**It is common to compare individualized treatment rules based on the value function, which is the expected potential outcome under the treatment rule. Although the value function is not point-identified when there is unmeasured confounding, it still defines a partial order among the treatment rules under Rosenbaum’s sensitivity analysis model. We first consider how to compare two treatment rules with unmeasured confounding in the single-decision setting and then use this pairwise test to rank multiple treatment rules. We consider how to, among many treatment rules, select the best rules, and select the rules that are better than a control rule. The proposed methods are illustrated using two real examples, one about the benefit of malaria prevention programs to different age groups and another about the effect of late retirement on senior health in different gender and occupation groups.**Discussant:**Edward Kennedy (Carnegie Mellon University)**Links:**[Relevant paper] [Slides]

**Thursday, January 7, 2021**[Recording]**Speaker:**Yuval Benjamini (Hebrew University of Jerusalem)**Title:**Localizing differences between correlation matrix populations in resting-state fMRI**Abstract:**Resting state fMRI consists of continuous neural-activity recordings over a period of several minutes without structured experimental manipulation. These measurements are summarized into a correlation matrix between activity in p predetermined brain-regions (p between 90 and 500). Neurologists are interested in identifying localized differences in correlation between, e.g. disease and control populations, but the relatively high noise, small samples and many comparisons make mass univariate approaches impractical due to low signal. Therefore, resting-state fMRI analysis can be a model problem for data-adaptive pooling of hypotheses.

However, as I discuss in the talk, even static pooling of effects across different correlation values is not simple in this type of data. We reparametrize the matrix of differences between populations as p main effects representing change for each region, with the goal of replacing p^2/2 hypotheses with p main ones. For this new model, we derive likelihood estimators that require explicit or implicit characterisation of the dependence in the data. We show that the method preforms well on simulations, and discuss an example from Amnesia data.

This is joint work with Itamar Faran, Michael Peer and Shahar Arzi.**Discussant:**Lucy Gao (University of Waterloo)**Relevant links:**[Slides]

**Thursday, December 10, 2020**[Recording]**Speaker:**Toru Kitagawa (University College London)**Title:**Inference on Winners**Abstract:**Many empirical questions concern target parameters selected through optimization. For example, researchers may be interested in the effectiveness of the best policy found in a randomized trial, or the best-performing investment strategy based on historical data. Such settings give rise to a winner’s curse, where conventional estimates are biased and conventional confidence intervals are unreliable. This paper develops optimal confidence intervals and median-unbiased estimators that are valid conditional on the target selected and so overcome this winner’s curse. If one requires validity only on average over targets that might have been selected, we develop hybrid procedures that combine conditional and projection confidence intervals to offer further performance gains relative to existing alternatives. This is joint work with Isaiah Andrews and Adam McCloskey.**Discussant:**Kenneth Hung (Facebook)**Links:**[Relevant paper] [Slides]

**Thursday, December 3, 2020****Speaker**: Jingyi Jessica Li (UCLA)**Title**: Clipper: p-value-free FDR control on high-throughput data from two conditions**Abstract:**High-throughput biological data analysis commonly involves the identification of “interesting” features (e.g., genes, genomic regions, and proteins), whose values differ between two conditions, from numerous features measured simultaneously. To ensure the reliability of such analysis, the most widely-used criterion is the false discovery rate (FDR), the expected proportion of uninteresting features among the identified ones. Existing bioinformatics tools primarily control the FDR based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions, two requirements that are often unmet in biological studies. To address this issue, we propose Clipper, a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper is applicable to identifying both enriched and differential features from high-throughput biological data of diverse types. In comprehensive simulation and real-data benchmarking, Clipper outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks, including peak calling from ChIP-seq data, differentially expressed gene identification from RNA-seq data, differentially interacting chromatin region identification from Hi-C data, and peptide identification from mass spectrometry data. Notably, our benchmarking results for peptide identification are based on the first mass spectrometry data standard that has a realistic dynamic range. Our results demonstrate Clipper’s flexibility and reliability for FDR control, as well as its broad applications in high-throughput data analysis.**Discussant:**Nikos Ignatiadis (Stanford University)**Links:**[Relevant paper] [Slides]

**Thursday, November 19, 2020**[Recording]**Speaker:**Oscar Hernan Madrid Padilla (UCLA)**Title:**Optimal post-selection inference for sparse signals: a nonparametric empirical-Bayes**Abstract:**Many recently developed Bayesian methods have focused on sparse signal detection. However, much less work has been done addressing the natural follow-up question: how to make valid inferences for the magnitude of those signals after selection. Ordinary Bayesian credible intervals suffer from selection bias, owing to the fact that the target of inference is chosen adaptively. Existing Bayesian approaches for correcting this bias produce credible intervals with poor frequentist properties, while existing frequentist approaches require sacrificing the benefits of shrinkage typical in Bayesian methods, resulting in confidence intervals that are needlessly wide. We address this gap by proposing a nonparametric empirical-Bayes approach for constructing optimal selection-adjusted confidence sets. Our method produces confidence sets that are as short as possible on average, while both adjusting for selection and maintaining exact frequentist coverage uniformly over the parameter space. Our main theoretical result establishes an important consistency property of our procedure: that under mild conditions, it asymptotically converges to the results of an oracle-Bayes analysis in which the prior distribution of signal sizes is known exactly. Across a series of examples, the method outperforms existing frequentist techniques for post-selection inference, producing confidence sets that are notably shorter but with the same coverage guarantee. This is joint work with Spencer Woody and James G. Scott.**Discussant:**Małgorzata Bogdan (Uniwersytet Wroclawski, Instytut Matematyki)**Links:**[Relevant paper] [Slides]

**Thursday, November 12, 2020****Speaker**: Peter Grünwald (Centrum Wiskunde & Informatica and Leiden University)**Title**:*E is the New P:*Tests that are safe under optional stopping, with an application to time-to-event data**Abstract:**The E-value is a notion of evidence which, unlike p-values, allows for effortlessly combining evidence from several tests, even in the common scenario where the decision to perform a new test depends on previous test outcomes. 'Safe' tests based on E-values generally preserve Type-I error guarantees under such `optional continuation', thereby potentially alleviating one of the main causes for the reproducibility crisis.

E-values, also known as 'betting scores', are the basic constituents of test martingales and always-valid confidence sequences - a dormant cluster of ideas going back to Ville and Robbins and suddenly rapidly gaining popularity due to recent work by Vovk, Shafer, Ramdas and Wang. For simple nulls they are just likelihood ratios or Bayes factors, but for composite nulls it's trickier - we show how to construct them in this case using the 'joint information projection'. We then zoom in on time-to-event data and show how to define an E-value based on Cox' partial likelihood, illustrating with (hypothetical!) data on covid vaccine RCTs. If all research groups were to report their results in terms of E-values rather than p-values, then in principle, one could even do meta-analysis that retains an overall Type-I error guarantee - thus saving greatly on 'research waste'.

Joint Work with R. de Heide, W. Koolen, A. Ly, M. Perez, R. Turner and J. Ter Schure.**Discussant:**Ruodu Wang (University of Waterloo)

**Thursday, November 5, 2020****Speaker**: Gilles Blanchard (Université Paris Sud)**Title**: Agnostic post hoc approaches to false positive control**Abstract:**Classical approaches to multiple testing grant control over the amount of false positives for a specific method prescribing the set of rejected hypotheses. In practice many users tend to deviate from a strictly prescribed multiple testing method and follow ad-hoc rejection rules, tune some parameters by hand, compare several methods and pick from their results the one that suits them best, etc. This will invalidate standard statistical guarantees because of the selection effect. To compensate for any form of such ”data snooping”, an approach which has garnered significant interest recently is to derive ”user-agnostic”, or post hoc, bounds on the false positives valid uniformly over all possible rejection sets; this allows arbitrary data snooping from the user. We present two contributions: starting from a common approach to post hoc bounds taking into account the p-value level sets for any candidate rejection set, we analyze how to calibrate the bound under different assumptions concerning the distribution of p-values. We then build towards a general approach to the problem using a family of candidate rejection subsets (call this a reference family) together with associated bounds on the number of false positives they contain, the latter holding uniformly over the family. It is then possible to interpolate from this reference family to find a bound valid for any candidate rejection subset. This general program encompasses in particular the p-value level sets considered earlier; we illustrate its interest in a different context where the reference subsets are fixed and spatially structured. (Joint work with Pierre Neuvial and Etienne Roquain.)**Discussant:**Arun Kumar Kuchibhotla (Carnegie Mellon University)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, October 29, 2020**[Recording]**Speaker:**Robert Lunde (University of Texas, Austin)**Title:**Resampling for Network Data**Abstract:**Network data, which represent complex relationships between different entities, have become increasingly common in fields ranging from neuroscience to social network analysis. To address key scientific questions in these domains, versatile inferential methods for network-valued data are needed. In this talk, I will discuss our recent work on network analogs of the three main resampling methods: subsampling, the jackknife, and the bootstrap. While network data are generally dependent, under the sparse graphon model, we show that these resampling procedures exhibit similar properties to their IID counterparts. I will also discuss related theoretical results, including central limit theorems for eigenvalues and a network Efron-Stein inequality. This is joint work with Purnamrita Sarkar and Qiaohui Lin.**Discussant:**Liza Levina (University of Michigan)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, October 22, 2020**[Recording]**Speaker:**Yuan Liao (Rutgers University)**Title:**Deep Learning Inference on Semi-Parametric Models with Weakly Dependent Data**Abstract:**Deep Neural Networks (DNNs) are nonlinear sieves that can approximate nonlinear functions of high dimensional variables more effectively than various linear sieves (or series). This paper considers efficient inference (estimation and confidence intervals) of functionals of nonparametric conditional moment restrictions via penalized DNNs, for weakly dependent beta-mixing time series data. The functionals of interest are either known or unknown expected functionals, such as weighted average derivatives , averaged partial means and averaged squared partial derivatives. Nonparametric conditional quantile instrumental variable models are a particular example of interest in this paper. This is joint work with Jiafeng Chen, Xiaohong Chen, and Elie Tamer.**Discussant:**Matteo Sesia (University of Southern California)**Links:**[Slides]

**Thursday, October 15, 2020**[Recording]**Speaker:**Zhimei Ren (Stanford University)**Title:**Derandomizing Knockoffs**Abstract:**Model-X knockoffs is a general procedure that can leverage any feature importance measure to produce a variable selection algorithm, which discovers true effects while rigorously controlling the number or fraction of false positives. Model-X knockoffs relies on the construction of synthethic random variables and is, therefore, random. In this paper, we propose a method for derandomizing model-X knockoffs. By aggregating the selection results across multiple runs of the knockoffs algorithm, our method provides stable decisions without compromising statistical power. The derandomization step is designed to be flexible and can be adapted to any variable selection base procedure. When applied to the base procedure of Janson et al. (2016), we prove that derandomized knockoffs controls both the per family error rate (PFER) and the k family-wise error rate (k-FWER). Further, we carry out extensive numerical studies demonstrating tight type-I error control and markedly enhanced power when compared with alternative variable selection algorithms. Finally, we apply our approach to multi-stage GWAS of prostate cancer and report locations on the genome that are significantly associated with the disease. When cross-referenced with other studies, we find that the reported associations have been replicated.**Discussant:**Richard Samworth (University of Cambridge)**Links:**[Relevant paper]

**Thursday, October 8, 2020**[Recording]**Speaker:**Nilesh Tripuraneni (UC Berkeley)**Title:**Single Point Transductive Prediction**Abstract:**Standard methods in supervised learning separate training and prediction: the model is fit independently of any test points it may encounter. However, can knowledge of the next test point $\mathbf{x}_{\star}$ be exploited to improve prediction accuracy? We address this question in the context of linear prediction, showing how techniques from semi-parametric inference can be used transductively to combat regularization bias. We first lower bound the $\mathbf{x}_{\star}$ prediction error of ridge regression and the Lasso, showing that they must incur significant bias in certain test directions. We then provide non-asymptotic upper bounds on the $\mathbf{x}_{\star}$ prediction error of two transductive prediction rules. We conclude by showing the efficacy of our methods on both synthetic and real data, highlighting the improvements single point transductive prediction can provide in settings with distribution shift. This is joint work with Lester Mackey.**Discussant:**Leying Guan (Yale University)**Links:**[Relevant paper] [Slides]

**Thursday, October 1, 2020**[Recording]**Speaker:**Asaf Weinstein (Hebrew University of Jerusalem)**Title:**A Power Analysis for Knockoffs with the Lasso Coefficient-Difference Statistic**Abstract:**In a linear model with possibly many predictors, we consider variable selection procedures given by $\{1\leq j\leq p: |\widehat{\beta}_j(\lambda)| > t\}$, where $\widehat{\beta}(\lambda)$ is the Lasso estimate of the regression coefficients, and where $\lambda$ and $t$ may be data dependent. Ordinary Lasso selection is captured by using $t=0$, thus allowing to control only $\lambda$, whereas thresholded-Lasso selection allows to control both $\lambda$ and $t$. Figuratively, thresholded-Lasso opens up the possibility to look further down the Lasso path, which typically leads to dramatic improvement in power. This phenomenon has been quantified recently leveraging advances in approximate message-passing (AMP) theory, but the implications are actionable only when assuming substantial knowledge of the underlying signal.In this work we study theoretically the power of a knockoffs-calibrated counterpart of thresholded-Lasso that enables us to control FDR in the realistic situation where no prior information about the signal is available. Although the basic AMP framework remains the same, our analysis requires a significant technical extension of existing theory in order to handle the pairing between original variables and their knockoffs. Relying on this extension we obtain exact asymptotic predictions for the true positive proportion achievable at a prescribed type I error level. In particular, we show that the knockoffs version of thresholded-Lasso can (still) perform much better than ordinary Lasso selection if $\lambda$ is chosen by cross-validation on the augmented matrix. This is joint work with Malgorzata Bogdan, Weijie Su, Rina Foygel Barber and Emmanuel Candes.**Discussant:**Zheng (Tracy) Ke (Harvard University)**Links:**[Relevant paper] [Slides]

**Thursday, September 24, 2020**[Recording]**Speaker:**Ruth Heller (Tel Aviv University)**Title:**Inference following aggregate level hypothesis testing**Abstract:**The practice of pooling several individual test statistics to form aggregate tests is common in many statistical applications where individual tests may be underpowered. Following aggregate-level testing, it is naturally of interest to infer on the individual units that drive the signal. Failing to account for selection will produce biased inference. We develop a hypothesis testing framework that guarantees control over false positives conditional on the selection by aggregate tests. We illustrate the usefulness of our procedures in two genomic applications: whole-genome expression quantitative loci (eQTL) analysis across multiple tissue types, and rare variant testing. This talk is based on joint works with Nilanjan Chatterjee, Abba Krieger, Amit Meir, and Jianxin Shi.**Discussant:**Jingshu Wang (University of Chicago)

**Thursday, September 17, 2020**[Recording]**Speaker:**Hannes Leeb (University of Vienna)**Title:**Conditional Predictive Inference for High-Dimensional Stable Algorithms**Abstract:**We investigate generically applicable and intuitively appealing prediction intervals based on leave-one-out residuals. The conditional coverage probability of the proposed intervals, given the observations in the training sample, is close to the nominal level, provided that the underlying algorithm used for computing point predictions is sufficiently stable under the omission of single feature/response pairs. Our results are based on a finite sample analysis of the empirical distribution function of the leave-one-out residuals and hold in non-parametric settings with only minimal assumptions on the error distribution. To illustrate our results, we also apply them to high-dimensional linear predictors, where we obtain uniform asymptotic conditional validity as both sample size and dimension tend to infinity at the same rate. These results show that despite the serious problems of resampling procedures for inference on the unknown parameters (cf. Bickel and Freedman, 1983; El Karoui and Purdom, 2015; Mammen, 1996), leave-one-out methods can be successfully applied to obtain reliable predictive inference even in high dimensions.

Joint work with Lukas Steinberger.**Discussant:**Yuansi Chen (ETH Zürich)**Links:**[Relevant paper] [Slides]

**Thursday, September 10, 2020**[Recording]**Speaker:**Michael Celentano (Stanford University)**Title:**The Lasso with general Gaussian designs with applications to hypothesis testing**Abstract:**The Lasso is a method for high-dimensional regression, which is now commonly used when the number of covariates p is of the same order or larger than the number of observations n. Classical asymptotic normality theory is not applicable to this model for two fundamental reasons: (1) The regularized risk is non-smooth; (2) The distance between the estimator and the true parameter vector cannot be neglected. As a consequence, standard perturbative arguments that are the traditional basis for asymptotic normality fail.

On the other hand, the Lasso estimator can be precisely characterized in the regime in which both n and p are large, while n/p is of order one. This characterization was first obtained in the case of standard Gaussian designs, and subsequently generalized to other high-dimensional estimation procedures. We extend the same characterization to Gaussian correlated designs with non-singular covariance structure.

Using this theory, we study (i) the debiased Lasso, and show that a degrees-of-freedom correction is necessary for computing valid confidence intervals, (ii) confidence intervals constructed via a leave-one-out technique related to conditional randomization tests, and (iii) a simple procedure for hyper-parameter tuning which is provably optimal for prediction error under proportional asymptotics.

Based on joint work with Andrea Montanari and Yuting Wei.**Discussant:**Dongming Huang (National University of Singapore)**Links:**[Relevant paper] [Slides]

**Thursday, September 3, 2020**[Recording]**Speaker:**Rina Foygel Barber (University of Chicago)**Title:**Is distribution-free inference possible for binary regression?**Abstract:**For a regression problem with a binary label response, we examine the problem of constructing confidence intervals for the label probability conditional on the features. In a setting where we do not have any information about the underlying distribution, we would ideally like to provide confidence intervals that are distribution-free---that is, valid with no assumptions on the distribution of the data. Our results establish an explicit lower bound on the length of any distribution-free confidence interval, and construct a procedure that can approximately achieve this length. In particular, this lower bound is independent of the sample size and holds for all distributions with no point masses, meaning that it is not possible for any distribution-free procedure to be adaptive with respect to any type of special structure in the distribution.**Discussant:**Aaditya Ramdas (Carnegie Mellon University)**Links:**[Relevant paper] [Slides]

**Thursday, August 27, 2020**[Recording]**Speaker:**Daniel Yekutieli (Tel Aviv University)**Title:**Bayesian selective inference**Abstract:**I will discuss selective inference from a Bayesian perspective. I will revisit existing work. I will demonstrate the effectiveness of Bayesian methods for specifying FDR-controlling selection rules and providing valid selection-adjusted marginal inferences in two simulated multiple testing examples: (a) Normal sequence model with continuous-valued parameters and (b) two-group model with dependent Normal observations.**Discussant:**Zhigen Zhao (Temple University)

**Thursday, August 20, 2020**[Recording]**Speaker:**Eugene Katsevich (University of Pennsylvania)**Title:**The conditional randomization test in theory and in practice**Abstract:**Consider the problem of testing whether a predictor X is independent of a response Y given a covariate vector Z. If we have access to the distribution of X given Z (the Model-X assumption), the conditional randomization test (Candes et al., 2018) is a simple and powerful conditional independence test, which does not require any knowledge of the distribution of Y given X and Z. The key obstacle to the practical implementation of the CRT is its computational cost, due to its reliance on repeatedly refitting a statistical machine learning model on resampled data. This motivated the development of distillation, a technique which speeds up the CRT by orders of magnitude while sacrificing little or no power (Liu, Katsevich, Janson, and Ramdas, 2020). I will also discuss recent theoretical developments that help us understand how the choice of CRT test statistic impacts its power (Katsevich and Ramdas, 2020). Finally, I'll illustrate an application of the CRT to the analysis of single cell CRISPR regulatory screens, where it helps circumvent the difficulties of modeling single cell gene expression (Katsevich and Roeder, 2020).**Discussant:**Wesley Tansey (Memorial Sloan Kettering Cancer Center)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, August 13, 2020**[Recording]**Speaker:**Lucy Gao (University of Waterloo)**Title:**Selective Inference for Hierarchical Clustering**Abstract:**It is common practice in fields such as single-cell transcriptomics to use the same data set to define groups of interest via clustering algorithms and to test whether these groups are different. Because the same data set is used for both hypothesis generation and hypothesis testing, simply applying a classical statistical test (e.g. the t-test) in this setting would yield an extremely inflated Type I error rate. We propose a selective inference framework for testing the null hypothesis of no difference in means between two clusters obtained using agglomerative hierarchical clustering. Using this framework, we can efficiently compute exact p-values for many commonly used linkage criteria. We demonstrate the utility of our test in simulated data and in single-cell RNA-seq data. This is joint work with Jacob Bien and Daniela Witten.**Discussant:**Yuval Benjamini (Hebrew University of Jerusalem)**Links:**[Slides]

**Thursday, July 30, 2020**[Recording]**Speaker:**Kathryn Roeder (Carnegie Mellon University)**Title:**Adaptive approaches for augmenting genetic association studies with multi-omics covariates**Abstract:**To correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new selective inference methodologies could improve power by enabling exploration of test statistics with covariates for informative weights while retaining desired statistical guarantees. We explore one such framework, adaptive p-value thresholding (AdaPT), in the context of genome-wide association studies (GWAS) under two types of regimes: (1) testing individual single nucleotide polymorphisms (SNPs) for schizophrenia (SCZ) and (2) the aggregation of SNPs into gene-based test statistics for autism spectrum disorder (ASD). In both settings, we focus on enriched expression quantitative trait loci (eQTLs) and demonstrate a substantial increase in power using flexible gradient boosted trees to account for covariates constructed with GWAS statistics from genetically-correlated phenotypes, as well as measures capturing association with gene expression and coexpression subnetwork membership. We address the practical challenges of implementing AdaPT in high-dimensional -omics settings, such as approaches for tuning gradient boosted trees without compromising error-rate control as well as handling the subtle issues of working with publicly available summary statistics (e.g., p-values reported to be exactly equal to one). Specifically, because a popular approach for computing gene-level p-values is based on an invalid approximation for the combination of dependent two-sided test statistics, it yields an inflated error rate. Additionally, the resulting improper null distribution violates the mirror-conservative assumption required for masking procedures. We believe our results are critical for researchers wishing to build new methods in this challenging area and emphasize that our pipeline of analysis can be implemented in many different high-throughput settings to ultimately improve power. This is joint work with Ronald Yurko, Max G’Sell, and Bernie Devlin.**Discussant:**Chiara Sabatti (Stanford University)**Links:**[Relevant paper] [Slides]

**Thursday, July 23, 2020**[Recording]**Speaker:**Will Fithian (UC Berkeley)**Title:**Conditional calibration for false discovery rate control under dependence**Abstract:**We introduce a new class of methods for finite-sample false discovery rate (FDR) control in multiple testing problems with dependent test statistics where the dependence is fully or partially known. Our approach separately calibrates a data-dependent p-value rejection threshold for each hypothesis, relaxing or tightening the threshold as appropriate to target exact FDR control. In addition to our general framework we propose a concrete algorithm, the dependence-adjusted Benjamini-Hochberg (dBH) procedure, which adaptively thresholds the q-value for each hypothesis. Under positive regression dependence the dBH procedure uniformly dominates the standard BH procedure, and in general it uniformly dominates the Benjamini–Yekutieli (BY) procedure (also known as BH with log correction). Simulations and real data examples illustrate power gains over competing approaches to FDR control under dependence. This is joint work with Lihua Lei.**Discussant:**Etienne Roquain (Sorbonne Université)**Links:**[Relevant paper] [Slides]

**Thursday, July 16, 2020**[Recording]**Speaker:**Arun Kumar Kuchibhotla (University of Pennsylvania)**Title:**Optimality in Universal Post-selection Inference**Abstract:**Universal post-selection inference refers to valid inference after an arbitrary variable selection in regression models. In the context of linear regression and GLMs, universal post-selection inference methods have been suggested by Berk et al. (2013, AoS) and Bachoc et al. (2020, AoS). Both these works use the so-called "max-t" approach to obtain valid inference after arbitrary variable selection. Although tight, this approach can lead to a conservative inference for several sub-models. (Tightness refers to the existence of a variable selection procedure for which the inference is exact/sharp.) In this talk, I present a different approach to universal post-selection inference called "Hierarchical PoSI" that scales differently for different sub-model sizes. The basic idea stems from pre-pivoting, introduced by Beran (1987, 1988, JASA) and also from multi-scale testing. Some numerical results will be presented to illustrate the benefits. No guarantees of optimality will be made.**Discussant:**Daniel Yekutieli (Tel Aviv University)

**Thursday, July 9, 2020**[Recording]**Speaker:**Lihua Lei (Stanford University)**Title:**AdaPT: An interactive procedure for multiple testing with side information**Abstract:**We consider the problem of multiple‐hypothesis testing with generic side information: for each hypothesis we observe both a*p*‐value*p*_{i }and some predictor*x*_{i }encoding contextual information about the hypothesis. For large‐scale problems, adaptively focusing power on the more promising hypotheses (those more likely to yield discoveries) can lead to much more powerful multiple‐testing procedures. We propose a general iterative framework for this problem, the adaptive*p*‐value thresholding procedure which we call AdaPT, which adaptively estimates a Bayes optimal*p*‐value rejection threshold and controls the false discovery rate in finite samples. At each iteration of the procedure, the analyst proposes a rejection threshold and observes partially censored*p*‐values, estimates the false discovery proportion below the threshold and proposes another threshold, until the estimated false discovery proportion is below*α*. Our procedure is adaptive in an unusually strong sense, permitting the analyst to use any statistical or machine learning method she chooses to estimate the optimal threshold, and to switch between different models at each iteration as information accrues. This is a joint work with Will Fithian.**Discussant:**Kun Liang (University of Waterloo)**Links:**[Relevant paper] [Slides]

**Thursday, July 2, 2020**[Recording]**Speaker:**Lucas Janson (Harvard University)**Title:**Floodgate: inference for model-free variable importance**Abstract:**Many modern applications seek to understand the relationship between an outcome variable Y and a covariate X in the presence of confounding variables Z = (Z_1,...,Z_p). Although much attention has been paid to testing whether Y depends on X given Z, in this paper we seek to go beyond testing by inferring the strength of that dependence. We first define our estimand, the minimum mean squared error (mMSE) gap, which quantifies the conditional relationship between Y and X in a way that is deterministic, model-free, interpretable, and sensitive to nonlinearities and interactions. We then propose a new inferential approach called floodgate that can leverage any regression function chosen by the user (including those fitted by state-of-the-art machine learning algorithms or derived from qualitative domain knowledge) to construct asymptotic confidence bounds, and we apply it to the mMSE gap. In addition to proving floodgate’s asymptotic validity, we rigorously quantify its accuracy (distance from confidence bound to estimand) and robustness. We demonstrate floodgate’s performance in a series of simulations and apply it to data from the UK Biobank to infer the strengths of dependence of platelet count on various groups of genetic mutations. This is joint work with Lu Zhang.**Discussant:**Weijie Su (University of Pennsylvania)**Links:**[Relevant paper] [Slides]

**Thursday, June 25, 2020**[Recording]**Speaker:**Alexandra Carpentier (Otto-von-Guericke-Universität Magdeburg)**Title:**Adaptive inference and its relations to sequential decision making**Abstract:**Adaptive inference - namely adaptive estimation and adaptive confidence statements - is particularly important in high of infinite dimensional models in statistics. Indeed whenever the dimension becomes high or infinite, it is important to adapt to the underlying structure of the problem. While adaptive estimation is often possible, it is often the case that adaptive and honest confidence sets do not exist. This is known as the adaptive inference paradox. And this has consequences in sequential decision making. In this talk, I will present some classical results of adaptive inference and discuss how they impact sequential decision making. This is joint work with Andrea Locatelli, Matthias Loeffler, Olga Klopp, Richard Nickl, James Cheshire, and Pierre Menard.**Discussant:**Jing Lei (Carnegie Mellon University)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, June 18, 2020**[Recording]

(Seminar hosted jointly with the CIRM-Luminy meeting on Mathematical Methods of Modern Statistics 2)**Speaker:**Weijie Su (University of Pennsylvania)**Title:**Gaussian Differential Privacy**Abstract:**Privacy-preserving data analysis has been put on a firm mathematical foundation since the introduction of differential privacy (DP) in 2006. This privacy definition, however, has some well-known weaknesses: notably, it does not tightly handle composition. In this talk, we propose a relaxation of DP that we term "f-DP", which has a number of appealing properties and avoids some of the difficulties associated with prior relaxations. First, f-DP preserves the hypothesis testing interpretation of differential privacy, which makes its guarantees easily interpretable. It allows for lossless reasoning about composition and post-processing, and notably, a direct way to analyze privacy amplification by subsampling. We define a canonical single-parameter family of definitions within our class that is termed "Gaussian Differential Privacy", based on hypothesis testing of two shifted normal distributions. We prove that this family is focal to f-DP by introducing a central limit theorem, which shows that the privacy guarantees of any hypothesis-testing based definition of privacy (including differential privacy) converge to Gaussian differential privacy in the limit under composition. This central limit theorem also gives a tractable analysis tool. We demonstrate the use of the tools we develop by giving an improved analysis of the privacy guarantees of noisy stochastic gradient descent. This is joint work with Jinshuo Dong and Aaron Roth.**Discussant:**Yu-Xiang Wang (UC Santa Barbara)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, June 11, 2020**[Recording]**Speaker:**Dongming Huang (Harvard University)**Title:**Controlled Variable Selection with More Flexibility**Abstract:**The recent model-X knockoffs method selects variables with provable and non-asymptotical error control and with no restrictions or assumptions on the dimensionality of the data or the conditional distribution of the response given the covariates. The one requirement for the procedure is that the covariate samples are drawn independently and identically from a precisely-known distribution. In this talk, I will show that the exact same guarantees can be made without knowing the covariate distribution fully, but instead knowing it only up to a parametric model with as many as Ω(np) parameters, where p is the dimension and n is the number of covariate samples (including unlabeled samples if available). The key is to treat the covariates as if they are drawn conditionally on their observed value for a sufficient statistic of the model. Although this idea is simple, even in Gaussian models, conditioning on a sufficient statistic leads to a distribution supported on a set of zero Lebesgue measure, requiring techniques from topological measure theory to establish valid algorithms. I will demonstrate how to do this for medium-dimensional Gaussian models, high-dimensional Gaussian graphical models, and discrete graphical models. Simulations show the new approach remains powerful under the weaker assumptions. This talk is based on joint work with Lucas Janson.**Discussant:**Snigdha Panigrahi (University of Michigan)**Links:**[Relevant paper][Slides]

**Thursday, June 4, 2020**[Recording]**Speaker:**Saharon Rosset (Tel Aviv University)**Title:**Optimal multiple testing procedures for strong control and for the two-group model**Abstract:**Multiple testing problems are a staple of modern statistics. The fundamental objective is to reject as many false null hypotheses as possible, subject to controlling an overall measure of false discovery, like family-wise error rate (FWER) or false discovery rate (FDR). We formulate multiple testing of simple hypotheses as an infinite-dimensional optimization problem, seeking the most powerful rejection policy which guarantees strong control of the selected measure. We show that for exchangeable hypotheses, for FWER or FDR and relevant notions of power, these problems lead to infinite programs that can provably be solved. We explore maximin rules for complex alternatives, and show they can be found in practice, leading to improved practical procedures compared to existing alternatives. We derive explicit optimal tests for FWER or FDR control for three independent normal means. We find that the power gain over natural competitors is substantial in all settings examined. We apply our optimal maximin rule to subgroup analyses in systematic reviews from the Cochrane library, leading to an increased number of findings compared to existing alternatives.

As time permits I will also review our follow-up work on optimal rules for controlling FDR or positive FDR in the two-group model, in high dimension and under arbitrary dependence. Our results show substantial and interesting differences between the standard approach for controlling the mFDR and our new solutions, in particular we attain substantially increased power (expected number of true rejections).

Joint work with Ruth Heller, Amichai Painsky and Udi Aharoni.**Discussant:**Wenguang Sun (University of Southern California)

**Thursday May 28, 2020**[Recording]**Speaker:**Jingshu Wang (University of Chicago)**Title:**Detecting Multiple Replicating Signals using Adaptive Filtering Procedures**Abstract:**Replicability is a fundamental quality of scientific discoveries: we are interested in those signals that are detectable in different laboratories, study populations, across time etc. Unlike meta-analysis which accounts for experimental variability but does not guarantee replicability, testing a partial conjunction (PC) null aims specifically to identify the signals that are discovered in multiple studies. In many contemporary applications, ex. comparing multiple high-throughput genetic experiments, a large number M of PC nulls need to be tested simultaneously, calling for a multiple comparison correction. However, standard multiple testing adjustments on the M PC p-values can be severely conservative, especially when M is large and the signals are sparse. We introduce AdaFilter, a new multiple testing procedure that increases power by adaptively filtering out unlikely candidates of PC nulls. We prove that AdaFilter can control FWER and FDR as long as data across studies are independent, and has much higher power than other existing methods. We illustrate the application of AdaFilter with three examples: microarray studies of Duchenne muscular dystrophy, single-cell RNA sequencing of T cells in lung cancer tumors and GWAS for metabolomics.**Discussant:**Eugene Katsevich (Carnegie Mellon University)**Links:**[Relevant paper] [Slides]

**Thursday, May 21, 2020**[Recording]**Speaker:**Yoav Benjamini (Tel Aviv University)**Title:**Confidence Intervals for selected parameters**Abstract:**Practical or scientific considerations may lead to selecting a subset of parameters as ‘important’. Inferences about the selected parameters often are based on the same data used for selection. We present a taxonomy of error-rates for selective confidence intervals then focus on controlling the probability that one or more intervals for selected parameter do not cover–the simultaneous over the selected (SoS) error-rate. We use two approaches to construct SoS-controlling confidence intervals for*k*location parameters out of*m*, deemed most important because their estimators are the largest. The new intervals improve substantially over Sidak intervals when*k**<<m*, and approach Bonferroni corrected when*k*is close to*m*. (Joint work with Yotam Hechtlinger and Philip Stark)**Discussant:**Aaditya Ramdas (Carnegie Mellon University)**Links:**[Relevant paper] [Slides]

**Thursday, May 14, 2020**[Recording**]****Speaker:**Malgorzata Bogdan (Uniwersytet Wroclawski)**Title:**Adaptive Bayesian Version of SLOPE**Abstract:**Sorted L-One Penalized Estimation (SLOPE) is a convex optimization procedure for identifying predictors in large data bases. It extends the popular Least Absolute Shrinkage and Selection Estimator (LASSO) by replacing the L1 norm penalty with the Sorted L-One Norm. It provably controls FDR under orthogonal designs and yields asymptotically minimax estimators of regression coefficients in sparse high-dimensional regression. In this talk I will briefly introduce the method and explain problems with FDR control under correlated designs. We will then discuss a novel adaptive Bayesian version of SLOPE (ABSLOPE), which addresses these issues and allows for simultaneous variable selection and parameter estimation, despite the missing values. We will also discuss a strong screening rule for discarding predictors for SLOPE, which substantially speeds up the SLOPE and ABSLOPE algorithms .**Discussant:**Cynthia Rush (Columbia University)**Links:**[Slides] [Relevant papers: paper #1, paper #2, paper #3]

**Thursday, May 7, 2020**[Recording]**Speaker:**Aldo Solari (University of Milano-Bicocca)**Title:**Exploratory Inference for Brain Imaging**Abstract:**Modern data analysis can be highly exploratory. In brain imaging, for example, researchers often highlight patterns of brain activity suggested by the data, but false discoveries are likely to intrude into this selection. How confident can the researcher be about a pattern that has been found, if that pattern has been selected from so many potential patterns?

In this talk we present a recent approach - termed 'All-Resolutions Inference' (ARI) - that delivers lower confidence bounds to the number of true discoveries in any selected set of voxels. Notably, these bounds are simultaneously valid for all possible selections. This allows a truly interactive approach to post-selection inference, that does not set any limits on the way the researcher chooses to perform the selection.**Discussant:**Genevera Allen (Rice University)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, Apr 30, 2020**[Recording]**Speaker:**Yingying Fan (University of Southern California)**Title:**Universal Rank Inference via Residual Subsampling with Application to Large Networks**Abstract:**Determining the precise rank is an important problem in many large-scale applications with matrix data exploiting low-rank plus noise models. In this paper, we suggest a universal approach to rank inference via residual subsampling (RIRS) for testing and estimating rank in a wide family of models, including many popularly used network models such as the degree corrected mixed membership model as a special case. Our procedure constructs a test statistic via subsampling entries of the residual matrix after extracting the spiked components. The test statistic converges in distribution to the standard normal under the null hypothesis, and diverges to infinity with asymptotic probability one under the alternative hypothesis. The effectiveness of RIRS procedure is justified theoretically, utilizing the asymptotic expansions of eigenvectors and eigenvalues for large random matrices recently developed in Fan et al. (2019a) and Fan et al. (2019b). The advantages of the newly suggested procedure are demonstrated through several simulation and real data examples. This work is joint with Xiao Han and Qing Yang.**Discussant:**Yuekai Sun (University of Michigan)**Links:**[Relevant paper] [Slides]

**Thursday, Apr 23, 2020**[Recording]**Speaker:**Aaditya Ramdas (Carnegie Mellon University)**Title:**Ville’s inequality, Robbins’ confidence sequences, and nonparametric supermartingales**Abstract:**

Standard textbook confidence intervals are only valid at fixed sample sizes, but scientific datasets are often collected sequentially and potentially stopped early, thus introducing a critical selection bias. A "confidence sequence” is a sequence of intervals, one for each sample size, that are uniformly valid over all sample sizes, and are thus valid at arbitrary data-dependent sample sizes. One can show that constructing the former at every time step guarantees false coverage rate control, while constructing the latter at each time step guarantees post-hoc familywise error rate control. We show that at a price of about two (doubling of width), pointwise asymptotic confidence intervals can be extended to uniform nonparametric confidence sequences. The crucial role of some beautiful nonnegative supermartingales will be made transparent in enabling “safe anytime-valid inference".

This talk will mostly feature joint work with Steven R. Howard (Berkeley, Voleon), Jon McAuliffe (Berkeley, Voleon), Jas Sekhon (Berkeley, Bridgewater) and recently Larry Wasserman (CMU) and Sivaraman Balakrishnan (CMU). I will also cover interesting historical and contemporary contributions to this area.

**Discussant:**Wouter Koolen (Centrum Wiskunde & Informatica)

**Thursday, Apr 16, 2020**[Recording]

**Speaker:**Emmanuel Candès (Stanford University)**Title:**Causal Inference in Genetic Trio Studies**Abstract:**

We introduce a method to rigorously draw causal inferences — inferences immune to all possible confounding — from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by developing a novel conditional independence test that identifies regions of the genome containing distinct causal variants. The proposed Digital Twin Test compares an observed offspring to carefully constructed synthetic offspring from the same parents to determine statistical significance, and it can leverage any black-box multivariate model and additional non-trio genetic data to increase power. Crucially, our inferences are based only on a well-established mathematical model of recombination and make no assumptions about the relationship between the genotypes and phenotypes.

**Discussant:**Matthew Stephens (University of Chicago)**Links:**[Relevant paper] [Slides]