## Past Seminar Presentations

**Thursday, November 24, 2022 (no seminar)**

**Thursday, December 1, 2022**[link to join]**Speaker:**Alexandre Blain (Inria)**Title:**Notip: Non-parametric True Discovery Proportion control for brain imaging**Abstract:**Cluster-level inference procedures are widely used for brain mapping. These methods compare the size of clusters obtained by thresholding brain maps to an upper bound under the global null hypothesis, computed using Random Field Theory or permutations. However, the guarantees obtained by this type of inference - i.e. at least one voxel is truly activated in the cluster - are not informative with regards to the strength of the signal therein. There is thus a need for methods to assess the amount of signal within clusters; yet such methods have to take into account that clusters are defined based on the data, which creates circularity in the inference scheme. This has motivated the use of post hoc estimates that allow statistically valid estimation of the proportion of activated voxels in clusters. In the context of fMRI data, the All-Resolutions Inference framework introduced in Rosenblatt et al. (2018) provides post hoc estimates of the proportion of activated voxels. However, this method relies on parametric threshold families, which results in conservative inference. In this paper, we leverage randomization methods to adapt to data characteristics and obtain tighter false discovery control. We obtain Notip, for Non-parametric True Discovery Proportion control: a powerful, non-parametric method that yields statistically valid guarantees on the proportion of activated voxels in data-derived clusters. Numerical experiments demonstrate substantial gains in number of detections compared with state-of-the-art methods on 36 fMRI datasets. The conditions under which the proposed method brings benefits are also discussed.**Discussant:**Angela Andreella (University Ca’ Foscari Venezia)**Links:**[Relevant papers: paper #1][Slides][Discussion Slides]

**Thursday, November 17, 2022**[Recording]**Speaker:**Etienne Roquain (Sorbonne Université)**Title:**Machine learning meets false discovery rate**Abstract:**Classical false discovery rate (FDR) controlling procedures offer strong and interpretable guarantees but often lack flexibility to work with complex data. By contrast, machine learning-based classification algorithms have superior performances on modern datasets but typically fall short of error-controlling guarantees. In this paper, we make these two meet by introducing a new adaptive novelty detection procedure with FDR control, called AdaDetect. It extends the scope of recent works of multiple testing literature to the high dimensional setting, notably the one in Yang et al. (2021). We prove that AdaDetect comes with finite sample guarantees: it controls the FDR strongly and approximates the oracle in terms of the power, with explicit remainder terms that are small under mild conditions. In practice, AdaDetect can be used in combination with *any* machine learning-based classifier, which allows the user to choose the most relevant classification approach. We illustrate this with classical real-world datasets, for which random forest and neural network classifiers are particularly efficient. The versatility of our method is also shown with an astrophysical application.**Discussant:**Matteo Sesia (University of Southern California)**Links:**[Relevant papers: paper #1][Slides][Discussion Slides]

**Thursday, November 10, 2022**[Recording]**Speaker:**Zhimei Ren (University of Chicago)**Title:**Derandomized knockoffs: leveraging e-values for false discovery rate control**Abstract:**Model-X knockoffs is a flexible wrapper method for high-dimensional regression algorithms, which provides guaranteed control of the false discovery rate (FDR). Due to the randomness inherent to the method, different runs of model-X knockoffs on the same dataset often result in different sets of selected variables, which is undesirable in practice. In this paper, we introduce a methodology for derandomizing model-X knockoffs with provable FDR control. The key insight of our proposed method lies in the discovery that the knockoffs procedure is in essence an e-BH procedure. We make use of this connection, and derandomize model-X knockoffs by aggregating the e-values resulting from multiple knockoff realizations. We prove that the derandomized procedure controls the FDR at the desired level, without any additional conditions (in contrast, previously proposed methods for derandomization are not able to guarantee FDR control). The proposed method is evaluated with numerical experiments, where we find that the derandomized procedure achieves comparable power and dramatically decreased selection variability when compared with model-X knockoffs.**Discussant:**Ruodu Wang (University of Waterloo)

**Thursday, November 3, 2022**[Recording]**Speaker:**Genevera Allen (Rice University)**Title:**Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles**Abstract:**To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model-refitting and data-splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model-refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model, and only assumes algorithmic stability. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings with interactions, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods. Joint work with Luqin Gan and Lili Zheng.**Discussant:**Byol Kim (University of Washington)

**Thursday, October 27, 2022**[link to join]**Speaker:**Weijie Su (University of Pennsylvania)**Title:**Statistical Estimation via a Truthful Owner-Assisted Scoring Mechanism**Abstract:**In 2014, NeurIPS received 1,678 paper submissions, while this number increased to 10,411 in 2022, putting a tremendous strain on the peer review process. In this talk, we attempt to address this challenge starting by considering the following scenario: Alice submits a large number of papers to a machine learning conference and knows about the ground-truth quality of her papers; Given noisy ratings provided by independent reviewers, can Bob obtain accurate estimates of the ground-truth quality of the papers by asking Alice a question about the ground truth? First, if Alice would truthfully answer the question because by doing so her payoff as additive convex utility over all her papers is maximized, we show that the questions must be formulated as pairwise comparisons between her papers. Moreover, if Alice is required to provide a ranking of her papers, which is the most fine-grained question via pairwise comparisons, we prove that she would be truth-telling. By incorporating the ground-truth ranking, we show that Bob can obtain an estimator with the optimal squared error in certain regimes based on any possible ways of truthful information elicitation. Moreover, the estimated ratings are substantially more accurate than the raw ratings when the number of papers is large and the raw ratings are very noisy. Finally, we conclude the talk with several extensions and some refinements for practical considerations.

**Discussant:**Davide Viviano (Stanford University)

**Thursday, October 20, 2022**[Recording]**Speaker:**Timothy Armstrong (University of Southern California)**Title:**Empirical Bayes Confidence Intervals, Average Coverage and the False Discovery Rate**Abstract:**This talk presents a general method for constructing intervals satisfying an average coverage property. Given an estimate of average squared bias of estimates of $n$ parameters, one computes a critical value that takes into account possible undercoverage due to bias, on average over the $n$ intervals. Applying our approach to shrinkage estimators in an empirical Bayes setting, we obtain confidence intervals that satisfy the empirical Bayes coverage property of Morris (1983), while avoiding parametric assumptions on the prior previously used to construct such intervals.

While tests based on average coverage intervals do not control size in the usual frequentist sense, certain results on false discovery rate (FDR) control of multiple testing procedures continue to hold when applied to such tests. In particular, the Benjamini and Hochberg (1995) step-up procedure still controls FDR in the asymptotic regime with many weakly dependent $p$-values, and certain adjustments for dependent $p$-values such as the Benjamini and Yekutieli (2001) procedure continue to yield FDR control in finite samples.

**Discussant:**Jiaying Gu (University of Toronto)

**Thursday, October 13, 2022**[Recording]**Speaker:**Aaditya Ramdas (Carnegie Mellon University)**Title:**E-values as unnormalized weights in multiple testing**Abstract:**The last two years have seen a flurry of new work on using e-values for multiple testing. This talk will summarize old ideas and present some new, unsubmitted work. I will briefly summarize what e-values and e-processes are (nonparametric, composite generalizations of likelihood ratios and Bayes factors), and recap the e-BH and e-BY procedures for FDR and FCR control, and their utility in a bandit context.

Then, I will present a simple, yet powerful, idea: using e-values as unnormalized weights in multiple testing. Most standard weighted multiple testing methods require the weights to deterministically add up to the number of hypotheses being tested (equivalently, the average weight is unity). But this normalization is not required when the weights are e-values obtained from independent data. This could result in a massive increase in power, especially if the non-null hypotheses have e-values much larger than one. More broadly, we study how to combine an e-value and a p-value, and design multiple testing procedures where both e-values and p-values are available for some hypotheses. A case study with RNA-seq and microarray data will demonstrate the practical power benefits.

These are joint works with Ruodu Wang, Neil Xu and Nikos Ignatiadis.

**Discussant:**Peter Grünwald (Centrum Wiskunde & Informatica and Leiden University)**Links:**[Relevant papers: paper #1, paper #2, paper #3, paper #4][Slides]

**Thursday, July 28, 2022**[Recording]**Speaker:**Trambak Banerjee (University of Kansas)**Title:**Nonparametric Empirical Bayes Estimation On Heterogeneous Data**Abstract:**The simultaneous estimation of many parameters based on data collected from corresponding studies is a key research problem that has received renewed attention in the high-dimensional setting. Many practical situations involve heterogeneous data where heterogeneity is captured by a nuisance parameter. Effectively pooling information across samples while correctly accounting for heterogeneity presents a significant challenge in large-scale estimation problems. We address this issue by introducing the ``Nonparametric Empirical Bayes Structural Tweedie" (NEST) estimator, which efficiently estimates the unknown effect sizes and properly adjusts for heterogeneity via a generalized version of Tweedie's formula. For the normal means problem, NEST simultaneously handles the two main selection biases introduced by heterogeneity: one, the selection bias in the mean, which cannot be effectively corrected without also correcting for, two, selection bias in the variance. Our theoretical results show that NEST has strong asymptotic properties and in our simulation studies NEST outperforms competing methods, with much efficiency gains in many settings. The proposed method is demonstrated on estimating the batting averages of baseball players and Sharpe ratios of mutual fund returns.**Discussant:**Jake Soloff (University of Chicago)**Links:**[Relevant papers: paper #1]

**Thursday, July 21, 2022****(postponed)****Speaker:**Dacheng Xiu (University of Chicago)**Title:**Prediction When Factors are Weak**Abstract:**Principal component analysis (PCA) has been the most prevalent approach to the recovery of factors. Nevertheless, the theoretical justification of the PCA-based approach often relies on a convenient and critical assumption that factors are pervasive. To incorporate information from weaker factors in the context of prediction, we propose a new procedure based on supervised PCA, which iterates over selection, PCA, and projection. The selection step finds a subset of predictors most correlated with the prediction target, whereas the projection step permits multiple weak factors of distinct strength. We justify our procedure in an asymptotic scheme where both the sample size and the cross-sectional dimension increase at potentially different rates. Our empirical analysis highlights the role of weak factors in predicting inflation.**Discussant:**Yiqiao Zhong (Stanford University)**Links:**[Relevant papers: ]

**Thursday, July 14, 2022 (****100-th ISSI seminar****)**[Recording]**Speaker:**Yoav Benjamini (Tel Aviv University)**Title:**Trends and challenges in research about selective inference and its practice**Abstract:**The international seminar on selective inference gives us an opportunity to identify trends in this important research area, discuss common topics of interest and raise some challenges. I’ll try to use this opportunity for these purposes, but obviously the challenges will reflect my own point of view.

**Thursday, June 30, 2022**[Recording]**Speaker:**Zhanrui Cai (Carnegie Mellon University)**Title:**Robust Cross Validation with Confidence**Abstract:**Cross validation is one of the most popular tools for model selection and tunning parameter selection in the modern statistics and machine learning community. By dividing the sample into K-folds, cross validation first train the models on $K-1$ folds of data, and test the prediction error on the remaining dataset. Then it chooses the model / tunning parameter that has the smallest test error. Recent studies aim to improve the confidence level for the models selected by cross validation (Lei, 2020), but may not be suitable for skewed/ heavy tailed data, or data with outliers. In this paper, we propose a robust cross validation method. Instead of comparing the mean of the prediction error, we propose to compare the quantiles of the test error due to its skewness nature. We illustrate the necessity of rank-sum comparison through motivating examples, and demonstrate the advantage of the proposed robust cross validation method through extensive simulation and real data analysis. In order to study the limiting distribution of the evaluation criterion, we develop the Gaussian approximation theory for high dimensional two sample U-statistics, which may be of independent interest.**Discussant:**Morgane Austern (Harvard University)**Links:**[Relevant papers: ]

**Thursday, June 23, 2022**[Recording]**Speaker:**Yixiang Luo (University of California, Berkeley)**Title:**Improving knockoffs with conditional calibration**Abstract:**The knockoff filter of Barber and Candès (2015) is a flexible framework for multiple testing in supervised learning models, based on introducing synthetic predictor variables to control the false discovery rate (FDR). Using the conditional calibration framework of Fithian and Lei (2020), we introduce the calibrated knockoff procedure, a method that uniformly improves the power of any knockoff procedure. We implement our method for fixed-X knockoffs and show theoretically and empirically that the improvement is especially notable in two contexts where knockoff methods can be nearly powerless: when the rejection set is small, and when the structure of the design matrix prevents us from constructing good knockoff variables. In these contexts, calibrated knockoffs even outperform competing FDR-controlling methods like the (dependence-adjusted) Benjamini– Hochberg procedure in many scenarios.

This is joint work with Will Fithian and Lihua Lei.

**Discussant:**Lucas Janson (Harvard University)**Links:**[Relevant papers: paper #1]

**Thursday, June 16, 2022 (****ISSI-STAMPS joint seminar****)**[Recording]**Speaker:**Ann Lee (Carnegie Mellon University)**Title:**Likelihood-Free Frequentist Inference: Confidence Sets with Correct Conditional Coverage**Abstract:**Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, outside the asymptotic and low-dimensional regimes. Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce confidence sets with correct conditional coverage. In this talk, I will describe our group's recent and ongoing research on developing scalable and modular procedures for (i) constructing Neyman confidence sets with finite-sample guarantees of nominal coverage, and for (ii) computing diagnostics that estimate conditional coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic, like the likelihood ratio, can be adapted to LF2I to create valid confidence sets and diagnostics, without costly Monte Carlo samples at fixed parameter settings. In my talk, I will discuss where we stand with LF2I and challenges that still remain. (Part of these efforts are joint with Niccolo Dalmasso, Rafael Izbicki, Luca Masserano, Tommaso Dorigo, Mikael Kuusela, and David Zhao. Our general framework is described in arXiv:2107.03920)**Discussant:**Minge Xie (Rutgers University)

**Thursday, June 9, 2022**[Recording]**Speaker:**Anna Neufeld (University of Washington)**Title:**Inference after latent variable estimation for single-cell RNA sequencing data**Abstract:**In the analysis of single-cell RNA sequencing data, researchers often ﬁrst characterize the variation between cells by estimating a latent variable, representing some aspect of the individual cell’s state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values and conﬁdence intervals in the second step will fail to achieve statistical guarantees such as Type 1 error control or nominal coverage. Furthermore, approaches such as sample splitting that can be fruitfully applied to solve similar problems in other settings are not applicable in this context. In this paper, we introduce count splitting, an extremely ﬂexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study, and apply count splitting to a dataset of pluripotent stem cells diﬀerentiating to cardiomyocytes.**Discussant:**James Leiner (Carnegie Mellon University)**Links:**[Relevant papers: ][Slides]

**Thursday, June 2, 2022**[Recording]**Speaker:**Matteo Sesia (University of Southern California)**Title:**Individualized conditional independence testing under model-X with heterogeneous samples and interactions**Abstract:**Model-X knockoffs and the conditional randomization test are methods that search for conditional associations in large data sets, controlling the type-I errors if the joint distribution of the predictors is known. However, they cannot test for interactions nor find whether an association is only significant within a latent subset of a heterogeneous population. We address this limitation by developing an extension of the knockoff filter that tests conditional associations within automatically detected subsets of individuals, provably controlling the false discovery rate for the selected hypotheses. Then, under the additional assumption of a partially linear model with a binary predictor, we extend the conditional randomization test as to make inferences about quantiles of individual effects that are robust to sample heterogeneity and interactions. The performances of these methods are investigated through simulations and with the analysis of data from a randomized blood donation experiment with several treatments.**Discussant:**Brad Ross (Stanford University)**Links:**[Relevant papers: paper #1]

**Thursday, May 26, 2022**[Recording]**Speaker:**James Leiner (Carnegie Mellon University)**Title:**Data~~Blurring~~Fission: sample splitting a single sample**Abstract:**Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2021) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data blurring, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.**Discussant:**Daniel Garcia Rasines (ICMAT - CSIC)

**Thursday, May 19, 2022**[Recording]**Speaker:**Daniel Wilhelm (University College London)**Title:**Inference for Ranks**Abstract:**It is often desired to rank different populations according to the value of some feature of each population. For example, it may be desired to rank neighborhoods according to some measure of intergenerational mobility or countries according to some measure of academic achievement. These rankings are invariably computed using estimates rather than the true values of these features. As a result, there may be considerable uncertainty concerning the rank of each population. In this paper, we consider the problem of accounting for such uncertainty by constructing confidence sets for the rank of each population. We consider both the problem of constructing marginal confidence sets for the rank of a particular population as well as simultaneous confidence sets for the ranks of all populations. We show how to construct such confidence sets under weak assumptions. An important feature of all of our constructions is that they remain computationally feasible even when the number of populations is very large. We apply our theoretical results to re-examine the rankings of both neighborhoods in the United States in terms of intergenerational mobility and developed countries in terms of academic achievement. The conclusions about which countries do best and worst at reading, math, and science are fairly robust to accounting for uncertainty. The confidence sets for the ranking of the 50 most populous commuting zones by measures of mobility are also found to be small. These rankings, however, become much less informative if one includes all commuting zones, if one considers neighborhoods at a more granular level (counties, Census tracts), or if one uses movers across areas to address concerns about selection.**Discussant:**Aldo Solari (University of Milano-Bicocca)

**Thursday, May 12, 2022**[Recording]**Speaker:**Colin Fogarty (Massachusetts Institute of Technology)**Title:**Sensitivity and Multiplicity**Abstract:**Corrections for multiple comparisons generally imagine that all other modeling assumptions are met for the hypothesis tests being conducted, such that the only reason for inflated false rejections is the fact that multiplicity has been ignored when performing inference. In reality, such modes of inference often rest upon unverifiable assumptions. Common expedients include the assumption of ``representativeness" of the sample at hand for the population of interest; and of "no unmeasured confounding" when inferring treatment effects in observational studies. In a sensitivity analysis, one quantifies the magnitude of the departure from unverifiable assumptions required to explain away the findings of a study. Individually, both sensitivity analyses and multiplicity controls can reduce the rate at which true signals are detected and reported. In studies with multiple outcomes resting upon untestable assumptions, one may be concerned that correcting for multiple comparisons while also conducting a sensitivity analysis could render the study entirely devoid of power. We present results on sensitivity analysis for observational studies with multiple endpoints, where the researcher must simultaneously account for multiple comparisons and assess robustness to hidden bias. We find that of the two pursuits, it is recognizing the potential for hidden bias that plays the largest role in determining the conclusions of a study: individual findings that are robust to hidden bias are remarkably persistent in the face of multiple comparisons, while sensitive findings are quickly erased regardless of the number of comparisons. Through simulation studies and empirical examples, we show that through the incorporation of the proposed methodology within a closed testing framework, in a sensitivity analysis one can often attain the same power for testing individual hypotheses that one would have attained had one not accounted for multiple comparisons at all. This suggests that once one commits to conducting a sensitivity analysis, the additional loss in power from controlling for multiple comparisons may be substantially attenuated.**Discussant:**Bo Zhang (Fred Hutchinson Cancer Center)**Links:**[Relevant papers: paper #1, paper #2, paper #3][Slides]

**Thursday, May 5, 2022**[Recording]**Speaker:**Ariane Marandon (Sorbonne Université, LPSM)**Title:**False clustering rate control in mixture models**Abstract:**The clustering task consists in delivering labels to the members of a sample. For most data sets, some individuals are ambiguous and intrinsically difficult to attribute to one or another cluster. However, in practical applications, misclassifying individuals is potentially disastrous. To overcome this difficulty, the idea followed here is to classify only a part of the sample in order to obtain a small misclassification rate. This approach is well known in the supervised setting, and referred to as classification with an abstention option. The purpose of this paper is to revisit this approach in an unsupervised mixture-model framework. The problem is formalized in terms of controlling the false clustering rate (FCR) below a prescribed level α, while maximizing the number of classified items. New procedures are introduced and their behavior is shown to be close to the optimal one by establishing theoretical results and conducting numerical experiments.**Discussant:**Gilles Blanchard (Université Paris Sud)

**Thursday, April 28, 2022**[Recording]**Speaker:**Pragya Sur (Harvard University)**Title:**A modern central limit theorem for the classical doubly robust estimator: variance inflation and beyond**Abstract:**Estimating the average treatment effect (ATE) is a central problem in causal inference. Modern advances in the field studied estimation and inference for the ATE in high dimensions through a variety of approaches. Doubly robust estimators form a popular approach in this context. However, the high-dimensional literature surrounding these estimators relies on sparsity conditions, either on the outcome regression (OR) or the propensity score (PS) model. This talk will introduce a new central limit theorem for the classical doubly robust (DR) estimator, that applies agnostic to such sparsity-type assumptions. Specifically, we will study properties of the cross-fit version of the estimator under well-specified OR and PS models, and the common modern regime where the number of features and samples are both large and comparable. In this regime, under assumptions on the covariate distribution, our CLT will uncover two crucial phenomena among others: (i) the DR estimator exhibits a substantial variance inflation that can be precisely quantified in terms of the signal-to-noise ratio and other problem parameters, (ii) the asymptotic covariance between the estimators used while cross-fitting is not negligible even on the root-n scale. These findings are strikingly different from their classical counterparts, and open a vista of possibilities for studying similar other high-dimensional effects. On the technical front, our work utilizes a novel interplay between three distinct tools—approximate message passing theory, the theory of deterministic equivalents, and the leave-one-out approach. Time permitting, I will outline some of these techniques. This is based on joint work with Kuanhao Jiang, Rajarshi Mukherjee, and Subhabrata Sen.**Discussant:**Michael Celentano (University of California, Berkeley)

**Thursday, April 21, 2022**[Recording]**Speaker:**Zongming Ma (University of Pennsylvania)**Title:**Testing equivalence of clustering**Abstract:**In this talk, we test whether two datasets measured on the same set of subjects share a common clustering structure. As a leading example, we focus on comparing clustering structures in two independent random samples from two deterministic two-component mixtures of multivariate Gaussian distributions. Mean parameters of these Gaussian distributions are treated as potentially unknown nuisance parameters and are allowed to differ. Assuming knowledge of mean parameters, we first determine the phase diagram of the testing problem over the entire range of signal-to-noise ratios by providing both lower bounds and tests that achieve them. When nuisance parameters are unknown, we propose tests that achieve the detection boundary adaptively as long as ambient dimensions of the datasets grow at a sub-linear rate with the sample size. The talk is based on a joint work with Chao Gao.**Discussant:**Kaizheng Wang (Columbia University)**Links:**[Relevant papers: paper #1]

**Thursday, April 14, 2022**[Recording]**Speaker:**Zheng (Tracy) Ke (Harvard University)**Title:**Power Analysis and Phase Transitions for FDR Control Methods**Abstract:**Many recent FDR control methods have been proposed under sparse linear regression models. In this talk, we are interested in two questions: 1) How to design an FDR control method to achieve good power? 2) Does the operation of adding fake variables in an FDR control method lead to any unwanted power loss? We tackle these questions by viewing an FDR control method as having three components: ranking algorithm, tampered design and symmetric statistic. We consider a collection of different combinations of the three components, where each combination corresponds to a specific FDR control method. This collection covers the recent methods of knockoff filter (Barber and Candes, 2015), Gaussian mirror (Xing, Zhao and Liu, 2021) and their variants. We evaluate the power of each FDR control method by deriving their theoretical phase diagrams under a Rare/Weak signal model. We then answer Question (1) by comparing the phase diagrams of different FDR control methods and deriving insights of power boost. We answer Question (2) by comparing the phase diagram of an FDR control method with its prototype – a method that uses an ideal threshold. We give encouraging examples where an FDR control method has a negligible power loss relative to its prototype.**Discussant:**Asaf Weinstein (Hebrew University of Jerusalem)**Links:**[Relevant papers: paper #1]

**Thursday, April 7, 2022**[Recording]**Speaker:**Yiqun Chen (University of Washington)**Title:**Selective inference for k-means clustering**Abstract:**We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate, because the clusters were obtained on the same data used for testing. To overcome this problem, we propose a selective inference approach. We describe an efficient algorithm to compute a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering. We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.**Discussant:**Govinda Kamath (10x Genomics)

**Thursday, March 31, 2022**[Recording]**Speaker:**Jake Soloff (UC Berkeley)**Title:**The edge of discovery: Controlling the local false discovery rate at the margin**Abstract:**Despite the popularity of the false discovery rate (FDR) as an error control metric for large-scale multiple testing, its close Bayesian counterpart the local false discovery rate (lfdr), defined as the posterior probability that a particular null hypothesis is false, is a more directly relevant standard for justifying and interpreting individual rejections. However, the lfdr is difficult to work with in small samples, as the prior distribution is typically unknown. We propose a simple multiple testing procedure and prove that it controls the expectation of the maximum lfdr across all rejections; equivalently, it controls the probability that the rejection with the largest p-value is a false discovery. Our method operates without knowledge of the prior, assuming only that the p-value density is uniform under the null and decreasing under the alternative. We also show that our method asymptotically implements the oracle Bayes procedure for a weighted classification risk, optimally trading off between false positives and false negatives. We derive the limiting distribution of the attained maximum lfdr over the rejections, and the limiting empirical Bayes regret relative to the oracle procedure.

This is joint work with Daniel Xiang and Will Fithian.

**Discussant:**Jinjin Tian (Carnegie Mellon University)

**Thursday, March 24, 2022**[Recording]**Speaker:**Ziyu (Neil) Xu (Carnegie Mellon University)**Title:**Post-selection inference for e-value based confidence intervals**Abstract:**Suppose that one can construct a valid (1-delta)-CI for each of K parameters of potential interest. A data analyst uses an arbitrary data-dependent criterion to select some subset S of them for reporting, or highlighting. The confidence intervals for the selected parameters are no longer valid, due to the selection bias, so the question is how one must adjust these in order to account for selection. We focus on the popular notion of false coverage rate (FCR), which is the expected ratio of the number of selected intervals that miscover, to the number of selected intervals |S|. The main established method is the ``BY procedure'' from a seminal work by Benjamini and Yekutieli (JASA, 2005), that was inspired by the Benjamini-Hochberg (BH) procedure. Unfortunately, the BY procedure involves restrictions on the dependence between CIs and the selection criterion. We propose a natural and much simpler method---both in implementation, and in proof---which is valid under any dependence structure between the original CIs, and any (unknown) selection criterion, but which only applies to a special, yet broad, class of CIs. Our procedure reports (1-delta|S|/K)-CIs for the selected parameters, and we prove that it controls the FCR at delta for confidence intervals that implicitly invert *e-values*; examples include those constructed via supermartingale methods, or via universal inference, or via Chernoff-style bounds on the moment generating function, among others.

Our work also has implications for multiple testing in sequential settings, since it applies at stopping times to continuously-monitored confidence sequences and multi-armed bandit sampling.

**Discussant:**Zhimei Ren (University of Chicago)

**Thursday, March 17, 2022**[Recording: (part 1) (part 2) ]**Speaker:**Yachong Yang (University of Pennsylvania)**Title:**Double robust prediction with covariate shift**Abstract:**Conformal prediction has received tremendous attention in recent years with several applications across health and social sciences. Recently, conformal inference has offered new solutions to problems in causal inference, which has led to advances in modern discipline of semiparametric statistics for constructing novel, efficient prediction uncertainty quantification. In this paper, we consider the problem of obtaining distribution-free prediction regions when there is a shift in the distribution of the covariates between the training and test data. We propose a method built on the efficient influence function for the average treatment effect among treated (ATT) functional that can be combined with an arbitrary training algorithm, without compromising asymptotic coverage. The prediction set attains nominal average coverage. This guaranty is a consequence of the product bias form of our proposal which implies correct coverage if either the propensity score or the conditional distribution of the response can be estimated sufficiently well, also known as double robustness. We also discuss parameter tuning for optimal performance, and resolve a number of open problems at the intersection of causal inference, semiparametric theory, and conformal prediction.**Discussant:**James Robins (Harvard University)

**Thursday, March 10, 2022**[Recording]**Speaker:**Leying Guan (Yale University)**Title:**Localized Conformal Prediction**Abstract:**We propose an inference framework called localized conformal prediction. It generalizes conformal prediction and offers a single-test-sample adaptive construction by emphasizing a local region around it, and can be combined with different conformal score constructions. The proposed framework enjoys an assumption-free finite sample marginal coverage guarantee. In addition, it offers approximate/asymptotic conditional coverage guarantees under suitable assumptions. We demonstrate how to change from conformal prediction to localized conformal prediction using several conformal scores and an associated potential gain via numerical examples.**Discussant:**Rafael Izbicki (Federal University of São Carlos)**Links:**[Relevant papers: paper #1]

**Thursday, March 3, 2022**[Recording]**Speaker:**Ajit Tamhane (Northwestern University)**Title:**Testing Primary and Secondary Endpoints in Group Sequential Clinical Trials**Abstract:**In this talk I will give an overview of my work (in collaboration with others) over the last decade on the important practical problem of testing primary and secondary endpoints subject to a gatekeeping constraint in group sequential clinical trials. I will also mention some current work that is under way on interesting extensions. The focus of the talk will be the ideas behind the results and not the technical proofs. As such, this talk should be accessible to all.**Discussant:**Jason Hsu (The Ohio State University)**Links:**[Relevant papers: paper #1, paper #2, paper #3, paper #4]

**Thursday, February 24, 2022**[Recording]**Speaker:**Bradley Rava (University of Southern California)**Title:**A Burden Shared is a Burden Halved: A Fairness-Adjusted Approach to Classification**Abstract:**We study fairness in classification, where one wishes to make automated decisions for people from different protected groups. When individuals are classified, the decision errors can be unfairly concentrated in certain protected groups. We develop a fairness-adjusted selective inference (FASI) framework and data-driven algorithms that achieve statistical parity in the sense that the false selection rate (FSR) is controlled and equalized among protected groups. The FASI algorithm operates by converting the outputs from black-box classifiers to R-values, which are intuitively appealing and easy to compute. Selection rules based on R-values are provably valid for FSR control, and avoid disparate impacts on protected groups. The effectiveness of FASI is demonstrated through both simulated and real data. Joint work with Wenguang Sun, Gareth James and Xin Tong.**Discussant:**Yaniv Romano (Technion—Israel Institute of Technology)**Links:**[Relevant papers: paper #1]

**Thursday, February 17, 2022**[Recording]**Speaker:**Dillon Bowen (University of Pennsylvania)**Title:**Inference for Losers**Abstract:**Researchers frequently report league tables ranking units (neighborhoods or firms, for instance) based on estimated coefficients. Since the rankings are formed based on estimates, however, coefficients reported in league tables suffer from selection bias, with estimates for highly-ranked units biased upwards and those for low-ranked units biased downwards. Further, conventional confidence intervals can undercover. This paper introduces corrected estimators and confidence intervals that address these biases, ensuring that estimates and confidence intervals reported for each position in a league table are median-unbiased and have correct coverage, respectively.**Discussant:**Roger Koenker (University College London)**Links:**[Relevant papers: paper #1]

**Thursday, February 10, 2022**[Recording]**Speaker:**Ying Ding (University of Pittsburgh)**Title:**Logic Inference and Testing in Targeted Treatment Development with Survival Outcomes**Abstract:**There has been growing interest in discovering precision medicine in modern drug development and biomedical research. One aspect of precision medicine is to develop new therapies that target a subgroup of patients who exhibit enhanced treatment efficacy (as compared to the complement of the subgroup) through randomized controlled trials (RCTs). In this talk, we will address two important statistical problems in such target treatment development process when outcome is time-to-event type: (1) establish a correct and logic inference procedure when population is a mixture of subgroups with differential efficacy; (2) develop a multiple-testing-based procedure to simultaneously identify and infer subgroups with enhanced treatment efficacy. Specifically, we propose a subgroup mixable estimation (SME) procedure to estimate efficacy in subgroups and their mixtures. We also develop a confident effect (CE4) approach which formulates the multiple testing problem through contrasts and construct their simultaneous confidence intervals. Such a testing procedure rigorously controls both within- and across-marker multiplicity. We illustrate the methods on a large RCT of an eye disease, age-related macular degeneration (AMD), by discovering consistent differential treatment effects on delaying AMD progression in subgroups defined by SNPs.**Discussant:**Thorsten Dickhaus (University of Bremen)

**Thursday, February 3, 2022**[Recording]**Speaker:**Yonghoon Lee (University of Chicago)**Title:**Distribution-free inference for regression: discrete, continuous, and in between**Abstract:**In data analysis problems where we are not able to rely on distributional assumptions, what types of inference guarantees can still be obtained? Many popular methods, such as holdout methods, cross-validation methods, and conformal prediction, are able to provide distribution-free guarantees for predictive inference, but the problem of providing inference for the underlying regression function (for example, inference on the conditional mean 𝔼[Y|X]) is more challenging. In the setting where the features X are continuously distributed, recent work has established that any confidence interval for 𝔼[Y|X] must have non-vanishing width, even as sample size tends to infinity. At the other extreme, if X takes only a small number of possible values, then inference on 𝔼[Y|X] is trivial to achieve. In this work, we study the problem in settings in between these two extremes. We find that there are several distinct regimes in between the finite setting and the continuous setting, where vanishing-width confidence intervals are achievable if and only if the effective support size of the distribution of X is smaller than the square of the sample size.**Discussant:**Ying Jin (Stanford University)**Links:**[Relevant papers: paper #1][Slides][Discussion Slides]

**Thursday, January 27, 2022**[Recording]**Speaker:**Richard Berk (University of Pennsylvania)**Title:**Improving Fairness in Criminal Justice Algorithmic Risk Assessments Using Optimal Transport and Conformal Prediction Sets**Abstract**: In the United States and elsewhere, risk assessment algorithms are being used to help inform criminal justice decision-makers. A common intent is to forecast an offender's ``future dangerousness.'' Such algorithms have been correctly criticized for potential unfairness, and there is an active cottage industry trying to make repairs. In this paper, we use counterfactual reasoning to consider the prospects for improved fairness when members of a less privileged group are treated by a risk algorithm as if they are members of a more privileged group. We combine a machine learning classifier trained in a novel manner with an optimal transport adjustment for the relevant joint probability distributions, which together provide a constructive response to claims of bias-in-bias-out. A key distinction is between fairness claims that are empirically testable and fairness claims that are not. We then use confusion tables and conformal prediction sets to evaluate achieved fairness for projected risk. Our data are a random sample of 300,000 offenders at their arraignments for a large metropolitan area in the United States during which decisions to release or detain are made. We show that substantial improvement in fairness can be achieved consistent with a Pareto improvement for protected groups.**Discussant:**Emmanuel Candès (Stanford University)**Links:**[Relevant papers: paper #1][Slides][Discussion Slides]

**Thursday, January 20, 2022**[Recording]**Speaker:**Richard Samworth (University of Cambridge)**Title:**Optimal subgroup selection**Abstract**: In clinical trials and other applications, we often see regions of the feature space that appear to exhibit interesting behaviour, but it is unclear whether these observed phenomena are reflected at the population level. Focusing on a regression setting, we consider the subgroup selection challenge of identifying a region of the feature space on which the regression function exceeds a pre-determined threshold. We formulate the problem as one of constrained optimisation, where we seek a low-complexity, data-dependent selection set on which, with a guaranteed probability, the regression function is uniformly at least as large as the threshold; subject to this constraint, we would like the region to contain as much mass under the marginal feature distribution as possible. This leads to a natural notion of regret, and our main contribution is to determine the minimax optimal rate for this regret in both the sample size and the Type I error probability. The rate involves a delicate interplay between parameters that control the smoothness of the regression function, as well as exponents that quantify the extent to which the optimal selection set at the population level can be approximated by families of well-behaved subsets. Finally, we expand the scope of our previous results by illustrating how they may be generalised to a treatment and control setting, where interest lies in the heterogeneous treatment effect.**Discussant:**Charles Doss (University of Minnesota)

**Thursday, December 16, 2021**[Recording]**Speaker:**Marina Bogomolov (Technion - Israel Institute of Technology)**Title:**Adaptive methods for testing hypotheses with group structure while simultaneously controlling several error rates**Abstract**: In many statistical applications a large set of hypotheses is tested, and the hypotheses can be naturally classified into groups based on different criteria, defined by the characteristics of the problem. Examples of such applications include brain imaging, microbiome, and genome-wide association studies. In such settings, it may be of interest to identify groups containing signals, for each partition into groups, with control over false discoveries. This goal was addressed by Barber and Ramdas (2017) and by Ramdas, Barber, Wainwright, and Jordan (2019), who developed the p-filter method for controlling the group-level false discovery rate (FDR) simultaneously for all partitions. We address the same goal, and aim to increase the power of the p-filter method by capturing the group structure of the hypotheses using adaptive weights developed by Nandi, Sarkar, and Chen (2021). We prove that the modified p-filter method controls the group-level FDR for each partition into groups under independence, and show by simulations that it seems to retain the control under certain forms of positive dependence. Our simulation study shows that the proposed modification increases the power of the method in the settings where the signals are concentrated within some groups. We compare the performance of the modified method to that of the original p-filter on real brain imaging data, where the hypotheses are grouped with respect to two criteria. This is a joint work with Ido Griness.**Discussant:**Shinjini Nandi (Montana State University)**Links:**[Relevant papers:]

**Thursday, December 9, 2021**[Recording (part I) Recording (part II)]**Speaker:**Jiaying Gu (University of Toronto)**Title:**Invidious Comparisons: Ranking and Selection as Compound Decisions**Abstract:**There is an innate human tendency, one might call it the “league table mentality,” to construct rankings. Schools, hospitals, sports teams, movies, and myriad other objects are ranked even though their inherent multi-dimensionality would suggest that – at best – only partial orderings were possible. We consider a large class of elementary ranking problems in which we observe noisy, scalar measurements of merit for n objects of potentially heterogeneous precision and are asked to select a group of the objects that are “most meritorious.” The problem is naturally formulated in the compound decision framework of Robbins’s (1956) empirical Bayes theory, but it also exhibits close connections to the recent literature on multiple testing. The nonparametric maximum likelihood estimator for mixture models (Kiefer and Wolfowitz (1956)) is employed to construct optimal ranking and selection rules. Performance of the rules is evaluated in simulations and an application to ranking U.S kidney dialysis centers**Discussant:**Soonwoo Kwon (Brown University)

**Thursday, December 2, 2021**[Recording]**Speaker:**Daniel Garcia Rasines (ICMAT - CSIC)**Title:**Splitting strategies for post-selection inference**Abstract**: We consider the problem of providing valid inference for a selected parameter in a sparse regression setting. It is well known that classical regression tools can be unreliable in this context due to the bias generated in the selection step. Many approaches have been proposed in recent years to ensure inferential validity. Here, we consider a simple alternative to data splitting based on randomising the response vector, which allows for higher selection and inferential power than the former and is applicable with an arbitrary selection rule. We provide a theoretical and empirical comparison of both methods and extend the randomisation approach to non-normal settings. Our investigations show that the gain in power can be substantial.**Discussant:**Tijana Zrnic (UC Berkeley)

**Thursday, November 18, 2021**[Recording]**Speaker:**Cynthia Rush (Columbia University)**Title:**Characterizing the Type 1-Type 2 Error Trade-off for SLOPE**Abstract:**Sorted L1 regularization has been incorporated into many methods for solving high-dimensional statistical estimation problems, including the SLOPE estimator in linear regression. In this talk, we study how this relatively new regularization technique improves variable selection by characterizing the optimal SLOPE trade-off between the false discovery proportion (FDP) and true positive proportion (TPP) or, equivalently, between measures of type I and type II error. Additionally, we show that on any problem instance, SLOPE with a certain regularization sequence outperforms the Lasso, in the sense of having a smaller FDP, larger TPP and smaller L2 estimation risk simultaneously. Our proofs are based on a novel technique that reduces a variational calculus problem to a class of infinite-dimensional convex optimization problems and a very recent result from approximate message passing (AMP) theory. With SLOPE being a particular example, we discuss these results in the context of a general program for systematically deriving exact expressions for the asymptotic risk of estimators that are solutions to a broad class of convex optimization problems via AMP.**Discussant:**Yuting Wei (University of Pennsylvania)

**Thursday, November 11, 2021**[Recording]**Speaker:**Shuangning Li (Stanford University)**Title:**Deploying the Conditional Randomization Test in High Multiplicity Problems**Abstract:**This paper introduces the sequential CRT, which is a variable selection procedure that combines the conditional randomization test (CRT) and Selective SeqStep+. Valid p-values are constructed via the flexible CRT, which are then ordered and passed through the selective SeqStep+ filter to produce a list of discoveries. We develop theory guaranteeing control on the false discovery rate (FDR) even though the p-values are not independent. We show in simulations that our novel procedure indeed controls the FDR and are competitive with -- and sometimes outperform -- state-of-the-art alternatives in terms of power. Finally, we apply our methodology to a breast cancer dataset with the goal of identifying biomarkers associated with cancer stage.**Discussant:**Jingyi Jessica Li (UCLA)

**Thursday, November 4, 2021**[Recording]**Speaker:**Kai Zhang (The University of North Carolina at Chapel Hill)**Title:**BEAUTY Powered BEAST**Abstract:**We study nonparametric dependence detection with the proposed binary expansion approximation of uniformity (BEAUTY) approach, which generalizes the celebrated Euler's formula, and approximates the characteristic function of any copula with a linear combination of expectations of binary interactions from marginal binary expansions. This novel theory enables a unification of many important tests through approximations from some quadratic forms of symmetry statistics, where the deterministic weight matrix characterizes the power properties of each test. To achieve a robust power, we study test statistics with data-adaptive weights, referred to as the binary expansion adaptive symmetry test (BEAST). By utilizing the properties of the binary expansion filtration, we show that the Neyman-Pearson test of uniformity can be approximated by an oracle weighted sum of symmetry statistics. The BEAST with this oracle provides a benchmark of feasible power against any alternative by leading all existing tests with a substantial margin. To approach this oracle power, we develop the BEAST through a regularized resampling approximation of the oracle test. The BEAST improves the empirical power of many existing tests against a wide spectrum of common alternatives and provides clear interpretation of the form of dependency when significant.**Discussant:**Bhaswar Bhattacharya (University of Pennsylvania)

**Thursday, October 28, 2021**[Recording]**Speaker:**Chiara Sabatti (Stanford University)**Title:**Searching for consistent associations with a multi-environment knockoff filter**Abstract:**This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across diverse environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations consistently replicated under different conditions may be more interesting. In fact, consistency sometimes provably leads to valid causal inferences even if conditional associations do not. While the proposed method is flexible and can be deployed in a wide range of applications, this paper highlights its relevance to genome-wide association studies, in which consistency across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to the UK Biobank data.**Discussant:**Niklas Pfister (University of Copenhagen)

**Thursday, October 21, 2021**[Recording]**Speaker:**Yao Zhang (University of Cambridge)**Title:**Multiple conditional randomization tests**Abstract:**We propose a general framework for (multiple) conditional randomization tests that incorporate several important ideas in the recent literature. We establish a general sufficient condition on the construction of multiple conditional randomization tests under which their p-values are "independent", in the sense that their joint distribution stochastically dominates the product of uniform distributions under the null. Conceptually, we argue that randomization should be understood as the mode of inference precisely based on randomization. We show that under a change of perspective, many existing statistical methods, including permutation tests for (conditional) independence and conformal prediction, are special cases of the general conditional randomization test. The versatility of our framework is further illustrated with an example concerning lagged treatment effects in stepped-wedge randomized trials.**Discussant:**Panos Toulis (University of Chicago)**Links:**[Relevant papers: paper #1][Slides][Discussion slides]

**Thursday, October 14, 2021**[Recording]**Speaker:**Byol Kim (University of Chicago)**Title:**Predictive inference is free with the jackknife+-after-bootstrap**Abstract:**Ensemble learning is widely used in applications to make predictions in complex decision problems --- for example, averaging models fitted to a sequence of samples bootstrapped from the available training data. While such methods offer more accurate, stable, and robust predictions and model estimates, much less is known about how to perform valid, assumption-lean inference on the output of these types of procedures. In this paper, we propose the jackknife+-after-bootstrap (J+aB), a procedure for constructing a predictive interval, which uses only the available bootstrapped samples and their corresponding fitted models, and is therefore "free" in terms of the cost of model fitting. The J+aB offers a predictive coverage guarantee that holds with no assumptions on the distribution of the data, the nature of the fitted model, or the way in which the ensemble of models are aggregated --- at worst, the failure rate of the predictive interval is inflated by a factor of 2. Our numerical experiments verify the coverage and accuracy of the resulting predictive intervals on real data. This work is joint with Chen Xu and Rina Foygel Barber.**Discussant:**Yachong Yang (University of Pennsylvania)

**Thursday, October 7, 2021**[Recording]**Speaker:**Kenneth Hung (Facebook)**Title:**Statistical Methods for Replicability Assessment**Abstract:**Large-scale replication studies like the Reproducibility Project: Psychology (RP:P) provide invaluable systematic data on scientific replicability, but most analyses and interpretations of the data fail to agree on the definition of “replicability” and disentangle the inexorable consequences of known selection bias from competing explanations. We discuss three concrete definitions of replicability based on: (1) whether published findings about the signs of effects are mostly correct, (2) how effective replication studies are in reproducing whatever true effect size was present in the original experiment and (3) whether true effect sizes tend to diminish in replication. We apply techniques from multiple testing and post-selection inference to develop new methods that answer these questions while explicitly accounting for selection bias. Our analyses suggest that the RP:P dataset is largely consistent with publication bias due to selection of significant effects. The methods in this paper make no distributional assumptions about the true effect sizes.**Discussant:**Marcel van Assen (Tilburg University)

**Thursday, September 30, 2021**[Recording]**Speaker:**Pallavi Basu (Indian School of Business)**Title:**Empirical Bayes Control of the False Discovery Exceedance**Abstract:**We propose an empirical Bayes procedure that guarantees control of the False Discovery eXceedance (FDX) by ranking and thresholding hypotheses based on their local false discovery rate (lfdr) test statistic. In a two-group independent model or Gaussian with exchangeable hypotheses, we show that ranking by the lfdr delivers the ``optimal'' ranking for FDX control. We propose a computationally efficient procedure that does not empirically lose validity and power and illustrate its properties by analyzing two million stock trading strategies.

Joint work with Luella Fu, Alessio Saretto, and Wenguang Sun.

**Discussant:**Sebastian Döhler (Darmstadt University of Applied Sciences)**Links:**[Relevant papers: paper #1]

**Thursday, August 12, 2021**[Recording]**Speaker:**Sanat K. Sarkar (Temple University)**Title:**Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff-assisted variable selection**Abstract:**The knockoff-based multiple testing setup of Barber & Candès (2015) for variable selection in multiple regression where sample size is as large as the number of explanatory variables is considered. The Benjamini-Hochberg method based on ordinary least squares estimates of the regression coefficients is adjusted to the setup, transforming it to a valid p-value based FDR controlling method not relying on any specific correlation structure of the explanatory variables. Simulations and real data applications show that our proposed method that is agnostic to $\pi_0$, the proportion of unimportant explanatory variables, and a data-adaptive version of it that uses an estimate of $\pi_0$ are powerful competitors of the FDR controlling methods in Barber & Candès (2015).**Discussant:**Lucas Janson (Harvard University)**Links:**[Relevant papers: paper #1]

**Thursday, August 5, 2021**[Recording]**Speaker:**Snigdha Panigrahi (University of Michigan)**Title:**Approximate Methods for Joint Estimation of Group-sparse Parameters post Selection**Abstract:**In this talk, I will present a post-selective Bayesian framework to jointly and consistently estimate parameters within automatic group-sparse regression models. Selected through an indispensable class of learning algorithms, e.g. the Group LASSO, the overlapping Group LASSO, the sparse Group LASSO etc., uncertainty estimates for the matched parameters are unreliable in the absence of adjustments for selection bias. Limiting however the application of state of the art tools for the group-sparse problem include estimation strictly tailored to (i) real-valued projections onto very specific selected subspaces, (ii) selection events admitting representations as linear inequalities in the data variables. The proposed approximate Bayesian methods address these gaps by deriving an adjustment factor in an easily feasible analytic form that eliminates bias from the selection of promising groups. Paying a very nominal price for this adjustment, experiments on simulated data demonstrate the efficiency of our methods at a joint estimation of group-sparse parameters learned from data.

This talk is based upon joint work with Peter W. Macdonald and Daniel Kessler.

**Discussant:**Joshua Loftus (London School of Economics)

**Thursday, July 29, 2021**[Link to join]**Speaker:**Wesley Tansey (Memorial Sloan Kettering Cancer Center)**Title:**Efficient, robust, and powerful machine learning approaches to conditional independence testing**Abstract:**In this talk, I will present two approaches to conditional independence testing using deep neural networks. The first half of the talk focuses on the model-X knockoffs framework. I will present an optimization approach, Deep Direct Likelihood Knockoffs (DDLK), to learning the knockoff distribution directly through minimizing an adversarial swap objective. In the second half of the talk, I will shift to the conditional randomization test (CRT) framework. CRTs have higher power than knockoffs but come with a computational burden that generally makes them intractable. I will present an information-theoretic approach to CRTs, the Decoupled Independence Test (DIET), that overcomes this burden by reducing the CRT to a series of marginal independence tests. DIET estimates the residual information about the response and target variable after removing mutual information with the covariates. Under mild conditions, testing for conditional independence then reduces to testing for marginal independence between these two residuals. Both DDLK and DIET achieve higher power than existing methods and empirically control the target error rate in a broad class of benchmarks on synthetic and semi-synthetic data.**Discussant:**Thomas Berrett (University of Warwick)**Links:**[Relevant papers][Slides]

**Thursday, July 22, 2021**[Recording]**Speaker:**Matthew Plumlee (Northwestern University)**Title:**Inexact computer model calibration: Concerns, controversy, credibility, and confidence**Abstract**: There has been a recent surge in statistical methods for calibration of inexact models. Alongside these developments, a controversy has emerged about the goals of calibration of inexact models. This talk will trace a swath of research stemming from twenty years ago and potential concerns are marked along the way. The talk will also present some new ideas in this setting that might help close some of these philosophical and practical issues.**Discussant:**Rui Tuo (Texas A&M University)

**Thursday, July 15, 2021**[Recording]**Speaker:**Armin Schwartzman (UCSD)**Title:**Spatial inference for excursion sets**Abstract:**Spatial inference for excursion sets refers to the problem of estimating the set of locations where a function is greater than a threshold. This problem appears in analyses of 2D climate data and 3D brain imaging data. The purpose of solving such a problem is to provide an alternative to the standard large-scale multiple testing approach, where all locations in an image are tested for the presence of signal. As sample sizes in large imaging studies keep increasing, the statistical power becomes sufficient to detect the presence of signal in large portions of the image, making it difficult to localize important effects. Moreover, the multiple testing approach does not provide a measure of spatial uncertainty. We directly address the question of where the important effects are by estimating excursion sets and by constructing spatial confidence sets, given as nested regions that spatially bound the true excursion set with a given probability. We develop this approach for excursion sets of the mean function in a signal-plus-noise model, including coefficients in pointwise regression models, and further extend it to the Cohen's d parameter in order to handle spatial heteroscedasticity. Examples and computational issues are discussed for 3D fMRI data.**Discussant:**Jelle Goeman (Leiden University)**Links:**[Relevant papers: paper #1, paper #2, paper #3][Slides][Discussion Slides]

**Thursday, July 1, 2021**[Recording]**Speaker:**Xiao Li (UC Berkeley)**Title:**Whiteout: when do fixed-X knockoffs fail?**Abstract:**A core strength of knockoff methods is their virtually limitless customizability, allowing an analyst to exploit machine learning algorithms and domain knowledge without threatening the method’s robust finitesample false discovery rate control guarantee. While several previous works have investigated regimes where specific implementations of knockoffs are provably powerful, negative results are more difficult to obtain for such a flexible method. In this work we recast the fixed-X knockoff filter for the Gaussian linear model as a conditional post-selection inference method. It adds user-generated Gaussian noise to the ordinary least squares estimator βˆ to obtain a “whitened” estimator β˜ with uncorrelated entries, and performs inference using sgn(β˜j ) as the test statistic for Hj : βj = 0. We prove equivalence between our whitening formulation and the more standard formulation based on negative control predictor variables, showing how the fixed-X knockoffs framework can be used for multiple testing on any problem with (asymptotically) multivariate Gaussian parameter estimates. Relying on this perspective, we obtain the first negative results that universally upper-bound the power of all fixed-X knockoff methods, without regard to choices made by the analyst. Our results show roughly that, if the leading eigenvalues of Var(βˆ) are large with dense leading eigenvectors, then there is no way to whiten βˆ without irreparably erasing nearly all of the signal, rendering sgn(β˜j ) too uninformative for accurate inference. We give conditions under which the true positive rate (TPR) for any fixed-X knockoff method must converge to zero even while the TPR of Bonferroni-corrected multiple testing tends to one, and we explore several examples illustrating this phenomenon.**Discussant:**Asher Spector (Harvard University)

**Thursday, June 24, 2021**[Recording]**Speaker:**Jason Hsu (The Ohio State University)**Title:**Confident Directional Selective Inference, from Multiple Comparisons with the Best to Precision Medicine**Abstract:**MCB (multiple comparisons with the best, 1981, 1984), comparing treatments to the best without knowing which one is the best, can be considered an early example of selective inference. With the thinking that "there is only one true best", the relevance of MCB to this presentation is it led to the Partitioning Principle, which is essential for deriving confidence sets for stepwise tests. Inference based on confidence sets control the directional error rate, inference based on tests of equalities may not.

The FDA gave Accelerated Approval to Aduhelm^{TM} (aducanumab) for Alzheimer's Disease (AD) on 8 June 2021, based on its reduction of beta-amyloid plaque (a surrogate biomarker endpoint). When clinical efficacy of a treatment for the overall population is not shown, genome-wide association studies (GWAS) are often used to discover SNPs that might predict efficacy in subgroups. In the process of working on GWAS with real data, we came to realization that, if one causal SNP makes its zero-null hypothesis false, then all other zero-null hypotheses are statistically false as well. While the majority of no-association null hypotheses might well be true biologically, statistically they are false (if one is false) in GWAS. I will indeed illustrate this with a causal SNP for the ApoE gene which is involved in the clearance of beta-amyloid plaque in AD. We suggest our confidence interval CE4 approach instead.

Targeted therapies such as OPDIVO and TECENTRIQ naturally have patient subgroups, already defined by the extent to which the drug target is present or absent in them, subgroups that may derive differential efficacy. An additional danger of testing equality nulls in the presence of subgroups is that the illusory logical relationships among efficacy in subgroups and their mixtures created by exact quality nulls leads to too drastic a stepwise multiplicity reduction, resulting in inflated directional error rates, as I will explain. Instead, Partition Tests, which would be called Confident Direction methods in the language of Tukey, might be safer to use.

**Discussant:**Will Fithian (UC Berkeley)

**Thursday, June 17, 2021**[Recording]**Speaker:**Patrick Chao (University of Pennsylvania)**Title:**AdaPT-GMM: Powerful and robust covariate-assisted multiple testing**Abstract**: We propose a new empirical Bayes method for covariate-assisted multiple testing with false discovery rate (FDR) control, where we model the local false discovery rate for each hypothesis as a function of both its covariates and p-value. Our method refines the adaptive p-value thresholding (AdaPT) procedure by generalizing its masking scheme to reduce the bias and variance of its false discovery proportion estimator, improving the power when the rejection set is small or some null p-values concentrate near 1. We also introduce a Gaussian mixture model for the conditional distribution of the test statistics given covariates, modeling the mixing proportions with a generic user-specified classifier, which we implement using a two-layer neural network. Like AdaPT, our method provably controls the FDR in finite samples even if the classifier or the Gaussian mixture model is misspecified. We show in extensive simulations and real data examples that our new method, which we call AdaPT-GMM, consistently delivers high power relative to competing state-of-the-art methods. In particular, it performs well in scenarios where AdaPT is underpowered, and is especially well-suited for testing composite null hypothesis, such as whether the effect size exceeds a practical significance threshold.**Discussant:**Patrick Kimes (Genentech)

**Thursday, June 10, 2021**[Recording]**Speaker:**Wooseok Ha (UC Berkeley)**Title:**Interpreting deep neural networks in a transformed domain**Abstract:**Machine learning lies at the heart of new possibilities for scientific discovery, knowledge generation, and artificial intelligence. Its potential benefits to these fields require going beyond predictive accuracy and focusing on interpretability. In particular, many scientific problems require interpretations in domain-specific interpretable feature space (e.g. the frequency or wavelet domain) whereas attributions to the raw features (e.g. the pixel space) may be unintelligible or even misleading. To address this challenge, we propose TRIM (Transformation Importance), a novel approach which attributes importances to features in a transformed space and can be applied post-hoc to a fully trained model. We focus on a problem in cosmology, where it is crucial to interpret how a model trained on simulations predicts fundamental cosmological parameters. By using TRIM in interesting ways, we next introduce adaptive wavelet distillation (AWD), a method that aims to distill information from a trained neural network into a wavelet transform. Specifically, AWD penalizes feature attributions of a neural network in the wavelet domain to learn an effective multi-resolution wavelet transform. The resulting model is highly predictive, concise, computationally efficient, and has properties (such as a multi-scale structure) which make it easy to interpret. We showcase how AWD addresses challenges in two real-world settings: cosmological parameter inference and molecular-partner prediction. In both cases, AWD informs predictive features that are scientifically meaningful in the context of respective domains.**Discussant:**Sarah Tan (Facebook)

**Thursday, June 3, 2021**[Recording]**Speakers:**Song Zhai (UC Riverside)**Title:**Learning from Real World Data About Combinatorial Treatment Selection for COVID-19**Abstract:**COVID-19 is an unprecedented global pandemic with a serious negative impact on virtually every part of the world. Although much progress has been made in preventing and treating the disease, much remains to be learned about how best to treat the disease while considering patient and disease characteristics. This paper reports a case study of combinatorial treatment selection for COVID-19 based on real-world data from a large hospital in Southern China. In this observational study, 417 confirmed COVID-19 patients were treated with various combinations of drugs and followed for four weeks after discharge (or until death). Treatment failure is defined as death during hospitalization or recurrence of COVID-19 within four weeks of discharge. Using a virtual multiple matching method to adjust for confounding, we estimate and compare the failure rates of different combinatorial treatments, both in the whole study population and in subpopulations defined by baseline characteristics. Our analysis reveals that treatment effects are substantial and heterogeneous, and that the optimal combinatorial treatment may depend on baseline age, systolic blood pressure, and c-reactive protein level. Using these three variables to stratify the study population leads to a stratified treatment strategy that involves several different combinations of drugs (for patients in different strata). Our findings are exploratory and require further validation.**Discussant:**Hongyuan Cao (Florida State University)**Links:**[Slides]

**Thursday, May 27, 2021**[Recording]**Speaker:**Matthew Stephens (University of Chicago)**Title:**A simple new approach to variable selection in regression, with application to genetic fine-mapping**Abstract:**We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model — the “Sum of Single Effects” (SuSiE) model — which comes from writing the sparse vector of regression coefficients as a sum of “single-effect” vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure — Iterative Bayesian Stepwise Selection (IBSS) — which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a Credible Set of variables for each selection. Our methods are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate their application to fine-mapping genetic variants influencing alternative splicing in human cell-lines. We also discuss the potential and challenges for applying these methods to generic variable selection problems.**Discussant:**Peter Bühlmann (ETH Zürich)**Links:**[Relevant papers: paper #1][Slides][Discussant Slides]

**Thursday, May 20, 2021**[Recording]**Speaker:**Dan Kluger (Stanford University)**Title:**A central limit theorem for the Benjamini-Hochberg false discovery proportion under a factor model**Abstract:**The Benjamini-Hochberg (BH) procedure remains widely popular despite having limited theoretical guarantees in the commonly encountered scenario of correlated test statistics. Of particular concern is the possibility that the method could exhibit bursty behavior, meaning that it might typically yield no false discoveries while occasionally yielding both a large number of false discoveries and a false discovery proportion (FDP) that far exceeds its own well controlled mean. In this paper, we investigate which test statistic correlation structures lead to bursty behavior and which ones lead to well controlled FDPs. To this end, we develop a central limit theorem for the FDP in a multiple testing setup where the test statistic correlations can be either short-range or long-range as well as either weak or strong. The theorem and our simulations from a data-driven factor model suggest that the BH procedure exhibits severe burstiness when the test statistics have many strong, long-range correlations, but does not otherwise.**Discussant:**Grant Izmirlian (NCI DCP Biometry Research Group)**Links:**[Relevant papers: paper #1][Slides][Discussion Slides]

**Thursday, May 13, 2021**[Recording]**Speaker:**Chirag Gupta (Carnegie Mellon University)**Title:**Recent advances in distribution-free uncertainty quantification**Abstract:**Uncertainty quantification seeks to supplement point predictions with estimates of confidence or reliability. In the distribution-free (DF) framework, we require these confidence estimates to make valid statistical claims that provably hold no matter how the data is distributed, as long as the training and test data follow the same distribution. We present some recent results in DF uncertainty quantification for classification and regression problems. First, we discuss nested conformal, a framework to produce prediction sets that are guaranteed to contain the true output with a pre-defined probability. We then describe an ensemble-based conformal algorithm, QOOB. QOOB has DF guarantees, is computationally efficient, and produces prediction sets that exhibit strong practical performance on regression tasks. Next, we describe the notion of calibration in binary classification and connect it to prediction sets and confidence intervals. This relationship leads to an impossibility result for continuous-output DF calibration. We then show DF calibration guarantees for a popular discrete-output calibration algorithm called histogram binning. Based on our guarantees, we make practical recommendations for choosing the number of bins in histogram binning.**Discussant:**Rina Foygel Barber (University of Chicago)

**Thursday, May 6, 2021**[Recording]**Speaker:**Marie Perrot-Dockès (Université de Paris)**Title:**Post hoc false discovery proportion inference under a Hidden Markov Model**Abstract:**We address the multiple testing problem under the assumption that the true/false hypotheses are driven by a Hidden Markov Model (HMM), which is recognized as a fundamental setting to model multiple testing under dependence since the seminal work of Sun and Cai (2009). While previous work has concentrated on deriving specific procedures with a controlled False Discovery Rate (FDR) under this model, following a recent trend in selective inference, we consider the problem of establishing confidence bounds on the false discovery proportion (FDP), for a user-selected set of hypotheses that can depend on the observed data in an arbitrary way. We develop a methodology to construct such confidence bounds first when the HMM model is known, then when its parameters are unknown and estimated, including the data distribution under the null and the alternative, using a nonparametric approach. In the latter case, we propose a bootstrap-based methodology to take into account the effect of parameter estimation error. We show that taking advantage of the assumed HMM structure allows for a substantial improvement of confidence bound sharpness over existing agnostic (structure-free) methods, as witnessed both via numerical experiments and real data examples.**Discussant:**Jesse Hemerik (Wageningen University)

**Thursday, April 29, 2021**[Recording]**Speaker:**Thorsten Dickhaus (University of Bremen)**Title:**Randomized p-values in replicability analysis**Abstract:**We will be concerned with testing replicability hypotheses for many endpoints simultaneously. This constitutes a multiple test problem with composite null hypotheses. Traditional p-values, which are computed under least favourable parameter configurations (LFCs), are over-conservative in the case of composite null hypotheses. As demonstrated in prior work, this poses severe challenges in the multiple testing context, especially when one goal of the statistical analysis is to estimate the proportion $\pi_0$ of true null hypotheses. We will discuss the application of randomized p-values in the sense of [1] in replicability analysis. By means of theoretical considerations as well as computer simulations, we will demonstrate that their usage typically leads to a much more accurate estimation of $\pi_0$ than the LFC-based approach. Furthermore, we will draw connections to other recently proposed methods for dealing with conservative p-values in the multiple testing context. Finally, we will present a real data example from genomics. The presentation is based on [2] and [3].**Discussant:**Ruth Heller (Tel Aviv University)**Links:**[Relevant papers: paper #1, paper #2, paper #3][Slides]

**Thursday, April 22, 2021**[Recording]**Speaker:**Feng Ruan (UC Berkeley)**Title:**A Self-Penalizing Objective Function for Scalable Interaction Detection**Abstract:**We tackle the problem of nonparametric variable selection with a focus on discovering interactions between variables. With p variables there are O(ps) possible order-s interactions making exhaustive search infeasible. It is nonetheless possible to identify the variables involved in interactions with only linear computation cost, O(p). The trick is to maximize a class of parametrized nonparametric dependence measures which we call metric learning objectives; the landscape of these nonconvex objective functions is sensitive to interactions but the objectives themselves do not explicitly model interactions. Three properties make metric learning objectives highly attractive:

(a) The stationary points of the objective are automatically sparse (i.e. performs selection) -- no explicit ℓ1 penalization is needed.

(b) All stationary points of the objective exclude noise variables with high probability.

(c) Guaranteed recovery of all signal variables without needing to reach the objective's global maxima or special stationary points.

The second and third properties mean that all our theoretical results apply in the practical case where one uses gradient ascent to maximize the metric learning objective. While not all metric learning objectives enjoy good statistical power, we design an objective based on ℓ1 kernels that does exhibit favorable power: it recovers (i) main effects with n∼logp samples, (ii) hierarchical interactions with n∼logp samples and (iii) order-s pure interactions with n∼p^{2(s−1)}logp samples.

**Discussant:**Sumanta Basu (Cornell University)

**Thursday, April 15, 2021**[Recording]**Speaker:**Nikolaos Ignatiadis (Stanford University)**Title:**Confidence Intervals for Nonparametric Empirical Bayes Analysis**Abstract:**In an empirical Bayes analysis, we use data from repeated sampling to imitate inferences made by an oracle Bayesian with extensive knowledge of the data-generating distribution. Existing results provide a comprehensive characterization of when and why empirical Bayes point estimates accurately recover oracle Bayes behavior. In this work, we develop flexible and practical confidence intervals that provide asymptotic frequentist coverage of empirical Bayes estimands, such as the posterior mean or the local false sign rate. The coverage statements hold even when the estimands are only partially identified or when empirical Bayes point estimates converge very slowly. This is joint work with Stefan Wager.**Discussant:**Timothy Armstrong (Yale University)

**Thursday, April 8, 2021**[Recording]**Speaker:**Hongyuan Cao (Florida State University)**Title:**Optimal False Discovery Rate Control For Large Scale Multiple Testing With Auxiliary Information**Abstract:**Large-scale multiple testing is a fundamental problem in high dimensional statistical inference. It is increasingly common that various types of auxiliary information, reflecting the structural relationship among the hypotheses, are available. Exploiting such auxiliary information can boost statistical power. To this end, we propose a framework based on a two-group mixture model with varying probabilities of being null for different hypotheses a priori, where a shape constrained relationship is imposed between the auxiliary information and the prior probabilities of being null. An optimal rejection rule is designed to maximize the expected number of true positives when average false discovery rate is controlled. Focusing on the ordered structure, we develop a robust EM algorithm to estimate the prior probabilities of being null and the distribution of p-values under the alternative hypothesis simultaneously. We show that the proposed method has better power than state-of-the-art competitors while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method. Datasets from genome-wide association studies are used to illustrate the new methodology.**Discussant:**James Scott (University of Texas at Austin)

**Thursday, April 1, 2021**[Recording]**Speaker:**Jingshen Wang (UC Berkeley)**Title:**Sharp Inference on Selected Subgroups in Observational Studies**Abstract:**In modern drug development, the broader availability of high-dimensional observational data provides opportunities for scientist to explore subgroup heterogeneity, especially when randomized clinical trials are unavailable due to cost and ethical constraints. However, a common practice that naively searches the subgroup with a high treatment level is often misleading due to the “subgroup selection bias.” More importantly, the nature of high-dimensional observational data has further exacerbated the challenge of accurately estimating the subgroup treatment effects. To resolve these issues, we provide new inferential tools based on resampling to assess the replicability of post-hoc identified subgroups from observational studies. Through careful theoretical justification and extensive simulations, we show that our proposed approach delivers asymptotically sharp confidence intervals and debiased estimates for the selected subgroup treatment effects in the presence of high-dimensional covariates. We further demonstrate the merit of the proposed methods by analyzing the UK Biobank data. The R package “debiased.subgroup" implementing the proposed procedures is available on GitHub.**Discussant:**Rui Wang (Harvard University)**Links:**[Relevant papers: paper #1]

**Thursday, March 25, 2021**[Recording]**Speaker:**Jackson Loper (Columbia University)**Title:**Smoothed Nested Testing on Directed Acyclic Graphs**Abstract:**We consider the problem of multiple hypothesis testing when there is a logical nested structure to the hypotheses. When one hypothesis is nested inside another, the outer hypothesis must be false if the inner hypothesis is false. We model the nested structure as a directed acyclic graph, including chain and tree graphs as special cases. Each node in the graph is a hypothesis and rejecting a node requires also rejecting all of its ancestors. We propose a general framework for adjusting node-level test statistics using the known logical constraints. Within this framework, we study a smoothing procedure that combines each node with all of its descendants to form a more powerful statistic. We prove a broad class of smoothing strategies can be used with existing selection procedures to control the familywise error rate, false discovery exceedance rate, or false discovery rate, so long as the original test statistics are independent under the null. When the null statistics are not independent but are derived from positively-correlated normal observations, we prove control for all three error rates when the smoothing method is arithmetic averaging of the observations. Simulations and an application to a real biology dataset demonstrate that smoothing leads to substantial power gains.**Discussant:**Wenge Guo (New Jersey Institute of Technology)**Links:**[Relevant papers: paper #1]

**Thursday, March 18, 2021**[Recording]**Speaker:**Ruodu Wang (University of Waterloo)**Title:**Multiple hypothesis testing with e-values and dependence**Abstract:**E-values have gained attention as potential alternatives to p-values as measures of uncertainty, significance and evidence. In brief, e-values are realized by random variables with expectation at most one under the null; examples include betting scores, (point null) Bayes factors, likelihood ratios and stopped supermartingales. We design a natural analog of the Benjamini-Hochberg (BH) procedure for false discovery rate (FDR) control that utilizes e-values, called the e-BH procedure, and compare it with the standard procedure for p-values. One of our central results is that, unlike the usual BH procedure, the e-BH procedure controls the FDR at the desired level---with no correction---for any dependence structure between the e-values. We illustrate that the new procedure is convenient in various settings of complicated dependence, structured and post-selection hypotheses, and multi-armed bandit problems. Moreover, the BH procedure is a special case of the e-BH procedure through calibration between p-values and e-values. Overall, the e-BH procedure is a novel, powerful and general tool for multiple testing under dependence, that is complementary to the BH procedure, each being an appropriate choice in different applications.**Discussant:**Lihua Lei (Stanford University)

**Thursday, March 11, 2021**[Recording]**Speaker:**Stephen Bates (UC Berkeley)**Title:**Distribution-Free, Risk-Controlling Prediction Sets**Abstract:**While improving prediction accuracy has been the focus of machine learning in recent years, this alone does not suffice for reliable decision-making. Deploying learning systems in consequential settings also requires calibrating and communicating the uncertainty of predictions. To convey instance-wise uncertainty for prediction tasks, we show how to generate set-valued predictions from a black-box predictor that control the expected loss on future test points at a user-specified level. Our approach provides explicit finite-sample guarantees for any dataset by using a holdout set to calibrate the size of the prediction sets. This framework enables simple, distribution-free, rigorous error control for many tasks, and we demonstrate it in five large-scale machine learning problems: (1) classification problems where some mistakes are more costly than others; (2) multi-label classification, where each observation has multiple associated labels; (3) classification problems where the labels have a hierarchical structure; (4) image segmentation, where we wish to predict a set of pixels containing an object of interest; and (5) protein structure prediction. Lastly, we discuss extensions to uncertainty quantification for ranking, metric learning and distributionally robust learning.**Discussant:**Vladimir Vovk (Royal Holloway, University of London)

**Thursday, March 4, 2021**[Recording]**Speaker:**Boyan Duan (Carnegie Mellon University)**Title:**Interactive identification of individuals with positive treatment effect while controlling false discoveries**Abstract:**Out of the participants in a randomized experiment with anticipated heterogeneous treatment effects, is it possible to identify which ones have a positive treatment effect, even though each has only taken either treatment or control but not both? While subgroup analysis has received attention, claims about individual participants are more challenging. We frame the problem in terms of multiple hypothesis testing: we think of each individual as a null hypothesis (the potential outcomes are equal, for example) and aim to identify individuals for whom the null is false (the treatment potential outcome stochastically dominates the control, for example). We develop a novel algorithm that identifies such a subset, with nonasymptotic control of the false discovery rate (FDR). Our algorithm allows for interaction — a human data scientist (or a computer program acting on the human’s behalf) may adaptively guide the algorithm in a data-dependent manner to gain high identification power. We also propose several extensions: (a) relaxing the null to nonpositive effects, (b) moving from unpaired to paired samples, and (c) subgroup identification. We demonstrate via numerical experiments and theoretical analysis that the proposed method has valid FDR control in finite samples and reasonably high identification power.**Discussant:**Bikram Karmakar (University of Florida)

**Thursday, February 25, 2021**[Recording]**Speaker:**Anna Vesely, University of Padua**Title:**Permutation-based true discovery guarantee by sum tests**Abstract:**Sum-based global tests are highly popular in multiple hypothesis testing. In this paper we propose a general closed testing procedure for sum tests, which provides confidence lower bounds for the proportion of true discoveries (TDP), simultaneously over all subsets of hypotheses. Our method allows for an exploratory approach, as simultaneity ensures control of the TDP even when the subset of interest is selected post hoc. It adapts to the unknown joint distribution of the data through permutation testing. Any sum test may be employed, depending on the desired power properties. We present an iterative shortcut for the closed testing procedure, based on the branch and bound algorithm. It converges to the full closed testing results, often after few iterations. Even if it is stopped early, it controls the TDP. The feasibility of the method for high dimensional data is illustrated on brain imaging data. We compare the properties of different choices for the sum test through simulations.**Discussant:**Pierre Neuvial (Institut de Mathématiques de Toulouse (IMT))

**Thursday, February 18, 2021**[Recording]**Speaker:**Tijana Zrnic (UC Berkeley)**Title:**Post-Selection Inference via Algorithmic Stability**Abstract:**Modern approaches to data analysis make extensive use of data-driven model selection. The resulting dependencies between the selected model and data used for inference invalidate statistical guarantees derived from classical theories. The framework of post-selection inference (PoSI) has formalized this problem and proposed corrections which ensure valid inferences. Yet, obtaining general principles that enable computationally-efficient, powerful PoSI methodology with formal guarantees remains a challenge. With this goal in mind, we revisit the PoSI problem through the lens of algorithmic stability. Under an appropriate formulation of stability---one that captures closure under post-processing and compositionality properties---we show that stability parameters of a selection method alone suffice to provide non-trivial corrections to classical z-test and t-test intervals. Then, for several popular model selection methods, including the LASSO, we show how stability can be achieved through simple, computationally efficient randomization schemes. Our algorithms offer provable unconditional simultaneous coverage and are computationally efficient; in particular, they do not rely on MCMC sampling. Importantly, our proposal explicitly relates the magnitude of randomization to the resulting confidence interval width, allowing the analyst to tune interval width to the loss in utility due to randomizing selection. This is joint work with Michael I. Jordan.**Discussant:**Arun Kumar Kuchibhotla (Carnegie Mellon University)

**Thursday, February 11, 2021**[Recording]**Speaker:**Jelle Goeman (Leiden University)**Title:**Only closed testing procedures are admissible for controlling false discovery proportions**Abstract:**We consider a general class of procedures controlling the tail probability of the number or proportion of false discoveries, either in a single (random) set or in several such sets simultaneously. This class includes, among others, (generalized) familywise error, false discovery exceedance, simultaneous false discovery proportion control, and other selective inference methods. We put these procedures in a general framework, formulating all of them as special cases of true discovery guarantee procedures. We formulate both necessary and sufficient conditions for admissibility. Most importantly, we show that all such procedures are either a special case of closed testing, or they can be uniformly improved by a closed testing procedure. The practical value of our results is illustrated by giving uniform improvements of existing selective inference procedures, achieved by formulating them as a closed testing procedures. In particular, we investigate when procedures controlling conditional familywise error rate, and data-splitting methods, can be uniformly improved by closed testing.**Discussant:**Will Fithian (UC Berkeley)

**Thursday, February 4, 2021**[Recording]**Speaker:**Arian Maleki (Columbia University)**Title:**Comparing Variable Selection Techniques Under a High-Dimensional Asymptotic**Abstract:**In this talk, we discuss the problem of variable selection for linear models under the high-dimensional asymptotic setting, where the number of observations, n, grows at the same rate as the number of predictors, p. We consider two-stage variable selection techniques (TVS) in which the first stage obtains an estimate of the regression coefficients, and the second stage simply thresholds this estimate to select the “important” predictors. The asymptotic false discovery proportion (AFDP) and true positive proportion (ATPP) of these TVS are evaluated, and their optimality will be discussed.**Discussant:**Pragya Sur (Harvard University)

**Thursday, January 28, 2021**[Recording]**Speaker:**Ali Shojaie (University of Washington)**Title:**Nonparametric Inference for Infinite-Dimensional Parameters via a Generalized Score Test**Abstract:**Infinite-dimensional parameters that can be defined as the minimizer of a population risk arise naturally in many applications. Classic examples include the conditional mean function and the density function. Though there is extensive literature on constructing consistent estimators for infinite-dimensional risk minimizers, there is limited work on quantifying the uncertainty associated with such estimates via, e.g., hypothesis testing and construction of confidence regions. We propose a general inferential framework for infinite-dimensional risk minimizers as a nonparametric extension of the score test. We illustrate that our framework requires only mild assumptions and is applicable to a variety of estimation problems. In examples, we specialize our proposed methodology to estimation of regression functions with continuous outcomes and also consider a partially additive model as an extension of the more classical partially linear model.**Discussant:**Mladen Kolar (University of Chicago Booth School of Business)**Links:**[Slides]

**Thursday, January 21, 2021**[Recording]**Speaker:**Etienne Roquain (Sorbonne Université)**Title:**Structured multiple testing: can one mimic the oracle?**Abstract:**Knowing the model structure can significantly help to perform a multiple testing inference. Hence, a general aim is to build a procedure mimicking the performances of the oracle, that is, of a benchmark procedure that knows (and uses) this structure. As a case in point, classical structures are derived from the famous two-group model or its extensions, by specifying particular assumptions on the corresponding parameters, as the null/alternative distributions, or the false/null occurrence process. We will discuss the issue of mimicking the oracle for the three following structures and various multiple testing error rates:

(1) structure = Gaussian null distribution family, error rate= FDR (see https://arxiv.org/abs/1912.03109, joint work with Nicolas Verzelen and https://arxiv.org/abs/1809.08330, joint work with Alexandra Carpentier, Sylvain Delattre and Nicolas Verzelen)

(2) structure = stochastic block model for the false/null occurrence process, error rate = FDR (see https://arxiv.org/abs/1907.10176, joint work with Tabea Rebafka and Fanny Villers)

(3) structure = hidden Markov model for the false/null occurrence process, error rate = FDP confidence post hoc bound (preprint to come, joint work with Marie Perrot-Dockès, Gilles Blanchard and Pierre Neuvial) We will emphasize the work (1) above, and show that building a confidence region for the structure parameter can be fruitful to know whether mimicking the oracle is possible and how to mimic it when it is possible.**Discussant:**Ery Arias-Castro (UC San Diego)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, January 14, 2021**[Recording]**Speaker:**Qingyuan Zhao (University of Cambridge)**Title:**Selecting and Ranking Individualized Treatment Rules With Unmeasured Confounding**Abstract:**It is common to compare individualized treatment rules based on the value function, which is the expected potential outcome under the treatment rule. Although the value function is not point-identified when there is unmeasured confounding, it still defines a partial order among the treatment rules under Rosenbaum’s sensitivity analysis model. We first consider how to compare two treatment rules with unmeasured confounding in the single-decision setting and then use this pairwise test to rank multiple treatment rules. We consider how to, among many treatment rules, select the best rules, and select the rules that are better than a control rule. The proposed methods are illustrated using two real examples, one about the benefit of malaria prevention programs to different age groups and another about the effect of late retirement on senior health in different gender and occupation groups.**Discussant:**Edward Kennedy (Carnegie Mellon University)**Links:**[Relevant paper] [Slides]

**Thursday, January 7, 2021**[Recording]**Speaker:**Yuval Benjamini (Hebrew University of Jerusalem)**Title:**Localizing differences between correlation matrix populations in resting-state fMRI**Abstract:**Resting state fMRI consists of continuous neural-activity recordings over a period of several minutes without structured experimental manipulation. These measurements are summarized into a correlation matrix between activity in p predetermined brain-regions (p between 90 and 500). Neurologists are interested in identifying localized differences in correlation between, e.g. disease and control populations, but the relatively high noise, small samples and many comparisons make mass univariate approaches impractical due to low signal. Therefore, resting-state fMRI analysis can be a model problem for data-adaptive pooling of hypotheses.

However, as I discuss in the talk, even static pooling of effects across different correlation values is not simple in this type of data. We reparametrize the matrix of differences between populations as p main effects representing change for each region, with the goal of replacing p^2/2 hypotheses with p main ones. For this new model, we derive likelihood estimators that require explicit or implicit characterisation of the dependence in the data. We show that the method preforms well on simulations, and discuss an example from Amnesia data.

This is joint work with Itamar Faran, Michael Peer and Shahar Arzi.**Discussant:**Lucy Gao (University of Waterloo)**Relevant links:**[Slides]

**Thursday, December 10, 2020**[Recording]**Speaker:**Toru Kitagawa (University College London)**Title:**Inference on Winners**Abstract:**Many empirical questions concern target parameters selected through optimization. For example, researchers may be interested in the effectiveness of the best policy found in a randomized trial, or the best-performing investment strategy based on historical data. Such settings give rise to a winner’s curse, where conventional estimates are biased and conventional confidence intervals are unreliable. This paper develops optimal confidence intervals and median-unbiased estimators that are valid conditional on the target selected and so overcome this winner’s curse. If one requires validity only on average over targets that might have been selected, we develop hybrid procedures that combine conditional and projection confidence intervals to offer further performance gains relative to existing alternatives. This is joint work with Isaiah Andrews and Adam McCloskey.**Discussant:**Kenneth Hung (Facebook)**Links:**[Relevant paper] [Slides]

**Thursday, December 3, 2020****Speaker**: Jingyi Jessica Li (UCLA)**Title**: Clipper: p-value-free FDR control on high-throughput data from two conditions**Abstract:**High-throughput biological data analysis commonly involves the identification of “interesting” features (e.g., genes, genomic regions, and proteins), whose values differ between two conditions, from numerous features measured simultaneously. To ensure the reliability of such analysis, the most widely-used criterion is the false discovery rate (FDR), the expected proportion of uninteresting features among the identified ones. Existing bioinformatics tools primarily control the FDR based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions, two requirements that are often unmet in biological studies. To address this issue, we propose Clipper, a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper is applicable to identifying both enriched and differential features from high-throughput biological data of diverse types. In comprehensive simulation and real-data benchmarking, Clipper outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks, including peak calling from ChIP-seq data, differentially expressed gene identification from RNA-seq data, differentially interacting chromatin region identification from Hi-C data, and peptide identification from mass spectrometry data. Notably, our benchmarking results for peptide identification are based on the first mass spectrometry data standard that has a realistic dynamic range. Our results demonstrate Clipper’s flexibility and reliability for FDR control, as well as its broad applications in high-throughput data analysis.**Discussant:**Nikos Ignatiadis (Stanford University)**Links:**[Relevant paper] [Slides]

**Thursday, November 19, 2020**[Recording]**Speaker:**Oscar Hernan Madrid Padilla (UCLA)**Title:**Optimal post-selection inference for sparse signals: a nonparametric empirical-Bayes**Abstract:**Many recently developed Bayesian methods have focused on sparse signal detection. However, much less work has been done addressing the natural follow-up question: how to make valid inferences for the magnitude of those signals after selection. Ordinary Bayesian credible intervals suffer from selection bias, owing to the fact that the target of inference is chosen adaptively. Existing Bayesian approaches for correcting this bias produce credible intervals with poor frequentist properties, while existing frequentist approaches require sacrificing the benefits of shrinkage typical in Bayesian methods, resulting in confidence intervals that are needlessly wide. We address this gap by proposing a nonparametric empirical-Bayes approach for constructing optimal selection-adjusted confidence sets. Our method produces confidence sets that are as short as possible on average, while both adjusting for selection and maintaining exact frequentist coverage uniformly over the parameter space. Our main theoretical result establishes an important consistency property of our procedure: that under mild conditions, it asymptotically converges to the results of an oracle-Bayes analysis in which the prior distribution of signal sizes is known exactly. Across a series of examples, the method outperforms existing frequentist techniques for post-selection inference, producing confidence sets that are notably shorter but with the same coverage guarantee. This is joint work with Spencer Woody and James G. Scott.**Discussant:**Małgorzata Bogdan (Uniwersytet Wroclawski, Instytut Matematyki)**Links:**[Relevant paper] [Slides]

**Thursday, November 12, 2020****Speaker**: Peter Grünwald (Centrum Wiskunde & Informatica and Leiden University)**Title**:*E is the New P:*Tests that are safe under optional stopping, with an application to time-to-event data**Abstract:**The E-value is a notion of evidence which, unlike p-values, allows for effortlessly combining evidence from several tests, even in the common scenario where the decision to perform a new test depends on previous test outcomes. 'Safe' tests based on E-values generally preserve Type-I error guarantees under such `optional continuation', thereby potentially alleviating one of the main causes for the reproducibility crisis.

E-values, also known as 'betting scores', are the basic constituents of test martingales and always-valid confidence sequences - a dormant cluster of ideas going back to Ville and Robbins and suddenly rapidly gaining popularity due to recent work by Vovk, Shafer, Ramdas and Wang. For simple nulls they are just likelihood ratios or Bayes factors, but for composite nulls it's trickier - we show how to construct them in this case using the 'joint information projection'. We then zoom in on time-to-event data and show how to define an E-value based on Cox' partial likelihood, illustrating with (hypothetical!) data on covid vaccine RCTs. If all research groups were to report their results in terms of E-values rather than p-values, then in principle, one could even do meta-analysis that retains an overall Type-I error guarantee - thus saving greatly on 'research waste'.

Joint Work with R. de Heide, W. Koolen, A. Ly, M. Perez, R. Turner and J. Ter Schure.**Discussant:**Ruodu Wang (University of Waterloo)

**Thursday, November 5, 2020****Speaker**: Gilles Blanchard (Université Paris Sud)**Title**: Agnostic post hoc approaches to false positive control**Abstract:**Classical approaches to multiple testing grant control over the amount of false positives for a specific method prescribing the set of rejected hypotheses. In practice many users tend to deviate from a strictly prescribed multiple testing method and follow ad-hoc rejection rules, tune some parameters by hand, compare several methods and pick from their results the one that suits them best, etc. This will invalidate standard statistical guarantees because of the selection effect. To compensate for any form of such ”data snooping”, an approach which has garnered significant interest recently is to derive ”user-agnostic”, or post hoc, bounds on the false positives valid uniformly over all possible rejection sets; this allows arbitrary data snooping from the user. We present two contributions: starting from a common approach to post hoc bounds taking into account the p-value level sets for any candidate rejection set, we analyze how to calibrate the bound under different assumptions concerning the distribution of p-values. We then build towards a general approach to the problem using a family of candidate rejection subsets (call this a reference family) together with associated bounds on the number of false positives they contain, the latter holding uniformly over the family. It is then possible to interpolate from this reference family to find a bound valid for any candidate rejection subset. This general program encompasses in particular the p-value level sets considered earlier; we illustrate its interest in a different context where the reference subsets are fixed and spatially structured. (Joint work with Pierre Neuvial and Etienne Roquain.)**Discussant:**Arun Kumar Kuchibhotla (Carnegie Mellon University)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, October 29, 2020**[Recording]**Speaker:**Robert Lunde (University of Texas, Austin)**Title:**Resampling for Network Data**Abstract:**Network data, which represent complex relationships between different entities, have become increasingly common in fields ranging from neuroscience to social network analysis. To address key scientific questions in these domains, versatile inferential methods for network-valued data are needed. In this talk, I will discuss our recent work on network analogs of the three main resampling methods: subsampling, the jackknife, and the bootstrap. While network data are generally dependent, under the sparse graphon model, we show that these resampling procedures exhibit similar properties to their IID counterparts. I will also discuss related theoretical results, including central limit theorems for eigenvalues and a network Efron-Stein inequality. This is joint work with Purnamrita Sarkar and Qiaohui Lin.**Discussant:**Liza Levina (University of Michigan)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, October 22, 2020**[Recording]**Speaker:**Yuan Liao (Rutgers University)**Title:**Deep Learning Inference on Semi-Parametric Models with Weakly Dependent Data**Abstract:**Deep Neural Networks (DNNs) are nonlinear sieves that can approximate nonlinear functions of high dimensional variables more effectively than various linear sieves (or series). This paper considers efficient inference (estimation and confidence intervals) of functionals of nonparametric conditional moment restrictions via penalized DNNs, for weakly dependent beta-mixing time series data. The functionals of interest are either known or unknown expected functionals, such as weighted average derivatives , averaged partial means and averaged squared partial derivatives. Nonparametric conditional quantile instrumental variable models are a particular example of interest in this paper. This is joint work with Jiafeng Chen, Xiaohong Chen, and Elie Tamer.**Discussant:**Matteo Sesia (University of Southern California)**Links:**[Slides]

**Thursday, October 15, 2020**[Recording]**Speaker:**Zhimei Ren (Stanford University)**Title:**Derandomizing Knockoffs**Abstract:**Model-X knockoffs is a general procedure that can leverage any feature importance measure to produce a variable selection algorithm, which discovers true effects while rigorously controlling the number or fraction of false positives. Model-X knockoffs relies on the construction of synthethic random variables and is, therefore, random. In this paper, we propose a method for derandomizing model-X knockoffs. By aggregating the selection results across multiple runs of the knockoffs algorithm, our method provides stable decisions without compromising statistical power. The derandomization step is designed to be flexible and can be adapted to any variable selection base procedure. When applied to the base procedure of Janson et al. (2016), we prove that derandomized knockoffs controls both the per family error rate (PFER) and the k family-wise error rate (k-FWER). Further, we carry out extensive numerical studies demonstrating tight type-I error control and markedly enhanced power when compared with alternative variable selection algorithms. Finally, we apply our approach to multi-stage GWAS of prostate cancer and report locations on the genome that are significantly associated with the disease. When cross-referenced with other studies, we find that the reported associations have been replicated.**Discussant:**Richard Samworth (University of Cambridge)**Links:**[Relevant paper]

**Thursday, October 8, 2020**[Recording]**Speaker:**Nilesh Tripuraneni (UC Berkeley)**Title:**Single Point Transductive Prediction**Abstract:**Standard methods in supervised learning separate training and prediction: the model is fit independently of any test points it may encounter. However, can knowledge of the next test point $\mathbf{x}_{\star}$ be exploited to improve prediction accuracy? We address this question in the context of linear prediction, showing how techniques from semi-parametric inference can be used transductively to combat regularization bias. We first lower bound the $\mathbf{x}_{\star}$ prediction error of ridge regression and the Lasso, showing that they must incur significant bias in certain test directions. We then provide non-asymptotic upper bounds on the $\mathbf{x}_{\star}$ prediction error of two transductive prediction rules. We conclude by showing the efficacy of our methods on both synthetic and real data, highlighting the improvements single point transductive prediction can provide in settings with distribution shift. This is joint work with Lester Mackey.**Discussant:**Leying Guan (Yale University)**Links:**[Relevant paper] [Slides]

**Thursday, October 1, 2020**[Recording]**Speaker:**Asaf Weinstein (Hebrew University of Jerusalem)**Title:**A Power Analysis for Knockoffs with the Lasso Coefficient-Difference Statistic**Abstract:**In a linear model with possibly many predictors, we consider variable selection procedures given by $\{1\leq j\leq p: |\widehat{\beta}_j(\lambda)| > t\}$, where $\widehat{\beta}(\lambda)$ is the Lasso estimate of the regression coefficients, and where $\lambda$ and $t$ may be data dependent. Ordinary Lasso selection is captured by using $t=0$, thus allowing to control only $\lambda$, whereas thresholded-Lasso selection allows to control both $\lambda$ and $t$. Figuratively, thresholded-Lasso opens up the possibility to look further down the Lasso path, which typically leads to dramatic improvement in power. This phenomenon has been quantified recently leveraging advances in approximate message-passing (AMP) theory, but the implications are actionable only when assuming substantial knowledge of the underlying signal.In this work we study theoretically the power of a knockoffs-calibrated counterpart of thresholded-Lasso that enables us to control FDR in the realistic situation where no prior information about the signal is available. Although the basic AMP framework remains the same, our analysis requires a significant technical extension of existing theory in order to handle the pairing between original variables and their knockoffs. Relying on this extension we obtain exact asymptotic predictions for the true positive proportion achievable at a prescribed type I error level. In particular, we show that the knockoffs version of thresholded-Lasso can (still) perform much better than ordinary Lasso selection if $\lambda$ is chosen by cross-validation on the augmented matrix. This is joint work with Malgorzata Bogdan, Weijie Su, Rina Foygel Barber and Emmanuel Candes.**Discussant:**Zheng (Tracy) Ke (Harvard University)**Links:**[Relevant paper] [Slides]

**Thursday, September 24, 2020**[Recording]**Speaker:**Ruth Heller (Tel Aviv University)**Title:**Inference following aggregate level hypothesis testing**Abstract:**The practice of pooling several individual test statistics to form aggregate tests is common in many statistical applications where individual tests may be underpowered. Following aggregate-level testing, it is naturally of interest to infer on the individual units that drive the signal. Failing to account for selection will produce biased inference. We develop a hypothesis testing framework that guarantees control over false positives conditional on the selection by aggregate tests. We illustrate the usefulness of our procedures in two genomic applications: whole-genome expression quantitative loci (eQTL) analysis across multiple tissue types, and rare variant testing. This talk is based on joint works with Nilanjan Chatterjee, Abba Krieger, Amit Meir, and Jianxin Shi.**Discussant:**Jingshu Wang (University of Chicago)

**Thursday, September 17, 2020**[Recording]**Speaker:**Hannes Leeb (University of Vienna)**Title:**Conditional Predictive Inference for High-Dimensional Stable Algorithms**Abstract:**We investigate generically applicable and intuitively appealing prediction intervals based on leave-one-out residuals. The conditional coverage probability of the proposed intervals, given the observations in the training sample, is close to the nominal level, provided that the underlying algorithm used for computing point predictions is sufficiently stable under the omission of single feature/response pairs. Our results are based on a finite sample analysis of the empirical distribution function of the leave-one-out residuals and hold in non-parametric settings with only minimal assumptions on the error distribution. To illustrate our results, we also apply them to high-dimensional linear predictors, where we obtain uniform asymptotic conditional validity as both sample size and dimension tend to infinity at the same rate. These results show that despite the serious problems of resampling procedures for inference on the unknown parameters (cf. Bickel and Freedman, 1983; El Karoui and Purdom, 2015; Mammen, 1996), leave-one-out methods can be successfully applied to obtain reliable predictive inference even in high dimensions.

Joint work with Lukas Steinberger.**Discussant:**Yuansi Chen (ETH Zürich)**Links:**[Relevant paper] [Slides]

**Thursday, September 10, 2020**[Recording]**Speaker:**Michael Celentano (Stanford University)**Title:**The Lasso with general Gaussian designs with applications to hypothesis testing**Abstract:**The Lasso is a method for high-dimensional regression, which is now commonly used when the number of covariates p is of the same order or larger than the number of observations n. Classical asymptotic normality theory is not applicable to this model for two fundamental reasons: (1) The regularized risk is non-smooth; (2) The distance between the estimator and the true parameter vector cannot be neglected. As a consequence, standard perturbative arguments that are the traditional basis for asymptotic normality fail.

On the other hand, the Lasso estimator can be precisely characterized in the regime in which both n and p are large, while n/p is of order one. This characterization was first obtained in the case of standard Gaussian designs, and subsequently generalized to other high-dimensional estimation procedures. We extend the same characterization to Gaussian correlated designs with non-singular covariance structure.

Using this theory, we study (i) the debiased Lasso, and show that a degrees-of-freedom correction is necessary for computing valid confidence intervals, (ii) confidence intervals constructed via a leave-one-out technique related to conditional randomization tests, and (iii) a simple procedure for hyper-parameter tuning which is provably optimal for prediction error under proportional asymptotics.

Based on joint work with Andrea Montanari and Yuting Wei.**Discussant:**Dongming Huang (National University of Singapore)**Links:**[Relevant paper] [Slides]

**Thursday, September 3, 2020**[Recording]**Speaker:**Rina Foygel Barber (University of Chicago)**Title:**Is distribution-free inference possible for binary regression?**Abstract:**For a regression problem with a binary label response, we examine the problem of constructing confidence intervals for the label probability conditional on the features. In a setting where we do not have any information about the underlying distribution, we would ideally like to provide confidence intervals that are distribution-free---that is, valid with no assumptions on the distribution of the data. Our results establish an explicit lower bound on the length of any distribution-free confidence interval, and construct a procedure that can approximately achieve this length. In particular, this lower bound is independent of the sample size and holds for all distributions with no point masses, meaning that it is not possible for any distribution-free procedure to be adaptive with respect to any type of special structure in the distribution.**Discussant:**Aaditya Ramdas (Carnegie Mellon University)**Links:**[Relevant paper] [Slides]

**Thursday, August 27, 2020**[Recording]**Speaker:**Daniel Yekutieli (Tel Aviv University)**Title:**Bayesian selective inference**Abstract:**I will discuss selective inference from a Bayesian perspective. I will revisit existing work. I will demonstrate the effectiveness of Bayesian methods for specifying FDR-controlling selection rules and providing valid selection-adjusted marginal inferences in two simulated multiple testing examples: (a) Normal sequence model with continuous-valued parameters and (b) two-group model with dependent Normal observations.**Discussant:**Zhigen Zhao (Temple University)

**Thursday, August 20, 2020**[Recording]**Speaker:**Eugene Katsevich (University of Pennsylvania)**Title:**The conditional randomization test in theory and in practice**Abstract:**Consider the problem of testing whether a predictor X is independent of a response Y given a covariate vector Z. If we have access to the distribution of X given Z (the Model-X assumption), the conditional randomization test (Candes et al., 2018) is a simple and powerful conditional independence test, which does not require any knowledge of the distribution of Y given X and Z. The key obstacle to the practical implementation of the CRT is its computational cost, due to its reliance on repeatedly refitting a statistical machine learning model on resampled data. This motivated the development of distillation, a technique which speeds up the CRT by orders of magnitude while sacrificing little or no power (Liu, Katsevich, Janson, and Ramdas, 2020). I will also discuss recent theoretical developments that help us understand how the choice of CRT test statistic impacts its power (Katsevich and Ramdas, 2020). Finally, I'll illustrate an application of the CRT to the analysis of single cell CRISPR regulatory screens, where it helps circumvent the difficulties of modeling single cell gene expression (Katsevich and Roeder, 2020).**Discussant:**Wesley Tansey (Memorial Sloan Kettering Cancer Center)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, August 13, 2020**[Recording]**Speaker:**Lucy Gao (University of Waterloo)**Title:**Selective Inference for Hierarchical Clustering**Abstract:**It is common practice in fields such as single-cell transcriptomics to use the same data set to define groups of interest via clustering algorithms and to test whether these groups are different. Because the same data set is used for both hypothesis generation and hypothesis testing, simply applying a classical statistical test (e.g. the t-test) in this setting would yield an extremely inflated Type I error rate. We propose a selective inference framework for testing the null hypothesis of no difference in means between two clusters obtained using agglomerative hierarchical clustering. Using this framework, we can efficiently compute exact p-values for many commonly used linkage criteria. We demonstrate the utility of our test in simulated data and in single-cell RNA-seq data. This is joint work with Jacob Bien and Daniela Witten.**Discussant:**Yuval Benjamini (Hebrew University of Jerusalem)**Links:**[Slides]

**Thursday, July 30, 2020**[Recording]**Speaker:**Kathryn Roeder (Carnegie Mellon University)**Title:**Adaptive approaches for augmenting genetic association studies with multi-omics covariates**Abstract:**To correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new selective inference methodologies could improve power by enabling exploration of test statistics with covariates for informative weights while retaining desired statistical guarantees. We explore one such framework, adaptive p-value thresholding (AdaPT), in the context of genome-wide association studies (GWAS) under two types of regimes: (1) testing individual single nucleotide polymorphisms (SNPs) for schizophrenia (SCZ) and (2) the aggregation of SNPs into gene-based test statistics for autism spectrum disorder (ASD). In both settings, we focus on enriched expression quantitative trait loci (eQTLs) and demonstrate a substantial increase in power using flexible gradient boosted trees to account for covariates constructed with GWAS statistics from genetically-correlated phenotypes, as well as measures capturing association with gene expression and coexpression subnetwork membership. We address the practical challenges of implementing AdaPT in high-dimensional -omics settings, such as approaches for tuning gradient boosted trees without compromising error-rate control as well as handling the subtle issues of working with publicly available summary statistics (e.g., p-values reported to be exactly equal to one). Specifically, because a popular approach for computing gene-level p-values is based on an invalid approximation for the combination of dependent two-sided test statistics, it yields an inflated error rate. Additionally, the resulting improper null distribution violates the mirror-conservative assumption required for masking procedures. We believe our results are critical for researchers wishing to build new methods in this challenging area and emphasize that our pipeline of analysis can be implemented in many different high-throughput settings to ultimately improve power. This is joint work with Ronald Yurko, Max G’Sell, and Bernie Devlin.**Discussant:**Chiara Sabatti (Stanford University)**Links:**[Relevant paper] [Slides]

**Thursday, July 23, 2020**[Recording]**Speaker:**Will Fithian (UC Berkeley)**Title:**Conditional calibration for false discovery rate control under dependence**Abstract:**We introduce a new class of methods for finite-sample false discovery rate (FDR) control in multiple testing problems with dependent test statistics where the dependence is fully or partially known. Our approach separately calibrates a data-dependent p-value rejection threshold for each hypothesis, relaxing or tightening the threshold as appropriate to target exact FDR control. In addition to our general framework we propose a concrete algorithm, the dependence-adjusted Benjamini-Hochberg (dBH) procedure, which adaptively thresholds the q-value for each hypothesis. Under positive regression dependence the dBH procedure uniformly dominates the standard BH procedure, and in general it uniformly dominates the Benjamini–Yekutieli (BY) procedure (also known as BH with log correction). Simulations and real data examples illustrate power gains over competing approaches to FDR control under dependence. This is joint work with Lihua Lei.**Discussant:**Etienne Roquain (Sorbonne Université)**Links:**[Relevant paper] [Slides]

**Thursday, July 16, 2020**[Recording]**Speaker:**Arun Kumar Kuchibhotla (University of Pennsylvania)**Title:**Optimality in Universal Post-selection Inference**Abstract:**Universal post-selection inference refers to valid inference after an arbitrary variable selection in regression models. In the context of linear regression and GLMs, universal post-selection inference methods have been suggested by Berk et al. (2013, AoS) and Bachoc et al. (2020, AoS). Both these works use the so-called "max-t" approach to obtain valid inference after arbitrary variable selection. Although tight, this approach can lead to a conservative inference for several sub-models. (Tightness refers to the existence of a variable selection procedure for which the inference is exact/sharp.) In this talk, I present a different approach to universal post-selection inference called "Hierarchical PoSI" that scales differently for different sub-model sizes. The basic idea stems from pre-pivoting, introduced by Beran (1987, 1988, JASA) and also from multi-scale testing. Some numerical results will be presented to illustrate the benefits. No guarantees of optimality will be made.**Discussant:**Daniel Yekutieli (Tel Aviv University)

**Thursday, July 9, 2020**[Recording]**Speaker:**Lihua Lei (Stanford University)**Title:**AdaPT: An interactive procedure for multiple testing with side information**Abstract:**We consider the problem of multiple‐hypothesis testing with generic side information: for each hypothesis we observe both a*p*‐value*p*_{i }and some predictor*x*_{i }encoding contextual information about the hypothesis. For large‐scale problems, adaptively focusing power on the more promising hypotheses (those more likely to yield discoveries) can lead to much more powerful multiple‐testing procedures. We propose a general iterative framework for this problem, the adaptive*p*‐value thresholding procedure which we call AdaPT, which adaptively estimates a Bayes optimal*p*‐value rejection threshold and controls the false discovery rate in finite samples. At each iteration of the procedure, the analyst proposes a rejection threshold and observes partially censored*p*‐values, estimates the false discovery proportion below the threshold and proposes another threshold, until the estimated false discovery proportion is below*α*. Our procedure is adaptive in an unusually strong sense, permitting the analyst to use any statistical or machine learning method she chooses to estimate the optimal threshold, and to switch between different models at each iteration as information accrues. This is a joint work with Will Fithian.**Discussant:**Kun Liang (University of Waterloo)**Links:**[Relevant paper] [Slides]

**Thursday, July 2, 2020**[Recording]**Speaker:**Lucas Janson (Harvard University)**Title:**Floodgate: inference for model-free variable importance**Abstract:**Many modern applications seek to understand the relationship between an outcome variable Y and a covariate X in the presence of confounding variables Z = (Z_1,...,Z_p). Although much attention has been paid to testing whether Y depends on X given Z, in this paper we seek to go beyond testing by inferring the strength of that dependence. We first define our estimand, the minimum mean squared error (mMSE) gap, which quantifies the conditional relationship between Y and X in a way that is deterministic, model-free, interpretable, and sensitive to nonlinearities and interactions. We then propose a new inferential approach called floodgate that can leverage any regression function chosen by the user (including those fitted by state-of-the-art machine learning algorithms or derived from qualitative domain knowledge) to construct asymptotic confidence bounds, and we apply it to the mMSE gap. In addition to proving floodgate’s asymptotic validity, we rigorously quantify its accuracy (distance from confidence bound to estimand) and robustness. We demonstrate floodgate’s performance in a series of simulations and apply it to data from the UK Biobank to infer the strengths of dependence of platelet count on various groups of genetic mutations. This is joint work with Lu Zhang.**Discussant:**Weijie Su (University of Pennsylvania)**Links:**[Relevant paper] [Slides]

**Thursday, June 25, 2020**[Recording]**Speaker:**Alexandra Carpentier (Otto-von-Guericke-Universität Magdeburg)**Title:**Adaptive inference and its relations to sequential decision making**Abstract:**Adaptive inference - namely adaptive estimation and adaptive confidence statements - is particularly important in high of infinite dimensional models in statistics. Indeed whenever the dimension becomes high or infinite, it is important to adapt to the underlying structure of the problem. While adaptive estimation is often possible, it is often the case that adaptive and honest confidence sets do not exist. This is known as the adaptive inference paradox. And this has consequences in sequential decision making. In this talk, I will present some classical results of adaptive inference and discuss how they impact sequential decision making. This is joint work with Andrea Locatelli, Matthias Loeffler, Olga Klopp, Richard Nickl, James Cheshire, and Pierre Menard.**Discussant:**Jing Lei (Carnegie Mellon University)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, June 18, 2020**[Recording]

(Seminar hosted jointly with the CIRM-Luminy meeting on Mathematical Methods of Modern Statistics 2)**Speaker:**Weijie Su (University of Pennsylvania)**Title:**Gaussian Differential Privacy**Abstract:**Privacy-preserving data analysis has been put on a firm mathematical foundation since the introduction of differential privacy (DP) in 2006. This privacy definition, however, has some well-known weaknesses: notably, it does not tightly handle composition. In this talk, we propose a relaxation of DP that we term "f-DP", which has a number of appealing properties and avoids some of the difficulties associated with prior relaxations. First, f-DP preserves the hypothesis testing interpretation of differential privacy, which makes its guarantees easily interpretable. It allows for lossless reasoning about composition and post-processing, and notably, a direct way to analyze privacy amplification by subsampling. We define a canonical single-parameter family of definitions within our class that is termed "Gaussian Differential Privacy", based on hypothesis testing of two shifted normal distributions. We prove that this family is focal to f-DP by introducing a central limit theorem, which shows that the privacy guarantees of any hypothesis-testing based definition of privacy (including differential privacy) converge to Gaussian differential privacy in the limit under composition. This central limit theorem also gives a tractable analysis tool. We demonstrate the use of the tools we develop by giving an improved analysis of the privacy guarantees of noisy stochastic gradient descent. This is joint work with Jinshuo Dong and Aaron Roth.**Discussant:**Yu-Xiang Wang (UC Santa Barbara)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, June 11, 2020**[Recording]**Speaker:**Dongming Huang (Harvard University)**Title:**Controlled Variable Selection with More Flexibility**Abstract:**The recent model-X knockoffs method selects variables with provable and non-asymptotical error control and with no restrictions or assumptions on the dimensionality of the data or the conditional distribution of the response given the covariates. The one requirement for the procedure is that the covariate samples are drawn independently and identically from a precisely-known distribution. In this talk, I will show that the exact same guarantees can be made without knowing the covariate distribution fully, but instead knowing it only up to a parametric model with as many as Ω(np) parameters, where p is the dimension and n is the number of covariate samples (including unlabeled samples if available). The key is to treat the covariates as if they are drawn conditionally on their observed value for a sufficient statistic of the model. Although this idea is simple, even in Gaussian models, conditioning on a sufficient statistic leads to a distribution supported on a set of zero Lebesgue measure, requiring techniques from topological measure theory to establish valid algorithms. I will demonstrate how to do this for medium-dimensional Gaussian models, high-dimensional Gaussian graphical models, and discrete graphical models. Simulations show the new approach remains powerful under the weaker assumptions. This talk is based on joint work with Lucas Janson.**Discussant:**Snigdha Panigrahi (University of Michigan)**Links:**[Relevant paper][Slides]

**Thursday, June 4, 2020**[Recording]**Speaker:**Saharon Rosset (Tel Aviv University)**Title:**Optimal multiple testing procedures for strong control and for the two-group model**Abstract:**Multiple testing problems are a staple of modern statistics. The fundamental objective is to reject as many false null hypotheses as possible, subject to controlling an overall measure of false discovery, like family-wise error rate (FWER) or false discovery rate (FDR). We formulate multiple testing of simple hypotheses as an infinite-dimensional optimization problem, seeking the most powerful rejection policy which guarantees strong control of the selected measure. We show that for exchangeable hypotheses, for FWER or FDR and relevant notions of power, these problems lead to infinite programs that can provably be solved. We explore maximin rules for complex alternatives, and show they can be found in practice, leading to improved practical procedures compared to existing alternatives. We derive explicit optimal tests for FWER or FDR control for three independent normal means. We find that the power gain over natural competitors is substantial in all settings examined. We apply our optimal maximin rule to subgroup analyses in systematic reviews from the Cochrane library, leading to an increased number of findings compared to existing alternatives.

As time permits I will also review our follow-up work on optimal rules for controlling FDR or positive FDR in the two-group model, in high dimension and under arbitrary dependence. Our results show substantial and interesting differences between the standard approach for controlling the mFDR and our new solutions, in particular we attain substantially increased power (expected number of true rejections).

Joint work with Ruth Heller, Amichai Painsky and Udi Aharoni.**Discussant:**Wenguang Sun (University of Southern California)

**Thursday May 28, 2020**[Recording]**Speaker:**Jingshu Wang (University of Chicago)**Title:**Detecting Multiple Replicating Signals using Adaptive Filtering Procedures**Abstract:**Replicability is a fundamental quality of scientific discoveries: we are interested in those signals that are detectable in different laboratories, study populations, across time etc. Unlike meta-analysis which accounts for experimental variability but does not guarantee replicability, testing a partial conjunction (PC) null aims specifically to identify the signals that are discovered in multiple studies. In many contemporary applications, ex. comparing multiple high-throughput genetic experiments, a large number M of PC nulls need to be tested simultaneously, calling for a multiple comparison correction. However, standard multiple testing adjustments on the M PC p-values can be severely conservative, especially when M is large and the signals are sparse. We introduce AdaFilter, a new multiple testing procedure that increases power by adaptively filtering out unlikely candidates of PC nulls. We prove that AdaFilter can control FWER and FDR as long as data across studies are independent, and has much higher power than other existing methods. We illustrate the application of AdaFilter with three examples: microarray studies of Duchenne muscular dystrophy, single-cell RNA sequencing of T cells in lung cancer tumors and GWAS for metabolomics.**Discussant:**Eugene Katsevich (Carnegie Mellon University)**Links:**[Relevant paper] [Slides]

**Thursday, May 21, 2020**[Recording]**Speaker:**Yoav Benjamini (Tel Aviv University)**Title:**Confidence Intervals for selected parameters**Abstract:**Practical or scientific considerations may lead to selecting a subset of parameters as ‘important’. Inferences about the selected parameters often are based on the same data used for selection. We present a taxonomy of error-rates for selective confidence intervals then focus on controlling the probability that one or more intervals for selected parameter do not cover–the simultaneous over the selected (SoS) error-rate. We use two approaches to construct SoS-controlling confidence intervals for*k*location parameters out of*m*, deemed most important because their estimators are the largest. The new intervals improve substantially over Sidak intervals when*k**<<m*, and approach Bonferroni corrected when*k*is close to*m*. (Joint work with Yotam Hechtlinger and Philip Stark)**Discussant:**Aaditya Ramdas (Carnegie Mellon University)**Links:**[Relevant paper] [Slides]

**Thursday, May 14, 2020**[Recording**]****Speaker:**Malgorzata Bogdan (Uniwersytet Wroclawski)**Title:**Adaptive Bayesian Version of SLOPE**Abstract:**Sorted L-One Penalized Estimation (SLOPE) is a convex optimization procedure for identifying predictors in large data bases. It extends the popular Least Absolute Shrinkage and Selection Estimator (LASSO) by replacing the L1 norm penalty with the Sorted L-One Norm. It provably controls FDR under orthogonal designs and yields asymptotically minimax estimators of regression coefficients in sparse high-dimensional regression. In this talk I will briefly introduce the method and explain problems with FDR control under correlated designs. We will then discuss a novel adaptive Bayesian version of SLOPE (ABSLOPE), which addresses these issues and allows for simultaneous variable selection and parameter estimation, despite the missing values. We will also discuss a strong screening rule for discarding predictors for SLOPE, which substantially speeds up the SLOPE and ABSLOPE algorithms .**Discussant:**Cynthia Rush (Columbia University)**Links:**[Slides] [Relevant papers: paper #1, paper #2, paper #3]

**Thursday, May 7, 2020**[Recording]**Speaker:**Aldo Solari (University of Milano-Bicocca)**Title:**Exploratory Inference for Brain Imaging**Abstract:**Modern data analysis can be highly exploratory. In brain imaging, for example, researchers often highlight patterns of brain activity suggested by the data, but false discoveries are likely to intrude into this selection. How confident can the researcher be about a pattern that has been found, if that pattern has been selected from so many potential patterns?

In this talk we present a recent approach - termed 'All-Resolutions Inference' (ARI) - that delivers lower confidence bounds to the number of true discoveries in any selected set of voxels. Notably, these bounds are simultaneously valid for all possible selections. This allows a truly interactive approach to post-selection inference, that does not set any limits on the way the researcher chooses to perform the selection.**Discussant:**Genevera Allen (Rice University)**Links:**[Relevant papers: paper #1, paper #2, paper #3] [Slides]

**Thursday, Apr 30, 2020**[Recording]**Speaker:**Yingying Fan (University of Southern California)**Title:**Universal Rank Inference via Residual Subsampling with Application to Large Networks**Abstract:**Determining the precise rank is an important problem in many large-scale applications with matrix data exploiting low-rank plus noise models. In this paper, we suggest a universal approach to rank inference via residual subsampling (RIRS) for testing and estimating rank in a wide family of models, including many popularly used network models such as the degree corrected mixed membership model as a special case. Our procedure constructs a test statistic via subsampling entries of the residual matrix after extracting the spiked components. The test statistic converges in distribution to the standard normal under the null hypothesis, and diverges to infinity with asymptotic probability one under the alternative hypothesis. The effectiveness of RIRS procedure is justified theoretically, utilizing the asymptotic expansions of eigenvectors and eigenvalues for large random matrices recently developed in Fan et al. (2019a) and Fan et al. (2019b). The advantages of the newly suggested procedure are demonstrated through several simulation and real data examples. This work is joint with Xiao Han and Qing Yang.**Discussant:**Yuekai Sun (University of Michigan)**Links:**[Relevant paper] [Slides]

**Thursday, Apr 23, 2020**[Recording]**Speaker:**Aaditya Ramdas (Carnegie Mellon University)**Title:**Ville’s inequality, Robbins’ confidence sequences, and nonparametric supermartingales**Abstract:**

Standard textbook confidence intervals are only valid at fixed sample sizes, but scientific datasets are often collected sequentially and potentially stopped early, thus introducing a critical selection bias. A "confidence sequence” is a sequence of intervals, one for each sample size, that are uniformly valid over all sample sizes, and are thus valid at arbitrary data-dependent sample sizes. One can show that constructing the former at every time step guarantees false coverage rate control, while constructing the latter at each time step guarantees post-hoc familywise error rate control. We show that at a price of about two (doubling of width), pointwise asymptotic confidence intervals can be extended to uniform nonparametric confidence sequences. The crucial role of some beautiful nonnegative supermartingales will be made transparent in enabling “safe anytime-valid inference".

This talk will mostly feature joint work with Steven R. Howard (Berkeley, Voleon), Jon McAuliffe (Berkeley, Voleon), Jas Sekhon (Berkeley, Bridgewater) and recently Larry Wasserman (CMU) and Sivaraman Balakrishnan (CMU). I will also cover interesting historical and contemporary contributions to this area.

**Discussant:**Wouter Koolen (Centrum Wiskunde & Informatica)

**Thursday, Apr 16, 2020**[Recording]

**Speaker:**Emmanuel Candès (Stanford University)**Title:**Causal Inference in Genetic Trio Studies**Abstract:**

We introduce a method to rigorously draw causal inferences — inferences immune to all possible confounding — from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by developing a novel conditional independence test that identifies regions of the genome containing distinct causal variants. The proposed Digital Twin Test compares an observed offspring to carefully constructed synthetic offspring from the same parents to determine statistical significance, and it can leverage any black-box multivariate model and additional non-trio genetic data to increase power. Crucially, our inferences are based only on a well-established mathematical model of recombination and make no assumptions about the relationship between the genotypes and phenotypes.

**Discussant:**Matthew Stephens (University of Chicago)**Links:**[Relevant paper] [Slides]