International Seminar on Selective Inference
A weekly online seminar on selective inference, multiple testing, and post-selection inference.
Gratefully inspired by the Online Causal Inference Seminar
Mailing List
For announcements and Zoom invitations please subscribe to our mailing list.
Upcoming Seminar Presentations
All seminars take place Mondays at 8:30 am PT / 11:30 am ET / 4:30 pm London / 6:30 pm Tel Aviv. Past seminar presentations are posted here.
Monday, October 21st, 2024 [link to join]
Speaker: Kai Zhang (University of North Carolina at Chapel Hill)
Title: BET and BELIEF
Abstract: We study the problem of distribution-free dependence detection and modeling through the new framework of binary expansion statistics (BEStat). The binary expansion testing (BET) avoids the problem of non-uniform consistency and improves upon a wide class of commonly used methods (a) by achieving the minimax rate in sample size requirement for reliable power and (b) by providing clear interpretations of global relationships upon rejection of independence. The binary expansion approach also connects the symmetry statistics with the current computing system to facilitate efficient bitwise implementation. Modeling with the binary expansion linear effect (BELIEF) is motivated by the fact that two linearly uncorrelated binary variables must be also independent. Inferences from BELIEF are easily interpretable because they describe the association of binary variables in the language of linear models, yielding convenient theoretical insight and striking parallels with the Gaussian world. With BELIEF, one may study generalized linear models (GLM) through transparent linear models, providing insight into how modeling is affected by the choice of link. We explore these phenomena and provide a host of related theoretical results. This is joint work with Benjamin Brown and Xiao-Li Meng.
Discussant: Hongjian Shi (Technical University of Munich)
Monday, October 28th, 2024 [link to join]
Speaker: Nikos Ignatiadis (University of Chicago)
Title: Empirical partially Bayes multiple testing and compound χ2 decisions
Abstract: A common task in high-throughput biology is to screen for associations across thousands of units of interest, e.g., genes or proteins. Often, the data for each unit are modeled as Gaussian measurements with unknown mean and variance and are summarized as per-unit sample averages and sample variances. The downstream goal is multiple testing for the means. In this domain, it is routine to "moderate" (that is, to shrink) the sample variances through parametric empirical Bayes methods before computing p-values for the means. Such an approach is asymmetric in that a prior is posited and estimated for the nuisance parameters (variances) but not the primary parameters (means). Our work initiates the formal study of this paradigm, which we term "empirical partially Bayes multiple testing." In this framework, if the prior for the variances were known, one could proceed by computing p-values conditional on the sample variances -- a strategy called partially Bayes inference by Sir David Cox. We show that these conditional p-values satisfy an Eddington/Tweedie-type formula and are approximated at nearly-parametric rates when the prior is estimated by nonparametric maximum likelihood. The estimated p-values can be used with the Benjamini-Hochberg procedure to guarantee asymptotic control of the false discovery rate. Even in the compound setting, wherein the variances are fixed, the approach retains asymptotic type-I error guarantees.
Discussant: Jelle Goeman (Leiden University)
Links: [Relevant papers: paper #1]
Monday, November 4th, 2024 [link to join]
Speaker: Anru Zhang (Duke University)
Title: Functional post-clustering selective inference with applications to EHR
Abstract: In electronic health records (EHR) analysis, clustering patients according to patterns in their data is crucial for uncovering new subtypes of diseases. Existing medical literature often relies on classical hypothesis testing methods to test for differences in means between these clusters. Due to selection bias induced by clustering algorithms, the implementation of these classical methods on post-clustering data often leads to an inflated type-I error. In this paper, we introduce a new statistical approach that adjusts for this bias when analyzing data collected over time. Our method extends classical selective inference methods for cross-sectional data to longitudinal data. We provide theoretical guarantees for our approach with upper bounds on the selective type-I and type-II errors. We apply the method to simulated data and real-world Acute Kidney Injury (AKI) EHR datasets, thereby illustrating the advantages of our approach.
Discussant:
Links: [Relevant papers: paper #1]
Format
The seminars are held on Zoom and last 60 minutes:
45 minutes of presentation
15 minutes of discussion, led by an invited discussant
Moderators collect questions using the Q&A feature during the seminar.
How to join
You can attend by clicking the link to join (there is no need to register in advance).
More instructions for attendees can be found here.
Organizers
Will Fithian (UC Berkeley)
Jelle Goeman (Leiden University)
Nikos Ignatiadis (University of Chicago)
Lihua Lei (Stanford University)
Zhimei Ren (University of Pennsylvania)
Former organizers
Rina Barber (University of Chicago)
Daniel Yekutieli (Tel Aviv University)
Contact us
If you have feedback or suggestions or want to propose a speaker, please e-mail us at selectiveinferenceseminar@gmail.com.
What is selective inference?
Broadly construed, selective inference means searching for interesting patterns in data, usually with inferential guarantees that account for the search process. It encompasses:
Multiple testing: testing many hypotheses at once (and paying disproportionate attention to rejections)
Post-selection inference: examining the data to decide what question to ask, or what model to use, then carrying out one or more appropriate inferences
Adaptive / interactive inference: sequentially asking one question after another of the same data set, where each question is informed by the answers to preceding questions
Cheating: cherry-picking, double dipping, data snooping, data dredging, p-hacking, HARKing, and other low-down dirty rotten tricks; basically any of the above, but done wrong!