# International Seminar on Selective Inference

A weekly online seminar on selective inference, multiple testing, and post-selection inference.

Gratefully inspired by the Online Causal Inference Seminar

## Mailing List

For announcements and Zoom invitations please subscribe to our mailing list.

## Upcoming Seminar Presentations

All seminars take place Thursdays at 8:30 am PT / 11:30 am ET / 4:30 pm London / 6:30 pm Tel Aviv. Past seminar presentations are posted here.

• Thursday, April 8, 2021 [Link to join]

• Speaker: Hongyuan Cao (Florida State University)

• Title: Optimal False Discovery Rate Control For Large Scale Multiple Testing With Auxiliary Information

• Abstract: Large-scale multiple testing is a fundamental problem in high dimensional statistical inference. It is increasingly common that various types of auxiliary information, reflecting the structural relationship among the hypotheses, are available. Exploiting such auxiliary information can boost statistical power. To this end, we propose a framework based on a two-group mixture model with varying probabilities of being null for different hypotheses a priori, where a shape constrained relationship is imposed between the auxiliary information and the prior probabilities of being null. An optimal rejection rule is designed to maximize the expected number of true positives when average false discovery rate is controlled. Focusing on the ordered structure, we develop a robust EM algorithm to estimate the prior probabilities of being null and the distribution of p-values under the alternative hypothesis simultaneously. We show that the proposed method has better power than state-of-the-art competitors while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method. Datasets from genome-wide association studies are used to illustrate the new methodology.

• Discussant: James Scott (University of Texas at Austin)

• Links: [Relevant papers: paper #1][Slides]

• Thursday, April 15, 2021 [Link to join]

• Speaker: Nikolaos Ignatiadis (Stanford University)

• Title: Confidence Intervals for Nonparametric Empirical Bayes Analysis

• Abstract: In an empirical Bayes analysis, we use data from repeated sampling to imitate inferences made by an oracle Bayesian with extensive knowledge of the data-generating distribution. Existing results provide a comprehensive characterization of when and why empirical Bayes point estimates accurately recover oracle Bayes behavior. In this work, we develop flexible and practical confidence intervals that provide asymptotic frequentist coverage of empirical Bayes estimands, such as the posterior mean or the local false sign rate. The coverage statements hold even when the estimands are only partially identified or when empirical Bayes point estimates converge very slowly. This is joint work with Stefan Wager.

• Discussant: Timothy Armstrong (Yale University)

• Links: [Relevant papers: paper #1]

• Thursday, April 22, 2021 [Link to join]

• Speaker: Feng Ruan (UC Berkeley)

• Title: A Self-Penalizing Objective Function for Scalable Interaction Detection

• Abstract: We tackle the problem of nonparametric variable selection with a focus on discovering interactions between variables. With p variables there are O(ps) possible order-s interactions making exhaustive search infeasible. It is nonetheless possible to identify the variables involved in interactions with only linear computation cost, O(p). The trick is to maximize a class of parametrized nonparametric dependence measures which we call metric learning objectives; the landscape of these nonconvex objective functions is sensitive to interactions but the objectives themselves do not explicitly model interactions. Three properties make metric learning objectives highly attractive:

(a) The stationary points of the objective are automatically sparse (i.e. performs selection) -- no explicit ℓ1 penalization is needed.

(b) All stationary points of the objective exclude noise variables with high probability.

(c) Guaranteed recovery of all signal variables without needing to reach the objective's global maxima or special stationary points.

The second and third properties mean that all our theoretical results apply in the practical case where one uses gradient ascent to maximize the metric learning objective. While not all metric learning objectives enjoy good statistical power, we design an objective based on ℓ1 kernels that does exhibit favorable power: it recovers (i) main effects with n∼logp samples, (ii) hierarchical interactions with n∼logp samples and (iii) order-s pure interactions with n∼p^{2(s−1)}logp samples.

• Thursday, April 29, 2021 [Link to join]

• Speaker: Thorsten Dickhaus (University of Bremen)

• Title: Randomized p-values in replicability analysis

• Abstract: We will be concerned with testing replicability hypotheses for many endpoints simultaneously. This constitutes a multiple test problem with composite null hypotheses. Traditional p-values, which are computed under least favourable parameter configurations (LFCs), are over-conservative in the case of composite null hypotheses. As demonstrated in prior work, this poses severe challenges in the multiple testing context, especially when one goal of the statistical analysis is to estimate the proportion $\pi_0$ of true null hypotheses. We will discuss the application of randomized p-values in the sense of [1] in replicability analysis. By means of theoretical considerations as well as computer simulations, we will demonstrate that their usage typically leads to a much more accurate estimation of $\pi_0$ than the LFC-based approach. Furthermore, we will draw connections to other recently proposed methods for dealing with conservative p-values in the multiple testing context. Finally, we will present a real data example from genomics. The presentation is based on [2] and [3].

• Links: [Relevant papers: paper #1, paper #2, paper #3]

• Thursday, June 3, 2021 [Link to join]

• Speakers: Xinping Cui (UC Riverside) and Haibing Zhao (Shanghai University of Finance and Economics)

• Title: Constructing confidence intervals for selected parameters

• Abstract: In large‐scale problems, it is common practice to select important parameters by a procedure such as the Benjamini and Hochberg procedure and construct confidence intervals (CIs) for further investigation while the false coverage‐statement rate (FCR) for the CIs is controlled at a desired level. Although the well‐known BY CIs control the FCR, they are uniformly inflated. In this paper, we propose two methods to construct shorter selective CIs. The first method produces shorter CIs by allowing a reduced number of selective CIs. The second method produces shorter CIs by allowing a prefixed proportion of CIs containing the values of uninteresting parameters. We theoretically prove that the proposed CIs are uniformly shorter than BY CIs and control the FCR asymptotically for independent data. Numerical results confirm our theoretical results and show that the proposed CIs still work for correlated data. We illustrate the advantage of the proposed procedures by analyzing the microarray data from a HIV study.

• Links: [Relevant papers: paper #1]

## Format

The seminars are held on Zoom and last 60 minutes:

• 45 minutes of presentation

• 15 minutes of discussion, led by an invited discussant

Moderators collect questions using the Q&A feature during the seminar.

## How to join

You can attend by clicking the link to join (there is no need to register in advance).

More instructions for attendees can be found here.

## Organizers

If you have feedback or suggestions or want to propose a speaker, please e-mail us at selectiveinferenceseminar@gmail.com.

## What is selective inference?

Broadly construed, selective inference means searching for interesting patterns in data, usually with inferential guarantees that account for the search process. It encompasses:

• Multiple testing: testing many hypotheses at once (and paying disproportionate attention to rejections)

• Post-selection inference: examining the data to decide what question to ask, or what model to use, then carrying out one or more appropriate inferences

• Adaptive / interactive inference: sequentially asking one question after another of the same data set, where each question is informed by the answers to preceding questions

• Cheating: cherry-picking, double dipping, data snooping, data dredging, p-hacking, HARKing, and other low-down dirty rotten tricks; basically any of the above, but done wrong!