MIA: Lucy Gao, Data thinning to avoid double dipping; Primer by Yiqun Chen
YouTube Viewers YouTube Viewers
28.4K subscribers
274 views
0

 Published On Apr 12, 2024

Models, Inference and Algorithms
March 20, 2024
Broad Institute of MIT and Harvard

Meeting: Data thinning to avoid double dipping

Lucy Gao
Assistant Professor of Statistics
University of British Columbia

"Double dipping" is the practice of using the same data to fit and validate a model. Problems typically arise when standard statistical procedures are applied in settings involving double dipping. To avoid the challenges surrounding double dipping, a natural approach is to fit a model on one dataset, and then validate the model on another independent dataset. When we only have access to one dataset, we typically accomplish this via sample splitting. Unfortunately, in some problems, sample splitting is unattractive or impossible. In this talk, we are motivated by unsupervised problems that arise in the analysis of single cell RNA sequencing data, where sample splitting does not allow us to avoid double dipping. We first propose Poisson thinning, which splits a single observation drawn from a Poisson distribution into two independent pseudo-observations. We show that Poisson count splitting allows us to avoid double dipping in unsupervised settings. We next generalize the Poisson thinning framework to a variety of distributions, and refer to this general framework as "data thinning". Data thinning is applicable far beyond the context of single-cell RNA sequencing data, and is particularly useful for problems where sample splitting is unattractive or impossible.

Primer: Testing data-driven hypotheses post-clustering

Yiqun Chen
Data Science Postdoctoral Fellow
Zou Lab, Stanford University

This primer talk is motivated by the practice of testing data-driven hypotheses. In the biomedical sciences, it has become increasingly common to collect massive datasets without a pre-specified research question. In this setting, a data analyst might use the data both to generate a research question, and to test the associated null hypothesis. For example, in single-cell RNA-sequencing analyses, researchers often first cluster the cells, and then test for differences in the expected gene expression levels between the clusters to quantify up- or down-regulation of genes, annotate known cell types, and identify new cell types. However, this popular practice is invalid from a statistical perspective: once we have used the data to generate hypotheses, standard statistical inference tools are no longer valid. To tackle this problem, I developed a conditional selective approach to test for a difference in means between pairs of clusters obtained via hierarchical and k-means clustering. The proposed approach has appropriate statistical guarantees (e.g., selective Type 1 error control), and we demonstrate its use on single-cell RNA-sequencing data.

For more information visit: https://www.broadinstitute.org/talks/...

Copyright Broad Institute, 2024. All rights reserved.

show more

Share/Embed