Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics.

Adaptive QC Biological variation Data-driven Exploratory data analysis (EDA) Quality control (QC) Single cell scRNA-seq

Journal

Genome biology
ISSN: 1474-760X
Titre abrégé: Genome Biol
Pays: England
ID NLM: 100960660

Informations de publication

Date de publication:
27 12 2022
Historique:
received: 18 08 2021
accepted: 23 11 2022
entrez: 27 12 2022
pubmed: 28 12 2022
medline: 30 12 2022
Statut: epublish

Résumé

Quality control (QC) of cells, a critical first step in single-cell RNA sequencing data analysis, has largely relied on arbitrarily fixed data-agnostic thresholds applied to QC metrics such as gene complexity and fraction of reads mapping to mitochondrial genes. The few existing data-driven approaches perform QC at the level of samples or studies without accounting for biological variation. We first demonstrate that QC metrics vary with both tissue and cell types across technologies, study conditions, and species. We then propose data-driven QC (ddqc), an unsupervised adaptive QC framework to perform flexible and data-driven QC at the level of cell types while retaining critical biological insights and improved power for downstream analysis. ddqc applies an adaptive threshold based on the median absolute deviation on four QC metrics (gene and UMI complexity, fraction of reads mapping to mitochondrial and ribosomal genes). ddqc retains over a third more cells when compared to conventional data-agnostic QC filters. Finally, we show that ddqc recovers biologically meaningful trends in gradation of gene complexity among cell types that can help answer questions of biological interest such as which cell types express the least and most number of transcripts overall, and ribosomal transcripts specifically. ddqc retains cell types such as metabolically active parenchymal cells and specialized cells such as neutrophils which are often lost by conventional QC. Taken together, our work proposes a revised paradigm to quality filtering best practices-iterative QC, providing a data-driven QC framework compatible with observed biological diversity.

Sections du résumé

BACKGROUND
Quality control (QC) of cells, a critical first step in single-cell RNA sequencing data analysis, has largely relied on arbitrarily fixed data-agnostic thresholds applied to QC metrics such as gene complexity and fraction of reads mapping to mitochondrial genes. The few existing data-driven approaches perform QC at the level of samples or studies without accounting for biological variation.
RESULTS
We first demonstrate that QC metrics vary with both tissue and cell types across technologies, study conditions, and species. We then propose data-driven QC (ddqc), an unsupervised adaptive QC framework to perform flexible and data-driven QC at the level of cell types while retaining critical biological insights and improved power for downstream analysis. ddqc applies an adaptive threshold based on the median absolute deviation on four QC metrics (gene and UMI complexity, fraction of reads mapping to mitochondrial and ribosomal genes). ddqc retains over a third more cells when compared to conventional data-agnostic QC filters. Finally, we show that ddqc recovers biologically meaningful trends in gradation of gene complexity among cell types that can help answer questions of biological interest such as which cell types express the least and most number of transcripts overall, and ribosomal transcripts specifically.
CONCLUSIONS
ddqc retains cell types such as metabolically active parenchymal cells and specialized cells such as neutrophils which are often lost by conventional QC. Taken together, our work proposes a revised paradigm to quality filtering best practices-iterative QC, providing a data-driven QC framework compatible with observed biological diversity.

Identifiants

pubmed: 36575523
doi: 10.1186/s13059-022-02820-w
pii: 10.1186/s13059-022-02820-w
pmc: PMC9793662
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

267

Informations de copyright

© 2022. The Author(s).

Références

Cell Metab. 2016 Oct 11;24(4):608-615
pubmed: 27667665
Nat Commun. 2018 Oct 29;9(1):4496
pubmed: 30374016
Science. 2018 May 18;360(6390):758-763
pubmed: 29622724
Cell Syst. 2019 Apr 24;8(4):281-291.e9
pubmed: 30954476
JCI Insight. 2016 Dec 8;1(20):e90558
pubmed: 27942595
Nat Med. 2021 Mar;27(3):546-559
pubmed: 33654293
Sci Rep. 2019 Mar 14;9(1):4557
pubmed: 30872674
PLoS Comput Biol. 2021 Aug 24;17(8):e1009290
pubmed: 34428202
Nature. 2020 Nov;587(7835):619-625
pubmed: 33208946
Nat Immunol. 2019 Feb;20(2):163-172
pubmed: 30643263
Nat Commun. 2019 Jul 2;10(1):2907
pubmed: 31266958
Database (Oxford). 2020 Nov 28;2020:
pubmed: 33247933
Cell. 2015 May 21;161(5):1202-1214
pubmed: 26000488
IEEE Trans Pattern Anal Mach Intell. 2020 Apr;42(4):824-836
pubmed: 30602420
Nature. 2018 Nov;563(7731):347-353
pubmed: 30429548
Sci Data. 2020 Jan 2;7(1):4
pubmed: 31896769
PLoS One. 2008 Mar 26;3(3):e1854
pubmed: 18365009
Nat Commun. 2017 Dec 11;8(1):2128
pubmed: 29225342
Nature. 2021 Jan;589(7841):281-286
pubmed: 33176333
Nature. 2020 May;581(7808):303-309
pubmed: 32214235
Mol Cell. 2015 Apr 16;58(2):339-52
pubmed: 25866248
Genome Res. 2017 Feb;27(2):208-222
pubmed: 27864352
Nat Med. 2019 Jul;25(7):1153-1163
pubmed: 31209336
Nat Med. 2020 Aug;26(8):1307
pubmed: 32587393
Database (Oxford). 2019 Jan 1;2019:
pubmed: 30951143
Nat Med. 2020 May;26(5):792-802
pubmed: 32405060
Genome Biol. 2018 Feb 6;19(1):15
pubmed: 29409532
Nat Methods. 2020 Aug;17(8):793-798
pubmed: 32719530
JCI Insight. 2018 Aug 23;3(16):
pubmed: 30135312
Nat Commun. 2019 Oct 17;10(1):4706
pubmed: 31624246
Cell. 2021 Jun 24;184(13):3573-3587.e29
pubmed: 34062119
Cell Rep. 2019 Feb 5;26(6):1501-1517.e4
pubmed: 30726734
Science. 2020 Jan 24;367(6476):405-411
pubmed: 31974247
J Exp Med. 2016 Dec 12;213(13):2861-2870
pubmed: 27864467
Diabetes. 2016 Oct;65(10):3028-38
pubmed: 27364731
Haematologica. 2019 May;104(5):894-906
pubmed: 30545929
Genome Biol. 2019 Mar 22;20(1):63
pubmed: 30902100
Gigascience. 2020 Dec 26;9(12):
pubmed: 33367645
Nature. 2020 Jul;583(7817):590-595
pubmed: 32669714
Nat Immunol. 2020 Aug;21(8):927-937
pubmed: 32632289
Nat Neurosci. 2020 Mar;23(3):323-326
pubmed: 32066986
Cell. 2019 Jun 13;177(7):1873-1887.e17
pubmed: 31178122
Science. 2022 May 13;376(6594):eabl4290
pubmed: 35549429
Cell. 2019 Feb 21;176(5):1222-1237.e22
pubmed: 30712875
Nat Biotechnol. 2018 Jun;36(5):411-420
pubmed: 29608179
Bioinformatics. 2021 May 17;37(7):963-967
pubmed: 32840568
Nat Commun. 2019 Feb 15;10(1):766
pubmed: 30770823
Diabetes. 2018 Sep;67(9):1783-1794
pubmed: 29950394
Cell. 2016 Aug 25;166(5):1308-1323.e30
pubmed: 27565351
JCI Insight. 2018 Dec 6;3(23):
pubmed: 30518681
Science. 2017 Dec 8;358(6368):1318-1323
pubmed: 29217575
Aging Dis. 2021 Jun 1;12(3):705-709
pubmed: 34094635
Circulation. 2020 Aug 4;142(5):466-482
pubmed: 32403949
Genome Biol. 2016 Feb 17;17:29
pubmed: 26887813
Cell Metab. 2018 Dec 4;28(6):961
pubmed: 30517897
Bioinformatics. 2017 Apr 15;33(8):1179-1186
pubmed: 28088763
Cell Syst. 2019 Apr 24;8(4):329-337.e4
pubmed: 30954475
Cell. 2018 May 17;173(5):1307
pubmed: 29775597
Nat Commun. 2018 May 23;9(1):2028
pubmed: 29795293
Immunity. 2003 Oct;19(4):535-48
pubmed: 14563318
Cell. 2011 Apr 29;145(3):383-397
pubmed: 21529712
Bioinformatics. 2001 Dec;17(12):1152-7
pubmed: 11751223
Nat Med. 2020 Feb;26(2):236-243
pubmed: 31959990
Cell. 2011 Aug 19;146(4):645-58
pubmed: 21854988
Cell. 2019 Jul 25;178(3):714-730.e22
pubmed: 31348891
Nat Commun. 2018 Oct 22;9(1):4383
pubmed: 30348985
Cell Rep. 2018 Dec 18;25(12):3530-3542.e5
pubmed: 30566875
Science. 2022 May 13;376(6594):eabl5197
pubmed: 35549406
Science. 2019 May 17;364(6441):685-689
pubmed: 31097668
Cell. 2017 Oct 5;171(2):321-330.e14
pubmed: 28965763
Nat Med. 2018 Aug;24(8):1277-1289
pubmed: 29988129
Genome Biol. 2020 Mar 5;21(1):57
pubmed: 32138770
Mol Cell Biol. 1991 Aug;11(8):3842-9
pubmed: 1712897
EMBO J. 2019 Sep 16;38(18):e100811
pubmed: 31436334
Nature. 2018 Oct;562(7727):367-372
pubmed: 30283141
J Crohns Colitis. 2020 Oct 5;14(10):1446-1461
pubmed: 32179884
Hepatol Commun. 2022 Apr;6(4):821-840
pubmed: 34792289
Nature. 2019 Aug;572(7768):199-204
pubmed: 31292543
Mol Syst Biol. 2019 Jun 19;15(6):e8746
pubmed: 31217225
Nature. 2017 Nov 30;551(7682):658
pubmed: 29143821

Auteurs

Ayshwarya Subramanian (A)

Klarman Cell Observatory, Broad Institute of MIT and Harvard, Cambridge, MA, USA. subraman@broadinstitute.org.
Brigham and Womens's Hospital, Harvard Medical School, Boston, USA. subraman@broadinstitute.org.

Mikhail Alperovich (M)

Klarman Cell Observatory, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
MIT PRIMES, Massachusetts Institute of Technology, Cambridge, MA, USA.
Lexington High School, Lexington, MA, USA.
Present Address: Wake Technical Community College, Raleigh, USA.

Yiming Yang (Y)

Klarman Cell Observatory, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Center for Immunology and Inflammatory Diseases, Department of Medicine, Massachusetts General Hospital, Boston, MA, 02114, USA.
Present Address: Department of Cellular and Tissue Genomics, Genentech Inc., South San Francisco, CA, USA.

Bo Li (B)

Klarman Cell Observatory, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Center for Immunology and Inflammatory Diseases, Department of Medicine, Massachusetts General Hospital, Boston, MA, 02114, USA.
Present Address: Department of Cellular and Tissue Genomics, Genentech Inc., South San Francisco, CA, USA.
Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.

Articles similaires

Drought Resistance Gene Expression Profiling Gene Expression Regulation, Plant Gossypium Multigene Family
Arabidopsis Arabidopsis Proteins Osmotic Pressure Cytoplasm RNA, Messenger
Humans Colorectal Neoplasms Biomarkers, Tumor Prognosis Gene Expression Regulation, Neoplastic
Animals Lung India Sheep Transcriptome

Classifications MeSH