Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets.
16S rRNA gene
Aerodigestive tract
Habitat-specific database
Microbiome
Nasal
Naïve Bayesian RDP Classifier
Species-level taxonomy
Training set
V1–V3
eHOMD
Journal
Microbiome
ISSN: 2049-2618
Titre abrégé: Microbiome
Pays: England
ID NLM: 101615147
Informations de publication
Date de publication:
15 05 2020
15 05 2020
Historique:
received:
18
02
2020
accepted:
15
04
2020
entrez:
17
5
2020
pubmed:
18
5
2020
medline:
2
3
2021
Statut:
epublish
Résumé
The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Existing computational algorithms can resolve 16S rRNA gene sequences into high-resolution amplicon sequence variants (ASVs), which represent consistent labels comparable across studies. Assigning these ASVs to species-level taxonomy strengthens the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies and further facilitates data comparison across studies. To achieve this, we developed a broadly applicable method for constructing high-resolution training sets based on the phylogenic relationships among microbes found in a habitat of interest. When used with the naïve Bayesian Ribosomal Database Project (RDP) Classifier, this training set achieved species/supraspecies-level taxonomic assignment of 16S rRNA gene-derived ASVs. The key steps for generating such a training set are (1) constructing an accurate and comprehensive phylogenetic-based, habitat-specific database; (2) compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon in the database; (3) trimming the training set to match the sequenced regions, if necessary; and (4) placing species sharing closely related sequences into a training-set-specific supraspecies taxonomic level to preserve subgenus-level resolution. As proof of principle, we developed a V1-V3 region training set for the bacterial microbiota of the human aerodigestive tract using the full-length 16S rRNA gene reference sequences compiled in our expanded Human Oral Microbiome Database (eHOMD). We also overcame technical limitations to successfully use Illumina sequences for the 16S rRNA gene V1-V3 region, the most informative segment for classifying bacteria native to the human aerodigestive tract. Finally, we generated a full-length eHOMD 16S rRNA gene training set, which we used in conjunction with an independent PacBio single molecule, real-time (SMRT)-sequenced sinonasal dataset to validate the representation of species in our training set. This also established the effectiveness of a full-length training set for assigning taxonomy of long-read 16S rRNA gene datasets. Here, we present a systematic approach for constructing a phylogeny-based, high-resolution, habitat-specific training set that permits species/supraspecies-level taxonomic assignment to short- and long-read 16S rRNA gene-derived ASVs. This advancement enhances the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies. Video Abstract.
Sections du résumé
BACKGROUND
The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Existing computational algorithms can resolve 16S rRNA gene sequences into high-resolution amplicon sequence variants (ASVs), which represent consistent labels comparable across studies. Assigning these ASVs to species-level taxonomy strengthens the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies and further facilitates data comparison across studies.
RESULTS
To achieve this, we developed a broadly applicable method for constructing high-resolution training sets based on the phylogenic relationships among microbes found in a habitat of interest. When used with the naïve Bayesian Ribosomal Database Project (RDP) Classifier, this training set achieved species/supraspecies-level taxonomic assignment of 16S rRNA gene-derived ASVs. The key steps for generating such a training set are (1) constructing an accurate and comprehensive phylogenetic-based, habitat-specific database; (2) compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon in the database; (3) trimming the training set to match the sequenced regions, if necessary; and (4) placing species sharing closely related sequences into a training-set-specific supraspecies taxonomic level to preserve subgenus-level resolution. As proof of principle, we developed a V1-V3 region training set for the bacterial microbiota of the human aerodigestive tract using the full-length 16S rRNA gene reference sequences compiled in our expanded Human Oral Microbiome Database (eHOMD). We also overcame technical limitations to successfully use Illumina sequences for the 16S rRNA gene V1-V3 region, the most informative segment for classifying bacteria native to the human aerodigestive tract. Finally, we generated a full-length eHOMD 16S rRNA gene training set, which we used in conjunction with an independent PacBio single molecule, real-time (SMRT)-sequenced sinonasal dataset to validate the representation of species in our training set. This also established the effectiveness of a full-length training set for assigning taxonomy of long-read 16S rRNA gene datasets.
CONCLUSION
Here, we present a systematic approach for constructing a phylogeny-based, high-resolution, habitat-specific training set that permits species/supraspecies-level taxonomic assignment to short- and long-read 16S rRNA gene-derived ASVs. This advancement enhances the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies. Video Abstract.
Identifiants
pubmed: 32414415
doi: 10.1186/s40168-020-00841-w
pii: 10.1186/s40168-020-00841-w
pmc: PMC7291764
doi:
Substances chimiques
RNA, Ribosomal, 16S
0
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
65Subventions
Organisme : NIAID NIH HHS
ID : R01 AI101018
Pays : United States
Organisme : NIDCR NIH HHS
ID : R01 DE024468
Pays : United States
Organisme : NIDCR NIH HHS
ID : R01 DE016937
Pays : United States
Organisme : National Institute of General Medical Sciences (US)
ID : R01GM117174
Pays : International
Organisme : Division of Intramural Research, National Institute of Allergy and Infectious Diseases
ID : R01AI101018
Pays : International
Organisme : NIDCR NIH HHS
ID : R37 DE016937
Pays : United States
Organisme : NCATS NIH HHS
ID : UL1 TR001102
Pays : United States
Organisme : NIDCR NIH HHS
ID : R01DE024468
Pays : United States
Organisme : NIGMS NIH HHS
ID : R01 GM117174
Pays : United States
Organisme : NIDCR NIH HHS
ID : R37DE016937
Pays : United States
Références
ISME J. 2015 Jan;9(1):68-80
pubmed: 25012900
Microbiome. 2018 Aug 9;6(1):140
pubmed: 30092815
BMC Bioinformatics. 2015 Oct 08;16:324
pubmed: 26450747
ISME J. 2012 Mar;6(3):610-8
pubmed: 22134646
PeerJ. 2018 Apr 18;6:e4652
pubmed: 29682424
Bioinformatics. 2014 Dec 15;30(24):3548-55
pubmed: 25359891
BMC Genomics. 2016 Jan 14;17:55
pubmed: 26763898
Front Microbiol. 2018 May 04;9:873
pubmed: 29780377
Sci Data. 2019 Feb 05;6:190007
pubmed: 30720800
Front Microbiol. 2016 May 19;7:712
pubmed: 27242733
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45
pubmed: 26553804
mSystems. 2017 Mar 7;2(2):
pubmed: 28289731
Appl Environ Microbiol. 2013 Sep;79(17):5112-20
pubmed: 23793624
Database (Oxford). 2015 Jun 27;2015:bav062
pubmed: 26120139
Proc Natl Acad Sci U S A. 2007 Dec 18;104(51):20529-33
pubmed: 18077362
Nucleic Acids Res. 2019 Oct 10;47(18):e103
pubmed: 31269198
BMC Bioinformatics. 2015 Jul 01;16:205
pubmed: 26130333
Genome Biol. 2018 Jun 27;19(1):82
pubmed: 29950165
PeerJ. 2014 Aug 05;2:e494
pubmed: 25165621
PeerJ. 2019 Mar 5;7:e6496
pubmed: 30863673
World J Gastroenterol. 2005 Dec 14;11(46):7277-83
pubmed: 16437628
BMC Genomics. 2015 Dec 12;16:1056
pubmed: 26651617
Int J Med Microbiol. 2010 Nov;300(7):503-11
pubmed: 20510651
Microbiome. 2018 Oct 23;6(1):190
pubmed: 30352611
BMC Genomics. 2015 Jul 22;16:539
pubmed: 26198432
Science. 2009 May 29;324(5931):1190-2
pubmed: 19478181
Nucleic Acids Res. 2002 Jul 15;30(14):3059-66
pubmed: 12136088
Nucleic Acids Res. 2014 Jan;42(Database issue):D643-8
pubmed: 24293649
Mycologia. 2016 Jan-Feb;108(1):1-5
pubmed: 26553774
Nat Methods. 2016 Jul;13(7):581-3
pubmed: 27214047
ISME J. 2017 Dec;11(12):2639-2643
pubmed: 28731476
ISME J. 2011 May;5(5):780-91
pubmed: 21151003
Appl Environ Microbiol. 2007 Aug;73(16):5261-7
pubmed: 17586664
Syst Appl Microbiol. 2015 Oct;38(7):472-82
pubmed: 26283320
PLoS One. 2015 Apr 10;10(4):e0120520
pubmed: 25860802
Genome Med. 2012 Oct 10;4(10):77
pubmed: 23050952
PLoS One. 2012;7(10):e47075
pubmed: 23071716
Gigascience. 2018 May 1;7(5):
pubmed: 29762668
Nature. 2013 Jun 20;498(7454):367-70
pubmed: 23698366
Sci Total Environ. 2018 Mar 15;618:1254-1267
pubmed: 29089134
Nucleic Acids Res. 2013 Jan;41(Database issue):D590-6
pubmed: 23193283
New Microbes New Infect. 2018 Feb 28;23:61-69
pubmed: 29707211
BMC Bioinformatics. 2016 Mar 22;17:135
pubmed: 27000765
mSystems. 2018 Dec 4;3(6):
pubmed: 30534599
ISME J. 2012 Jan;6(1):94-103
pubmed: 21716311
PLoS One. 2011;6(6):e20956
pubmed: 21738596
BMC Bioinformatics. 2017 May 10;18(1):247
pubmed: 28486927
Bioinformatics. 2017 Dec 1;33(23):3808-3810
pubmed: 28961926
PLoS One. 2015 Feb 06;10(2):e0117617
pubmed: 25658760
mSphere. 2018 Sep 5;3(5):
pubmed: 30185512
Nucleic Acids Res. 2014 Jan;42(Database issue):D633-42
pubmed: 24288368
Genome Res. 2013 Dec;23(12):2103-14
pubmed: 24170601
J Clin Microbiol. 2007 Jun;45(6):1954-62
pubmed: 17409203
J Bacteriol. 2010 Oct;192(19):5002-17
pubmed: 20656903
PLoS One. 2012;7(3):e32491
pubmed: 22403664
J Oral Microbiol. 2015 Sep 29;7:28934
pubmed: 26426306
BMC Microbiol. 2012 Sep 26;12:221
pubmed: 23013113
Database (Oxford). 2017 Jan 1;2017(1):
pubmed: 28365734
Genome Res. 2012 May;22(5):850-9
pubmed: 22310478
Appl Environ Microbiol. 2018 Mar 19;84(7):
pubmed: 29427429
Microbiome. 2018 May 17;6(1):90
pubmed: 29773078
ISME J. 2015 Mar 17;9(4):968-79
pubmed: 25325381
Methods Ecol Evol. 2013 Dec 1;4(12):
pubmed: 24358444
BMC Bioinformatics. 2017 Mar 16;18(1):172
pubmed: 28302051
Proc Natl Acad Sci U S A. 2004 Mar 23;101(12):4250-5
pubmed: 15016918
Annu Rev Microbiol. 2019 Sep 8;73:335-358
pubmed: 31180804
Mol Biol Evol. 2009 Jul;26(7):1641-50
pubmed: 19377059
PeerJ. 2018 Jun 12;6:e5030
pubmed: 29910992
Database (Oxford). 2010 Jul 06;2010:baq013
pubmed: 20624719
Bioinformatics. 2018 Jul 15;34(14):2371-2375
pubmed: 29506021
Nature. 2014 Oct 2;514(7520):59-64
pubmed: 25279917