Provable Boolean interaction recovery from tree ensemble obtained via random forests.
consistency
decision trees
ensemble methods
interaction selection
interpretable machine learning
Journal
Proceedings of the National Academy of Sciences of the United States of America
ISSN: 1091-6490
Titre abrégé: Proc Natl Acad Sci U S A
Pays: United States
ID NLM: 7505876
Informations de publication
Date de publication:
31 05 2022
31 05 2022
Historique:
entrez:
24
5
2022
pubmed:
25
5
2022
medline:
27
5
2022
Statut:
ppublish
Résumé
Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.
Identifiants
pubmed: 35609192
doi: 10.1073/pnas.2118636119
pmc: PMC9295780
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
e2118636119Références
Proc Natl Acad Sci U S A. 2020 Feb 25;117(8):3920-3929
pubmed: 32054788
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W339-44
pubmed: 17553836
J Theor Biol. 1969 Oct;25(1):1-47
pubmed: 4390734
Proc Natl Acad Sci U S A. 2005 Mar 22;102(12):4470-5
pubmed: 15728384
BMC Bioinformatics. 2008 Jul 11;9:307
pubmed: 18620558
Trends Biochem Sci. 1996 Dec;21(12):460-6
pubmed: 9009826
Nucleic Acids Res. 2013 Jan;41(2):827-41
pubmed: 23221638
Curr Opin Microbiol. 2008 Dec;11(6):574-9
pubmed: 18935980
Brief Bioinform. 2013 May;14(3):315-26
pubmed: 22786785
Genes (Basel). 2019 Dec 02;10(12):
pubmed: 31810264
BMC Bioinformatics. 2011 Dec 12;12:469
pubmed: 22151604
EMBO J. 1999 Aug 2;18(15):4299-307
pubmed: 10428968
Bioinformatics. 2019 Jul 1;35(13):2343-2345
pubmed: 30462187
Bioinformatics. 2018 Nov 1;34(21):3711-3718
pubmed: 29757357
Proc Natl Acad Sci U S A. 2005 Apr 12;102(15):5310-1
pubmed: 15811940
BMC Bioinformatics. 2020 Jul 14;21(1):307
pubmed: 32664864
Genomics. 2012 Jun;99(6):323-9
pubmed: 22546560
BMC Bioinformatics. 2009 Jan 09;10:13
pubmed: 19134182
Proc Natl Acad Sci U S A. 2018 Feb 20;115(8):1943-1948
pubmed: 29351989
J Comput Graph Stat. 2017;26(3):589-597
pubmed: 30906174