Provable Boolean interaction recovery from tree ensemble obtained via random forests.

consistency decision trees ensemble methods interaction selection interpretable machine learning

Journal

Proceedings of the National Academy of Sciences of the United States of America
ISSN: 1091-6490
Titre abrégé: Proc Natl Acad Sci U S A
Pays: United States
ID NLM: 7505876

Informations de publication

Date de publication:
31 05 2022
Historique:
entrez: 24 5 2022
pubmed: 25 5 2022
medline: 27 5 2022
Statut: ppublish

Résumé

Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.

Identifiants

pubmed: 35609192
doi: 10.1073/pnas.2118636119
pmc: PMC9295780
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

e2118636119

Références

Proc Natl Acad Sci U S A. 2020 Feb 25;117(8):3920-3929
pubmed: 32054788
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W339-44
pubmed: 17553836
J Theor Biol. 1969 Oct;25(1):1-47
pubmed: 4390734
Proc Natl Acad Sci U S A. 2005 Mar 22;102(12):4470-5
pubmed: 15728384
BMC Bioinformatics. 2008 Jul 11;9:307
pubmed: 18620558
Trends Biochem Sci. 1996 Dec;21(12):460-6
pubmed: 9009826
Nucleic Acids Res. 2013 Jan;41(2):827-41
pubmed: 23221638
Curr Opin Microbiol. 2008 Dec;11(6):574-9
pubmed: 18935980
Brief Bioinform. 2013 May;14(3):315-26
pubmed: 22786785
Genes (Basel). 2019 Dec 02;10(12):
pubmed: 31810264
BMC Bioinformatics. 2011 Dec 12;12:469
pubmed: 22151604
EMBO J. 1999 Aug 2;18(15):4299-307
pubmed: 10428968
Bioinformatics. 2019 Jul 1;35(13):2343-2345
pubmed: 30462187
Bioinformatics. 2018 Nov 1;34(21):3711-3718
pubmed: 29757357
Proc Natl Acad Sci U S A. 2005 Apr 12;102(15):5310-1
pubmed: 15811940
BMC Bioinformatics. 2020 Jul 14;21(1):307
pubmed: 32664864
Genomics. 2012 Jun;99(6):323-9
pubmed: 22546560
BMC Bioinformatics. 2009 Jan 09;10:13
pubmed: 19134182
Proc Natl Acad Sci U S A. 2018 Feb 20;115(8):1943-1948
pubmed: 29351989
J Comput Graph Stat. 2017;26(3):589-597
pubmed: 30906174

Auteurs

Merle Behr (M)

Department of Statistics, University of California, Berkeley, CA 94720.

Yu Wang (Y)

Department of Statistics, University of California, Berkeley, CA 94720.

Xiao Li (X)

Department of Statistics, University of California, Berkeley, CA 94720.

Bin Yu (B)

Department of Statistics, University of California, Berkeley, CA 94720.
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.
Center for Computational Biology, University of California, Berkeley, CA 94720.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Understanding the role of machine learning in predicting progression of osteoarthritis.

Simone Castagno, Benjamin Gompels, Estelle Strangmark et al.
1.00
Humans Disease Progression Machine Learning Osteoarthritis
1.00
Humans Magnetic Resonance Imaging Brain Infant, Newborn Infant, Premature

Classifications MeSH