Towards self-describing and FAIR bulk formats for biomedical data.


Journal

PLoS computational biology
ISSN: 1553-7358
Titre abrégé: PLoS Comput Biol
Pays: United States
ID NLM: 101238922

Informations de publication

Date de publication:
03 2023
Historique:
received: 24 07 2022
accepted: 13 02 2023
revised: 23 03 2023
pubmed: 14 3 2023
medline: 28 3 2023
entrez: 13 3 2023
Statut: epublish

Résumé

We introduce a self-describing serialized format for bulk biomedical data called the Portable Format for Biomedical (PFB) data. The Portable Format for Biomedical data is based upon Avro and encapsulates a data model, a data dictionary, the data itself, and pointers to third party controlled vocabularies. In general, each data element in the data dictionary is associated with a third party controlled vocabulary to make it easier for applications to harmonize two or more PFB files. We also introduce an open source software development kit (SDK) called PyPFB for creating, exploring and modifying PFB files. We describe experimental studies showing the performance improvements when importing and exporting bulk biomedical data in the PFB format versus using JSON and SQL formats.

Identifiants

pubmed: 36913405
doi: 10.1371/journal.pcbi.1010944
pii: PCOMPBIOL-D-22-01126
pmc: PMC10035862
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

e1010944

Subventions

Organisme : NHLBI NIH HHS
ID : U2C HL138346
Pays : United States

Informations de copyright

Copyright: © 2023 Lukowski et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Déclaration de conflit d'intérêts

The authors have declared that no competing interests exist.

Références

J Biomed Inform. 2009 Jun;42(3):530-9
pubmed: 19475726
Cell Genom. 2021 Nov 10;1(2):
pubmed: 35072136
Stud Health Technol Inform. 2006;121:279-90
pubmed: 17095826
Nat Biotechnol. 2022 Jun;40(6):817-820
pubmed: 35705716
Cancer J. 2018 May/Jun;24(3):126-130
pubmed: 29794537
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
J Am Med Inform Assoc. 2016 Sep;23(5):899-908
pubmed: 26911829
Nucleic Acids Res. 2017 Jan 4;45(D1):D865-D876
pubmed: 27899602
Nat Genet. 2017 May 26;49(6):816-819
pubmed: 28546571
Nat Genet. 2021 Mar;53(3):257-262
pubmed: 33619384
Nucleic Acids Res. 2012 Jan;40(Database issue):D940-6
pubmed: 22080554
Trends Genet. 2019 Mar;35(3):223-234
pubmed: 30691868

Auteurs

Michael Lukowski (M)

Center for Translational Data Science, University of Chicago, Chicago, Illinois, United States of America.

Andrew Prokhorenkov (A)

Center for Translational Data Science, University of Chicago, Chicago, Illinois, United States of America.

Robert L Grossman (RL)

Center for Translational Data Science, University of Chicago, Chicago, Illinois, United States of America.
Section of Biomedical Data Science, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America.
Department of Computer Science, University of Chicago, Chicago, Illinois, United States of America.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software
Cephalometry Humans Anatomic Landmarks Software Internet
Humans Algorithms Software Artificial Intelligence Computer Simulation

Classifications MeSH