Towards Capturing Provenance of the Data Curation Process at Domain-specific Repositories
Saito, Mak A.
MetadataShow full item record
Data repositories often transform submissions to improve understanding and reuse of data by researchers other than the original submitter. However, scientific workflows built by the data submitters often depend on the original data format. In some cases, this makes the repository’s final data product less useful to the submitter. As a result, these two workable but different versions of the data provide value to two disparate, non-interoperable research communities around what should be a single dataset. Data repositories could bridge these two communities by exposing provenance explaining the transform from original submission to final product. A subsequent benefit of this provenance would be the transparent value-add of domain repository data curation. To improve its data management process efficiency, the Biological and Chemical Oceanography Data Management Office (BCO-DMO, https://www.bco-dmo.org) has been adopting the data containerization specification defined by the Frictionless Data project (https://frictionlessdata.io). Recently, BCO-DMO has been using the Frictionless Data Package Pipelines Python library (https://github.com/frictionlessdata/datapackage-pipelines) to capture the data curation processing steps that transform original submissions to final data products. Because these processing steps are stored using a declarative language they can be converted to a structured provenance record using the Provenance Ontology (PROV-O, https://www.w3.org/TR/prov-o/). PROV-O abstracts the Frictionless Data elements of BCO-DMO’s workflow for capturing necessary curation provenance and enables interoperability with other external provenance sources and tools. Users who are familiar with PROV-O or the Frictionless Data Pipelines can use either record to reproduce the final data product in a machine-actionable way. While there may still be some curation steps that cannot be easily automated, this process is a step towards end-to-end reproducible transforms throughout the data curation process. In this presentation, BCO-DMO will demonstrate how Frictionless Data Package Pipelines can be used to capture data curation provenance from original submission to final data product exposing the concrete value-add of domain-specific repositories.
Presented at AGU Fall Meeting, American Geophysical Union, Washington, D.C., 10 – 14 Dec 2018
Suggested CitationPresentation: Shepherd, Adam, Rauch, Shannon, Schloer, Conrad, Kinkade, Danie, Biddle, Matt, Copley, Nancy, Saito, Mak A., Wiebe, Peter, York, Amber, "Towards Capturing Provenance of the Data Curation Process at Domain-specific Repositories", Presented at AGU Fall Meeting, American Geophysical Union, Washington, D.C., 10 – 14 Dec 2018, DOI:10.1575/1912/10826, https://hdl.handle.net/1912/10826
The following license files are associated with this item:
Showing items related by title, author, creator and subject.
What role should a domain-specific repository play in treating code as a first class research product? [poster] Biddle, Matt; Ake, Hannah; Copley, Nancy; Kinkade, Danie; Rauch, Shannon; Saito, Mak A.; Shepherd, Adam; Wiebe, Peter; York, Amber (2018-12-13)The Biological and Chemical Oceanography Data Management Office (BCO-DMO) is a publicly accessible earth science data repository created to curate, publicly serve (publish), and archive digital data and information from ...
Sources of land-derived runoff to a coral reef-fringed embayment identified using geochemical tracers in nearshore sediment traps Takesue, Renee K.; Bothner, Michael H.; Reynolds, Richard L. (Elsevier B.V., 2009-09-24)Geochemical tracers, including Ba, Co, Th, 7Be, 137Cs and 210Pb, and magnetic properties were used to characterize terrestrial runoff collected in nearshore time-series sediment traps in Hanalei Bay, Kauai, during flood ...
Anderson, Chloe H.; Murray, Richard W.; Dunlea, Ann G.; Giosan, Liviu; Kinsley, Christopher W.; McGee, David; Tada, Ryuji (John Wiley & Sons, 2018-08-11)We examine the paleoceanographic record over the last ∼400 kyr derived from major, trace, and rare earth elements in bulk sediment from two sites in the East China Sea drilled during Integrated Ocean Drilling Program ...