Towards Capturing Provenance of the Data Curation Process at Domain-specific Repositories

dc.contributor.author Shepherd, Adam
dc.contributor.author Rauch, Shannon
dc.contributor.author Schloer, Conrad
dc.contributor.author Kinkade, Danie
dc.contributor.author Biddle, Matt
dc.contributor.author Copley, Nancy
dc.contributor.author Saito, Mak A.
dc.contributor.author Wiebe, Peter
dc.contributor.author York, Amber
dc.date.accessioned 2019-01-14T20:57:47Z
dc.date.available 2019-01-14T20:57:47Z
dc.date.issued 2018-12-14
dc.description Presented at AGU Fall Meeting, American Geophysical Union, Washington, D.C., 10 – 14 Dec 2018
dc.description.abstract Data repositories often transform submissions to improve understanding and reuse of data by researchers other than the original submitter. However, scientific workflows built by the data submitters often depend on the original data format. In some cases, this makes the repository’s final data product less useful to the submitter. As a result, these two workable but different versions of the data provide value to two disparate, non-interoperable research communities around what should be a single dataset. Data repositories could bridge these two communities by exposing provenance explaining the transform from original submission to final product. A subsequent benefit of this provenance would be the transparent value-add of domain repository data curation. To improve its data management process efficiency, the Biological and Chemical Oceanography Data Management Office (BCO-DMO, https://www.bco-dmo.org) has been adopting the data containerization specification defined by the Frictionless Data project (https://frictionlessdata.io). Recently, BCO-DMO has been using the Frictionless Data Package Pipelines Python library (https://github.com/frictionlessdata/datapackage-pipelines) to capture the data curation processing steps that transform original submissions to final data products. Because these processing steps are stored using a declarative language they can be converted to a structured provenance record using the Provenance Ontology (PROV-O, https://www.w3.org/TR/prov-o/). PROV-O abstracts the Frictionless Data elements of BCO-DMO’s workflow for capturing necessary curation provenance and enables interoperability with other external provenance sources and tools. Users who are familiar with PROV-O or the Frictionless Data Pipelines can use either record to reproduce the final data product in a machine-actionable way. While there may still be some curation steps that cannot be easily automated, this process is a step towards end-to-end reproducible transforms throughout the data curation process. In this presentation, BCO-DMO will demonstrate how Frictionless Data Package Pipelines can be used to capture data curation provenance from original submission to final data product exposing the concrete value-add of domain-specific repositories. en_US
dc.description.sponsorship NSF #1435578 en_US
dc.identifier.doi 10.1575/1912/10826
dc.identifier.uri https://hdl.handle.net/1912/10826
dc.rights Attribution 4.0 International
dc.rights.uri http://creativecommons.org/licenses/by/4.0
dc.subject Provenance en_US
dc.subject Frictionless Data en_US
dc.subject Data management en_US
dc.title Towards Capturing Provenance of the Data Curation Process at Domain-specific Repositories en_US
dc.type Presentation en_US
dspace.entity.type Publication
relation.isAuthorOfPublication 5a1ec46b-03cf-40c3-b294-551ee5f54cf7
relation.isAuthorOfPublication fabbcd8e-ce7a-4ede-b638-e47af490c67c
relation.isAuthorOfPublication 18a11994-2e73-4de3-adf6-cda9a0ddddb6
relation.isAuthorOfPublication 09cddcd0-c893-4334-8a78-292171f697b4
relation.isAuthorOfPublication acaa04eb-34c3-4dcd-a8a7-e2a6c525e6cb
relation.isAuthorOfPublication 0fd499a5-2c8f-4e73-afd8-b33db071dd97
relation.isAuthorOfPublication 5ca83620-c5f3-4f10-9ad0-1356498a329c
relation.isAuthorOfPublication cb145654-8987-45bf-8412-902f2c36b648
relation.isAuthorOfPublication 8c6806d4-c72e-47a8-b713-fe927d8dce80
relation.isAuthorOfPublication.latestForDiscovery 5a1ec46b-03cf-40c3-b294-551ee5f54cf7
Files
Original bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROV of Data Management.pptx
Size:
11.04 MB
Format:
Microsoft Powerpoint
Description:
PROV of Data Management Presentation
Thumbnail Image
Name:
PROV of Data Management.pdf
Size:
2.78 MB
Format:
Adobe Portable Document Format
Description:
PROV of Data Management PDF
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.89 KB
Format:
Item-specific license agreed upon to submission
Description: