GitHub
GitHub is a cloud-based, version-control platform that offers excellent storing, management and sharing capabilities for plain text data (.csv) files and code.
Hakai Dataset Repositories
Project- or research output-specific repositories are set up at an early stage of the project within the Hakai organizational account. These repositories are essentially the data packages.
The Hakai Dataset Repository template is strongly recommended when setting up a new repository, and includes recommended files and folders that need to be updated prior to release.
Contents of the repository are owned by the Hakai Institute and maintained by project collaborators. While initially these repositories can be private (accessible only to project collaborators), they should be made publicly accessible upon publication of the data or manuscript.
GitHub Actions can be set up to run on selected triggers, to be implemented as a continuous integration and continuous delivery (CI/CD) platform to automate for example:
- Running and testing a data pipeline
- Identifying, flagging outliers or potentially erroneous data
- Ensuring consistency in dataset, file structure and file naming conventions across collaborators and versions
- Generating files and data products
- Harvesting data to a database or visualization tools
- Upload releases to an existing repository
Releasing
A finalized data package or repository should include the following:
- data
- scripts
- documentation
- sampling and processing methods
- licensing
- data attribution
- necessary readme files
Essentially all the different components needed to facilitate data reuse and reproducibility.
A data package is made publicly available by generating a version release of the repository. Releases are essentially 'snapshots' of your repository. When a new release is created you must match the release tag with the version element in your metadata record. Previous version releases are archived and remain accessible. Please provide a link to the 'Releases' page (e.g. Hakai Juvenile Salmon Program Releases) through the metadata record so that when users click the link they can navigate to their desired version.
If a repository contains sensitive or restricted data that cannot be made publicly accessible, an adapted pipeline will need to be developed to accommodate the sensitivities.
Zenodo Integration
GitHub releases can be preserved in a long-term trusted repository through e.g. the integration capabilities with Zenodo, to automatically archive versions of your repository and assign a DOI when you create a GitHub release. To set up an automated integration, follow the GitHub instructions or contact data@hakai.org.
Make sure you reference the DOI through a metadata record in the Hakai Catalogue. You will still have to mint a DOI for the record in the Hakai Catalogue through the metadata form, and indicate the relationship between the records. If there are multiple versions (ie. time series dataset) then the DOI generated through the Hakai metadata form serves as the DOI for the overall collection.