Athemis/doi2dataset

Fork 0

doi2dataset is a Python tool designed to process DOIs and generate metadata for Dataverse.org datasets. It retrieves metadata from external APIs (such as OpenAlex and CrossRef), maps metadata fields, and can optionally upload the generated metadata to a Dataverse.org instance.

Find a file

Alexander Minges 6f5f9a0bf8 All checks were successful Test pipeline / test (push) Successful in 16s Details feat: migrate from Forgejo to GitLab CI with test reporting - Convert Forgejo workflow to GitLab CI/CD (.gitlab-ci.yml) - Add JUnit XML and coverage reporting with pytest-cov - Configure caching and artifacts for better CI performance - Fix pytest.ini configuration and enhance .coveragerc - Support GitLab v18.1.1 with comprehensive test visualization		2025-07-08 14:54:40 +02:00
.forgejo/workflows	Use example config for tests	2025-03-21 16:06:28 +01:00
docs	Initial commit and release of doi2dataset	2025-03-21 14:53:23 +01:00
tests	feat!: generalize script by removing organizational metadata	2025-07-07 14:41:39 +02:00
.coveragerc	feat: migrate from Forgejo to GitLab CI with test reporting	2025-07-08 14:54:40 +02:00
.gitignore	feat: migrate from Forgejo to GitLab CI with test reporting	2025-07-08 14:54:40 +02:00
.gitlab-ci.yml	feat: migrate from Forgejo to GitLab CI with test reporting	2025-07-08 14:54:40 +02:00
__init__.py	feat!: generalize script by removing organizational metadata	2025-07-07 14:41:39 +02:00
config_example.yaml	feat!: generalize script by removing organizational metadata	2025-07-07 14:41:39 +02:00
doi2dataset.py	feat!: generalize script by removing organizational metadata	2025-07-07 14:41:39 +02:00
LICENSE.md	Fix copy-pase error in LICENSE.md	2025-03-21 14:59:59 +01:00
pytest.ini	feat: migrate from Forgejo to GitLab CI with test reporting	2025-07-08 14:54:40 +02:00
README.md	fix: remove duplicate heading in README	2025-07-07 14:48:31 +02:00
requirements-dev.txt	feat: migrate from Forgejo to GitLab CI with test reporting	2025-07-08 14:54:40 +02:00
requirements-doc.txt	Initial commit and release of doi2dataset	2025-03-21 14:53:23 +01:00
requirements.txt	Initial commit and release of doi2dataset	2025-03-21 14:53:23 +01:00
setup.py	feat: migrate from Forgejo to GitLab CI with test reporting	2025-07-08 14:54:40 +02:00

README.md

doi2dataset

doi2dataset is a Python tool designed to process DOIs and generate metadata for Dataverse.org datasets. It retrieves metadata from external APIs (such as OpenAlex and CrossRef), maps metadata fields, and can optionally upload the generated metadata to a Dataverse.org instance.

Features

DOI Validation and Normalization: Validates DOIs and converts them into a standardized format.
Metadata Retrieval: Fetches metadata such as title, abstract, license, and author information from external sources.
Standard Dataverse Metadata: Generates standard Dataverse citation metadata including:
- Title, publication date, and alternative URL
- Author information with affiliations and ORCID identifiers
- Dataset contact information (corresponding authors)
- Abstract and description
- Keywords and subject classification
- Grant/funding information
- License information when available
Optional Upload: Allows uploading of metadata directly to a Dataverse.org server.
Progress Tracking: Uses the Rich library for user-friendly progress tracking and error handling.

Installation

Clone the repository from GitHub:

git clone https://git.athemis.de/Athemis/doi2dataset
cd doi2dataset

Configuration

Before running the tool, configure the necessary settings in the config.yaml file located in the project root. This file contains configuration details such as:

Connection details: URL, API token, authentication credentials for Dataverse server
Principal Investigator (PI) information: Optional - used for fallback determination of corresponding authors when not explicitly specified in the publication
Default grant configurations: Funding information to be included in the metadata (supports multiple grants)

Configuration File Structure

The configuration file should follow this structure:

# Dataverse server connection details
dataverse:
  url: "https://your-dataverse-instance.org"
  api_token: "your-api-token"

# Default grant information (supports multiple grants)
default_grants:
  - funder: "Your Funding Agency"
    id: "GRANT123456"
  - funder: "Another Funding Agency"
    id: "GRANT789012"

# Principal investigators for fallback corresponding author determination (optional)
pis:
  - family_name: "Doe"
    given_name: "John"
    orcid: "0000-0000-0000-0000"
    email: "john.doe@university.edu"
    affiliation: "Department of Science, University"

See config_example.yaml for a complete example configuration.

Note: The PI section is optional. If no corresponding authors are found in the publication metadata and no PIs are configured, the tool will still generate metadata but may issue a warning about missing corresponding author information.

Usage

Run doi2dataset from the command line by providing one or more DOIs:

python doi2dataset.py [options] DOI1 DOI2 ...

Command Line Options

-f, --file Specify a file containing DOIs (one per line).
-o, --output-dir Directory where metadata files will be saved.
-d, --depositor Name of the depositor.
-s, --subject Default subject for the metadata.
-m, --contact-mail Contact email address.
-u, --upload Upload metadata to a Dataverse.org server.
-r, --use-ror Use Research Organization Registry (ROR) identifiers for institutions when available.

Documentation

Documentation is generated using Sphinx. See the docs/ directory for detailed API references and usage examples.

Testing

Tests are implemented with pytest. The test suite provides comprehensive coverage of core functionalities. To run the tests, execute:

pytest

Or using the Python module syntax:

python -m pytest

Code Coverage

The project includes code coverage analysis using pytest-cov. Current coverage is approximately 61% of the codebase, with key utilities and test infrastructure at 99-100% coverage.

To run tests with code coverage analysis:

pytest --cov=.

Generate a detailed HTML coverage report:

pytest --cov=. --cov-report=html

This creates a htmlcov directory. Open htmlcov/index.html in a browser to view the detailed coverage report.

A .coveragerc configuration file is provided that:

Excludes test files, documentation, and boilerplate code from coverage analysis
Configures reporting to ignore common non-testable lines (like defensive imports)
Sets the output directory for HTML reports

Recent improvements have increased coverage from 48% to 61% by adding focused tests for:

Citation building functionality
License processing and validation
Metadata field extraction
OpenAlex integration
Publication data parsing and validation

Areas that could benefit from additional testing:

More edge cases in the MetadataProcessor class workflow
Additional CitationBuilder scenarios with diverse inputs
Complex network interactions and error handling

Test Structure

The test suite is organized into six main files:

test_doi2dataset.py: Basic tests for core functions like name splitting, DOI validation, and filename sanitization.
test_fetch_doi_mock.py: Tests API interactions using a mock OpenAlex response stored in srep45389.json.
test_citation_builder.py: Tests for building citation metadata from API responses.
test_metadata_processor.py: Tests for the metadata processing workflow.
test_license_processor.py: Tests for license processing and validation.
test_publication_utils.py: Tests for publication year extraction and date handling.

Test Categories

The test suite covers the following categories of functionality:

Core Functionality Tests

DOI Validation and Processing: Parameterized tests for DOI normalization, validation, and filename sanitization with various inputs.
Name Processing: Extensive tests for parsing and splitting author names in different formats (with/without commas, middle initials, etc.).
Email Validation: Tests for proper validation of email addresses with various domain configurations.

API Integration Tests

Mock API Responses: Tests that use a saved OpenAlex API response (srep45389.json) to simulate API interactions without making actual network requests.
Data Fetching: Tests for retrieving and parsing data from the OpenAlex API.
Abstract Extraction: Tests for extracting and cleaning abstracts from OpenAlex's inverted index format, including handling of empty or malformed abstracts.
Subject Mapping: Tests for mapping OpenAlex topics to controlled vocabulary subject terms.

Metadata Processing Tests

Citation Building: Tests for properly building citation metadata from API responses.
License Processing: Tests for correctly identifying and formatting license information from various license IDs.
Principal Investigator Matching: Tests for finding project PIs based on ORCID identifiers (used for fallback corresponding author determination).
Configuration Loading: Tests for properly loading and validating configuration from files.
Metadata Workflow: Tests for the complete metadata processing workflow.

These tests ensure that all components work correctly in isolation and together as a system, with special attention to edge cases and error handling.

Changelog

Version 2.0 - Generalization Update

This version has been updated to make the tool more generalized and suitable for broader use cases:

Breaking Changes:

Removed organizational-specific metadata blocks (project phases, organizational fields)
Removed Phase class and phase-related configuration
Simplified configuration structure

What's New:

Streamlined metadata generation focusing on standard Dataverse citation metadata
Reduced configuration requirements for easier adoption
Maintained PI information support for corresponding author fallback functionality

Migration Guide:

Remove the phase section from your configuration file
The tool will now generate only standard citation metadata blocks
PI information is still supported and used for fallback corresponding author determination

Contributing

Contributions are welcome! Please fork the repository and submit a pull request with your improvements.

License

This project is licensed under the MIT License. See the LICENSE.md file for details.