doi2dataset is a Python tool designed to process DOIs and generate metadata for Dataverse.org datasets. It retrieves metadata from external APIs (such as OpenAlex and CrossRef), maps metadata fields, and can optionally upload the generated metadata to a Dataverse.org instance.
Find a file
Alexander Minges 1c84cae93b
Add code coverage config and expand test suite
Adds .coveragerc configuration file to control coverage analysis settings.
Expands test suite with additional unit tests for AbstractProcessor,
SubjectMapper, CitationBuilder, LicenseProcessor, PIFinder, and
MetadataProcessor classes.

Updates README with comprehensive testing documentation, including
information about current code coverage (53%) and instructions for
running tests with coverage analysis.
2025-05-20 14:02:30 +02:00
.forgejo/workflows Use example config for tests 2025-03-21 16:06:28 +01:00
docs Initial commit and release of doi2dataset 2025-03-21 14:53:23 +01:00
tests Add code coverage config and expand test suite 2025-05-20 14:02:30 +02:00
.coveragerc Add code coverage config and expand test suite 2025-05-20 14:02:30 +02:00
.gitignore Include missing json for tests 2025-03-21 15:57:29 +01:00
__init__.py Initial commit and release of doi2dataset 2025-03-21 14:53:23 +01:00
config_example.yaml Initial commit and release of doi2dataset 2025-03-21 14:53:23 +01:00
doi2dataset.py Fix missing affiliation in Person class output 2025-05-20 13:52:49 +02:00
LICENSE.md Fix copy-pase error in LICENSE.md 2025-03-21 14:59:59 +01:00
README.md Add code coverage config and expand test suite 2025-05-20 14:02:30 +02:00
requirements-dev.txt Update requirements-dev.txt 2025-03-21 15:53:09 +01:00
requirements-doc.txt Initial commit and release of doi2dataset 2025-03-21 14:53:23 +01:00
requirements.txt Initial commit and release of doi2dataset 2025-03-21 14:53:23 +01:00
setup.py Initial commit and release of doi2dataset 2025-03-21 14:53:23 +01:00

doi2dataset

Test Status

doi2dataset is a Python tool designed to process DOIs and generate metadata for Dataverse.org datasets. It retrieves metadata from external APIs (such as OpenAlex and CrossRef), maps metadata fields, and can optionally upload the generated metadata to a Dataverse.org instance.

Features

  • DOI Validation and Normalization: Validates DOIs and converts them into a standardized format.
  • Metadata Retrieval: Fetches metadata such as title, abstract, license, and author information from external sources.
  • Metadata Mapping: Automatically maps and generates metadata fields (e.g., title, description, keywords) including support for controlled vocabularies and compound fields.
  • Optional Upload: Allows uploading of metadata directly to a Dataverse.org server.
  • Progress Tracking: Uses the Rich library for user-friendly progress tracking and error handling.

Installation

Clone the repository from GitHub:

git clone https://git.athemis.de/Athemis/doi2dataset
cd doi2dataset

Configuration

Configuration

Before running the tool, configure the necessary settings in the config.yaml file located in the project root. This file contains configuration details such as:

  • Connection details (URL, API token, authentication credentials)
  • Mapping of project phases
  • Principal Investigator (PI) information
  • Default grant configurations

Usage

Run doi2dataset from the command line by providing one or more DOIs:

python doi2dataset.py [options] DOI1 DOI2 ...

Command Line Options

  • -f, --file Specify a file containing DOIs (one per line).

  • -o, --output-dir Directory where metadata files will be saved.

  • -d, --depositor Name of the depositor.

  • -s, --subject Default subject for the metadata.

  • -m, --contact-mail Contact email address.

  • -u, --upload Upload metadata to a Dataverse.org server.

  • -r, --use-ror Use Research Organization Registry (ROR) identifiers for institutions when available.

Documentation

Documentation is generated using Sphinx. See the docs/ directory for detailed API references and usage examples.

Testing

Tests are implemented with pytest. The test suite provides comprehensive coverage of core functionalities.

Running Tests

To run the tests, execute:

pytest

Code Coverage

The project includes code coverage analysis using pytest-cov. Current coverage is approximately 53% of the codebase, with key utilities and test infrastructure at 99-100% coverage.

To run tests with code coverage analysis:

pytest --cov=doi2dataset

Generate a detailed HTML coverage report:

pytest --cov=doi2dataset --cov-report=html

This creates a htmlcov directory. Open htmlcov/index.html in a browser to view the detailed coverage report.

A .coveragerc configuration file is provided that:

  • Excludes test files, documentation, and boilerplate code from coverage analysis
  • Configures reporting to ignore common non-testable lines (like defensive imports)
  • Sets the output directory for HTML reports

To increase coverage:

  1. Focus on adding tests for the MetadataProcessor class
  2. Add tests for the LicenseProcessor and SubjectMapper with more diverse inputs
  3. Create tests for the Configuration loading system

Test Categories

The test suite includes the following categories of tests:

Core Functionality Tests

  • DOI Validation and Processing: Tests for DOI normalization, validation, and filename sanitization.
  • Phase Management: Tests for checking publication year against defined project phases.
  • Name Processing: Tests for proper parsing and splitting of author names in different formats.
  • Email Validation: Tests for proper validation of email addresses.

API Integration Tests

  • Mock API Responses: Tests that use a saved OpenAlex API response (srep45389.json) to simulate API interactions without making actual network requests.
  • Data Fetching: Tests for retrieving and parsing data from the OpenAlex API.
  • Abstract Extraction: Tests for extracting and cleaning abstracts from OpenAlex's inverted index format.
  • Subject Mapping: Tests for mapping OpenAlex topics to controlled vocabulary subject terms.

Metadata Processing Tests

  • Citation Building: Tests for properly building citation metadata from API responses.
  • License Processing: Tests for correctly identifying and formatting license information.
  • Principal Investigator Matching: Tests for finding project PIs based on ORCID identifiers.
  • Configuration Loading: Tests for properly loading and validating configuration from files.
  • Metadata Workflow: Tests for the complete metadata processing workflow.

These tests ensure that all components work correctly in isolation and together as a system.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request with your improvements.

License

This project is licensed under the MIT License. See the LICENSE.md file for details.