| 
				 
		
			All checks were successful
		
		
	 
	Test pipeline / test (push) Successful in 42s 
				
			- Add config_test.yaml loading to test_build_grants_with_default_config - Fix FileNotFoundError when config.yaml is not available in CI - Ensure test uses config_test.yaml instead of missing config.yaml  | 
			||
|---|---|---|
| .forgejo/workflows | ||
| docs | ||
| doi2dataset | ||
| scripts | ||
| tests | ||
| .coveragerc | ||
| .gitattributes | ||
| .gitignore | ||
| .gitlab-ci.yml | ||
| .gitlint | ||
| .gitmessage | ||
| .pre-commit-config.yaml | ||
| __init__.py | ||
| CHANGELOG.md | ||
| config_example.yaml | ||
| CONTRIBUTING.md | ||
| LICENSE.md | ||
| MANIFEST.in | ||
| pyproject.toml | ||
| pytest.ini | ||
| README.md | ||
| requirements-dev.txt | ||
| requirements-doc.txt | ||
| requirements.txt | ||
doi2dataset
doi2dataset is a Python tool designed to process DOIs and generate metadata for Dataverse.org datasets. It retrieves metadata from external APIs (such as OpenAlex and CrossRef), maps metadata fields, and can optionally upload the generated metadata to a Dataverse.org instance.
Features
- DOI Validation and Normalization: Validates DOIs and converts them into a standardized format.
 - Metadata Retrieval: Fetches metadata such as title, abstract, license, and author information from external sources.
 - Standard Dataverse Metadata: Generates standard Dataverse citation metadata including:
- Title, publication date, and alternative URL
 - Author information with affiliations and ORCID identifiers
 - Dataset contact information (corresponding authors)
 - Abstract and description
 - Keywords and subject classification
 - Grant/funding information
 - License information when available
 
 - Optional Upload: Allows uploading of metadata directly to a Dataverse.org server.
 - Progress Tracking: Uses the Rich library for user-friendly progress tracking and error handling.
 
Installation
Requirements
- Python 3.12 or higher
 
Installation from Source
Clone the repository:
git clone https://git.uni-due.de/cbm343e/doi2dataset
cd doi2dataset
Quick Start
# Install the package in development mode
pip install -e .
# Run with a DOI
doi2dataset 10.1038/nature12373
Note: The package is not yet available on PyPI. Please use the Git installation method above.
Configuration
Before running the tool, configure the necessary settings in the config.yaml file located in the project root. This file contains configuration details such as:
- Connection details: URL, API token, authentication credentials for Dataverse server
 - Principal Investigator (PI) information: Optional - used for fallback determination of corresponding authors when not explicitly specified in the publication
 - Default grant configurations: Funding information to be included in the metadata (supports multiple grants)
 
Configuration File Structure
The configuration file should follow this structure:
# Dataverse server connection details
dataverse:
  url: "https://your-dataverse-instance.org"
  api_token: "your-api-token"
# Default grant information (supports multiple grants)
default_grants:
  - funder: "Your Funding Agency"
    id: "GRANT123456"
  - funder: "Another Funding Agency"
    id: "GRANT789012"
# Principal investigators for fallback corresponding author determination (optional)
pis:
  - family_name: "Doe"
    given_name: "John"
    orcid: "0000-0000-0000-0000"
    email: "john.doe@university.edu"
    affiliation: "Department of Science, University"
See config_example.yaml for a complete example configuration.
Note: The PI section is optional. If no corresponding authors are found in the publication metadata and no PIs are configured, the tool will still generate metadata but may issue a warning about missing corresponding author information.
Environment Variables
For security and deployment flexibility, you can override Dataverse configuration values using environment variables. This is particularly useful for sensitive credentials like API tokens and passwords.
The following environment variables are supported:
DATAVERSE_URL- Dataverse server URLDATAVERSE_API_TOKEN- API token for authenticationDATAVERSE_DATAVERSE- Dataverse alias/nameDATAVERSE_AUTH_USER- Basic authentication usernameDATAVERSE_AUTH_PASSWORD- Basic authentication password
Environment variables take precedence over values in the configuration file. You can set some or all of these variables - any unset variables will fall back to the config file values.
Example Usage
# Set environment variables
export DATAVERSE_API_TOKEN="your-secure-token"
export DATAVERSE_AUTH_PASSWORD="your-secure-password"
# Run doi2dataset - it will use environment variables for credentials
python doi2dataset.py 10.1234/example.doi
# Or set them inline for a single run
DATAVERSE_API_TOKEN="token" python doi2dataset.py 10.1234/example.doi
This approach allows you to:
- Keep sensitive credentials out of version control
 - Use different configurations for different environments (dev, staging, production)
 - Deploy the tool with secure environment-based configuration
 
Usage
doi2dataset can be used in several ways after installation:
Method 1: Console Command
# After installation with pip install -e .
doi2dataset [options] DOI1 DOI2 ...
Method 2: Python Module
# Use CLI module directly
python -m doi2dataset.cli [options] DOI1 DOI2 ...
# Or use main module
python -m doi2dataset.main [options] DOI1 DOI2 ...
Method 3: Python Import
from doi2dataset import MetadataProcessor
from pathlib import Path
processor = MetadataProcessor(
    doi="10.1038/nature12373",
    output_path=Path("metadata.json"),
    depositor="Your Name"
)
metadata = processor.process()
Command Line Options
All methods support the same command-line options:
- 
-f, --fileSpecify a file containing DOIs (one per line). - 
-o, --output-dirDirectory where metadata files will be saved. - 
-d, --depositorName of the depositor. - 
-s, --subjectDefault subject for the metadata. - 
-m, --contact-mailContact email address. - 
-u, --uploadUpload metadata to a Dataverse.org server. - 
-r, --use-rorUse Research Organization Registry (ROR) identifiers for institutions when available. 
Examples
# Process a single DOI
doi2dataset 10.1038/nature12373
# Process multiple DOIs
doi2dataset 10.1038/nature12373 10.1126/science.1234567
# Process DOIs from a file with custom output directory
doi2dataset -f dois.txt -o ./output -d "Your Name"
# Upload to Dataverse with contact email
doi2dataset -u -m your.email@university.edu 10.1038/nature12373
# Use ROR identifiers for institutions
doi2dataset -r 10.1038/nature12373
Documentation
Documentation is generated using Sphinx and is available online at: https://doi2dataset-66f763.gitpages.uni
See the docs/ directory for detailed API references and usage examples.
Building Documentation
The documentation supports multiple versions (branches and tags) and can be built locally or deployed automatically via GitLab CI/CD.
Prerequisites
Install documentation dependencies:
pip install -r requirements-doc.txt
Local Building
# Build single version (current branch)
cd docs
make html
# Build all versions (multiversion)
cd docs
make multiversion
Multiversion Configuration
The multiversion setup automatically builds documentation for:
- Main development branches (
main,master,develop) - Version tags matching the pattern 
v*.*.* 
Configuration can be customized in docs/source/conf.py:
smv_branch_whitelist: Pattern for included branchessmv_tag_whitelist: Pattern for included tagssmv_latest_version: Default version to display
Deployment
Documentation is automatically built and deployed via GitLab CI/CD:
- Triggered on pushes to main branches and version tags
 - Deployed to GitLab Pages
 - Accessible at your project's Pages URL
 
Git Commit Message Linting
This project uses gitlint to enforce consistent commit message formatting. Commit messages should follow the Conventional Commits specification.
The linting is integrated into the development workflow through:
- Pre-commit hooks: Automatically validates commit messages when you commit
 - Manual linting: Available through standalone scripts for individual commits or ranges
 - CI/CD integration: Can be used in continuous integration pipelines
 
Commit Message Format
Commit messages must follow this format:
<type>(<scope>): <description>
[optional body]
[optional footer(s)]
Types:
feat: A new featurefix: A bug fixdocs: Documentation only changesstyle: Changes that do not affect the meaning of the coderefactor: A code change that neither fixes a bug nor adds a featuretest: Adding missing tests or correcting existing testschore: Changes to the build process or auxiliary toolsci: Changes to CI configuration files and scriptsbuild: Changes that affect the build system or dependenciesperf: A code change that improves performancerevert: Reverts a previous commit
Examples:
feat(api): add support for DOI batch processing
fix(metadata): handle missing author information gracefully
docs: update installation instructions
test(citation): add tests for license processing
Linting Commit Messages
To lint commit messages, use the provided script:
# Lint the last commit
python scripts/lint-commit.py
# Lint a specific commit
python scripts/lint-commit.py --hash <commit-hash>
# Lint a range of commits
python scripts/lint-commit.py --range HEAD~3..
# Install as a git hook (optional)
python scripts/lint-commit.py --install-hook
Automated Validation with Pre-commit
The project includes a pre-commit configuration that automatically validates commit messages:
# Install pre-commit hooks (recommended)
pre-commit install --hook-type commit-msg
# Or install all hooks including code formatting
pre-commit install
This sets up automatic validation that runs every time you commit, ensuring all commit messages follow the required format.
Manual Git Hook Installation
Alternatively, you can install a standalone git hook:
python scripts/lint-commit.py --install-hook
This creates a simple commit-msg hook that runs gitlint directly.
Testing
Tests are implemented with pytest. The test suite provides comprehensive coverage of core functionalities. To run the tests, execute:
pytest
Or using the Python module syntax:
python -m pytest
Code Coverage
The project includes code coverage analysis using pytest-cov. Current coverage is approximately 61% of the codebase, with key utilities and test infrastructure at 99-100% coverage.
To run tests with code coverage analysis:
pytest --cov=.
Generate a detailed HTML coverage report:
pytest --cov=. --cov-report=html
This creates a htmlcov directory. Open htmlcov/index.html in a browser to view the detailed coverage report.
A .coveragerc configuration file is provided that:
- Excludes test files, documentation, and boilerplate code from coverage analysis
 - Configures reporting to ignore common non-testable lines (like defensive imports)
 - Sets the output directory for HTML reports
 
Recent improvements have increased coverage from 48% to 61% by adding focused tests for:
- Citation building functionality
 - License processing and validation
 - Metadata field extraction
 - OpenAlex integration
 - Publication data parsing and validation
 
Areas that could benefit from additional testing:
- More edge cases in the MetadataProcessor class workflow
 - Additional CitationBuilder scenarios with diverse inputs
 - Complex network interactions and error handling
 
Test Structure
The test suite is organized into six main files:
- test_doi2dataset.py: Basic tests for core functions like name splitting, DOI validation, and filename sanitization.
 - test_fetch_doi_mock.py: Tests API interactions using a mock OpenAlex response stored in 
srep45389.json. - test_citation_builder.py: Tests for building citation metadata from API responses.
 - test_metadata_processor.py: Tests for the metadata processing workflow.
 - test_license_processor.py: Tests for license processing and validation.
 - test_publication_utils.py: Tests for publication year extraction and date handling.
 
Test Categories
The test suite covers the following categories of functionality:
Core Functionality Tests
- DOI Validation and Processing: Parameterized tests for DOI normalization, validation, and filename sanitization with various inputs.
 - Name Processing: Extensive tests for parsing and splitting author names in different formats (with/without commas, middle initials, etc.).
 - Email Validation: Tests for proper validation of email addresses with various domain configurations.
 
API Integration Tests
- Mock API Responses: Tests that use a saved OpenAlex API response (
srep45389.json) to simulate API interactions without making actual network requests. - Data Fetching: Tests for retrieving and parsing data from the OpenAlex API.
 - Abstract Extraction: Tests for extracting and cleaning abstracts from OpenAlex's inverted index format, including handling of empty or malformed abstracts.
 - Subject Mapping: Tests for mapping OpenAlex topics to controlled vocabulary subject terms.
 
Metadata Processing Tests
- Citation Building: Tests for properly building citation metadata from API responses.
 - License Processing: Tests for correctly identifying and formatting license information from various license IDs.
 - Principal Investigator Matching: Tests for finding project PIs based on ORCID identifiers (used for fallback corresponding author determination).
 - Configuration Loading: Tests for properly loading and validating configuration from files.
 - Metadata Workflow: Tests for the complete metadata processing workflow.
 
These tests ensure that all components work correctly in isolation and together as a system, with special attention to edge cases and error handling.
Changelog
Version 2.0 - Generalization Update
This version has been updated to make the tool more generalized and suitable for broader use cases:
Breaking Changes:
- Removed organizational-specific metadata blocks (project phases, organizational fields)
 - Removed 
Phaseclass and phase-related configuration - Simplified configuration structure
 
What's New:
- Streamlined metadata generation focusing on standard Dataverse citation metadata
 - Reduced configuration requirements for easier adoption
 - Maintained PI information support for corresponding author fallback functionality
 
Migration Guide:
- Remove the 
phasesection from your configuration file - The tool will now generate only standard citation metadata blocks
 - PI information is still supported and used for fallback corresponding author determination
 
Contributing
Contributions are welcome! Please fork the repository and submit a pull request with your improvements.
Development Setup
- 
Install development dependencies:
pip install -r requirements-dev.txt - 
Install the package in development mode:
pip install -e . - 
Set up commit message template:
git config commit.template .gitmessage - 
Install pre-commit hooks:
pre-commit install --hook-type pre-commit --hook-type commit-msg - 
Run tests:
pytest - 
Run code quality checks:
pre-commit run --all-files 
Package Structure
The project follows a modular architecture:
doi2dataset/
├── cli.py                    # Command-line interface
├── main.py                   # Main entry point
├── core/                     # Core components
│   ├── config.py            # Configuration management
│   ├── models.py            # Data models (Person, Institution, etc.)
│   └── metadata_fields.py   # Dataverse metadata field types
├── api/                      # External API integration
│   ├── client.py            # HTTP client for API requests
│   └── processors.py        # License and abstract processors
├── processing/               # Business logic
│   ├── citation.py          # Citation building
│   ├── metadata.py          # Metadata processing pipeline
│   └── utils.py             # Processing utilities
└── utils/                    # General utilities
    └── validation.py        # Validation functions
Code Quality
- Follow the existing code style and formatting
 - Write tests for new functionality
 - Ensure all tests pass before submitting
 - Use meaningful commit messages following the conventional commits format
 - Run 
python scripts/lint-commit.pyto validate commit messages 
License
This project is licensed under the MIT License. See the LICENSE.md file for details.
