feat!: generalize script by removing organizational metadata
All checks were successful
Test pipeline / test (push) Successful in 14s

Remove Phase class, organizational metadata blocks, and unused project fields. Update configuration
to use 'default_grants' and simplify PI usage to fallback corresponding author determination only.

BREAKING CHANGES: - Remove 'phase' and 'project' fields from configuration - Use 'default_grants'
instead of 'default_grant' - Generate only standard Dataverse citation metadata
This commit is contained in:
Alexander Minges 2025-07-07 14:41:39 +02:00
parent 01bc537bd8
commit 67b46d5140
Signed by: Athemis
SSH key fingerprint: SHA256:TUXshgulbwL+FRYvBNo54pCsI0auROsSEgSvueKbkZ4
11 changed files with 207 additions and 269 deletions

View file

@ -8,7 +8,14 @@
- **DOI Validation and Normalization:** Validates DOIs and converts them into a standardized format.
- **Metadata Retrieval:** Fetches metadata such as title, abstract, license, and author information from external sources.
- **Metadata Mapping:** Automatically maps and generates metadata fields (e.g., title, description, keywords) including support for controlled vocabularies and compound fields.
- **Standard Dataverse Metadata:** Generates standard Dataverse citation metadata including:
- Title, publication date, and alternative URL
- Author information with affiliations and ORCID identifiers
- Dataset contact information (corresponding authors)
- Abstract and description
- Keywords and subject classification
- Grant/funding information
- License information when available
- **Optional Upload:** Allows uploading of metadata directly to a Dataverse.org server.
- **Progress Tracking:** Uses the Rich library for user-friendly progress tracking and error handling.
@ -23,14 +30,41 @@ cd doi2dataset
## Configuration
Configuration
Before running the tool, configure the necessary settings in the `config.yaml` file located in the project root. This file contains configuration details such as:
- Connection details (URL, API token, authentication credentials)
- Mapping of project phases
- Principal Investigator (PI) information
- Default grant configurations
- **Connection details**: URL, API token, authentication credentials for Dataverse server
- **Principal Investigator (PI) information**: Optional - used for fallback determination of corresponding authors when not explicitly specified in the publication
- **Default grant configurations**: Funding information to be included in the metadata (supports multiple grants)
### Configuration File Structure
The configuration file should follow this structure:
```yaml
# Dataverse server connection details
dataverse:
url: "https://your-dataverse-instance.org"
api_token: "your-api-token"
# Default grant information (supports multiple grants)
default_grants:
- funder: "Your Funding Agency"
id: "GRANT123456"
- funder: "Another Funding Agency"
id: "GRANT789012"
# Principal investigators for fallback corresponding author determination (optional)
pis:
- family_name: "Doe"
given_name: "John"
orcid: "0000-0000-0000-0000"
email: "john.doe@university.edu"
affiliation: "Department of Science, University"
```
See `config_example.yaml` for a complete example configuration.
**Note**: The PI section is optional. If no corresponding authors are found in the publication metadata and no PIs are configured, the tool will still generate metadata but may issue a warning about missing corresponding author information.
## Usage
@ -102,11 +136,13 @@ pytest --cov=. --cov-report=html
This creates a `htmlcov` directory. Open `htmlcov/index.html` in a browser to view the detailed coverage report.
A `.coveragerc` configuration file is provided that:
- Excludes test files, documentation, and boilerplate code from coverage analysis
- Configures reporting to ignore common non-testable lines (like defensive imports)
- Sets the output directory for HTML reports
Recent improvements have increased coverage from 48% to 61% by adding focused tests for:
- Citation building functionality
- License processing and validation
- Metadata field extraction
@ -114,6 +150,7 @@ Recent improvements have increased coverage from 48% to 61% by adding focused te
- Publication data parsing and validation
Areas that could benefit from additional testing:
- More edge cases in the MetadataProcessor class workflow
- Additional CitationBuilder scenarios with diverse inputs
- Complex network interactions and error handling
@ -122,7 +159,7 @@ Areas that could benefit from additional testing:
The test suite is organized into six main files:
1. **test_doi2dataset.py**: Basic tests for core functions like phase checking, name splitting and DOI validation.
1. **test_doi2dataset.py**: Basic tests for core functions like name splitting, DOI validation, and filename sanitization.
2. **test_fetch_doi_mock.py**: Tests API interactions using a mock OpenAlex response stored in `srep45389.json`.
3. **test_citation_builder.py**: Tests for building citation metadata from API responses.
4. **test_metadata_processor.py**: Tests for the metadata processing workflow.
@ -136,7 +173,6 @@ The test suite covers the following categories of functionality:
#### Core Functionality Tests
- **DOI Validation and Processing**: Parameterized tests for DOI normalization, validation, and filename sanitization with various inputs.
- **Phase Management**: Tests for checking publication year against defined project phases, including boundary cases.
- **Name Processing**: Extensive tests for parsing and splitting author names in different formats (with/without commas, middle initials, etc.).
- **Email Validation**: Tests for proper validation of email addresses with various domain configurations.
@ -151,12 +187,36 @@ The test suite covers the following categories of functionality:
- **Citation Building**: Tests for properly building citation metadata from API responses.
- **License Processing**: Tests for correctly identifying and formatting license information from various license IDs.
- **Principal Investigator Matching**: Tests for finding project PIs based on ORCID identifiers.
- **Principal Investigator Matching**: Tests for finding project PIs based on ORCID identifiers (used for fallback corresponding author determination).
- **Configuration Loading**: Tests for properly loading and validating configuration from files.
- **Metadata Workflow**: Tests for the complete metadata processing workflow.
These tests ensure that all components work correctly in isolation and together as a system, with special attention to edge cases and error handling.
## Changelog
### Version 0.2.0 - Generalization Update
This version has been updated to make the tool more generalized and suitable for broader use cases:
**Breaking Changes:**
- Removed organizational-specific metadata blocks (project phases, organizational fields)
- Removed `Phase` class and phase-related configuration
- Simplified configuration structure
**What's New:**
- Streamlined metadata generation focusing on standard Dataverse citation metadata
- Reduced configuration requirements for easier adoption
- Maintained PI information support for corresponding author fallback functionality
**Migration Guide:**
- Remove the `phase` section from your configuration file
- The tool will now generate only standard citation metadata blocks
- PI information is still supported and used for fallback corresponding author determination
## Contributing
Contributions are welcome! Please fork the repository and submit a pull request with your improvements.