feat!: generalize script by removing organizational metadata

Remove Phase class, organizational metadata blocks, and unused project fields. Update configuration to use 'default_grants' and simplify PI usage to fallback corresponding author determination only. BREAKING CHANGES: - Remove 'phase' and 'project' fields from configuration - Use 'default_grants' instead of 'default_grant' - Generate only standard Dataverse citation metadata
2025-07-07 14:41:39 +02:00 · 2025-07-07 14:41:39 +02:00 · 67b46d5140
commit 67b46d5140
parent 01bc537bd8
11 changed files with 207 additions and 269 deletions
--- a/README.md
+++ b/README.md
@ -8,7 +8,14 @@

 - **DOI Validation and Normalization:** Validates DOIs and converts them into a standardized format.
 - **Metadata Retrieval:** Fetches metadata such as title, abstract, license, and author information from external sources.
- **Metadata Mapping:** Automatically maps and generates metadata fields (e.g., title, description, keywords) including support for controlled vocabularies and compound fields.
+- **Standard Dataverse Metadata:** Generates standard Dataverse citation metadata including:
+  - Title, publication date, and alternative URL
+  - Author information with affiliations and ORCID identifiers
+  - Dataset contact information (corresponding authors)
+  - Abstract and description
+  - Keywords and subject classification
+  - Grant/funding information
+  - License information when available
 - **Optional Upload:** Allows uploading of metadata directly to a Dataverse.org server.
 - **Progress Tracking:** Uses the Rich library for user-friendly progress tracking and error handling.

@ -23,14 +30,41 @@ cd doi2dataset

 ## Configuration

-Configuration
-
 Before running the tool, configure the necessary settings in the `config.yaml` file located in the project root. This file contains configuration details such as:

- Connection details (URL, API token, authentication credentials)
- Mapping of project phases
- Principal Investigator (PI) information
- Default grant configurations
+- **Connection details**: URL, API token, authentication credentials for Dataverse server
+- **Principal Investigator (PI) information**: Optional - used for fallback determination of corresponding authors when not explicitly specified in the publication
+- **Default grant configurations**: Funding information to be included in the metadata (supports multiple grants)
+
+### Configuration File Structure
+
+The configuration file should follow this structure:
+
+```yaml
+# Dataverse server connection details
+dataverse:
+  url: "https://your-dataverse-instance.org"
+  api_token: "your-api-token"
+
+# Default grant information (supports multiple grants)
+default_grants:
+  - funder: "Your Funding Agency"
+    id: "GRANT123456"
+  - funder: "Another Funding Agency"
+    id: "GRANT789012"
+
+# Principal investigators for fallback corresponding author determination (optional)
+pis:
+  - family_name: "Doe"
+    given_name: "John"
+    orcid: "0000-0000-0000-0000"
+    email: "john.doe@university.edu"
+    affiliation: "Department of Science, University"
+```
+
+See `config_example.yaml` for a complete example configuration.
+
+**Note**: The PI section is optional. If no corresponding authors are found in the publication metadata and no PIs are configured, the tool will still generate metadata but may issue a warning about missing corresponding author information.

 ## Usage

@ -102,11 +136,13 @@ pytest --cov=. --cov-report=html
 This creates a `htmlcov` directory. Open `htmlcov/index.html` in a browser to view the detailed coverage report.

 A `.coveragerc` configuration file is provided that:
+
 - Excludes test files, documentation, and boilerplate code from coverage analysis
 - Configures reporting to ignore common non-testable lines (like defensive imports)
 - Sets the output directory for HTML reports

 Recent improvements have increased coverage from 48% to 61% by adding focused tests for:
+
 - Citation building functionality
 - License processing and validation
 - Metadata field extraction
@ -114,6 +150,7 @@ Recent improvements have increased coverage from 48% to 61% by adding focused te
 - Publication data parsing and validation

 Areas that could benefit from additional testing:
+
 - More edge cases in the MetadataProcessor class workflow
 - Additional CitationBuilder scenarios with diverse inputs
 - Complex network interactions and error handling
@ -122,7 +159,7 @@ Areas that could benefit from additional testing:

 The test suite is organized into six main files:

-1. **test_doi2dataset.py**: Basic tests for core functions like phase checking, name splitting and DOI validation.
+1. **test_doi2dataset.py**: Basic tests for core functions like name splitting, DOI validation, and filename sanitization.
 2. **test_fetch_doi_mock.py**: Tests API interactions using a mock OpenAlex response stored in `srep45389.json`.
 3. **test_citation_builder.py**: Tests for building citation metadata from API responses.
 4. **test_metadata_processor.py**: Tests for the metadata processing workflow.
@ -136,7 +173,6 @@ The test suite covers the following categories of functionality:
 #### Core Functionality Tests

 - **DOI Validation and Processing**: Parameterized tests for DOI normalization, validation, and filename sanitization with various inputs.
- **Phase Management**: Tests for checking publication year against defined project phases, including boundary cases.
 - **Name Processing**: Extensive tests for parsing and splitting author names in different formats (with/without commas, middle initials, etc.).
 - **Email Validation**: Tests for proper validation of email addresses with various domain configurations.

@ -151,12 +187,36 @@ The test suite covers the following categories of functionality:

 - **Citation Building**: Tests for properly building citation metadata from API responses.
 - **License Processing**: Tests for correctly identifying and formatting license information from various license IDs.
- **Principal Investigator Matching**: Tests for finding project PIs based on ORCID identifiers.
+- **Principal Investigator Matching**: Tests for finding project PIs based on ORCID identifiers (used for fallback corresponding author determination).
 - **Configuration Loading**: Tests for properly loading and validating configuration from files.
 - **Metadata Workflow**: Tests for the complete metadata processing workflow.

 These tests ensure that all components work correctly in isolation and together as a system, with special attention to edge cases and error handling.

+## Changelog
+
+### Version 0.2.0 - Generalization Update
+
+This version has been updated to make the tool more generalized and suitable for broader use cases:
+
+**Breaking Changes:**
+
+- Removed organizational-specific metadata blocks (project phases, organizational fields)
+- Removed `Phase` class and phase-related configuration
+- Simplified configuration structure
+
+**What's New:**
+
+- Streamlined metadata generation focusing on standard Dataverse citation metadata
+- Reduced configuration requirements for easier adoption
+- Maintained PI information support for corresponding author fallback functionality
+
+**Migration Guide:**
+
+- Remove the `phase` section from your configuration file
+- The tool will now generate only standard citation metadata blocks
+- PI information is still supported and used for fallback corresponding author determination
+
 ## Contributing

 Contributions are welcome! Please fork the repository and submit a pull request with your improvements.