feat!: generalize script by removing organizational metadata

Remove Phase class, organizational metadata blocks, and unused project fields. Update configuration to use 'default_grants' and simplify PI usage to fallback corresponding author determination only. BREAKING CHANGES: - Remove 'phase' and 'project' fields from configuration - Use 'default_grants' instead of 'default_grant' - Generate only standard Dataverse citation metadata
2025-07-07 14:41:39 +02:00 · 2025-07-07 14:41:39 +02:00 · 67b46d5140
commit 67b46d5140
parent 01bc537bd8
11 changed files with 207 additions and 269 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,5 +1,6 @@
 # Config file
 config.yaml
 .config.yaml
 # Processed DOIs
 *.json
--- a/README.md
+++ b/README.md
@ -8,7 +8,14 @@
 - **DOI Validation and Normalization:** Validates DOIs and converts them into a standardized format.
 - **Metadata Retrieval:** Fetches metadata such as title, abstract, license, and author information from external sources.
- **Metadata Mapping:** Automatically maps and generates metadata fields (e.g., title, description, keywords) including support for controlled vocabularies and compound fields.
+- **Standard Dataverse Metadata:** Generates standard Dataverse citation metadata including:
  - Title, publication date, and alternative URL
  - Author information with affiliations and ORCID identifiers
  - Dataset contact information (corresponding authors)
  - Abstract and description
  - Keywords and subject classification
  - Grant/funding information
  - License information when available
 - **Optional Upload:** Allows uploading of metadata directly to a Dataverse.org server.
 - **Progress Tracking:** Uses the Rich library for user-friendly progress tracking and error handling.
@ -23,14 +30,41 @@ cd doi2dataset
 ## Configuration
 Configuration
 Before running the tool, configure the necessary settings in the `config.yaml` file located in the project root. This file contains configuration details such as:
- Connection details (URL, API token, authentication credentials)
+- **Connection details**: URL, API token, authentication credentials for Dataverse server
- Mapping of project phases
+- **Principal Investigator (PI) information**: Optional - used for fallback determination of corresponding authors when not explicitly specified in the publication
- Principal Investigator (PI) information
+- **Default grant configurations**: Funding information to be included in the metadata (supports multiple grants)
- Default grant configurations
+
 ### Configuration File Structure
 The configuration file should follow this structure:
 ```yaml
 # Dataverse server connection details
 dataverse:
  url: "https://your-dataverse-instance.org"
  api_token: "your-api-token"
 # Default grant information (supports multiple grants)
 default_grants:
  - funder: "Your Funding Agency"
    id: "GRANT123456"
  - funder: "Another Funding Agency"
    id: "GRANT789012"
 # Principal investigators for fallback corresponding author determination (optional)
 pis:
  - family_name: "Doe"
    given_name: "John"
    orcid: "0000-0000-0000-0000"
    email: "john.doe@university.edu"
    affiliation: "Department of Science, University"
 ```
 See `config_example.yaml` for a complete example configuration.
 **Note**: The PI section is optional. If no corresponding authors are found in the publication metadata and no PIs are configured, the tool will still generate metadata but may issue a warning about missing corresponding author information.
 ## Usage
@ -102,11 +136,13 @@ pytest --cov=. --cov-report=html
 This creates a `htmlcov` directory. Open `htmlcov/index.html` in a browser to view the detailed coverage report.
 A `.coveragerc` configuration file is provided that:
 - Excludes test files, documentation, and boilerplate code from coverage analysis
 - Configures reporting to ignore common non-testable lines (like defensive imports)
 - Sets the output directory for HTML reports
 Recent improvements have increased coverage from 48% to 61% by adding focused tests for:
 - Citation building functionality
 - License processing and validation
 - Metadata field extraction
@ -114,6 +150,7 @@ Recent improvements have increased coverage from 48% to 61% by adding focused te
 - Publication data parsing and validation
 Areas that could benefit from additional testing:
 - More edge cases in the MetadataProcessor class workflow
 - Additional CitationBuilder scenarios with diverse inputs
 - Complex network interactions and error handling
@ -122,7 +159,7 @@ Areas that could benefit from additional testing:
 The test suite is organized into six main files:
-1. **test_doi2dataset.py**: Basic tests for core functions like phase checking, name splitting and DOI validation.
+1. **test_doi2dataset.py**: Basic tests for core functions like name splitting, DOI validation, and filename sanitization.
 2. **test_fetch_doi_mock.py**: Tests API interactions using a mock OpenAlex response stored in `srep45389.json`.
 3. **test_citation_builder.py**: Tests for building citation metadata from API responses.
 4. **test_metadata_processor.py**: Tests for the metadata processing workflow.
@ -136,7 +173,6 @@ The test suite covers the following categories of functionality:
 #### Core Functionality Tests
 - **DOI Validation and Processing**: Parameterized tests for DOI normalization, validation, and filename sanitization with various inputs.
 - **Phase Management**: Tests for checking publication year against defined project phases, including boundary cases.
 - **Name Processing**: Extensive tests for parsing and splitting author names in different formats (with/without commas, middle initials, etc.).
 - **Email Validation**: Tests for proper validation of email addresses with various domain configurations.
@ -151,12 +187,36 @@ The test suite covers the following categories of functionality:
 - **Citation Building**: Tests for properly building citation metadata from API responses.
 - **License Processing**: Tests for correctly identifying and formatting license information from various license IDs.
- **Principal Investigator Matching**: Tests for finding project PIs based on ORCID identifiers.
+- **Principal Investigator Matching**: Tests for finding project PIs based on ORCID identifiers (used for fallback corresponding author determination).
 - **Configuration Loading**: Tests for properly loading and validating configuration from files.
 - **Metadata Workflow**: Tests for the complete metadata processing workflow.
 These tests ensure that all components work correctly in isolation and together as a system, with special attention to edge cases and error handling.
 ## Changelog
 ### Version 0.2.0 - Generalization Update
 This version has been updated to make the tool more generalized and suitable for broader use cases:
 **Breaking Changes:**
 - Removed organizational-specific metadata blocks (project phases, organizational fields)
 - Removed `Phase` class and phase-related configuration
 - Simplified configuration structure
 **What's New:**
 - Streamlined metadata generation focusing on standard Dataverse citation metadata
 - Reduced configuration requirements for easier adoption
 - Maintained PI information support for corresponding author fallback functionality
 **Migration Guide:**
 - Remove the `phase` section from your configuration file
 - The tool will now generate only standard citation metadata blocks
 - PI information is still supported and used for fallback corresponding author determination
 ## Contributing
 Contributions are welcome! Please fork the repository and submit a pull request with your improvements.
--- a/init.py
+++ b/init.py
@ -8,9 +8,8 @@ from .doi2dataset import (
    LicenseProcessor,
    MetadataProcessor,
    NameProcessor,
    PIFinder,
    Person,
-    Phase,
+    PIFinder,
    SubjectMapper,
    sanitize_filename,
    validate_email_address,
--- a/config_example.yaml
+++ b/config_example.yaml
@ -1,23 +1,25 @@
-default_grant:
+dataverse:
  url: "https://your-dataverse-instance.org"
  api_token: "your-api-token-here"
  dataverse: "your-dataverse-alias"
  auth_user: "your-username"
  auth_password: "your-password"
 default_grants:
  - funder: "Awesome Funding Agency"
    id: "ABC12345"
-
+  - funder: "Another Funding Agency"
-phase:
+    id: "DEF67890"
  "Phase 1 (2021/2025)":
    start: 2021
    end: 2025
 pis:
  - family_name: "Doe"
    given_name: "Jon"
    orcid: "0000-0000-0000-0000"
-    email: "jon.doe@some-university.edu"
+    email: "jon.doe@iana.org"
    affiliation: "Institute of Science, Some University"
    project: ["Project A01"]
  - family_name: "Doe"
    given_name: "Jane"
    orcid: "0000-0000-0000-0001"
-    email: "jane.doe@some-university.edu"
+    email: "jane.doe@iana.org"
    affiliation: "Institute of Science, Some University"
    project: ["Project A02"]
--- a/doi2dataset.py
+++ b/doi2dataset.py
@ -109,36 +109,6 @@ class FieldType(Enum):
    COMPOUND = "compound"
    VOCABULARY = "controlledVocabulary"
@dataclass
 class Phase:
    """
    Represents a project phase with a defined time span.
    Attributes:
        name (str): The name of the project phase.
        start (int): The start year of the project phase.
        end (int): The end year of the project phase.
    """
    name: str
    start: int
    end: int
    def check_year(self, year: int) -> bool:
        """
        Checks whether a given year falls within the project's phase boundaries.
        Args:
            year (int): The year to check.
        Returns:
            bool: True if the year is within the phase boundaries, otherwise False.
        """
        if self.start <= year <= self.end:
            return True
        return False
@dataclass
 class BaseMetadataField[T]:
    """
@ -301,7 +271,7 @@ class Institution:
                "termName": self.display_name,
                "@type": "https://schema.org/Organization"
            }
-            return PrimitiveMetadataField("authorAffiliation", False, self.ror, expanded_value)
+            return PrimitiveMetadataField("authorAffiliation", False, self.ror, expanded_value=expanded_value)
        else:
            return PrimitiveMetadataField("authorAffiliation", False, self.display_name)
@ -316,14 +286,12 @@ class Person:
        orcid (str): ORCID identifier (optional).
        email (str): Email address (optional).
        affiliation (Institution): Affiliation of the person (optional).
        project (list[str]): List of associated projects.
    """
    family_name: str
    given_name: str
    orcid: str = ""
    email: str = ""
    affiliation: Institution | str = ""
    project: list[str] = field(default_factory=list)
    def to_dict(self) -> dict[str, str | list[str] | dict[str, str]]:
            """
@ -340,8 +308,7 @@ class Person:
                "family_name": self.family_name,
                "given_name": self.given_name,
                "orcid": self.orcid,
-                "email": self.email,
+                "email": self.email
                "project": self.project
            }
            if isinstance(self.affiliation, Institution):
@ -464,12 +431,10 @@ class ConfigData:
    Attributes:
        dataverse (dict[str, str]): Dataverse-related configuration.
        phase (dict[str, dict[str, int]]): Mapping of project phases.
        pis (list[dict[str, Any]]): List of principal investigator configurations.
        default_grants (list[dict[str, str]]): Default grant configurations.
    """
    dataverse: dict[str, str]
    phase: dict[str, dict[str, int]]
    pis: list[dict[str, Any]]
    default_grants: list[dict[str, str]]
@ -523,7 +488,6 @@ class Config:
        cls._config_data = ConfigData(
            dataverse=config_data.get('dataverse', {}),
            phase=config_data.get('phase', {}),
            pis=config_data.get('pis', []),
            default_grants=config_data.get('default_grants', [])
        )
@ -545,16 +509,6 @@ class Config:
            raise RuntimeError("Failed to load configuration")
        return cls._config_data
    @property
    def PHASE(self) -> dict[str, dict[str, int]]:
        """
        Get phase configuration.
        Returns:
            dict[str, dict[str, int]]: Mapping of phases.
        """
        return self.get_config().phase
    @property
    def PIS(self) -> list[dict[str, Any]]:
        """
@ -833,7 +787,10 @@ class AbstractProcessor:
            else:
                console.print(f"\n{ICONS['warning']} No abstract found in CrossRef!", style="warning")
        else:
            if license.name:
                console.print(f"\n{ICONS['info']} License {license.name} does not allow derivative works. Reconstructing abstract from OpenAlex!", style="info")
            else:
                console.print(f"\n{ICONS['info']} Custom license does not allow derivative works. Reconstructing abstract from OpenAlex!", style="info")
        openalex_abstract = self._get_openalex_abstract(data)
@ -1406,8 +1363,7 @@ class MetadataProcessor:
                            CompoundMetadataField("grantNumber", True, grants).to_dict()
                        ],
                        "displayName": "Citation Metadata"
-                    },
+                    }
                    "crc1430_org_v1": self._build_organization_metadata(data)
                },
                "files": []
            }
@ -1473,71 +1429,22 @@ class MetadataProcessor:
        """
        return data.get("publication_year", "")
    def _build_organization_metadata(self, data: dict[str, Any]) -> dict[str, Any]:
        """
        Build organization metadata fields (phase, project, PI names).
        Args:
            data (dict[str, Any]): The metadata.
        Returns:
            dict[str, Any]: Organization metadata.
        """
        publication_year = self._get_publication_year(data)
        if publication_year:
            phases = self._get_phases(int(publication_year))
        else:
            phases = []
        pis = self._get_involved_pis(data)
        projects: list[str] = []
        for pi in pis:
            for project in pi.project:
                projects.append(project)
        pi_names: list[str] = []
        for pi in pis:
            pi_names.append(pi.format_name())
        # Deduplicate projects and PI names
        unique_projects = list(set(projects))
        unique_pi_names = list(set(pi_names))
        return {
            "fields": [
                ControlledVocabularyMetadataField("crc1430OrgV1Phase", True, phases).to_dict(),
                ControlledVocabularyMetadataField("crc1430OrgV1Project", True, unique_projects).to_dict(),
                ControlledVocabularyMetadataField("crc1430OrgV1PI", True, unique_pi_names).to_dict()
            ]
        }
    def _get_phases(self, year: int) -> list[str]:
        """
        Determine the project phases matching a given publication year.
        Args:
            year (int): The publication year.
        Returns:
            list[str]: List of matching phase names.
        """
        config = Config()
        matching_phases: list[str] = []
        for phase_name, phase_info in config.PHASE.items():
            phase = Phase(phase_name, phase_info["start"], phase_info["end"])
            if phase.check_year(year):
                matching_phases.append(phase.name)
        return matching_phases
    def _get_involved_pis(self, data: dict[str, Any]) -> list[Person]:
        """
-        Identify involved principal investigators from the metadata.
+        Identify involved principal investigators from the metadata for use as fallback
        corresponding authors.
        This method matches authors in the publication metadata against the configured
        PIs and returns matching PIs. It is used as a fallback when no corresponding
        authors are explicitly declared in the publication metadata.
        Args:
-            data (dict[str, Any]): The metadata.
+            data (dict[str, Any]): The metadata from OpenAlex.
        Returns:
-            list[Person]: List of PIs.
+            list[Person]: List of matching PIs for use as corresponding authors.
        """
        involved_pis: list[Person] = []
        for authorship in data.get("authorships", []):
--- a/tests/config_test.yaml
+++ b/tests/config_test.yaml
@ -1,23 +1,16 @@
-default_grant:
+default_grants:
  - funder: "Awesome Funding Agency"
    id: "ABC12345"
 phase:
  "Phase 1 (2021/2025)":
    start: 2021
    end: 2025
 pis:
  - family_name: "Doe"
    given_name: "Jon"
    orcid: "0000-0000-0000-0000"
    email: "jon.doe@iana.org"
    affiliation: "Institute of Science, Some University"
    project: ["Project A01"]
  - family_name: "Doe"
    given_name: "Jane"
    orcid: "0000-0000-0000-0001"
    email: "jane.doe@iana.org"
    affiliation: "Institute of Science, Some University"
    project: ["Project A02"]
--- a/tests/test_citation_builder.py
+++ b/tests/test_citation_builder.py
@ -1,13 +1,9 @@
 import json
 import os
 import pytest
 from unittest.mock import MagicMock
-from doi2dataset import (
+import pytest
-    CitationBuilder,
+
-    PIFinder,
+from doi2dataset import CitationBuilder, Person, PIFinder
    Person
 )
@pytest.fixture
@ -27,8 +23,7 @@ def test_pi():
        given_name="Author",
        orcid="0000-0000-0000-1234",
        email="test.author@example.org",
-        affiliation="Test University",
+        affiliation="Test University"
        project=["Test Project"]
    )
--- a/tests/test_doi2dataset.py
+++ b/tests/test_doi2dataset.py
@ -3,21 +3,9 @@ import sys
 sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
-from doi2dataset import NameProcessor, Phase, sanitize_filename, validate_email_address
+from doi2dataset import NameProcessor, sanitize_filename, validate_email_address
 def test_phase_check_year():
    """Test that check_year correctly determines if a year is within the phase boundaries."""
    phase = Phase("TestPhase", 2000, 2010)
    # Within boundaries
    assert phase.check_year(2005) is True
    # Outside boundaries
    assert phase.check_year(1999) is False
    assert phase.check_year(2011) is False
    # Boundary cases
    assert phase.check_year(2000) is True
    assert phase.check_year(2010) is True
 def test_sanitize_filename():
    """Test the sanitize_filename function to convert DOI to a valid filename."""
    doi = "10.1234/abc.def"
--- a/tests/test_fetch_doi_mock.py
+++ b/tests/test_fetch_doi_mock.py
@ -8,12 +8,11 @@ from doi2dataset import (
    APIClient,
    CitationBuilder,
    Config,
    License,
    LicenseProcessor,
    MetadataProcessor,
    Person,
    PIFinder,
-    SubjectMapper
+    SubjectMapper,
 )
@ -158,8 +157,7 @@ def test_pi_finder_find_by_orcid():
        given_name="Jon",
        orcid="0000-0000-0000-0000",
        email="jon.doe@iana.org",
-        affiliation="Institute of Science, Some University",
+        affiliation="Institute of Science, Some University"
        project=["Project A01"]
    )
    # Create PIFinder with our test PI
--- a/tests/test_metadata_processor.py
+++ b/tests/test_metadata_processor.py
@ -1,7 +1,8 @@
 import json
 import os
 from unittest.mock import MagicMock
 import pytest
 from unittest.mock import MagicMock, patch
 from doi2dataset import MetadataProcessor
@ -40,7 +41,6 @@ def test_build_metadata_basic_fields(metadata_processor, openalex_data, monkeypa
    # Mock methods that might cause issues in isolation
    metadata_processor._build_description = MagicMock(return_value="Test description")
    metadata_processor._get_involved_pis = MagicMock(return_value=[])
    metadata_processor._build_organization_metadata = MagicMock(return_value={})
    # Call the method we're testing
    metadata = metadata_processor._build_metadata(openalex_data)
@ -81,7 +81,6 @@ def test_build_metadata_authors(metadata_processor, openalex_data, monkeypatch):
    # Mock methods that might cause issues in isolation
    metadata_processor._build_description = MagicMock(return_value="Test description")
    metadata_processor._get_involved_pis = MagicMock(return_value=[])
    metadata_processor._build_organization_metadata = MagicMock(return_value={})
    # Call the method we're testing
    metadata = metadata_processor._build_metadata(openalex_data)
@ -130,7 +129,6 @@ def test_build_metadata_keywords_and_topics(metadata_processor, openalex_data, m
    # Mock methods that might cause issues in isolation
    metadata_processor._build_description = MagicMock(return_value="Test description")
    metadata_processor._get_involved_pis = MagicMock(return_value=[])
    metadata_processor._build_organization_metadata = MagicMock(return_value={})
    # Call the method we're testing
    metadata = metadata_processor._build_metadata(openalex_data)
--- a/tests/test_person.py
+++ b/tests/test_person.py
@ -1,5 +1,5 @@
-import pytest
+from doi2dataset import Institution, Person
-from doi2dataset import Person, Institution
+
 def test_person_to_dict_with_string_affiliation():
    """Test Person.to_dict() with a string affiliation."""
@ -8,8 +8,7 @@ def test_person_to_dict_with_string_affiliation():
        given_name="John",
        orcid="0000-0001-2345-6789",
        email="john.doe@example.org",
-        affiliation="Test University",
+        affiliation="Test University"
        project=["Project A"]
    )
    result = person.to_dict()
@ -18,7 +17,6 @@ def test_person_to_dict_with_string_affiliation():
    assert result["given_name"] == "John"
    assert result["orcid"] == "0000-0001-2345-6789"
    assert result["email"] == "john.doe@example.org"
    assert result["project"] == ["Project A"]
    assert result["affiliation"] == "Test University"
@ -31,8 +29,7 @@ def test_person_to_dict_with_institution_ror():
        given_name="John",
        orcid="0000-0001-2345-6789",
        email="john.doe@example.org",
-        affiliation=inst,
+        affiliation=inst
        project=["Project A"]
    )
    result = person.to_dict()