feat: add pre-commit setup with gitlint

This commit is contained in:
Alexander Minges 2025-07-14 09:39:07 +02:00
parent b4e9943b7c
commit 9d270ec601
Signed by: Athemis
SSH key fingerprint: SHA256:TUXshgulbwL+FRYvBNo54pCsI0auROsSEgSvueKbkZ4
17 changed files with 1197 additions and 360 deletions

54
.pre-commit-config.yaml Normal file
View file

@ -0,0 +1,54 @@
# Pre-commit configuration for doi2dataset
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
# Built-in pre-commit hooks
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
- id: check-merge-conflict
- id: check-json
- id: check-toml
- id: mixed-line-ending
args: ['--fix=lf']
# Python code formatting and linting
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.9
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- id: ruff-format
# Git commit message linting with gitlint
- repo: https://github.com/jorisroovers/gitlint
rev: v0.19.1
hooks:
- id: gitlint
stages: [commit-msg]
# Optional: Check for common security issues
- repo: https://github.com/PyCQA/bandit
rev: 1.7.10
hooks:
- id: bandit
args: ["-c", "pyproject.toml"]
additional_dependencies: ["bandit[toml]"]
# Configuration for specific hooks
ci:
autofix_commit_msg: |
[pre-commit.ci] auto fixes from pre-commit hooks
for more information, see https://pre-commit.ci
autofix_prs: true
autoupdate_branch: ''
autoupdate_commit_msg: '[pre-commit.ci] pre-commit autoupdate'
autoupdate_schedule: weekly
skip: []
submodules: false

View file

@ -152,6 +152,73 @@ Documentation is automatically built and deployed via GitLab CI/CD:
- Deployed to GitLab Pages
- Accessible at your project's Pages URL
## Git Commit Message Linting
This project uses [gitlint](https://jorisroovers.github.io/gitlint/) to enforce consistent commit message formatting. Commit messages should follow the [Conventional Commits](https://www.conventionalcommits.org/) specification.
### Commit Message Format
Commit messages must follow this format:
```
<type>(<scope>): <description>
[optional body]
[optional footer(s)]
```
**Types:**
- `feat`: A new feature
- `fix`: A bug fix
- `docs`: Documentation only changes
- `style`: Changes that do not affect the meaning of the code
- `refactor`: A code change that neither fixes a bug nor adds a feature
- `test`: Adding missing tests or correcting existing tests
- `chore`: Changes to the build process or auxiliary tools
- `ci`: Changes to CI configuration files and scripts
- `build`: Changes that affect the build system or dependencies
- `perf`: A code change that improves performance
- `revert`: Reverts a previous commit
**Examples:**
```
feat(api): add support for DOI batch processing
fix(metadata): handle missing author information gracefully
docs: update installation instructions
test(citation): add tests for license processing
```
### Linting Commit Messages
To lint commit messages, use the provided script:
```bash
# Lint the last commit
python scripts/lint-commit.py
# Lint a specific commit
python scripts/lint-commit.py --hash <commit-hash>
# Lint a range of commits
python scripts/lint-commit.py --range HEAD~3..
# Install as a git hook (optional)
python scripts/lint-commit.py --install-hook
```
### Git Hook Installation
You can optionally install a git hook that automatically checks commit messages:
```bash
python scripts/lint-commit.py --install-hook
```
This will create a `commit-msg` hook that runs automatically when you commit, ensuring all commit messages follow the required format.
## Testing
Tests are implemented with pytest. The test suite provides comprehensive coverage of core functionalities. To run the tests, execute:
@ -270,6 +337,33 @@ This version has been updated to make the tool more generalized and suitable for
Contributions are welcome! Please fork the repository and submit a pull request with your improvements.
### Development Setup
1. Install development dependencies:
```bash
pip install -r requirements-dev.txt
```
2. Run tests to ensure everything works:
```bash
pytest
```
3. Install the git commit message hook (recommended):
```bash
python scripts/lint-commit.py --install-hook
```
### Code Quality
- Follow the existing code style and formatting
- Write tests for new functionality
- Ensure all tests pass before submitting
- Use meaningful commit messages following the conventional commits format
- Run `python scripts/lint-commit.py` to validate commit messages
## License
This project is licensed under the MIT License. See the [LICENSE.md](LICENSE.md) file for details.

View file

@ -1,47 +1,47 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
if "%1" == "" goto help
if "%1" == "multiversion" goto multiversion
if "%1" == "multiversion-clean" goto multiversion-clean
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:multiversion
sphinx-multiversion %SOURCEDIR% %BUILDDIR%\multiversion\html %SPHINXOPTS% %O%
goto end
:multiversion-clean
rmdir /s /q %BUILDDIR%\html 2>nul
sphinx-multiversion %SOURCEDIR% %BUILDDIR%\multiversion\html %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
if "%1" == "" goto help
if "%1" == "multiversion" goto multiversion
if "%1" == "multiversion-clean" goto multiversion-clean
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:multiversion
sphinx-multiversion %SOURCEDIR% %BUILDDIR%\multiversion\html %SPHINXOPTS% %O%
goto end
:multiversion-clean
rmdir /s /q %BUILDDIR%\html 2>nul
sphinx-multiversion %SOURCEDIR% %BUILDDIR%\multiversion\html %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

View file

@ -0,0 +1,229 @@
Git Commit Message Linting
===========================
This project uses `gitlint <https://jorisroovers.github.io/gitlint/>`_ to enforce consistent commit message formatting. All commit messages must follow the `Conventional Commits <https://www.conventionalcommits.org/>`_ specification to ensure clear and standardized project history.
Why Commit Message Standards Matter
-----------------------------------
Standardized commit messages provide several benefits:
* **Improved readability**: Clear, consistent format makes it easier to understand changes
* **Automated changelog generation**: Tools can parse conventional commits to generate changelogs
* **Better collaboration**: Team members can quickly understand the nature of changes
* **Easier debugging**: Well-formatted commits help identify when bugs were introduced
* **Semantic versioning**: Conventional commits can trigger automated version bumps
Commit Message Format
---------------------
All commit messages must follow this format:
.. code-block:: text
<type>(<scope>): <description>
[optional body]
[optional footer(s)]
Components
~~~~~~~~~~
**Type (required)**
The type of change being made. Must be one of:
* ``feat``: A new feature
* ``fix``: A bug fix
* ``docs``: Documentation only changes
* ``style``: Changes that do not affect the meaning of the code (white-space, formatting, etc.)
* ``refactor``: A code change that neither fixes a bug nor adds a feature
* ``test``: Adding missing tests or correcting existing tests
* ``chore``: Changes to the build process or auxiliary tools and libraries
* ``ci``: Changes to CI configuration files and scripts
* ``build``: Changes that affect the build system or external dependencies
* ``perf``: A code change that improves performance
* ``revert``: Reverts a previous commit
**Scope (optional)**
The scope of the change, enclosed in parentheses. Common scopes for this project:
* ``api``: Changes to API functionality
* ``metadata``: Changes to metadata processing
* ``citation``: Changes to citation building
* ``config``: Changes to configuration handling
* ``tests``: Changes to test files
* ``docs``: Changes to documentation
* ``deps``: Changes to dependencies
**Description (required)**
A short description of the change:
* Use the imperative, present tense: "change" not "changed" nor "changes"
* Don't capitalize the first letter
* No period (.) at the end
* Maximum 50 characters
**Body (optional)**
A longer description of the change:
* Use the imperative, present tense
* Wrap at 72 characters
* Explain what and why vs. how
**Footer (optional)**
One or more footers may be provided:
* ``BREAKING CHANGE:`` description of breaking changes
* ``Closes #123``: reference to closed issues
* ``Co-authored-by: Name <email@example.com>``: additional authors
Examples
--------
**Simple feature addition:**
.. code-block:: text
feat(api): add support for DOI batch processing
**Bug fix with scope:**
.. code-block:: text
fix(metadata): handle missing author information gracefully
**Documentation update:**
.. code-block:: text
docs: update installation instructions
**Breaking change:**
.. code-block:: text
feat(api): change metadata output format
BREAKING CHANGE: The metadata output format has changed from JSON
to YAML. Users need to update their parsing code accordingly.
**Multi-line with body:**
.. code-block:: text
refactor(citation): improve author name parsing
The author name parsing logic has been refactored to handle
more edge cases, including names with multiple middle initials
and international characters.
Closes #45
Configuration
-------------
The project uses a ``.gitlint`` configuration file that enforces:
* Maximum title length of 50 characters
* Conventional commit format validation
* Maximum body line length of 72 characters
* Exclusion of certain words like "WIP", "TODO", "FIXME" in titles
* Automatic ignoring of merge commits and dependency updates
Linting Tools
-------------
Manual Linting
~~~~~~~~~~~~~~~
Use the provided script to lint commit messages:
.. code-block:: bash
# Lint the last commit
python scripts/lint-commit.py
# Lint a specific commit by hash
python scripts/lint-commit.py --hash <commit-hash>
# Lint a range of commits
python scripts/lint-commit.py --range HEAD~3..
# Check staged commit message
python scripts/lint-commit.py --staged
Git Hook Installation
~~~~~~~~~~~~~~~~~~~~~
Install an automated git hook to check commit messages:
.. code-block:: bash
python scripts/lint-commit.py --install-hook
This creates a ``commit-msg`` hook that automatically validates commit messages when you commit. The commit will be rejected if the message doesn't meet the requirements.
Direct Gitlint Usage
~~~~~~~~~~~~~~~~~~~~
You can also use gitlint directly:
.. code-block:: bash
# Lint last commit
gitlint
# Lint specific commit
gitlint --commit <commit-hash>
# Lint commit range
gitlint --commits HEAD~3..
Common Validation Errors
-------------------------
**Title too long**
Keep titles under 50 characters. If you need more space, use the body.
**Invalid type**
Use only the allowed types: ``feat``, ``fix``, ``docs``, ``style``, ``refactor``, ``test``, ``chore``, ``ci``, ``build``, ``perf``, ``revert``.
**Missing colon**
Don't forget the colon after the type/scope: ``feat(api): add feature``
**Capitalized description**
Don't capitalize the first letter of the description: ``feat: add feature`` not ``feat: Add feature``
**Trailing period**
Don't add a period at the end of the title: ``feat: add feature`` not ``feat: add feature.``
**Body line too long**
Keep body lines under 72 characters. Break long lines appropriately.
Troubleshooting
---------------
**Gitlint not found**
Install development dependencies:
.. code-block:: bash
pip install -r requirements-dev.txt
**Hook not working**
Ensure the hook is executable:
.. code-block:: bash
chmod +x .git/hooks/commit-msg
**Existing commits don't follow format**
The linting only applies to new commits. Existing commits can be left as-is or rebased if necessary.
Integration with CI/CD
----------------------
The commit message linting can be integrated into CI/CD pipelines to ensure all commits in pull requests follow the standard format. This helps maintain consistency across all contributors.
For more information on gitlint configuration and advanced usage, see the `official gitlint documentation <https://jorisroovers.github.io/gitlint/>`_.

View file

@ -115,20 +115,47 @@ Development Setup
pip install -r requirements-dev.txt
4. Make your changes
5. Run tests to ensure everything works
6. Submit a pull request
4. Install the git commit message hook (recommended):
.. code-block:: bash
python scripts/lint-commit.py --install-hook
5. Make your changes
6. Run tests to ensure everything works
7. Validate your commit messages follow the standards
8. Submit a pull request
Code Style
----------
Please follow the existing code style and conventions used in the project. Make sure to:
- Write clear, descriptive commit messages
- Write clear, descriptive commit messages following the :doc:`commit-messages` standards
- Add tests for new functionality
- Update documentation as needed
- Follow Python best practices
Commit Message Standards
~~~~~~~~~~~~~~~~~~~~~~~~
All commit messages must follow the Conventional Commits specification. See the :doc:`commit-messages` documentation for detailed information on:
- Required message format
- Available commit types
- Examples of proper commit messages
- How to use the linting tools
To validate your commit messages:
.. code-block:: bash
# Lint the last commit
python scripts/lint-commit.py
# Install automatic validation hook
python scripts/lint-commit.py --install-hook
Submitting Changes
------------------
@ -136,6 +163,7 @@ Submitting Changes
2. Make your changes with appropriate tests
3. Ensure all tests pass
4. Update documentation if needed
5. Submit a pull request with a clear description of your changes
5. Ensure all commit messages follow the conventional commits format
6. Submit a pull request with a clear description of your changes
Thank you for contributing to **doi2dataset**!

View file

@ -39,4 +39,5 @@ Key Features:
usage
modules
contributing
commit-messages
faq

File diff suppressed because it is too large Load diff

View file

@ -50,6 +50,7 @@ dev = [
"pytest-mock>=3.14.0,<4.0",
"pytest-cov>=6.0.0,<7.0",
"ruff>=0.11.1,<0.20",
"gitlint>=0.19.1,<0.20",
]
test = [
"pytest>=8.3.5,<9.0",
@ -132,3 +133,7 @@ ignore = [
[tool.ruff.lint.per-file-ignores]
"tests/*" = ["E501"]
[tool.bandit]
exclude_dirs = ["tests", "docs", ".venv", "build", "dist"]
skips = ["B101", "B601", "B404", "B603"]

View file

@ -2,3 +2,4 @@ pytest>=8.3.5,<9.0
pytest-mock>=3.14.0,<4.0
pytest-cov>=6.0.0,<7.0
ruff>=0.11.1,<0.20
gitlint>=0.19.1,<0.20

179
scripts/lint-commit.py Normal file
View file

@ -0,0 +1,179 @@
#!/usr/bin/env python3
"""
Simple script to lint git commit messages using gitlint.
This script can be used to:
1. Lint the last commit message
2. Lint a specific commit by hash
3. Lint commit messages in a range
4. Be used as a pre-commit hook
Usage:
python scripts/lint-commit.py # Lint last commit
python scripts/lint-commit.py --hash <hash> # Lint specific commit
python scripts/lint-commit.py --range <range> # Lint commit range
python scripts/lint-commit.py --staged # Lint staged commit message
This implementation enforces conventional commit message format.
"""
import argparse
import subprocess
import sys
from pathlib import Path
def run_command(cmd, check=True):
"""Run a shell command and return the result."""
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=check)
return result
except subprocess.CalledProcessError as e:
print(f"Error running command: {cmd}")
print(f"Exit code: {e.returncode}")
print(f"Output: {e.stdout}")
print(f"Error: {e.stderr}")
return e
def check_gitlint_installed():
"""Check if gitlint is installed."""
result = run_command(["which", "gitlint"], check=False)
if result.returncode != 0:
print("Error: gitlint is not installed.")
print("Please install it with: pip install gitlint")
print("Or install dev dependencies: pip install -r requirements-dev.txt")
sys.exit(1)
def lint_commit(commit_hash=None, commit_range=None, staged=False):
"""Lint commit message(s) using gitlint."""
# Build gitlint command
cmd = ["gitlint"]
if staged:
# Lint staged commit message
cmd.extend(["--staged"])
elif commit_range:
# Lint commit range
cmd.extend(["--commits", commit_range])
elif commit_hash:
# Lint specific commit
cmd.extend(["--commit", commit_hash])
else:
# Lint last commit (default)
cmd.extend(["--commit", "HEAD"])
print(f"Running: {' '.join(cmd)}")
print("-" * 50)
# Run gitlint
result = run_command(cmd, check=False)
if result.returncode == 0:
print("✅ All commit messages are valid!")
return True
else:
print("❌ Commit message validation failed:")
print(result.stdout)
if result.stderr:
print("Error output:")
print(result.stderr)
return False
def main():
"""Main entry point."""
parser = argparse.ArgumentParser(
description="Lint git commit messages using gitlint",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s # Lint last commit
%(prog)s --hash abc123 # Lint specific commit
%(prog)s --range HEAD~3.. # Lint last 3 commits
%(prog)s --staged # Lint staged commit message
""",
)
parser.add_argument("--hash", help="Specific commit hash to lint")
parser.add_argument("--range", help="Commit range to lint (e.g., HEAD~3..)")
parser.add_argument(
"--staged", action="store_true", help="Lint staged commit message"
)
parser.add_argument(
"--install-hook", action="store_true", help="Install as git commit-msg hook"
)
args = parser.parse_args()
# Check if gitlint is installed
check_gitlint_installed()
# Install hook if requested
if args.install_hook:
install_hook()
return
# Validate arguments
exclusive_args = [args.hash, args.range, args.staged]
if sum(bool(arg) for arg in exclusive_args) > 1:
print("Error: --hash, --range, and --staged are mutually exclusive")
sys.exit(1)
# Lint commits
success = lint_commit(
commit_hash=args.hash, commit_range=args.range, staged=args.staged
)
sys.exit(0 if success else 1)
def install_hook():
"""Install the script as a git commit-msg hook."""
git_dir = Path(".git")
if not git_dir.exists():
print("Error: Not in a git repository")
sys.exit(1)
hooks_dir = git_dir / "hooks"
hooks_dir.mkdir(exist_ok=True)
hook_file = hooks_dir / "commit-msg"
hook_content = """#!/usr/bin/env python3
# Git commit-msg hook for gitlint
# Python-based commit message linting with gitlint
import subprocess
import sys
# Run gitlint on the commit message
result = subprocess.run( # nosec B603
["gitlint", "--msg-filename", sys.argv[1]],
capture_output=True,
text=True
)
if result.returncode != 0:
print("Commit message validation failed:")
print(result.stdout)
if result.stderr:
print("Error output:")
print(result.stderr)
sys.exit(1)
print("✅ Commit message is valid!")
"""
hook_file.write_text(hook_content)
hook_file.chmod(0o755)
print(f"✅ Installed commit-msg hook at {hook_file}")
print("The hook will automatically run when you commit.")
if __name__ == "__main__":
main()

View file

@ -23,7 +23,7 @@ def test_pi():
given_name="Author",
orcid="0000-0000-0000-1234",
email="test.author@example.org",
affiliation="Test University"
affiliation="Test University",
)
@ -115,7 +115,9 @@ def test_build_authors_with_ror(openalex_data, pi_finder):
pytest.skip("Test data doesn't contain any ROR identifiers")
# Create builder with ror=True to enable ROR identifiers
builder = CitationBuilder(data=openalex_data, doi=doi, pi_finder=pi_finder, ror=True)
builder = CitationBuilder(
data=openalex_data, doi=doi, pi_finder=pi_finder, ror=True
)
# Get authors
authors, _ = builder.build_authors()
@ -129,11 +131,11 @@ def test_build_authors_with_ror(openalex_data, pi_finder):
for author in authors:
# Check if author has affiliation
if not hasattr(author, 'affiliation') or not author.affiliation:
if not hasattr(author, "affiliation") or not author.affiliation:
continue
# Check if affiliation is an Institution with a ROR ID
if not hasattr(author.affiliation, 'ror'):
if not hasattr(author.affiliation, "ror"):
continue
# Check if ROR ID is present and contains "ror.org"
@ -154,7 +156,7 @@ def test_build_authors_with_ror(openalex_data, pi_finder):
assert affiliation_field.value == institution_with_ror.ror
# Verify the expanded_value dictionary has the expected structure
assert hasattr(affiliation_field, 'expanded_value')
assert hasattr(affiliation_field, "expanded_value")
assert isinstance(affiliation_field.expanded_value, dict)
# Check specific fields in the expanded_value

View file

@ -13,6 +13,7 @@ def test_sanitize_filename():
result = sanitize_filename(doi)
assert result == expected
def test_split_name_with_comma():
"""Test splitting a full name that contains a comma."""
full_name = "Doe, John"
@ -20,6 +21,7 @@ def test_split_name_with_comma():
assert given == "John"
assert family == "Doe"
def test_split_name_without_comma():
"""Test splitting a full name that does not contain a comma."""
full_name = "John Doe"
@ -27,11 +29,13 @@ def test_split_name_without_comma():
assert given == "John"
assert family == "Doe"
def test_validate_email_address_valid():
"""Test that a valid email address is correctly recognized."""
valid_email = "john.doe@iana.org"
assert validate_email_address(valid_email) is True
def test_validate_email_address_invalid():
"""Test that an invalid email address is correctly rejected."""
invalid_email = "john.doe@invalid_domain"

View file

@ -20,6 +20,7 @@ class FakeResponse:
"""
A fake response object to simulate an API response.
"""
def __init__(self, json_data, status_code=200):
self._json = json_data
self.status_code = status_code
@ -30,6 +31,7 @@ class FakeResponse:
def raise_for_status(self):
pass
@pytest.fixture(autouse=True)
def load_config_test():
"""
@ -39,6 +41,7 @@ def load_config_test():
config_path = os.path.join(os.path.dirname(__file__), "config_test.yaml")
Config.load_config(config_path=config_path)
@pytest.fixture
def fake_openalex_response():
"""
@ -50,6 +53,7 @@ def fake_openalex_response():
data = json.load(f)
return data
def test_fetch_doi_data_with_file(mocker, fake_openalex_response):
"""
Test fetching DOI metadata by simulating the API call with a locally saved JSON response.
@ -88,7 +92,7 @@ def test_openalex_abstract_extraction(mocker, fake_openalex_response):
assert abstract_text is not None
# If abstract exists in the response, it should be properly extracted
if 'abstract_inverted_index' in fake_openalex_response:
if "abstract_inverted_index" in fake_openalex_response:
assert len(abstract_text) > 0
@ -152,7 +156,7 @@ def test_pi_finder_find_by_orcid():
given_name="Jon",
orcid="0000-0000-0000-0000",
email="jon.doe@iana.org",
affiliation="Institute of Science, Some University"
affiliation="Institute of Science, Some University",
)
# Create PIFinder with our test PI
@ -181,8 +185,10 @@ def test_metadata_processor_fetch_data(mocker, fake_openalex_response):
doi = "10.1038/srep45389"
# Mock API response
mocker.patch("doi2dataset.APIClient.make_request",
return_value=FakeResponse(fake_openalex_response, 200))
mocker.patch(
"doi2dataset.APIClient.make_request",
return_value=FakeResponse(fake_openalex_response, 200),
)
# Create processor with upload disabled and progress disabled
processor = MetadataProcessor(doi=doi, upload=False, progress=False)

View file

@ -3,37 +3,27 @@ from doi2dataset import License, LicenseProcessor
def test_license_processor_cc_by():
"""Test processing a CC BY license"""
data = {
"primary_location": {
"license": "cc-by"
}
}
data = {"primary_location": {"license": "cc-by"}}
license_obj = LicenseProcessor.process_license(data)
assert isinstance(license_obj, License)
assert license_obj.short == "cc-by"
assert license_obj.name == "CC BY 4.0"
assert license_obj.uri == "https://creativecommons.org/licenses/by/4.0/"
def test_license_processor_cc0():
"""Test processing a CC0 license"""
data = {
"primary_location": {
"license": "cc0"
}
}
data = {"primary_location": {"license": "cc0"}}
license_obj = LicenseProcessor.process_license(data)
assert isinstance(license_obj, License)
assert license_obj.short == "cc0"
assert license_obj.name == "CC0 1.0"
assert license_obj.uri == "https://creativecommons.org/publicdomain/zero/1.0/"
def test_license_processor_unknown_license():
"""Test processing an unknown license"""
data = {
"primary_location": {
"license": "unknown-license"
}
}
data = {"primary_location": {"license": "unknown-license"}}
license_obj = LicenseProcessor.process_license(data)
assert isinstance(license_obj, License)
assert license_obj.short == "unknown-license"
@ -41,17 +31,17 @@ def test_license_processor_unknown_license():
assert license_obj.name == "unknown-license" or license_obj.name == ""
assert hasattr(license_obj, "uri")
def test_license_processor_no_license():
"""Test processing with no license information"""
data = {
"primary_location": {}
}
data = {"primary_location": {}}
license_obj = LicenseProcessor.process_license(data)
assert isinstance(license_obj, License)
assert license_obj.short == "unknown"
assert license_obj.name == ""
assert license_obj.uri == ""
def test_license_processor_no_primary_location():
"""Test processing with no primary location"""
data = {}

View file

@ -33,7 +33,10 @@ def test_build_metadata_basic_fields(metadata_processor, openalex_data, monkeypa
abstract_mock = MagicMock()
abstract_mock.text = "This is a sample abstract"
abstract_mock.source = "openalex"
monkeypatch.setattr("doi2dataset.AbstractProcessor.get_abstract", lambda *args, **kwargs: abstract_mock)
monkeypatch.setattr(
"doi2dataset.AbstractProcessor.get_abstract",
lambda *args, **kwargs: abstract_mock,
)
# Mock the _fetch_data method to return our test data
metadata_processor._fetch_data = MagicMock(return_value=openalex_data)
@ -47,21 +50,23 @@ def test_build_metadata_basic_fields(metadata_processor, openalex_data, monkeypa
# Verify the basic metadata fields were extracted correctly
assert metadata is not None
assert 'datasetVersion' in metadata
assert "datasetVersion" in metadata
# Examine the fields inside datasetVersion.metadataBlocks
assert 'metadataBlocks' in metadata['datasetVersion']
citation = metadata['datasetVersion']['metadataBlocks'].get('citation', {})
assert "metadataBlocks" in metadata["datasetVersion"]
citation = metadata["datasetVersion"]["metadataBlocks"].get("citation", {})
# Check fields in citation section
assert 'fields' in citation
fields = citation['fields']
assert "fields" in citation
fields = citation["fields"]
# Check for basic metadata fields in a more flexible way
field_names = [field.get('typeName') for field in fields]
assert 'title' in field_names
assert 'subject' in field_names
assert 'dsDescription' in field_names # Description is named 'dsDescription' in the schema
field_names = [field.get("typeName") for field in fields]
assert "title" in field_names
assert "subject" in field_names
assert (
"dsDescription" in field_names
) # Description is named 'dsDescription' in the schema
def test_build_metadata_authors(metadata_processor, openalex_data, monkeypatch):
@ -73,7 +78,10 @@ def test_build_metadata_authors(metadata_processor, openalex_data, monkeypatch):
abstract_mock = MagicMock()
abstract_mock.text = "This is a sample abstract"
abstract_mock.source = "openalex"
monkeypatch.setattr("doi2dataset.AbstractProcessor.get_abstract", lambda *args, **kwargs: abstract_mock)
monkeypatch.setattr(
"doi2dataset.AbstractProcessor.get_abstract",
lambda *args, **kwargs: abstract_mock,
)
# Mock the _fetch_data method to return our test data
metadata_processor._fetch_data = MagicMock(return_value=openalex_data)
@ -86,33 +94,35 @@ def test_build_metadata_authors(metadata_processor, openalex_data, monkeypatch):
metadata = metadata_processor._build_metadata(openalex_data)
# Examine the fields inside datasetVersion.metadataBlocks
assert 'metadataBlocks' in metadata['datasetVersion']
citation = metadata['datasetVersion']['metadataBlocks'].get('citation', {})
assert "metadataBlocks" in metadata["datasetVersion"]
citation = metadata["datasetVersion"]["metadataBlocks"].get("citation", {})
# Check fields in citation section
assert 'fields' in citation
fields = citation['fields']
assert "fields" in citation
fields = citation["fields"]
# Check for author and datasetContact fields
field_names = [field.get('typeName') for field in fields]
assert 'author' in field_names
assert 'datasetContact' in field_names
field_names = [field.get("typeName") for field in fields]
assert "author" in field_names
assert "datasetContact" in field_names
# Verify these are compound fields with actual entries
for field in fields:
if field.get('typeName') == 'author':
assert 'value' in field
assert isinstance(field['value'], list)
assert len(field['value']) > 0
if field.get("typeName") == "author":
assert "value" in field
assert isinstance(field["value"], list)
assert len(field["value"]) > 0
if field.get('typeName') == 'datasetContact':
assert 'value' in field
assert isinstance(field['value'], list)
if field.get("typeName") == "datasetContact":
assert "value" in field
assert isinstance(field["value"], list)
# The datasetContact might be empty in test environment
# Just check it exists rather than asserting length
def test_build_metadata_keywords_and_topics(metadata_processor, openalex_data, monkeypatch):
def test_build_metadata_keywords_and_topics(
metadata_processor, openalex_data, monkeypatch
):
"""Test that _build_metadata correctly extracts keywords and topics"""
# Mock the console to avoid print errors
metadata_processor.console = MagicMock()
@ -121,7 +131,10 @@ def test_build_metadata_keywords_and_topics(metadata_processor, openalex_data, m
abstract_mock = MagicMock()
abstract_mock.text = "This is a sample abstract"
abstract_mock.source = "openalex"
monkeypatch.setattr("doi2dataset.AbstractProcessor.get_abstract", lambda *args, **kwargs: abstract_mock)
monkeypatch.setattr(
"doi2dataset.AbstractProcessor.get_abstract",
lambda *args, **kwargs: abstract_mock,
)
# Mock the _fetch_data method to return our test data
metadata_processor._fetch_data = MagicMock(return_value=openalex_data)
@ -134,27 +147,27 @@ def test_build_metadata_keywords_and_topics(metadata_processor, openalex_data, m
metadata = metadata_processor._build_metadata(openalex_data)
# Examine the fields inside datasetVersion.metadataBlocks
assert 'metadataBlocks' in metadata['datasetVersion']
citation = metadata['datasetVersion']['metadataBlocks'].get('citation', {})
assert "metadataBlocks" in metadata["datasetVersion"]
citation = metadata["datasetVersion"]["metadataBlocks"].get("citation", {})
# Check fields in citation section
assert 'fields' in citation
fields = citation['fields']
assert "fields" in citation
fields = citation["fields"]
# Check for keyword and subject fields
field_names = [field.get('typeName') for field in fields]
field_names = [field.get("typeName") for field in fields]
# If keywords exist, verify structure
if 'keyword' in field_names:
if "keyword" in field_names:
for field in fields:
if field.get('typeName') == 'keyword':
assert 'value' in field
assert isinstance(field['value'], list)
if field.get("typeName") == "keyword":
assert "value" in field
assert isinstance(field["value"], list)
# Check for subject field which should definitely exist
assert 'subject' in field_names
assert "subject" in field_names
for field in fields:
if field.get('typeName') == 'subject':
assert 'value' in field
assert isinstance(field['value'], list)
assert len(field['value']) > 0
if field.get("typeName") == "subject":
assert "value" in field
assert isinstance(field["value"], list)
assert len(field["value"]) > 0

View file

@ -8,7 +8,7 @@ def test_person_to_dict_with_string_affiliation():
given_name="John",
orcid="0000-0001-2345-6789",
email="john.doe@example.org",
affiliation="Test University"
affiliation="Test University",
)
result = person.to_dict()
@ -29,7 +29,7 @@ def test_person_to_dict_with_institution_ror():
given_name="John",
orcid="0000-0001-2345-6789",
email="john.doe@example.org",
affiliation=inst
affiliation=inst,
)
result = person.to_dict()
@ -48,7 +48,7 @@ def test_person_to_dict_with_institution_display_name_only():
family_name="Smith",
given_name="Jane",
orcid="0000-0001-9876-5432",
affiliation=inst
affiliation=inst,
)
result = person.to_dict()
@ -63,11 +63,7 @@ def test_person_to_dict_with_empty_institution():
# Create an Institution with empty values
inst = Institution("")
person = Person(
family_name="Brown",
given_name="Robert",
affiliation=inst
)
person = Person(family_name="Brown", given_name="Robert", affiliation=inst)
result = person.to_dict()
@ -79,9 +75,7 @@ def test_person_to_dict_with_empty_institution():
def test_person_to_dict_with_no_affiliation():
"""Test Person.to_dict() with no affiliation."""
person = Person(
family_name="Green",
given_name="Alice",
orcid="0000-0002-1111-2222"
family_name="Green", given_name="Alice", orcid="0000-0002-1111-2222"
)
result = person.to_dict()

View file

@ -14,44 +14,44 @@ def metadata_processor():
processor.console = MagicMock()
return processor
def test_get_publication_year_with_publication_year(metadata_processor):
"""Test that _get_publication_year extracts year from publication_year field"""
data = {"publication_year": 2020}
year = metadata_processor._get_publication_year(data)
assert year == 2020
def test_get_publication_year_with_date(metadata_processor):
"""Test that _get_publication_year returns empty string when publication_year is missing"""
data = {"publication_date": "2019-05-15"}
year = metadata_processor._get_publication_year(data)
assert year == ""
def test_get_publication_year_with_both_fields(metadata_processor):
"""Test that _get_publication_year prioritizes publication_year over date"""
data = {
"publication_year": 2020,
"publication_date": "2019-05-15"
}
data = {"publication_year": 2020, "publication_date": "2019-05-15"}
year = metadata_processor._get_publication_year(data)
assert year == 2020
def test_get_publication_year_with_partial_date(metadata_processor):
"""Test that _get_publication_year returns empty string when only publication_date is present"""
data = {"publication_date": "2018"}
year = metadata_processor._get_publication_year(data)
assert year == ""
def test_get_publication_year_with_missing_data(metadata_processor):
"""Test that _get_publication_year handles missing data"""
data = {"other_field": "value"}
year = metadata_processor._get_publication_year(data)
assert year == ""
def test_get_publication_year_with_invalid_data(metadata_processor):
"""Test that _get_publication_year returns whatever is in publication_year field"""
data = {
"publication_year": "not-a-year",
"publication_date": "invalid-date"
}
data = {"publication_year": "not-a-year", "publication_date": "invalid-date"}
year = metadata_processor._get_publication_year(data)
assert year == "not-a-year"