docs: update for modular package structure

- Update README.md with new usage methods and package structure
- Revise installation instructions for pip install and development setup
- Update usage documentation to reflect CLI module and Python imports
- Add package architecture overview to contributing guide
- Update API reference documentation for all new modules

Documentation changes:
- README.md: Updated installation, usage examples, development setup
- docs/source/installation.rst: Added verification and dev setup
- docs/source/usage.rst: Updated for new CLI and import methods
- docs/source/introduction.rst: Updated package overview
- docs/source/contributing.rst: Added package architecture overview
- docs/source/modules.rst: Complete API reference for all modules

All documentation now reflects the modular structure with proper
usage instructions for console commands, Python modules, and imports.
This commit is contained in:
Alexander Minges 2025-07-22 11:54:07 +02:00
parent b1dd2917b2
commit c60817702b
Signed by: Athemis
SSH key fingerprint: SHA256:TUXshgulbwL+FRYvBNo54pCsI0auROsSEgSvueKbkZ4
6 changed files with 395 additions and 32 deletions

108
README.md
View file

@ -28,13 +28,23 @@
- Python 3.12 or higher
Clone the repository from GitHub:
Clone the repository:
```bash
git clone https://git.athemis.de/Athemis/doi2dataset
git clone https://git.uni-due.de/cbm343e/doi2dataset
cd doi2dataset
```
### Quick Start
```bash
# Install the package
pip install -e .
# Run with a DOI
doi2dataset 10.1038/nature12373
```
## Configuration
Before running the tool, configure the necessary settings in the `config.yaml` file located in the project root. This file contains configuration details such as:
@ -109,14 +119,43 @@ This approach allows you to:
## Usage
Run doi2dataset from the command line by providing one or more DOIs:
doi2dataset can be used in several ways after installation:
### Method 1: Console Command
```bash
python doi2dataset.py [options] DOI1 DOI2 ...
# After installation with pip install -e .
doi2dataset [options] DOI1 DOI2 ...
```
### Method 2: Python Module
```bash
# Use CLI module directly
python -m doi2dataset.cli [options] DOI1 DOI2 ...
# Or use main module
python -m doi2dataset.main [options] DOI1 DOI2 ...
```
### Method 3: Python Import
```python
from doi2dataset import MetadataProcessor
from pathlib import Path
processor = MetadataProcessor(
doi="10.1038/nature12373",
output_path=Path("metadata.json"),
depositor="Your Name"
)
metadata = processor.process()
```
### Command Line Options
All methods support the same command-line options:
- `-f, --file`
Specify a file containing DOIs (one per line).
@ -138,6 +177,25 @@ python doi2dataset.py [options] DOI1 DOI2 ...
- `-r, --use-ror`
Use Research Organization Registry (ROR) identifiers for institutions when available.
### Examples
```bash
# Process a single DOI
doi2dataset 10.1038/nature12373
# Process multiple DOIs
doi2dataset 10.1038/nature12373 10.1126/science.1234567
# Process DOIs from a file with custom output directory
doi2dataset -f dois.txt -o ./output -d "Your Name"
# Upload to Dataverse with contact email
doi2dataset -u -m your.email@university.edu 10.1038/nature12373
# Use ROR identifiers for institutions
doi2dataset -r 10.1038/nature12373
```
## Documentation
Documentation is generated using Sphinx and is available online at:
@ -403,38 +461,66 @@ Contributions are welcome! Please fork the repository and submit a pull request
pip install -r requirements-dev.txt
```
2. Set up commit message template (recommended):
2. Install the package in development mode:
```bash
pip install -e .
```
3. Set up commit message template:
```bash
git config commit.template .gitmessage
```
3. Install pre-commit hooks (recommended):
4. Install pre-commit hooks:
```bash
pre-commit install --hook-type pre-commit --hook-type commit-msg
```
4. Run tests to ensure everything works:
5. Run tests:
```bash
pytest
```
5. Optionally run pre-commit on all files to check formatting:
6. Run code quality checks:
```bash
pre-commit run --all-files
```
### Package Structure
The project follows a modular architecture:
```
doi2dataset/
├── cli.py # Command-line interface
├── main.py # Main entry point
├── core/ # Core components
│ ├── config.py # Configuration management
│ ├── models.py # Data models (Person, Institution, etc.)
│ └── metadata_fields.py # Dataverse metadata field types
├── api/ # External API integration
│ ├── client.py # HTTP client for API requests
│ └── processors.py # License and abstract processors
├── processing/ # Business logic
│ ├── citation.py # Citation building
│ ├── metadata.py # Metadata processing pipeline
│ └── utils.py # Processing utilities
└── utils/ # General utilities
└── validation.py # Validation functions
```
### Code Quality
- Follow the existing code style and formatting (enforced by pre-commit hooks)
- Follow the existing code style and formatting
- Write tests for new functionality
- Ensure all tests pass before submitting
- Use meaningful commit messages following the conventional commits format
- Pre-commit hooks will automatically validate commit messages and code formatting
- Run `python scripts/lint-commit.py` to manually validate commit messages
- Run `python scripts/lint-commit.py` to validate commit messages
## License

View file

@ -1,7 +1,67 @@
Contributing
============
We welcome contributions to **doi2dataset**! This guide provides information for developers who want to contribute to the project or build the documentation locally.
This guide provides information for developers who want to contribute to the project, understand the package architecture, or build the documentation locally.
Package Architecture
--------------------
**doi2dataset** has a modular architecture:
**Core Components (`core/`)**
- `config.py`: Configuration management with environment variable support
- `models.py`: Data models for Person, Institution, License, Abstract
- `metadata_fields.py`: Dataverse metadata field type definitions
**API Integration (`api/`)**
- `client.py`: HTTP client for external API requests
- `processors.py`: Processors for licenses and abstracts
**Processing Logic (`processing/`)**
- `citation.py`: Citation building from API data
- `metadata.py`: Metadata processing pipeline
- `utils.py`: Processing utilities (name processing, PI finding, subject mapping)
**Utilities (`utils/`)**
- `validation.py`: Validation functions for DOIs, emails, etc.
**User Interface**
- `cli.py`: Command-line interface implementation
- `main.py`: Entry point for the package
Development Setup
-----------------
1. Clone the repository and install:
.. code-block:: bash
git clone https://git.uni-due.de/cbm343e/doi2dataset.git
cd doi2dataset
pip install -e .
pip install -r requirements-dev.txt
2. Set up development tools:
.. code-block:: bash
# Set up commit message template
git config commit.template .gitmessage
# Install pre-commit hooks
pre-commit install --hook-type pre-commit --hook-type commit-msg
3. Run tests:
.. code-block:: bash
pytest
4. Run code quality checks:
.. code-block:: bash
pre-commit run --all-files
Building Documentation
----------------------

View file

@ -5,13 +5,16 @@ There are several ways to install **doi2dataset**:
Using Git
---------
Clone the repository from GitHub by running the following commands in your terminal:
Clone the repository by running the following commands in your terminal:
.. code-block:: bash
git clone https://git.uni-due.de/cbm343e/doi2dataset.git
cd doi2dataset
# Install in development mode
pip install -e .
Using pip (if available)
-------------------------
You can also install **doi2dataset** via pip:
@ -20,9 +23,37 @@ You can also install **doi2dataset** via pip:
pip install doi2dataset
Development Installation
------------------------
Install in editable mode for development:
.. code-block:: bash
git clone https://git.uni-due.de/cbm343e/doi2dataset.git
cd doi2dataset
pip install -e .
# Install development dependencies
pip install -r requirements-dev.txt
# Set up pre-commit hooks
pre-commit install --hook-type pre-commit --hook-type commit-msg
Verification
------------
Check the installation:
.. code-block:: bash
# Check console command
doi2dataset --help
# Or use module
python -m doi2dataset.cli --help
Configuration
-------------
After installation, ensure that the tool is configured correctly.
Check the `config.yaml` file in the project root for necessary settings such as Dataverse connection details and PI information.
After installation, configure the tool by editing the `config.yaml` file in the project root.
Set Dataverse connection details and PI information as needed.
For more detailed instructions, please refer to the README file provided with the project.
See the README file for detailed configuration instructions.

View file

@ -1,8 +1,19 @@
Introduction
============
Welcome to the **doi2dataset** documentation. This guide provides an in-depth look at the tool, its purpose, and how it can help you generate standard citation metadata for Dataverse datasets.
**doi2dataset** is a Python tool that processes DOIs and generates metadata for Dataverse datasets.
The **doi2dataset** tool is aimed at researchers, data stewards, and developers who need to convert DOI-based metadata into a format compatible with Dataverse. It automates the retrieval of metadata from external sources (like OpenAlex and CrossRef) and generates standard Dataverse citation metadata blocks including title, authors, abstract, keywords, and funding information.
It retrieves metadata from external sources (OpenAlex and CrossRef) and generates Dataverse citation metadata blocks including title, authors, abstract, keywords, and funding information.
In the following sections, you'll learn about the installation process, usage examples, and a detailed API reference.
The package is organized into modules:
- `core/`: Configuration, data models, and metadata field definitions
- `api/`: HTTP client and API processors for external services
- `processing/`: Citation building and metadata processing logic
- `utils/`: Validation and utility functions
- `cli.py`: Command-line interface
- `main.py`: Entry point
The tool can be used as a command-line application or imported as a Python package.
The documentation covers installation, usage, and API reference.

View file

@ -3,7 +3,113 @@ API Reference
This section contains the API reference generated from the source code docstrings.
Main Package
------------
.. automodule:: doi2dataset
:members:
:undoc-members:
:show-inheritance:
Core Components
---------------
Configuration Management
~~~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: doi2dataset.core.config
:members:
:undoc-members:
:show-inheritance:
Data Models
~~~~~~~~~~~
.. automodule:: doi2dataset.core.models
:members:
:undoc-members:
:show-inheritance:
Metadata Fields
~~~~~~~~~~~~~~~
.. automodule:: doi2dataset.core.metadata_fields
:members:
:undoc-members:
:show-inheritance:
API Integration
---------------
HTTP Client
~~~~~~~~~~~
.. automodule:: doi2dataset.api.client
:members:
:undoc-members:
:show-inheritance:
API Processors
~~~~~~~~~~~~~~
.. automodule:: doi2dataset.api.processors
:members:
:undoc-members:
:show-inheritance:
Processing Components
---------------------
Citation Building
~~~~~~~~~~~~~~~~~
.. automodule:: doi2dataset.processing.citation
:members:
:undoc-members:
:show-inheritance:
Metadata Processing
~~~~~~~~~~~~~~~~~~~
.. automodule:: doi2dataset.processing.metadata
:members:
:undoc-members:
:show-inheritance:
Processing Utilities
~~~~~~~~~~~~~~~~~~~~
.. automodule:: doi2dataset.processing.utils
:members:
:undoc-members:
:show-inheritance:
Utilities
---------
Validation Functions
~~~~~~~~~~~~~~~~~~~~
.. automodule:: doi2dataset.utils.validation
:members:
:undoc-members:
:show-inheritance:
Command Line Interface
----------------------
CLI Module
~~~~~~~~~~
.. automodule:: doi2dataset.cli
:members:
:undoc-members:
:show-inheritance:
Main Entry Point
~~~~~~~~~~~~~~~~
.. automodule:: doi2dataset.main
:members:
:undoc-members:
:show-inheritance:

View file

@ -1,7 +1,7 @@
Usage
=====
Running **doi2dataset** is done from the command line. Below is an example of how to use the tool.
**doi2dataset** can be run from the command line or imported as a Python package.
Demo
----
@ -11,13 +11,39 @@ Here's a demonstration of **doi2dataset** in action:
:alt: doi2dataset demonstration
:align: center
Basic Example
Usage Methods
-------------
To process one or more DOIs, run:
**doi2dataset** can be used in several ways:
**Method 1: Console Command**
.. code-block:: bash
python doi2dataset.py 10.1234/doi1 10.5678/doi2
doi2dataset 10.1234/doi1 10.5678/doi2
**Method 2: Python Module**
.. code-block:: bash
# Use CLI module directly
python -m doi2dataset.cli 10.1234/doi1 10.5678/doi2
# Or use main module
python -m doi2dataset.main 10.1234/doi1 10.5678/doi2
**Method 3: Python Import**
.. code-block:: python
from doi2dataset import MetadataProcessor
from pathlib import Path
processor = MetadataProcessor(
doi="10.1234/doi1",
output_path=Path("metadata.json"),
depositor="Your Name"
)
metadata = processor.process()
Command Line Options
--------------------
@ -87,25 +113,68 @@ Example usage:
export DATAVERSE_AUTH_PASSWORD="your-secure-password"
# Run doi2dataset - it will use environment variables for credentials
python doi2dataset.py 10.1234/example.doi
doi2dataset 10.1234/example.doi
# Or set them inline for a single run
DATAVERSE_API_TOKEN="token" python doi2dataset.py 10.1234/example.doi
DATAVERSE_API_TOKEN="token" doi2dataset 10.1234/example.doi
This approach allows you to:
- Keep sensitive credentials out of version control
- Use different configurations for different environments (dev, staging, production)
- Deploy the tool with secure environment-based configuration
- Use different configurations per environment
Usage Example with Configuration
----------------------------------
If you have configured your **config.yaml** and want to process DOIs from a file while uploading the metadata, you could run:
Usage Examples
---------------
Here are some practical examples of using **doi2dataset**:
**Process a single DOI:**
.. code-block:: bash
python doi2dataset.py -f dois.txt -o output/ -d "Doe, John" -s "Medicine, Health and Life Sciences" -m "john.doe@example.com" -u -r
doi2dataset 10.1038/nature12373
This command will use the options provided on the command line as well as the settings from **config.yaml**.
**Process multiple DOIs:**
For more details on usage and configuration, please refer to the rest of the documentation.
.. code-block:: bash
doi2dataset 10.1038/nature12373 10.1126/science.1234567
**Process DOIs from a file with custom settings:**
.. code-block:: bash
doi2dataset -f dois.txt -o output/ -d "Doe, John" -s "Medicine, Health and Life Sciences" -m "john.doe@example.com" -u -r
**Upload to Dataverse with ROR identifiers:**
.. code-block:: bash
doi2dataset -u -r -m your.email@university.edu 10.1038/nature12373
Commands use options from the command line and settings from **config.yaml**.
Package Structure
-----------------
The **doi2dataset** package modules:
.. code-block:: text
doi2dataset/
├── cli.py # Command-line interface
├── main.py # Main entry point
├── core/ # Core components
│ ├── config.py # Configuration management
│ ├── models.py # Data models (Person, Institution, etc.)
│ └── metadata_fields.py # Dataverse metadata field types
├── api/ # External API integration
│ ├── client.py # HTTP client for API requests
│ └── processors.py # License and abstract processors
├── processing/ # Business logic
│ ├── citation.py # Citation building
│ ├── metadata.py # Metadata processing pipeline
│ └── utils.py # Processing utilities
└── utils/ # General utilities
└── validation.py # Validation functions
See other documentation sections for more details.