Core Concepts

Gunz-ML is more than just a logging utility; it is a research-centric SDK designed to manage the entire lifecycle of deep learning experiments in a distributed environment.

1. The Research SDK Philosophy

Most experiment trackers are passive sinks for data. Gunz-ML acts as a bridge. It allows you to:

Write: Log high-frequency metrics and artifacts during training.
Read: Query past results to inform the current HPO (Hyperparameter Optimization) loop.
Extract: Programmatically download artifacts from Juno for downstream analysis in notebooks.

2. Tracking vs. Management

The library distinguishes between two levels of operation:

Tracking (gunz_ml.integrations): Low-level logic to ensure metrics reach MLflow and Optuna without database locks or race conditions.
Management (gunz_ml.management): High-level logic (e.g., TrackingManager) used to find the best runs, prune failed trials, and generate comparison reports across studies.

3. Distributed Safety

In a Slurm-based cluster environment, multiple workers often try to initialize the same study simultaneously. Gunz-ML implements Initialisation First Policy:

Studies are pre-scaffolded using the gunz-ml init CLI.
Workers use safe_set_experiment to verify the environment is ready before starting, preventing the common “database is locked” errors in MariaDB.

4. The Juno Ecosystem

Gunz-ML is designed to communicate with Juno, the unified experiment infrastructure.

MLflow: Stores run metadata, parameters, and time-series metrics.
Optuna (MariaDB): Stores the relational data for HPO trials.
MinIO (S3): Stores large binary artifacts (model checkpoints, .pt files, and plots).

By standardizing on these backends, Gunz-ML ensures that your research is reproducible, queryable, and persistent.