Core Concepts
Gunz-ML is more than just a logging utility; it is a research-centric SDK designed to manage the entire lifecycle of deep learning experiments in a distributed environment.
1. The Research SDK Philosophy
Most experiment trackers are passive sinks for data. Gunz-ML acts as a bridge. It allows you to:
Write: Log high-frequency metrics and artifacts during training.
Read: Query past results to inform the current HPO (Hyperparameter Optimization) loop.
Extract: Programmatically download artifacts from Juno for downstream analysis in notebooks.
2. Tracking vs. Management
The library distinguishes between two levels of operation:
Tracking (
gunz_ml.integrations): Low-level logic to ensure metrics reach MLflow and Optuna without database locks or race conditions.Management (
gunz_ml.management): High-level logic (e.g.,TrackingManager) used to find the best runs, prune failed trials, and generate comparison reports across studies.
3. Distributed Safety
In a Slurm-based cluster environment, multiple workers often try to initialize the same study simultaneously. Gunz-ML implements Initialisation First Policy:
Studies are pre-scaffolded using the
gunz-ml initCLI.Workers use
safe_set_experimentto verify the environment is ready before starting, preventing the common “database is locked” errors in MariaDB.
4. The Juno Ecosystem
Gunz-ML is designed to communicate with Juno, the unified experiment infrastructure.
MLflow: Stores run metadata, parameters, and time-series metrics.
Optuna (MariaDB): Stores the relational data for HPO trials.
MinIO (S3): Stores large binary artifacts (model checkpoints, .pt files, and plots).
By standardizing on these backends, Gunz-ML ensures that your research is reproducible, queryable, and persistent.