Core Concepts

Gunz-ML is more than just a logging utility; it is a research-centric SDK designed to manage the entire lifecycle of deep learning experiments in a distributed environment.

1. The Research SDK Philosophy

Most experiment trackers are passive sinks for data. Gunz-ML acts as a bridge. It allows you to:

  • Write: Log high-frequency metrics and artifacts during training.

  • Read: Query past results to inform the current HPO (Hyperparameter Optimization) loop.

  • Extract: Programmatically download artifacts from Juno for downstream analysis in notebooks.

2. Tracking vs. Management

The library distinguishes between two levels of operation:

  • Tracking (gunz_ml.integrations): Low-level logic to ensure metrics reach MLflow and Optuna without database locks or race conditions.

  • Management (gunz_ml.management): High-level logic (e.g., TrackingManager) used to find the best runs, prune failed trials, and generate comparison reports across studies.

3. Distributed Safety

In a Slurm-based cluster environment, multiple workers often try to initialize the same study simultaneously. Gunz-ML implements Initialisation First Policy:

  • Studies are pre-scaffolded using the gunz-ml init CLI.

  • Workers use safe_set_experiment to verify the environment is ready before starting, preventing the common “database is locked” errors in MariaDB.

4. The Juno Ecosystem

Gunz-ML is designed to communicate with Juno, the unified experiment infrastructure.

  • MLflow: Stores run metadata, parameters, and time-series metrics.

  • Optuna (MariaDB): Stores the relational data for HPO trials.

  • MinIO (S3): Stores large binary artifacts (model checkpoints, .pt files, and plots).

By standardizing on these backends, Gunz-ML ensures that your research is reproducible, queryable, and persistent.