Workflow

At a high level, GENBoostGPU moves data from disk or memory, filters and scores SNPs, trains elastic net models in a boosting loop, and evaluates variance explained. The diagram below highlights the major stages.

+-------------+      +------------------+      +-----------------+      +------------------+
| Data input  | ---> | SNP preprocessing| ---> | Boosting elastic| ---> | Evaluation &     |
| (PLINK, CuPy|      | (filtering, LD)  |      | net iterations  |      | persistence      |
+-------------+      +------------------+      +-----------------+      +------------------+
        |                      |                        |                        |
        v                      v                        v                        v
data_io.load_*      snp_processing.*         enet_boosting.boosting_*    data_io.save_results

Module responsibilities

genboostgpu.data_io: Reads PLINK and phenotype files, emits CuPy/cuDF objects, and saves outputs to TSV/Parquet.
genboostgpu.snp_processing: Applies zero-variance filtering, missing value imputation, cis-window selection, and LD clumping.
genboostgpu.enet_boosting: Implements the boosting loop, Optuna-based ElasticNet tuning, and final ridge refit.
genboostgpu.cpg_orchestration: CpG-centric orchestration utilities for scheduling boosting tasks across traits, chromosomes, or distributed Dask workers.
genboostgpu.orchestration: High-level entry point. Launches genboostgpu.vmr_runner.run_single_window() across windows, optionally using dask_cuda.LocalCUDACluster for multi-GPU execution.

Putting it together

Build a list of windows (chromosome, start, end, phenotype ID/path).
Load or provide genotype/phenotype objects (genboostgpu.data_io).
Call genboostgpu.orchestration.run_windows_with_dask() to schedule work.
Inspect the resulting pandas DataFrame plus the saved parquet/TSV files.

For more detailed orchestration examples, see tutorials/index.