Workflow

At a high level, GENBoostGPU moves data from disk or memory, filters and scores SNPs, trains elastic net models in a boosting loop, and evaluates variance explained. The diagram below highlights the major stages.

+-------------+      +------------------+      +-----------------+      +------------------+
| Data input  | ---> | SNP preprocessing| ---> | Boosting elastic| ---> | Evaluation &     |
| (PLINK, CuPy|      | (filtering, LD)  |      | net iterations  |      | persistence      |
+-------------+      +------------------+      +-----------------+      +------------------+
        |                      |                        |                        |
        v                      v                        v                        v
data_io.load_*      snp_processing.*         enet_boosting.boosting_*    data_io.save_results

Module responsibilities

genboostgpu.data_io

Reads PLINK and phenotype files, emits CuPy/cuDF objects, and saves outputs to TSV/Parquet.

genboostgpu.snp_processing

Applies zero-variance filtering, missing value imputation, cis-window selection, and LD clumping.

genboostgpu.enet_boosting

Implements the boosting loop, Optuna-based ElasticNet tuning, and final ridge refit.

genboostgpu.cpg_orchestration

CpG-centric orchestration utilities for scheduling boosting tasks across traits, chromosomes, or distributed Dask workers.

genboostgpu.orchestration

High-level entry point. Launches genboostgpu.vmr_runner.run_single_window() across windows, optionally using dask_cuda.LocalCUDACluster for multi-GPU execution.

Putting it together

  1. Build a list of windows (chromosome, start, end, phenotype ID/path).

  2. Load or provide genotype/phenotype objects (genboostgpu.data_io).

  3. Call genboostgpu.orchestration.run_windows_with_dask() to schedule work.

  4. Inspect the resulting pandas DataFrame plus the saved parquet/TSV files.

For more detailed orchestration examples, see tutorials/index.