Workflow
At a high level, GENBoostGPU moves data from disk or memory, filters and scores SNPs, trains elastic net models in a boosting loop, and evaluates variance explained. The diagram below highlights the major stages.
+-------------+ +------------------+ +-----------------+ +------------------+
| Data input | ---> | SNP preprocessing| ---> | Boosting elastic| ---> | Evaluation & |
| (PLINK, CuPy| | (filtering, LD) | | net iterations | | persistence |
+-------------+ +------------------+ +-----------------+ +------------------+
| | | |
v v v v
data_io.load_* snp_processing.* enet_boosting.boosting_* data_io.save_results
Module responsibilities
genboostgpu.data_ioReads PLINK and phenotype files, emits CuPy/cuDF objects, and saves outputs to TSV/Parquet.
genboostgpu.snp_processingApplies zero-variance filtering, missing value imputation, cis-window selection, and LD clumping.
genboostgpu.enet_boostingImplements the boosting loop, Optuna-based ElasticNet tuning, and final ridge refit.
genboostgpu.cpg_orchestrationCpG-centric orchestration utilities for scheduling boosting tasks across traits, chromosomes, or distributed Dask workers.
genboostgpu.orchestrationHigh-level entry point. Launches
genboostgpu.vmr_runner.run_single_window()across windows, optionally usingdask_cuda.LocalCUDAClusterfor multi-GPU execution.
Putting it together
Build a list of windows (chromosome, start, end, phenotype ID/path).
Load or provide genotype/phenotype objects (
genboostgpu.data_io).Call
genboostgpu.orchestration.run_windows_with_dask()to schedule work.Inspect the resulting pandas DataFrame plus the saved parquet/TSV files.
For more detailed orchestration examples, see tutorials/index.