Data & formats

GENBoostGPU accepts genomic inputs from a handful of common formats. Regardless of the source, data are converted to GPU-friendly representations before hitting genboostgpu.enet_boosting.

Supported inputs

PLINK (BED/BIM/FAM)

Use genboostgpu.data_io.load_genotypes(), which wraps pandas_plink. Genotypes are returned as a CuPy array with samples on rows and SNPs on columns, plus the accompanying BIM/FAM tables (pandas or cuDF).

CuPy / NumPy arrays

When you already hold genotypes in memory, pass the CuPy array directly via geno_arr. NumPy arrays are accepted but will be copied to the GPU with cupy.asarray().

Parquet/TSV phenotypes

Phenotypes are typically stored as tab-separated files per CpG or VMR. Use genboostgpu.data_io.load_phenotypes() to read them into a cuDF DataFrame. Results written by genboostgpu.data_io.save_results() are parquet or TSV files with betas, variance explained, and metadata.

If you are preparing large-scale CpG inputs, see CpG pipeline for the recommended workflow. It documents the helper script scripts/prepare_cpg_inputs.R and the manifest/phenotype templates used by examples/cpg_test_million.py.

Sample and variant alignment

  • Samples must be in the same order across genotype and phenotype matrices. load_genotypes retains the FAM order and load_phenotypes preserves file order, so alignments are deterministic.

  • The BIM table needs chrom, pos, and snp columns. It can be a pandas DataFrame or cuDF DataFrame. The genboostgpu.snp_processing.filter_cis_window() helper relies on those names to pull the right SNP indices.

  • When providing preloaded arrays, supply the matching BIM metadata so cis-window filtering can locate positions.

Handling missingness and QC

genboostgpu.snp_processing standardises preprocessing so the models see a clean matrix:

  • filter_zero_variance removes monomorphic SNPs (default threshold 1e-8).

  • impute_snps wraps cuml.preprocessing.SimpleImputer. The default strategy (most_frequent) is well suited for hard-call genotypes; use mean for dosage-style data.

  • run_ld_clumping performs phenotype-informed LD pruning using Pearson correlations from _corr_with_y_streaming.

Missing CpG values are handled when reading the phenotype table—the values are normalised in genboostgpu.vmr_runner.run_single_window(). If you need custom logic (e.g., dropping low-quality samples), perform it before invoking the pipeline and pass the cleaned CuPy vector via pheno.