Data & formats
GENBoostGPU accepts genomic inputs from a handful of common formats. Regardless
of the source, data are converted to GPU-friendly representations before hitting
genboostgpu.enet_boosting.
Supported inputs
- PLINK (BED/BIM/FAM)
Use
genboostgpu.data_io.load_genotypes(), which wrapspandas_plink. Genotypes are returned as a CuPy array with samples on rows and SNPs on columns, plus the accompanying BIM/FAM tables (pandas or cuDF).- CuPy / NumPy arrays
When you already hold genotypes in memory, pass the CuPy array directly via
geno_arr. NumPy arrays are accepted but will be copied to the GPU withcupy.asarray().- Parquet/TSV phenotypes
Phenotypes are typically stored as tab-separated files per CpG or VMR. Use
genboostgpu.data_io.load_phenotypes()to read them into a cuDF DataFrame. Results written bygenboostgpu.data_io.save_results()are parquet or TSV files with betas, variance explained, and metadata.If you are preparing large-scale CpG inputs, see CpG pipeline for the recommended workflow. It documents the helper script
scripts/prepare_cpg_inputs.Rand the manifest/phenotype templates used byexamples/cpg_test_million.py.
Sample and variant alignment
Samples must be in the same order across genotype and phenotype matrices.
load_genotypesretains the FAM order andload_phenotypespreserves file order, so alignments are deterministic.The BIM table needs
chrom,pos, andsnpcolumns. It can be a pandas DataFrame or cuDF DataFrame. Thegenboostgpu.snp_processing.filter_cis_window()helper relies on those names to pull the right SNP indices.When providing preloaded arrays, supply the matching BIM metadata so cis-window filtering can locate positions.
Handling missingness and QC
genboostgpu.snp_processing standardises preprocessing so the models see a
clean matrix:
filter_zero_varianceremoves monomorphic SNPs (default threshold1e-8).impute_snpswrapscuml.preprocessing.SimpleImputer. The default strategy (most_frequent) is well suited for hard-call genotypes; usemeanfor dosage-style data.run_ld_clumpingperforms phenotype-informed LD pruning using Pearson correlations from_corr_with_y_streaming.
Missing CpG values are handled when reading the phenotype table—the values are
normalised in genboostgpu.vmr_runner.run_single_window(). If you need
custom logic (e.g., dropping low-quality samples), perform it before invoking the
pipeline and pass the cleaned CuPy vector via pheno.