Data & formats ============== GENBoostGPU accepts genomic inputs from a handful of common formats. Regardless of the source, data are converted to GPU-friendly representations before hitting :mod:`genboostgpu.enet_boosting`. Supported inputs ---------------- PLINK (BED/BIM/FAM) Use :func:`genboostgpu.data_io.load_genotypes`, which wraps :mod:`pandas_plink`. Genotypes are returned as a CuPy array with samples on rows and SNPs on columns, plus the accompanying BIM/FAM tables (pandas or cuDF). CuPy / NumPy arrays When you already hold genotypes in memory, pass the CuPy array directly via ``geno_arr``. NumPy arrays are accepted but will be copied to the GPU with :func:`cupy.asarray`. Parquet/TSV phenotypes Phenotypes are typically stored as tab-separated files per CpG or VMR. Use :func:`genboostgpu.data_io.load_phenotypes` to read them into a cuDF DataFrame. Results written by :func:`genboostgpu.data_io.save_results` are parquet or TSV files with betas, variance explained, and metadata. If you are preparing large-scale CpG inputs, see :doc:`cpg_pipeline` for the recommended workflow. It documents the helper script ``scripts/prepare_cpg_inputs.R`` and the manifest/phenotype templates used by ``examples/cpg_test_million.py``. Sample and variant alignment ---------------------------- * Samples must be in the same order across genotype and phenotype matrices. ``load_genotypes`` retains the FAM order and ``load_phenotypes`` preserves file order, so alignments are deterministic. * The BIM table needs ``chrom``, ``pos``, and ``snp`` columns. It can be a pandas DataFrame or cuDF DataFrame. The :func:`genboostgpu.snp_processing.filter_cis_window` helper relies on those names to pull the right SNP indices. * When providing preloaded arrays, supply the matching BIM metadata so cis-window filtering can locate positions. Handling missingness and QC --------------------------- :mod:`genboostgpu.snp_processing` standardises preprocessing so the models see a clean matrix: * ``filter_zero_variance`` removes monomorphic SNPs (default threshold ``1e-8``). * ``impute_snps`` wraps ``cuml.preprocessing.SimpleImputer``. The default strategy (``most_frequent``) is well suited for hard-call genotypes; use ``mean`` for dosage-style data. * ``run_ld_clumping`` performs phenotype-informed LD pruning using Pearson correlations from ``_corr_with_y_streaming``. Missing CpG values are handled when reading the phenotype table—the values are normalised in :func:`genboostgpu.vmr_runner.run_single_window`. If you need custom logic (e.g., dropping low-quality samples), perform it before invoking the pipeline and pass the cleaned CuPy vector via ``pheno``.