CpG pipeline ============ Overview -------- This pipeline is designed for large-scale CpG heritability estimation using GENBoostGPU's chromosome-wise CpG workflow and cross-chromosome hyperparameter tuning entry points. Inputs and expected files ------------------------- The concrete templates and directory layout expected by the CpG pipeline are defined by ``examples/cpg_test_million.py``: * CpG manifest template: ``data/cpg_manifests/cpg_manifest_chr{chrom}.parquet`` * Phenotype template: ``data/phenotypes/pheno_chr{chrom}.parquet`` * Genotype PLINK prefix: ``data/genotypes/genotypes`` These templates are exposed directly as the ``--cpg-manifest-template`` and ``--pheno-template`` arguments in ``examples/cpg_test_million.py``. Preparing inputs from BSseq --------------------------- In-memory BSseq workflow ~~~~~~~~~~~~~~~~~~~~~~~~ If you already have a ``BSseq`` object in memory, first persist it to disk: .. code-block:: r saveRDS(bs, "data/bsseq.rds") Helper script command ~~~~~~~~~~~~~~~~~~~~~ Use the helper script to generate per-chromosome manifest and phenotype files: .. code-block:: bash Rscript scripts/prepare_cpg_inputs.R --bsseq data/bsseq.rds --output data Key flags ~~~~~~~~~ The most important options for large-scale runs are: * ``--sample-id-col``: Column in ``colData(bs)`` that should be treated as the canonical sample identifier. * ``--validate-fam``: Load the genotype ``.fam`` file and confirm its sample IDs match the BSseq sample IDs before writing outputs. * ``--no-smooth``: Skip smoothing and use the raw methylation proportions. * ``--min-cov``: Minimum coverage required for a CpG to be retained. * ``--chromosomes``: Restrict processing to a subset of chromosomes. Generated directories ~~~~~~~~~~~~~~~~~~~~~ By default, the helper script writes the following directories under the ``--output`` root: * ``cpg_manifests/``: Per-chromosome CpG manifests. * ``phenotypes/``: Per-chromosome phenotype matrices aligned to the manifests. Running the pipeline -------------------- A minimal end-to-end run looks like: .. code-block:: bash python examples/cpg_test_million.py --geno-path data/genotypes/genotypes Template overrides for non-default output roots ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you wrote outputs somewhere other than ``data/``, override the templates so that ``{chrom}`` points at the new location: .. code-block:: bash python examples/cpg_test_million.py \ --geno-path alt_data/genotypes/genotypes \ --cpg-manifest-template alt_data/cpg_manifests/cpg_manifest_chr{chrom}.parquet \ --pheno-template alt_data/phenotypes/pheno_chr{chrom}.parquet Sample alignment guidance ------------------------- Sample alignment is critical: genotype ``.fam`` IDs must match the BSseq sample IDs used to build the phenotype matrices. Use ``--validate-fam`` in ``scripts/prepare_cpg_inputs.R`` to catch mismatches early in the preparation step. Entry points to search for -------------------------- To understand or customize the pipeline internals, start with these entry points: * ``examples/cpg_test_million.py`` * ``genboostgpu.run_cpgs_by_chromosome`` * ``genboostgpu.global_tune_cpg_params``