CpG pipeline

Overview

This pipeline is designed for large-scale CpG heritability estimation using GENBoostGPU’s chromosome-wise CpG workflow and cross-chromosome hyperparameter tuning entry points.

Inputs and expected files

The concrete templates and directory layout expected by the CpG pipeline are defined by examples/cpg_test_million.py:

  • CpG manifest template: data/cpg_manifests/cpg_manifest_chr{chrom}.parquet

  • Phenotype template: data/phenotypes/pheno_chr{chrom}.parquet

  • Genotype PLINK prefix: data/genotypes/genotypes

These templates are exposed directly as the --cpg-manifest-template and --pheno-template arguments in examples/cpg_test_million.py.

Preparing inputs from BSseq

In-memory BSseq workflow

If you already have a BSseq object in memory, first persist it to disk:

saveRDS(bs, "data/bsseq.rds")

Helper script command

Use the helper script to generate per-chromosome manifest and phenotype files:

Rscript scripts/prepare_cpg_inputs.R --bsseq data/bsseq.rds --output data

Key flags

The most important options for large-scale runs are:

  • --sample-id-col: Column in colData(bs) that should be treated as the canonical sample identifier.

  • --validate-fam: Load the genotype .fam file and confirm its sample IDs match the BSseq sample IDs before writing outputs.

  • --no-smooth: Skip smoothing and use the raw methylation proportions.

  • --min-cov: Minimum coverage required for a CpG to be retained.

  • --chromosomes: Restrict processing to a subset of chromosomes.

Generated directories

By default, the helper script writes the following directories under the --output root:

  • cpg_manifests/: Per-chromosome CpG manifests.

  • phenotypes/: Per-chromosome phenotype matrices aligned to the manifests.

Running the pipeline

A minimal end-to-end run looks like:

python examples/cpg_test_million.py --geno-path data/genotypes/genotypes

Template overrides for non-default output roots

If you wrote outputs somewhere other than data/, override the templates so that {chrom} points at the new location:

python examples/cpg_test_million.py \
  --geno-path alt_data/genotypes/genotypes \
  --cpg-manifest-template alt_data/cpg_manifests/cpg_manifest_chr{chrom}.parquet \
  --pheno-template alt_data/phenotypes/pheno_chr{chrom}.parquet

Sample alignment guidance

Sample alignment is critical: genotype .fam IDs must match the BSseq sample IDs used to build the phenotype matrices. Use --validate-fam in scripts/prepare_cpg_inputs.R to catch mismatches early in the preparation step.

Entry points to search for

To understand or customize the pipeline internals, start with these entry points:

  • examples/cpg_test_million.py

  • genboostgpu.run_cpgs_by_chromosome

  • genboostgpu.global_tune_cpg_params