Integration YAML

This documentation explains the parameters of the integration configuration yaml file generated by running the panpipes integration config.
The steps to run the pipeline are described in the integration workflow.

When running the integration workflow, panpipes provides you with a basic pipeline.yml file. To run the workflow with your own data you need to specify the parameters described below in the pipeline.yml file to meet the requirements of your data. However, we do provide pre-filled versions of the pipeline.yml for individual tutorials.

You can download the different integration pipeline.yml files here:

  • Basic pipeline.yml file (not pre-filled) that is generated when calling panpipes integration config: Download here

  • pipeline.ymlfor Integration tutorial: View and Download here

For more information on functionalities implemented in panpipes to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors and *scalars, please check our documentation

Compute resources options

resources
Computing resources to use, specifically the number of threads used for parallel jobs. Check threads_tasks_panpipes for more information on which threads each specific task requires. Specified by the following parameters:

  • threads_high Integer, Default: 1
    Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load your MuData object which was created in the preprocessing step of the workflow. In this workflow, all the integration, batch correction and dimensionality reduction tasks run with threads high

  • threads_medium Integer, Default: 1
    Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks. In this workflow, collating results after integration and scib metrics calculation run with threads_medium.

  • threads_low Integer, Default: 1
    Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two. In this workflow, plotting and lisi calculation run with threads_low

  • threads_gpu Integer, Default: 2
    Number of cores per gpu used for computing tasks. For each thread, there must be enough memory to compute the tasks above. In this workflow, if the gpu queues are defined below, scvi algorithms and mofa can run on gpu, otherwise threads_high argument is used

condaenv String
Path to conda environment that should be used to run panpipes. Leave blank if running native or your cluster automatically inherits the login node environment

queues
Allows for tweaking which queues jobs get submitted to, in case there is a special queue for long jobs, or you have access to a gpu-specific queue. The default queue should be specified in your .cgat.yml file. Leave blank if you do not want to use any alternative queues.

  • long

  • gpu

Loading and merging data options

Data format

sample_prefix String, Mandatory parameter, Default: test
Prefix for the sample that comes out of the filtering/ preprocessing steps of the workflow.

preprocessed_obj String, Mandatory parameter
Path to the output file from preprocessing (e.g. ../preprocess/test.h5mu). Ensure that the submission file is in the right format and that the correct path is provided.

Batch correction

Batch correction is done in unimodal mode, meaning each modality is batch corrected independently.

RNA modality

rna: Batch correction for the RNA modality is specified by the following parameters:

  • run Boolean, Default: True
    Defines if you want the batch correction to run. If set to False, PCA with default parameters is calculated.

  • tools String (comma-separated), Default: harmony,bbknn,scanorama,scvi
    Defines the method used to run batch correction, multiple can be selected and run simultaneously. If left blank, the integration will produce no correction outputs.

    Choices: harmony, bbknn, scanorama, scvi

  • column String (comma-separated), Default: sample_id

    The column name of the covariate you want want to batch correct on, if a comma-separated list is specified then all will be used simultaneously.

Harmony arguments

  • harmony: Basic parameters required to run harmony:

    • sigma Float, Default: 0.1

    • theta Float, Default: 1.0

    • npcs Integer, Default: 30

    For more information on harmony check the harmony documentation

BBKNN arguments

  • bbknn:

    • neighbors_within_batch: Integer, Default: 3

For more information on bbknn check the bbknn documentation

SCVI arguments

  • scvi: SCVI parameters are specified as

    • exclude_mt_genes: Boolean, Default: True

    • exclude_mt_genes: String, Default: mt

    • model_args: Model argument parameters:

      • n_layers: Float, Default: 1.0

      • n_latent: Integer, Default: 10

      • gene_likelihood: String, Default: zinb

    • training_agrs Training argument parameters:

      • max_epochs Integer, Default: 400

      • train_size Float, Default: 0.9

      • early_stopping: Boolean, Default: True

    • training_plan Training plan parameters:

      • lr Float, Default:0.001

      • n_epochs_kl_warmup: Integer, Default: 40

      • reduce_lr_on_plateau: Boolean, Default: True

      • lr_scheduler_metric
        String, Default: elbo_validation

      • lr_patience Integer, Default: 8

      • lr_factor Float, Default: 0.1

For more information on scvi check the scvi documentation

KNN calculation on RNA modality

Parameters to compute the connectivity graph on RNA

  • neighbors: String

    • npcs Integer, Default: 30

      Number of principal components to calculate for neighbors and Umap

    • k Integer, Default: 30
      Number of neighbors

    • metric String, Default: euclidean
      Metric can be either euclidean or cosine

    • method String, Default: scanpy
      The method can either be scanpy or hnsw

Protein modality

prot: Batch correction for the protein modality is specified by the following parameters:

  • run Boolean, Default: True
    Defines if you want the batch correction to run on the Protein modality.If set to False, PCA with default parameters is calculated.

  • tools String (comma-separated), Default: harmony
    Defines the method used to run batch correction, multiple can be selected. If left blank, the integration will produce no correction outputs.

    choices: harmony, bbknn, combat

  • column String (comma-separated), Default: sample_id

    The column you want to batch correct on, if a comma-separated list is specified then all will be used simultaneously

Harmony arguments

harmony
Basic parameters required to run harmony:

  • sigma Float, Default: 0.1

  • theta Float, Default: 1.0

  • npcs Integer, Default: 30

For more information on harmony check the harmony documentation

BBKNN arguments

bbknn

  • neighbors_within_batch: Integer, Default: 3

For more information on bbknn check the bbknn documentation

KNN calculation on Protein modality

Parameters to compute the connectivity graph on Protein

neighbors String, Default: &prot_neighbors

  • npcs Integer, Default: 30
    Number of principal components to calculate for neighbors and Umap

  • k Integer, Default: 30
    Number of neighbors

  • metric String, Default: euclidean
    Metric can be either euclidean or cosine

  • methof String, Default: scanpy
    The method can either be scanpy or hnsw

ATAC modality

atac: Batch correction for the ATAC modality is specified by the following parameters:

  • run Boolean, Default: False
    Defines if you want the batch correction to run. If set to False, PCA with default parameters is calculated.

  • dimred String, Default: PCA
    Defines which dimensionality reduction to use. Available options are PCA and LSI.

  • tools String (comma-separated), Default: harmony
    Defines the method used to run batch correction. If left blank, the integration will produce no correction outputs. Multiple can be selected by specifying them as a comma-seprated string without spaces. Available options are: harmony, bbknn, and combat

  • column String (comma-separated), Default: sample_id
    The column you want to batch correct on. If a comma-separated list is provided then all will be used simultaneously.

Harmony arguments

  • harmony: Basic parameters required to run harmony:

    • sigma Float, Default: 0.1

    • theta Float, Default: 1.0

    • npcs Integer, Default: 30

    For more information on harmony check the harmony documentation

BBKNN arguments

  • bbknn:

    • neighbors_within_batch: Integer, Default: 3

For more information on bbknn check the bbknn documentation.

KNN calculation on ATAC modality

  • neighbors: String

    • npcs Integer, Default: 30

      Number of principal components to calculate for neighbors and UMAP.

    • k Integer, Default: 30
      Number of neighbors

    • metric String, Default: euclidean
      Metric can be either euclidean or cosine

    • method String, Default: scanpy
      The method can either be scanpy or hnsw

Multimodal integration

multimodal:

  • run Boolean, Default: True
    Set to False if you don’t want to run multimodal integration

  • tools String(list), Default: “WNN”
    Method you want to use to run batch correction. Options include:

    • methods with set modalities: totalVI (rna, prot) and multiVI (rna, atac)

      • totalvi

      • MultiVI

    • methods which accept any combination of modalities: MOFA and WNN

      • mofa

      • WNN

    You can specify mutiple methods and they will be run simultaneously. It makes biological sense to include rna modality if available, which is most informative in terms of cell type differences.

  • column_categorical String(Comma separated), Default: sample_id
    This is the column you want to run a batch correction on. Multiple columns can be selected simultaneously by providing them as a comma-separated string without spaces.

TotalVI arguments

TotalVI has to run on both rna and protein data

This is the minimal set of TotalVI parameters required, you can add more if it fits your analysis better.

  • totalvi:

    • modalities String(Comma separated), Default: rna,prot

    • exclude_mt_genes Boolean, Default: True

    • mt_column String, Default: mt

    • filter_by_hvg Boolean, Default: True
      To filter manually create a column called prot_outliers in mdata[‘prot’]

    • filter_prot_outliers Boolean, Default: False

    • model_args:

      • latent_distributionString, Default: “normal”

    • training_args:

      • max_epochsInteger, Default: 100

      • train_sizeFloat, Default: 0.9

      • early_stopping Boolean, Default: True

    • training_plan String, Default: None

MultiVI arguments

MultiVI has to run on both rna and atac data

This is the minimal set of MultiVI parameters required, you can add more if it fits your analysis better.

Setting lowmem to True it will subset the ATAC data to the top 25k HVF which is recommended to deal with the concatenation of atac and rna on large datasets which at the moment is required by scvi-tools. Note that >100GB of RAM are required to concatenate ATAC and RNA data with 15k cells and 120k total features (union rna,atac)

  • MultiVI:

    • lowmen Boolean, Default: True

    • model_args String, Default: None

      • n_hidden String, Default: None

      • n_latent Boolean, Default: True

      • region_factors Boolean, Default: True

      • latent_distribution String, Default: normal

      • deeply_inject_covariates Boolean, Default: False

      • fully_paired Boolean, Default: False

    • training_args

      • max_epochs Integer, Default: 500

      • lr Float, Default: 0.0001

      • use_gpu String, Default: None
        Leave blank for default str, int and bool.

      • train_size Float, Default: 0.9

      • validation_size String, Default: None
        Leave blank for default

      • batch_size Integer, Default: 128

      • weight_decay Float, Default: 0.001

      • eps Float, Default: 1e-08

      • early_stopping Boolean, Default: True

      • save_best Boolean, Default: True

      • check_val_every_n_epoch String, Default: None
        Leave blank for the default integer

      • n_steps_kl_warmup String, Default: None
        Leave blank for the default integer

      • n_epochs_kl_warmup Integer, Default: 50

      • adversarial_mixing Boolean, Default: False

  • training_plan String, Default: None

Mofa arguments

Requires at least two modalities, can run with three

This is the minimal set of Mofa parameters required, you can add more if it fits your analysis better.

  • mofa:

    • modalities String (Comma separated), Default: rna,prot,atac

    • fliter_by_hgv Boolean, Default: True

    • n_factors Integer, Default: 10

    • n_iterations Integer, Default: 1000

    • convergence_mode String, Default: fast
      Choice between fast, medium, and slow

    • save_parameters Boolean, Default: False

    • outfile String, Default: path/to/h5ad/to_save_model_to

WNN arguments

Requires at least two modalities, can run with three

This is the minimal set of WNN parameters required, you can add more if it fits your analysis better. Panpipes uses muon’s implementation of WNN.

  • WNN:

    • modalities String (Comma separated), Default: rna, prot, atac

    • batch_corrected String, Default: None

      Set the modality to one method (“bbknn”, “scVI”, “harmony”, “scanorama”), if left None, a default de novo calculation of neighbours on non-corrected data for that modality using specified parameters

      • rna String, Default: None
        Options here include “bbknn” and “harmony”

      • prot String, Default: None
        Options here include “harmony”

      • atac String, Default: None

    • knn:

      • rna String, Default: *rna_neighbors

      • prot String, Default: *prot_neighbors

      • atac String, Default: *atac_neighbors

    • n_neighbors String, Default: “leave blank”
      Leave blank to arithmetic mean across modalities neighbors

    • n_bandwidth_neighbors Integer, Default: 20

    • n_multineighbors Integer, Default: 200

    • metric String, Default: euclidean

    • low_memory Boolean, Default: True

KNN calculation for multimodal analysis

  • neighbors:

    • npcs Integer, Default: 30
      The number of principal components to calculate for neighbors and UMAP. If no correction is applied PCA will be calculated and used to run the UMAP. If harmony is chosen it will use the following components to create a corrected dimensionality reduction.

    • k Integer, Default: 30

    • metric String, Default: euclidean
      Options include euclidean and cosine

    • method String, Default: scanpy
      Options include scanpy and hnsw

Plotting parameters

  • plotqc:

    • grouping_var String, Default: sample_id
      Column name(s) of the covariate(s) you want to group the plot on. Must be a categorical variable. Must be provided as a comma-separated String, without spaces.

Specify other metrics you want to plot on each modality’s embedding. One plot per group will be created. Use the notation mod:variable . These can be categorical or numeric variables. Any metrics you may want to plot on all modality UMAPs should be listed under all.

  • all String, Default: rep:receptor_subtype

  • rna String, Default: rna:total_counts

  • prot String, Default: prot:total_counts

  • atac String, Default: atac:total_counts

  • multimodal String, Default: rna:total_counts

If you want to add any additional plots, simply remove the log file (logs/plot_batch_corrected_umaps.log) and run panpipes integration make plot_umaps.

scib metrics

To assess the unimodal data integration, we use the scib metrics. The metrics are calculated using the scib-metrics package. The metrics are calculated and stored for each modality separately.

  • scib:

    • rna String, Optional
      Obs column name for the rna modality containing the cell type labels. If not provided, only a subset of the metrics working without cell type labels will be calculated. Especially the bio conservation metrics will not be calculated, which also prevents creating the benchmarking plots.

    • prot String, Optional
      Obs column name for the prot modality containing the cell type labels.

    • atac String, Optional
      Obs column name for the atac modality containing the cell type labels.

Creating the final object

Leave this final option blank until you have reviewed the results from running panpipes integration make full.

This step will produce a MuData object with one layer for each modality, and the multimodal embeddings are stored as global view. To store the embeddings resulting from batch correction algorithms applied for each modality, set the relevant include to True and specify which algorithms you want to retain. The embeddings generated from the multimodal runs are stored in the global mudata layer. To select the uncorrected unimodal embeddings, use “no_correction” for the relevant modalities. Setting the include parameter to False for a specific modality will generate a Mudata without that modality.

Then runpanpipes integration make merge_integration

  • final_obj:

    • rna:

      • include Boolean, Default: True

      • bc_choice String, Default: harmony

    • prot:

      • include Boolean, Default: True

      • bc_choice String, Default: harmony

    • atac:

      • include Boolean, Default: False

      • bc_choice Boolean, Default: harmony

    • multimodal:

      • include Boolean, Default: True

      • bc_choice String, Default: totalvi