Integration YAML

This documentation explains the parameters of the integration configuration yaml file generated by running the panpipes integration config.
The steps to run the pipeline are described in the integration workflow.

When running the integration workflow, panpipes provides you with a basic pipeline.yml file. To run the workflow with your own data you need to specify the parameters described below in the pipeline.yml file to meet the requirements of your data. However, we do provide pre-filled versions of the pipeline.yml for individual tutorials.

You can download the different integration pipeline.yml files here:

Basic pipeline.yml file (not pre-filled) that is generated when calling panpipes integration config: Download here
pipeline.ymlfor Integration tutorial: View and Download here

For more information on functionalities implemented in panpipes to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors and *scalars, please check our documentation

Compute resources options

resources
Computing resources to use, specifically the number of threads used for parallel jobs. Check threads_tasks_panpipes for more information on which threads each specific task requires. Specified by the following parameters:

threads_high Integer, Default: 1
Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load your MuData object which was created in the preprocessing step of the workflow. In this workflow, all the integration, batch correction and dimensionality reduction tasks run with threads high
threads_medium Integer, Default: 1
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks. In this workflow, collating results after integration and scib metrics calculation run with threads_medium.
threads_low Integer, Default: 1
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two. In this workflow, plotting and lisi calculation run with threads_low
threads_gpu Integer, Default: 2
Number of cores per gpu used for computing tasks. For each thread, there must be enough memory to compute the tasks above. In this workflow, if the gpu queues are defined below, scvi algorithms and mofa can run on gpu, otherwise threads_high argument is used

condaenv String
Path to conda environment that should be used to run panpipes. Leave blank if running native or your cluster automatically inherits the login node environment

queues
Allows for tweaking which queues jobs get submitted to, in case there is a special queue for long jobs, or you have access to a gpu-specific queue. The default queue should be specified in your .cgat.yml file. Leave blank if you do not want to use any alternative queues.

long
gpu

Loading and merging data options

Data format

sample_prefix String, Mandatory parameter, Default: test
Prefix for the sample that comes out of the filtering/ preprocessing steps of the workflow.

preprocessed_obj String, Mandatory parameter
Path to the output file from preprocessing (e.g. ../preprocess/test.h5mu). Ensure that the submission file is in the right format and that the correct path is provided.

Batch correction

Batch correction is done in unimodal mode, meaning each modality is batch corrected independently.

RNA modality

rna: Batch correction for the RNA modality is specified by the following parameters:

run Boolean, Default: True
Defines if you want the batch correction to run. If set to False, PCA with default parameters is calculated.
tools String (comma-separated), Default: harmony,bbknn,scanorama,scvi
Defines the method used to run batch correction, multiple can be selected and run simultaneously. If left blank, the integration will produce no correction outputs.

Choices: harmony, bbknn, scanorama, scvi
column String (comma-separated), Default: sample_id

The column name of the covariate you want want to batch correct on, if a comma-separated list is specified then all will be used simultaneously.

Harmony arguments

harmony: Basic parameters required to run harmony:
- sigma Float, Default: 0.1
- theta Float, Default: 1.0
- npcs Integer, Default: 30
For more information on harmony check the harmony documentation

BBKNN arguments

bbknn:
- neighbors_within_batch: Integer, Default: 3

For more information on bbknn check the bbknn documentation

SCVI arguments

scvi: SCVI parameters are specified as
- exclude_mt_genes: Boolean, Default: True
- exclude_mt_genes: String, Default: mt
- model_args: Model argument parameters:
  - n_layers: Float, Default: 1.0
  - n_latent: Integer, Default: 10
  - gene_likelihood: String, Default: zinb
- training_agrs Training argument parameters:
  - max_epochs Integer, Default: 400
  - train_size Float, Default: 0.9
  - early_stopping: Boolean, Default: True
- training_plan Training plan parameters:
  - lr Float, Default:0.001
  - n_epochs_kl_warmup: Integer, Default: 40
  - reduce_lr_on_plateau: Boolean, Default: True
  - lr_scheduler_metric
    String, Default: elbo_validation
  - lr_patience Integer, Default: 8
  - lr_factor Float, Default: 0.1

For more information on scvi check the scvi documentation

KNN calculation on RNA modality

Parameters to compute the connectivity graph on RNA

neighbors: String
- npcs Integer, Default: 30
  
  Number of principal components to calculate for neighbors and Umap
- k Integer, Default: 30
  Number of neighbors
- metric String, Default: euclidean
  Metric can be either euclidean or cosine
- method String, Default: scanpy
  The method can either be scanpy or hnsw

Protein modality

prot: Batch correction for the protein modality is specified by the following parameters:

run Boolean, Default: True
Defines if you want the batch correction to run on the Protein modality.If set to False, PCA with default parameters is calculated.
tools String (comma-separated), Default: harmony
Defines the method used to run batch correction, multiple can be selected. If left blank, the integration will produce no correction outputs.

choices: harmony, bbknn, combat
column String (comma-separated), Default: sample_id

The column you want to batch correct on, if a comma-separated list is specified then all will be used simultaneously

Harmony arguments

harmony
Basic parameters required to run harmony:

sigma Float, Default: 0.1
theta Float, Default: 1.0
npcs Integer, Default: 30

For more information on harmony check the harmony documentation

BBKNN arguments

bbknn

neighbors_within_batch: Integer, Default: 3

For more information on bbknn check the bbknn documentation

KNN calculation on Protein modality

Parameters to compute the connectivity graph on Protein

neighbors String, Default: &prot_neighbors

npcs Integer, Default: 30
Number of principal components to calculate for neighbors and Umap
k Integer, Default: 30
Number of neighbors
metric String, Default: euclidean
Metric can be either euclidean or cosine
methof String, Default: scanpy
The method can either be scanpy or hnsw

ATAC modality

atac: Batch correction for the ATAC modality is specified by the following parameters:

run Boolean, Default: False
Defines if you want the batch correction to run. If set to False, PCA with default parameters is calculated.
dimred String, Default: PCA
Defines which dimensionality reduction to use. Available options are PCA and LSI.
tools String (comma-separated), Default: harmony
Defines the method used to run batch correction. If left blank, the integration will produce no correction outputs. Multiple can be selected by specifying them as a comma-seprated string without spaces. Available options are: harmony, bbknn, and combat
column String (comma-separated), Default: sample_id
The column you want to batch correct on. If a comma-separated list is provided then all will be used simultaneously.

Harmony arguments

harmony: Basic parameters required to run harmony:
- sigma Float, Default: 0.1
- theta Float, Default: 1.0
- npcs Integer, Default: 30
For more information on harmony check the harmony documentation

BBKNN arguments

bbknn:
- neighbors_within_batch: Integer, Default: 3

For more information on bbknn check the bbknn documentation.

KNN calculation on ATAC modality

neighbors: String
- npcs Integer, Default: 30
  
  Number of principal components to calculate for neighbors and UMAP.
- k Integer, Default: 30
  Number of neighbors
- metric String, Default: euclidean
  Metric can be either euclidean or cosine
- method String, Default: scanpy
  The method can either be scanpy or hnsw

Multimodal integration

multimodal:

run Boolean, Default: True
Set to False if you don’t want to run multimodal integration
tools String(list), Default: “WNN”
Method you want to use to run batch correction. Options include:
- methods with set modalities: totalVI (rna, prot) and multiVI (rna, atac)
  - totalvi
  - MultiVI
- methods which accept any combination of modalities: MOFA and WNN
  - mofa
  - WNN
You can specify mutiple methods and they will be run simultaneously. It makes biological sense to include rna modality if available, which is most informative in terms of cell type differences.
column_categorical String(Comma separated), Default: sample_id
This is the column you want to run a batch correction on. Multiple columns can be selected simultaneously by providing them as a comma-separated string without spaces.

TotalVI arguments

TotalVI has to run on both rna and protein data

This is the minimal set of TotalVI parameters required, you can add more if it fits your analysis better.

totalvi:
- modalities String(Comma separated), Default: rna,prot
- exclude_mt_genes Boolean, Default: True
- mt_column String, Default: mt
- filter_by_hvg Boolean, Default: True
  To filter manually create a column called prot_outliers in mdata[‘prot’]
- filter_prot_outliers Boolean, Default: False
- model_args:
  - latent_distributionString, Default: “normal”
- training_args:
  - max_epochsInteger, Default: 100
  - train_sizeFloat, Default: 0.9
  - early_stopping Boolean, Default: True
- training_plan String, Default: None

MultiVI arguments

MultiVI has to run on both rna and atac data

This is the minimal set of MultiVI parameters required, you can add more if it fits your analysis better.

Setting lowmem to True it will subset the ATAC data to the top 25k HVF which is recommended to deal with the concatenation of atac and rna on large datasets which at the moment is required by scvi-tools. Note that >100GB of RAM are required to concatenate ATAC and RNA data with 15k cells and 120k total features (union rna,atac)

MultiVI:
- lowmen Boolean, Default: True
- model_args String, Default: None
  - n_hidden String, Default: None
  - n_latent Boolean, Default: True
  - region_factors Boolean, Default: True
  - latent_distribution String, Default: normal
  - deeply_inject_covariates Boolean, Default: False
  - fully_paired Boolean, Default: False
- training_args
  - max_epochs Integer, Default: 500
  - lr Float, Default: 0.0001
  - use_gpu String, Default: None
    Leave blank for default str, int and bool.
  - train_size Float, Default: 0.9
  - validation_size String, Default: None
    Leave blank for default
  - batch_size Integer, Default: 128
  - weight_decay Float, Default: 0.001
  - eps Float, Default: 1e-08
  - early_stopping Boolean, Default: True
  - save_best Boolean, Default: True
  - check_val_every_n_epoch String, Default: None
    Leave blank for the default integer
  - n_steps_kl_warmup String, Default: None
    Leave blank for the default integer
  - n_epochs_kl_warmup Integer, Default: 50
  - adversarial_mixing Boolean, Default: False
training_plan String, Default: None

Mofa arguments

Requires at least two modalities, can run with three

This is the minimal set of Mofa parameters required, you can add more if it fits your analysis better.

mofa:
- modalities String (Comma separated), Default: rna,prot,atac
- fliter_by_hgv Boolean, Default: True
- n_factors Integer, Default: 10
- n_iterations Integer, Default: 1000
- convergence_mode String, Default: fast
  Choice between fast, medium, and slow
- save_parameters Boolean, Default: False
- outfile String, Default: path/to/h5ad/to_save_model_to

WNN arguments

Requires at least two modalities, can run with three

This is the minimal set of WNN parameters required, you can add more if it fits your analysis better. Panpipes uses muon’s implementation of WNN.

WNN:
- modalities String (Comma separated), Default: rna, prot, atac
- batch_corrected String, Default: None
  
  Set the modality to one method (“bbknn”, “scVI”, “harmony”, “scanorama”), if left None, a default de novo calculation of neighbours on non-corrected data for that modality using specified parameters
  - rna String, Default: None
    Options here include “bbknn” and “harmony”
  - prot String, Default: None
    Options here include “harmony”
  - atac String, Default: None
- knn:
  - rna String, Default: *rna_neighbors
  - prot String, Default: *prot_neighbors
  - atac String, Default: *atac_neighbors
- n_neighbors String, Default: “leave blank”
  Leave blank to arithmetic mean across modalities neighbors
- n_bandwidth_neighbors Integer, Default: 20
- n_multineighbors Integer, Default: 200
- metric String, Default: euclidean
- low_memory Boolean, Default: True

KNN calculation for multimodal analysis

neighbors:
- npcs Integer, Default: 30
  The number of principal components to calculate for neighbors and UMAP. If no correction is applied PCA will be calculated and used to run the UMAP. If harmony is chosen it will use the following components to create a corrected dimensionality reduction.
- k Integer, Default: 30
- metric String, Default: euclidean
  Options include euclidean and cosine
- method String, Default: scanpy
  Options include scanpy and hnsw

Plotting parameters

plotqc:
- grouping_var String, Default: sample_id
  Column name(s) of the covariate(s) you want to group the plot on. Must be a categorical variable. Must be provided as a comma-separated String, without spaces.

Specify other metrics you want to plot on each modality’s embedding. One plot per group will be created. Use the notation mod:variable . These can be categorical or numeric variables. Any metrics you may want to plot on all modality UMAPs should be listed under all.

all String, Default: rep:receptor_subtype
rna String, Default: rna:total_counts
prot String, Default: prot:total_counts
atac String, Default: atac:total_counts
multimodal String, Default: rna:total_counts

If you want to add any additional plots, simply remove the log file (logs/plot_batch_corrected_umaps.log) and run panpipes integration make plot_umaps.

scib metrics

To assess the unimodal data integration, we use the scib metrics. The metrics are calculated using the scib-metrics package. The metrics are calculated and stored for each modality separately.

scib:
- rna String, Optional
  Obs column name for the rna modality containing the cell type labels. If not provided, only a subset of the metrics working without cell type labels will be calculated. Especially the bio conservation metrics will not be calculated, which also prevents creating the benchmarking plots.
- prot String, Optional
  Obs column name for the prot modality containing the cell type labels.
- atac String, Optional
  Obs column name for the atac modality containing the cell type labels.

Creating the final object

Leave this final option blank until you have reviewed the results from running panpipes integration make full.

This step will produce a MuData object with one layer for each modality, and the multimodal embeddings are stored as global view. To store the embeddings resulting from batch correction algorithms applied for each modality, set the relevant include to True and specify which algorithms you want to retain. The embeddings generated from the multimodal runs are stored in the global mudata layer. To select the uncorrected unimodal embeddings, use “no_correction” for the relevant modalities. Setting the include parameter to False for a specific modality will generate a Mudata without that modality.

Then runpanpipes integration make merge_integration

final_obj:
- rna:
  - include Boolean, Default: True
  - bc_choice String, Default: harmony
- prot:
  - include Boolean, Default: True
  - bc_choice String, Default: harmony
- atac:
  - include Boolean, Default: False
  - bc_choice Boolean, Default: harmony
- multimodal:
  - include Boolean, Default: True
  - bc_choice String, Default: totalvi