Integration YAML
This documentation explains the parameters of the integration
configuration yaml file generated by running the panpipes integration config
.
The steps to run the pipeline are described in the integration workflow.
When running the integration workflow, panpipes provides you with a basic pipeline.yml
file. To run the workflow with your own data you need to specify the parameters described below in the pipeline.yml
file to meet the requirements of your data. However, we do provide pre-filled versions of the pipeline.yml
for individual tutorials.
You can download the different integration pipeline.yml files here:
Basic
pipeline.yml
file (not pre-filled) that is generated when callingpanpipes integration config
: Download herepipeline.yml
for Integration tutorial: View and Download here
For more information on functionalities implemented in panpipes
to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors
and *scalars
, please check our documentation
Compute resources options
resources
Computing resources to use, specifically the number of threads used for parallel jobs. Check threads_tasks_panpipes for more information on which threads each specific task requires.
Specified by the following parameters:
threads_high
Integer
, Default: 1
Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load your MuData object which was created in the preprocessing step of the workflow. In this workflow, all the integration, batch correction and dimensionality reduction tasks run with threads highthreads_medium
Integer
, Default: 1
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks. In this workflow, collating results after integration and scib metrics calculation run with threads_medium.threads_low
Integer
, Default: 1
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two. In this workflow, plotting and lisi calculation run with threads_lowthreads_gpu
Integer
, Default: 2
Number of cores per gpu used for computing tasks. For each thread, there must be enough memory to compute the tasks above. In this workflow, if the gpu queues are defined below,scvi
algorithms andmofa
can run on gpu, otherwisethreads_high
argument is used
condaenv String
Path to conda environment that should be used to run panpipes.
Leave blank if running native or your cluster automatically inherits the login node environment
queues
Allows for tweaking which queues jobs get submitted to, in case there is a special queue for long jobs, or you have access to a gpu-specific queue.
The default queue should be specified in your .cgat.yml file.
Leave blank if you do not want to use any alternative queues.
long
gpu
Loading and merging data options
Data format
sample_prefix String
, Mandatory parameter, Default: test
Prefix for the sample that comes out of the filtering/ preprocessing steps of the workflow.
preprocessed_obj String
, Mandatory parameter
Path to the output file from preprocessing (e.g. ../preprocess/test.h5mu
).
Ensure that the submission file is in the right format and that the correct path is provided.
Batch correction
Batch correction is done in unimodal mode, meaning each modality is batch corrected independently.
RNA modality
rna: Batch correction for the RNA modality is specified by the following parameters:
run
Boolean
, Default: True
Defines if you want the batch correction to run. If set toFalse
,PCA
with default parameters is calculated.tools
String
(comma-separated), Default:harmony,bbknn,scanorama,scvi
Defines the method used to run batch correction, multiple can be selected and run simultaneously. If left blank, the integration will produce no correction outputs.Choices:
harmony
,bbknn
,scanorama
,scvi
column
String
(comma-separated), Default: sample_idThe column name of the covariate you want want to batch correct on, if a comma-separated list is specified then all will be used simultaneously.
Harmony arguments
harmony: Basic parameters required to run harmony:
sigma
Float
, Default: 0.1theta
Float
, Default: 1.0npcs
Integer
, Default: 30
For more information on
harmony
check the harmony documentation
BBKNN arguments
bbknn:
neighbors_within_batch:
Integer
, Default: 3
For more information on bbknn
check the bbknn documentation
SCVI arguments
scvi: SCVI parameters are specified as
exclude_mt_genes:
Boolean
, Default: Trueexclude_mt_genes:
String
, Default: mtmodel_args: Model argument parameters:
n_layers:
Float
, Default: 1.0n_latent:
Integer
, Default: 10gene_likelihood:
String
, Default: zinb
training_agrs Training argument parameters:
max_epochs
Integer
, Default: 400train_size
Float
, Default: 0.9early_stopping:
Boolean
, Default: True
training_plan Training plan parameters:
lr
Float
, Default:0.001n_epochs_kl_warmup:
Integer
, Default: 40reduce_lr_on_plateau:
Boolean
, Default: Truelr_scheduler_metric
String
, Default: elbo_validationlr_patience
Integer
, Default: 8lr_factor
Float
, Default: 0.1
For more information on scvi
check the scvi documentation
KNN calculation on RNA modality
Parameters to compute the connectivity graph on RNA
neighbors:
String
npcs
Integer
, Default: 30
Number of principal components to calculate for neighbors and Umapk
Integer
, Default: 30
Number of neighborsmetric
String
, Default: euclidean
Metric can be either euclidean or cosinemethod
String
, Default: scanpy
The method can either be scanpy or hnsw
Protein modality
prot: Batch correction for the protein modality is specified by the following parameters:
run
Boolean
, Default: True
Defines if you want the batch correction to run on the Protein modality.If set toFalse
,PCA
with default parameters is calculated.tools
String
(comma-separated), Default: harmony
Defines the method used to run batch correction, multiple can be selected. If left blank, the integration will produce no correction outputs.choices: harmony, bbknn, combat
column
String
(comma-separated), Default: sample_idThe column you want to batch correct on, if a comma-separated list is specified then all will be used simultaneously
Harmony arguments
harmony
Basic parameters required to run harmony:
sigma
Float
, Default: 0.1theta
Float
, Default: 1.0npcs
Integer
, Default: 30
For more information on harmony
check the harmony documentation
BBKNN arguments
bbknn
neighbors_within_batch:
Integer
, Default: 3
For more information on bbknn
check the bbknn documentation
KNN calculation on Protein modality
Parameters to compute the connectivity graph on Protein
neighbors String
, Default: &prot_neighbors
npcs
Integer
, Default: 30
Number of principal components to calculate for neighbors and Umapk
Integer
, Default: 30
Number of neighborsmetric
String
, Default: euclidean
Metric can be either euclidean or cosinemethof
String
, Default: scanpy
The method can either be scanpy or hnsw
ATAC modality
atac: Batch correction for the ATAC modality is specified by the following parameters:
run
Boolean
, Default: False
Defines if you want the batch correction to run. If set toFalse
,PCA
with default parameters is calculated.dimred
String
, Default: PCA
Defines which dimensionality reduction to use. Available options are PCA and LSI.tools
String
(comma-separated), Default: harmony
Defines the method used to run batch correction. If left blank, the integration will produce no correction outputs. Multiple can be selected by specifying them as a comma-seprated string without spaces. Available options are: harmony, bbknn, and combatcolumn
String
(comma-separated), Default: sample_id
The column you want to batch correct on. If a comma-separated list is provided then all will be used simultaneously.
Harmony arguments
harmony: Basic parameters required to run harmony:
sigma
Float
, Default: 0.1theta
Float
, Default: 1.0npcs
Integer
, Default: 30
For more information on
harmony
check the harmony documentation
BBKNN arguments
bbknn:
neighbors_within_batch:
Integer
, Default: 3
For more information on bbknn
check the bbknn documentation.
KNN calculation on ATAC modality
neighbors:
String
npcs
Integer
, Default: 30
Number of principal components to calculate for neighbors and UMAP.k
Integer
, Default: 30
Number of neighborsmetric
String
, Default: euclidean
Metric can be either euclidean or cosinemethod
String
, Default: scanpy
The method can either be scanpy or hnsw
Multimodal integration
multimodal:
run
Boolean
, Default: True
Set to False if you don’t want to run multimodal integrationtools
String
(list), Default: “WNN”
Method you want to use to run batch correction. Options include:methods with set modalities: totalVI (rna, prot) and multiVI (rna, atac)
totalvi
MultiVI
methods which accept any combination of modalities: MOFA and WNN
mofa
WNN
You can specify mutiple methods and they will be run simultaneously. It makes biological sense to include rna modality if available, which is most informative in terms of cell type differences.
column_categorical
String
(Comma separated), Default: sample_id
This is the column you want to run a batch correction on. Multiple columns can be selected simultaneously by providing them as a comma-separated string without spaces.
TotalVI arguments
TotalVI has to run on both rna and protein data
This is the minimal set of TotalVI parameters required, you can add more if it fits your analysis better.
totalvi:
modalities
String
(Comma separated), Default: rna,protexclude_mt_genes
Boolean
, Default: Truemt_column
String
, Default: mtfilter_by_hvg
Boolean
, Default: True
To filter manually create a column called prot_outliers in mdata[‘prot’]filter_prot_outliers
Boolean
, Default: Falsemodel_args:
latent_distribution
String
, Default: “normal”
training_args:
max_epochs
Integer
, Default: 100train_size
Float
, Default: 0.9early_stopping
Boolean
, Default: True
training_plan
String
, Default: None
MultiVI arguments
MultiVI has to run on both rna and atac data
This is the minimal set of MultiVI parameters required, you can add more if it fits your analysis better.
Setting lowmem
to True it will subset the ATAC data to the top 25k HVF which is recommended to deal with the concatenation of atac and rna on large datasets which at the moment is required by scvi-tools
.
Note that >100GB of RAM are required to concatenate ATAC and RNA data with 15k cells and 120k total features (union rna,atac)
MultiVI:
lowmen
Boolean
, Default: Truemodel_args
String
, Default: Nonen_hidden
String
, Default: Nonen_latent
Boolean
, Default: Trueregion_factors
Boolean
, Default: Truelatent_distribution
String
, Default: normaldeeply_inject_covariates
Boolean
, Default: Falsefully_paired
Boolean
, Default: False
training_args
max_epochs
Integer
, Default: 500lr
Float
, Default: 0.0001use_gpu
String
, Default: None
Leave blank for default str, int and bool.train_size
Float
, Default: 0.9validation_size
String
, Default: None
Leave blank for defaultbatch_size
Integer
, Default: 128weight_decay
Float
, Default: 0.001eps
Float
, Default: 1e-08early_stopping
Boolean
, Default: Truesave_best
Boolean
, Default: Truecheck_val_every_n_epoch
String
, Default: None
Leave blank for the default integern_steps_kl_warmup
String
, Default: None
Leave blank for the default integern_epochs_kl_warmup
Integer
, Default: 50adversarial_mixing
Boolean
, Default: False
training_plan
String
, Default: None
Mofa arguments
Requires at least two modalities, can run with three
This is the minimal set of Mofa parameters required, you can add more if it fits your analysis better.
mofa:
modalities
String
(Comma separated), Default: rna,prot,atacfliter_by_hgv
Boolean
, Default: Truen_factors
Integer
, Default: 10n_iterations
Integer
, Default: 1000convergence_mode
String
, Default: fast
Choice between fast, medium, and slowsave_parameters
Boolean
, Default: Falseoutfile
String
, Default:path/to/h5ad/to_save_model_to
WNN arguments
Requires at least two modalities, can run with three
This is the minimal set of WNN parameters required, you can add more if it fits your analysis better. Panpipes uses muon’s implementation of WNN.
WNN:
modalities
String
(Comma separated), Default: rna, prot, atacbatch_corrected
String
, Default: NoneSet the modality to one method (“bbknn”, “scVI”, “harmony”, “scanorama”), if left None, a default de novo calculation of neighbours on non-corrected data for that modality using specified parameters
rna
String
, Default: None
Options here include “bbknn” and “harmony”prot
String
, Default: None
Options here include “harmony”atac
String
, Default: None
knn:
rna
String
, Default: *rna_neighborsprot
String
, Default: *prot_neighborsatac
String
, Default: *atac_neighbors
n_neighbors
String
, Default: “leave blank”
Leave blank to arithmetic mean across modalities neighborsn_bandwidth_neighbors
Integer
, Default: 20n_multineighbors
Integer
, Default: 200metric
String
, Default: euclideanlow_memory
Boolean
, Default: True
KNN calculation for multimodal analysis
neighbors:
npcs
Integer
, Default: 30
The number of principal components to calculate for neighbors and UMAP. If no correction is applied PCA will be calculated and used to run the UMAP. If harmony is chosen it will use the following components to create a corrected dimensionality reduction.k
Integer
, Default: 30metric
String
, Default: euclidean
Options include euclidean and cosinemethod
String
, Default: scanpy
Options include scanpy and hnsw
Plotting parameters
plotqc:
grouping_var
String
, Default: sample_id
Column name(s) of the covariate(s) you want to group the plot on. Must be a categorical variable. Must be provided as a comma-separated String, without spaces.
Specify other metrics you want to plot on each modality’s embedding. One plot per group will be created.
Use the notation mod:variable
.
These can be categorical or numeric variables.
Any metrics you may want to plot on all modality UMAPs should be listed under all
.
all
String
, Default: rep:receptor_subtyperna
String
, Default: rna:total_countsprot
String
, Default: prot:total_countsatac
String
, Default: atac:total_countsmultimodal
String
, Default: rna:total_counts
If you want to add any additional plots, simply remove the log file (logs/plot_batch_corrected_umaps.log) and run panpipes integration make plot_umaps
.
scib metrics
To assess the unimodal data integration, we use the scib metrics.
The metrics are calculated using the scib-metrics
package.
The metrics are calculated and stored for each modality separately.
scib:
rna
String
, Optional
Obs column name for the rna modality containing the cell type labels. If not provided, only a subset of the metrics working without cell type labels will be calculated. Especially the bio conservation metrics will not be calculated, which also prevents creating the benchmarking plots.prot
String
, Optional
Obs column name for the prot modality containing the cell type labels.atac
String
, Optional
Obs column name for the atac modality containing the cell type labels.
Creating the final object
Leave this final option blank until you have reviewed the results from running panpipes integration make full
.
This step will produce a MuData
object with one layer for each modality, and the multimodal embeddings are stored as global view.
To store the embeddings resulting from batch correction algorithms applied for each modality, set the relevant include
to True
and specify which algorithms you want to retain. The embeddings generated from the multimodal runs are stored in the global mudata layer.
To select the uncorrected unimodal embeddings, use “no_correction” for the relevant modalities.
Setting the include
parameter to False
for a specific modality will generate a Mudata
without that modality.
Then runpanpipes integration make merge_integration
final_obj:
rna:
include
Boolean
, Default: Truebc_choice
String
, Default: harmony
prot:
include
Boolean
, Default: Truebc_choice
String
, Default: harmony
atac:
include
Boolean
, Default: Falsebc_choice
Boolean
, Default: harmony
multimodal:
include
Boolean
, Default: Truebc_choice
String
, Default: totalvi