Clustering YAML
In this documentation, the parameters of the clustering
configuration yaml file are explained.
This file is generated running panpipes clustering config
.
The individual steps run by the pipeline are described in clustering workflow
The clustering
workflow works with outputs generated by the integration
workflow, and expects a MuData
object with
neighbors
saved in the .uns
of the global layer to run clustering on the multimodal embedding. If neighbors
are calculated on each modality layers, these can be reused or re-calculated on the flight.
When running the clustering workflow, panpipes provides a basic pipeline.yml
file to customize with parameters.
To run the workflow on your own data, you need to specify the parameters described below in the pipeline.yml
file to meet the requirements of your data.
However, we do provide pre-filled versions of the pipeline.yml
file for individual tutorials.
For more information on functionalities implemented in panpipes
to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors
and *scalars
, please check our documentation
You can download the different clustering pipeline.yml files here:
Basic
pipeline.yml
file (not prefilled) that is generated when callingpanpipes clustering config
: Download herepipeline.yml
for Clustering Tutorial
Compute resources options
resources
Computing resources to use, specifically the number of threads used for parallel jobs, Check threads_tasks_panpipes for more information on which threads each specific task requires. Specified by the following three parameters:threads_high
Integer
, Default: 2
Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load all your input files at once and create the MuData object.threads_medium
Integer
, Default: 2
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks.threads_low
Integer
, Default: 2
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.fewer_jobs
Boolean
, Default: Truecondaenv
String
(Path)
Path to conda environment that should be used to run panpipes. Leave blank if running native or your cluster automatically inherits the login node environment
Loading data
Data format
sample_prefix
String
, Mandatory parameter, Default: mdata
Prefix for the sample that comes out of the filtering/ preprocessing steps of the workflow.scaled_obj
String
, Mandatory parameter, Default: mdata_scaled.h5mu
Path to the output file from preprocessing (e.g.../preprocessed/mdata_scaled.h5mu
). Ensure that the path to the file is correct.full_obj
String
, Default:
Specify the full object if your scaled_obj contains only HVG. If your scaled_obj contains all the genes then leave full_obj blank. panpipes will use the full object to do marker genes analysis (rank_gene_groups) and for plotting those genes.modalities
Which modalities to run clustering on.rna
Boolean
, Default: True
If set toTrue
, the workflow will stop if it doesn’t find a modality named ‘rna’prot
Boolean
, Default: True
If set toTrue
, the workflow will stop if it doesn’t find a modality named ‘prot’atac
Boolean
, Default: False
If set toTrue
, the workflow will stop if it doesn’t find a modality named ‘atac’spatial
Boolean
, Default: False
If set toTrue
, the workflow will stop if it doesn’t find a modality named ‘spatial’
multimodal
rna_clustering
Boolean
, Default: False
If set to True, runs clustering on multimodal embeddingintegration_method
String
, Default: None
In case you have run WNN and want to run clustering on the wnn embedding, specify “WNN” here. The neigbhours are saved with a different--neighbors_key
param only for wnn, for every other method (totalvi, multivi, mofa) leave this parameter blank.
Parameters for finding neighbours
neighbors: Sets the number of neighbors to use when calculating the graph for clustering and umap.
rna:
use_existing
Boolean
, Default: True
Use existing neighbours in .uns calculated in theintegration
workflow. IfFalse
, it will recalculate using the following parametersdim_red
String
, Default: X_pca
Defines which representation in .obsm to use for nearest neighborsn_dim_red
Integer
, Default: 30
Number of components to use for clusteringk
Integer
, Default: 30
Number of neighboursmetric
String
, Default: euclidean
Options here include euclidean and cosinemethod
String
, Default: scanpy
Options include scanpy and hnsw (from scvelo)
prot:
use_existing
Boolean
, Default: True
Use existing neighbours in .uns calculated in theintegration
workflow. IfFalse
, it will recalculate using the following parametersdim_red
String
, Default: X_pca
Defines which representation in .obsm to use for nearest neighborsn_dim_red
Integer
, Default: 30
Number of components to use for clusteringk
Integer
, Default: 30
Number of neighboursmetric
String
, Default: euclidean
Options here include euclidean and cosinemethod
String
, Default: scanpy
Options include scanpy and hnsw (from scvelo)
atac:
use_existing
Boolean
, Default: True
Use existing neighbours in .uns calculated in theintegration
workflow. IfFalse
, it will recalculate using the following parametersdim_red
String
, Default: X_lsi
Defines which representation in .obsm to use for nearest neighborsn_dim_red
Integer
, Default: 1
Number of components to use for clusteringk
Integer
, Default: 30
Number of neighboursmetric
String
, Default: euclidean
Options here include euclidean and cosinemethod
String
, Default: scanpy
Options include scanpy and hnsw (from scvelo)
spatial:
use_existing
Boolean
, Default: False
Use existing neighbours in .uns calculated in theintegration
workflow. IfFalse
, it will recalculate using the following parametersdim_red
String
, Default: X_pca
Defines which representation in .obsm to use for nearest neighborsn_dim_red
Integer
, Default: 30
Number of components to use for clusteringk
Integer
, Default: 30
Number of neighboursmetric
String
, Default: euclidean
Options here include euclidean and cosinemethod
String
, Default: scanpy
Options include scanpy and hnsw (from scvelo)
Parameters for umap calculation
umap:
run
Boolean
, Default: True
Set toTrue
runs the umap calculation and plotting.rna:
mindist
Float
, Default: 0.5
Can specify a single float or an array: 0.25,0.5
prot:
mindist
Float
, Default: 0.5
Can specify a single float or an array: 0.25,0.5,0.8
atac:
mindist
Float
, Default: 0.5
Can specify a single float or an array: 0.25,0.5,0.8
multimodal:
mindist
Float
, Default: 0.5
Can specify a single float or an array: 0.25,0.5,0.8
rna:
mindist
Float
, Default: 0.5
Can specify a single float or an array: 0.25,0.5,0.8
The mindist parameters should be inputted as a list, in the following format:
mindist:
- 0.25
- 0.5
Parameters for clustering
clusterspecs:
rna:
resolutions
Float
, Default: 0.2, 0.6, 1
Can specify a single float or an array: 0.2,0.6,1algorithm
String
, Default: leiden
Options include louvain or leiden.
prot:
resolutions
Float
, Default: 0.2, 0.6, 1
Can specify a single float or an array: 0.2,0.6,1algorithm
String
, Default: leiden
Options include louvain or leiden.
atac:
resolutions
Float
, Default: 0.2, 0.6, 1
Can specify a single float or an array to compute in parallel: 0.2,0.6,1algorithm
String
, Default: leiden
Options include louvain or leiden.
multimmodal:
resolutions
Float
, Default: 0.5, 0.7
Can specify a single float or an array to compute in parallel: 0.2,0.6,1algorithm
String
, Default: leiden
Options include louvain or leiden.
spatial:
resolutions
Float
, Default: 0.2, 0.6, 1
Can specify a single float or an array to compute in parallel: 0.2,0.6,1algorithm
String
, Default: leiden
Options include louvain or leiden.
The resolutions should be inputted as a list, in the following format:
resolutions:
- 0.2
- 0.6
- 1
Parameters for finding marker genes
In this part of the analysis we define parameters to run marker analysis.
By default, pseudo_seurat is set to False, and we run scanpy.tl.rank_genes_groups.
When pseudo_seurat is set to True then a python implementation of Seurat:::FindMarkers
is run
markerspecs:
rna:
run
Boolean
, Default: Truelayer
String
, Default: logged_counts
Which layer stores counts for differential expression test.method
String
, Default: t-test_overestim_var
Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’mincells
Integer
, Default: 10
Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysispseudo_seurat
Boolean
, Default: Falseminpct
Float
, Default: 0.1
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to Truethreshuse
Float
, Default: 0.25
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True
prot:
run
Boolean
, Default: Truelayer
String
, Default: clr
Which layer stores counts for differential expression test.mincells
Integer
, Default: 10
Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysismethod
String
, Default: wilcoxonpseudo_seurat
Boolean
, Default: Falseminpct
Float
, Default: 0.1
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to Truethreshuse
Float
, Default: 0.25
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True
atac:
run
Boolean
, Default: Falselayer
String
, Default: logged_counts
Which layer stores counts for differential expression test. Options include logged_counts, signac_norm , and logTF_norm,logIDF_normmincells
Integer
, Default: 10
Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysismethod
String
, Default: wilcoxon
Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’pseudo_seurat
Boolean
, Default: Falseminpct
Float
, Default: 0.1
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to Truethreshuse
Float
, Default: 0.25
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True
multimodal:
mincells
Integer
, Default:10
If the cluster contains less than this number of cells, the marker analysis won’t be run.method
String
, Default: wilcoxon
Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’pseudo_seurat
Boolean
, Default: Falseminpct
Float
, Default: 0.1
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to Truethreshuse
Float
, Default: 0.25
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.This parameter is mandatory if pseudo_seurat is set to True
spatial:
run
Boolean
, Default: Truelayer
String
, Default: norm_pearson_resid
Options include logged_counts, signac_norm , and logTF_norm,logIDF_normmethod
String
, Default: t-test_overestim_var
Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’mincells
Integer
, Default: 10
Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysispseudo_seurat
Boolean
, Default: Falseminpct
Float
, Default: 0.1
Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to Truethreshuse
Float
, Default: 0.25
Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True
Plot specifications
Define which layers are used in the markers visualization
plotspecs:
layers:
rna
String
, Default: logged_countsprot
String
, Default: clratac
String
, Default: signac_normspatial
String
, Default: None
Options include lognorm and norm_pearson_resid depending what was selected on preprocessing.
top_n_markers
Integer
, Default: 10