Clustering YAML

In this documentation, the parameters of the clustering configuration yaml file are explained. This file is generated running panpipes clustering config.
The individual steps run by the pipeline are described in clustering workflow

The clustering workflow works with outputs generated by the integration workflow, and expects a MuData object with neighbors saved in the .uns of the global layer to run clustering on the multimodal embedding. If neighbors are calculated on each modality layers, these can be reused or re-calculated on the flight.

When running the clustering workflow, panpipes provides a basic pipeline.yml file to customize with parameters. To run the workflow on your own data, you need to specify the parameters described below in the pipeline.yml file to meet the requirements of your data.

However, we do provide pre-filled versions of the pipeline.yml file for individual tutorials.

For more information on functionalities implemented in panpipes to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors and *scalars, please check our documentation

You can download the different clustering pipeline.yml files here:

Basic pipeline.yml file (not prefilled) that is generated when calling panpipes clustering config: Download here
pipeline.yml for Clustering Tutorial

Compute resources options

resources
Computing resources to use, specifically the number of threads used for parallel jobs, Check threads_tasks_panpipes for more information on which threads each specific task requires. Specified by the following three parameters:
- threads_high Integer, Default: 2
  Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load all your input files at once and create the MuData object.
- threads_medium Integer, Default: 2
  Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks.
- threads_low Integer, Default: 2
  Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.
- fewer_jobs Boolean, Default: True
- condaenv String (Path)
  Path to conda environment that should be used to run panpipes. Leave blank if running native or your cluster automatically inherits the login node environment

Loading data

Data format

sample_prefix String, Mandatory parameter, Default: mdata
Prefix for the sample that comes out of the filtering/ preprocessing steps of the workflow.
scaled_obj String, Mandatory parameter, Default: mdata_scaled.h5mu
Path to the output file from preprocessing (e.g. ../preprocessed/mdata_scaled.h5mu). Ensure that the path to the file is correct.
full_obj String, Default:
Specify the full object if your scaled_obj contains only HVG. If your scaled_obj contains all the genes then leave full_obj blank. panpipes will use the full object to do marker genes analysis (rank_gene_groups) and for plotting those genes.
modalities
Which modalities to run clustering on.
- rna Boolean, Default: True
  If set to True, the workflow will stop if it doesn’t find a modality named ‘rna’
- prot Boolean, Default: True
  If set to True, the workflow will stop if it doesn’t find a modality named ‘prot’
- atac Boolean, Default: False
  If set to True, the workflow will stop if it doesn’t find a modality named ‘atac’
- spatial Boolean, Default: False
  If set to True, the workflow will stop if it doesn’t find a modality named ‘spatial’
multimodal
- rna_clustering Boolean, Default: False
  If set to True, runs clustering on multimodal embedding
- integration_method String, Default: None
  In case you have run WNN and want to run clustering on the wnn embedding, specify “WNN” here. The neigbhours are saved with a different --neighbors_key param only for wnn, for every other method (totalvi, multivi, mofa) leave this parameter blank.

Parameters for finding neighbours

neighbors: Sets the number of neighbors to use when calculating the graph for clustering and umap.
- rna:
  - use_existing Boolean, Default: True
    Use existing neighbours in .uns calculated in the integration workflow. If False, it will recalculate using the following parameters
  - dim_red String, Default: X_pca
    Defines which representation in .obsm to use for nearest neighbors
  - n_dim_red Integer, Default: 30
    Number of components to use for clustering
  - k Integer, Default: 30
    Number of neighbours
  - metric String, Default: euclidean
    Options here include euclidean and cosine
  - method String, Default: scanpy
    Options include scanpy and hnsw (from scvelo)
- prot:
  - use_existing Boolean, Default: True
    Use existing neighbours in .uns calculated in the integration workflow. If False, it will recalculate using the following parameters
  - dim_red String, Default: X_pca
    Defines which representation in .obsm to use for nearest neighbors
  - n_dim_red Integer, Default: 30
    Number of components to use for clustering
  - k Integer, Default: 30
    Number of neighbours
  - metric String, Default: euclidean
    Options here include euclidean and cosine
  - method String, Default: scanpy
    Options include scanpy and hnsw (from scvelo)
- atac:
  - use_existing Boolean, Default: True
    Use existing neighbours in .uns calculated in the integration workflow. If False, it will recalculate using the following parameters
  - dim_red String, Default: X_lsi
    Defines which representation in .obsm to use for nearest neighbors
  - n_dim_red Integer, Default: 1
    Number of components to use for clustering
  - k Integer, Default: 30
    Number of neighbours
  - metric String, Default: euclidean
    Options here include euclidean and cosine
  - method String, Default: scanpy
    Options include scanpy and hnsw (from scvelo)
- spatial:
  - use_existing Boolean, Default: False
    Use existing neighbours in .uns calculated in the integration workflow. If False, it will recalculate using the following parameters
  - dim_red String, Default: X_pca
    Defines which representation in .obsm to use for nearest neighbors
  - n_dim_red Integer, Default: 30
    Number of components to use for clustering
  - k Integer, Default: 30
    Number of neighbours
  - metric String, Default: euclidean
    Options here include euclidean and cosine
  - method String, Default: scanpy
    Options include scanpy and hnsw (from scvelo)

Parameters for umap calculation

umap:
- run Boolean, Default: True
  Set to True runs the umap calculation and plotting.
- rna:
  - mindist Float, Default: 0.5
    Can specify a single float or an array: 0.25,0.5
- prot:
  - mindist Float, Default: 0.5
    Can specify a single float or an array: 0.25,0.5,0.8
- atac:
  - mindist Float, Default: 0.5
    Can specify a single float or an array: 0.25,0.5,0.8
- multimodal:
  - mindist Float, Default: 0.5
    Can specify a single float or an array: 0.25,0.5,0.8
- rna:
  - mindist Float, Default: 0.5
    Can specify a single float or an array: 0.25,0.5,0.8

The mindist parameters should be inputted as a list, in the following format:

    mindist:
      - 0.25
      - 0.5

Parameters for clustering

clusterspecs:
- rna:
  - resolutions Float, Default: 0.2, 0.6, 1
    Can specify a single float or an array: 0.2,0.6,1
  - algorithm String, Default: leiden
    Options include louvain or leiden.
- prot:
  - resolutions Float, Default: 0.2, 0.6, 1
    Can specify a single float or an array: 0.2,0.6,1
  - algorithm String, Default: leiden
    Options include louvain or leiden.
- atac:
  - resolutions Float, Default: 0.2, 0.6, 1
    Can specify a single float or an array to compute in parallel: 0.2,0.6,1
  - algorithm String, Default: leiden
    Options include louvain or leiden.
- multimmodal:
  - resolutions Float, Default: 0.5, 0.7
    Can specify a single float or an array to compute in parallel: 0.2,0.6,1
  - algorithm String, Default: leiden
    Options include louvain or leiden.
- spatial:
  - resolutions Float, Default: 0.2, 0.6, 1
    Can specify a single float or an array to compute in parallel: 0.2,0.6,1
  - algorithm String, Default: leiden
    Options include louvain or leiden.

The resolutions should be inputted as a list, in the following format:

resolutions:
     - 0.2
     - 0.6
     - 1 

Parameters for finding marker genes

In this part of the analysis we define parameters to run marker analysis. By default, pseudo_seurat is set to False, and we run scanpy.tl.rank_genes_groups. When pseudo_seurat is set to True then a python implementation of Seurat:::FindMarkers is run

markerspecs:
- rna:
  - run Boolean, Default: True
  - layer String, Default: logged_counts
    Which layer stores counts for differential expression test.
  - method String, Default: t-test_overestim_var
    Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
  - mincells Integer, Default: 10
    Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
  - pseudo_seurat Boolean, Default: False
  - minpct Float, Default: 0.1
    Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True
  - threshuse Float, Default: 0.25
    Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True
prot:
- run Boolean, Default: True
- layer String, Default: clr
  Which layer stores counts for differential expression test.
- mincells Integer, Default: 10
  Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
- method String, Default: wilcoxon
- pseudo_seurat Boolean, Default: False
- minpct Float, Default: 0.1
  Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True
- threshuse Float, Default: 0.25
  Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True
atac:
- run Boolean, Default: False
- layer String, Default: logged_counts
  Which layer stores counts for differential expression test. Options include logged_counts, signac_norm , and logTF_norm,logIDF_norm
- mincells Integer, Default: 10
  Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
- method String, Default: wilcoxon
  Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
- pseudo_seurat Boolean, Default: False
- minpct Float, Default: 0.1
  Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True
- threshuse Float, Default: 0.25
  Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True
multimodal:
- mincells Integer, Default:10
  If the cluster contains less than this number of cells, the marker analysis won’t be run.
- method String, Default: wilcoxon
  Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
- pseudo_seurat Boolean, Default: False
- minpct Float, Default: 0.1
  Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True
- threshuse Float, Default: 0.25
  Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.This parameter is mandatory if pseudo_seurat is set to True
spatial:
- run Boolean, Default: True
- layer String, Default: norm_pearson_resid
  Options include logged_counts, signac_norm , and logTF_norm,logIDF_norm
- method String, Default: t-test_overestim_var
  Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’
- mincells Integer, Default: 10
  Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis
- pseudo_seurat Boolean, Default: False
- minpct Float, Default: 0.1
  Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True
- threshuse Float, Default: 0.25
  Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True

Plot specifications

Define which layers are used in the markers visualization

plotspecs:
- layers:
  - rna String, Default: logged_counts
  - prot String, Default: clr
  - atac String, Default: signac_norm
  - spatial String, Default: None
    Options include lognorm and norm_pearson_resid depending what was selected on preprocessing.
top_n_markers Integer, Default: 10