Clustering YAML

In this documentation, the parameters of the clustering configuration yaml file are explained. This file is generated running panpipes clustering config.
The individual steps run by the pipeline are described in clustering workflow

The clustering workflow works with outputs generated by the integration workflow, and expects a MuData object with neighbors saved in the .uns of the global layer to run clustering on the multimodal embedding. If neighbors are calculated on each modality layers, these can be reused or re-calculated on the flight.

When running the clustering workflow, panpipes provides a basic pipeline.yml file to customize with parameters. To run the workflow on your own data, you need to specify the parameters described below in the pipeline.yml file to meet the requirements of your data.

However, we do provide pre-filled versions of the pipeline.yml file for individual tutorials.

For more information on functionalities implemented in panpipes to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors and *scalars, please check our documentation

You can download the different clustering pipeline.yml files here:

Compute resources options

  • resources
    Computing resources to use, specifically the number of threads used for parallel jobs, Check threads_tasks_panpipes for more information on which threads each specific task requires. Specified by the following three parameters:

    • threads_high Integer, Default: 2
      Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load all your input files at once and create the MuData object.

    • threads_medium Integer, Default: 2
      Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks.

    • threads_low Integer, Default: 2
      Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.

    • fewer_jobs Boolean, Default: True

    • condaenv String (Path)
      Path to conda environment that should be used to run panpipes. Leave blank if running native or your cluster automatically inherits the login node environment

Loading data

Data format

  • sample_prefix String, Mandatory parameter, Default: mdata
    Prefix for the sample that comes out of the filtering/ preprocessing steps of the workflow.

  • scaled_obj String, Mandatory parameter, Default: mdata_scaled.h5mu
    Path to the output file from preprocessing (e.g. ../preprocessed/mdata_scaled.h5mu). Ensure that the path to the file is correct.

  • full_obj String, Default:
    Specify the full object if your scaled_obj contains only HVG. If your scaled_obj contains all the genes then leave full_obj blank. panpipes will use the full object to do marker genes analysis (rank_gene_groups) and for plotting those genes.

  • modalities
    Which modalities to run clustering on.

    • rna Boolean, Default: True
      If set to True, the workflow will stop if it doesn’t find a modality named ‘rna’

    • prot Boolean, Default: True
      If set to True, the workflow will stop if it doesn’t find a modality named ‘prot’

    • atac Boolean, Default: False
      If set to True, the workflow will stop if it doesn’t find a modality named ‘atac’

    • spatial Boolean, Default: False
      If set to True, the workflow will stop if it doesn’t find a modality named ‘spatial’

  • multimodal

    • rna_clustering Boolean, Default: False
      If set to True, runs clustering on multimodal embedding

    • integration_method String, Default: None
      In case you have run WNN and want to run clustering on the wnn embedding, specify “WNN” here. The neigbhours are saved with a different --neighbors_key param only for wnn, for every other method (totalvi, multivi, mofa) leave this parameter blank.

Parameters for finding neighbours

  • neighbors: Sets the number of neighbors to use when calculating the graph for clustering and umap.

    • rna:

      • use_existing Boolean, Default: True
        Use existing neighbours in .uns calculated in the integration workflow. If False, it will recalculate using the following parameters

      • dim_red String, Default: X_pca
        Defines which representation in .obsm to use for nearest neighbors

      • n_dim_red Integer, Default: 30
        Number of components to use for clustering

      • k Integer, Default: 30
        Number of neighbours

      • metric String, Default: euclidean
        Options here include euclidean and cosine

      • method String, Default: scanpy
        Options include scanpy and hnsw (from scvelo)

    • prot:

      • use_existing Boolean, Default: True
        Use existing neighbours in .uns calculated in the integration workflow. If False, it will recalculate using the following parameters

      • dim_red String, Default: X_pca
        Defines which representation in .obsm to use for nearest neighbors

      • n_dim_red Integer, Default: 30
        Number of components to use for clustering

      • k Integer, Default: 30
        Number of neighbours

      • metric String, Default: euclidean
        Options here include euclidean and cosine

      • method String, Default: scanpy
        Options include scanpy and hnsw (from scvelo)

    • atac:

      • use_existing Boolean, Default: True
        Use existing neighbours in .uns calculated in the integration workflow. If False, it will recalculate using the following parameters

      • dim_red String, Default: X_lsi
        Defines which representation in .obsm to use for nearest neighbors

      • n_dim_red Integer, Default: 1
        Number of components to use for clustering

      • k Integer, Default: 30
        Number of neighbours

      • metric String, Default: euclidean
        Options here include euclidean and cosine

      • method String, Default: scanpy
        Options include scanpy and hnsw (from scvelo)

    • spatial:

      • use_existing Boolean, Default: False
        Use existing neighbours in .uns calculated in the integration workflow. If False, it will recalculate using the following parameters

      • dim_red String, Default: X_pca
        Defines which representation in .obsm to use for nearest neighbors

      • n_dim_red Integer, Default: 30
        Number of components to use for clustering

      • k Integer, Default: 30
        Number of neighbours

      • metric String, Default: euclidean
        Options here include euclidean and cosine

      • method String, Default: scanpy
        Options include scanpy and hnsw (from scvelo)

Parameters for umap calculation

  • umap:

    • run Boolean, Default: True
      Set to True runs the umap calculation and plotting.

    • rna:

      • mindist Float, Default: 0.5
        Can specify a single float or an array: 0.25,0.5

    • prot:

      • mindist Float, Default: 0.5
        Can specify a single float or an array: 0.25,0.5,0.8

    • atac:

      • mindist Float, Default: 0.5
        Can specify a single float or an array: 0.25,0.5,0.8

    • multimodal:

      • mindist Float, Default: 0.5
        Can specify a single float or an array: 0.25,0.5,0.8

    • rna:

      • mindist Float, Default: 0.5
        Can specify a single float or an array: 0.25,0.5,0.8

The mindist parameters should be inputted as a list, in the following format:

    mindist:
      - 0.25
      - 0.5

Parameters for clustering

  • clusterspecs:

    • rna:

      • resolutions Float, Default: 0.2, 0.6, 1
        Can specify a single float or an array: 0.2,0.6,1

      • algorithm String, Default: leiden
        Options include louvain or leiden.

    • prot:

      • resolutions Float, Default: 0.2, 0.6, 1
        Can specify a single float or an array: 0.2,0.6,1

      • algorithm String, Default: leiden
        Options include louvain or leiden.

    • atac:

      • resolutions Float, Default: 0.2, 0.6, 1
        Can specify a single float or an array to compute in parallel: 0.2,0.6,1

      • algorithm String, Default: leiden
        Options include louvain or leiden.

    • multimmodal:

      • resolutions Float, Default: 0.5, 0.7
        Can specify a single float or an array to compute in parallel: 0.2,0.6,1

      • algorithm String, Default: leiden
        Options include louvain or leiden.

    • spatial:

      • resolutions Float, Default: 0.2, 0.6, 1
        Can specify a single float or an array to compute in parallel: 0.2,0.6,1

      • algorithm String, Default: leiden
        Options include louvain or leiden.

The resolutions should be inputted as a list, in the following format:

resolutions:
     - 0.2
     - 0.6
     - 1 

Parameters for finding marker genes

In this part of the analysis we define parameters to run marker analysis. By default, pseudo_seurat is set to False, and we run scanpy.tl.rank_genes_groups. When pseudo_seurat is set to True then a python implementation of Seurat:::FindMarkers is run

  • markerspecs:

    • rna:

      • run Boolean, Default: True

      • layer String, Default: logged_counts
        Which layer stores counts for differential expression test.

      • method String, Default: t-test_overestim_var
        Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’

      • mincells Integer, Default: 10
        Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis

      • pseudo_seurat Boolean, Default: False

      • minpct Float, Default: 0.1
        Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True

      • threshuse Float, Default: 0.25
        Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True

  • prot:

    • run Boolean, Default: True

    • layer String, Default: clr
      Which layer stores counts for differential expression test.

    • mincells Integer, Default: 10
      Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis

    • method String, Default: wilcoxon

    • pseudo_seurat Boolean, Default: False

    • minpct Float, Default: 0.1
      Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True

    • threshuse Float, Default: 0.25
      Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True

  • atac:

    • run Boolean, Default: False

    • layer String, Default: logged_counts
      Which layer stores counts for differential expression test. Options include logged_counts, signac_norm , and logTF_norm,logIDF_norm

    • mincells Integer, Default: 10
      Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis

    • method String, Default: wilcoxon
      Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’

    • pseudo_seurat Boolean, Default: False

    • minpct Float, Default: 0.1
      Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True

    • threshuse Float, Default: 0.25
      Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True

  • multimodal:

    • mincells Integer, Default:10
      If the cluster contains less than this number of cells, the marker analysis won’t be run.

    • method String, Default: wilcoxon
      Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’

    • pseudo_seurat Boolean, Default: False

    • minpct Float, Default: 0.1
      Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True

    • threshuse Float, Default: 0.25
      Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells.This parameter is mandatory if pseudo_seurat is set to True

  • spatial:

    • run Boolean, Default: True

    • layer String, Default: norm_pearson_resid
      Options include logged_counts, signac_norm , and logTF_norm,logIDF_norm

    • method String, Default: t-test_overestim_var
      Options include: ‘logreg’, ‘t-test’, ‘wilcoxon’, ‘t-test_overestim_var’

    • mincells Integer, Default: 10
      Marker analysis is run for clusters >= mincells. If a cluster ncells < mincells , then the cluster is excluded from marker analysis

    • pseudo_seurat Boolean, Default: False

    • minpct Float, Default: 0.1
      Only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. This parameter is mandatory if pseudo_seurat is set to True

    • threshuse Float, Default: 0.25
      Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. This parameter is mandatory if pseudo_seurat is set to True

Plot specifications

Define which layers are used in the markers visualization

  • plotspecs:

    • layers:

      • rna String, Default: logged_counts

      • prot String, Default: clr

      • atac String, Default: signac_norm

      • spatial String, Default: None
        Options include lognorm and norm_pearson_resid depending what was selected on preprocessing.

  • top_n_markers Integer, Default: 10