Clustering
The panpipes clustering pipeline runs the following steps:
Detect or generate nearest neighbor graph
Generate UMAPS with different minimum distance parameters
Calculate leiden or louvain clustering, at a range of resolutions, saving them as csv and in the MuData object.
Visualise clustering on UMAPs and as a clustree diagram. Note that it is important to specify more than 1 resolution for clustree to work.
Identify top marker genes for each cluster at each clustering resolution, using
scanpy.tl.rank_genes_groups
saved as csvs.
Steps to run
In a new folder, generate config file for clustering,
panpipes clustering config
and edit the pipeline.yml file.Run the clustering pipeline you can run the full pipeline with
panpipes clustering make full
or follow this step by step guide:panpipes clustering make cluster_analysis
. This will do the initial nearest neighbours and clustering for the parameters you specify.Decide on the best values for k nearest neighbours based on UMAPs and clustree results. (Optionally) Once decided delete the folders for the parameters you don’t need and delete those from the pipeline.yml.
Find markers for each of your cluster resolutions with
panpipes clustering make marker_analysis
(Again you could run all the clustering pipeline at once withpanpipes clustering make full
but by making decisions along the way you’ll reduce the computation and file size burden of the pipeline)
Expected structure of MuData object
The ideal way to run panpipes clustering
is to use the output MuData file from panpipes preprocess
or panpipes integration
, as this will make sure the MuData object has correctly names layers and slots.
The bare minimum MuData object required is normalised data in the .X
slot of each modality, a sample_id column in each slot of the obs and the outer obs, and the layers indicated in the find_markers section of the yml.
Check the clustering tutorial for an example of how to run the workflow.
A note on neighbors graph:
In order to take advantage of batch corrected data, we recommend using the integration
workflow which generates K nearest neighbor graphs (KNN) based on batch corrected/integrated data, before running clustering
. If you wish to create a new nearest neighbor graph (perhaps to test a different parameter set), then make sure to set the use_existing: False
in the clustering yml
configuration file and fill in the yml with your custom params, and repeat for each modality for which you want to recompute the neighbors :
neighbors:
rna:
use_existing: False
# which representation in .obsm to use for nearest neighbors
# if dim_red=X_pca and X_pca not in .obsm, will be computed with default parameters
dim_red: X_pca
#how many components to use for clustering
n_dim_red: 30
# number of neighbours
k: 30
# metric: euclidean | cosine
metric: euclidean
# scanpy | hnsw (from scvelo)
method: scanpy