Spatial Preprocessing YAML

In this documentation, the parameters of the preprocess_spatial configuration yaml file are explained. This file is generated running panpipes preprocess_spatial config.
The individual steps run by the pipeline are described in the spatial preprocessing workflow.

For more information on functionalities implemented in panpipes to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors and *scalars, please check our documentation

When running the preprocess workflow, panpipes provides a basic pipeline.yml file. To run the workflow on your own data, you need to specify the parameters described below in the pipeline.yml file to meet the requirements of your data. However, we do provide pre-filled versions of the pipeline.yml file for individual tutorials. You can download the different preprocess pipeline.yml files here:

Basic pipeline.yml file (not prefilled) that is generated when calling panpipes preprocess_spatial config: Download here
pipeline.yml file for Preprocessing spatial data Tutorial: Download here

0. Compute Resource Options

resources
Computing resources to use, specifically the number of threads used for parallel jobs. Check threads_tasks_panpipes for more information on which threads each specific task requires. Specified by the following three parameters:

threads_high Integer, Default: 1
Number of threads used for high intensity computing tasks.
threads_medium Integer, Default: 1
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your SpatialData and do computationally light tasks.
threads_low Integer, Default: 1
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.

condaenv String
Path to conda environment that should be used to run panpipes. Leave blank if running native or your cluster automatically inherits the login node environment

1. Input Options

With the preprocess_spatial workflow, one or multiple SpatialData objects can be preprocessed in one run. The workflow reads in all .zarr objects of a directory. The SpatialData objects in the directory need to be of the same assay (Vizgen, Visium, or Xenium). The workflow then runs the preprocessing of each SpatialData object separately with the same parameters that are specified in the yaml file.

input_dir String, Mandatory parameter
Path to the folder containing all input zarr files.

assay ['visium', 'vizgen'], Default: 'visium'
Spatial transcriptomics assay of the zarr files in input_dir.

2. Filtering Options

filtering

run Boolean, Default: False
Whether to run filtering. If False, will not filter the data and will not produce post-filtering plots.
keep_barcodes String, Default: None
Path to a csv-file that has no header containing barcodes you want to keep. Barcodes that are not in the file, will be removed from the dataset before filtering the dataset with the thresholds specified below.

With the parameters below you can specify thresholds for filtering. The filtering is fully customisable to any columns in .obs or .var. You are not restricted by the columns given as default. When specifying a column name, please make sure it exactly matches the column name in the table of the SpatialData object.
Please also make sure, that the specified metrics are present in all SpatialData objects of the input_dir, i.e. the SpatialData objects for that the preprocessing is run.

spatial:
    obs:
        min:
            total_counts: 
        max:
            pct_counts_mt:
        bool: 
    var:
        min:
            n_cells_by_counts: 
        max:
            total_counts:

3. Post-Filter Plotting

The parameters below specify which metrics of the filtered data to plot. As for the QC, violin and spatial embedding plots are generated for each slide separately.

plotqc

grouping_var String, Default: None
Comma-separated string without spaces, e.g. sample_id,batch of categorical columns in .obs. One violin will be created for each group in the violin plot. Not mandatory, can be left empty.
spatial_metrics String, Default: None
Comma-separated string without spaces, e.g. total_counts,n_genes_by_counts of columns in .obs or .var.
Specifies which metrics to plot. If metric is present in both, .obs and .var, both will be plotted.

4. Normalization, HVG Selection, and PCA Options

4.1 Normalization and HVG Selection

Panpipes offers two different normalization and HVG selection flavours, 'seurat' and 'squidpy'.
The 'seurat' flavour first selects HVGs on the raw counts using analytic Pearson residuals, i.e. scanpy.experimental.pp.highly_variable_genes. Afterwards, analytic Pearson residual normalization is applied, i.e. scanpy.experimental.pp.normalize_pearson_residuals. Parameters of both functions can be specified by the user in the yaml file.
The 'squidpy' flavour runs the basic scanpy normalization and HVG selection functions, i.e. scanpy.pp.normalize_total, scanpy.pp.log1p, and scanpy.pp.highly_variable_genes.

norm_hvg_flavour['squidpy', 'seurat'], Default: None
Normalization and HVG selection flavour to use. If None, will not run normalization nor HVG selection.

Parameters for norm_hvg_flavour == 'squidpy'

squidpy_hvg_flavour['seurat','cellranger','seurat_v3'], Default: ‘seurat’
Flavour to select HVGs, i.e.flavor parameter of the function scanpy.pp.highly_variable_genes.

min_meanFloat, Default: 0.05
Parameter in scanpy.pp.highly_variable_genes.

max_meanFloat, Default: 1.5
Parameter in scanpy.pp.highly_variable_genes.

min_dispFloat, Default: 0.5
Parameter in scanpy.pp.highly_variable_genes.

Parameters for norm_hvg_flavour == 'seurat'

thetaFloat, Default: 100
The negative binomial overdispersion parameter for pearson residuals. The same value is used for HVG selection and normalization.

clipFloat, Default: None
Specifies clipping of the residuals.
clip can be specified as:

None: residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)]
A float value: if float c specified: clipped to the interval [-c, c]
np.Inf: no clipping

Parameters for both norm_hvg_flavour flavours

n_top_genesInteger, Default: 2000
Number of genes to select. Mandatory for norm_hvg_flavour='seurat' and squidpy_hvg_flavour='seurat_v3'.

filter_by_hvgBoolean, Default: False
Subset the data to the HVGs.

hvg_batch_keyString, Default: None
If specified, HVGs are selected within each batch separately and merged.

4.2 PCA

After normalization and HVG selection, PCA is run and the PCA and elbow plot are plotted. For that, the user can specify the number of PCs for the PCA computation and for the elbow plot, i.e. the same number is used for both.

n_pcsInteger, Default: 50
Number of PCs to compute.

5. Concatenation

In case multiple SpatialData objects have been preprocessed separately, the user has the option to concatenate the preprocessed objects in the end.

concatBoolean, Default: False
Whether to concatenate all preprocessed SpatialData objects. The concatenated object is saved to ./concatenated.data/concatenated.zarr