Spatial Preprocessing YAML
In this documentation, the parameters of the preprocess_spatial
configuration yaml file are explained.
This file is generated running panpipes preprocess_spatial config
.
The individual steps run by the pipeline are described in the spatial preprocessing workflow.
For more information on functionalities implemented in panpipes
to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors
and *scalars
, please check our documentation
When running the preprocess workflow, panpipes provides a basic pipeline.yml
file.
To run the workflow on your own data, you need to specify the parameters described below in the pipeline.yml
file to meet the requirements of your data.
However, we do provide pre-filled versions of the pipeline.yml
file for individual tutorials.
You can download the different preprocess pipeline.yml files here:
Basic
pipeline.yml
file (not prefilled) that is generated when callingpanpipes preprocess_spatial config
: Download herepipeline.yml
file for Preprocessing spatial data Tutorial: Download here
0. Compute Resource Options
resources
Computing resources to use, specifically the number of threads used for parallel jobs. Check threads_tasks_panpipes for more information on which threads each specific task requires.
Specified by the following three parameters:
threads_high
Integer
, Default: 1
Number of threads used for high intensity computing tasks.threads_medium
Integer
, Default: 1
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks.threads_low
Integer
, Default: 1
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.
condaenv String
Path to conda environment that should be used to run panpipes.
Leave blank if running native or your cluster automatically inherits the login node environment
1. Input Options
With the preprocess_spatial workflow, one or multiple MuData
objects can be preprocessed in one run. The workflow reads in all .h5mu
objects of a directory. The MuData
objects in the directory need to be of the same assay (vizgen or visium). The workflow then runs the preprocessing of each MuData
object separately with the same parameters that are specified in the yaml file.
input_dir String
, Mandatory parameter
Path to the folder containing all input h5mu
files.
assay ['visium'
, 'vizgen'
], Default: 'visium'
Spatial transcriptomics assay of the h5mu
files in input_dir
.
2. Filtering Options
filtering
run
Boolean
, Default: False
Whether to run filtering. IfFalse
, will not filter the data and will not produce post-filtering plots.keep_barcodes
String
, Default: None
Path to a csv-file that has no header containing barcodes you want to keep. Barcodes that are not in the file, will be removed from the dataset before filtering the dataset with the thresholds specified below.
With the parameters below you can specify thresholds for filtering. The filtering is fully customisable to any columns in .obs
or .var
. You are not restricted by the columns given as default. When specifying a column name, please make sure it exactly matches the column name in the h5mu object.
Please slso make sure, that the specified metrics are present in all h5mu
objects of the input_dir
, i.e. the MuData
objects for that the preprocessing is run.
spatial:
obs:
min:
total_counts:
max:
pct_counts_mt:
bool:
var:
min:
n_cells_by_counts:
max:
total_counts:
3. Post-Filter Plotting
The parameters below specify which metrics of the filtered data to plot. As for the QC, violin and spatial embedding plots are generated for each slide separately.
plotqc
grouping_var
String
, Default: None
Comma-separated string without spaces, e.g. sample_id,batch of categorical columns in.obs
. One violin will be created for each group in the violin plot. Not mandatory, can be left empty.spatial_metrics
String
, Default: None
Comma-separated string without spaces, e.g. total_counts,n_genes_by_counts of columns in.obs
or.var
.
Specifies which metrics to plot. If metric is present in both,.obs
and.var
, both will be plotted.
4. Normalization, HVG Selection, and PCA Options
4.1 Normalization and HVG Selection
Panpipes
offers two different normalization and HVG selection flavours, 'seurat'
and 'squidpy'
.
The 'seurat'
flavour first selects HVGs on the raw counts using analytic Pearson residuals, i.e. scanpy.experimental.pp.highly_variable_genes. Afterwards, analytic Pearson residual normalization is applied, i.e. scanpy.experimental.pp.normalize_pearson_residuals. Parameters of both functions can be specified by the user in the yaml file.
The 'squidpy'
flavour runs the basic scanpy normalization and HVG selection functions, i.e. scanpy.pp.normalize_total, scanpy.pp.log1p, and scanpy.pp.highly_variable_genes.
norm_hvg_flavour['squidpy'
, 'seurat'
], Default: None
Normalization and HVG selection flavour to use. If None, will not run normalization nor HVG selection.
Parameters for norm_hvg_flavour
== 'squidpy'
squidpy_hvg_flavour['seurat'
,'cellranger'
,'seurat_v3'
], Default: ‘seurat’
Flavour to select HVGs, i.e.flavor
parameter of the function scanpy.pp.highly_variable_genes.
min_meanFloat
, Default: 0.05
Parameter in scanpy.pp.highly_variable_genes.
max_meanFloat
, Default: 1.5
Parameter in scanpy.pp.highly_variable_genes.
min_dispFloat
, Default: 0.5
Parameter in scanpy.pp.highly_variable_genes.
Parameters for norm_hvg_flavour
== 'seurat'
thetaFloat
, Default: 100
The negative binomial overdispersion parameter for pearson residuals. The same value is used for HVG selection and normalization.
clipFloat
, Default: None
Specifies clipping of the residuals. clip
can be specified as:
- None: residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)]
- A float value: if float c specified: clipped to the interval [-c, c]
- np.Inf: no clipping
Parameters for both norm_hvg_flavour
flavours
n_top_genesInteger
, Default: 2000
Number of genes to select. Mandatory for norm_hvg_flavour='seurat'
and squidpy_hvg_flavour='seurat_v3'
.
filter_by_hvgBoolean
, Default: False
Subset the data to the HVGs.
hvg_batch_keyString
, Default: None
If specified, HVGs are selected within each batch separately and merged.
4.2 PCA
After normalization and HVG selection, PCA is run and the PCA and elbow plot are plotted. For that, the user can specify the number of PCs for the PCA computation and for the elbow plot, i.e. the same number is used for both.
n_pcsInteger
, Default: 50
Number of PCs to compute.