Preprocess YAML
In this documentation, the parameters of the preprocess
configuration yaml file are explained.
This file is generated by running panpipes preprocess config
.
The individual steps run by the pipeline are described in the preprocess workflow.
When running the preprocess workflow, panpipes provides a basic pipeline.yml
file.
To run the workflow on your own data, you need to specify the parameters described below in the pipeline.yml
file to meet the requirements of your data.
However, we do provide pre-filled versions of the pipeline.yml
file for individual tutorials.
For more information on functionalities implemented in panpipes
to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors
and *scalars
, please check our documentation.
You can download the different preprocess pipeline.yml
files here:
Basic
pipeline.yml
file (not prefilled) that is generated when calling `panpipes preprocess config: Download here.Prefilled
pipeline.yml
file for the preprocess tutorial: Download here.
Compute resources options
resources
Computing resources to use, specifically the number of threads used for parallel jobs.Check threads_tasks_panpipes for more information on which threads each specific task requires.
Specified by the following three parameters:
threads_high
Integer
, Default: 2
Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load all your input files at once and create the MuData object.threads_medium
Integer
, Default: 2
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks.threads_low
Integer
, Default: 1
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.
condaenv String
(Path)
Path to conda environment that should be used to run panpipes.
Leave blank if running native or your cluster automatically inherits the login node environment.
For more information on this, please refer to the detailed explanation here.
General project specifications
sample_prefix String
Prefix for sample names.
unfiltered_obj String
If running this on prefiltered data, complete the following steps:
1. Leave unfiltered_obj
(this parameter) blank
2. Rename your filtered file so that it matches the format PARAMS[‘sample_prefix’] + ‘.h5mu’
3. Put the renamed file in the same folder as the pipeline.yml
4. Set filtering run
to False
below
modalities
Specify which modalities are included in the data by setting the respective modality to True.
Leave empty (None) or False to signal this modality is not part of the experiment.
The modalities are processed in the order of the following list:
rna
Boolean
, Default: Trueprot
Boolean
, Default: Falserep
Boolean
, Default: Falseatac
Boolean
, Default: False
Filtering Cells and Features
Filtering in panpipes is done sequentially for all modalities, filtering first cells and then features. For each modality, the pipeline.yml file contains a dictionary with the following structure:
MODALITY:
obs:
min:
max:
bool:
var:
min:
max:
bool:
This format can be applied to any modality by editing the filtering dictionary
You are not restricted by the columns given as default.
This is fully customizable to any columns in the mudata.obs or var object. When specifying a column name, make sure it exactly matches the column name in the h5mu object.
Example:
rna:
obs:
min: # Any column for which you want to run a minimum filter
n_genes_by_counts: 500 # i.e. will filter out cells with a value less than 500 in the n_genes_by_counts column
max: # Any column for which you want to run a maximum filter
pct_counts_mt: 20 # i.e. each cell may have a maximum of 20 in the pct_counts_mt column
# be careful with any columns named after gene sets.
# The column will be named based on the gene list input file,
# so if the mitochondrial genes are in group "mt"
# as in the example given in the resource file,
# then the column will be named "pct_counts_mt".
bool:
is_doublet: False # if you have any boolean columns you want to filter on,
# then use this section of the modality dictionary
# in this case any obs['is_doublet'] that are False will be retained in the dataset.
filtering
run
Boolean
, Default: True
If set to False, no filtering is applied to theMuData
object.keep_barcodes
String
(Path)
Path to a file containing specific cell barcodes you want to keep; leave blank if not applicable.
RNA-specific filtering (rna)
obs
Parameters for obs, i.e. cell level filtering:
min
Filtering cells based on a minimum value in a column. Leave parameters blank if you do not want to filter by them.n_genes_by_counts
Integer
Minimum number of genes by counts per cell. For instance, setting the parameter to 500, will filter out cells with a value less than 500 in the n_genes_by_counts column.
max
Filtering cells based on a maximum value in a column. Leave parameters blank if you do not want to filter by them.total_counts
Integer
Cells with a total count greater than this value will be filtered out.n_genes_by_counts
Integer
Maximum number of genes by counts per cell.pct_counts_mt
Integer
(in Percent)
Percent of counts that are mitochondrial genes. Cells with a value greater than this will be filtered out. Should be a value between 0 and 100 (%).pct_counts_rp
Integer
(in Percent)
Percent of counts that are ribosomal genes. Cells with a value greater than this will be filtered out. Should be a value between 0 and 100 (%).doublet_scores
Integer
If you want to apply a custom scrublet threshold per input sample you can specify it here. Provide either as one score for all samples (e.g. 0.25), or a csv file with two columns sample_id, and cut off.
bool
You can add a new column to the mudata[‘rna’].obs with boolean (True/False) values, and then list that column under this bool section. This can be done for any modality.
var
Parameters for var, i.e. gene (feature) level filtering:
min
n_cells_by_counts
Integer
max
total_counts
Integer
n_cells_by_counts
Integer
Protein-specific filtering (prot)
obs
Parameters for obs, i.e. cell level filtering:
max
Filtering cells based on a maximum value in a column. Leave parameters blank if you do not want to filter by them.total_counts
Integer
Cells with a total count greater than this value will be filtered out.
ATAC-specific filtering (atac)
var
Parameters for var, i.e. gene (feature) level filtering:
nucleosome_signal
Intersecting cell barcodes
intersect_mods String
Taking observations present only in modalities listed in mods, or all modalities if set to None.
Provide a comma separated list where you want to keep only the intersection of barcodes. e.g. rna,prot
Downsampling cell barcodes
downsample_n Integer
Number of cells to downsample to, leave blank to keep all cells.
downsample_col String
If you want to equalise by dataset or sample_id, then specifiy a column in obs of the adata to downsample by here.
If specified, the data will be subset to n cells per downsample_col value.
downsample_mods String
(comma separated)
Specify which modalities you want to subsample.
If more than one modality is added then these will be intersected.
Provide as a comma separated String, e.g.: rna,prot
Plotting variables
plotqc
All metrics in this section should be provided as a comma separated string without spaces e.g. a,b,c
Leave blank to avoid plotting.
grouping_var
String
(comma separated), Default: sample_id
Use these categorical variables to plot/split by.rna_metrics
String
(comma separated), Default: pct_counts_mt,pct_counts_rp,pct_counts_hb,pct_counts_ig,doublet_scores
Specify the metrics in the metadata of the RNA modality to plot.prot_metrics
String
(comma separated), Default: total_counts,log1p_total_counts,n_prot_by_counts,pct_counts_isotype
Specify the metrics in the metadata of the Protein modality to plot.atac_metrics
String
(comma separated)
Specify the metrics in the metadata of the ATAC modality to plot.rep_metrics
String
(comma separated)
Specify the metrics in the metadata of the Rep modality to plot.
RNA preprocessing steps
Currently, only standard preprocessing steps (sc.pp.normalize_total followed by sc.pp.log1p) is offered for the RNA modality.
log1p Boolean
, Default: True
If set to False, the log1p transformation is not applied to the RNA modality.
hvg
Options for the detection of highly variable genes (HVGs) in the RNA modality.
flavor
String
, Default: seurat
Choose one of the supported hvg_flavor options: “seurat”, “cell_ranger”, “seurat_v3”. For the dispersion based methods “seurat” and “cell_ranger”, you can specify parameters:min_mean
,max_mean
,min_disp
(listed below). For “seurat_v3” a different method is used, and you need to specify how many variable genes to find by specifying the parametern_top_genes
. If you specifyn_top_genes
, then the other parameters (min_mean
,max_mean
,min_disp
) are nulled. For further reading on this, please refer to the scanpy API.batch_key
String
Ifbatch_key
is specified, highly-variable genes are selected within each batch separately and merged. For details on this, please refer to the scanpy API. If you want to use more than one obs column as covariates, specify this as as “covariate1,covariate2” (comma separated list). Leave blank if no batch should be accounted for in the HVG detection (default behavior).n_top_genes
Integer
, Default: 2000
Number of highly-variable genes to keep. You must specify this parameter if flavor is “seurat_v3”.min_mean
Float
Minimum mean expression of genes to be considered as highly variable genes. Ignored ifn_top_genes
is specified or if flavor is set to “seurat_v3”.max_mean
Float
Maximum mean expression of genes to be considered as highly variable genes. Ignored ifn_top_genes
is specified or if flavor is set to “seurat_v3”.min_disp
Float
Minimum dispersion of genes to be considered as highly variable genes. Ignored ifn_top_genes
is specified or if flavor is set to “seurat_v3”.exclude_file
String
(Path)
It may be useful to exclude some genes from the HVG selection. In this case, you can provide a file with a list of genes to exclude. We provide an example for genes that could be excluded when analyzing immune cells here. When examining this file, you will note that it has three columns, the first specifying the modality, the second one the gene id and the third the groups to which the respective gene belongs. This workflow will exclude the genes that are marked accordingly by their group name. By default, the workflows will remove the genes that are flagged as “exclude” in the group column from HVG detection. You can customize the gene list and change the name of the gene group in theexclude:
parameter (see below) accordingly.exclude
String
This variable defines the group name tagging the genes to be excluded in file specified in the previous parameter. Leave empty if you don’t want to exclude genes from HVG detection.filter
Boolean
, Default: False
Set to True if you want to filter the object to retain only Highly Variable Genes.
regress_variables String
Regression variables, specify the variables you want to regress out.
Leave blank if you don’t want to regress out anything.
We recommend not regressing out anything unless you have good reason to.
Scaling
Scaling has the effect that all genes are weighted equally for downstream analysis. Whether applying scaling or not is still a matter of debate, as stated in the Leucken et al Best Practices paper:
“There is currently no consensus on whether or not to perform normalization over genes. While the popular Seurat tutorials (Butler et al, 2018) generally apply gene scaling, the authors of the Slingshot method opt against scaling over genes in their tutorial (Street et al, 2018). The preference between the two choices revolves around whether all genes should be weighted equally for downstream analysis, or whether the magnitude of expression of a gene is an informative proxy for the importance of the gene.”
run_scale Boolean
, Default: True
Set to False if you do not want to scale the data.
scale_max_value Float
Clip to this value after scaling.
If left blank, scaling is run with default parameters, as described in the scanpy API.
RNA Dimensionality Reduction
pca
Parameters for PCA dimensionality reduction.
n_pcs
Integer
, Default: 50
Number of principal components to compute.solver
String
, Default: default
Setting this parameter to “default” will use thearpack
solver. If you want to use a different solver, you can specify it as described in the scanpy API.color_by
String
, Default: sample_id
The variable to color the PCA plot by. Should be a column in the obs of the adata.
Protein (PROT) preprocessing steps
prot
Parameters for the preprocessing of the protein modality.
normalisation_methods
String
(comma-separated), Default: clr,dsb
Comma separated string of normalisation options. Available options are: dsb,clr . For more details, please refer to the muon documentation. Muon also provides separate information on dsb normalisation and clr normalisation methods. The normalised count matrices are stored in layers called ‘clr’ and ‘dsb’, along with a layer called ‘raw_counts’. If you choose to run both (dsb and clr), then ‘dsb’ is stored in X as default. For downstream visualisation, you can either specify the layer, or take the default stored in X.clr_margin
Integer
(0 or 1), Default: 1
Parameter for CLR normalisation. The CLR margin determines whether you normalise per cell (as you would normalise RNA data), or by feature (recommended, due to the variable nature of protein assays). Hence, CLR margin 1 is recommended for informative qc plots in this pipeline.0 = normalise row-wise (per cell)
1 = normalise column-wise (per feature)
background_obj
String
(Path)
Parameter for DSB normalisation. You must specify the path to the backgroundMuData
(h5mu) object created in the ingest pipeline in order to run dsb normalisation.quantile_clipping
Boolean
, Default: True
Parameter for DSB normalisation. Whether to perform quantile clipping on the normalised data. Despite normalisation, some cells get extreme outliers which can be clipped as discussed here. The maximum value will be set at the 99.5% quantile value, applied per feature. Please note that this feature is in the default muonmu.pp.dsb
code, but manually implemented in this code.store_as_X
String
If you choose to run more than one normalisation method, specify which normalisation method should be stored in the X slot. If left blank, ‘dsb’ is the default that will be stored in X.save_norm_prot_mtx
Boolean
, Default: False
Specify if you want to save the prot normalised assay additionally as a txt file.pca
Boolean
, Default: False
Specify if you want to run PCA on the normalised protein data. This might be useful, when you have more than 50 features in your protein assay.n_pcs
Integer
, Default: 50
Number of principal components to compute. Specify at least n_pcs <= number of features -1.solver
String
, Default: default
Which solver to use for PCA. If set to “default”, the ‘arpack’ solver is used.color_by
String
, Default: sample_id
Column to be fetched from the protein layer .obs to color the PCA plot by.
ATAC preprocessing steps
atac
Parameters for the preprocessing of the ATAC modality.
binarize
Boolean
, Default: False
If set to True, the data will be binarized.normalize
String
, Default: TFIDF
What normalisation method to use. Available options are “log1p” or “TFIDF”.TFIDF_flavour
String
, Default: signac
TFIDF normalisation flavor. Leave blank if you don’t use TFIDF normalisation. Available options are: “signac”, “logTF” or “logIDF”.feature_selection_flavour
String
, Default: signac
Flavor for selecting highly variable features (HVF). HVF selection either with scanpy’spp.highly_variable_genes()
function or apseudo-FindTopFeatures()
function of the signac package. Accordingly, available options are: “signac” or “scanpy”.min_mean
Float
, Default: 0.05
Applicable iffeature_selection_flavour
is set to “scanpy”. You can leave this parameter blank if you want to use the default value.max_mean
Float
, Default: 1.5
Applicable iffeature_selection_flavour
is set to “scanpy”. You can leave this parameter blank if you want to use the default value.min_disp
Float
, Default: 0.5
Applicable iffeature_selection_flavour
is set to “scanpy”. You can leave this parameter blank if you want to use the default value.n_top_features
Integer
Applicable iffeature_selection_flavour
is set to “scanpy”. Number of highly-variable features to keep. If specified, overwrites previous defaults for HVF selection.filter_by_hvf
Boolean
, Default: False
Applicable iffeature_selection_flavour
is set to “scanpy”. Set to True if you want to filter the ATAC layer to retain only HVFs.min_cutoff
String
, Default: q5
Applicable iffeature_selection_flavour
is set to “signac”. Can be specified as follows:“q[x]”: “q” followed by the minimum percentile, e.g. q5 will set the top 95% most common features as higly variable.
“c[x]”: “c” followed by a minimum cell count, e.g. c100 will set features present in > 100 cells as highly variable.
“tc[x]”: “tc” followed by a minimum total count, e.g. tc100 will set features with total counts > 100 as highly variable.
“NULL”: All features are assigned as highly variable.
“NA”: Highly variable features won’t be changed.
dimred
String
, Default: LSI
Available options are: PCA or LSI. LSI will only be computed if TFIDF normalisation was used.n_comps
Integer
, Default: 50
Number of components to compute.solver
String
, Default: default
If using PCA, which solver to use. Setting this parameter to “default”, will use the ‘arpack’ solver.color_by
String
, Default: sample_id
Specify the covariate you want to use to color the dimensionality reduction plot.dim_remove
Integer
Whether to remove the component(s) associated to technical artifacts. For instance, it is common to remove the first LSI component, as it is often associated with batch effects. Specify1
to remove the first component. Leave blank to avoid removing any.