Preprocess YAML

In this documentation, the parameters of the preprocess configuration yaml file are explained. This file is generated by running panpipes preprocess config.
The individual steps run by the pipeline are described in the preprocess workflow.

When running the preprocess workflow, panpipes provides a basic pipeline.yml file. To run the workflow on your own data, you need to specify the parameters described below in the pipeline.yml file to meet the requirements of your data. However, we do provide pre-filled versions of the pipeline.yml file for individual tutorials.

For more information on functionalities implemented in panpipes to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors and *scalars, please check our documentation.

You can download the different preprocess pipeline.yml files here:

Compute resources options

resources
Computing resources to use, specifically the number of threads used for parallel jobs.Check threads_tasks_panpipes for more information on which threads each specific task requires. Specified by the following three parameters:

  • threads_high Integer, Default: 2
    Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load all your input files at once and create the MuData object.

  • threads_medium Integer, Default: 2
    Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks.

  • threads_low Integer, Default: 1
    Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.

condaenv String (Path)
Path to conda environment that should be used to run panpipes. Leave blank if running native or your cluster automatically inherits the login node environment. For more information on this, please refer to the detailed explanation here.

General project specifications

sample_prefix String
Prefix for sample names.

unfiltered_obj String
If running this on prefiltered data, complete the following steps: 1. Leave unfiltered_obj (this parameter) blank 2. Rename your filtered file so that it matches the format PARAMS[‘sample_prefix’] + ‘.h5mu’ 3. Put the renamed file in the same folder as the pipeline.yml 4. Set filtering run to False below

modalities
Specify which modalities are included in the data by setting the respective modality to True. Leave empty (None) or False to signal this modality is not part of the experiment. The modalities are processed in the order of the following list:

  • rna Boolean, Default: True

  • prot Boolean, Default: False

  • rep Boolean, Default: False

  • atac Boolean, Default: False

Filtering Cells and Features

Filtering in panpipes is done sequentially for all modalities, filtering first cells and then features. For each modality, the pipeline.yml file contains a dictionary with the following structure:

MODALITY:
    obs:
        min:
        max:
        bool:
    var:
        min:
        max:
        bool:

This format can be applied to any modality by editing the filtering dictionary
You are not restricted by the columns given as default.

This is fully customizable to any columns in the mudata.obs or var object. When specifying a column name, make sure it exactly matches the column name in the h5mu object.

Example:

rna:
  obs:
    min:  # Any column for which you want to run a minimum filter
      n_genes_by_counts: 500  # i.e. will filter out cells with a value less than 500 in the n_genes_by_counts column
    max:  # Any column for which you want to run a maximum filter
      pct_counts_mt: 20  # i.e. each cell may have a maximum of 20 in the pct_counts_mt column
                         # be careful with any columns named after gene sets. 
                         # The column will be named based on the gene list input file, 
                         # so if the mitochondrial genes are in group "mt" 
                         # as in the example given in the resource file,
                         # then the column will be named "pct_counts_mt".
    bool: 
       is_doublet: False  # if you have any boolean columns you want to filter on, 
                          # then use this section of the modality dictionary
                          # in this case any obs['is_doublet'] that are False will be retained in the dataset.

filtering

  • run Boolean, Default: True
    If set to False, no filtering is applied to the MuData object.

  • keep_barcodes String (Path)
    Path to a file containing specific cell barcodes you want to keep; leave blank if not applicable.

RNA-specific filtering (rna)

obs
Parameters for obs, i.e. cell level filtering:

  • min
    Filtering cells based on a minimum value in a column. Leave parameters blank if you do not want to filter by them.

    • n_genes_by_counts Integer
      Minimum number of genes by counts per cell. For instance, setting the parameter to 500, will filter out cells with a value less than 500 in the n_genes_by_counts column.

  • max
    Filtering cells based on a maximum value in a column. Leave parameters blank if you do not want to filter by them.

    • total_counts Integer
      Cells with a total count greater than this value will be filtered out.

    • n_genes_by_counts Integer
      Maximum number of genes by counts per cell.

    • pct_counts_mt Integer (in Percent)
      Percent of counts that are mitochondrial genes. Cells with a value greater than this will be filtered out. Should be a value between 0 and 100 (%).

    • pct_counts_rp Integer (in Percent)
      Percent of counts that are ribosomal genes. Cells with a value greater than this will be filtered out. Should be a value between 0 and 100 (%).

    • doublet_scores Integer
      If you want to apply a custom scrublet threshold per input sample you can specify it here. Provide either as one score for all samples (e.g. 0.25), or a csv file with two columns sample_id, and cut off.

  • bool
    You can add a new column to the mudata[‘rna’].obs with boolean (True/False) values, and then list that column under this bool section. This can be done for any modality.

var
Parameters for var, i.e. gene (feature) level filtering:

  • min

    • n_cells_by_counts Integer

  • max

    • total_counts Integer

    • n_cells_by_counts Integer

Protein-specific filtering (prot)

obs
Parameters for obs, i.e. cell level filtering:

  • max
    Filtering cells based on a maximum value in a column. Leave parameters blank if you do not want to filter by them.

    • total_counts Integer
      Cells with a total count greater than this value will be filtered out.

ATAC-specific filtering (atac)

var
Parameters for var, i.e. gene (feature) level filtering:

  • nucleosome_signal

Intersecting cell barcodes

intersect_mods String
Taking observations present only in modalities listed in mods, or all modalities if set to None.
Provide a comma separated list where you want to keep only the intersection of barcodes. e.g. rna,prot

Downsampling cell barcodes

downsample_n Integer
Number of cells to downsample to, leave blank to keep all cells.

downsample_col String
If you want to equalise by dataset or sample_id, then specifiy a column in obs of the adata to downsample by here. If specified, the data will be subset to n cells per downsample_col value.

downsample_mods String (comma separated)
Specify which modalities you want to subsample. If more than one modality is added then these will be intersected. Provide as a comma separated String, e.g.: rna,prot

Plotting variables

plotqc
All metrics in this section should be provided as a comma separated string without spaces e.g. a,b,c Leave blank to avoid plotting.

  • grouping_var String (comma separated), Default: sample_id
    Use these categorical variables to plot/split by.

  • rna_metrics String (comma separated), Default: pct_counts_mt,pct_counts_rp,pct_counts_hb,pct_counts_ig,doublet_scores
    Specify the metrics in the metadata of the RNA modality to plot.

  • prot_metrics String (comma separated), Default: total_counts,log1p_total_counts,n_prot_by_counts,pct_counts_isotype
    Specify the metrics in the metadata of the Protein modality to plot.

  • atac_metrics String (comma separated)
    Specify the metrics in the metadata of the ATAC modality to plot.

  • rep_metrics String (comma separated)
    Specify the metrics in the metadata of the Rep modality to plot.

RNA preprocessing steps

Currently, only standard preprocessing steps (sc.pp.normalize_total followed by sc.pp.log1p) is offered for the RNA modality.

log1p Boolean, Default: True
If set to False, the log1p transformation is not applied to the RNA modality.

hvg
Options for the detection of highly variable genes (HVGs) in the RNA modality.

  • flavor String, Default: seurat
    Choose one of the supported hvg_flavor options: “seurat”, “cell_ranger”, “seurat_v3”. For the dispersion based methods “seurat” and “cell_ranger”, you can specify parameters: min_mean, max_mean, min_disp(listed below). For “seurat_v3” a different method is used, and you need to specify how many variable genes to find by specifying the parameter n_top_genes. If you specify n_top_genes, then the other parameters (min_mean, max_mean, min_disp) are nulled. For further reading on this, please refer to the scanpy API.

  • batch_key String
    If batch_key is specified, highly-variable genes are selected within each batch separately and merged. For details on this, please refer to the scanpy API. If you want to use more than one obs column as covariates, specify this as as “covariate1,covariate2” (comma separated list). Leave blank if no batch should be accounted for in the HVG detection (default behavior).

  • n_top_genes Integer, Default: 2000
    Number of highly-variable genes to keep. You must specify this parameter if flavor is “seurat_v3”.

  • min_mean Float
    Minimum mean expression of genes to be considered as highly variable genes. Ignored if n_top_genes is specified or if flavor is set to “seurat_v3”.

  • max_mean Float
    Maximum mean expression of genes to be considered as highly variable genes. Ignored if n_top_genes is specified or if flavor is set to “seurat_v3”.

  • min_disp Float
    Minimum dispersion of genes to be considered as highly variable genes. Ignored if n_top_genes is specified or if flavor is set to “seurat_v3”.

  • exclude_file String (Path)
    It may be useful to exclude some genes from the HVG selection. In this case, you can provide a file with a list of genes to exclude. We provide an example for genes that could be excluded when analyzing immune cells here. When examining this file, you will note that it has three columns, the first specifying the modality, the second one the gene id and the third the groups to which the respective gene belongs. This workflow will exclude the genes that are marked accordingly by their group name. By default, the workflows will remove the genes that are flagged as “exclude” in the group column from HVG detection. You can customize the gene list and change the name of the gene group in the exclude: parameter (see below) accordingly.

  • exclude String
    This variable defines the group name tagging the genes to be excluded in file specified in the previous parameter. Leave empty if you don’t want to exclude genes from HVG detection.

  • filter Boolean, Default: False
    Set to True if you want to filter the object to retain only Highly Variable Genes.

regress_variables String
Regression variables, specify the variables you want to regress out. Leave blank if you don’t want to regress out anything. We recommend not regressing out anything unless you have good reason to.

Scaling

Scaling has the effect that all genes are weighted equally for downstream analysis. Whether applying scaling or not is still a matter of debate, as stated in the Leucken et al Best Practices paper:

“There is currently no consensus on whether or not to perform normalization over genes. While the popular Seurat tutorials (Butler et al, 2018) generally apply gene scaling, the authors of the Slingshot method opt against scaling over genes in their tutorial (Street et al, 2018). The preference between the two choices revolves around whether all genes should be weighted equally for downstream analysis, or whether the magnitude of expression of a gene is an informative proxy for the importance of the gene.”

run_scale Boolean, Default: True
Set to False if you do not want to scale the data.

scale_max_value Float
Clip to this value after scaling. If left blank, scaling is run with default parameters, as described in the scanpy API.

RNA Dimensionality Reduction

pca
Parameters for PCA dimensionality reduction.

  • n_pcs Integer, Default: 50
    Number of principal components to compute.

  • solver String, Default: default
    Setting this parameter to “default” will use the arpack solver. If you want to use a different solver, you can specify it as described in the scanpy API.

  • color_by String, Default: sample_id
    The variable to color the PCA plot by. Should be a column in the obs of the adata.

Protein (PROT) preprocessing steps

prot
Parameters for the preprocessing of the protein modality.

  • normalisation_methods String (comma-separated), Default: clr,dsb
    Comma separated string of normalisation options. Available options are: dsb,clr . For more details, please refer to the muon documentation. Muon also provides separate information on dsb normalisation and clr normalisation methods. The normalised count matrices are stored in layers called ‘clr’ and ‘dsb’, along with a layer called ‘raw_counts’. If you choose to run both (dsb and clr), then ‘dsb’ is stored in X as default. For downstream visualisation, you can either specify the layer, or take the default stored in X.

  • clr_margin Integer (0 or 1), Default: 1
    Parameter for CLR normalisation. The CLR margin determines whether you normalise per cell (as you would normalise RNA data), or by feature (recommended, due to the variable nature of protein assays). Hence, CLR margin 1 is recommended for informative qc plots in this pipeline.

    • 0 = normalise row-wise (per cell)

    • 1 = normalise column-wise (per feature)

  • background_obj String (Path)
    Parameter for DSB normalisation. You must specify the path to the background MuData (h5mu) object created in the ingest pipeline in order to run dsb normalisation.

  • quantile_clipping Boolean, Default: True
    Parameter for DSB normalisation. Whether to perform quantile clipping on the normalised data. Despite normalisation, some cells get extreme outliers which can be clipped as discussed here. The maximum value will be set at the 99.5% quantile value, applied per feature. Please note that this feature is in the default muon mu.pp.dsb code, but manually implemented in this code.

  • store_as_X String
    If you choose to run more than one normalisation method, specify which normalisation method should be stored in the X slot. If left blank, ‘dsb’ is the default that will be stored in X.

  • save_norm_prot_mtx Boolean, Default: False
    Specify if you want to save the prot normalised assay additionally as a txt file.

  • pca Boolean, Default: False
    Specify if you want to run PCA on the normalised protein data. This might be useful, when you have more than 50 features in your protein assay.

  • n_pcs Integer, Default: 50
    Number of principal components to compute. Specify at least n_pcs <= number of features -1.

  • solver String, Default: default
    Which solver to use for PCA. If set to “default”, the ‘arpack’ solver is used.

  • color_by String, Default: sample_id
    Column to be fetched from the protein layer .obs to color the PCA plot by.

ATAC preprocessing steps

atac
Parameters for the preprocessing of the ATAC modality.

  • binarize Boolean, Default: False
    If set to True, the data will be binarized.

  • normalize String, Default: TFIDF
    What normalisation method to use. Available options are “log1p” or “TFIDF”.

  • TFIDF_flavour String, Default: signac
    TFIDF normalisation flavor. Leave blank if you don’t use TFIDF normalisation. Available options are: “signac”, “logTF” or “logIDF”.

  • feature_selection_flavour String, Default: signac
    Flavor for selecting highly variable features (HVF). HVF selection either with scanpy’s pp.highly_variable_genes() function or a pseudo-FindTopFeatures() function of the signac package. Accordingly, available options are: “signac” or “scanpy”.

  • min_mean Float, Default: 0.05
    Applicable if feature_selection_flavour is set to “scanpy”. You can leave this parameter blank if you want to use the default value.

  • max_mean Float, Default: 1.5
    Applicable if feature_selection_flavour is set to “scanpy”. You can leave this parameter blank if you want to use the default value.

  • min_disp Float, Default: 0.5
    Applicable if feature_selection_flavour is set to “scanpy”. You can leave this parameter blank if you want to use the default value.

  • n_top_features Integer
    Applicable if feature_selection_flavour is set to “scanpy”. Number of highly-variable features to keep. If specified, overwrites previous defaults for HVF selection.

  • filter_by_hvf Boolean, Default: False
    Applicable if feature_selection_flavour is set to “scanpy”. Set to True if you want to filter the ATAC layer to retain only HVFs.

  • min_cutoff String, Default: q5
    Applicable if feature_selection_flavour is set to “signac”. Can be specified as follows:

    • “q[x]”: “q” followed by the minimum percentile, e.g. q5 will set the top 95% most common features as higly variable.

    • “c[x]”: “c” followed by a minimum cell count, e.g. c100 will set features present in > 100 cells as highly variable.

    • “tc[x]”: “tc” followed by a minimum total count, e.g. tc100 will set features with total counts > 100 as highly variable.

    • “NULL”: All features are assigned as highly variable.

    • “NA”: Highly variable features won’t be changed.

  • dimred String, Default: LSI
    Available options are: PCA or LSI. LSI will only be computed if TFIDF normalisation was used.

  • n_comps Integer, Default: 50
    Number of components to compute.

  • solver String, Default: default
    If using PCA, which solver to use. Setting this parameter to “default”, will use the ‘arpack’ solver.

  • color_by String, Default: sample_id
    Specify the covariate you want to use to color the dimensionality reduction plot.

  • dim_remove Integer
    Whether to remove the component(s) associated to technical artifacts. For instance, it is common to remove the first LSI component, as it is often associated with batch effects. Specify 1 to remove the first component. Leave blank to avoid removing any.