Using custom genes annotations: gene list formats

It’s often practical to rely on known gene lists, for a series of tasks, like evaluating % of mitochondrial genes or ribosomal genes, or excluding genes from HVG selection such as those constituting the IG chains.

We pre-compile these lists for qc and for cell cycle.

Custom gene list

We provide an example of a preformatted gene lists file in resources/qc_genelist_1.0.csv.

All Custom Gene Lists files provided to the pipeline should be in a 3 columns format, where the column headers are “mod” (modality: “rna”, “prot”, or “atac”), feature and group. The group column is used to label and distinguish different gene groups, in the same way as you would submit them as a single vector. The same gene can belong to multiple groups by duplicating the row for that gene and changing the group label. Gene ids can come in a variety of formats, including upper or lower cases, letter,numbers or a combination of both. Users can provide any gene format in the custom gene files, depending on the reference they used to produce their count matrices.

mod: the modality for the feature in use. Modalities are always specified in lowercase.
feature: feature name, i.e. a gene id
group: the group the gene belongs to. Group can be upper or lowercase or a mix of both and will be interpreted as a string.

mod	feature	group
rna	gene_1	mt
rna	gene_2	rp
rna	gene_1	exclude
rna	gene_1	markerX
…	…	…

For example, if your count matrix uses Ensembl ids, you would specify ENSG00000163914 in the feature column and group photoreceptor

GeneCards Symbol: RHO
HGNC: 10012 
NCBI Gene: 6010 
Ensembl: ENSG00000163914

The gene lists are not pre-determined within panpipes in order to maximise flexibility and users can provide their own lists.

For a typical usecase, we provide example lists on our github page which are also used by default as specified in the next sections.

Cell cycle genes

The human-only cellcycle genes used in scanpy.score_genes_cell_cycle are stored in resources/cell_cycle_genes.csv

However, if you are working with mouse data, we supply an alternative cellcycle gene list with murine genes, which can be found in resources/mouse_cell_cycle_genes.tsv

Differently from the other custom gene file, the cell cycle file should be a tab separated file with two columns:

gene_name: the name of the gene
cc_phase: which phase of the cell cycle is the gene expression indicative of.

gene_name	cc_phase
MCM5	s
PCNA	s
TYMS	s
CCNB2	g2m
CKAP2L	g2m
…	…

Using custom gene lists to calculate QC metrics

Panpipes uses “actions” to define in which tasks to use the provided gene list. We encode three main actions to use gene lists to describe qualities of the cells and populate the metadata (.obs) of the object with the newly calculated cell QC metric. Specify the “group” name of the genes you want to use to apply a specific action to calculate the cell QC metric

If left blank, these actions will not be performed (i.e. no calculation of % of mt genes per cell will be included in the ingestion of the data) If the group name is not spelled correctly, the action will fail. See The section on

Supplying custom gene lists to calculate QC metrics

The human custom genelist file can be supplied by the user in the workflows configuration files:

Ingest workflow pipeline_ingest config file: (pipeline.yml)

Supply the gene list by customizing the following parameter
```
custom_genes_file: resources/qc_genelist_1.0.csv
```
Preprocess workflow pipeline_preprocess config file: (pipeline.yml)

Supply the gene list by customizing the following parameter
```
exclude_file: resources/qc_genelist_1.0.csv
```

Note that we have formatted an example file containing all genes to use in both workflows, and therefore supply the same file to both workflows but users can have independent files for each of them.

If the input is from mouse data

Ingest workflow

pipeline_ingest config file: (pipeline.yml)
```
custom_genes_file: resources/qc_gene_list_mouse.csv
```
Preprocess workflow

pipeline_preprocess config file: (pipeline.yml)
```
exclude_file: resources/qc_gene_list_mouse.csv
```

Explaining custom gene lists actions

Ingest workflow (pipeline_ingest.py)

In the qc section of the ingest workflow, we specify the main actions harnessing the gene list provided in the custom_genes_file param:

    custom_genes_file: resources/qc_genelist_1.0.csv

    calc_proportions: hb,mt,rp
    score_genes: mt

    ccgenes: default

calc_proportions: calculate proportion of reads mapping to X genes over total number of reads, per cell, using scanpy.pp.calculate_qc_metrics.

For example, for the rna modality, including a list of mitochondrial genes in the group mt, and setting
```
  calc_proportion: mt 
```
will calculate the proportion of reads mapping to the genes whose group is “mt” and and will add pct_counts_mt, and total_counts_mt to the mdata["rna"].obs assay.
score_genes: using scanpy.tl.score_genes, the action calculates the average expression of a set of genes, subtracted of the average expression of a reference set of genes. First introduced in Satija et al. Nature Biotechnology (2015).

This action will generate a column GroupName_score and add it to the mdata[MOD].obs assay. For example for the rna modality, including a list of Markers for a cell type in the group ‘MarkersNeutro’ and setting
```
score_genes: MarkersNeutro
```
will score the cells for the list of genes provided and add a column MarkersNeutro_score to the mdata["rna"].obs assay.

Preprocess workflow (pipeline_preprocess.py)

In the preprocess workflow, we specify what action to perform on the genes in the custom gene list using the exclude parameter.

  exclude_file: resources/qc_genelist_1.0.csv
  exclude: default

exclude: exclude these genes from the HVG selection, if they are deemed Highly Variable.

For the exclude action, if set to default the workflow will look for genes whose group is set to exclude in the supplied qc_genelist file. Alternatively, if you are specifying your custom gene list and you want to exclude another set of genes, for example a group you call TCR_genes, specify this group (i.e. exclude: TCR_genes) If left blank, no genes will be excluded from the HVG.

Cell cycle actions

As described before, we also rely on a user-supplied list of genes to calculate the cell cycle phase of a cell. We believe that this choice offers the maximum flexibility to use a trusted gene-set for the calculation of this metric. The cell cycle scoring happens in the ingest workflow using the ccgenes parameter. The cell cycle action is performed using scanpy.tl.score_genes_cell_cycle

# cell cycle action
ccgenes: default

ccgenes:
Setting the ccgenes param to default in the ingest workflow will calculate the phase of the cell cycle in which the cell is by using scanpy.tl.score_genes_cell_cycle using the file provided in panpipes/resources/cell_cicle_genes.tsv. Using this file, this action will produce at least 3 columns in the mdata["rna"].obs assay, namely ‘S_score’, ‘G2M_score’, ‘phase’.

Users can create their own list, and need to specify the path to this new file in in the ccgenes param to score the cells with their custom list. If left blank, the cellcycle score won’t be calculated.

Using Custom Gene lists to plot: the Visualization workflow

Users may also supply custom gene lists to plot markers using standard visualizations, such as UMAPs or dotplots of gene expressions. We have designated an entire workflow just for this purpose. The Visualization workflow accepts custom gene files in the same 3-column format as defined above.

These files can be specified in the viz configuration file as follows:

pipeline_vis config file: (pipeline.yml)

# the full list will be plotted in dot plots and matrix plots, one plot per group
full:
 - long_file1.csv
 - long_file2.csv
# the shorter list will be plotted on umaps as well as other plot types, one plot per group
minimal:
 - short_file1.csv

Generally in the visualization pipeline all gene groups in the input are plotted. In heatmaps and dotplots, one dotplot per group is plotted. For UMAPs, one plot per gene is plotted, and a new file is saved per group.

Plot Makers in the Visualization workflow

The custom maker csv file for full and minimal must contain three columns and follow the following structure:

mod	feature	group
prot	prot_CD8	Tcellmarkers
rna	CD8A	Tcellmarkers

The full list will be plotted in dot plots and matrix plots, with one plot per group.

The shorter list will be plotted on umaps as well as other plot types, with one plot per group.

feature_1	feature_2	colour
CD8A	prot_CD8
CD4	CD8A	doublet_scores

Plot metadata variables

The scatter_features.csv file should have the following format:

feature_1	feature_2	colour
rna:total_counts	prot:total_counts	doublet_scores

Final notes

Be deliberate and informative with the choice of group names for any gene set use, since the .obs column generated as output will be named based on the group of the gene list input file. The columns added to the MuData object will be used for filtering in the preprocess workflow (see filtering instructions ).

If the mitochondrial genes are in group “mt” as in the example given in the resource file, then the column generated with the calc_proportions action and containing the percentage of MT genes will be named “pct_counts_mt”.

So specifying pct_counts_mito instead of pct_counts_mt will not filter the mudata based on mitochondrial %, because the workflow can’t find the supplied column in the data.