Refmap YAML

In this documentation, the parameters of the refmap configuration yaml file are explained. This file is generated running panpipes refmap config.
The individual steps run by the pipeline are described in the Reference Mapping workflow.

When running the refmap workflow, panpipes provides a basic pipeline.yml file. To run the workflow on your own data, you need to specify the parameters described below in the pipeline.yml file to meet the requirements of your data. However, we do provide pre-filled versions of the pipeline.yml file for individual tutorials

For more information on functionalities implemented in panpipes to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors and *scalars, please check our documentation

You can download the different refmap pipeline.yml files here:

Basic pipeline.yml file (not prefilled) that is generated when calling `panpipes refmap config: Download here
pipeline.yml file for Reference Mapping Tutorial: Download here

Compute resources options

resources
Computing resources to use, specifically the number of threads used for parallel jobs. Check threads_tasks_panpipes for more information on which threads each specific task requires. Specified by the following three parameters:

threads_high Integer, Default: 1
Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load all your input files at once and create the MuData object.
threads_medium Integer, Default: 1
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks.
threads_low Integer, Default: 1
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.
condaenv String (Path)
Path to conda environment that should be used to run panpipes.
queues: String (Path)
In case a special queue is required for long jobs or if the user has access to a GPU-specific queue. Otherwise, leave it blank.
- long: String (Path)
- gpu: String (Path)

Loading data options

Query Dataset

query String, Default: path/to/data
Give the path to the desired data. Formats accepted include raw10x, preprocessed quality filtered mudata or anndata as input query
modality String, Default: rna
If mudata was provided then specify the modality to be used. Currently, only RNA modality is supported.
query_batch String, Default:
Only to be filled if the data provided had a batch correction, if so specify the column this is in. If not, leave blank
query_celltype String, Default:
If the query provided has celltype annotations that should be compared to the transferred labels. If not, leave blank.

Scvi tools parameters

reference_data String, Default: path/to/mudata
Specify one or more reference models to be used as reference. Users can also specify their own reference built using pipeline_integration. Leave blank for no model specification.
totalvi: String, Default: path/to/totalvi
Provide path to totalvi saved model. Multiple paths can be provided as a list:

totalvi: 
  - path_to_totalvi1
  - path_to_totalvi2

impute_proteins Boolean, Default: False
transform_batch String, Default:
Transform_batch is a batch-covariate specific to totalvi, allows the model to use the batch information in the query to mitigate differences in protein sequencing depth.
scvi String, Default: path/to/scvi Mandatory, Provide a path to the scvi model. Multiple paths can be provided as a list:

scvi: 
  - path_to_totalvi1
  - path_to_totalvi2

scanvi String, Default:path/to/scanvi Mandatory, Provide a path to the scvi model.
run_randomforest Boolean, Default:False
Set to true if the reference model has a trained random forest classifier to transfer the labels.

Training parameters

To reuse the same params in multiple locations, please use anchors (&) and scalars (*) in the relevant place, i.e. if specifying &rna_neighbors, the same params will be called by *rna_neighbors where referenced. Check our documentation for more info on using anchors and scalars

training_plan:
- totalvi: Default: array of training parameters.
  For the full list of parameters check here. to reuse the same parameters in other locations use an anchor, for example writing totalvi: &totalvitraining and will ensure the same array is reused when referencing it as *totalvitraining. In this example the &totalvitraining array contains the two parameters max_epochs and weight_decay
  - max_epochs Integer, Default: 200
  - weight_decay Float, Default: 0.0
    Recommended weight decay is 0.0. This ensures the latent representation of the reference cells will remain exactly the same if passing them through this new query model.
  - scvi Array of training parameters, Default: *totalvitraining (reuse the same array as specified above)
  - scanvi Array of training parameters, Default: *totalvitraining (reuse the same array as specified above)

Neighbors parameters to calculate umaps

This can be on either query alone, or query+ reference dataset.

neighbors:
- npcs Integer, Default: 30
  Number of Principal Components to calculate for neighbours and umap. If no correction is applied, PCA will be calculated and used to run UMAP and clustering on. And if Harmony is the method of choice, it will use these components to create a corrected dim red.
- k Integer, Default: 30
  This is the number of neighbours
- metric String, Default: euclidean
  Options here include cosine and euclidean
- method String, Default: sanpy
  Options here include scanpy, and hnsw (from scvelo)

Run scib metrics on query

Running scib on query data after transferring labels, where available (with the totalvi and scanvi models), or using default leiden clustering after training the vae model (scvi) Check documentation for the metrics used

scib:
- run Boolean, Default: False
- cluster_key String, Default: predictions
  Used for ARI and NMI, if left empty will default to leiden clustering calculated on the new latent representation after reference mapping.
- batch_key String, Default:
  Used for clisi_graph_embed and if no batch is present the metrics will not be included in the results. If left blank will default do cluster_key defauls.
- celltype_key String, Default: celltype