Refmap YAML
In this documentation, the parameters of the refmap
configuration yaml file are explained.
This file is generated running panpipes refmap config
.
The individual steps run by the pipeline are described in the Reference Mapping workflow.
When running the refmap workflow, panpipes provides a basic pipeline.yml
file.
To run the workflow on your own data, you need to specify the parameters described below in the pipeline.yml
file to meet the requirements of your data.
However, we do provide pre-filled versions of the pipeline.yml
file for individual tutorials
For more information on functionalities implemented in panpipes
to read the configuration files, such as reading blocks of parameters and reusing blocks with &anchors
and *scalars
, please check our documentation
You can download the different refmap pipeline.yml
files here:
Basic
pipeline.yml
file (not prefilled) that is generated when calling `panpipes refmap config: Download herepipeline.yml
file for Reference Mapping Tutorial: Download here
Compute resources options
resources
Computing resources to use, specifically the number of threads used for parallel jobs. Check threads_tasks_panpipes for more information on which threads each specific task requires.
Specified by the following three parameters:
threads_high
Integer
, Default: 1
Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load all your input files at once and create the MuData object.threads_medium
Integer
, Default: 1
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks.threads_low
Integer
, Default: 1
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.condaenv
String
(Path)
Path to conda environment that should be used to run panpipes.queues:
String
(Path)
In case a special queue is required for long jobs or if the user has access to a GPU-specific queue. Otherwise, leave it blank.long:
String
(Path)gpu:
String
(Path)
Loading data options
Query Dataset
query
String
, Default: path/to/data
Give the path to the desired data. Formats accepted include raw10x, preprocessed quality filtered mudata or anndata as input querymodality
String
, Default: rna
If mudata was provided then specify the modality to be used. Currently, only RNA modality is supported.query_batch
String
, Default:
Only to be filled if the data provided had a batch correction, if so specify the column this is in. If not, leave blankquery_celltype
String
, Default:
If the query provided has celltype annotations that should be compared to the transferred labels. If not, leave blank.
Scvi tools parameters
reference_data
String
, Default: path/to/mudata
Specify one or more reference models to be used as reference. Users can also specify their own reference built usingpipeline_integration
. Leave blank for no model specification.totalvi:
String
, Default: path/to/totalvi
Provide path to totalvi saved model. Multiple paths can be provided as a list:
totalvi:
- path_to_totalvi1
- path_to_totalvi2
impute_proteins
Boolean
, Default: Falsetransform_batch
String
, Default:
Transform_batch is a batch-covariate specific to totalvi, allows the model to use the batch information in the query to mitigate differences in protein sequencing depth.scvi
String
, Default: path/to/scvi Mandatory, Provide a path to the scvi model. Multiple paths can be provided as a list:
scvi:
- path_to_totalvi1
- path_to_totalvi2
scanvi
String
, Default:path/to/scanvi Mandatory, Provide a path to the scvi model.run_randomforest
Boolean
, Default:False
Set to true if the reference model has a trained random forest classifier to transfer the labels.
Training parameters
To reuse the same params in multiple locations, please use anchors (&) and scalars (*) in the relevant place, i.e. if specifying &rna_neighbors, the same params will be called by *rna_neighbors where referenced. Check our documentation for more info on using anchors and scalars
training_plan:
totalvi: Default: array of training parameters.
For the full list of parameters check here. to reuse the same parameters in other locations use an anchor, for example writingtotalvi: &totalvitraining
and will ensure the same array is reused when referencing it as*totalvitraining
. In this example the&totalvitraining
array contains the two parametersmax_epochs
andweight_decay
max_epochs
Integer
, Default: 200weight_decay
Float
, Default: 0.0
Recommended weight decay is 0.0. This ensures the latent representation of the reference cells will remain exactly the same if passing them through this new query model.scvi Array of training parameters, Default:
*totalvitraining
(reuse the same array as specified above)scanvi Array of training parameters, Default:
*totalvitraining
(reuse the same array as specified above)
Neighbors parameters to calculate umaps
This can be on either query alone, or query+ reference dataset.
neighbors:
npcs
Integer
, Default: 30
Number of Principal Components to calculate for neighbours and umap. If no correction is applied, PCA will be calculated and used to run UMAP and clustering on. And if Harmony is the method of choice, it will use these components to create a corrected dim red.k
Integer
, Default: 30
This is the number of neighboursmetric
String
, Default: euclidean
Options here include cosine and euclideanmethod
String
, Default: sanpy
Options here include scanpy, and hnsw (from scvelo)
Run scib metrics on query
Running scib on query data after transferring labels, where available (with the totalvi and scanvi models), or using default leiden clustering after training the vae model (scvi) Check documentation for the metrics used
scib:
run
Boolean
, Default: Falsecluster_key
String
, Default: predictions
Used for ARI and NMI, if left empty will default to leiden clustering calculated on the new latent representation after reference mapping.batch_key
String
, Default:
Used for clisi_graph_embed and if no batch is present the metrics will not be included in the results. If left blank will default do cluster_key defauls.celltype_key
String
, Default: celltype