Useful information on formatting and reading yaml files

panpipes workflows can be executed by specifying their parameters into their configuration files. Each workflow has its own configuration file with is generated by panpipes NAME_OF_WORKFLOW config

The configuration files are YAML files, which are the backbone of the system’s setup and operation, defining parameters, settings, and structures in a human-readable format. These files allow for easy modification and sharing of configurations, promoting a clear and efficient way to manage the panpipes behavior and its various components.

If you’re not familiar with the format please check out these resources:

How we use YAML files to configure panpipes actions

How does panpipes read the configuration files?

panpipes reads the whole pipeline.yml as PARAMS at the beginning of each pipeline execution:

from cgatcore import pipeline as P

PARAMS = P.get_parameters(
    ["%s/pipeline.yml" % os.path.splitext(__file__)[0],
     "pipeline.yml"])

Understanding mapping blocks and indentations

YAML works with mapping blocks, which are started and closed by a new indentation level. Therefore, the indentations in the files are essential for the pipeline to understand which are the blocks that it needs to parse correctly.

Here is an example of a mapping block from an excerpt of the integration pipeline.yml file.

prot:
  
  run: True 
  tools: harmony
  column: sample_id 
  #----------------------------
  # Harmony args
  #-----------------------------
  harmony:
  # sigma value, used by Harmony
    sigma: 0.1 
  # theta value used by Harmony, default is 1
    theta: 1.0
  # number of pcs, used by Harmony
    npcs: 30
  #----------------------------
  # BBKNN args # https://bbknn.readthedocs.io/en/latest/
  #-----------------------------
  bbknn:
    neighbors_within_batch: 
  #----------------------------›
  # find neighbour parameters
  #-----------------------------
  neighbors: &prot_neighbors
    # number of Principal Components to calculate for neighbours and umap:
    #   -if no correction is applied, PCA will be calculated and used to run UMAP and clustering on
    #   -if Harmony is the method of choice, it will use these components to create a corrected dim red.)
    # note: scvelo default is 30
    npcs: 30
    # number of neighbours
    k: 30
    # metric: euclidean | cosine
    metric: euclidean
    # scanpy | hnsw (from scvelo)
    method: scanpy

panpipes reads the whole pipeline.yml as PARAMS.

Therefore PARAMS['prot_run'] will inherit the value it was assigned in the yaml, True. Similarly, PARAMS['prot_tools'] will be interpreted as harmony.

When the block gets deeper in indentations, it may be good to read all the block in one go, for example prot_params = PARAMS['prot']. Therefore, PARAMS['prot']['neighbors'] is the equivalent of prot_params['neighbors']

You can read the pipeline.yml configuration file in python and check how the commands are parsed using

import yaml
with open('pipeline.yml', 'r') as file:
    PARAMS = yaml.safe_load(file)

in panpipes we read the yaml configuration file at the beginning of each workflow with:

from cgatcore import pipeline as P
PARAMS = P.get_parameters(
    ["%s/pipeline.yml" % os.path.splitext(__file__)[0],
     "pipeline.yml"])

Anchors and Scalars

YAML files offer a lot of functionalities to fill and parse blocks of text avoiding to repeat entire sections. One example are anchors (&) and scalars (*) (check the documentation links for more information on the terminology)

Let’s look at an example from the same excerpt as before: In this block, we configure the parameters to construct the knn connectivity graph on the protein modality.

neighbors: &prot_neighbors
  # number of Principal Components to calculate for neighbours and umap:
  #   -if no correction is applied, PCA will be calculated and used to run UMAP and clustering on
  #   -if Harmony is the method of choice, it will use these components to create a corrected dim red.)
  # note: scvelo default is 30
  npcs: 30
  # number of neighbours
  k: 30
  # metric: euclidean | cosine
  metric: euclidean
  # scanpy | hnsw (from scvelo)
  method: scanpy

It may be useful to reuse the same parameters for other sections of the yaml without having to re-write them (Number of PCs npcs:30, number of k neighbours k:30 , euclidean distance with metric: euclidean)

For these cases, we anchor the block of code associated to the neighbors block with the following syntax:

neighbors: &prot_neighbors

This simple syntax allows us to reuse the prot_neighbors params when referencing it using a scalar *prot_neighbors in a place where the same parameters are expected. Looking further down in the pipeline.yml file, we have used this notation to reuse the KNN graph computation parameters for all the modalities:

WNN:
    # muon implementation of WNN 
    modalities: rna,prot,atac 
    # run wnn on batch corrected unimodal data, set each of the modalities you want to use to calc WNN to ONE method.
    # leave to None and it will default to de novo calculation of neighbours on non corrected data for that modality using specified params 
    batch_corrected:
      # options are: "bbknn", "scVI", "harmony", "scanorama"
      rna: None
      # options are "harmony", "bbknn"
      prot: None
      # options are "harmony"
      atac: None 
    # please use anchors (&) and scalars (*) in the relevant place
    # i.e. &rna_neighbors will be called by *rna_neighbors where referenced
    knn:
      rna: *rna_neighbors
      prot: *prot_neighbors
      atac: *atac_neighbors