# Map Reads onto a genome with EMA

at least 4 cores/threads available
a genome assembly in FASTA format: .fasta .fa .fasta.gz .fa.gz case insensitive
paired-end fastq sequence files gzipped recommended
- sample name: a-z 0-9 . _ - case insensitive
- forward: _F .F .1 _1 _R1_001 .R1_001 _R1 .R1
- reverse: _R .R .2 _2 _R2_001 .R2_001 _R2 .R2
- fastq extension: .fq .fastq case insensitive
patience because EMA is slow

The original haplotag manuscript uses BWA to map reads. The authors have since recommended the use of EMA (EMerald Aligner) for most applications. EMA is barcode-aware, meaning it considers sequences with identical barcodes to have originated from the same molecule, and therefore has higher mapping accuracy than using BWA. Here's a comparison from the EMA manuscript:

Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input, such as those derived using harpy qc . You can map reads onto a genome assembly with Harpy using the align ema module:

usage
harpy align ema OPTIONS... REFERENCE INPUTS...

example
harpy align ema genome.fasta Sequences/ 

# Running Options

In addition to the [!badgevariant="info" corners="pill" text="common runtime options"](/ Resources/common_options.md), the align ema module is configured using these command-line arguments:

argument	type	default	description
`REFERENCE`	file path		required Reference genome for read mapping
`INPUTS`	file/directory paths		required Files or directories containing input FASTQ files
`--contigs`	file path or list		Contigs to plot in the report
`--fragment-density` `-d`	toggle	false	Perform read fragment density optimization
`--ema-bins` `-e`	integer (1-1000)	`500`	Number of barcode bins for EMA
`--extra-params` `-x`	string		Additional EMA-align arguments, in quotes
`--keep-unmapped` `-u`	toggle	false	Output unmapped sequences too
`--min-quality` `-q`	integer (0-40)	`30`	Minimum `MQ` (SAM mapping quality) to pass filtering
`--platform` `-p`	string	`haplotagging`	required Linked read technology: `haplotagging` or `10x`
`--barcode-list` `-b`	file path		Path to barcode list (`--platform 10x` only)

# Barcode list

Some linked-read methods (e.g. 10x, Tellseq) require the inclusion of a barcode list (often referred to a "whitelist", unfortunately). This file is a simple text file that has one barcode per line so a given software knows what barcodes to expect in your data. If you need to process 10x data, then you will need to include the barcode list file (usually provided by 10x). Conveniently, haplotaggging data doesn't require this file.

# Quality filtering

The --min-quality argument filters out alignments below a given MQ threshold. The default, 30, keeps alignments that are at least 99.9% likely correctly mapped. Set this value to 1 if you only want alignments removed with MQ = 0 (0% likely correct). You may also set it to 0 to keep all alignments for diagnostic purposes. The plot below shows the relationship between MQ score and the likelihood the alignment is correct and will serve to help you decide on a value you may want to use. It is common to remove alignments with MQ <30 (<99.9% chance correct) or MQ <40 (<99.99% chance correct).

Every alignment in a BAM file has an associated mapping quality score (MQ) that informs you of the likelihood that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more likely correct. The math governing the MQ score actually calculates the percent chance the alignment is incorrect:

\%\ chance\ incorrect = 10^\frac{-MQ}{10} \times 100\\
\text{where }0\le MQ\le 40

You can simply subtract it from 100 to determine the percent chance the alignment is correct:

\%\ chance\ correct = 100 - \%\ chance\ incorrect\\
\text{or} \\
\%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100

# Marking PCR duplicates

EMA marks duplicates in the resulting alignments, however the read with invalid barcodes are aligned separately with BWA. Therefore, Harpy uses samtools markdup to mark putative PCR duplicates in the reads with invalid barcodes. Duplicate marking uses the -S option to mark supplementary (chimeric) alignments as duplicates if the primary alignment was marked as a duplicate. Duplicates get marked but are not removed.

# EMA workflow

leverages the BX barcode information to improve mapping
sometimes better downstream SV detection
slower
marks split alignments as secondary alignments ⚠️
lots of temporary files

Since EMA does extra things to account for barcode information, the EMA workflow is a bit more complicated under the hood. Reads with barcodes are aligned using EMA and reads without valid barcodes are separately mapped using BWA before merging all the alignments together again. EMA will mark duplicates within alignments, but the BWA alignments need duplicates marked manually using samtools markdup.

graph LR
    subgraph Inputs
        trm[FASTQ files]:::clean---geno[genome]:::clean
    end
    Inputs-->A & IDX
    A([EMA count]):::clean --> B([EMA preprocess]):::clean
    B-->C([EMA align barcoded]):::clean
    C-->D([sort BX alignments]):::clean
    D-->F([merge all alignments]):::clean
    IDX([index genome]):::clean-->C
    IDX-->Z([BWA align unbarcoded]):::clean
    Z-->Y([sort alignments]):::clean
    Y-->X([mark duplicates]):::clean
    X-->F
    F-->J([alignment stats]):::clean
    subgraph markdp [mark duplicates via `samtools`]
        direction LR
        collate:::clean-->fixmate:::clean
        fixmate-->sort:::clean
        sort-->markdup:::clean
    end
    style markdp fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px,rx:10px,ry:10px
    style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px,rx:10px,ry:10px
    classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px

The default output directory is Align/ema with the folder structure below. Sample1 is a generic sample name for demonstration purposes. The resulting folder also includes a workflow directory (not shown) with workflow-relevant runtime files and information.

Align/ema
├── Sample1.bam
├── Sample1.bam.bai
├── logs
│   ├── Sample1.bwa.align.log
│   ├── Sample1.bwa.sort.log
│   ├── Sample1.ema.align.log
│   ├── Sample1.ema.sort.log
│   ├── Sample1.markdup.log
│   └── preproc
│       └── Sample1.preproc.log
└── reports
    ├── barcodes.summary.html
    ├── ema.stats.html
    ├── Sample1.html
    └── data
        ├── bxstats
        │   └── Sample1.bxstats.gz
        └── coverage
            └── Sample1.cov.gz

item	description
`*.bam`	sequence alignments for each sample
`*.bai`	sequence alignment indexes for each sample
`logs/preproc/*.preproc.log`	everything `ema preproc` writes to `stderr` during operation
`reports/`	various counts/statistics/reports relating to sequence alignment
`reports/ema.stats.html`	report summarizing `samtools flagstat and stats` results across all samples from `multiqc`
`reports/barcodes.summary.html`	interactive html report summarizing barcode-specific metrics across all samples
`reports/Sample1.html`	interactive html report summarizing BX tag metrics and alignment coverage
`reports/data/coverage/*.cov.gz`	output from samtools cov, used for plots
`reports/data/bxstats`	tabular data containing the information used to generate the BX stats in reports

By default, Harpy runs ema with these parameters (excluding inputs and outputs):

ema-h align -d -p haplotag -R "@RG\tID:samplename\tSM:samplename"

Below is a list of all ema align command line arguments, excluding those Harpy already uses or those made redundant by Harpy's implementation of EMA. These are taken directly from the EMA documentation.

ema arguments
-d: apply fragment read density optimization [off]
-i <index>: index to follow 'BX' tag in SAM output [1]

These are the summary reports Harpy generates for this workflow. You may right-click the images and open them in a new tab if you wish to see the examples in better detail.

Alignment BX Information

Molecule size and Coverage

Samtools Alignment stats

An aggregate report of barcode-specific alignment information for all samples.

Reports the inferred molecule sized based on barcodes in the alignments.

Reports the general statistics computed by samtools stats and flagstat

linked-read