# Downsample data by barcode

  • One of either:
    • one alignment file .bam .sam case insensitive
    • one set of paired-end reads in FASTQ format .fq .fastq gzip recommended case insensitive
  • Barcodes in the BX:Z SAM tag for both BAM and FASTQ inputs

While downsampling (subsampling) FASTQ and BAM files is relatively simple with tools such as awk, samtools, seqtk, seqkit, etc., downsample allows you to downsample a BAM file (or paired-end FASTQ) by barcodes. That means you can keep all the reads associated with d number of barcodes.

usage
harpy downsample OPTIONS... INPUT(S)...
example
# BAM file
harpy downsample -d 1000 -i 0.3 -p sample1.sub1000 sample1.bam

# FASTQ file
harpy downsample -d 1000 -i 0 -p sample1.sub1000 sample1.F.fq.gz sample1.R.fq.gz

# Running Options

In addition to the common runtime options , the downsample module is configured using the command-line arguments below.

argument short name default description
INPUT(S) required One BAM file or both read files from a paired-end FASTQ pair
--downsample -d required Number of barcodes to downsample to
--invalid -i 1 Proportion of barcodes to sample
--prefix -p downsampled Prefix for output files
--random-seed Random seed for sampling optional

# invalid barcodes

The --invalid options determines what proportion of invalid barcodes appear in the barcode pool. Bear in mind that the barcode pool still gets subsampled, so the --invalid proportion doesn't necessarily reflect how many end up getting sampled, rather what proportion will be considered for sampling. The proportions equate to:

  • 0: invalid barcodes are skipped
  • 1: all invalid barcodes appear in the barcode pool that gets subsampled
  • 0<i<1: that proportion of barcodes appear in the barcode pool that gets subsampled

# Downsample Workflow

graph LR
    subgraph fastq
        R1([read 1]):::clean---R2([read 2]):::clean
    end
    subgraph bam
        bamfile([bam]):::clean
    end
    fastq-->|bam conversion|bam
    bam-->sub([extract and\n subsample barcodes]):::clean
    sub-->exreads([extract reads]):::clean
    bam-->exreads
    fastq-->exreads
    style fastq fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
    style bam fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
    classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px