# Demultiplex Raw Sequences

  • at least 2 cores/threads available
  • paired-end reads from an Illumina sequencer in FASTQ format gzip recommended

When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ files in return. These files contain sequences for all of your samples and need to be demultiplexed using barcodes to separate the sequences for each sample into their own files (a forward and reverse file for each sample). These barcodes should have been added during the sample DNA preparation in a laboratory. The demultiplexing strategy will vary based on the haplotag technology you are using (read Haplotag Types).

usage
harpy demultiplex METHOD OPTIONS... R1_FQ R2_FQ I1_FQ I2_FQ
example using wildcards
harpy demultiplex gen1 --threads 20 --schema demux.schema Plate_1_S001_R*.fastq.gz Plate_1_S001_I*.fastq.gz

# Running Options

In addition to the common runtime options , the demultiplex module is configured using these command-line arguments:

argument short name type required description
METHOD choice ‼️ Haplotag technology of the sequences [gen1]
R1_FQ file path ‼️ The forward multiplexed FASTQ file
R2_FQ file path ‼️ The reverse multiplexed FASTQ file
I1_FQ file path ‼️ The forward FASTQ index file provided by the sequencing facility
I2_FQ file path ‼️ The reverse FASTQ index file provided by the sequencing facility
--schema -s file path ‼️ Tab-delimited file of sample<tab>barcode

# Haplotag Types

  • Barcode configuration: 13 + 13
  • sequencing mask: 151+13+13+151
  • Sample identifier: Cxx barcode
  • Facility should not demultiplex

These are the original 13 + 13 barcodes described in Meier et al. 2021. You should request that the sequencing facility you used do not demultiplex the sequences. Requires the use of bcl2fastq without sample-sheet and with the settings --use-bases-mask=Y151,I13,I13,Y151 and --create-fastq-for-index-reads. With Generation I beadtags, the C barcode is sample-specific, meaning a single sample should have the same C barcode for all of its sequences.

# demultiplexing schema

Since Generation I haplotags use a unique Cxx barcode per sample, that's the barcode that will be used to identify sequences by sample. You will need to provide a simple text file to --schema (-s) with two columns, the first being the sample name, the second being the Cxx barcode (e.g., C19). This file is to be tab or space delimited and must have no column names.

example sample sheet
Sample01    C01
Sample02    C02
Sample03    C03
Sample04    C04

This will result in splitting the multiplexed reads into individual file pairs Sample01.F.fq.gz, Sample01.R.fq.gz, Sample02.F.fq.gz, etc.


# Gen I Demultiplex Workflow

Barcode correction and migration into the read headers is performed using demult_fastq (Harpy renames it to demuxGen1), which is distributed by the team behind haplotagging. Demultiplexing the pooled FASTQ files into individual samples is performed in parallel and using the beloved workhorse grep.

graph LR
    subgraph Inputs
        direction TB
        A[multiplexed FASTQ]:::clean---BX
        BX[Barcode Files]:::clean---SCH
        SCH[Sample Schema]:::clean
    end
    Inputs-->B([barcodes to headers]):::clean
    B-->C([demultiplex samples]):::clean
    C-->D([quality metrics]):::clean
    style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
    classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px

The default output directory is Demultiplex with the folder structure below. Sample1 and Sample2 are generic sample names for demonstration purposes. The resulting folder also includes a workflow directory (not shown) with workflow-relevant runtime files and information.

Demultiplex/
├── Sample1.F.fq.gz
├── Sample1.R.fq.gz
├── Sample2.F.fq.gz
├── Sample2.R.fq.gz
└── reports
    └── demultiplex.QC.html
item description
*.F.fq.gz Forward-reads from multiplexed input --file belonging to samples from the samplesheet
*.R.fq.gz Reverse-reads from multiplexed input --file belonging to samples from the samplesheet
reports/demultiplex.QC.html phased vcf annotated with phased blocks
FASTQC metrics

This is the summary report Harpy generates for this workflow. You may right-click the image and open it in a new tab if you wish to see the example in better detail.

reports/demultiplex.QC.html
reports/demultiplex.QC.html