#
Demultiplex Raw Sequences
- at least 2 cores/threads available
- paired-end reads from an Illumina sequencer in FASTQ format gzip recommended
When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ
files in return. These files contain sequences for all of your samples and need to be demultiplexed using barcodes to
separate the sequences for each sample into their own files (a forward and reverse file for each sample). These barcodes
should have been added during the sample DNA preparation in a laboratory. The demultiplexing strategy will vary based on the
haplotag technology you are using (read
harpy demultiplex METHOD OPTIONS... R1_FQ R2_FQ I1_FQ I2_FQ
harpy demultiplex gen1 --threads 20 --schema demux.schema Plate_1_S001_R*.fastq.gz Plate_1_S001_I*.fastq.gz
#
Running Options
In addition to the common runtime options , the demultiplex module is configured using these command-line arguments:
#
Haplotag Types
gen1
- Barcode configuration:
13 + 13
- sequencing mask:
151+13+13+151
- Sample identifier:
Cxx
barcode - Facility should not demultiplex
These are the original 13 + 13 barcodes described in Meier et al. 2021. You should request that the sequencing facility you used
do not demultiplex the sequences. Requires the use of bcl2fastq without sample-sheet
and with the settings
--use-bases-mask=Y151,I13,I13,Y151
and --create-fastq-for-index-reads
. With Generation I beadtags, the C
barcode is sample-specific,
meaning a single sample should have the same C
barcode for all of its sequences.
#
demultiplexing schema
Since Generation I haplotags use a unique Cxx
barcode per sample, that's the barcode
that will be used to identify sequences by sample. You will need to provide a simple text
file to --schema
(-s
) with two columns, the first being the sample name, the second being
the Cxx
barcode (e.g., C19
). This file is to be tab
or space
delimited and must have no column names.
Sample01 C01
Sample02 C02
Sample03 C03
Sample04 C04
This will result in splitting the multiplexed reads into individual file pairs Sample01.F.fq.gz
, Sample01.R.fq.gz
, Sample02.F.fq.gz
, etc.
#
Gen I Demultiplex Workflow
Barcode correction and migration into the read headers is performed using demult_fastq
(Harpy renames it to demuxGen1
), which is distributed by the team behind haplotagging. Demultiplexing the pooled FASTQ files into
individual samples is performed in parallel and using the beloved workhorse grep
.
graph LR subgraph Inputs direction TB A[multiplexed FASTQ]:::clean---BX BX[Barcode Files]:::clean---SCH SCH[Sample Schema]:::clean end Inputs-->B([barcodes to headers]):::clean B-->C([demultiplex samples]):::clean C-->D([quality metrics]):::clean style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
The default output directory is Demultiplex
with the folder structure below. Sample1
and Sample2
are
generic sample names for demonstration purposes. The resulting folder also includes a workflow
directory
(not shown) with workflow-relevant runtime files and information.
Demultiplex/
├── Sample1.F.fq.gz
├── Sample1.R.fq.gz
├── Sample2.F.fq.gz
├── Sample2.R.fq.gz
└── reports
└── demultiplex.QC.html
This is the summary report Harpy generates for this workflow. You may right-click the image and open it in a new tab if you wish to see the example in better detail.