Skip to article frontmatterSkip to article content

Compare Inversion Calls Across Treatments

Cornell University

What would be interesting to see is the inversion-calling performance in linked reads across inversion sizes and depths. Since we already created the assessment tables from the individual size-class data exploration, we can read those in without having to reprocess those data.

library(ggplot2)
library(dplyr)
library(ggh4x)
options(warn = -1)
Output

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Read in the linked-read variant assessment

.assess <- read.csv("assess_called_sv/linkedread.sv.assessment", header = T)
.assess$zygosity <- gsub("hom", "homozygote", .assess$zygosity)
.assess$zygosity <- gsub("het", "heterozygote", .assess$zygosity)
head(.assess)
Loading...

Read in the long-read variant assessment

longread <- read.csv("longread_workflow/longread.sv.assessment", header = T)
longread$state <- gsub("hom", "homozygote", longread$state)
longread$state <- gsub("het", "heterozygote", longread$state)

head(longread)
Loading...

Ok so maybe I wasn’t as consistent with naming as I could have been between the linked-read and long-read workflows, but let’s unify that here. First by renaming the columns in the long-read data to match the linked-read data, then to convert samples to integers.

longread <- rename(longread, "simulated" = present, "zygosity" = state, "technology" = platform)
longread$sample <- as.integer(gsub("sample_", "", longread$sample))
head(longread)
Loading...

Next we keep only the relevant columns from the linked-read data (1, 4 through 9, and 12) and reorganize the columns in the long-read data to match.

assess <- rbind(
    .assess[, c(1, 4:9, 12)],
    longread[, c(2,9,1,5:8, 10)]
)

Let’s also make some of the columns factors

assess$depth <- as.factor(assess$depth)
assess$size <- factor(assess$size, ordered = T, levels = c("small","medium","large","xl"))
assess$technology <- factor(assess$technology, ordered = T, levels = c("linkedread", "ontshort", "ontlong", "pacbio"))
assess$zygosity <- as.factor(assess$zygosity)
assess$assessment <- as.factor(assess$assessment)
head(assess, 10)
Loading...

Sample-level Calling

Prepare the Data

It would probably make sense to collapse these data somewhat to show a representation of true positives vs false negatives (as a ratio) across depths (x axis) and faceted for hom/het in the non-pooled data. We will also remove inversions that weren’t simulated for a sample b/c we aren’t interested in true negatives (since there were no false positives).

samples <- assess[assess$sample != 11 & assess$simulated,]

samples <- group_by(samples, size, depth, zygosity, assessment, technology) %>%
    summarize(count = n()) %>% ungroup()

head(samples)
Loading...

Some of the treatments don’t have true positives or false negatives, so we’ll need to add in rows with count 0 for those treatments.

# add missing rows with value 0 in the event there were no true positives or false negatives for a treatment
# instantiate empty def
missings <- samples[0,]

for(size in c("small","medium","large","xl")){
    for(depth in levels(samples$depth)){
        for(zyg in c("homozygote","heterozygote")){
            for(tech in c("linkedread", "ontshort", "ontlong", "pacbio")){
                .df <- samples[samples$size == size & samples$depth == depth & samples$zygosity == zyg & samples$technology == tech,]
                if(!("false negative" %in% .df$assessment)){
                    missings[nrow(missings)+1,] <- list(size, depth, zyg,"false negative", tech, 0)
                }
                if(!("true positive" %in% .df$assessment)){
                    missings[nrow(missings)+1,] <- list(size, depth, zyg,"true positive", tech, 0)
                }
            }
        }
    }
}

samples <- rbind(samples,missings) %>% arrange(size, depth, zygosity, technology)
head(samples)
Loading...

One last thing we’ll need to do is convert counts to proportions

samples <- group_by(samples, size, depth, zygosity, technology) %>% mutate(proportion = round(count/sum(count),2))
head(samples)
Loading...

Let’s also set a color palette for the technology comparisons using the Arches theme and the Redwoods theme for the depths.

tech_colors <- c(
    "ontshort" = "#9b6981",
    "ontlong" = "#682c37",
    "pacbio" = "#f6955e" ,
    "linkedread" = "#a8cdec"
)

size_colors <- c(
    "#c0c6cf",
    "#a2aab5",
    "#6e7684",
    "#54546c"
)

zygosity_colors <- c(
    "#85a98eff",
    "#7fa9b4ff"
)

Visualization

We are now interested in seeing how these performed. There are several dimensions to consider here:

  • depth

  • inversion size

  • zygotic state

  • sequencing technology

Detection between technologies across inversion classes

We’ll approach this a few different ways to try to reveal meaningful information. First let’s compare how each of the technologies did within a given inversion size class.

Source
options(repr.plot.width = 12, repr.plot.height = 12)
.strip <- strip_themed(background_x = elem_list_rect(fill = zygosity_colors), background_y = elem_list_rect(fill = size_colors))

filter(samples, assessment == "true positive") %>%
ggplot(aes(x = depth, y = proportion, color = technology, group = technology, shape = technology)) +
    geom_line() +
    geom_point(size = 4) +
    scale_color_manual(values = tech_colors) +
    scale_shape_manual(values = c(19,15,18,17)) +
    theme_light() +
    labs(title = "Inversion Detection by Zygotic State and Variant Size", color = "Technology", shape = "Technology") +
    ylab("Proportion of Inversions Correctly Detected") +
    xlab("Sequencing Depth") +
    theme(panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank()) +
    facet_grid2(cols = vars(zygosity), rows = vars(size), strip = .strip)
plot without title

Detection between inversion class within technologies

This is effectively the same thing, but it’s a more 1:1 comparison across size classes within a given technology.

Source
options(repr.plot.width = 12, repr.plot.height = 12)
.strip <- strip_themed(background_x = elem_list_rect(fill = zygosity_colors), background_y = elem_list_rect(fill = tech_colors[c(4,1,2,3)]))

filter(samples, assessment == "true positive") %>%
ggplot(aes(x = depth, y = proportion, color = size, group = size, shape = size)) +
    geom_line() +
    geom_point(size = 4) +
    scale_shape_manual(values = c(19,15,18,17)) +
    scale_color_manual(values = size_colors) +
    theme_light() +
    labs(title = "Inversion Detection as a Product of Variant Size", color = "Inversion Size", shape = "Inversion Size") +
    ylab("Proportion of Inversions Correctly Detected") +
    xlab("Sequencing Depth") +
    theme(panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank()) +
    facet_grid2(cols = vars(zygosity), rows = vars(technology), strip = .strip)
plot without title

Holistic comparison of technologies

What does it look like when we compare the performance of hom/het within and across technologies for each inversion class?

Source
options(repr.plot.width = 15, repr.plot.height = 5)
.strip <- strip_themed(background_x = elem_list_rect(fill = size_colors))

filter(samples, assessment == "true positive") %>%
ggplot(aes(x = depth, y = proportion, color = technology, group = paste(technology, zygosity), shape = zygosity)) +
    geom_line() +
    geom_point(aes(shape = zygosity), size = 4) +
    theme_light() +
    scale_color_manual(values = tech_colors) +
    scale_shape_manual(values = c("homozygote"= 19,"heterozygote" = 18)) +
    labs(title = "Inversion Detection as a Product of Zygotic State", color = "Technology", shape = "Zygotic State") +
    ylab("Proportion of Inversions Correctly Detected") +
    xlab("Sequencing Depth") +
    theme(panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank()) +
    facet_grid2(cols = vars(size), strip = .strip)
plot without title

Pooled Data

Now let’s look at the analogous characteristics of the pooled data. It’s very important to note here, the comparison is between the pooled linked-read data and the single-sample long-read data. That is to say, we are interested in seeing if sequencing these 10 samples with each of these technologies yields different results. Above, we show that when structural variants are called for individual samples, linked reads tend to underperform long-reads at each depth. However, if we pool the 10 linked-read samples together, we effectively boost the sequencing depth x10.

Prepare the Data

We need to filter for pooled samples only. The contig names don’t matter so much here as much as “common” or “rare” does, so we need to rename the contigs and sum their values. For plotting purporses, we’ll also want to get a count of the number of inversions for each category so there’s an indication of sample size for each treatment.

pooled <- assess[assess$sample == 11 & assess$simulated | assess$technology != "linkedread",]
pooled$treatment <- case_when(
    pooled$contig == "2L" ~ "Common",
    pooled$contig == "2R" ~ "Common",
    pooled$contig == "3L" ~ "Common",
    pooled$contig == "3R" ~ "Rare"
)

pooled <- group_by(pooled, treatment, size, depth, assessment, technology) %>%
    summarize(count = n()) %>%
    group_by(treatment, size, depth, technology) %>%
    mutate(total = sum(count)) %>%
    ungroup() %>%
    arrange(size, depth, technology)

head(pooled)
Loading...

Again, some of the treatments are missing false negative or true positive rows, so we need to add them with a count of 0.

Source
# add missing rows with value 0 in the event there were no true positives or false negatives for a treatment
# instantiate empty def
missings <- pooled[0,]

for(.size in c("small","medium","large","xl")){
    for(.depth in levels(pooled$depth)){
        for(.treat in c("Common", "Rare")){
            for(.tech in levels(pooled$technology)){
                .df <- filter(pooled, size == .size,  depth == .depth, treatment == .treat, technology == .tech)
                .tot <- sum(.df$total)
                if(!("false negative" %in% .df$assessment)){
                    missings[nrow(missings)+1,] <- list(.treat, .size, .depth, "false negative", .tech, 0, .tot)
                }
                if(!("true positive" %in% .df$assessment)){
                    missings[nrow(missings)+1,] <- list(.treat, .size, .depth, "true positive", .tech, 0, .tot)
                }
            }
        }
    }
}

pooled <- rbind(pooled,missings) %>% arrange(size, depth, technology, treatment)
head(pooled)
Loading...

Lastly, convert the counts to proportions. Let’s also rename the depths to properly reflect that with aggregates, the depth is actually 10X higher. Seems silly to have to do as.numeric(as.character()), but if you don’t convert it to a Character first, 0.5 gets misinterpereted as 1.

pooled <- group_by(pooled, treatment, size, depth, technology) %>% mutate(proportion = round(count/sum(count),2))
#agg_depths <- paste0(as.numeric(as.character(pooled$depth)) * 10, " (", pooled$depth, ")")
#pooled$agg_depth <- factor(agg_depths, levels = unique(agg_depths))
head(pooled)
Loading...

It would behoove us to show a more informative X-axis that has both the single-sample sequencing depth and the pooled aggregate depth, so let’s set that up here.

depth_labels <- c("0.5X (5X)", "2X (10X)", "5X (50X)", "10X (100X)", "20X (200X)")
freq_colors <- c(
    "#45865fff",
    "#8ebba1ff"
)

Visualization

Detection by frequency

Source
options(repr.plot.width = 12, repr.plot.height = 12)
.strip <- strip_themed(background_x = elem_list_rect(fill = freq_colors), background_y = elem_list_rect(fill = size_colors))

filter(pooled, assessment == "true positive") %>%
ggplot(aes(x = depth, y = proportion, color = technology, group = technology, shape = technology)) +
    geom_line() +
    geom_point(size = 4) +
    scale_shape_manual(values = c(19,15,18,17)) +
    scale_x_discrete(labels = depth_labels) +
    scale_color_manual(values = tech_colors) +
    theme_light() +
    labs(title = "Inversion Detection of Pooled Linked-Read Data vs Single-Sample Long-Read Data", color = "Technology", shape = "Technology") +
    ylab("Proportion of Inversions Correctly Detected") +
    xlab("Sequencing Depth (Aggregate Depth)") +
    theme(
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank()
    ) +
    facet_grid2(cols = vars(treatment), rows = vars(size), strip = .strip)
plot without title

True Positives Across Inversion Size Classes

Consolidate views of just true positives across treatments

Source
options(repr.plot.width = 12, repr.plot.height = 12)
.strip <- strip_themed(background_x = elem_list_rect(fill = freq_colors), background_y = elem_list_rect(fill = tech_colors[c(4,1,2,3)]))

filter(pooled, assessment == "true positive") %>%
ggplot(aes(x = depth, y = proportion, color = size, group = size, shape = size)) +
    geom_line() +
    geom_point(size = 4) +
    scale_color_manual(values = size_colors) +
    scale_shape_manual(values = c(19,15,18,17)) +
    scale_x_discrete(labels = depth_labels) +
    theme_light() +
    labs(title = "Inversion Detection of Pooled Linked-Read Data vs Single-Sample Long-Read Data", color = "Inversion Size", shape = "Inversion Size") +
    ylab("Proportion of Inversions Correctly Detected") +
    xlab("Sequencing Depth (Aggregate Depth)") +
    theme(panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank()) +
    facet_grid2(cols = vars(treatment), rows = vars(technology), strip = .strip)
plot without title

Split by contigs

The plot below is the same as the one above, but separated out by inversion size class

Source
.strip <- strip_themed(background_x = elem_list_rect(fill = size_colors), background_y = elem_list_rect(fill = tech_colors[c(4,1,2,3)]))

filter(pooled, assessment == "true positive") %>%
ggplot(aes(x = depth, y = proportion, color = treatment, shape = treatment, group = treatment)) +
    geom_line() +
    geom_point(size = 4) +
    theme_light() +
    scale_color_manual(values = freq_colors) +
    labs(title = "Inversion Detection of Pooled Linked-Read Data vs Single-Sample Long-Read Data", color = "Inversion Frequency", shape = "Inversion Frequency", size = "Total Inversions") +
    ylab("Proportion of Inversions Detected") +
    xlab("Sequencing Depth (Aggregate Depth)") +
    theme(panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank()) +
    facet_grid2(cols = vars(size), rows = vars(technology), strip = .strip)
plot without title

Scaled Sequencing Depth

The comparison of pooled linked-read data to single-sample long-read data isn’t exactly 1:1 unless we are discussing sequencing effort. It would be useful to scale the above plots so that the pooled sample’s aggregate depth aligns with the single-sample depth. That is, if the aggregate depth is 5X, compare it to the 5X single-sample data.

pooled$effective_depth <- as.numeric(as.character(pooled$depth))
pooled$effective_depth[pooled$technology == "linkedread"] <- pooled$effective_depth[pooled$technology == "linkedread"] * 10
pooled$effective_depth <- as.factor(pooled$effective_depth)
head(pooled)
Loading...
Source
.strip <- strip_themed(background_x = elem_list_rect(fill = freq_colors), background_y = elem_list_rect(fill = size_colors))

filter(pooled, assessment == "true positive") %>%
ggplot(aes(x = effective_depth, y = proportion, color = technology, group = technology, shape = technology)) +
    geom_line() +
    geom_point(size = 4) +
    scale_shape_manual(values = c(19,15,18,17)) +
    scale_color_manual(values = tech_colors) +
    theme_light() +
    labs(title = "Pooled Linked-Read vs Single-Sample Long-Read Inversion Detection", subtitle = "Scaled by effective sequencing depth", color = "Technology", shape = "Technology") +
    ylab("Proportion of Inversions Correctly Detected") +
    xlab("Effective Sequencing Depth") +
    theme(
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank()
    ) +
    facet_grid2(cols = vars(treatment), rows = vars(size), strip = .strip)
plot without title

Holistic comparison between technologies

This panel shows the overall performance of the technologies, not scaling the sequencing depths like shown above.

Source
options(repr.plot.width = 15, repr.plot.height = 5)
.strip <- strip_themed(background_x = elem_list_rect(fill = size_colors))

filter(pooled, assessment == "true positive") %>%
ggplot(aes(x = depth, y = proportion, color = technology, group = paste(technology, treatment), shape = treatment)) +
    geom_line() +
    geom_point(aes(shape = treatment), size = 4) +
    theme_light() +
    scale_color_manual(values = tech_colors) +
    scale_shape_manual(values = c("Common"= 19,"Rare" = 18)) +
    labs(title = "Inversion Detection by Frequency in a Population", subtitle = "With linked-read variants called on the sample pool", color = "Technology", shape = "Occurance") +
    ylab("Proportion of Inversions Correctly Detected") +
    xlab("Single-Sample Sequencing Depth") +
    theme(panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank()) +
    facet_grid2(cols = vars(size), strip = .strip)
plot without title