Demultiplexing with custom barcoded primers
Contributors (wetlab and computational): Michelli Faria Oliveira, Thomas Vollbrecht, Nikesh Kumar, and Ben Murrell
This notebook shows how to demultiplex PacBio reads using NextGenSeqUtils.jl, where RAD is run on each demultiplexed population.
5bp barcodes on the forward and reverse primers were used to barcode 25 PCR reactions. DNA was extracted from both NL43-infected cells and from the 8E5 cell line, and mixed to provide the starting templates for these PCR reactions, so each well could potentially contain up to two variants.
The objective of this experiment was to discover whether, from these PCR conditions, we could recover SGA-like sequences from wells where >1 variant is present, or if PCR recombination would render this impossible.
Loading packages, and data filtering.
using PyPlot, NextGenSeqUtils function sizes_to_names(sizes; prefix = "") return ["$(prefix)seq_$(i)_$(sizes[i])" for i in 1:length(sizes)] end
length_vs_qual(MultiplexedMixtures.fastq,alpha = 0.01)
fastq_filter(MultiplexedMixtures.fastq, "MultiplexedMixtures_filtered.fastq", error_rate = 0.01, min_length = 2100, max_length = 2800); seqs,phreds,seqnames = read_fastq("MultiplexedMixtures_filtered.fastq");
Demultiplexing samples.
fwd_primers = [ "GCGCGGAGCAGAAGACAGTG", "CTAGCGAGCAGAAGACAGTG", "TGCATGAGCAGAAGACAGTG", "ATCTGGAGCAGAAGACAGTG", "CGATGGAGCAGAAGACAGTG", "GCTGAGAGCAGAAGACAGTG", "CGTCAGAGCAGAAGACAGTG", "GATATGAGCAGAAGACAGTG", "TCGACGAGCAGAAGACAGTG", "AGAGTGAGCAGAAGACAGTG"] rev_primers = [ "TCAGTCCACTTGCCACCCAT", "TATGCCCACTTGCCACCCAT", "ATACGCCACTTGCCACCCAT", "CGTAGCCACTTGCCACCCAT", "CTGCTCCACTTGCCACCCAT", "GAGTCCCACTTGCCACCCAT", "GCACACCACTTGCCACCCAT", "GCTATCCACTTGCCACCCAT", "GTCGACCACTTGCCACCCAT", "TAGAGCCACTTGCCACCCAT"] demux_dic = demux_dict(seqs,fwd_primers,rev_primers, phreds = phreds,verbose = false);
Primer trimming, and running FAD on each sample.
seq_collection = String[] name_collection = String[] size_collection = Float64[] #Looping through each sample in the demux dictionary. for k in keys(demux_dic) if length(demux_dic[k]) > 10 #Ignore samples with <=10 reads. #Use the first 5bp of each primer as the barcode ID f_bc = fwd_primers[k[1]][1:5] r_bc = rev_primers[k[2]][1:5] print("F:", f_bc," R:",r_bc) println(" : ",length(demux_dic[k]), " reads") #Trimming the full primers from the reads. #Note that these need not be the same length as the segments used to demux. trimmed = [ double_primer_trim(j[1],j[2], "NNNNNGAGCAGAAGACAGTGGCAATGA","NNNNNCCACTTGCCACCCATBTTATAGCA") for j in demux_dic[k] ] reads,phredarr = [s[1] for s in trimmed],[s[2] for s in trimmed] #Running FAD to identify the variants in each sample. variants,sizes = FAD(reads,phreds = phredarr); variant_names = sizes_to_names(sizes,prefix = f_bc*"_"*r_bc*"_") println("FAD identifies $(length(variant_names)) variant(s):") for vn in variant_names println(vn) end size_filt = sizes .> 10 #Collecting the FAD variants from each sample. append!(seq_collection,variants[size_filt]) append!(name_collection,variant_names[size_filt]) append!(size_collection,sizes[size_filt]) end end write_fasta("LP157_demuxed_FAD.fasta", seq_collection,names = name_collection)
Recovered variants.
Paste these into an alignment viewer for further analysis.
print_fasta(seq_collection,name_collection) println()
Supplementary analysis
#The NL43 reference sequence. NL43 = "GAGTGAAGGAGAAGTATCAGCACTTGTGGAGATGGGGGTGGAAATGGGGCACCATGCTCCTTGGGATATTGATGATCTGTAGTGCTACAGAAAAATTGTGGGTCACAGTCTATTATGGGGTACCTGTGTGGAAGGAAGCAACCACCACTCTATTTTGTGCATCAGATGCTAAAGCATATGATACAGAGGTACATAATGTTTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAGTAGTATTGGTAAATGTGACAGAAAATTTTAACATGTGGAAAAATGACATGGTAGAACAGATGCATGAGGATATAATCAGTTTATGGGATCAAAGCCTAAAGCCATGTGTAAAATTAACCCCACTCTGTGTTAGTTTAAAGTGCACTGATTTGAAGAATGATACTAATACCAATAGTAGTAGCGGGAGAATGATAATGGAGAAAGGAGAGATAAAAAACTGCTCTTTCAATATCAGCACAAGCATAAGAGATAAGGTGCAGAAAGAATATGCATTCTTTTATAAACTTGATATAGTACCAATAGATAATACCAGCTATAGGTTGATAAGTTGTAACACCTCAGTCATTACACAGGCCTGTCCAAAGGTATCCTTTGAGCCAATTCCCATACATTATTGTGCCCCGGCTGGTTTTGCGATTCTAAAATGTAATAATAAGACGTTCAATGGAACAGGACCATGTACAAATGTCAGCACAGTACAATGTACACATGGAATCAGGCCAGTAGTATCAACTCAACTGCTGTTAAATGGCAGTCTAGCAGAAGAAGATGTAGTAATTAGATCTGCCAATTTCACAGACAATGCTAAAACCATAATAGTACAGCTGAACACATCTGTAGAAATTAATTGTACAAGACCCAACAACAATACAAGAAAAAGTATCCGTATCCAGAGGGGACCAGGGAGAGCATTTGTTACAATAGGAAAAATAGGAAATATGAGACAAGCACATTGTAACATTAGTAGAGCAAAATGGAATGCCACTTTAAAACAGATAGCTAGCAAATTAAGAGAACAATTTGGAAATAATAAAACAATAATCTTTAAGCAATCCTCAGGAGGGGACCCAGAAATTGTAACGCACAGTTTTAATTGTGGAGGGGAATTTTTCTACTGTAATTCAACACAACTGTTTAATAGTACTTGGTTTAATAGTACTTGGAGTACTGAAGGGTCAAATAACACTGAAGGAAGTGACACAATCACACTCCCATGCAGAATAAAACAATTTATAAACATGTGGCAGGAAGTAGGAAAAGCAATGTATGCCCCTCCCATCAGTGGACAAATTAGATGTTCATCAAATATTACTGGGCTGCTATTAACAAGAGATGGTGGTAATAACAACAATGGGTCCGAGATCTTCAGACCTGGAGGAGGCGATATGAGGGACAATTGGAGAAGTGAATTATATAAATATAAAGTAGTAAAAATTGAACCATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTGGTGCAGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGTTCCTTGGGTTCTTGGGAGCAGCAGGAAGCACTATGGGCGCAGCGTCAATGACGCTGACGGTACAGGCCAGACAATTATTGTCTGATATAGTGCAGCAGCAGAACAATTTGCTGAGGGCTATTGAGGCGCAACAGCATCTGTTGCAACTCACAGTCTGGGGCATCAAACAGCTCCAGGCAAGAATCCTGGCTGTGGAAAGATACCTAAAGGATCAACAGCTCCTGGGGATTTGGGGTTGCTCTGGAAAACTCATTTGCACCACTGCTGTGCCTTGGAATGCTAGTTGGAGTAATAAATCTCTGGAACAGATTTGGAATAACATGACCTGGATGGAGTGGGACAGAGAAATTAACAATTACACAAGCTTAATACACTCCTTAATTGAAGAATCGCAAAACCAGCAAGAAAAGAATGAACAAGAATTATTGGAATTAGATAAATGGGCAAGTTTGTGGAATTGGTTTAACATAACAAATTGGCTGTGGTATATAAAATTATTCATAATGATAGTAGGAGGCTTGGTAGGTTTAAGAATAGTTTTTGCTGTACTTTCTATAGTGAATAGAGTTAGGCAGGGATATTCACCATTATCGTTTCAGACCCACCTCCCAATCCCGAGGGGACCCGACAGGCCCGAAGGAATAGAAGAAGAAGGTGGAGAGAGAGACAGAGACAGATCCATTCGATTAGTGAACGGATCCTTAGCACTTATCTGGGACGATCTGCGGAGCCTGTGCCTCTTCAGCTACCACCGCTTGAGAGACTTACTCTTGATTGTAACGAGGATTGTGGAACTTCTGGGACGCAGGGGGTGGGAAGCCCTCAAATATTGGTGGAATCTCCTACAGTATTGGAGTCAGGAACTAAAGAATAGTGCTGTTAACTTGCTCAATGCCACAGCCATAGCAGTAGCTGAGGGGACAGATAGGGTTATAGAAGTATTACAAGCAGCTTATAGAGCTATTCGCCACATACCTAGAAGAATAAGACAGGGCTTGGAAAGGATTT";
Compute distances from each read to NL43 reference sequence.
d = [kmer_seeded_edit_dist(seq,NL43) for seq in seq_collection]; coloring = ["red","green","blue"][(d.<=2) .+ (d.<=47) .+ 1] fig = figure(figsize=(10, 4)) for i in 2:length(d) if split(name_collection[i],"_")[[1,2]] != split(name_collection[i-1],"_")[[1,2]] plot([i-0.5,i-0.5],[-5,maximum(d)+5],color="grey",alpha = 0.2) end end scatter(1:length(d),d,s = size_collection./5,c=coloring,alpha = 0.75) #Plotting vertical bars between different wells xlabel("Sequence index") ylabel("Distance from NL43") fig
Variants are either very close to NL43, or ~48bp from NL43 (which corresponds to the 8E5 sequence, which is an NL43 relative). One anomalous variant (number 30 above) is 8bp from NL43. Let us investigate how this differs from NL43:
ali_ref,ali_seq = kmer_seeded_align(NL43,seq_collection[29]); print_diffs(ali_ref,ali_seq)
7 out of the 8 differences are G->A mutations, suggesting that APOBEC activity, and not PCR recombination, is the cause of this difference. Examining the alignment confirms that none of these mutations occur in 8E5, so this variant cannot be caused by PCR recombination.