UGENE Forum - Merge Forward and Reverse Reads

Oct 15^th, 2020 at 11:00pm

Jonas Offline
YaBB Newbies

Posts: 4

Dear all,

first of all, thanks a lot for this nice software. I was searching for a good and free solution for editing and assembling Sanger reads since long and might finally be there.

I have the following common situation: I have forward and reverse Sanger chromatogram reads (ab1) for the same sequence. What I want to do is to align forward and reverse sequence and check for incorrect base calls etc., looking at both chromatograms at once. After finishing the editing, I want to merge the reads to end up with one sequence for the sample. I need to do this in an efficient way for many samples.

My best attempt so far is to map all ab1 chromatograms to a reference sequence, which nicely determines forward and reverse sequences and shows all chromatograms at once. Perfect so far. From now on, I gloriously failed to produce one fasta-file that contains one merged sequence per sample.

(1) I managed to export a consensus sequence for the whole reference mapping alignment which is not what I need. (Export consensus)

(2) I produced a fasta file that contains all sequence, i.e. all forward and reverse sequences. (Export alignment without chromatogram)

(3) I invested some hours to try existing workflows and to create my own using "read sequence", "mark sequence by name", "group", and "write fasta/alignment" etc., however I can't figure out how to do it.

My ab1-files follow the following format: <ID>_<primer>.ab1 (e.g. 1234_ar_f.ab1). In Geneious, there is an option like "assemble sequences by 1st part of the name separated by '_' (underscore)", or similar.

Is there any way to solve this isse or are there even workflows around that can do such a thing from regular expressions on the file name or something?

I want to use UGENE in a course with students, soon. Any help is greatly appreciated!

Best,
Jonas

Back to top

IP Logged

Reply #1 - Oct 16^th, 2020 at 2:22pm

Dmitrii Sukhomlinov Offline
YaBB Administrator
Russia

Gender:
Posts: 78

Hello,
I'm not sure in what way do you want to merge reads? You need to make one multi-fasta file, or you want to join reads together and get one long-long fasta sequence? In the first case, you need to "Export alignment without chromatograms..." as you've already figured out how to do. In the appeared dialog choose "Fasta" as a format and uncheck "Include reference sequence". Now you have
Also, it may be necessary to remove the first (the consensus) sequence from the result multi-fasta file. If you want to join all these sequences into one long, you may do it by the following steps:
1. Create the multi-fasta file as it was described;
2. Save and close this file;
3. Click "File -> Open as...". Choose the saved file, choose "FASTA" and click OK.
4. In the appeared dialog choose " Merge sequences into a single sequence ..." and switch "Number of bases" to 0.
Is that what you wanted or not?

Best regards,
Dmitrii Sukhomlinov,
The UGENE team.

Back to top

IP Logged

Reply #2 - Oct 16^th, 2020 at 3:44pm

Jonas Offline
YaBB Newbies

Posts: 4

Dear Dmitrii,

thanks a lot for the fast response! I'll try to clarify:

I need a multi-fasta file (multiple sequences in one file) with consensus sequences of forward and reverse reads of each sample.

I have reads of the same gene and from the same sample in forward and reverse direction for many samples. The sample can be identified by an ID in the sequence name (<ID>_<primer>.ab1). Typically, the ends of the amplified region are only represented in good quality by either the forward or the reverse read. That results in two nearly identical sequences for each sample, only that the beginning and the end is represented by only one sequence. I attached a screenshot of this situation (see the overview at the bottom). In the example, the sample IDs would be 02, 04, and 05. The samples come from different individuals/species. In the resulting fasta file, there should consequetly be 3 sequenes (>02, >04, and >05) covering the whole range of forward+reverse read. Ideally, ambiguities would be translated into IUPAC codes (N, R, Y, S, ...).

Thanks for your help again!

Best,
Jonas

UGENE_screenshot.png (215 KB | 441 )

Back to top

IP Logged

Reply #3 - Oct 19^th, 2020 at 2:16pm

Dmitrii Sukhomlinov Offline
YaBB Administrator
Russia

Gender:
Posts: 78

Let me get this right - you need to extract each read with the corresponding part of the consensus sequence? I'm still not sure what you mean

Best regards,
Dmitrii Sukhomlinov,
The UGENE team.

Back to top

IP Logged

Reply #4 - Oct 19^th, 2020 at 4:45pm

Jonas Offline
YaBB Newbies

Posts: 4

Sorry for the confusion. Embarrassed

I guess a small worked example would be best to explain:

After mapping my reads to a reference, I can export a fasta file that contains all reads as separate sequences (excluding the reference):

exported_fasta.fa =~=~=~=~=~=~=~=~=~=~=~=~=~= >individual1_gene1_Primer1_fwd AAGGCTCTGGCTAGCTTGAAACC----- >individual1_gene1_Primer2_rev -----TCTGGCTAGCTTGAAACCGCTSG >individual2_gene1_Primer1_fwd AAGGATCTGGCTAGCTTGAAACC----- >individual2_gene1_Primer2_rev -----CCTGGCTAGCTTGAAACCGCTSG =~=~=~=~=~=~=~=~=~=~=~=~=~=

What I need is a fasta file that contains the iteratively build consensus sequences of forward and reverse read per individual/sample, which would be obtained by the following steps:
1. filter fasta for a single individual
2. build consensus sequence of forward and reverse read(s)
3. repeat for all individuals
4. write fasta that only contains the consensus sequences.

The result would look like this:
final_fasta.fa =~=~=~=~=~=~=~=~=~=~=~=~=~= >individual1_gene1 AAGGCTCTGGCTAGCTTGAAACCGCTSG >individual2_gene1 AAGGAYCTGGCTAGCTTGAAACCGCTSG =~=~=~=~=~=~=~=~=~=~=~=~=~=

Note that site number 6 in the sequence of individual2 has the IUPAC degeneracy 'Y'.

It would be even better, if the final fasta could be directly saved without the intermediate step of the reference-mapping-fasta.

I hope that makes things clearer. This should be a very general issue for everybody who sequences forward and reverse reads for many specimens and a desired function.

Meanwhile, I found an R-package that can do the trick: sangeranalyseR
They have a manual page for the issue on sangeranalyser[dot]readthedocs[dot]io (Advanced User Guide - SangerAlignment)

However, since I plan to use UGENE with students in a course, I try to avoid using too many different software packages, in particular for presumably easy tasks like this one.

Best,
Jonas

Back to top

IP Logged

Reply #5 - Oct 20^th, 2020 at 4:11pm

Dmitrii Sukhomlinov Offline
YaBB Administrator
Russia

Gender:
Posts: 78

OK, I figured the idea out. The first thing is that if you need a consensus sequence, which is related just for forward and reverse sequences, you need to set just these 2 sequences as reads when you configure the alignment. After the aligning process is finished, use the Export consensus" button to extract the corresponding consensus (turn off the "Keep gaps" parameter).

As far as I understand, you want to make all these steps automatically, without clicking every time. It is partly possible to do with the Workflow Designer. Open it ("Run or create workflow") and find the "Trim and map Sanger reads" workflow on the left part. Close the Wizard and click on the first "Read Sequence" element. Here you can set several datasets (green + button on the right) and set the pair of forward-reverse read to each dataset. But, unfortunately, you still have to open all of them one-by-one to extract each consensus. There is an element "Extract Consensus from Sanger Alignment" which is under development now, which allows you to extract all consensus sequences as a part of the workflow, but it does not exist now and possibly will be released in version 37.

Best regards,
Dmitrii Sukhomlinov,
The UGENE team.

Back to top

IP Logged

Reply #6 - Oct 20^th, 2020 at 8:05pm

Jonas Offline
YaBB Newbies

Posts: 4

Ok, I see. In that case, with the students, I will process the reads sample by sample. For productive work, I'll probably map all reads against a reference and use custom scripts to handle the resulting fasta that I "export without chromatograms" and look forward to version 37.

Thank you very much for your help!

Back to top

IP Logged

Reply #7 - Nov 16^th, 2020 at 10:09pm