UGENE Forum - List of primers to features

Jul 17^th, 2012 at 2:14am

hatta0 Offline
YaBB Newbies

Posts: 6

I have a FASTA file with a few dozen primers in it. I also have a .gb file representing a plasmid. I want to use UGENE to search the plasmid for each of the files, and if a match is found, create a feature at the appropriate place.

I gather Workflow Designer is how you do batch jobs like this in UGENE.

What I don't understand is why the Smith-Waterman tool in WD only has one input. Shouldn't it have two inputs? One for the query, and one for the subject?

Also, Write Annotations appears to require two inputs, but only has space for one connector. When I hover over the red 'unset' in Write Annotations, I don't get a clickable mouse icon.

I've attached a screenshot of what my Workflow Design looks like. What is wrong with this?

Also, it looks like the sequence read tool in WD only takes files as input. UGENE has this nice little object browser. Can't I just load all my sequences into a project, and then select the sequences from there instead of going back to the filesystem every time? If not, what is the point of the object browser?

UGENE-WD-SW.png (78 KB | )

Back to top

IP Logged

Reply #1 - Jul 17^th, 2012 at 2:34am

hatta0 Offline
YaBB Newbies

Posts: 6

Apologies for the large image. I didn't know it would default to displaying the whole thing inline.

Also, you can't see my cursor in this screenshot, but it's hovering over the red text, and is not a clickable hand. Other links in other tools are clickable.

Back to top

IP Logged

Reply #2 - Jul 17^th, 2012 at 3:30am

Konstantin Okonechnikov Offline
Global Moderator

Posts: 173

Hi!

Quote:

I have a FASTA file with a few dozen primers in it. I also have a .gb file representing a plasmid. I want to use UGENE to search the plasmid for each of the files, and if a match is found, create a feature at the appropriate place.

I gather Workflow Designer is how you do batch jobs like this in UGENE.

Your workflow schema looks OK!

Quote:

What I don't understand is why the Smith-Waterman tool in WD only has one input. Shouldn't it have two inputs? One for the query, and one for the subject?

One important concept of the Workflow Designer is that it works like a single data bus. Once being read, input data will go through each element of the workflow. In this design, adding additional input to Smith-Waterman for input query would require additional filter after the element to discard input queries to from being pushed through data flow (usually user is not interested in these queries). That's why input queries are defined as a parameter.

Quote:

Also, Write Annotations appears to require two inputs, but only has space for one connector. When I hover over the red 'unset' in Write Annotations, I don't get a clickable mouse icon.
I've attached a screenshot of what my Workflow Design looks like. What is wrong with this?

You have to specify the input data port i.e. the source of the annotations. See screenshot attached.

Quote:

Also, it looks like the sequence read tool in WD only takes files as input. UGENE has this nice little object browser. Can't I just load all my sequences into a project, and then select the sequences from there instead of going back to the filesystem every time? If not, what is the point of the object browser?

That is a good suggestion. I believe it is already planned as a feature in one of upcoming releases.

annotations_source.gif (20 KB | )

Back to top

IP Logged

Reply #3 - Jul 17^th, 2012 at 4:33am

hatta0 Offline
YaBB Newbies

Posts: 6

Quote:

One important concept of the Workflow Designer is that it works like a single data bus. Once being read, input data will go through each element of the workflow.

So when I execute my workflow schema, Read Sequence passes the entire sequence to Smith-Waterman. Smith Waterman searches the sequence for the given features and then: A) passes the entire input from RS (verbatim) onto the Write Annotations tool, B) passes only the newly crafted annotation table on to WA, C) passes both A and B onto WA (in this case, which happens first?), or D) something else I'm not anticipating?

Is there a 'dummy' tool that I could place in between two other tools so that I could get a "wiretap" of what is actually passing over the connection?

Quote:

In this design, adding additional input to Smith-Waterman for input query would require additional filter after the element to discard input queries to from being pushed through data flow (usually user is not interested in these queries). That's why input queries are defined as a parameter.

You could just add the filter to the Smith Waterman tool. It seems to me that the only thing coming out of the SW tool should be the list of matches. Apparently branching on output is supported, so if the user wants more than just a list of annotations he could bind those annotations with the branched output of Read Sequence.

Please tell me what is wrong with this idea, it will help me understand UGENE better.

Anyway, I set the data input parameter as suggested. I encountered some more problems.

First, UGENE choked on my primer names. My primers contain 3' and 5' in their names. This didn't work. UGENE should really be escaping special characters in its input. I couldn't find a bug report on this issue, should I open one?

Next, after running "sed -i 's/\'/p/'" on my primer file, I was able to complete the schema successfully. However, the resulting file contains no sequence information. This is despite setting the parameter "Sequence" to "By Sequence" under "Write Annotations".

I tried branching off of "Read Sequence" and sending that through "Write Sequence" set up to write to the same file as "Write Annotations" but that just ended up overwriting the file with the original sequence and annotations. What am I doing wrong?

And finally, what packages do I need to install to get pretty fonts in UGENE? I'm running UGENE on a RHEL6 install on a VM, forwarded with NX to a Debian machine. I checked under Settings->Preferences, and didn't see anything concerning fonts.

Back to top

IP Logged

Reply #4 - Jul 17^th, 2012 at 1:21pm

Yuriy Vaskin Offline
Global Moderator

Gender:
Posts: 138

Hello!

As for Smith-Waterman scheme. I think Workflow Designer infrastructure is ready to make Smith-Waterman worker take multiple Read Sequences as input. Here is the issue on that: https://ugene.unipro.ru/tracker/browse/UGENE-1100

There will be two Read Sequence elements (reference and patterns). Smith-Waterman worker search each pattern on the reference and writes annotated reference (there will be filtering/grouping after SW worker to filter unlikely patterns).

Then you could take only matches or only annotations from the annotated reference using built-in workers.

By the way. Have a look at "Find substring in sequence" sample scheme in the samples tab.

Quote:

My primers contain 3' and 5' in their names. This didn't work. UGENE should really be escaping special characters in its input. I couldn't find a bug report on this issue, should I open one?

Definitely you should create one.
Have you tried to use a simple scheme like "Read Sequence"->"Write Sequence" to check it?

Quote:

However, the resulting file contains no sequence information. This is despite setting the parameter "Sequence" to "By Sequence" under "Write Annotations".

Also seems to be a bug. Could you explain it in more detail? This option only for CSV files. What annotations file format do you use?

Quote:

I tried branching off of "Read Sequence" and sending that through "Write Sequence" set up to write to the same file as "Write Annotations" but that just ended up overwriting the file with the original sequence and annotations. What am I doing wrong?

Try to set Source URL of "Write Annotations" element to empty otherwise it will write the result in the source file. Pick an output file ("Output" option). Also choose "Existing file" option.

Quote:

I checked under Settings->Preferences, and didn't see anything concerning fonts.

UGENE uses system native fonts. Try to delve into the font settings of your VM/RHEL.

wd_source_url_001.jpg (135 KB | )

Back to top

IP Logged

Reply #5 - Jul 18^th, 2012 at 12:01am

hatta0 Offline
YaBB Newbies

Posts: 6

Yuriy Vaskin wrote on Jul 17^th, 2012 at 1:21pm:

Hello!

As for Smith-Waterman scheme. I think Workflow Designer infrastructure is ready to make Smith-Waterman worker take multiple Read Sequences as input. Here is the issue on that:

Awesome, I knew I couldn't be the only one who thought it would work like that.

Quote:

Quote:

My primers contain 3' and 5' in their names. This didn't work. UGENE should really be escaping special characters in its input. I couldn't find a bug report on this issue, should I open one?

Definitely you should create one.
Have you tried to use a simple scheme like "Read Sequence"->"Write Sequence" to check it?

Hm, I just tried that and it worked fine. It must be something in the write annotations tool. I'll have to troubleshoot that later.

Quote:

Quote:

However, the resulting file contains no sequence information. This is despite setting the parameter "Sequence" to "By Sequence" under "Write Annotations".

Also seems to be a bug. Could you explain it in more detail? This option only for CSV files. What annotations file format do you use?

I'm just using genbank files. I've not heard of using CSV for annotations or sequence data before. It sounds...scary. CSV is too freeform. Which rows and columns are sequence and which are annotations? Wouldn't you need another specification to determine all that, at which point you're not really supporting "CSV" but a special subset of CSV.

If I understand correctly, the "sequence" parameter only has an effect if I'm using CSV files, and since I'm using genbank it should "just work"?

Quote:

Quote:

I tried branching off of "Read Sequence" and sending that through "Write Sequence" set up to write to the same file as "Write Annotations" but that just ended up overwriting the file with the original sequence and annotations. What am I doing wrong?

Try to set Source URL of "Write Annotations" element to empty otherwise it will write the result in the source file. Pick an output file ("Output" option). Also choose "Existing file" option.

I'm not clear on this advice. Do I need a "Write Sequence" tool in addition to the "Write Annotations", or is the "Write Annotations" tool supposed to pass on the sequence it receives untouched?

"Source URL" is unset. "Existing File" defaults to rename. I tried setting the "Write Sequence" tool to overwrite, and the "Write Annotations" to "append", hoping that it would add the annotations to the sequence that was previously written, but UGENE wasn't able to read the resulting .gb file.

Back to top

IP Logged

Reply #6 - Jul 18^th, 2012 at 2:20pm

Yuriy Vaskin Offline
Global Moderator

Gender:
Posts: 138

Quote:

If I understand correctly, the "sequence" parameter only has an effect if I'm using CSV files, and since I'm using genbank it should "just work"?

"Sequence" parameter is only needed for the "Write sequence names" parameter (screenshot). This is the only connection (for CSV files only). The "Sequence" parameter has nothing to do with the rest of the "Write annotation" element. You may use "Annotations name" parameter to set a name of resulting annotations.

1. If you want to write ANNOTATIONS ONLY use "Write annotations" element. These annotations will be independent of the sequence they were attached before. What sequence information do you need to see in the annotations file?
2. If you want to write ANNOTATED SEQUENCES or SEQUENCES ONLY use "Write Sequence" element. You may vary a number of annotations as Konstantin showed you.

Quote:

I tried setting the "Write Sequence" tool to overwrite, and the "Write Annotations" to "append", hoping that it would add the annotations to the sequence that was previously written, but UGENE wasn't able to read the resulting .gb file.

Now I think I understand what you're trying to do. Source URL has nothing to do with that...

Given a file F (in genebank) with an annotated sequence - A1(annotations 1) and S(sequence). You read the file, perform some calculation on S (using Smith Waterman for instance) that produces annotation A2. If you want to write the result in the file F you just need to use the "Write sequence" element ("overwrite" option). The file F will contain the annotated sequence - A1, A2 and S. Annotations are already attached to its sequence in the context of Workflow Designer and you don't need to connect them manually using "Write annotations".

With "Append" option an element just appends a result to the end of a file

It's not that intelligent to add annotations to a sequence.

Hope this help. Please, correct me, if I didn't get it.

wd_write_annotations.jpg (87 KB | )

Back to top

IP Logged

	Welcome, Guest. Please Login or Register
	Welcome to our forum.