General Category >> Help and How-to >> Smith-waterman db vs query sequence alignment

Message started by EngineerInBiology on Jul 5th, 2013 at 8:58pm

Title: Smith-waterman db vs query sequence alignment
Post by EngineerInBiology on Jul 5th, 2013 at 8:58pm
I just found UGENE and I'm trying to execute my usual tasks in this software environment rather than in SSEARCH software.
I'm interested in aligning a single query sequence against an entire db, both strands. I would like to use command line tool. Query and db are in FASTA format.

I tried the following command
CMD> ugene.exe find-sw --in=.\db65536seqs.fasta --ptrn=query_seq.fasta --filtern=none --score=0 --matrix=dna

my question are the following:
- score=0 means that every db sequence is going to be aligned with the query, hence all S-W scores > 0 are reported, (regardless of the similariy percentage in practice), right?
- there is any way I can obtain the Smith waterman score ranking (top N) like in SSEARCH?
- Using this command, any hardware acceleration is automatically enabled (CUDA, SSE2)? Is it possible to select the desidred acceleration method, as in ugene GUI?
- There is a way to specify gap opening and extension by command line?

Thank you very much for your help, I'm stuck into this problems. Hope hearing someone of you soon, best regards

Title: Re: Smith-waterman db vs query sequence alignment
Post by EngineerInBiology on Jul 8th, 2013 at 5:38pm
Ok, partially solved. The solution for adding additional parameters is to
- open worflow designer schema named "find-sw" (find-sw.uwl file on ugene directory)
- add additional command line parameter (gap opening, extension and algorithm) as described in chapter 13 of the workflow designer manual.
- save the new workflow designer schema
- execute the new workflow from command line, using schema name as task parameter (i.e. ugene.exe <new_schema_name> ...)

Everything seems to work just find, except
- if running from command line with a fasta db with 65k sequences (mean length of 8k bp) and a fasta query with length 512bp the program seem to crash. After a long time of working, the message printed on standard output is "QThread::start: Failed to create thread", and then a text in italian that can be translated as "(access code not valid)". This happened also with standard "find-sw" task, not only with my new modified schema.

- I'm not yet able to obtain a top results ranking, as in SSEARCH

Anyone willing to help me?

Further details:
I'm running on Windows 7 32bit, 4GB RAM, 4 core

Title: Re: Smith-waterman db vs query sequence alignment
Post by EngineerInBiology on Jul 9th, 2013 at 10:11pm
The problem may be due to UGENE RAM high usage  and task occupation limits.

As I said before, system RAM is about 4GB, but Task Memory maximum limit is 1.536GB (From Application Settings\Resource).
Although each sequence of my (big) DB is completely loaded by the "Read Reference Sequence" Workflow Designer Block, used RAM increases while "Smith-Waterman Search" block do its alignment operations. It keeps increasing until is the limit is reached (but only 10% of db is aligned), then UGENE GUI becomes partially stucked, CPU activity of UGENE become 0% from 100% (an entire core) and no more sequences are aligned.

Maybe it is not designed for big dbs\queries alignment with S-W? please let me know something

Title: Re: Smith-waterman db vs query sequence alignment
Post by Yuriy Vaskin on Jul 12th, 2013 at 12:49pm
Hello! Sorry for the long delay...

It’s great that you coped with updating the find-sw scheme with additional parameters. We're going to do that for all the schemes in one of the coming releases (
The main difference of SSEARCH and UGENE SW is the following:
-SSEARCH is designed to work with a query sequence and a database.
-UGENE SW is designed to work with one reference sequence and a query sequence. With Workflow Designer you are able to apply the SW operation to a bunch of sequences in the db.
It means that UGENE SW in the Workflow Designer treats all sequences in the database independently and it cannot infer in-group relations like the top scored sequences. But it is able to annotate any number of sequences in the database and filter those (See “Samples” of the Workflow Designer).

I think it fails to process your query because of the score you set. Score is the threshold for a similarity measure of two sequences. If the current sequence from the database matches the query sequence with the score N and the score is greater than the threshold - it is reported as a result. If you set 0% threshold each subsequence of the current sequence matches the query, since every two sequences have similarity >0%. It results in huuuge amount of non-sense annotations that floods your RAM. I would suggest setting a reasonable threshold score, >70% for instance.

UGENE Forum » Powered by YaBB 2.5 AE!
YaBB Forum Software © 2000-2010. All Rights Reserved.