UGENE Forum - Print Page

Hello,
It would be very useful to have a plugin for the analysis of data from chromatin immunoprecipitation coupled with high throughput sequencing (ChIP-seq), e.g. Illumina GA. Ideally, the plugin should be able to align reads to the reference sequence and plot their distribution, search for specific peaks if the background control sample (i.e. negative control ChIP) is defined.

I'm happy to discuss the details if this suggestion will be of interest to the Ugene developers.

have you tested Cocas??
Works for Agilent chips and is written in java - means platform independet.

Nikolay,
this is a very intresting suggestion.

ChIp-seq workflow truly would be a useful feature for our users. UGENE includes already some "assembly to reference" functionality. We're planning to support Illumina data format too.

It would be great if you provide more details on this topic. What kind of ChIp analisys do you have in mind? Pehaps some specific methods or existing tools?

Any references would be welcome and we are ready to discuss this idea.

Hi Konstantin,
we don't actually carry out sequencing ourselves, rather we submit our ChIP DNA samples to the University central sequencing facility. They prepare libraries, perform high throughput sequencing and finally provide us with a huge fasta file (5Gb+) which contains tens of millions of reads and the corresponding quality scores.
From this point all the downstream analysis is carried out by a bioinformatition in our lab. He is a computer scientist native with programming so for most of what he need to do he actually writes a code. However, I thought it would be helpful if biologist could perform at least a basic data analysis using some sort of graphic interface rather than annoy a bioinformatition with simple but repetitive tasks.
The way I see it could be working is:
1) we receive a data file from the sequencing facility
2) the reads are aligned to the reference sequencing (in my case it would be the human genome, I know that many people would be interested in the mouse genome) to create a reads distribution map
3) the distribution maps between two (or more) samples are compared to identify sample-specific peaks. the sample used for comparison (or as a background reading) is usually a negative control ChIP which represents only non-specifically adsorbed DNA (at least in theory).
4) the list of sample specific peaks is created. ideally, there should be some functionality to compare this list with any other given list to test for an overlap. The criteria for an overlap may be coordinates of genomic location or simply the name of a gene.

As I understand from our bioinformatition he uses something called 'bowtie' for alignment (step 2) and something called 'useq' for peak finding (step 3). I'm not sure though how much of this can be implicated on a regular desktop/laptop computer... he has a very powerful monster and still it takes quite a lot of time. To look at the aligned reads we use UCSC genome browser and big-bed type files which are stored on the lab's server. This is not particularly pretty but at least it work at a reasonable speed. Any other software I tried for this (e.g. CLCBio) inevitably kills my computer trying even to show the distribution map, not talking about the alignment calculations.

That's about it. Not sure that I clearly described what I intended to so if there's any questions - please ask. might be easier to explain in Russian but I thought I should stick with English for the general forum.

Nikolay,
thanks for this detailed description.
Steps 1 and 2 are rather clear, UGENE performs already step number 2 and we have Bowtie integrated. Soon we're going to introduce solution to large data sets ( ~ 10GB) analisys which will involve cloud computing.
The next steps require some clarification. As I understand a "distribution map" is a particular alignment of reads to the genome. What is the "sample" exactly? Is it a part of the "distribution map" or is it a stand alone sequence?
What do you mean mean by "distribution map between two samples"?
P.S. If it is more comfortable for you to explain in Russian, feel free to move this discussion to the forum in Russian language.

Konstantin,
sorry for late response. "distribution map" is indeed a particular alignment of reads to the genome but I'd like to add that it should be possible to visualize this alignment as an image something like this one:
http://www.nature.com/nmeth/journal/v6/n4/fig_tab/nmeth.f.247_F2.html

Quote:

What is the "sample" exactly? Is it a part of the "distribution map" or is it a stand alone sequence?

by sample I mean a sequencing readout (i.e. all aligned reads) for each individual ChIP DNA (= each individual antibody used in the experiment). One experiment often includes more than one sample (more than one antibody) to enable identification of specific binding sites, see below.

Quote:

What do you mean mean by "distribution map between two samples"?

this means that the software should identify reads which are in one sample but not in the other. Suppose, you're looking for the binding sites of FOXA3 protein, as on the image I refereed above, and use mouse monoclonal antibody to perform ChIP. In this case you may need to carry out additional ChIP using another mouse monoclonal antibody which does not recognize anything in human cells. This can be, for example, antibody raised against jellyfish protein GFP.
Sequencing of both samples will return some reads because some non-specific binding always takes place. To account for this you would need to compare the reads from FOXA3 ChIP with the reads from GFP ChIP. Reads that will be found only in FOXA3 ChIP but not in GFP ChIP are specific for FOXA3 protein. Reads that are founds in both FOXA3 and GFP ChIP are most likely to originate from non-specific binding mentioned above.

UGENE Forum
https://forum.ugene.net/forum/YaBB.pl General Category >> Feature Requests >> ChIP-seq data analysis https://forum.ugene.net/forum/YaBB.pl?num=1274820148 Message started by Nikolay on May 26^th, 2010 at 3:42am