Saturday, 22 December 2012

bioinformatics


A sequence for the human Pax6 gene was obtained from NCBI. A 10kb region at the 5' end containing the putative promoter was selected and downloaded  into Genbank and Fasta formats. 

Hidden Markov Model
The sequence was analysed using Cister, a program which uses the Hidden Markov Model algorithm. Cister predicts the positions and probability of cis-regulatory elements based on the distribution of cis-elements. The parameters in Cister may be changed to adjust variable distributions of cis-elements.

The pink regions correspond to the protein-coding region, which is predominantly between positions 6000-10000. Several discontinuous protein-coding regions indicates the presence of introns between exons. 

The parameters were adjusted to detect the positions and respective probabilities of each cis-regulatory element for variable distributions. In all graphs a general consensus cis-cluster is detected around 8000-10000, although the probabilities of the cluster varies with parameter values. It is inferred that a cis-cluster exists around this position.





Possible functional cis-elements
typepositionstrandsequenceprobability
Sp18723 to 8735-cgccccgcccccg0.49
Sp18827 to 8839+aaggggcgggagc0.42
Sp18739 to 8751-gcccccaccccat0.37
CCAAT8383 to 8398-cagccgattggatgct0.29
Sp18793 to 8805-accccagcccacc0.24
CCAAT8114 to 8129-atgctgattggtgatg0.23
Sp18625 to 8637-ccccactccccct0.22
Sp18562 to 8574-cggcccgccctgc0.19
CCAAT8413 to 8428-ggtgcaattggtccgt0.14
AP-19168 to 9178+gttgactaagt0.12
Sp13042 to 3054-cgccccgccctga0.1
Sp19228 to 9240+gaggggaggggga0.1


Parameter Settings:

a average distance between motifs within a cluster: 35
b average number of motifs in a cluster: 6
g average distance between clusters: 30000
W half-width of sliding window for local base composition: 1000
Motif probability threshold: 0.1
Pseudocount: 1



The highest probability is only 0.49, so the parameters are changed.



Possible functional cis-elements
typepositionstrandsequenceprobability
Sp18723 to 8735-cgccccgcccccg0.7
Sp18827 to 8839+aaggggcgggagc0.6
Sp18739 to 8751-gcccccaccccat0.53
CCAAT8383 to 8398-cagccgattggatgct0.42
CCAAT8114 to 8129-atgctgattggtgatg0.38
Sp18793 to 8805-accccagcccacc0.34
Sp18625 to 8637-ccccactccccct0.31
Sp18562 to 8574-cggcccgccctgc0.26
Sp13042 to 3054-cgccccgccctga0.25
AP-19168 to 9178+gttgactaagt0.22
Sp13057 to 3069-agcccagcccctc0.2
CCAAT8413 to 8428-ggtgcaattggtccgt0.19
Sp12917 to 2929+gaggggaggagct0.18
Sp19228 to 9240+gaggggaggggga0.18
Sp18986 to 8998-ctccccgcccttt0.12
Sp13181 to 3193-ggcccagcccctg0.12
AP-18438 to 8448+tatgaatacac0.12
Sp13125 to 3137+ggggggatggggt0.12
Sp16129 to 6141-cctcccgccccca0.12

Parameter Settings:

a average distance between motifs within a cluster: 35
b average number of motifs in a cluster: 6
g average distance between clusters: 10000
W half-width of sliding window for local base composition: 1000
Motif probability threshold: 0.1
Pseudocount: 1

Here, the highest probability is 0.7. Switch parameters again. 


Possible functional cis-elements
typepositionstrandsequenceprobability
Sp18723 to 8735-cgccccgcccccg0.79
CCAAT8383 to 8398-cagccgattggatgct0.75
Sp18827 to 8839+aaggggcgggagc0.73
CCAAT8114 to 8129-atgctgattggtgatg0.68
Sp18739 to 8751-gcccccaccccat0.55
Sp18625 to 8637-ccccactccccct0.46
AP-19168 to 9178+gttgactaagt0.43
Sp18562 to 8574-cggcccgccctgc0.42
Sp18793 to 8805-accccagcccacc0.36
Sp18986 to 8998-ctccccgcccttt0.33
Sp19228 to 9240+gaggggaggggga0.32
CCAAT8413 to 8428-ggtgcaattggtccgt0.3
AP-18438 to 8448+tatgaatacac0.18
AP-18209 to 8219-cctctgtcatc0.16
Sp19274 to 9286+gctgggagggatt0.16
Sp18470 to 8482-ggtcccggctcct0.14
CCAAT7868 to 7883-gttttgtttggttggg0.12
Sp18617 to 8629-ggtcgcgccccca0.12
Sp19201 to 9213+gagaggaggggaa0.12
CCAAT8180 to 8195-ctcctcactggcccat0.12
Sp19256 to 9268+ggggggcggatga0.11
Sp18266 to 8278+atgaggccgagcc0.11
AP-18298 to 8308-cactaatcact0.11
AP-19264 to 9274+gatgaccaatg0.11
Sp18972 to 8984-tcctccaccccgc0.1
Sp18159 to 8171-cgcccggcctcgc0.1

Parameter Settings:

a average distance between motifs within a cluster: 50
b average number of motifs in a cluster: 10
g average distance between clusters: 30000
W half-width of sliding window for local base composition: 1000
Motif probability threshold: 0.1
Pseudocount: 1


Possible functional cis-elements
typepositionstrandsequenceprobability
Sp18723 to 8735-cgccccgcccccg0.15
Sp18739 to 8751-gcccccaccccat0.13

Parameter Settings:

a average distance between motifs within a cluster: 10
b average number of motifs in a cluster: 10
g average distance between clusters: 30000
W half-width of sliding window for local base composition: 1000
Motif probability threshold: 0.1
Pseudocount: 1


The highest probability for graph 1 is 0.49 for an Sp1-binding site at positions 8723-8735 in the minus strand. This is not conclusive of its presence, so the parameters are changed. When the probability is 0.7, several other cis-regulatory sites are detected at higher probabilities. The cis-regulatory elements detected whose probability increases in graph 2 are considered. The elements whose probability remains the same or decreases are likely not significant. It is observed that the elements with the highest probability increases its likelihood of presence.

8723-8735 in the minus strand is highly probable to contain a Sp1-binding site, as it is predicted in both the graphs 1 and 2 predictions. Another probably Sp1-binding site is 8827-8839 in the plus strand.  8739-8751 in minus strand, as it also appears for graph with low probabilities. 

For CCAAT, 8383-8398 and 8114-8129 in the minus strand. They appear in both graphs with increasing probabilities, and are in the cis-element cluster.

There are fewer AP-1 binding sites. For first graph, 9168-9178 in the plus strand is predicted at a low probability of 0.12. Probability is lower even for the 2nd and 3rd graphs which show a higher probability for more SP1-binding sites. Hence, likelihood of AP-1 binding sites is low in this cluster. 

High cluster probability around the 8000-8600 positions, indicating the presence of many cis-regulatory elements in this region. 

Several predicted cis-elements are within the protein-coding region. This indicates that either exons also serve as cis-regulatory elements, or that Cister is not reliable. Cister tends to predict a higher success rate of detecting cis-regulatory elements in regions that are not likely to contain these elements, particularly in a probable cluster. Hence, many predicted cis-elements are not likely to be present, especially those of a probability below 0.2. The predicted AP-1 binding site is of a low probability and may not even be present. The most probably cis-elements in the cluster are the Sp1-binding sites and the CCAAT binding sites mentioned.  Further investigations should be carried out on these.

Cister has 4 variable parameters: a, which represents the average distance between cis-elements in a motif, b, which represents the number of cis-elements in a cluster, g which represents the average distance between clusters and W, the window width. For graph 1, the parameter values were a=35, b=6, g=30 000 and W=1000. For graph 2, the parameter values were a=50, b=10, g=30 000 and W=1000.

When the probability of finding a particular cis-regulatory element in a particular region is low, as shown in graph 1, fewer cis-regulatory elements are predicted.  This decreases sensitivity but increases specificity. However, when the probability of finding a cis-regulatory element in a particular region is high, as shown in graph 2, many cis-regulatory elements are predicted, including in protein-coding regions. This increases sensitivity but decreases the specificity of the method.

Altering a and g has a great effect on the results.  A higher a-value increases the probability of cis-elements, but decreases sensitivity. A higher g value decreases the probability of cis-elements. 

The predicted cis-elements are scattered around the overall protein-coding regions. Those present at the 5' end of 3' end are more likely to be present rather than those within the protein-coding regions. The number of protein-coding regions indicates the presence of several exons in the gene.

First EF 
FirstEF predicts the position of a 5' terminal exon and promoter. A potential first donor site is GT. For each given GT, Frist EF predicts the position of a firts exon by calculating the probabilities of a promoter, donor and exon.

In the direct strand it was predicted that the promoter sequences were from 1904-2473, 1978-2547 and 8322-8891.  On the complementary strand the promoter sequences were 7684-7115 and 3416-2847. The promoter sequence 8322-8891 is likely, as cis-regulatory elements were predicted to be present in that region using Cister.

The Promoter 2.0 Prediction Server was used.  It predicts the transcription start sites for vertebrate Pol III promoters. The predicted transcription start sites were at positions 400 and 4300, with 4300 having a higher probability. The site is fairly reliable, as the predictions for highly likely TSSs are 95% accurate. However, the promoter sequence predicted by FirstEF, 8322-8891 is not consistent with the most probable transcription start site predicted by Promoter 2.0.

McPromoter predicts the TSS position as 3201-3202. It is not consistent with the promoter sequences predicted by FirstEF. 
  Position  Score  Likelihood
       400  1.034  Highly likely prediction
      1500  0.652  Marginal prediction
      2900  0.633  Marginal prediction
      4300  1.140  Highly likely prediction
      5400  0.717  Marginal prediction
      5800  0.540  Marginal prediction
      7000  0.733  Marginal prediction
      8200  0.647  Marginal prediction
      8600  0.530  Marginal prediction
      9400  0.646  Marginal prediction


No comments:

Post a Comment