CAGR LAB


Computational Analysis of Gene Regulation & Genome Rearrangement


Research | People | Rotation Projects | Graduate Group


RESEARCH ACTIVITIES

Gene regulation is critical to life. Transcriptional regulation is an important component of the overall gene regulation and a better understanding of which may lead to therapeutic intervention of disease processes. With the availability of several genomes and genome-wide datasets it is now possible to investigate transcriptional regulation using computational approaches.

PolII promoter prediction

PolII enzyme transcribes all protein coding and several non-coding genes in mammals. The few 100 bp genomic region immediately flanking the transcription start site is called the core promoter. The gross core promoter attributes such as common core motifs (TATA-box, CAAT-box etc.) and CpG islands are not sufficiently informative to provide the required PolII location specificity. We have developed a method that exploits the motif positions and inter-motif distances in the promoter region to improve the promoter modeling, as these attributes are necessary for proper interaction among the transcription factor proteins and PolII. For lack of one superior tool, as a practical compromise, we have recently integrated several state-of-the-art tools using Artificial Neural Net. There does not seem to be a good general promoter model. By recognizing functionally relevant promoter subclasses and developing subclass-specific models we are likely to improve promoter identification accuracy.

Figure 1. General transcription factors, the associated factors, the pre-initiation complex, the core promoter region and distal elements forming a module. CAAT: CAAT-box, GC: GC-box, INR: initiator signal, DPE: downstream promoter element, TBP: TATA binding protein, TAF: TBP associated factors.

1. Hannenhalli, S. and S. Levy (2001). "Promoter prediction in the human genome." Bioinformatics 17 Suppl 1: S90-6.

2. Wang, J. and S. Hannenhalli (2005). "Generalizations of Markov model to characterize biological sequences." BMC Bioinformatics 6(1): 219.

3. Wang, J. and S. Hannenhalli (2006). "A mammalian promoter model links cis elements to genetic networks." Biochem Biophys Res Commun 347(1): 166-177.

3. Wang, J., Ungar, L., Tseng, H. and S. Hannenhalli (2007). "MetaProm: a neural network based meta-predictor for alternative human promoter prediction" BMC Genomics 8(374).

 

Identification of transcription factor binding sites

Transcription factors (TF) bind to short and often degenerate DNA motifs. From an information-theoretic viewpoint there is insufficient information in these motifs to accurately identify the binding sites. Evolutionarily conserved non-coding regions may be under purifying selection and are thus likely to be functional. We (and others) have used the evolutionary conservation - the so called phylogenetic footprinting – to reduce the false positives in binding site recognition. Positional Weight Matrix (PWM) is the most common representation of DNA binding specificity of a TF.

We have closely investigated the degenerate motifs and found that often the binding sites for a TF fall into distinct clusters and by modeling the TF’s DNA binding by a mixture of PWMs instead of a single PWM we can predict the binding sites more accurately. The biological relevance of these clusters is however not known. One possibility is that the clusters correspond to different contexts in which the TF binds, for instance, the interaction partners. This however requires further investigation.

Another possible reason for relatively precise in vivo TF-DNA binding despite a degenerate binding motif is that a TF’s binding depends not only on its own motif but presence or absence of other motifs. This is reasonable because TFs almost always act as a physically interacting group. Such an interaction-dependent model of TF-DNA binding has shown some promise.

 PWM model assumes independence between distinct positions within a binding site. Because of multiple bases interacting with a single TF residue, this may not be the case. If indeed two positions are interdependent then a chance mutation at one position is likely to change the selection pressures at the other position. We have compared the evolutionary patterns at pairs of positions within TFBS and have found a prevalence of interdependence. Better evolutionary models as well as better controls are however needed to rule out epistasis among functionally relevant positions.

Eventually any sequence-only based approach to analyze transcription is limited by the fact that transcription is a highly dynamic process and sequence is static. Future work must incorporate the TF protein levels, epigenomic state of the DNA as well as post translational modification status of histones and TFs.

1.      Levy, S. and S. Hannenhalli (2002). "Identification of transcription factor binding sites in the human genome sequence." Mamm Genome 13(9):

2.      Hannenhalli, S. and L. S. Wang (2005). "Enhanced position weight matrices using mixture models." Bioinformatics 21 Suppl 1: i204-i212.

3.      Wang, L. S., S. T. Jensen and S. Hannenhalli (2005). "An Interaction-dependent model for transcription factor binding." RECOMB-Regulatory Genomics, Lecture Notes in Bioinformatics (to appear).

3.      Vardhanabhuti, S., Wang, J. and S. Hannenhalli (2007). "Position and distance specificity are important determinants of cis-regulatory motifs in addition to evolutionary conservation.", NAR  35 3339-3354

 

TF-TF interaction and Cis Regulatory Module

Transcription factors do not act alone but do so as groups of interacting TFs – cis regulatory modules (CRM) – that co-regulate functionally related genes. Identifying TF-TF interactions and CRMs are thus important. We have exploited genome-wide co-occurrence of binding sites for specific TF pairs as an indication of their interaction. Consider a bi-partite graph of genes and TFs where a TF-Gene edge indicates that the gene might be regulated by the TF. A completely connected subgraph, or a bi-partite clique, is likely to represent a CRM. Applications of this approach to several biological contexts have yielded useful results. We have recently extended this to finding dense subgraphs in a multi-partite graph (see figure) where various parts may represent functional annotation and expression profiles of the genes; this provides further biological interpretation of the detected CRMs. This approach can also be applied to detect subclasses of a motif where each subclass regulates a different set of functionally related genes.

Figure 2. Multi-partite graph represents genes, tissues where the genes are expressed and the TF which may regulate the gene. A dense tri-partite sub-graph in this network represents a CRM that regulate tissue-specific expression.

1.      Hannenhalli, S. and S. Levy (2002). "Predicting transcription factor synergism." Nucleic Acids Res 30(19): 4278-84.

2.      Hannenhalli, S. and S. Levy (2003). "Transcriptional regulation of protein complexes and biological pathways." Mamm Genome 14(9): 611-9.

3.      Keeley, M. B., M. A. Wood, C. Isiegas, J. Stein, K. Hellman, S. Hannenhalli and T. Abel (2006). "Differential transcriptional response to nonassociative and associative components of classical fear conditioning in the amygdala and hippocampus." Learn Mem 13(2): 135-42.

4.      Hannenhalli, S., M. E. Putt, J. M. Gilmore, J. Wang, M. S. Parmacek, J. A. Epstein, E. E. Morrisey, K. B. Margulies and T. P. Cappola (2006). "Transcriptional genomics associates FOX transcription factors with human heart failure." Circulation 114(12): 1269-76.

5.      Everett, L., L. S. Wang and S. Hannenhalli (2006). "Dense subgraph computation via stochastic search: application to detect transciptional modules." Bioinformatics 22(14): e117-e123.

5.      Singh, L.N., Wang, L.S., and S. Hannenhalli (2007). "TREMOR – A tool for retrieving transcriptional modules by incorporating motif covariance" NAR (in press).

 

Evolution of transcriptional regulation

The links between the chance DNA mutations and the organismal evolution is of fundamental interest. These links are mediated by systems-level interactions between genes, all the way to the interaction between an individual and its environment. Because of a lack of such a systems-level understanding, the investigations so far have been gene-centric, i.e., what determines the fate of a duplicated gene? The relationship between expression divergence and the protein sequence divergence among paralogs has been investigated by several researchers. Expression and coding sequence represent two modes of divergence; relationships between other modes of divergences, especially the ones with quantifiable functional consequence, will elucidate the selection pressures during the evolution of a gene family. We have found that for TF gene paralogs the expression divergence is inversely related to the divergence in their DNA binding motifs. We will extend this analysis to several other modes of divergences. We are working towards a close investigation of the evolution of developmentally important regulatory networks based on the duplication and diversification of individual TFs in the network.

Figure 3. Scatter plots showing the correlation between the DNA binding similarity of a pair of human TFs and the expression divergence of the corresponding TF genes. Greater the expression divergence, the smaller is the DNA binding divergence (greater the similarity).

Natural selection on cis elements

Polymorphisms in the non-coding portion of the human genome are likely to underlie significant component of the phenotypic variability among humans and between humans and other primates.  If so, these genomic regions may be undergoing rapid evolutionary change, due in part to natural selection. However, the non-coding region is a heterogeneous mix of functional elements, each under potentially varying selection regimes. Our preliminary genome-scale investigation of natural selection specifically on putative transcription factor binding sites in human proximal promoters, based on HapMap and Perlegen SNP data, and several population-genetic techniques indicates that a sizable portion of human-specific and primate-specific binding sites may be evolving under positive selection, while the sites conserved between primates and rodents are likely to be under purifying selection. Furthermore a larger-than-expected fraction of high frequency derived alleles in the human-specific sites yields a binding site gain as opposed to a loss. A closer look at these cases coupled with experimental validation may provide insights into human adaptation.

1.      Singh, L.N. and S. Hannenhalli (2008). "Functional diversification of paralogous transcription factors via divergence in DNA binding site motif and in expression" PLoS ONE (in press). 

Computational analysis of genome rearrangements

Genome rearrangement events like inversions, transpositions, translocations and duplications present modes of evolution alternative to single base substitutions or small insertion/deletions. Besides being part of normal evolutionary process, these genome shuffling events are also common in various human diseases including cancer. There has been some debate over whether the genomic breakages are randomly and distributed or whether there are breakage hotspots.

Figure 4. The region between human markers X and Y is conserved in chimpanzee, dog and chicken but is disrupted in mouse and rat. (b) Under the assumption of parsimony, all species breaks between X and Y can be explained by a single break in the rodent lineage (denoted by the dashed line). The region between human markers A and B in (a) is conserved in mouse, dog and chicken but disrupted in chimpanzee and rat. This can only be explained by two ‘independent’ breaks in chimpanzee and rat lineages (denoted by dotted lines).

Recently we have done a multiple-species analysis where we localized large number of genomic breaks to specific lineages and we found a prevalence of breakages of same region in two independent lineages, i.e., hotspots. The specific genomic attributes of the hotspots is of interest, not only for evolutionary breakages but also for the breakages which happen in a population and especially in congenital diseases and cancer. Finally a global analysis of the effects of evolutionary genome rearrangement and gene expression is largely unexplored.

1.      Hannenhalli, S., C. Chappey, E. V. Koonin and P. A. Pevzner (1995). "Genome sequence comparison and scenarios for gene rearrangements: a test case." Genomics 30(2): 299-311.

2.      Hannenhalli, S. and P. Pevzner (1999). "Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals." J. ACM 46(1): 1-27. This work is mentioned in the New York Science Times article: “The history of chromosome may shape the future of disease”, August 30, 2005.

3.      Hinsch, H. and S. Hannenhalli (2006). "Recurring genomic breaks in independent lineages support genomic fragility." BMC Evol Biol 6: 90.

 


PEOPLE

Sridhar Hannenhalli
, Asst. Professor

Larry Singh, Postdoc

Logan Everett, Graduate student

Praveen Sethupathy, Graduate student

 

Past lab members

 Junwen Wang, Postdoc( currently at U. Hong Kong )

  Saran Vardhanabhuti, Research Programmer( currently Penn Biostatistics PhD student)

 

Past rotation students

    Swetha Garrimalla (CMU)

    Rithun Mukherjee

    Tom Petty

    Rumen Kostadinov

    Perry Evans

    Adam Ewing

    Le Ba Nguyen

 



ROTATION PROJECTS

CLICK HERE for project list. The access is currently restricted  from within PCBI. I do not update this list often enough so stop by for current ideas.



GRADUATE GROUP AFFILIATION

Genomics and Computation Biology (GCB)