### DIMACS Workshop on Detecting and Processing Regularities in High Throughput Biological Data

#### June 20 - 22, 2005DIMACS Center, CoRE Building, Rutgers University

Organizer:
Laxmi Parida, IBM T J Watson Research, parida@us.ibm.com
Presented under the auspices of the DIMACS/BioMaPS/MB Center Special Focus on Information Processing in Biology.

This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).

#### Abstracts:

Gabriela Alexe, Institute for Advanced Study
Gyan Bhanot, IBM Research and Institute for Advanced Study
Arnold Levine, Institute for Advanced Study and Robert Wood Johnson School of Medicine and Dentistry, and
Gustavo Stolovitsky, IBM Research

Title: Robust diagnosis of non-Hodgkin lymphoma phenotypes validated on gene expression data from different laboratories

A major challenge in cancer diagnosis from microarray data is the need for robust, highly accurate, robust classification models which are independent of the analysis techniques used and can combine data from different laboratories. We propose such a classification scheme developed originally developed for phenotype identification from mass spectrometry data. The method uses a robust multivariate gene selection procedure and combines the results of several machine learning tools trained on raw and pattern data to produce an accurate meta-classifier. We illustrate and validate our method by applying it to distinguish diffuse large B-cell lymphoma (DLBCL) from follicular lymphoma (FL) on two independent gene expression datasets: the oligonucleotide HuGeneFL microarray dataset of Shipp et al. (www. genome.wi.mit.du/MPR/lymphoma) and the Hu95Av2 Affymetrix dataset (DallaFavera's laboratory, Columbia University). The meta-classification technique achieves higher predictive accuracies than each of the individual classifiers trained on the same dataset, is robust against various data perturbations and provides subsets of related predictive genes. Our techniques predict that combinations of some genes in the p53 pathway are highly predictive of phenotype. In particular, we find that in 80% of DLBCL cases the mRNA level of at least one of the three genes p53, PLK1 and CDK2 is elevated, while in 80% of FL cases, the mRNA level of at most one of them is elevated.

Alberto Apostolico, Purdue University and University of Padova

Title: The Unbearable Monotony of Surprises

The extraction of surprisingly recurrent words and motifs often exposes more candidates than one can afford to inspect. In some cases, scores of surprise span regions of monotonicity over which only extremal terms and values need to be considered. This talk reviews past accomplishments and future directions in this line of research.

Sudha Balla, University of Connecticut

Title: Efficient Algorithms for Motif Search

Motif search is an important problem in biology. This problem in general requires finding short patterns of interest from voluminous data. Several variants of this motif search problem have been identified in the literature. In this talk we survey algorithms that have been proposed for the following three versions:

Problem 1. - Input are $n$ sequences of length $m$ each. Input also are two integers $l$ and $d$. The problem is to find a motif (i.e., a sequence) $M$ of length $l$. It is given that each input sequence contains a variant of $M$. The variants of interest are sequences that are at a hamming distance of $d$ from $M$. (This problem is also known as the planted $(l, d)$-motif problem.)

Problem 2 . The input is a database {\em DB} of sequences $S_1,S_2,\ldots,S_n$. Input also are integers $l$, $d$, and $q$. Output should be all the patterns in DB such that each pattern is of length $l$ and it occurs in at least $q$ of the $n$ sequences. A pattern $U$ is considered an occurrence of another pattern $V$ as long as the edit distance between $U$ and $V$ is at most $d$.

Problem 3. A pattern is a string of symbols (also called residues) and ?'s. A "?" refers to a wild card character. A pattern cannot begin or end with ?. AB?D, EB??DS?R, etc. are examples of patterns. The length of a pattern is the number of characters in it (including the wildcard characters). This problem takes as input a database DB of sequences. The goal is to identify all the patterns of length at most $l$ (with anywhere from 0 to $\lfloor l/2\rfloor$ wild card characters). In particular, the output should be all the patterns together with a count of how many times each pattern occurs. Optionally a threshold value for the number of occurrences could be supplied.

J. Baussand, Université Pierre et Marie Curie
C. Derembleb, Laboratoire de Biochimie Théorique, rue Pierre et Marie
A. Carbonea, Laboratoire de Biochimie Théorique, rue Pierre et Marie

Title: Multiple alignment using hydrophobic clusters: a tool to identify and align distantly related proteins

We present the first automatic alignment method which detects and uses hydrophobic clusters, that is groups of hydrophobic amino acids which are likely to be in contact once the protein is folded, as a basic pattern to align distantly related proteins. This tool is part of a larger computational system devoted to the detection of protein-protein and protein-DNA interaction sites. The use of hydrophobic clusters as a structural pattern detectable from sequences to align distantly related proteins has already shown its efficiency. Hydrophobic Cluster Analysis [1] is a method based on hydrophobic clusters and has been successfully applied on pairs of proteins of 15-25% sequence identity to show their structural relatedness [2,3], but these examples are based on manual pairwise alignment and no program realizing an automatic sequence alignment with the help of hydrophobic clusters has been available so far. Our tool consists on an automatic multiple sequence alignment that detects hydrophobic clusters in one-dimension by a prescreening of the amino-acids sequence, and integrates this structural information in the multi-alignment procedure ClustalW [4] thanks to extra structural conditions. Analysis of 145 families of homologous proteins of less than 30% sequence identity shows that hydrophobicity is the most conserved physical-chemical-geometrical property in sequence, secondary structures and hydrophobic clusters leading to trust the use of hydrophobic positions to identify and align distantly related proteins. Since aligning weakly homologous proteins needs substitution matrices and gap penalties that take into account the structural context of residues [5,6], these families of proteins have been used to construct two suitable substitution matrices for amino-acids sitting within and without hydrophobic clusters. Performance of the matrices has been tested with a large range of gap insertion on the alignment of 8 pairs of proteins of less than 30% identity and with known structure (1 +, 4 all , 3 all ). Comparison of the results to ClustalW alignments (using Blosum 62, Blosum 30, Gonnet and HSDM matrices) confirms the accuracy of the use of hydrophobic cluster information for the alignment of distantly related proteins and encourage further developpement of the hydrophobic clusters fitting approach.

Reference

1. C. Gaboriaud, V. Bissery, T. Benchetrit and J.P. Mornon. Hydrophobic Cluster Analysis, an efficient new way to compare and analyse amino acids sequences. Febs Letters, 224:149-155,1987.
2. I.H. Hung, R.L.B. Casareno, G. Labesse, F.S. Mathews and J.D. Gitlin. HAH1 is a copper binding protein with distinct amino acids residues mediating copper homeostasis and antioxidant defense. J. Biol. Chem, 273:1749-1754, 1998.
3. C. Carret, S. Delbecq, G. Labesse, B. Carcy, E. Precigout, K. Moudri, T.P. Schetters and A. Gorenflot. Characterization and molecular cloning of an adenosine kinase from Babesia canis rossi. Eur. J. Biochem .,13:211-222,1999.
4. D. Thompson, D.G. Higgins and T.J. Gibson. ClustalW: improving the sensitivity of progressive multiple alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucl. Acids Res., 22:4673-4680, 1994.
5. A.M. Lesk, M. Levitt, and C. Clothia. Alignment of the amino acid sequence of distantly related protein using variable gap penalties. Prot. Eng., 1:77-78, 1986
6. O. Teodorescu, T. Galor, J. Pillardy and R. Elber. Enriching the sequence substitution matrix by structural information. Proteins, 54:41-48, 2004.

Gyan Bhanot, IBM Research, and Institute for Advanced Study
Gabriela Alexe, Institute for Advanced Study
Ramakrishna Ramaswamy, Institute for Advanced Study
Lilian Chiang, Institute for Advanced Study
Jorge Lepre, IBM Research, and
Gustavo Stolovitzky, IBM Research

Title: Gene expression patterns of breast cancer phenotype revealed by molecular profiling

Among the most significant biomedical studies undertaken in the last few years is the examination of the genetic basis of breast cancer. The two dominant breast cancer phenotypes (estrogen receptor positive and estrogen receptor negative) are known to have different prognosis expectations and to respond to different therapies, although the outcome pattern is susceptible to a significant amount of variation. To address the outcome heterogeneity problem, recent extensive analyses performed on gene array data at the Stanford Genomics Breast Cancer Consortium and elsewhere revealed the existence of several novel breast cancer subtypes within the major ones and suggested that they might benefit from specialized clinical intervention.

In our study we re-investigated and validated the existence of different breast cancer phenotypes by applying robust meta-consensus clustering techniques on independent microarray datasets from NCBI's Gene Expression Omnibus database. Since accurate detection of the breast cancer subtypes might have a significant impact in patient's outcome, we further focused on the identification of libraries of comprehensive robust combinatorial biomarkers having a high specificity (above 90%) in identifying individual cancer phenotypes. By applying these robust meta-learning techniques we expect a 5-10% accuracy increase in the correct identification of the cancer phenotype versus the standard techniques.

Andrea Califano, Columbia University

Title: Discovering key modulators of genetic activity by network reverse engineering.

Several algorithms exist that can recover signaling and transcriptional interactions in the cell and arrange them into a putative topology. However, the true genetic-network topology cannot be static. Rather, it must change, dramatically at times, as a function of the specific cellular or molecular phenotype. For instance, lack of expression of a tissue-specific kinase may shut down an entire sub-network causing a significant rearrangement of the network interactions.

Thus, the recovery of a single "average" cellular network is likely an ill-posed problem. In this talk we present results from the systematic reverse-engineering of network topologies corresponding to thousands of specific molecular constraints. Reconstruction of the individual networks was performed using the ARACNE algorithm on specific subsets of ~300 B cell microarray expression profiles. The identification of constraints producing statistically significant topological rearrangements leads to the selection of candidate regulators of key network genes, such as c-MYC or BCL6, two proto-oncogenes that play an important role in B cell oncogenesis. For Myc, for instance, we show that only two GO categories are significantly enriched across putative modulators, kinases and transcription factors, accounting for close to one half of the predicted modulator pool. A synthetic network model, which explicitly represents phosphorilation and formation of transcriptional complexes, further validates the approach. The method is efficient enough to be applied to the discovery of the regulators associated with each known gene in the network.

Title: Varun: Extraction of Over-Represented Extensible Patterns

The discovery of motifs in biosequences is frequently torn between the rigidity of the model on the one hand and the abundance of candidates on the other. In particular, the variety of motifs described by strings that include don't care'' patterns escalates exponentially with the length of the motif, and this gets only worse if a don't care is allowed to stretch up to some prescribed maximum length. This circumstance tends to generates daunting computational burdens, and often gives rise to tables that are impossible to visualize and digest. This is unfortunate, as it seems to preclude precisely those massive analyses that have become conceivable with the increasing availability of massive genomic and protein data. While part of the problem is endemic, another part of it seems rooted in the various characterizations offered for the notion of a motif, that are typically based either on syntax or on statistics alone. It seems worthwhile to consider alternatives that result from a prudent combination of these two aspects in the model.

Results:
We introduce and study a notion of extensible motif in a sequence which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. We show that a combination of appropriate saturation conditions (expressed in terms of minimum number of don't cares compatible with a given list of occurrences) and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate over-represented motifs.

The merits of the method are documented by results obtained in implementation, which specifically targeted protein sequence families. In all cases tested, the motif reported in PROSITE as most important in terms of functional/structural relevance emerges among the top thirty extensible motifs returned by our algorithm, often right at the top. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constraints.

Title: False Negatives and False Positives in HTS

Traditionally a single cutoff threshold is applied to high throughput screening (HTS) results to decide which compounds should be followed upon. Consecutive retests will reveal how many of the hits can be confirmed and how many hits are false positive. Unfortunately there also may be false negatives that are never discovered and represent a lost opportunity. We have developed an informatics solution to deal with the false negative/positive compounds which takes into account many different features of the HTS. We can use mixtures of normal and uniform distributions to identify appropriate cutoff thresholds. Based on the historical screening data we define selective and cross-reactive compounds. Based on the structure we predict which compounds are more likely to be novel or appealing to medicinal chemists. We also take into account positions of compounds on plates and compensate for the variability of HTS results among wells. The results of these computational applications on a number of diverse HTS programs at Millennium will be discussed.

Rave Harpaz, CUNY

Title: Linear Manifold Embeddings of Pattern Clusters

In this paper we propose a new unifying approach for clustering gene expression data which has both theoretical and practical merit, and which is based on the concept of linear manifolds. The power of linear manifold models lies in their ability to capture various types of linear dependencies and correlations among different attributes of the data. We abstract the problem of clustering gene expression data by showing how various types of pattern clusters, including the ones based on the "bicluster" model can be embedded in arbitrarily oriented linear manifolds of various dimensions. On the theoretical side this abstraction can shed light on the important aspects of the problem. In contrast to existing methods, our approach does not require any assumptions such as the underlying distribution of the data, nor does it require to be posed in advance the type of patterns or number of clusters that are present.

Based on this insight we present an efficient and scalable algorithm which exploits the power of randomness and the geometric properties of linear manifolds, to identify pattern clusters that may be hidden in subspaces of the data. As a consequence of our unifying approach this algorithm is capable of detecting patterns that are not necessarily visible in only a subset of the original measurement features, but also patterns that may be induced by linear combinations of the original measurement features. In addition the algorithm does not require the number of clusters to be predetermined, nor the does it require data transformations as preprocessing steps to support any apriori assumptions, which in turn enables it to detect different types of patterns that may co-exit in the same data set. Another advantage of our method is that it can easily be extended to support a probabilistic description of the clusters, which besides aiding in the interpretation of the results, supports the concept of overlapping clusters. The algorithm requires two parameters, a sampling level and a goodness of fit or mean squared error threshold similar to the "biclustering" threshold, and operates by repeatedly sampling minimal subsets of points to create trial manifolds from which the one that best fits a subset of the data is selected as a candidate linear manifold embedding of a pattern cluster.

Steffen Heber, North Carolina State University

Title: Common Intervals of Graphs

Let p be a permutation of [n] = {1, 2, ?, n}. An interval of p is a set of the form {p(i), p(i+1), ..., p(j)} for 1 £ i < j £ n. For a family P = (p0, ..., pk-1) of permutations of [n] we call a subset S of [n] a common interval of P if S is an interval of pi for 0 £ i £ k-1. Common intervals have various applications in bioinformatics. They are used to detect possible functional associations between genes, to compute the reversal distance between genomes, and to define a similarity measure for gene order permutations.

Uno and Yagiura (2000) presented an optimal output-sensitive algorithm for finding all common intervals of two permutations. In Heber and Stoye (2001) and Heber and Savage (2005) this result was generalized to find common intervals in multiple permutations, multichromosomal permutations, signed permutations, circular permutations, and labeled trees. In this paper we consider the problem of finding common intervals in a family of labeled connected graphs, a generalization of the concept of common intervals in permutations or trees.

Let G = (V,E) be an undirected, connected graph with vertex set V = [n]. A subset U of V is called convex if the induced subgraph G[U] is connected, and includes every shortest path with end-vertices in H. We call a subset S of [n] an interval of G if |S| > 1 and G[S] is convex. Let G = (G0, G1, ?, Gk-1) be a family of undirected, connected graphs, each with vertex set [n ]. A common interval of G is a set S of [n] such that S is an interval of Gi for i = 0, ?, k-1.

If we restrict this approach to permutations (paths) or trees, our new definition coincides with the previous definitions of common intervals. Although graphs can have exponentially many common intervals we show that for any family of undirected, connected graphs with vertex set [n] there is a set of at most n(n-1)/2 irreducible common intervals which form a generating set for the set of all common intervals. We give a simple algorithm to compute the irreducible intervals.

Our approach allows us to apply the concept of common intervals to infer conserved modules in graph-structured data (e.g. protein-protein interaction networks, co-expression networks, gene regulation networks, etc.).

This is joint work with Dr. Carla Savage, Jiangtian Li, and Oliver Serang.

Christine Heitsch, University of Wisconsin

Title: Deciphering the Information Encoded in RNA Viral Genomes

The formation of base pairs within single-stranded RNA molecules, such as the Hepatitis C viral genome, creates structure and affects function, thereby conveying biological information. Improving our understanding of how structure and function are encoded in RNA sequences is an essential problem, particularly for efforts to cure numerous RNA-related diseases. Yet, despite the abundance of experimental sequence data now available, deciphering the coding of biological information in RNA molecules remains a fundamental scientific challenge.

Like many RNA molecules, the overall structure and function of RNA viral genomes are significantly related to the base pairings of their secondary structure. Single-stranded RNA sequences are understood to self-bond with a complex interplay between energetically beneficial stacked base pairs, or "helices," and destabilizing single-stranded structures called "loops." A nested RNA secondary structure can be modeled as a plane tree -- a rooted tree whose subtrees are linearly ordered. Plane trees nicely capture the essential arrangement of an RNA secondary structure because the different components of the tree correspond to energetically distinct types of RNA substructures. Using this combinatorial model, we address fundamental questions regarding the coding of biological information in RNA secondary structures.

We investigate the importance of regularities in the base pairing of RNA secondary structures by analyzing a related combinatorial question. Specifically, we provide a tight lower bound on the number of distinct"oriented" integers necessary and sufficient to encode the structure of a given plane tree. This result demonstrates the importance of local constraints in specifying a global structure and pertains directly to the theoretical minimum number of different complementary segments needed to encode the helices of a folded RNA molecule. Based on this result, we analyze the regularities in the encoding of viral RNA secondary structures as a means of determining how the selective hybridization of base sequences encodes important functional substructures. Likewise, we investigate means of characterizing regularities in the loop energy configurations for RNA secondary structures. In particular, we show that there are natural energy minima corresponding to plane trees maximizing number of vertices with two children. Not only does this result resolve an apparent contradiction in RNA loop energy calculations, but it leads to new analysis of the branching degrees in RNA viral genomes. We expect this and other related work to advance understanding and applications by allowing us to detect RNA viral substructures whose base pairings encode significant structural and functional information.

Kush Kapila, Matrox Imaging, Laxmi Parida, and Ruhong Zhou, IBM T. J. Watson Research, and Sudhir Mudur, Concordia University

Title: Animated Terrain Evolution for Visual Analysis of Protein Folding Trajectory

In the field of bioinformatics, the use of visualization techniques to support the analysis of voluminous data is becoming increasingly crucial to understanding the underlying biology and answering important questions. Similar to clustering of gene expression data is that of clustering the characteristics of protein folding trajectory and a good visualization tool can help visual analysis and can provide faster and deeper insights into the manner in which a protein folds. In this paper we present a visualization technique which uses the metaphor of an evolving terrain in which mountains represent protein states and the growth of these mountains their occurrences over time, to assist in greater understanding of how a protein fold was obtained. We believe that this novel 3D technique can be used effectively to analyze a protein's trajectory existing states as well as to identify new states. The technique has been implemented in the program PROTERAN and is available at www.kapila.ca\proteran.html.

Md. Enamul Karim, Laxmi Parida and Arun Lakhotia University of Louisiana at Lafayette

Title: Inferring Phylogeny using Permutation Patterns on Genomic Data

Existing approaches to inferring phylogeny using genomic data usually use inversion or break point information to measure distance matrix. We notice that the traces of evolutionary operations such as inversion, transposition etc are preserved in permutation of segments of various lengths for gene order data. This paper uses that fact to develop a novel approach to inferring phylogeny. Accuracy of this approach has been verified for simulated evolution on synthetic data and found to be highly satisfactory. We then infer phylogeny for 13 chloroplast genome arrangements including tobacco as the outgroup and compare it with published results. We also show how distance measures can be applied on permuted segments to develop a hybrid scheme.

A.E. Kister, UMDNJ and Rutgers University I.M. Gelfand, Rutgers University

Title: The structures of about eight thousand sandwich proteins can be described by few supermotifs.

The classification of the spatial organizations of secondary structures of the proteins is critical for understanding the basic principles of the packing of strands in the beta structures. On the first stage of our analysis we investigate the structural motifs of a large group of beta proteins, so called the sandwich proteins. This type of architecture unites of about 8000 proteins from very different protein superfamilies, which have no detectable sequence homology. The analysis of the arrangements of strands has revealed 65 structural supersecondary structural motifs, which describe all sandwich proteins. The architectural properties of these proteins are dictated by a set of rules. These rules imply the existence of certain invariant supersecondary structure, which we called "interlock". The aim of the second stage of analysis is to find the main regularities of the packing of strands in the structural motifs. In this work we introduce a new supersecondary unit - the set of consecutive antiparallel strands connected by the hydrogen bonds. Here, we call this set a "strandon". The examination of the arrangement of the strandons reveals 10 "supermotifs" which describe about 90% of all known currently sandwich-like structures. Moreover, all of these arrangements follow specific regularities, which can serve as the background for understanding protein topology. These constraint rules allow one to determine all permissible motifs of the sandwich-like proteins.

Michael A. Langston, University of Tennessee and Oak Ridge National Laboratory

Title: Novel Computational and Integrative Tools for the Analysis of Gene Co-Expression Data

Powerful computational techniques born of fixed-parameter tractability are applied to high-throughput biological data, with the analysis of microarray data serving as a primary goal. Using mRNA samples obtained from recombinant inbred Mus musculus strains, we solve immense instances of the NP-complete clique problem to derive sets of putatively co-regulated genes. The depth of quantitative genetic analysis we can perform is vastly enhanced by integrating these results with knowledge of cis-regulatory elements, ontological classifications, and causal structures that may be imposed with quantitative trait locus mapping. Techniques for dealing with noisy data are important concerns. A long-term aim is gene regulatory network discovery.

Relevant citations:

• M. A. Langston, L. Lin, X. Peng, N. E. Baldwin, C. T. Symons, J. R. Snoddy and B. Zhang, "A Combinatorial Approach to the Analysis of Differential Gene Expression Data: The Use of Graph Algorithms for Disease Prediction and Screening," Proceedings, International Conference for the Critical Assessment of Microarray Data Analysis (CAMDA), Durham, North Carolina, November, 2003.
• F. N. Abu-Khzam, R. L. Collins, M. R. Fellows, M. A. Langston, W. H. Suters and C. T. Symons, "Kernelization Algorithms for the Vertex Cover Problem: Theory and Experiments," Proceedings, Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, Louisiana, January, 2004.
• N. E. Baldwin, R. L. Collins, M. A. Langston, M. R. Leuze, C. T. Symons and B. H. Voy, "High Performance Computational Tools for Motif Discovery," Proceedings, IEEE International Workshop on High Performance Computational Biology (HiCOMB), Santa Fe, New Mexico, April, 2004.
• N. E. Baldwin, E. J. Chesler, S. Kirov, M. A. Langston, J. R. Snoddy, R. W. Williams and B. Zhang, "Computational, Integrative and Comparative Methods for the Elucidation of Genetic Co-Expression Networks," Journal of Biomedicine and Biotechnology, accepted for publication.
• E. J. Chesler, L. Lu, S. Shou, Y. Qu, J. Gu, J. Wang, H. C. Hsu, J. D. Mountz, N. E. Baldwin, M. A. Langston, J. B. Hogenesch, D. W. Threadgill, K. F. Manly and R. W. Williams, "Genetic Dissection of Gene Expression Reveals Polygenic Networks Modulating Brain Structure and Function," Nature Genetics, accepted for publication.

Cindy Y. Li, Leszek Gasieniec, Paul Sant, and Prudence W.H. Wong

Title: Efficient probe selection in microarray design

The DNA microarray technology, originally developed to measure the level of gene expression, becomes one of the most widely used tools in genomic study. Microarrays have been proved to benefit areas including gene discovery, disease diagnosis, and multi-virus discovery. The crux of microarray design lies in how to select a unique probe that distinguishes a given genomic sequence from other sequences. However, in cases that the existence of a unique probe is unlikely, e.g., in the context of a large family of closely homologous genes, the use of a limited number of non-unique probes is still desirable.

Because of its significance, probe selection attracts a lot of attention. Various probe selection algorithms have been developed in recent years. Good probe selection algorithms should produce as small number of candidate probes as possible. Efficiency is also crucial because the data involved is usually huge. Most existing algorithms usually select probes by filtering, which is usually not selective enough and quite a large number of probes are returned. We propose a new direction to tackle the problem and give an efficient algorithm to select (randomly) a small set of probes and demonstrate that such a small set of probes is sufficient to distinguish each sequence from all the other sequences. Based on the algorithm, we have developed a probe selection software, which runs efficiently and effectively in practice.

A number of experiments have been carried out and the results will be discussed.

Bud Mishra, Courant and NYU Biology

Title: Comparative genomics: The lion, the leopard, the wolf or the boar, Why not more?

I will discuss a novel algorithm, PRIZM, for comparing two large and distant genomes efficiently and accurately, even when the genomes are highly dissimilar, e.g. homology level around 65%. This algorithm incorporates background knowledge about genome evolution (using various priors based on Juke-Cantor one parameter and Kimura's two parameters models of evolution). Accordingly, it can achieve a remarkable efficiency; for instance, it takes about 30 seconds to compare mouse chromosome 10 with rat chromosome 1 to find a shared syntenic region 20Mb long at the beginning of the two chromosomes, even when the rest of the chromosomes are similar only with a homology level of 40%. A short demo is promised.

David W. Mount, University of Arizona

Title: Mining and Pattern Analysis in Large Data Sets for Biological Information

We are developing computational tools to enable identification of patterns in high throughput data that have biological meaning. For example, a list of genes that are over- or under-expressed in cancer cells or tissues based on microarray data are analyzed using existing regulatory and metabolic pathway databases. We develop interactive graphs that reveal all of the known relationships and these graphs may then be filtered in order to focus on the most important ones. Similarly, the same gene lists are examined for other indicators of biological activity, including gene ontologies, gene-gene, and protein-protein interaction data (see http:://www.biorag.org ). These methods are particularly useful for identifying genes that play a central regulatory role, as in cancer cells, for example. These genes may then become drug targets if the cancer cells are dependent on them for growth and if there is no backup function as there is in normal cells due to the high loss of genetic information in cancer cells. These methods are being extended to use state of the art statistical software e.g. mixed models in order to find genes that are most likely to be changing between cancer cells and normal cells. We are also trying to estimate the expected regulatory interactions among genes that are key regulatory changes in cancer progression based on cancer models and promoter sequence data as prior information. . The goal is to find genetic changes that are predictive for cancer progression or that provide suitable drug targets for therapeutic intervention.

T. M. Murali, Deept Kumar, Greg Grothaus, Maulik Shukla, Jamie Garst, Graham Jack, Richard Helm, Malcolm Potts, and Naren Ramakrishnan, Virginia Polytechnic Institute and State University

Title: ActiveNetworks: A Framework for Cross-Condition Analysis of Functional Genomic Data

The increasing availability of diverse types of molecular interaction data sets and measurements of molecular profiles opens up the possibility of obtaining improved understanding of cellular processes by an integrated analysis of these data sets. We describe a general computational framework called ActiveNetworks that integrates and analyses such data sets to (i) detect cellular networks that are activated in a particular cell state or in response to an external stimulus and (ii) determine core cellular networks that are activated in response to multiple stimuli.

Our system starts by consolidating datasets that contain information on known molecular interactions into a universal network. When queried with a molecular profile measured after application of an external stimulus to the cell, our system searches the universal network using an algorithm for finding heavily-weighted subgraphs. This search retrieves an ActiveNetwork, which is a sub-network whose constituent molecules act in concert (as determined by the combined similarity of their measured profiles) to determine the cell's response to the stimulus. Given molecular profiles for multiple stimuli, we compute ActiveNetworks for each stimulus. Finally, we use an algorithm that detects frequent subgraphs to simultaneously analyse all computed ActiveNetworks. This analysis (i) identifies core ActiveNetworks that are activated in multiple stimuli and (ii) determines variations in cellular responses to different stimuli. We apply this approach to Saccharomyces cerevisiea by integrating protein-protein interactions and protein-DNA binding data with a stress-response gene expression data set.

Richard Mushlin and Aaron Kershenbaium, IBM T.J. Watson Research Center

Title: An Algorithm for Exploring Patterns in Clinical Genomic Data

Many problems in clinical genomics can be analyzed in terms of finding patterns in clinical genomic data. As high throughput techniques become more widely available, the need for algorithms capable of handling large quantities of data become increasingly important. This includes algorithms for finding the patterns as well as ones to explore the patterns as a group and draw general conclusions from the entire collection of patterns found. We present a general model for clinical genomic data in terms of a bipartite graph representing the relationships between people (cases and controls) and features (e.g., genetic alleles, clinical history). The basic problem is to identify significant feature patterns (i.e., collections of features shared by a significantly different percentage of cases and controls). These are represented as bi-cliques in the bipartite graph. We present an algorithm capable of finding maximal bi-cliques in such graphs. The general problem is NP-complete, but our algorithm trades computational complexity against the tightness of the guarantee for solution quality, allowing the user to obtain high quality results even for problems of realistic size (thousands of people and features). The algorithm is based on the widely used technique of building cliques from "singletons" containing individual features and taking unions of feature sets while taking intersections of support sets (people sharing all the features). Its scalability lies in carefully doing a directed search based on figures of merit (e.g., Q-values, likelihood ratios, support set size, etc.) associated with the feature sets.

Given the bi-cliques found by the algorithm above, we then create a lattice of patterns based on meet and join functions defined in terms of unions and intersections of the feature and support sets in the bi-cliques. This allows us to explore groups of patterns close to one another in the lattice and recognize more general patterns in the overlaps among the bi-cliques. This reduces the number of patterns to explore and also allows us to relax the strict requirement that the patterns be cliques; i.e., it is possible to examine patterns where most of the people share most of the properties rather than requiring that all people share all properties. We have implemented this approach and present computational experience using it on problems of realistic size.

M.D. Prokhorov and V.I. Ponomarenko, Saratov Department of the Institute of RadioEngineering and Electronics of Russian Academy of Sciences and
V.I. Gridnev, Institute of Cardiology, Chernyshevskaya

Title: Detecting synchronization between the signals from multivariate and univariate biological data

Complex rhythmic processes interacting with each other are typical in living organisms [1]. These rhythms have different origin and take place at subcellular and cellular level, as well as in several organs and in intact organisms. It is known that interaction between self-oscillating processes and biological rhythmic processes, in particular, can result in their synchronization [2]. The concept of synchronization is widely used in experimental studies and the modeling of interaction between different biological systems demonstrating oscillating behavior [3]. As the result of synchronization, the system dynamics can become more regular and the presence or absence of synchronization can reflect healthy dynamics.

In our study we try to access the following problem: suppose we observe a system with a complex structure that is not known exactly, and record multivariate or univariate data at its outputs. Our goal is to extract information on the interaction of some subsystems or rhythms within the system and to find out whether these subsystems are synchronized or not. We have shown that synchronization between different rhythmic processes can be detected even from the analysis of univariate data. To derive the rhythmic components from the complicated signal and to extract their instantaneous phases and frequencies we apply the three different methods using bandpass filtration and the Hilbert transform, empirical mode decomposition and the Hilbert transform, and wavelet transform. These methods can be successfully applied for a variety of data of different nature.

The main ideas of our research are illustrated with the data of the human cardiovascular system as an example. We investigate synchronization between the three main rhythmic processes governing the cardiovascular dynamics in humans, namely, the main heart rhythm, respiration and the process of low-frequency regulation of blood pressure and heart rate. The synchronization is studied based on the analysis of simultaneously measured signals of respiration, blood pressure and electrocardiogram and on the analysis of univariate data in the form of the heartbeat time series. For the cases of spontaneous respiration and paced respiration with a fixed frequency and linearly increasing frequency, the analysis of the experimental records reveals synchronous regimes of different orders n:m between all the three main rhythms. It is shown that the presence and duration of synchronization between the main rhythms operating within the cardiovascular system can be used for diagnostics of its state.

This work is supported by the Russian Foundation for Basic Research, Grant No. 03v02v17593, U.S. Civilian Research Development Foundation for the Independent States of the Former Soviet Union, Award No. RECv006, and INTAS, Grant No. 03v55v920.

References

[1] L. Glass, M.C. Mackey, From Clocks to Chaos: The Rhythms of Life (Princeton University Press, Princeton, 1988). [2] A. Pikovsky, M. Rosenblum, J. Kurths, Synchronization: A Universal Concept in Nonlinear Science (Cambridge University Press, Cambridge, 2001). [3] M.D. Prokhorov, V.I. Ponomarenko, V.I. Gridnev, M.B. Bodrov, A.B. Bespyatov "Synchronization between main rhythmic processes in the human cardiovascular system", Phys. Rev. E, 2003, V.68, 041913.

Ajay Royyuru, IBM T J Watson Research

Title: The Genographic Project

National Geographic and IBM are embarking on a landmark five-year study that will map how mankind populated the Earth ( http://nationalgeographic.com/genographic). It is the most ambitious genetic anthropology research initiative ever. The Genographic Project will use sophisticated computer analysis of DNA contributed by hundreds of thousands of people - including indigenous populations and the general public - to reveal mankind's migratory history and to better understand the connections and differences that make up the human species. With global field science support from the Waitt Family Foundation, the global team of researchers from National Geographic and IBM, and leading scientists from 10 institutions worldwide, will conduct ground breaking field and laboratory research. An Advisory Board of leading international authorities and representatives from indigenous communities will oversee the testing and research protocols. It will result in one of the largest collections of DNA samples ever assembled, culminating in a virtual museum of human history for all to access.

Andrey Rzhetsky, Columbia University

Title: Chains of collective reasoning in biological publications

I will discuss probabilistic modeling of molecular interaction data extracted from a large corpus of full-text research articles.

David Sankoff, University of Ottawa

Title: Statistical and algorithmic approaches to genome rearrangements based on sequence data

The key to comparing whole genome sequences in order to study evolutionary rearrangements such as inversion and translocations is being able to partition each genome into segments conserved by two genomes since their divergence. To do this requires a method for disregarding ÒnoiseÓ, including small segments below some level of resolution. Given the partition, and the resulting order of segments along the chromosomes of the two genomes, there are a number of methods available to infer aspects of the evolutionary divergence history. In the higher animals and many other eukaryotes, however, the partitioning step is inevitably somewhat problematic, because our results are strongly dependant on level of resolution we choose for the partition. This is because the number of segments keeps increasing as the resolution becomes finer, and the analyses depend in complex ways on the new, albeit small, segments. These segments, however, may be due not only to rearrangements, but also to retrotranspositions, paralogy and other repetitive and/or inserted sequences not removed by repeat masking, as well as to deletions, conversion and uneven sequence divergence.

In this paper we investigate the quantitative nature and analytical consequences of resolution dependence in the comparison of complete animal genome sequences. We develop our data sets on syntenies and segments from the continuously updated output of the UCSC Genome Browser. We first derive an estimator for the number of reciprocal translocations responsible for the number of conserved syntenies (chromosomes in the two genomes sharing at least one segment) between two genomes, and use simulations to show that the bias and standard deviation of this estimator are less than 5 %, under two models of random translocation, with and without strict conservation of the centromere. By contrasting the number of conserved segments with the number of conserved syntenies, we can also estimate the number inversions or other intra-chromosomal events. Our results include:

A loss of stability of the data at resolutions starting at about 100 Kb.
A highly variable proportion of translocations relative to inversions across lineages, for a fixed level of resolution.
The relative stability of translocation rates compared to inversion rates, as resolution is refined.
The absence of correlation between accumulated rearrangements and chronological time elapsed, especially beyond 20 Myr.

We then discuss the trade-off between decreased resolution, with its relatively stable data, and artifacts caused by poorer resolution when applying rearrangement-inferring algorithms, focusing on the notion of breakpoint re-use introduced by Pevzner and Tesler. Analytic and simulation methods show that re-use is sensitive to the proportion of segments excluded from a partition. Exclusion of a significant proportion of segments renders the order of the remaining segments indistinguishable from a random order.

Ugur Sezerman, Sabanci University, Computational Biology Laboratory, 34956 Tuzla, Istanbul, TURKEY

Title: Predicting Novel Genes and Interactions using the Scale Free Behavior of Genetic Networks

Modeling the interacting genes over the whole genome as a graph constitutes a type of genetic networks. We propose a graph matching as a method by which we examine the compatibility among the genetic networks of different organisms. Given a plausible mapping between the graphs, the missing nodes might indicate the existence of unidentified genes while the missing links might represent unidentified interactions. Properties of genetic networks are explored and a suitable inexact sub-graph matching algorithm is developed. The networks are shown to exhibit scale free behaviour. A search base graph matching tool is developed exploiting distinguishing properties of scale free networks for limiting the search space. Genetic networks of two inherently similar bacteria Mycoplasma Pneumonia and Mycoplasma Genitalium are matched as a test case which reported missing genes and interactions.

Frank Suits and Laxmi Parida, IBM T J Watson Research

Title: Analysis of High Bandwidth Molecular Dynamics Results from the Blue Gene Project

The IBM Blue Gene project started in 1999 with the aim to construct the world's most powerful supercomputer, and use it to study protein folding. Currently both the machine and the software are generating scientific results at a high rate, placing demands on the infrastructure needed to store and analyze the streams of emitted data. This talk describes some of the molecular dynamics work being done as part of the Blue Gene project, and the corresponding analysis techniques required to process the large volumes of data. In addition, the storage and data management requirements will be discussed. Although this talk is specific to molecular dynamics work, many of the issues discussed are relevant to other scientific efforts involving high bandwidth data streams, and the resulting reduction and storage of information needed to manage and interpret the results.

Jason Wang, NJIT

Title: A method for aligning RNA secondary structures and its application to RNA motif detection

We present here an efficient tool called RSmatch for aligning RNA secondary structures and for motif detection. Motivated by widely used algorithms for RNA folding, we decompose an RNA secondary structure into a set of atomic structure components that are further organized by a tree model to capture the structural particularities of RNA. RSmatch can find the optimal global and local alignment between two RNA secondary structures using two scoring matrices, one for single-stranded regions and the other for double-stranded regions. The time complexity of RSmatch is O(mn) where m is the size of the query structure and n that of the subject structure. When applied to searching a structure database, RSmatch can find similar RNA substructures, and is capable of conducting multiple structure alignment and iterative database search. Therefore it can be used to identify functional RNA motifs. The accuracy of RSmatch is tested by experiments using a number of known RNA structures, including simple stem loops and complex structures containing junctions. This tool shall be useful to researchers interested in comparing RNA structures obtained from wet lab experiments or RNA folding programs.

The talk is based on the following paper:
J. Liu, J.T.L Wang, J. Hu and B. Tian. A method for aligning RNA secondary structures and its application to RNA motif detection. BMC Bioinformatics 2005, 6:89.

Gabriela Alexe, Institute for Advanced Study
Sorin Alexe, Rutgers University
D. E. Axelrod, Rutgers University
P. L. Hammer, Rutgers University and
D. Weissmann, Robert Wood Johnson University Hospital

Title: Logical Analysis of Diffuse Large B-Cell Lymphomas

The goal of this study is to reexamine the oligonucleotide microarray dataset of Schipp et al. (www.genome.wi.mit.du/MPR/lymphoma), which contains the intensity levels of 6817 genes of 58 patients with diffuse large B-cell lymphoma (DLBCL) and 19 with follicular lymphoma (FL), by means of the combinatorics, optimization and logic based methodology of LAD (Logical Analysis of Data).

For the differential diagnosis of DLBCL vs. FL, a model based on 8 significant genes is constructed and shown to have a sensitivity of 94.7% and a specificity of 100% on the test set. For the prognosis of good vs. poor outcome among the DLBCL patients, a model is constructed on another set consisting also of 8 significant genes, and shown to have a sensitivity of 87.5% and a specificity of 90% on the test set. The genes selected by LAD also work well as a basis for other kinds of statistical analysis, indicating their robustness.

Beside the construction of these two models whose accuracies compare favorably to those in the original study, the current study also provides a ranking by importance of the genes in the selected significant subsets, as well as a library of dozens of combinatorial biomarkers (i.e. pairs or triplets of genes) which can serve as collections of mathematically generated, statistically significant research hypotheses in need of biological explanation.

Ho-Leung Chan, Tak-Wah Lam, Wing-King Sung, Prudence W.H. Wong, Siu-Ming Yiu, University of Liverpool

Title: Locating conserved genes in whole genome scale

Given the genomes of two related species, one important task is to uncover and locate the conserved genes, i.e., genes sharing similar functions. This task is non-trivial as most parts of a genome are non-coding areas and the information of genes of each genome is often not available. Delcher et al. had given a very useful observation: A conserved gene rarely contain the same entire sequence in the two genomes, but it shares a lot of short common substrings that are unique to this conserved gene (MUM pair). However, not every MUM pair corresponds to conserved genes. The key step is how to select the right pairs.

Different approaches have been proposed to select the right MUM pairs. Earlier work uses longest common subsequence (LCS) approach. Some other work models the problem as a combinatorial clustering problem. Recently a new structural optimization problem related to mutations is proposed. In this talk, different approaches will be reviewed and the results will be presented.

Mohammed Zaki, RPI

Title: TriCluster: Mining Coherent Clusters in 3D Microarray Data

Traditional clustering algorithms work in the full dimensional space, which consider the value of each point in all the dimensions and try to group the similar points together. In biclustering if some points are similar in only a subset of dimensions (a subspace), they will be clustered together in that subspace. Biclustering has proved of great value for finding the interesting patterns in the microarray expression data, which records the expression levels of many genes (the rows/points), for different biological samples (the columns/dimensions).

Besides biclustering along the gene-sample dimensions, there has been a lot of interest in mining gene expression patterns across time. The previously proposed approaches are also mainly two-dimensional, i.e., finding patterns along the gene-time dimensions. We, on the other hand, are interested in mining tri-clusters, i.e., mining coherent clusters along the gene-sample-time (temporal) or gene-sample-region (spatial) dimensions.

We present a novel, efficient, deterministic, triclustering method called TriCluster, that addresses the above challenges. The key features of our approach include: 1) We mine only the maximal triclusters satisfying certain homogeneity criteria. 2) The clusters can be arbitrarily positioned anywhere in the input data matrix and they can have arbitrary overlapping regions. 3) We use a flexible definition of a cluster which can mine several types of triclusters, such as triclusters having identical or approximately identical values for all dimensions or a subset of the dimensions, and triclusters that exhibit a scaling or shifting expression values (where one dimension is a approximately constant multiple of or is at an approximately constant offset from another dimension, respectively). 4) TriCluster is a deterministic and complete algorithm, which utilizes the inherent unbalanced property (number of genes being a lot more than the number of samples or time-slices) in microarray data, for efficient mining. 5) TriCluster can optionally merge/delete triclusters that have large overlaps, and can also automatically relax the similarity criteria. It can thus tolerate some noise in the dataset, and lets the user focus on the most important clusters. 6) We present a useful set of metrics to evaluate the clustering quality, and we show that TriCluster can find substantially significant triclusters in the real microarray datasets. To the best of our knowledge, TriCluster is the first 3D microarray subspace clustering method.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center