Similarity Search in Protein Structure Databases

Transkript

1 Charles University in Prague Faculty of Mathematics and Physics DOCTORAL THESIS Jakub Galgonek Similarity Search in Protein Structure Databases Department of Software Engineering Supervisor of the doctoral thesis: Study programme: Specialization: doc. RNDr. Tomáš Skopal, Ph.D. Computer Science Software Systems Prague 2012

2 I declare that I carried out this doctoral thesis independently, and only with the cited sources, literature and other professional sources. I understand that my work relates to the rights and obligations under the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that the Charles University in Prague has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 paragraph 1 of the Copyright Act. In... date... signature of the author

3 Annotation Title: Author: Similarity Search in Protein Structure Databases Jakub Galgonek Department: Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague Supervisor: doc. RNDr. Tomáš Skopal, Ph.D. Abstract: Proteins are one of the most important biopolymers having a wide range of functions in living organisms. Their huge functional diversity is achieved by their ability to fold into various 3D structures. Moreover, it has been shown that proteins sharing similar structure often share also other properties (e.g, a biological function, an evolutionary origin, etc.). Therefore, protein structures and methods to identify their similarities are so widely studied. In this thesis, we introduce a system allowing similarity search in protein structure databases. The system retrieves, given a query structure, all database structures being similar to the query structure. It employs several key components. We have introduced a novel similarity measure assigning similarity scores to pairs of protein structures. We have designed specific access method based on LAESA metric indexing and using the proposed measure. The access method allows to search similar structures more efficiently than when a sequential scan of a database is employed. To achieve further speedup, the measure and the access method have been parallelized, resulting in almost linear speedup with the respect to the number of available cores. The last component is a web user interface that allows to accept a query structure and to present a list of the result structures after the retrieval is finished. Thus, the web server makes our method easily accessible to wide scientific community. Keywords: protein structure databases, similarity search, metric access methods

4 Anotace Název práce: Autor: Katedra: Školitel: Podobnostní vyhledávání v databázích proteinových struktur Jakub Galgonek galgonek@ksi.mff.cuni.cz Katedra softwarového inženýrství Matematicko-fyzikální fakulta Univerzita Karlova v Praze doc. RNDr. Tomáš Skopal, Ph.D. skopal@ksi.mff.cuni.cz Abstrakt: Proteiny patří mezi nejdůležitějších biopolymery, neboť v organismu zastávají nejrůznější životně důležité funkce. Jejich funkční rozmanitost je umožněna především jejich velkou strukturní rozmanitostí. Navíc se ukazuje, že proteiny sdílející podobnou strukturu sdílí také jiné vlastnosti (např. funkci, evoluční původ, atd.). Proto je studiu proteinových struktur a možnosti identifikovat podobné struktury věnována taková pozornost. V této práci představujeme systém umožňující podobnostní vyhledávání v databázích proteinových struktur. Tento systém, pro danou dotazovou strukturu, vyhledá v databáze ty struktury, které jsou dotazu strukturně podobné. Systém se skládá z několika klíčových částí. Byla navržena vlastní podobnostní míra umožňující měřit podobnost mezi dvojicí proteinových struktur. Speciálně pro tuto míru byla vytvořena přístupová metoda založená na metrické přístupové metodě LAESA. Přístupová metoda umožňuje hledat podobné struktury mnohem rychleji, než by to bylo možné sekvenčním procházením databáze. Pro dosažení dalšího urychlení byly obě části paralelizovány, přičemž se podařilo dosáhnout téměř lineárního zrychlení. Poslední částí je uživatelské rozhraní ve formě webového serveru, který umožňuje přijmout dotaz a následně prezentovat nalezené výsledky. Tím je představovaný systém snadno zpřístupněn širší odborné komunitě. Klíčová slova: databáze proteinových struktur, podobnostní vyhledávání, metrické přístupové metody

5 Contents Preface 7 Focus of the Thesis Summary of Contributions Structure of the Thesis Acknowledgments Introduction Origin of Proteins Protein Chemistry Amino Acids Peptide Bond Protein Structure Primary Structure Secondary Structure Tertiary Structure Quaternary Structure Protein Function Protein Databases PDB: Protein Data Bank SCOP: Structural Classification of Proteins Protein Sequence Similarity Sequence Similarity Measure Alignment Alignment Scoring Scoring Matrix PAM BLOSUM Pairwise Alignment Algorithms Needleman-Wünsch Algorithm Smith-Waterman Algorithm Database Search Algorithms BLAST Protein Structure Similarity Superposition Methods RMSD MaxSub TM-score

6 3.2 Structure Similarity Methods DALI SSAP ProtDex CE MAMMOTH Vorolign Vorometric PPM FAST Sabertooth db-itm D-BLAST SARST SProt Similarity Measure Method Description Representation of a Protein Sphere Similarity Alignment and Superposition Optimizations Results Protein Classification Information Retrieval in Protein Structure Databases Quality of Structural Alignments Summary Speed-up by Indexing Metric Access Methods LAESA Access Method Capability of Indexing SProt Metric Properties Metric Axioms Indexability SProt Access Method Results Summary Speed-up by Parallel Computing Parallel Implementation Optimizations Parallel Approach Results Summary Web Server Web Server Implementation Usage of P3S Query Submission

7 7.2.2 Result Presentation Results Download Results Comparison of Web Application Interfaces Comparison of Databases Effectiveness of P3S Summary Conclusion 99 3

8 List of Figures 1.1 Central dogma of molecular biology Generic structure of an amino acid Amino acids Amino acid properties Peptide bond Protein chain Secondary structure Tertiary Structure Sequence and structure of trypsin Yearly growth of total structures Two-domain protein structure (1kzl) The sum of local alignments The weight of an edge Example of an amino acid viewpoint Angles κ and α An example of an aa-sphere Average precision-recall curves Asymmetry of SProt distances Histogram of SProt distances RBQ modifiers Access method efficiency (number of compared pairs) Access method efficiency (computation time) Retrieval errors SCOP retrieval errors family SCOP retrieval errors superfamily SCOP retrieval errors fold Measured speedup for each version of the implementation Measured speedup of the algorithm parts P3S Architecture The P3S web server user interface Average precision-recall curves

9 List of Tables 1.1 Standard genetic code Classification accuracy Average precision Comparison of the alignment quality Effects of modifiers Required computational time Basic features of web applications Used databases and query times

10 List of Algorithms 1 MaxSub algorithm TM-score algorithm LAESA access method SProt access method

11 Preface Proteins are one of the most important biopolymers. The basic building blocks of a protein polymer chain are called amino acids. There are only 20 different kind of amino acids that are used in the synthesis of proteins, while their order in a protein chain is called the sequence of the protein. The template of the sequence of each protein is stored as a structural gene in the DNA of an organism. Proteins have a wide range of functions in living organisms such as enzymatic, signaling, transportation and building function. The way how proteins carry out their biological function is through interaction with other proteins or small molecules based on spatial arrangement of their polymer chains. Therefore, the study of protein structures brings invaluable insight into function of proteins [1]. The structure of a protein is largely determined by its sequence. In the native environment of a cell, a newly created protein folds into its native structure only on the basis of its amino acid sequence. However, the sequence similarity cannot substitute for the structure similarity, because the structural relationship does not imply the sequence relationship [2]. During evolution, the DNA is subjected to many kind of mutations. The mutations affect also structural genes and so change sequences of proteins. Subsequently, changes in the sequence of a protein have an influence on its structure. However, structures are more conserved during evolution than sequences are. The explanation is that evolution tends to preserve the function of the protein and thus it tends to preserve its structure rather than its sequence [3, 4, 5]. Therefore, changes in the sequence of the protein that have small impact on its structure are in many cases accepted by natural selection. Similarity retrieval in databases of protein structures has applications in various fields of computational proteomics. It can be used, for example, for an automatic classification of protein structures. In this case the class of the structure is set to the class of the most similar structure selected from the database of already classified structures. A similar approach can be also used to determine other properties of the protein, for example, its biological function. This approach is based on the hypothesis, that proteins with similar structures have also similar biological functions. Similarity retrieval in protein structure databases can be used also for identification of the homologous proteins originating from different organisms but having the same ancestor. This set of proteins can be used for studying molecular evolutions or for phylogenetic analysis. Phylogenetics studies relationships among 7

12 various biological species, and it represents observed results in the form of a, so-called, phylogenetic tree. The nodes of the tree represent studied species and the lengths of the branches represent the evolutionary distances between the species. Estimation of evolutionary distances can be based, for example, on the sequence or structure similarities between proteins that are shared by all studied species [6]. For proteins involving low sequence similarities, sequence similarities could be unreliable and a phylogenetic tree using a structure-based similarity is more appropriate [7]. Focus of the Thesis As has been stated, the task to identify proteins having similar structures can be used for solving various biological problems. From the computer science point of view, the task can be defined as a retrieval from a database of protein structures based on similarity. For a given query structure, a similarity retrieval tool has to return a list of protein structures that are similar to the query structure. The retrieval tool consists of several essential parts. In the first place, it is necessary to develop a method (algorithm) measuring the similarity between a pair of protein structures. It requires an interdisciplinary approach knowledge of biology and computer science is needed. After that, it is needed to introduce an access method (database index) that employs the similarity measure and retrieves structure being similar to a query structure. The most straightforward solution comparing a query structure to each database structure is typically too slow due to the size of the database and a large time complexity of the measure algorithm. Thus, it is needed to develop an access method demanding less similarity comparisons. The last important part of the tool is a user interface. It allows to accept a query structure and to present the list of the result structures when the retrieval process is finished. A web interface is often used for a such type of tool. The thesis is focused on a development of a novel tool allowing the similarity retrieval from large-scale databases of protein structures. We introduce our own approach to each of the mentioned parts. Summary of Contributions In this thesis, we propose a novel tool to retrieve similar structures from a protein structure database. The contributions are summarized in the following list: We introduce a novel algorithm, called SProt, measuring the similarity of protein structures [8, 9]. For each pair of amino acids coming from the compared structures, SProt computes (local) structure similarity of their spatial neighborhoods. The local similarities are then aggregated in an 8

13 alignment algorithm to produce the resulting global alignment. The quality of the alignment is then evaluated by a superposition algorithm. The SProt algorithm is computationally very expensive. To reduce the time required to retrieve similar structures from a database, we design a specific access method based on metric indexing (variant of LAESA), which makes the search process by an order of magnitude faster [9]. To achieve even faster search, we propose a parallel approach using the Intel s Threading Building Blocks (TBB) library. It allows to fully exploit the computational power of current CPU architectures and achieve almost linear speedup with the respect to the number of available CPU cores. Finally, we introduce a web application, called P3S, employing the SProt similarity measure and the proposed (parallel) access method. Given a query structure, the application allows to identify the set of the most similar structures in a selected database. The result set can be browsed interactively, including visual inspection of the structure superposition, or it can be downloaded as a zip archive. Structure of the Thesis For better readability of the thesis, we briefly describe its structure. The thesis is divided into eight chapters. Due to the interdisciplinary nature of the topic, the first chapter provides insight into the structural biology of proteins. Although this thesis is not focused on the sequence similarity, many methods used originally in this area have been later used also in the area of structure similarity search. Therefore, the second chapter describes basic methods used in the area of sequence similarity search. The following one, Chapter 3, presents the state of the art in the field of structure similarity. The rest of the thesis describes the components of the proposed system allowing similarity search in protein structure databases. The fourth chapter introduces a novel structure similarity measure, that we have called SProt. The access method that has been developed specifically for the measure is presents in the fifth chapter. The parallel implementation of the measure algorithm and the access method is described in the sixth chapter. The web user interface of the proposed tool is described in the seventh chapter. Finally, the last chapter summarizes the results of the thesis. Acknowledgments I would like to thank all those who supported me in my doctoral studies and work on my thesis. I very appreciate the help and advices received from my supervisor Tomáš Skopal and my consultant David Hoksza. I am grateful for numerous 9

14 corrections and comments. And undoubtedly, I must also express my thanks to all the anonymous reviewers of my papers for helpful remarks and ideas. My thanks also go to institutions that provided financial support for my research work. During my doctoral studies, my work was partially supported by the following projects: Federation of European Biochemical Societies, Short-Term Fellowship Grant Agency of Charles University (GAUK) project number Czech Science Foundation (GAČR), project number 201/09/0683 Specific Academic Research, project number SVV Specific Academic Research, project number SVV Specific Academic Research, project number SVV Ministry of Education of the Czech Republic, grant MSM

15 Chapter 1 Introduction Proteins are very important biochemical compounds. From the chemical point of view, a protein is a linear polymer chain of amino acids. The protein chain is very flexible and it can reach different conformations (i.e., spatial structures). It allows proteins to perform very diverse functions. In the following sections, we will briefly introduce the protein origin, chemistry, structure, and function. We will also introduce databases in which information about protein structures can be found. 1.1 Origin of Proteins A living organism is an incredible complicated system being often compared to a program-driven machine, where the main task of the system is to defend, repair and make copies of itself. We also use this analogy in the following introduction to the biology of living organisms. The program of the machine is written in DNA (deoxyribonucleic acid) as a linear sequence of four bases - adenine (abbreviated as A), cytosine (C), guanine (G) and thymine (T). The bases can form pairs (adenines with thymines or cytosines with guanines), which allows organize DNA as a double helix where one strand of the helix is a mirror copy of the other strand of the helix. The double helix organization makes the DNA stable and fault-tolerant and it also allows to make copies of DNA in a simple way. Components of the machine are mostly made up of proteins, which perform most of the major functions in a living organism. Proteins create channels and pumps for distributing the chemicals. Cargo proteins directly transmit other components into target cell compartments. Proteins called chaperonins help to fold other proteins. Synthetases and polymerases construct complex chemical compounds from simpler ones. Enzymes allowing digestion, antibodies guarding the organism, or hormones used for signaling are also special kinds of proteins. Proteins are linear chains made of only twenty kinds of amino acids. Their huge functional diversity is achieved by their ability to fold into various 3D structures. 11

16 transcription translation DNA mrna protein Figure 1.1: Central dogma of molecular biology As we already stated, because one of the main tasks of the living machine is to reproduce itself, the program of the machine (i.e., DNA) has to contain production plans for every protein used in the machine. The flow of the genetic information and the origin of proteins can be depicted by the so-called central dogma of molecular biology [10], which is schematically shown in Figure 1.1. A part of a DNA sequence representing the plan for one given protein is called a structural gene. The DNA retains a very large number of structural genes, even the simplest organisms have thousands of such genes. If some protein needs to be synthesized, the structural gene corresponding to the required protein is transcribed into messenger RNA (ribonucleic acid) by RNA polymerase. The messenger RNA is a single-strand mirror copy of the structural gene and it is very similar to DNA. It also expects that RNA contains the base uracil (abbreviated as U) while DNA contains thymine. The messenger RNA is translated into protein by ribosomes. Every three consecutive bases in the messenger RNA, called codons, encode one amino acid in the amino acid sequence of the required protein. There are also special codons for the termination and for the initialization of the encoded amino acid sequence. This coding scheme is called the genetic code (see Table 1.1) and it is shared by all organisms. 1st base U C A G 2nd base U UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG Phenylalanine Leucine Isoleucine Methionine Valine C UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG Serine Proline Threonine Alanine A UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAG GAU GAC GAA GAG Tyrosine STOP Histidine Glutamine Asparagine Lysine Aspartic acid Glutamic acid Table 1.1: Standard genetic code 12 G UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG Cystein STOP Tryptophan Arginine Serine Arginine Glycine 3rd base U C A G U C A G U C A G U C A G

17 Figure 1.2: Generic structure of an amino acid A synthesized protein folds into its prescribed structure and it is transported to the place where it becomes a functional component of the machine. 1.2 Protein Chemistry To understand the protein structure, it is essential to know basic building blocks of proteins (i.e., amino acids) and how these blocks are assembled together (i.e., a peptide bound) Amino Acids All amino acids share the same generic atomic structure (see Figure 1.2). The central point of each amino acid is a carbon atom called α-carbon denoted as C α. There are the amino group (of atoms) and the carboxyl group attached to the α-carbon. The last group attached to the α-carbon is so-called side chain. This chain is different for every type of amino acid and it presents a varying part of amino acids. The carbon atom of the side chain that is directly attached to the α-carbon is called β-carbon and it is denoted as C β. The amino acids used for the synthesis of proteins are listed in Figure 1.3. The chemical properties of amino acids depend on the properties of their side chains. These properties should be very different, for example an amino acid can be charged positively or negatively, it can be hydrophilic or hydrophobic, and so on [11, 12]. This diversity allows to construct proteins having very different properties and performing very different tasks. The overview of various chemical properties of amino acids is shown in the Figure Peptide Bond An amino group and a carboxyl group (each of different amino acids) can be attached together by a so-called peptide bound. The peptide bound allows to construct sequences of amino acids i.e., proteins. An example of a peptide bound is shown in Figure 1.5. The peptide bond is planar which means that all atoms of the peptide bond (O=C N H) lie in the same plane. On the other hand, the bond C α C and the bond C α N allow free rotations of the attached groups 13

18 Glycine (G) Alanine (A) Serine (S) Threonine (T) H O H O H O H O H 2 N C C OH H 2 N C C OH H 2 N C C OH H 2 N C C OH H CH 3 H C OH H C OH H CH 3 H 2 N H 3 C Valine (V) Leucine (L) Isoleucine (I) Methionine (M) H C O C CH CH 3 OH H 2 N H 3 C H C CH 2 O C CH CH 3 OH H 2 N H C O C H C CH 3 CH 2 CH 3 OH H 2 N H C CH 2 CH 2 S CH 3 O C OH Aspartic acid (D) Asparagine (N) Glutamic acid (E) Glutamine (Q) H O H O H O H O H 2 N C C OH H 2 N C C OH H 2 N C C OH H 2 N C C OH CH 2 CH 2 CH 2 CH 2 HO C O H 2 N C O HO CH 2 C O H 2 N CH 2 C O H 2 N Histidine (H) Cystein (C) Proline (P) Phenylalanine (F) H C O C CH 2 N CH HC NH OH H 2 N H C CH 2 SH O C OH HN O C OH H H 2 N H C CH 2 O C OH H 2 N Lysine (K) Arginine (R) Tyrosine (Y) Tryptophan (W) H O H O H O H O C C OH H 2 N C C OH H 2 N C C OH H 2 N C C OH CH 2 CH 2 CH 2 CH 2 H CH 2 CH 2 CH 2 CH 2 N H CH 2 NH 2 NH C NH 2 NH 2 OH Figure 1.3: Amino acids 14

19 Figure 1.4: Amino acid properties [13] of atoms. The possibility of the rotations makes the protein chain very flexible, so it can adopt to different spatial conformations. The protein structure will be described in more details in the next section. 1.3 Protein Structure Protein structure can be studied at different levels as an amino acid sequence (the primary structure), as local conformations of an amino acid chain (the secondary structure), as the global conformation of an amino acid chain (the tertiary structure), or as conformation of multi-chain complexes (the quaternary structure). Figure 1.5: Peptide bond 15

20 main chain first AA second AA... last AA side chains atoms: hydrogen oxygen nitrogen carbon side chain Primary Structure Figure 1.6: Protein chain As it has been already stated above, a protein is a sequence of amino acids linked by peptide bonds (see Figure 1.6). This sequence is sometimes called the primary structure of the protein. In the primary structure, positions of atoms are not important, so the primary structure can be simple captured as the sequence of letters one for each amino acids used in the protein. To study the protein structure on higher levels, the atoms forming a protein chain are often divided into two parts the side chains of the protein amino acids, and the main chain (called also the protein backbone) including the C α atoms and the atoms of the peptide bonds (see Figure 1.6) [14] Secondary Structure The secondary structure refers to the general three-dimensional form of local segments of the protein backbone. These segments are stabilized by so-called hydrogen bonds. A hydrogen bond is the weak chemical interaction that can be established between a hydrogen atom and an electronegative atom (e.g., nitrogen or oxygen). There exist several types of secondary structure types that are observed in proteins. The most common variants include the alpha helix and the parallel or anti-parallel beta sheets. They are shown in Figure 1.7. Atoms (and whole side chains) are presented as balls using same colors as in Figure 1.6. Hydrogen bounds are denoted by dashed lines. Gray shapes show so-called cartoon representations of secondary structures Tertiary Structure The tertiary structure refers to the overall three-dimensional structure of a protein chain. It is stabilized by hydrogen bounds, van der Waals forces, hydrophobic interactions, and other weak chemical interactions. The tertiary structure (often called only as the structure) is described as the combination of coordinates of individual atoms in the space. 16

21 (a) α-helix (b) anti-parallel β-sheet (c) parallel β-sheet Figure 1.7: Secondary structure The protein structure can be visualized in different ways. The visualization of all atoms is very confusing (see Figure 1.8a), so it is not preferred. The trace representation is a polyline connecting C α atoms of the amino acid in their sequence order (see Figure 1.8b). The tube representation is similar except that a smooth curve is used for connecting atoms(see Figure 1.8c). The most abstract variant, which is also probably the most clear, is the cartoon representation based on the cartoon representation of secondary structures (see Figure 1.8d) Quaternary Structure The quaternary structure refers to the complex of multiple protein chains that are connected together by various weak chemical interactions. Complexes are named according to the number of units (chains) using Greek numbers and a suffix - mer (e.g., dimer, trimer, tetramer, and so on). Nevertheless, this thesis is not primarily focused on protein complexes, so we will not describe the quaternary structure in more details here. 1.4 Protein Function A biological function of a protein is mainly determined by its spatial conformation (i.e., by its structure). A part of the protein structure that performs a biological function is called active site of the protein. The active site is usually comprised from amino acids that can be very distant in the protein sequence. Thus, only a correctly folded protein can properly fulfill its function. For example, enzyme 17

22 (a) atoms (b) trace (c) tube (d) cartoon Figure 1.8: Tertiary Structure trypsin is a digestive enzyme that can cleave peptide bonds [15]. The trypsin active site is formed by three amino acids, that are very far in the sequence, as it can be seen in Figure 1.9a showing the trypsin sequence. The amino acids of the active site are denoted with the red color. When the protein folds into the correct structure, then the amino acids become spatially close and form a biological scissors. The active site is shown in Figure 1.9b, the atoms of the active site amino acids are represented by balls, and the rest of the protein structure is represented by a tube. The structure into which a protein folds in a native environment depends often only on its amino acid sequence. However, the structure similarity cannot be substituted by the sequence similarity. Although the structure is determined by the sequence, it has been shown that the structure similarity does not imply the sequence similarity [2]. The explanation is that evolution tends to preserve the structure (and thus the function) of the protein rather than its sequence [3, 4, 5]. Thus, the main reason for enormous effort spent on protein structure research is 18

23 VDDDDKIVGG YTCGANTVPY QVSLNSGYHF CGGSLINSQW VVSAAHCYKS GIQVRLGEDN INVVEGNEQF ISASKSIVHP SYNSNTLNND IMLIKLKSAA SLNSRVASIS LPTSCASAGT QCLISGWGNT KSSGTSYPDV LKCLKAPILS DSSCKSAYPG QITSNMFCAG YLEGGKDSCQ GDSGGPVVCS GKLQGIVSWG SGCAQKNKPG VYTKVCNYVS WIKQTIASN (a) sequence (b) structure Figure 1.9: Sequence and structure of trypsin that the protein structure is closer to the function than the sequence. Moreover, the biological motivation for protein structure similarity search stems from the thesis that proteins that have similar structures also share similar functions (and other properties). Hence, it is very useful to have tools for measuring the protein structure similarity in order to be able to identify similar protein structures from a database of protein structures with already known function. 1.5 Protein Databases Protein structures are determined by physical methods. The most often used one is X-ray crystallography that is based on x-ray diffraction of a protein crystal. It allows to calculate the structure of the protein from measured diffraction patterns. Another widely used method is the nuclear magnetic resonance spectroscopy that is based on different resonant frequencies of a nucleus included in protein structures. Both methods are very expensive and time-consuming. Thus, to obtain a large database of protein structures, a combination of the efforts of many laboratories was, and still is, necessary. To support this effort, the solved protein structures are deposited in freely available databases. In the following subsection we briefly describe two databases that are probably the most important ones in the research area on which this thesis is focused PDB: Protein Data Bank The Protein Data Bank (PDB) is the primary protein structure database [16]. After a protein structure is solved, it obtains an ID and it is deposited into the PDB database together with additional information. The PDB web interface allows to browse through the database and it allows to search in the database 19

24 number of structures added yearly total count year Figure 1.10: Yearly growth of total structures according to the additional information. The additional information provides details about the protein (e.g., the name, the source organism,... ), details about the physical experiment being used to solve the structure, names of authors, and so on. All information is stored in the form of so-called PDB file [17]. It is a text-based file format capturing information about a protein and its structure in a form that can be read by programs. It allows to create many various programs that deal with protein structures stored in the standardized PDB file format. The amount of structures that are yearly deposited is still increasing, as can be seen in Figure The figure also shows the total number of structures that are deposited in the database. Despite the continuous growth, the number of resolved protein structures constantly falls behind the number of resolved protein sequences SCOP: Structural Classification of Proteins The protein structures can be cataloged according their structural properties. One of the most widely used repository is the SCOP database, that was established as the gold standard for organizing protein structures. SCOP does not focus on whole protein structures, but each structure is decomposed into protein domains. A domain is (vague) defined as the portion of the protein chain that folds into a compact semi-independent unit [18]. An example of two-domain protein can be seen in Figure The individual domains are distinguished by different colors. 20

25 Figure 1.11: Two-domain protein structure (1kzl) SCOP [19] is a manually curated hierarchical evolutionary classification of protein domains that are stored in the leaves of the four-level hierarchy: family - Domains that have high sequence similarity (> 30%) or very similar functions and structures are clustered into families. superfamily - Families whose domains have a similar structure and probably a common evolutionary origin are clustered into superfamilies. fold - Superfamilies that share same major secondary structures in same arrangement with the same topological connections are clustered into folds. class - On the basis of similar relative amount of types of secondary structures, the folds are clustered into classes. Information stored in the SCOP database is used to produce the ASTRAL database [20]. For each SCOP domain, the ASTRAL database contains the sequence and the structure of the domain. ASTRAL also provide representative subsets of domains that are based on the sequence similarity of domains. ASTRAL together with SCOP is very useful for testing the quality of similarity search methods. ASTRAL provides structures, in which the search methods might be tested. And SCOP provides information about similarities of domains according to the meaning of biology specialists that can be used to evaluate the quality of searching. 21

26 Chapter 2 Protein Sequence Similarity Measuring the sequence similarity is older issue than measuring the structure similarity. It is also relatively simpler problem. There is a widely accepted representation of sequences together with a widely accepted definition of the sequence similarity differences are only in parameters of the measure. Moreover, the task to find similar sequences is satisfactorily solved by the BLAST method. However, although this thesis does not focus on protein sequence similarity, several approaches and algorithms used to measure the sequence similarity can be also employed to measure the structure similarity. Thus, in this chapter we introduce the basic concepts and algorithms used in this area. 2.1 Sequence Similarity Measure The basic concept of sequence similarity measures is so-called alignment. An alignment describes a pairing between amino acids of compared sequences. The similarity of compared sequences is defined based on similarities of aligned pairs of amino acids. We formally describe the method in the following sections Alignment An alignment can be defined as follows: Let us assume that there are two protein sequences over the alphabet of amino acid types Σ. The sequences are denoted as P x = {p x 1,..., p x n x} Σnx and P y = {p y 1,..., p y n y} Σny and their lengths are n x and n y, respectively. Then A is an alignment of the sequences P x and P y if it satisfies the following conditions: 1. A maps the indexes of the sequence P x into indexes of the sequence P y : 2. A is an injective map: A {1,..., n x } {1,..., n y } (2.1) {(i x, i y ), (j x, j y )} A : i x = j x i y = j y (2.2) 22

27 3. A is strictly increasing: {(i x, i y ), (j x, j y )} A : i x < j x i y < j y (2.3) Note that elements of i A are denoted i x and i y, i.e., i = (i x, i y ). The monotonicity condition allows to write an alignment as an ordered sequence of aligned pairs. It also allows to show an alignment (implicitly) using the compared sequences. The compared sequences are written on two lines in such a way that aligned amino acids are displayed underneath each other. If a specific amino acid in one sequence is not aligned, a special character - (called gap) is written in the other sequence. Consider two sequences MANANA and ANNA, then their alignment {(2, 1), (3, 2), (4, 4)} can be showed as: MAN-ANA -ANNA Alignment Scoring There is a huge amount of alignments that can be established between a pair of sequences. Thus, the task of a similarity algorithm is to select the one that has the maximum value of a score. There exist several variants of sequence scores differing in their purpose. Each score is based on the so-called scoring matrix which describes similarities between two types of amino acids. The scoring matrix S is a small matrix having rows and columns indexed by the elements of the alphabet Σ. A main difference lies in the way how the algorithms treat unaligned amino acids, i.e., how they score gaps. Zero Gap Penalty Model In the simplest case, the score of an alignment (between sequences P x and P y ) depends only on similarities between types of amino acids: M(P x, P y, A) = S (p x i x, py iy) (2.4) i A where S is a scoring matrix indexed by the types of amino acids. Constant Gap Penalty Model Ignoring gaps can cause problems, especially in the case when a scoring matrix contains elements with negative values. In such a case, it would be always preferable to use a pair of a vertical and a horizontal gap instead of aligning amino acids which results in a negative substitution score. A possible solution is to introduce a penalty for a gap into the final score: M G (P x, P y, A) = i A S (p x i x, py i y) + gp (nx l) + gp (n y l) (2.5) 23

28 where l is the length of the alignment and gp is the gap penalty. Sometimes it is required to align only contiguous subsequences of compared sequences. It is useful in the case when we know that compared sequences, which can be very different, contain a common subsequence. To solve this problem, a so-called local alignment is used. It is equivalent to the previous variant, called global alignment, except for the gap penalties at the beginning and at the end of the compared sequence which are considered as penalty-free: M L (P x, P y, A) = i A S (p x i x, py i y)+gp ((ax l a x 1+1) l)+gp ((a y l ay 1+1) l) (2.6) where l is the length of the alignment and gp is the gap penalty. Affine Gap Penalty Model Insertions or deletions that might produce gaps in protein sequences occur infrequently as one point mutations. In many cases, a subsequence of amino acids is affected by one insertion mutation or one deletion mutation. Unfortunately, the constant gap penalty model does not reflect this observation. As the solution, the affine gap penalty model has been introduced. A fragment of contiguous gaps is scored as whole, and the value of its score depends on the length of the fragment: { 0 if k = 0 gap(k) = ogp + (k 1) egp otherwise (2.7) where k is the length of the fragment, ogp is so-called open gap penalty, and egp is so-called extend gap penalty. Based on this, the score for a global alignment is defined as: M G (P x, P y, A) = i A S (p x i x, py i y) + gap (a x 1 1) + gap (n x a x l ) + l 1 gap ( a x i+1 a x i 1 ) + i=1 (2.8) gap (a y 1 1) + gap (n y a y l ) + l 1 gap ( a y i+1 ay i 1) i=1 24

29 And the score for a local alignment is defined as: M L (P x, P y, A) = i A S (p x i x, py i y) + l 1 gap ( a x i+1 a x i 1 ) + i=1 (2.9) l 1 gap ( a y i+1 ay i 1) i=1 Experiments have shown that the affine gap penalty model serves a good approximation to biologically realistic gap penalties [21]. 2.2 Scoring Matrix The introduced scoring matrix has the biggest influence on properties of a sequence measure. The scoring matrix has to reflect similarities between amino acids. For example, it can be based on similarities of chemical properties (see Figure 1.4). However, the most widely used scoring matrices are based on the evolutionary relationship. Note, that the evolutionary based matrices express chemical and other amino acid similarities implicitly, because a mutation propagated to future generations is more frequent if the original and the new amino acid have some common properties. A sequence score tries to estimate the ratio between probabilities of two hypotheses the first one is that the compared sequences are related, i.e, that they share a common ancestor, and the second hypothesis is that the compared sequences are not related: S(P x, P y ) = P (P x and P y are related) P (P x and P y are not related) (2.10) If the issue of gaps is ignored and if only the alignment A is taken into the account, then the score is defined as: S(P x, P y, A) = P (P A x and P y A are related) P (PA x and P y A are not related) (2.11) where PA x is the selected subsequence containing all aligned amino acids from the sequence PA x; similarly for P y A. If it is assumed that individual mutations are independent, the score can be expressed as a product of probabilities: S(P x, P y, A) = (i,j) A P (p x i and p y j have a common ancestor) P (p x i and p y j are aligned by chance) (2.12) 25

30 The logarithm is used to convert the score to be written as a sum of log odds ratios: S(P x, P y, A) = s log b (i,j) A P (p x i and p y j have a common ancestor) P (p x i and p y j are aligned by chance) (2.13) where c is a scale constant, and b is a base of the logarithm. From the above, it follows that an element of an evolutionary based scoring matrix has the form: S ij = s log b P (i and j have a common ancestor) P (i and j are aligned by chance) (2.14) Used similarity matrices are mainly different in the way how the probabilities and the frequencies are obtained. In the following section, we describe constructions of the two most commonly used series of scoring matrices PAM and BLOSUM PAM The PAM matrix is constructed on the basis of 71 groups of closely related protein sequences [22]. For each group, a phylogenetic tree is constructed the leaves of the tree represent the sequences, and the internal nodes represent inferred ancestral sequences. The phylogenetic trees are used to obtained three pieces of information: accepted point mutation matrix A ij capturing the total numbers of accepted point mutations (i.e., the total numbers of amino acid replacements that are accepted by natural selection) between amino acids i and j, relative mutability m i being proportional to the probability that amino acid i will mutate in the given small evolutionary interval specified as the time needed for one accepted point mutation per 100 amino acids, and amino acid frequency f i representing the exposure to the mutation of the amino acids i. On the basis of the obtained information, the mutation probability matrix M ij is constructed. The matrix captures the probability that the amino acid j will be replaced by the amino acid i after the given evolutionary interval (in this case 1 PAM). The matrix is defined as: λm j A ij M ij = i A if i j ij (2.15) 1 λm j if i = j where λ is the proportionality constant. 26

31 The mutation probability matrix for a longer evolutionary interval can be obtained by matrix multiplication. For example, Mij 250 is probability matrix for 250 PAMs. Note that the number of accepted point mutations (per 100 amino acid) is not equal to the number of changed amino acids (per 100 amino acid), because one amino acid can mutate multiple times during the given evolutionary interval. For this reason, the evolutionary internal can be longer than 100 PAMs. Consider two sequences having the evolutionary distances n PAMs. The probability that their alignment contains an aligned pair of amino acid j (in first sequence) and amino acid i (in the second sequence) is f j M ij. On the other hand, the probability that the same pair will be observed in an alignment of unrelated sequences is f i f j, where f i (f j ) is the frequency of an amino acid of type i (j). Based on these probabilities, the score of the PAM-n matrix has the following form: (PAM-n) ij = c log b ( Mij f i where c = 10 is a scale factor and b = 10 is a base of the logarithm. ) (2.16) BLOSUM Unlike the PAM matrix that uses an explicit evolutionary model, the BLOSUM matrix (Blocks of Amino Acid Substitution Matrix) is based on the implicit evolutionary model which is derived from local alignments of related segments of protein sequences [23]. To reduce effects of very similar sequences, groups of similar sequence segments are clustered and counted as one average segment. The clustering is parameterized by so-called clustering percentage (denoted as n). A sequence segment A belongs into the cluster, if the cluster contains the sequence segment B and if A is identical to B, i.e., n percent of their aligned positions are equal. Consider a fixed clustering percentage n. The local alignments are directly used to obtain frequencies of aligned pairs of amino acids. A frequency is denoted as f ij for the disordered pair (i, j) of amino acids. Based on the frequencies, the probability that a pair (i, j) occurs in an alignment of related sequences is expressed as: q ij = f ij 20 i=1 i j=1 f ij (2.17) The probability that an amino acid i occurs in a pair of amino acids is then defined as: j i p i = q ii + q ij (2.18) 2 The probabilities p i are used to express the probability that the pair (i, j) occurs in an alignment of unrelated sequences: { pi p e ij = j if i = j p i p j + p j p i if i j (2.19) 27

32 According to Equation 2.14, the score of the BLOSUM-n matrix has the following form: ( ) qij (BLOSUM-n) ij = c log b (2.20) e ij where c = 2 is a scale factor and b = 2 is a base of the logarithm. 2.3 Pairwise Alignment Algorithms The task to find the best alignment between a pair of sequences can be simply solved by dynamic programing methods. The Needleman-Wünsch algorithm is used to find a global alignment [24], and the Smith-Waterman algorithm is used to find a local alignment [25]. During the time, the algorithms have been improved, for example, efficient handling with the affine gap penalty model has been introduced later [26]. However, the improved versions of algorithms are still referenced by their original names. The currently used forms of both algorithms are described in the following sections Needleman-Wünsch Algorithm The Needleman-Wünsch algorithm is based on the recursive expression of the similarity score formula [24]. Consider the sequence X = x 1, x 2,..., x n having the length n and the sequence Y = y 1, y 2,..., y m having the length m. We denote the subsequence x 1, x 2,..., x i as X i, and the subsequence y 1, y 2,..., y j as Y i. Furthermore, we denote the best alignment between the subsequences as A G i,j, and its score is defined as G[i, j]. The best alignment not aligning a i (i.e., the alignment has to contain a gap after b j ) is denoted as A V i,j, and its score is defined as V [i, j]. Similarly, the best alignment not aligning b j is denoted as A H i,j, and its score is defined as H[i, j]. The alignment A V i,j is either equal to the alignment A V i 1,j (so the last gap is extended), or it is equal to the alignment A G i 1,j (so the new gap is opened). Thus, the score V [i, j] can be recursively expressed as: V [i, j] = max { V [i 1, j] + egp G[i 1, j] + ogp (2.21) Similarly, the score H[i, j] can be recursively expressed as: H[i, j] = max { H[i, j 1] + egp G[i, j 1] + ogp (2.22) Finally, the alignment A G i,j is either equal to the alignment A G i 1,j 1 (i, j), or it 28

33 is equal to A V i,j or A H i,j. Thus, the score G[i, j] can be recursively expressed as: G[i, j] = max G[i 1, j 1] + S(x i, y j ) V [i, j] H[i, j] (2.23) Based on the recursive equations, the problem can be solved by iterative calculation of matrices G,H, and V with the following initial values: G[0, 0] = 0 G[i, 0] = V [i, 0] = ogp + (i 1) egp for i {0,... n} G[0, j] = H[0, j] = ogp + (j 1) egp for j {0,... m} (2.24) The score of the best alignment will be in the element G[n, m]. The alignment can be obtained by backtracking through the matrix G. We begin with the element G[n, m] and we backtrack until the element G[0, 0] is reached. The following rules are used: If G[i, j] is equal to G[i 1, j 1] + S(x i, y j ), then the alignment contains a pair (i, j) and we go back to the element G[i 1, j 1]. If G[i, j] is equal to V [i, j], then we go back to the element G[i s, j], where G[i, j] = G[i s, j] + ogp + (s 1) egp. If G[i, j] is equal to H[i, j], then we go back to the element G[i, j s], where G[i, j] = G[i, j s] + ogp + (s 1) egp. Note that if the constant gap penalty model is used then matrices V and H can be omitted. In such a case, ogp is equal to egp so it is clear that V [i, j] is equal to G[i 1, j] + ogp; similarly for H Smith-Waterman Algorithm The Smith-Waterman alignment can be obtained by a small modification of the previous algorithm [25]. Consider the same definitions of A G, A V, A H, and their scores G, V, H only with one difference the leading gaps of the alignments of subsequences are not scored. The score of the best local alignment between sequences X and Y is then equal to: S(X, Y ) = max G[i, j] (2.25) i {1,...,n},j {1,...,m} There is only one change that has to be done in the recursive equations: G[i 1, j 1] + S(x i, y j ) V [i, j] G[i, j] = max H[i, j] 0 (2.26) 29

34 This modification allows to begin the alignment by an arbitrary pair of amino acids without any penalty. Based on the new recursive equations, the problem can be also solved by the iterative calculation of matrices G,H, and V. As in the previous case, the local alignment can be obtained by backtracking through the matrix G. We begin with the element G[i, j] having the maximum value and we backtrack until we reach the element G[k, l] which has the zero value. The condition of the backtracking step is kept same. Note that if the constant gap penalty model is used then matrices V and H can be omitted as in the previous case. 2.4 Database Search Algorithms The speed of the proposed alignment algorithms are sufficient when only one alignment is required. The computational complexities of the algorithms become the limiting factor when the task is to compare a query sequence against all sequences in a database, and to select the most similar ones. Due to the large sizes of sequence databases (for example, more than 535 thousand sequences is deposited in UniProtKB/Swiss-Prot), other approaches have to be used BLAST One of the most widely used methods allowing similarity search in sequence databases is the BLAST method (Basic Local Alignment Search Tool) [27, 28]. The method is based on a heuristic approach being used to identify sequences that can be similar to the query sequence. Very briefly, the BLAST method employs following steps [18]: 1. Low complexity regions, that are not useful to obtain a meaningful alignment, are optionally removed from the query sequence. 2. The query sequence is used to generate a list of contiguous subsequences of length 3, called words. Thus, if the query sequence has the length k, then the list includes k 3 words. 3. Each word from the list is compared against all the 3-letter words (over the amino acid alphabet). A given scoring matrix and the ungapped alignment are used to score a pair of worlds. Only so-called high-scoring words with a score greater than or equal to the threshold T are selected for further processing. 4. Obtained high-scoring words are organized into an efficient search tree. The tree is then used to scan all database sequences to find exact matches. 5. Each exact match is extended to obtain high-scoring segment pair (HSP). The original version of BLAST employs a simple algorithm. The seed align- 30

35 ment defined by the exact match is extended (without gaps) in both directions until the score falls a certain value below the best score found for shorter extensions. The newer version uses another way which is more time efficient. It extends only pairs of exact matches that have same distances in the query and database sequences, and that have the distances below the threshold A [28]. For each such a pair, the exact matches are merged and are extended as in the older version. 6. All HSPs having the score below the cutoff score S are filtered out. 7. The full local alignment is performed between the query sequence and retrieved database sequences (identified by HSPs). For each alignment, E- score is computed. 8. Alignments having the E-score below a user-defined threshold value are reported as results of searching. 31

36 Chapter 3 Protein Structure Similarity As it has been already stated, the fundamental part of the thesis is focused on measuring protein structure similarity. In this chapter, we present some of stateof-the-art algorithms solving this task. The similarity methods can be divided into two groups global similarity measures taking into account the whole structure, and local similarity measures comparing only active sites, that is, the sites where the protein links to its binding ligand or protein. The latter methods can be more precise in linking functional similarity to structural similarity since they focus directly on the site where the action occurs. On the other hand, for new structures the active site is often not known as localizing the exact ligand position requires the structures to be resolved at high resolution. In the current version (v.2011) of sc-pdb [29], the annotated database of druggable binding sites from the Protein DataBank (PDB) [16] contains binding site information for 3034 proteins and, at the same time (March 27, 2012), PDB contains 3D information 80,402 structures. This disproportion favors the first group of methods which are able to assess pairwise protein structure similarity without the need of knowing exact binding site positions. We focus here only on the methods belonging into the first group. Although there are exceptions, measuring the protein structure similarity often consists of three subsequent steps: 1. Finding a correspondence between pairs of amino acids in compared protein structures. The general algorithm determines similarities (based on local, or possibly global, properties) of all possible amino acid pairs and selects the optimal subset. The output of this step is a vector of amino acid pairs, i.e., an alignment. 2. Computing the superposition between protein structures, i.e, finding the transformation of one protein structure to minimize the distance to another protein structure. Since the protein structure is not anchored in the Euclidean space, it is difficult to find such a transformation (rotation and translation) that minimizes the mutual distance of the compared structures. 3. Evaluating a similarity measure. The used formula is based on the obtained alignment (and also on the obtained transformation, if it is needed). 32

37 We start by the description of the algorithms being used to superimpose two aligned protein structures. They are widely used as an integral part of many similarity algorithms, so it is useful to describe them separately. In the rest of the chapter, we focus on the similarity algorithms themselves. 3.1 Superposition Methods A superposition algorithm tries to find such a transformation (rotation and translation) that, applied to one of the structures, makes aligned amino acids to be spatially close. The task is formally defined as finding the transformation that minimizes a given formula based on spatial distances between aligned amino acids. Thus, the alignment has to be obtained before a superposition algorithm can be used. Superposition algorithms do not deal with unaligned amino acids, thus, to simplify their descriptions, the structures are often restricted only to aligned amino acids. Consider an alignment A (having the length L) of protein structures S x and S y. If we denote the restrictions as X and Y, then X[i] and Y [i] contain the coordinates of i th aligned pair of amino acids X[i] contains the coordinates of the i th aligned amino acid from the structure S x, and Y [i] contains the coordinates of the i th aligned amino acid from the structure S y. The distance between them (after the superposition) is denoted as d i. We use the introduced symbols for descriptions of all superposition algorithms. In the rest of the section, we describe three of probably the most used superposition algorithms, namely, RMSD, MaxSub, and TM-score RMSD One of the most commonly used superposition algorithms is the algorithm minimizing root mean square deviations (RMSD) of distances between aligned amino acids. For a given superposition, the RMSD score is formally defined as: L RMSD = i=1 d2 i L. (3.1) Probably the biggest advantage of such a definition of the score is that the superposition which minimizes the score has an algebraic solution [30]. The computation is based on the center of mass vectors T X and T Y and a covariance matrix R: L i=1 T X = X[i], (3.2) L L i=1 T Y = Y [i], (3.3) L L R = (X[i] T X )(Y [i] T Y ) T. (3.4) i=1 33

38 Rotation U and translation T of the superposition can be then expressed as: U = W V T (3.5) 0 0 d T = T Y UT X (3.6) where d = sign(det(r)), and R = V SW T is singular value decomposition of the matrix R. Probably the biggest advantage of RMSD superposition is that the algorithm based on the previous formulas works in linear time, thus, it is very fast. Unfortunately, RMSD superposition has also several disadvantages. RMSD is very sensitive to misaligned pairs of amino acids (called outliers). Such pairs have a very negative effect on the score and on the transformation itself. The previous problem is linked to the problem of balancing between the quality of the score, and the alignment cover (i.e., how many percent of amino acids in the structures are aligned). Putting too much emphasis on the alignment cover carries the risk that the alignment will include many outliers and the score will be too high. On the other hand, putting too much emphasis on a low score brings the risk that the alignment cover will be very low, in extreme cases, the alignment can be one amino acid long MaxSub The problems with outliers and with balancing between the score quality and the alignment cover can be solved, for example, by using the MaxSub score. The MaxSub algorithm maximizes the sum of values based on distances between aligned amino acids. Outliers do not contribute to the total sum, so they have no negative effect on the score. Their effect is only indirect, the misaligned amino acids cannot be aligned with different amino acids, with which they could have created better pairs and which could have contributed to the total sum. This scoring allows the alignment algorithm to focus on the alignment cover without a loss of score quality. The MaxSub algorithm maximizes the formula having the following form [31]: MaxSub = 1 L T L i= ( d i d ) 2, (3.7) where L T is the size of the query structure, d is a distance threshold, and d i is defined as d i if d i d, otherwise it is 0. However, in contrast to RMSD, there is no efficient algorithm that would allow us to find the exact superposition maximizing the formula quickly. Therefore, the suboptimal heuristic approach is the only solution. 34

39 The MaxSub algorithm tries to find the maximal subset of an alignment having distances between aligned pairs lesser than the threshold d. The maximum subset is then used to evaluate the score formula. Finding the maximum subset is based on the idea that two at least partially similar protein structures are likely to have structure subsets that will be very similar. For such subsets, the RMSD superposition would not be much different from optimal MaxSub superposition. Therefore, the algorithm tries to compute RMSD superpositions for various subsets of the alignment (we will call them cuts). Each initial cut is iteratively extended to contain as many amino acid pairs below the threshold as possible. The pseudo-code of MaxSub algorithm is outlined in Algorithm 1. It obtains the length of the query structure L T, and two sequences with coordinates X and Y, both of the length L, representing coordinates of aligned pairs of amino acids for the first and the second protein respectively. The result is the value of the computed score and the corresponding transformation. The outer loop (lines 2-17) of the algorithm takes all continuous parts of the alignment of lengths l as the initial cuts. Each initial cut is extended by k iterations of the internal loop (lines 4-9). Each iteration (denoted as j) computes the superposition for the current cut (line 5) and constructs a new cut that contains only those pairs of amino acids that are closer than the jd threshold after k the superposition (lines 6-9). The extended cut having the maximum length is selected (lines 15-17). The new RMSD superposition is computed (line 10) and pairs having distances greater than the threshold d are removed from the cut (lines 11-14). The final cut is then used for the computation of the score value (line 18) TM-score Both of the above algorithms share the same problem. The score value that we can obtain for a pair of randomly related structures depends on the length of the compared structures. For example, the authors of TM-score state that MaxSub score 0.3 can reflect a significant alignment for structures of 400 amino acids, but it is close to a random selection for structures of 40 amino acids [32]. This creates problems in situations, where it is needed to select a similarity threshold to distinguish significant and insignificant alignments. The TM-score solves this issue by modifying the MaxSub score to use the scale function d 0 that is dependent on the length of the query protein [32]. The algorithm tries to find the superposition that maximizes the following formula: TM-score = 1 L T L 1 ( 1 + i=1 d i d 0 (L T ) ) 2, (3.8) where L T is the size of the query structure. The scale function has been empirically determined as: d 0 (L T ) = L T (3.9) 35

40 Algorithm 1: MaxSub algorithm Input: L T N, L N, X (R 3 ) L, Y (R 3 ) L Result: score R, U max R 3 3, T max R 3 Parameters: k = 4, d begin 1 cut max {} 2 for i = 1 to L l + 1 do 3 cut {i,..., i + l 1} 4 for j = 1 to k do 5 (U, T ) RMSD(X[cut], Y [cut]) 6 cut {} 7 for i = 0 to L 1 do 8 if X[i] (UY [i] + T ) < jd k then 9 cut cut {i} 10 (U, T ) RMSD(X[cut], Y [cut]) 11 cut f {} 12 foreach i cut do 13 if X[i] (UY [i] + T ) < d then 14 cut f cut f {i} 15 if cut f > cut max then 16 cut max = cut f 17 (U max, T max ) (U, T ) 18 score = 1 L T i cut max Result: score, (U max, T max ) ( ) 2 X[i] (UmaxY [i]+t max) d Such a definition of the scale function ensures that TM-score has approximately the value 0.17 (independent on sizes of structures) if randomly related structures are compared. As in the case of MaxSub, there is no efficient algorithm that would allow us to find the exact superposition maximizing the formula quickly. The heuristic algorithm is similarly based on the idea that two at least partially similar protein structures are likely to have structure subsets that will be very similar. The algorithm tries to compute RMSD superpositions for various subsets of the alignment called cuts. A TM-score value is computed for each RMSD superposition and the maximal value (and the corresponding transformation) is returned as the result of the algorithm. 36

41 Algorithm 2: TM-score algorithm Input: L T N, L N, X (R 3 ) L, Y (R 3 ) L Result: score R, U max R 3 3, T max R 3 Parameters: max c = 20 begin 1 score max 0 2 foreach l {L, L/2,..., 4} and i {0,..., L l} do 3 cut {i,..., i + l 1} 4 for c = 0 to max c do 5 (U, T ) RMSD(X[cut], Y [cut]) 6 score = 1 L T i cut if score > score max then 8 score max score 9 (U max, T max ) (U, T ) 1 ( X[i] (UY [i]+t ) d 0 (L T ) 10 cut cut 11 cut {} 12 for i = 0 to L 1 do 13 if X[i] (UY [i] + T ) < 3d 0 then 14 cut cut {i} ) 2 15 if cut = cut then 16 break Result: score max, (U max, T max ) The pseudo-code of TM-score algorithm is outlined in Algorithm 2. As the inputs, it receives the length of query structure L T, and two sequences with coordinates X and Y, both of the length L, representing coordinates of aligned pairs of amino acids for the first and the second protein respectively. The result is the value of computed TM-score and the corresponding superposition. The outer loop (lines 2-16) of the algorithm takes all continuous parts of the alignment of lengths L, L/2, L/4,..., 4 as the initial cuts. Each iteration of the internal loop (lines 4-16) computes the superposition for the current cut (line 5) and constructs a new cut that contains only such pairs of amino acids that are closer than the 3d 0 threshold after the superposition (lines 12-14). This internal loop continues until the cut is stabilized or until a maximal number of iterations is reached. Afterwards, the outer loop continues by picking another initial cut for testing. 37

42 3.2 Structure Similarity Methods During the last decade, many methods have been introduced to measure protein structure similarity. They use different properties to identify similar parts of compared structures. Some of them compare inter-amino acids distances measured within individual structures (e.g., DALI [33], SSAP [34], or CE [35]). The very important advantage of such an approach is that no superposition is needed for distance comparisons. A different approach is to use inter-amino acids distances measured between compared structures (e.g., PPM [36]). Also angles between specific parts of a structure are widely used (e.g., 3D-BLAST [37], SARST [38]). Other methods are based on measuring the similarity of amino acid neighborhoods (e.g., Vorolign [39], Vorometric [40]). In the following sections, these methods will be described in more detail DALI One of the earliest approaches to assessment of protein structure similarity is DALI [33]. It is based on the idea that similar protein structures have also similar inter-residual distances. More precisely, consider that protein A and protein B are similar and that there are two pairs of aligned residues i = (i A, i B ) and j = (j A, j B ). Then the distance d A ij between residues i A and j A in protein A should be similar to the distance d B ij between residues i B and j B in protein B. DALI adopts this observation and defines a similarity score based on the comparison of inter-residual distances for every pair of aligned pairs of residues. The score has a form of additive similarity score: S = L L φ(i, j) (3.10) i=1 j=1 where i and j label pairs of aligned residues, L is the length of alignment, and φ is the similarity measure based on distances d A ij and d B ij. The authors of DALI introduced two forms of φ(i, j): 1. Rigid similarity score: φ R (i, j) = θ R d A ij d B ij (3.11) where the superscript R stands for rigid, and θ R = 1.5Å is the zero level of the similarity. 2. Elastic similarity score: ( ) d A θ E ij d B ij φ E d (i, j) = ij θ E w(d ij) if i j otherwise, (3.12) 38

43 where d ij is the average of d A ij, d B ij, θ E = 0.20 is the similarity threshold, and w is an envelope function defined as: w(r) = e r2 α 2 (3.13) where α = 20Å, calibrated for the size of a typical domain. The elastic similarity score is preferred by authors, because it is more tolerant to the cumulative effect of gradual geometrical distortions. Based on this definition of the similarity, the DALI alignment algorithm tries to find the alignment having the best similarity score. The algorithm employs two steps. In the first step, there is performed a systematic pairwise comparison of hexapeptide-hexapeptide fragments. In this scheme, residues (i A,..., i A + 5, j A,..., j A +5) in protein A are paired with residues (i B,..., i B +5, j B,..., j B +5) in protein B, where the hexapeptide i A,..., i A + 5 is aligned with i B,..., i B + 5 and the hexapeptide j A,..., j A + 5 is aligned with j B,..., j B + 5. The 40, 000 pairs of hexapeptide-hexapeptide fragments that have best scores are selected and they are used as the input for the second step. This step is based on the Monte Carlo algorithm, and tries to assemble pairs of fragments into larger consistent sets of alignments SSAP Another method using inter-amino acid distances is the SSAP (Sequential Structure Alignment Program) method [34]. Consider amino acid i from the protein A, and amino acid k from the protein B. The score of aligning the amino acid i with the amino acid k can be described by comparing their distances to other amino acids. If it is known that amino acids j from protein A is equivalent with amino acids l from protein B, than the following comparison can be performed: s = a d A ij + db kl + b (3.14) where d A ij is the distance between C β atoms of amino acids i and j in protein A, similarly for d B kl ; a and b are parameters. Unfortunately, when the similarity between a pair of amino acids is examined, no information about equivalences between amino acids are known. For this reason, the similarity between amino acids i and j is measured by the first level of a dynamic programming using the (lower level) scoring matrix defined as: S jl = a d A ij + db kl + b (3.15) where j ranges over the length of protein A, and l ranges over the length of protein B. As a consequence, comparing one pair of amino acids produces one alternative alignment of the whole structures. SSAP uses the alternative alignments by the 39

44 distance of C to distance of C to protein A H S E R R H V F G 28 Q protein B V G M A C Σ distance of F to H S E R R H V F G Q G Q distance of V to H S E R R H V F V V G G M M A A C C Figure 3.1: The sum of local alignments second level of dynamic programming to produce the final alignment. Scores along the alternative alignments are summed in the higher level scoring matrix to produce a consensus of alternative alignments (see Figure 3.1). The higher level scoring matrix is then used by the second level of dynamic programming ProtDex2 Similarly to the previous methods, also ProtDex2 is based on comparisons of properties that can be measured between local parts of a single structure. Contrary to DALI, ProtDex2 uses higher number of more complicated properties defined for larger fragments of structures [41]. ProtDex2 method breaks down the protein structure into secondary structure elements (SSEs). There are only two common types of SSEs distinguished by ProtDex2 helix (H) and sheet (E). The inter-sse contact between a pair of SSEs is described by a so-called feature vector. Each protein is then represented as a set of feature vectors - one vector for each pair of SSEs. Two proteins are considered to be similar if they contain sufficient number of similar feature vectors. The feature vector representation is derived from two classical concepts, namely: 40

45 1. SSE vector representation: SSEs can be roughly approximated as vectors in 3D space. The equations for calculating the start and the end points of an SSE vector have been adopted from [42]. The vector V a of SSEs a has the start point denoted as Vstart a and the end point denoted as V a 2. Contact region: The contact region C a,b for SSEs a and b is the matrix of distances defined as: C a,b i,j = d s a+i,s b +j (3.16) where i {0,..., l a 1} and j {0,..., l b 1}, s a and s b are starting amino acid of SSEs a and b, l a and l b are lengths of SSEs a and b, and d x,y denotes the distance between amino acids x and y. Based on these concepts, a seven-dimensional feature vector describing the contact between SSEs a and b is defined having the following elements: end. 1. Angle between a and b: 0 if V a = V b θ(a, b) = ( ) V a cos 1 V b otherwise V a V b 2. Vertex distance between a and b: VD(a, b) = min V a end V b V a start V b V a end V b V a start V b start end end start (3.17) (3.18) 3. Square-root of the area of C ab : SA(a, b) = l a l b (3.19) 4. Aspect ratio of C ab : ( la AR(a, b) = min, l ) b l b l a (3.20) 5. Mean C α C α distance in C ab : MD(a, b) = l a 1 l b 1 i=0 j=0 C ab i,j l a l b (3.21) 6. Standard deviation of C α C α distance in C ab : l a 1 l b 1 (Ci,j ab MD(a, b)) 2 i=0 j=0 SD(a, b) = (3.22) l a l b 41

46 7. Type of C ab CT (a, b) = 0 if(a is H) (b is H) (a = b) 1 if(a is H) (b is H) (a = b) 2 if(a is E) (b is E) (a = b) 3 if(a is E) (b is E) (a = b) 4 if((a is H) (b is E)) ((a is E) (b is H)) (3.23) As it has been stated above, the two proteins are considered to be similar if they contain enough similar feature vectors. Given a query protein structure Q and a protein structure P in the database, the similarity score ψ compares all pairs of feature vectors (one feature vector is always from Q and the other belongs to P ). If the pair is considered similar, the total score is increased by score contribution δ: ψ(q, P ) = δ(t, T ) (3.24) φ(t Q,T P ) 0 The score contribution is based on the similarity of the vectors (denoted φ) and their weights (denoted w): δ(t, T ) = w(q, T ) w(p, T ) φ(t, T ) W Q W P (3.25) Similarity score φ of two given feature vectors T and T is determined as: φ(t, T ) = φ t (t r, t r) (3.26) r {θ,vd,sa,ar,md,sd,ct} where φ r is the partial similarity score for the attribute r, which is in turn defined as: σ r e ( tr t r ξ r) if t r t r ξ r φ r (t r, t r) = (3.27) 0 otherwise where ξ r is the threshold value for the allowed difference between two attribute values t r and t r, and σ r is the relative importance of the attribute r. Weight w(q, T ) of the feature vector T from the query Q is calculated as: ( w(q, T ) = (lg f Q,T + 1) lg N ) + 1 (3.28) f T and weight w(p, T ) of the feature vector T from database protein P is calculated as: w(p, T ) = (lg f P,T + 1) (3.29) where N is the total number of protein structures in the database, f T is the number of proteins in which T occurs, f Q,T is the number of occurrences of T in Q, and f P,T is the number of occurrences of T in P. And finally, W x is the size of protein x {Q, P } in terms of the number of feature vectors it contains: W x = (w(x, T )) 2 (3.30) T x 42

47 ProtDex2 has been designed to allow quick search in a database. The database that uses ProtDex2 method is in fact an inverted file. Each feature vector is being assigned a list of structure IDs in which the vector is present. When a database search is performed, feature vectors of the query structure are compared with feature vectors from the database. The score φ, that reflects how similar the vectors are, is computed for each compared feature vector pair. If the score φ is non-zero, score φ is added to every structure ID enlisted with the feature vector in the database. Finally, the structures with highest total scores are yielded as the result. This method makes the time complexity of the search more dependent on the number of different feature vectors and much less dependent on the number of structures in the database CE The CE algorithm is based on combinatorial extension (hence the name) of an alignment path defined by aligned fragment pairs (AFPs) [35]. At the beginning, there are identified aligned fragment pairs that are sufficiently similar. In the following step, the fragments are combined to obtain an alignment. And finally, the optimization process is performed. AFPs are defined as gap-less alignments having the given fixed length (denoted as m). An AFP i is unambiguously determined by its starting amino acid position in protein A (denoted as p A i ), and its starting amino acid position in protein B (denoted as p B i ). An AFP i is considered to satisfy the similarity criteria, if and only if the following inequality is satisfied: 1 m 2 ( m 1 k=0 m 1 l=0 where D 0 is a similarity threshold (3Å by default). ) d A p A i +k,pa i +l da p B i +k,pb i +l < D 0 (3.31) An alignment is constructed from AFPs and has a form of an alignment path. Two AFPs i and i + 1 can be consecutive parts of an alignment path, if one of the following conditions is true: or or p A i+1 = p A i + m and p B i+1 = p B i + m (3.32) p A i+1 > p A i + m and p B i+1 = p B i + m and p A i+1 p A i + m + G (3.33) p A i+1 = p A i + m and p B i+1 > p B i + m and p B i+1 p B i + m + G (3.34) where G is the maximum allowable size of gaps. Applying these conditions reduces the combinatorial space and makes combinatorial searching significantly faster. Each suitable fragment can be used as a seed of a alignment path. The combinatorial extension is then based on a distance between AFPs. The distance between 43

48 two AFPs i and j is defined as: D ij = 1 ( d A d A m p A i,pa j p + B i,pb j d A p A i +m 1,pA j +m 1 da p + B i +m 1,pB j +m 1 m 2 k=1 ) d A p A i +k,pa j +m 1 k da p B i +k,pb j +m 1 k (3.35) The combinatorial extension follows these steps: 1. All AFPs that can extend the alignment are selected. 2. The best AFP is chosen based on the following condition: 1 n 1 D in < D 1 (3.36) n 1 i=0 3. The decision to extend or terminate the path is made based on the following condition: 1 n n D n 2 ij < D 1 (3.37) i=0 where n is the next AFP to be considered for addition to the alignment path of n 1 AFPs, D 1 is a required threshold (by default 4Å). After the combinatorial phase ends, optimizations of the best candidate alignments are performed. These optimizations contribute up to 2Å improvement in the RMSD score. During the optimization phase, twenty alignments having the best RMSD score are selected. To optimize each of them, two strategies are used: j=0 The first one tries to relocate gaps in the alignment. If the new alignment has a better RMSD score, the changes in gaps boundaries are accepted. The second strategy performs an iterative optimization using dynamic programming which employs the affine gap penalty model (5 for opening and 0.5 for extension) and the substitution score Mij = d 0 d ij, where d 0 is a constant initialized at d 0 = 2 and incremented by 0.5 in every iteration. The distance d ij between amino acid i (from protein A) and j (from protein B) is based on the superposition obtained from the alignment within the previous step. The iterations are repeated until either of two conditions is true: 1. The alignment length is less than 95% of the alignment length before the optimization. 2. RMSD is less than 110% of RMSD at the iteration when the condition 1 was first satisfied. 44

49 3.2.5 MAMMOTH Another method, called MAMMOTH, is based on defining amino acid similarity that can be used by a dynamic programming algorithm to obtain an alignment [43]. The representation of an amino acid used by MAMMOTH is based on heptapeptide fragments. For any given fragment, each pair of consecutive amino acids defines a vector. The vectors are normalized to unit size and moved into the origin, thus the fragment is mapped into vectors in the unit sphere. The measure of similarity between amino acid is then based on the unit-vector root mean square (URMS) distance of corresponding spheres. For a pair of amino acids A and B, the URMS distance is computed and denoted as URMS AB. To obtain the local similarity measure, the URMS is then normalized by formula: S AB = 10 URMSR URMS AB URMS R 0 otherwise if URMS R > URMS AB (3.38) where URMS R is the expected minimum URMS distance between two random sets of n unit vectors (n equals six in this case) that is estimated as: URMS R = (3.39) n Therefore, the amino acid similarity ranges from 0 to 10. Based on the amino acid similarity, the local alignment algorithm using the affine gap penalty model is used to obtain the alignment. To measure the quality of the alignment, MAMMOTH employs the MaxSub algorithm to express the percentage of structural identity (PSI) defined as the percentage of aligned residues below 4.0Å. The PSI score is transformed into z-score z and normalized to express the probability of obtaining the given proportion of aligned residues by chance (P-value): P (Z > z) = 1 e e ( ) π z+γ 6 (3.40) where γ is the well-known Euler-Mascheroni constant. The normalization is based on the Type-I extreme value distribution focusing on the largest extreme (also known as the Gumbel distribution) Vorolign The similar approach based on a definition of the local similarity is also used in Vorolign method [39]. Unlike the previous method, the Vorolign technique does not use amino acid representation based on amino acids that are sequentially close, but it uses a representation based on amino acids being in a spatial neighborhood with the represented amino acid. 45

50 Vorolign uses Voronoi tessellation to decide if two amino acids are neighbors. Voronoi tessellation defined for n generating points decomposes the space to n convex polyhedra, called Voronoi cells. Each Voronoi cell belongs to one generating point and contains all points in the space that are closer to the given generating point than to other generating points. Vorolign uses C β atoms of amino acids as generating points two amino acids are then consider to be neighbors if corresponding Voronoi cells share a common face. Each amino acid is represented by the neighbor list that contains its neighbor amino acids. The list is sorted to respect sequence ordering. For each amino acid in the neighbor list, there is stored the type of the amino acid and the type of the secondary structure element in which the amino acid occurs. The similarity of two neighbor lists is then defined by the first level of dynamic programming using the weighted sum of two scores as the similarity function: Sim(x i, y j ) = ω 1 AA(x i, y j ) + ω 2 SSE(x i, y j ) (3.41) where AA(x i, y j ) corresponds to a amino acid substitution score and SSE(x i, y j ) corresponds to a similarity of the corresponding secondary structure elements. The amino acid similarity score is finally used in the second level of dynamic programming to obtain a global alignment between the pair of compared structures Vorometric A similar approach to represent an amino acid is used also by the Vorometric method [40]. However, there are some exceptions. Vorometric uses C α atoms of amino acids to generate the Voronoi tessellation. The first level dynamic programming is performed separately for upstream amino acids and downstream amino acids (relatively to represented amino acids). The similarity function used by dynamic programming is modified to respect metric axioms, thus, amino acid similarity score also respects metric axioms. Therefore, metric indexing methods can be applied for the neighbor lists stored in a database. When similar structures are being searched, each neighbor list from the query structure is used as the query to the database of neighbor lists. The results (hits) are grouped on the basis of belonging to database proteins. For each database protein represented by some hits, the hits are sorted in descending order according to their similarities. Hits are then extended to obtain high-scoring segment pairs (HSP). The hits that have been processed during previous extensions are pruned. Heuristics used for this extension is similar to the one used in the Blast method. A hit is extended by constructing gapped local alignments in both forward and backward directions. Extending is limited by the condition that the score may not fall more than a fraction of the best score yet found. Finally, the alignment defined by HSPs is iteratively optimized by computing RMSD superposition from which a new (better) alignment is derived. 46

51 3.2.8 PPM A weak point of many methods is the use of rigid body superposition. Protein structures are flexible objects, so using rigid body superposition to measure their similarities can be limiting. The phenotypic plasticity method (PPM) solves this problem by measuring the cost of morphing one structure into another [36]. The method identifies the similar substructures and measures the changes in their mutual topology. The PPM algorithm consists of several steps. In the first step, similar parts of structures, called core blocs, are identified. The core block a is a subsequence of one protein structure that is rigidly superposable to some subsequence b of other protein structure. The similarity of such a mapping v = (a, b) is denoted as sim(a, b) = sim(v). The obtained core block mappings are filtered using global topology information to reduce the number of them and to avoid shift errors. In the second step, a weighted graph of the core block mappings is constructed. The core block mappings are used as vertices of the graph. The edge between two core block mappings (a 1, b 1 ) and (a 2, b 2 ) is weighted by the value describing the structural mutation changing the topology (a 1, a 2 ) into the topology (b 1, b 2 ). To reduce the number of edges, the edge is used only if the distances between a 1 and a 2, and between b 1 and b 2 are under a given threshold. In the third step, an alignment that is represented by tree T is constructed. Consider the tree T containing n vertices, then there are n 1 edges that can be used to connect vertex v to other nodes of T. To reduce the alignment plasticity, the edge having the k th -smallest weight is used to connect v. The alignment A represented by the tree with the maximal score is identified by A algorithm while the score of the alignment represented by tree T = (v 1,..., v n ; e 1,..., e n 1 ) is defined as: n n 1 T k P P M(A) = sim(v i ) + weight(e j ) (3.42) i=1 As the final step, the normalized score is computed. It takes into account the length of the query structure and the number of unaligned amino acids (denoted as na): P P M norm (P 1, P 2 ) = T 3P P M(A (P 1, P 2 )) 0.1 na (3.43) length(p 1 ) j= FAST Another method that is based on the graph theory is the FAST method [44]. The problem of finding an alignment can be transformed to the problem of finding the maximal clique (i.e., the maximal complete sub-graph) in a graph. A vertex (i, j) of the graph represents an aligned pair of amino acids i from protein A, and j from protein B. The edge between vertices (i, j) and (m, n) denotes the compatibility of inter-amino acid distances, i.e., it denotes that the distance between amino acids i and m in protein A is similar to the distance between 47

52 i+1 i-1 i α -1 α 1 A d i,m (i+1)' (i-1)' γ 1 γ -1 m m+1 β 1 β -1 m-1 Protein A e ij,mn j+1 j α' 1 B d j,n (j+1)' γ' 1 n n+1 β' 1 Protein B j-1 α' -1 (j-1)' γ' -1 β' -1 n-1 Figure 3.2: The weight of an edge amino acids j and n in protein B. The vertices of the maximal clique then determine the alignment. All inter-amino acid distances between aligned pairs are compatible, so the maximal clique identifies a similar sub-substructure of the compared protein structures. Unfortunately, the maximum clique detection problem is NP-hard. Thus, the FAST method focuses on using a heuristic approach the graph G = (V, E) of aligned pairs is pruned, until it is close to a clique. This is done by employing several steps: 1. Local geometric comparison For a given vertex (i, j) V representing the pair of aligned amino acids (i from A, and j from B), the score L ij is defined as: D(d A i 2,i+2, d B j 2,j+2), L ij = min D(d A i 2,i+1, d B j 2,j+1), (3.44) D(d A i 1,i+2, d B j 1,j+2) where d S k,l denotes the distance between amino acids k and l in the structure S, and D is defined as: D(d 1, d 2 ) = 0.1 d 1 d 2 d 1 + d 2 (3.45) Based on L score, vertices having negative scores or having isolated high scores are removed from the graph. 2. Scoring scheme for edge computation Based on distances and angles between amino acid, the weight e ij;mn of the edge between a pair of vertices (i, j) and (m, n) is defined as: e ij;mn = (1 max{k d t d, k α t α, k β t β, k γ t γ }) e ( d A i,m +db j,n 2d 0 ) 2 (3.46) 48

53 where t α = max{ α 1 α 1, α 1 α 1 } t β = max{ β 1 β 1, β 1 β 1 } t γ = max{ γ 1 γ 1, γ 1 γ 1 } t d = da i,m d B j,n d A i,m + db j,n where α 1, α 1, β 1, β 1, γ 1, γ 1 are angles defined by pairs of vectors connecting pairs of selected amino acids (see Figure 3.2), and k d, k α, k β, k γ are empirically determined scaling factors. The score e ij;mn is used for further pruning of vertices V and also for scoring the alignment. Consider the alignment X, the similarity score S X of the alignment X is defined as: S X = e ij;mn (3.47) (i,j) X;(m,n) X;(i,j) (m,n) 3. Further pruning and initial alignment Based on the weights of edges, the new score T ij of vertex (i, j) is defined as: T ij = max{e ij;mn, 0} (3.48) (m,n) V The score T ij is used for iterative pruning of graph vertices V using several heuristics such as vertices representing isolated pairs of amino acids or vertices receiving the low score T are eliminated. To measure the improvement of such the pruning method the degree of unanimity is defined as the number of edges with positive weights divided by the total number of possible edges. Pruning is repeated until there is no further improvement in the unanimity. After pruning, the global optimum should be dominant. Dynamic programming using T i j as the scoring matrix is then employed to obtain initial alignment X Alignment refinement The obtained initial alignment X 0 is further refined. Based on the alignment, the compatibility score of a vertex (i, j) is defined as: R ij = e ij;mn (3.49) (m,n) X 0 ;(m,n) (i,j) The scoring matrix R ij is used by the dynamic programming to obtain a new alignment. The refinement is performed for up to five rounds (with a new R matrix and updated alignment constructed in each step). Finally, the raw-score S X of the final alignment is normalized by the following equation: S = S X (3.50) M N where M and N are numbers of the amino acid in compared structures. 49

54 Sabertooth The Sabertooth method represents a protein structure by the so-called structural profile that is based on the contact vector [45, 46]. The i th element of the contact vector represents the number of contacts of an amino acid i. Amino acids are considered to be in contact if the C α atoms of amino acids are closer than 17Å and the distance in the sequence is greater than 3. The structural profile is defined as the normalized contact vector: c i = v i v i (3.51) where v i is the contact vector and v i is the average of the elements of v i. The scoring used by the Sabertooth alignment algorithm is a variant of the edit distance. For given structures A and B, the substitution score S ij between an amino acid i (from A) and an amino acid j (from B) is defined as: S ij = c A i c B p align e j + p substf (1 P (a A i, a B j ) ) p subste (3.52) where c S i is i th element of the structure profile of protein S, a S i is type of i th amino acid of protein S, P (x, y) is the probability that amino acids of types x and y share the same ancestor, and p aligne, p substf, and together with p subste denote measure parameters. The insertion of amino acid i into structure S is penalized by: I S i = p insertf (c S i ) pinserte (3.53) where c S i is i th element of the structure profile of protein S, and p insertf with p inserte denote measurement parameters. And finally, breaking the structure S between positions i and i+1 that represents the opposite to insertions is penalized by: ( ) c Bi S S = p breakf i + c S pbreake i+1 (3.54) 2 where c S i is i th element of the structure profile of protein S, and p breakf with p breake denote measurement parameters. Sabertooth uses the Dijkstra s shortest path algorithm. However, the dynamic programming can be also used. Values of the measurement parameters have been evaluated on a training set to obtain the best possible results. After the alignment is done, the superposition is obtained using the MaxSub algorithm. The final alignment can be improved by a post-processing phase. The obtained superposition is used to identify fragments being spatially close. The second run of the alignment algorithm is then performed, but only pairs of amino acids that are part of the fragments or that are localized between two consecutive fragments can be aligned. After that, the new superposition is computed. Similarly to the MAMMOTH method, Sabertooth score is evaluated as z-score of the percentage of structural identity (PSI) obtained by the MaxSub algorithm. 50

55 db-itm Figure 3.3: Example of an amino acid viewpoint The method db-itm represents our older approach [47, 48]. A protein structure (of size n) is represented by a set of n feature vectors each of them describing the neighborhood of an individual amino acid. We present several semantics of feature vectors based on the density of amino acids in nested 3-dimensional rings with the center of the amino acid from which the protein is viewed (see Figure 3.3). Based on widths or perimeters of those rings, feature vectors are extracted which we call viewpoint tags (VPT) since they are blueprints of the protein according to given amino acids. One of the semantics is, for example, sdens: Let vp represent a particular viewpoint, then vp[i] stands for the i th ring and its value is the density (sum) of the residues in the ring. We utilize weighted Euclidean distance for VPTs comparison. Weighting is used to emphasize the fact that for assessing the similarity to a pair of viewpoints, their close neighborhood is more important than the more distant one. Specifically, for i th coordinates of the feature vectors, we define weighting scheme w as w(i) = n log(i). Using VPTs and the distance functions we apply dynamic programming to find optimal pairs of amino acids (similar viewpoints) that follow the sequence order. We employ Needleman-Wünsch dynamic programming in case that a global alignment is needed or Smith-Waterman dynamic programming in case of a local alignment. 51

56 The obtained alignment is scored by TM-score. One of the major qualities of the TM-score formula lies in concentrating on strong local structural similarities. Hence, an alignment with highly similar regions shows high TM-score. On the other hand, the superposition looking optically well can obtain lower TM-score than it should according to the look. In such case, the correction of the alignment by the dynamic programming should increase the TM-score value. Therefore, we employ the Needleman-Wünsch algorithm with scoring function S, defined as: S(i, j) = ( dij d 0 ) 2 (3.55) where d ij is Euclidean distance of the i th and j th residues of the proteins, and d 0 is the TM-score normalization factor. The optimal path through the dynamic programming matrix represents the alignment having best TM-score value according to the given superposition. We modify Needleman-Wünsch dynamic programming to increase speed and to avoid extensive modifications in the alignment by considering only an area (belt) in the dynamic programming matrix with a constant width going along the original alignment (we set width of the belt to 21). Based on the newly obtained alignment, the whole process of computing TM-score superposition can be iteratively repeated (we run two iterations of dynamic programming) D-BLAST Another approach presents the 3D-Blast method [37]. It is derived from the Blast method, which is considered as a state-of-the-art in the protein sequence similarity search. It allows using the same high-quality search algorithm as the method Blast uses. To use the approach based on the Blast method, it is needed to introduce a small structural alphabet and corresponding substitution matrix. 3D-Blast defines a structural alphabet that is inferred from κ and α angles. For a given amino acid i, the angle κ is a bond angle formed by three C α atoms of amino acids i 2, i, and i + 2. The angle α is a dihedral angle formed by four C α atoms of amino acids i 1, i, i + 1, and i + 2 (see Figure 3.4). Combinations of angles κ and α are factorized and they are clustered based on their frequency. The result is a small alphabet having 23 letters. A structure is then represented as the sequence of these letters where one letter corresponds to one amino acid. A substitution matrix (called SASM) is created analogically to the creation of BLOSUM62 substitution matrix for protein sequences SARST The method SARST (Structural similarity search Aided by Ramachandran Sequential Transformation) is very similar to the previous method [38]. Unlike the previous methods, however, it uses the well-known Ramachandran plot based on 52

57 C I-1 α C I-2 α C I α κ C I+1 α α C I+2 α Figure 3.4: Angles κ and α [49] ϕ and ψ backbone dihedral angles. For a given amino acid, ϕ and ψ describe angles around N-C α and C α -C bonds. Also in this case, combinations of angles are factorized and they are clustered based on their frequency. The result is 22 clusters denoted by letters. Moreover, a special letter X is used in case when it is not possible to obtain required dihedral angles. A substitution matrix is created similarly to the creation of the BLOSUM substitution matrix. This approach which represents structures as sequences of the letters and introduces the substitution matrix allows to use classical sequence similarity search methods to perform the similarity retrieval. 53

58 Chapter 4 SProt Similarity Measure In contrast to most of the presented methods (see Section 3.2), in our solution we put a lot of emphasis on high-quality modeling of local similarities of the amino acids. We believe that representing proteins by various derived features might cause loss of information which is inevitable for quality alignment. In this chapter we present our solution, called SPROTorg, SProt, which aims to avoid the possible loss-of-information drawback. 4.1 Method Description As is clear from the surveyed methods, determining alignment and superposition of protein structures is a nontrivial problem. However, what holds true for whole protein structures does not have to be valid for small substructures. If we want to align two small parts of two protein backbones, the natural way is to execute gapless alignment for these parts. When aligning only a few amino acids, it does not make sense to introduce gaps and thus the alignment is defined unambiguously. We further employ this alignment in a consequent step where we add those amino acids to the alignment that are spatially close to the already aligned backbone amino acids. These do not have to be close in terms of sequence order. In this way, we are able to take the spatial neighborhood into account when modeling local similarity. As stated in the survey (see Section 3.1), given an alignment, the computation of the superposition is a relatively easy task. The above outlined principle is the central point of the local measure used in SProt. Before we describe the details of the algorithm in the following sections, we briefly present the main ideas. SProt represents each amino acid A by amino acids that are spatially close to A (Section 4.1.1). To compute the local similarity between such representations of amino acids, an alignment and superposition are subsequently performed (Section 4.1.2), as motivated above. The computed local similarities are then used by a dynamic programing method to obtain the global structural alignment. The quality of this alignment is expressed in terms of a TM-score value (Section 4.1.3). 54

59 4.1.1 Representation of a Protein Each amino acid A is represented by the amino acids located within the euclidean sphere centered in A and with given radius. Since the representation of A is based on its spatial neighborhood bounded by the sphere, we call the representation an aa-sphere. SProt treats the position of each amino acid as its α-carbon position. However, when testing intersection of an amino acid with a sphere, all heavy atoms of the amino acid are considered, not only the α-carbon. Such an approach allows us to include amino acids into the aa-sphere whose α-carbons are too far from the aa-sphere s center but their side chains are still close enough. We divide the content of each aa-sphere into several categories: Spherical backbone is the maximal continuous part of the amino acid sequence that is included in the aa-sphere and contains the central amino acid. A spherical backbone is divided into upstream spherical backbone and downstream spherical backbone. In the former the amino acids precede the central amino acid in the protein sequence, while in the latter the amino acids follow the central amino acid. Upstream neighborhood contains amino acids in the aa-sphere that precede the central amino acid in protein sequence and are not included in the spherical backbone. Downstream neighborhood contains amino acids in the aa-sphere that follow the central amino acid in protein sequence and are not included in the spherical backbone. See Figure 4.1 for an example of an aa-sphere, including the categories. The example demonstrates an aa-sphere for the 26-th amino acid of Ubiquitin (PDB ID: 1ubq). Each amino acid is represented by a ball centered in its α-carbon position. The tube corresponds to the protein backbone denoting the protein sequence. The Euclidean sphere with center in the 26-th amino acid (black) with radius 9 Å (gray). The different colors emphasize amino acids included in the aasphere. Some of the heavy atoms of the colored amino acids with their α-atoms outside the Euclidean sphere intersect with the sphere and thus the respective amino acids are also included in the aa-sphere. The figure has been generated by VMD[50]. For the purposes of the following steps, the amino acids in each category preserve the original protein sequence ordering. We also define the term quantity characteristics for each aa-sphere to denote the number of amino acids belonging to a particular category. The whole protein can then be modeled by a sequence of aa-spheres built for every amino acid. 55

60 upstream backbone upstream neighborhood central aminoacid downstream backbone downstream neighborhood non-sphere aminoacids Sphere Similarity Figure 4.1: An example of an aa-sphere We measure similarity of aa-spheres using alignment and superposition of their content, as this is simpler for aa-spheres than for entire protein structures. Assessing the similarity to a pair of aa-spheres consists of five steps, where the first three steps construct the alignment and the last two valuate it: 1. Generating seed spherical backbone alignment. Spherical backbones are aligned using gapless alignment. The alignment is unique since it is gapless and the central amino acids are aligned to each other. 2. Computing spherical backbone superposition. The alignment from the previous step determines the spherical superposition carried out by the Kabsch algorithm, which is of linear complexity [30]. 3. Generating spherical alignment. In the previous step, we have superposed the spherical backbones. However, to assess similarity to the whole aaspheres, we have to consider also the other aa-sphere content. Therefore, the obtained superposition is used to align the rest of the amino acids in the aa-sphere (upstream and downstream neighborhoods). We apply the Needleman-Wünsch algorithm [24] (global alignment) separately on the upstream and downstream neighborhoods. The algorithm utilizes a scoring function in the form 1 S ij = ( ) 2, (4.1) dij 1 + d s where d ij is the euclidean distance of i-th and j-th amino acids according to the superposition of the aa-spheres, and d s represents a scale parameter (empirically determined). 56

61 4. Computing raw spherical measure (SM-raw). The raw spherical measure for aa-spheres x and y is computed for the whole spherical alignment (steps 1, 2, 3) as: SM-raw(x, y) = 1 max [x][y] L A i=1 1 ( ) 2, (4.2) d 1 + i d s where L A is the length of the alignment, d i is the distance between i-th pair of amino acids according to the spherical superposition, d s is the same scale parameter as in the previous step, and max [x][y] is a normalization factor (the maximal value of the sum that can be obtained for aa-spheres with the same quantity characteristics as the aa-spheres x, y have). 5. Computing normalized spherical measure. An SM-raw value that is expected to occur for a pair of aa-spheres only by chance depends highly on the quantity characteristics of the compared aa-spheres. That is because better superpositions are more probable for smaller aa-spheres. Hence, there arises a problem when comparing the similarities between pairs of aa-spheres with different quantity characteristics. Therefore, we compute the empirical cumulative distribution functions (ECDF) for SM-raw that are specific to quantity characteristics of the compared aa-spheres x and y (denoted as F [x][y] ). The usage of ECDF allows us to express the probability that a better result could not be obtained by chance for aa-spheres with identical quantity characteristics. However, such a modification is not yet sufficient. For example, if aa-sphere w is obtained from aa-sphere z by removing some amino acids, then SM-raw(w, z) will be maximum for given quantity characteristics. It implies that ECDF of SM-raw(w, z) will be maximal as well, but that is not correct. Therefore, we added the factor f that captures the differences in the quantity characteristics of aa-spheres x and y: min(q(x), q(y)) + 1 f(x, y) = q {q ub,q db,q un,q dn } max(s(x) + t(x), s(y) + t(y)) + 1, (4.3) (s,t) {(q ub,q un),(q db,q dn )} where q ub, q db, q un and q dn denote individual quantity characteristics of an aa-sphere. The full normalized measure of the aa-spheres x and y is then SM-score(x, y) = f(x, y) F [x][y] (SM-raw(x, y)). (4.4) Alignment and Superposition To generate the global alignment of two protein structures, the logarithm of SMscore is used as a scoring function together with the linear gap penalty model. The SM-score estimates the probability that a matching of given pairs of spheres is significant. Thus, the logarithm of SM-score used inside the Needleman-Wünsch algorithm maximizes the probability that the resulting alignment is significant. 57

62 After obtaining the alignment, we employ the widely used TM-score algorithm to get the superposition and the final score [32]. The TM-score algorithm was designed in order to maximize the following formula: TM-score = 1 L T L A i=1 1 ( 1 + d i d 0 (L T ) ) 2, (4.5) where L A is length of the alignment, L T is size of the query structure, d i is distance between i-th pair of amino acids according to the superposition computed by the TM-score algorithm, and d 0 (L T ) is a scale function. When speaking about similarity measure, we understand high scores as high similarities. However, for some applications it is more convenient to treat similarity as distance. Thus, similar structures exhibit low distance. Since the TM-score is a similarity measure that reaches 1 for identical structures, it can be easily converted to a distance function as d(x, y) = 1 TM-score(x, y) Optimizations The proposed SProt similarity measure depends on the following parameters that must be tuned to obtain high-quality results. Sphere Radius This parameter determines the number of amino acids in an aa-sphere. A small radius results in low number of amino acids in an aa-sphere which leads to decreased accuracy. On the other hand, using a large radius increases the time needed to compute the aa-sphere similarity. This is because a large aa-sphere influences the runtime of the Needleman-Wünsch algorithm (being of quadratic complexity). In our experimental section, we used sphere radius 9 Å as a trade-off between time and accuracy. Scale Parameter d s The SM-raw measure is a variant of TM-score that uses scale parameter dependent on the size of the compared proteins. However, TM-score s parameterization is not suitable for aa-spheres, because they are much smaller than the whole protein structures. Therefore, we used constant-value scale parameter as the ancestors of TM-score did. For example, MaxSub [31] used value 3.5 Å, S-score [51] used value 5 Å. We decided to set the parameter to 2 Å due to the generally smaller sizes of aa-spheres in comparison to the average protein size. SM-raw Empirical Cumulative Distribution Functions The empirical cumulative distribution functions (ECDF) of SM-raw measure were produced from the all-to-all comparisons of proteins taken from ASTRAL-25 58

63 v1.65 database [20]. Since the ECDF computation is highly space-consuming if every possible combination of quantity characteristics has to be taken into account, a downsampling technique was used to decrease the space complexity. The upstream and downstream neighborhood characteristics were downsampled by a factor of 2, the backbones of sizes 0 and 1 were treated identically as well as each quantity characteristics exceeding value 7. Gap Penalty Setting a gap penalty value has the essential impact on the quality of the measure. We used log(0.75) as the gap penalty value which has the best results for most of the evaluations. This setting of the gap penalty is low enough, thus only amino acids with significant similarity will be paired. 4.2 Results In order to evaluate the quality of the proposed measure, we focus on expressing how well the measure fits the view of experts on protein structure similarity. The difficulty of this task lies in the absence of a large-scale expert-moderated database of pairwise protein structure similarities, which we could use as a standard of truth. However, there exists the expert-moderated hierarchical evolutionary classification SCOP (structural classification of proteins) that could be used for this purpose [19]. Using SCOP, we are able to (indirectly) compare SProt with domain expert s conception of the structure similarity. The SCOP hierarchy consists of four levels family, superfamily, fold and class. Proteins in the same family have either high sequence similarity (> 30 %), or they have a lower sequence similarity (> 15 %) but share very similar function or structure. Proteins that share common evolutionary origin (based on structural and functional features) but have different sequence reside in the same superfamily. Structures that share major secondary structures in similar topological distribution are in the same fold. And finally, similar folds are grouped into classes. Therefore, SCOP can provide us with the information whether two protein structures are considered similar or not (at the given level) by a human observer. Although such a binary measure (similar or dissimilar) is not able to express detailed qualities of the similarity measure, such as the quality of alignment or superposition, it is suitable to express performance of the measure in terms of ability of classification and retrieval Protein Classification Automatic classification of protein structures is one of the traditional problems. The task is to determine SCOP classification of a query protein according to the investigated measure. The category of the query protein is derived from category 59

64 Method Family Superfamily Fold RMSD Cover TM-score SProt Vorometric PPM n/a n/a n/a db-itm n/a n/a n/a FAST n/a Vorolign CE BLAST Table 4.1: Classification accuracy of the database protein being most similar to the query. Accuracy of classification at a given level is measured as the percentage of correctly classified queries. We used the dataset that was originally introduced for evaluation of the Vorolign method (Vorolign dataset). The dataset utilizes ASTRAL-25 v1.65 [20] containing 4,357 structures. As the query set, 979 structures from difference set between SCOP v1.67 and v1.65 are used. Results on the dataset are summarized in Table 4.1. The values of the dbitm method are taken from [47], and the values of the other compared methods (except FAST) are taken from [40]. The table describes the classification accuracy for family, superfamily and fold levels. It also shows average values of several characteristics describing the algorithms from different points of view. Namely, the table contains average TM-score, average RMSD and average alignment cover (i.e., how many percent of amino acids of a query is aligned) between each query and its most similar structure used for classification. At the superfamily and fold level, SProt outperforms the other solutions, while at the family level SProt is slightly defeated by Vorometric. It is interesting to realize that although the other solutions stand out in terms of average values of the various characteristics, SProt outperforms them in terms of classification accuracy. Thus, better partial characteristics do not necessarily lead to better real-world results Information Retrieval in Protein Structure Databases In the previous section, we measured the hit rate based on the most similar database structure. Thus, the most similar structure was the only determinant of the quality. However, the user often wants to obtain all relevant structures, not just the most similar one. The result can be then visualized as a list of database structures ordered according to the given measure with the most similar structure on top. Correctness of such ordering can be measured in terms of precision and recall, used as standard effectiveness measure in the area of information retrieval [52]. Precision expresses how many percent of structures at the given cut-off rank in the result list are relevant. Recall expresses how many percent of all relevant results are obtained at the given cut-off rank in the result list. The precision-recall dependence can be expressed in a graph that describes the average precision of queries for different recall levels. As a single-value evaluation metric, 60

65 average precision (%) SProt FAST Vorometric CE MAMMOTH 3D BLAST SARST PSI BLAST ProtDex recall (%) Figure 4.2: Average precision-recall curves it is possible to use the widely accepted mean average precision [52]. For a single query, the average precision is defined as the average of precision values that are computed for prefixes in the result list, where each of the list prefixes ends by a relevant structure. The mean of these values for all queries then determines the mean average precision. Another single-value evaluation metric is described in the Vorometric paper [40] and called here also average precision. To avoid a confusion we will call it average precision for standard recall levels. This evaluation metric is defined as the mean of average precision values for the 10 standard recall levels (10% 100%). For this experiment we used the ProtDex2 dataset consisting of 34,055 proteins that have been first used for evaluation of the ProtDex2 method. As the query set, 108 structures from medium-size families of the dataset were selected. We consider a selected database structure as relevant if it comes from the same SCOP family as the query. Precision-recall graph for the used dataset is presented in Figure 4.2. The data for the compared methods are borrowed from [40] and [38]. The SProt has better precision-recall curve than the other methods, except FAST and Vorometric. In comparison with Vorometric, the curve of SProt is slightly worse for medium recall levels while it is noticeably better for high levels. When measuring the above defined single-value evaluation metrics, SProt outperforms the other methods (probably except FAST for which we have no precise data) as Table 4.2 demonstrates. The mean average precision values of the compared 61

66 Method Mean average precision Average precision for standard recall levels SProt Vorometric n/a CE MAMMOTH D-BLAST PSI-BLAST based on returning top 100 hits Table 4.2: Average precision methods are taken from [37], and the average precision for standard recall levels values of the compared methods are taken from [40] Quality of Structural Alignments It would also be appropriate to investigate what is the quality of alignments and scores the measure produces. For this purpose, 10 difficult pairs of structures were introduced in [53]. It is obvious from Table 4.1, that SProt does not produce high alignment cover and TM-score. However, to produce better alignment and TM-score, it is possible to apply iterative improvement of TM-score. In this case, the superposition obtained by the original SProt is used to produce a new better alignment. A similar approach was utilized also by other methods, e.g., Vorometric. For the purpose of the improvement, Needleman-Wünsch algorithm is used with the scoring function S ij = { 1 ( dij 1+ d 0 (L T ) ) 2 if d ij < 3d 0 (L T ) otherwise (4.6) where d ij represents distance between i-th and j-th amino acid according to the superposition, L T is the length of the query protein, and d 0 (L T ) is the scale function used in TM-score. The 3d 0 (L T ) threshold is used to prevent aligning too distant amino acids. The resulting alignment is then used in the TM-score algorithm to obtain new score and superposition. This procedure is repeated while the score is being improved. As shown in Table 4.3, this approach (denoted SProt + TM-optimization) significantly improves the cover and score. The tests were performed on the special set of 10 difficult pairs of structures and average values of various characteristics are presented. The values of the compared methods are taken from [40]. On the other hand, extensive use of the iterative concept does not improve the results of the previous evaluations whereas it noticeably downgrades performance of the algorithm. 62

67 Method RMSD Cover TM-score SProt + TM-optimization SProt Vorometric Vorolign DaliLite SSAP CE Summary Table 4.3: Comparison of the alignment quality. We proposed a novel algorithm for measuring protein structure similarity that puts emphasis on high-quality modeling of local similarities of the amino acids. This is achieved by representing each amino acid by its spatial neighborhood containing close amino acids. The approach leads to good real-world results, especially for superfamily/fold classification accuracy and for precision at high recall levels. It reaches high efficiency and is comparable or even better than most of the state-of-the-art methods in terms of effectiveness. 63

68 Chapter 5 Speed-up by Indexing The proposed SProt measure is computationally very expensive. This poses a challenge especially in the task of selecting the most similar structures from a large structure database where many SProt computations have to be performed. One of the possible solutions of this challenge is to employ indexing methods. In particular, the metric access methods could be successfully employed this purpose are. In this chapter we briefly introduce the basic concepts of metric access methods; we explore the possibility of using these methods for indexing SProt, and finally we introduce our own method based on the metric access methods and evaluate the speed-up provided by our method. 5.1 Metric Access Methods Most of the domain-specific applications of similarity search employ pairwise similarity only as a step within the process of database search. Typically, we search for the most similar object in a database to a given query. The most straightforward solution in such a scenario is to sequentially scan the database, compare the query object to each object in the database and identify the most similar object (the nearest neighbor) or the k most similar objects (the k nearest neighbors). The metric access methods (or metric indexes) [54] form a set of index structures allowing to filter out database objects not similar to the query, thus highly decreasing the runtime while maintaining accuracy of the search. The goal is achieved by resorting to metric distance functions, which is the requirement of all metric access methods. Hence, only the domains where the distance d between objects fulfills the metric axioms can benefit from the metric access methods (without loss of accuracy). The metric axioms are as follows ( x, y, z): 1. Non-negativity: d(x, y) 0 2. Identity of indiscernibles: x = y iff d(x, y) = 0 64

69 3. Symmetry: d(x, y) = d(y, x) 4. Triangle inequality: d(x, z) d(x, y) + d(y, z) The axiom of triangle inequality is the most important for metric access methods. This axiom, in conjunction with the other ones, allows to compute a lower bound d LB (q, o) of the distance d(q, o) between a query object q and a database object o through another database object p (often called a pivot). Specifically, the following equation follows directly from the axioms: d LB (q, o) = d(p, o) d(p, q) d(q, o) (5.1) It is possible to compute multiple lower bounds of the distance by using different pivots and select the maximum lower bound being the closest one to the distance. This can provide a good estimate of the distance between q and o. If the estimate is large enough, object o can be filtered out, because it surely cannot be close to the query and so cannot be a part of the result set LAESA Access Method One of the metric access methods, representing so-called pivot-based approach, is LAESA [55, 56] (Linear Approximating and Eliminating Search Algorithm), being suitable for time-expensive measures because of its filtration abilities [57]. LAESA uses a small part of the database as the set of pivots. The pivots are used during the query process to estimate distances between a query and all the database objects. Based on these estimates, it is possible to eliminate some of the database objects from the search, so that the expensive distance computations between the query and these objects are not needed to compute. To compute the distance estimations as fast as possible, all distances between the pivots and the database objects are precomputed and stored in so-called metric index. To perform the k nearest neighbor query, LAESA maintains a set S containing not yet eliminated objects that might be still included in the result. The elimination process is based on estimations of distances between the query and database objects. Thus, LAESA also maintains the estimation of the distance for the query and each database object o (e(o)). These estimations are continuously updated as more and more pivots are taken into account. During the execution of the algorithm, the k nearest neighbors from the set of already processed objects are stored in a set R. At the end of the algorithm, the set R contains the final result, i.e., the k nearest neighbor objects. The pseudo-code of the LAESA algorithm is outlined in Algorithm 3, and can be described as follows: 1. Initialization: At the beginning, all database objects might be included in the result, therefore the set S contains all database objects (line 1). Lowerbound estimations of distances between the query and database objects are set to 0 (line 4) and the set R is empty (line 2). 65

70 Algorithm 3: LAESA access method Input: q D, k N Parameters: f preserved N Data: D D, P D, d i : D P R + 0 Result: R D begin 1 S D 2 R {} 3 iter 0 4 o D : e(o) 0 5 s arbitrary member of P 6 while S {} do 7 iter iter S S\{s} // distance measuring 9 d(s) d(q, s) 10 R k-nearest objects from set R {s} 11 threshold max{d(o); o R} 12 foreach o S do // approximation 13 if s P then 14 e(o) max { d i (o, s) d(s), e(o)} // elimination 15 if e(o) > threshold and ( o / P or iter > f preserved ) then 16 S S\{o} 17 if S P {} then 18 s select o S P having the lowest e(o) 19 else 20 s select o S having the lowest e(o) Result: R 66

71 2. The first pivot selection: An arbitrary pivot is selected and denoted s (line 5). 3. The main loop: While s is defined (line 6): (a) Distance computation: Remove s from the set S (line 8) and compute distance d(q, s) (line 9). Update the set R to contain k already processed objects having the smallest distances to the query object q (line 10). (b) Approximation: If s is a pivot, use it to make the estimations more accurate. That is, for each database object o, compute a lower bound of its distance to the query and set the related estimation e(o) to the value of the lower bound if the lower bound is greater than the original value of the estimation (lines 13-14). (c) Elimination: Use the greatest distance between the query and an object from R as a threshold (line 11) and eliminate all objects o from S having e(o) greater than the threshold (lines 15-16). The distance between o and query q is never greater than the related estimation e(o), thus the eliminated objects cannot be included in the result. However, pivots contained in the set S are explicitly protected against elimination during the first few steps. The number of such steps is a parameter of the algorithm (denoted f preserved in the code). (d) The next object selection: If S contains pivots, select a pivot p S having the smallest estimation e(p) and denote it as s (line 18). Otherwise, select b S having the smallest estimation e(b) and denote it as s (line 20). If S is empty, s becomes undefined and so the algorithm terminates. 4. Result: The set R contains results of the search Capability of Indexing From the description of LAESA (step 3c) it follows that the speed-up is directly proportional to the number of objects eliminated during the query process. It has been shown that the elimination ability (indexability) depends on the distribution of the distances between objects in the metric space [58]. In an extreme case, there exist measures that cannot be successfully indexed by the metric access methods based on estimations of distances. For example, considering a metric measure having values (for non-identical objects) in range from 0.5 to 1, the lower bound formula gives values from 0 to 0.5. Thus, results of the lower bound formula are useless as they cannot filter anything. For such a measure, metric access methods work correctly, but all distances between a query object and database objects have to be computed and thus no speed-up is achieved. If a distance exhibits low degree of indexability, it could be improved by applying a convex function on top of the original distance, the so-called similarity-preserving 67

72 modifier [58]. The modifier virtually makes the object clusters in the database more tight, so that the indexability is increased. However, the use of such a modifier may violate the triangle inequality axiom to some extent. In particular, for some triplets of the database objects x, y, z the triangle inequality formula does not hold, which can cause inaccuracies in the search. In such case the search becomes only approximate. Therefore, the modifier has to be chosen carefully since it represents the trade-off between accuracy and speed. The concept of intrinsic dimensionality is widely used to estimate the indexability of a metric measure. For a metric measure d and a dataset S, the intrinsic dimensionality is defined as: ρ(d, S) = µ2 (5.2) 2σ 2 where µ is the mean and σ 2 is the variance of the distance distribution on the dataset S [59]. A high intrinsic dimensionality indicates poor indexability of the measure, and vice versa. However, the intrinsic dimensionality should not be considered as an absolute criterion of indexability, because it is not easy to say what a sufficiently low value of the intrinsic dimensionality is. In fact, the biggest benefit of the intrinsic dimensionality lies in the possibility to explore effects of modifiers. By comparing the intrinsic dimensionality before and after applying a modifier it is possible to estimate the effect of the modifier on the indexability of the measure. 5.2 SProt Metric Properties In this section, we discuss the possibility of using an adapted metric access method in order to index the SProt measure. In contrast to what has been stated in the previous section, unfortunately, SProt is not a metric measure, because it does not satisfy all the metric axioms. However, methods based on the metric access methods can be used (with certain restrictions) also for measures that violate some of the metric axioms. In the following we explore the relationship of SProt to the metric axioms and we discuss the impacts of violation of the metric axioms on the applicability of metric access methods. Finally, we will also discuss the potential indexability of SProt Metric Axioms The fulfillment or violation of the metric axioms is the consequence of TM-score used (in the form of a distance) in the final step of measuring structure similarity. As has been stated above (see Section 5.1), there are four metric axioms. We discuss the axioms separately in the following paragraphs. 68

73 Non-negativity Although this axiom is very simple, it is still important. It implies that values of the measure are lower-bounded. The lower-bound value (i.e., zero) is the distance of identical structures. As follows from the definition of the measure, the range of the measure is from 0 to 1, so the value of the measure is always non-negative. Identity of Indiscernibles If two compared structures are identical, their distance is zero. Unfortunately, the value of the distance measure is zero also in all the cases when the first compared structure is a substructure of the second compared structure. Thus, the identity of indiscernibles axiom is not satisfied, only the following weaker condition applies: SProt(x, y) = 0 iff x y. (5.3) Fortunately, the violation of the axiom does not bring any trouble for most of the metric access methods. Moreover, even if the axiom is still required for proper functioning of a method, it is possible to define a new measure, SProt, as: { SProt(x, y) + ε if x y SProt (x, y) = (5.4) 0 otherwise where ε is a very small positive number. The new measure satisfies the axiom and preserves the validity of the other metric axioms. From a user s perspective, there are also no troubles. If a query structure is a substructure of some database structure, then the database structure is probably relevant. In such a case, it is fully acceptable having zero distance (or very low). It is also worth mentioning that zero similarity distance between pair of structures is very unlikely. Even two structures of the same protein obtained from two different physical experiments have a non-zero similarity distance due to inaccuracies during the physical experiments. Symmetry From the definition of the measure it is clear that the measure is asymmetric. The reason is only the length of the first structure is used to normalize the computed distance. Asymmetry of distances can be very large, as it is shown in Figure 5.1. Each point in the graph represents one pair of compared structures x and y from dataset ASTRAL-40 v1.75. The coordinates of a point are equal to distances SP rot(x, y) and SP rot(y, x). In the case of a symmetric measure, all points are localized on the diagonal. Also the absence of symmetry does not form a serious problem a small change in the lower bound formula (see Formula 5.1) can fix it: d LB (q, o) = max(d(p, o) d(p, q), d(q, p) d(o, p)) d(q, o) (5.5) 69

74 Figure 5.1: Asymmetry of SProt distances It is important to note that this formula requires (due to asymmetry) to compute both of the distances between structures q and p, i.e., distances SProt(q, p) and SProt(p, q). Both computations share the same alignment, utilizing more than 90% of the computation time. Hence, SProt(p, q) can be computed relatively cheap when SProt(q, p) is already computed. Triangle Inequality A substantial problem is that SProt also violates the triangle inequality axiom, being the most important metric axiom for indexing. It allows to make estimations of distances, which represents the basic concept of the most indexing methods (see section 5.1). From this reason, it is not possible to simply modify a metric access method to not use the triangle inequality. However, also in this situation it can be possible to use metric access methods if the number of the violating object triplets is small. The method then works as an approximate access methods. To evaluate the degree of the fulfillment of the triangle inequality axiom, the T-error can be used. For a given measure d and for given dataset S the T-error is defined as the relative number of non-triangular triplets (the three values cannot 70

75 density be side lengths of a triangle): distance Figure 5.2: Histogram of SProt distances T-error(d, S) = ( ) 1 S T (x, y, z) (5.6) 3 {x,y,z} S where T (x, y, z) is 1 if the distances of the given objects are not triangular, otherwise 0. The experiment performed on dataset ASTRAL-40 v1.75 containing 10, 569 structures shows that T-error of the measure is very low ( in the used dataset). However, such a low value can be only a consequence of the distribution of distances between objects as it has been shown in Section The distribution of SProt distances computed for the same dataset is captured by the histogram in Figure 5.2. Only a small portion of the distances is less than 0.5, thus low T-error is reasonably expected. For this reason, we compare the measured value of T-error with the value of T-error that can be expected for the measure having the same distribution of distances but having no dependencies (i.e., no correlation) between unrelated distances. If we assume the correlation only between distances computed for the same pair of objects but in reverse directions, then the expected T-error (denoted as T-error exp ) can be expressed as: T-error exp (d) = T (d xy, d yz, d xz ) ρ(d xy )ρ(d yz )ρ(d xy ) (5.7) R 2 R 2 R 2 where d xy, d yz, d xz R 2 are distances (in both directions) between denoted objects, ρ is a joint probability density of distances between pairs of objects, and T (d xy, d yz, d xz ) is 1 if the distances are not triangular, otherwise 0. To evaluate the expected T-error, we computed empirical joint probability density for the used dataset being based on the pairwise comparison between structures. 71

76 modified distance weight = 0.0 weight = 0.5 weight = 1.0 weight = 1.5 weight = 2.0 weight = original distance Figure 5.3: RBQ modifiers The obtained value of the expected T-error is , which is approximately 62 greater than the measured T-error. Thus, such a low T-error value of the SProt measure is not only a consequence of the distribution of distances Indexability The distribution of the SProt distances (see Figure 5.2) does not promise good indexability, also the intrinsic dimensionality computed for the same dataset is extremely high. On the other hand, the T-error is very low, thus, we decided to use similarity-preserving modifiers to improve the indexability. The SProt measure ranges from 0 to 1 while most of the distances (approximately 95%) are higher than 0.7. Therefore, we decided on the basis of our experience to use a modifier that, simply said, smoothly expands the interval [0.7:1] at the expense of the interval [0:0.7] which is condensed. One of such modifiers is the RBQ (0.7,0.15) (w) modifier [58] parameterized by a weight w. This modifier is defined as the rational Bézier quadric curve, starting at point [0, 0] going toward point [0.7, 0.15] and arriving in point [1, 1]. The weight w determines the degree of deflection of the curve toward the point [0.7, 0.15] (i.e., the convexity of the function). Thus, the weight w determines the ratio of the expansion and condensation and thus it also impacts the indexability. The RBQ [0.7,0.15] (w) modifier for various weights w is depicted in the Figure 5.3. RBQ (a,b) (w) modifier is de- 72

77 Weight T-error expected T-error Intrinsic dimension Table 5.1: Effects of modifiers fined as the rational Bézier quadric curve starting at point [0, 0] going toward the control point [a, b] ([0.7, 0.15] in this case) and arriving to point [1, 1]. Weight w determines the degree of deflection of the curve toward the control point. The effects of modifiers on T-error, expected T-error, and intrinsic dimensionality are summarized in Table 5.1. Intrinsic dimensionality dramatically decreases with the increasing values of the weight, thus the usage of modifiers has a very good influence on indexability. Unfortunately, the T-error increases with increasing values of the weight and becomes more similar to the expected T-error. However, it seems to be still reasonable low and good for indexing. 5.3 SProt Access Method We decided to base our access method on LAESA for its good filtering ability. Due to violations of the metric axioms there are several issues that need to be addressed. The pseudo-code of the modified algorithm is outlined in Algorithm 4. The first issue that was needed to solve was to modify the method to work correctly with asymmetric measures. The lower-bound formula had to be changed according to Formula 5.5 (see line 21). For this purpose, also the index including reverse distances had to be added (denoted as d R i in the code). Also, as stated above, the computation of an alignment is shared for computation distances in reverse directions (see lines 10-13). The second issue that was needed to take into account was the violation of the triangle inequality axiom. Although we stated that T-error of SProt is very low, it is important to realize that even a relatively small probability that a triplet violates the axiom can lead to a high probability that an estimation produced by LAESA during the execution is overvalued and so is incorrect. For example, suppose that the probability that a triplet does not satisfy the axiom is Then, if we used 1000 pivots to estimate a distance, the probability that the estimation is incorrect would be approximately 1 ( ) %. The reason is that the estimation of a distance is always set to the maximum of lower bounds produced by different pivots. Thus, if one of the lower bounds is overvalued, then the estimation is overvalued as well, so that the estimate becomes incorrect. Therefore, it is desirable to adjust the method to be more robust against incorrect estimations. To do so, we introduced two enhancements: 73

78 Algorithm 4: SProt access method Input: q D, k N Parameters: f aprox R +, f order N, f preserved N Data: D D, P D, d i : D P R + 0, d R i : P D R + 0 Result: R D begin 1 S D; S D 2 R {} 3 iter 0; stop 0 4 o D : e(o) 0 5 s arbitrary member of P 6 while stop f order do 7 iter iter S S\{s} 9 S S \{s} // distance measuring 10 A SProt alignment (q, s) 11 d(s) 1 TM-score(q, s, A) 12 if s P then 13 d R (s) 1 TM-score(s, q, A) 14 R R 15 R k-nearest objects from set R {s} 16 if R R then 17 stop 0 18 threshold (1 + f approx 10 2 ) max{d(o); o R} 19 foreach o S do // approximation 20 if s P then 21 e(o) max { d(s) d i (o, s), d R i (s, o) d R (s), e(o) } // elimination 22 if e(o) > threshold and ( o / P or iter > f preserved ) then 23 S S\{o} 24 if S P {} then 25 s select o S P having the lowest e(o) 26 else if S {} then 27 s select o S having the lowest e(o) 28 else 29 s select o S having the lowest e(o) 30 stop stop + 1 Result: R 74

79 1. An object t is eliminated during the LAESA elimination step if the estimation e(t) of d(q, t) is greater than a threshold θ. In such case the distance d(q, t) is not greater than θ. However, it may not be true if the estimation is overvalued. Hence, we introduced requirement that the estimation must be greater than θ by more than v percent to make the algorithm robust against small overvaluation in the estimations (see line 18). If the estimations are not overvalued by more than v percent, then the result of the algorithm is equal to sequential scan. We call the v value the approximation error tolerance factor (denoted as f aprox in the code). 2. The second improvement does not depend directly on the rate of overvaluation. Assume that s is included in the result corresponding to the sequential scan. Then, if the algorithm processes s in the main loop, s has to be added into the set R and will be never pushed away by any other object. This is because there are no more than k 1 objects in the database having smaller distances to the query than s has. Thus, all incorrectness in the result can be interpreted as a too early elimination of the object (due to its overvalued estimation) before it could be processed. Once all objects are eliminated, the main loop is terminated. Hence, the second improvement is intended to delay the termination of the main loop and to process some of the eliminated objects. Originally, the main loop is terminated after there is no object s to be selected from S. Thus, we modify the step of the next object selection. If the set S is empty, the eliminated objects are taken into account and an eliminated object b with the smallest estimation e(b) is selected and denoted as s (see line 29). This type of selection can be performed up to r times since the last change of the set R. In other words, once the original stop condition is true the stability of the set R must be additionally confirmed by r consecutive iterations of the main loop during which R must not be changed (see lines 6, 17, 30). We call the r value the order error tolerance factor (denoted as f order in the code). This factor makes the method more robust against some incorrectness caused be wrong order of objects selection due to incorrect estimations. Proper settings of the introduced factors will prevent from incorrect estimations. As we show later, this prevention is so good that the use of modifiers improving indexability is possible. However, it is important to note that the searching is still approximative. 5.4 Results To evaluate the speedup possibilities based on indexing, we utilized the ProtDex2 dataset. This dataset is large enough to show advantages of indexing. On the other hand, it is still size that allows to perform the sequence scan in a reasonable time. Hence results of sequence scan can be used for comparison. The following settings were used. We selected 1651 protein structures as pivots, one from each family in the dataset. The value tolerance factor was set to 2.5% 75

80 average number of compared pairs (%) weight = 0.0 weight = 0.5 weight = 1.0 weight = 1.5 weight = 2.0 weight = number of nearest neighbors Figure 5.4: Access method efficiency (number of compared pairs) and the order tolerance factor was set to 128. The number of steps over which the pivots were protected against elimination was set to 1 of the total number of pivots. These settings provide sufficient robustness to prevent overvalued 64 estimations for the used dataset. The efficiency of the SProt access method was measured for different weights of the modifier and for different numbers of requested nearest neighbors. All the 108 queries of the ProtDex2 dataset were utilized and the average values are presented. The efficiency is expressed both in terms of the relative (according to sequential scan) number of protein structure pairs being compared (see Figure 5.4) and in terms of the relative (according to sequential scan) computation time (see Figure 5.5). Figure 5.5 also includes the absolute time which was measured on a machine containing an Intel Xeon E GHz processor. The sequential scan takes 39.4 minutes on average. Vertical dashed lines denote minimal, average and maximal size of the query families in the dataset. As shown in the figures, the computation time and the number of protein structure pairs being compared increases with the decreasing weight, and they also naturally increase with the increasing number of the requested nearest neighbors. It is also important to describe the precision of such approximative searches. The precision of approximate search using k-nearest neighbor query is measured as the retrieval error between the query result returned by the SProt access method (R(q)) and the accurate query result obtained by sequential scan of the database (R seq (q)): E = R seq(q) R(q) (5.8) R seq (q). The retrieval error describes how many percent of structures included in the sequential scan result are missing in the result of the SProt access method. 76

81 average relative computation time (%) weight = 0.0 weight = 0.5 weight = 1.0 weight = 1.5 weight = 2.0 weight = average computation time (min) number of nearest neighbors Figure 5.5: Access method efficiency (computation time) However, with the increasing number of the requested nearest neighbors, the retrieval error becomes less significant. The reason is that missing structures are often located in the back positions of the result. As shown in the information retrieval experiment, at the back positions there are located relatively few of the relevant structures (according to meaning of domain experts). So, we also introduce SCOP retrieval errors that take into account the SCOP categories (family, superfamily or fold): (R seq (q) S L (q)) R(q) if R E L = seq (q) S L (q) 0 R seq (q) S L (q) (5.9) 0 otherwise where R(q) is the result set obtained by the method for query q and R seq (q) is the sequential scan result set and S c (q) is a set of all structures having the same SCOP category at level L (at the family, superfamily or fold level) as the query structure q. Thus, the SCOP errors describe how many percent of relevant structures included in the sequential scan result are missing in the result of the SProt access method. The retrieval errors of the SProt access method were measured for different weights of the modifier and for different numbers of requested nearest neighbors. All the 108 queries of the ProtDex2 dataset were utilized and the average values are presented. The errors are described as the percentage of protein structures included in the sequential scan result and missing in the result of the SProt access method. In the first case (Figure 5.6), the whole result of sequence scan is considered and the retrieval error E is measured. In the other cases, only subsets of the result sharing the same SCOP family (Figure 5.7), superfamily (Figure 5.8) or fold (Figure 5.9) with the query structure are taken into account, leading to 77

82 average retrieval error (%) weight = 0.0 weight = 0.5 weight = 1.0 weight = 1.5 weight = 2.0 weight = number of nearest neighbors Figure 5.6: Retrieval errors different SCOP retrieval errors. An error 0% means that none of the considered structures were missing, and conversely, an error 100% means that all considered structures were missing. Vertical dashed lines denote minimal, average and maximal size of query families in the dataset. As shown in Figure 5.6, the retrieval error naturally increases with the increasing weight. Moreover, with the increasing number of the requested nearest neighbors, the error increases. An exception is the retrieval error for high weights and low numbers of the requested nearest neighbors, where it also increases. As it can be seen in Figure 5.7, Figure 5.8 and Figure 5.9, the SCOP errors still naturally increase with the increasing weight. Nevertheless, the errors do not negatively depend on the number of the requested nearest neighbors and they are very small. Again, the exceptions are the errors for high weights and low numbers of the requested nearest neighbors. When searching in ProtDex2 dataset, we could conclude that the weight value of the modifier set close to 1 results in reasonably fast retrieval and low retrieval errors. However, the optimal configuration of the SProt access methods parameters may vary, depending on the dataset used (especially on its size and indexability). However, it is important that the above factors (except from pivot selection) do not need to be known during the database indexing and can be set right at the query time. Thus, the user has the freedom to change the settings if he is not satisfied with the obtained results, and run the query again using the same index. 78

83 average SCOP retrieval error family (%) weight = 0.0 weight = 0.5 weight = 1.0 weight = 1.5 weight = 2.0 weight = number of nearest neighbors Figure 5.7: SCOP retrieval errors family average SCOP retrieval error superfamily (%) weight = 0.0 weight = 0.5 weight = 1.0 weight = 1.5 weight = 2.0 weight = number of nearest neighbors Figure 5.8: SCOP retrieval errors superfamily 79

84 average SCOP retrieval error fold (%) weight = 0.0 weight = 0.5 weight = 1.0 weight = 1.5 weight = 2.0 weight = number of nearest neighbors 5.5 Summary Figure 5.9: SCOP retrieval errors fold The focus on the quality of the modeling results in high computational demands of the SProt measure. We resolve this handicap be introduction of SProt access method a modification of LAESA metric access method that highly decreases the runtime needed for scanning large datasets of protein structures. The speedup makes SProt competitive with the best contemporary solutions not only concerning the effectiveness but also the efficiency. 80

85 Chapter 6 Speed-up by Parallel Computing Even though the indexing improves performance significantly, the methods are still quite slow for real-time applications. We propose a parallel approach to this problem to fully exploit the computational power of current CPU architectures and achieve almost linear speedup with the respect to the number of available processors/cores. 6.1 Parallel Implementation This section addresses the implementation aspects of the parallel version of our algorithm (which is shown at the end of the section). Furthermore, several optimizations that increase performance of both serial and parallel versions of the code are proposed Optimizations The key to efficient and scalable programming is to eliminate, or at least reduce, parts of the code that require explicit waiting or other means of synchronization across running threads. The more independent are the parts of code running concurrently, the better performance and scalability can be expected. One of the most serious synchronization hotspots can be the memory allocator if such resource is simply shared amongst the threads. Our implementation uses one local allocator per thread, thus all temporary allocations are performed lockfree. Furthermore, the allocation sequence of our implementation can be easily reordered, so that allocation calls and corresponding memory release calls are well paired 1. Such modification allows us to use a simple stack-like allocator which is very fast and does not fragment the memory. This allocator is also used in sequential version for its many benefits. Another implemented optimization improves the TM-score algorithm. Original 1 If block memory block A is allocated earlier than memory block B (w.r.t. the program execution), then B must be released before A. 81

86 implementation explicitly constructs the cut index set while traversing current alignment (Algorithm 2: lines 12-14). The cut contains indices of those aligned pairs, which are (within given superposition) closer than defined threshold. Subsequently, the subalignment, which is passed to the RMSD procedure (Algorithm 2: line 5), is explicitly created from the cut set. Thus, the implementation traverses current alignment to create a cut, then traverses the cut to create the subalignment, which is finally traversed by the RMSD procedure to compute the superposition. We can combine all three phases and compute the RMSD superposition directly during the alignment traversal to avoid explicit construction of the cut and the subalignment. Thus we can reduce the amount of memory required by the algorithm and optimize the usage of CPU caches. Our implementation benefits from the fact, that only the R matrix, vector T X, and vector T Y are required [30] to compute the RMSD superposition for the given cut. These are defined as follows: i cut X[i] T X =, (6.1) cut i cut T Y = Y [i], (6.2) cut R = (X[i] T X )(Y [i] T Y ) T, (6.3) i cut where X and Y are sequences of coordinates that corresponds to aligned pairs of amino acids (X for the first protein and Y for the second). The T X and T Y vectors can be easily constructed from the length of the cut and sums of coordinates that belong to the cut. We need additionally the sum of direct product of the coordinates that belong to the cut in order to compute the R matrix: R = X[i]Y [i] T + cut (T X TY T T X TY T T X TY T ) (6.4) i cut The implementation requires only to sum coordinates that belong to the cut and to compute their direct product while traversing current alignment. Thus, the required memory space for intermediate data is not proportional to the length of the cut. Since we have omitted the explicit cut construction, we have to update the terminal condition as well (see Algorithm 2: line 15). Original condition compares a newly constructed cut with its previous version to see, whether there were any changes. We have chosen to compare the superposition instead. If two consecutive superpositions are the same, the following constructed cut will be the same as the current cut, thus the computation has reached a stable state. Furthermore, the superpositions can be compared in constant time (unlike the cut comparison which requires linear time) Parallel Approach There are many approaches to the parallel processing of our task. The most straightforward would be to process multiple queries concurrently. In such case, 82

87 Method Subpart Time Relative Time (min:sec) (in percents) Query Model Loads 0 : % Alignments 96 : % Alignments: Score Matrix 95 : % TM-score 13 : % LAESA: Approximation and Estimation 2 : % Total Query Evaluation 111 : % Table 6.1: Required computational time each query is evaluated by separate task while tasks are processed concurrently by available processors, however, the task itself is processed in serial manner. This approach is very easy to implement and since the tasks are completely independent, it is also very efficient. Unfortunately, sufficient number of queries should be submitted to the system at the same time in order to occupy all available processors. Furthermore, if there are a few long lasting queries amongst a block of short queries, the system will be left with a few big tasks that will linger on one CPU each while remaining processors become idle. Second possible approach is to modify the method itself, so that one query evaluation utilizes multiple cores. For the purpose of this task, we have profiled our code to determine, how long does each part of the algorithm take. Times measured in the profiling run and their relative contribution are summarized in Table 6.1. The ProtDex2 dataset [41] containing 34, 055 structures and 108 queries was used for the profiling. The search yielded 20 the most similar structures for each query. The profiling results clearly show that the alignment computation (especially the score matrix computation) dominates the entire evaluation. Therefore, this part would benefit the most from the concurrent execution. However, according to Amdahl s law [60], serial parts create serious scalability issues, thus we parallelize the TM-score computation and LAESA approximation and estimation as well. We do not parallelize the query model loading as the IO operations are serial in nature and the loading itself takes insignificant part of the overall time. We can also choose completely different method and try a data parallel approach. In such case, the database is divided into fragments, which are treated as independent (small) datasets and queried concurrently. Each fragment yields a local result and local results of all fragments are merged into the final result. Even though this method has minimal scheduling overhead and does not depend on the number of queries available, it is not suitable for our algorithm. Our search measure employs the LAESA method to prune the number of computed distances (structure similarities). This method performs better on larger datasets and with larger number of pivots. If the database fragments are too small, the effect of LAESA is significantly diminished. In the worst case, it can deteriorate into simple scan of the entire database. For these reasons, we have abandoned this approach and focus solely on the previous two approaches. 83

88 Concurrent Alignment Computation According to profiling results, the alignment algorithm spends almost entire time by computing the scoring matrix. Since each item of the matrix is computed independently, the computation can be easily modified to use two-dimensional parallel for-loop instead of its sequential version. The matrix is evenly divided into tiles and each tile is processed as a concurrent task. The size of the tiles is chosen carefully so that one task computing one tile will consist of at least 10 5 CPU instructions 2. The tile-wise approach is preferred to both row-wise and column-wise approaches as it is better optimized for CPU caches. Concurrent TM-score Computation In the case of TM-score, we chose to parallelize the outer-loop only. Internal loops are rather short, thus, the overhead of task scheduling would be too great. The outer loop iterates over initial cuts. An initial cut can be defined by its length and the index of the first pair of aligned amino acids (we denote the length-index tuple the initial configuration). Our approach first generates all initial configurations to an array and then traverses the array by the parallel forloop. Hence, each configuration is processed independently as a separate task. In order to eliminate explicit synchronization, each thread keeps its own maximum value of computed scores and the corresponding superposition. The best score is picked from the local values when the parallel for-loop terminates. Concurrent computation of initial cuts introduces nondeterminism into the process, since we arbitrarily change the order of score comparisons. Even though this does not affect the method, a deterministic approach is preferred by the users and required for the verification of our parallel implementation. We have modified the score comparison, so that the configuration index is used as secondary key in case two score values are equal. This modification ensures that the choice of the best superposition is deterministic despite the concurrent processing. Our implementation generate the initial configurations sequentially and then process them concurrently. In theory, it would be better to generate the configurations by one thread and dispatch them immediately to other available threads. We can simply utilize a parallel while-loop [61] provided with an initial configuration generator. However, this approach requires significantly more thread synchronization, thus having greater overhead. Empirical results show that it is not as efficient as concurrent processing of configurations, which have been pregenerated sequentially. LAESA Approximation and Estimation Loop The LAESA method main loop cannot be effectively parallelized since each iteration updates the set of candidates as well as lower bound estimates which are required by next iteration. Fortunately, we are still able to parallelize the in- 2 Tasks containing 10 5 to 10 6 of CPU instructions were observed as optimal [61]. 84

89 ternal loop, called also the approximation and estimation loop (see Algorithm 4: lines 19-23). The internal loop updates the estimate values e(o), filters the candidate set S, and finds the candidate s with the lowest estimate e(s). To avoid synchronization, the threads work on different parts of S (and e(o) respectively). When the candidate s needs to be found, each thread t has its own candidate s t and after the entire set S is processed concurrently, the candidate s is quickly found from the set s {s t t Threads}. Since we are parallelizing the internal loop, a barrier must be present at the end of outer loop, therefore, the whole process is slightly less efficient than in other parts of the algorithm. Implementation Details We use the Intel Threading Building Blocks (TBB) library for our parallel implementation in C++ [61]. This library provides thread management, task scheduling, thread-based memory allocator, and high-level parallel primitives such as parallel for-loop. Intel C++ compiler was used, configured to produce optimal code. 6.2 Results The experiments were performed on three different versions of the parallel implementation and compared with the performance of the serial implementation. The first one (denoted Query-Parallel) employs the first parallelization approach queries are evaluated concurrently, but each query runs serially. The second one (Intern-Parallel) utilizes the second approach to parallelism each query is parallelized inside, but queries are submitted for sequential evaluation. The third one (Full-Parallel) combines previous two methods to achieve even higher scalability. Each of these versions points out different aspects of parallel processing. The Query-Parallel version is the most efficient since it reduces the overhead of task spawning, scheduling, and synchronization. However, it presumes that there are enough tasks to keep available CPU cores occupied. The Intern-Parallel implementation better reflects the real world situation when only one query is being processed and we still want to exploit parallelism to expedite the evaluation of this query. Combination of these two methods provides the ultimate solution (Full-Parallel) that reaches the limits of the optimal scalability. We measure the efficiency of parallelization by the speedup factor: S p = T seq T p, (6.5) where T seq is the execution time of the sequential algorithm and T p is the execution time of parallel version running on p CPU cores. Note that we do not distinguish between two physical CPUs and two CPU cores on one die. Even though that there are some differences like NUMA factor or L3 cache-sharing issues, these 85

90 speedup Query Parallel Version Intern Parallel Version Full Parallel Version linear speedup number of cores Figure 6.1: Measured speedup for each version of the implementation hardware properties had negligible impact on overall performance in our case. On the other hand, we do not consider two logical cores mapped to one physical core using hyper-threading technology to be independent cores and all tests were performed on independent physical cores only. We have used the Dell M910 server containing four Intel Xeon E7540 CPUs, six cores clocked at 2.0 GHz each (i.e., 24 cores total). The server was equipped with 128GB of RAM organized as 4-node cache coherent NUMA. A RedHat Enterprise Linux 6 was used as an operating system. The experiments were conducted using the ProtDex2 dataset [41] containing 34, 055 structures and 108 testing queries. The speedup of all three implementations measured for up to 24 cores is depicted in Figure 6.1. We have used an approximation curve that demonstrates the speedup of the algorithm. The approximation curve minimizes the squared differences from the measured values. Figure 6.1 indicates that Query-Parallel version exhibits lower scalability and larger variance of measured results than other versions. This is due to large granularity of the tasks (as we execute only 108 queries), thus the performance strongly depends on the task distribution over available threads. It is safe to assume that the performance would be better if the number of the tasks would be significantly larger than number of available processors. The Intern-Parallel version performs better than Query-Parallel as it produces more fine-grained tasks. However, this approach still suffers from the synchronization between two subsequent queries evaluation of one query must be completed before another query is started. These synchronization points between queries create performance gaps 86

91 speedup Scoring Matrix Computation TM Score Algorithm LAESA: Approximation and Estimation Loop linear speedup number of cores Figure 6.2: Measured speedup of the algorithm parts when CPUs are heavily underutilized, thus the speedup is still less than optimal (16 on 24 cores). On the other hand, the Intern-Parallel version simulates the situation when the queries are provided by the user in sequential manner. We observed that even in such case a significant speedup can be achieved on multiple cores. If we combine both approaches, we preserve the benefit of fine-grained tasks and we eliminate the synchronization points between queries. The Full-Parallel version reached 21.4 speedup on 24 cores. If we also consider the inevitable overhead of the task scheduling and the presence of data loading parts which are sequential in nature, it is safe to say that the scalability of the Full-Parallel version is almost optimal. This version is particularly useful when multiple users search the database. To compare the efficiency of each part of the parallel implementation, we have also measured their individual speedups (see Figure 6.2). The results match our expectations. The scoring matrix computation is the most independent, thus exhibits the best speedup. The TM-score algorithm, which is often adopted by other similarity methods, scales also exceptionally well. The weakest part of the algorithm is the LAESA method since it is iterative and only the internal loop was parallelized. We have observed smaller speedup due to synchronization within the main loop and small size of the tasks being dispatched to the threads. 87

92 6.3 Summary We have compared three approaches to the parallel implementation of the SProt measure and its access method. Our best version reached excellent speedup and scales almost linearly with the number of CPU cores. Furthermore, we believe, that achieved speedup is almost optimal, since there is some inevitable overhead and sequential parts that cannot be parallelized. We have also presented concurrent version of the TM-score superposition algorithm. Since the TM-score is being used in other methods as well, its concurrent version may find other applications beyond the realms of the protein structure similarity measures. 88

93 Chapter 7 Web Server In this chapter we introduce a web application allowing, given a query structure, to identify the set of the most similar structures in a selected database. The application employs the SProt similarity measure and an indexing technique based on the LAESA method (adjusted specifically for the SProt measure) 7.1 Web Server Implementation From the beginning, the P3S web server has been considered as an interactive web application. Because of this intent, we use Google Web Toolkit (https: //developers.google.com/web-toolkit) that has been designed especially for writing such type of applications. The framework allows to develop the clientside part of the application (running in a user s web browser) together with the required application server. The Jmol java applet ( has been used for the dynamic visualization of the proteins. In order to obtain a high-performance solution, the computational server (including the SProt measure and the indexing method) has been developed in C++ using Intel Threading Building Blocks (TBB) library [61] that provides parallel implementation. The computational server is running on a dedicated server and communicates with the application server using the well-established CORBA technology. 7.2 Usage of P3S The web server is available at the address The user interface is divided into three panels Query (Fig. 7.2A), Result (Fig. 7.2B) and Information (Fig. 7.2F). The Query panel is used to submit a query. The Result panel is used to present the retrieved structures and the Information panel shows link to the application documentation and other relevant information. 89

94 7.2.1 Query Submission Figure 7.1: P3S Architecture. To perform a search, the user has to select a query structure, the target database, and the number of the most similar structures which are to be retrieved from the database. The query structure can be defined by SCOP ID [19] or uploaded manually. In the first case the structure stored on the server having the given code is used. When at least three characters are entered during typing an ID, the application starts suggesting the possible IDs. The second choice is to upload user-defined PDB file. The PDB file should use the actual version of the PDB format and should also include all heavy atoms. The application allows checking the messages from the PDB parser, which is useful in cases when the PDB file is rejected due to errors in the format. If the PDB file contains multiple models, only the first one is taken into account. The model has to contain at least ten amino acids. Currently, P3S supports two databases that the user can choose. The first one is the Astral [20] 1.75 database containing 33,352 structures (only structures having different sequences are included). The second one is the ProtDex2 database containing 34,055 structures [41]. It is applicable mainly for comparison with other methods. Note that the databases do not contain multi-domain protein structures. Therefore the query should be preferably a single-domain structure. The number of the most similar structures to be retrieved can be set together with other parameters that influence the precision of the access method. In the ideal case, these additional parameters should not be changed. However, if the user considers the method as not precise enough, it is possible to try to decrease the RBQ modifier weight parameter or to increase values of other factors. On the other hand, if the query process takes too much time, it can be useful to increase the RBQ modifier weight parameter. Although setting the number of retrieved structures and the RBQ modifier weight can speedup the search substantially, the time needed for the evaluation of a query depends also on the size of the query structure and on the number of (similar) structures in the database. To give the user an information on the query progress, the query progress bar shows the fraction of the already searched database. Note that because of the non-linear behavior of the search algorithm the progress pace 90

95 Figure 7.2: The P3S web server user interface 91

Zobrazit více