New techniques for the location of hot spots in proteins and exons in DNA using digital filters

(1)

in DNA Using Digital Filters

by

Parameswaran Ramachandran M.A.Sc., University of Victoria, Canada, 2005

B.E., Bharathidasan University, India, 2001

A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of

D

OCTOR OF

P

HILOSOPHY

in the Department of Electrical and Computer Engineering

c

Parameswaran Ramachandran, 2010 University of Victoria

(2)

New Techniques for the Location of Hot Spots in Proteins and Exons

in DNA Using Digital Filters

by

Parameswaran Ramachandran M.A.Sc., University of Victoria, Canada, 2005

B.E., Bharathidasan University, India, 2001

Supervisory Committee

Dr. A. Antoniou, Co-Supervisor (Department of Electrical and Computer Engineering)

Dr. W.-S. Lu, Co-Supervisor (Department of Electrical and Computer Engineering)

Dr. P. Agathoklis, Department Member (Department of Electrical and Computer Engineer-ing)

(3)

Supervisory Committee

Dr. A. Antoniou, Co-Supervisor (Department of Electrical and Computer Engineering)

Dr. W.-S. Lu, Co-Supervisor (Department of Electrical and Computer Engineering)

Dr. P. Agathoklis, Department Member (Department of Electrical and Computer Engineer-ing)

Dr. D. Olesky, Outside Member (Department of Computer Science)

ABSTRACT

The development, implementation, and performance evaluation of new techniques for the location of hot spots in proteins and exons in DNA using digital filters are presented.

The application of bandpass notch (BPN) digital filters for locating hot spots in proteins is first investigated. A technique is proposed for designing the appropriate BPN filter for a specific protein sequence in which the area under the amplitude response is minimized to achieve maximum selectivity for a chosen stability margin. The minimization is performed using the golden-section search. A tuning technique is also proposed for improving the accuracy of the BPN filter. The tuning is carried out using a least-squares polynomial model. Several example protein sequences are used to illustrate these techniques.

BPN filters are then employed for locating exons in DNA. An additional step of lowpass filtering is introduced in order to detect the strength of the bandpass filtered signal as a function of nucleotide location. For the character-to-numerical mapping, the application of the electron-ion interaction potentials (EIIPs) of the nucleotides as well as their binary sequences is investigated.

(4)

The performance of the techniques is then evaluated using metrics such as sensitivity, specificity, accuracy, precision, and computational efficiency. These metrics are used in conjunction with the so-called receiver operating characteristic (ROC) technique to estab-lish a reliable framework for the comparisons. For exon location, a technique based on the short-time discrete Fourier transform (STDFT) reported in the literature is also included in the comparison. The effect of using different window functions on the prediction accuracy of the technique is explored. Using a set of examples, it is shown that BPN filters predict short exons with better accuracy than the STDFT. The test dataset comprised 66 protein sequences and 160 DNA sequences obtained from the protein data bank and the HMR195 database, respectively. Results show that among the techniques considered, BPN filters perform best for the location of both protein hot spots and DNA exons in terms of accuracy and computational efficiency. User-friendly MATLAB implementations of the techniques incorporating graphical interfaces are also described.

Optimized numerical mapping schemes are proposed for exon location using both EIIP as well as binary sequences. Characteristic numerical values are obtained for the four nu-cleotides using a training procedure in which the prediction accuracy is maximized using a quasi-Newton algorithm based on the Broyden-Fletcher-Goldfarb-Shanno updating for-mula. A training set of 80 DNA sequences is chosen from the HMR195 database and the objective function is formulated using the ROC technique. The procedure is initialized us-ing EIIP values. Unbiased testus-ing of the optimized values is carried out usus-ing a test set that has no overlap with the training set. Simulation results show that the optimized values yield more accurate exon locations than those obtained using the actual EIIP values. In addition, they perform significantly better than a set of existing optimized complex val-ues. By employing a similar strategy to optimize the weights of the binary sequences, it is shown that, in practice, only three out of four binary sequences are necessary to obtain accurate estimates of exon locations. Consequently, a computational saving of 25% can be achieved, which is substantial considering that DNA sequences encountered in practice are very long in nature.

(5)

List of Tables

Table 2.1 EIIP Values for the Protein Amino Acids . . . 20 Table 3.1 Hot-Spot Locations Obtained Using BPN Filters and Comparisons . . 42 Table 3.2 Potential Hot-Spot Locations Identified by BPN Filters . . . 43 Table 3.3 Average CPU Times for Different Hot-Spot Location Techniques . . 44 Table 4.1 EIIP Values for the DNA Nucleotides . . . 49 Table 4.2 Protein Functional Groups Investigated . . . 58 Table 4.3 Passband and Stopband Edges of the Inverse-Chebyshev Filters Used

for Hot-Spot Location . . . 59 Table 4.4 Best Operating Thresholds and Euclidean Distances for Hot-Spot

Location . . . 59 Table 4.5 Evaluation Metrics at Best Operating Thresholds for Hot-Spot Location 60 Table 4.6 Best Operating Thresholds and Euclidean Distances for Exon Location 64 Table 4.7 Evaluation Metrics at Best Operating Thresholds for Exon Location . 65 Table 4.8 Average CPU Times for Different Exon-Location Techniques . . . . 69 Table 6.1 Initial and Optimized Numerical Parameters . . . 92 Table 6.2 Optimized Weights for the Binary Sequences . . . 96

(10)

List of Figures

Figure 2.1 DNA and its building blocks . . . 10

Figure 2.2 The DNA double helix . . . 11

Figure 2.3 Relationship between the cell, chromosomes, DNA, and proteins . . 13

Figure 2.4 The central dogma of molecular biology . . . 15

Figure 2.5 An illustration of how a protein fits into its target . . . 16

Figure 2.6 Three-dimensional structure of a protein with hot spots . . . 16

Figure 2.7 An example consensus spectrum . . . 21

Figure 2.8 Filter-based hot-spot location system . . . 24

Figure 2.9 Exons and introns . . . 26

Figure 2.10 Alternative splicing . . . 27

Figure 3.1 BPN and BSN amplitude responses . . . 32

Figure 3.2 Allpass BPN zero-pole plot . . . 33

Figure 3.3 Stability triangle in the (d0, d1) space . . . 34

Figure 3.4 A unimodal objective function . . . 35

Figure 3.5 Amplitude response of the BPN filter example with ω0 = 0.3 . . . . 36

Figure 3.6 Coefficient d1 versus BPN notch frequency . . . 37

Figure 3.7 Sample hot-spot location plot . . . 43

Figure 4.1 Filter-based exon location system . . . 51

Figure 4.2 Plot of the amplitude-modulated signal for gene AF039307 . . . 51

Figure 4.3 Confusion matrix based on classifier outcomes . . . 52

Figure 4.4 The ROC plane illustrating the significant points and typical ROC curves . . . 55

(11)

Figure 4.5 ROC curves for hot-spot location obtained by varying the

classifi-cation threshold for the filters . . . 60

Figure 4.6 Points corresponding to best operating thresholds for locating hot spots . . . 61

Figure 4.7 Exon locations predicted using the BPN filter for gene AF009614 . . 62

Figure 4.8 Exon locations predicted using the BPN filter for gene AF060229 . . 63

Figure 4.9 Exon locations predicted using the BPN filter for gene U89486 . . . 64

Figure 4.10 ROC curves for exon location obtained using the STDFT by apply-ing three different windows . . . 65

Figure 4.11 ROC curves for exon location obtained using binary sequences . . . 66

Figure 4.12 ROC curves for exon location obtained using EIIP sequences . . . . 67

Figure 4.13 Points corresponding to best operating thresholds for exon location (all six cases) . . . 68

Figure 4.14 Identification of short exons for gene AF001689 using the STDFT and a BPN filter . . . 70

Figure 4.15 Identification of short exons for gene AF037438 using the STDFT and a BPN filter . . . 71

Figure 4.16 Plot of the overall accuracy versus computational efficiency of the six cases pertaining to exon location . . . 72

Figure 5.1 Screen shot of the hot-spot location GUI . . . 78

Figure 5.2 Screen shot of the tuning slider . . . 79

Figure 5.3 Screen shot of the exon location GUI . . . 81

Figure 6.1 Area under an ROC curve . . . 85

Figure 6.2 An ROC curve and its exponential model . . . 88

Figure 6.3 Training set ROC curves corresponding to EIIP and pseudo-EIIP values . . . 93 Figure 6.4 Test set ROC curves corresponding to EIIP and pseudo-EIIP values . 93

(12)

Figure 6.5 Test set ROC curves corresponding to pseudo-EIIP and complex values . . . 95 Figure 6.6 ROC curves corresponding to the initial and the optimized binary

weights, obtained using data set 1 . . . 97 Figure 6.7 ROC curves corresponding to the initial and the optimized binary

weights, obtained using data set 2 . . . 97 Figure 6.8 ROC curves corresponding to the initial and the optimized binary

weights, obtained using data set 3 . . . 98 Figure 6.9 Exon-location ROC curves obtained using both equal and optimized

weights for data set 3 with only binary sequences A, G, and C . . . 99 Figure 6.10 Exon-location ROC curve obtained using data set 1 with only binary

sequences A, G, and C . . . 100 Figure 6.11 Exon-location ROC curve obtained using data set 2 with only binary

sequences A, G, and C . . . 100 Figure 6.12 Exon-location ROC curve obtained using data set 3 with only binary

sequences A, G, and C . . . 101 Figure 6.13 Exon-location ROC curves obtained using data set 3 with only one

binary sequence at a time . . . 102 Figure 6.14 Exon-location ROC curves obtained using data set 3 with only two

binary sequences at a time . . . 103 Figure 6.15 Exon-location ROC curves obtained using data set 3 with only three

(13)

List of Abbreviations

3-D three-dimensional

A adenine

ACT average CPU time

ASEdb alanine scanning energetics database

AUC area under the curve

C cytosine

CPU central processing unit DFT discrete Fourier transform

DNA deoxyribonucleic acid

DSP digital signal processing

EIIP electron-ion interaction potential FFT fast Fourier transform

FIR finite-duration impulse response

FN false negative

FP false positive

FPR false positive rate

G guanine

IIR infinite-duration impulse response

mRNA messenger RNA

PDB protein data bank

RNA ribonucleic acid

ROC receiver operating characteristic RRM resonant recognition model

(14)

STDFT short-time discrete Fourier transform

T thymine

TN true negative

TP true positive

TPR true positive rate

(15)

Acknowledgments

The support and encouragement of a number of individuals have been indispensable to the successful completion of my doctoral studies. I take this opportunity to express my sincere gratitude to them.

First and foremost, I thank my supervisors, Profs. Andreas Antoniou and Wu-Sheng Lu, for believing in my abilities and encouraging me to explore a new and exciting area of research. Their invaluable guidance and support provided the enthusiasm, knowledge, and conducive atmosphere required to carry out my research. Their emphasis on a good writing style and the suggestions on proper use of language have improved my writing skills to a large extent. I am also very thankful for their patience during the time I took to learn the fundamentals of genomics.

I would like to gratefully acknowledge the advice and support of my supervisory com-mittee members, Profs. Pan Agathoklis and Dale Olesky.

I am also thankful to the entire staff of the Department of Electrical and Computer Engineering who have been very helpful in many different ways.

My deepest gratitude goes to my family for the unconditional love and emotional sup-port they have provided all along. The trust of my parents and sister, irrespective of my shortcomings, has made me strive great lengths towards my goals. The unbeatable blend of love, companionship, encouragement, and support of my wife, which only she can provide, has been priceless. Our graduation from being friends to soulmates during the course of my doctoral studies has proved to be a real blessing.

(16)

Dedication

(17)

Introduction

Man is descended from a hairy, tailed quadruped, probably arboreal in its habits.

—CHARLES DARWIN(1809–1882)

1.1 History and Motivation

Proteins were discovered in the early 19th century, much before the beginning of genet-ics [1–3]. They were the first class of cellular molecules to be studied, and were initially called albuminoids since they were discovered when egg white (albumin) coagulated on heating. Subsequent to their discovery, intense debates ensued regarding their nature and composition for almost a century until a consensus was reached and the chemists agreed that proteins were polymers made of subunits called amino acids. It was another long gap before the present list of 20 amino acids was completed in 1935. Since then, numerous studies to understand the structure and function of proteins have been conducted and will, no doubt, continue to be conducted for many years to come.

Research on deoxyribonucleic acid (DNA) and genetics began in the later half of the 19th century. In the 1860s, a Central European monk named Gregor Mendel performed extensive experiments of self-pollination and cross-pollination on a set of pea plants. He observed that the inheritance of traits was brought about by discrete units that were passed

(18)

on from one generation to the next. Based on his experiments, Mendel hypothesized a set of laws referred to as Mendel’s Laws of Inheritance. Another set of pioneering experiments is credited to a Swiss biologist, Friedrich Miescher, who carried out the first series of carefully conceived chemical studies on cell nuclei in 1868 [4]. Using the nuclei of pus cells obtained from discarded surgical bandages, Miescher detected a phosphorus-containing substance that he named nuclein. He showed that nuclein has an acidic portion, which we know today as DNA, and a basic portion, now recognized as a class of proteins called histones that are responsible for packaging DNA.

Spearheaded by these two works, research on the so-called nuclein continued for almost a century with the debate as to whether it is indeed the carrier of genetic information. This was finally confirmed in 1943 when three scientists from the Rockefeller Institute, Oswald Avery, Colin MacLeod, and Maclyn McCarty, discovered that DNA taken from a virulent strain of bacteria permanently transformed a non-virulent strain into a virulent one. From these very important experiments and a wealth of other corroborating evidence, it is now certain that DNA is the carrier of genetic information in all living cells [5].

Despite the proof that DNA is the hereditary entity, the structure of DNA and the mechanism by which genetic information is inherited remained unresolved until the Nobel-winning work of James Watson and Francis Crick in 1953. Their double helical structure for DNA paved the way for the modern fields of molecular biology, genetics, and biotech-nology. An important milestone in modern genetics is the completion of the sequencing of the human genome in 2003, which sparked an unprecedented interest in genomics re-search [6]. It brought together a diverse set of rere-searchers such as biologists, statisticians, computer scientists, and engineers working in tandem to decipher the functioning of DNA and proteins and the complex interactions between them.

It is now well understood that, at the fundamental level, the cells of every living or-ganism contain the complete set of instructions for building and maintaining the oror-ganism, known as the organism’s genome. A genome contains genes which are the hereditary mate-rial containing the instructions for making proteins. Genes are further divided into coding

(19)

regions called exons and noncoding regions called introns. Cells have robust mechanisms in place for decoding the genes by removing introns and stringing the exons together in order to manufacture proteins. Once manufactured, proteins are programmed to assemble themselves into complex three-dimensional (3-D) structures. By virtue of these structures, proteins carry out their functions by selectively interacting with targets which are typically other types of proteins or DNA fragments. The protein-target interactions are brought about by regions in the protein molecules known as active sites. These further contain subregions called hot spots that supply the binding energy required for a successful protetarget in-teraction.

Genes, as the hereditary material, and proteins, as the building blocks, together drive all life processes. Hence it is of utmost importance to deduce their operations in order to achieve a comprehensive understanding of life processes. Two fundamental aspects that need to be addressed as part of this endeavor are the development of efficient techniques for the accurate location of hot spots in proteins and exons in DNA. Several experimental techniques1 exist such as site-directed mutagenesis for hot-spot location [7, 8] and radia-tion hybrid mappingand classical genetic mapping for exon location [9, 10]. Mutagenesis schemes involve systematic mutational analysis of selected amino-acids to detect hot-spot locations, while genetic mapping involves performing a variety of chemical tests on a given genome to identify and annotate various types of DNA markers. Although these techniques are very effective, they involve several delicate steps that need to be flawlessly executed at a microscopic level. Hence they are very time-consuming, laborious, and costly. There-fore, there is a strong need for computational techniques that can yield good estimates of hot-spot and exon locations using a minimal amount of time, effort, and money. From the estimates obtained, biologists can construct blueprints of the locations and then selectively perform laboratory tests to confirm them thereby saving a considerable amount of time and resources.

1_{An experimental technique is carried out in a wet laboratory as opposed to a computational technique}

(20)

Existing computational techniques include computational alanine scanning [11] and molecular dynamics simulations [12, 13] for hot-spot location and GENOMESCAN [14] for exon location. These techniques yield good predictive results, but are not suitable for predictions involving newly-discovered sequences as they require secondary structural in-formation that is not available for new sequences. Hence, there is a strong need for simpler computational techniques that yield reasonably good predictions using only the amino-acid and nucleotide sequences.

In this dissertation, new filter-based DSP techniques for the location of hot spots in proteins and exons in DNA that involve a minimal amount of cost and effort are proposed. The proposed techniques are based on a set of unique characteristics exhibited by hot spots and exons that can be effectively employed for detecting them using digital filters. The techniques exploit the fact that protein and DNA sequences are inherently discrete in na-ture which makes them suitable for digital filtering upon mapping them into appropriate numerical sequences. The implementation details and extensive performance analysis of the techniques are also presented.

1.2 Contributions of the Dissertation

The major contributions of the dissertation can be summarized as follows:

1. A new technique is proposed for accurate and efficient location of hot spots in pro-teins using second-order bandpass notch (BPN) digital filters. Efficient design and tuning strategies for the BPN filters are also proposed.

2. The application of BPN filters for the location of exons in DNA is investigated. Re-sults show that BPN filters are well-suited for this purpose.

3. Extensive performance analysis of the filter-based hot-spot and exon location tech-niques is carried out using a number of evaluation metrics and the so-called receiver operating characteristic (ROC) technique.

(21)

4. Using the ROC technique, an optimized numerical mapping scheme is proposed that significantly improves the accuracy of exon-location predictions.

5. The ROC technique is also employed to demonstrate that three out of the four binary DNA sequences are sufficient for obtaining accurate exon-location predictions, thus yielding a computational saving of 25%.

6. The proposed techniques have been implemented in MATLAB and integrated into a software package by incorporating user-friendly graphical interfaces.

The dissertation is divided into seven chapters. The first two provide introductory infor-mation and the background necessary to put the work into context. The remaining present new material and discuss the contributions in detail.

In Chapter 2, a brief review of the fundamentals of the location of hot spots in proteins and exons in DNA is provided. The chapter begins with a description of proteins and DNA and explains the importance of hot spots and exons. Then, some of the existing computational techniques for locating hot spots and exons are briefly examined including our own work [15, 16].

Chapter 3 investigates the location of hot spots in proteins using BPN digital filters. A design technique is first proposed for BPN filters and the hot-spot location system using these filters is then described in detail. An automatic tuning technique is then proposed that can be employed for obtaining significant improvements in prediction accuracy. A variety of illustrative examples are presented. Some of the work in this chapter has been published in [17–20].

In Chapter 4, BPN filters are employed for locating exons in DNA. The suitability of these filters for exon location is first examined and then the technique is described in detail. A number of illustrative examples are also presented. Then the performance of filter-based techniques for the location of both hot spots and exons is investigated using a variety of evaluation metrics. These metrics are used in conjunction with the so-called ROC technique to establish a reliable framework for the comparisons. For exon location, the technique based on the short-time discrete Fourier transform (STDFT) is also included in

(22)

the comparison. The effect of using different window functions on the prediction accuracy of the technique is explored. Using a set of examples, it is shown that BPN filters predict short exons with better accuracy than the STDFT. A large test data set comprising 66 protein sequences and 160 DNA sequences is used for obtaining the ROC plots. Some of the work in this chapter has been published in [21].

Chapter 5 describes the software package that has been developed as part of this disser-tation. It features a user-friendly MATLAB graphical interface (GUI) that implements the proposed hot-spot and exon location techniques. The software is expected to be of signifi-cant usefulness to biologists for analyzing newly-discovered proteins and DNA sequences. In Chapter 6, a training procedure is proposed that maximizes the accuracy of the exon-location technique of Chapter 4 by using a quasi-Newton algorithm based on the Broyden-Fletcher-Goldfarb-Shanno updating formula. The objective function is formulated using the ROC technique and an exponential model. The procedure is first employed to obtain an alternative set of characteristic numerical values that can replace the EIIP values and thereby improve the accuracy of exon-location predictions. The new values obtained are shown to yield better accuracy than the optimized complex values reported in [22, 23]. The procedure is then used to demonstrate that, in practice, one of the four binary DNA sequences can be ignored without affecting the accuracy of the exon-location predictions thus yielding a computational saving of 25%. Several sets of simulation results obtained using separate training and test data sets are presented. Part of the work in this chapter will appear in [24].

Chapter 7 summarizes the results and the contributions made. The chapter concludes with a discussion on directions for future research.

(23)

Chapter 2 Fundamentals of the Location of Hot

Spots in Proteins and Exons in DNA

The sun, with all those planets revolving around it and dependent on it, can still ripen a bunch of grapes as if it had nothing else to do.

—GALILEO GALILEI(1564–1642)

2.1 Introduction

Life on Earth, as we see it today, is a result of millions of years of evolution. Primitive life forms known as cyanobacteria are believed to have appeared on Earth about three and a half billion years ago. Before that time, our planet was a violent combination of enormous volcanic eruptions and an oxygenless atmosphere filled with gases such as ammonia and methane.

Evolution is the mechanism in nature by means of which living organisms improve their capacity to survive [25–27]. It is made possible by errors in reproduction. These occur because any copy of information is most likely to be inexact, and their occurrence provides an opportunity for an organism to evolve into a more enduring life form.

In the world of today, there are over four million different kinds of plants and animals. They form a very complex network in which every kind of plant or animal depends on

(24)

several others for survival. Despite the tremendous diversity of life, all living things are made of the same set of building blocks, namely, proteins and DNA molecules.

In this chapter, a brief overview of proteins and DNA is presented followed by a de-scription of hot spots and exons. Then, some of the existing computational techniques for locating hot spots and exons are examined.

2.2 Cells, DNA, and Proteins

The cell is the fundamental independent unit of life. Any living organism is built and maintained by the collection and coordination of a huge number of such cells (about 75-100 trillion in humans). Cells have mechanisms to obtain their food, and to maintain and create copies of themselves when desired. Their extraordinary ability to reproduce distinguishes them from other microscopic entities such as viruses, and enables them to be characterized as living.

Based on the cell structure, living organisms are classified into two fundamental types, namely, eucaryotes and procaryotes [28]. Eucaryotic cells have a distinct, membrane-bound central part called the nucleus while procaryotic cells do not have a nucleus. Bacte-ria, the simplest of the present-day living cells, are procaryotes. Many types of yeasts and amoebas, plants, animals, and human beings are examples of eucaryotes.

For the purpose of studying and understanding life processes, biologists have chosen a distinct set of representative organisms for intense analysis. They are the bacterium E. coli, the simple eucaryotic cell S. cerevisiae (brewer’s yeast), the plant Arabidopsis, the fly Drosophila, the worm C. elegans, the mouse, and the human being.

The complete set of instructions to build and maintain a living organism is called the organism’s genome. The genome is an extremely long, linear sequence of code. Nearly all the cells of an organism contain the genome. In procaryotes, the genome is not isolated from other parts of the cell and is found floating near the center of the cell, wadded-up like a ball of string. In advanced organisms such as humans, genomes are extremely long and

(25)

are organized into packets called chromosomes inside the nucleus. This organization makes the genome more manageable, ensures that it is completely contained within the nucleus, and facilitates accurate replication during cell division.

2.2.1 Heredity and Genes

The genome is passed on across generations through the reproductive cells. This process is known as heredity and the genome is said to be hereditary. Heredity is responsible for the physical, mental, and behavioral traits of organisms. It is the reason why organisms always give birth to offsprings resembling themselves—a dog always gives birth to a dog, and a cat always gives birth to a cat. Specific portions of the genome, called genes, carry the instructions to make proteins, which are the building blocks of cells. Other portions instruct cells on controlling the amounts and functions of proteins. Numerous other types of instructions are also encoded in the genome, of which only a very small fraction has been deciphered so far.

2.2.2 The DNA

The genome is made up of DNA which consists of two long chains composed of four types of subunits called nucleotides. Nucleotides contain a five-carbon sugar called deoxyribose with one or more phosphate groups and a nitrogen-containing base. The base can be any of the four types adenine (A), thymine (T), guanine (G), or cytosine (C). The building blocks of DNA are illustrated schematically in Figure 2.1(a) [28]. The nucleotides are covalently linked forming a long backbone of alternating sugar and phosphate as shown in Figure 2.1(b). Since only the bases differ between nucleotides, the nucleotides are referred to by the names of their bases.

The way in which the nucleotides are linked together gives a chemical polarity to the DNA strand. This is indicated by referring to the phosphate end as the 5’ end and the sugar end as the 3’ end. This convention is based on the details of the chemical linkages between

(26)

C G T T T 5’ 3’ base nucleotide (a) Building blocks of DNA

(b) DNA single-strand

Phosphate Sugar

sugar phosphate

Figure 2.1. DNA and its building blocks. (a) The formation of a single nucleotide. (b) Individual nucleotides combine to form a DNA strand.

the nucleotide subunits.

DNA occurs as a double-stranded helical structure, popularly known as the DNA double helix. This is schematically shown in Figure 2.2(a), and is often straightened out as in Figure 2.2(b) for simplicity. The two strands are held together by cross linkages between their bases called hydrogen bonds. The base-pairing has a specific pattern. A always pairs with T and G always pairs with C. Thus the two strands are complementary to each other in the sense that knowledge of one of them automatically reveals the other one. The discovery of this fact in the early 1950s solved the mystery of how DNA is copied [29]. During copying, each strand acts as a template for the synthesis of a complementary new strand. Hence, the daughter double helix has an original strand and a new strand. Due to this fact, DNA replication is said to be “semiconservative”.

As each strand of a DNA double strand uniquely determines the other one, a double-stranded DNA molecule can be represented by either of the two character strings. The directionality (polarity), by convention, is taken to be from the 5’ end to the 3’ end rep-resented from left to right. For example, consider the following section of a hypothetical DNA double strand:

(27)

The right-handed thread

(a) The DNA double helix

G C T A T A T A G C T A G C _T A G C 5’ 5’ 3’ 3’

(b) Straightened-out double-stranded DNA

A A C C A G T A T G T A C G Sugar-phosphate backbone T T 3’ 5’ 3’ 5’ A A C C A G T A T G T A C G Sugar-phosphate backbone T T 3’ 5’ 3’ 5’

Figure 2.2. (a) The DNA double helix. (b) DNA double-strand, straightened out for sim-plicity. In reality, DNA exists as a double helix.

5’ - C-G-T-A-G-C-T-T-A-C-T-G - 3’ 3’ - G-C-A-T-C-G-A-A-T-G-A-C - 5’

This double strand can be represented either by the character string CGTAGCTTACTG, which is the top string written as it is, or by the character string CAGTAAGCTACG, which is the bottom string reversed to be in the 5’-to-3’ direction.

DNA strands that are complementary to themselves are known as self-complementary or palindromes [30]. For example, the strand ATCGTACGAT is a palindrome.

(28)

his-tory of molecular biology. No other molecule has reached such an iconic status. A good perspective on the evolution of the double helix at the hands of both science and art can be obtained from [31].

An interesting fact of nature is that, while most natural signals such as heat, sound, and electromagnetic waves occur as continuous signals, the fundamental signal of life, the genetic information, occurs as a discrete signal. A good description of the discrete nature of the genetic information can be found in [32].

2.2.3 Proteins

Most of the dry mass of a cell is composed of proteins. If we ignore the fat, our bodies consist of about 20% protein by weight. Biochemically, proteins play a variety of roles in life processes. They form the structural components (e.g., viral coat proteins, skin proteins, and proteins of the cytoskeleton), catalyze chemical reactions (e.g., enzymes), transport and store various materials (e.g., hemoglobin), regulate cell processes (e.g., hormones), control genetic transcription, and protect the organism from foreign invasion (e.g., anti-bodies). The multiplicity of functions performed by proteins arises from the unique three-dimensional(3-D) shapes they can adopt [33]. Proteins are manufactured in cells using the instructions encoded in genes. The relationship between the cell, chromosomes, DNA, and proteins is illustrated in Figure 2.3.

Proteins are long polymers of subunits called amino acids. The individual amino acids are linked by covalent linkages called peptide bonds. An amino acid consists of a car-boxylic acid group, an amino group, and a variable side chain, all attached to a central carbon atom called the α-Carbon. The side chain is the only component that varies from one amino acid to another. The varying side chains are responsible for the chemical variety of amino acids. Although many different amino acids are theoretically possible, only 20 of them are commonly found in proteins. These make up the proteins found in all kinds of living organisms. The reason for the specific choice of this set of amino acids can only be attributed to millions of years of evolution.

(29)

Cell Nucleus

Chromosomes

DNA

Protein

Figure 2.3. Cells in advanced organisms contain the nucleus housing the chromosomes. Chromosomes put together constitute the genome which contain the instructions for making proteins.

Although proteins can be conceptualized as linear chains of amino acids, they do not occur that way in reality. They fold into complex 3-D structures forming weak noncova-lent bonds between their own atoms. It is this folding ability that enables them to perform extremely specific functions. The information necessary to specify the 3-D structure of a protein is contained in its amino-acid sequence. This can be inferred from the fact that when a purified protein is heated or brought to conditions far from the normal physiological environment, it ‘unfolds’ or ‘denatures’ into a disordered, biologically-inactive structure. However, as soon as the abnormal conditions are removed, the protein spontaneously folds back or ‘renatures’ into its natural conformation. This spontaneous folding-back can hap-pen only if the complete folding information of a protein is contained in its amino-acid sequence [34].

2.2.4 From DNA to Proteins

The genetic information in genes directs the synthesis of proteins by suitably joining the amino acids. Each amino acid is represented by a group of three nucleotides (triplets),

(30)

known as a codon. The mapping of the codons to the amino acids is known as the genetic code. Since the size of the DNA alphabet is four and since there are three nucleotides in a codon, there are 64 different possibilities of forming a codon. As this number is greater than the total number of amino acids (twenty), more than one codon can correspond to an amino acid. The genetic code is thus said to be degenerate. Hayes in [35] traces the history of cracking the genetic code.

In order to speed up protein synthesis, cells make multiple copies of the genetic in-formation in genes using another type of nucleic acid called the ribonucleic acid (RNA). These are then used as templates to synthesize multiple protein molecules simultaneously. The copying step is known as transcription and the synthesis step is known as translation. Like DNA, the RNA is a linear polymer made of four types of nucleotides linked by co-valent bonds. The nucleotides, however, contain the sugar ribose instead of deoxyribose found in DNA. Also, RNA contains the base uracil (U) instead of thymine (T) in DNA. Several types of RNA molecules exist. The ones that are copies of the genes in DNA are called messenger RNA (mRNA).

As seen above, the flow of genetic information in cells is from DNA to RNA to protein. This fundamental principle is common to all living organisms and is therefore called the central dogmaof molecular biology. It is illustrated schematically in Figure 2.4.

The nuclei of nearly all the cells of an organism contain the organism’s complete genome, but each type of cell is made of different proteins. For example, a muscle cell is completely different in structure and function from a blood cell or a nerve cell, even though all of them contain the complete genome of the organism. This means that all cells do not make all the proteins. An individual cell, depending on its type, has a mechanism of selectively producing only those proteins that are needed for that type of cell. This se-lective production of proteins is achieved by controlling when and how often a given gene is transcribed into mRNA. In the vicinity of every gene in the genome there are regions called regulatory DNA. Specialized proteins called regulatory proteins bind to regulatory DNA turning the gene on. Absence of these proteins turns the gene off. In this manner,

(31)

Gene A Gene B DNA double helix RNA RNA synthesis (Transcription) RNA synthesis (Transcription) Protein synthesis (Translation) Protein synthesis (Translation) Protein A Protein B RNA

Figure 2.4. From DNA to proteins: The central dogma of molecular biology. genes are selectively turned on or off depending on the type of cell, thus enabling the cell to manufacture only the required proteins.

2.3 Hot Spots in Proteins

Proteins perform their functions by chemically interacting with their targets by virtue of their 3-D structures. These interactions are very specific in nature. They occur at prede-fined locations in the 3-D structures known as active sites [28, 33]. Active sites can be conceptualized as unique patterns in the arrangement of amino acids. The shapes of the active sites are such that they can fit into the target molecules in a way analogous to a hand fitting into a glove, as shown in Figure 2.5.

In and around the active sites are subregions known as hot spots that are responsible for both the chemical stability of active sites as well as supplying the binding energy for the protein-target interactions [11, 36, 37]. Depending on the protein, a hot spot may consist

(32)

Protein

Target

Figure 2.5. An illustration of how a protein fits into its target. Hot spots

Figure 2.6. Three-dimensional structure of a protein with hot spots. The protein molecule shown is malate dehydrogenase and the interaction is between the chains A (left) and C (right) of the molecule. The circled regions contain the hot spots. Protein structures are generally represented in terms of structural motifs known asα-helix and β-sheet, the coils and sheets in the figure. This structure was obtained from the Protein Data Bank (PDB) at http://www.pdb.org using the protein ID ‘1guy’.

of one or more amino-acid locations. A schematic of a protein with hot spots is shown in Figure 2.6.

(33)

Due to the indispensable role played by hot spots in enabling proteins to perform their functions, thorough knowledge about their locations is essential for understanding protein function. Therefore, reliable and efficient hot-spot location techniques are required [38– 49].

2.4 Location of Hot Spots in Proteins

2.4.1 Alanine Scanning Mutagenesis

A popular experimental technique carried out in a wet laboratory for locating active sites and hot spots is site-directed mutagenesis [7,8,50,51]. In this technique, an amino acid that is suspected to belong to a hot spot is replaced by another amino acid. Such replacements are known as mutations. If a mutated location is part of a hot spot, then the protein’s biological function gets considerably hampered as a result of the mutation. From this, we can confirm that the mutated location indeed belongs to a hot spot. The procedure must then be repeated for every suspected location in order to determine all the hot spots.

The amino acid alanine is usually chosen as the replacement for performing the muta-tions [51]. This choice is due to alanine’s simple molecular structure and the nonreactive nature of its side chain which ensure that the properties of the protein arising from the un-mutated amino acids are not affected by the mutation. Furthermore, alanine is abundant in proteins. A limitation of this technique is that it cannot be used if a suspected hot-spot amino acid happens to be alanine. In such circumstances, alternative techniques or another suitable substitute amino acid may be used. Nevertheless, the situation is rare due to the fact that the amino acids tryptophan, arginine, and tyrosine are much more likely to appear in hot spots than alanine [36]. For obvious reasons, this experimental hot-spot location procedure is called alanine-scanning mutagenesis (ASM).

Although conceptually simple, ASM is a delicate and expensive procedure that requires the use of specialized chemicals and laboratory apparatus. Therefore, simpler and less

(34)

expensive computational techniques that can yield estimates of the hot-spot locations are of immense help to biologists in minimizing unnecessary mutations. By using the estimates obtained, biologists can selectively perform ASM to confirm the hot-spot locations thereby saving a considerable amount of laboratory resources.

In concrete terms, a hot spot is defined using a thermodynamic quantity known as Gibbs free energy. This is the difference between the internal energy of a system and the product of its absolute temperature and entropy, and denotes a measure of the capacity of the system to do work.1 _{It is measured in kilojoules or kilocalories per mole and is denoted as ∆G.}

The lower the free energy value, the easier it is for the system to do work. If the free energy is negative, the system will have a tendency to do work spontaneously, as in the case of an exothermic chemical reaction. In the context of a protein-target interaction, the work involved is the binding of the two molecules, and hence the term used is binding free energy. In order to determine whether a given amino acid is a hot spot, it is mutated to alanine and the binding free energy of the mutated protein-target complex is measured. The change in the binding free energy before and after the mutation is denoted as ∆∆G. If the amino acid in question is a hot spot, then the mutation reduces the binding affinity or, in other words, the binding becomes more difficult after the mutation. A reduction in the binding affinity corresponds to a positive ∆∆G. A hot spot is defined as an amino acid whose mutation to alanine leads to a ∆∆G of at least 2.0 kcal/mol. This definition is commonly accepted in the biological community and has been widely used in the past (e.g., in [36, 48]).

Next we describe computational techniques for hot-spot location.

2.4.2 Structure-Based Computational Techniques

Many of the existing computational hot-spot location techniques require complex structural information of a protein such as its architecture, chemistry, size, shape, number of hydrogen

(35)

bonds, and information about bound water molecules. For example, computational alanine scanning, proposed in [11], makes use of the atomic coordinates obtained from x-ray crys-tallography studies along with various other physical and chemical properties of a protein for modeling the effects of alanine mutations. The model is then used to make predictions of the most probable hot-spot locations. An alternative technique involves estimating the free energy of association using molecular dynamics simulations and then using this infor-mation for predicting hot-spot locations [12, 13]. Although very effective, techniques of this type suffer from two major drawbacks. First, they are complex to implement due to the types of models involved and the amount of information they require. Secondly, they cannot be used to predict the hot-spot locations of newly-discovered proteins since detailed structural and physical information for these proteins is not available at the time of their discovery. Therefore, there is a strong need for computational techniques that yield good estimates of hot-spot locations using simple models and minimal information about the protein sequence of interest. Such techniques are available as detailed in the next section.

2.4.3 Computational Techniques Based on Digital Signal Processing

A simple and effective strategy for computationally locating hot spots is to employ the resonant recognition model(RRM) [52]. In this model, the protein character sequences are mapped onto numerical sequences using physical parameter values known as electron-ion interaction potentials(EIIPs). Every amino acid is assigned an EIIP value that denotes the average energy of the valence electrons in the amino acid and is known to correlate well with a protein’s biological properties. EIIP values can be computed using formulas based on the general-model pseudopotential as described in [53]. The EIIP values for the 20 amino acids are listed in Table 2.1. The use of the EIIP values is based on the fact that the strength of the electromagnetic field surrounding a molecule correlates well with its capability to take part in biochemical processes [54]. Among over 200 different types of numerical mappings, EIIP values have been shown to be very suitable for a frequency-based analysis of protein sequences [55]. The protein numerical sequences obtained using

(36)

Table 2.1. EIIP Values for the 20 Amino Acids in Columnwise Ascending Order

Amino acid EIIP Amino acid EIIP

Leucine (Leu) 0.0000 Tyrosine (Tyr) 0.0516

Isoleucine (Ile) 0.0000 Tryptophan (Trp) 0.0548 Asparagine (Asn) 0.0036 Glutamine (Gln) 0.0761

Glycine (Gly) 0.0050 Methionine (Met) 0.0823

Valine (Val) 0.0057 Serine (Ser) 0.0829

Glutamic acid (Glu) 0.0058 Cysteine (Cys) 0.0829

Proline (Pro) 0.0198 Threonine (Thr) 0.0941

Histidine (His) 0.0242 Phenylalanine (Phe) 0.0946

Lysine (Lys) 0.0371 Arginine (Arg) 0.0959

Alanine (Ala) 0.0373 Aspartic acid (Asp) 0.1263

EIIP values can be subjected to digital signal processing (DSP) for detailed analysis. Veljcovi´c and co-workers observed that the discrete Fourier transforms (DFTs) of the EIIP sequences of proteins of the same functional group share a unique frequency com-ponent known as the characteristic frequency of the functional group [56]. A protein and its target must have the same characteristic frequency with opposite phase for a success-ful interaction. This resembles resonance and hence the characteristic frequency is said to provide resonant recognition between the protein and its target. The model is thus called the resonant recognition model. Based on the model, it is possible to predict whether a given protein will interact with an arbitrary target by examining whether or not they share a common characteristic frequency. For further information on the RRM and its application to various protein sequences, the reader is referred to [52, 57–60].

(37)

0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 S quar ed M agni tude Frequency Characteristic frequency

Peak at the characteristic frequency

Figure 2.7. Consensus spectrum of the epidermal growth factor functional group. The common characteristic frequency of a functional group of M proteins can be de-termined by computing the cross-spectral function

S(ejω) = |X1(ejω) X2(ejω) . . . XM(ejω)| (2.1)

where X1, X2, . . . , XM are the DFTs corresponding to the M proteins. In simpler terms,

Eq. (2.1) corresponds to the product of the amplitude spectra of the protein sequences belonging to a functional group. Such a product has a distinct peak at the characteristic frequency and is known as the consensus spectrum of the group. The consensus spectrum for a set of epidermal growth factor (EGF) proteins is shown in Figure 2.7.

The number of protein sequences, M , required for a typical consensus spectrum varies from case to case. Typically, a sufficient number of protein sequences should be used to achieve a distinct peak at the characteristic frequency in the consensus spectrum. To start with, a set of two protein sequences may be tried. If there is ambiguity (i.e., if there are two or more peaks of approximately the same amplitude), then one more protein sequence from the functional group of interest is included in the computation. This procedure is repeated

(38)

until the ambiguity is resolved, i.e., there is only one prominent peak with all other peaks well below it, thus clearly identifying the characteristic frequency.

If a protein performs more than one function, then, according to the RRM, each function will correspond to a unique characteristic frequency which can be identified by considering several consensus spectra, one for each function.

The hot-spot locations in a protein or a target molecule can be identified by determining the regions in the corresponding numerical sequence where the characteristic frequency is dominant. A simple strategy for identifying such regions would be to alter the amplitude of the DFT coefficient corresponding to the characteristic frequency and determine the amino acids that are most affected by this alteration. This strategy is described in [52]. Its disadvantage is that a change in a single DFT coefficient affects all the elements of the original protein numerical sequence and, consequently, a hot-spot location technique based on this strategy is not reliable. A technique using wavelets has been described in [58].

2.4.4 Transform-Based DSP Technique

In [15, 61], we proposed a hot-spot location technique based on the use of the short-time discrete Fourier transform (STDFT). In this technique, the STDFT of the protein numerical sequence is first computed using a suitable window and its columns are then multiplied by the consensus spectrum. On plotting the squared magnitude of the modified STDFT, the hot-spot locations can be identified in the form of distinct peaks. Subsequently, we employed this technique for predicting hot spots in the tubulin family of proteins as well as determining relative differences in binding affinities to these hot spots [16]. From these as well as our other trials, we can infer that the technique yields results with reasonable accuracy but is computationally very expensive. The accuracy and computational efficiency can both be significantly improved by employing digital filters.

(39)

2.4.5 Filter-Based DSP Techniques

According to the RRM, the characteristic frequency of a protein numerical sequence, ob-tained by means of the consensus spectrum, uniquely correlates with its biological function as was discussed in Section 2.4.3. Hot spots can be located by determining the regions in the protein sequence where the characteristic frequency is dominant. Thus a hot-spot location technique based on the RRM should be capable of selecting the characteristic fre-quency from the large number of insignificant frequencies present in a protein numerical sequence. An ideal technique for the identification of the characteristic frequency is to use a narrowband bandpass digital filter centered at the characteristic frequency [62, 63]. A plot of the squared magnitude of the output sequence will reveal the hot-spot locations as distinct peaks in the power output.

A step-by-step procedure of the technique is as follows:

1. Convert several protein sequences from the functional group of interest into numeri-cal sequences using EIIP values.

2. Compute their DFTs and the consensus spectrum to determine the characteristic fre-quency.

3. Filter the protein sequence of interest using a narrowband bandpass digital filter cen-tered at the characteristic frequency.

4. Plot the squared magnitude of the filtered output to locate the hot spots.

The lengths of protein sequences are usually less than 216 and, consequently, 216-point DFTs are sufficient in practice. Evidently, the numerical sequences need to be adjusted to 216points before computing the DFTs by appending the appropriate number of trailing zeros.

Steps 3 and 4 can be performed by using a filtering system of the type illustrated in Figure 2.8. The type of digital filter employed has a significant influence on the resulting accuracy and computational efficiency. Several factors need to be taken into account while choosing the filter type. The two most critical requirements for our application are the order

(40)

Protein sequence EIIP Transformation _{[ ]} P x n Power Computation ( ) P H z R H zP( ) R [ ] P y n Processed protein sequence

Zero-phase narrowband bandpass filtering

2

( [ ])y n_P

Figure 2.8. The filter-based hot-spot location system. and selectivity of the filter.

Filter Order: The higher the order of the filter, the longer would be its transient re-sponse. This would result in an inefficient filtering of the protein sequence because a sig-nificant portion of the sequence would have already passed through the filter by the time steady state is attained. Thus, it is critical to use a filter of low order.

Selectivity: The frequency spectrum of a protein numerical sequence consists of a num-ber of other frequency components in addition to the characteristic frequency component. Our aim is to select only the characteristic frequency component while attenuating all the other frequency components to an insignificant level. As it is possible to have unwanted frequencies very close to the characteristic frequency in the frequency spectrum, the digital filter must have a high selectivity (i.e., narrow transition bands).

These requirements can be simultaneously satisfied by using an IIR filter. This is be-cause, in an IIR filter, the poles of the transfer function can be placed close to the unit circle. Hence a high selectivity can easily be achieved with a low-order transfer function. In a finite-duration impulse response (FIR) filter, on the other hand, with the poles fixed at the origin, high selectivity can be achieved only by using a relatively high order for the transfer function. Moreover, linear phase response, the single most important benefit provided by FIR filters, is not of relevance to our application as the filter delay can be eliminated altogether using zero-phase filtering due to the fact that this is a nonreal-time application.

(41)

In the zero-phase filtering block of Figure 2.8 (see Sec. 12.5 in [62] for relevant theory), the signal is filtered through a cascade arrangement of two IIR filters characterized by H(z) and H(z−1). The transfer function H(z−1) is realized by using a first-in last-out register, R, followed by a filter with transfer function H(z) followed by a second first-in last-out register, R. The frequency response of the cascade arrangement is real and, as a result, has zerophase response. The protein numerical sequence is first fed to the filter characterized by H(z) and the resulting output is then reversed and fed to the same filter again. The output of the second filtering operation is then reversed to obtain the final output. The delay introduced by the first filtering operation is canceled by the second filtering operation since the signal is fed backwards the second time. Thus, upon zero-phase filtering, the characteristic frequency component is not delayed at the output, and the need to compute the phase response of the IIR filter gets eliminated.

The output block in Figure 2.8 squares the values of the filtered sequence and, in effect, produces a sequence whose values are proportional to the power inherent in the filtered sequence.

Choosing an IIR digital filter for our application offers another advantage. Due to the much lower order of the filter, the filtering would involve only a small amount of computa-tion, which would lead to an efficient implementation of the hot-spot location system.

A number of classical IIR filters are available, namely, Bessel-Thomson, Butterworth, Chebyshev, inverse-Chebyshev, and elliptic. In [17, 18, 61], we investigated these filters and determined that inverse-Chebyshev filters are most suitable for our application among the available classical filter types as they provide a reasonably good selectivity and a low filter order.

In Chapter 3, we explore the application of BPN filters for locating hot spots and demonstrate that these filters yield better accuracy and computational efficiency compared to inverse-Chebyshev filters.

(42)

Genes

A stretch of DNA

Exon Intron

Figure 2.9. The arrangement of exons and introns in a eucaryotic gene.

2.5 Exons in DNA

The organization of genes is fundamentally different in procaryotes and eucaryotes. A procaryotic gene occurs as an uninterrupted stretch of DNA that is transcribed into RNA that, without any further processing, can directly serve as an mRNA. In contrast, a eu-caryotic gene is separated into many fragments called exons. These fragments, when put together, form the actual uninterrupted gene. The portions between the exons, as shown in Figure 2.9, are called introns. Introns do not code for proteins and hence are referred to as noncoding regions. Usually, exons are much shorter than introns and thus the coding portion of a gene is often only a small fraction of its total length.

Before protein-coding occurs in a eucaryotic cell, the introns are removed by the cellu-lar mechanism and the exons are joined together to form an uninterrupted gene. By lacing genes with introns, a eucaryotic cell is able to produce different proteins from a single gene by joining the exons in different combinations. This procedure is known as alternative splicing and represents a type of data compression developed through evolution for the purpose of manufacturing a wide variety of proteins from a small number of genes. This concept is illustrated in Figure 2.10.

(43)

1 2 3 4 Exon Exon Exon Exon Gene 1 2 3 4 RNA Alternative splicing 1 2 4 2 3 4 1 4

Translation Translation Translation

Protein A Protein B Protein C mRNA

Figure 2.10. Alternative splicing—a single gene can produce many different proteins.

2.6 Location of Exons in DNA

Accurate location of exons in genomes is a very important first step in tackling the larger problem of understanding life processes. With a gene sequence and the locations of its exons in hand, the corresponding protein sequence can be determined thereby leading to the next step of analyzing the structure and function of the protein. If the protein happens to be already known, then the newly identified exons can now be associated to this protein. With this knowledge, genetic engineers can attempt to customize the protein for performing a desired function. In this manner, knowledge pertaining to exon locations may result in the design of customized drugs and new cures for diseases. The problem of locating genes and exons is sometimes generally referred to as the gene-finding problem.

Initial gene-finding techniques involved painstaking experimentation on living cells that resulted in detailed genetic maps. Today, with sophisticated genome sequencing techniques and powerful computational resources at our disposal, gene finding has been redefined as a largely computational problem [64, 65].

(44)

There are two basic approaches to computational gene finding, namely, comparative and ab-initio. In comparative techniques, a query sequence is compared with a library of known sequences and those library sequences that resemble the query sequence beyond a certain threshold are identified. For example, following the discovery of a previously unknown mouse gene, a researcher will typically perform a search of the human genome to see if humans carry a similar gene. If a match is found, then it is likely that the mouse gene performs the same function as the human gene. The so-called basic local alignment search tool(BLAST) is a popular comparative technique [66].

Comparative techniques are based on the evolutionary fact that nature tends to con-serve sequences with similar functions across different organisms. However, the success of comparative techniques is dependent upon the existence of databases consisting of genes from a variety of organisms with similar sequences and known functions, which are often unavailable. Moreover, there are many exceptions where genes with similar sequences per-form completely different functions in different organisms and genes with similar functions have different DNA sequences.

Ab-initio techniques, on the other hand, use the genomic DNA sequence alone to systematically search for certain tell-tale signs of genes and exons. Examples of such signs are splice sites, promoters, CpG islands, certain statistical regularities, and period-icities [67, 68]. Ab-initio techniques can be broadly classified as model-dependent and model-independent techniques. Model-dependent techniques operate in two stages. They first apply complex probabilistic models such as hidden Markov models or machine learn-ing techniques such as support vector machines on previously known sequences to charac-terize the discriminative properties between coding and noncoding regions [69–72]. This information is then used in the second stage for predicting genes and exons in unknown sequences. These techniques yield results with good accuracy but have certain limitations. First, due to the complex nature of the models involved, the techniques are difficult to implement and operate. Secondly, a set of previously known sequences capable of reli-ably characterizing the discriminative properties for a given organism may not be always

(45)

available. Hence, it is worthwhile to investigate model-independent techniques that rely on generic discriminative properties such as certain distinct periodicities exhibited by exons. Such techniques are much simpler to implement and can potentially yield accuracies as good as those obtained with model-dependent techniques. A generic discriminative prop-erty that has caught the attention of many researchers recently is the period-3 propprop-erty.

The period-3 property stems from the fact that the power spectra of DNA segments corresponding to exons tend to exhibit a strong component at the period-3 frequency, i.e., 2π/3, whereas segments corresponding to introns do not [73]. Thus exons can be located by mapping the DNA characters into numbers in some way and then tracking the strength of the period-3 component along the length of the DNA sequence of interest. In a popular character-to-numeric mapping scheme proposed by Voss [74], a DNA sequence is repre-sented by four binary indicator sequences, one for each of the four types of nucleotides. The digits ‘1’ and ‘0’ are used to represent the presence or absence of the nucleotide of interest. For example, if we consider the DNA sequence ‘ATCCGCTTAGC’, the indicator sequence corresponding to nucleotide A would be ‘10000000100’ and that corresponding to nucleotide C would be ‘00110100001’.

Indicator sequences have been employed for locating exons using DSP techniques in [22, 75–77]. The STDFT was used in conjunction with the rectangular window in [22, 75]. The use of the rectangular window essentially amounts to truncating the sequence before computing the DFT. This introduces unwanted frequency components in the Fourier spectrum known as Gibbs oscillations that significantly reduce the prediction accuracy. These oscillations can be reduced by using non-rectangular window functions such as the triangular window (also known as the Bartlett window) employed in [76]. The triangular window effectively reduces the Gibbs oscillations, but it is still a fixed window and does not have an adjustable parameter to vary the mainlobe width and the sidelobe attenuation. Such a parameter would give more control over the characteristics of the window used. Hence, for our simulations in Chapter 4, we use an adjustable window known as the Kaiser win-dow to implement the STDFT and compare its performance with the inverse-Chebyshev

(46)

and BPN filters.

An alternative character-to-numeric mapping scheme involves the use of EIIP values that were described in Section 2.4.3 for proteins. Since these values are physicochemical parameters, they exist for DNA nucleotides as well. Compared to the processing of four indicator sequences in the Voss mapping scheme, the use of EIIP values involves only a single sequence thus improving the computational efficiency by 75%. In addition, they can be optimized to obtain maximum prediction accuracy. Due to these advantages, we investigate the application of EIIP values for exon location by employing digital filters in Chapter 4 and optimize them for maximum accuracy in Chapter 6.

2.7 Summary

An overview of the fundamentals of locating hot spots and exons has been provided. In summary, the complete set of instructions to make and maintain a living organism is called the organism’s genome. Portions of the genome, called genes, contain the instructions for making proteins. The genes of advanced organisms are organized into coding regions called exons and noncoding regions called introns. Before protein manufacture, all the exons of a gene are joined together to obtain the uninterrupted protein code. This interleaved arrangement is a type of data compression developed by nature to facilitate the encoding of a large variety of proteins into a small number of genes.

Proteins are the building blocks of living organisms. They exist as unique 3-D structures and perform their functions by interacting with their targets by virtue of these structures. Hot spots are regions that play a critical role in protein-target interactions.

The development of accurate and efficient techniques for locating hot spots in proteins and exons in DNA is indispensable for understanding life processes. In what follows, we investigate the application of digital filters towards this goal, and make contributions in the development, testing, and performance analysis of such techniques.

(47)

Chapter 3 Location of Hot Spots in Proteins Using

Bandpass Notch Digital Filters

I cannot persuade myself that a beneficent and omnipotent God would have designedly created parasitic wasps with the express intention of their feeding within the living bodies of caterpillars.

—CHARLES DARWIN(1809–1882)

3.1 Introduction

The transform-based technique proposed in [15] for locating hot spots using the resonant recognition model (RRM) yields results with reasonable accuracy but is computationally very expensive. For an easy-to-use hardware or software implementation of a hot-spot lo-cation system, improved computational efficiency is highly desirable. Such improvements can be achieved by using digital filters.

In this chapter, we investigate the design and application of infinite-duration impulse response (IIR) second-order bandpass notch (BPN) filters for locating hot spots. In ad-dition to improved computational efficiency, these filters yield significant improvements in accuracy upon tuning. Towards this objective, an efficient tuning strategy based on a least-squares polynomial model is also proposed.

New techniques for the location of hot spots in proteins and exons in DNA using digital filters

in DNA Using Digital Filters

D

P

New Techniques for the Location of Hot Spots in Proteins and Exons

in DNA Using Digital Filters

ABSTRACT

Table of Contents

List of Tables

List of Figures

List of Abbreviations

Acknowledgments

Dedication

Introduction

1.1

History and Motivation

1.2

Contributions of the Dissertation

Chapter 2

Fundamentals of the Location of Hot

Spots in Proteins and Exons in DNA

2.1

Introduction

2.2

Cells, DNA, and Proteins

2.2.1

Heredity and Genes

2.2.2

The DNA

2.2.3

Proteins

2.2.4

From DNA to Proteins

2.3

Hot Spots in Proteins

2.4

Location of Hot Spots in Proteins

2.4.1

Alanine Scanning Mutagenesis

2.4.2

Structure-Based Computational Techniques

2.4.3

Computational Techniques Based on Digital Signal Processing

2.4.4

Transform-Based DSP Technique

2.4.5

Filter-Based DSP Techniques

2.5

Exons in DNA

2.6

Location of Exons in DNA

2.7

Summary

Chapter 3

Location of Hot Spots in Proteins Using

Bandpass Notch Digital Filters

3.1

Introduction