Similarity and Diversity in Chemical Design

  • Tamar SchlickEmail author
Part of the Interdisciplinary Applied Mathematics book series (IAM, volume 21)


Following a simple introduction to drug discovery research, this chapter presents some mathematical formulations and approaches to problems involved in chemical database analysis that might interest mathematical/physical scientists. With continued advances in structure determination, genomics, and high-throughput screening and related (more focused) techniques, in silico drug design is playing an important role as never before. Thus, traditional structure-directed library design methods in combination with newer approaches like fragment-based drug design [496, 1447], virtual screening [453, 1179], and system-scale approaches to drug design [236, 278, 649] will form important areas of research.


Singular Value Decomposition Drug Design Chemical Library Chemical Descriptor Distance Geometry 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Chapter 15 Notation







Dataset matrix (n ×m)


A k

Rank k approximation to A



covariance matrix (m ×m), elements c jj′


P k

projection matrix



SVD factor of A (n ×n), contains left singular values



SVD factor of A (m ×m), contains right singular values;


     also eigenvector matrix of C


V k

low-rank approximation to eigenvector matrix (m ×k)



SVD factor (n ×m), contains singular values


Σ k

low-rank approximation to Σ




u j

left singular value


v j

right singular value



vector of compound i (components Xi 1, Xi 2, ⋯Xi m )



scaled version of X i



projection of X i ; also principal component of C


Scalars & Functions


d ij

intercompound distance ij in the projected




f, E

target optimization functions


l ij

lower bounds on intercompound distance ij


u ij

upper bounds on intercompound distance ij



number of dataset descriptors



number of dataset components



number of variables


T d

total number of distance segments satisfying a given


     deviation from target


α, β

scaling factors



Euclidean distance (with upper/lower bounds u, l)






mean value



weights used in target optimization function



singular values


Every sentence I utter must be understood not as an affirmation but as a question.

Niels Bohr (1885–1962).

15.1 Introduction to Drug Design

Following a simple introduction to drug discovery research, this chapter presents some mathematical formulations and approaches to problems involved in chemical database analysis that might interest mathematical/physical scientists. With continued advances in structure determination, genomics, and high-throughput screening and related (more focused) techniques, in silico drug design is playing an important role as never before. Thus, traditional structure-directed library design methods in combination with newer approaches like fragment-based drug design [496, 1447], virtual screening [453, 1179], and system-scale approaches to drug design [236, 278, 649] will form important areas of research.

For a historical perspective of drug discovery, see [7, 159, 335, 507, 589, 727, 772], for example, and for specialized treatments in drug design modeling consult the texts by Leach [709] and Cohen [254].

15.1.1 Chemical Libraries

The field of combinatorial chemistry was recognized by Science in 1997 as one of nine “discoveries that transform our ideas about the natural world and also offer potential benefits to society”. Indeed, the systematic assembly of chemical building blocks to form potential biologically-active compounds and their rapid testing for bioactivity has experienced a rapid growth in both experimental and theoretical approaches (e.g., [640, 692, 1241]); see the editorial overview on combinatorial chemistry [207] and the associated group of articles. Two combinatorial chemistry journals were launched in 1997, with new journals since then, and a Gordon Research conference on Combinatorial Chemistry was created. The number of new-drug candidates reaching the clinical-trial stage is greater than ever. Indeed, it was stated in 1999: “Recent advances in solid-phase synthesis, informatics, and high-throughput screening suggest combinatorial chemistry is coming of age” [151].

Accelerated (automated and parallel) synthesis techniques combined with screening by molecular modeling and database analysis are the tools of combinatorial chemists. These tools can be applied to propose candidate molecules that resemble antibiotics, to find novel catalysts for certain reactions,to design inhibitors for the HIV protease, or to construct molecular sieves for the chemical industries based on zeolites. Thus, combinatorial technology is used to develop not only new drugs but also new materials, such as for electronic devices. Indeed, as electronic instruments become smaller, thin insulating materials for integrated circuit technology are needed. For example, the design of a new thin-film insulator at Bell Labs of Lucent Technologies [333] combined an optimal mixture of the metals zirconium (Zr), tin (Sn), and titanium (Ti) with oxygen.

As such experimental synthesis techniques are becoming cheaper and faster, huge chemical databases are becoming available for computer-aided [159] and structure-based [41, 453, 1179, 1447] drug design; the development of reliable computational tools for the study of these database compounds is thus becoming more important than ever.The term cheminformatics (chemical informatics, also called chemoinformatics), has been coined to describe this emerging discipline that aims at transforming such data into information, and that information into knowledge useful for faster identification and optimization of lead drugs.

15.1.2 Early Drug Development Work

Before the 1970s, proposals for new drug candidates came mostly from laboratory syntheses or extractions from Nature. A notable example of the latteris Carl Djerassi’s use of locally grown yams near his laboratory in Mexico City to synthesize cortisone; a year later, this led to his creation of the first steroid effective as a birth control pill [323]. Synthetic technology has certainly risen, but natural products have been and remain vital as pharmaceuticals (see [666, 1006] and Box 15.1.2 for a historical perspective).

A pioneer in the systematic development of therapeutic substances is James W. Black, who won the Nobel Prize in Physiology or Medicine in 1988 for his research on drugs beginning in 1964, including histamine H2-receptor antagonists. Black’s team at Smith Kline & French in England synthesized and tested systematically compounds to block histamine, a natural component produced in the stomach that stimulates secretion of gastric juices. Their work led to development of a classic ‘rationally-designed’ drug in 1972 known as Tagamet (cimetidine). This drug effectively inhibits gastric-acid production and has revolutionized the treatment of peptic ulcers.

Later, the term rational drug design was introduced as our understanding of biochemical processes increased, as computer technology improved, and as the field of molecular modeling gained wider acceptance. ‘Rational drug design’ refers to the systematic study of correlations between compound composition and its bioactive properties.

Box 15.1: Natural Pharmaceuticals

Though burdened by political, environmental, and economic issues, pharmaceutical industries have long explored unusual venues for disease remedies, many in remote parts of the world and involving indigenous cures. Micro-organisms and fungi, in particular, are globally available and can be reproduced readily. For example, among the world’s 25 top-selling drugs in 1997, seven were derived from natural sources. Some notable examples of products derived from Nature are listed below.
  • A fungus found on a Japanese golf course is being used by Merck to make the cholesterol lowering drug mevacor, one of the 25 top-sellers of 1997.

  • A fungus found on a Norwegian mountain is the basis for another 1997 top-seller, the transplant drug Cyclosporin, made by Novartis.

  • A fungus from a Pacific yew tree is also the source of the anticancer agent paclitaxel (taxol).

  • The rosy periwinkle of Madagascar is the source of Eli Lilly’s two cancer drugs vincristine and vinblastine, which have helped fight testicular cancer and childhood leukemia since the 1960s.

  • A microbe discovered in a Yellowstone hot spring is the source of a heat-resistant enzyme now key in DNA amplification processes.

  • Ocean salmon is a source for osteoporosis drugs (Calcimar and Miacalcin), and coral extracts are used for bone replacement.

  • The versatile polymer chitosan, extracted from crab and shrimp shells, is a well known fat-binding weight-loss aid, in addition to its usage in paper additives, pool cleaners, cosmetics, and hair gels.

  • The Artemisiam annua plant (also known as sweet wormwood), which grows in China, Vietnam, and some parts of the United States, provides the raw material for a malaria drug, artemisinin.

  • Frog-skin secretions serve as models for development of painkillers with fewer side effects than morphine. This chemical secret, long exploited by Amazon rain forest tribesmen, is now being pursued with frogs from Ecuador by Abbott Labs.

  • Marine organisms from the Philippines are being investigated as sources of chemicals toxic to cancer cells.

  • The venomous lizard termed Gila monster inhabiting Phoenix, Arizona, may provide a powerful peptide, exenden, for treating diabetes, because it stimulates insulin secretion and aids digestion in lizards that gorge thrice-yearly.

  • A compound isolated from a flowering plant in a Malaysian rainforest, calanolide A, is a promising drug candidate for AIDS therapy, in the class of non-nucleoside reverse transcriptase inhibitors.

  • A protein from a West African berry was identified by University of Wisconsin scientists as 2000 times sweeter than sugar; sweeteners are being developed from this source to make possible sweeter food products by gene insertion.

  • A natural marine product (ecteinascidin 743) derived from the Caribbean sea squirt Ecteinascidia turbinata was found to be an active inhibitor of cell proliferation in the late 1960s, but only recently purified, synthesized, and tested in clinical trials against certain cancers.

  • A Caribbean marine fungus extract (developed as halimide) shows early promise against cancer, including some breast cancers resistant to other drugs.

One of the most challenging aspects of using natural products as pharmaceutical agents is a sourcing problem, namely extracting and purifying adequate supplies of the target chemicals. For example, biochemical variations within specifies combined with international laws restricting collection (e.g., of frogs from Ecuador whose skins contain an alkaloid compound with powerful painkilling effects) limit available natural sources. In the case of the frog skin chemical, this sourcing problem prompted the synthetic design of a new type of analgesic that is potentially nonaddictive [1006].

15.1.3 Molecular Modeling in Rational Drug Design

Since the 1980s, further improvements in modeling methodology, computer technology, as well as X-ray crystallography and NMR spectroscopy for biomolecules, have increased the participation of molecular modeling in this lucrative field. Molecular modeling is playing a more significant role in drug development [453, 496, 666, 772, 1179, 1301, 1376] as more disease targets are being identified and solved at atomic resolution (e.g., HIV-1 protease, HIV integrase, adenovirus receptor, protein kinases), as our understanding of the molecular and cellular aspects of disease is enhanced (e.g., regarding pain signaling mechanisms, or the immune invasion mechanism of the HIV virus), and as viral genomes are sequenced [529]. Indeed, in analogy to genomics and proteomics — which broadly define the enterprises of identifying and classifying the genes and the proteins in the genome — the discipline of chemogenomics [198] has been associated with the delineation of drugs for all possible drug targets.

As described in the first chapter, examples of drugs made famous by molecular modeling include HIV-protease inhibitors (AIDS treatments), SARS virus inhibitor, thrombin inhibitors (for blood coagulation and clotting diseases), neuropeptide inhibitors (for blocking the pain signals resulting from migraines), PDE-5 inhibitors (for treating impotence by blocking a chemical reaction which controls muscle relaxation and resulting blood flow rate), various antibacterial agents, and protein kinase inhibitors for metastatic lung cancer and other tumors [913]. See Figure 15.1 for illustrations of popular drugs for migraine, HIV/AIDS, and blood-flow related diseases.
Fig. 15.1

Popular drug examples. Top: Zolmitriptan (zomig) for migraines, a 5-HT1 receptor agonist that enhances the action of serotonin. Middle: Nelfinavir Mesylate (viracept), a protease inhibitor for AIDS treatment. Bottom: Sildenfel Citrate (viagra) for penile dysfunctions, a temporary inhibitor of phosphodiesterase-5, which regulates associated muscle relaxation and blood flow by converting cyclic guanosine monophosphate to guanosine monophosphate. See other household examples in Figure 15.3.

Such computer modeling and analysis — rather than using trial and error and exhaustive database studies — was thought to lead to dramatic progress in the design of drugs. However, some believe that the field of rational drug design has not lived up to its expectations.

One reason for the restrained success is the limited reliability of modeling molecular interactions between drugs and target molecules; such interactions must be described very accurately energetically to be useful in predictions. Newer approaches consider multiple targets [278] and work in system-oriented approaches [649] to improve success.

Another reason for the limited success of drug modeling is that the design of compounds with the correct binding properties (e.g., dissociation constants in the micromolar range and higher) is only a first step in the complex process of drug design; many other considerations and long-term studies are needed to determine the drug’s bioactivity and its effects on the human body [1364]. For example, a compound may bind well to the intended target but be inactive biologically if the reaction that the drug targets is influenced by other components (see Box 15.2.2 for an example). Even when a drug binds well to an appropriate target, an optimal therapeutic agent must be delivered precisely to its target [999], screened for undesirable drug/drug interactions [1061], lack toxicity and carcinogenicity (likewise for its metabolites), be stable, and have a long shelf life.

The problems of viability and efficacy are even more important now with the increased development and usage of biologics or biotherapeutics — biological molecules like proteins derived from living cells and used as drugs — rather than small-molecule drugs. Such biologics, which include various vaccines, are typically administered by injection or infusion. Successful recent examples are Wyeth’s Enbrel for rheumatoid arthritis, Genetech’s Avastin for cancer, and Amgen’s Epogen for anemia. Many large pharmaceutical companies are increasing their work on biologics because such drugs are more complex and expensive to replicate and hence much less vulnerable to the usual patent expiration which allows introduction of generics and thereby restricts the profits of the original manufacturers. However, the big challenge in biologics is dealing with the characteristic heterogeneity of such biological molecules and better understanding their mechanism of action related to the disease target and long-term effects.

15.1.4 The Competition: Automated Technology

Even accepting those limitations of computer-based approaches, rational drug design has avid competition from automated technology: new synthesis techniques, such as robotic systems that can run hundreds of concurrent synthetic reactions, have emerged, thereby enhancing synthesis productivity enormously. With “high-throughput screening”, these candidates can be screened rapidly to analyze binding affinities, determine transport properties, and assess conformational flexibility.

Many believe that such a production en masse is the key to establishing diverse databases of drug candidates. Thus, at this time, it might be viewed that drug design need not be ‘rational’ if it can be exhaustive. Still, others advocate a more focused design approach, based on structures of ligands or receptors [453], fragment-based drug design [1447], or virtual screening approaches applied to smaller subsets of compounds [453, 1179].

Another convincing argument for the focused design approach is that the amount of synthesized compounds is so vast (and rapidly generated) that computers will be essential to sort through the huge databases for compound management and applications. Such applications involve clustering analysis and similarity and diversity sampling (see below), preliminary steps in generating drug candidates or optimizing bioactive compounds.

This information explosion explains the resurrection of computer-aided drug design and its enhancement in scope under the new title combinatorial chemistry, affectionately endorsed as ‘the darling of chemistry’ [1376].

15.1.5 Chapter Overview

In this chapter, a brief introduction into some mathematical questions involved in this discipline of chemical library design is presented, namely similarity and diversity sampling for ligand-based drug design. Some ideas on cluster analysis and database searching are also described. This chapter is only intended to whet the appetite for chemical design and to invite mathematical scientists to work on related problems.

Because medicinal chemistry applications are an important subfield of chemical design, this last chapter also provides some perspectives on current developments in drug design, as well as mentioning emerging areas such as pharmacogenomics of personalized medicine and biochips (see Boxes 15.2.3 and 15.2.3).

15.2 Problems in Chemical Libraries

Chemical libraries consist of compounds (known chemical formulas) with potential and/or demonstrated therapeutic activities. Most libraries are proprietary, residing in pharmaceutical houses, but public sources also exist, like the National Cancer Institute’s (NCI’s) 3D structure database.

Both target-independent and target-specific libraries exist. The name ‘combinatorial libraries’ stems from the important combinatorial problems associated with the experimental design of compounds in chemical libraries, as well as computational searches for potential leads using concepts of similarity and diversity as introduced below.

15.2.1 Database Analysis

In broad terms, two general problem categories can be defined in chemical library analysis and design:

Database systematics: analysis and compound grouping, compound classification, elimination of redundancy in compound representation (dimensionality reduction), data visualization, etc., and

Database applications: efficient formulation of quantitative links between compound properties and biological activity for compound selection and design optimization experiments.

Both of these general database problems involved in chemical libraries are associated with several mathematical disciplines. Those disciplines include multivariate statistical analysis and numerical linear algebra, multivariate nonlinear optimization (for continuous formulations), combinatorial optimization (for discrete formulations), distance geometry techniques, and configurational sampling.

15.2.2 Similarity and Diversity Sampling

Two specific problems, described formally in the next section after the introduction of chemical descriptors, are the similarity and diversity problems.

The similarity problem in drug design involves finding molecules that are ‘similar’ in physical, chemical, and/or biological characteristics to a known target compound. Deducing compound similarity is important, for example, when one drug is known and others are sought with similar physiochemical and biological properties, and perhaps with reduced side effects.

One example is the target bone-building drug raloxifene, whose chemical structure is somewhat related to the breast cancer drug tamoxifen (see Figure 15.2) (e.g., [1093]). Both are members of the family of selective estrogen receptor modulators (SERMs) that bind to estrogen receptors in the breast cancer cells and exert a profound influence on cell replication. It is hoped that raloxifene will be as effective for treating breast tumors but will reduce the increased risk of endometrial cancer noted for tamoxifen. Perhaps raloxifene will also not lose its effectiveness after five years like tamoxifen.
Fig. 15.2

Related pairs of drugs: the antiestrogens raloxifene and tamoxifen, and the tricyclic compounds with aliphatic side-chains at the middle ring quinacrine and chlorpromazine.

Another example of a related pair of drugs is chlorpromazine (for treating schizophrenia) and quinacrine (antimalarial drug). These tricyclic compounds with aliphatic side chains at the middle ring group (see Figure 15.2) were suggested as candidates for treating Creutzfeldt-Jakob and other prion diseases [677].

Because similarity in structure might serve as a first criterion for similarity in activity/function, similarity searching can be performed using 3D structural and energetic searches (e.g., induced fit or ‘docking’ [41, 818]) or using the concept of molecular descriptors introduced in the next section, possibly in combination with other discriminatory criteria.

The diversity problem in drug design involves delineating the most diverse subset of compounds within a given library. Diversity sampling is important for practical reasons. The smaller, representative subsets of chemical libraries (in the sense of being most ‘diverse’) might be searched first for lead compounds, thereby reducing the search time; representative databases might also be used to prioritize the choice of compounds to be purchased and/or synthesized, similarly resulting in an accelerated discovery process, not to speak of economic savings.

Box 15.2: Treatments for Chronic Pain

Amazing breakthroughs have been achieved recently in the treatment of chronic pain. Such advances were made possible by an increased understanding of the distinct cellular mechanisms that cause pain due to different triggers, from sprained backs to arthritis to cancer.

What is Pain? Pain signals start when nerve fibers known as nociceptors, found throughout the human body, react to some disturbances in nearby tissues. The nerve fibers send chemical pain messengers that collect in the dorsal horn of the spinal cord. Their release depends on the opening of certain pain gates. Only when these messengers are released into the brain is pain felt in the body.

Natural Ammunition. Fortunately, the body has a battery of natural painkillers that can close those pain gates or send signals from the brain. These compensatory agents include endorphins, adrenaline, and serotonin (a peptide similar to opium). Many painkillers enhance or mimic the action of these natural aids (e.g., opium-bases drugs such as morphine, codeine, and methadone). However, these opiates have many undesirable side effects.

Painkiller Targets. To address the problem of pain, new treatments are targeting specific opiate receptors. For example, Actiq, developed by Anesta Corp. for intense cancer pain, is a lozenge placed in the cheek that is absorbed quickly into the bloodstream, avoiding the gut. Other pain relievers include a class of drugs known as COX-2 inhibitors, like Monsanto’s Celebrex (celecoxib) and Merck’s Vioxx, which relieve aches and inflammation with fewer stomach-damaging effects. They do so by targeting only one (COX-2) of two enzymes called cyclo-oxegenases (COX), which are believed to cause inflammation and thereby trigger pain.

While regular non-steroidal anti-inflammatory drugs (NSAIDs, like Aspirin, Ibuprofen, and Naproxen) and others available by prescription attack both COX-1 and COX-2, COX-1 is also known to protect the stomach lining; this explains the stomach pain that many people experience with NSAIDs and the pain relief without the side effects that COX-2 inhibitors can offer.

Modern pain treatment also involves compounds that stop pain signals before the brain gets the message, either by intercepting the signals in the spinal cord or by blocking their route to the spine. Evidence is emerging that a powerful chemical called ‘substance P’ can be used as an agent to deliver pain blockers to receptors found throughout the body; an experimental drug based on this idea (marketed by Pfizer) has proven effective at easing tooth pain.

15.2.3 Bioactivity Relationships

Besides database systematics, such as similarity and diversity sampling, the establishment of clear links between compound properties and bioactivity is, of course, the heart of drug design. In many respects, this association is not unlike the protein prediction problem in which we seek some target energy function that upon global minimization will produce the biologically relevant, or native, structure of a protein.

In our context, formulating that ‘function’ to relate sequence and structure while not ignoring the environment might even be more difficult, since we are studying small molecules for which the evolutionary relationships are not clear as they might be for proteins. Further, the bioactive properties of a drug depend on much more than its chemical composition, three-dimensional (3D) structure, and energetic properties. A complex orchestration of cellular machinery is often involved in a particular human ailment or symptom, and this network must be understood to alleviate the condition safely and successfully.

A successful drug has usually passed many rounds of chemical modifications that enhanced its potency, optimized its selectivity, and reduced its toxicity. An example involves obesity treatments by the hormone leptin. Limited clinical studies have shown that leptin injections do not lead to clear trends of weight loss in people, despite demonstrating dramatic slimming of mice. Though not a quick panacea in humans, leptin has nonetheless opened the door to pharmacological manipulations of body weight, a dream with many medical — not to speak of monetary — benefits. Therapeutic manipulations will require an understanding of the complex mechanism associated with leptin regulation of our appetite, such as its signaling the brain on the status of body fat.

Box 15.2.2 contains another illustration of the need to understand such complex networks in connection with drug development for chronic pain. These examples clearly show that lead generation, the first step in drug development, is followed by lead optimization, the challenging, slower phase.

In fact, this complexity of the molecular machinery that underlies disease has given rise to the subdisciplines of molecular medicine and personalized medicine (see Boxes 15.2.3 and 15.2.3), where DNA technology plays an important role. Specifically, DNA chips — small glass wafers like computer chips studded with bits of DNA instead of transistors — can analyze the activities of thousands of genes at a time, helping to predict disease susceptibility in individuals, classify certain cancers, and to design treatments [400].

For example, DNA chips can study expression patterns in the tumor suppressor gene p53 (the gene with the single most common mutations in human cancers), and such patterns can be useful for understanding and predicting response to chemotherapy and other drugs. DNA microarrays have also been used to identify genes that selectively stimulate metastasis (the spread of tumor cells from the original growth to other sites) in melanoma cells.

Besides developments on more personalized medicine, which will also be enhanced by a better understanding of the human body and its ailments, new advances in drug delivery systems may be important for improving the rate and period of drug delivery in general [1304].

Box 15.3: Molecular and Personalized Medicine

Pauling’s Groundwork. Molecular medicine seeks to enhance our therapeutic solutions by understanding the molecular basis of disease. Linus Pauling lay the groundwork for this field in his seminal 1949 paper [977] which demonstrated that the hemoglobin from sickle cell anemia sufferers has a different electric charge than that from healthy people. This difference was later explained by Vernon Ingram as arising from a single amino acid difference [590]. These pioneering works relied on electrophoretic mobility measurements and fingerprinting techniques (electrophoresis combined with paper chromatography) for peptides.

Disease Simulations. A modern incarnation of molecular medicine involves conducting virtual experiments by computer simulation with the goal of developing new hypotheses regarding disease mechanisms and prevention. For example, scientists at Entelos Inc. (Menlo Park, California) are simulating cell inflammation caused by asthma to try to learn how blocking certain inflammation factors might affect cellular receptors and then to identify targets for steroid inhalers.

From SNPs to Tailored Drugs. Another significant current trend in medicine is personalized medicine, the tailoring of drugs to individual genetic makeup. User-specific drugs have great potential to be more potent and to eliminate adverse side effects experienced by some individuals. Pharmacogenetics is the field of studying how genetic factors influence drug response. Its newer sibling pharmacogenomics involves using genomics to describe individual responses to drugs. Pharmacogenomics (also abbreviated as Pgx or pgx) has become possible with the advent of microarray technology (e.g., [544, 1174]): these make possible large-scale genome-wide analyses to test thousands of genes for related activity with a specific drug. Developing tailored diets and vitamins based on individual responses to diet (determined in part by one’s genes) is another growing field called nutritional genomics or nutrigenomics.

Specifically, the drug tailoring idea is based on identifying the small variations in people’s DNA where a single nucleotide differs from the standard sequence. These mutations, or individual variations in genome sequence that occur once every couple of hundred of base pairs, are called single-nucleotide polymorphisms known as SNPs (pronounced “snips”). The presence of SNPs can be signaled visually using DNA chips or biochips, instruments of fancy of the biotechnology industry (See [400] and Box of  Chapter 1). Other genomic factors besides SNPs also serve as distinguishing factors in pharmacogenomics studies.

Pharmacogenetics gained momentum in April 1999 when eleven pharmaceutical and technology companies and the Wellcome Trust announced a genome mapping consortium for SNPs. The consortium’s goal is to construct a fine-grained map of order 300,000 SNPs to permit searching for SNP patterns that correlate with particular drug responses. Efforts are ongoing, and many companies have specialized in this area. Pharmacogenomics now receives considerable attention both from the professional medical circles and the popular press. It has potential to markedly improve medical intervention, reduce hospitalization costs, and alleviate human suffering by increasing the efficacy and decreasing adverse effects in the drug treatment of various human diseases.

Some notable examples of success of pharmacogenomics include the genotype-based dosaging of the blood thinning drug Warfarin; administration of Abacavir (an RT inhibitor) to HIV patients; Herceptin treatment for HER2-positive breast cancer patients; and Gleevac and other cancer drugs for individual cancer patients. See Box 15.2.3 for details of some of these drugs.

Directed drugs are also under development to treat or diagnose diabetics, neurological diseases like Alzheimer’s, prostate cancer, and ailments requiring antibiotics. Though there are many hurdles to this new field, not to mention possible financial drawbacks of genotyping, it is hoped that some benefits of cost savings in prescriptions and hospitalizations for adverse drug effects could be realized in the not-too-distant future [588].

Box 15.4: Examples of Successes in Pharmacogenomics

Warfarin. Warfarin is the “darling” of pharmacogenomics because international collaborations by the International Warfarin Pharmacogenetics Consortium and the Pharmacogenetics Research Network have led to development of a dosing algorithm [591]. This is a milestone in the evolution of drug prescription from trial and error to exact science [591]. Warfarin is an anti-coagulation agent given to patients with risk of heart disease. However, adverse effects can be catastrophic since the patient may bleed to death. Practical experience has shown that reactions to the drug vary widely from person to person. But why? Pharmacogenomics analyses revealed that a patient’s response to Warfarin depends on the presence of two genes encoding two proteins: CYP2C9 which metabolizes warfarin, and VKORC1 which recycles vitamin K and affects clotting factors. Certain genotypes make reaction much more sensitive. In 2007, FDA modified Warfarin labels to highlight the potential relevance of genetic information to prescribing decisions.

Abacavir. Abacavir is a guanosine reverse-transcriptase inhibitor used as an anti-retroviral treatment against infections of HIV. However, 5 to 8% of the white population develops adverse side effects, namely toxic skin reaction. In 2002, it was discovered that the HLAB*5701 gene variant is highly associated with this hyper sensitivity. Genotyping has thus been used to effectively reduce the number of such adverse reactions. Genetic testing for sensitivity to abacavir is now widely used.

Herceptin. Herceptin (Trastuzumab) is an antibody used to treat breast cancer. Studies have shown that Herceptin is effective for patients with over-expression of the human epidermal growth factor receptor HER2, which occurs in invasive breast carcinomas. Herceptin has now been approved by the FDA for patients with invasive breast cancer that over expresses HER2.

Codeine for Breast-Feeding Mothers. Codeine is a painkiller often prescribed to help women with post-delivery pain. Codeine is metabolized into morphine, but it was generally considered to be safe for breast-feeding mothers. In 2005, a 13-day old male baby in Toronto who was breastfed by a codeine-treated mother died of a morphine overdose [673]. Investigations revealed that the mother was an “ultra-metabolizer” of codeine, and this led to an unusually high amount of morphine in the baby. Studies have shown that the metabolism of codeine is related to the CYP2D6 gene. Subsequently, genetic testing for this variant has been suggested for mothers who want to breastfeed and receive codeine for post-delivery pain [1375]. Alternatively, breast feeding can be avoided, reduced, and/or the level of morphine in the neonate monitored carefully to prevent unnecessary deaths.

15.3 General Problem Definitions

15.3.1 The Dataset

Our given dataset of size n contains information on compounds with potential biological activity (drugs, herbicides, pesticides, etc.). A schematic illustration is presented in Figure 15.3. The value of n is large, say one million or more. Because of the enormous dataset size, the problems described below are simple to solve in principle but extremely challenging in practice because of the large associated computational times. Any systematic schemes to reduce this computing time can thus be valuable.
Fig. 15.3

A chemical library can be represented by n compounds i (known or potential drugs), each associated with m characteristic descriptors ({Xi k }) and activities {Bi j } with respect to m B biological targets (known or potential).

15.3.2 The Compound Descriptors

Each compound in the database is characterized by a vector (the descriptor). The vector can have real or binary elements. There are many ways to formulate these descriptors so as to reduce the database search time and maximize success in generation of lead compounds.

Conventionally, each compound i is described by a list of chemical descriptors, which may reflect molecular composition, such as atom number, atom connectivity, or number of functional groups (like aromatic or heterocyclic rings, tertiary aliphatic amines, alcohols, and carboxamides), molecular geometry, such as number of rotatable bonds, electrostatic properties, such as charge distribution, and various physiochemical measurements that are important for bioactivity.

These descriptors are currently available from many commercial packages like Molconn-X and Molconn-Z (Hall Associates Consulting, Qincy, MD). Descriptors fall into many classes. Examples include:
2D descriptors

also called molecular connectivity or topological indices — reflecting molecular connectivity and other topological invariants;

binary descriptors

simpler encoded representations indicating the presence or absence of a property, such as whether or not the compound contains at least three nitrogen atoms, doubly-bonded nitrogens, or alcohol functional groups;

3D descriptors

reflecting geometric structural factors like van der Waals volume and surface area; and

electronic descriptors

characterizing the ionization potential, partial atomic charges, or electron densities.

See also [8] for further examples.

Binary descriptors allow rapid database analysis using Boolean algebra operations. The MolConn-X and MolConn-Z programs, for example, generate topological descriptors based on molecular connectivity indices (e.g., number of atoms, number of rings, molecular branching paths, atoms types, bond types, etc.). Such descriptors have been found to be a convenient and reasonably successful approximation to quantify molecular structure and relate structure to biological activity (see review in [6]). These descriptors can be used to characterize compounds in conjunction with other selectivity criteria based on activity data for a training set (e.g., [322, 582]). The search for the most appropriate descriptors is an ongoing enterprise, not unlike force-field development for macromolecules.

The number of these descriptors, m, is roughly on the order of 1000, thus much smaller than n (the number of compounds) but too large to permit standard systematic comparisons for the problems that arise.

Let us define the vector Xi associated with compound i to be the row m-vector
Our dataset \(\mathcal{S}\) can then be described as the collection of n vectors
$$\mathcal{S} =\{ X1,X2,X3,\ldots,Xn\}\,,$$
or expressed as a rectangular matrix A n ×m by listing, in rows, the m chemical descriptors of the n database compounds:
$$A = \left (\begin{array}{llllll} X{1}_{1} & X{1}_{2} & \cdots &\cdots &\cdots &X{1}_{m} \\ X{2}_{1} & X{2}_{2} & \cdots &\cdots &\cdots &X{2}_{m} \\ \vdots & & &\cdots & &\\ \vdots & & &\cdots & & \\ \vdots & & &\cdots & &\\ \vdots & & &\cdots & & \\ \vdots & & &\cdots & &\\ X{n}_{ 1} & X{n}_{2} & \cdots &\cdots &\cdots &X{n}_{m}\\ \end{array} \right ).$$
In practice, this rectangular n ×m matrix has nm (i.e., the matrix is long and narrow), where n is on the order of millions and m is several hundreds.

The compound descriptors are generally highly redundant. Yet, it is far from trivial how to select the “principal descriptors”. Thus, various statistical techniques (principal component analysis, classic multivariate regression; see below) have been used to assess the degree of correlation among variables so as to eliminate highly-correlated descriptors and reduce the dimension of the problems involved.

15.3.3 Characterizing Biological Activity

Another aspect of each compound in such databases is its biological activity. Pharmaceutical scientists might describe this property by associating a simple affirmative or negative score with each compound to indicate various areas of activity (e.g., with respect to various ailments or targets, which may include categories like headache, diabetes, protease inhibitors, etc.).

Drugs may enhance/activate (e.g., agonists) or inhibit (e.g., antagonists, inhibitors) certain biochemical processes. This bioactivity aspect of database problems is far less quantitative than the simple chemical descriptors. Of course, it also requires synthesis and biological testing for activity determination. Studies of several drug databases have suggested that active compounds can be associated with certain ranges of physiochemical properties like molecular weight and occurrence of functional groups [451].

For the purpose of the problems outlined here, it suffices to think of such an additional set of descriptors associated with each compound. For example, a matrix \({B}_{n\times {m}_{B}}\) may complement the n ×m database matrix A; see Figure 15.3. Each row i of B may correspond to measures of activity of compound i with respect to specific targets (e.g., binary variables for active/nonactive target response).

The ultimate goal in drug design is to find a compound that yields the desired pharmacological effect. This quest has led to the broad area termed SAR, an acronym for Structure/Activity Relationship [709]. This discipline applies various statistical, modeling, or optimization techniques to relate compound properties to associated pharmacological activity. A simple linear model, for example, might attempt to solve for variables in the form of a matrix \({X}_{m\times {m}_{B}}\), satisfying
$$AX = B\,.$$
Explained more intuitively, SAR formulations attempt to relate the given compound descriptors to experimentally-determined bioactivity markers. While earlier models for ‘quantitative SAR’ (QSAR) involved simple linear formulations for fitting properties and various statistical techniques (e.g., multivariate regression, principal component analysis), nonlinear optimization techniques combined with other visual and computational techniques are more common today [448]. The problem remains very challenging, with rigorous frameworks continuously being sought.

15.3.4 The Target Function

To compare compounds in the database to each other and to new targets, a quantitative assessment can be based on common structural features. Whether characterized by topological (chemical-formula based) or 3D features, this assessment can be broadly based on the vectorial chemical descriptors provided by various computer packages. A target function f is defined, typically based on the Euclidean distance function between vector pairs, δ, where
$$f(Xi,Xj) = {\delta }_{ij} \equiv \| Xi - Xj\| = \sqrt{\sum\limits_{k=1}^{m}{(X{i}_{k} - X{j}_{k})}^{2}}\,.$$

Thus, to measure the similarity or diversity for each pair of compounds Xi and Xj, the function f(Xi, Xj) is often set to the simple distance function δ ij . Other functions of distance are also appropriate depending upon the objectives of the optimization task.

15.3.5 Scaling Descriptors

Scaling the descriptor components is important for proper assessment of the score function [1372]. This is because the individual chemical descriptors can vary drastically in their magnitudes as well as the variance within the dataset. Subsequently, a few large descriptors can overwhelm the similarity or diversity measures. For example, actual descriptor components of a database compound may look like the following:

Open image in new window

Clearly, the ranges of individual descriptors vary (e.g., 0 to 1 versus 0 to 1000). Thus, given no chemical/physical guidance, it is customary to scale the vector entries before analysis. In practice, however, it is very difficult to determine the appropriate scaling and displacement factors for the specific application problem [1372]. A general scaling of each Xi k to produce \(\hat{X{i}}_{k}\) can be defined using two real numbers α k and β k , for k = 1, 2, , m, termed the scaling and displacement factors, respectively, where α k > 0. Namely, for k = 1, 2, , m, we define the scaled components as
$$\hat{X{i}}_{k} = {\alpha }_{k}\,(X{i}_{k} - {\beta }_{k}),\quad \quad 1 \leq i \leq n\,.$$
The following two scaling procedures are often used. The first makes each column in the range [0, 1]: each column of the matrix A is modified using eq. (15.4) by setting the factors as
$$\begin{array}{rcl} {\beta }_{k}& =& {\min }_{1\leq i\leq n}X{i}_{k}\,, \\ {\alpha }_{k}& =& 1/\left({\max }_{1\leq i\leq n}X{i}_{k} - {\beta }_{k}\right).\end{array}$$
This scaling procedure is also termed “standardization of descriptors”.
The second scaling produces a new matrix A where each column has a mean of zero and a standard deviation of one. It does so by setting the factors (for k = 1, 2, , m) as
$$\begin{array}{rcl} {\beta }_{k}& =& \frac{1} {n}\sum\limits_{i=1}^{n}X{i}_{ k}\,, \\ {\alpha }_{k}& =& 1/\sqrt{ \frac{1} {n}\sum\limits_{i=1}^{n}{(X{i}_{k} - {\beta }_{k})}^{2}}\,.\end{array}$$

Both scaling procedures defined by eqs. (15.5) and (15.6) are based on the assumption that no one descriptor dominates the overall distance measures.

15.3.6 The Similarity and Diversity Problems

The Euclidean distance function f(Xi, Xj) = δ ij based on the chemical descriptors can be used in performing similarity searches among the database compounds and between these compounds and a particular target. This involves optimization of the distance function over i = 1, , n, for a fixed j:
$${ \mbox{ Minimize}}_{\ \ Xi\in \mathcal{S}\ }\{f({\delta }_{ij})\}\,.$$
More difficult and computationally-demanding is the diversity problem. Namely, we seek to reduce the database of the n compounds by selecting a “representative subset” of the compounds contained in \(\mathcal{S}\), that is one that is “the most diverse” in terms of potential chemical activity. We can formulate the diversity problem as follows:
$$\mbox{ Maximize}\sum\limits_{Xi,Xj\in {\mathcal{S}}_{0}}\;\{f({\delta }_{ij})\;\}\,$$
for a given subset \({\mathcal{S}}_{0}\) of size n 0.

The molecular diversity problem naturally arises since pharmaceutical companies must scan huge databases each time they search for a specific pharmacological activity. Thus reducing the set of n compounds to n 0 representative elements of the set \({\mathcal{S}}_{0}\) is likely to accelerate such searches. ‘Combinatorial library design’ corresponds to this attempt to choose the best set of substituents for combinatorial synthetic schemes so as to maximize the likelihood of identifying lead compounds.

The molecular diversity problem involves maximizing the volume spanned by the elements of \({\mathcal{S}}_{0}\) as well as the separation between those elements. Geometrically, we seek a well separated, uniform-like distribution of points in the high-dimensional compound space in which each chemical cluster has a ‘representative’.

A simple, heuristic formulation of this problem might be based on the similarity problem above: successively minimize f ij ) over all i, for a fixed (target) j, so as to eliminate a subset {Xi} of compounds that are similar to Xj. This approach thus identifies groupings that maximize intracluster similarity as well as maximize intercluster diversity.

The combinatorial optimization problem, an example of a very difficult computational task, has non-polynomial computational complexity (‘NP-complete’)(see footnote in  Chapter 11,  Section 11.2). This is because an exhaustive calculation of the above distance-sum function over a fixed set \({\mathcal{S}}_{0}\) of n 0 elements requires a total of \(\mathcal{O}({n}_{0}^{2}m)\) operations. However, there are many possible subsets of \(\mathcal{S}\) of size n 0, namely \({C}_{n}^{{n}_{0}}\) of them, where
$$\begin{array}{rcl}{ C}_{n}^{{n}_{0} }& =& \frac{n!} {{n}_{0}!\;(n - {n}_{0}!)} \\ & =& \frac{n(n - 1)(n - 2)\cdots (n - {n}_{0} + 1)} {{n}_{0}!} \,.\end{array}$$
As a simple example, for n = 4, we have \({C}_{4}^{1} = 4/1 = 4\) subsets of one element; \({C}_{4}^{2} = (4 \times 3)/2 = 6\) different subsets of two elements, \({C}_{4}^{3} = (4 \times 3 \times 2)/(3!) = 4\) subsets of three elements, and \({C}_{4}^{4} = (4 \times 3 \times 2)/(4!) =\) one subset of four elements.

Typically, these combinatorial optimization problems are solved by stochastic and heuristic approaches. These include genetic algorithms, simulated annealing, and tabu-search variants. (See Agrafiotis [5], for example, for a review).

As in other applications, the efficiency of simulated annealing depends strongly on the choice of cooling schedule and other parameters. Several potentially valuable annealing algorithms such as deterministic annealing, multiscale annealing, and adaptive simulated annealing, as well as other variants, have been extensively studied.

Various formulations of the diversity problem have been used in practice. Examples include the maximin function — to maximize the minimum intermolecular similarity:
$${ \mbox{ Maximize}{}_{\stackrel{\;}{i,\;Xi\in {\mathcal{S}}_{0}}}\;\{\min }_{\stackrel{j\neq i}{Xj\in {\mathcal{S}}_{0}}}\;({\delta }_{ij})\;\}\,$$
or its variant — maximizing the sum of these distances:
$${ \mbox{ Maximize}}_{\stackrel{\;}{Xi,Xj\in {\mathcal{S}}_{0}}}\;{ \sum\limits_{i}\;\{\min }_{j\neq i}\;({\delta }_{ij})\;\}\,.$$
The maximization problem above can be formulated as a minimization problem by standard techniques if f(x) is normalized so it is monotonic with range [0, 1], since we can often write
$$\max [f(x)] \Leftrightarrow \min [-f(x)]\ \ \mbox{ or}\ \ \min [1 - f(x)]\,.$$

In special cases, combinatorial optimization problems can be formulated as integer programming and mixed-integer programming problems. In this approach, linear programming techniques such as interior methods can be applied to the solution of combinatorial optimization problems, leading to branch and bound algorithms, cutting plane algorithms, and dynamic programming algorithms. Parallel implementation of combinatorial optimization algorithms is also important in practice to improve the performance.

Other important research areas in combinatorial optimization include the study of various algebraic structures (such as matroids and greedoids) within which some combinatorial optimization problems can more easily be solved [263].

Currently, practical algorithms for addressing the diversity problem in drug design are relatively simple heuristic schemes that have computational complexity of at most \(\mathcal{O}({n}^{2})\), already a huge number for large n.

15.4 Data Compression and Cluster Analysis

Dimensionality reduction and data visualization are important aids in handling the similarity and diversity problems outlined above. Principal component analysis (PCA) is a classic technique for data compression (or dimensionality reduction). It has already shown to be useful in analyzing microarray data (e.g., [1009]), as discussed in  Chapter 1. The singular value decomposition (SVD) is another closely related approach. Data visualization for cluster analysis requires dimensionality reduction in the form of a projection from a high-dimensional space to 2D or 3D so that the dataset can be easily visualized. Cluster analysis is heuristic in nature.

In this section we outline the PCA and SVD approaches for dimensionality reduction in turn, continue with the distance refinement that can follow such analyses, and illustrate projection and clustering results with some examples.

15.4.1 Data Compression Based on Principal Component Analysis (PCA)

PCA transforms the input system (our database matrix A) into a smaller matrix described by a few uncorrelated variables called the principal components (PCs). These PCs are related to the eigenvectors of the covariance matrix defined by the component variables. The basic idea is to choose the orthogonal components so that the original data variance is well approximated. That is, the relations of similarity/dissimilarity among the compounds can be well approximated in the reduced description. This is done by performing eigenvalue analysis on the covariance matrix that describes the statistical relations among the descriptor variables.

Covariance Matrix and PCs

Let a ij be an element of our n ×m database matrix A. The covariance matrix C m ×m is formed by elements c jj′ where each entry is obtained from the sum
$${c}_{jj^\prime} = \frac{1} {n - 1}\sum\limits_{i=1}^{n}({a}_{ ij} - {\mu }_{j})\,({a}_{ij^\prime} - {\mu }_{j^\prime})\,.$$
Here μ j is the mean of the column associated with descriptor j:
$${\mu }_{j} = \frac{1} {n}\sum\limits_{i=1}^{n}{a}_{ ij}\,.$$
C is a symmetric semi-definite matrix and thus has the spectral decomposition
$$C = V \Sigma {V }^{T}\,,$$
where the superscript T denotes the matrix transpose, and the matrix V (m ×m) is the orthogonal eigenvector matrix satisfying VV T = I m ×m with m component vectors {v i }. The diagonal matrix Σ of dimension m contains the m ordered eigenvalues
$${\lambda }_{1} \geq {\lambda }_{2} \geq \cdots \geq {\lambda }_{m} \geq 0\,.$$
We then define the m PCs Yj for j = 1, 2, ⋯, m as the product of the original matrix A and the eigenvectors v j :
$$Y j = A{v}_{j}\,,\;\;\;\;\;\;\;\;\;\;j = 1,2,\cdots \,,m\,.$$
We also define the m ×m matrix Y corresponding to eq. (15.15), related to V, as the matrix that holds the m PCs Y 1, Y 2, ⋯, Ym; this allows us to write eq. (15.15) in the matrix form Y = AV. Since VV T = I, we then obtain an expression for the dataset matrix A in terms of the PCs:
$$A = Y {V }^{T}\,.$$

Dimensionality Reduction

The problem dimensionality can be reduced based on eq. (15.16). First note that eq. (15.16) can be written as:
$$A =\sum\limits_{j=1}^{m}Y j \cdot {v}_{ j}^{T}\,.$$
Second, note that Xi, the vector of compound i, is the transpose of the ith row vector of A:
$$Xi = {A}^{T}\,{e}_{ i}\,,$$
where e i is an n ×1 unit vector with 1 in the ith component and 0 elsewhere. Thus, compound Xi is expressed as the linear combination of the orthonormal set of eigenvectors {v j } of the covariance matrix C derived from A:
$$Xi =\sum\limits_{j=1}^{m}(Y {j}_{ i})\,{v}_{j}\,,\;\;\;i = 1,2,\cdots \,,n\,,$$
where Yj i is the ith component of the column vector Yj.
Based on eq. (15.19), the problem dimensionality m can be reduced by constructing a k-dimensional approximation to Xi, Xi k , in terms of the first k PCs:
$$X{i}^{k} =\sum\limits_{j=1}^{k}(Y {j}_{ i})\,{v}_{j}\,,\;\;\;i = 1,2,\cdots \,,n\,.$$
The index k of the approximation can be chosen according a criterion involving the threshold variance γ, where
$$\left (\sum\limits_{i=1}^{k}{\lambda }_{ i}\right )/\left (\sum\limits_{i=1}^{m}{\lambda }_{ i}\right ) \geq \gamma \,.$$
The eigenvalues of C represent the variances of the PCs. Thus, the measure γ = 1 for k = m reflects a 100% variance representation. In practice, good approximations to the overall variance (e.g., γ > 0. 7) can be obtained for km for large databases.

For such a suitably chosen k, the smaller database represented by components {Xi k } for i = 1, 2, ⋯, n approximates the variance of the original database A reasonably, making it valuable for cluster analysis.

As we show below, the singular value decomposition can be used to compute the factorization of the covariance matrix C when the ‘natural scaling’ of eq. (15.6) is used.

15.4.2 Data Compression Based on the Singular Value Decomposition (SVD)

SVD is a procedure for data compression used in many practical applications like image processing and cryptanalysis (code deciphering) [296, for example]. Essentially, it is a factorization for rectangular matrices that is a generalization of the eigenvalue decomposition for square matrices. Image processing techniques are common tools for managing large datasets, such as digital encyclopedias, or images transmitted to earth from space shuttles on limited-speed modems.

SVD defines two appropriate orthogonal coordinate systems for the domain and range of the mapping defined by a rectangular n ×m matrix A. This matrix maps a vector \(x \in {\mathcal{R}}^{n}\) to a vector \(y = Ax \in {\mathcal{R}}^{m}\). The SVD determines the orthonormal coordinate system of \({\mathcal{R}}^{n}\) (the columns of an n ×n matrix U) and the orthonormal coordinate system of \({\mathcal{R}}^{m}\) (the columns of an m ×m matrix V ) so that A is diagonal.

The SVD is used routinely for storing computer-generated images. If, a photograph is stored as a matrix where each entry corresponds to a pixel in the photo, fine resolution requires storage of a huge matrix. The SVD can factor this matrix and determine its best rank-k approximation. This approximation is computed not as an explicit matrix but rather as a sum of k outer products, each term of which requires the storage of two vectors, one of dimension of n and another of dimension m (m + n storage for the pair). Hence, the total storage required for the image is reduced from nm to (m + n)k.

The SVD also provides the rank ofA (the number of independent columns), thus specifying how the data may be stored more compactly via the best rank-k approximation. This reformulation can reduce the computational work required for evaluation of the distance function used for similarity or diversity sampling.

SVD Factorization

The SVD decomposes the real matrix A as:
$$A = U\Sigma {V }^{T},$$
where the matrices U (n ×n) and V (m ×m) are orthogonal, i.e., UU T = I n ×n and VV T = I m ×m . The matrix Σ (n ×m) contains at most m nonzero entries (σ i , i = 1, ⋯, m), known as the singular values, in the first m diagonal elements:
$$\Sigma = \left (\begin{array}{cccccccc} {\sigma }_{1} & 0 & 0 & 0 & \cdots & \cdots & \cdots & 0 \\ 0 & {\sigma }_{2} & 0 & 0 & \cdots & \cdots & \cdots & 0 \\ 0 & 0 & \cdots & {\sigma }_{r} & \cdots & \cdots & \cdots & 0\\ \vdots & & & & & & & \\ 0 & 0 & 0 & 0 & \cdots & \cdots & \cdots & {\sigma }_{m} \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ \vdots & & & & & & & \\ 0 & 0 & 0 & 0 & \cdots & \cdots & \cdots & 0 \end{array} \right )$$
$${\sigma }_{1} \geq {\sigma }_{2} \geq \ldots \geq {\sigma }_{r}\ldots \geq {\sigma }_{m} \geq 0\,.$$
The columns of U, namely u 1, , u n , are the left singular vectors; the columns of V, namely v 1, , v m , are the right singular vectors. In addition, r = rank of A = number of nonzero singular values. Thus if rm, a rank-r approximation of A is natural. Otherwise, we can set k to be smaller than r by neglecting the singular values beyond a certain threshold.

Low-Rank Approximation

The rank-k approximation to A can be obtained by noting that A can be written as the sum of rank-1 matrices:
$$A =\sum\limits_{j=1}^{r}{\sigma }_{ j}\,{u}_{j}\,{v}_{j}^{T}\,.$$
The rank-k approximation, A k , is simply formed by extending the summation in eq. (15.24) from 1 to k instead of 1 to r. In practice, this means storing k left singular vectors and k right singular vectors. This matrix A k can also be written as
$${A}_{k} =\sum\limits_{j=1}^{k}{\sigma }_{ j}\,{u}_{j}\,{v}_{j}^{T}\, =\, U{\Sigma }_{ k}{V }^{T}$$
$${\Sigma }_{k} = \mbox{ diag }({\sigma }_{1},\ldots,{\sigma }_{k},0,\ldots,0)\,.$$
This matrix is closest to A in the sense that
$$\|A - {A}_{k}\| = {\sigma }_{k+1}$$
for the standard Euclidean norm.
Recall that we can express each Xi as:
$$\mbox{ Row }i\mbox{ of }(A) = {({A}^{T}\,{e}_{ i})}^{T}\,,$$
where e i is an n ×1 unit vector with 1 in the ith component and 0 elsewhere. Using the decomposition of eq. (15.24), we have:
$${A}^{T}\,{e}_{ i} =\sum\limits_{j=1}^{r}{\sigma }_{ j}\,{v}_{j}\,{u}_{j}^{T}\,{e}_{ i} =\sum\limits_{j=1}^{r}({\sigma }_{ j}\,{u}_{{j}_{i}})\,{v}_{j}.$$
The SVD transforms this row vector to [(A k ) T e i ] T , where:
$${({A}_{k})}^{T}\,{e}_{ i} =\sum\limits_{j=1}^{k}({\sigma }_{ j}\,{u}_{{j}_{i}})\,{v}_{j}.$$


This transformation can be used to project a vector onto the first k principal components. That is, the projection matrix \({P}_{k} =\sum\limits_{j=1}^{k}[{v}_{j}{v}_{j}^{T}]\) maps a vector from m to k dimensions. For example, for k = 2, we have:
$$\begin{array}{rcl}{ P}_{2}{A}^{T}\,{e}_{ i}& =& \sum\limits_{j=1}^{r}({v}_{ 1}\,{v}_{1}^{T} + {v}_{ 2}\,{v}_{2}^{T})({\sigma }_{ j}\,{u}_{{j}_{i}})\,{v}_{j} \\ & =& ({\sigma }_{1}\,{u}_{{1}_{i}})\,{v}_{1} + ({\sigma }_{2}\,{u}_{{2}_{i}})\,{v}_{2}\,. \end{array}$$
Thus, this projection maps the m-dimensional row vector Xi onto the two-dimensional (2D) vector Yi with components \({\sigma }_{1}{u}_{{1}_{i}}\) and \({\sigma }_{2}{u}_{{2}_{i}}\). This mapping generalizes to a projection onto the k-dimensional space where km:
$$Y {i}^{k} = ({\sigma }_{ 1}\,{u}_{{1}_{i}}\,,{\sigma }_{2}\,{u}_{{2}_{i}}\,,\cdots \,,{\sigma }_{k}\,{u}_{{k}_{i}})\,.$$

15.4.3 Relation Between PCA and SVD

It can be shown that the eigenvectors {v i } of the covariance matrix (eq. (15.14)) coincide with the right eigenvectors {v i } defined above when the second scaling (eq. (15.6)) is applied to the database matrix. Recall that this scaling makes all columns have zero means and a variance of unity.

Moreover, the left SVD vectors {u i } can be related to the singular values {σ i } and PC vectors {Yi} of eq. (15.15) by
$${u}_{i} = A\,{v}_{i}/{\sigma }_{i} = Y i/{\sigma }_{i}\,.$$
Therefore, we can use the SVD factorization as defined above (eq. (15.22)) to compute the PCs {Yi} of the covariance matrix C. The SVD approach is more efficient since formulation of the covariance matrix is not required.

The algorithm ARPACK [728] can compute the first k PCs, saving significant storage. It requires an order \(\mathcal{O}(nk)\) memory and \(\mathcal{O}(n{m}^{2})\) floating point operations.

15.4.4 Data Analysis via PCA or SVD and Distance Refinement

The SVD or the PCA projection is a first step in database visualization. The second step refines this projection so that the original Euclidean distances {δ ij } in the m-dimensional space are closely related to the corresponding distances {d ij } in the reduced, k-D space. Here,
$${\delta }_{ij} \equiv \vert \vert Xi - Xj\vert \vert $$
$${d}_{ij} \equiv \vert \vert Y i - Y j\vert \vert $$
for all i, j, where the vectors {Y i } are the k-D vectors produced by SVD defined by eq. (15.28).

Projection Refinement

This distance refinement is a common task in distance geometry refinement of NMR models. In the NMR context, a set of interatomic distances is given and the objective is to find the 3D coordinate vector (the molecular structure) that best fits the data. Since such a problem is typically overdetermined — there are \(\mathcal{O}({n}^{2})\) distances but only \(\mathcal{O}(n)\) Cartesian coordinates for a system of n atoms — an optimal approximate solution is sought.

For example, optimization work on evolutionary trees [1001] solved an identical mathematical problem in an unusual context that is closely related to the molecular similarity problem here. Specifically, the experimental distance-data in evolutionary studies reflect complex factors rather than simple spatial distances (e.g., interspecies data arise from immunological studies which compare the genetic material among taxa and assign similarity scores). Finding a 3D evolutionary tree by the distance-geometry approach, rather than the conventional 2D tree which conveys evolutionary linkages, helps identify subgroup similarities.

Distance Geometry

The distance-geometry problem in our evolutionary context can be formulated as follows. We are given a set of pairwise distances with associated lower and upper bounds:
$$\{{l}_{ij} \leq {\delta }_{ij} \leq {u}_{ij}\},\quad \mbox{ for}\quad i,j = 1,2,\ldots,n,$$
where each δ ij is a target interspecies distance with associated lower and upper bounds l ij and u ij , respectively, and n is the number of species. Our goal is to compute a 3D “tree” for those species based on the measured distance/similarity data.
This distance geometry problem can be reduced to finding a coordinate vector that minimizes the objective function
$$E(Y ) =\sum\limits_{i<j}{\omega }_{ij}\,{\left ({d}_{ij}^{2}(Y ) - {\delta }_{ ij}^{2}\right )}^{2}\,,$$
where d ij (Y ) is Euclidean distance between points i and j in the vector Y, and the {ω ij } are appropriately-chosen weights.

In the combinatorial chemistry context, we use the same function E(Y ) where Y is the vector of 2n components, listing the 2D projections of each compound in turn. Details of this data clustering approach are described in [1399, 1402]. Minimization can be performed so that the high-dimensional distance relationships are approximated.

Besides the value of the objective function (eq. (15.30)), a useful measure of the distance approximation in the low-dimensional space is the percentage of intercompound distances {i, j} (out of \(n(n - 1)/2\)) that are within a certain threshold of the original distances. We first define the deviations from the targets by a percentage η so that
$$\begin{array}{rcl} \vert d(Y i,Y j) - {\delta }_{ij}\vert \leq \eta \,{\delta }_{ij}\quad & \mbox{ when }& {\delta }_{ij} > {d}_{\mathrm{min}}\,, \\ d(Y i,Y j) \leq \tilde{ \epsilon }\quad & \mbox{ when }& {\delta }_{ij} \leq {d}_{\mathrm{min}}\,,\end{array}$$
where \(\eta,\tilde{\epsilon },\) and d min are given small positive numbers less than one. For example, η = 0. 1 specifies a 10% accuracy; the other values may be set to small positive numbers such as \({d}_{\mathrm{min}} = 1{0}^{-12}\) and \(\tilde{\epsilon } = 1{0}^{-8}\). The second case above (very small original distance) may occur when two compounds in the datasets are highly similar.
With this definition, the total number T d of the distance segments d(Yi, Yj) satisfying eq. (15.31) can be used to assess the degree of distance preservation of our mapping. We define the percentage ρ of the distance segments satisfying eq. (15.31) as
$$\rho = \frac{{T}_{d}} {n(n - 1)/2} \times 100\,.$$
The greater the ρ value (the maximum is 100), the better the mapping and the more information that can be inferred from the projected views of the database compounds.

This minimization procedure (projection refinement) is quite difficult for scaled datasets. Experiments with several chemical datasets of size 58 to 27255 compounds show that the percentage of distances satisfying a threshold deviation ρ of 10% (eq. (15.31)) is in the range of 40% [1399, 1402]. Nonetheless, these low values can be made close to 100% with projections onto 10-dimensional space. This is illustrated in Figure 15.4, which shows the percentage of distances satisfying eq. (15.31) for η = 0. 1 as a function of the projection dimension for a database ARTF.

A similar improvement can be achieved with larger tolerances η (e.g., distances that are within 25% of the original values rather than 10%) [1399, 1402].
Fig. 15.4

Performance of the SVD and SVD/minimization protocols for the ARTF chemical database in terms of the percentage of distances satisfying eq. (15.31) for η = 0. 1 (reflecting 10% distance deviations) as a function of the projection dimension [1399, 1402].

15.4.5 Projection, Refinement, and Clustering Example

As an illustration, consider the model database ARTF of 402 compounds and m = 312 descriptors containing eight chemical subgroups. We have analyzed this database by performing 2D and 3D projections based on the SVD factorization followed by minimization refinement by TNPACK [1121, 1122, 1397] for performance assessment in terms of accuracy as well as visual analysis of the compound interrelationships.

From Figure 15.4 we note that the refinement stage that follows the SVD projection is important for increasing the accuracy in every dimension. Namely, the accuracy is increased by 25–40% in this example.

The 2D and 3D projection patterns obtained for ARTF in Figure 15.5 show the utility of such a projection approach. The resemblance between the 2D and 3D views is evident, and the various 3D views offer different perspectives of the intercompound relationships.
Fig. 15.5

Two and three-dimensional projections of the chemical database ARTF of 402 compounds composed of the eight chemical subgroups ecdysteroids (EC), estrogens (ES), D1 agonists (D+), D1 antagonists (D), H1 ligands (HL), DHFR inhibitors (DH), AchE inhibitors (AC), and 5HT ligands (HT) using the projection/refinement SVD/TNPACK approach [1399, 1402]. Three views are shown for the 3D projection. The accuracy of the 2D projection is about 46% and that of the 3D is 63% (with η = 0. 1); see eq. (15.31). The 2D projection was obtained by refining the 3D projection. The nine chemical structures labeled in the projections are drawn in Figure 15.6.

We note that clusters corresponding to individual pharmacological subsets appear very close to each other, though partial overlap of clusters is evident. The ecdysteroid group forms a diverse but separate set of points. The estrogen class is also clustered and somewhat separate from the others. The strong overlap of the three clusters corresponding to D1 agonists, D1 antagonists, and H1 receptor ligands is reasonable given the relative chemical similarity of these compounds: all act at receptors of the same pharmacological class (i.e., G-protein coupled receptors). Thus, such data compression and visualization techniques can be used as a quick analysis tool of the database structure.

The chemical structures in Figure 15.6 reveal that compounds that are nearer in the projection are more closely related than those that are distant; this is seen when compounds are compared both within the same subgroup and within different subgroup. For example, the two labeled estrogen representatives that are distant in the projection appear chemically quite different, while the three clustered H1 ligands appear similar to each other and perhaps to the nearby D1 agonist representative.

An example of a database projection in 2D by the alternative PCA approach followed by distance refinement is shown in Figures 15.7 and 15.8 for 832 compounds from the MDL Drug Data Report (MDDR) database using topological indices. (This work was performed in collaboration with Merck Research Laboratories). The accuracy of this projection (the percentage of distances satisfying eq. (15.31) for η = 0. 1) is only 0.2% after PCA and 24.8% after PCA/TNPACK. Figure 15.7 shows that compounds close in the projection appear similar, and Figure 15.8 shows that more distantly related compounds tend to be different. Without knowing the grouping of these compounds according to bioactivity, the clusters identified in Figure 15.8 suggest a ‘diversity subset’ consisting of a few members from each cluster.
Fig. 15.6

Selected chemical structures from the ARTF projection shown in Figure 15.5 reveal similarity of nearby structures and dissimilarity of distant compounds.

The approach described here appears promising, but further work is required to make the technique viable for very large databases.

15.5 Future Perspectives

Similarity and diversity sampling of combinatorial chemistry libraries is a field in its infancy. The choice of descriptors as well as metrics used to define similarity and diversity are empirical and perhaps application dependent. Thus, many challenges remain for future developments in the field, and the added involvement of mathematical scientists and new approaches borrowed from allied disciplines might be fruitful.
Fig. 15.7

2D projection using PCA for 832 compounds in the MDDR database showing the similarity of four compound pairs that are near in the projection.

Fig. 15.8

2D projection using PCA for 832 compounds in the MDDR database showing the diversity of compounds that represent different clusters in the projection (distinguished by letters). A representative subset may thus consist of one or only a few members from each cluster.

Developments are needed for formulation of descriptor sets, rigorous mathematical frameworks for their analysis, and efficient algorithms for very large-scale problems based on statistics, cluster analysis, and optimization. The algorithmic challenge of manipulating large datasets might also explain the tendency toward smaller and focused libraries [555]; still, as argued in [621], this assumed defeat is premature!

The central assumption of structure/activity relationships of course remains a challenge to validate, develop, and further apply.

More broadly, structure-based drug design is likely to increase in importance as many more protein targets are identified and synthesized [1301], and as modeling programs improve in their ability to predict binding affinities of certain ligands (e.g., peptide-like) that share chemical groups with macromolecules, the focus of many biomodeling packages. The difficulty in determining membrane protein structures continues to be a limitation since membrane receptors are important pharmacological targets.

While perhaps not the dominant technique, it is clear that structure-based drug design will be an important component of drug modification and optimization after available leads have been generated. The search for the needle in the haystack (i.e., a successful drug) will likely be guided by the steady light generated by computer modeling. And, with additional genetic and genomic screening, disease treatment is likely to move forward to a new phase of greater scientific precision and success.


  1. 1.
    L. Adams and J. L. Nazareth, editors. Linear and Nonlinear Conjugate Gradient- Related Methods. SIAM, Philadelphia, PA, 1996.Google Scholar
  2. 5.
    D. K. Agrafiotis. Stochastic algorithms for maximizing molecular diversity. J. Chem. Inf. Comput. Sci., 37:841–851, 1997.CrossRefGoogle Scholar
  3. 6.
    D. K. Agrafiotis. Diversity of chemical libraries. In P. von Ragué Schleyer (Editor- in Chief), N. L. Allinger, T. Clark, J. Gasteiger, P. A. Kollman, and H. F. Schaefer, III, editors, Encyclopedia of Computational Chemistry, volume 1, pages 742–761. John Wiley & Sons, West Sussex, England, 1998.Google Scholar
  4. 7.
    D. K. Agrafiotis, V. S. Lobanov, and F. R. Salemme. Combinatorial informatics in the post-genomics era. Nat. Rev. Drug. Disc., 1:337–346, 2002.CrossRefGoogle Scholar
  5. 8.
    D. K. Agrafiotis, J. C. Myslik, and F. R. Salemme. Advances in diversity profiling and combinatorial series design. Mol. Div., 4:1–22, 1999.CrossRefGoogle Scholar
  6. 41.
    L. M. Amzel. Structure-based drug design. Curr. Opin. Biotech., 9:366–369, 1998.Google Scholar
  7. 151.
    S. Borman. Reducing time to drug discovery. Chem. Eng. News, 77:33–48, 1998.CrossRefGoogle Scholar
  8. 159.
    D. B. Boyd. Computer-aided molecular design. In A. Kent (Executive) and C. M. Hall (Administrative), editors, Encyclopedia of Library and Informa- tion Science, volume 59, pages 54–84. Marcel Dekker, New York, NY, 1997. upplement 22.Google Scholar
  9. 198.
    P. R. Caron, M. D. Mullican, R. D. Mashal, K. P. Wilson, M. S. Su, and M. A. urcko. Chemogenomic approaches to drug discovery. Curr. Opin. Chem. Biol., 5:464–470, 2001.Google Scholar
  10. 207.
    T. Caulfield and K. Burgess. Combinatorial chemistry. Focused diversity and diversity of focus. Curr. Opin. Chem. Biol., 5:241–242, 2001.Google Scholar
  11. 236.
    C. H. Cho andM. E. Nuttall. Emerging techniques for the discovery and validation of therapeutic targets for skeletal diseases. Expert Opin. Ther. Targets, 6:679–689, 2002.Google Scholar
  12. 254.
    N. C. Cohen, editor. Guidebook on Molecular Modeling in Drug Design. cademic Press, San Diego, CA, 1996.Google Scholar
  13. 263.
    W. J. Cook, W. H. Cunningham, W. R. Pulleyblank, and A. Schrijver. Combinato- rial Optimization. John Wiley & Sons, New York, NY, 1998.Google Scholar
  14. 278.
    P. Csermely, V. Agoston, and S. Pongor. The efficiency of multi-target drugs: The network approach might help drug design. Trends in Pharm. Sci., 26:178–182, 2005.CrossRefGoogle Scholar
  15. 296.
    J.W. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, PA, 1997.Google Scholar
  16. 322.
    S. L. Dixon and H. O. Villar. Investigation of classification methods for the prediction of activity in diverse chemical libraries. J. Comput.-Aided Mol. Design, 13:533–545, 1999.CrossRefGoogle Scholar
  17. 323.
    C. Djerassi. The Pill, Pygmy Chimps, and Degas’ Horse. The Remarkable Auto- biography of the Award-Winning Scientist Who Synthesized the Birth Control Pill. asic Books, New York, NY, 1992.Google Scholar
  18. 333.
    (Structural Genomics Supplement).Google Scholar
  19. 335.
    H. R. Drew, R. M. Wing, T. Takano, C. Broka, S. Tanaka, K. Itakura, and R. E. ickerson. Structure of a B-DNA dodecamer: Conformation and dynamics. Proc. atl. Acad. Sci. USA, 78:2179–2183, 1981.Google Scholar
  20. 400.
    M. J. Field. A Practical Introduction to the Simulation of Molecular Systems. ambridge University Press, Cambridge, UK, second edition, 2007.Google Scholar
  21. 448.
    B. Garc´ıa-Archilla, J.M. Sanz-Serna, and R.D. Skeel. Long-time-step methods for oscillatory differential equations. SIAM J. Sci. Comput., 20:930–963, 1998.Google Scholar
  22. 451.
    C. A. Gelfand, G. E. Plum, S. Mielewczyk, D. P. Remeta, and K. J. Breslauer. quantitative method for evaluating the stabilities of nucleic acid complexes. roc. Natl. Acad. Sci. USA, 96:6113–6118, 1999.Google Scholar
  23. 453.
    A. K. Ghose, V. N. Viswanadhan, and J. J. Wendoloski. A knowledge-based approach in designing combinatorial or medicinal chemistry libraries for drug dis- covery. 1. A qualitative and quantitative characterization of known drug databases. . Comb. Chem., 1:55–68, 1999.Google Scholar
  24. 496.
    J. M. Haile. Molecular Dynamics Simulations: Elementary Methods. John Wiley & Sons, New York, NY, 1992.Google Scholar
  25. 507.
    P. Hammarstr¨om, F. Schneider, and J. W. Kelly. Trans-suppression of misfolding in an amyloid disease. Science, 293:2459–2462, 2001.Google Scholar
  26. 529.
    M. A. El Hassan and C. R. Calladine. Conformational characteristics of DNA: Empirical classifications and a hypothesis for the conformational behaviour of dinucleotide steps. Phil. Trans. Math. Phys. Engin. Sci., 355:43–100, 1997.zbMATHCrossRefGoogle Scholar
  27. 544.
    D. K. Hendrix, S. E. Brenner, and S. R. Holbrook. RNA structural motifs: building blocks of a modular biomolecule. Q. Rev. Biophys., 38:221–243, 2005.CrossRefGoogle Scholar
  28. 555.
    R. W. Hockney and J. W Eastwood. Computer Simulation Using Particles. cGraw-Hill, New York, NY, 1981.Google Scholar
  29. 582.
    P. H. H¨unenberger and J. A. McCammon. Effect of artificial periodicity in simulations of biomolecules under Ewald boundary conditions: A continuum electrostatics study. Biophys. Chem., 78:69–88, 1999.Google Scholar
  30. 588.
    W. Im, D. Beglov, and B. Roux. Continuum solvation model: Computation of electrostatic forces from numerical solutions to the Poisson-Boltzmann equation. omput. Phys. Comm., 111:59–75, 1998.zbMATHGoogle Scholar
  31. 589.
    W. Im, J. Chen, and C. L. Brooks, III. Peptide and protein folding and confor- mational equilibria: Theoretical treatment of electrostatics and hydrogen bonding with implicit solvent models. Adv. Protein Chem., 72:173–197, 2006.CrossRefGoogle Scholar
  32. 590.
    M. Ingelman-Sundberg. Pharmacogenomic biomakers for prediction of severe adverse drug reactions. N. Eng. J. Med., 358:637–639, 2008.CrossRefGoogle Scholar
  33. 591.
    J. Inglese, D. S. Auld, A. Jadhav, R. L. Johnson, A. Simeonov, A. Yasgar, W. Zheng, and C. P. Austin. Quantitative high-throughput screening qHTS: A titration-based approach that efficiently identifies biological activities in large chemical libraries. Proc. Natl. Acad. Sci. USA, 103:11473–11478, 2006.CrossRefGoogle Scholar
  34. 621.
    W. L. Jorgensen and J. Tirado-Rives. Monte Carlo vs. molecular dynamics for conformational sampling. J. Phys. Chem., 100:14508–14513, 1996.Google Scholar
  35. 640.
    J. Khandogin, A. Hu, and D. M. York. Electronic structure properties of sol- vated biomolecules: A quantum approach for macromolecular characterization. . Comput. Chem., 21:1562–1571, 2000.Google Scholar
  36. 649.
    Y. C. Kim and G. Hummer. Coarse-grained models for simulations of multiprotein complexes: application to ubiquitin binding. J. Mol. Biol., 375:1416–1433, 2008.CrossRefGoogle Scholar
  37. 666.
    P. Koehl and M. Levitt. A brighter future for protein structure prediction. Nature Struc. Biol., 6:108–111, 1999.CrossRefGoogle Scholar
  38. 673.
    M. W. Konrad and J. I Bolonick. Molecular dynamics simulation of DNA stretch- ing is consistent with the tension observed for extension and strand separation and predicts a novel ladder structure. J. Amer. Chem. Soc., 118:10989–10994, 1996.Google Scholar
  39. 677.
    N. Korolev, A. P. Lyubartsev, A. Laaksonen, and L. Nordenski¨old. On the comnpe- tition between water, sodium ions, and spermine in binding to DNA: A molecular dynamics simulation study. Biophys. J., 82:2860–2875, 2002.Google Scholar
  40. 692.
    C. Laing, S. Jung, A. Iqbal, and T. Schlick. Tertiary motifs revealed in analyses of higher-order RNA junctions. J. Mol. Biol., 393:67–82, 2009.CrossRefGoogle Scholar
  41. 709.
    T. Lazaridis and M. Karplus. “New view” of protein folding reconciled with the old through multiple unfolding simulations. Science, 278:1928–1931, 1997.CrossRefGoogle Scholar
  42. 727.
    J. H. Lee, M. D. Canny, A. De Erkenez, D. Krilleke, Y. S. Ng, D. T. Shima, A. Pardi, and F. Jucker. A therapeutic aptamer inhibits angiogenesis by specifi- cally targeting the heparin binding domain of VEGF165. Proc. Natl. Acad. Sci. SA, 102:18902–18907, 2005.CrossRefGoogle Scholar
  43. 728.
    T.-S. Lee, D. M. York, and W. Yang. Linear-scaling semiempirical quantum calculations for macromolecules. J. Chem. Phys., 105:2744–2750, 1996.CrossRefGoogle Scholar
  44. 772.
    E. Lindahl, B. Hess, and D. van der Spoel. GROMACS 3.0: A package for molecular simulation and trajectory analysis. J. Mol. Model., 7:306–317, 2001.Google Scholar
  45. 818.
    G. Maisuradze, A. Liwo, and H. Scheraga. Principal component analysis for protein folding dynamics. J. Mol. Biol., 385:312–329, 2009.CrossRefGoogle Scholar
  46. 913.
    L. Nilsson and M. Karplus. Empirical energy functions for energy minimization and dynamics of nucleic acids. J. Comput. Chem., 7:591–616, 1986.CrossRefGoogle Scholar
  47. 977.
    L. Pauling. The Nature of the Chemical Bond. third edition, Cornell University Press, New York, NY, 1960.Google Scholar
  48. 999.
    A. T. Phan, J.-L. Leroy, and M. Guéron. Determination of the residence time of water molecules hydrating B_-DNA and B-DNA, by one-dimensional zero- enhancement nuclear Overhauser effect spectroscopy. J. Mol. Biol., 286:505–519, 1999.Google Scholar
  49. 1001.
    L. Piela, J. Kostrowicki, and H. A. Scheraga. The multiple-minima problem in conformational analysis of molecules. deformation of the potential energy hy- persurface by the diffusion equation method. J. Phys. Chem., 93:3339–3346, 1989.Google Scholar
  50. 1006.
    R. M. Pitzer. The barrier to internal rotation in ethane. Acc. Chem. Res., 16:207–210, 1983.CrossRefGoogle Scholar
  51. 1009.
    R. H. A. Plasterk. RNA silencing: The genome’s immune system. Science, 296:1263–1265, 2002.CrossRefGoogle Scholar
  52. 1061.
    R. A. Robinson and R. H. Stokes. Electrolyte Solutions: The Measurement and Interpretation of Conductance, Chemical Potential and Diffusion in Solutions of Simple Electrolytes. Butterworth & Co., London, England, second edition, 1965.Google Scholar
  53. 1093.
    B. Sandak. Multiscale fast summation of long-range charge and dipolar interac- tions. J. Comput. Chem., 22:717–731, 2001.CrossRefGoogle Scholar
  54. 1121.
    T. Schlick. Molecular-dynamics based approaches for enhanced sampling of long-time, large-scale conformational changes in biomolecules. F1000 Biol. Rep., 1:51, 2009.Google Scholar
  55. 1122.
    T. Schlick. Monte Carlo, harmonic approximation, and coarse-graining approaches for enhanced sampling of biomolecular structure. F1000 Biol. Rep., 1:48, 2009.Google Scholar
  56. 1174.
    D. E. Shaw, M. M. Deneroff, R. O. Dror, J. S. Kuskin, R. H. Larson, J. K. Salmon, C. Young, B. Batson, K. J. Bowers, J. C. Chao, M. P. Eastwood, J. Gagliardo, J. Grossman, C. R. Ho, D. J. Ierardi, I. Kolossvry, J. L. Klepeis, T. Layman, C. McLeavey, M. A. Moraes, R. Mueller, E. C. Priest, Y. Shan, J. Spengler, M. Theobald, B. Towles, and S. C. Wang. Anton: A special-purpose machine for molecular dynamics simulation. In Proceedings of the 34th annual international symposium on Computer architecture, pages 1–12, San Diego, CA, 2007. ACM.Google Scholar
  57. 1179.
    M. M. Shi, D. Mehrens, and K. Dacus. Pharmacogenomics: Changing the health care paradigm. Mod. Drug Disc., 4:27–32, 2001.Google Scholar
  58. 1241.
    S. J. Stuart, R. Zhou, and B. J. Berne. Molecular dynamics with multiple time scales: The selection of efficient reference system propagators. J. Chem. Phys., 105:1426–1436, 1996.CrossRefGoogle Scholar
  59. 1301.
    W. F. van Gunsteren and M. Karplus. Effect of constraints on the dynamics of macromolecules. Macromolecules, 15:1528–1543, 1982.CrossRefGoogle Scholar
  60. 1304.
    J. VandeVondele and U. Rothlisberger. Canonical adiabatic free energy sampling (CAFES): A novel method for the exploration of free energy surface. J. Phys. hem. B, 106:203–208, 2002.CrossRefGoogle Scholar
  61. 1364.
    S. J. Weiner, P. A. Kollman, D. T. Nguyen, and D.A. Case. An all atom force field for simulations of proteins and nucleic acids. J. Comput. Chem., 7:230–252, 1986.CrossRefGoogle Scholar
  62. 1372.
    E.Westhof and L. Jaeger. RNA pseudoknots. Curr. Opin. Struct. Biol., 2:327–333, 1992.CrossRefGoogle Scholar
  63. 1375.
    J. H. White. An introduction to the geometry and topology of DNA structures. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, chapter 9. RC Press, Boca Raton, Florida, 1989.Google Scholar
  64. 1376.
    H. Wille, M. D. Michelitsch, V. Guénebaut, S. Supattapone, A. Serban, F. E. Cohen, D. A. Agard, and S. B. Prusiner. Structural studies of the scrapie prion protein by electron crystallography. Proc. Natl. Acad. Sci. USA, 99:3563–3568, 2002.Google Scholar
  65. 1397.
    B. Wu, P. Dr¨oge, and C. A. Davey. Site selectivity of platinum anticancer therapeutics. Nat. Chem. Biol., 4:110–112, 2008.Google Scholar
  66. 1399.
    X. Wu and S. Wang. Enhancing systematic motion in molecular dynamics simulation. J. Chem. Phys., 110:9401–9410, 1999.CrossRefGoogle Scholar
  67. 1402.
    D. Xie and T. Schlick. Efficient implementation of the truncated Newton method for large-scale chemistry applications. SIAM J. Opt., 10(1):132–154, 1999.MathSciNetzbMATHCrossRefGoogle Scholar
  68. 1447.
    Y. Zhang. Pseudobond Ab Initio QM/MM approach and its applications to enzyme reactions. Theor. Chem. Acc., 116:43–50, 2006.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Courant Institute of Mathematical Sciences and Department of ChemistryNew York UniversityNew YorkUSA

Personalised recommendations