Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1.1 What Is Bioinformatics

Bioinformatics has become a hot research topic in recent years, a hot topic in several disciplines that were not so closely linked with biology previously. A side evidence of this is the fact that the 2007 Graduate Summer School on Bioinformatics of China had received more than 800 applications from graduate students from all over the nation and from a wide collection of disciplines in biological sciences, mathematics and statistics, automation and electrical engineering, computer science and engineering, medical sciences, environmental sciences, and even social sciences. So what is bioinformatics?

It is always challenging to define a new term, especially a term like bioinformatics that has many meanings. As an emerging discipline, it covers a lot of topics from the storage of DNA data and the mathematical modeling of biological sequences, to the analysis of possible mechanisms behind complex human diseases, to the understanding and modeling of the evolutionary history of life, etc.

Another term that often goes together or close with bioinformatics is computational molecular biology, and also computational systems biology in recent years, or computational biology as a more general term. People sometimes use these terms to mean different things, but sometimes use them in exchangeable manners. In our personal understanding, computational biology is a broad term, which covers all efforts of scientific investigations on or related with biology that involve mathematics and computation. Computational molecular biology, on the other hand, concentrates on the molecular aspects of biology in computational biology, which therefore has more or less the same meaning with bioinformatics. Bioinformatics studies the storage, manipulation, and interpretation of biological data, especially data of nucleic acids and amino acids, and studies molecular rules and systems that govern or affect the structure, function, and evolution of various forms of life from computational approaches. The word “computational” does not only mean “with computers,” but it refers to data analysis with mathematical, statistical, and algorithmatic methods, most of which need to be implemented with computer programs. As computational biology or bioinformatics studies biology with quantitative data, people also call it as quantitative biology.

Most molecules do not work independently in living cells, and most biological functions are accomplished by the harmonic interaction of multiple molecules. In recent years, the new term systems biology came into being. Systems biology studies cells and organisms as systems of multiple molecules and their interactions with the environment. Bioinformatics plays key roles in analyzing such systems. People have invented the term computational systems biology, which, from a general viewpoint, can be seen as a branch of bioinformatics that focuses more on systems rather than individual elements.

For a certain period, people regarded bioinformatics as the development of software tools that help to store, manipulate, and analyze biological data. While this is still an important role of bioinformatics, more and more scientists realize that bioinformatics can and should do more. As the advancement of modern biochemistry, biophysics, and biotechnologies is enabling people to accumulate massive data of multiple aspects of biology in an exponential manner, scientists begin to believe that bioinformatics and computational biology must play a key role for understanding biology.

People are studying bioinformatics in different ways. Some people are devoted to developing new computational tools, both from software and hardware viewpoints, for the better handling and processing of biological data. They develop new models and new algorithms for existing questions and propose and tackle new questions when new experimental techniques bring in new data. Other people take the study of bioinformatics as the study of biology with the viewpoint of informatics and systems. These people also develop tools when needed, but they are more interested in understanding biological procedures and mechanisms. They do not restrict their research to computational study, but try to integrate computational and experimental investigations.

1.2 Some Basic Biology

No matter what type of bioinformatics one is interested in, basic understanding of existing knowledge of biology especially molecular biology is a must. This chapter was designed as the first course in the summer school to provide students with non-biology backgrounds very basic and abstractive understanding of molecular biology. It can also give biology students a clue how biology is understood by researchers from other disciplines, which may help them to better communicate with bioinformaticians.

1.2.1 Scale and Time

Biology is the science about things that live in nature. There are many forms of life on the earth. Some forms are visible to human naked eyes, like animals and plants. Some can only be observed under light microscope or electron microscope, like many types of cells in the scale of 1.100 μm and some virus in the scale of 100 nm. The basic components of those life forms are molecules of various types, which scale around 1.10 nm. Because of the difficulty of direct observation at those tiny scales, scientists have to invent various types of techniques that can measure some aspects of the molecules and cells. These techniques produce a large amount of data, from which biologists and bioinformaticians infer the complex mechanisms underlying various life procedures.

Life has a long history. The earliest form of life appeared on the earth about 4 billion years ago, not long after the forming of the earth. Since then, life has experienced a long way of evolution to reach today’s variety and complexity. If the entire history of the earth is scaled to a 30-day month, the origin of life happened during days 3–4, but there has been abundant life only since day 27. A lot of higher organisms appeared in the last few days: first land plants and first land animals all appeared on day 28, mammals began to exist on day 29, and birds and flowering plants came into being on the last day. Modern humans, which are named homo sapiens in biology, appeared in the last 10 min of the last day. If we consider the recorded human history, it takes up only the last 30 s of the last day. The process that life gradually changes into different and often more complex or higher forms is called evolution. When studying the biology of a particular organism, it is important to realize that it is one leaf or branch on the huge tree of evolution. Comparison between related species is one major approach when investigating the unknown.

1.2.2 Cells

The basic component of all organisms is the cell. Many organisms are unicellular, which means one cell itself is an organism. However, for higher species like animals and plants, an organism can contain thousands of billions of cells.

Cells are of two major types: prokaryotic cells and eukaryotic cells. Eukaryotic cells are cells with real nucleus, while prokaryotic cells do not have nucleus. Living organisms are also categorized as two major groups: prokaryotes and eukaryotes according to whether their cells have nucleus. Prokaryotes are the earlier forms of life on the earth, which includes bacteria and archaea. All higher organisms are eukaryote, including unicellular organisms like yeasts and higher organisms like plants and animals. The bacteria E. coli is a widely studied prokaryote. Figure 1.1 shows the structure of an E. coli cell, as a representative of prokaryotic cells.

Fig. 1.1
figure 1

A prokaryotic cell

Eukaryotic cells have more complex structures, as shown in the example of a human plasma cell in Fig. 1.2. In eukaryotic cells, the key genetic materials, DNA, live in nucleus, in the form of chromatin or chromosomes. When a cell is not dividing, the nuclear DNA and proteins are aggregated as chromatin, which is dispersed throughout the nucleus. The chromatin in a dividing cell is packed into dense bodies called chromosomes. Chromosomes are of two parts, called the P-arm and Q-arm, or the shorter arm and longer arm, separated by the centromere.

Fig. 1.2
figure 2

An eukaryotic cell

1.2.3 DNA and Chromosome

DNA is the short name for deoxyribonucleic acid, which is the molecule that stores the major genetic information in cells. A nucleotide consists of three parts: a phosphate group, a pentose sugar (ribose sugar), and a base. The bases are of four types: adenine (A), guanine (G), cytosine (C), and thymine (T). A and G are purines with two fused rings. C and T are pyrimidines with one single ring. Besides DNA, there is another type of nucleotide called RNA or ribonucleic acid. For RNA, the bases are also of these four types except that the T is replaced by the uracil (U) in RNA.

DNA usually consists of two strands running in opposite directions. The backbone of each strand is a series of pentose and phosphate groups. Hydrogen bonds between purines and pyrimidines hold the two strands of DNA together, forming the famous double helix. In the hydrogen bonds, a base A always pairs with a base T on the other stand and a G always with a C. This mechanism is called base pairing. RNA is usually a single strand. When an RNA strand pairs with a DNA strand, the base-pairing rule becomes A-U, T-A, G-C, and C-G.

The ribose sugar is called pentose sugar because it contains five carbons, numbered as 1′–5′, respectively. The definition of the direction of a DNA or RNA strand is also based on this numbering, so that the two ends of a DNA or RNA strand are called the 5′ end and the 3′ end. The series of bases along the strand is called the DNA or RNA sequence and can be viewed as character strings composed with the alphabet of “A,” “C,” “G,” and “T” (“U” for RNA). We always read a sequence from the 5′ end to the 3′ end. On a DNA double helix, the two strands run oppositely. Figure 1.3 is an example of a segment of double-strand DNA sequence. Because of the DNA base-pairing rule, we only need to save one strand of the sequence.

Fig. 1.3
figure 3

An example segment of a double-strand DNA sequence

DNA molecules have very complicated structures. A DNA molecule binds with histones to form a vast number of nucleosomes, which look like “beads” on DNA “string.” Nucleosomes pack into a coil that twists into another larger coil and so forth, producing condensed supercoiled chromatin fibers. The coils fold to form loops, which coil further to form a chromosome. The length of all the DNA in a single human cell is about 2 m, but with the complicated packing, they fit into the nucleus with diameter around 5 μm.

1.2.4 The Central Dogma

The central dogma in genetics describes the typical mechanism by which the information saved in DNA sequences fulfills its job: information coded in DNA sequence is passed on to a type of RNA called messenger RNA (mRNA). Information in mRNA is then passed on to proteins. The former step is called transcription, and latter step is called translation.

Transcription is governed by the rule of complementary base pairing between the DNA base and the transcribed RNA base. That is, an A in the DNA is transcribed to a U in the RNA, a T to an A, a G to a C, and vice versa.

Proteins are chains of amino acids. There are 20 types of standard amino acids used in lives. The procedure of translation converts the information from the language of nucleotides to the language of amino acids. The translation is done by a special dictionary: the genetic codes or codon. Figure 1.4 shows the codon table. Every three nucleotides code for one particular amino acid. The three nucleotides are called a triplet. Because three nucleotides can encode 64 unique items, there are redundancies in this coding scheme, as shown in Fig. 1.4. Many amino acids are coded by more than one codon. For the redundant codons, usually their first and second nucleotides are consistent, but some variation in the third nucleotide is tolerated. AUG is the start codon that starts a protein sequence, and there are three stop codons CAA, CAG, and UGA that stop the sequence.

Fig. 1.4
figure 4

The genetic codes

Figure 1.5a illustrates the central dogma in prokaryotes. First, DNA double helix is opened and one strand of the double helix is used as a template to transcribe the mRNA. The mRNA is then translated to protein in ribosome with the help of tRNAs.

Fig. 1.5
figure 5

The central dogma

Figure 1.5b illustrates the central dogma in eukaryotes. There are several differences with the prokaryote case. In eukaryotic cells, DNAs live in the nucleus, where they are transcribed to mRNA similar to the prokaryote case. However, this mRNA is only the preliminary form of message RNA or pre-mRNA. Pre-mRNA is processed in several steps: parts are removed (called spicing), and ends of 150–200 As (called poly-A tail) are added. The processed mRNA is exported outside the nucleus to the cytoplasm, where it is translated to protein.

The procedure that genes are transcribed to mRNAs which are then translated to proteins is called the expression of genes. And the abundance of the mRNA molecules of a gene is usually called the expression value (level) of that gene, or simply the expression of the gene.

1.2.5 Genes and the Genome

We believe that the Chinese translation “基因” of the term “gene” is one of the best scientific term ever translated. Besides that the pronunciation is very close to the English version, the literal meaning of the two characters is also very close to the definition of the term: basic elements. Genes are the basic genetic elements that, together with interaction with environment, are decisive for the phenotypes.

Armed with knowledge of central dogma and genetic code, people had long taken the concept of a gene as the fragments of the DNA sequence that finally produce some protein products. This is still true in many contexts today. More strictly, these DNA segments should be called protein-coding genes, as scientists have found that there are some or many other parts on the genome that do not involve in protein products but also play important genetic roles. Some people call them as nonprotein-coding genes or noncoding genes for short. One important type of noncoding genes is the so-called microRNAs or miRNAs. There are several other types of known noncoding genes and may be more unknown. In most current literature, people still use gene to refer to protein-coding genes and add attributes like “noncoding” and “miRNA” when referring to other types of genes. We also follow this convention in this chapter.

The length of a DNA segment is often counted by the number of nucleotides (nt) in the segment. Because DNAs usually stay as double helix, we can also use the number of base pairs (bp) as the measurement of the length. For convenience, people usually use “k” to represent “1,000.” For example, 10 kb means that the sequence is of 10,000 bp. A protein-coding gene stretches from several hundreds of bp to several k bp in the DNA sequence. Figure 1.6 shows an example structure of a gene in high eukaryotes.

Fig. 1.6
figure 6

The structure of a gene

The site on the DNA sequence where a gene is started to be transcribed is called the transcription start site or TSS. The sequences around (especially the upstream) the TSS contain several elements that play important roles in the regulation of the transcription. These elements are called cis-elements. Transcription factors bind to such factors to start, enhance, or repress the transcription procedure. Therefore, sequences upstream the TSS are called promoters. Promoter is a loosely defined concept, and it can be divided into three parts: (1) a core promoter which is about 100 bp long around the TSS containing binding sites for RNA polymerase II (Pol II) and general transcription factors, (2) a proximal promoter of several hundred base pairs long containing primary specific regulatory elements located at the immediately upstream of the core promoter, and (3) a distal promoter up to thousands of base pairs long providing additional regulatory information. In eukaryotes, the preliminary transcript of a gene undergoes a processing step called splicing, during which some parts are cut off and remaining parts are joined. The remaining part is called exon, and the cut part is called intron. There can be multiple exons and introns in a gene. After introns are removed, the exons are connected to form the processed mRNA. Only the processed mRNAs are exported to the cytoplasm, and only parts of the mRNAs are translated to proteins. There may be untranslated regions (UTRs) at both ends of the mRNA: one at the TSS end is called 5′-UTR, and the other at the tail end is called 3′-UTR. The parts of exons that are translated are called CDS or coding DNA sequences. Usually exons constitute only a small part in the sequence of a gene.

In higher eukaryotes, a single gene can have more than one exon-intron settings. Such genes will have multiple forms of protein products (called isoforms). One isoform may contain only parts of the exons, and the stretch of some exons may also differ among isoforms. This phenomenon is called alternative splicing. It is an important mechanism to increase the diversity of protein products without increasing the number of genes.

The term “genome” literally means the set of all genes of an organism. For prokaryotes and some low eukaryotes, majority of their genome is composed of protein-coding genes. However, as more and more knowledge about genes and DNA sequences in human and other high eukaryotes became available, people learned that protein-coding genes only take a small proportion of all the DNA sequences in the eukaryotic genome. Now people tend to use “genome” to refer all the DNA sequences of an organism or a cell. (The genomes of most cell types in an organism are the same.)

The human genome is arranged in 24 chromosomes, with the total length of about 3 billion base pairs (3 × 109 bp). There are 22 autosomes (Chr.1.22) and 2 sex chromosomes (X and Y). The 22 autosomes are ordered by their lengths (with the exception that Chr.21 is slightly shorter than Chr.22): Chr.1 is the longest chromosome and Chr.21 is the shortest autosome. A normal human somatic cell contains 23 pairs of chromosomes: two copies of chromosomes 1.22 and two copies of X chromosome in females or one copy of X and one copy of Y in males. The largest human chromosome (Chr.1) has about 250 million bp, and the smallest human chromosome (Chr.Y) has about 50 million bp.

There are about 20,000–25,000 protein-coding genes in the human genome, spanning about 1/3 of the genome. The average human gene consists of some 3,000 base pairs, but sizes vary greatly, from several hundred to several million base pairs. The protein-coding part only takes about 1.5–2 % of the whole genome. Besides these regions, there are regulatory sequences like promoters, intronic sequences, and intergenic (between-gene) regions. Recent high-throughput transcriptomic (the study of all RNA transcripts) study revealed that more than half of the human genomes are transcribed, although only a very small part of them are processed to mature mRNAs. Among the transcripts are the well-known microRNAs and some other types of noncoding RNAs. The functional roles played by majority of the transcripts are still largely unknown. There are many repetitive sequences in the genome, and they have not been observed to have direct functions.

Human is regarded as the most advanced form of life on the earth, but the human genome is not the largest. Bacteria like E. coli has genomes of several million bp, yeast has about 15 million bp, Drosophila (fruit fly) has about 3 million bp, and some plants can have genomes as large as 100 billion bp. The number of genes in a genome is also not directly correlated with the complexity of the organism’s complexity. The unicellular organism yeast has about 6,000 genes, fruit fly has about 15,000 genes, and the rice that we eat everyday has about 40,000 genes. In lower species, protein-coding genes are more densely distributed on the genome. The human genome also has a much greater portion (50 %) of repeat sequences than the lower organisms like the worm (7 %) and the fly (3 %).

1.2.6 Measurements Along the Central Dogma

For many years, molecular biology can only study one or a small number of objects (genes, mRNAs, or proteins) at a time. This picture was changed since the development of a series of high-throughput technologies. They are called high throughput because they can obtain measurement of thousands of objects in one experiment in a short time. The emergence of massive genomic and proteomic data generated with these high-throughput technologies was actually a major motivation that promotes the birth and development of bioinformatics as a scientific discipline. In some sense, what bioinformatics does is manipulating and analyzing massive biological data and aiding scientific reasoning based on such data. It is therefore crucial to have the basic understanding of how the data are generated and what the data are for.

1.2.7 DNA Sequencing

The sequencing reaction is a key technique that enables the completion of sequencing the human genome. Figure 1.7 illustrates the principle of the widely used Sanger sequencing technique.

Fig. 1.7
figure 7

Sequencing reaction

The technique is based on the complementary base-pairing property of DNA. When a single-strand DNA fragment is isolated and places with primers, DNA polymerase, and the four types of deoxyribonucleoside triphosphate (dNTP), a new DNA strand complementary to the existing one will be synthesized. In the DNA sequencing reaction, dideoxyribonucleoside triphosphate (ddNTP) is added besides the above components, and the four types of ddNTPs are bound to four different fluorescent dyes. The synthesis of a new strand will stop when a ddNTP instead of a dNTP is added. Therefore, with abundant template single-strand DNA fragments, we’ll be able to get a set of complementary DNA segments of all different lengths, each one stopped by a colored ddNTP. Under electrophoresis, these segments of different lengths will run at different speeds, with the shortest segments running the fastest and the longest segments running the slowest. By scanning the color of all segments ordered by their length, we’ll be able to read the nucleotide at each position of the complementary sequence and therefore read the original template sequence. This technique is implemented in the first generation of sequencing machines.

The sequencing reaction can only measure sequence fragments of up to ∼800 nt (it’s very difficult to separate larger DNA fragments that have only one nucleotide difference in length by current capillary electrophoresis). For the whole human genome, scientists have been able to mark the long genome with DNA sequence tags whose genomic position can be uniquely identified and cut the DNA into large fragments (∼million bp). These fragments are still too long for the sequencing machine. Scientists invented the so-called shotgun strategy to sequence those long DNA fragments. The DNA is randomly broken into shorter fragments of 500–800 bp, which can be sequenced by the sequencing machine to obtain reads. Multiple overlapping reads are gathered by several rounds of fragmentation followed by sequencing. Computer programs piece together those overlapping reads, align, and merge them into original larger sequences. Efficient sequence assembly has raised many challenges for bioinformatics as well as for computing power, and the completion of the human genome project is impossible without the help of powerful bioinformatics tools.

In the last 2 years, a new generation of sequencing technique emerged. It is called second-generation sequencing or deep sequencing. The new technology can read huge amount of shorter DNA sequences at much higher efficiency. Figure 1.8 shows the brief concept of such sequencing methods: The DNA fragments are first cut into short fragments and ligated with some adaptor sequences. Next, in vitro amplification is performed to generate an array of million PCR colonies or “polonies.” Each polony which is physically isolated from the others contains many copies of a single DNA fragment. Next, with the sequencing by synthesis method, serial extension of primed templates is performed, and fluorescent labels incorporated with each extension are captured by high-resolution image-based detection system. The nucleotide synthesis (complement to the template DNA fragment) of each polony at each cycle is recalled by processing the serial image data. Compared with the Sanger sequencing technology, the deep sequencing method is highly parallel and can produce gigas of sequence data in a single day. Up to the end of 2007, there are mainly three deep sequencing platforms: 454, Solexa, and SOLiD. The 454 system which is designed based on the pyrosequencing technology could produce about 100 Mb sequences in a single instrument run with reads length up to 200–400 nt. Solexa and SOLiD could produce about one to two Gb sequences in a single run but with reads length only up to about 36 nt. The key advantage of the 454 system is its longer reads length, and its application focuses on de novo genome sequencing (sequence the unknown genome). In contrast the Solexa and SOLiD platforms are mainly used for genome resequencing (like SNP detection), transcriptomic analysis, ChIP-seq analysis, etc. With the rapid development of new technologies, sequencing personal genomes is becoming a realistic goal. In October 2006, the X Prize Foundation established the Archon X Prize, which intends to award $10 million to “the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98 % of the genome, and at a recurring cost of no more than $10,000 (US) per genome.”

Fig. 1.8
figure 8

Flowchart of the second-generation sequencing (This picture is derived from [1])

1.2.8 Transcriptomics and DNA Microarrays

The genome can be viewed as the original blueprint of the cell. When a gene is to take effect, it is transcribed into mRNAs as the acting copy, according to which proteins are made. This procedure is called the expression of a gene. When more proteins of some type are needed, more mRNAs will be made. Therefore, the abundance of the mRNA in the cell indicates the expression level of the corresponding gene. Sometimes people call it the expression of the gene for simplicity.

The genomic information in all or most cells of an organism is the same, but genes express differently at different developmental stage and in different tissues. There is an estimation that only about one third of all genes are expressed at the same time in a certain tissue. There are genes that perform basic functions in cells and therefore are expressed in all tissues. They are called housekeeping genes. On the other hand, many genes show distinctive tissue-specific expression patterns. That means they may be expressed highly in one type of cells but not in other cells. Different cell types in a multicellular organism express different sets of genes at different time and with different quantities. Basic cellular processes are realized by tightly regulated gene expression programs.

Therefore, it is important to study the expression profiles of the whole repertoire of genes. The study of all transcripts is called transcriptomes. The DNA microarray is a key high-throughput technique in transcriptomic investigations. It can simultaneously measure the abundance of mRNAs of thousands or more genes. As mRNAs often degrade rapidly, usually complementary DNAs (cDNAs) reverse transcribed from the mRNAs are used in the measurement.

The basic principle of microarrays is also the complementary base-pairing hybridization of DNAs. Pieces of different DNA fragments (called probes) are placed on a small chip. The probes were designed in ways that they can represent individual genes. When the samples’ cDNAs are applied on the chip, they’ll hybridize with the probes whose sequences are complementary to the cDNA sequences, and those DNAs that do not hybridize to any probe will be washed off. With proper fluorescence labeling on the cDNAs, their abundances can be “read” from the fluorescence intensities at each probe locations. These readings measure the expression levels of the genes represented by the probes.

There are two different types of DNA microarrays: the printed cDNA microarray (cDNA microarray for short) and the oligonucleotide microarray. The major difference is their ways of preparing the probes. Figure 1.9 illustrates the basic principle of the two types of methods.

Fig. 1.9
figure 9

DNA microarrays. (a) Printed cDNA microarray. (b) Oligonucleotide microarray

In cDNA microarrays, probes are relatively long segments of genes made from cloned cDNA libraries. They are spotted on the chip with techniques similar to jet printers. Different labs can prepare their own probes according to the genes they want to study. However, this advantage of flexibility also brings the disadvantage that the quantity of each probe can hardly be controlled precisely. Therefore, data reproducibility and comparison between the data from two labs can be a problem. To tackle this problem, usually two samples of identical amount labeled with different fluorescences are applied to the chip. If a gene is expressed at different abundances in the two samples, the two fluorescences will have different intensities as the result of competitive hybridization, and the ratio of the two intensities will reflect the ratio of the gene’s expression in the two samples. This strategy can partially eliminate effect of possible variance in the quantity of probes. The two samples can be one patient and one control. When studying multiple samples, each of them can be compared with a common control or a matched control. Different experiment designs have different characteristics and can have different implications on the bioinformatics processing of the data.

Oligonucleotide microarrays use much shorter probes (about ∼25nt), and multiple probes may be used for a single gene to increase the specificity. Taking the Affymetrix GeneChip as an example, the oligonucleotide probes are grown on the chip with light-directed oligonucleotide synthesis. The quantity of the probes can be precisely controlled. Usually only one sample is applied on the chip, and the reading is the expression level of each gene in the sample instead of the ratio of two expressions. These types of chips have higher density than cDNA arrays as the probes are much shorter. The latest Affymetrix expression microarray contains probes for all known human genes. The data obtained at two labs with the same system can be better compared. The disadvantage of oligonucleotide microarrays is that the chips are factory made and are therefore less flexible: individual labs cannot design their own chips, and ordering customized chips can be much more expensive. The data quality of oligonucleotide microarrays is in general regarded as of better quality than cDNA microarrays.

No matter what type of microarray it is, the original data form is scanned images. From reading the intensities on the images to getting expression of genes, there are many problems that bioinformatics has played major roles. After getting the expression of multiple genes in different samples, bioinformatics then becomes the major player in analyzing the data.

A typical microarray-based study is to compare the expression of many genes between two groups of samples, or across certain time course. For example, in studying the molecular features of two subtypes of the same cancer, patient samples of the two subtypes are collected and microarrays are used to measure the gene expression. A vector of the expression of all genes is obtained for each sample, and a gene expression matrix is obtained on all the samples, with columns representing samples and rows representing genes. A typical bioinformatics task is to identify the genes underlying the distinction between the two subtypes of the cancer.

Only a small proportion of the genome contains protein-coding genes; however, many other parts of the genome are also transcribed. The microarrays introduced above are now often called gene expression microarrays. Many other types of microarrays have emerged in recent years, following the same general principle. For example, microRNAs (miRNAs) are a type of small noncoding RNAs that play important regulatory roles in cells. By using probes for those microRNAs, microarrays can be designed for measuring the expression of microRNAs in the sample. As the density of the microarray chips increases, scientists have developed the so-called tiling arrays that have probes tiling the whole genome at high resolution. With such tiling arrays, we can measure the abundance of all transcribed parts of the genome, no matter they are known protein-coding genes, microRNAs, or transcripts that are previously unknown. It is with such technique that scientists have found that most parts of the human genome are transcribed. However, the noise level of high-density microarrays is still very high, which raises more questions to bioinformatics for processing the data.

In early days, the term “transcriptome” often meant the study of the expression of all genes at mRNA level, which is similar to “expression profiling.” However, with more and more noncoding transcripts discovered, the term is closer to its literal meaning: the study of all or many transcripts.

It should be noted that with the development of second-generation sequencing, microarrays are no longer the only choice for transcriptomic study. Deep sequencing can be applied to count the fragments of cDNAs so that the expression of RNAs can be measured in digital.

1.2.9 Proteomics and Mass Spectrometry

For the protein-coding genes, mRNAs are only the intermediate product. Proteins are even more complicated than DNA/RNA. For many genes, the abundance of their protein products is not linearly correlated with the expression at mRNA level. Therefore, studying the expression of proteins is important for understanding the molecular system of cells. Proteomics is the study of all or many proteins. Due to the mechanisms like alternative splicing and posttranslational protein modification, the diversity of proteins is much larger than that of genes. People even don’t have the common understanding about the approximate number of all types of proteins in human. It may be several magnitudes larger than the number of genes.

Two key features for identifying proteins are their electrical charge and their molecular mass. Scientists developed techniques to separate protein mixtures according to these factors. A representative technique is the 2D gel electrophoresis (2D gel, for short), which separates protein mixtures first according to the isoelectric focusing (IEF) and then by the mass.

Mass spectrometry is a technique that is widely used in proteomics study. The key principle is the time-of-flight mass spectrometry (TOF-MS): ionized proteins are located on certain surface or matrix, and an electricity field is applied. The charged protein or peptides (protein segments) will fly in the electricity field and reach a detector. The time of the flight before reaching the detector depends on the mass-to-charge ratio of the protein, and the strength of the signal at the detector is proportional to the accumulation of the molecules which reflects the quantity of the proteins. Figure 1.10 illustrates the basic principle of TOF-MS.

Fig. 1.10
figure 10

TOF-MS

Mass spectrometry has three typical types of applications in proteomics study. The first one is to identify proteins/peptides in a mixture. This is rather straightforward: on the mass spectrum of a mixture, a peak corresponds to the abundance of certain protein or peptide. By searching for proteins with the same or very close molecular weight as the peak location in protein databases, the protein can be identified if it has been reported before.

The second type of application is de novo amino acid sequencing. A protein segment is broken into all possible pieces before applying to the MS machine. On the mass spectrum, multiple peaks can be detected corresponding to peptide segments of different lengths. Different amino acid sequence segments will result in different molecular masses corresponding to peaks at different locations. Therefore, from all the peaks, it is theoretically possible to resolve the sequence. However, this is a combinatorial problem and is a challenging task for bioinformatics algorithms.

Usually tandem mass spectrometry (or MS/MS) is adopted in such applications. Tandem mass spectrometry is two (or more) rounds of mass spectrometry. For example, the first round can isolate one peptide from the protein mixture, and the second round is used to resolve the sequence.

Another typical application of mass spectrometry is to study the expression of multiple proteins in the samples, like the microarrays are used for measuring mRNA abundances. The mass spectrum of each sample provides the expression profile of all the proteins in the sample. By aligning the peaks between different samples, we can detect the proteins that are differentially expressed between groups of samples, and we can also study the different patterns of multiple peaks between the compared samples.

1.2.10 ChIP-Chip and ChIP-Seq

Chromatin Immunoprecipitation or ChIP is an experimental technique that can capture the DNA segments bound by certain proteins. The ChIP-chip technique combines ChIP with DNA microarrays (especially tiling arrays) to detect the DNAs that are bound by specific proteins like transcription factors. This technique is widely used in the study of transcription factor binding sites and histone modification states. ChIP-sequencing or ChIP-seq is a more recent developed technology which shares the same idea with ChIP but replaces the microarrays by deep sequencing. Figure 1.11 illustrates the basic principle of ChIP-seq method. In the first step, the DNA-binding proteins (transcription factors or histones) are cross-linked with the DNA site they bind to in an in vivo environment. Then, the total DNA is extracted and cut into fragments of hundreds of base pairs in length by sonication or some nuclease. Next, the DNA-protein complex are selected by using an antibody specific to the protein of interest, and the DNA fragments that do not bind by this protein are washed away. In the next step, the DNA-protein complexes are reverse cross-linked and the remaining DNA is purified. After size selection, all the resulting ChIP-DNA fragments are sequenced simultaneously using deep sequencing machines like Solexa Genome Analyzer which can generate millions of short reads with length of about 36 bp in a single sequencing run. Finally, the sequenced reads are mapped to the genome by some high efficient reads mapping programs like ELAND, RMAP, or ZOOM, and the genomic loci with high reads intensity are identified as putative binding sites of the protein of interest.

Fig. 1.11
figure 11

Summary of ChIP-seq method (This figure is adapted from [2])

1.3 Example Topics of Bioinformatics

With the quick review of some warm-up biological knowledge, far from being adequate for building any solid background, we quickly move to introduce some typical examples of bioinformatics research. Some of the topics will be deeply covered by following chapters. The purpose of this section is to give readers an overall feeling about what types of questions can be and should be answered by bioinformatics and computational biology.

1.3.1 Examples of Algorithmatic Topics

The birth and growth of bioinformatics as a discipline was accompanied by the generation and accumulation of data in molecular biology. When data size increases, even a simple task in manipulating the data may become non-straightforward. We need special algorithms to do that. “Algorithm” means a step-by-step computational procedure for solving a problem.

For example, when we get many DNA sequence data, storing them in computer databases seems trivial. However, when we find a new sequence segment (called a query) and ask whether this sequence has already been deposited in the database, it becomes less trivial, or even challenging, when the database is huge. When we are looking not only at the sequence segments that are exactly the same as the query but at those sequences that look similar with the query, it becomes even more difficult. This is what sequence analysis is about.

The sequence database query problem can be boiled down to finding the best local alignment between two sequences, or two strings in computer science jargon. Figure 1.12 presents a very simple illustration of the problem. It was a breakthrough when the dynamic programming approach for such problems was proposed by Temple Smith and Michael Waterman in 1981, although at that time only few people realized how important the work was. Sequence alignment is a very basic problem in bioinformatics. The question has many variations, and it is the foundation for many other topics.

Fig. 1.12
figure 12

An example of pairwise sequence alignment

From the brief introduction of the shotgun sequencing method, we could realize that assembling the long sequence from many short reads is a challenging task for algorithms. It is like to solve a huge jigsaw puzzle problem. With the availability of massive deep sequencing data, a related problem is how to efficiently map the short sequence reads back to the genome.

Multiple sequence alignment brings another dimension of complexity to the problem. Comparative genomics is based on multiple sequence alignment. The genomes of multiple organisms can be compared to infer the evolutionary history of the species. Building the phylogenetic tree is an important challenge for algorithms.

Besides sequence-related problems, there are also many other types of algorithmatic problems in bioinformatics, such as finding hidden patterns in a noisy microarray matrix, inferring the amino acid sequences from possible combinations, and analyzing biological network graphs.

1.3.2 Examples of Statistical Topics

When we search for a short query sequence in a long genome segment, we need to design powerful algorithms to find matched targets efficiently. But when we want to infer biological conclusions from the searching result, we need to ask questions like “what is the probability of finding the matched targets in the candidate sequence under a certain biological context?” This is one type of questions that statistics help to answer.

From data point of view, there are two types of bioinformatics tasks: one is the processing of the data themselves, and the other is inferring answers to biological questions from the data. Most, if not all, biological data can be viewed as noisy sample generated by some underlying probabilistic rules. Statistical inference is a discipline to infer the underlying rules from data. The key concept is the so-called p-value, which gives an estimation of the probability to have the observed data when a hypothesized rule does not apply. For example, when a particular short sequence pattern (called motif) is found in the promoters of a set of genes that tend to express in a coordinated manner, one will ask the probability of observing the multiple appearance of such a sequence pattern by chance. The question can be answered with some statistical models about the DNA sequences. If the probability is small enough, then one will tend to believe that the sequence motif has some role in the coordination of the genes. This example is a simplified description of the motif discovery problem which plays a key role in many bioinformatics and functional genomics study.

In microarray study of cancers, a basic question is which genes are differentially expressed between cancer and normal samples. This is a typical statistical question, and many standard statistical methods can be applied. However, due to the special characteristics of microarray data, new methods are also needed.

The complexity nature of many types of biological data raises many new challenges to established statistical models. How to build proper statistical models based on biological knowledge and make inferences from data is a key question in many bioinformatics research. For example, in gene recognition tasks, scientists have built very sophisticated hidden Markov models that incorporate existing knowledge about gene structure.

1.3.3 Machine Learning and Pattern Recognition Examples

Building statistical models is one way to describe the data and make predictions. Another approach is to build a prediction machine directly from the data. This approach is called machine learning, which is an important topic in the field of intelligent information processing. When the target to be predicted is discrete classes, the task is called pattern recognition or pattern classification.

Machine learning has been widely used in bioinformatics. For example, recognizing genes and other functional elements on the genome is an important topic in bioinformatics and genomics. Scientists have developed machine learning methods such as artificial neural networks and support vector machines for these types of tasks. A learning machine is actually also a model, but not necessarily a statistical one, and data reported with biological experiments are used to train the model. HMM can also be regarded as a machine learning method. It uses a sequential statistical model to describe the data, and parameters in the model also need to be trained with known data.

Another typical example is using microarray data or proteomics expression data to classify cancers. For each patient, the gene expressions measured by microarrays compose a vector. They can be viewed as the original features for classifying the samples. One can select a smaller number of genes to classify a certain type of cancer with normal cells or to classify subtypes of the cancer. It seems like a standard pattern recognition task. However, microarray data has several unique properties: the data dimension can be very high (tens of thousands of dimension), but the sample size is usually small (in hundreds or less). Some traditional machine programs cannot work in such extreme scenario. Many people developed new or improved machine methods for this type of questions.

Besides supervised machine learning problems, unsupervised machine learning also has broad application in bioinformatics. Among the many other examples, hierarchical clustering can be used to cluster genes into groups with possible function correlation according to their expression profiles and can be used to cluster samples into groups based on their gene expressions.

1.3.4 Basic Principles of Genetics

Up to this point, we have compiled an incomplete compendium of research areas of modern biology from bioinformaticians’ perspective. One of the important areas worth a separate discussion here is genetics. As we elaborated at the beginning of this chapter, it is hard to quantify the precise scope of bioinformatics, as a result of its multidisciplinary nature. Genetics, however, has seldom been taken as a part of bioinformatics. It sounds surprising, since both fields are entrenched on the shared methodological background: statistics and algorithm. But it is understandable that while the large body of bioinformatics is focused on a single representative sequence of the genome, the principal concept of genetics is interindividual variation which makes it quite detached from the result of biology. On the other hand, we emphasize that the development of modern genetics cannot be possible without the advancement in biotechnology and aids from bioinformatics; bioinformaticians should be acquainted with the basic principles of genetics in order to better communicate with geneticists. In this section, we take a historical approach to distill the essential concepts of genetics within the context of disease gene mapping.

1.3.4.1 Mendel and Morgan’s Legacy

The dawn of the modern genetics is unanimously attributed to Mendel’s seminal work on pea plant. More than 140 years ago, Mendel observed that crossing purebred peas with one binary trait (e.g., yellow and green seed color) resulted in one trait (yellow seeds) rather than a mixture of two; after selfing of F1 generation, seed color (yellow/green) exhibited 3:1 ratio. Similarly when crossing two binary traits (e.g., purple or white flower color plus spherical or wrinkled seed shape), 9:3:3:1 ratio was observed among the F2 generation for all combination of traits. Mendel postulated that each individual’s binary trait was controlled by a distinct factor (later called genes), which had two different forms (alleles), recessive and dominant. Genes normally occur in pairs in a normal body cell: one is maternal derived and the other paternal derived. Within an individual, if two alleles are identical, then the individual is called homozygous for that gene; otherwise, the individual is called heterozygous. Individual’s appearance is determined by the set of alleles it happens to possess (genotype) and environment. In case of heterozygote, dominant allele will hide the effect of recessive allele. During the formation of sex cells (gametes), two alleles of a gene will segregate and pass on to eggs or sperms, each of which receives one randomly chosen allele copy (law of segregation). And alleles of different genes will pass on independently to each other to the offspring, so there is no relation between, for example, seed shape and color of flower (law of independent assortment).

The significance Mendel’s work was the proposition of the concept of gene as the discrete hereditary unit whose different alleles control different traits. It took another 40 years until the importance of Mendel’s idea was recognized. Soon after geneticists rediscovered Mendel’s law, they found that the independent assortment for different traits was not always the case. Instead, they observed that there are groups of traits tended to be inherited together (linked) by the offspring rather than assorted independently (unlinked). The dependence of inheritance (linkage) led Morgan et al. to the chromosome theory of inheritance in which chromosomes were thought to harbor genetic material. In diploid organism, chromosomes come in pairs; each homolog comes from one parent. During meiosis, the process to produce gametes, one parent provides one chromosome from each homologous pair. During first round division of meiosis, several crossover events will take place between homologous positions of two parental chromosomes, such that the transmitted chromosome consists alternating segments from two parental alleles. Chromosome theory elucidated the biological basis for Mendel’s law of segregation and also reconciled the contradiction between linked traits and the violation to law of independent assortment. It turned out that genes controlling Mendel’s pea traits were either on different chromosomes or located far apart on the same chromosome where an obligatory crossover in between must occur. Chromosome theory postulated that genes are arranged linearly along the chromosomes; the combination of nearby alleles along the same chromosome (haplotype) tended to be transmitted jointly unless they are shuffled by crossover.

The distance separating two genes on the same chromosome determines the frequency of their recombinant (genetic distance) and the probability that corresponding traits will be inherited together by offspring. By analyzing co-inheritance pattern of many linked traits from experimental crosses or family pedigrees, it is possible to place corresponding genes in order and estimate genetic distances between neighboring genes. Rigorous statistical methods were developed to construct such genetic maps. It is truly remarkable in retrospect that early-day geneticists were able to know where genes were and their relative positions even they had no idea about molecular structure of genes.

1.3.4.2 Disease Gene Mapping in the Genomic Era

The early day practice taking gene as a polymorphic landmark naturally spawned the concept of genetic markers (or locus) in the genomic era. Alleles giving rise to different Mendel’s pea traits are just coding variants that produce different protein isoforms among individuals (called non-synonymous variants; also recall that alternative splicing creates protein isoforms within the same individuals). There are many more types of variations whose different forms (also termed alleles), coding or noncoding, can be directly assayed from DNA level. While some alleles may cause changes in phenotypes, for example, increasing the risk to diseases, most are neutral (little phenotypic consequences) and commonly occurring within human population. Among them, two types of variations have shown greatest practical utility: single base-pair change (single nucleotide polymorphism, SNP) and short sequence of 1.6 bp repeated in tandem (microsatellite).

A microsatellite locus typically has tens of alleles (copy numbers of repeating unit), which can be determined via PCR amplification from unique flanking sequences. Highly variable alleles among human individuals make microsatellite the ideal markers to construct human genetic map from extended pedigrees. A map of ordered DNA markers had huge practical values. It allowed geneticists to localize loci (e.g., protein-coding genes and regulatory elements) whose mutations therein are responsible for the trait of our interest (e.g., diseases status and crop yield) on to the grid of prearranged genomic landmarks, a process known as gene mapping. The idea of gene mapping via linkage analysis is not new, inheriting the legacy from Mendel and Morgan: both DNA tags and traits loci are taken as genetic markers; and their relative orders are determined by tracing co-inheritance pattern of traits with markers in families or experimental crosses. Linkage studies using human pedigrees during the past 30 years have led to the mapping of thousands of genes within which some single mutations cause severe disorders (Mendelian disease), like Tay-Sachs diseases and cystic fibrosis, among others (see Online Mendelian Inheritance in Man for a complete compendium).

Encouraged by the huge success of mapping genes for rare Mendelian disease, geneticists were eager to apply the linkage analysis to common and complex diseases (like hypertension, diabetes), which also exhibit familial aggregation. But this time, they fell short of luck. At least two distinct features of common diseases are known to compromise the power of linkage analysis: first, the risk of getting the diseases for the carriers of causal variants is much lower than in Mendelian cases. Second, there may be multiple genes that, possibly through their interaction with environment, influence the disease susceptibility.

An alternative way emerged during mid-1990s. Rather than tracing the segregation patterns within families, we can pinpoint disease mutations by systematically testing each common genetic variation for their allele frequency differences between unrelated cases and controls sampled from population (association mapping). Aside from the practical tractability, the focus on common variants is based on the “common disease-common variants” (CDCV) hypothesis, which proposes that variants conferring susceptibility to common diseases occur commonly in population (with allele frequency >5 % as an operational criteria). While idea of association study is absolutely simple, transforming this blueprint into practices awaits for more than a decade.

As a first step toward this goal, great efforts were made in parallel with human genome project to compile a comprehensive catalog of sequence variations and map them to the reference genome backbone. SNPs are the most abundant form of variants. In contrast to high variability of microsatellites, they typically have two alleles at each locus which can be measured by hybridization (genotyping). Two homologous chromosomes within an individual differ on average 1 in every 1,000 bases in their aligned regions (heterozygosity). And more than 95 % of those heterozygous loci will have >5 % minor allele frequencies within population. Up to now, it has been estimated that more than 70 % of total 10 million common SNPs have been discovered and deposited in the public databases. Other forms of variations including those altering copy numbers of large DNA chunks have also been mapped in an accelerated pace recently. Nevertheless, high abundance and easy to genotype make SNPs the primal choice for association study. Meanwhile, off-the-shelf SNP genotyping microarrays nowadays can simultaneously genotype more than half million SNPs in one individual with more than 99 % accuracy. With both genomic resources and cutting-edge technologies at hand, genome-wide association study seemed tantalizing.

But question remained: do we really need to type all the variants in the genome-wide association study (which is still infeasible)? Even provided that we could type all common SNPs, but if the disease-causing variant is not SNPs, are we still able to find them? To answer these questions, we need to take on an evolutionary perspective.

Variations do not come out of nowhere. All the variations that we observe in the current day population result from historical mutations that happen on the chromosomes that are passed on to the next generation. Each SNP is typically biallelic due to a unique point mutation event earlier in the human history (because point mutation rate is very low, 10−8 per site per generation, recurrent mutation is negligible). As we mentioned above, most of the variation is neutral, so the frequencies of newly arisen alleles will subject to random fluctuation because population size is finite (genetic drift). As time goes by, most of the newly arisen alleles will be removed from the population, while some of them will happen to spread across the entire population (fixation). So the polymorphisms we observe are those old mutations that have neither become extinct nor reached fixation until today. Some of the mutations can influence individual’s fitness to the environment, for example, causing severe disorder in the early age. In such cases, the probability for this allele being transmitted to the next generation will be reduced, since the carrier may unlikely to survive until reproductive age. The frequencies of such deleterious alleles, including those causing Mendelian diseases, will be kept low as a consequence of purifying selection. Most common diseases, however, have only mild impact on individual’s reproduction. So the variants that predispose individuals to common diseases can rise to moderate frequencies, consistent with but not proving the CDCV hypothesis.

Variations do not come alone. Whenever a new allele was born, it must be embedded on the particular background of a specific combination of existing alleles (haplotype) at that time. In subsequent generations, the haplotype background of that specific allele will be reshuffled by the meiotic crossovers. Because nearby markers undergo fewer crossovers, alleles of closely linked loci (be it SNPs, indels, copy number variations, etc.) exhibit allelic associations with each other (termed linkage disequilibrium, abbreviated as LD). It suggests that even if the disease-causing mutations are not directly typed and tested for association, they can still be “tagged” by the alleles of nearby SNPs. And by properly selecting markers based on the LD patterns of human population, genome-wide association studies can be made in a cost-effective way. Both the marker selection and result interpretation therefore require the knowledge about the interrelationship between variants.

International HapMap Project has been completed to achieve this goal, with the priority given to the common SNPs. We now know that there are regions of tens or even hundreds of kilobases long, where diversity of SNP haplotypes is limited. These “haplotype blocks” are separated by sharp breakdown of LD as a result of punctuated distribution of crossover events (with ∼80 % of crossovers happen within recombination hotspots). Within blocks, a reduced number of common SNPs can serve as a proxy to predict allelic status of remaining common SNPs or even other common genetic variations (like copy number gain or loss). Half million SNPs can provide adequate power in association study to test most of the common SNPs in East Asia and European populations. These findings, together with the maturity of technology and statistical methodology, have paved the way for the first wave of association study during the past 2 years. More than a hundred loci have now been identified to be bona fide reproducibly associated with common forms of human diseases.

Never satisfied by the initial success, geneticists want to extend the power of the association mapping to rare variants. To this end, they call for a map that catalogs and describes the relationships among almost all the variants, be it common and rare. Armed with cutting-edge sequencers, the 1000 Genomes Project has been launched with this ambition. Geneticists and expertise from other disciplines are now working in an ever closer manner.