Definition

An index on an attribute provides an efficient way to access data records associated with a given range of values for the indexed attribute. Typically, an index stores a list of RIDs (called a RID-list) of all the records associated with each distinct value v of the indexed attribute. In a bitmap index, each RID-list is represented in the form of a bit vector (i.e., bitmap) where the size of each bitmap is equal to the cardinality of the indexed relation, and the i-th bit in each bitmap corresponds to the i-th record in the indexed relation. The simplest bitmap index design is the Value-List index, which is illustrated in Fig. 1b for an attribute A of a 12-record relation R in Fig. 1a. In this bitmap index, there is one bitmap Ev associated with each attribute value v ϵ [0,9] such that the i-th bit of Ev is set to 1 if and only if the i-th record has a value v for the indexed attribute.

Bitmap Index, Fig. 1
figure 219figure 219

Examples of bitmap indexes. (a) indexed attribute A, (b) equality-encoded index (or value-list index), (c) range-encoded index (or base-10 bit-sliced index), (d) interval-encoded index

Historical Background

The idea of using bitmap indexes to speed up selection predicate evaluation has been recognized since the early 1970s [4]. Some early implementations of bitmap processing techniques include PC DBMSs (e.g., FoxPro, Interbase), a scientific/statistical database application developed at Lawrence Berkeley Laboratory [11], and Model 204, which is a commercial DBMS for the IBM mainframe [7].

The main advantage of using a bitmap index is the CPU efficiency of bitmap operations (AND, OR, XOR, NOT). Furthermore, compared to RID-based indexes, bitmap indexes are more space-efficient for attributes with low cardinality and more I/O-efficient for evaluating selection predicates with low selectivities. For example, assuming each RID requires four bytes of storage and ignoring any compression, bitmap indexes are more space-efficient if the attribute cardinality is less than 32, and reading a bitmap is more I/O-efficient than reading a RID-list if the selectivity factor of the selection predicate is more than \( \frac{1}{32}\left(\approx 3.2\%\right) \). Another advantage of bitmap indexes is that they are very amenable to parallelization due to the equal-sized bitmaps and the nature of the bitwise operations.

A variety of bitmap index designs have been proposed since the early days. Besides the simple Value-List index illustrated in Fig. 1b, another early bitmap index design is the Bit-Sliced index (BSI) which is implemented in Model 204 and Sybase IQ [7]. A BSI for an attribute with a cardinality of C consists of k = ⌈log 2(C)⌉ bitmaps, with one bitmap associated with each bit in the binary representation of C. Compared to the Value-List index, the BSI is more space-efficient with an attribute value v being encoded by a string of k bits corresponding to its binary representation. The BSI design can be generalized to use a nonbinary base b such that it consists of k(b − 1) bitmaps \( {B}_i^j:1\le k,0\le jb \), where k = ⌈logb(C)⌉. Each attribute value v is expressed in base b as a sequence of k base-b digits v k v k−1v2v1, and each bitmap \( {B}_i^j \)represents the set of records with v i ≤ j. Using a larger base number improves the index’s performance for evaluating range predicates at the cost of an increased space cost. An example of a base-10 BSI is shown in Fig. 1c. Both Model 204 and Sybase IQ implemented base-10 Bit-Sliced indexes [7].

Several bitmap index designs have also been implemented in a scientific/statistical database application at Lawrence Berkeley Laboratory [11]. These bitmap indexes include binary encoded indexes (equivalent to binary BSI), unary encoded indexes (equivalent to non-binary BSI), K-of-N encoded indexes (generalizations of Value-List indexes where each attribute value is encoded by a N-bit string with exactly K bits set to 1), and superimposed encoded indexes based on superimposed encoding which is useful for indexing set-valued attributes.

Interest in bitmap indexes was revived in the mid-1990s due to the emergence of data warehousing applications which are characterized by read-mostly query workloads dominated by large, complex ad hoc queries [7]. All the major DBMS vendors (IBM, Microsoft, Oracle, and Sybase) also started to support bitmap indexes in their products around this time.

Foundations

  • Chan and Ioannidis proposed a two-dimensional framework to characterize the design space of bitmap indexes [2]. The two orthogonal parameters identified for bitmap indexes (with an attribute cardinality of C) are (i) the arithmetic used to represent attribute values; i.e., how an attribute value is decomposed into digits according to some base (e.g., base-C arithmetic is used in a Value-List index); and (ii) the encoding scheme of each decomposed digit in bits (e.g., each attribute value in a Value-List index is encoded by turning on exactly one out of C bits). Consider an attribute value v ϵ [0 , C) and a sequence of n base numbers B = bn , bn − 1 , … , b1>, where \( {b}_n=\left\lceil \frac{C}{\prod \begin{array}{c}n-1\\ {}i-1\end{array}{b}_i}\right\rceil \) and bi ≥ 2, [1, n]. Using B, v can be decomposed into a unique sequence of n digits <vn, vn− 1 , …, v1> as follows: vi = Vi mod bi, where V1 = v and \( {V}_i=\left\lfloor \frac{v_{i-1}}{b_{i-1}}\right\rfloor \), for 1 < i ≤ n. Thus, \( v={v}_n\left(\prod \begin{array}{c}n-1\\ {}j=1\end{array}{b}_j\right)+\dots +{v}_i\left(\prod \begin{array}{c}i-1\\ {}j=1\end{array}{b}_j\right)+\dots +{v}_2{b}_1+{v}_1 \). Note that each v i is a base-b i digit (i.e., 0 ≤ v i < b i). Each choice of n and base-sequence B gives a different representation of attribute values and therefore a different index (known as a Base-B). The index consists of n components (i.e., one component per digit) where each component individually is now a collection of bitmaps. Figure 2b shows a base- <3,4 > Value-List index that consists of two components: the first component has four bitmaps \( \left\{{E}_1^3,{E}_1^2,{E}_1^1,{E}_1^0\right\} \) and the second component has three bitmaps \( \left\{{E}_2^2,{E}_2^1,{E}_2^0\right\} \). Note that the k-th bit in each bitmap \( {E}_i^j \) is set to 1 if and only if vj = j, where <v2, v1 > is the <3,4 > −decomposition of the k-th record’s indexed attribute value. For the encoding scheme dimension, there are two basic encoding schemes: equality encoding and range encoding. Consider the i-th component of an index with base bi, and a value viϵ[0, bi − 1]. In the equality encoding scheme, vi is encoded by bi bits, where all the bits are set to 0 except for the bit corresponding to vi, which is set to 1. Thus, an equality-encoded component (with base bi) consists of bi bitmaps \( \left\{{E}_i^{b_i-1},\dots, {E}_i^0\right\} \) such that the k-th bit in each bitmap \( {E}_i^j \) is set to 1 if and only if vi = j, where vi is the i-th digit of the decomposition of the k-th record’s indexed attribute value. In the range encoding scheme, vi is encoded again by bi bits, with the vi rightmost bits set to 0 and the remaining bits (starting from the one corresponding to vi and to the left) set to 1. The k-th bit in each bitmap \( {R}_i^{b_i-1} \)is set to 1 if and only if vij, where vi is the i-th digit of the decomposition of the k-th record’s indexed attribute value. Since the bitmap \( {R}_i^{b_i-1} \) has all bits set to 1, it does not need to be stored, so a range-encoded component consists of (bi − 1) bitmaps \( \left\{{R}_i^{b_i-2},\dots, {R}_i^0\right\} \). Value-List and Bit-Sliced indexes therefore correspond to equality-encoded and range-encoded indexes, respectively. Figures 1c and 2c shows the range-encoded indexes corresponding to the equality-encoded indexes in Figs. 1b and 2b. Details of query processing algorithms and space-time tradeoffs of equality/range-encoded, multicomponent bitmap indexes are given in [4].

Bitmap Index, Fig. 2
figure 220figure 220

example of base- <3,4 > indexes. (a) Indexed attribute A (b) equality-encoded index (c) range-encoded index (d) interval-encoded index

A new encoding scheme, called interval encoding, was proposed in [5]. For an attribute with a cardinality of C, a value v ϵ [0, C) is encoded using \( \left\lceil \frac{C}{2}\right\rceil \) bits such that if \( v\left\lceil \frac{C}{2}\right\rceil \), then v is encoded by setting the (v + 1) rightmost bits to 1 and the remaining bits to 0; otherwise, v is encoded by setting the (C − 1 − v) leftmost bits to 1 and remaining bits to 0. Thus, the interval encoding scheme consists of \( \left\lceil \frac{C}{2}\right\rceil \)bitmaps \( \left\{{I}^{\left\lceil \frac{C}{2}\right\rceil -1},\dots, {I}^0\right\} \), where each bitmap Ij is associated with a range of (m + 1) values [j,j + m], \( m=\left\lfloor \frac{C}{2}\right\rfloor -1 \), such that the k-th bit in a bitmap I j is set to 1 if and only if the k-th record’s indexed attribute value is in [j,j + m]. Figures 1d and 2d shows the interval-encoded indexes corresponding to the equality-encoded indexes in Figs. 1b and 2b. Note that interval encoding has better space-time tradeoff than range encoding: it has the same worst-case evaluation cost of two bitmap scans as range encoding, but its space requirement is about half that of range encoding.

Wu and Bachmann proposed a variant of binary BSI called encoded bitmap index (EBI) [12, 13]. Instead of encoding each attribute value simply in terms of its binary representation, an EBI uses a lookup table to map each attribute value to a distinct bit string; this flexibility enables optimization of the value-to-bit-string mapping, by exploiting knowledge of the query workload, to reduce the number of bitmap scans for query evaluation. Thus, binary BSI is a special case of EBI. Another similar index design called Encoded-Vector index (EVI) is used in IBM DB2. Instead of storing the index as a collection of ⌈logb(C)⌉ bitmaps as in EBI, EVI is organized as a single vector of ⌈logb(C)⌉-bit strings, and the purpose of the lookup table optimization is to reduce the CPU cost of bit string comparisons when evaluating selection queries of the form “Α ϵ{v1, v2, … , vn} . " Another related index is the Projection index [7], which is implemented in Sybase IQ.

Complex, multitable join queries (such as star-join queries) can also be evaluated very efficiently using bitmapped join indexes [6], which are indexes that combine the advantages of join indexes and bitmap representation. A join index for the join between two relations R and S is a precomputation of their join result defined by ΠR . rid , S . rid(RpS) where p is the join predicate between R and S. Thus, a join index on RS can be thought of as a conventional index on the table R, where the attribute being indexed is the “virtual” attribute S.rid; i.e., each distinct S.rid value v is associated with a list of all R.rid values that are related to v via the join. A bitmapped join index [6] is simply a join index with the RID-lists represented using bitmaps. Bitmapped join indexes are implemented in Informix Red Brick Warehouse and Oracle. Bitmap indexes can also be applied to evaluate queries that involve aggregate functions (e.g., SUM, MIN/MAX, MEDIAN); evaluation algorithms for Value-List and Bit-Sliced indexes are discussed in a paper by O’Neil and Quass [7]. Efficient algorithms for performing arithmetic operations (addition and subtraction) on binary BSIs are proposed in [8].

As bitmap indexes become less efficient for larger attribute cardinality, a number of approaches have been developed to reduce their space requirement. Besides using multicomponent bitmap indexes [2, 11], another common space-reduction technique is to apply compression. In Model 204 [7], the bitmaps are compressed by using a hybrid representation; specifically, each individual bitmap is partitioned into a number of fixed-size segments, and segments that are dense are stored as verbatim bitmaps, while sparse segments are converted into RID-lists. While generic compression techniques (e.g., LZ77) are effective in reducing both the disk storage and retrieval cost of bitmap indexes, the savings in I/O cost can be offset by the high CPU cost incurred for decompressing the compressed bitmaps before they can be operated with other bitmaps. A number of specialized compression techniques that enable bitmaps to be operated on without a complete decompression have been proposed: Byte-aligned Bitmap Code (BBC) (which is used by Oracle) and Word-Aligned Hybrid code (WAH) [15]. Some performance studies of compressed bitmap indexes are reported in [2176,2177,3, 15].

A different approach proposed to reduce the size of bitmaps is to use range-based bitmaps (RBB) [14], which have been applied to index large data sets in tertiary storage systems as well as large, multidimensional data sets in scientific applications [15]. Unlike Value-List indexes where there is one bitmap for each distinct attribute value, the RBB approach partitions the attribute domain into a number of disjoint ranges and constructs one bitmap for each range of values (this is also known as binning). Thus, RBB provides a form of lossy compression which requires additional postprocessing to filter out false positives. Koudas [5] has examined space-optimal RBBs for equality queries when both the attribute and query distributions are known. More recently, Sinha and Winslett have proposed multiresolution bitmap indexes to avoid the cost of filtering out false positives for RBB [10]. For example, in a two-resolution bitmap index, it has a lower resolution index consisting of RBB (i.e., with each bitmap representing a range of attribute values) and a higher resolution index consisting of one bitmap for each distinct attribute value. By combining the efficiency of lower resolution indexes and the precision of higher resolution indexes, queries can be evaluated efficiently without false positive filtering.

Key Applications

Today, bitmap indexes are supported by all major database systems, and they are particularly suitable for data warehousing applications [7]. Bitmap indexes have also been used in scientific databases (e.g., [10, 11, 15]), indexing data on tertiary storage systems (e.g., [1]), and data mining applications.

Future Directions

The design space for bitmap indexes is characterized by four key parameters: levels of resolution (which affects the number of levels of bitmap indexes and the index granularity at each level), attribute value representation (which affects the number and size of the index components at each level) encoding scheme (which affects how the bitmaps in each component are encoded), and storage format (i.e., uncompressed, compressed, or a combination of compressed and uncompressed). While there are several performance studies that have examined various combinations of the above parameter space (e.g., [16]), a comprehensive investigation into the space-time tradeoffs of the entire design space is, however, still lacking and deserves to be further explored. The result of such a study can be applied to further enhance automated physical database tuning tools.

Cross-References