Understanding this section requires some computer-science exposure – specifically, graph theory. If you need to bone up, a data-structures/algorithms book that is reasonable and not overwhelming is Sedgewick.
This section is relevant to the ad hoc query of non-sparse, heterogeneous data as modeled using an EAV/CR framework, though it also applies to a conventionally modeled schema. 18 The Universal Database Concept
A traditionally implemented (column-modeled) database schema can be regarded as an
undirected graph data structure, where a table can be regarded as a vertex or node, while a primary-key/foreign key relationship between two tables can be treated as an edge. Undirected means that one is permitted to navigate in either direction along any edge, even though the ends of the edge have a specific meaning in terms of primary key vs. foreign key.
In theory, if one wants to retrieve columns from several tables, through criteria on columns that are dispersed across several tables, a program could help in composing part of the SQL – specifically the join expression:
For a pair of physical tables that are directly related, the join expressions is:
primary-key-table. primary-key-field = foreign-key-table. foreign-key-field.If there are multiple columns in the join, the above is repeated, and the individual column-pairs are combined with an AND operator.
If the paths between the tables whose columns the user has specified are known,
the software could automatically include intermediate tables in the path, even without the user having specified these tables.
An idea originating in the late 1980s with David Maier and Jeffrey Ullman – the “Universal database” (UDB) – takes this insight to an extreme level. The UDB concept attempts to create the illusion, for the end-user, that a database consisting possibly of hundreds of tables is one gigantic table with a large number of columns. In other words, the user concerns oneself only with individual columns, and need not be bothered with the existence of individual tables: foreign keys in particular, can be hidden from the user. The software SAP Business Objects (BO) is based on this premise: in BO terminology, the virtual giant table is the “Universe” and the “Objects” are the individual columns.
Weaknesses in the UDB Premise
In terms of graph theory, the basic UDB approach is guaranteed to work only for schemas that happen to be
, where there is
only one path that connects any two tables
. (If there is more than one path, the schema is said to have
.) In real schemas, such an assumption never holds. There are two common conditions where the assumption is violated: see Fig.
Cycles in an undirected graph
One instance of cycles is where one table is directly joined to another more once. (In Fig.
, there are two links between A and D.) An example is an employee-hierarchy table, consisting of two columns, Manager ID and Employee ID, both these columns are linked to a Persons table that contains details of the individual. Another example is Ship-From and Ship-to address IDs in a sales order, both of which point to an Addresses table. 15.8
Another cycle is where a table is directly linked to other tables, which also happen to be linked to each other by different paths. (In Fig.
, C is linked directly to A and B, but A is also linked to B, so to get from A to C, one can use the direct route A-C, or the indirect one A-B-C.) This happens in reference databases of biomedical information where multiple tables store different kinds of facts, but each fact needs to be backed up with a bibliographic citation. If all citations are stored centrally in a single table, then multiple tables may contain a foreign-key column, Citation ID, which points to the citations table. 15.8
Another concern here is that it is not enough to know that two tables are connected, but how they should be connected in a specific circumstance. Most of the time, one needs a standard (inner) join, but at other time, one needs a left or right outer join, and very rarely a full outer join (the union of left and right joins).
In other words, the basic UDB paradigm will take you only so far: at some point, the user must leave Never Land and become aware of the real world, where separate tables exist.
Cycle Detection and Elimination
In Business Objects, setting up a “Universe” involves detecting and eliminating cycles. The standard algorithm for cycle-detection in undirected graphs is based on depth-first search: see Bard
for an accessible explanation. Essentially, one maintains lists of edges and vertices: all vertices are set to “unvisited”. Starting with any arbitrary vertex, you move along each of the edges connected to it that have not already been used. Any unvisited vertex encountered is set to “visited”. If a previously visited vertex is encountered, the edge that got you to this vertex is a cycle, or part of one. For example, in Fig. 19 , starting from A, you would reach D, B and C. Moving on to B, you would reach C, which has already been visited: therefore A–C is a cycle. Moving from D to A would similarly detect a cycle. 15.8
To eliminate cycles, you need to duplicate the multiply-connected vertexes: in terms of database schemas, you introduce alternative aliases for a table. You keep repeating this process until no cycles are detected. In Fig.
, which is a transformation of Fig.
, adding the two aliases A2 and D2 for A and D would
to have eliminated cycles.
Cycle elimination by vertex duplication
However, the problem gets more complicated. Remember that in the double-link between A and D of Fig.
, 15.8 two separate columns in one table (the foreign keys) are linked to the primary key of the second table. Let’s assume that D is the foreign-key table with two columns Col1 and Col2. (These could be Ship-From and Ship-to address IDs.) Let’s assume that Col1 is linked to A, and Col2 is linked to A2. Then, in Fig. , the only way to go from Col2 to B is via A2 and C (which may not be what the user wants). Similarly, any route from Col1 must involve B, which again may not be what is needed. So one must duplicate each D node once again, creating two new nodes D3 and D4, with D3 linked to A, and D4 linked to A2. 15.9
You must deal with the outer-join problem in a similar way. Let’s say that you mostly link A and B with an inner join, but occasionally need to use A right-outer-join B. You would duplicate A to create A3 (you could alternatively duplicate B – the choice is immaterial). You could then designate the link A–B as an inner join and A3–B as a right outer join. In general, if there are N cycles, you have to create N*2 extra aliases in the worst case.
Can Database Schemas be “Hidden” Successfully Using a UDB Approach?
As the number of multiple links between tables grows, the setup of a “Universe” becomes increasingly laborious. For the dubious “benefit” of hiding the physical schema and inter-table relationships from a supposedly naïve and E-R-diagram-phobic user, the virtual UDB schema can become much more complex than the actual physical schema. There are several practical issues here.
The UDB concept originated at a time when GUIs were non-existent, and E-R diagrams existed only on paper as a modeling tool SQL, including join expressions, had to be entered mostly by typing. In most modern environments, E-R diagrams are nothing to be scared of. Modern query GUIs such as MS-Access’s VQBE generate the join expression between two tables for you if you click and drag between the two columns that you want to connect. (You can specify the join type by changing the connection’s properties.) VQBE will actually utilize existing relationships in your schema, if you select two tables that happen to be directly related, pre-connecting them. (If you select two tables that are not directly connected, VQBE is currently not smart enough to include intermediate tables. You need to refer to the E-R diagram and add intermediate tables manually – which is not too big a deal.)
If you still insist on going the BO route, a fairly challenging issue involves giving the duplicated column aliases meaningful names that would somehow indicate the path desired by the user when a particular alias (or aliases, such as D4 and A2 in our running example) were picked implicitly. This is a problem that I believe contains an intractable contradiction: the user who is supposedly too naïve to know about separate tables must now be forced to not only deal with the tables but also the paths between them.
In my experience, serious analysts, who know SQL and are quite comfortable with E-R diagrams, find the BO approach particularly condescending and confining: they are likely to spend more time fighting against the software than getting work done. Such individuals would be more productive with direct read-only SQL access to the schema. (An amusing episode involved watching a friend and colleague of mine, the Universe designer/maintainer for an institutional Business Objects installation, bypass BO completely and type SQL to answer queries posed to him by a researcher. This contradicts the well-known dictum about “eating your own dog food”: don’t force your users to use something that you wouldn’t use yourself.)
Finally, in the case of EAV designs, where the same table can be joined to itself multiple times for set intersection operations, and many join operations are not readily predictable, setup of a “Universe” becomes so complicated that your typical developers – unless they are billing by the hour and primarily focused on job security – are likely to throw up their hands in frustration and turn in their resignations.
This is not to say that automatic join composition cannot be a source of developer productivity. The danger is of creating software that so insists on helping a user that it becomes a straitjacket or gatekeeper that persistently underestimates the user’s intelligence: it then becomes a database analog of the infamous Microsoft Office Assistant. Good software should know when to get out of the way, and should allow the power user ways of bypassing it if desired: VQBE, for example, lets developers type SQL.