Models for Big Data

Bayliss, David

doi:10.1007/978-3-319-44550-2_9

David Bayliss³

4065 Accesses

Abstract

The principal performance driver of a Big Data application is the data model in which the Big Data resides. Unfortunately most extant Big Data tools impose a data model upon a problem and thereby cripple their performance in some applications. The aim of this chapter is to discuss some of the principle data models that exist and are imposed; and then to argue that an industrial strength Big Data solution needs to be able to move between these models with a minimum of effort.

This chapter has been adopted from the LexisNexis’ white paper authored by David Bayliss.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In fact, the performance is so crippled that the application just “doesn’t happen”. It is impossible to ‘list all the things that didn’t happen’ although our paper “Math and the Multi-Component Key” does give a detailed example of a problem that would take 17 h in a key-value model and which runs at about 60 transactions per second in a structured model. It is easy to imagine that the key-value version would not happen.
2.
This is changing very rapidly in some areas. The other models are tackled later; but it is probably still true that today this one deserves to be tackled first.
3.
A rather longer and more formal treatment is given here: http://en.wikipedia.org/wiki/Database_normalization.
4.
Structured data does NOT need to imply SQL—but SQL is, without doubt, the leading method through which structured data is accessed.
5.
Many people use ‘exponentially’ as an idiom for ‘very’. In this paper the term is used correctly to denote a problem that grows as a power of the scaling factor. In this case if you have three choices as to how to perform a join between two files, then between 10 files you have at least 3¹⁰ = 59,049 choices. In fact you have rather more as you can also choose the order in which the files are processed; and there are 3,628,800 orders of 10 files giving a total of 2.14 × 1011 ways to optimize a query across 10 files.
6.
Because of the way most Key systems work it does not in general provide a fast access path for Year only. Some good structured systems can access year quickly if the earlier component has low cardinality.
7.
Again, this is not an idiomatic expression. If a field is evenly distributed with a cardinality of N then adding it as a component of a key in the search path reduces the amount of data read by a factor of N. Thus if you add two or three fields each with a cardinality of 100 then one has produced a system that will go 4–6 orders of magnitude (10,000–1,000,000×) faster.
8.
Many people refer to text as ‘unstructured data’. I have generally avoided that term as good text will usually follow the structure of the grammar and phonetics of the underlying language. Thus text is not genuinely unstructured so much as ‘structured in a way that is too complex and subtle to be readily analyzed by a computer using the technology we have available today.’ Although see the section on semi-structured data.
9.
Of course, people are researching this field. Watson is an example of a system that appears to be able to derive information from a broad range of text. However if one considers that ‘bleeding edge’ systems in this field are correct about 75 % of the time it can immediately be seen that this would be a very poor way to represent data that one actually cared about (such as a bank balance!).
10.
Google pioneered a shift from this model; the ‘page ranking’ scheme effectively places the popularity of a page ahead of the relevance of the page to the actual search. Notwithstanding the relevance ranking of a page is still computed as discussed.
11.
Of course one can build multi-billion dollar empires by ‘tweaking’ this formula correctly.
12.
Or not care; if one is just ‘surfing the web’ then as long as the page offered is ‘interesting enough’ then one is happy—whether or not it was the ‘best’ response to the question is immaterial.
13.
XQuery has probably surpassed XPATH in more modern installations.
14.
A good XML database such as MarkLogic will allow for optimization of complex queries provided the access paths can be predicted and declared to the system.
15.
Within the academic literature there have been numerous attempts to extend XPATH towards more relational or graph-based data.
16.
http://en.wikipedia.org/wiki/HPCC.
17.
Usually including performance.
18.
Usually speed of update or standards conformance.
19.
http://en.wikipedia.org/wiki/ECL,_data-centric_programming_language_for_Big_Data.
20.
For those familiar with COBOL this was a method of having narrower records whereby collections of fields would only exist based upon a ‘type’ field in the parent record.
21.
Our premier entity resolution system uses multi-component keys to handle the bulk of queries and falls back to a system similar to key value if the multi-components are not applicable.
22.
Referred to as ‘comma separated variable’ although there are many variants; most of which don’t include commas!.
23.
In standard ASCII and UNICODE formats.
24.
There may be other flags and weights for some applications.
25.
When accessing inverted indexes naively you need to read every entry for every value that is being searched upon; this can require an amount of sequential data reading that would cripple performance. ‘Smart stepping’ is a technique whereby the reading of the data is interleaved with the merging of the data allowing, on occasions, vast quantities of the data for one or more values to be skipped (or stepped) over. The ‘local’ case is where this is done on a single machine; ‘global’ is the case where we achieve this even when the merge is spread across multiple machines.
26.
Specifically the case where linguistic rules of grammar not followed uniformly and thus the data really is ‘unstructured’.
27.
A brief treatment of these is given here: http://en.wikipedia.org/wiki/GLR_parser.
28.
This case works particularly well (and easily) in the case where the text being parsed is generated against a BNF grammar.
29.
Although it has any number of values; not just one per record.
30.
Fuller details of this will be published in due course; probably accompanied by a product Module offering.
31.
This term is being used technically; not in the pejorative. ECL works naturally with any combination of N-tuples. Asserting everything must be triples (or 3-tuples) is one very simple case of that.
32.
Again, this is a mathematical claim, not a marketing one.
33.
If we can find it—we could reference the Lawrence Livermore study here.
34.
Yoo and Kaplan from Lawrence Livermore have produced an excellent study and independent on the advantages of DAS for graph processing: http://dsl.cs.uchicago.edu/MTAGS09/a05-yoo-slides.pdf.
35.
A team led by LexisNexis Risk Solutions, including Sandia National Labs and Berkeley Labs has an active proposal for further development of this system.
36.
As a recent example: ECL provides two sets of string libraries, one which is UNICODE complaint, one of which is not. The non-compliant libraries execute five times faster than the compliant ones. That said; string library performance is not usually the dominant factor in Big Data performance.
37.
Put simply—if you can write it in one line of ECL the ECL compiler knew exactly what you were trying to do—if you write the same capability in 100 lines of Java then the system has no high level understanding of what you were doing.
38.
As an aside, the HPCC was one of the first systems available that could linearly scale work across hundreds of servers. As such it could often provide two orders of magnitude (100×) performance uplift over the extant single-server solutions; which clearly is game changing. For this paper, given the title includes ‘Big Data’, it is presumed that HPCC is being contrasted to other massively parallel solutions.
39.
One subtle form of deception is ‘measuring something else’; the overall performance of a system needs to include latency, transaction throughput, startup time, resource requirements and system availability. There are systems that are highly tuned to one of those; this is legitimate—provided it is declared.
40.
This is a mathematical zero, not a marketing one. ECL/HPCC supports all of these models and hybrids of them within the same system and code. Many, if not all, ECL deployments run data in most of these models simultaneously and will often run the same data in different models at the same time on the same machine!.
41.
Where a native model is an extremely un-natural model imposed upon the data by an earlier system.

Author information

Authors and Affiliations

LexisNexis Risk Solutions, Alpharetta, GA, USA
David Bayliss

Authors

David Bayliss
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bayliss, D. (2016). Models for Big Data. In: Big Data Technologies and Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-44550-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-44550-2_9
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44548-9
Online ISBN: 978-3-319-44550-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics