Big Data

Fox, Charles

doi:10.1007/978-3-319-72953-4_10

Charles Fox²

Part of the book series: Springer Textbooks in Earth Sciences, Geography and Environment ((STEGE))

2237 Accesses
2 Citations

Abstract

These is no standard definition of “big” or “small” data but we will define: Small data sets are those which can be held and analyzed in a computer’s memory, by consumer applications such as spreadsheets and scripting languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
People who draw these diagrams have job titles like “Enterprise architect” and have some of the highest paid salaries in IT such as 100k\(+\) GBP/year or 1000\(+\)GBP/day in 2018.
2.
Typically by consultants who want to charge you a lot of money to do something about them. Often the same people who draw enterprise diagrams.
3.
This view of AI is the source of the old media caricature of robots crashing with “does not compute” error when faced with contradictions. Later AI research did manage to avoid such spectacular logical blow-ups by using different logical foundations such as para-consistent logic in “Truth Maintenance Systems”. In these systems, you might prove that the car is red and also prove that the car is blue, but it no longer follows from this that 0 \(=\) 1 because the contradictions are contained in logical domains, and indicate problems with your assumptions rather than the state of the world. This is based on human reasoning, which very often can deduce contradictory results, but also without them exploding. See (Doyle 1979, A truth maintenance system, Artificial Intelligence 12(3):231–271).
4.
This makes life hard for other researchers who happen to have their own conference seasons at the same time, and it is not unknown for dirty tricks to be played to optimize one group’s compute time at the expense of other’s when this happens. For some reason, physicists always seem to get top priority. Depending on your point of view, this may be because their work is more fundamental to the progress of human knowledge, or because they got the funding to set up the cluster.
5.
blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/.
6.
This may depend on the style of object orientation used by the programming language too. In strongly-typed languages like C++ all classes used in the program must be predefined, but a “duck-typed” language like Python is able to construct new classes on the fly at run time which could in theory be used to represent arbitrary relations as objects. I don’t know if anyone has tried this yet?
7.
Also LMBD the “Lightning Memory Mapped Database” has this form, though optimized for very fast retrieval on a single machine – useful for machine learning training.
8.
For example, Hadoop esri.github.io/gis-tools-for-hadoop.
9.
For example, http://blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/ is a project to run distributed PyMC on Spark.
10.
However, Hadoop is critically dependent on the use of tab, newline and carriage return characters. Binary data may contain these in some bytes, which you need to remove. One way to do this is to replace them with some other long strings using regexes, then replace those strings with the original characters when you are ready to process the data in your mapper. Alternatively, libraries and “SeqFiles” can be used to do something similar.
11.
Native Linux users should prefix with sudo.
12.
As suggested by ITS Leeds student Aseem Awad. More generally and usefully this could be used to predict total flows as counted by sensors such as temporary induction loops, after the temporary sensors have been used for calibration then removed.

Author information

Authors and Affiliations

Institute for Transport Studies, University of Leeds, Leeds, UK
Charles Fox

Authors

Charles Fox
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charles Fox .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fox, C. (2018). Big Data. In: Data Science for Transport. Springer Textbooks in Earth Sciences, Geography and Environment. Springer, Cham. https://doi.org/10.1007/978-3-319-72953-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-72953-4_10
Published: 28 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72952-7
Online ISBN: 978-3-319-72953-4
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)

Publish with us

Policies and ethics