Skip to main content

Big Data

  • Chapter
  • First Online:
Data Science for Transport

Abstract

These is no standard definition of “big” or “small” data but we will define: Small data sets are those which can be held and analyzed in a computer’s memory, by consumer applications such as spreadsheets and scripting languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    People who draw these diagrams have job titles like “Enterprise architect” and have some of the highest paid salaries in IT such as 100k\(+\) GBP/year or 1000\(+\)GBP/day in 2018.

  2. 2.

    Typically by consultants who want to charge you a lot of money to do something about them. Often the same people who draw enterprise diagrams.

  3. 3.

    This view of AI is the source of the old media caricature of robots crashing with “does not compute” error when faced with contradictions. Later AI research did manage to avoid such spectacular logical blow-ups by using different logical foundations such as para-consistent logic in “Truth Maintenance Systems”. In these systems, you might prove that the car is red and also prove that the car is blue, but it no longer follows from this that 0 \(=\) 1 because the contradictions are contained in logical domains, and indicate problems with your assumptions rather than the state of the world. This is based on human reasoning, which very often can deduce contradictory results, but also without them exploding. See (Doyle 1979, A truth maintenance system, Artificial Intelligence 12(3):231–271).

  4. 4.

    This makes life hard for other researchers who happen to have their own conference seasons at the same time, and it is not unknown for dirty tricks to be played to optimize one group’s compute time at the expense of other’s when this happens. For some reason, physicists always seem to get top priority. Depending on your point of view, this may be because their work is more fundamental to the progress of human knowledge, or because they got the funding to set up the cluster.

  5. 5.

    blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/.

  6. 6.

    This may depend on the style of object orientation used by the programming language too. In strongly-typed languages like C++ all classes used in the program must be predefined, but a “duck-typed” language like Python is able to construct new classes on the fly at run time which could in theory be used to represent arbitrary relations as objects. I don’t know if anyone has tried this yet?

  7. 7.

    Also LMBD the “Lightning Memory Mapped Database” has this form, though optimized for very fast retrieval on a single machine – useful for machine learning training.

  8. 8.

    For example, Hadoop esri.github.io/gis-tools-for-hadoop.

  9. 9.

    For example, http://blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/ is a project to run distributed PyMC on Spark.

  10. 10.

    However, Hadoop is critically dependent on the use of tab, newline and carriage return characters. Binary data may contain these in some bytes, which you need to remove. One way to do this is to replace them with some other long strings using regexes, then replace those strings with the original characters when you are ready to process the data in your mapper. Alternatively, libraries and “SeqFiles” can be used to do something similar.

  11. 11.

    Native Linux users should prefix with sudo.

  12. 12.

    As suggested by ITS Leeds student Aseem Awad. More generally and usefully this could be used to predict total flows as counted by sensors such as temporary induction loops, after the temporary sensors have been used for calibration then removed.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charles Fox .

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Fox, C. (2018). Big Data. In: Data Science for Transport. Springer Textbooks in Earth Sciences, Geography and Environment. Springer, Cham. https://doi.org/10.1007/978-3-319-72953-4_10

Download citation

Publish with us

Policies and ethics