Getting Started with Databricks

Large Scale Data Analytics

Your browser needs to be JavaScript capable to view this video

Try reloading this page, or reviewing your browser settings

This segment introduces large scale data analytics and some of the tools being use today.

Keywords

  • Databricks
  • large scale data analytics
  • data analytics tools

About this video

Author(s)
Robert Ilijason
First online
01 May 2021
DOI
https://doi.org/10.1007/978-1-4842-6919-0_1
Online ISBN
978-1-4842-6919-0
Publisher
Apress
Copyright information
© Robert Ilijason 2021

Video Transcript

So what is large-scale data analytics? Well, it is analytics in a larger scale. Traditionally, when you did data analytics, you pulled all the data from your sources, you stored them in a data warehouse, you organized the data, and presented the result in either reports or in a tool like Qlik or Tableau where the end users could play around with it.

We’ve had statistics and machine learning for many, many years, but they weren’t really viable in a large scale other than using proprietary tools with very expensive machinery. Right now, we’re in the middle of a breakthrough in this field, making large-scale data analytics, data analytics on a massive amount of data, possible. One of the drivers is the availability of data. We produce more data today than we ever had before, and the number just keeps increasing.

We also have new types of data, at least new types of data that we can do analytics on top of. So for instance, doing analytics on images, videos, sounds, and human readable text was almost impossible before, not totally impossible, but it was very both expensive and hard. Today, it isn’t anymore. Also, we have an ability to store almost everything that happens, so all your tweets, everything that happens on Facebook, everything that happens in machine-to-machine communication can be stored for analytics. We can today buy petabytes for the money we used to buy megabytes for 20 years ago.

Another driver is cloud technology. So in the past, if you wanted to do large-scale data analytics, you basically had to set up your own data center. This was very expensive and only available to a few companies in the world, the really big ones.

Today, however, we have cloud providers like Amazon, Microsoft, and Google who can give you access to their data centers, where they take care of everything that happens, all the technical stuff, and you can just pay as you go. You can use the machine you need and only pay for the minutes you use them. This gives you an opportunity to scale hugely across hundreds of machines for the day or week you actually need it, without having to pay up front for the hardware cost.

Finally, we also have the tools. The interest in this field have created more tools, and the best thing is that they are free, both as in speech and in beer. You can literally go to the site of these tools and download them, use them without paying a dime. And if you’re interested, you can look at the source code to see how they actually work underneath the surface.

This has made data analytics, overall, more accessible to more people, which has fueled even more tools. So together, data, cloud, and tools have enabled large-scale data analytics for a lot of people. It used to be the realm for big enterprises. Today, if you have $100, you can do some pretty darn big data analytics. So let’s look at one of those tools in the next session.