By Michael Burke | October 7, 2019

What makes a ‘data lake’ a lake? You can throw as much stuff in it as you want, as fast as you want. In my last post we talked about Map+Reduce–one of Hadoop’s ‘tricks’ for making it possible to ingest and store a ton of data of just about any type imaginable at breakneck speed. This week we’ll talk about “schema-on-read”, the other big component in this. Much of the data that organizations collect may not be storable in the standard relational databases that companies use.  Organizations have traditionally used a ‘schema-on-write’ approach (also referred to as ETL–extract, transform, and load), where data has to be fit into a specified schema before it is stored’. This is all fine and dandy, but the catch is that you have to know how you’re going to define and use the data before you load it. 

For example, if I had data that said “Michael Burke, 13B79 $25, 10/29/18, Big Fish”, I’d need a database table “scheme” that told me that some guy named Michael Burke, with a customer ID of 13B79 purchased a Blu Ray copy of “Big Fish” for $25 on October 29 last year. With most enterprise databases, it’s not just that the data doesn’t make any sense without this scheme–the database is literally configured to prevent you from inputting undefined data. With Big Data, such a level of detail isn’t always feasible, as data which might be highly useful doesn’t necessarily fit into a schema, and you certainly don’t always know what you’re going to use it for. Additionally, the speed and volume with which you’re collecting it makes fitting it all into a schema unpractical. Hadoop pioneered the “data lake” concept–capture all your data now, and decide what to do with it later. 

Lessons on Read Vs. Write from Scrooge

Before we get too deep into schema-on-read, it’s important to understand what “write” and “read” mean in database parlance. Writing is nothing more than putting data in–think of an old school ledger, where Ebeneazer Scrooge inks the following:

“Bob Cratchet, Hired: 12/25/1873 at 6 shillings/week; pay as of 12/25/1883: 6 shillings/week ”

The old miser has performed a database “write”. What’s a “read”? When he puts on his glasses and reads what he’s written, he’s performed a database read. Similarly, when we input to or query info from a database, we’re performing a “write” and “read”, respectively.

Avoiding the Traffic Jams of Big Data

Imagine if in order to get into your favorite pop star’s concert everyone had to fill out a form with the ticket taker–even a very short form would dramatically slow everything down. Likewise, when the old-school ETL approach is used with Big Data, bottlenecks occur, servers crash, data gets discarded, and generally bad things happen.  Hadoop makes Big Data in its raw form useful–I could actually store “Michael Burke, 13B79 $25, 10/29/18, Big Fish” without a neatly defined table. The ‘schema’ is applied when you’re ready to do something with the data, rather than when it’s loaded. 

The “schema-on-read” approach–made possible by the map+reduce paradigm–allows us to apply structure to data when it suits our needs. It also allows the data to be manipulated during “reads” in such a way that the original data that was “written” to the system isn’t altered. The basic idea: collect the data now, and then decide what to do with it when you actually know what you need to do. Contrast this with the ETL/schema-on-write/data warehousing approach, where you’re forced to make assumptions about how the data might eventually be used before storing it, and you begin to see the value of Hadoop. 

Other advantages of the schema-on-read approach include the ability for multiple business units to use the same data simultaneously–one department doesn’t have to own a dataset, and it can be made available to all.  

Fault tolerance: “It’s not my fault

Both clustering and the schema-on-read paradigms support what’s called “fault tolerance”, or the risk that when a server goes down it doesn’t throw your system into chaos. With clustering, there’s no single block of information that isn’t retained as a copy on a separate node (though no one single node contains all the information). Imagine that you had a group of five or six people in your department, but for every team member there was at least one other who shared their knowledge and skill set–if a team member quit or came down ill, you’d be able to carry on fairly easily. Similarly, with Hadoop, when a server goes down, the other servers can instantly adapt and continue operation without a hiccup. 

The way Hadoop pulls this off is similar to how human beings do it. Most human teams have a manager that determines who does what, and can reassign when someone leaves or is sick. Similarly, Hadoop clusters (‘teams’ of computers or servers assigned to a data set) have a “Job Tracker” that tells the other servers (or ‘nodes’) called “Task Trackers” what to do. When a Task Tracker goes down, the Job Tracker reassigns work. 

So that pretty much wraps it up. In sum, Hadoop is a system for handling the 3 Vs of Big Data,  volume, velocity and variety. It is specifically designed for massive amounts of data and currently isn’t ideal for smaller data volumes. However, as it’s also designed to be built on, and is in fact an ‘ecosystem’ of applications rather than a single application, the sky is the limit for what can be done in the future, so we can expect to see some very interesting things. 

Related blog topics:

Making Sense of the Wild World of Hadoop
What is Virtualization Anyway, and What Are Its “Virtues”?
Removing “The Cloud” from the Mysterious Ether