By Michael Burke | February 3, 2020
For businesses to have a systematized process/infrastructure for deriving insight from data–which should be a goal of all organizations–there has to be an organized repository of all the data that might be required for all business operations and analytical questions. That repository is generally known as the ‘data warehouse’.
The library vs. the book
What’s the difference between a data warehouse and a database? Honestly, it gets a little murky because in some instances a Data Warehouse might actually be a giant database. But generally speaking, you can think of a data warehouse as the library, and a database as the book. Data warehouses store and organize data from a variety of systems, whereas a database is usually collecting data from one source.
Then there’s the ‘data lake’, such as Hadoop, which on the surface might sound a lot like a data warehouse. The difference is in how they’re organized (or not organized). A data warehouse, like a relational database, has a predefined order or structure (schema) to it. Just like you wouldn’t see a librarian throwing books on any old shelf, a data warehouse is designed to store data in a very specific way. If data doesn’t fit its schema, it doesn’t go in. Another way of saying it is that the data has been ‘processed’, or put in a form that’s more or less ready to use (note the ‘more or less’, as this is the case in theory but not necessarily in practicality).
The data lake, on the other hand, is a place where you can throw any data in its raw, unprocessed form. This makes it perfect for organizations that are collecting data faster than they know what to do with it. If the idea of throwing a bunch of stuff in a pristine lake is unappealing, think of it as a garage. But we’ll talk more about data lakes in our section on Hadoop. Suffice it to say for now that a data lake takes anything, while a data warehouse is highly selective.
Though there be data warehouses many…
Data warehouses can have subunits too, called ‘data marts’–sometimes big companies find that it makes more sense to have smaller repositories that are specific to certain departments. Like databases, data warehouses can be physically housed in several different ways. A company might house it in its own datacenter, or they might house it on servers that they rent from a cloud provider like AWS or Google.
Regardless of where they reside, they have a fundamental goal of data consistency. You see, a big problem in data management is that there are multiple copies of the same data, and those copies don’t always agree with each other, which can lead to very sloppy analytics and business intelligence that is, well…not so intelligent. The data warehouse, therefore, can be seen as the system that imposes order on what would otherwise be data chaos.
Need a PR and marketing agency that understands enterprise tech?