By Michael Burke, Account Director
A database is probably one of the most basic structures of information technology, yet as a concept it can actually be kind of a tricky one. This is partially because of some disagreement about what actually constitutes a “database”, and also because of the rise of something that is referred to as “distributed computing”, not to mention confusing terminology and the penchant of techies to use conflicting terms to describe the same thing.
Generally speaking, a database is a collection of data points, which have some kind of logical connection to each other, and are stored on a computer in some kind of structured form that makes it possible to retrieve information. It’s also helpful to understand what a database isn’t:
- Software. Software just happens to be the medium for organizing it, and most database software is referred to as a “database management system” (dbms), rather than a “database”. It’s not a database until data is stored on it.
- Storage device. To be sure, a database has to be stored somewhere, and it always resides on some kind of storage hardware, whether it be on a single computer, an enterprise-grade server or a group of servers in the cloud (more commonly known as a “cluster”). If a database was a person, the storage device would be the house.
- Data. Although it may seem counterintuitive, data does not always reside on a computer in database form, and the mere presence of a bunch of data on a drive doesn’t constitute a database. In fact one of the major technologies enabling Big Data, Hadoop, is not considered a database, even though it is a vast collection of data. A simple way to think about it is that there is kind of a minimum threshold of structure and purpose required for a collection of data to be considered a database.
Relational Database–the grandfather of databases
A relational database is perhaps the most familiar kind of database, and it uses a structure that’s familiar to anyone who has ever looked at a spreadsheet. It’s often described as “relational” because it organizes each piece of data in in relation to another piece of data in the database. Personally, I find that description pretty useless, and it’s easier to think of it by visualizing an Excel or Google spreadsheet, where you’ve got columns and rows of data. But there is a little more to it than that. What you’re probably visualizing right now actually isn’t the whole database, but rather a section of the database referred to as a “table”. A relational database can have numerous tables, and sometimes the information in one table can be mixed and matched with the information in another table.
You may be thinking that a table is basically the same as a “tab” on a spreadsheet, but you’d only be partially right. A table is really just a presentation of the data that exists in a database, whereas in a spreadsheet, the tab is the actual data. Additionally, a relational database may be much, much bigger than anything you’d store in a spreadsheet, sometimes containing millions of rows of data. Also, most relational databases are a bit more finicky than a spreadsheet, and are set up to only contain certain kinds of data (for example, some columns can only store integers, and some can only hold a certain number of letters). This is a good thing! Otherwise they’d be full of even more inconsistencies and junk than they often are.
In a relational database, you can create tables on the fly by mixing and matching columns that may be present in other tables using something that you’ll hear referenced all the time in the database world: SQL (pronounced either “sequel” or as an acronym). SQL stands for “structured query language”, and it’s basically just a way to talk to the database and tell it to do things (the way C3PO talks to the Millenium Falcon in computerese). So when you want to get information from a relational database, you typically use some form of SQL or near-SQL language.
There are a ton of different relational databases, and the one you use probably depends on your needs and capabilities, and companies make decisions based on budget, the amount of data they anticipate having to deal with, and the purpose of the database (will you be analyzing data? Do transactions rely on the data?). Organizations that opt for an open source relational database (meaning that it’s code is open and it is typically free) may look at SQLite for “small data”, or PostgreSQL or MySQL for web applications that don’t require massive scaling. Those requiring a more secure, scalable, enterprise-grade solution may look at OracleDB or Microsoft’s SQL Server, which come with a significant price tag but are designed to serve banks and other institutions where the stakes for database performance are very high.
The term “NoSQL” applies to a set of databases that are built for very large scale usage, where the need to organize massive amounts of data outweighs some of the needs for consistency that a relational database applies. They’re usually leveraged in “distributed” environments, or in other words computing environments that leverage more than one computer. Some prefer to refer to the category as “not only SQL” rather than NoSQL, in order to emphasize that they’re not trying to replace relational technologies. The database management systems employ different schemas to accomplish their varied objectives, but typically these are simpler than those used in relational databases, and that’s a part of why they can achieve such massive scale. Other than that, there’s not really a common thread tying them together, which makes it somewhat difficult to find a coherent definition of the category. However, it’s important to note that these kinds of technologies are a large part of what has enabled companies like Google, Netflix and Facebook to thrive–the kind of scale they require couldn’t be achieved on relational databases.
There are a number of different players in the market, but you’ll often hear names like Aerospike, Redis, CouchDB, MongoDB, Google BigTable, Cassandra and HBase thrown around in database discussions. There’s also an entire category that sits between relational and NoSQL, referred to as NewSQL, which attempts to provide the best of both worlds, including MariaDB and its recently acquired ClustrixDB.
How companies decide which database to use
As mentioned earlier, budget, requirements for security and the nature of the data being collected all play heavily into database decisions. Additionally, something called “CAP theorem” also plays an important role. In short, the theorem holds that there are three competing aspects of database design, and you can’t excel in one without sacrificing in the other two areas, so you’ll always have to decide between:
- Consistency. This means that the data has “every read receives the most recent write or an error”. If that’s not very helpful don’t worry, it isn’t for most people. Think of it this way: If I withdraw the last $20 from my bank account, Wells Fargo’s database had better “write” that into the database immediately, otherwise my wife could be simultaneously withdrawing it from another ATM.
- Availability. This refers to how much of the time the system is operational. If you can’t afford to have your database down for even one minute, you require 100 percent availability, for example.
- Partition tolerance. This involves the ability to break up the database into parts and store it on multiple computers–this is the “distributed computing” that we referenced earlier, and is extremely important for databases that have to handle massive amounts of data (such as many of those used by Facebook, Google and other internet giants).
As you can imagine, companies often find that their applications really do need all three of these capabilities, so sometimes they have to make some tough decisions in building their database infrastructure. Fortunately technology is making its way around CAP theorem, so the sacrifices aren’t always as stark as they used to be.
As Porky Pig used to say, “TTTTTThat’s all, folks!”
…but for those seeking higher wisdom and knowledge, check out: Hybrid IT, in Plain English