To start, even though Data Warehouse (DWH) is often used as terminology for a specific piece of technology, such as Snowflake or BigQuery, if we go deep into the definition, the term DWH is not about technology but of use case. Let me quote Wikipedia:
[A] data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence.
Often DWH system is some kind of database. Which is a nice segue to Operational Databases. That clearly states in the name the use case: Operational. But what is a database? Let’s ask Wikipedia again:
[A] database is an organized collection of data stored and accessed electronically.
Often when we talk about operational databases (DB), we are (or should be) talking about relational databases such as PostgreSQL. Those DBs could also be document-oriented, such as MongoDB. The critical point here is an organized collection. Databases are about structured data, be it tables or JSON, and not any blob of files on some filesystem.
But as the question was about technologies, let’s look at the different characteristics of these other use cases and what it requires from the technologies.
On a high level, these two different use cases, Operational vs. Analytical, have different query patterns, traditionally divided into:
In Operational systems, the pattern is (usually) OLTP: there are large amounts of short and small queries (and transactions). Queries touch single or few rows at a time, but as there are a lot of users, there are a lot of queries happening all the time. You are accessing your data; some other user is happily accessing their data. Databases build for these use cases focus on fast query processing and data integrity. Transactions are the key! One wants to ensure that the data is always correct, no matter what. For example, you want to ensure that the order is released when the payment is made. Both operations need to happen atomically together.
In analytical systems, the access pattern is entirely different: there are only a few queries, but those touch a lot of data simultaneously. These queries usually have a lot of aggregations and group by’s: Roll-ups (Consolidation), Drill-downs, and Slicing & Dicing. Databases built for these use cases focus on quickly processing a lot of data and splitting it into multiple rows, like calculating the total sales for a year from all individual transactions. Many times you don’t need the latest real-time data.
And yes, some systems try to work for both; these systems are called HTAP = Hybrid Transactional and Analytical Processing; examples include TiDB and the new Unistore in Snowflake.
In addition, as computers have gotten faster and technology has advanced, being able to do analytical stuff doesn’t necessarily require OLAP optimized system. Using databases designed for OLTP workloads, such as PostgreSQL, can often be used for analytical purposes, but the opposite is not valid. Using OLAP-optimized systems such as BigQuery as an operational database will get messy.
Why do different use cases require another type of system? As always, technology is all about trade-offs; you cannot have everything. Usually, systems that are OLTP are row-oriented, and OLAP systems are column-oriented. But what does this mean?
Even though we think data in table format eventually gets written to a hard drive, there will be just an array of bytes. So how we “flatten the table” dictates how the data gets stored and accessed. In a row-oriented system, a single row is stored together, and in a column-oriented system single column is stored together. Let’s look at this through an example.
A simple table:
In the row-oriented system, the data would be stored like this:
And in column-oriented:
But why this matters? Remember hard drives, those with the spinning disc? On those systems reading sequential data is much faster than random access. So with a system that usually touches single rows at a time, having the row data together makes it super duper fast to read a single row. And in a system that wants to read like monthly sales, just reading the sales column in the range that describes the month is fast.
OLTP systems often also have different possibilities to index the data. So that finding the right rows is fast based on the common search patterns. But they are keeping indices correct when data changes is a super complex problem. For this reason, OLAP systems usually don’t have indices, and as such, finding a single row will require scanning through all the data; it can be fast, but it does require resources, and as such, a column-oriented system only works well when you need to access many data in a single query.
Now that discs are faster, if one needs an HTAP system, unless the data is extensive, using a row-oriented OLTP system, such as PostgreSQL, can be a suitable solution.
But which database to choose for which use case? Well, as usual, it depends. I usually say that you can select any database as long as it’s PostgreSQL. It’s my go-to always unless there are some special requirements. Most analytical use cases are small data, and computers are super powerful; even a MacBook Air packs enough power to crunch, though, like most analytical use cases.
Even though I like PostgreSQL a lot, even with hosted services such as Aiven, AWS RDS, Google Cloud SQL, and Neon, it requires some management. Like deciding what size of a machine to run. And that sucks ass when you want to run queries and not think about anything else. Because of this, if you can use Google Cloud, BigQuery is a fantastic choice for DWH: it simply works, and you pay for what you use.
With BigQuery, one needs to be careful about the cost. You are charged by the data your query processes, and as discussed in the previous section, with the column-oriented system without indices, fetching a single row will require you to scan the whole fucking table. Just be careful, ok.