DataLakes & DataWarehouses: How they are used in SEO

Although the concepts of DataWarehouses and DataLakes became part of the everyday language of Data Analysts and Data Scientists a long time ago, we have only been hearing about them in other industries for the past few years.
For example, Web Analysts and SEO Experts are beginning to take a serious look at these concepts, due to the nature of their jobs and the strong connection that exists between what they do and data manipulation. Many recent articles talk about the interest of implementing an SEO DataLake or an SEO DataWarehouse, treating the two terms as interchangeable and without making a distinction between the two.

In this article, we will guide you in determining the differences between DataLakes and DataWarehouses in order to understand their purposes and their use cases in SEO and web analytics.

DataWarehouse: structured storehouse for data

The first use of the term “DataWarehouse” dates back to 1988 in a paper by Paul Murphy and Barry Delvin, An architecture for a business and information systems. This article gives us a first definition of the concept as an easy-to-access, relational database environment, bringing together all the business data useful for strategic decision-making.

What does a DataWarehouse contain?

The DataWarehouse is used to gather the business data useful for strategic decision-making for the company in a single place. We are talking about business data that can cover anything from customer data, to inventory information, to conversions on a commercial website or organic visits (from a search engine such as Google for example).

It is commonly accepted that the data sent to a DataWarehouse are structured, pre-processed data used to offload operational databases, which ultimately allows these operational databases to be solicited as little as possible for query purposes.
The main goal of a DataWarehouse and those who administer it is to compile data from various, heterogeneous sources (both internal and external) in order to standardize it so that the various sources can communicate with each other. The end goal is to use this data to carry out analyses, reporting, decision-making support, etc.

Who are the daily users of a DataWarehouse?

Because of the nature of the DataWarehouse and the format and type of data it contains, it is an ideal playground for Data and Web Analysts.
Data Analysts works alongside the DataWarehouse administrator (or administration team). They define the business needs and use cases. They identify the data sources and the actions needed to process the data upstream. These data will then be used by the Data Analysts at the end of the chain.

How do users communicate with a DataWarehouse?

Once the data sources have been identified, and the data processed, ingested and linked in the DataWarehouse, the Data Analyst can use this data in analyses and to create new data combinations. This process can be used to maintain reporting dashboards, alerting dashboards, etc.

The most commonly used programming language for querying in a DataWarehouse is SQL (or SQL-like languages). SQL allows Data Analysts to manipulate and process data in order to meet business needs: monitoring, strategic decision-making, etc.

What use cases and project types do DataWarehouses serve?

It is impossible to draw up an exhaustive list of use cases involving the use of a DataWarehouse. However, here are a few examples of projects on which a Data Analyst is likely to work:

Improvement of a DataWarehouse:
This type of project is often encountered when setting up a DataWarehouse, but also when a new need or business use case is identified.
It is a question here of adding new data to a DWH (again, this can be internal or external data).
In this case we often speak of an ETL (Extraction-Transformation-Loading) process:

Realization of statistical analyses:
This is a very frequent use of DWHs. The goal may be to prove X or Y through the data, to produce statistics based on the available historical data, or to establish causal links to explain a finding, etc.
Reporting & alerting:
This is, once again, a very frequent use case. In fact, as the data in a DWH are highly structured and formatted (sharing a fixed and predefined schema), it is all suitable for pushing data to reporting or alerting dashboards.

This is a recurring request from top management, who need to be able to monitor operational teams and the health of results, sales, etc. in the simplest and quickest way possible.

If we summarize all of these, we have more or less 2 types of projects: data acquisition and integration projects (which can also be compared to a form of data storage and historization) and data analysis and evaluation projects (through monitoring/dashboarding and alerting).

The concept of a DWH has been present in the everyday language of those who work with data for a long time. How it works and its numerous use cases have long since been confirmed, and DWHs can be found in many companies of varying maturity where questions of data management are concerned.

This is less the case for the concept of DataLakes, which is much younger and much less widespread.

[oc-redirect num=1]

DataLake: lake of megadata (BigData)

The origin of this concept is attributed to James Dixon, CTO of Penthao, who defines it as a solution for storing and exploiting large volumes of data, without pre-processing and without necessarily a specific use case… Unlike DWHs, which are very much oriented towards immediate activation.
The DL tries to fill the gap, which is more and more important with the emergence of BigData, of what to do with all this mass of data that we are able to collect today and how to take advantage of it.

What does a DataLake contain?

I’ll start by quoting James Dixon who uses a very evocative comparison, serving both as an explanation for the “lake” name of his concept and as a differentiation with the DWH:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

This quote perfectly illustrates the difference between the type of data contained in a DWH, which is structured and organized in tables with precise, fixed patterns, and the type of data contained in a DataLake, which is raw, without prior processing, available to take samples from as needed, whether exploratory or not.

Where a DWH is restricted to accommodate structured data, the DataLake is made to store all kinds of raw data (structured or not). A debate between Tamara Dull (Amazon Web Service) and Anne Buff (Microsoft SAS) gives us a slightly more concrete vision of the content of a DataLake:

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

Who are the daily users of DataLakes?

Where a Data Analyst was perfectly suited to work with the structured data contained in a DHW, raw data is instead the specialty of Data Scientists, who are often better equipped to manipulate this type of data.
This change in data profile and main user also results in different programming languages and use cases.

What use cases and project types do DataLakes serve?

Due to its unstructured nature and the considerable volume of data that a DataLake can contain, the use cases can be very different from those previously found in the DWH framework, for example:

How do users communicate with a DataLake?

Given the use cases and users (Data Scientists), we will very often find programming languages such as Python, Java, R, Scala, etc…
For the most part, these languages have been present in the field of data science for a long time.

A DataLake is therefore a tool for managing BigData. It relies on the massive storage of raw data for advanced analysis and visualization purposes, thus allowing the enhancement of data that had not previously been put to much use.

To summarize, here is a table of the differentiating elements established since the beginning of this article:

DataWarehouse DataLake
Type of data Structured, pre-processed data, organized in tables with defined schemas Raw data, stored in a structured or unstructured manner
Users Data Analysts, Web Analysts Data Scientists
(sometimes Data Analysts)
Data volume Small – Large
(Depending on need and use case)
Potentially very large
(Big Data)
Programming language used SQL or SQL-like Python, R, Java, Scala, among others
Type of project Analytical & statistical projects, Reporting, Alerting, ELT (export, transform, load) type projects, some predictive and data-driven analysis Predictive analysis, machine learning, transit zone between data sources and DWH, advanced visualization – BI, data-driven analysis

Predictive analysis, machine learning, transit zone between data sources and DWH, advanced visualization – BI, data-driven analysis

It is these differences that make these two concepts complementary tools. In many cases, depending on the maturity of a company’s governance and data management, they may rely on a combination of these two tools.
A DWH is mainly used for traditional reporting and analysis, whereas a DataLake serves as a data source before reaching its full potential as the company approaches maturity on data subjects.

In my opinion, DataLakes are more a response to the new data issues of the 21st century, especially with the emergence of BigData and the increasing capacity of companies to collect data, than a replacement for DWHs, as some might think.
Both have their advantages, disadvantages, strengths and weaknesses. The best way to make the most of both is still to use both together to be able to deal with any eventuality and to address a wider variety of needs.

Now that we’ve clearly defined the concepts, we will finally focus on the use of DataWarehouses and DataLakes for marketing and more specifically for SEO (even if in many cases, what is true for the former will be true for the latter, and vice versa).

DataWarehouse and DataLake SEO

We will talk here of a DataWarehouse or a DataLake (or both) where at least part of the data present can be used for SEO use cases.

Why associate DataLakes and DataWarehouses with Marketing and SEO?

SEO (and, more generally, marketing) has already taken a very marked turn towards data in recent years. More and more tasks require the use of various data sources:

To this list we should also add the use of APIs of tools such as Search Console, Majestic, Google Analytics for example, which naturally pushes us towards the kind of solutions described earlier in this article.
It is this strong connection between SEO and Data that pushes more and more Web Analysts and SEO Experts to learn about new ways to organize their data pipeline.

However, the drivers for this transition are not only about the potential and interconnectedness of SEO and Data. Many everyday use cases resonate with the project types listed above for DWHs and DLs.

The use cases of an SEO DataWarehouse or an SEO DataLake.

I will first start from pain points commonly encountered by SEO Experts before explaining how the use of a DataLake or a DataWarehouse is an answer to be considered when addressing them.
Among the main pain points, the following stand out:

Again, this is a non-exhaustive list of pain points and possible use cases for DataWarehouse – DataLake SEO. Limits are encountered more through the lack of imagination of those who use them than in the tools themselves.

Choosing a DataWarehouse or DataLake for your SEO uses

To conclude, contrary to what you may often hear or read, DataWarehouses and DataLakes are separate structures for data storage and collection, and are not incompatible. There is no need to choose one over the other, quite the opposite. Both have different use cases and there are even some adhesions.

The case of SEO is a telling example, and reinforces the need for DataWarehouses and DataLakes in general. Data is omnipresent in SEO: we have to manipulate huge amounts of data from different sources. So it’s not surprising that we talk about DataWarehouses and DataLakes in this context. We can imagine a lot of use cases of DataWarehouses or DataLakes in SEO, whether it is for automation purposes, to perform “augmented” analysis through data, or simply to solve recurring issues (pain points).