The Data Lakehouse is an approach that has the potential to significantly simplify data management for business intelligence and analytics.
In addition to the classic data warehouse, data lakes have become increasingly important as central storage locations. The integration of both approaches is therefore an important issue for the further development of a BI and data strategy. Behind the term data warehouse and data lake is a data management approach that promises to combine the functions of both systems in one technical concept:
- The storage of structured, semi-structured and unstructured raw data in the open, scalable architecture of the Data Lake.
- The data structuring and processing functionalities of the data warehouse.
In this blog article, we explain the approach and take a closer look at the structure and benefits of a data lakehouse. Databricks is currently the most well-known provider for setting up a lakehouse, and SAP has entered into a strategic partnership with them, probably for this reason.
With the SAP HANA platform and the SAP Datasphere and SAP BW applications, SAP offers state-of-the-art data warehouse solutions. The focus of the investments is on Datasphere, SAP's strategic cloud data warehouse solution, which is geared towards the modern requirements of specialist departments for data-driven self-service.
1 The data warehouse concept: definition and basics
The classic data warehouse
With a powerful database, the classic data warehouse has integrated storage and computing resources and thus ensures consistently high performance in data transformation and retrieval. It also offers high data quality and a rich semantic description of business-relevant data. However, it also requires complex data preparation and is less flexible when ingesting semi-structured and unstructured data such as social media or sensor data (see Fig. 1). The constant availability of high-performance computing resources can also lead to high costs, which could be avoided in phases with few read and write operations.
Data Lake
A data lake is designed for the centralized storage of large amounts of data without providing computing power. Services that provide the corresponding computing resources are required for write or read operations. Data lakes provide the basis for use cases such as machine learning and predictive analyses, which are becoming increasingly important for data-driven business decisions. Data lakes and data warehouses have therefore been used in combination in recent years (see Fig. 2).
The Data Lakehouse concept
The data warehouse concept transfers this two-stage connection into a joint approach and thus combines the best aspects of both worlds (see Fig. 3). It offers the flexibility of a data lake for data acquisition and storage as well as the structured organization and quality of a data warehouse. The central feature is the separation of data storage and computing power (storage and computing resources), which means that computing resources are only used when required. This separation makes it possible to store large volumes of data cost-effectively and perform powerful analyses at the same time. With these characteristics, the Data Lakehouse aims to simplify data-driven decision-making processes, increase flexibility and reduce costs. This concept has therefore gained considerable popularity in practice in recent years.
Whitepaper
The Data Lakehouse in Databricks and SAP Datasphere - An application example
2. functionality of the Data Lakehouse
2.1 Data consistency and quality with open-table formats
The Data Lakehouse is based on a virtual layer that accesses files in a data lake (see Fig. 4). These files are stored in standardized, platform-independent formats that are optimized for column-based storage of large amounts of data. These open and standardized formats enable flexible data management as they are not tied to a specific technology.
Challenges of the standard formats:
However, standardized file formats have disadvantages such as inefficient writing processes and problems with simultaneous transactions, which can lead to data inconsistencies. To overcome these challenges, the files are enriched by a virtual metadata layer, the so-called open-table format. This metadata layer forms the technical foundation of the Lakehouse approach and offers functions that are traditionally only available in classic databases, thereby solving several challenges.
Advantages of the open-table format:
- ACID-compatible transactions (Atomicity, Consistency, Isolation, Durability): Open-table formats ensure that read and write transactions are atomic and isolated from each other and that the data is consistent and permanently available at all times. This enables several people to work simultaneously on the same tables or files. In this respect, open-table formats guarantee the reliability and integrity of the data.
- Versioning and historization ("Time Travel"): Changes to tables can be traced back to a specific point in time and earlier versions can be restored. This improves the traceability of changes and allows different versions to be compared. As only metadata of the changes is recorded, previous versions do not need to be physically saved.
- Schema enforcement and evolution: By default, it is ensured that the schema of new data corresponds to the existing table schema in order to guarantee the consistency of the data ("schema-on-write"). In addition, schema evolution enables flexible adjustments to the table schema (e.g. adding new columns) without having to completely recreate the table.
- Partition evolution: Partitions are saved as separate files, which simplifies queries and changes in certain partitions. New rows and columns can be added to the metadata without having to rewrite the partitions, which optimizes performance.
- Governance and metadata management: A transaction log written to the table location enables optimized queries and transactions by providing metadata such as schema and partitions.
- Combined batch and streaming processing: Open-table formats enable the processing of real-time data streams and historical data in micro-batches. This reduces the effort required to manage heterogeneous data pipelines and facilitates the provision of information for informed decision-making in real time.
- Connection of data science and machine learning: Thanks to the open and standardized access, machine learning tools can quickly access existing data. In addition, the data is already stored in a form that can be read by common machine learning libraries.
Performance optimization and open source solutions
To ensure good performance, a lot of metadata is stored in the cloud storage cache, which reduces the latency of queries and transactions. Well-known open-table formats such as Delta Lake, Iceberg and Hudi are open source and can be used in data lakes and compute services of hyperscalers such as Microsoft Azure, AWS or Google Cloud Platform. Despite different implementations, these formats offer similar performance features.
These technical foundations and functions make open-table formats a key component of the data warehouse concept, which combines the advantages of data lakes and data warehouses and thus enables efficient and consistent data management.
2.2 Data on the podium: The medal structure
The data lakehouse requires a structured setup in order to ensure the necessary governance and to be able to react flexibly to the challenges of constantly changing data landscapes. To this end, a hierarchical data structure is introduced in the data lakehouse that meets the requirements of data scientists for data analysis on the one hand and the needs of data warehouse experts for reporting and BI applications on the other. This hierarchy is referred to as a medallion structure (see Fig. 5) and was originally proposed by Databricks. It divides the lakehouse into three layers representing different data states: raw, processed and aggregated. The data passes through these layers from the lowest to the highest, with each layer having its own modeling conventions and loading rules.
This type of processing is not a new idea, but implements the logic of data transformations and ETL routes of classic warehouse approaches in the lakehouse and will be well known to many data warehouse experts.
The layers of the medal structure:
- Bronze layer:
- Function: Storage location for raw data from various sources such as ERP and CRM systems, IoT devices and other data sources.
- Characteristics: The data is stored in its original form and without transformation. Cost-effective storage in the data lake enables high documentation requirements to be met.
- Staging layer (optional):
- Function: Intermediate level between bronze and silver layer.
- Characteristics: Standardization and conversion of the raw data into the open-table format, collection of additional metadata and correction of obvious errors. This facilitates the transformation into the silver layer.
- Silver-Layer:
- Function: Central persistence layer for transformed data.
- Characteristics: Data is transformed according to hard and, if necessary, business rules and stored in a harmonized raw layer (write optimized). The modular structure of data vault modeling ensures flexibility and scalability.
- Gold layer:
- Function: Query-optimized preparation of data for analyses and reports.
- Characteristics: Data is denormalized and presented in dimensions and facts, similar to classic data warehouses, to enable the creation of analyses and reports.
Advantages of the medal structure
The medallion architecture gives the data lakehouse a structure that resembles modern data warehouses and meets the different requirements of data engineers, data scientists, data governance specialists and business users. Thanks to technical innovations such as the separation of computing and storage resources and the use of open-table formats, the challenge lies less in setting up loading routes and business logic than in how these innovations can be integrated into existing data strategies and used optimally within the company.
3. conclusion
The data lakehouse combines the strengths of data lakes and data warehouses by enabling flexible, cost-efficient and high-performance data processing. The separation of data storage and computing power can contribute to more efficient use of cloud resources and cost reductions.
Open-table formats ensure data consistency and quality through ACID-compatible transactions, versioning, schema enforcement and support for batch and streaming processing. The medal structure (bronze, silver and gold layers) ensures a clear data organization that meets the requirements of both data scientists and data warehouse experts. At the same time, the implementation of a data warehouse approach in companies requires careful planning in order to meet the requirements for consistency, accessibility and governance of the data and to benefit from the opportunities offered by the approach.
Whitepaper
The Data Lakehouse
Author: Damian Garrell & Martin Peitz
Your contact person
Christopher Kampmann
Head of Business Unit
Data & Analytics
christopher.kampmann@isr.de
+49 (0) 151 422 05 448


