Data Lake

Data Lake
Central platform for modern data strategies

A data lake is a central platform for storing and analyzing large amounts of data. It enables companies to efficiently manage structured and unstructured data, process it flexibly, and make it available for analysis. Thanks to its scalability and cost savings, a data lake plays a central role in data-driven business models.

What is a data lake? A definition

Alongside the data warehouse, a data lake forms a central foundation for a data-driven company. The data warehouse and data lake should be viewed as a logical unit that is technically and functionally distinct. A data lake is a central storage platform that collects, stores, and manages large amounts of data in its raw format. It is a flexible and scalable repository that accepts structured, semi-structured, and unstructured data. The data is stored in a data lake without prior transformation or modeling, which distinguishes it from traditional databases or data warehouses.

Key features of Data Lake

Data is stored in its original state without having to be processed or structured beforehand. This allows for flexible use, as different analyses can be performed on the same data set. 

Supports structured data (e.g., tables), semi-structured data (e.g., JSON, XML), and unstructured data (e.g., videos, images, audio, log files).

Data lakes are often implemented in the cloud and can grow almost indefinitely, making them ideal for big data applications.

Compared to Data warehouses , storage in a data lake is generally less expensive because data is stored in its raw state and does not require expensive data models or transformations. The storage is often object storage, which large amounts of data can be stored cost-effectively in buckets .

Data can be processed, analyzed, and extracted in various formats at any time, e.g., for machine learning, AI analyses, or real-time applications. 

Both approaches offer advantages that complement each other well. A data lakehouse is therefore a logical further development of data architecture that combines the strengths of data lakes and data warehouses. It is particularly suitable for companies that want to efficiently store, process, and analyze large and diverse amounts of data without sacrificing the consistency and reliability of traditional systems.

Alongside the data warehouse, a data lake forms a central foundation for a data-driven company. The data warehouse and data lake should be viewed as a logical unit that is technically and functionally distinct. A data lake is a central storage platform that collects, stores, and manages large amounts of data in its raw format. It is a flexible and scalable repository that accepts structured, semi-structured, and unstructured data. The data is stored in a data lake without prior transformation or modeling, which distinguishes it from traditional databases or data warehouses.


Key features of a data lake:

    1. Raw data format: Data is stored in its original state without having to be pre-processed or structured. This allows for flexible use, as different analyses can be performed on the same data set.

 

    1. Diversity of data types: Supports structured data (e.g., tables), semi-structured data (e.g., JSON, XML), and unstructured data (e.g., videos, images, audio, log files).

 

    1. Scalability: Data lakes are often implemented in the cloud and can grow almost indefinitely, making them ideal for big data applications.

 

    1. Low cost: Compared to data warehouses, storage in a data lake is usually cheaper because data is stored in its raw state and no expensive data models or transformations are required. The storage is often object storage, which stores large amounts of data cost-effectively in buckets.

 

    1. Flexibility in data analysis: Data can be processed, analyzed, and extracted in various formats at any time, e.g., for machine learning, AI analyses, or real-time applications.

 

Both approaches offer advantages that complement each other well. A data lakehouse is therefore a logical further development of data architecture that combines the strengths of data lakes and data warehouses. It is particularly suitable for companies that want to efficiently store, process, and analyze large and diverse amounts of data without sacrificing the consistency and reliability of traditional systems.

The Data Lakehouse – A Use Case in Databricks and SAP Datasphere

In this white paper, we use sample data to explain step by step how to build a data lakehouse using SAP Datasphere and Databricks, and discuss why using a lakehouse makes sense, especially in the business context of many German companies.

Advantages of a data lake

A data lake offers you a cost-effective, flexible, and scalable solution for storing and analyzing large and diverse amounts of data. It is particularly useful for modern data-intensive applications such as big data, AI, machine learning, and real-time analytics. Data lakes offer the following advantages:

Flexible & scalable

Stores structured and unstructured data in raw form—without prior transformation. Easy to add new data types and scale horizontally.

Cost Efficiency

Low storage and operating costs thanks to cloud technologies. Ideal for large amounts of data such as IoT or streaming data.

Central platform

Bundles data from different sources (CRM, ERP, IoT, etc.) for company-wide use and unified analysis

Real-time capability & strong analytical skills

Integrates streaming technologies such as Apache Kafka or Flink. Supports modern analytics and machine learning tools directly on raw data.

schema-on-read

Flexibility in data analysis: structures are only applied during reading – perfect for exploratory analyses.

Accessible to all user groups

Whether business analysts, developers, or data scientists—everyone accesses the same database, depending on their needs and tools.

Long-term storage & integration

Any amount of data can be stored permanently and used with cloud and big data technologies such as Spark, Hadoop, or Presto.

Flexibility in data storage

A data lake can store data in its raw form, regardless of its structure.

  • Support for structured, semi-structured, and unstructured data (e.g., tables, JSON files, images, videos, logs).
  • No need to transform data before storing it.
  • Easily add new data types without changing the architecture.

Cost Efficiency

Data lakes use cost-effective storage solutions, often cloud-based.

  • Scalability at low cost.

  • Ideal for storing large amounts of data, such as IoT data, streaming data, or historical data.
  • Reduction in storage costs compared to traditional data warehouses.

Scalability

A data lake is designed for storing and processing huge amounts of data.

  • Horizontal scaling with growing data volumes.
  • Suitable for big data applications and real-time data streams.

Support for modern analytics

Data lakes enable the use of modern data analysis and machine learning methods.

  • Easy access for data scientists and AI developers.
  • Support for machine learning frameworks such as TensorFlow, PyTorch, or Spark MLlib.
  • Enables the development of algorithms directly on raw data.

Central data platform

A data lake serves as a central storage location for data from different sources.

  • Integration of heterogeneous data sources (e.g., CRM, ERP, IoT devices, social media).
  • A single platform for storage and analysis, regardless of the source or structure of the data.
  • Enables the standardization of data for company-wide use.

Support for schema-on-read

Data can be structured and analyzed during reading instead of writing.

  • Flexibility in data analysis, as different users can apply different schemas.
  • Data does not need to be transformed in a complex manner before storage.
  • Enables exploratory analyses and different processing methods.

Support for real-time data

Modern data lakes can integrate streaming data and real-time sources.

  • Processing of live data streams for applications such as fraud detection, IoT analytics, or real-time dashboards.
  • Integration of technologies such as Apache Kafka or Flink.

Democratization of data

A data lake enables broad access to data within the company.

  • Data is accessible to various departments, such as marketing, finance, or IT.
  • Promoting a data-driven corporate culture.
  • Self-service for analysts and data scientists without heavy reliance on IT departments.

Long-term storage of historical data

Data lakes store data indefinitely, even if it is not currently needed.

  • Historical data is available for future analyses and machine learning models.
  • No restrictions due to storage space or costs, as with other systems.

Integration with big data and cloud technologies

Data lakes work seamlessly with modern big data technologies.

  • Use of Hadoop, Spark, Presto, and other big data frameworks.
  • Scalable cloud storage solutions such as Amazon S3, Azure Data Lake, or Google Cloud Storage.

Support for different user groups

A data lake can be used by different types of users.

  • Business analysts: Access to structured data for reports.
  • Data scientists: Use of raw data for exploratory analyses.
  • Developers: Use of data for applications and algorithms.

Support for different user groups

Data can be structured and analyzed during reading instead of writing.

  • Flexibility in data analysis, as different users can apply different schemas.
  • Data does not need to be transformed in a complex manner before storage.
  • Enables exploratory analyses and different processing methods.

Use cases for a data lake

Data lakes offer a flexible and scalable solution for storing and processing large and diverse amounts of data. They are particularly suitable for various data-intensive use cases. Here are the most important use cases for data lakes:

What tools and technologies are available?

A data lake is based on a variety of modern technologies that work together seamlessly to store, integrate, analyze, and secure large and diverse amounts of data. The selection of tools depends on specific requirements, such as the type of data (structured, unstructured), the desired use case (batch or real-time processing), and the preferred infrastructure (cloud or on-premises).

data warehouse architecture

In data lake architecture, there is no division into staging areas, core DWH, and data marts, as the data structures are already defined before loading ("schema on write"). A data lake follows the "schema on read" approach, in which the analysis structures are only created during data use. The data therefore first flows into the data lake in its original structures.

In data lake architecture, there is no division into staging areas, core DWH, and data marts, as the data structures are already defined before loading ("schema on write"). A data lake follows the "schema on read" approach, in which the analysis structures are only created during data use. The data therefore first flows into the data lake in its original structures.

First, data from source systems is loaded into the object store in the data lake, as before. Based on this object store, the data is transformed into "open table formats." This allows structured data to be accessed using SQL language. This takes place in the "bronze," "silver," and "gold" layers. These correspond to the stage, core, and data mart layers in the data warehouse architecture. Thus, in the classic data lakehouse architecture, we find the data lake and the data warehouse integrated in a technically coordinated environment.

Modern data warehouse platforms

First, data from source systems is loaded into the object store in the data lake, as before. Based on this object store, the data is transformed into "open table formats." This allows structured data to be accessed using SQL language. This takes place in the "bronze," "silver," and "gold" layers. These correspond to the stage, core, and data mart layers in the data warehouse architecture. Thus, in the classic data lakehouse architecture, we find the data lake and the data warehouse integrated in a technically coordinated environment.

DataBricks logo
Databricks is one of the best-known lakehouse platforms and is based on Apache Spark. It offers a unified platform for data processing, machine learning, and business intelligence (BI).
  •  Delta Lake as core technology. 
  • Support for batch and streaming data. 
  • Seamless integration with tools such as Power BI and Tableau. 
  • Highly optimized for machine learning and AI. 
snowflake

Snowflake is often referred to as a cloud data warehouse, but has increasingly integrated lakehouse capabilities by supporting unstructured data.

  • Cloud-native with multi-cloud support (AWS, Azure, Google Cloud). 
  • Support for semi-structured data (e.g., JSON, Parquet).
  • Powerful data sharing features.
  • Scalability without downtime.
google_bigquery_logo_icon_168151
BigQuery, part of Google Cloud Platform, is a serverless data warehouse solution with lakehouse capabilities.
  • Integration with Google Cloud Storage as a data lake.
  • SQL queries on huge amounts of data.
  • Support for machine learning with BigQuery ML.
  • Low latency for analyses.
ec03efe6-1e00-4bb6-8623-913638a85e70

Azure Synapse combines data warehousing and big data analytics in a single platform.

  • Integration with Azure Data Lake.
  • Support for Spark-based analytics
  • Seamless connection to Power BI.
  • Support for T-SQL for queries.
Aws Lake Formation--Streamline Svg Logos

Amazon Web Services combines Amazon S3 as a data lake with Amazon Redshift as a data warehouse to create a lakehouse architecture.

  • Data preparation and cataloging with AWS Lake Formation. 
  • Advanced analytics with Redshift Spectrum.
  • Integration with other AWS services (Glue, Athena, SageMaker).
96cfdb01-cd85-4c29-98f0-8e10091140b3

Dremio is a specialized lakehouse platform that simplifies and accelerates access to data in data lakes.

  • SQL queries directly on data lakes.
  • Optimization through a "query accelerator."
  • Support for open formats such as Parquet and ORC.
Apache Hudi logo

These open-source projects offer lakehouse functionality for companies that want to use their own infrastructure.

Features:

  • Iceberg: Column-based storage management, transactions, and time-travel queries.
  • Hudi: Support for upserts and efficient data processing in data lakes.
Cloudera Data Platform (CDP) logo

Cloudera combines data lake and data warehouse in a hybrid platform.

Features:

  • On-premise and cloud support.
  • Integration of Apache Hadoop, Spark, and Hive.
  • Governance and security functions.

Data warehouse vs. data lake vs. data lakehouse

Data warehouse, data lake, and data lakehouse are three approaches to storing and processing large amounts of data. While a data warehouse optimizes structured data for analysis, a data lake stores raw data in any form. The data lakehouse combines both concepts to unite flexibility and performance.

data warehouse Data Lake Data lakehouse
data storage Structured data In-memory & column-based Data vault & dimensional modeling Potentially expensive storage Unstructured data Cloud storage & file-based No generic data model Affordable storage Structured & unstructured data Cloud storage & file-based Data vault & dimensional modeling Affordable storage
data usage Enterprise-ready reporting SQL interface, reporting tools Low-code/no-code possible High query performance & low latency Advanced Analytics, Data Science Machine Learning & AI Code-Only Integration of DWH and DL SQL, Python, and R interface, reporting tools High virtualization reduces query performance
data processing A main memory with a fixed size Schema-on-write ETL & ELT CRUD operations ACID transactions Storage only, no processing Schema-on-read Parallel & scalable processing separate from storage Schema-on-read CRUD operations ACID transactions
data management Single Source of Truth (SSOT) Centralized control & governance Row & Column Level Security (RCLS) No central control No SSOT No RCLS Comprehensive authorization concept SSOT RCLS

data warehouse

Data storage. A data warehouse stores structured data in a column-based in-memory format. Both data vault and dimensional models are used, which guarantees analytically optimized storage—but at a tendency toward higher storage costs.

Data usage. The focus is on company-wide reporting: Common reporting tools and low- or no-code platforms can be connected via an SQL interface. This enables the system to achieve high query performance with minimal latency, even for complex analyses.

Data processing. Processing takes place in a fixed-size main memory according to the schema-on-write principle. Classic ETL and ELT pipelines support CRUD operations and ensure ACID transactions, so that data remains consistent and reliable.

Data management. As a single source of truth (SSOT), the data warehouse provides centralized governance with fine-grained row and column-level security (RCLS). This allows access rights to be precisely controlled and compliance requirements to be reliably met.

Data lakehouse

Data storage. A data lake primarily stores unstructured data in cost-effective, file-based cloud storage. Since no generic data model is enforced, data from any format and source can be flexibly imported and later structured as needed.

Data usage. The environment is geared toward advanced analytics, data science, machine learning, and AI workloads. Queries and analyses are typically performed using code only in Python, R, or similar languages, allowing complex models and experiments to be implemented without additional layers.

Data processing. The lake functions exclusively as storage; computing resources are added via external clusters as needed. Analyses run according to the schema-on-read principle, meaning that data is only interpreted and converted into the desired format when it is read.

Data management. Governance is organized on a project or team-specific basis: there is no central control, no "single source of truth," and no row or column-level security (RCLS). Security and compliance requirements must therefore be implemented individually.

Data Lake

Data storage:A data lakehouse combines the advantages of a warehouse and a lake by storing both structured and unstructured data cost-effectively in file-based cloud storage. Despite this flexible storage, it supports data vault and dimensional modeling approaches, so that analytical structures are quickly available.

Data usage:Through the seamless integration of data warehouse and data lake, the lakehouse combines classic BI and modern data science workloads. Users can access it via SQL, Python, or R interfaces or via common reporting tools. The strong virtualization of this mixed environment allows for great freedom, but can slightly reduce query performance compared to pure warehouse operation.

Data processing:Computing clusters process data in parallel and in a scalable manner, but are physically separated from storage. Thanks to schema-on-read, data can be interpreted flexibly as needed; at the same time, the engines support CRUD operations and ACID transactions, ensuring consistency and reliability.

Data management:A comprehensive authorization concept establishes the lakehouse as a single source of truth (SSOT) and offers fine-grained row and column-level security (RCLS). This allows governance and compliance requirements to be implemented centrally and efficiently.

Webinar

The Data Lakehouse – A Use Case in Databricks and SAP Datasphere

In this webinar, we will guide you through a possible approach using Azure Databricks and SAP Datasphere based on a specific use case. We will also provide an outlook on future developments and show how the concept contributes to making data-driven decisions more efficient and cost-effective.

Data lake(house) with ISR

To be honest, in the world of IT and analytics, there are always buzzwords that everyone is talking about, but which have no real meaning and do not lead to any real change in procedures. With data lakes and data lakehouses, it's different. From 1,000 km up in the air, a data lake is nothing more than a data warehouse: it stores data and makes it available for analysis. But if you take a closer look, there are many differences, and it is worth considering whether and for what purpose you want to use a data lake in your own specific environment. In many customer situations, the first step is therefore to record the relevant use cases and requirements:

  • Who could benefit from a data lake?
  • What data could be collected in it?
  • What forms of analysis would be possible in this case?
  • Are there already ideas that could be implemented on a data lake?
  • Can the data lake be integrated with a DWH and replace parts of the DWH architecture?

However, caution is advised at this point, because data lakes are often used to support innovative or even disruptive approaches. Innovation cannot be planned, and the data lake as infrastructure is intended to enable the agile and exploratory search for unknown value potential in data. The use of a data lake is therefore often accompanied by a paradigm shift in data usage. There is a shift from the predictable implementation of structured and standardized reporting to data-driven innovation processes that are based on hypotheses to the maximum extent possible, but often also involve research "into the blue." This approach often requires changes in governance and organization. In our projects, we also accompany the initial process of idea generation and strategic adjustments upon request. Only the question of amortization can almost never be answered ex ante. This is because the most valuable innovations are those that are only discovered through the use of the data-driven approach on the data lake. Strategic courage and foresight are therefore required here.

Data & analytics consulting with ISR
Why you should choose us!

Are you looking for a partner on your journey to becoming a data-driven company? We are here to support you with our many years of expertise in data and analytics consulting! With our in-depth understanding of business analytics, we accompany you from strategy to implementation. Our experienced data analytics consultants enable you to discover potential and base strategic decisions on a solid data foundation. Thanks to our long-standing partnerships with SAP, IBM, and Microsoft, we are happy to advise you in a vendor-neutral and cross-technology manner.

We offer this too.
Further advice pages

Data Science Consulting
With our solutions, we design your data processes to be efficient and future-proof, enabling you to gain competitive advantages.
Data Analytics Consulting
We help you gain valuable insights from your data for the sustainable success of your enterprise.
Futuristic Bar Chart with Upward Trend
Data Management Consulting
With tailored Data Lake, Data Warehouse, and Data Lakehouse solutions, we support you from inception to completion.

Read more on the ISR blog
Valuable content on the topic of data lakes

Get in touch with us now
We would be pleased to advise you.

Take advantage of our data analytics consultants' many years of expertise to make your data strategy future-proof. Let's work together to develop customized data analytics solutions that will prepare your company for the future in the long term. Feel free to contact us for a no-obligation initial consultation.