Data Lake

Data Lake
Central platform for modern data strategies

A data lake is a central platform for storing and analyzing large amounts of data. It enables companies to efficiently manage structured and unstructured data, process it flexibly, and make it available for analysis. Thanks to its scalability and cost savings, a data lake plays a central role in data-driven business models.

Table of Contents

What is a data lake? A definition
Advantages of a data lake
Use cases for a data lake
What tools and technologies are available?
data warehouse architecture
Modern data warehouse platforms
Data warehouse vs. data lake vs. data lakehouse
Data lake(house) with ISR
Why you should choose us!
We also offer the following consulting services
Valuable content on the topic of data lakes

What is a data lake? A definition

Alongside the data warehouse, a data lake forms a central foundation for a data-driven company. The data warehouse and data lake should be viewed as a logical unit that is technically and functionally distinct. A data lake is a central storage platform that collects, stores, and manages large amounts of data in its raw format. It is a flexible and scalable repository that accepts structured, semi-structured, and unstructured data. The data is stored in a data lake without prior transformation or modeling, which distinguishes it from traditional databases or data warehouses.

Data is stored in its original state without having to be processed or structured beforehand. This allows for flexible use, as different analyses can be performed on the same data set.

Supports structured data (e.g., tables), semi-structured data (e.g., JSON, XML), and unstructured data (e.g., videos, images, audio, log files).

Data lakes are often implemented in the cloud and can grow almost indefinitely, making them ideal for big data applications.

Compared to Data warehouses , storage in a data lake is generally less expensive because data is stored in its raw state and does not require expensive data models or transformations. The storage is often object storage, which large amounts of data can be stored cost-effectively in buckets .

Data can be processed, analyzed, and extracted in various formats at any time, e.g., for machine learning, AI analyses, or real-time applications.

Both approaches offer advantages that complement each other well. A data lakehouse is therefore a logical further development of data architecture that combines the strengths of data lakes and data warehouses. It is particularly suitable for companies that want to efficiently store, process, and analyze large and diverse amounts of data without sacrificing the consistency and reliability of traditional systems.

Key features of a data lake:

1. Raw data format: Data is stored in its original state without having to be pre-processed or structured. This allows for flexible use, as different analyses can be performed on the same data set.

1. Diversity of data types: Supports structured data (e.g., tables), semi-structured data (e.g., JSON, XML), and unstructured data (e.g., videos, images, audio, log files).

1. Scalability: Data lakes are often implemented in the cloud and can grow almost indefinitely, making them ideal for big data applications.

1. Low cost: Compared to data warehouses, storage in a data lake is usually cheaper because data is stored in its raw state and no expensive data models or transformations are required. The storage is often object storage, which stores large amounts of data cost-effectively in buckets.

1. Flexibility in data analysis: Data can be processed, analyzed, and extracted in various formats at any time, e.g., for machine learning, AI analyses, or real-time applications.

The Data Lakehouse – A Use Case in Databricks and SAP Datasphere

In this white paper, we use sample data to explain step by step how to build a data lakehouse using SAP Datasphere and Databricks, and discuss why using a lakehouse makes sense, especially in the business context of many German companies.

Advantages of a data lake

A data lake offers you a cost-effective, flexible, and scalable solution for storing and analyzing large and diverse amounts of data. It is particularly useful for modern data-intensive applications such as big data, AI, machine learning, and real-time analytics. Data lakes offer the following advantages:

Flexible & scalable

Stores structured and unstructured data in raw form—without prior transformation. Easy to add new data types and scale horizontally.

Cost Efficiency

Low storage and operating costs thanks to cloud technologies. Ideal for large amounts of data such as IoT or streaming data.

Central platform

Bundles data from different sources (CRM, ERP, IoT, etc.) for company-wide use and unified analysis

Real-time capability & strong analytical skills

Integrates streaming technologies such as Apache Kafka or Flink. Supports modern analytics and machine learning tools directly on raw data.

schema-on-read

Flexibility in data analysis: structures are only applied during reading – perfect for exploratory analyses.

Accessible to all user groups

Whether business analysts, developers, or data scientists—everyone accesses the same database, depending on their needs and tools.

Long-term storage & integration

Any amount of data can be stored permanently and used with cloud and big data technologies such as Spark, Hadoop, or Presto.

Flexibility in data storage

A data lake can store data in its raw form, regardless of its structure.

Support for structured, semi-structured, and unstructured data (e.g., tables, JSON files, images, videos, logs).
No need to transform data before storing it.
Easily add new data types without changing the architecture.

Cost Efficiency

Data lakes use cost-effective storage solutions, often cloud-based.

Scalability at low cost.
Ideal for storing large amounts of data, such as IoT data, streaming data, or historical data.
Reduction in storage costs compared to traditional data warehouses.

Scalability

A data lake is designed for storing and processing huge amounts of data.

Horizontal scaling with growing data volumes.
Suitable for big data applications and real-time data streams.

Support for modern analytics

Data lakes enable the use of modern data analysis and machine learning methods.

Easy access for data scientists and AI developers.
Support for machine learning frameworks such as TensorFlow, PyTorch, or Spark MLlib.
Enables the development of algorithms directly on raw data.

Central data platform

A data lake serves as a central storage location for data from different sources.

Integration of heterogeneous data sources (e.g., CRM, ERP, IoT devices, social media).
A single platform for storage and analysis, regardless of the source or structure of the data.
Enables the standardization of data for company-wide use.

Support for schema-on-read

Data can be structured and analyzed during reading instead of writing.

Flexibility in data analysis, as different users can apply different schemas.
Data does not need to be transformed in a complex manner before storage.
Enables exploratory analyses and different processing methods.

Support for real-time data

Modern data lakes can integrate streaming data and real-time sources.

Processing of live data streams for applications such as fraud detection, IoT analytics, or real-time dashboards.
Integration of technologies such as Apache Kafka or Flink.

Democratization of data

A data lake enables broad access to data within the company.

Data is accessible to various departments, such as marketing, finance, or IT.
Promoting a data-driven corporate culture.
Self-service for analysts and data scientists without heavy reliance on IT departments.

Long-term storage of historical data

Data lakes store data indefinitely, even if it is not currently needed.

Historical data is available for future analyses and machine learning models.
No restrictions due to storage space or costs, as with other systems.

Integration with big data and cloud technologies

Data lakes work seamlessly with modern big data technologies.

Use of Hadoop, Spark, Presto, and other big data frameworks.
Scalable cloud storage solutions such as Amazon S3, Azure Data Lake, or Google Cloud Storage.

Support for different user groups

A data lake can be used by different types of users.

Business analysts: Access to structured data for reports.
Data scientists: Use of raw data for exploratory analyses.
Developers: Use of data for applications and algorithms.

Support for different user groups

Data can be structured and analyzed during reading instead of writing.

Flexibility in data analysis, as different users can apply different schemas.
Data does not need to be transformed in a complex manner before storage.
Enables exploratory analyses and different processing methods.

Use cases for a data lake

Data lakes offer a flexible and scalable solution for storing and processing large and diverse amounts of data. They are particularly suitable for various data-intensive use cases. Here are the most important use cases for data lakes:

Big data processing

Data lakes serve as storage for enormous amounts of data from a wide variety of sources, which is often stored in its original, unprocessed form. This technology enables the efficient processing of data generated by IoT devices, sensors, and other real-time sources. In addition, data lakes offer the option of storing extensive log data generated by systems, applications, or networks. Another important aspect is the consolidation of data from external sources such as social media platforms and web tracking systems, creating a comprehensive database for analysis and decision-making.

Machine learning and AI

Data lakes form an essential basis for training machine learning models by providing extensive and diverse data sets. This technology enables AI models to be trained using a wide range of unstructured and semi-structured data, such as text, images, and videos. The use of these diverse data sources enables companies to develop sophisticated, personalized recommendation systems tailored to individual user behavior. For example, data lakes play an important role in detecting anomalies, whether in financial transactions or production processes. This allows companies to identify potential risks early on and respond proactively.

real-time data processing

Modern data lakes have evolved into powerful platforms that integrate streaming data sources and enable real-time analytics. This opens up a wide range of possible applications in various industries:

Fraud detection in financial services through real-time transaction analysis.
Monitoring and control of IoT devices or sensors in production.
Real-time marketing through the analysis of user interactions on websites or in apps.

data archiving

Data lakes offer a cost-effective solution for the long-term storage of large amounts of data. They enable historical data to be retained for compliance and audit purposes, which is particularly important in regulated industries. Raw data can be archived and used later for analysis or modeling, creating flexibility for future projects such as retrospective analysis and the identification of long-term trends.

Business Intelligence and Advanced Analytics

Data lakes serve as the basis for analyzing business data. By consolidating data from different systems, meaningful dashboards and reports can be created. These provide decision-makers with a holistic overview of business processes and performance. In addition, data lakes support ad hoc analysis and exploratory data analysis. Analysts can flexibly access the entire data pool to gain new insights or answer specific questions. The provision of a central data pool promotes self-service BI initiatives. Employees from different departments can independently access relevant data and perform analyses. This accelerates decision-making and strengthens the data culture within the company.

Support for data science teams

Data lakes are the preferred platform for data scientists because they offer flexible access to raw data. This enables them to perform exploratory data analysis and efficiently prepare data for machine learning models. Direct access to raw data allows them to identify patterns and test hypotheses without being constrained by predefined data structures. The integration of modern big data technologies such as Apache Spark, Flink, or Presto into data lakes enables the processing and analysis of large amounts of data in real time. These tools offer powerful functions for distributed computing and data processing. Data lakes allow data from different sources to be combined, enabling innovative use cases.

Research and development

Research teams are increasingly relying on data lakes to efficiently store and analyze large and diverse amounts of data. This technology offers decisive advantages for various scientific disciplines:

Scientific analysis of experimental data in biotechnology, chemistry, or astronomy.
Modeling and simulation in areas such as climate research or materials science.
Storage and analysis of open data sets.

Hybrid and multi-cloud strategies

Data lakes enable the use of data across different cloud platforms. They offer companies the flexibility to store data centrally while performing analyses on different cloud platforms. This allows for the optimal use of specific cloud services for different analysis requirements. Data lakes also support hybrid architectures that combine on-premises and cloud data. This allows companies to connect their existing local infrastructures with cloud resources and achieve seamless data integration across different environments. Data lakehouses, a further development of data lakes, integrate the advantages of data warehouses and data lakes.

What tools and technologies are available?

A data lake is based on a variety of modern technologies that work together seamlessly to store, integrate, analyze, and secure large and diverse amounts of data. The selection of tools depends on specific requirements, such as the type of data (structured, unstructured), the desired use case (batch or real-time processing), and the preferred infrastructure (cloud or on-premises).

data warehouse architecture

In data lake architecture, there is no division into staging areas, core DWH, and data marts, as the data structures are already defined before loading ("schema on write"). A data lake follows the "schema on read" approach, in which the analysis structures are only created during data use. The data therefore first flows into the data lake in its original structures.

First, data from source systems is loaded into the object store in the data lake, as before. Based on this object store, the data is transformed into "open table formats." This allows structured data to be accessed using SQL language. This takes place in the "bronze," "silver," and "gold" layers. These correspond to the stage, core, and data mart layers in the data warehouse architecture. Thus, in the classic data lakehouse architecture, we find the data lake and the data warehouse integrated in a technically coordinated environment.

Modern data warehouse platforms

Databricks is one of the best-known lakehouse platforms and is based on Apache Spark. It offers a unified platform for data processing, machine learning, and business intelligence (BI).

Delta Lake as core technology.
Support for batch and streaming data.
Seamless integration with tools such as Power BI and Tableau.
Highly optimized for machine learning and AI.

Snowflake is often referred to as a cloud data warehouse, but has increasingly integrated lakehouse capabilities by supporting unstructured data.

Cloud-native with multi-cloud support (AWS, Azure, Google Cloud).
Support for semi-structured data (e.g., JSON, Parquet).
Powerful data sharing features.
Scalability without downtime.

BigQuery, part of Google Cloud Platform, is a serverless data warehouse solution with lakehouse capabilities.

Integration with Google Cloud Storage as a data lake.
SQL queries on huge amounts of data.
Support for machine learning with BigQuery ML.
Low latency for analyses.

Azure Synapse combines data warehousing and big data analytics in a single platform.

Integration with Azure Data Lake.
Support for Spark-based analytics
Seamless connection to Power BI.
Support for T-SQL for queries.

Amazon Web Services combines Amazon S3 as a data lake with Amazon Redshift as a data warehouse to create a lakehouse architecture.

Data preparation and cataloging with AWS Lake Formation.
Advanced analytics with Redshift Spectrum.
Integration with other AWS services (Glue, Athena, SageMaker).

Dremio is a specialized lakehouse platform that simplifies and accelerates access to data in data lakes.

SQL queries directly on data lakes.
Optimization through a "query accelerator."
Support for open formats such as Parquet and ORC.

These open-source projects offer lakehouse functionality for companies that want to use their own infrastructure.

Features:

Iceberg: Column-based storage management, transactions, and time-travel queries.
Hudi: Support for upserts and efficient data processing in data lakes.

Cloudera combines data lake and data warehouse in a hybrid platform.

Features:

On-premise and cloud support.
Integration of Apache Hadoop, Spark, and Hive.
Governance and security functions.

Data warehouse vs. data lake vs. data lakehouse

Data warehouse, data lake, and data lakehouse are three approaches to storing and processing large amounts of data. While a data warehouse optimizes structured data for analysis, a data lake stores raw data in any form. The data lakehouse combines both concepts to unite flexibility and performance.

	data warehouse	Data Lake	Data lakehouse
data storage	Structured data In-memory & column-based Data vault & dimensional modeling Potentially expensive storage	Unstructured data Cloud storage & file-based No generic data model Affordable storage	Structured & unstructured data Cloud storage & file-based Data vault & dimensional modeling Affordable storage
data usage	Enterprise-ready reporting SQL interface, reporting tools Low-code/no-code possible High query performance & low latency	Advanced Analytics, Data Science Machine Learning & AI Code-Only	Integration of DWH and DL SQL, Python, and R interface, reporting tools High virtualization reduces query performance
data processing	A main memory with a fixed size Schema-on-write ETL & ELT CRUD operations ACID transactions	Storage only, no processing Schema-on-read	Parallel & scalable processing separate from storage Schema-on-read CRUD operations ACID transactions
data management	Single Source of Truth (SSOT) Centralized control & governance Row & Column Level Security (RCLS)	No central control No SSOT No RCLS	Comprehensive authorization concept SSOT RCLS

data warehouse

Data storage. A data warehouse stores structured data in a column-based in-memory format. Both data vault and dimensional models are used, which guarantees analytically optimized storage—but at a tendency toward higher storage costs.

Data usage. The focus is on company-wide reporting: Common reporting tools and low- or no-code platforms can be connected via an SQL interface. This enables the system to achieve high query performance with minimal latency, even for complex analyses.

Data processing. Processing takes place in a fixed-size main memory according to the schema-on-write principle. Classic ETL and ELT pipelines support CRUD operations and ensure ACID transactions, so that data remains consistent and reliable.

Data management. As a single source of truth (SSOT), the data warehouse provides centralized governance with fine-grained row and column-level security (RCLS). This allows access rights to be precisely controlled and compliance requirements to be reliably met.

Data lakehouse

Data storage. A data lake primarily stores unstructured data in cost-effective, file-based cloud storage. Since no generic data model is enforced, data from any format and source can be flexibly imported and later structured as needed.

Data usage. The environment is geared toward advanced analytics, data science, machine learning, and AI workloads. Queries and analyses are typically performed using code only in Python, R, or similar languages, allowing complex models and experiments to be implemented without additional layers.

Data processing. The lake functions exclusively as storage; computing resources are added via external clusters as needed. Analyses run according to the schema-on-read principle, meaning that data is only interpreted and converted into the desired format when it is read.

Data management. Governance is organized on a project or team-specific basis: there is no central control, no "single source of truth," and no row or column-level security (RCLS). Security and compliance requirements must therefore be implemented individually.

Data Lake

Data storage:A data lakehouse combines the advantages of a warehouse and a lake by storing both structured and unstructured data cost-effectively in file-based cloud storage. Despite this flexible storage, it supports data vault and dimensional modeling approaches, so that analytical structures are quickly available.

Data usage:Through the seamless integration of data warehouse and data lake, the lakehouse combines classic BI and modern data science workloads. Users can access it via SQL, Python, or R interfaces or via common reporting tools. The strong virtualization of this mixed environment allows for great freedom, but can slightly reduce query performance compared to pure warehouse operation.

Data processing:Computing clusters process data in parallel and in a scalable manner, but are physically separated from storage. Thanks to schema-on-read, data can be interpreted flexibly as needed; at the same time, the engines support CRUD operations and ACID transactions, ensuring consistency and reliability.

Data management:A comprehensive authorization concept establishes the lakehouse as a single source of truth (SSOT) and offers fine-grained row and column-level security (RCLS). This allows governance and compliance requirements to be implemented centrally and efficiently.

Webinar

The Data Lakehouse – A Use Case in Databricks and SAP Datasphere

In this webinar, we will guide you through a possible approach using Azure Databricks and SAP Datasphere based on a specific use case. We will also provide an outlook on future developments and show how the concept contributes to making data-driven decisions more efficient and cost-effective.

Data lake(house) with ISR

To be honest, in the world of IT and analytics, there are always buzzwords that everyone is talking about, but which have no real meaning and do not lead to any real change in procedures. With data lakes and data lakehouses, it's different. From 1,000 km up in the air, a data lake is nothing more than a data warehouse: it stores data and makes it available for analysis. But if you take a closer look, there are many differences, and it is worth considering whether and for what purpose you want to use a data lake in your own specific environment. In many customer situations, the first step is therefore to record the relevant use cases and requirements:

Who could benefit from a data lake?
What data could be collected in it?
What forms of analysis would be possible in this case?
Are there already ideas that could be implemented on a data lake?
Can the data lake be integrated with a DWH and replace parts of the DWH architecture?

However, caution is advised at this point, because data lakes are often used to support innovative or even disruptive approaches. Innovation cannot be planned, and the data lake as infrastructure is intended to enable the agile and exploratory search for unknown value potential in data. The use of a data lake is therefore often accompanied by a paradigm shift in data usage. There is a shift from the predictable implementation of structured and standardized reporting to data-driven innovation processes that are based on hypotheses to the maximum extent possible, but often also involve research "into the blue." This approach often requires changes in governance and organization. In our projects, we also accompany the initial process of idea generation and strategic adjustments upon request. Only the question of amortization can almost never be answered ex ante. This is because the most valuable innovations are those that are only discovered through the use of the data-driven approach on the data lake. Strategic courage and foresight are therefore required here.

Data & analytics consulting with ISR
Why you should choose us!

Are you looking for a partner on your journey to becoming a data-driven company? We are here to support you with our many years of expertise in data and analytics consulting! With our in-depth understanding of business analytics, we accompany you from strategy to implementation. Our experienced data analytics consultants enable you to discover potential and base strategic decisions on a solid data foundation. Thanks to our long-standing partnerships with SAP, IBM, and Microsoft, we are happy to advise you in a vendor-neutral and cross-technology manner.

We offer this too.
Further advice pages

Data Science Consulting

With our solutions, we design your data processes to be efficient and future-proof, enabling you to gain competitive advantages.

Learn More

Data Analytics Consulting

We help you gain valuable insights from your data for the sustainable success of your enterprise.

Learn More

Data Management Consulting

With tailored Data Lake, Data Warehouse, and Data Lakehouse solutions, we support you from inception to completion.

Learn More

SAP Business Data Cloud – Revolution or Just Another Upgrade?

With the announcement of the SAP Business Data Cloud (BDC), SAP establishes a new milestone in…

July 2, 2024

The Data Lakehouse – The Future of Data Management?

The data lakehouse is an approach that has the potential to transform data management for Business Intelligence ...

June 22, 2023

Data Mesh and Data Fabric – Friends or Foes?

In a data mesh, the IT department acts as a platform provider for analytics tools and standards that business units use ...

March 16, 2022

What is the current SAP Data Warehouse Strategy?

With SAP BW/4HANA, SAP HANA SQL Data Warehousing, and SAP Datasphere, SAP offers three distinct …

Get in touch with us now
We would be pleased to advise you.

Take advantage of our data analytics consultants' many years of expertise to make your data strategy future-proof. Let's work together to develop customized data analytics solutions that will prepare your company for the future in the long term. Feel free to contact us for a no-obligation initial consultation.

Data and Analytics

Document Logistics

Operations

Data Lake

Data LakeCentral platform for modern data strategies

What is a data lake? A definition

Advantages of a data lake

Flexible & scalable

Cost Efficiency

Central platform

Real-time capability & strong analytical skills

schema-on-read

Accessible to all user groups

Long-term storage & integration

Flexibility in data storage

Cost Efficiency

Scalability

Support for modern analytics

Central data platform

Support for schema-on-read

Support for real-time data

Democratization of data

Long-term storage of historical data

Integration with big data and cloud technologies

Support for different user groups

Support for different user groups

Use cases for a data lake

What tools and technologies are available?

Security and access control tools

Security and access control tools

Data integration and ingestion

Processing and analysis

Processing and analysis

Data cataloging and management

query engines

Security and access control tools

Orchestration and workflow management

Machine learning and AI

Visualization and business intelligence

Monitoring and optimization

data warehouse architecture

Modern data warehouse platforms

Data warehouse vs. data lake vs. data lakehouse

data warehouse

Data lakehouse

Data Lake

The Data Lakehouse – A Use Case in Databricks and SAP Datasphere

Data lake(house) with ISR

Data & analytics consulting with ISR Why you should choose us!

We offer this too.Further advice pages

Read more on the ISR blog Valuable content on the topic of data lakes

SAP Business Data Cloud – Revolution or Just Another Upgrade?

The Data Lakehouse – The Future of Data Management?

Data Mesh and Data Fabric – Friends or Foes?

What is the current SAP Data Warehouse Strategy?

Get in touch with us nowWe would be pleased to advise you.

Data Lake
Central platform for modern data strategies

Data & analytics consulting with ISR
Why you should choose us!

We offer this too.
Further advice pages

Read more on the ISR blog
Valuable content on the topic of data lakes

Get in touch with us now
We would be pleased to advise you.