Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logical foundation behind this and presents practical use case with tool called Delta Lake. Enjoy!
What is data lake?
A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.Amazon Web Services
Firstly, the rationale behind data lakes is quite similar to widely used data warehouse. Although they fall into same category are quite different in the logic behind them. For instance data warehouse’s nature is that information stored inside it is already pre-processed. In other words reason for storing has to be known and data model well defined. However data lake takes different approach. As a result the reason of storing and data model don’t have to be defined. In conclusion, both variants can be compared like below:
|Characteristics||Data Warehouse||Data Lake|
|Data||Structured data of transactional systems||Unstructured or semi-structured data|
|Schema||Schema on write||Schema on read|
|Storage||Requires high-cost storage||Uses low-cost storage|
|Users||Business analysts||Data scientists|
|Analytics||BI and visualization||Machine Learning and Data Science|
Using Delta Lake OSS create a data lake
Now let’s use that theoretical knowledge and apply it using Delta Lake OSS. Delta Lake is open source framework based on Apache Spark, used to retrieve, manage and transform data into data lake. Getting started is quite simple – you will need an Apache Spark project (use this link for more guidance). Firstly, add Delta Lake as SBT dependency:
libraryDependencies += "io.delta" %% "delta-core" % "0.5.0"
Saving data to Delta
Next, let’s create a first table. For this, you will need a Spark Dataframe, which can be an arbitrary set or data read from another format, like JSON or Parquet.
val data = spark.range(0, 50) data.write.format("delta").save("/data/delta-table")
Reading data from Delta
Reading the data is as simple as writing to it. Just specify the path and correct format, same as you would do with CSV or JSON data.
val df = spark.read.format("delta").load("/data/delta-table") df.show()
Updating the data in Delta
The Delta Lake OSS supports a range of update options, thanks to its ACID model. Let’s use that to run a batch update, that overwrite the existing data. We do this by using following code:
val data = spark.range(0, 100) data.write.format("delta").mode("overwrite").save("/data/delta-table") df.show()
I hope this post convinced you that Delta Lakes are worth the time to familiarize with. The potential benefits of using them instead of Data Warehouses are evident. This technology is emerging fast, so you can expect more and more methods and tools gaining a popularity in the near future.
Stay up-to-date with the latest information.