HomeDataChoosing the right real-time processing framework
The Big Data landscape is growing fast in the area of data stream processing. New technologies and tools are emerging. This post will help you to guide through this jungle of technical details, and choose the one that suits you the best. Let’s get on.
From its humble beginnings in the AMPLab at U.C. Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world. Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing. You’ll find it used by banks, telecommunications companies, games companies, governments, and all of the major tech giants such as Apple, Facebook, IBM, and Microsoft.
Ease of use
Dynamic in nature
Extra components like GrapX and MlLib
No automatic optimization process
File Management System
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. Apache Flink features APIs on different levels of abstraction. Analytical use cases can be easily realized with Flink’s Table API or SQL. These APIs are designed as unified APIs for batch and stream sources, meaning that the same query produces the same result regardless whether it is evaluated over a file or a Kafka topic (given that both contain the same data).
Low latency on minimal resources
Variety of sources and sinks
High level API
Exactly once processing
Lack of initial adoption
Community not as big as Spark
No adoption of Flink Batch
Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use! Apache Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Very low latency
Excellent for non-complicated streaming use cases
No state management
No advanced features like Event time processing, aggregation, windowing, sessions etc