Chapter 1: Why Apache Flink?
traditional architecture problems
no single database, but letting the data flow
Message transport and Message processing
To implement effective stream-first architecture is to use a common patters to implement streaming architecture by using two main kinds of components: message transport & stream processing system
transport layer: ideal capabilities
performance with persistence: serve as a safety queue upstream from the processing step - a buffer to hold event data as kind of short term insurance against an interruption in processing data
persistent transport layers are replayable, that means the messages are replayable and this allows the streaming processor to replay and recompute a specified part of the stream of events
decoupling of multiple produces from multiple consumers: enabling collection of data from many sources (producers) and makes it available form multiple consumers
geo distributed replication of streams
useful between data centers need to preserve message offsets to allow updates from any of the data centers to be propagated to any of the other data centers in bidirectional and cyclic replication of data.
What Flink does?
Counting with Batch and Lambda Architecture vs Streaming Architecture
Notions of time
Mechanisms to group and collect a bunch of events by time or other characteristics
Grouping elements base on their counts instead of their timestamps
/* To close a window, in case the number of elements never fulfills the condition, should be used a "trigger" */
Session is a period of activity that is preceded and followed by a period of inactivity
/* how long we want to wait until we believe that a session has ended */
Time travel means rewinding the stream to some time in the past and restarting the processing from there, eventually catching up with the present.
Regular records embedded in the stream that based on event time, inform computations that a certain time has been reached.
Batch is a Special Case of Streaming
Why Apache Flink is considered the next distributed data processing revolution ?
Comparing Spark and Apache Flink performances for batch processing and stream processing
Distributed Data Processing with Apache Flink Architecture
What flink can do that others didn't?
Image for example
State: Checkpoints, Savepoints, and Fault-tolerance
how Apache Flink handles stateful stream processing and how to manage distributed stream processing and data driven applications efficiently with Flink's checkpoints and savepoints?
The Architecture of Apache Flink
Back pressure in Flink
Disadvantages of Flink
Pipelined execution in Flink does have some limitation in regards to memory management
https://books.google.it/books?id=nHc5DwAAQBAJ&pg=PA288&lpg=PA288&dq=flink+uses+raw+bytes+disadvantage& source=bl&ots=GjoXKSb1VH&sig=ACfU3U11PntsGFzIQZogmHxwtRakdCHVgA&hl=en&sa=X&ved=2ahUKEwjWtN2LluvoAhWJzaQKHSXxA6MQ6AEwAHoECA0QKQ#v=onepage&q=flink uses raw bytes disadvantage&f=false
/* The limitation only applies to Flink's DataStream/Streaming API when using iterations. When using the DataSet/Batch API, there are no limitations. */
Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing Framework
Why Ali Baba chosed Flink?
Monitorning Apache Flink Applications