Chapter 1: Why Apache Flink?
Stream-First Architecture
traditional architecture problems
streaming architecture
no single database, but letting the data flow
Message transport and Message processing
To implement effective stream-first architecture is to use a common patters to implement streaming architecture by using two main kinds of components: message transport & stream processing system
transport layer: ideal capabilities
performance with persistence: serve as a safety queue upstream from the processing step - a buffer to hold event data as kind of short term insurance against an interruption in processing data
persistent transport layers are replayable, that means the messages are replayable and this allows the streaming processor to replay and recompute a specified part of the stream of events
decoupling of multiple produces from multiple consumers: enabling collection of data from many sources (producers) and makes it available form multiple consumers
geo distributed replication of streams
useful between data centers need to preserve message offsets to allow updates from any of the data centers to be propagated to any of the other data centers in bidirectional and cyclic replication of data.
What Flink does?
Handling Time
Counting with Batch and Lambda Architecture vs Streaming Architecture
Notions of time
Windows
Mechanisms to group and collect a bunch of events by time or other characteristics
Time Windows
Tumbling Window
Sliding Window
Count Window
Grouping elements base on their counts instead of their timestamps
/* To close a window, in case the number of elements never fulfills the condition, should be used a "trigger" */
stream.countWindow(4)
stream.countWindow(4, 2)
Session Window
Session is a period of activity that is preceded and followed by a period of inactivity
/* how long we want to wait until we believe that a session has ended */
stream.window(SessionWindows.withGap(Time.minutes(5))
Time Travel
Time travel means rewinding the stream to some time in the past and restarting the processing from there, eventually catching up with the present.
Watermarks
Regular records embedded in the stream that based on event time, inform computations that a certain time has been reached.
Stateful Computation
Batch is a Special Case of Streaming
Intro
https://www.zdnet.com/article/apache-flink-does-the-world-need-another-streaming-engine/
Why Apache Flink is considered the next distributed data processing revolution ?
Comparing Spark and Apache Flink performances for batch processing and stream processing
Distributed Data Processing with Apache Flink Architecture
https://www.xenonstack.com/blog/data-processing-apache-flink/
What flink can do that others didn't?
Image for example
https://mapr.com/blog/distributed-stream-and-graph-processing-apache-flink/
State: Checkpoints, Savepoints, and Fault-tolerance
how Apache Flink handles stateful stream processing and how to manage distributed stream processing and data driven applications efficiently with Flink's checkpoints and savepoints?
https://www.infoq.com/presentations/distributed-stream-processing-flink/
The Architecture of Apache Flink
https://learning.oreilly.com/library/view/stream-processing-with/9781491974285/ch03.html
Performance
https://lsds.doc.ic.ac.uk/blog/do-we-need-distributed-stream-processing
Back pressure in Flink
https://www.ververica.com/blog/how-flink-handles-backpressure
https://learning.oreilly.com/library/view/learning-apache-flink/9781786466228/ch10s07.html
Disadvantages of Flink
https://learning.oreilly.com/library/view/data-lake-for/9781787281349/3ced0f87-601d-4016-9285-359a45bcdf8b.xhtml
Pipelined execution in Flink does have some limitation in regards to memory management
Pipelined execution in Flink does have some limitation in regards to memory managemen
https://books.google.it/books?id=nHc5DwAAQBAJ&pg=PA288&lpg=PA288&dq=flink+uses+raw+bytes+disadvantage& source=bl&ots=GjoXKSb1VH&sig=ACfU3U11PntsGFzIQZogmHxwtRakdCHVgA&hl=en&sa=X&ved=2ahUKEwjWtN2LluvoAhWJzaQKHSXxA6MQ6AEwAHoECA0QKQ#v=onepage&q=flink uses raw bytes disadvantage&f=false
file:///C:/Users/sokol/Downloads/paper-final.pdf
/* The limitation only applies to Flink's DataStream/Streaming API when using iterations. When using the DataSet/Batch API, there are no limitations. */
Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing Framework
https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-processing-91ea3f04675b
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
Why Ali Baba chosed Flink?
https://www.alibabacloud.com/blog/why-did-alibaba-choose-apache-flink-anyway_595190
Monitorning Apache Flink Applications