Introduction to Apache Flink, Stream Processing for Real Time and Beyond

Chapter 1: Why Apache Flink?
- the data has been assumed as finite and not as a stream and this makes easier the store of the data, instead of a stream. Flink stream processing for large volume data —> let you do batch analytics with one technology
- goals for processing continuous event data: low latency, deal with interrupts, restart after failure, fault-tolerant, no overhead in case of no failures, recognize sessions based on when the event occurs rather than an arbitrary processing interval, track events in correct order, easy to use, easily maintained, handle stream of events out of order
- evolution of stream processing technologies:
  - apache storm brought very low latency but high-throughput was hard to archive, doesn't provide the level of correctness needed, no state consistency with exactly-once processing, didn't handle event time processing
  - lambda architecture provided accurate but delayed results via batch MapReduce jobs and in-the-moment preliminary view of new results via Strom's processing needs to be programmed twice, once for batch system and once for streaming system
  - apache spark: broke a stream of data of continuous events into a series of small atomic batch jobs, micro-batching. In this way we can achieve exactly-once guarantees of state consistency. If a micro batch fails, it can be rerun.
  - apache flink: flink run time that is the core fabric and Data Streaming API(stream processing) and Data Set API(batch processing). DataStream is fluent API for defining analytics on possibly infinite data streams., main data structure DataStream.Flink :
    - is distributed because it can run on hundreds or thousands of machines, distributing a large computation in small chunks, which each machine executing a chunk. The Flink framework automatically takes care of correctly restoring the computation in the event of machine and other failures.
    - internally uses fault-tolerant streaming data flows, allowing developers to analyse never-ending streams of data.
    - supports program visualization to help understand how programs are running
    - stream-based architecture nicely supports a microservice approach and Flink provides stream processing that is needed for this type of work.
    - " the ease of working with it, and the wide range of ways it can be used to advantage make it an attractive option "
Stream-First Architecture
- traditional architecture problems
  - centralized database which represents the state of the business
  - to complex and to slow the pipeline from data ingestion to analytics
  - traditional architecture is too moonlighting
  - complex failure models
  - hard to maintain the state in large distributed systems
- streaming architecture
  - no single database, but letting the data flow
  - Message transport and Message processing
    
    To implement effective stream-first architecture is to use a common patters to implement streaming architecture by using two main kinds of components: message transport & stream processing system
    - message transport: to collect and deliver data from countinous events from a variety of sources or producers and make the data available to applications that subscribe to it (consumers)
    - stream processing system: to consistently move data between applications, aggregate and process events, maintain local application state
- transport layer: ideal capabilities
  - performance with persistence: serve as a safety queue upstream from the processing step - a buffer to hold event data as kind of short term insurance against an interruption in processing data
    
    persistent transport layers are replayable, that means the messages are replayable and this allows the streaming processor to replay and recompute a specified part of the stream of events
  - decoupling of multiple produces from multiple consumers: enabling collection of data from many sources (producers) and makes it available form multiple consumers
- geo distributed replication of streams
  
  useful between data centers need to preserve message offsets to allow updates from any of the data centers to be propagated to any of the other data centers in bidirectional and cyclic replication of data.
What Flink does?
- versatile and able to address correctness: Flink offers correctness through a natural fit between the way of computation windows are defined and how data naturally occurs.
- Processing Time vs Event Time: the separation of different types of time in Flink is part of what makes it able to do more than older streaming systems.
- Accuracy under failures, keeping track of state: To maintain the accuracy a computation must keep track of state, and if it is not done by the computational framework, the developer should do it. In continuous stream processing there's no end point at which you stop and tally up, instead you must keep updating the state as you go. Flink addresses this challenge using the checkpoints. Checkpoints enable fault tolerance by keeping track of intermediate conditions of state such that the system can be accurately reset if there is a failure, doing this with low overhead. Using checkpoints Flink can has the ability to reprocess data.
Handling Time
- Counting with Batch and Lambda Architecture vs Streaming Architecture
  - Batch and Lambda Architecture problems: to many moving parts and using a lot of systems hard to administrate, implicit treatment of time and early alerts hardly to offer, out of order events, unclear batch boundaries
  - Streaming Architecture: batches called windows are embedded entirely in application logic of Flink program
- Notions of time
  - Event Time: time that an event actually happened in the real word
  - Processing Time: time that the evet is observed by machine that is processing it
- Windows
  
  Mechanisms to group and collect a bunch of events by time or other characteristics
  - Time Windows
    - Tumbling Window
    - Sliding Window
  - Count Window
    
    Grouping elements base on their counts instead of their timestamps
```
/* To close a window, in case the number of elements never fulfills the condition, should be used a "trigger" */
stream.countWindow(4)
stream.countWindow(4, 2)
```
  - Session Window
    
    Session is a period of activity that is preceded and followed by a period of inactivity
```
/* how long we want to wait until we believe that a session has ended */
stream.window(SessionWindows.withGap(Time.minutes(5))
```
- Time Travel
  
  Time travel means rewinding the stream to some time in the past and restarting the processing from there, eventually catching up with the present.
  - Watermarks
    
    Regular records embedded in the stream that based on event time, inform computations that a certain time has been reached.
Stateful Computation
Batch is a Special Case of Streaming

IDEAS

Main notes
- Intro
  
  https://www.zdnet.com/article/apache-flink-does-the-world-need-another-streaming-engine/
  - Apache Flink as distributed processing engine over unbounded and bounded data streams
- Why Apache Flink is considered the next distributed data processing revolution ?
  - https://www.kdnuggets.com/2017/07/apache-flink-distributed-data-processing-revolution.html
  Comparing Spark and Apache Flink performances for batch processing and stream processing
- Distributed Data Processing with Apache Flink Architecture
  - Problems in Stream Processing with Batch Engine
  - Does Backpressure Occur in Case of Flink? ( where Flink is not efficent, what to do? )
    - How can backpressure be handled?
  https://www.xenonstack.com/blog/data-processing-apache-flink/
- What flink can do that others didn't?
  - Image for example
  https://mapr.com/blog/distributed-stream-and-graph-processing-apache-flink/
- State: Checkpoints, Savepoints, and Fault-tolerance
- how Apache Flink handles stateful stream processing and how to manage distributed stream processing and data driven applications efficiently with Flink's checkpoints and savepoints?
  
  https://www.infoq.com/presentations/distributed-stream-processing-flink/
- The Architecture of Apache Flink
  
  https://learning.oreilly.com/library/view/stream-processing-with/9781491974285/ch03.html
- Performance
  
  https://lsds.doc.ic.ac.uk/blog/do-we-need-distributed-stream-processing
- Back pressure in Flink
  
  https://www.ververica.com/blog/how-flink-handles-backpressure
  
  https://learning.oreilly.com/library/view/learning-apache-flink/9781786466228/ch10s07.html
- Disadvantages of Flink
  
  https://learning.oreilly.com/library/view/data-lake-for/9781787281349/3ced0f87-601d-4016-9285-359a45bcdf8b.xhtml
  - Pipelined execution in Flink does have some limitation in regards to memory management
    
    Pipelined execution in Flink does have some limitation in regards to memory managemen
    
    https://books.google.it/books?id=nHc5DwAAQBAJ&pg=PA288&lpg=PA288&dq=flink+uses+raw+bytes+disadvantage& source=bl&ots=GjoXKSb1VH&sig=ACfU3U11PntsGFzIQZogmHxwtRakdCHVgA&hl=en&sa=X&ved=2ahUKEwjWtN2LluvoAhWJzaQKHSXxA6MQ6AEwAHoECA0QKQ#v=onepage&q=flink uses raw bytes disadvantage&f=false
    
    file:///C:/Users/sokol/Downloads/paper-final.pdf
    
    /* The limitation only applies to Flink's DataStream/Streaming API when using iterations. When using the DataSet/Batch API, there are no limitations. */
- Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing Framework
  
  https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-processing-91ea3f04675b
  
  https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
- Why Ali Baba chosed Flink?
  
  https://www.alibabacloud.com/blog/why-did-alibaba-choose-apache-flink-anyway_595190
- Monitorning Apache Flink Applications