Big Data – Real time Streaming

Real Time Streaming

Processing large data streams at or near real-time with the goal of running “continuous queries” against it to produce real time analytics.  CCI has worked with many open source frameworks in this area to provide solutions for our clients.  Specifically, it has experience working with the following commonly used open source streaming frameworks:

Apache Spark Streaming runs on Apache Spark to provide processing and analysis of both real time and historical data. It uses DStream – a discretized stream – to represent a continuous stream of data. It enables a high volume, scalable, and fault-tolerant processing of live data streams. Spark also comes with tools to facilitate machine learning and graph processing on these streams.

Apache Storm provides event collection at massive scale.  It provides real time processing and includes batch support through Trident – a high level abstraction for real time computing over Storm.  It supports building stream processing logic in multiple languages, and guarantees delivery of data.

Apache Fink Streaming is a distributed stream processing engine that provides accurate results, even for out-of-sequence or delayed data. It maintains exactly-once application state to provide a robust streaming solution that is stateful and possesses an ability to seamlessly recover from failures. It performs well at large scale, with high volume and low latency characteristics.

Kafka Streams simplify application development by building on the Kafka producer and consumer libraries and leveraging the native capabilities of Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity.  It is best suited for streaming applications that use Kafka in their data pipeline.