Conventional big data analytics systems (e.g., MapReduce, Dryad, Spark) are designed to work in an offline, batch-based manner originally. All data needs to be available in advance and will be processed as a whole. However, data is often generated continuously and needs to be processed in real time, for instance, network traffic data in telecommunication environment. To solve the problem, CUHK research team develops AF-Stream, the novel big data online distributed stream processing system. It provides a high-performance, fault-tolerant, and generic analytics platform for various analytics applications, such as data synopsis, stream database queries, and online machine learning. AF-Stream realizes a novel concept called approximate fault tolerance, which reduces the number backup operations to mitigate performance overheads for fault tolerance maintenance, while ensuring that the stream processing errors upon failures are bounded. To address diverse application needs, AF-Stream can easily tune the trade-off between performance and accuracy with only few parameters. Therefore, our technology is able to process more data and faster than other systems without fault tolerance.
large-scale, real-time analysis on continuous, unbounded data streams.
network measurements (e.g., anomaly detection, flow size distribution, failure diagnosis), data mining and machine learning (e.g., frequent pattern mining, classification, regression, prediction).
telecommunication, IT service operators, big data analytics industry