The Twitter engineering team has recently released an "experience" report with many interesting details and insights into the nowadays hot trade-off between Big vs Fast Data. The paper is available in arXiv. As the authors note the use of Hadoop was motivated by the availability of supportive tools (e.g. Pig) and the wide experience of the team with this technology; however, after building a first prototype of the system, they found out that the latency introduced by the batch processing paradigm was prohibitive for keeping up with the breaking news cycle (5-10 minutes max.). Thus, they went ahead and designed a custom in-memory processing system that could address their needs. In the end, they point out the need for a new architectural paradigm that brings the benefits of both worlds (batch and real-time processing). Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture Abstract: We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data. |
Blog >