skip to main content
10.1145/2723372.2742788acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Twitter Heron: Stream Processing at Scale

Published: 27 May 2015 Publication History

Abstract

Storm has long served as the main platform for real-time analytics at Twitter. However, as the scale of data being processed in real-time at Twitter has increased, along with an increase in the diversity and the number of use cases, many limitations of Storm have become apparent. We need a system that scales better, has better debug-ability, has better performance, and is easier to manage -- all while working in a shared cluster infrastructure. We considered various alternatives to meet these needs, and in the end concluded that we needed to build a new real-time stream data processing system. This paper presents the design and implementation of this new system, called Heron. Heron is now the de facto stream data processing engine inside Twitter, and in this paper we also share our experiences from running Heron in production. In this paper, we also provide empirical evidence demonstrating the efficiency and scalability of Heron.

References

[1]
Apache Aurora. https://aurora.incubator.apache.org
[2]
Apache Samza. https://samza.incubator.apache.org
[3]
Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033--1044 (2013)
[4]
Mohamed H. Ali, Badrish Chandramouli, Jonathan Goldstein, Roman Schindlauer: The extensibility framework in Microsoft StreamInsight. ICDE 2011: 1242--1253
[5]
Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, Shivakumar Venkataraman: Photon: fault-tolerant and scalable joining of continuous data streams. SIGMOD 2013: 577--588
[6]
Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito, Rajeev Motwani, Itaru Nishizawa, Utkarsh Srivastava, Dilys Thomas, Rohit Varma, Jennifer Widom: STREAM: The Stanford Stream Data Manager. IEEE Data Eng. Bull. 26(1): 19--26 (2003)
[7]
Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Eduardo F. Galvez, Jon Salz, Michael Stonebraker, Nesime Tatbul, Richard Tibbetts, Stanley B. Zdonik: Retrospective on Aurora. VLDB J. 13(4): 370--383 (2004)
[8]
P. Oscar Boykin, Sam Ritchie, Ian O'Connell, Jimmy Lin: Summingbird: A Framework for Integrating Batch and Online MapReduce Computations. PVLDB 7(13): 1441--1451 (2014)
[9]
DataTorrent. https://www.datatorrent.com
[10]
Minos N. Garofalakis, Johannes Gehrke: Querying and Mining Data Streams: You Only Get One Look. VLDB 2002
[11]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy H. Katz, Scott Shenker, Ion Stoica: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. NSDI 2011
[12]
IBM Infosphere Streams. https://www-03.ibm.com/software/products/en/infosphere-streams/
[13]
Kestrel: A simple, distributed message queue system. https://robey.github.com/kestrel
[14]
Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: a distributed messaging system for log processing. SIGMOD Workshop on Networking Meets Databases, 2011.
[15]
Simon Loesing, Martin Hentschel, Tim Kraska, Donald Kossmann: Stormy: an elastic and highly available streaming service in the cloud. EDBT/ICDT Workshops 2012: 55--60
[16]
Nathan Marz: (Storm) Tutorial. https://github.com/nathanmarz/storm/wiki/Tutorial
[17]
S4 Distributed stream computing platform. https://incubator.apache.org/s4/
[18]
Spark Streaming. https://spark.apache.org/streaming/
[19]
Sankar Subramanian, Srikanth Bellamkonda, Hua-Gang Li, Vince Liang, Lei Sheng, Wayne Smith, James Terry, Tsae-Feng Yu, Andrew Witkowski: Continuous Queries in Oracle. VLDB 2007: 1173--1184
[20]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy V. Ryaboy: Storm@twitter. SIGMOD 2014: 147--156
[21]
Trident: https://github.com/nathanmarz/storm/wiki
[22]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, Eric Baldeschwieler: Apache Hadoop YARN: yet another resource negotiator. SoCC 2013: 5
[23]
ZeroMQ: https://zeromq.org. Retrieved December 1, 2014.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. real-time data processing.
  2. stream data processing systems

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)863
  • Downloads (Last 6 weeks)78
Reflects downloads up to 26 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Incremental Sliding Window Connectivity over Streaming GraphsProceedings of the VLDB Endowment10.14778/3675034.367504017:10(2473-2486)Online publication date: 1-Jun-2024
  • (2024)POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least ResistanceProceedings of the VLDB Endowment10.14778/3648160.364817517:6(1350-1363)Online publication date: 1-Feb-2024
  • (2024)"Back to the Byte": Towards Byte-oriented Semantics for Streaming StorageProceedings of the 25th International Middleware Conference Industrial Track10.1145/3700824.3701099(43-49)Online publication date: 2-Dec-2024
  • (2024)Fault Tolerance Placement in the Internet of ThingsProceedings of the ACM on Management of Data10.1145/36549412:3(1-29)Online publication date: 30-May-2024
  • (2024)TensAIR: Real-Time Training of Neural Networks from Data-streamsProceedings of the 2024 8th International Conference on Machine Learning and Soft Computing10.1145/3647750.3647762(73-82)Online publication date: 26-Jan-2024
  • (2024)Demeter: Resource-Efficient Distributed Stream Processing under Dynamic Loads with Multi-Configuration OptimizationProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645048(142-153)Online publication date: 7-May-2024
  • (2024)Evaluating Stream Processing AutoscalersProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666036(110-122)Online publication date: 24-Jun-2024
  • (2024)Safe Shared State in Dataflow SystemsProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666029(30-41)Online publication date: 24-Jun-2024
  • (2024)Snatch: Online Streaming Analytics at the Network EdgeProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629577(349-369)Online publication date: 22-Apr-2024
  • (2024)Bayesian-Driven Automated Scaling in Stream Computing With Multiple QoS TargetsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339983435:7(1251-1267)Online publication date: Jul-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media