skip to main content
research-article
Open access

Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line Control

Published: 12 December 2023 Publication History

Abstract

Many big data systems are written in languages such as C, C++, Java, and Scala to process large amounts of data efficiently, while data analysts often use Python to conduct data wrangling, statistical analysis, and machine learning. User-defined functions (UDFs) are commonly used in these systems to bridge the gap between the two ecosystems. In this paper, we propose Udon, a novel debugger to support fine-grained debugging of UDFs. Udon encapsulates the modern line-by-line debugging primitives, such as the ability to set breakpoints, perform code inspections, and make code modifications while executing a UDF on a single tuple. It includes a novel debug-aware UDF execution model to ensure the responsiveness of the operator during debugging. It utilizes advanced state-transfer techniques to satisfy breakpoint conditions that span across multiple UDFs. It incorporates various optimization techniques to reduce the runtime overhead. We conduct experiments with multiple UDF workloads on various datasets and show its high efficiency and scalability.

References

[1]
Rohan Achar, Pritha Dawn, and Cristina V. Lopes. 2019. GoTcha: an interactive debugger for GoT-based distributed systems. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2019, Athens, Greece, October 23--24, 2019, Hidehiko Masuhara and Tomas Petricek (Eds.). ACM, 94--110. https://doi.org/10.1145/3359591.3359733
[2]
Apache Hadoop 2023. Apache Hadoop, https://hadoop.apache.org.
[3]
bdb 2023. bdb -- Debugging framework -- Python documentation, https://docs.python.org/3/library/bdb.html.
[4]
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O'Reilly. https://www.oreilly.de/catalog/9780596516499/index.html
[5]
Building a non-breaking breakpoint for Python debugging | Opensource.com 2023. https://opensource.com/article/19/8/debug-python.
[6]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink?: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38, 4 (2015), 28--38. https://sites.computer.org/debull/A15dec/p28.pdf
[7]
Gladys E. Carrillo and Cristina L. Abad. 2017. Inferring Workflows with Job Dependencies from Distributed Processing Systems Logs. In 15th IEEE Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress, DASC/PiCom/DataCom/CyberSciTech 2017, Orlando, FL, USA, November 6--10, 2017. IEEE Computer Society, 1025--1030. https://doi.org/10.1109/DASC-PICom-DataCom-CyberSciTec.2017.168
[8]
Bertty Contreras-Rojas, Jorge-Arnulfo Quiané-Ruiz, Zoi Kaoudi, and Saravanan Thirumuruganathan. 2019. TagSniff: Simplified Big Data Debugging for Dataflow Jobs. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019, Santa Cruz, CA, USA, November 20--23, 2019. ACM, 453--464. https://doi.org/10.1145/3357223.3362738
[9]
Darren Dao, Jeannie R. Albrecht, Charles Edwin Killian, and Amin Vahdat. 2009. Live Debugging of Distributed Systems. In Compiler Construction, 18th International Conference, CC 2009, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009, York, UK, March 22--29, 2009. Proceedings (Lecture Notes in Computer Science, Vol. 5501), Oege de Moor and Michael I. Schwartzbach (Eds.). Springer, 94--108. https://doi.org/10.1007/978--3--642-00722--4_8
[10]
Debugging | Apache Flink 2023. https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/debugging/.
[11]
Debugging PySpark -- PySpark 3.1.1 documantation 2023. https://spark.apache.org/docs/3.1.1/api/python/development/debugging.html.
[12]
Yannis Foufoulas and Alkis Simitsis. 2023. User-Defined Functions in Modern Data Engines. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3--7, 2023. IEEE, 3593--3598. https://doi.org/10.1109/ICDE55515.2023.00276
[13]
Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd D. Millstein, and Miryung Kim. 2016. BigDebug: debugging primitives for interactive big data processing in spark. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14--22, 2016, Laura K. Dillon, Willem Visser, and Laurie A. Williams (Eds.). ACM, 784--795. https://doi.org/10.1145/2884781.2884813
[14]
Pedro Holanda, Mark Raasveldt, and Martin L. Kersten. 2017. Don't Keep My UDFs Hostage - Exporting UDFs For Debugging Purposes. In XXXII Simpósio Brasileiro de Banco de Dados - Short Papers, Uberlandia, MG, Brazil, October 4--7, 2017, Carmem S. Hara, Bernadette Farias Lóscio, and Damires Yluska de Souza Fernandes (Eds.). SBC, 246--251. https://sbbd.org.br/2017/wp-content/uploads/sites/3/2018/02/p246--251.pdf
[15]
How does the breakpoint of pdb has affection on performance - StackOverflow 2023. https://stackoverflow.com/questions/73314863/how-does-the-breakpoint-of-pdb-has-affection-on-performance.
[16]
Konstantinos Karanasos, Matteo Interlandi, Fotis Psallidas, Rathijit Sen, Kwanghyun Park, Ivan Popivanov, Doris Xin, Supun Nakandala, Subru Krishnan, Markus Weimer, Yuan Yu, Raghu Ramakrishnan, and Carlo Curino. 2020. Extending Relational Query Processing with ML Inference. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12--15, 2020, Online Proceedings. www.cidrdb.org. https://cidrdb.org/cidr2020/papers/p24-karanasos-cidr20.pdf
[17]
Felix Kossak and Michael Zwick. 2019. ML-PipeDebugger: A Debugging Tool for Data Processing Pipelines. In Database and Expert Systems Applications - 30th International Conference, DEXA 2019, Linz, Austria, August 26--29, 2019, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 11707), Sven Hartmann, Josef Küng, Sharma Chakravarthy, Gabriele Anderst-Kotsis, A Min Tjoa, and Ismail Khalil (Eds.). Springer, 263--272. https://doi.org/10.1007/978--3-030--27618--8_20
[18]
Avinash Kumar, Zuozhi Wang, Shengquan Ni, and Chen Li. 2020. Amber: A Debuggable Dataflow System Based on the Actor Model. Proc. VLDB Endow. 13, 5 (2020), 740--753. https://doi.org/10.14778/3377369.3377381
[19]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 8693), David J.Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, 740--755. https://doi.org/10.1007/978--3--319--10602--1_48
[20]
Matteo Marra, Guillermo Polito, and Elisa Gonzalez Boix. 2020. A debugging approach for live Big Data applications. Sci. Comput. Program. 194 (2020), 102460. https://doi.org/10.1016/j.scico.2020.102460
[21]
Barton P. Miller and Jong-Deok Choi. 1988. Breakpoints and Halting in Distributed Programs. In Proceedings of the 8th International Conference on Distributed Computing Systems, San Jose, California, USA, June 13--17, 1988. IEEE Computer Society, 316--323. https://doi.org/10.1109/DCS.1988.12532
[22]
Christopher Olston, Benjamin C. Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10--12, 2008, Jason Tsong-Li Wang (Ed.). ACM, 1099--1110. https://doi.org/10.1145/1376616.1376726
[23]
PDB - The Python Debugger 2023. https://docs.python.org/3/library/pdb.html.
[24]
Pillow 2023. Pillow (PIL fork) 10.1.0 documentation, https://pillow.readthedocs.io/en/stable/.
[25]
PyDev Debugger 2023. https://www.pydev.org/manual_adv_debugger.html.
[26]
PyFlink 2023. PyFlink Docs, https://nightlies.apache.org/flink/flink-docs-master/api/python/.
[27]
PySpark 2023. PySpark documantation, https://spark.apache.org/docs/3.1.1/api/python/development/debugging.html.
[28]
Python Signal 2023. Signal - Python 3.10.0 documentation, https://docs.python.org/3/library/signal.html.
[29]
Mark Raasveldt, Pedro Holanda, Hannes Mühleisen, and Stefan Manegold. 2018. Deep Integration of Machine Learning Into Column Stores. In Proceedings of the 21st International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26--29, 2018, Michael H. Böhlen, Reinhard Pichler, Norman May, Erhard Rahm, Shan-Hung Wu, and Katja Hose (Eds.). OpenProceedings.org, 473--476. https://doi.org/10.5441/002/edbt.2018.50
[30]
Ariel Rabkin and Randy H. Katz. 2010. Chukwa: A System for Reliable Large-Scale Log Collection. In Uncovering the Secrets of System Administration: Proceedings of the 24th Large Installation System Administration Conference, LISA 2010, San Jose, CA, USA, November 7--12, 2010, Rudi van Drunen (Ed.). USENIX Association. https://www.usenix.org/conference/lisa10/chukwa-system-reliable-large-scale-log-collection
[31]
Viktor Rosenfeld, René Müller, Pinar Tözün, and Fatma Özcan. 2017. Processing Java UDFs in a C environment. In Proceedings of the 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, CA, USA, September 24--27, 2017. ACM, 419--431. https://doi.org/10.1145/3127479.3132022
[32]
spaCy 2023. spaCy · Industrial-strength Natural Language Processing in Python https://spacy.io.
[33]
Leonhard F. Spiegelberg, Rahul Yesantharao, Malte Schwarzkopf, and Tim Kraska. 2021. Tuplex: Data Science in Python at Native Code Speed. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 1718--1731. https://doi.org/10.1145/3448016.3457244
[34]
Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2008. SALSA: Analyzing Logs as StAte Machines. In First USENIX Workshop on the Analysis of System Logs, WASL 2008, San Diego, CA, USA, December 7, 2008, Proceedings, Greg Bronevetsky (Ed.). USENIX Association. https://www.usenix.org/events/wasl/tech/full_papers/tan/ tan.pdf
[35]
Texera 2023. Collaborative Data Analytics Using Workflows, https://github.com/Texera/texera/.
[36]
TPC-H 2023. TPC-H Homepage, https://www.tpc.org/tpch/.
[37]
What determines debugger run-time performance - StackOverflow 2023. https://stackoverflow.com/questions/9346622/what-determines-debugger-run-time-performance.
[38]
Why only main thread can set signal handler in Python - StackOverflow 2023. https://stackoverflow.com/questions/44151888/why-only-main-thread-can-set-signal-handler-in-python.
[39]
Doug Woos, Zachary Tatlock, Michael D. Ernst, and Thomas E. Anderson. 2018. A Graphical Interactive Debugger for Distributed Systems. CoRR abs/1806.05300 (2018). arXiv:1806.05300 https://arxiv.org/abs/1806.05300
[40]
Zhihui Yang, Zuozhi Wang, Yicong Huang, Yao Lu, Chen Li, and X. Sean Wang. 2022. Optimizing Machine Learning Inference Queries with Correlative Proxy Models. Proc. VLDB Endow. 15, 10 (2022), 2032--2044. https://www.vldb.org/pvldb/vol15/p2032-yang.pdf
[41]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud'10, Boston, MA, USA, June 22, 2010, Erich M. Nahum and Dongyan Xu (Eds.). USENIX Association. https://www.usenix.org/conference/hotcloud- 10/spark-cluster-computing-working-sets
[42]
Yunquan Zhang, Ting Cao, Shigang Li, Xinhui Tian, Liang Yuan, Haipeng Jia, and Athanasios V. Vasilakos. 2016. Parallel Processing Systems for Big Data: A Survey. Proc. IEEE 104, 11 (2016), 2114--2136. https://doi.org/10.1109/JPROC.2016.2591592

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 4
PACMMOD
December 2023
1317 pages
EISSN:2836-6573
DOI:10.1145/3637468
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023
Published in PACMMOD Volume 1, Issue 4

Author Tags

  1. big data systems
  2. debugging
  3. user-defined functions (UDFs)

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)499
  • Downloads (Last 6 weeks)48
Reflects downloads up to 03 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Texera: A System for Collaborative and Interactive Data Analytics Using WorkflowsProceedings of the VLDB Endowment10.14778/3681954.368202217:11(3580-3588)Online publication date: 1-Jul-2024
  • (2024)Window Function Expression: Let the Self-Join EnterProceedings of the VLDB Endowment10.14778/3665844.366584817:9(2162-2174)Online publication date: 1-May-2024
  • (2024)Proximity Queries on Point Clouds using Rapid Construction Path OracleProceedings of the ACM on Management of Data10.1145/36392612:1(1-26)Online publication date: 26-Mar-2024
  • (2024)Demonstration of Udon: Line-by-line Debugging of User-Defined Functions in Data WorkflowsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654756(476-479)Online publication date: 9-Jun-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media