skip to main content
research-article

Reducing Energy in GPGPUs through Approximate Trivial Bypassing

Published: 04 January 2021 Publication History

Abstract

General-purpose computing using graphics processing units (GPGPUs) is an attractive option for acceleration of applications with massively data-parallel tasks. While performance of modern GPGPUs is increasing rapidly, the power consumption of these devices is becoming a major concern. In particular, execution units and register file are among the top three most power-hungry components in GPGPUs. In this work, we exploit trivial instructions to reduce power consumption in GPGPUs.
Trivial instructions are those instructions that do not need computations, i.e., multiplication by one. We found that, during the course of a program's execution, a GPGPU executes many trivial instructions. Execution of these instructions wastes power unnecessarily. In this work, we propose trivial bypassing which skips execution of trivial instructions and avoids unnecessary allocation of resources for trivial instructions. By power gating execution units and skipping trivial computing, trivial bypassing reduces both static and dynamic power. Also, trivial bypassing reduces dynamic energy of register file by avoiding access to register file for source and/or destination operands of trivial instructions. While trivial bypassing reduces energy of GPGPUs, it has detrimental impact on performance as a power-gated execution unit requires several cycles to resume its normal operation. Conventional warp schedulers are oblivious to the status of execution units. We propose a new warp scheduler that prioritizes warps based on availability of execution units. We also propose a set of new power management techniques to reduce performance penalty of power gating, further. To increase energy saving of trivial bypassing, we also propose approximating operands of instructions. We offer a set of new techniques to approximate both integer and floating-point instructions and increase the pool of trivial instructions. Our evaluations using a diverse set of benchmarks reveal that our proposed techniques are able to reduce energy of execution units by 11.2% and dynamic energy of register file by 12.2% with minimal performance and quality degradation.

References

[1]
A. Sethia, D. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15), 2015.
[2]
Tse-Yuh Yeh and Yale Patt. 1992. Alternative implementations of two-level adaptive branch prediction. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 1992.
[3]
Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, 2011.
[4]
Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. ACM SIGPLAN Notices 45. ACM, 198--209.
[5]
J. J. Yi and D. J. Lilja. 2002. Improving processor performance by simplifying and bypassing trivial computations. In Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (September 2002), 462--465.
[6]
S. Richardson. 1993. Caching function results: Faster arithmetic by avoiding unnecessary computation. In Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA'93).
[7]
P. Rogers. 2010. CUDA-samples/Sobel. 2010. github.com/hellopatrick/cuda-samples/tree/master/sobel.
[8]
A. Jog, O. Kayiran, A. Mishra, M. Kandemir, O. Mutlu, R. Iyer, and C. Das. 2013. Orchestrated scheduling and prefetching for GPGPUS. In Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.
[9]
R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 1974.
[10]
NVIDIA Tesla P100, images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.
[11]
Whitepaper: NVIDIA GeForce GTX 980.
[12]
Per Hammarlund, Alberto J. Martinez, Atiq A. Bajwa, David L. Hill, Erik G. Hallnor, Hong Jiang, Martin G. Dixon, Michael Derr, Mikal Hunsaker, Rajesh Kumar, Randy B. Osborne, Ravi Rajwar, Ronak Singhal, Reynold D'Sa, Robert Chappell, Shiv Kaushik, Srinivas Chennupaty, Stéphan Jourdan, Steve Gunther, Thomas Piazza, and Ted Burton. 2014. Haswell: The fourth-generation Intel core processor. IEEE Micro 34, 2, (2014).
[13]
NVIDIA, CUDA C Programming Guide.
[14]
AMD, Introduction to OpenCL™ Programming. 2010.
[15]
NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.
[16]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software 2009, 163--174.
[17]
E. Atoofian and A. Baniasadi. 2006. Improving energy-efficiency in high-performance processors by bypassing trivial computations. IEE Proceedings Computer and Digital Techniques 153, 5 (2006), 313--322.
[18]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.
[19]
NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, 2012.
[20]
Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Warped gates: Gating aware scheduling and power gating for GPGPUs. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. (December 2013).
[21]
David Reinsel, John Gantz and John Rydning. 2017. Data Age 2025: The Evolution of Data to Life-Critical. International Data Corporation 2017.
[22]
S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA'15).
[23]
Extracting value from chaos. 2011. www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.
[24]
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO-44'11), 2011.
[25]
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.
[26]
A. Sodani and G. S. Sohi. 1997. Dynamic instruction reuse. Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97), 194--205.
[27]
AMD Graphics Cores Next (GCN) Architecture. Technical report, AMD, 2012.
[28]
America's Data Centers Consuming and Wasting Growing Amounts of Energy, NRDC. 2015, https://www.nrdc.org/energy/data-center-efficiency-assessment.asp.
[29]
K. Kim and W. W. Ro. 2018. WIR: Warp Instruction Reuse to minimize repeated computations in GPUs. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 389--402, 2018.
[30]
Jingwen Leng, Tayler H. Hetherington, Ahmed ElTantawy, Syed Zohaib Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. Gpuwattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13), 487--498, 2013.
[31]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. MCPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009.
[32]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of MICRO, 3--14, 2007.
[33]
FreePDKTM process design kit, https://www.eda.ncsu.edu/wiki/FreePDK.
[34]
E. Atoofian. Trivial bypassing in GPGPUs. IEEE Embed. Syst. Lett.
[35]
Zayan Shaikh and Ehsan Atoofian. 2020. Approximate trivial instructions. In Proceedings of the ACM International Conference on Computing Frontiers, 1--9, 2020.
[36]
NVIDIA, CUDA C/C++ SDK code samples, 2013.
[37]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. MARS: A mapreduce framework on graphics processors. PACT 2008.
[38]
Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor V. Zyuban, Hans M. Jacobson, and Pradip Bose. 2004. Microarchitectural techniques for power gating of execution units. In Proceedings of the 2004 International Symposium on Low Power Electronics and Design, 32--37, 2004.
[39]
Qiumin Xu and Murali Annavaram. 2014. PATS: Pattern aware scheduling and power gating for GPGPUs. In PACT 2014.
[40]
Mohammad Sadrosadati, Seyed Borna Ehsani, Hajar Falahati, Rachata Ausavarungnirun, Arash Tavakkol, Mojtaba Abaee, Lois Orosa, Yaohua Wang, Hamid Sarbazi-Azad, and Onur Mutlu. 2019. ITAP: Idle-time-aware power management for GPU execution units. ACM TACO.
[41]
Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. 2011. Power gating strategies on GPUs. ACM TACO, 2011.
[42]
Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. ACM SIGARCH Computer Architecture News 39. ACM, 235--246.
[43]
Homa Aghilinasab, Mohammad Sadrosadati, Mohammad Hossein Samavatian, and Hamid Sarbazi-Azad. 2016. Reducing power consumption of GPGPUs through instruction reordering. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, 356--361, 2016.
[44]
Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2016. RFVP: Rollback-free value prediction with safe-to-approximate loads. ACM Transactions on Architecture and Code Optimization (TACO), 12, 4 (2016), 1--26.
[45]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'12), 2012.
[46]
E. Atoofian. 2020. Approximate cache in GPGPUs. ACM Trans. Embed. Comput. Syst. 19, 5 (2020), 1--22.
[47]
Joshua San Miguel, Jorge Albericio, Andreas Moshovos, and Natalie D. Enright Jerger. 2015. Doppelganger: A cache for approximate computing. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO'15), (Waikiki, Hawaii).
[48]
M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th International Symposium on Microarchitecture, (MICRO'13) 2013.
[49]
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO'07) 407--418, 2007.
[50]
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2018. CRAT: Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. IEEE Transactions on Computers 67, 6 (2018).

Cited By

View all
  • (2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
  • (2023)Mixed-Precision Architecture for GPU Tensor Cores2023 IEEE Smart World Congress (SWC)10.1109/SWC57546.2023.10448789(1-8)Online publication date: 28-Aug-2023
  • (2023)Selective High-Latency Arithmetic Instruction Reuse in Multicore Processors2023 27th International Conference on System Theory, Control and Computing (ICSTCC)10.1109/ICSTCC59206.2023.10308483(410-415)Online publication date: 11-Oct-2023
  • Show More Cited By

Index Terms

  1. Reducing Energy in GPGPUs through Approximate Trivial Bypassing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 20, Issue 2
    March 2021
    230 pages
    ISSN:1539-9087
    EISSN:1558-3465
    DOI:10.1145/3446664
    • Editor:
    • Tulika Mitra
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 04 January 2021
    Accepted: 01 October 2020
    Revised: 01 August 2020
    Received: 01 June 2020
    Published in TECS Volume 20, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPUs
    2. Trivial computing
    3. approximate computing
    4. energy

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 04 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
    • (2023)Mixed-Precision Architecture for GPU Tensor Cores2023 IEEE Smart World Congress (SWC)10.1109/SWC57546.2023.10448789(1-8)Online publication date: 28-Aug-2023
    • (2023)Selective High-Latency Arithmetic Instruction Reuse in Multicore Processors2023 27th International Conference on System Theory, Control and Computing (ICSTCC)10.1109/ICSTCC59206.2023.10308483(410-415)Online publication date: 11-Oct-2023
    • (2023)PTTS: Power-aware tensor cores using two-sided sparsityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.11.004173(70-82)Online publication date: Mar-2023
    • (2022)AxBy-ViT: Reconfigurable Approximate Computation Bypass for Vision Transformers2022 23rd International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED54688.2022.9806143(1-5)Online publication date: 6-Apr-2022
    • (2021)G-SEPMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476170(1-15)Online publication date: 14-Nov-2021
    • (2021)Sparsity-aware Power Gating for Tensor Cores2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD53543.2021.00021(94-103)Online publication date: Oct-2021

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media