Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurrent failures of the GitHub Actions performance test #583

Open
marshallward opened this issue Mar 13, 2024 · 5 comments
Open

Recurrent failures of the GitHub Actions performance test #583

marshallward opened this issue Mar 13, 2024 · 5 comments

Comments

@marshallward
Copy link
Member

No doubt many have noticed the intermittent failure of the "Performance Monitor" test. This does not appear to have anything to do with MOM6, but rather the parsing of the perf report output.

python tools/parse_perf.py -f work/p0/opt/perf.data > work/p0/opt/profile.json
Traceback (most recent call last):
  File "/home/runner/work/MOM6/MOM6/.testing/tools/parse_perf.py", line 71, in <module>
    main()
  File "/home/runner/work/MOM6/MOM6/.testing/tools/parse_perf.py", line 19, in main
    profile = parse_perf_report(args.data)
  File "/home/runner/work/MOM6/MOM6/.testing/tools/parse_perf.py", line 63, in parse_perf_report
    period = int(tokens[3])
ValueError: invalid literal for int() with base 10: 'std::char_traits<wchar_t>'

This is a very simple parser that makes a lot of assumptions about what it is reading. In this case, the output is expected to look something like this:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 147K of event 'cycles:u'
# Event count (approx.): 114907202161
#
# Overhead  Symbol                                                                      Period  IPC   [IPC Coverage]
# ........  ....................................................................  ............  ....................
#
    12.18%  [.] mom_barotropic_mp_btstep_                                          13995565662  -      -            
     8.53%  [.] diag_manager_mod_mp_diag_send_data_                                 9804461284  -      -            
     8.16%  [.] mpp_domains_mod_mp_mpp_do_group_update_r8_                          9378762297  -      -            
     7.95%  [.] mom_vert_friction_mp_vertvisc_coef_                                 9132010079  -      -            
     6.43%  [.] mom_hor_visc_mp_horizontal_viscosity_                               7389425771  -      -            
     3.03%  [.] mom_eos_wright_mp_int_density_dz_wright_                            3484215718  -      -            
     3.03%  [.] mom_continuity_ppm_mp_zonal_flux_layer_                             3477704292  -      -            
...

The reality is that perf output can be hard to predict. The performance metrics are usually unique to the CPU; AMD and Intel chips will rarely have the same metrics. Different platforms may have different headers or even different formats, and reading the fourth token may just not work as a test.

Unfortunately, this is incredibly hard to replicate. I have a branch that is trying to catch this and report the output, but have so far failed to replicate this error.

@marshallward
Copy link
Member Author

We got our first perf error log:

parse_perf.py: Error extracting symbol count
line: '     0.00%  [.] std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()        250062  -      -            \n'
tokens: ['0.00%', '[.]', 'std::__cxx11::basic_string<char,', 'std::char_traits<char>,', 'std::allocator<char>', '>::~basic_string()', '250062', '-', '-']

This is happening in the C++ library, and there is poor handling of the template syntax (basic_string<char, ...>). It is not helped by the resolution at the end.

This could be resolved by either tracking <, > as delimiters or perhaps just index-counting from the opposite end.

@marshallward
Copy link
Member Author

marshallward commented May 6, 2024

A second error:

line: '     0.00%  [.] std::__cxx11::messages<char>::messages(unsigned long)                  250062  -      -            \n'
tokens: ['0.00%', '[.]', 'std::__cxx11::messages<char>::messages(unsigned', 'long)', '250062', '-', '-']

This one is due to a type (or type recast) in the output: messages<char>::messages(unsigned long), which would also require tracking (, ) delimiters.

C++ is the gift which keeps on giving.

@marshallward
Copy link
Member Author

I think that #632 should fix this issue.

@marshallward
Copy link
Member Author

Fixed by #632

@marshallward
Copy link
Member Author

Unfortunately not yet fixed... an error in the prototype leaked over into the production script. #664 will fix this one.

@marshallward marshallward reopened this Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant