8328866: Add raw monitor rank support to the debug agent #19044

plummercj · 2024-05-01T21:32:46Z

This PR adds ranked monitor support to the debug agent. The debug agent has a large number of monitors, and it's really hard to know which order to grab them in, and for that matter which monitors might already be held at any given moment. By imposing a rank on each monitor, we can check to make sure they are always grabbed in the order of their rank. Having this in place when I was working on JDK-8324868 would have made it much easier to detect a deadlock that was occuring, and the reason for it. That's what motivated me to do this work

There were 2 or 3 minor rank issues discovered as a result of these changes. I also learned a lot about some of the more ugly details of the locking support in the process.

Tested with the following on all supported platforms and with virtual threads:

com/sun/jdi
vmTestbase/nsk/jdi
vmTestbase/nsk/jdb
vmTestbase/nsk/jdwp

Still need to run tier2 and tier5.

Details of the changes follow in the first comment.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8328866: Add raw monitor rank support to the debug agent (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19044/head:pull/19044
$ git checkout pull/19044

Update a local copy of the PR:
$ git checkout pull/19044
$ git pull https://git.openjdk.org/jdk.git pull/19044/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19044

View PR using the GUI difftool:
$ git pr show -t 19044

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19044.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2024-05-01T21:33:36Z

👋 Welcome back cjplummer! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2024-05-01T21:34:39Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2024-05-01T21:35:03Z

@plummercj The following label will be automatically applied to this pull request:

serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

plummercj · 2024-05-01T21:36:08Z

I was able to minimize changes for users of RawMonitor by keeping the same client wrapper APIs (eg. debugMonitorEnter), and for the most part passing in the same argument (the monitor to enter). The only change is that the client is passing a new type called DebugRawMonitor* instead of a jrawMonitorID. The DebugRawMonitor encapsulates the jrawMonitorID and contains other monitor info that needs to be tracked (like the rank and current owner).

Ranks were determined largely by trial and error by running tests and fixing the rank violations asserts as I hit them. In the enum that contains all the rankings, I explain the rational for each monitor's rank.

For the moment I have included a flag (rankedmonitors) to control whether or not ranked monitors should be used. This is just a safety net in case someone runs into an issue. The flag is not documented and will likely be removed at some point.

The ranked monitor supported ended up needing a RawMonitor itself (dbgRawMonitor_lock). This RawMonitor is therefore not part of the ranked monitor support. It could probably use a better name. Suggestions are welcomed.

There are a few issues I fixed while working on the ranked monitor support:

During debug agent initialization we need to call util_initialize() before commonRef_initialize() because the later creates a monitor, and that depends on util_initialize() having already been called.

getLocks() in threadControl.c wasn't grabbing locks in the right order. This could have lead to a deadlock, but seems the structure of how locks are used prevented one from happening.

In invoker_completeInvokeRequest() I had to add grabbing the stepLock in order to maintain proper lock order when getLocks() was eventually called (which also grabbed stepLock). Other than triggering a rank violation assert, this doesn't seem to have caused any real issue, but did leave the potential for a deadlock.

In handleInterrupt() I noticed that a local JNI ref was not being freed. I noticed because there were other places where I had added calls to threadControl_currentThread() that resulted in a noticeable JNI ref leak. When I fixed those, I fixed the one in handleInterrupt() too even tough it was not causing any problems.

The ranked monitor APIs needed some extra safeguards during VM exit that are not needed for the unranked monitor APIs. This is because the unranked APIs didn't have a problem with the current thread being NULL, but the ranked APIs require a valid current thread. That is why you see the following check:

    if (gdata->vmDead && current_thread == NULL) {
        return;
    }

mlbridge · 2024-05-01T21:42:11Z

Webrevs

alexmenkov · 2024-05-02T01:24:31Z

src/jdk.jdwp.agent/share/native/libjdwp/util.c

 error = ignore_vm_death(error);
 if (error != JVMTI_ERROR_NONE) {
 EXIT_ERROR(error, "on destruction of raw monitor");
 }
+
+ dbg_monitor->monitor = NULL;


I think it would be better to protect this with dbgRawMonitor

I don't see how that helps. Access to the field is not protected.

I guess there could be a race if one thread is destroying this monitor while another is trying to use it. Thus a thread could be doing something like a RawMonitorEnter in the middle of (or after) the monitor being destroyed. Fixing that would require holding dbgRawMonitor during RawMonitorEnter, and that would cause deadlocks all over the place. Also, this same potential issue exists already, but doesn't seem to ever arise. It seems we only call debugMonitorDestroy for cmdQueueLock. Not sure why that is.

FWIW Deleting monitors is a tricky business and needs to be done with great care. You have to ensure all threads using the monitor are completely done with it. Simply locking it first to check it is "unused" is not sufficient.

This is probably the reason that the debug agent only destroys one of the 20 or so monitors it creates. I'm not sure how it got to that point. It may have destroyed more at one point but there were issues, and only one survived destruction without issue.

If debugger attaches to the debuggee, detaches and re-attaches again, are the monitors recreated again?
(with rankedMonitor you added an assert if DebugRawMonitor::monitor is not null)

If debugger attaches to the debuggee, detaches and re-attaches again, are the monitors recreated again?
(with rankedMonitor you added an assert if DebugRawMonitor::monitor is not null)

Almost all of the debugMonitorCreate() calls seem to come (indirectly) from initialize(), which is only called once.

cmdQueueLock, which is the one monitor we call debugMonitorDestroy() on, is called from debugLoop_run(), which is called whenever a debugger connection is initiated. This is the only case of creating, destroying, and then recreating. I wonder if there is some need for that.

popFrameEventLock and popFrameProceedLock are created lazily when first needed. There is a check to make sure they are not recreated if already created. They are never destroyed.

That seems to be it. So we pretty much create all monitors upon initialization and never recreate those that are created during initialization.

With regard to the assert, this can only be an issue for cmdQueueLock, and since debugMonitorDestroy() is called on it (and that clears the needed fields), the assert shouldn't be triggered.

src/jdk.jdwp.agent/share/native/libjdwp/util.c

src/jdk.jdwp.agent/share/native/libjdwp/threadControl.c

src/jdk.jdwp.agent/share/native/libjdwp/util.c

alexmenkov · 2024-05-02T19:38:36Z

src/jdk.jdwp.agent/share/native/libjdwp/util.c

+
+ // Need to lock during initialization so verifyMonitorRank() can be guaranteed that
+ // if the monitor field is set, then the monitor if fully initialized. More specifically,
+ // that the rank field has been set.


rank for all monitors can be set in util_initialize():

for (i = 0; i < NUM_DEBUG_RAW_MONITORS; i++) { dbg_monitors[i]->rank = i; }

I decided not to make this change. It was a bit disruptive with the need for forward declarations. Plus I wasn't convinced I wouldn't still need the locking since there is also the assignment of the name field (although that is only used if something has gone wrong) and also I was a little worried that tearing could happen on some platforms (a partial ptr value is read in), although not on any platform that we support. I did update the comment to make it less specific about the rank field.

…tors()

sspitsyn · 2024-05-07T00:25:18Z

Chris, I'm looking at this fix.

plummercj · 2024-05-07T00:28:43Z

I added a verifyMonitorRank() when doing a debugMonitorWait() since wait does an exit and enter, so we should do the same rank check that we do during an enter. I was a bit worried that we might be doing wait() while holding higher ranking lock, especially given how getLocks() pro-actively grabs locks before they are needed, but all tests (surprisingly) passed after this change.

…d enter, so we should do the same rank check that we do during an enter.

sspitsyn · 2024-05-07T00:29:14Z

src/jdk.jdwp.agent/share/native/libjdwp/util.c

+ JDI_ASSERT(monitor != NULL);
+
+ // Need to lock during initialization so verifyMonitorRank() can be guaranteed that
+ // if the monitor field is not NULL, then the monitor if fully initialized.


Nit: It looks like a typo: "the monitor if fully" => "the monitor is fully"

plummercj · 2024-05-07T18:54:37Z

What do you think about the terminology usage of "rank". In this PR the term "higher ranked" means that it must be entered first (before any monitor with a lower rank), but 0 is the highest rank, and bigger numbers mean a lower rank. Sometimes when we rank things (in general) we say #1 is the best, but sometimes we instead rank with a score, and the highest score is the best. I'm wondering if for this PR the rank ordering should be reversed, so a "higher rank" is actually a higher number.

Update: Hotspot seems to use the term "higher/lowest rank" in the same way, but applies the lowest rank number to the lowest ranked mutex, so this would suggest that I should reverse the rank orders.

sspitsyn · 2024-05-08T15:30:19Z

What do you think about the terminology usage of "rank".

I slightly prefer to have the same rank order as Hotspot uses.
At the same time, I'm not sure if it won't cause any issues/difficulties. But I'd give it a try.

sspitsyn · 2024-05-08T15:44:37Z

src/jdk.jdwp.agent/share/native/libjdwp/debugLoop.c

@@ -63,7 +63,7 @@ lastCommand(jdwpCmdPacket *cmd)
 void
 debugLoop_initialize(void)
 {
- vmDeathLock = debugMonitorCreate("JDWP VM_DEATH Lock");
+ vmDeathLock = debugMonitorCreate(vmDeathLockForDebugLoop_Rank, "JDWP VM_DEATH Lock");


Nit: Just a suggestion to consider (there was a similar suggestion from by Alex)...
Can all the debugMonitor's created/initialized by one function debugMonitor_initialize() called by util_initialize()? Then each monitor consumer can cache needed debugMonitor's like this:

vmDeathLock = getDebugMonitor(vmDeathLockForDebugLoop_Rank);

I think you are suggesting that all monitors be created up front during util_intialize(). The debug agent in the past has always created them lazily (and sometimes some of them end up never being created because they are not needed). I don't really see a big advantage in creating them all up front.

Alex's suggestion was a very different one, and was just a suggestion to initialize all the DebugRawMonitor ranks up front rather than when the jMonitorID is created, and the reason for the suggestion was to avoid having to grab the dbgRawMonitor when setting the rank, but I wasn't convinced that it actually allowed you to not grab the dbgRawMonitor.

I just wanted to say that Alex's suggestion is in a similar direction.
I do not see a big advantage to save just on creation of a few monitors. It is not a scalability problem as their number is constant. Creation of the monitors at initialization time is a simplification as we do not care about synchronization at this phase. Also, it is easier to understand because there is no interaction with other threads.

The init code would have to know the name of each monitor since it is no longer being passed in. Something like the following might work:

static DebugRawMonitor dbg_monitors[] = {
{ NULL, "JDWP VM_DEATH Lock", vmDeathLockForDebugLoop_Rank, NULL, 0 },
{ NULL, "JDWP Callback Block", callbackBlock_Rank, NULL, 0 },
{ NULL, "JDWP Callback Lock", callbackLock_Rank, NULL, 0 },
...
};

And then the init() function can iterate over the array and allocate the monitor for each entry. Note that this array needs to be kept in sync with the DebugRawMonitorRank enum (same entries and in the same order). I can assert during the initialization that the rank field for each entry is equal to its array index and that the array size is NUM_DEBUG_RAW_MONITORS.

Can we do as below? :

static DebugRawMonitor dbg_monitors[]; // initialized with all zeros by default . . . static void createDebugRawMonitor(DebugRawMonitorRank rank, const char* name) { dbg_monitors[rank] = debugMonitorCreate(rank, name); } static void monitors_initialize() { createDebugRawMonitor( vmDeathLockForDebugLoop_Rank, "JDWP VM_DEATH Lock"); createDebugRawMonitor( callbackBlock_Rank, "JDWP Callback Block"); createDebugRawMonitor( callbackLock_Rank, "JDWP Callback Lock"); . . . }

The pair of functions createDebugRawMonitor()+debugMonitorCreate() can be refactored into just one function. There monitors can be created in any order but they will be set according to the DebugRawMonitorRank enum.

src/jdk.jdwp.agent/share/native/libjdwp/util.c

sspitsyn · 2024-05-08T16:46:33Z

src/jdk.jdwp.agent/share/native/libjdwp/util.c

+struct DebugRawMonitor {
+ jrawMonitorID monitor;
+ const char* name;
+ DebugRawMonitorRank rank;


Nit: Please, consider to add same comment for lines #1116-1117 as for the line #1119 below.
It might be useful to move the field ownerThread before the other fields (not sure about the monitor field), so the same comment can be shared by multiple fields.

I kind of don't want to distract from the point that ownerThread is protected by the dbgRawMonitor, and that is the main purpose of dbgRawMonitor. Once you have verified that ownerThread is the current thread, you can release the dbgRawMonitor and access the rank and name fields safely.

There is a race issue with all the fields during debugMonitorCreate() and verifyMonitorRank(), which is why the debugMonitorCreate() must hold the dbgRawMonitor(). However, there is not a race with the fields (other than ownerThread) for the DebugRawMonitor passed into the debugMonitorXXX() APIs because they are not called until the DebugRawMonitor is done being initialized. So this means that debugMonitorXXX() can access the monitor, name, and rank fields without holding dbgRawMonitor, because they are guaranteed to be fully initialized by that point, and they don't change. ownerThread does change, so you must access it while holding the dbgRawMonitor. Technically speaking entryCount could also be changed by other threads, but we don't access it until we know the current thread owns the DebugRawMonitor, so it is safe to access (no other thread would modify or access it).

To put this another way, the debugMonitorXXX() APIs can safely access most of the DebugRawMonitor fields for the DebugRawMonitor passed in because we know they are initialized by this point and don't change. ownerThread (and to some extent entryCount) are the exception. verifyMonitorRank() introduces additional synchronization issues because it iterates over all DebugRawMonitors, and some might be in the middle of creation, so some synchronization with the creation code is needed.

However, I now realize that if we initialized all the monitors up front, then there would be no need to hold dbgRawMonitor during debugMonitorCreate(), although this does not allow for any improvements in verifyMonitorRank() because it still needs to hold the dbgRawMonitor due to accessing the ownerThread field.

No pressure, it is up to you. I just wonder if there is any room to make it a little bit more clear and simple.

…nk().

src/jdk.jdwp.agent/share/native/libjdwp/threadControl.c

alexmenkov · 2024-05-17T20:03:11Z

src/jdk.jdwp.agent/share/native/libjdwp/util.c

+ * verifyMonitorRank(), we should also hold it during monitor initialization.
+ */
+
+static jrawMonitorID dbgRawMonitor;


As the monitor is used to synchronize access to dbg_monitors array, maybe rename it to something like dbg_monitors_lock?

@alexmenkov Actually I don't think dbg_monitors_lock is a good name. It doesn't synchronize access to the dbg_monitors array, but rather to the ownerThread field of any given entry in the array. Maybe something like ownerThread_field_access_lock would be better, although that's getting a bit wordy, and maybe more specific than it needs to be (it would prevent other uses of the lock).

alexmenkov · 2024-05-17T20:08:43Z

src/jdk.jdwp.agent/share/native/libjdwp/util.c

+}
+
+static void
+assertIsCurrentThread(JNIEnv *env, jthread thread, jthread current_thread)


The function gets both threads as arguments, I think assertIsSameThread would be more correct name

dholmes-ora · 2024-05-20T01:47:31Z

Note we can't check if the current thread owns a lock without grabbing dbgRawMonitor. In fact the main purpose of dbgRawMonitor is to allow the current thread to check if it owns the monitor.

That smells fishy to me. A thread can always safely check if it owns something because it can't race with itself and get a wrong answer.

plummercj · 2024-05-20T03:23:25Z

That smells fishy to me. A thread can always safely check if it owns something because it can't race with itself and get a wrong answer.

Unfortunately checking ownership means comparing jobjects, which you can't safely do if you are comparing to a jobject that could be freed at any moment (I'm referring the JNI ref being freed, not the object being GC'd). I ran into asserts in JNI IsSameObject() due to this happening. Basically the local ref became invalid between the time it was fetched from the DebugRawMonitor and when is was used by IsSameObject().

dholmes-ora · 2024-05-20T04:25:49Z

Unfortunately checking ownership means comparing jobjects, which you can't safely do if you are comparing to a jobject that could be freed at any moment

Okay but how does this dbgRawMonitor lock prevent the GC from clearing a jobject?

plummercj · 2024-05-20T04:36:22Z

Okay but how does this dbgRawMonitor lock prevent the GC from clearing a jobject?

The JNI ref. It's actually a global ref (I said local ref earlier). As long as any setters of DebugRawMonitor::ownerThread (setting the thread or clearing it) grab dbgRawMonitor first, and the current thread accessing ownerThread does the same, then you know that whatever ownerThread points to is either NULL, the current thread, or some other thread that is still alive and is actually holding the monitor that the DebugRawMonitor is encapsulating. As long as the current thread continues to hold dbgRawMonitor, no other thread will be able to change ownerThread. This means that if some other thread currently owns the monitor that the DebugRawMonitor is encapsulating, then it will first block on dbgRawMonitor before trying to release it and clear ownerThread.

bridgekeeper · 2024-06-19T22:59:28Z

@plummercj This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

plummercj · 2024-06-24T17:35:22Z

keep alive

bridgekeeper · 2024-07-22T19:22:39Z

@plummercj This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

plummercj · 2024-07-22T19:29:14Z

Keep alive. I'll be continuing work on this soon.

bridgekeeper · 2024-09-14T01:27:33Z

@plummercj This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Add ranked monitor support to the debug agent.

c86d507

openjdk bot changed the title ~~8328866~~ 8328866: Add raw monitor rank support to the debug agent. May 1, 2024

openjdk bot added the serviceability [email protected] label May 1, 2024

Fix jchecks errors

bcb92b5

openjdk bot added the rfr Pull request is ready for review label May 1, 2024

alexmenkov reviewed May 2, 2024

View reviewed changes

plummercj added 8 commits May 5, 2024 22:57

Optimize loop iterations during rank verification.

77d4e2c

Free localrefs allocated by GetThreadInfo().

5f5790a

dbgRawMonitor reference should be dbg_monitor->monitor in dumpRawMoni…

183ba77

…tors()

pass already fetched current_thread to assertIsCurrentThread()

6f4265f

Renamed thread -> savedOwnerThread and entryCount -> savedEntryCount.

d376327

Print threads names instead of addresses in assertIsCurrentThread()

6f000e2

Don't do anything in assertIsCurrentThread() if asserts are disabled

1224197

Minor comment fix in debugMonitorCreate()

ece5f34

Call verifyMonitorRank() when doing a wait since wait does an exit an…

c782279

…d enter, so we should do the same rank check that we do during an enter.

openjdk bot removed the rfr Pull request is ready for review label May 7, 2024

sspitsyn reviewed May 7, 2024

View reviewed changes

plummercj added 2 commits May 7, 2024 18:47

Fix comment typo.

c0a7924

Fix jcheck extra whitespace.

049d388

openjdk bot added the rfr Pull request is ready for review label May 7, 2024

sspitsyn reviewed May 8, 2024

View reviewed changes

plummercj added 2 commits May 8, 2024 19:23

Fix minor comment typo.

301da7d

Flip rank order. Some cleanup and better comments for verifyMonitorRa…

1c6a2e3

…nk().

openjdk bot removed the rfr Pull request is ready for review label May 9, 2024

Simplify by getting rid of special logic around leaf monitors.

c3bd171

openjdk bot added the rfr Pull request is ready for review label May 16, 2024

kevinjwalls reviewed May 16, 2024

View reviewed changes

src/jdk.jdwp.agent/share/native/libjdwp/threadControl.c Outdated Show resolved Hide resolved

kevinjwalls reviewed May 16, 2024

View reviewed changes

src/jdk.jdwp.agent/share/native/libjdwp/threadControl.c Outdated Show resolved Hide resolved

Fix a couple of minor typos in comments.

33fcea3

alexmenkov reviewed May 17, 2024

View reviewed changes

plummercj added 4 commits July 29, 2024 18:04

Merge

dc978d7

tweak to raw monitor dump output

6a20e55

Fix some rank orders after merge with 8332738.

185233a

Merge

0cb80d4

plummercj changed the title ~~8328866: Add raw monitor rank support to the debug agent.~~ 8328866: Add raw monitor rank support to the debug agent Aug 15, 2024

plummercj added 2 commits August 16, 2024 23:49

Dump raw monitors as part of JVMTI DataDumpRequest event

3ef379a

add newline before dumping raw monitors

ced4e27

8328866: Add raw monitor rank support to the debug agent #19044

Are you sure you want to change the base?

8328866: Add raw monitor rank support to the debug agent #19044

Conversation

plummercj commented May 1, 2024 • edited by openjdk bot Loading

Progress

Issue

Reviewing

Webrev

bridgekeeper bot commented May 1, 2024

openjdk bot commented May 1, 2024

openjdk bot commented May 1, 2024

plummercj commented May 1, 2024

mlbridge bot commented May 1, 2024 • edited Loading

Webrevs

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sspitsyn commented May 7, 2024

plummercj commented May 7, 2024

Choose a reason for hiding this comment

plummercj commented May 7, 2024 • edited Loading

sspitsyn commented May 8, 2024

sspitsyn May 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sspitsyn May 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholmes-ora commented May 20, 2024

plummercj commented May 20, 2024

dholmes-ora commented May 20, 2024

plummercj commented May 20, 2024

bridgekeeper bot commented Jun 19, 2024

plummercj commented Jun 24, 2024

bridgekeeper bot commented Jul 22, 2024

plummercj commented Jul 22, 2024

bridgekeeper bot commented Sep 14, 2024

plummercj commented May 1, 2024 •

edited by openjdk bot

Loading

mlbridge bot commented May 1, 2024 •

edited

Loading

plummercj commented May 7, 2024 •

edited

Loading

sspitsyn May 8, 2024 •

edited

Loading

sspitsyn May 9, 2024 •

edited

Loading