Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency issues with package cache in validator #1641

Open
dehall opened this issue May 30, 2024 · 21 comments
Open

Concurrency issues with package cache in validator #1641

dehall opened this issue May 30, 2024 · 21 comments
Labels

Comments

@dehall
Copy link
Contributor

dehall commented May 30, 2024

When multiple validator sessions are starting up at the same time we've seen Exceptions that appear to be concurrency issues when reading from certain files in the package cache.
Similar to #1491 - this issue was initially observed using the org.hl7.fhir.validator-wrapper but able to reproduce just with the core validation library.
An example stack trace:

java.io.IOException: Error parsing JSON source: Unexpected content at start of JSON: Eof at Line 1 (path=[null]) from '/Users/dehall/.fhir/packages/hl7.fhir.uv.extensions.r4#1.0.0/usage.ini'
	at org.hl7.fhir.utilities.json.parser.JsonLexer.error(JsonLexer.java:120)
	at org.hl7.fhir.utilities.json.parser.JsonParser.parseSource(JsonParser.java:244)
	at org.hl7.fhir.utilities.json.parser.JsonParser.parseJsonObject(JsonParser.java:230)
	at org.hl7.fhir.utilities.json.parser.JsonParser.parseObject(JsonParser.java:71)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackageInfo(FilesystemPackageCacheManager.java:262)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackageFromCacheOnly(FilesystemPackageCacheManager.java:463)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackage(FilesystemPackageCacheManager.java:615)
	at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:108)
	at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:121)
	at org.hl7.fhir.validation.cli.services.ValidationService.loadIgsAndExtensions(ValidationService.java:549)
	at org.hl7.fhir.validation.cli.services.ValidationService.buildValidationEngine(ValidationService.java:495)
	at org.hl7.fhir.validation.cli.services.ValidationService.initializeValidator(ValidationService.java:476)

In practice with the validator-wrapper, I seem to recall other files causing issues but maybe it's only usage.ini. Other examples:

Error parsing JSON source: Unexpected content at start of JSON: Eof at Line 1 (path=[null]) from '/home/ktor/.fhir/packages/hl7.terminology#5.5.0/usage.ini'
Error parsing JSON source: Unexpected content at start of JSON: Eof at Line 1 (path=[null]) from '/home/ktor/.fhir/packages/hl7.fhir.uv.extensions.r4#1.0.0/usage.ini'

I suspect the issue is that FilesystemPackageCacheManager.loadPackageInfo writes to the file without locking it, so another thread can enter this at the same time and catch the file before the first thread is done with it. It may be as simple as wrapping the following line in a FilesystemPackageCacheLock.doWriteWithLock block but my initial attempts to do that just got me different errors.

Sample code that (usually, not always unfortunately) reproduces the issue:

  @Test
  public void multithreadingTest() throws IOException {
    ValidationService myService = new ValidationService();
    final AtomicInteger totalSuccessful = new AtomicInteger();    

    final String[] packages = {
        "hl7.fhir.us.core#3.1.1",
        "hl7.fhir.us.core#4.0.0",
        "hl7.fhir.us.core#5.0.1",
        "hl7.fhir.us.core#6.1.0",
        "hl7.fhir.us.core#7.0.0-ballot"
    };
    
    List<Thread> threads = new ArrayList<>();
    int i = 0;
    for (String pckage : packages) {
      final int index = i++;
      Thread t = new Thread(() -> {
        try {
          myService.initializeValidator(
              new CliContext().setTxServer(null).setIgs(List.of(pckage)),
              "hl7.fhir.r4.core", 
              new TimeTracker(), 
              null
          );
          
          totalSuccessful.incrementAndGet();
          System.out.println("Thread " + index + " completed");
        } catch (Throwable e) {
          e.printStackTrace();
          System.err.println("Thread " + index + " failed");
        }
      });
      t.start();
      threads.add(t);
    }
    threads.forEach(t -> {
      try {
        t.join();
      } catch (InterruptedException e) {

      }
    });
    assertEquals(packages.length, totalSuccessful.get());
  }
@grahamegrieve
Copy link
Collaborator

The information in this file is only used when displaying the cache to a user and advising when an entry in the cache was last used. There is a tool that does this, but not all tools set the usage date, so it's not reliable.

We should surround this code with a try..catch and ignore any errors from it; they simply don't matter.

@dehall
Copy link
Contributor Author

dehall commented Jun 10, 2024

Update: there is one other error we've seen multiple times, looks like it's related to one thread trying to read a file while another is trying to delete it. I haven't had time to try to replicate this yet but this is the log snippet.

Installing fhir.dicom#2022.4.202210062024-06-07 17:50:20.380 [ktor-jetty-3500-2] ERROR ktor.application - 1 exception(s): [org.apache.commons.io.IOIndexedException: IOException #0: Cannot delete file: /home/ktor/.fhir/packages/44eb56b7-82f4-4808-b3d8-6df033b81f72/package]
 done.
Clean up package /home/ktor/.fhir/packages/fhir.dicom#2022.4.20221006 because installation failed: File /home/ktor/.fhir/packages/fhir.dicom#2022.4.20221006/package/package.json not found
java.io.IOException: File /home/ktor/.fhir/packages/fhir.dicom#2022.4.20221006/package/package.json not found
2024-06-07 17:50:20.383 [ktor-jetty-3500-3] INFO  ktor.application - 500 Internal Server Error: POST - /validate
	at org.hl7.fhir.utilities.TextFile.fileToString(TextFile.java:117)
	at org.hl7.fhir.utilities.npm.NpmPackage.loadFiles(NpmPackage.java:401)
	at org.hl7.fhir.utilities.npm.NpmPackage.fromFolder(NpmPackage.java:360)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackageInfo(FilesystemPackageCacheManager.java:265)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.lambda$addPackageToCache$2(FilesystemPackageCacheManager.java:546)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheLock.doWriteWithLock(FilesystemPackageCacheLock.java:37)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.addPackageToCache(FilesystemPackageCacheManager.java:508)
	at org.hl7.fhir.utilities.npm.FilesystemPackageCacheManager.loadPackage(FilesystemPackageCacheManager.java:651)
	at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:107)
	at org.hl7.fhir.validation.IgLoader.loadIg(IgLoader.java:120)
	at org.hl7.fhir.validation.cli.services.ValidationService.loadIgsAndExtensions(ValidationService.java:553)
	at org.hl7.fhir.validation.cli.services.ValidationService.buildValidationEngine(ValidationService.java:494)
	at org.hl7.fhir.validation.cli.services.ValidationService.initializeValidator(ValidationService.java:475)
	at org.hl7.fhir.validation.cli.services.ValidationService.validateSources(ValidationService.java:95)
	at controller.validation.ValidationControllerImpl.validateRequest(ValidationControllerImpl.kt:18)
	at controller.validation.ValidationModuleKt$validationModule$1.invokeSuspend(ValidationModule.kt:43)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at io.ktor.util.pipeline.SuspendFunctionGun.resumeRootWith(SuspendFunctionGun.kt:138)
	at io.ktor.util.pipeline.SuspendFunctionGun.loop(SuspendFunctionGun.kt:112)
	at io.ktor.util.pipeline.SuspendFunctionGun.access$loop(SuspendFunctionGun.kt:14)
	at io.ktor.util.pipeline.SuspendFunctionGun$continuation$1.resumeWith(SuspendFunctionGun.kt:62)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
2024-06-07 17:50:20.393 [ktor-jetty-3500-4] ERROR ktor.application - File /home/ktor/.fhir/packages/fhir.dicom#2022.4.20221006/package/package.json not found
2024-06-07 17:50:20.394 [ktor-jetty-3500-1] INFO  ktor.application - Received Validation Request. FHIR Version: 4.0.1 IGs: [hl7.fhir.us.core#5.0.1] Memory (free/max): 4921594968/8283750400
No such cached session exists for session id 008b523b-bb4c-44ae-8a0e-44ac1bcd5599, re-instantiating validator.
  Load FHIR v4.0.1 from hl7.fhir.r4.core#4.0.12024-06-07 17:50:20.395 [ktor-jetty-3500-4] INFO  ktor.application - 500 Internal Server Error: POST - /validate

@grahamegrieve
Copy link
Collaborator

Yes this is. big concurrency issue. scheduled to look at next week

@dotasek
Copy link
Collaborator

dotasek commented Aug 1, 2024

@dehall @grahamegrieve

I've done some initial work with this. I created multithreadingTestMinimal() which uses ValidationEngine more directly; I hope to eventually strip that away and use the FileSystemPackageCacheManager directly instead. It originally displayed the same failure as Dylan's test.

I have the beginnings of a solution, which gets me repeated passes for multithreadingTestMinimal()

master...do-20240722-thread-safety

The issue is that there is no read lock... we check to make sure two engines aren't writing the same data, but never that one engine might be reading something that is still in the process of being written.

What I've done in the branch above is first ask for another write lock when we need to read. Then, if a write lock is permitted (no one is writing), the lock is downgraded to a read. There are multiple read threads allowed on these locks, but just one write, so this will give us a bit more efficiency, rather than only allowing a single thread to read at a time.

This is by no means a complete solution, just something to start with.

Thoughts?

@dotasek
Copy link
Collaborator

dotasek commented Aug 1, 2024

Hah, and I see that there's some DDOS protection on the server, because my repeated tests are displaying 'Too Many Requests' errors.

@dotasek
Copy link
Collaborator

dotasek commented Aug 1, 2024

Oh... and this is weird. Dylan's test STILL fails, but for a different reason; the txCache is getting repeatedly cleared because at some point packages.ini contains a null for the package cache version... which it shouldn't. Unless something is in the middle of rewriting it.

@dehall
Copy link
Contributor Author

dehall commented Aug 5, 2024

Thanks @dotasek , I haven't dug into the details of how files get read here, but conceptually this sounds good.

One thought though - it sounds like the issue is more that there currently is no read-locking being done, rather than an existing read-lock mechanism is broken. Is the always-get-a-write-lock-to-read-lock approach needed? Since the write lock is exclusive, seems like that will result in read locks essentially being made exclusive too

@dotasek
Copy link
Collaborator

dotasek commented Aug 5, 2024

@dehall I'm thinking along similar lines. I ran this several times this morning only to get different failures for other files, which have no locks on them at all.

The lock downgrade mechanism actually releases the write lock, so it doesn't make read locks exclusive. It gets a write lock, then gets a read lock, then releases the write lock, leaving only the read lock active while it does its work. The intent there was to prevent reads while writing is happening. This should work OK for the package manager, where it only writes a file once.

I do have to check if there's a case where certain files will be re-written; I'm thinking particularly packages.ini

And, looking at how the terminology cache works, this concurrency will likely be a big issue there, and a more complex one, because the cache appends to files instead of just creating and writing once.

@dotasek
Copy link
Collaborator

dotasek commented Aug 5, 2024

Some very odd errors cropping up today that I don't recall seeing before:

Load FHIR v4.0 from hl7.fhir.r4.coreError parsing Observation-HVS-item-example-88122-7.json: [SQLITE_READONLY_DBMOVED] The database file has been moved since it was opened (attempt to write a readonly database)
Error parsing Observation-HVS-item-example-88123-5.json: Cannot read the array length because "bytes" is null
Error parsing Observation-HVS-item-example-88124-3.json: Cannot read the array length because "bytes" is null
Error parsing Observation-HVS-panel-example-88121-9.json: statement is not executing

I don't know if this is related to concurrency at all. Even reverting to a state before my changes results in this behaviour.

@dotasek
Copy link
Collaborator

dotasek commented Aug 6, 2024

To get beyond the issue I was experiencing, I started running my test using local packages, so at least I'm not hammering the package server as much.

I made some changes that seem to get @dehall 's test to repeatedly pass (I set it to loop until failure in IntelliJ, and got to 4 before writing this). The main changes were ensuring that there's only one lock management system per package cache directory, and adding a special lock for changes to the whole package cache directory (and, I guess, the package.ini file).

This is NOT in any way complete. There's so many other places it could crap out, and I removed the .lock file mechanism during the refactor in order to reduce complexity.

Dylan, could you take a look at this and see if what I did makes any sense?

aa23ed8

@dehall
Copy link
Contributor Author

dehall commented Aug 6, 2024

Sure thing, will take a look tomorrow AM. Thanks so much for digging into this

@dotasek
Copy link
Collaborator

dotasek commented Aug 6, 2024

No problem. This is a good find, and the replicable test is great.

Staring at this now, I realize that maybe that bit of 'cleverness' with lock downgrading isn't so clever. I think it will wait for read locks to be released before allowing a write lock, so functionally, I might as well just have a write lock there. I'll need to revisit that.

@dehall
Copy link
Contributor Author

dehall commented Aug 7, 2024

Overall I think this makes sense but that's with the caveat that I haven't looked at the existing and surrounding code nearly as much as you and Grahame have, and I'm not an expert in concurrency.

In terms of the specifics of the locks, from reading the docs on ReadWriteLock https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/ReadWriteLock.html the write lock is completely exclusive but multiple read locks may be held at the same time - that sounds good. So in doReadWithLock I think the write lock downgrade would be unnecessary if it could be guaranteed that the function it calls is truly only a read; ie, if you're sure you only need to read, just get a read lock and ReadWriteLock will handle the rest. But I can tell making sure the function really is read-only is not as easy as it sounds. I'm not sure if there's a way to do that beyond manually reviewing every function you might call. loadPackageInfo sounds like it should be a pure read but no, turns out it also writes back to the usage.ini file. Having doReadWithLock always get a write lock first is, as far as I can tell, totally safe in terms of locking, but could potentially slow things down unnecessarily. (Not sure how "slow" it would be in practice given the lock is per-package and there shouldn't be that many concurrent readers anyway)

I'm wondering if this "Manager" approach, where a function is passed into the Manager to call after acquiring the lock, is the best way to do this. Another way to do it might be to pass the Lock around into the specific methods that do file operations and let them get the readLock or writeLock as appropriate. That way the caller doesn't have to know the specifics of what the method is going to do. But that may be a major refactoring and more trouble than it's worth

@dotasek
Copy link
Collaborator

dotasek commented Aug 7, 2024

I think I spoke entirely too soon. I'm looking at it again today, and I think yesterday I got the cache into a 'lucky' state. I'm getting all sorts of errors now.

usage.ini is a bit of a pain. From the git blame, this looks to have been a tracking mechanism for the validator website to see what packages are being validated against. But it hasn't been used at all, as I've never gone to look at it. I'm going to suggest to Grahame that we move that.

This is kind of the problem with the cache; it mixes reading and writing in numerous locations instead of clearly distinguishing between the two. I think sorting THAT out first might make concurrency much easier to solve.

What disturbs me about taking another approach (passing locks around) is precisely that read/write locking would be left up to individual methods. The moment we do that, we'll end up with n implementations of concurrency peppered throughout the cache instead of being in one encapsulated location. But I'm open to altering that opinion if you can expand on the approach, maybe with some pseudocode.

@dehall
Copy link
Contributor Author

dehall commented Aug 8, 2024

This is kind of the problem with the cache; it mixes reading and writing in numerous locations instead of clearly distinguishing between the two. I think sorting THAT out first might make concurrency much easier to solve.

Definitely agree. If the methods can be cleanly split out then the specifics of the locking approach don't matter as much.

@dotasek
Copy link
Collaborator

dotasek commented Aug 8, 2024

After talking with Grahame, we agreed on a few things to make this easier.

There are two tracking mechanisms which are unused:

usage.ini (ironically) is completely unused to date. It can be totally removed.

packages.ini was also slated for total removal, except that it contains the cache version. This is important; if we update the cache implementation, this is the mechanism that allows us to identify obsolete caches and initiate a cleanup. However, the tracking mechanism for package install timestamps is totally unused, and THAT was causing file rewrites mid package install. I will remove that.

I'm going to try to do two things:

  • Isolate package install and cache cleaning into a single test that only involves the FilesystemPackageCacheManager.
  • Clean up the cache init process first and see what that gets us.

@dotasek
Copy link
Collaborator

dotasek commented Aug 9, 2024

The last round of commits to this branch are pretty much what I outlined in my last comment: master...do-20240722-thread-safety

Still to be done:

  • Apply the locks to the rest of the FilesystemPackageCacheManager
  • Figure out a way to test not just the same method, but combinations of methods from different threads.
  • Re-introduce a file lock, and make sure it can poll (at present, it actually just failed when the lock was held, instead of waiting).
  • Create an issue to apply a similar approach to TerminologyCache too.

@dotasek

This comment was marked as outdated.

@dotasek
Copy link
Collaborator

dotasek commented Aug 16, 2024

I hid the last comment, as it was too soon. I have a consistently failing Windows test now that can't delete the lock files for some reason. Will try to fix...

@dehall
Copy link
Contributor Author

dehall commented Aug 16, 2024

Apologies, I didn't have time to take a close look at this today as I'm wrapping a lot of things up before I go on vacation (back 8/28). At a glance the tests look good. Maybe I'm missing something but I don't see much in terms of new Exceptions. "Error reading package" and "Timeout waiting for lock file deletion" I think are fine. More specific exceptions for situations like this are tough because if you could explain to the user exactly what broke you could potentially handle it anyway.

When you're ready for a stress test, let me know and we (the Inferno team) can try it out in the scenario we've seen this in. It doesn't happen 100% of the time but it happens often enough that we could tell pretty quickly if it's resolved

@dotasek
Copy link
Collaborator

dotasek commented Aug 16, 2024

@dehall I'm not ready for a stress test; unfortunately. The Ubuntu tests are failing, and I don't know why.

Windows and Mac both appear to be working just fine, but I think I may have to debug in WSL or something to find out what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants