Fix multiple instances of Tribler on Flatpak #7621

xoriole · 2023-10-06T14:35:04Z

This PR fixes the multiple instances of Tribler running on Flatpak. If the app is connected to the previous instance via QLocalSocket then there is another instance running. In this case, Tribler closes now by passing its arguments to the running process via the QT local socket.

In addition to that, in the PR, for the core process, it is now checked if the port is known whether or not REST API is accessible. If the API is not accessible, then the core process is considered not running. Checking on the pid is not sufficient on the Flatpak environment because the pid is not unique anymore on the sandbox environment.

This PR addresses the issue of multiple instances of Flatpak application running in parallel and crashing Tribler when one of the instances is closed. It is addressed in the following way.

If the last primary process from the database has a PID that either matches the current process PID or can be confirmed to be running, then the current process instance is terminated. This prevents the second instance of Tribler from running.

It could be argued that the PID could be reused and could lead to Tribler terminating itself and no instances running at all. In such a case, the likelihood of the last Tribler process having the same PID as the current process and the database having the process record with a primary set to 1, in my opinion, is negligible. Furthermore, if such a case happens, Tribler will likely start on the next run as the PID would most likely change from the previous run which in my opinion is acceptable.

Fixes #7626
Fixes partially #7603

kozlovsky · 2023-10-09T08:39:38Z

Thank you for your efforts in addressing this issue. After looking it over, I have some concerns about the correctness of the suggested approach. I'm diving deeper into the details and will provide a more comprehensive review in about half an hour.

kozlovsky

In my opinion, it is an interesting attempt to solve the issue, but unfortunately, I think it can bring new significant issues and make the code logic more complicated. I think we need to find a different approach to this issue.

kozlovsky · 2023-10-09T09:09:58Z

src/tribler/gui/start_gui.py

@@ -64,7 +64,7 @@ def run_gui(api_port: Optional[int], api_key: Optional[str], root_state_dir, par
 translator = get_translator(settings.value('translation', None))
 app.installTranslator(translator)

- if not current_process_is_primary:
+ if not current_process_is_primary or app.connected_to_previous_instance:


This PR contains two fixes for the GUI and Core processes. Let's first look at the fix for the GUI process, which is implemented as an extended check at line 67 that adds the "app.connected_to_previous_instance" condition.

It looks like this PR can indeed solve the problem on Flatpak for the case when the first (primary) Flatpack-based Tribler instance is already running and the second (secondary) Flatpack-based Tribler instance is starting. Before your fix, the second GUI process incorrectly considers itself a primary instance because it has the same PID as the first GUI process. With your changes, the second GUI process is regarded as a primary process but finishes immediately because it can connect to the first GUI process via the local socket.

However, the solution proposed in this PR creates a race condition for the case when two instances of the Tribler application are starting at the same moment. This situation is very common for some obscure reason and needs to be handled properly. Unfortunately, with the simultaneous start of two GUI processes, it is possible to have a race condition when the first process claims itself primary in the processes.sqlite lock file, while the second process was the first to grab the local socket. In that case, it can be possible that the first (primary) process is connected to the local socket of the secondary process. It can lead to a situation when both processes stop simultaneously, the second process because it is not primary and the first because it was able to connect to the local socket of the secondary process.

In other words, this change fixes the Flatpak issue but re-creates a race condition for the non-Flatpak case. Previously, the race condition manifests itself as two GUI processes running in parallel, and now it appears as a Tribler application that refuses to start at all.

Thank you for your feedback. Here's a more detailed explanation:

The app.connected_to_previous_instance was introduced to enhance the current check determining if the process is primary. Although the existing check for the primary process should ideally address the scenario where multiple instances of Tribler exist, it doesn't perform as expected in environments like Flatpak. The added condition is supplemental (an OR condition), ensuring it doesn't disrupt the standard execution flow.

The value of app.connected_to_previous_instance is set to True only when the new GUI process successfully connects to the local socket. If a race condition were to occur, this value would be set to False, retaining the original execution behavior, in my opinion.

I haven't encountered or been able to reproduce a race condition scenario in the current setup. However, I'm keen to understand your perspective and would welcome any suggestions on potential improvements or alternative solutions.

It is possible to encounter the following case:

Two GUI processes started at the same time (that was a very frequent issue until we implemented a new SQLite-based lock; the previous triblerd.lock implementation had race conditions in this scenario).

The first GUI process was first to grab the file lock and mark itself as a primary process.

The second GUI process was first to grab the local socket

I believe it is a pretty frequent scenario. In my opinion, the current logic in the PR does not handle this situation properly. To handle it properly, the GUI process should grab the local socket while holding the file lock. It can complicate the logic a bit and increase the time the process holds the file lock.

kozlovsky · 2023-10-09T09:44:49Z

src/tribler/core/utilities/process_manager/process.py

+
+ try:
+ docs_url = f"http:https://localhost:{self.api_port}/docs"
+ _ = requests.get(docs_url, timeout=API_CHECK_TIMEOUT_IN_SECONDS)


The second fix proposed in this PR is an HTTP request to the Tribler REST API added to the TriblerProcess.is_running method.

Note that the is_running method is called, for example, when TriblerProcess.__str__() is called. It looks a bit extreme to me to issue an HTTP request whenever someone tries to print an instance of a core process.

Second, it is not clear why it is helpful to have this check for the Core process for two reasons:

If the logic for GUI processes is correct, the second GUI instance should understand that it is secondary and should not start the second Core process at all. So, the check for the Core is less important than for the GUI if the GUI check is implemented correctly.

It is hard to say what this check allows us to achieve. Initially, when the Core process is just started, its api_port is always None. Not that with the current implementation, the GUI process does not pass the API port number to the Core process. The Core process opens the REST API on any available port and sets its value to the database so the GUI process can look up the port value. So when the Core process is just starting, its api_port is always None, and this check will not be triggered. When two Core processes are already running by mistake, their api_port values will differ. Each Core process has a different value of its API port and can connect to it (by the way, can the code create a deadlock when the Core process connects to its port? The call to request.get is synchronous, so it looks like it should block the asyncio loop). It is hard to say why this check of the HTTP endpoint availability is useful and how correctly it works in all different use cases.

xoriole marked this pull request as ready for review October 6, 2023 14:56

xoriole requested a review from a team as a code owner October 6, 2023 14:56

xoriole requested review from egbertbouman and removed request for a team October 6, 2023 14:56

kozlovsky self-requested a review October 9, 2023 06:56

kozlovsky requested changes Oct 9, 2023

View reviewed changes

xoriole marked this pull request as draft October 10, 2023 07:39

xoriole force-pushed the fix/core-timeout branch from 1dfd92c to 2516505 Compare October 10, 2023 11:23

Check for current process when checking primary process

d7abc40

xoriole force-pushed the fix/core-timeout branch from 08a72c1 to d7abc40 Compare October 12, 2023 13:18

Update os.getpid patch with new arg

9e6ed3d

xoriole force-pushed the fix/core-timeout branch from 2b909a7 to 9e6ed3d Compare October 12, 2023 14:50

drew2a mentioned this pull request Oct 23, 2023

Fix the detection of running processes by including process UID to the check #7637

Closed

xoriole closed this Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiple instances of Tribler on Flatpak #7621

Fix multiple instances of Tribler on Flatpak #7621

xoriole commented Oct 6, 2023 •

edited

Loading

kozlovsky commented Oct 9, 2023

kozlovsky left a comment

kozlovsky Oct 9, 2023

xoriole Oct 9, 2023

kozlovsky Oct 9, 2023 •

edited

Loading

kozlovsky Oct 9, 2023 •

edited

Loading

Fix multiple instances of Tribler on Flatpak #7621

Fix multiple instances of Tribler on Flatpak #7621

Conversation

xoriole commented Oct 6, 2023 • edited Loading

kozlovsky commented Oct 9, 2023

kozlovsky left a comment

Choose a reason for hiding this comment

kozlovsky Oct 9, 2023

Choose a reason for hiding this comment

xoriole Oct 9, 2023

Choose a reason for hiding this comment

kozlovsky Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

kozlovsky Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

xoriole commented Oct 6, 2023 •

edited

Loading

kozlovsky Oct 9, 2023 •

edited

Loading

kozlovsky Oct 9, 2023 •

edited

Loading