Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider HealthInfo with negative seeders and leechers as invalid #7487

Merged
merged 1 commit into from
Jun 20, 2023

Conversation

drew2a
Copy link
Contributor

@drew2a drew2a commented Jun 14, 2023

This PR fixes #7430 by suppressing all errors in PopularityCommunity.gossip_random_torrents_health().

In #7430 the root cause of the error is a packing error in the attempt to pack the signed number -2 into an unsigned 32-bit integer.

We have several strategies for addressing this bug:

  1. Inspect all instances where the leecher count is created/modified to ensure it consistently remains positive.
  2. Guarantee that num_leechers is positive before invoking Community.ez_send().
  3. Suppress all errors within the community.

I've opted for the third approach, in line with the existing ipv8 policy of error suppression within communities to prevent Tribler from crashing due to, for instance, a packet of death.

This bug uncovers the inconsistency in ipv8 error handling methods: some suppress exceptions, while others do not. Ideally, ipv8 error handling would be made more consistent, but due to strict contribution rules, this isn't easily achievable (it's nearly impossible for Tribler developers to introduce changes to ipv8). Hence, we should address this issue on Tribler's end.

I've implemented an annotation This annotation can be appended to any method (synchronous or asynchronous), and it will suppress all exceptions that arise within the method. However, it's important to note that this isn't a comprehensive solution. The complete solution would involve incorporating the same logic into all fragile community methods. Implementing this fully is a significant change for Tribler, so it would be more prudent to reserve such a comprehensive overhaul for the next release.

@drew2a drew2a changed the title Add suppress exceptions Add suppress_exceptions annotation Jun 14, 2023
@drew2a drew2a marked this pull request as ready for review June 14, 2023 14:25
@drew2a drew2a requested review from a team, xoriole and kozlovsky and removed request for a team June 14, 2023 14:25
Copy link
Contributor

@xoriole xoriole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should suppress exceptions on the calls like gossip_random_torrents_health() that are triggered by the user itself. This only affects the user trying to send the message and not any other users receiving the message. If there are exceptions, we should resolve the errors instead. In this case, I would be in favor of handling the pack error.

Regarding suppression of exceptions, I would agree on suppressing exceptions on message handlers so they do not crash Tribler for an innocent user who received a death packet.

@drew2a
Copy link
Contributor Author

drew2a commented Jun 14, 2023

@xoriole

I don't think we should suppress exceptions on the calls like gossip_random_torrents_health() that are triggered by the user itself.

The calls are initiated by the TaskManager, which means users do not have any control over these processes. Unfortunately, an erroneous call can result in a crash of the Tribler application on the user's machine. A portion of these users might encounter the FeedbackDialog and choose to send us an error report, while others might simply opt for a different BitTorrent client without any feedback.

From my perspective, it is crucial that we suppress all errors that do not directly impact Tribler's core functionality. To retain our user base, we should aim to suppress all errors of this nature, like the packing error we are currently facing. The user experience is paramount, and taking such measures helps ensure minimal disruption for our users.

@qstokkink
Copy link
Contributor

I see IPv8. I consider myself summoned. 🧙

This bug uncovers the inconsistency in ipv8 error handling methods: some suppress exceptions, while others do not.

This is by design. I recall a heated debate with @egbertbouman in #5438. Long story short (though you can read the entire exchange in that issue if you want): I argued for the same "catch all" approach as you are arguing for right now. Back then I was convinced by @egbertbouman's following two sentences, which I hope will be of guidance to you too:

Crashing the app gives us error reports, and revealed this bug. If this didn't happen we would never know about this issue.

Besides that, regarding how to suppress errors (if you do end up doing this), I would like to point you to the ignore argument of register_task(). If you call self.register_task("gossip_random_torrents", self.gossip_random_torrents_health, interval=PopularityCommunity.GOSSIP_INTERVAL_FOR_RANDOM_TORRENTS) with ignore=(Exception,), all exceptions are caught and logged for this particular task.

Other than this recounting of history and objective info, I won't offer any opinion on what approach to take.

@drew2a
Copy link
Contributor Author

drew2a commented Jun 14, 2023

@qstokkink thank you for sharing the historical context behind this design decision and for the tip regarding ignore=(Exception,).

I agree with the statement:

Crashing the app gives us error reports, and revealed this bug. If this didn't happen we would never know about this issue.

By allowing Tribler to be susceptible to such failures, we indeed increase the chance of uncovering bugs, assuming that some of our users will experience these issues and kindly send us reports. This is a valid point. However, I do have concerns about the cost associated with this approach. In this case, the cost is our users. I don't have exact statistics, but from my estimation, it might look like this: out of 10 users who experience a bug, 4 might uninstall Tribler as a result, and only one sends us a report.

I'd like to suggest a slightly modified approach:

  1. Improve our test coverage by writing additional unit tests.
  2. Expand the number of scenarios for the Application Tester.
  3. Maintain Tribler's fragility to these tests.
  4. Strengthen Tribler's resilience for the end users.

I recognize that this paradigm shift might be a departure from what we've been doing, but it's a topic worth discussing. I suggest we could include this in our agenda for next week's meeting.

Another potential strategy could be:

  1. Enhance Tribler's stability for the end users.
  2. Develop a new reporting system that anonymously reports all suppressed exceptions.

I look forward to further discussions on this matter. Thank you again for your insights.

@drew2a drew2a changed the title Add suppress_exceptions annotation Suppress exceptions in PopularityCommunity.gossip_random_torrents_health() Jun 15, 2023
@egbertbouman
Copy link
Member

egbertbouman commented Jun 15, 2023

Since I'm being mentioned I'll also respond.

The argument that I made against a catchall in tunnel-related logic was mostly due to the amount of logic that is behind these calls. Even with a proper stack trace debugging an error could potentially take days. Besides, it wasn't even needed in this specific case since the input was already checked. As the former maintainer of the tunnel community I was always keen on being informed of any serious errors. The tradeoff could be different for other (less vital) components.

Anyway, that's why I added the ignore kwarg to the TaskManager. That way developers can make these tradeoffs for themselves.

Note that if you await the task it will still throw the error. It's just meant to prevent the exception from ending up in the asyncio exception handler (the one that's registered using set_exception_handler).

Copy link
Contributor

@kozlovsky kozlovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a distant analogy from another domain - the biological field. The DNA of a living organism can undergo mutations. Some of these mutations are genuinely dangerous, rendering the organism unviable. However, many mutations are mildly harmful. They do not lead to the immediate death of the organism, but they make it slightly less viable. With each mildly harmful mutation, an individual becomes a bit weaker, less intelligent, and less attractive, but the effect is negligible. For instance, a mildly harmful mutation might make the organism 0.01% less adapted.

Surprisingly, such mildly harmful mutations can be far more dangerous than truly harmful ones, which quickly kill the individual or make it an easy prey for predators. The effect of mildly harmful mutations is too insignificant to be filtered out by natural selection. If an animal is only 0.01% less intelligent than its counterparts, other random factors (such as luck) play a more significant role in survival.

As a result, natural selection cannot effectively filter out these mildly harmful mutations; they spread from one individual to the entire population, leading to a decrease in the fitness of the species. To combat these mutations, specialized mechanisms are required to work in conjunction with natural selection. These mechanisms detect combinations of several mildly harmful mutations and eliminate the embryo at early stages.

In a way, bugs in a software product resemble these mutations. Severely harmful mutations are akin to bugs that cause the application to crash, often leading to the permanent 'death' of the instance (i.e., the installed Tribler application). Mildly harmful mutations are akin to bugs that are not outwardly noticeable but degrade the application's performance and functionality without the user noticing.

I am concerned that by converting many bugs from the category of severely harmful mutations to mildly harmful mutations, these mildly harmful mutations will accumulate. Eventually, it will become apparent that Tribler is performing slower and worse, but no one will understand why, and will have no idea how to fix it.

Coming back to the current PR, in my opinion, it is more accurate to say that this fix doesn't correct bug #7430, but rather masks it. The root of the issue is that a negative number is being passed as an argument instead of a positive one, not that the user sees an error.

I agree that users might be frustrated by application crashes in case of non-critical errors. However, this means that we should develop an alternative way of anonymously reporting such errors without halting the application. I believe it would be sufficient to anonymously gossip the exception traceback, keeping the developers informed that there is an error in the code.

If the error report contains only the traceback, we could propagate it without requiring the user's consent, allowing us to more accurately assess the quantity of such errors.

Therefore, I believe that in the context of the current reporting system, it's better to address and rectify the root issue (specifically, the negative number of seeders) rather than conceal the error.

@kozlovsky
Copy link
Contributor

Regarding

  1. Improve our test coverage by writing additional unit tests.
  2. Expand the number of scenarios for the Application Tester.
  3. Maintain Tribler's fragility to these tests.
  4. Strengthen Tribler's resilience for the end users.

I think unit tests are helpful in finding regressions of known bugs, but it may be hard to find a genuinely new bug with tests. The same with Application Tester - it is very useful and allows us to discover some important issues. Still, as it only follows pre-defined scenarios (even if they are combined in random order), it cannot detect important bugs when a user does something slightly unusual.

Realistically, we can't add and maintain hundreds of scenarios to Application Tester in the hope they allow us to find all future bugs, as we do not have enough resources for it: it is one thing to add a specific scenario to check a particular case and another thing to add all possible scenarios to find all possible bugs, not known in advance.

Unit tests are very helpful, as well as the Application Tester, but we can't expect we can find all critical bugs with them.

@drew2a
Copy link
Contributor Author

drew2a commented Jun 16, 2023

@kozlovsky I appreciate the use of analogies for better understanding, however, I would exercise caution when using these particular analogies as software often operates in varied environments and under distinct regulations.

Such comparisons might potentially lead to misunderstandings or give a false sense of comprehension of the issue. Here's how it might transpire: You begin with a valid example (in this case, evolution and DNA mutations), you create a wrong analogy (bugs functioning like DNA mutations), and then you draw a false conclusion (we should avoid artificially selecting certain bugs).

When discussing software products, it might be more beneficial to use terminology and concepts specific to this field, primarily because software does not function exactly like organisms in nature. I'd like to see a Tribler who produces offspring :)

Again, I appreciate your input and look forward to more insightful discussions.

@egbertbouman thank you for your balanced view. You're absolutely correct, it is about carefully determining the trade-offs for each component of Tribler. In the case of critical sections like tunnels and libtorrent, allowing Tribler to crash might be acceptable. This is because there's a possibility, albeit uncertain, that we might receive a report from the user, enabling us to address the issue.

However, for other, less critical parts of Tribler, this might not be as significant. For instance, if there's a minor rare bug in the TorrentChecker or PopularityCommunity (or do I also mention BandwidthCommunity #7336 to summon @devos50 :)?), it might be acceptable to simply overlook it.


Just to clarify, I'm not advocating for completely ignoring errors, nor am I suggesting that every error should result in crashing the application and displaying the FeedbackDialog. My position favors a balanced approach. In the case of non-critical components of Tribler, errors could potentially be overlooked. However, for crucial parts, they absolutely should not be ignored. I appreciate your understanding in this matter.

@drew2a
Copy link
Contributor Author

drew2a commented Jun 16, 2023

@kozlovsky

Coming back to the current PR, in my opinion, it is more accurate to say that this fix doesn't correct bug #7430, but rather masks it. The root of the issue is that a negative number is being passed as an argument instead of a positive one, not that the user sees an error.

The root cause of the appearance of negative numbers is still a mystery. The negative number could come to DB from some old version of Tribler or be returned by a tracker.

We see the error because of the inconsistency of these two data structures:

@dataclass(order=True)
class HealthInfo:
    infohash: bytes = field(repr=False)
    seeders: int = 0
    leechers: int = 0
    last_check: int = field(default_factory=lambda: int(time.time()))
    self_checked: bool = False
@vp_compile
class TorrentInfoFormat(VariablePayload):
   format_list = ['20s', 'I', 'I', 'Q']
   names = ['infohash', 'seeders', 'leechers', 'timestamp']
   length = 36

Given that we aim to retain the current payloads of PopularityCommunity as they are, one potential approach to 'fix the bug' would be to more rigorously define types in HealthInfo and conduct careful checks. However, this might necessitate substantial refactoring and I question whether the benefit would justify the effort.

A more straightforward, and perhaps more appropriate approach in this situation, might be to simply ignore this and all potential errors, as suggested in commit dcff75d

Alternatively, there's another option available in commit 9b8b31e

Could you please let me know which one you prefer? Would you consider accepting it as a resolution for the bug?

@kozlovsky
Copy link
Contributor

@drew2a after some investigation, I think I understand the root cause of negative leechers. It is indeed possible for some versions of BitTorrent trackers to return a negative number of leechers: webtorrent/bittorrent-tracker#65

That means Tribler should be able to handle this properly.

The solutions in dcff75d and 9b8b31e are focused on preventing the gossiping of wrong health information to other peers. But it may be better not to store the invalid health information in the database as well, as it may somehow break the logic of some queries. For this reason, it is better to reject invalid health when receiving it from a tracker.

In Tribler 7.13, we now use a convenient class HealthInfo, that allows us to handle this check nicely

The place where we process the incoming health is this:

    async def get_tracker_response(self, session: TrackerSession) -> TrackerResponse:
        ...
        with db_session:
            for health in result.torrent_health_list:
                self.update_torrent_health(health)

And inside the update_torrent_health method, we do the following:

    def update_torrent_health(self, health: HealthInfo) -> bool:
        if not health.is_valid():
            self._logger.warning(f'Invalid health info ignored: {health}')
            return False
        ...

So, we can extend the HealthInfo.is_valid method to check the negative value of seeders and leechers. Instead

    def is_valid(self) -> bool:
        return self.last_check < int(time.time()) + TOLERABLE_TIME_DRIFT

we can do

    def is_valid(self) -> bool:
        return self.seeders >= 0 and self.leechers >= 0 and \
                self.last_check < int(time.time()) + TOLERABLE_TIME_DRIFT

This way, we not only protect other peers from receiving incorrect information but also protect the local database and an in-memory TorrentChecker._torrents_checked dictionary

@drew2a
Copy link
Contributor Author

drew2a commented Jun 16, 2023

@kozlovsky nice finding! I'll implement the fix. And what should we do with the numbers that are already in the DB?

@kozlovsky
Copy link
Contributor

kozlovsky commented Jun 16, 2023

And what should we do with the numbers that are already in the DB?

I looked at my local database and found no negative seeders or leechers. Considering that we have only a single Sentry report with this type of error, the typical user database contains very few incorrect health records.

It is possible to find all such records and set the number of leechers to zero with a query like

with db_session:
    for ts in TorrentState.select(lambda ts: ts.leechers < 0):
        ts.leechers = 0

Or with raw SQL query:

with db_session:
    db.execute("""
        update TorrentState set leechers = 0 where leechers < 0
    """)

This query can be part of a database upgrade. While the actual number of invalid records is probably low (maybe zero), the query is still potentially slow and can take 10-15 seconds on the user's machine.

But I checked database queries, and they should work fine for records with a negative number of leechers. So, another option is to leave the database as is and fix HealthInfo records after retrieving them from the database by resetting the negative number of leechers to zero. It can be implemented in the following way:

  1. Add the handling of negative leechers to the TorrentState.to_health() method:
    class TorrentState(db.Entity):
        ...
        def to_health(self) -> HealthInfo:
            leechers = max(0, self.leechers)  # fix the rare case of negative leechers in the DB
            return HealthInfo(self.infohash, self.seeders, leechers, self.last_check, self.self_checked)

And use the to_health method when constructing HealthInfo from the database. For this, we need to replace this:

    @db_session
    def load_torrents_checked_from_db(self) -> Dict[bytes, HealthInfo]:
        ...
        for torrent in checked_torrents:
            result[torrent.infohash] = HealthInfo(torrent.infohash, torrent.seeders, torrent.leechers,
                                                  last_check=torrent.last_check, self_checked=True)
        return result

With this:

    @db_session
    def load_torrents_checked_from_db(self) -> Dict[bytes, HealthInfo]:
        ...
        for torrent_state in checked_torrents:
            result[torrent.infohash] = torrent_state.to_health()
        return result

I think fixing TorrentState.to_health this way without fixing the database records should be enough

@drew2a
Copy link
Contributor Author

drew2a commented Jun 19, 2023

@kozlovsky

I looked at my local database and found no negative seeders or leechers. Considering that we have only a single Sentry report with this type of error, the typical user database contains very few incorrect health records.

Then perhaps the easiest correct solution is just to ignore errors that occur because of these records, right?

@drew2a drew2a requested review from xoriole and kozlovsky June 19, 2023 14:56
@drew2a drew2a changed the title Suppress exceptions in PopularityCommunity.gossip_random_torrents_health() Consider HealthInfo with negative seeders and leechers as invalid Jun 19, 2023
@drew2a drew2a merged commit fe5c07d into Tribler:release/7.13 Jun 20, 2023
@drew2a drew2a deleted the fix/7430 branch June 20, 2023 06:50
@drew2a drew2a added this to the 7.13.0 milestone Jun 20, 2023
@drew2a drew2a self-assigned this Jun 20, 2023
@drew2a drew2a removed this from the 7.13.0 milestone Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants