Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up logging #2216

Merged
merged 2 commits into from
Jan 4, 2024
Merged

Clean up logging #2216

merged 2 commits into from
Jan 4, 2024

Conversation

sl0thentr0py
Copy link
Member

closes #2031

Copy link

codecov bot commented Jan 2, 2024

Codecov Report

Merging #2216 (c588979) into master (49a628e) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2216      +/-   ##
==========================================
- Coverage   97.43%   97.42%   -0.02%     
==========================================
  Files         102      102              
  Lines        3817     3800      -17     
==========================================
- Hits         3719     3702      -17     
  Misses         98       98              
Components Coverage Δ
sentry-ruby 98.13% <100.00%> (-0.02%) ⬇️
sentry-rails 95.01% <ø> (ø)
sentry-sidekiq 94.70% <ø> (ø)
sentry-resque 92.06% <ø> (ø)
sentry-delayed_job 94.44% <ø> (ø)
sentry-opentelemetry 100.00% <ø> (ø)
Files Coverage Δ
sentry-ruby/lib/sentry/client.rb 100.00% <100.00%> (ø)
sentry-ruby/lib/sentry/configuration.rb 98.78% <100.00%> (ø)
sentry-ruby/lib/sentry/event.rb 98.55% <ø> (-0.25%) ⬇️
sentry-ruby/lib/sentry/transport.rb 99.09% <100.00%> (ø)
sentry-ruby/lib/sentry/transport/http_transport.rb 100.00% <100.00%> (ø)

@sl0thentr0py sl0thentr0py force-pushed the neel/logger-cleanup branch 2 times, most recently from 6224ee9 to da84af7 Compare January 2, 2024 17:10
loggable_event_type = event_type.capitalize
log_error("#{loggable_event_type} sending failed", e, debug: configuration.debug)

event_info = Event.get_log_message(event.to_hash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past we relied on this message to look into the potentially problematic event payload. Do we have an alternative for this now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to move away from logging whole payloads for the end user, one can always log it in before_send if one really wants. Even on our side, we spent a lot of logging resources in python by logging huge payloads once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging here should be triggered very rarely. Most expected errors should've been captured at HTTPTransport#send_data (network errors...etc.) so they don't even reach here.
What's most likely captured here are errors from the before_send* callbacks and event/envelope serialisation errors, which with the event data will make debugging much easier for users and us.

Copy link
Member Author

@sl0thentr0py sl0thentr0py Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send_data re-raises here

raise Sentry::ExternalError, error_info
end
rescue SocketError, *HTTP_ERRORS => e
on_error if respond_to?(:on_error)
raise Sentry::ExternalError.new(e&.message)

so these will end up being logged for network errors too.

I can add back some logging but I'd make it debug at best.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I missed it 🤦‍♂️
Ok let's remove it first and find a better place to log them if we later find it necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I agree that debugging before_send errors is not ideal. We've also discussed this internally.

@sl0thentr0py sl0thentr0py merged commit 918b83d into master Jan 4, 2024
122 of 123 checks passed
@sl0thentr0py sl0thentr0py deleted the neel/logger-cleanup branch January 4, 2024 13:22
handle_rate_limited_response(response)
end
handle_rate_limited_response(response) if has_rate_limited_header?(response)
elsif response.code == "429"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋🏻 @sl0thentr0py @st0012 I was reading into the logic for 429s as we noticed errors weren't being raised anymore after upgrading from 5.14 to 5.16. This change seems to be the reason as 429s were included in the else block before which did call raise Sentry::ExternalError, error_info.

Reading through the issue/PR, the changes seemed to only be related to updating logging, so I'm curious if this change not including raising an error for 429s was made on purpose. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charligoulet yes, this was deliberately done. In general, we shouldn't raise exceptions in userland for SDK internal things and if you were using this for seeing if you are limited on our SaaS offering, there is visibility in the product (Stats section).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt response!

My concern here is around retrying rate-limited events. We are currently leveraging Sentry in the context of Fluentd (we built a plugin with the Ruby SDK).

Fluentd has a retry mechanism that leverages an exponential back-off strategy. For Fluentd to trigger retries, errors must be escalated from the SDK to Fluentd. For example, we're currently able to retry 503s and 504s rather than dropping events, because the errors for those are raised by the SDK. With this change, there is no way of programmatically knowing that a specific event has been rate-limited and that it should be retried. As I think you'll agree, digging through the Stats section or logs isn't realistic for retry purposes.

Unless I'm misunderstanding the docs, these events are also not being retried for us on the Sentry side. So without raising an error here and without Sentry retrying rate-limited events for us, isn't this a breaking change from a behavioural standpoint blocking users from retrying events that have been rate-limited and is essentially leaving us with the only option of dropping events in this scenario?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation of your use case.
So a few points

  • we never do a retry ever on our side, a dropped event is a dropped event
  • the server's rate limiting is merely interpreted as 'stop doing stuff for a while'
  • technically it is breaking yes, but this is considered internal to the http transport implementation and not really part of the API contract

That said, there are a couple of options for you now

  • make your own transport deriving from Sentry::HTTPTransport and override send_data there and do what you wish there
  • I can expose a on_rate_limit hook in the transport class similar to on_error to make that easier so you just need to define that one method to retry

does that work?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 2 (exposing an on_rate_limit hook) seems like a great way forward.

Appreciate you taking the time to understand our context and coming up with a solution here! Let me know if you need anything further from my end. Looking forward to being able to use the hook.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool I'll make a new issue to track that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce the logging
3 participants