-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need source telemetry to capture 'the first hop' when mirroring traffic #11093
Comments
BTW, if/when this is fixed, I think if there is a way to discriminate mirrored traffic versus normal traffic in telemetry, that would be even greater. The current situation leads to somewhat inconsistent graph as described by @jmazzitelli , but on the other hand it has the side-effect advantage to somehow illustrate that some mirroring is happening via telemetry, by having discrepancy between source reporting and destination reporting. I'm not sure what could be that discriminator. A new label? |
But there is nothing to say this discrepancy is related to mirrored traffic - having only reporter="destination" does not necessarily mean this is mirrored traffic (I mean there is nothing specifically to say "having only reporter=destination means mirror" - is that always true?). And note - this discrepancy is ONLY on the first hop. The rest of the mirrored traffic today looks identical to normal traffic - there is no way to tell that reviews-v2 to ratings-v1 is mirrored traffic rather than, say, a normal request initiated by a job from reviews-v2. So as far as I can see, even today there is no way to really know if traffic is mirrored or not - there would have to be some other way - some other attribute perhaps - to indicate this. |
@jmazzitelli I'm saying this because I actually appreciated this "bug" while demoing traffic shadowing & visualization in Kiali. It's true that I would prefer an explicit mention of "this is mirroring" in telemetry. But to the audience I was able to explain: Also, the fact the it only affects the first hop isn't a problem: it shows that from destination PoV, requests are real. And hence that a user has to be careful about not "reversing" the mirrored request into a production workflow with side effect, such as writing in a DB (kind of stuff explained by C. Posta here: https://blog.christianposta.com/microservices/advanced-traffic-shadowing-patterns-for-microservices-with-istio-service-mesh/ ) So even if I admit that it's mainly due to coincidences, this bug is kind of helpful when we want to show shadowing. As I wrote above, a good solution for both points of view would be to flag the requests as being mirrored. |
PS: in Kiali to illustrate source versus destination PoV we just have to double-click on a node, in which case the reporter used becomes node-centric |
OK, so it's just by coincidence that this bug is sort of helpful. I get it. I would prefer something more of a "feature" though, as you say we need a "good solution" here rather than rely on this bug. Because first of all, you are still kind of guessing ... "well, I think this is mirrored because this source telemetry edge from productpage to reviews is missing." You would have to drill down in Kiali to see it (or look at the destination telemetry if you are doing this outside of Kiali). Without Kiali (just looking at raw telemetry), it would be even harder to find. And even then, I'm not sure this necessarily means mirrored traffic (are there other instances where destination telemetry exists where source telemetry is missing?) I also think not seeing any indication that the second hop and thereafter is mirrored is a problem. If I'm looking at a screen of my production mesh, and I see traffic "downstream" - I have no way of knowing (even from the destination point of view) if this is dark traffic or not. e.g. looking at ratings-v1, even from a destination point of view (ratings-v1 being the destination) we can't tell. Double-clicking and drilling down in ratings-v1 in Kiali won't help in that case (because there is both source and destination telemetry here - it looks like normal traffic). And I can see people wanting to track dark traffic all the way through their mesh to see the behavior of the services for that traffic (not just on the first hop). I see the purpose of dark traffic to be able to test the performance of that traffic all the way through the services it touches - not just the first one. So you would have to be able to discern where that dark traffic is flowing. Perhaps we should write an istio github issue for an enhancement request for this? |
I think that the proposal for New Virtual Service Attributes could potentially add what's needed for this enhancement. Is there work being done on that proposal? Being able to distinguish between the requested service and the resulting service request(s) would, I think. allow for a visualization. |
This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
I think this is still relevant to future telemetry needs. |
This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This still relevant to Istio 1.4 telemetry discussions. |
@douglas-reid I believe this is still relevant. |
@douglas-reid this is set for 1.5 and I rashly added the |
This did not get into 1.5 and the "requested service" info is currently listed as P1 for 1.6. I am hoping to see it added as I don't think it's to difficult to add these fields to the set of default telemetry labels. |
@kyessenov for comment / work |
From what I understand, the underlying issue is the lack of the extension support on the client-side for the mirrored traffic. It's fairly complicated since currently mirrored traffic is fire-and-forget. For peer metadata exchange to work, we have to wait for the peer to write its prologue for mirrored requests, unless we have some sort of a cache. Related to this issue is the confusing tracing data. Mirrored traffic receives unsanitized tracing headers, so the span is duplicated (but not reported twice). See envoyproxy/envoy#10257. |
This was originally reported on "Discuss Istio" - I was asked to create a github issue regarding this bug. See: https://discuss.istio.io/t/need-source-telemetry-to-capture-the-first-hop-when-mirroring-traffic/369
Describe the bug
I have traffic mirroring set up in bookinfo (see this yaml if you are interested) where productpage-v1 sends its requests to reviews-v1 but mirrors to reviews-v2.
Visualizing the resulting telemetry in Kiali looks like this:
Notice that “first hop” in the mirrored traffic is missing - that is to say, the request going from productpage-v1 to reviews-v2 is missing. This is because there is no reporter=“source” metric. However, as that mirrored traffic flows “downstream”, there is source telemetry for the rest of the request as it flows from service to service (which is why you see edges from reviews-v2 to ratings-v1 and -v2 as well as ratings-v2 to mongodb-v1).
The Istio implementation should be changed so the full mirrored traffic (starting at that “first hop”) is represented by source telemetry. As it is now, the source telemetry has a “hole” in it as you see when the telemetry is visualized.
Side note: There is reporter=“destination” telemetry for that first hop - but that is from the point of view of the reviews-v2 workload (thus reporter=“destination”). The Kiali graph is visualizing reporter=“source” telemetry because that is the only side that provides information about client-side errors (like injected faults, network errors, etc).
Expected behavior
I expect traffic mirroring to have the same telemetry as "normal" traffic. Specifically, I expected to see source telemetry (reporter="source") for the "first hop" of the mirrored traffic.
Steps to reproduce the bug
istio_requests_total{destination_workload=“reviews-v2”}
. Notice there is a missing reporter="source" timeseries - there is only a reporter="destination" timeseries. You will see something like this:Version
Istio 1.0.5
Installation
Using helm
The text was updated successfully, but these errors were encountered: