Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need source telemetry to capture 'the first hop' when mirroring traffic #11093

Open
jmazzitelli opened this issue Jan 19, 2019 · 15 comments
Open
Labels
area/extensions and telemetry kind/enhancement lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed
Milestone

Comments

@jmazzitelli
Copy link
Member

This was originally reported on "Discuss Istio" - I was asked to create a github issue regarding this bug. See: https://discuss.istio.io/t/need-source-telemetry-to-capture-the-first-hop-when-mirroring-traffic/369

Describe the bug
I have traffic mirroring set up in bookinfo (see this yaml if you are interested) where productpage-v1 sends its requests to reviews-v1 but mirrors to reviews-v2.

Visualizing the resulting telemetry in Kiali looks like this:

missing-telemetry

Notice that “first hop” in the mirrored traffic is missing - that is to say, the request going from productpage-v1 to reviews-v2 is missing. This is because there is no reporter=“source” metric. However, as that mirrored traffic flows “downstream”, there is source telemetry for the rest of the request as it flows from service to service (which is why you see edges from reviews-v2 to ratings-v1 and -v2 as well as ratings-v2 to mongodb-v1).

The Istio implementation should be changed so the full mirrored traffic (starting at that “first hop”) is represented by source telemetry. As it is now, the source telemetry has a “hole” in it as you see when the telemetry is visualized.

Side note: There is reporter=“destination” telemetry for that first hop - but that is from the point of view of the reviews-v2 workload (thus reporter=“destination”). The Kiali graph is visualizing reporter=“source” telemetry because that is the only side that provides information about client-side errors (like injected faults, network errors, etc).

Expected behavior
I expect traffic mirroring to have the same telemetry as "normal" traffic. Specifically, I expected to see source telemetry (reporter="source") for the "first hop" of the mirrored traffic.

Steps to reproduce the bug

  1. Install bookinfo demo
  2. Add the virtual service/destination rule that turns on mirroring using this yaml.
  3. Send one request from your browser to the bookinfo's productpage web page. (The result of this is that the virtual service sends the "normal" request to reviews-v1 but mirrors traffic to reviews-v2).
  4. Look in Prometheus for all timeseries metrics dealing with the mirrored traffic going to reviews-v2. You do this using the query istio_requests_total{destination_workload=“reviews-v2”}. Notice there is a missing reporter="source" timeseries - there is only a reporter="destination" timeseries. You will see something like this:

istio_requests_total{connection_security_policy="none",
destination_app="reviews",
destination_principal="unknown",
destination_service="reviews.bookinfo.svc.cluster.local",
destination_service_name="reviews",
destination_service_namespace="bookinfo",
destination_version="v2",
destination_workload="reviews-v2",
destination_workload_namespace="bookinfo",
instance="172.17.0.12:42422",
job="istio-mesh",
reporter="destination",
request_protocol="http",
response_code="200",
source_app="productpage",
source_principal="unknown",
source_version="v1",
source_workload="productpage-v1",
source_workload_namespace="bookinfo"}

Version
Istio 1.0.5

Installation
Using helm

@jotak
Copy link
Contributor

jotak commented Feb 1, 2019

BTW, if/when this is fixed, I think if there is a way to discriminate mirrored traffic versus normal traffic in telemetry, that would be even greater. The current situation leads to somewhat inconsistent graph as described by @jmazzitelli , but on the other hand it has the side-effect advantage to somehow illustrate that some mirroring is happening via telemetry, by having discrepancy between source reporting and destination reporting.

I'm not sure what could be that discriminator. A new label?

@jmazzitelli
Copy link
Member Author

jmazzitelli commented Feb 1, 2019

it has the side-effect advantage to somehow illustrate that some mirroring is happening via telemetry, by having discrepancy between source reporting and destination reporting.

But there is nothing to say this discrepancy is related to mirrored traffic - having only reporter="destination" does not necessarily mean this is mirrored traffic (I mean there is nothing specifically to say "having only reporter=destination means mirror" - is that always true?). And note - this discrepancy is ONLY on the first hop. The rest of the mirrored traffic today looks identical to normal traffic - there is no way to tell that reviews-v2 to ratings-v1 is mirrored traffic rather than, say, a normal request initiated by a job from reviews-v2.

So as far as I can see, even today there is no way to really know if traffic is mirrored or not - there would have to be some other way - some other attribute perhaps - to indicate this.

@jotak
Copy link
Contributor

jotak commented Feb 1, 2019

@jmazzitelli I'm saying this because I actually appreciated this "bug" while demoing traffic shadowing & visualization in Kiali. It's true that I would prefer an explicit mention of "this is mirroring" in telemetry. But to the audience I was able to explain:
"See, i've setup shadowing, from the source point of view there's no traffic - because it's not kind of aware that its requests are duplicated - but from the destination PoV the traffic exists and is real: this is shadowing."

Also, the fact the it only affects the first hop isn't a problem: it shows that from destination PoV, requests are real. And hence that a user has to be careful about not "reversing" the mirrored request into a production workflow with side effect, such as writing in a DB (kind of stuff explained by C. Posta here: https://blog.christianposta.com/microservices/advanced-traffic-shadowing-patterns-for-microservices-with-istio-service-mesh/ )

So even if I admit that it's mainly due to coincidences, this bug is kind of helpful when we want to show shadowing.

As I wrote above, a good solution for both points of view would be to flag the requests as being mirrored.

@jotak
Copy link
Contributor

jotak commented Feb 1, 2019

PS: in Kiali to illustrate source versus destination PoV we just have to double-click on a node, in which case the reporter used becomes node-centric

@jmazzitelli
Copy link
Member Author

OK, so it's just by coincidence that this bug is sort of helpful. I get it. I would prefer something more of a "feature" though, as you say we need a "good solution" here rather than rely on this bug. Because first of all, you are still kind of guessing ... "well, I think this is mirrored because this source telemetry edge from productpage to reviews is missing." You would have to drill down in Kiali to see it (or look at the destination telemetry if you are doing this outside of Kiali). Without Kiali (just looking at raw telemetry), it would be even harder to find. And even then, I'm not sure this necessarily means mirrored traffic (are there other instances where destination telemetry exists where source telemetry is missing?)

I also think not seeing any indication that the second hop and thereafter is mirrored is a problem. If I'm looking at a screen of my production mesh, and I see traffic "downstream" - I have no way of knowing (even from the destination point of view) if this is dark traffic or not. e.g. looking at ratings-v1, even from a destination point of view (ratings-v1 being the destination) we can't tell. Double-clicking and drilling down in ratings-v1 in Kiali won't help in that case (because there is both source and destination telemetry here - it looks like normal traffic). And I can see people wanting to track dark traffic all the way through their mesh to see the behavior of the services for that traffic (not just on the first hop). I see the purpose of dark traffic to be able to test the performance of that traffic all the way through the services it touches - not just the first one. So you would have to be able to discern where that dark traffic is flowing.

Perhaps we should write an istio github issue for an enhancement request for this?

@jshaughn
Copy link

I think that the proposal for New Virtual Service Attributes could potentially add what's needed for this enhancement. Is there work being done on that proposal? Being able to distinguish between the requested service and the resulting service request(s) would, I think. allow for a visualization.

@stale
Copy link

stale bot commented Jun 25, 2019

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 25, 2019
@jshaughn
Copy link

I think this is still relevant to future telemetry needs.

@stale
Copy link

stale bot commented Sep 23, 2019

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 23, 2019
@jshaughn
Copy link

This still relevant to Istio 1.4 telemetry discussions.

@jmazzitelli
Copy link
Member Author

@douglas-reid I believe this is still relevant.

@esnible
Copy link
Contributor

esnible commented Mar 2, 2020

@douglas-reid this is set for 1.5 and I rashly added the User Experience label last year. Is this still a problem with Telemetry v2 and is there a story for fixing it?

@jshaughn
Copy link

jshaughn commented Mar 2, 2020

This did not get into 1.5 and the "requested service" info is currently listed as P1 for 1.6. I am hoping to see it added as I don't think it's to difficult to add these fields to the set of default telemetry labels.

@douglas-reid
Copy link
Contributor

@kyessenov for comment / work

@kyessenov
Copy link
Contributor

From what I understand, the underlying issue is the lack of the extension support on the client-side for the mirrored traffic. It's fairly complicated since currently mirrored traffic is fire-and-forget. For peer metadata exchange to work, we have to wait for the peer to write its prologue for mirrored requests, unless we have some sort of a cache.

Related to this issue is the confusing tracing data. Mirrored traffic receives unsanitized tracing headers, so the span is duplicated (but not reported twice). See envoyproxy/envoy#10257.

@esnible esnible modified the milestones: 1.5, Backlog Apr 21, 2020
@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Sep 3, 2020
@douglas-reid douglas-reid added lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed and removed lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while labels Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/extensions and telemetry kind/enhancement lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed
Projects
Status: P1
Development

No branches or pull requests

8 participants