Need source telemetry to capture 'the first hop' when mirroring traffic #11093

jmazzitelli · 2019-01-19T00:08:07Z

This was originally reported on "Discuss Istio" - I was asked to create a github issue regarding this bug. See: https://discuss.istio.io/t/need-source-telemetry-to-capture-the-first-hop-when-mirroring-traffic/369

Describe the bug
I have traffic mirroring set up in bookinfo (see this yaml if you are interested) where productpage-v1 sends its requests to reviews-v1 but mirrors to reviews-v2.

Visualizing the resulting telemetry in Kiali looks like this:

Notice that “first hop” in the mirrored traffic is missing - that is to say, the request going from productpage-v1 to reviews-v2 is missing. This is because there is no reporter=“source” metric. However, as that mirrored traffic flows “downstream”, there is source telemetry for the rest of the request as it flows from service to service (which is why you see edges from reviews-v2 to ratings-v1 and -v2 as well as ratings-v2 to mongodb-v1).

The Istio implementation should be changed so the full mirrored traffic (starting at that “first hop”) is represented by source telemetry. As it is now, the source telemetry has a “hole” in it as you see when the telemetry is visualized.

Side note: There is reporter=“destination” telemetry for that first hop - but that is from the point of view of the reviews-v2 workload (thus reporter=“destination”). The Kiali graph is visualizing reporter=“source” telemetry because that is the only side that provides information about client-side errors (like injected faults, network errors, etc).

Expected behavior
I expect traffic mirroring to have the same telemetry as "normal" traffic. Specifically, I expected to see source telemetry (reporter="source") for the "first hop" of the mirrored traffic.

Steps to reproduce the bug

Install bookinfo demo
Add the virtual service/destination rule that turns on mirroring using this yaml.
Send one request from your browser to the bookinfo's productpage web page. (The result of this is that the virtual service sends the "normal" request to reviews-v1 but mirrors traffic to reviews-v2).
Look in Prometheus for all timeseries metrics dealing with the mirrored traffic going to reviews-v2. You do this using the query istio_requests_total{destination_workload=“reviews-v2”}. Notice there is a missing reporter="source" timeseries - there is only a reporter="destination" timeseries. You will see something like this:

istio_requests_total{connection_security_policy="none",
destination_app="reviews",
destination_principal="unknown",
destination_service="reviews.bookinfo.svc.cluster.local",
destination_service_name="reviews",
destination_service_namespace="bookinfo",
destination_version="v2",
destination_workload="reviews-v2",
destination_workload_namespace="bookinfo",
instance="172.17.0.12:42422",
job="istio-mesh",
reporter="destination",
request_protocol="http",
response_code="200",
source_app="productpage",
source_principal="unknown",
source_version="v1",
source_workload="productpage-v1",
source_workload_namespace="bookinfo"}

Version
Istio 1.0.5

Installation
Using helm

The text was updated successfully, but these errors were encountered:

jotak · 2019-02-01T05:30:34Z

BTW, if/when this is fixed, I think if there is a way to discriminate mirrored traffic versus normal traffic in telemetry, that would be even greater. The current situation leads to somewhat inconsistent graph as described by @jmazzitelli , but on the other hand it has the side-effect advantage to somehow illustrate that some mirroring is happening via telemetry, by having discrepancy between source reporting and destination reporting.

I'm not sure what could be that discriminator. A new label?

jmazzitelli · 2019-02-01T10:55:04Z

it has the side-effect advantage to somehow illustrate that some mirroring is happening via telemetry, by having discrepancy between source reporting and destination reporting.

But there is nothing to say this discrepancy is related to mirrored traffic - having only reporter="destination" does not necessarily mean this is mirrored traffic (I mean there is nothing specifically to say "having only reporter=destination means mirror" - is that always true?). And note - this discrepancy is ONLY on the first hop. The rest of the mirrored traffic today looks identical to normal traffic - there is no way to tell that reviews-v2 to ratings-v1 is mirrored traffic rather than, say, a normal request initiated by a job from reviews-v2.

So as far as I can see, even today there is no way to really know if traffic is mirrored or not - there would have to be some other way - some other attribute perhaps - to indicate this.

jotak · 2019-02-01T12:19:54Z

@jmazzitelli I'm saying this because I actually appreciated this "bug" while demoing traffic shadowing & visualization in Kiali. It's true that I would prefer an explicit mention of "this is mirroring" in telemetry. But to the audience I was able to explain:
"See, i've setup shadowing, from the source point of view there's no traffic - because it's not kind of aware that its requests are duplicated - but from the destination PoV the traffic exists and is real: this is shadowing."

Also, the fact the it only affects the first hop isn't a problem: it shows that from destination PoV, requests are real. And hence that a user has to be careful about not "reversing" the mirrored request into a production workflow with side effect, such as writing in a DB (kind of stuff explained by C. Posta here: https://blog.christianposta.com/microservices/advanced-traffic-shadowing-patterns-for-microservices-with-istio-service-mesh/ )

So even if I admit that it's mainly due to coincidences, this bug is kind of helpful when we want to show shadowing.

As I wrote above, a good solution for both points of view would be to flag the requests as being mirrored.

jotak · 2019-02-01T12:26:33Z

PS: in Kiali to illustrate source versus destination PoV we just have to double-click on a node, in which case the reporter used becomes node-centric

jmazzitelli · 2019-02-01T12:35:55Z

OK, so it's just by coincidence that this bug is sort of helpful. I get it. I would prefer something more of a "feature" though, as you say we need a "good solution" here rather than rely on this bug. Because first of all, you are still kind of guessing ... "well, I think this is mirrored because this source telemetry edge from productpage to reviews is missing." You would have to drill down in Kiali to see it (or look at the destination telemetry if you are doing this outside of Kiali). Without Kiali (just looking at raw telemetry), it would be even harder to find. And even then, I'm not sure this necessarily means mirrored traffic (are there other instances where destination telemetry exists where source telemetry is missing?)

I also think not seeing any indication that the second hop and thereafter is mirrored is a problem. If I'm looking at a screen of my production mesh, and I see traffic "downstream" - I have no way of knowing (even from the destination point of view) if this is dark traffic or not. e.g. looking at ratings-v1, even from a destination point of view (ratings-v1 being the destination) we can't tell. Double-clicking and drilling down in ratings-v1 in Kiali won't help in that case (because there is both source and destination telemetry here - it looks like normal traffic). And I can see people wanting to track dark traffic all the way through their mesh to see the behavior of the services for that traffic (not just on the first hop). I see the purpose of dark traffic to be able to test the performance of that traffic all the way through the services it touches - not just the first one. So you would have to be able to discern where that dark traffic is flowing.

Perhaps we should write an istio github issue for an enhancement request for this?

jshaughn · 2019-03-27T12:35:58Z

I think that the proposal for New Virtual Service Attributes could potentially add what's needed for this enhancement. Is there work being done on that proposal? Being able to distinguish between the requested service and the resulting service request(s) would, I think. allow for a visualization.

stale · 2019-06-25T14:35:15Z

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

jshaughn · 2019-06-25T15:11:51Z

I think this is still relevant to future telemetry needs.

stale · 2019-09-23T15:42:44Z

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

jshaughn · 2019-09-23T17:41:07Z

This still relevant to Istio 1.4 telemetry discussions.

jmazzitelli · 2019-10-30T11:11:27Z

@douglas-reid I believe this is still relevant.

esnible · 2020-03-02T15:02:20Z

@douglas-reid this is set for 1.5 and I rashly added the User Experience label last year. Is this still a problem with Telemetry v2 and is there a story for fixing it?

jshaughn · 2020-03-02T15:04:47Z

This did not get into 1.5 and the "requested service" info is currently listed as P1 for 1.6. I am hoping to see it added as I don't think it's to difficult to add these fields to the set of default telemetry labels.

douglas-reid · 2020-03-06T23:34:46Z

@kyessenov for comment / work

kyessenov · 2020-03-07T00:18:40Z

From what I understand, the underlying issue is the lack of the extension support on the client-side for the mirrored traffic. It's fairly complicated since currently mirrored traffic is fire-and-forget. For peer metadata exchange to work, we have to wait for the peer to write its prologue for mirrored requests, unless we have some sort of a cache.

Related to this issue is the confusing tracing data. Mirrored traffic receives unsanitized tracing headers, so the span is duplicated (but not reported twice). See envoyproxy/envoy#10257.

jmazzitelli added the area/extensions and telemetry label Jan 19, 2019

jmazzitelli mentioned this issue Jan 22, 2019

Telemetry resulting from injected faults is wrong. #11151

Closed

jmazzitelli mentioned this issue Feb 1, 2019

have telemetry be able to discern mirrored traffic from "normal" traffic #11475

Closed

esnible added the area/user experience label Feb 1, 2019

douglas-reid added this to the Nebulous Future milestone Mar 7, 2019

douglas-reid added the kind/enhancement label Mar 27, 2019

stale bot added the stale label Jun 25, 2019

stale bot removed the stale label Jun 25, 2019

jotak mentioned this issue Sep 9, 2019

Response time is not shown for shadowed/mirrored services in graph kiali/kiali#1649

Closed

stale bot added the stale label Sep 23, 2019

stale bot removed the stale label Sep 23, 2019

istio-policy-bot added the lifecycle/to-be-closed label Oct 30, 2019

geeknoid removed the lifecycle/to-be-closed label Nov 6, 2019

douglas-reid modified the milestones: Nebulous Future, 1.5 Nov 21, 2019

esnible modified the milestones: 1.5, Backlog Apr 21, 2020

douglas-reid mentioned this issue May 5, 2020

Telemtry v2 does not report metrics for mirror traffic at source, but does destination #23496

Open

istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Sep 3, 2020

douglas-reid added lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed and removed lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while labels Sep 3, 2020

jshaughn mentioned this issue Nov 30, 2020

Weird percentage in Graph edges kiali/kiali#3477

Closed

esnible removed the area/user experience label Feb 9, 2021

jmazzitelli mentioned this issue Sep 30, 2021

Show mirroring info or badge on the graph kiali/kiali#4383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need source telemetry to capture 'the first hop' when mirroring traffic #11093

Need source telemetry to capture 'the first hop' when mirroring traffic #11093

jmazzitelli commented Jan 19, 2019

jotak commented Feb 1, 2019

jmazzitelli commented Feb 1, 2019 •

edited

Loading

jotak commented Feb 1, 2019

jotak commented Feb 1, 2019

jmazzitelli commented Feb 1, 2019

jshaughn commented Mar 27, 2019

stale bot commented Jun 25, 2019

jshaughn commented Jun 25, 2019

stale bot commented Sep 23, 2019

jshaughn commented Sep 23, 2019

jmazzitelli commented Oct 30, 2019

esnible commented Mar 2, 2020

jshaughn commented Mar 2, 2020

douglas-reid commented Mar 6, 2020

kyessenov commented Mar 7, 2020

Need source telemetry to capture 'the first hop' when mirroring traffic #11093

Need source telemetry to capture 'the first hop' when mirroring traffic #11093

Comments

jmazzitelli commented Jan 19, 2019

jotak commented Feb 1, 2019

jmazzitelli commented Feb 1, 2019 • edited Loading

jotak commented Feb 1, 2019

jotak commented Feb 1, 2019

jmazzitelli commented Feb 1, 2019

jshaughn commented Mar 27, 2019

stale bot commented Jun 25, 2019

jshaughn commented Jun 25, 2019

stale bot commented Sep 23, 2019

jshaughn commented Sep 23, 2019

jmazzitelli commented Oct 30, 2019

esnible commented Mar 2, 2020

jshaughn commented Mar 2, 2020

douglas-reid commented Mar 6, 2020

kyessenov commented Mar 7, 2020

jmazzitelli commented Feb 1, 2019 •

edited

Loading