Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2dns makes SRV records unusable #21325

Closed
setaou opened this issue Jun 13, 2024 · 9 comments
Closed

v2dns makes SRV records unusable #21325

setaou opened this issue Jun 13, 2024 · 9 comments

Comments

@setaou
Copy link

setaou commented Jun 13, 2024

Overview of the Issue

Since consul 1.19 (with v2dns enabled by default), SRV requests return the same hostname for every allocation. If allocations are using different ports, there is no way to tell which IP correspond to which port, rendering DNS SRV queries useless.


Reproduction Steps

Given a service "livetiler" tagged "prod" in the datacenter "paris" with 5 allocations on different hosts, here is the result of an SRV request on Nomad 1.19.0 :

# dig _livetiler._prod.service.paris.consul @127.0.0.1 -p8600 SRV +short
1 1 31072 livetiler.service.paris.consul.
1 1 21978 livetiler.service.paris.consul.
1 1 25530 livetiler.service.paris.consul.
1 1 25915 livetiler.service.paris.consul.
1 1 25013 livetiler.service.paris.consul.

Obviously, the hosts can be fetched using an A query on livetiler.service.paris.consul., but it is impossible to know which port correspond to which host.

On the contrary, on Consul 1.18 (or 1.19.0 with the v1dns option), an SRV request return a different name for each allocation:

# dig _livetiler._prod.service.paris.consul @127.0.0.1 -p8600 SRV +short
1 1 31072 c0a80226.addr.paris.consul.
1 1 25013 c0a80269.addr.paris.consul.
1 1 25530 c0a8026e.addr.paris.consul.
1 1 21978 c0a80274.addr.paris.consul.
1 1 25915 c0a80212.addr.paris.consul.

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 4
	services = 4
build:
	prerelease = 
	revision = bf0166d8
	version = 1.19.0
	version_metadata = 
consul:
	acl = disabled
	known_servers = 5
	server = false
runtime:
	arch = amd64
	cpu_count = 24
	goroutines = 198
	max_procs = 24
	os = linux
	version = go1.22.4
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 97
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 31794
	members = 79
	query_queue = 0
	query_time = 1

{"bind_addr":"192.168.2.80","data_dir":"/opt/consul","datacenter":"paris","enable_script_checks":true,"limits":{"http_max_conns_per_client":2048},"retry_join":["aa.paris.gnet","bb.paris.gnet","cc.paris.gnet","dd.paris.gnet","ee.paris.gnet"]}
Server info
agent:
	check_monitors = 2
	check_ttls = 0
	checks = 12
	services = 12
build:
	prerelease = 
	revision = bf0166d8
	version = 1.19.0
	version_metadata = 
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 192.168.2.28:8300
	server = true
raft:
	applied_index = 54373494
	commit_index = 54373494
	fsm_pending = 0
	last_contact = 16.055954ms
	last_log_index = 54373494
	last_log_term = 679
	last_snapshot_index = 54361357
	last_snapshot_term = 677
	latest_configuration = [{Suffrage:Voter ID:a2b3aa47-fc91-9d78-924a-38158b01ce40 Address:192.168.2.21:8300} {Suffrage:Voter ID:d34285d8-b411-ab76-7cb5-c69bc16c24ad Address:192.168.2.28:8300} {Suffrage:Voter ID:9b2f6214-c833-5e75-0d65-8fa0158aea3e Address:192.168.2.158:8300} {Suffrage:Voter ID:63a5960e-d185-bdf2-04c9-338c7f15e4dc Address:192.168.2.55:8300} {Suffrage:Voter ID:fdd2392d-7dab-2e51-02f5-661e1f800e28 Address:192.168.2.24:8300}]
	latest_configuration_index = 0
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 679
runtime:
	arch = amd64
	cpu_count = 48
	goroutines = 1267
	max_procs = 48
	os = linux
	version = go1.22.4
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 97
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 31794
	members = 79
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 4272
	members = 5
	query_queue = 0
	query_time = 1

{"bind_addr":"192.168.2.21","bootstrap_expect":3,"client_addr":"0.0.0.0","data_dir":"/opt/consul","datacenter":"paris","enable_script_checks":true,"server":true,"ui":true}

Operating system and Environment details

Ubuntu Linux 22.04 LTS, amd64

Log Fragments

n/a

@DanStough
Copy link
Member

Hey @setaou, thank you for the report. It definitely looks like a regression. I'm looking into it now and will report back with a fix or any additional information.

For anyone else trying out 1.19 in the interim that is having problem with SRV records or anything DNS related, you can set experiments: [ "v1dns" ] in the agent config for the old behavior.

@maxadamo
Copy link

maxadamo commented Jun 14, 2024

@DanStough in case it wasn't clear from the description of the issue, the problem is not only about services having the same name, but whichever string you prepend to the record, it will always be resolved.
For instance in this example we have _livetiler._prod.service.paris.consul, but if we try to resolve blablahblah._livetiler._prod.service.paris.consul it will also work and it resolves to the same addresses.
With version 1.18 I was getting NXDOMAIN
This is breaking tags usage.

dig hey.there.how.is.it.going.consul.service.ha.geant.net @127.0.0.1 -p8600 -t SRV +short

1 1 8300 test-consul02.node.test-geant.ha.geant.net.
1 1 8300 test-consul03.node.test-geant.ha.geant.net.
1 1 8300 test-consul01.node.test-geant.ha.geant.net.

@faryon93
Copy link

faryon93 commented Jun 14, 2024

The title of the issue does not tell the whole story. When issuing a standard A lookup against consul DNS the tag is ignored and all registered instances are returned.

As #21336 got closed as duplicate of this, this issue should clearly indicate that this is not a SRV record only issue, but filtering for tags via Consul DNS is generally broken. When I started noticing issues I have not considered looking into this issue, because it was not resembling my observations. When we make that a little more clear in the title of this issue, other users might find the workaround faster than me.

I can confirm by turning v1dns on again, the standard A record lookups work as expected.

@maxadamo
Copy link

@faryon93 I am going to re-open.

@setaou
Copy link
Author

setaou commented Jun 17, 2024

@DanStough in case it wasn't clear from the description of the issue, the problem is not only about services having the same name, but whichever string you prepend to the record, it will always be resolved. For instance in this example we have _livetiler._prod.service.paris.consul, but if we try to resolve blablahblah._livetiler._prod.service.paris.consul it will also work and it resolves to the same addresses. With version 1.18 I was getting NXDOMAIN This is breaking tags usage.

Actually, querying SRV or A for blablahblah._livetiler._prod.service.paris.consul or _livetiler._blahblahblah.service.paris.consul (RFC 2782 lookup) does work correctly by returning NXDOMAIN, but for blablahblah.livetiler.service.paris.consul (standard lookup) it does not and returns the full unfiltered list. However, this is a different issue than the one I reported here, even though it is related to v2dns.

@DanStough
Copy link
Member

Thanks all for the details. I've reproduced the behavior with the tag #21336 and the SRV results (this issue). I should have a PR up today or tomorrow to fix.

@mpilone
Copy link

mpilone commented Jun 22, 2024

FWIW, we also found this change appears to break/confuse HAProxy DNS Discovery. This older page describes the setup, but the most relevant part is:

Using DNS A records gives you each server’s IP address, but you must hardcode the port. You can also configure HAProxy to query for DNS SRV records in order to set the port in addition to the IP address. An SRV record returns a hostname and port number.
First, you’ll need to change the A records so that instead of having a hostname, myservice.example.local, resolve to multiple IP addresses, each A record will have a different hostname, such as host1, host2, and host3. Then, add the same number of SRV records and configure them to resolve a service name, such as _myservice._tcp.example.local, to the hosts you defined in your A records.

It looks like HAProxy constantly detects different IPs for the same service because the SRV records are all returning the same host name which is resolving to different IPs. That causes it to constantly flap between backend servers. We see a lot of messages in our HAProxy logs for a Consul service "dashboard" like:

Jun 14 09:01:08 haproxy13 haproxy[1275181]: [WARNING] 165/090108 (1275181) : dashboard_DEV/dashboard1 changed its IP from 192.168.166.178 to 192.168.166.176 by DNS additional record.

@DanStough
Copy link
Member

This should be resolved with the linked PR. We're discussing putting out 1.19.1 sooner than expected. BOLO for that release.

@havedill
Copy link

havedill commented Jul 1, 2024

I should note, this version breaks HAProxy Enterprise dynamic backend configurations with consul, so i would consider this a high priority to release the patch before bigger firms start having issues.

Fortunately my issues were on my dev cluster and your configuration change seemed to fix my problems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants