Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracking: Equinix open source program support #151

Open
vielmetti opened this issue Nov 11, 2022 · 26 comments
Open

tracking: Equinix open source program support #151

vielmetti opened this issue Nov 11, 2022 · 26 comments

Comments

@vielmetti
Copy link

This is a tracking item to coordinate efforts around access to Equinix Metal resources, per an application to our open source program by @dtaht .

Some possible goals include access to the platform by the team, automation to support testing, performance characterization and tuning, NIC card support, and the like. We'll also be interested in communicating results, both internally and externally.

I'm tasked with establishing a budget for this, so that we can keep tabs on usage. This is mostly about identifying needs and figuring out the effective way to fulfill them, rather than establishing some arbitrary dollar target.

more to follow

@dtaht
Copy link
Collaborator

dtaht commented Nov 11, 2022

There are quite a few things where this relationship can be mutually beneficial. My own view of what "normal" internet traffic "looks like" is colored by the 90's perception that it is fractal in nature:

#149

Recent research in the DC shows it to be dominated by RPC instead, and things like Homa are being designed to address it. But what does normal ISP <-> home traffic look like? How is it changing as we add more and more (mostly wifi) devices to it?

Kathie Nichols has suggested we add an inter-quartile estimator to our ebpf work: #148

There are some amazing things we can do to speed up "slow start" and web PLT, were an ISP or DC adopt multiple IPv6 things (like fiddling with the flow), and cake's per host FQ + AQM + ECN capabilities: #144 And BBRv1 and BBRv2 are becoming more popular (and don't deal with slow start much as yet)

Further instrumenting the kernel to report on DROP_REASON is on my mind: #143

One of the reasons why I like bare metal is that we can drop latencies from ingress to egress down to the bare mininimum. What IS the bare minimum? How can that be improved? Fq_codel/cake/fq-pie all basically implement a software analog of "cut-through-switching"...

#141

Similarly what more can we accomplish by smarter offloads? (Is there a card with JUST a way to map LPM to a cpu to interrupt?) I really was excited by the appearance of DPU cards but haven't had the budget to try one. Can we crack the 100Gbit 100k ISP customer barrier with one, or with more CPUs?

What can we do to consolidate 5G processing?

The list goes on and on. Of these - honestly we're just trying to get libreqos v1.3 out the door, stable to 6 sigmas, right now. Getting setup to "go deep" on speeding up the internet in the next release is what we're talking about here. Thanks so much for your interest and some level of support!

@dtaht
Copy link
Collaborator

dtaht commented Nov 11, 2022

Also, I'm a scrounger. I'd love to find a source of older "flight proven" hardware that could be repurposed by an ISP for use to shape and monitor traffic.

@rchac
Copy link
Member

rchac commented Nov 11, 2022

My goal here is a bit more simple. :p I'm just hoping we can test the max throughput of LibreQoS. We'd need 3 servers:

  • A server running Microsoft Ehtr as an endpoint for speed tests
  • A server running LibreQoS - to shape traffic
  • A server to simulate customer traffic, also running Microsoft Ehtr. Here we'd use multiple instances with the IP bind option to have one server simulate dozens or hundreds of hosts.

That's my big goal for Equinix testing. It would help us to find potential opportunities for optimization, and to see what its real world throughput capability is.

@thebracket
Copy link
Collaborator

I'm with @rchac here - let's walk before we try running. Solving any scalability issues with larger hardware and NIC throughputs than we have access to seems like the most important near-term goal. I don't think we've tested above 10gbit/s yet.

@vielmetti
Copy link
Author

@rchac That sounds like a nice, constrained, doable first step in getting started.

The instance types typically have either 2x 10G, 2x 25G, or 4x 25G NIC configurations. I'd suggest that the configuration initially should run in a couple of our "small" instances, specifically the "m3.small", which you can see more about here:

https://metal.equinix.com/product/servers/m3-small/

which is an 8C/16T "Rocket Lake" Intel Xeon E-2378G with 2x25G NIC. That should give you enough of a testbed to set up the network in an interesting way, test at above 10G/sec, and get all of the tooling in place. Beyond that there are bigger machines, but if we start there it should be possible to get meaningful results.

We'll want to talk about setting up the network appropriately to support this since there are a couple of configurations there. I would also refer you to these docs about setting up layer 2 / layer 3 configurations because that may be meaningful for a test configuration.

https://metal.equinix.com/developers/docs/layer2-networking/overview/

@dtaht
Copy link
Collaborator

dtaht commented Nov 11, 2022

I am very tickled to discover that the fellah joining us on the call also ran CeroWrt back in the day.

@thebracket
Copy link
Collaborator

thebracket commented Nov 11, 2022

That looks workable. If I'm reading the documentation correctly, I think we'd wind up with:

  • Traffic Server* - an m3-small instance with 2x25 Gbps NICs.
    • NIC 1 would remain bonded, to provide external access.
    • NIC 2 would be removed from the bond and added to a VLAN (I'm going to call it VLAN2)
  • Shaper Server - an m3-small instance with 4x25 Gbps NICs (I didn't see 3 as an option).
    • NIC1 would be un-bonded, and in the same VLAN as traffic server - VLAN 2.
    • NIC2 would be un-bonded, and in a second VLAN (I'm going to call it VLAN3)
    • NIC 3/4 remain bonded, allowing us to access/update the VM. (If there's a better way to ensure we can get in, let me know)
    • br0 would then bridge NIC1 and NIC2, providing the traffic shaping interface.
  • Traffic Client - an m3-small instance with 2x25 Gbps NICs
    • NIC1 is un-bonded, and part of VLAN 3.
    • NIC2 remains in the bond, providing management access.
graph TD;
    Management-->TrafficServer;
    Management-->Shaper;
    Management-->TrafficClient;
    TrafficServer-->Shaper;
    Shaper-->TrafficServer;
    Shaper-->TrafficClient;
    TrafficClient-->Shaper;
Loading

@dtaht
Copy link
Collaborator

dtaht commented Nov 11, 2022

I know I'm thinking too far ahead ok? MPLS and PPPOE encapsulations/decapsulations have been a problem that I don't know others have solved. I am under the impression that modern kernel's flow dissectors can handle it. Also The panda flow dissector is rumored to be faster and more complete and aimable to vrfs and udp encapsulations... https://netdevconf.info/0x15/session.html?Replacing-Flow-Dissector-with-PANDA-Parser And then there's the nightmare of 5G... waaay to far ahead, please ignore me....

@dtaht
Copy link
Collaborator

dtaht commented Nov 11, 2022

What's the NIC? We have had good results with the intel 40GBit nic, less good results with a 25Gbit nic that I have to go remember the name of....

@thebracket
Copy link
Collaborator

MPLS/PPoE at least partially work in the Linux flow dissector. We'd have to extend xdp-tcpmap-tc (and the similar but not quite the same epping) code-bases to extend their dissectors to handle the additional frame types. Right now, I think we're ok telling people to de-encapsulate before they hit the shaper.

I assumed we'd separate out down to layer-2 without an LACP bond. In theory, I'd expect LACP to work so long as the bonded interfaces are attached to the bridge (not the raw interfaces) - in practice, I think I've seen some people say it caused problems.

@vielmetti
Copy link
Author

For the 4x25G NIC option for the "shaper", the recommendation would be our n3.xlarge, described at

https://metal.equinix.com/product/servers/n3-xlarge/

It has SR-IOV turned on by default on the NICs if that's of interest.

Alternatively, a 2x25G NIC "shaper" system could be an m3.small, in "pure layer 2" mode, and remote access via our serial-over-SSH console called SOS. Details at

https://metal.equinix.com/developers/docs/layer2-networking/layer2-mode/

I think that the relevant NICs here are Intel e810 (using the "ice" driver). On our call today we'll have someone who can speak with confidence to the NIC setup on all the machines, there is some variation in the fleet.

For a traffic generator, a recommendation was for TRex at https://trex-tgn.cisco.com which will support generating "realistic" loads. "TRex can scale up to 200Gb/sec with one server" - and our biggest config "only" has 4x25G.

@dtaht
Copy link
Collaborator

dtaht commented Nov 11, 2022

I was thinking that en-capsulating after leaving the shaper would save some ISPs another box down the line. E.g. internet router -> shaper -> encapsulator(s) is pretty common, and adds latency and makes for poorer MTBF.

trex is GOOD - gives us pps at various sizes. Trafgen also helpful

Other pain points have been steam downloads, and twitch streaming. Insight into DASH traffic behaviors. Videoconferencing. There's a new broadband forum benchmark... the FCC benchmarks...

And flent, always, flent. of course.

@dtaht
Copy link
Collaborator

dtaht commented Nov 11, 2022

The repo for UDPST is here: https://github.com/BroadbandForum/obudpst
We are also working to standardize the protocol the UDPST uses to measure RFC 9097:
https://datatracker.ietf.org/doc/html/draft-ietf-ippm-capacity-protocol-03
and potentially many aspects of network and application performance.

@dtaht
Copy link
Collaborator

dtaht commented Nov 11, 2022

git log drivers/net/ethernet/intel/ice

810 is a nice card. 100Gbit support added recently. xmit_more and bql support, skb_edit offloads, ptp

@rchac
Copy link
Member

rchac commented Nov 11, 2022

For a traffic generator, a recommendation was for TRex at https://trex-tgn.cisco.com which will support generating "realistic" loads. "TRex can scale up to 200Gb/sec with one server" - and our biggest config "only" has 4x25G.

This is awesome! Way better solution.

@dtaht
Copy link
Collaborator

dtaht commented Nov 11, 2022

CGNAT is also not working out for a lot of people....

@vielmetti
Copy link
Author

A summary of sorts of our 200p Eastern call today.

We talked architecture, setup, etc. and seem to have settled on the configuration of 3 m3.small servers in a single metro. These are in good availability in nearly all of our metros, and you can check the capacity dashboard at https://metal.equinix.com/developers/capacity-dashboard/ to double check before setting up what your options are.

The web console does not inform you precisely which data center you are in (just the metro), but our API does allow picking a facility. See https://metal.equinix.com/developers/docs/locations/facilities/ for the facility information including facility codes. Based on geography and availability, our Dallas (DA) metro might be the best.

Equinix brand guidelines and brand assets are at https://brand.equinix.com - when we get to a point of putting some acknowledgement somewhere, please follow that.

I know you have a release coming up so I don't expect anything like a deep engagement with this until that's out the door.

As I understand it, @rchac will be lead on anything related to the Metal account, which I confirmed has been set up under the name "LibreQoE". Just after the 1st of the month you'll get an invoice, the amount will reflect usage, but the total will be $0 with credits.

I'll be a primary point of contact, and you should always feel free to open a ticket with support if the docs don't answer the questions. If we need to peek behind the curtain to figure anything out, we can escalate. You have my Calendly if you want to schedule another call, and I'll pull in someone from engineering or developer relations to support.

@vielmetti
Copy link
Author

Checking in here - would love to know @dtaht et al any/all initial impressions, any glitches or gotchas you saw, and any comments (good and bad) about the documentation either what exists now or what you'd like to see.

I saw a mention in another issue about SR-IOV, we do have systems with that feature turned on in the NICs, when you are ready to spin up a project specific to that support let me know. That machine is the 4x 25G NIC n3.xlarge https://deploy.equinix.com/product/servers/n3-xlarge/ .

@dtaht
Copy link
Collaborator

dtaht commented Nov 16, 2022

I spun up a server quickly (in amsterdam to test the throughput) (why amsterdam, 'cause it was 4AM there). Very impressed with the management console, very impressed by the throughput and latency to my linode cluster in germany. I should have spun up two metal boxes in amsterdam (or some other unused DC at 2-4 in the morning). But this was very good compared to linode-linode communications:

rrul_-_first-equinix-test-cake-32

I wasn't sure how to clone that as-built box... and we've been very busy buttoning down libreqos.io v1.3.

I figure most of the jitter above is coming from the linode vm. Puzzled also at 1Gbit per flow. Many thoughts, not a lot of time.

After thinking about it in this way, if you could identify your least used, cheapest on power, datacenter, anywhere in the world, we needn't use dallas for the 3 piece setup we plan to do.

@dtaht
Copy link
Collaborator

dtaht commented Nov 23, 2022

I did a very quick test between two servers in dallas. Now there can be MANY reasons for the jitter and latency experienced in the OS itself (see the third panel), but I am totally open to finding the most idle DC you have, and leveraging that.

image

@dtaht
Copy link
Collaborator

dtaht commented Nov 23, 2022

rsync -av lqos.taht.net::p2p-tests .

I did not do anything special to build these machines, just took the defaults, didn't know how to get them into the same switch, and the results from run to run are so wildly variable that I have no idea what's going on underneath.

@interduo
Copy link
Collaborator

This looks very nice... If we would like to make tests more efficient - we could make CPU slower in Shaper or just use slowest instance in Equinix for Shaper.

It would be easier to generate so big (complicated/varied) amount of traffic.
@dtaht did You met the bottleneck now or just close interface?

@dtaht
Copy link
Collaborator

dtaht commented Jan 16, 2023

I just wanted to thank equinix here for making the development of lqosd and the bifrost bridge (which is about 30% more efficient than our prior code), possible. Work continues... see https://payne.taht.net for a demo. I'm planning a blog piece soon in part talking about the equinix hardware we used and support we have had for the v1.4 release. There's also a podcast we are on in late january.

@vielmetti
Copy link
Author

Thanks @dtaht - always happy to hear progress, and to read/review and share internally for comments any interesting and useful results you get.

@dtaht
Copy link
Collaborator

dtaht commented Feb 24, 2023

@vielmetti It appears our grant will expire fairly soon. In interval since we've talked, we've given y'all credit on the website, in a couple blogs, and on our recent podcast on packet pushers, here: https://packetpushers.net/podcast/heavy-networking-666-improving-quality-of-experience-with-libreqos/

We have a pretty good demo of the v1.4 release here: https://payne.taht.net/ emulating various ISP plans.

So, although I dropped you an email a week or so back, and we are very appreciative of all your help, if we cannot renew we will have to start planning to move off of equinix soon. We absolutely would not have got this far with libreqos without you!

@vielmetti
Copy link
Author

@dtaht As noted today in email we'll be renewing the grant - I will let our team know about the Packet Pushers post!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants