Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft Azure silently install management agents with vulns on your Linux VMs (twitter.com/gossithedog)
299 points by sidcool 4 months ago | hide | past | favorite | 69 comments

This backdoor is present on all vm providers. Qemu has one as well. I did wrote about it three years ago: https://raymii.org/s/blog/Linux_on_Microsoft_Azure_Disable_t...

The issue is that the unfixed Azure agent allows remote unauthenticated root access, if you send it a request without an Authorization: header. It is supposed to verify that requests come from Azure’s management infrastructure, but the check is broken.

I recently tried Oracle Cloud VM following a HN discussion[1], An AMD 2 core/1GB instance with Ubuntu 20.04. Nearly 50% of memory was already occupied desktop packages(because the image is not server version, it's not available) and more importantly by Oracle-cloud-agent through snap by default.

I don't remember management-agents pre-installed in other cloud providers I've used over past several years. I've not used Azure instances much, Only their server-less services and it seems like Azure & Oracle are on same boat as far as management-services baked by default in the images(of course vulnerabilities have not surfaced for the latter ...yet).

But Oracle seems to be supporting Bring Your Own Image (BYOI), which I assume those who are serious about security are already using.

[1] https://news.ycombinator.com/item?id=28497181#28501062

This would be a failed assignment in an entry level uni programming course. How on earth does a Microsoft developer accomplish such a thing as bringing an incomplete and unsafe program to production deployment?

What, do you think that just because the mistake is easy to understand and identify, that experienced developers won’t make it?

Experienced, smart, and savvy programmers will make all sorts of mistakes in their code, even stupid mistakes. It takes an immense amount of effort and savvy to reduce, mitigate, and recover from bugs in production code. Meanwhile, management applies pressure to ship new features. The kinds of teams and people that manage to ship high-quality code on a regular basis are teams with a lot of different types of people on them—you need someone who can make the case to management that these efforts need resources and are in the best interests of the company, you need someone to drive team culture and figure out what kind of practices will reduce bugs, you need people who work on automated tooling, you need people to run disaster scenarios, you need technical coders who can make frameworks that are easy to use but hard to misuse.

Nobody I met came with any of those skills out of college.

Multiply the difficulty when you’re working with distributed systems, like this one.

> Nobody I met came with any of those skills out of college.

Exactly. Experience is key.

> Meanwhile, management applies pressure to ship new features.

This quantity over quality mindset, combined with the industry's rampant ageism and veneration of newness over all else, is making things worse at an alarming rate.

Yes, experienced smart developers can make dumb mistakes, but this is pretty dumb. More importantly, it's the type of thing that should have been caught in a pull request or with a test, if not immediately after writing the code in question. Their process is severely lacking if a basic auth bug of this nature got through to production.

> Experienced, smart, and savvy programmers will make all sorts of mistakes in their code, even stupid mistakes

Reason why I've been uninstalling the agent on each Azure VM since 5 years: you can't make mistakes in code you don't have, at the cost of losing integration with the dashboard.

> This would be a failed assignment in an entry level uni programming course

This wouldn't be an assignment in an entry level uni course.

"Write an agent running on a machine capable of providing remote command execution, authentication and that must report OS metrics externally". Next week's lab : "Recursion".

This wouldn’t be any type of assignment in an entry level uni programming course. Entry level uni programming courses are things like “implement a noughts and crosses game in Java”, not “implement a bespoke authentication workflow for a large-scale public cloud provider”.

It's called quality assurance. Or the lack of it. In a normal development process you have design and testing all acconpaniated by reviews. But there commes tailoring and you skip some testing ( it uses code from previous tested project) and maybe some reviews. At the end of the project you shall have some lessons learned, but they are there just to please management. Add to this the fact that the people who will have to fix the mess are never the same who did the original development and you have a picture. And unfortunately this is not MS specific.

Lack of negative unit or integration tests.

This is the correct reason. Add security audit (internal/external) as well. Don't blame the developers. Blame the process. There will be days on which even the world's smartest person makes mistakes.

I agree auditing would be prudent but frankly the developer is the first line of defence and will share my blame.

> Simply remove the auth header and you are root. remotely. on all machines. Is this really 2021?

Is that accurate? Is this some kind of joke?

One would assume that absence of credentials would necessarily = auth failure.

Like, the basic flow would check the validity and, implicitly, the presence of the auth header. To bypass auth in the case of the absence of the header itself would need to be an explicit conditional. IF no header, then authenticated. Right? That’s crazy.

I suppose I could look at the code.

On the other hand MS enforced strict auth policies to access their Office APIs in a ridiculous fashion. When I needed to register my applications at MS, I just dropped integration into their services and I never looked back.

That's the kind of thing where a unit test would be useful and easy..

I'd probably forget to write it... but it would be useful and easy.

The basic firewall (Network Security Groups) blocks network access by default. So you have to grant the attacker access to the port and IP.

Does it block access within the same group by default for the lateral motion case? That would definitely help somewhat, although it's certainly too common for people to have allow-all rules for internal traffic.

has no one replied that any VM that handles HTTP(s) traffic MUST open ports to start functioning, and is therefore fully vulnerable? what am I missing here

Opening http(s) ports != opening all ports, or even the ones that the management services run on

Open ports to webservers like Apache,nginx etc. aren't affected by this issue.

has to be -- I hope

That’s still better than MS SCVMM which installs an agent which craps a brick, takes out your VM network and then there’s zero support other than an abandoned GitHub repo.

They could have supported cloud-init but went for the usual not invented here approach which was a shit show.

If you want first class Linux support look elsewhere. Anywhere else!!!

Been using cloud-init on Azure Linux VMs for a long time (5+ yrs). Did you experience this on a specific distro?

This was SCVMM on premises hyperv infrastructure not azure.

This announcement seems badly timed given the OMIGOD vulnerability:

"Microsoft announces passwordless future – available across Microsoft Edge and Microsoft 365 apps": https://blogs.windows.com/windowsexperience/2021/09/15/micro...

I just started using Azure cause I occasionally need a remote Windows desktop, and it's insane how complicated it is. There's so much infrastructure it expects you to micromanage, so many moving parts. What's it all doing, how much does it cost? Not very clear.

I'm sure it's fine for someone who does IT work for a medium/large business, but for an independent user the UX just plain sucks compared to any number of VPS providers.

I don't know your use case, but if you need access to a Windows Desktop you can download some ready made VMs from Microsoft


I'm a complete dork in terms of cloud infrastructure and hosting, so I'm heavily drawn to providers with great UX/DX, documentation, simple interfaces, and predictable cost such as Linode, Cloudflare, railway.app, Vercel to name a few examples.

Whenever I have to work with Azure I get uneasy... (SAML...)

This is not great. In terms of ethical concerns I would rather use Azure than AWS for example.

>Is this really 2021?

Spoiler: we'll still be finding this stuff in 2040

Hello fellow time traveler

Issue is not the existence of the management agent, but the fact that it seems it’s an abysmal codebase.

It doesn't seem that abysmal, more like not maintained by enough people.

That’s most of their open source stuff. And probably most of their closed source too.

These background agents are needed for various VM recovery scenario. It's not a silent install. Very much needed.

The vuln is that API calls with no auth headers run as root.

They're not mandatory- we don't use these agents, and instead consider every VM to be replaceable.

Is that an official statement?

Are they optional? As far as I understand, AWS doesn't do the same.


By default, SSM Agent is preinstalled on instances created from the following Amazon Machine Images (AMIs):

Amazon Linux

Amazon Linux 2

Amazon Linux 2 ECS-Optimized Base AMIs

macOS 10.14.x (Mojave) and 10.15.x (Catalina)

Ubuntu Server 16.04, 18.04, and 20.04

Windows Server 2008-2012 R2 AMIs published in November 2016 or later

Windows Server 2016 and 2019

But the AWS SSM agent doesn't listen on the network [0]. The connection is initiated by the agent towards the cloud API, so any commands that come in aren't new connections established over a possibly insecure network.

Of course, if the agent's verification of who it's talking to is as good as in the case of Azure, all bets are off.


[0] I've just checked this on an Ubuntu EC2 instance. The SSM agent is running, but it doesn't listen on any interface. No custom configuration was done it.

Amazon does do the same from what I understand, their official AMI's contain a management agent - I don't believe it's required though.

It's not and by the default it's not allowed to talk to the Service Manager. You have to explicitly allow this through an instance role.

You do lose some functionality, though.

Oracle on OCI does the same. You can perform some administrative tasks directly from a web panel for instance.

Earlier discussion still on front page as I type this: https://news.ycombinator.com/item?id=28532531

If your threat model is your cloud provider, aren't you already screwed since they already have physical access to your VMs?

Correct. Sadly the issue is that the cloud provider installed an agent that only blocks requests if the authentication header does not contain correct authentication.

If you remove the authentication header, that check never fires, and it considers you authenticated. Then it proceeds to let you run any command.

Now the point is, anyone who can send you messages can strip the authentication header, so anyone who can send you messages can execute arbitrary commands.

I don't think that's the threat model here - this is more a lateral movement vector for an attacker that's able to get inside your service network perimeter.

The threat is not the cloud provider and this has nothing to do with physical access.

The software vulnerability can be exploited by anyone with network access to the machine.

The thread model isn't the cloud provider, it's anyone who can spoof the IP of your cloud provider's metadata service CIDR block. These tend to be link-local IPs, so it's common for the cloud boundary firewall itself to block anything incoming from that block, so the attacker would need to already in the perimeter, but it isn't exactly hard to get inside the data center just by being another tenant. This is one reason why it's common to block at the host level any packets with a src IP in the metadata service's CIDR block, just in case. You give up certain forms of remote management capabilities, but that is often worth it not to open up back doors developers are rarely even aware of.

I’ve worked for and consulted at several security places that have built “agents”. These tools are typically C, C++ or Objective-C so that they have the smallest footprint possible for the platform they’re deployed on. Some are for scanning and reporting binaries for viruses, worms etc. some have “remediation” features that let the agent execute commands sent to it remotely. Most of them can be updated in place remotely.

Most of these tools are janky, poorly tested and I’m sure contain dozens of vulnerabilities of their own.

Edit: also, it’s invariable one dude with poor hygiene working on the agent at these companies. He’s usually at the back, in the closet. Most of the engineering work is on he backend and reporting, so the agent gets no peer review or formal security review.

Most (if not all) major cloud providers do this if you use their provided images. If you want to avoid agents, then go with a provider that allows you to upload your own ISO which means you can install a vanilla OS.

And use disk encryption. I've encountered situation when hoster mounted root partition and wrote his scripts inside after OS installation. Fortunately it broken network, so I noticed it. Very sneaky and unexpected.

There is an important qualification, however: the AWS SSM agent does not start a network server at all (it connects to their API endpoint) and it doesn't do anything if you don't add the appropriate permissions to the instance's role.

The GCP agent similarly does not listen to the network.

Both the Google and AWS agents are written in Go, too, so they're unlikely to have the classic C/C++ errors, more likely to use libraries rather than reinventing the wheel, and using a higher-level language often makes the logic easier to understand. Neither of those are foolproof or prevent logic errors, of course, but I would still expect a lower bug density all other things being equal.

> more likely to use libraries rather than reinventing the wheel

I love programming in go, but I disagree with this point. The golang library ecosystem is absolutely less mature compared to C++.

Rust and Go are a pleasure to write in, but they don’t magically fix every problem and frequently CREATE problems because they’re still under development. In this case, the missing auth header vuln has nothing to do with the underlying language.

Yeah, it's definitely not a simple good/bad decision. My thought is that it's more likely that a Go library would be more likely to have implemented something like a mandatory auth check but the counterpoint is that if such a library were vulnerable it would affect potentially a very large number of services.

The problem with C++, is that while std::{array, string, vector} exist, and there are compiler options that make operator[]() behave just like at(), there are still lots of people that will happily use char * instead.

> so they're unlikely to have the classic C/C++ errors

Two things here :

- This vulnerability is a logic error (no header => root), not a buffer overflow. It could have happen in any language.

- C and C++ don't play in the same band anymore. Most security vulnerabilities affecting C generally do not affect C++ (no stack based string handling, no VLA, no void* everywhere, proper RAII, proper type safety)

And yes for developing a minimum-memory-footprint system daemon in 2021, I would use C++ or Rust but definitively not C.

Yes, note that I didn’t say this was a classic C vulnerability but that I’d expect to see more bugs of that class all other things being equal. C++ has been getting better but that doesn’t automatically remediate all of the code in the world and retrain every developer.

OMI may listen on the network (depending on what Azure feature is configuring it), but you will find that the most common azure feature pushing OMI does not configure it to listen on the network, which is Log Analytics.

Yeah, the discussion made it clear that this is a configurable problem. I think the puzzling part for me is that a new network service got so little review – after Microsoft’s decades trying to recover from bad calls in the 90s I’d have expected that to trigger more review.

The vuln is that API calls with no auth headers run as root.

You can avoid unwanted agents by avoiding cloud provider.

Bare metal ig.

On some azure machines at work, I removed the OMI updater crontab entry for root, killed running procs and deleted everything in /opt/omi

I hope that should do it...

You may have just borked your VM's. OMI installed because it is needed for certain feature integrations (most notably Log Analytics/Azure Monitor).

It's worth mentioning many of these features which deploy OMI do so in a way that does not set up a listening port. Therefore, the impact of the vulnerability is pretty low. Indeed, I just spun up a Ubuntu VM and while it got a vulnerable version of OMI, it doesn't have an open port.

The LA table VMBoundPort will let you assess if you've got one configured for listening on the OMI ports. I think a lot of the people in this thread, and possibly the people writing the articles aren't Azure SME's, hence all the panic and grumbling and eye rolling.

I get what you're saying, but... they're still un-borked and survived a reboot.

Also I don't care about the monitoring and analytics, so there's that

Embrace, extend, and extinguish all over again, only now Apple, Google and Amazon seem to do it to this scale too.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact