Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create AKS cluster due to Service Principal Not Found occurring in multiple regions #1206

Closed
jnoller opened this issue Sep 13, 2019 · 16 comments

Comments

@jnoller
Copy link
Contributor

jnoller commented Sep 13, 2019

Tracking issue for issue resolution tracking

AKS and the larger Azure team have been investigating an issue when creating a new AKS cluster and not passing in a pre-created Service Principal (SP), cluster creation may fail with a Service Principal Not Found error.

This error impacts cluster creation in all regions as well as the CLI and Azure Portal.

Azure engineering has root caused this issue to a data replication / caching issue. Teams are working on both short-term mitigation and longer term changes. This issue will be updated as the fixes are deployed globally.

Mitigation/Working around the error

Use the following workarounds:

  • Use an existing service principal which has already propagated across regions and exists to pass into AKS at cluster create time.
  • If using automation scripts, add time delays between service principal creation and AKS cluster creation (up to 90 seconds)
  • If using Azure portal, return to the cluster settings during create and retry the validation page after a few minutes.

Please see the AKS FAQ for more information.

Issue Details

AKS creates a Service Principal (SP) on behalf of the user, then AKS attempts to look up the newly created SP within 15 seconds (with retries) which then fails (the SP is created however).

The failure is due to the response not returning the SP. Lookup requests are geo-load-balanced and traffic is directed to a new data center rather than the one accepting the write request. The not found error is due to increased global replication time as well the replica propagation at the storage layer.

The error is non-destructive - users may use the linked work arounds to mitigate until the mitigations are deployed.

@ahelwer
Copy link

ahelwer commented Sep 23, 2019

This issue is intermittent - with a pre-created service principal you can run Test-AzResourceGroupDeployment until it stops spitting out the ServicePrincipalNotFound error, but then you run New-AzResourceGroupDeployment and it fails with ServicePrincipalNotFound. This is in line with my tests where about half of Get-AzADServicePrincipal calls return $null for a while after service principal creation. I've worked around this in my script by calling Get-AzADServicePrincipal with 10 retries before declaring nonexistence. Are similar retries implemented for pre-existing service principals with AKS template validation? You mentioned 90 seconds, is that the SLA for service principal propagation?

@jnoller
Copy link
Contributor Author

jnoller commented Sep 30, 2019

Reopening, engineering teams have added mitigations in place for this failure in the Azure portal, customers using the CLI or other tools are advised to continue to use other mitigations

@katbyte
Copy link

katbyte commented Dec 6, 2019

@jluk is there any timeframe for this to be fixed as it still affects terraform?

@ghost
Copy link

ghost commented Dec 28, 2019

Any work around on this issue. First time terraform apply command is failing . On next run, it is becoming successful.

@jluk
Copy link
Contributor

jluk commented Jan 6, 2020

We are planning the AKS-side short term mitigation for CLI and other clients such as Terraform now that portal has been resolved. I've reached out to Tom for us to figure out how to mitigate the SP propagation latency for TF.

The long term improvement for AAD propagation is being discussed from Active Directory.

@stvhwrd
Copy link

stvhwrd commented Jan 22, 2020

I am no longer seeing this when creating a cluster via the Azure Portal GUI 👍

@torumakabe
Copy link

torumakabe commented Mar 24, 2020

@jnoller @jluk Is Managed Identity which has recently been GA a solution? I am not sure the implementation to wait or retry AAD sync of Managed Identity in AKS cluster creation, but I hope so.

https://docs.microsoft.com/en-us/azure/aks/use-managed-identity

@jluk
Copy link
Contributor

jluk commented Mar 24, 2020

@torumakabe are you still seeing this error occur? We've introduced some improvements for both Portal and CLI which should mitigate this problem, but curious if you're still seeing it and if so what clients are you using.

@ahelwer
Copy link

ahelwer commented Mar 24, 2020

I am still seeing it when creating a service principal then deploying an AKS cluster using an ARM template

@torumakabe
Copy link

@jluk I use Terraform, so I implement a workaround(sleep) after AAD app creation like this https://github.com/ToruMakabe/container-handson/blob/1342101525cfd2b8de7f357a0cf8481ee85f16f9/prep/modules/aks/main.tf#L25

If Managed Identity could solve this AAD propagation problem, I would not use SP anymore.

@oddball
Copy link

oddball commented Apr 18, 2020

I worked around it by running the same command 35 times.
First 34 times I got:

az aks create --resource-group kubernetes-cluster-group --name kubernetes-cluster-v3 --node-count 1 --generate-ssh-keys --node-vm-size Standard_D4s_v3
Finished service principal creation[##################################]  100.0000%
Operation failed with status: 'Bad Request'. Details: The credentials in ServicePrincipalProfile were invalid. Please see https://aka.ms/aks-sp-help for more details. (Details: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"unauthorized_client","error_description":"jfggk: Application with identifier 'aafe21' was not found in the directory '8f'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant.\r\nTrace ID: bc\r\nCorrelation ID: 6f\r\nTimestamp: 2020-04-18 10:56:07Z","error_codes":[700016],"timestamp":"2020-04-18 10:56:07Z","trace_id":"bc","correlation_id":"6e","error_uri":"https://login.microsoftonline.com/error?code=700016"})

Then the 35th time it worked. Confidence inspiring

@stale
Copy link

stale bot commented Jul 20, 2020

This issue has been automatically marked as stale because it has not had activity in 90 days. It will be closed if no further activity occurs. Thank you!

@stale stale bot added the stale Stale issue label Jul 20, 2020
@ahelwer
Copy link

ahelwer commented Jul 21, 2020

This is still active.

@palma21
Copy link
Member

palma21 commented Jul 22, 2020

This should not be experienced on the latest client versions. Are you still seeing this?

@palma21 palma21 removed the stale Stale issue label Jul 22, 2020
@ahelwer
Copy link

ahelwer commented Jul 22, 2020

Okay I've run some tests again and looks like it has been fixed for ARM templates.

@jluk
Copy link
Contributor

jluk commented Jul 22, 2020

Closing this as I believe this is addressed, but we can revisit if the issue lingers

@jluk jluk closed this as completed Jul 22, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Aug 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants