Unable to create AKS cluster due to Service Principal Not Found occurring in multiple regions #1206

jnoller · 2019-09-13T21:06:10Z

Tracking issue for issue resolution tracking

AKS and the larger Azure team have been investigating an issue when creating a new AKS cluster and not passing in a pre-created Service Principal (SP), cluster creation may fail with a Service Principal Not Found error.

This error impacts cluster creation in all regions as well as the CLI and Azure Portal.

Azure engineering has root caused this issue to a data replication / caching issue. Teams are working on both short-term mitigation and longer term changes. This issue will be updated as the fixes are deployed globally.

Mitigation/Working around the error

Use the following workarounds:

Use an existing service principal which has already propagated across regions and exists to pass into AKS at cluster create time.
If using automation scripts, add time delays between service principal creation and AKS cluster creation (up to 90 seconds)
If using Azure portal, return to the cluster settings during create and retry the validation page after a few minutes.

Please see the AKS FAQ for more information.

Issue Details

AKS creates a Service Principal (SP) on behalf of the user, then AKS attempts to look up the newly created SP within 15 seconds (with retries) which then fails (the SP is created however).

The failure is due to the response not returning the SP. Lookup requests are geo-load-balanced and traffic is directed to a new data center rather than the one accepting the write request. The not found error is due to increased global replication time as well the replica propagation at the storage layer.

The error is non-destructive - users may use the linked work arounds to mitigate until the mitigations are deployed.

ahelwer · 2019-09-23T20:53:17Z

This issue is intermittent - with a pre-created service principal you can run Test-AzResourceGroupDeployment until it stops spitting out the ServicePrincipalNotFound error, but then you run New-AzResourceGroupDeployment and it fails with ServicePrincipalNotFound. This is in line with my tests where about half of Get-AzADServicePrincipal calls return $null for a while after service principal creation. I've worked around this in my script by calling Get-AzADServicePrincipal with 10 retries before declaring nonexistence. Are similar retries implemented for pre-existing service principals with AKS template validation? You mentioned 90 seconds, is that the SLA for service principal propagation?

jnoller · 2019-09-30T22:56:34Z

Reopening, engineering teams have added mitigations in place for this failure in the Azure portal, customers using the CLI or other tools are advised to continue to use other mitigations

katbyte · 2019-12-06T22:59:00Z

@jluk is there any timeframe for this to be fixed as it still affects terraform?

ghost · 2019-12-28T17:36:32Z

Any work around on this issue. First time terraform apply command is failing . On next run, it is becoming successful.

jluk · 2020-01-06T17:13:18Z

We are planning the AKS-side short term mitigation for CLI and other clients such as Terraform now that portal has been resolved. I've reached out to Tom for us to figure out how to mitigate the SP propagation latency for TF.

The long term improvement for AAD propagation is being discussed from Active Directory.

stvhwrd · 2020-01-22T00:19:33Z

I am no longer seeing this when creating a cluster via the Azure Portal GUI 👍

torumakabe · 2020-03-24T00:25:17Z

@jnoller @jluk Is Managed Identity which has recently been GA a solution? I am not sure the implementation to wait or retry AAD sync of Managed Identity in AKS cluster creation, but I hope so.

https://docs.microsoft.com/en-us/azure/aks/use-managed-identity

jluk · 2020-03-24T01:13:01Z

@torumakabe are you still seeing this error occur? We've introduced some improvements for both Portal and CLI which should mitigate this problem, but curious if you're still seeing it and if so what clients are you using.

ahelwer · 2020-03-24T01:34:05Z

I am still seeing it when creating a service principal then deploying an AKS cluster using an ARM template

torumakabe · 2020-03-24T03:34:27Z

@jluk I use Terraform, so I implement a workaround(sleep) after AAD app creation like this https://github.com/ToruMakabe/container-handson/blob/1342101525cfd2b8de7f357a0cf8481ee85f16f9/prep/modules/aks/main.tf#L25

If Managed Identity could solve this AAD propagation problem, I would not use SP anymore.

oddball · 2020-04-18T11:12:49Z

I worked around it by running the same command 35 times.
First 34 times I got:

az aks create --resource-group kubernetes-cluster-group --name kubernetes-cluster-v3 --node-count 1 --generate-ssh-keys --node-vm-size Standard_D4s_v3
Finished service principal creation[##################################]  100.0000%
Operation failed with status: 'Bad Request'. Details: The credentials in ServicePrincipalProfile were invalid. Please see https://aka.ms/aks-sp-help for more details. (Details: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"unauthorized_client","error_description":"jfggk: Application with identifier 'aafe21' was not found in the directory '8f'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant.\r\nTrace ID: bc\r\nCorrelation ID: 6f\r\nTimestamp: 2020-04-18 10:56:07Z","error_codes":[700016],"timestamp":"2020-04-18 10:56:07Z","trace_id":"bc","correlation_id":"6e","error_uri":"https://login.microsoftonline.com/error?code=700016"})

Then the 35th time it worked. Confidence inspiring

stale · 2020-07-20T18:53:53Z

This issue has been automatically marked as stale because it has not had activity in 90 days. It will be closed if no further activity occurs. Thank you!

ahelwer · 2020-07-21T14:18:36Z

This is still active.

palma21 · 2020-07-22T08:44:10Z

This should not be experienced on the latest client versions. Are you still seeing this?

ahelwer · 2020-07-22T23:42:00Z

Okay I've run some tests again and looks like it has been fixed for ARM templates.

jluk · 2020-07-22T23:42:24Z

Closing this as I believe this is addressed, but we can revisit if the issue lingers

triage-new-issues bot added the triage label Sep 13, 2019

jnoller added the known-issue label Sep 13, 2019

triage-new-issues bot removed the triage label Sep 13, 2019

jnoller pinned this issue Sep 13, 2019

ahelwer mentioned this issue Sep 23, 2019

Service Principal creation lags behind final validation (ServicePrincipalNotFound) #1165

Closed

jluk mentioned this issue Sep 25, 2019

az aks create fails to obtain SP credentials Azure/azure-cli#9585

Closed

jnoller closed this as completed Sep 30, 2019

jnoller unpinned this issue Sep 30, 2019

jnoller removed the known-issue label Sep 30, 2019

jnoller reopened this Sep 30, 2019

triage-new-issues bot added the triage label Sep 30, 2019

jluk added the known-issue label Oct 9, 2019

triage-new-issues bot removed the triage label Oct 9, 2019

techspeque mentioned this issue Oct 31, 2019

Service principal creation isn't finished before other resources start provisioning hashicorp/terraform-provider-azuread#156

Closed

jluk mentioned this issue Nov 14, 2019

Issue creating AKS Cluster with Terraform - Service Principal Invalid #1316

Closed

mikhailshilkov mentioned this issue Nov 25, 2019

azure-cs-aks: "Error creating Managed Kubernetes Cluster" pulumi/examples#480

Closed

This was referenced Dec 4, 2019

aks generation script fails with Bad Request when ran as a script, but not when ran interactively. Azure/azure-cli#10213

Closed

az aks create fails with Internal Server Error Azure/azure-cli#10281

Closed

jluk added the Azure/ActiveDirectory label Jan 6, 2020

arodriguezdlc mentioned this issue Feb 12, 2020

Error creating Managed Kubernetes Cluster: ServicePrincipalNotFound hashicorp/terraform-provider-azurerm#5703

Closed

penpyt mentioned this issue May 14, 2020

Add kubernetes module scalar-labs/scalar-kubernetes#1

Merged

7 tasks

stale bot added the stale Stale issue label Jul 20, 2020

palma21 removed the stale Stale issue label Jul 22, 2020

palma21 assigned jluk Jul 22, 2020

jluk closed this as completed Jul 22, 2020

ghost locked as resolved and limited conversation to collaborators Aug 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to create AKS cluster due to Service Principal Not Found occurring in multiple regions #1206

Unable to create AKS cluster due to Service Principal Not Found occurring in multiple regions #1206

jnoller commented Sep 13, 2019

ahelwer commented Sep 23, 2019 •

edited

Loading

jnoller commented Sep 30, 2019

katbyte commented Dec 6, 2019

ghost commented Dec 28, 2019

jluk commented Jan 6, 2020

stvhwrd commented Jan 22, 2020

torumakabe commented Mar 24, 2020 •

edited

Loading

jluk commented Mar 24, 2020

ahelwer commented Mar 24, 2020

torumakabe commented Mar 24, 2020

oddball commented Apr 18, 2020

stale bot commented Jul 20, 2020

ahelwer commented Jul 21, 2020

palma21 commented Jul 22, 2020

ahelwer commented Jul 22, 2020

jluk commented Jul 22, 2020

Unable to create AKS cluster due to Service Principal Not Found occurring in multiple regions #1206

Unable to create AKS cluster due to Service Principal Not Found occurring in multiple regions #1206

Comments

jnoller commented Sep 13, 2019

Mitigation/Working around the error

Issue Details

ahelwer commented Sep 23, 2019 • edited Loading

jnoller commented Sep 30, 2019

katbyte commented Dec 6, 2019

ghost commented Dec 28, 2019

jluk commented Jan 6, 2020

stvhwrd commented Jan 22, 2020

torumakabe commented Mar 24, 2020 • edited Loading

jluk commented Mar 24, 2020

ahelwer commented Mar 24, 2020

torumakabe commented Mar 24, 2020

oddball commented Apr 18, 2020

stale bot commented Jul 20, 2020

ahelwer commented Jul 21, 2020

palma21 commented Jul 22, 2020

ahelwer commented Jul 22, 2020

jluk commented Jul 22, 2020

ahelwer commented Sep 23, 2019 •

edited

Loading

torumakabe commented Mar 24, 2020 •

edited

Loading