Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "wait" and "retry" deployment options #1013

Open
rshariy opened this issue Nov 26, 2020 · 85 comments
Open

Add "wait" and "retry" deployment options #1013

rshariy opened this issue Nov 26, 2020 · 85 comments

Comments

@rshariy
Copy link

rshariy commented Nov 26, 2020

ARM template deployment often fails with errors like:

"Another operation is in progress on the selected item. If there is an in-progress operation, please retry after it has finished."

"BMSUserErrorObjectLocked","message":"Another operation is in progress on the selected item."

Just to clarity - this is not a dependency issue. ARM deployment may fail if ,for example, you try to add a VM to an RSV and there is another VM being added at the same time: for a few seconds RSV will not accept new clients and as the result your deployment will fail.

Would like to have an option to pause deployment and/or retry it - may be introduce the "wait" and "retry" deployment conditions, i.e:

resource blob 'Microsoft.Storage/storageAccounts/blobServices/containers@2019-06-01' = {
    wait: 30
    retry: 5
    name: '${stg.name}/default/logs'
}
@rshariy rshariy added the enhancement New feature or request label Nov 26, 2020
@ghost ghost added the Needs: Triage 🔍 label Nov 26, 2020
@alex-frankel alex-frankel added intermediate language Related to the intermediate language and removed Needs: Triage 🔍 labels Nov 30, 2020
@alex-frankel
Copy link
Collaborator

Understood. This is something we have been considering, but haven't scheduled the work yet. If you (or others) have other examples that you have run into, it would be great to capture those here.

I know RBAC replication (and replication delays in general) are another place where something like this would be helpful.

@anthony-c-martin
Copy link
Member

I know RBAC replication (and replication delays in general) are another place where something like this would be helpful.

@alex-frankel I'm assuming this is something we're planning on also addressing in the underlying platform? This feels like a leaky abstraction, not something that the end-user should have to deal with by adding delays.

@alex-frankel
Copy link
Collaborator

This feels like a leaky abstraction, not something that the end-user should have to deal with by adding delays.

Agreed. @bmoore-msft and I were also discussing this yesterday. Ideally, ARM will co-locate all the calls end-to-end so a user never has to think about this. Not sure if/when that will be possible, and this may be a necessary evil in the meantime.

@bmoore-msft
Copy link
Contributor

The OP doesn't sound like replication (feels like concurrency) though I could see that you could potentially address both with something like retry. The problem in this case (or either really case) is indefinite postponement. This feels like a problem with the RP - common operations returning frequent 400s instead of maybe 429.

The challenge with this workaround is not only does the user have to fail, then implement a non-deterministic work around (that's expensive on the service) it will mask problems with across ARM, RPs and user code.

@rshariy - have you raised this issue with the RSV team? It doesn't appear to be an uncommon problem and seems like it should be addressed by the RSV... either it shouldn't happen or we're not helping customer figure out how to effectively use RSV.

@rshariy
Copy link
Author

rshariy commented Dec 2, 2020

@bmoore-msft I raised a similar issue with the Azure Firewall product team about a year ago - the only solution we found is to use a PowerShell function to check Azure FW status (make sure it is not "updating") before kicking-off new ARM deployment to FW.

Just logged ticket 120120226003381 about the RSV issue - lets see what MS support will come up with.

@alex-frankel alex-frankel added provider bug revisit and removed enhancement New feature or request intermediate language Related to the intermediate language labels Dec 3, 2020
@alex-frankel
Copy link
Collaborator

alex-frankel commented Dec 3, 2020

it will mask problems with across ARM, RPs and user code.

this point is what gives us caution on implementing something like this. We have some potential solutions to deal with the replication delay in particular that we will explore before introducing a wait.

@rshariy - please let us know the resolution of the case.

@Agazoth
Copy link

Agazoth commented Mar 31, 2021

I have a main template that looks like this:

module kv 'keyvault.bicep' = {
  name: 'kvSmoketestDeploy'
  scope: rg
  params: {
    keyVaultName: keyVaultName
    enableSoftDelete: false
  }
}

module kvaccpol 'keyvaultaccesspolicy.bicep' = {
  name: 'kvAccPolSmoketestDeploy'
  scope: rg
  params: {
    keyVaultName: keyVaultName
    action: 'add'
    objectId: objectId
    access: keyVaultAccessPolicyAccess
  }
}

When that runs, the deployment breaks with:

{
   "error": {
     "code": "ParentResourceNotFound",
     "message": "Can not perform requested operation on nested resource. Parent resource 'kv-kvaccpoltest' not found."
   }
} (Code:NotFound)

Running the deployment again, deploys the policy

@eja-git
Copy link

eja-git commented Apr 14, 2021

I ran into a scenario where I'd like a wait, not much code to show, basically deploying a FunctionApp, then want to output the default key for use in Api Management. The problem is the function app takes some time to spin up before the app keys are present...

resource functionApp 'Microsoft.Web/sites@2020-06-01' = {
  name: functionAppName
  location: location
  kind: 'functionapp'
...

output functionappdefaultkey string = listKeys('${functionApp.id}/host/default', functionApp.apiVersion).functionKeys.default

Workaround is to run the initial deployment of the function app twice.

@bmoore-msft
Copy link
Contributor

@eja-git this isn't a "wait" scenario, it's bug in the deployment engine job scheduling... the listKeys job is scheduled too early... so that's the fix for your particular scenario.

@Pietervanhove
Copy link

Pietervanhove commented Jul 1, 2021

Hi,

I've logged the following issue projectkudu/kudu#3312 (comment) that could also benefit from the wait option during a deployment.

Best Regards
Pieter

@azMantas
Copy link
Contributor

azMantas commented Oct 1, 2021

I am trying to simplify firewall rule collection deploying by using loadTextContent and then loop from each variable. workload-x.json contains all properties for rule collection.

var workloads = [
  json(loadTextContent('./workload-1.json'))
  json(loadTextContent('./workload-2.json'))
  json(loadTextContent('./workload-3.json'))
]

resource afwPolicy 'Microsoft.Network/firewallPolicies@2021-02-01' existing = {
  name: 'bicepRules'
}

resource collectionGroups 'Microsoft.Network/firewallPolicies/ruleCollectionGroups@2021-02-01' = [for workload in workloads: {
  name: workload.name
  parent: afwPolicy
  properties: workload.properties
}]

here is the error I get

Rule Collection Group workload-2 can not be updated because Parent Firewall Policy bicepRules is in Updating state from previous operation

I am sure that a short delay between deployments would help us to loop through all array

@SenthuranSivananthan
Copy link

Only one Rule Collection Group can be updated at a time with Azure Firewall Policy. Since the update refreshes all of the connected Azure Firewall instances, the amount of time it takes to update is non-deterministic. Therefore you will need to serialize the deployment using the batchSize decorator.

Can you try:

@batchSize(1)
resource collectionGroups 'Microsoft.Network/firewallPolicies/ruleCollectionGroups@2021-02-01' = [for workload in workloads: {
  name: workload.name
  parent: afwPolicy
  properties: workload.properties
}]

@SQLDBAWithABeard
Copy link

I have two scenarios that come to mind from recent experience.

Overarching enterprise management level policy being applied to a resource that has been created which I reference in next resource/module causing the Another Operation error. A retry would be useful here as I have no control or influence over the Policies.

I have also faced situations where a newly created resource is not available when referenced immediately afterwards which I assume is a replication/caching issue as the next run works flawlessly.

@wsucoug69
Copy link

My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes. In this case I am unable to use the resource output to set the connection string for use in subsequent modules e.g. passing into keyVault and functionAppSettings

@alex-frankel
Copy link
Collaborator

My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes.

@markjbrown - do you mind taking a look at this one? I'd expect the Cosmos Account not to report complete until it is fully provisioned. @zapadoody -- do you happen to have the code sample of the repro and a correlation ID when the error occured?

@markjbrown
Copy link

For run-time deployment errors you should raise a support ticket as they are best equipped to diagnose specific errors with an activity id.

However I am happy to look at an existing bicep file though to see if there are any issues.

I do have a sample on how to output the endpoint and key from a Cosmos account and input into appSettings for an App Service here if that helps.

https://github.com/Azure/azure-quickstart-templates/blob/master/quickstarts/microsoft.documentdb/cosmosdb-webapp/main.bicep

@wsucoug69
Copy link

wsucoug69 commented Nov 9, 2021

here's my cosmosAccount.bicep

param location string
param cosmosAccountName string
param cosmosDefaultConsistencyPolicy string 
param cosmosPrimaryRegion string
param cosmosSecondaryRegion string

var lowerCosmosAcctName = toLower(cosmosAccountName)
var locations = [
  {
    locationName: cosmosPrimaryRegion
    failoverPriority: 0
    isZoneRedundant: false
  }
  {
    locationName: cosmosSecondaryRegion
    failoverPriority: 1
    isZoneRedundant: false
  }
]

resource cosmosAccountResource 'Microsoft.DocumentDB/databaseAccounts@2021-06-15' = {
  name: lowerCosmosAcctName
  kind: 'GlobalDocumentDB'
  location: location
  properties: {
    locations: locations
    databaseAccountOfferType: 'Standard'
    enableAutomaticFailover: true
    consistencyPolicy: {
      defaultConsistencyLevel: cosmosDefaultConsistencyPolicy
    }
  }
}


output cosmosAccountResourceName string = cosmosAccountResource.name

here's the KeyVault.bicep

param location string 
param keyVaultName string
param productionPrincipalId string
param productionTenantId string
param stagingPrincipalId string
param stagingTenantId string

@secure()
param cosmosPrimaryConnectionString string

@secure()
param cosmosSecondaryConnectionString string

@secure()
param serviceStorageConnectionString string

@secure()
param appStorageConnectionString string


resource keyVault 'Microsoft.KeyVault/vaults@2019-09-01' = {
  name: keyVaultName
  location: location
  properties: {
    enabledForDeployment: true
    enabledForTemplateDeployment: true
    enabledForDiskEncryption: true
    tenantId: productionTenantId
    accessPolicies: [
      {
        tenantId: productionTenantId
        objectId: productionPrincipalId
        permissions: {
          secrets: [
            'get'
            'list'
          ]
        }
      }
      {
        tenantId: stagingTenantId
        objectId: stagingPrincipalId
        permissions: {
          secrets: [
            'get'
            'list'
          ]
        }
      }
    ]
    sku: {
      name: 'standard'
      family: 'A'
    }
  }  
}

resource cosmosPrimaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/cosmosPrimaryConnectionString'
  properties: {
    value: cosmosPrimaryConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

resource cosmosSecondaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/cosmosSecondaryConnectionString'
  properties: {
    value: cosmosSecondaryConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

resource serviceStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/dbConnectionString'
  properties: {
    value: serviceStorageConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

resource appStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
  name: '${keyVaultName}/appStorageConnectionString'
  properties: {
    value: appStorageConnectionString
  }
  dependsOn:[
    keyVault
  ]
}

output appStorageConnectionStringUri string = appStorageConnectionStringSecret.properties.secretUri
output serviceStorageConnectionStringUri string = serviceStorageConnectionStringSecret.properties.secretUri
output cosmosPrimaryConnectionStringUri string = cosmosPrimaryConnectionStringSecret.properties.secretUri
output cosmosSecondaryConnectionStringUri string = cosmosSecondaryConnectionStringSecret.properties.secretUri

and here's the main.bicep

/// cosmos db account, database and container module
module cosmosAccountMod '../cosmosAccount.bicep' = {
  name: 'cosmosAccount-${environmentName}-${buildNumber}'
  params: {
    cosmosAccountName: cosmosAccountName
    cosmosDefaultConsistencyPolicy: cosmosDefaultConsistencyPolicy
    cosmosPrimaryRegion: cosmosPrimaryRegion
    cosmosSecondaryRegion: cosmosSecondaryRegion
    location: location
  }
}

module cosmosDatabaseMod '../cosmosDbContainer.bicep' = {
  name: 'cosmosDBContainer-${environmentName}-${buildNumber}'
  params: {
    cosmosAccountName: cosmosAccountMod.outputs.cosmosAccountResourceName
    cosmosContainerName: cosmosContainerName
    cosmosDatabaseName: cosmosDatabaseName
    cosmosThroughput: cosmosThroughput
  }
  dependsOn: [
    cosmosAccountMod
  ]
}

// storage account module - storage for the tenants application 
module appStorageAccountMod '../storageAccount.bicep' = {
  name: 'appStorageAcctName-${environmentName}-${buildNumber}'
  params: {
    storageAcctName: appStorageAcctName
    storageSkuName: appStorageAcctSku
    location: location
  }
}

// app insights module
module appInsightsMod '../appInsights.bicep' = {
  name: 'appInsightsName-${environmentName}-${buildNumber}'
  params: {
    name: appInsightsName
    resourceGroupLocation: location
  }
}

// app service plan module
module appServicePlanMod '../appServicePlan.bicep' = {
  name: 'appServicePlan-${environmentName}-${buildNumber}'
  params: {
    appSvcPlanSku: appSvcPlanSku
    appSvcPlanTier: appSvcPlanTier
    appSvcPlanName: appSvcPlanName
    appPlanLocation: location
  }
}

// function app module
module functionAppMod '../functionApp.bicep' = {
  name: 'functionApp-${environmentName}-${buildNumber}'
  params: {
    appSvcPlanName: appSvcPlanName
    functionAppName: functionAppName
    location: location
  }
  dependsOn: [
    appStorageAccountMod
    appServicePlanMod
    cosmosAccountMod
  ]
}

// service storage account module - storage for the function app 
module serviceStorageAccountMod '../storageAccount.bicep' = {
  name: 'serviceStorageAcctName-${environmentName}-${buildNumber}'
  params: {
    storageAcctName: serviceStorageAcctName
    storageSkuName: serviceStorageAcctSku
    location: location
  }
}

// key vault module
module keyVaultMod '../keyVault.bicep' = {
  name: 'keyVaultName-${environmentName}-${buildNumber}'
  params: {
    keyVaultName: keyVaultName
    location: location
    cosmosPrimaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[0].connectionString
    cosmosSecondaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[1].connectionString
    productionPrincipalId: functionAppMod.outputs.productionPrincipalId
    productionTenantId: functionAppMod.outputs.productionTenantId
    stagingPrincipalId: functionAppMod.outputs.stagingPrincipalId
    stagingTenantId: functionAppMod.outputs.stagingTenantId
    serviceStorageConnectionString: serviceStorageAccountMod.outputs.storageAccountConnectionString
    appStorageConnectionString: appStorageAccountMod.outputs.storageAccountConnectionString
  }
  dependsOn:[
    functionAppMod
    cosmosAccountMod
    cosmosDatabaseMod
  ]
}

// function app settings module
module functionAppSettingMod '../functionAppSettings.bicep' = {
  name: 'functionAppSettings-${environmentName}-${buildNumber}'
  params: {
    appInsightsKey: appInsightsMod.outputs.appInsightsKey
    cosmosConnectionStringUri: keyVaultMod.outputs.cosmosPrimaryConnectionStringUri
    appStorageConnectionStringUri: keyVaultMod.outputs.appStorageConnectionStringUri
    serviceStorageConnectionStringUri: keyVaultMod.outputs.serviceStorageConnectionStringUri
    functionAppName: functionAppMod.outputs.prodSlotFunctionAppName
    functionAppStagingName: functionAppMod.outputs.stagingSlotFunctionAppName
  }
  dependsOn:[
    functionAppMod
    appInsightsMod
    cosmosAccountMod
    keyVaultMod
  ]
}

@wsucoug69
Copy link

Also to clarify previously I was using the output in the cosmosAccount.bicep but changed to the query approach to try ad get away from the error. Thanks for the tip on raising the support ticket.

@wsucoug69
Copy link

wsucoug69 commented Nov 9, 2021

For run-time deployment errors you should raise a support ticket as they are best equipped to diagnose specific errors with an activity id.

However I am happy to look at an existing bicep file though to see if there are any issues.

I do have a sample on how to output the endpoint and key from a Cosmos account and input into appSettings for an App Service here if that helps.

https://github.com/Azure/azure-quickstart-templates/blob/master/quickstarts/microsoft.documentdb/cosmosdb-webapp/main.bicep

@alex-frankel Can you take a look at that? It seems the dependsOn is being fulfilled with the ack of the started and/or accepted responses rather than succeeded

@wsucoug69
Copy link

My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes.

@markjbrown - do you mind taking a look at this one? I'd expect the Cosmos Account not to report complete until it is fully provisioned. @zapadoody -- do you happen to have the code sample of the repro and a correlation ID when the error occured?

@alex-frankel any thoughts on the bicep here? Also I have opened a support case for this if you need that ref # let me know and I can send direct.

@markjbrown
Copy link

The problem is this listConnectionStrings function. I've never seen it before. I tried testing in an ARM template and it doesn't work (not sure why the template didn't fail validation).

If you want to output the endpoint and keys use this syntax below. To make it as a connection string just concat them together with "AccountEndpoint=" and ";AccountKey="

"[reference(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName'))).documentEndpoint]"
"[listKeys(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName')), '2021-04-15').primaryMasterKey]"

@wsucoug69
Copy link

The problem is this listConnectionStrings function. I've never seen it before. I tried testing in an ARM template and it doesn't work (not sure why the template didn't fail validation).

If you want to output the endpoint and keys use this syntax below. To make it as a connection string just concat them together with "AccountEndpoint=" and ";AccountKey="

"[reference(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName'))).documentEndpoint]" "[listKeys(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName')), '2021-04-15').primaryMasterKey]"

@markjbrown apologies thank you for the assistance!!!

@brwilkinson
Copy link
Collaborator

Thank you @tejas-nagchandi keep us posted, if you are able to resolve, otherwise we can continue to investigate.

@tejas-nagchandi
Copy link

@brwilkinson: I tested with dependsOn as well. But extensions dependencies are not resolved.
My bicep:

module componentVM 'virtualMachines.bicep' = [for (vm, index) in component: {
  name: '${vmType}VM-${vm.name}'
  params: {
    location: location
    vmName: vmName[index].name
    zone: vm.zone
    subnet: subnet
    vmProperties: properties
    keyVaultName: keyVaultName
    vnetName: vnetName
    vnetResourceGroup: vnetResourceGroup
    infraEncryptionKeyId: infraEncryptionKeyId
    uaiForDiskid: uaiForDiskid
    uaiForVMid: uaiForVMid
    lbProperties: lbProperties
  }
}]

module protectVM 'protectedItems.bicep' = [for (vm, index) in component: {
  name: 'protect-${vm.name}'
  dependsOn: componentVM
  params: {
    location: location
    policyId: policyId
    vaultName: vaultName
    vmName: vm.name
    resourceSuffix: resourceSuffix
  }
}]

This deployment fails with message

"message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
"details": [
  {
    "code": "ResourceDeploymentFailure",
    "target": "/subscriptions/xxxxx-xxxx-xxxx-xxxx-xxxxxxxx/resourceGroups/xxxx-d-rg/providers/Microsoft.RecoveryServices/vaults/xxxx-rsv/backupFabrics/Azure/protectionContainers/iaasvmcontainer;iaasvmcontainerv2;xxxx-d-rg;xxxxxxx/protectedItems/vm;iaasvmcontainerv2;xxxx-d-rg;xxxxxxxx",
    "message": "The 'AzureAsyncOperationWaiting' resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "UserErrorGuestAgentStatusUnavailable",
        "message": "VM agent is unable to communicate with the Azure Backup Service."
      }]}
  ]}

This happens because "Agent" and "VM Extensions" deployment is initiated by Policies set by organization.

And, when I wait for these extensions to be "Provisioned Successfully" and retry the failed deployment, the deployment goes smooth and my VMs are included in the ProtectedItems of my recoveryServiceVault.

@brwilkinson
Copy link
Collaborator

Hi @tejas-nagchandi

This is an interesting scenario that I see. I am not sure if you have uncovered a bug OR if it is actually doing the correct thing.

Let me explain.

dependson is actually an array of resource references.

E.g.

  dependsOn: [
    componentVM
  ]
  • in your case since componentVM is actually an array, it appears to look correct, however I am unsure if it is actually working correctly.
    • I will do some more testing on that, however in the meantime can you please update the syntax as above.

you can also use what is below... above waits for all VM's to complete, below waits for the single VM iteration to complete. Both should work, however if you need a longer delay, just use the one above. i.e. wait for ALL VM's to complete prior to deploying the vmprotection.

 dependsOn: [
   componentVM[index]
 ]

if you still hit issues, perhaps also consider adding the following as well.

@batchSize(1)
module protectVM 'protectedItems.bicep' = [for (vm, index) in component: {
  name: 'protect-${vm.name}'
    dependsOn: [
    componentVM
  ]
  • Please let us know the outcome, hopefully you don't need the batchsize.

@tejas-nagchandi
Copy link

@brwilkinson: Same outcome after adding dependsOn as array and batchSize annotation.
I was actually expecting the same outcome as the dependsOn without array was also resolving into the correct dependency, I checked the compiled ARM before the deployment as well.

The main issue is that the VM resource deployment is reported successful before the extensions (initiated by policy) are getting provisioned.

@brwilkinson
Copy link
Collaborator

@tejas-nagchandi
If your issue is still not resolved after the dependson, I think it would be best to just open up a separate discussion for this topic of installing and configuring the backup agent. We can keep that topic outside of this issue/thread, then report back here on the outcome.

@tejas-nagchandi
Copy link

Sure @brwilkinson, I will open a separate discussion on this. Thanks for the quick response so far.

@brwilkinson
Copy link
Collaborator

Thank you @tejas-nagchandi for opening the separate discussion.

We were able to determine that the conflict was from setting the backup in Bicep as well as in Azure Policy. So the recommendation was to remove this configuration from Bicep and allow the Policy to deploy the desired vm protetion configuration.

@tejas-nagchandi
Copy link

Thank you @tejas-nagchandi for opening the separate discussion.

We were able to determine that the conflict was from setting the backup in Bicep as well as in Azure Policy. So the recommendation was to remove this configuration from Bicep and allow the Policy to deploy the desired vm protetion configuration.

@brwilkinson: The final solution is to gain all control within Bicep, so that the dependencies are managed easily. So, not to wait for policies to initiate but include extensions as well as protectedItems in bicep.

@Kaloszer
Copy link

Same issue when you have a Sentinel Analytic Rule which has a query using a newly created watchlist. Even though the watchlist resource is in dependsOn - it will still fail initially - because it still takes time for it to be available for querying (even after a sucessful deployment), a retry with a timer would help here.

@brwilkinson
Copy link
Collaborator

@Kaloszer can you share the info back on that other watchlist discussion?

@bowlerma
Copy link

bowlerma commented Aug 7, 2023

We're hitting similar problems when deploying Azure SQL. We have a template that deploys a logical Azure SQL servers and then performs a number of additional configuration such as enabling audit, adding an AD Admin user, setting the connection policy, configuring firewall rules and adding elastic pools. All of these child resources are using 'dependsOn' to ensure that they run one after the other in series rather than in parallel.

Most of the times this works, but occasionally the template deployment fails with an 'Internal Server Error'. When we raise this with the Microsoft support team they just tells us "The server is currently busy. Please wait a few minutes and try again." Retrying the template deployment doesn't always work, and there is no built in mechanism to add this delay.

In this particular case I'd of thought a better response here would be to return a 429 response rather than a 500 response so that the deployment of each child resource can be automatically tried again with an exponential backoff between each retry.

It's little issues like this that make working with ARM such a frustrating experience. Just because something deployed OK once, there's no guarantee that it will deploy successfully the next time.

@mdjx
Copy link

mdjx commented Sep 3, 2023

When Entra Domain Services (previously Azure AD Domain Services / AADDS) is deployed via Bicep the deployment completes within Bicep, but the actual resource remains in the "Deploying" state in Azure for at least ~20 minutes longer.

image

A wait/retry mechanism would help ensure the service is fully provisioned before further deployments kick off that depend on it, or at least allow them to retry.

@Kaloszer
Copy link

Kaloszer commented Sep 18, 2023

Yet another case when you're trying to assign >1 federated identity to an uami within the same module:

Too many Federated Identity Credentials are written concurrently for the managed identity '/subscriptions/<sub>/resourcegroups/<rg>/providers/microsoft.managedidentity/userassignedidentities/<uami01'. Concurrent Federated Identity Credentials writes under the same managed identity are not supported. (Code: ConcurrentFederatedIdentityCredentialsWritesForSingleManagedIdentity)

PS:
Workaround is to deploy another module with the second binding with a dependency on the first one, but still...

@sserjeglobant
Copy link

Hello, I would like to know if you continue with this very necessary development, here is another example of what is happening:

It turns out that I have to create a vnet and multiple subnets,

I have a module for vnets and another module for subnets.

In the main, I call each module as follows:

vnet module plus its parameters

subnet module plus its parameters and the depends on vnet module name with the for function that reads the object of the subnets that it has to create.

What happens is that sometimes when subnet 0 is created, Azure Deployment has not closed the process and when it is going to be sent to create subnet 1, an error appears that there is a previous creation process and that the next one cannot be created. subnet thus damaging the deployment.

Does anyone have an idea how else I can solve this problem? Or maybe MS can help us with this valuable feature of adding waiting times to the modules.

@SvenAelterman
Copy link

Hello, I would like to know if you continue with this very necessary development, here is another example of what is happening:

It turns out that I have to create a vnet and multiple subnets,

I have a module for vnets and another module for subnets.

In the main, I call each module as follows:

vnet module plus its parameters

subnet module plus its parameters and the depends on vnet module name with the for function that reads the object of the subnets that it has to create.

What happens is that sometimes when subnet 0 is created, Azure Deployment has not closed the process and when it is going to be sent to create subnet 1, an error appears that there is a previous creation process and that the next one cannot be created. subnet thus damaging the deployment.

Does anyone have an idea how else I can solve this problem? Or maybe MS can help us with this valuable feature of adding waiting times to the modules.

This is a very different issue.

If you're expecting to be able to redeploy the module for your virtual network, you'll need to make sure you create your subnets with the virtual network, not separately (that's an anti-pattern). If you try to redeploy your virtual network only (no subnts) once you have created subnets and deployed resources in them, the deployment of the virtual network will attempt to delete your subnets, which is neither desired nor possible and will thus cause your virtual network deployment to fail.

If you are looking to deploy additional subnets in an existing virtual network (and will then never again deploy the virtual network unless you pull the full subnet configuration again), then you need to use the @batchSize(1) decorator in the subnet loop.

@aslan-im
Copy link

what is the status?

@matzter
Copy link

matzter commented Mar 20, 2024

I have another, similar issue deploying a Front Door profile and a metricAlert in the same deployment.

'Microsoft.Cdn/profiles@2023-05-01
'Microsoft.Insights/metricAlerts@2018-03-01'

The error is "Couldn't find a metric named OriginHealthPercentage"
And yes, the metricAlerts deployment is depending on the profile deployment.

@devdeer-alex
Copy link

devdeer-alex commented Apr 27, 2024

Just to be clear here: Isn't that contradicting the statement from the documentation?

Repeatable results: Repeatedly deploy your infrastructure throughout the development lifecycle and have confidence your resources are deployed in a consistent manner. Bicep files are idempotent, which means you can deploy the same file many times and get the same resource types in the same state. You can develop one file that represents the desired state, rather than developing lots of separate files to represent updates.

@mattias-fjellstrom
Copy link

Whatever solution is planned for this, will it be Bicep-specific or will it be available in ARM-templates as well?

I encountered an issue with Azure Policy where I use a policy-set containing a number of policies that each enables a given Defender for Cloud plan (Storage, CosmosDB, ARM, etc) if it is not enabled for a given subscription (each policy uses the deployIfNotExists effect).
When I create a new subscription these policies all run at the same time and some of them will error out with a Conflict ... error message. As far as I understand there seems to be no retry-operation built-in to Azure Policy (been waiting a few hours to make sure). So this would be a good scenario for specifying a retry in the ARM-template defined inline of the policy.

@WhitWaldo
Copy link

@mattias-fjellstrom Likely ARM-level given that Bicep is generating ARM under the hood for deployments (as evidenced by the artifacts in Azure following such a deployment).

@alex-frankel
Copy link
Collaborator

@WhitWaldo is correct!

@mattias-fjellstrom
Copy link

@WhitWaldo Very true, that makes sense 👍🏻

@NickSpag
Copy link

NickSpag commented Aug 30, 2024

Has this been assigned or further discussed @alex-frankel? We're an ISV with an azure managed application in the marketplace so IaC-based environments are part of our CICD.

There are a few classes of errors here where this would be helpful. To highlight one: in the past few years alone we regularly see the metric alerts issue that's been discussed here, where metric's aren't "ready," and once or twice a year it results in multi-day disruptions to our customer updates and development cycle when the wait time needed is beyond anything we can orchestrate by manually pushing the alerts module down the deployment chain.

I'm sure this proposal is extensive work and cuts against the spirit of a declarative DSL but as a practical effect for our org: we're essentially at the point where we are going to have to extend our entire deployment approach to include a packaged C#-based runner, and/or network-connected DevOps pipelines in to customer tenants, exclusively in order to achieve wait/retry functionality (and graceful failure, if I had a wish list).

Unfortunately the Resource Providers simply aren't reliable enough to depend on here and we need appropriate tools to account for that reality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests