[AVM Module Issue]: DNS Zones - Deployment History limit #1237

aavdberg · 2024-04-22T12:12:27Z

Check for previous/existing GitHub issues

I have checked for previous/existing GitHub issues

Issue Type?

I'm not sure

Module Name

avm/res/network/dns-zone

(Optional) Module Version

No response

Description

I am deploying through the 'br/public:avm/res/network/dns-zone:0.2.4' module Dnszones

But running to the following problem:

I deploy seventy domains with records in it.

(Optional) Correlation Id

No response

avm-team-linter · 2024-04-22T12:12:46Z

@aavdberg, thanks for submitting this issue for the avm/res/network/dns-zone module!

Important

A member of the @azure/avm-res-network-dnszone-module-owners-bicep or @azure/avm-res-network-dnszone-module-contributors-bicep team will review it soon!

ChrisSidebotham · 2024-04-26T14:02:28Z

Hi @aavdberg.

Thanks for raising this issue. I am intrigued are you supplying domain + Records totalling to more than 800?

The 800 limit comes from the resource group level limits as noted here:
https://learn.microsoft.com/en-us/azure/azure-resource-manager/templates/deployment-history-deletions?tabs=azure-powershell

Could it be your deployment history is more than 800? in the CI Environment we have script to remove deployment history where we see failures but there is a time delay between this being removed from the visible view in the portal to the metadata being removed on the server side.

@AlexanderSehr May have some more supporting information for this

AlexanderSehr · 2024-04-26T17:36:01Z

Hey @aavdberg,

thanks for bringing this up. This is indeed a curious case since there is an auto-cleanup mechanism for deployments at this scope, i.e., if you're closing in on the 800 deployments limit, it will start removing deployments as per the FIFO principle.

That means, you somewhow managed to create so many deployments, that Azure is actually not able to remove deployments in time. This could, for example, happen in the extreme case of creating 800 deployments at once, but can happen for less. You mention 'only' 70, which should only run in the aforementioned issue, if 70 is higher than the threshold by which Azure starts removing deployments, which as per the link that @ChrisSidebotham provided is at 700. I assume you deploy other resources with your DNS Zones that may pump the # of dpeloyments to above 100?

@alex-frankel would you happen to have any advice / a best practice in mind how to avoid this issue when deploying templates that make heavy use of deployments?

@ChrisSidebotham, the script you're referring to is most important for the management group level as there is no auto-cleanup to be found. But we did in fact also use it for the subscription level in the past if we tested our entire library multiple times a day.

teemukom · 2024-04-29T06:09:15Z

If the resource group has CanNotDelete lock (as it has when using this module with WAF parameters) it prevents deletions from the deployment history.

ChrisSidebotham · 2024-05-08T09:25:22Z

@aavdberg - Do you have an update on this with regards to the above comments?

aavdberg · 2024-05-08T11:33:47Z

We have seventy domains that we deploy through Azure DevOps pipeline. We use now @batchsize(1) to give the automatic cleanup time to do its thing and then it's working, but that makes the deployment take longer. We did try it also with a batchsize of ten but it also fails because the automatic cleanup does not go fast enough.

The domains get deployed to a resourcegroup and the scope is also to resourcegroup in the bicep file.

Every domain has several records.

So, every record in the domain gets a deployment created.

// Module Public DNS Zone
@batchSize(1)
module publicdnszone 'br/public:avm/res/network/dns-zone:0.2.4' = [
  for dnszone in varPublicDnsZones: {
    name: 'deploy-PublicDnsZone-${dnszone.name}'
    params: {
      name: dnszone.name
      location: 'global'
      lock: {
        kind: 'None'
        name: 'lock'
      }
      a: dnszone.recordSets.?a
      aaaa: dnszone.recordSets.?aaaa
      caa: dnszone.recordSets.?caa
      cname: dnszone.recordSets.?cname
      mx: dnszone.recordSets.?mx
      ptr: dnszone.recordSets.?ptr
      soa: dnszone.recordSets.?soa
      srv: dnszone.recordSets.?srv
      txt: dnszone.recordSets.?txt
      tags: {
        createdBy: 'Bicep through Azure DevOps'
        solutionName: parSolutionName
      }
    }
  }
]

Hopes this gives more clearity about the problem.

AlexanderSehr · 2024-05-08T20:39:47Z

We have seventy domains that we deploy through Azure DevOps pipeline. We use now @batchsize(1) to give the automatic cleanup time to do its thing and then it's working, but that makes the deployment take longer. We did try it also with a batchsize of ten but it also fails because the automatic cleanup does not go fast enough.

The domains get deployed to a resourcegroup and the scope is also to resourcegroup in the bicep file.

Every domain has several records.

So, every record in the domain gets a deployment created.
// Module Public DNS Zone
@batchSize(1)
module publicdnszone 'br/public:avm/res/network/dns-zone:0.2.4' = [
  for dnszone in varPublicDnsZones: {
    name: 'deploy-PublicDnsZone-${dnszone.name}'
    params: {
      name: dnszone.name
      location: 'global'
      lock: {
        kind: 'None'
        name: 'lock'
      }
      a: dnszone.recordSets.?a
      aaaa: dnszone.recordSets.?aaaa
      caa: dnszone.recordSets.?caa
      cname: dnszone.recordSets.?cname
      mx: dnszone.recordSets.?mx
      ptr: dnszone.recordSets.?ptr
      soa: dnszone.recordSets.?soa
      srv: dnszone.recordSets.?srv
      txt: dnszone.recordSets.?txt
      tags: {
        createdBy: 'Bicep through Azure DevOps'
        solutionName: parSolutionName
      }
    }
  }
]
Hopes this gives more clearity about the problem.

It does, thanks @aavdberg.

And it does yet again surface two challenges with Deployment objects in Azure:

The cleanup apparently is too slow (which is quite suprising to me, as the threshold to start deleting is 700, and the limit of deployments 800)
Deployments take a lot longer than native resource deployments. For solutions that deploy some heavy wheights that anyways take a few minutes to deploy they make a lot of sense - but for something where you deploy e.g. 70 instances of a resource, I would personally recommend to use the AVM module as a reference to implement a Bicep-native resource deployment. This would potentially only create 1 deployment instead of 70 (if implemented accordingly) and be faster as a result.
I used to see similar things happening with e.g. PolicyAssignments where one tried to deploy like 300 policy assignments which just took ages, while the Bicep-native implementation took only a few minutes or less.

I guess what I'm trying to say is that there is always a trade-off. The good news is, that the PG is aware and are working on speeding deployments up. ETA unknown though.

ChrisSidebotham · 2024-05-09T09:00:20Z

@alex-frankel - Would you be able to assist in an ETA for the deployments work Alex mentioned above? Is there a better place for this issue regarding deployments than the current BRM Repository?

ChrisSidebotham · 2024-05-22T13:11:42Z

@sydkar - Any update on moving this to the correct area?

sydkar · 2024-05-22T19:18:33Z

@ChrisSidebotham No update right now, trying to get an ETA on the deployments speed up work. I'll let you know by the end of the week.

alex-frankel · 2024-07-01T15:56:28Z

Sorry about the delay on this one. Is @teemukom's comment accurate? If so, then that is likely the source of the issue no?

@aavdberg, can you take a look at the Deployment resources in the Resource Group (there is a "Deployments" tab in the table of contents of the Resource Group blade in the portal)? Does it eventually get pruned back down to 700? If so, then that suggests @AlexanderSehr is correct that we are not able to keep up with the pace of deployments getting created.

The fact that the deployment works with @batchSize(1) also suggests it's our inability to keep up.

@aavdberg if this is still occurring, can you share a recent correlation ID and we can investigate further?

alex-frankel · 2024-07-01T16:04:01Z

I'm also curious what the var varPublicDnsZones looks like. That will help us understand how many sub-modules will be created for a given record.

aavdberg · 2024-07-02T06:54:54Z

Send you a ping in teams @alex-frankel

aavdberg · 2024-07-02T06:56:43Z

When using a batchsize(1) then it slows down the deployment and have the deployment history cleanup time to clean up the old deployments.

alex-frankel · 2024-07-08T15:12:09Z

Just got off a call with @aavdberg and he is creating a large number of DNS records. With the current structure each new record (even if the record is a new module) will produce a unique deployment. All the deployments are ideally happening in parallel which is why the Deployment clean up process is not able to keep up.

@ChrisSidebotham / @AlexanderSehr / @jtracey93 -- can we consider refactoring this AVM module to not produce so many nested modules? While we ideally would be able to (and plan to) raise the limit, it won't happen in the short-term, so refactoring the module is the only short-term fix that I feel is practical.

The refactor would require creating all of the records in the entry point file for the module instead of in a nested deployment. I realize reduces the separation of concerns, but that might be a worthy tradeoff to increase usability for the AVM consumer. Thoughts?

AlexanderSehr · 2024-07-08T19:20:57Z

When using a batchsize(1) then it slows down the deployment and have the deployment history cleanup time to clean up the old deployments.

1 is maybe a bit ... low and 50 or 100 may do the trick 😄

alex-frankel · 2024-07-08T19:43:42Z

@AlexanderSehr, @aavdberg mentions that even with a batchSize of 10, he is running into this issue:

We did try it also with a batchsize of ten but it also fails because the automatic cleanup does not go fast enough.

Can we consider refactoring this module?

AlexanderSehr · 2024-07-08T19:47:04Z

Just got off a call with @aavdberg and he is creating a large number of DNS records. With the current structure each new record (even if the record is a new module) will produce a unique deployment. All the deployments are ideally happening in parallel which is why the Deployment clean up process is not able to keep up.

@ChrisSidebotham / @AlexanderSehr / @jtracey93 -- can we consider refactoring this AVM module to not produce so many nested modules? While we ideally would be able to (and plan to) raise the limit, it won't happen in the short-term, so refactoring the module is the only short-term fix that I feel is practical.

The refactor would require creating all of the records in the entry point file for the module instead of in a nested deployment. I realize reduces the separation of concerns, but that might be a worthy tradeoff to increase usability for the AVM consumer. Thoughts?

Hola Mr @alex-frankel,
we could consider it - but - I want to provide some context that may / should influence the ultimate path. Bare with me 😄

As you may know, coming from CARML, we've traditionally always written modules in a way that they align closly with the Azure REST API's structure (i.e., how resource types are nested in provider namespaces and resource types may have child-resource types) and treated child-resource types as full-on modules that we published independently to their parent.

Now, since we migrated the library to AVM we don't do this anymore due to the technical limitations in the backend, that is certain manual updates that can make this quite painful for every new module. There is the hope to work through this eventually, but it's what we have today. We've already seen several issues being opened that asked for the child-modules, so there's definitely a need beyond what we're ourselves using in Microsoft & at customers.

Coming back to the issue at hand: The intuitive alignment with the API structure and the comprehensive publishing are the main drivers behind the design of the modules today - and doing so consistently. As a result, child-resource types are always nested in their own module file and are tested against the same rule set as their parent (with the sole exeption of the missing version.json file).

Some additional thoughts:
If we'd try to radically remove the amount of deployments in AVM modules we could do so essentially each time there is no nested loop. This would / should affect a lot of modules. But it would come at the price of making it impossible to publish them as their own modules.
The compromise would be to do so only for modules where you may not need a seperate published module. This may even be the case for the DNS Zone. Personally, I can't judge that even though I've seen a wide range of customer setups that slice and dice the permissions they grant in very specific ways. The other problem is that it would blow a big whole into the guidelines, compliance tests & consistency, effektively requiring to debate every resource type of every resource.

Personally, I value consistency a lot as its benefits usually massively outweigh the benefits of inconsistency. I can't however argue with the fact that in this case it comes at the very unfortuante cost of deployment speed & the deployment cleanup. So much for my personal thoughts on the matter. I'm sure others have some thoughts too :)

alex-frankel · 2024-07-08T19:57:22Z

Ultimately, your call :)

I don't think the deployment speed fix is a meaningful fix to this issue. That being said, @anthony-c-martin what is the status on the rollout for the performance improvements with nested modules?

AlexanderSehr · 2024-07-09T12:39:37Z

Ultimately, your call :)

I don't think the deployment speed fix is a meaningful fix to this issue. That being said, @anthony-c-martin what is the status on the rollout for the performance improvements with nested modules?

That's true. The 'speed' is an issue by itself - while this issue is about the number of deployments and their cleanup. May have gotten a bit dragged away towards the end 😏 Will update.

Regardless, it surely can be 'our' call as a collective. The information provided just paints the picture of why the structure is the way it is, and what my subjective thoughts on the matter are 😉. I'm sure there's a wide variety of opions and maybe even more ideas (:

alex-frankel · 2024-07-17T20:27:39Z

@aavdberg as an aside, if you'd like to be onboarded to the Deployments performance improvement that @AlexanderSehr referenced, we can onboard you manually. Just email me your subscription ID(s). We are in the process of rolling it out to everyone, but we are going to do that slowly and progressively.

aavdberg · 2024-07-24T04:12:07Z

@alex-frankel send you ping in Teams with subscription id

cedricbraekevelt · 2024-07-24T13:40:58Z

Hi @alex-frankel,

Is the deployment performance improvement open to anyone?

microsoft-github-policy-service · 2024-07-25T15:40:09Z

Important

The "Needs: Triage 🔍" label must be removed once the triage process is complete!

Tip

For additional guidance on how to triage this issue/PR, see the AVM Issue Triage documentation.

aavdberg added the Needs: Triage 🔍 Maintainers need to triage still label Apr 22, 2024

avm-team-linter bot assigned ChrisSidebotham Apr 22, 2024

ChrisSidebotham removed the Needs: Triage 🔍 Maintainers need to triage still label Apr 26, 2024

microsoft-github-policy-service bot added the Needs: Attention 👋 Reply has been added to issue, maintainer to review label Apr 26, 2024

sydkar transferred this issue from Azure/bicep-registry-modules May 22, 2024

microsoft-github-policy-service bot added the Needs: Triage 🔍 Maintainers need to triage still label May 22, 2024

ChrisSidebotham removed their assignment May 23, 2024

stephaniezyen assigned alex-frankel Jun 5, 2024

anthony-c-martin removed the Needs: Attention 👋 Reply has been added to issue, maintainer to review label Jun 19, 2024

alex-frankel removed the Needs: Triage 🔍 Maintainers need to triage still label Jul 1, 2024

microsoft-github-policy-service bot added the Needs: Triage 🔍 Maintainers need to triage still label Jul 2, 2024

alex-frankel transferred this issue from Azure/bicep Jul 25, 2024

AlexanderSehr added Type: AVM 🅰️ ✌️ Ⓜ️ This is an AVM related issue Language: Bicep 💪 This is related to the Bicep IaC language Needs: Core Team 🧞 This item needs the AVM Core Team to review it and removed Needs: Triage 🔍 Maintainers need to triage still labels Jul 25, 2024

AlexanderSehr self-assigned this Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AVM Module Issue]: DNS Zones - Deployment History limit #1237

[AVM Module Issue]: DNS Zones - Deployment History limit #1237

aavdberg commented Apr 22, 2024

avm-team-linter bot commented Apr 22, 2024

ChrisSidebotham commented Apr 26, 2024

AlexanderSehr commented Apr 26, 2024 •

edited

Loading

teemukom commented Apr 29, 2024

ChrisSidebotham commented May 8, 2024

aavdberg commented May 8, 2024 •

edited

Loading

AlexanderSehr commented May 8, 2024 •

edited

Loading

ChrisSidebotham commented May 9, 2024

ChrisSidebotham commented May 22, 2024

sydkar commented May 22, 2024

alex-frankel commented Jul 1, 2024

alex-frankel commented Jul 1, 2024

aavdberg commented Jul 2, 2024

aavdberg commented Jul 2, 2024

alex-frankel commented Jul 8, 2024

AlexanderSehr commented Jul 8, 2024 •

edited

Loading

alex-frankel commented Jul 8, 2024

AlexanderSehr commented Jul 8, 2024 •

edited

Loading

alex-frankel commented Jul 8, 2024

AlexanderSehr commented Jul 9, 2024 •

edited

Loading

alex-frankel commented Jul 17, 2024

aavdberg commented Jul 24, 2024

cedricbraekevelt commented Jul 24, 2024

microsoft-github-policy-service bot commented Jul 25, 2024

[AVM Module Issue]: DNS Zones - Deployment History limit #1237

[AVM Module Issue]: DNS Zones - Deployment History limit #1237

Comments

aavdberg commented Apr 22, 2024

Check for previous/existing GitHub issues

Issue Type?

Module Name

(Optional) Module Version

Description

(Optional) Correlation Id

avm-team-linter bot commented Apr 22, 2024

ChrisSidebotham commented Apr 26, 2024

AlexanderSehr commented Apr 26, 2024 • edited Loading

teemukom commented Apr 29, 2024

ChrisSidebotham commented May 8, 2024

aavdberg commented May 8, 2024 • edited Loading

AlexanderSehr commented May 8, 2024 • edited Loading

ChrisSidebotham commented May 9, 2024

ChrisSidebotham commented May 22, 2024

sydkar commented May 22, 2024

alex-frankel commented Jul 1, 2024

alex-frankel commented Jul 1, 2024

aavdberg commented Jul 2, 2024

aavdberg commented Jul 2, 2024

alex-frankel commented Jul 8, 2024

AlexanderSehr commented Jul 8, 2024 • edited Loading

alex-frankel commented Jul 8, 2024

AlexanderSehr commented Jul 8, 2024 • edited Loading

alex-frankel commented Jul 8, 2024

AlexanderSehr commented Jul 9, 2024 • edited Loading

alex-frankel commented Jul 17, 2024

aavdberg commented Jul 24, 2024

cedricbraekevelt commented Jul 24, 2024

microsoft-github-policy-service bot commented Jul 25, 2024

AlexanderSehr commented Apr 26, 2024 •

edited

Loading

aavdberg commented May 8, 2024 •

edited

Loading

AlexanderSehr commented May 8, 2024 •

edited

Loading

AlexanderSehr commented Jul 8, 2024 •

edited

Loading

AlexanderSehr commented Jul 8, 2024 •

edited

Loading

AlexanderSehr commented Jul 9, 2024 •

edited

Loading