Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AVM Module Issue]: DNS Zones - Deployment History limit #1237

Open
1 task done
aavdberg opened this issue Apr 22, 2024 · 24 comments
Open
1 task done

[AVM Module Issue]: DNS Zones - Deployment History limit #1237

aavdberg opened this issue Apr 22, 2024 · 24 comments
Assignees
Labels
Language: Bicep 💪 This is related to the Bicep IaC language Needs: Core Team 🧞 This item needs the AVM Core Team to review it Type: AVM 🅰️ ✌️ Ⓜ️ This is an AVM related issue

Comments

@aavdberg
Copy link

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Issue Type?

I'm not sure

Module Name

avm/res/network/dns-zone

(Optional) Module Version

No response

Description

I am deploying through the 'br/public:avm/res/network/dns-zone:0.2.4' module Dnszones

But running to the following problem:

image

I deploy seventy domains with records in it.

(Optional) Correlation Id

No response

@aavdberg aavdberg added the Needs: Triage 🔍 Maintainers need to triage still label Apr 22, 2024
Copy link

@aavdberg, thanks for submitting this issue for the avm/res/network/dns-zone module!

Important

A member of the @azure/avm-res-network-dnszone-module-owners-bicep or @azure/avm-res-network-dnszone-module-contributors-bicep team will review it soon!

@ChrisSidebotham
Copy link
Contributor

Hi @aavdberg.

Thanks for raising this issue. I am intrigued are you supplying domain + Records totalling to more than 800?

The 800 limit comes from the resource group level limits as noted here:
https://learn.microsoft.com/en-us/azure/azure-resource-manager/templates/deployment-history-deletions?tabs=azure-powershell

Could it be your deployment history is more than 800? in the CI Environment we have script to remove deployment history where we see failures but there is a time delay between this being removed from the visible view in the portal to the metadata being removed on the server side.

@AlexanderSehr May have some more supporting information for this

@ChrisSidebotham ChrisSidebotham removed the Needs: Triage 🔍 Maintainers need to triage still label Apr 26, 2024
@AlexanderSehr
Copy link
Contributor

AlexanderSehr commented Apr 26, 2024

Hey @aavdberg,

thanks for bringing this up. This is indeed a curious case since there is an auto-cleanup mechanism for deployments at this scope, i.e., if you're closing in on the 800 deployments limit, it will start removing deployments as per the FIFO principle.

That means, you somewhow managed to create so many deployments, that Azure is actually not able to remove deployments in time. This could, for example, happen in the extreme case of creating 800 deployments at once, but can happen for less. You mention 'only' 70, which should only run in the aforementioned issue, if 70 is higher than the threshold by which Azure starts removing deployments, which as per the link that @ChrisSidebotham provided is at 700. I assume you deploy other resources with your DNS Zones that may pump the # of dpeloyments to above 100?

@alex-frankel would you happen to have any advice / a best practice in mind how to avoid this issue when deploying templates that make heavy use of deployments?

@ChrisSidebotham, the script you're referring to is most important for the management group level as there is no auto-cleanup to be found. But we did in fact also use it for the subscription level in the past if we tested our entire library multiple times a day.

@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs: Attention 👋 Reply has been added to issue, maintainer to review label Apr 26, 2024
@teemukom
Copy link

If the resource group has CanNotDelete lock (as it has when using this module with WAF parameters) it prevents deletions from the deployment history.

@ChrisSidebotham
Copy link
Contributor

@aavdberg - Do you have an update on this with regards to the above comments?

@aavdberg
Copy link
Author

aavdberg commented May 8, 2024

We have seventy domains that we deploy through Azure DevOps pipeline. We use now @batchsize(1) to give the automatic cleanup time to do its thing and then it's working, but that makes the deployment take longer. We did try it also with a batchsize of ten but it also fails because the automatic cleanup does not go fast enough.

The domains get deployed to a resourcegroup and the scope is also to resourcegroup in the bicep file.

Every domain has several records.

So, every record in the domain gets a deployment created.

// Module Public DNS Zone
@batchSize(1)
module publicdnszone 'br/public:avm/res/network/dns-zone:0.2.4' = [
  for dnszone in varPublicDnsZones: {
    name: 'deploy-PublicDnsZone-${dnszone.name}'
    params: {
      name: dnszone.name
      location: 'global'
      lock: {
        kind: 'None'
        name: 'lock'
      }
      a: dnszone.recordSets.?a
      aaaa: dnszone.recordSets.?aaaa
      caa: dnszone.recordSets.?caa
      cname: dnszone.recordSets.?cname
      mx: dnszone.recordSets.?mx
      ptr: dnszone.recordSets.?ptr
      soa: dnszone.recordSets.?soa
      srv: dnszone.recordSets.?srv
      txt: dnszone.recordSets.?txt
      tags: {
        createdBy: 'Bicep through Azure DevOps'
        solutionName: parSolutionName
      }
    }
  }
]

Hopes this gives more clearity about the problem.

@AlexanderSehr
Copy link
Contributor

AlexanderSehr commented May 8, 2024

We have seventy domains that we deploy through Azure DevOps pipeline. We use now @batchsize(1) to give the automatic cleanup time to do its thing and then it's working, but that makes the deployment take longer. We did try it also with a batchsize of ten but it also fails because the automatic cleanup does not go fast enough.

The domains get deployed to a resourcegroup and the scope is also to resourcegroup in the bicep file.

Every domain has several records.

So, every record in the domain gets a deployment created.

// Module Public DNS Zone
@batchSize(1)
module publicdnszone 'br/public:avm/res/network/dns-zone:0.2.4' = [
  for dnszone in varPublicDnsZones: {
    name: 'deploy-PublicDnsZone-${dnszone.name}'
    params: {
      name: dnszone.name
      location: 'global'
      lock: {
        kind: 'None'
        name: 'lock'
      }
      a: dnszone.recordSets.?a
      aaaa: dnszone.recordSets.?aaaa
      caa: dnszone.recordSets.?caa
      cname: dnszone.recordSets.?cname
      mx: dnszone.recordSets.?mx
      ptr: dnszone.recordSets.?ptr
      soa: dnszone.recordSets.?soa
      srv: dnszone.recordSets.?srv
      txt: dnszone.recordSets.?txt
      tags: {
        createdBy: 'Bicep through Azure DevOps'
        solutionName: parSolutionName
      }
    }
  }
]

Hopes this gives more clearity about the problem.

It does, thanks @aavdberg.

And it does yet again surface two challenges with Deployment objects in Azure:

  1. The cleanup apparently is too slow (which is quite suprising to me, as the threshold to start deleting is 700, and the limit of deployments 800)
  2. Deployments take a lot longer than native resource deployments. For solutions that deploy some heavy wheights that anyways take a few minutes to deploy they make a lot of sense - but for something where you deploy e.g. 70 instances of a resource, I would personally recommend to use the AVM module as a reference to implement a Bicep-native resource deployment. This would potentially only create 1 deployment instead of 70 (if implemented accordingly) and be faster as a result.
    I used to see similar things happening with e.g. PolicyAssignments where one tried to deploy like 300 policy assignments which just took ages, while the Bicep-native implementation took only a few minutes or less.

I guess what I'm trying to say is that there is always a trade-off. The good news is, that the PG is aware and are working on speeding deployments up. ETA unknown though.

@ChrisSidebotham
Copy link
Contributor

@alex-frankel - Would you be able to assist in an ETA for the deployments work Alex mentioned above? Is there a better place for this issue regarding deployments than the current BRM Repository?

@ChrisSidebotham
Copy link
Contributor

@sydkar - Any update on moving this to the correct area?

@sydkar
Copy link

sydkar commented May 22, 2024

@ChrisSidebotham No update right now, trying to get an ETA on the deployments speed up work. I'll let you know by the end of the week.

@sydkar sydkar transferred this issue from Azure/bicep-registry-modules May 22, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs: Triage 🔍 Maintainers need to triage still label May 22, 2024
@ChrisSidebotham ChrisSidebotham removed their assignment May 23, 2024
@anthony-c-martin anthony-c-martin removed the Needs: Attention 👋 Reply has been added to issue, maintainer to review label Jun 19, 2024
@alex-frankel
Copy link

Sorry about the delay on this one. Is @teemukom's comment accurate? If so, then that is likely the source of the issue no?

@aavdberg, can you take a look at the Deployment resources in the Resource Group (there is a "Deployments" tab in the table of contents of the Resource Group blade in the portal)? Does it eventually get pruned back down to 700? If so, then that suggests @AlexanderSehr is correct that we are not able to keep up with the pace of deployments getting created.

The fact that the deployment works with @batchSize(1) also suggests it's our inability to keep up.

@aavdberg if this is still occurring, can you share a recent correlation ID and we can investigate further?

@alex-frankel
Copy link

I'm also curious what the var varPublicDnsZones looks like. That will help us understand how many sub-modules will be created for a given record.

@alex-frankel alex-frankel removed the Needs: Triage 🔍 Maintainers need to triage still label Jul 1, 2024
@aavdberg
Copy link
Author

aavdberg commented Jul 2, 2024

Send you a ping in teams @alex-frankel

@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs: Triage 🔍 Maintainers need to triage still label Jul 2, 2024
@aavdberg
Copy link
Author

aavdberg commented Jul 2, 2024

When using a batchsize(1) then it slows down the deployment and have the deployment history cleanup time to clean up the old deployments.

@alex-frankel
Copy link

Just got off a call with @aavdberg and he is creating a large number of DNS records. With the current structure each new record (even if the record is a new module) will produce a unique deployment. All the deployments are ideally happening in parallel which is why the Deployment clean up process is not able to keep up.

@ChrisSidebotham / @AlexanderSehr / @jtracey93 -- can we consider refactoring this AVM module to not produce so many nested modules? While we ideally would be able to (and plan to) raise the limit, it won't happen in the short-term, so refactoring the module is the only short-term fix that I feel is practical.

The refactor would require creating all of the records in the entry point file for the module instead of in a nested deployment. I realize reduces the separation of concerns, but that might be a worthy tradeoff to increase usability for the AVM consumer. Thoughts?

@AlexanderSehr
Copy link
Contributor

AlexanderSehr commented Jul 8, 2024

When using a batchsize(1) then it slows down the deployment and have the deployment history cleanup time to clean up the old deployments.

1 is maybe a bit ... low and 50 or 100 may do the trick 😄

@alex-frankel
Copy link

@AlexanderSehr, @aavdberg mentions that even with a batchSize of 10, he is running into this issue:

We did try it also with a batchsize of ten but it also fails because the automatic cleanup does not go fast enough.

Can we consider refactoring this module?

@AlexanderSehr
Copy link
Contributor

AlexanderSehr commented Jul 8, 2024

Just got off a call with @aavdberg and he is creating a large number of DNS records. With the current structure each new record (even if the record is a new module) will produce a unique deployment. All the deployments are ideally happening in parallel which is why the Deployment clean up process is not able to keep up.

@ChrisSidebotham / @AlexanderSehr / @jtracey93 -- can we consider refactoring this AVM module to not produce so many nested modules? While we ideally would be able to (and plan to) raise the limit, it won't happen in the short-term, so refactoring the module is the only short-term fix that I feel is practical.

The refactor would require creating all of the records in the entry point file for the module instead of in a nested deployment. I realize reduces the separation of concerns, but that might be a worthy tradeoff to increase usability for the AVM consumer. Thoughts?

Hola Mr @alex-frankel,
we could consider it - but - I want to provide some context that may / should influence the ultimate path. Bare with me 😄

As you may know, coming from CARML, we've traditionally always written modules in a way that they align closly with the Azure REST API's structure (i.e., how resource types are nested in provider namespaces and resource types may have child-resource types) and treated child-resource types as full-on modules that we published independently to their parent.

Now, since we migrated the library to AVM we don't do this anymore due to the technical limitations in the backend, that is certain manual updates that can make this quite painful for every new module. There is the hope to work through this eventually, but it's what we have today. We've already seen several issues being opened that asked for the child-modules, so there's definitely a need beyond what we're ourselves using in Microsoft & at customers.

Coming back to the issue at hand: The intuitive alignment with the API structure and the comprehensive publishing are the main drivers behind the design of the modules today - and doing so consistently. As a result, child-resource types are always nested in their own module file and are tested against the same rule set as their parent (with the sole exeption of the missing version.json file).

Some additional thoughts:
If we'd try to radically remove the amount of deployments in AVM modules we could do so essentially each time there is no nested loop. This would / should affect a lot of modules. But it would come at the price of making it impossible to publish them as their own modules.
The compromise would be to do so only for modules where you may not need a seperate published module. This may even be the case for the DNS Zone. Personally, I can't judge that even though I've seen a wide range of customer setups that slice and dice the permissions they grant in very specific ways. The other problem is that it would blow a big whole into the guidelines, compliance tests & consistency, effektively requiring to debate every resource type of every resource.

Personally, I value consistency a lot as its benefits usually massively outweigh the benefits of inconsistency. I can't however argue with the fact that in this case it comes at the very unfortuante cost of deployment speed & the deployment cleanup. So much for my personal thoughts on the matter. I'm sure others have some thoughts too :)

@alex-frankel
Copy link

Ultimately, your call :)

I don't think the deployment speed fix is a meaningful fix to this issue. That being said, @anthony-c-martin what is the status on the rollout for the performance improvements with nested modules?

@AlexanderSehr
Copy link
Contributor

AlexanderSehr commented Jul 9, 2024

Ultimately, your call :)

I don't think the deployment speed fix is a meaningful fix to this issue. That being said, @anthony-c-martin what is the status on the rollout for the performance improvements with nested modules?

That's true. The 'speed' is an issue by itself - while this issue is about the number of deployments and their cleanup. May have gotten a bit dragged away towards the end 😏 Will update.

Regardless, it surely can be 'our' call as a collective. The information provided just paints the picture of why the structure is the way it is, and what my subjective thoughts on the matter are 😉. I'm sure there's a wide variety of opions and maybe even more ideas (:

@alex-frankel
Copy link

@aavdberg as an aside, if you'd like to be onboarded to the Deployments performance improvement that @AlexanderSehr referenced, we can onboard you manually. Just email me your subscription ID(s). We are in the process of rolling it out to everyone, but we are going to do that slowly and progressively.

@aavdberg
Copy link
Author

@alex-frankel send you ping in Teams with subscription id

@cedricbraekevelt
Copy link

Hi @alex-frankel,

Is the deployment performance improvement open to anyone?

Important

The "Needs: Triage 🔍" label must be removed once the triage process is complete!

Tip

For additional guidance on how to triage this issue/PR, see the AVM Issue Triage documentation.

@alex-frankel alex-frankel transferred this issue from Azure/bicep Jul 25, 2024
@AlexanderSehr AlexanderSehr added Type: AVM 🅰️ ✌️ Ⓜ️ This is an AVM related issue Language: Bicep 💪 This is related to the Bicep IaC language Needs: Core Team 🧞 This item needs the AVM Core Team to review it and removed Needs: Triage 🔍 Maintainers need to triage still labels Jul 25, 2024
@AlexanderSehr AlexanderSehr self-assigned this Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Language: Bicep 💪 This is related to the Bicep IaC language Needs: Core Team 🧞 This item needs the AVM Core Team to review it Type: AVM 🅰️ ✌️ Ⓜ️ This is an AVM related issue
Projects
Status: Needs: Triage
Status: Todo
Development

No branches or pull requests

8 participants