Some resources are failing in CD because they are not unique enough #443

juandiegopalomino · 2021-10-25T16:16:17Z

Description

Honestly, surprised it took this long to spot this. Just adding some random hashes based on the failures here:
https://github.com/run-x/opta/actions/runs/1379261534

Safety checklist

This change is backwards compatible and safe to apply by existing users
This change will NOT lead to data loss
This change will NOT lead to downtime who already has an env/service setup

How has this change been tested, beside unit tests?

Ran locally, verified resources were not being destroyed

Honestly, surprised it took this long to spot this

NitinAgg · 2021-10-25T16:18:56Z

config/tf_modules/aws-documentdb/main.tf

 resource "aws_docdb_cluster_instance" "cluster_instances" {
 count = var.instance_count
- identifier = "opta-${var.layer_name}-${var.module_name}-${count.index}"
+ identifier = "opta-${var.layer_name}-${var.module_name}-${random_string.db_name_hash.result}-${count.index}"


Would it be better to add env name instead of random hash

then we would start facing issues of name size limitations. We would still face them if the layer and module names are too long, but I just wanted to change one thing at a time

We are 100% sure this is backwards compatible?

Using the *_prefix fields would be better here when available. identifier_prefix in this case. Not sure if that is something that we can seamlessly switch to in this case.

Generally, we also want to be careful with generating names that are too long for the resource type, especially since we are including user-provided values in those names where we don't have strong control over the length.

We need validators for every user input field where we can check these things and throw a more helpful error message

codecov · 2021-10-25T16:27:42Z

Codecov Report

Merging #443 (f79dd31) into dev (c0337f5) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##              dev     #443   +/-   ##
=======================================
  Coverage   70.11%   70.11%           
=======================================
  Files          88       88           
  Lines        5167     5167           
=======================================
  Hits         3623     3623           
  Misses       1544     1544

Flag	Coverage Δ
unittests	`70.11% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0337f5...f79dd31. Read the comment docs.

* Attempting more graceful usage of git dependency (#393) Import is not within a try-catch and only imported if the code for its usage is invoked. * Nicer plan displayer (#391) * Nicer plan displayer 1. Using tables to show data about changes 2. Only 2 security levels "LOW RISK" and "HIGH RISK". Low risk represents no expected dataloss and minimal, recoverable downtime * more colors * typo * addressing cr * max node count increase should be benign * addressing cr * Preemptive GCP instances. (#394) * Fix unnecessary logs and other minor issues (#395) * fix-stuff * revert-pipfile * fix-dict * fix-lint * fix-lint * Update nightly.yaml * Checking b/w 2 versions will now verify no dataloss (#398) * Feature/dashlocal (#397) Added --local functionality for Opta local runs. * Check in ci that pdb is not mentioned anywhere in the code (#401) * Check in ci that pdb is not mentioned anywhere in the code * wip * wip * Tiny fix for our runx module (#405) * Jd/fixing displayer (#406) * plan displayer handle delete * destroy was not a valid value * adding deletions to test * Fixed local yaml quote issue (#404) * Jd/fixing aws destroy (#407) * AWS destroy facing issue b/c cli can't clean up auto created sec group terraform-aws-modules/terraform-aws-vpc#283 * wip * terraform fmt * Adding Secondary Gcp NodePool Opta module (#403) * Adding Secondary Gcp NodePool Opta module * Update the GCP Node Pool name * Update gcp-env.yml example and fix terraform lint * Add IAM Member permissions * Supporting a list of helm values files (#409) Also, now we check for relative path and support it * Updating to linkerd v2.10.2 (#408) * Updating to linkerd v2.10.2 OK, looks like all of the important work was already taken care of earlier as we skipped the outbound mysql and postgres ports already. More good news: the linkerd visualization components are now their own separate charts so linkerd resource overhead will drop a lot. https://linkerd.io/2.10/tasks/upgrading-2.10-ports-and-protocols/# https://linkerd.io/2.10/tasks/upgrade/#upgrade-notice-stable-2-10-0 * terraform fmt * Updated release helper script (#412) Prettier, more compact, less repetitive. Scrolling still not there but the updates makes that far less likely * Fixing azure destroy (#414) * Fixing azure destroy Sometime in the past week Azure terraform started failing when trying to destroy the acr key vault key because we disabled purge and yet destroy causes a purge. Found the new toggle to just do a soft delete on purge and confirmed that it worked. * Disabling regular rule for purge * Reverting azure provider to version 2.78.0 Honestly, I'm very disappointed with Azure: Azure/AKS#2584 * Fixing gcp dns delegation * Disabling ssl for gcp postgres (#415) Doing this because in order to enable ssl for postgres a user would need to download the ssl CA/key files and include them in all outgoing connections, which makes psql incredibly cumbersome to use. So for now, we won't be supporting it. * Forgot to update regula to not complain about missing gcp postgres ssl (#416) * Merge main to dev (#420) * Add yaml syntax highlighting (#402) * Release 0.15.0 (#410) * Attempting more graceful usage of git dependency (#393) Import is not within a try-catch and only imported if the code for its usage is invoked. * Nicer plan displayer (#391) * Nicer plan displayer 1. Using tables to show data about changes 2. Only 2 security levels "LOW RISK" and "HIGH RISK". Low risk represents no expected dataloss and minimal, recoverable downtime * more colors * typo * addressing cr * max node count increase should be benign * addressing cr * Preemptive GCP instances. (#394) * Fix unnecessary logs and other minor issues (#395) * fix-stuff * revert-pipfile * fix-dict * fix-lint * fix-lint * Update nightly.yaml * Checking b/w 2 versions will now verify no dataloss (#398) * Feature/dashlocal (#397) Added --local functionality for Opta local runs. * Check in ci that pdb is not mentioned anywhere in the code (#401) * Check in ci that pdb is not mentioned anywhere in the code * wip * wip * Tiny fix for our runx module (#405) * Jd/fixing displayer (#406) * plan displayer handle delete * destroy was not a valid value * adding deletions to test * Fixed local yaml quote issue (#404) * Jd/fixing aws destroy (#407) * AWS destroy facing issue b/c cli can't clean up auto created sec group terraform-aws-modules/terraform-aws-vpc#283 * wip * terraform fmt * Adding Secondary Gcp NodePool Opta module (#403) * Adding Secondary Gcp NodePool Opta module * Update the GCP Node Pool name * Update gcp-env.yml example and fix terraform lint * Add IAM Member permissions * Supporting a list of helm values files (#409) Also, now we check for relative path and support it * Updating to linkerd v2.10.2 (#408) * Updating to linkerd v2.10.2 OK, looks like all of the important work was already taken care of earlier as we skipped the outbound mysql and postgres ports already. More good news: the linkerd visualization components are now their own separate charts so linkerd resource overhead will drop a lot. https://linkerd.io/2.10/tasks/upgrading-2.10-ports-and-protocols/# https://linkerd.io/2.10/tasks/upgrade/#upgrade-notice-stable-2-10-0 * terraform fmt * Fixing gcp dns delegation * Fixed relative path issues for yaml files (#411) * Disabling ssl for gcp postgres (#415) Doing this because in order to enable ssl for postgres a user would need to download the ssl CA/key files and include them in all outgoing connections, which makes psql incredibly cumbersome to use. So for now, we won't be supporting it. * Forgot to update regula to not complain about missing gcp postgres ssl (#416) * Fixing azure destroy (#414) (#418) * Fixing azure destroy Sometime in the past week Azure terraform started failing when trying to destroy the acr key vault key because we disabled purge and yet destroy causes a purge. Found the new toggle to just do a soft delete on purge and confirmed that it worked. * Disabling regular rule for purge * Reverting azure provider to version 2.78.0 Honestly, I'm very disappointed with Azure: Azure/AKS#2584 * Fixed terraform local working dir (#413) Co-authored-by: Juan Diego Palomino <[email protected]> Co-authored-by: Nilesh Sarupriya <[email protected]> Co-authored-by: Sachin Agarwal <[email protected]> Co-authored-by: Anthony Campolo <[email protected]> Co-authored-by: Nitin Aggarwal <[email protected]> Co-authored-by: Juan Diego Palomino <[email protected]> Co-authored-by: Nilesh Sarupriya <[email protected]> * Add a prompt for configuration file. (#419) * Add a prompt for configuration file. * Added Test cases. * Refactoring * Update comment Co-authored-by: Nitin Aggarwal <[email protected]> * Improvements to logs and helper strings (#423) * improvements * undo-pipfile * ci * Deleting that old debugger I made ages ago and was hidden since February (#425) * Adding the tags for alb ingress to vpc (#426) https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/subnet_discovery/ Adding tags hurts nothing so should be zero problem * Feat/nilesh/runx 800 better creds error (#424) * Compare System Configured Credentials (AWS/GCP) * Comment fix. * Update Error Message * Updating helm module docs to be less confusing (#428) * Add support for Multiple Instance in aws-documentdb (#427) * Add support for Multiple Instance in aws-documentdb * Updated Test Cases. Validate Document DB Instance count. * Lint fixes. * Opta Module Uniqueness check (#429) * Opta Module uniqueness check * Adding the Uniqueness check for required Opta modules. * Adding the Uniqueness check for required Opta modules. * Added Uniqueness check for external-ssl-cert * Persistent storage option for k8s services (#430) * Persistent storage option for k8s services * Terraform fmt * Addressing cr * Addressing cr * Unified helm chart for k8s service (#433) Turns out that the only difference inside the chart was the service account annotations, which we can just put placeholders for lol * Fixing a bug where due to a silly fix gcp k8s base needed gcp dns (#435) * Retool example (#431) * retool * updates * Enhanced amplitude event properties (#438) * Enhanced amplitude event properties 1. All will have parent name if applicable 2. There will be a module_* count for certain modules we wish to keep track of 3. There is a new event for recording the end of an apply. * Addressing cr * fixing tests * Fixing tests * addressing cr * addressing cr * lol * Add support for custom JSON encoding (#440) * Add support for custom JSON encoding * Add tests * Fix lint issues * Validate encoding matches stdlib Co-authored-by: Patrick Fiedler <[email protected]> * Not sure how it happened, but I forgot a couple of metric counts (#442) * Some resources are failing in CD because they are not unique enough (#443) Honestly, surprised it took this long to spot this * AWS TF resources with Valid Resource Names (#447) * AWS TF resources with Valid Resource Names * Fix TF Formatting. * Check if DynamoDB Exists (#449) * Check if DynamoDB Exists * Fix Lint. * Postgres db identifier bugfix (#446) This is not backwards incompatible because 1. We have lifecycle ignore changes on the db identifier 2. Upper case letters in the db identifier is not allowed, so there would be no identifiers of running clusters who would be changed * Handling routing without domain via all/path, */path and /path (#450) * Dynamodb module (#444) * Dynamodb module no local or secondary index for now * linting/formating * addressing cr * addressing cr * addressing cr * testfixing * addressing nitn's comments * Ignore capitalization of auto (#452) * Ignore capitalization of auto * terraform fmt * adding more tests * Make local k8s service use universal helm chart (#434) I think it should work as straightforward as with the other clouds also there where some features missing form local k8s service which should now be present. Deleting * Fix k8s-service failing when public_uri is not specified (#454) Co-authored-by: Patrick Fiedler <[email protected]> Co-authored-by: Juan Diego Palomino <[email protected]> Co-authored-by: Nilesh Sarupriya <[email protected]> Co-authored-by: Nitin Aggarwal <[email protected]> Co-authored-by: Anthony Campolo <[email protected]> Co-authored-by: Patrick Fiedler <[email protected]> Co-authored-by: Patrick Fiedler <[email protected]>

Some resources are failing in CD because they are not unique enough

f79dd31

Honestly, surprised it took this long to spot this

juandiegopalomino requested review from rabbitfang, NitinAgg and bigbitbus October 25, 2021 16:16

NitinAgg reviewed Oct 25, 2021

View reviewed changes

bigbitbus approved these changes Oct 25, 2021

View reviewed changes

juandiegopalomino requested a review from NitinAgg October 25, 2021 16:25

juandiegopalomino merged commit 57c6157 into dev Oct 26, 2021

juandiegopalomino deleted the jd/fixing-cd branch October 26, 2021 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some resources are failing in CD because they are not unique enough #443

Some resources are failing in CD because they are not unique enough #443

juandiegopalomino commented Oct 25, 2021

NitinAgg Oct 25, 2021

juandiegopalomino Oct 25, 2021

NitinAgg Oct 25, 2021

rabbitfang Oct 25, 2021

rabbitfang Oct 25, 2021

NitinAgg Oct 25, 2021

codecov bot commented Oct 25, 2021

Some resources are failing in CD because they are not unique enough #443

Some resources are failing in CD because they are not unique enough #443

Conversation

juandiegopalomino commented Oct 25, 2021

Description

Safety checklist

How has this change been tested, beside unit tests?

NitinAgg Oct 25, 2021

Choose a reason for hiding this comment

juandiegopalomino Oct 25, 2021

Choose a reason for hiding this comment

NitinAgg Oct 25, 2021

Choose a reason for hiding this comment

rabbitfang Oct 25, 2021

Choose a reason for hiding this comment

rabbitfang Oct 25, 2021

Choose a reason for hiding this comment

NitinAgg Oct 25, 2021

Choose a reason for hiding this comment

codecov bot commented Oct 25, 2021

Codecov Report