Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some resources are failing in CD because they are not unique enough #443

Merged
merged 1 commit into from
Oct 26, 2021

Conversation

juandiegopalomino
Copy link
Collaborator

Description

Honestly, surprised it took this long to spot this. Just adding some random hashes based on the failures here:
https://github.com/run-x/opta/actions/runs/1379261534

Safety checklist

  • This change is backwards compatible and safe to apply by existing users
  • This change will NOT lead to data loss
  • This change will NOT lead to downtime who already has an env/service setup

How has this change been tested, beside unit tests?

Ran locally, verified resources were not being destroyed

Honestly, surprised it took this long to spot this
resource "aws_docdb_cluster_instance" "cluster_instances" {
count = var.instance_count
identifier = "opta-${var.layer_name}-${var.module_name}-${count.index}"
identifier = "opta-${var.layer_name}-${var.module_name}-${random_string.db_name_hash.result}-${count.index}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to add env name instead of random hash

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we would start facing issues of name size limitations. We would still face them if the layer and module names are too long, but I just wanted to change one thing at a time

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are 100% sure this is backwards compatible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the *_prefix fields would be better here when available. identifier_prefix in this case. Not sure if that is something that we can seamlessly switch to in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we also want to be careful with generating names that are too long for the resource type, especially since we are including user-provided values in those names where we don't have strong control over the length.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need validators for every user input field where we can check these things and throw a more helpful error message

@codecov
Copy link

codecov bot commented Oct 25, 2021

Codecov Report

Merging #443 (f79dd31) into dev (c0337f5) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##              dev     #443   +/-   ##
=======================================
  Coverage   70.11%   70.11%           
=======================================
  Files          88       88           
  Lines        5167     5167           
=======================================
  Hits         3623     3623           
  Misses       1544     1544           
Flag Coverage Δ
unittests 70.11% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0337f5...f79dd31. Read the comment docs.

@juandiegopalomino juandiegopalomino merged commit 57c6157 into dev Oct 26, 2021
@juandiegopalomino juandiegopalomino deleted the jd/fixing-cd branch October 26, 2021 16:01
bigbitbus added a commit that referenced this pull request Nov 1, 2021
* Attempting more graceful usage of git dependency (#393)

Import is not within a try-catch and only imported if the code for its usage is invoked.

* Nicer plan displayer (#391)

* Nicer plan displayer

1. Using tables to show data about changes
2. Only 2 security levels "LOW RISK" and "HIGH RISK". Low risk represents no expected dataloss and minimal, recoverable downtime

* more colors

* typo

* addressing cr

* max node count increase should be benign

* addressing cr

* Preemptive GCP instances. (#394)

* Fix unnecessary logs and other minor issues (#395)

* fix-stuff

* revert-pipfile

* fix-dict

* fix-lint

* fix-lint

* Update nightly.yaml

* Checking b/w 2 versions will now verify no dataloss (#398)

* Feature/dashlocal (#397)

Added --local functionality for Opta local runs.

* Check in ci that pdb is not mentioned anywhere in the code (#401)

* Check in ci that pdb is not mentioned anywhere in the code

* wip

* wip

* Tiny fix for our runx module (#405)

* Jd/fixing displayer (#406)

* plan displayer handle delete

* destroy was not a valid value

* adding deletions to test

* Fixed local yaml quote issue (#404)

* Jd/fixing aws destroy (#407)

* AWS destroy facing issue b/c cli can't clean up auto created sec group

terraform-aws-modules/terraform-aws-vpc#283

* wip

* terraform fmt

* Adding Secondary Gcp NodePool Opta module (#403)

* Adding Secondary Gcp NodePool Opta module

* Update the GCP Node Pool name

* Update gcp-env.yml example and fix terraform lint

* Add IAM Member permissions

* Supporting a list of helm values files (#409)

Also, now we check for relative path and support it

* Updating to linkerd v2.10.2 (#408)

* Updating to linkerd v2.10.2

OK, looks like all of the important work was already taken care of earlier as we skipped the outbound mysql and postgres ports already.

More good news: the linkerd visualization components are now their own separate charts so linkerd resource overhead will drop a lot.

https://linkerd.io/2.10/tasks/upgrading-2.10-ports-and-protocols/#
https://linkerd.io/2.10/tasks/upgrade/#upgrade-notice-stable-2-10-0

* terraform fmt

* Updated release helper script (#412)

Prettier, more compact, less repetitive. Scrolling still not there but the updates makes that far less likely

* Fixing azure destroy (#414)

* Fixing azure destroy

Sometime in the past week Azure terraform started failing when trying to destroy the acr key vault key because we disabled purge and yet destroy causes a purge. Found the new toggle to just do a soft delete on purge and confirmed that it worked.

* Disabling regular rule for purge

* Reverting azure provider to version 2.78.0

Honestly, I'm very disappointed with Azure:
Azure/AKS#2584

* Fixing gcp dns delegation

* Disabling ssl for gcp postgres (#415)

Doing this because in order to enable ssl for postgres a user would need to download the ssl CA/key files and include them in all outgoing connections, which makes psql incredibly cumbersome to use. So for now, we won't be supporting it.

* Forgot to update regula to not complain about missing gcp postgres ssl (#416)

* Merge main to dev (#420)

* Add yaml syntax highlighting (#402)

* Release 0.15.0 (#410)

* Attempting more graceful usage of git dependency (#393)

Import is not within a try-catch and only imported if the code for its usage is invoked.

* Nicer plan displayer (#391)

* Nicer plan displayer

1. Using tables to show data about changes
2. Only 2 security levels "LOW RISK" and "HIGH RISK". Low risk represents no expected dataloss and minimal, recoverable downtime

* more colors

* typo

* addressing cr

* max node count increase should be benign

* addressing cr

* Preemptive GCP instances. (#394)

* Fix unnecessary logs and other minor issues (#395)

* fix-stuff

* revert-pipfile

* fix-dict

* fix-lint

* fix-lint

* Update nightly.yaml

* Checking b/w 2 versions will now verify no dataloss (#398)

* Feature/dashlocal (#397)

Added --local functionality for Opta local runs.

* Check in ci that pdb is not mentioned anywhere in the code (#401)

* Check in ci that pdb is not mentioned anywhere in the code

* wip

* wip

* Tiny fix for our runx module (#405)

* Jd/fixing displayer (#406)

* plan displayer handle delete

* destroy was not a valid value

* adding deletions to test

* Fixed local yaml quote issue (#404)

* Jd/fixing aws destroy (#407)

* AWS destroy facing issue b/c cli can't clean up auto created sec group

terraform-aws-modules/terraform-aws-vpc#283

* wip

* terraform fmt

* Adding Secondary Gcp NodePool Opta module (#403)

* Adding Secondary Gcp NodePool Opta module

* Update the GCP Node Pool name

* Update gcp-env.yml example and fix terraform lint

* Add IAM Member permissions

* Supporting a list of helm values files (#409)

Also, now we check for relative path and support it

* Updating to linkerd v2.10.2 (#408)

* Updating to linkerd v2.10.2

OK, looks like all of the important work was already taken care of earlier as we skipped the outbound mysql and postgres ports already.

More good news: the linkerd visualization components are now their own separate charts so linkerd resource overhead will drop a lot.

https://linkerd.io/2.10/tasks/upgrading-2.10-ports-and-protocols/#
https://linkerd.io/2.10/tasks/upgrade/#upgrade-notice-stable-2-10-0

* terraform fmt

* Fixing gcp dns delegation

* Fixed relative path issues for yaml files (#411)

* Disabling ssl for gcp postgres (#415)

Doing this because in order to enable ssl for postgres a user would need to download the ssl CA/key files and include them in all outgoing connections, which makes psql incredibly cumbersome to use. So for now, we won't be supporting it.

* Forgot to update regula to not complain about missing gcp postgres ssl (#416)

* Fixing azure destroy (#414) (#418)

* Fixing azure destroy

Sometime in the past week Azure terraform started failing when trying to destroy the acr key vault key because we disabled purge and yet destroy causes a purge. Found the new toggle to just do a soft delete on purge and confirmed that it worked.

* Disabling regular rule for purge

* Reverting azure provider to version 2.78.0

Honestly, I'm very disappointed with Azure:
Azure/AKS#2584

* Fixed terraform local working dir (#413)

Co-authored-by: Juan Diego Palomino <[email protected]>
Co-authored-by: Nilesh Sarupriya <[email protected]>
Co-authored-by: Sachin Agarwal <[email protected]>

Co-authored-by: Anthony Campolo <[email protected]>
Co-authored-by: Nitin Aggarwal <[email protected]>
Co-authored-by: Juan Diego Palomino <[email protected]>
Co-authored-by: Nilesh Sarupriya <[email protected]>

* Add a prompt for configuration file. (#419)

* Add a prompt for configuration file.

* Added Test cases.

* Refactoring

* Update comment

Co-authored-by: Nitin Aggarwal <[email protected]>

* Improvements to logs and helper strings (#423)

* improvements

* undo-pipfile

* ci

* Deleting that old debugger I made ages ago and was hidden since February (#425)

* Adding the tags for alb ingress to vpc (#426)

https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/subnet_discovery/
Adding tags hurts nothing so should be zero problem

* Feat/nilesh/runx 800 better creds error (#424)

* Compare System Configured Credentials (AWS/GCP)

* Comment fix.

* Update Error Message

* Updating helm module docs to be less confusing (#428)

* Add support for Multiple Instance in aws-documentdb (#427)

* Add support for Multiple Instance in aws-documentdb

* Updated Test Cases. Validate Document DB Instance count.

* Lint fixes.

* Opta Module Uniqueness check (#429)

* Opta Module uniqueness check

* Adding the Uniqueness check for required Opta modules.

* Adding the Uniqueness check for required Opta modules.

* Added Uniqueness check for external-ssl-cert

* Persistent storage option for k8s services (#430)

* Persistent storage option for k8s services

* Terraform fmt

* Addressing cr

* Addressing cr

* Unified helm chart for k8s service (#433)

Turns out that the only difference inside the chart was the service account annotations, which we can just put placeholders for lol

* Fixing a bug where due to a silly fix gcp k8s base needed gcp dns (#435)

* Retool example (#431)

* retool

* updates

* Enhanced amplitude event properties (#438)

* Enhanced amplitude event properties

1. All will have parent name if applicable
2. There will be a module_* count for certain modules we wish to keep track of
3. There is a new event for recording the end of an apply.

* Addressing cr

* fixing tests

* Fixing tests

* addressing cr

* addressing cr

* lol

* Add support for custom JSON encoding (#440)

* Add support for custom JSON encoding

* Add tests

* Fix lint issues

* Validate encoding matches stdlib

Co-authored-by: Patrick Fiedler <[email protected]>

* Not sure how it happened, but I forgot a couple of metric counts (#442)

* Some resources are failing in CD because they are not unique enough (#443)

Honestly, surprised it took this long to spot this

* AWS TF resources with Valid Resource Names (#447)

* AWS TF resources with Valid Resource Names

* Fix TF Formatting.

* Check if DynamoDB Exists (#449)

* Check if DynamoDB Exists

* Fix Lint.

* Postgres db identifier bugfix (#446)

This is not backwards incompatible because
1. We have lifecycle ignore changes on the db identifier
2. Upper case letters in the db identifier is not allowed, so there would be no identifiers of running clusters who would be changed

* Handling routing without domain via all/path,  */path and /path (#450)

* Dynamodb module (#444)

* Dynamodb module

no local or secondary index for now

* linting/formating

* addressing cr

* addressing cr

* addressing cr

* testfixing

* addressing nitn's comments

* Ignore capitalization of auto (#452)

* Ignore capitalization of auto

* terraform fmt

* adding more tests

* Make local k8s service use universal helm chart (#434)

I think it should work as straightforward as with the other clouds
also there where some features missing form local k8s service which should now be present.

Deleting

* Fix k8s-service failing when public_uri is not specified (#454)

Co-authored-by: Patrick Fiedler <[email protected]>

Co-authored-by: Juan Diego Palomino <[email protected]>
Co-authored-by: Nilesh Sarupriya <[email protected]>
Co-authored-by: Nitin Aggarwal <[email protected]>
Co-authored-by: Anthony Campolo <[email protected]>
Co-authored-by: Patrick Fiedler <[email protected]>
Co-authored-by: Patrick Fiedler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants