This project is an example of how you can combine the AWS Cloud Development Kit (CDK) and the AWS Elastic Kubernetes Serivce (EKS) to quickly deploy a more complete and "production ready" Kubernetes environment on AWS.
I describe it a bit more in a recent blog post - https://jason-umiker.medium.com/automating-the-provisioning-of-a-production-ready-kubernetes-cluster-with-aws-eks-cdk-b1f0e8a12723
- An appropriate VPC (/22 CDIR w/1024 IPs by default - though you can edit this in
eks_cluster.py
) with public and private subnets across three availabilty zones.- Alternatively, just flip
create_new_vpc
toFalse
and then specify the name of your VPC underexisting_vpc_name
ineks_cluster.py
- Alternatively, just flip
- A new EKS cluster with:
- A dedicated new IAM role to create it from. The role that creates the cluster is a permanent, and rather hidden, full admin role that doesn't appear in nor is subject to the aws-auth config map. So, you want a dedicated role explicity for that purpose like CDK does for you here that you can then restrict access to assume unless you need it (e.g. you lock yourself out of the cluster with by making a mistake in the aws-auth configmap).
- A new Managed Node Group with 3 x m5.large instances spread across 3 Availability Zones.
- You can change the instance type and quantity by changing
eks_node_quantity
and/oreks_node_type
at the top ofcluster-bootstrap/eks-clustery.py
.
- You can change the instance type and quantity by changing
- All control plane logging to CloudWatch Logs enabled (defaulting to 1 month's retention within CloudWatch Logs).
- The AWS Load Balancer Controller (https://kubernetes-sigs.github.io/aws-load-balancer-controller) to allow you to seamlessly use ALBs for Ingress and NLB for Services.
- External DNS (https://github.com/kubernetes-sigs/external-dns) to allow you to automatically create/update Route53 entries to point your 'real' names at your Ingresses and Services.
- A new managed Amazon Elasticsearch Domain behind a private VPC endpoint as well as an aws-for-fluent-bit DaemonSet (https://github.com/aws/aws-for-fluent-bit) to ship all your container logs there - including enriching them with the Kubernetes metadata using the kubernetes fluent-bit filter.
- (Temporarily until the AWS Managed Prometheus/Grafana are available) The kube-prometheus Operator (https://github.com/prometheus-operator/kube-prometheus) which deploys you a Prometheus on your cluster that will collect all your cluster metrics as well as a Grafana to visualise them.
- TODO: Add some initial alerts for sensible common items in the cluster via Prometheus/Alertmanager
- The AWS EBS CSI Driver (https://github.com/kubernetes-sigs/aws-ebs-csi-driver). Note that new development on EBS functionality has moved out of the Kubernetes mainline to this externalised CSI driver.
- The AWS EFS CSI Driver (https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html). Note that new development on EFS functionality has moved out of the Kubernetes mainline to this externalised CSI driver.
- An OPA Gatekeeper to enforce prevenetative secruity and operational policies (https://github.com/open-policy-agent/gatekeeper). A default set of example policies is deployed by default - see
gatekeeper-policies/README.md
- The cluster autoscaler (CA) (https://github.com/kubernetes/autoscaler). This will scale your EC2 instances to ensure you have enough capacity to launch all of your Pods as they are deployed/scaled.
- The metrics-server (required for the Horizontal Pod Autoscaler (HPA)) (https://github.com/kubernetes-sigs/metrics-server)
- The Calico Network Policy Provider (https://docs.aws.amazon.com/eks/latest/userguide/calico.html). This enforces any NetworkPolicies that you specify.
- The AWS Systems Manager (SSM) agent. This allows for various management activities (e.g. Inventory, Patching, Session Mnaager, etc.) of your Instances/Nodes by AWS Systems Manager.
All of the add-ons are optional and you control whether you get them with variables at the top of cluster-bootstrap\eks_cluster.py
that you flip to True/False
The Cloud Development Kit (CDK) is a tool where you can write infrastucture-as-code with 'actual' code (TypeScript, Python, C#, and Java). This takes these lanugages and 'compiles' them into a CloudFormation template for the AWS CloudFormation engine to then deploy and manage as stacks.
When you develop and deploy infrastructure with the CDK you don't edit the intermediate CloudFormation but, instead, let CDK regenerate it in reponse to changes in the upstream CDK code.
What makes CDK uniquely good when it comes to our EKS Quickstart is:
- It handles the IAM Roles for Service Accounts (IRSA) rather elegantly and creates the IAM Roles and Policies, creates the Kubernetes service accounts, and then maps them to each other.
- It has implemented custom CloudFormation resources with Lambda invoking kubectl and helm to deploy manifests and charts as part of the cluster provisioning.
- Until we have Managed Add-Ons for the common things with EKS like the above this can fill the gap and provision us a complete cluster with all the add-ons we need.
You can either deploy this from your machine or leverge CodeBuild.
To use the CodeBuild CloudFormation Template:
- Generate a personal access token on GitHub - https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token
- Run
aws codebuild import-source-credentials --server-type GITHUB --auth-type PERSONAL_ACCESS_TOKEN --token <token_value>
to provide your token to CodeBuild - Deploy
cluster-codebuild/EKSCodeBuildStack.template.json
- using the console and specify the
GithubUserParameter
with your GitHub login - or using the cli :
aws cloudformation create-stack --stack-name "aws-alb-ineks-quickstart-bootstrap-stack" --template-body file:https://cluster-codebuild/EKSCodeBuildStack.template.json --capabilities CAPABILITY_NAMED_IAM --parameters ParameterKey=GithubUserParameter,ParameterValue=<your-GitHub-login>
- using the console and specify the
- Go to the CodeBuild console, click on the Build project that starts with
EKSCodeBuild
, and then click the Start build button. - (Optional) You can click the Tail logs button to follow along with the build process
NOTE: This also enables a GitOps pattern where changes to the cluster-bootstrap folder on the branch mentioned (main by default) will re-trigger this CodeBuild to do another cdk deploy
via web hook.
There are some prerequisites you likely will need to install on the machine doing your environment bootstrapping including Node, Python, the AWS CLI, the CDK, fluxctl and Helm
Run sudo ./ubuntu-prepreqs.sh
- Install Homebrew (https://brew.sh/)
- Run
./mac-prereqs.sh
- Edit your
~/.zshrc
and/or your~/.bash_profile
to put /usr/local/bin at the start of your PATH statement so that the brew things installed take precendence over the built-in often outdated options like python2.
- Make sure that you have your AWS CLI configured with administrative access to the AWS account in question (e.g. an
aws s3 ls
works)- This can be via setting your access key and secret in your .aws folder via
aws configure
or in your environment variables by copy and pasting from AWS SSO etc.
- This can be via setting your access key and secret in your .aws folder via
- Run
cd eks-quickstart/cluster-bootstrap
- Run
sudo npm install --upgrade -g aws-cdk
to ensure your CDK is up to date - Run
pip install --upgrade -r requirements.txt
to install the required Python bits of the CDK - Run
export CDK_DEPLOY_REGION=ap-southeast-2
replacing ap-southeast-2 with your region of choice - Run
export CDK_DEPLOY_ACCOUNT=123456789123
replacing 123456789123 with your AWS account number - (Optional) If you want to make an existing IAM User or Role the cluster admin rather than creating a new one then edit
eks_cluster.py
and comment out the curernt cluster_admin_role and uncomment the one beneath it and fill in the ARN of the User/Role you'd like there. - (Only required the first time you use the CDK in this account) Run
cdk bootstrap
to create the S3 bucket where it puts the CDK puts its artifacts - (Only required the first time ES in VPC mode is used in this account) Run
aws iam create-service-linked-role --aws-service-name es.amazonaws.com
- Run
cdk deploy --require-approval never
- (Temporary until it is added to our Helm Chart - PR open) Run
kubectl edit configmap fluentbit-aws-for-fluent-bit --namespace=kube-system
and add the following to the bottomReplace_Dots On
We've deployed Flux to deploy - and then keep in sync via GitOps - our default Gatekeeper policies and constraints. In order for that to work, though, we'll need to get the SSH key that Flux generated and add it to GitHub to give us the required access.
fluxctl and the required access is set up on the Bastion - if you have deployed that:
- Connect to the Bastion via Systems Manager Session Manager or code-server
- Run
fluxctl identity --k8s-fwd-ns kube-system
- Take the SSH key that has been output and add it to GitHub by following these instructions - https://docs.github.com/en/github/authenticating-to-github/adding-a-new-ssh-key-to-your-github-account
- (Optional) If you don't want to wait up to 5 minutes for Flux to sync you can run
fluxctl sync --k8s-fwd-ns kube-system
If you set deploy_bastion
to True
in eks_cluster.py
then the template will deploy an EC2 instance running Code Server which is Visual Studio Code but running in your browser.
So there are two ways to reach your Bastion and, through it, your EKS cluster:
- Go to the Systems Manager Server in the AWS Console
- Go to Managed Instances on the left hand navigation pane
- Select the instance with the name
EKSClusterStack/CodeServerInstance
- Under the Instance Actions menu on the upper right choose Start Session
- You need to run
sudo bash
to get to root's profile where we've set up kubectl - Run
kubectl get nodes
to see that all the tools are there and set up for you.
The stack will have an Output with the address to the Bastion and the password for the web interface defaults to the Instance ID of the Bastion Instance (which you can get from the EC2 Console).
NOTE: Since this defaults to HTTP rather than HTTPS to accomodate accounts without a public Route 53 Zone and associated certificates that means that modern browsers won't allow you to paste with Ctrl-V. You can, however, paste with shift-insert (insert = fn + return on a Mac so shift-fn-return on a Mac to paste).
Here are a few things to familiarise yourself with the Bastion:
- Click the three dashes (I call it the hambuger menu) in the upper left corner then click
Terminal
and thenNew Terminal
. - Run
kubectl get nodes
and see that we've already installed the tools for you and run theaws eks update-kubeconfig
and it is all working - Click the three dashes in the upper left then click View then Command Palette. In that box type Browser Preview and choose
Browser Preview: Open Preview
. This browser is running on the Instance in the private VPC and you can use this browser to reach Kibana and Grafana etc.
If you set deploy_vpn
to True
in eks_cluster.py
then the template will deploy a Client VPN.
Note that you'll also need to create client and server certificates and upload them to ACM by following these instructions - https://docs.aws.amazon.com/vpn/latest/clientvpn-admin/client-authentication.html#mutual - and update ekscluster.py
with the certificate ARNs for this to work.
Once it has created your VPN you then need to configure the client:
- Open the AWS VPC Console and go to the Client VPN Endpoints on the left panel
- Click the Download Client Configuration button
- Edit the downloaded file and add:
- A section at the bottom for the server cert in between and
- Then under that a section for the client private key between and
- Install the AWS Client VPN Client - https://aws.amazon.com/vpn/client-vpn-download/
- Create a new profile pointing it at that configuration file
- Connect to the VPN
Once you are connected it is a split tunnel - meaning only the addresses in your EKS VPC will get routed through the VPN tunnel.
You then need to add the EKS cluster to your local kubeconfig by running the command in the clusterConfigCommand Output of the EKSClusterStack.
Then you should be able to run a kubectl get all -A
and see everything running on your cluster.
We put the Elasticsearch both in the VPC (i.e. not on the Internet) as well as within the same Security Group we use for controlling access to our EKS Control Plane.
We did this so that when we allow the Bastion or Client VPN access to that security group then it has access from a network perspective to both manage EKS as well as query logs in Elasticsearch/Kibana.
Since this ElasticSearch can only be reached if you are coming in via the Bastion/VPC/DirectConnect, then it is not that risky to allow 'open access' to it - especially in a Proof of Concept (POC) environment. As such, we've configured its default access policy so that no login and password and are controlling access to it form a network perspective rahter than with authentiation/authorization by default.
- Once that new access policy has applied click on the Kibana link on the Elasticsearch Domain's Overview Tab
- Click "Explore on my own" in the Welcome page
- Click "Connect to your Elasticsearch index" under "Use Elasticsearch Data"
- Close the About index patterns box
- Click the Create Index Pattern button
- In the Index pattern name box enter
fluent-bit*
and click Next step - Pick @timestamp from the dropbown box and click Create index pattern
- Then go back Home and click Discover
TODO: Walk through how to do a few basic things in Kibana with searching and dashboarding your logs.
We have deployed an in-VPC private Network Load Balancer (NLB) to access your Grafana service to visualise the metrics from the Prometheus we've deployed onto the cluster.
To access this enter the following command get service grafana-nlb --namespace=kube-system
to find the address of this under EXTERNAL-IP. Alternatively, you can find the Grafana NLB in the AWS EC2 console and get its address from there.
Once you go to that page the default login/password is admin/prom-operator.
There are some default dashboards that ship with this which you can see by going to Home on top. This will take you to a list view of the available dashboards. Some good ones to check out include:
- Kubernetes / Compute Resources / Cluster
- This gives you a whole cluster view
- Kubernetes / Compute Resources / Namespace (Pods)
- There is a namespace dropdown at the top and it'll show you the graphs including the consumption in that namespace broken down by Pod
- Kubernetes / Compute Resources / Namespace (Workloads)
- Similar to the Pod view but instead focuses on Deployment, StatefulSet and DaemonSet views
Within all of these dashboards you can click on names as links and it'll drill down to show you details relevant to that item.
TODO: Walk through deploying some apps that show off some of the cluster add-ons we've installed
Since we are explicit both with the EKS Control Plane version as well as the Managed Node Group AMI version upgrading these is simply incrementing these versions, saving cluster-bootstrap/eks_cluster.py
and then running a cdk deploy
.
As per the EKS Upgrade Instructions you start by upgrading the control plane, then any required add-on versions and then the worker nodes.
Upgrade the control plane by changing eks_version
at the top of eks_cluster.py
. You can see what to put there by looking at the CDK documentation for KubernetesVersion. Then run cdk deploy
- or let the CodeBuild GitOps provided in cluster-codebuild
do it for you.
Upgrade the worker nodes by updating eks_node_ami_version
at the top of cluster-bootstrap/eks_cluster.py
with the new version. You find the version to type there in the EKS Documentation as shown here:
Each of our add-ons are deployed via Helm Charts and are explicit about the chart version being deployed. In the comment above each chart version we link to the GitHub repo for that chart where you can see what the current chart version is and can see what changes may have been rolled in since the one cited in the template.
To upgrade the chart version update the chart version to the upstream version you see there, save it and then do a cdk deploy
.
NOTE: I was thinking about parametising this to the top of cluster-bootstrap/eks_cluster.py
but it is possible as the Chart versions change that the values you have to specify might also change. As such I have not done so as a reminder that this change might require a bit of testing and research rather than just popping a new version number in and expecting it'll work
- Investigate replacing the current instance ID password for the VS Code on the Bastion with something more secure such as generating a longer string and storing it in Secrets Manager - or drop it alltogether in lieu of just the Client VPN and/or Systems Manager Session Manager to the Bastion.