DevOps in the Cloud

How to Secure Your Kubernetes Cluster on GKE

Google Kubernetes Engine (GKE) is easy to get going with, but requires additional security controls. The documentation is hard to grasp as there are many features and changes that tie to specific Kubernetes versions that will require using beta feature enablement, as well as some out of date documentation that can catch people out.

There is also an outstanding bug that we have raised on Kubernetes that is hit by GKE. This is an edge case situation if you end up running pods directly and not deployments, and only happens if you enable pod security policies (which we recommend doing for security reasons).

If you are looking at production and have sensitive workloads, we advise that you implement everything inside of this article.

Securing Your Kubernetes Infrastructure

When you deploy a default GKE cluster with no additional options provided, you’ll get some sensible security defaults:

  1. Basic authentication is disabled.
  2. Encryption at rest for the Operating System (OS) image is enabled.
  3. Client certificate issuance is disabled.
  4. Secure Operating system (Using Container-Optimized Operating System; COS).
  5. VM Integrity monitoring to protect from malware/rootkits.
  6. Restricted Service Account.

So, why did Google enable these things by default? Essentially it means:

  • You can only authenticate via Google Cloud Platform (GCP) Single Sign On (SSO) to the Kubernetes API. That means, no basic authentication or client certificate can be used.
  • The OS is locked down and container optimized with a limited and read-only root filesystem with integrity checks. Integrity monitoring of the OS is enabled, to protect against rootkits and malware.
  • The service account that the Kubernetes nodes are using is restricted and will only have privileges to access cloud services it needs, such as Stackdriver for logging and monitoring.

The things you need to make sure you enable:

  • Enable secure boot to validate the authenticity of the OS and kernel modules. (If you decided not to use the default OS, then it might cause issues).

Using the virtual trusted platform module (VTPM) to sign kernel images and the OS, means that authenticity can be established. It also guarantees that nothing has been tampered with, and kernel modules have not been replaced with ones containing malware or rootkits inside.

  • Enable intranode visibility so you can see the data flowing between pods and nodes.

Making sure all traffic is logged and tracked between pods and nodes will help you identify any potential risks that may arise later on. This isn’t necessarily something you need to do for development, but, something you should do for production.

  • Put the Kubernetes API on a private network.

  • Put the node pool on a private network.

  • Provide a list of authorized networks that should be allowed to talk to the API.

Making your nodes and Kubernetes API private means it isn’t subject to network scans that are happening all the time by bots, hackers and script kiddies.

Putting it behind an internal network that you can only access through a VPN is also good, however, this is a much more involved process with GKE and isn’t as simple as a feature flag like the others. You can read about what is involved here.

Securing Your Application Containers

Again, Google enables some default features to get you started. However, there are still gaps that you will need to fill. 

What do you get?

  • Managed certificates
  • Role Based Access Controls (RBAC)

What you will want to enable:

  • Pod Security Policies. This is Beta and requires you to inform Google that you’re wanting to run the cluster creation in beta mode:

  • Network Policies.

This sounds great, but what does it actually mean? Basically, some of these roles are split into two, one is for application development teams to own and the other is for the cluster administrator.

Application Developer

  • Set network policies at an application level so you only allow the right level of access, i.e., AppA can talk to AppB but AppC can’t talk to either of them.
  • Have encryption to your service endpoint for your desired domain, i.e., https://www.mydomain.com.

Kubernetes can be a bit of a learning curve, there are technologies that make it simpler in terms of dependencies such as helm, that will allow you to deploy application dependencies with pre-defined deployments. But there is no real substitute from understanding the main components of Kubernetes;  Network Policies, Ingress, Certificate ManagementDeployment, Configmaps, Secrets and Service resources.

The main security components are network policies, secrets and certificate management. Network policies will allow you to control the traffic to and from your applications. Secrets are base64 encoded, so there is no real security in terms of how they are stored; therefore making sure the cluster administrator has enabled secret encryption, (as mentioned further down), will add that additional layer.

Certificate management will make sure the traffic to your service is encrypted, but if you’re communicating between services, then you should also add TLS between your applications. Having the cluster administrator install something like cert manager, will allow an easier way to encrypt between services. There are also services like Istio, but as that product does a lot more than just certificates, it can add more complexity than necessary.

Cluster Administrator

  • Have the ability to control what users can do and in what namespaces (RBAC), i.e., maybe Bob the developer can do deployments, (create, update, delete), but can’t view secrets in namespace our-teams-app1-dev.
  • Implement a deny all network policy for both egress and ingress (outbound and inbound). This can be controversial, as it can catch teams out, so making sure you’re communicating this is key. It will, however, force teams to define network policies to get their application working.
  • Force an application deployment security stance with Pod Security Policies. This means preventing containers running in the cluster escalating privilege, mounting in devices (including binding to the host network) on the nodes or running more privileged kernel system calls.

You want to make sure that development teams can’t deploy insecure applications or make attempts to escalate their privilege, mount in devices they shouldn’t or make unnecessary kernel system calls. Pod Security Policies offer a way to restrict how users, teams or service accounts can deploy applications into the cluster, enforcing a good security posture.

RBACs and Pod Security Policies go hand in hand. Once you define a good Pod Security Policy, you then have to create a role that references it and then bind a user, group and/or service account to it, either cluster wide or at a namespace level.

Note: GKE uses a webhook for RBAC that will bypass Kubernetes first. This means that if you are an administrator inside of Google Cloud Identity Access Management (IAM), it will always make you a cluster admin, so you could recover from accidental lock-outs. This is abstracted away inside the control plane and is managed by GKE itself.

We recommend the below to be a good PSP.  This will make sure that users, service accounts or groups can only deploy containers that meet the criteria below.

If you wanted to create a role to use the defined PSP above, then it would look like something below, this is a cluster role as opposed to a standard role. To then make this enforced on say all authenticated users, you would then create a role binding to apply to the “system:authenticated” group.

Remember that as this is cluster wide, any applications that may need more privileged permissions will stop working, some of these will be things Google adds into kubernetes; such as kube-proxy, that runs in the kube-system namespace.

You can read more information on RBAC and PSP’s on kubernetes.io.

Securing Sensitive Data and Application Cloud Services

We’ll break this down into two.

1. Encrypting Your Secrets in Kubernetes

The recommendation for encrypting secrets using Google Clouds KMS service, is to have segregated roles and responsibilities, and to define a new Google project outside of the Google project that will host Kubernetes and applications. This is to make sure the encryption keys and, more importantly, the key that is signing the other keys (envelope encryption) isn’t residing in the same project that could potentially get compromised.

For encrypting secrets you need to:

  • Setup a new Google project to implement segregation of duties (if necessary).
  • Create the master key ring and key inside that project.
  • Only allow the GKE service account in the Kubernetes hosted project access to the key in the dedicated key project.
  • Only allow just the right permissions (encrypt and decrypt) of the service account.

The documentation on how to do this can be found here. But the main things to remember are:

  • The key has to be in the same location (zone) as the cluster (this is to reduce latency or loss of a zone).
  • Access permissions have to be correct for the service account.
  • The binding is a KMS IAM policy binding.

Once this is setup you can pass the path to the key, to the “gcloud container clusters create” command:

Note: If any of the above is incorrect, you will get a 500 internal server error when you go to create a cluster. This could be the path is incorrect, the location is wrong or the permissions are not right.

2. Consuming Cloud Service Inside of Kubernetes

There are four different ways to allow application containers to consume cloud services. All of these have limitations in some way, i.e., being less user friendly and unautomated for developers (making developers wait for the relevant access to be provisioned), or have less visibility and more complexity in tying together auditability for Kubernetes and cloud administrators.

  1. Workload identity (this is still in Beta and is the more long-term direction Google are going in): Require a constant process of managing google IAM as well as Kubernetes Service Accounts to tie them together. This means managing Google IAM roles, policies and service accounts for specific application service lines as well as Kubernetes Service Accounts. It does, however, improve auditability.
  2. Google Cloud Service Account keys stores as secrets inside Kubernetes namespaces (where the application will be living): Similar to the above, but without the binding. This means provisioning a service account and placing it as a secret inside of the applications namespace for it to consume natively. This has the downside of not having full auditability within Google Cloud and Kubernetes.
  3. Use something like Vault to broker between Cloud and Applications: Using Vault provides an abstraction to cloud and will generate short lived access keys to applications. However, there is still a secret required to be able to speak to the Vault service, hence, the permissions are still the same but just abstracted down one. It also disjoints audibility between Google Cloud and Kubernetes.
  4. Using the default service account node role of GKE: Much simpler and more developer friendly, but more risky. It would mean allowing applications to use the default node service account and modifying the role to cater for all the service policies applications would need, increasing the scope and capability of the node service account to most Google Cloud services.

Having Good Visibility; Current and Historic

Note: As of today, there is no access transparency on when Google accesses your Kubernetes cluster (Google Access Transparency). This could be problematic for a lot of organizations who want assurances around the providers data access.

When the cluster is provisioned, all audit logs and monitoring data is being pushed to Stackdriver. As the control plane is managed by Google, you don’t get to override or modify the audit format, log or extend it to your own webhook.

It does mean that you can search your audit logs for things that are happening inside of Kubernetes in one place, for example; to query all events against a cluster inside of a specific Google project against your user ID, you can do the below:

From this point on, you could take it further and add additional security rules such as setting custom metrics to alert when cluster admin changes are made or specific modifications are happening on roles (cluster-admin roles).

Summary

Security is something everyone wants but, as you can see, it can be quite an involved process to achieve. It also requires domain knowledge to get enough context to assess the risk and what it might mean to your applications and business.

Security without automation can slow down productivity. Not enough security can put your business at risk. Enabling security features that require Beta feature enablement may also not be suitable for the business and only General Availability features are acceptable, which compromises on security.

As a general rule, hardening your clusters and enforcing secure ways of working with Kubernetes, containers, cloud services and your applications will get you the best outcome in the end. There may be frustrating learning curves, but as the industry matures, these will slowly be remediated.

To learn more about containerized infrastructure and cloud native technologies, consider coming to KubeCon + CloudNativeCon EU, in Amsterdam. The CNCF has made the decision to postpone the event (originally set for March 30 to April 2, 2020) to instead be held in July or August 2020.

Lewis Marshall

Lewis Marshall

Lewis Marshall is a cloud-native delivery advocate at Appvia. Over his 28 years of experience in development and operations, he has helped transform software delivery systems at Pfizer, London Stock exchange, the Home Office and many others. Working through x86 Assembly all the way to Golang and Kubernetes, Lewis is now helping organisations with their digital transformation efforts. Passionate about OneWheel and all things space.

Recent Posts

The Role of AI in Securing Software and Data Supply Chains

Expect attacks on the open source software supply chain to accelerate, with attackers automating attacks in common open source software…

4 hours ago

Exploring Low/No-Code Platforms, GenAI, Copilots and Code Generators

The emergence of low/no-code platforms is challenging traditional notions of coding expertise. Gone are the days when coding was an…

22 hours ago

Datadog DevSecOps Report Shines Spotlight on Java Security Issues

Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…

2 days ago

OpenSSF warns of Open Source Social Engineering Threats

Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…

2 days ago

Auto Reply

We're going to send email messages that say, "Hope this finds you in a well" and see if anybody notices.

2 days ago

From CEO Alan Shimel: Futurum Group Acquires Techstrong Group

I am happy and proud to announce with Daniel Newman, CEO of Futurum Group, an agreement under which Futurum has…

2 days ago