Managing SSH Known Hosts

This proposal suggests a method to more securely manage and ensure authentication of remote hosts through the SSH protocol within Tarmak environments on AWS.

Background

Currently we solely use the external OpenSHH SSH command line tool to connect to remote instances on EC2 for both interactive shell sessions as well as tunnelling and proxy commands for other services, including, connecting to the Vault cluster and the private Kubernetes API server endpoint. Currently in development is the replacement of our programmatic use cases of SSH in favour of the in package Go solution, a choice stemming from pain points developing more sophisticated utility functions for Tarmak and the desire for improvements in control of connections to remote hosts.

During development of this replacement it became clear that proper care must be taken during authentication of host public keys during connection and manual management of our ssh_known_hosts cluster file. Our current implementation allows OpenSSH to maintain this file however, does not exit with an error if public keys do not match due to the flag StrictHostKeyChecking set to no. Not only does a miss-match in public keys not cause an error, the population of known public keys on different authenticated machines to the same EC2 hosts will always use the hosts presented public key, meaning the set of public keys could potentially be different for users accessing the same cluster.

Objective

By implementing stricter enforcement of the ssh_known_hosts file and passing it’s management to Tarmak, we can improve the security of SSH connections to remote hosts. The key high level points to achieving this is as follows:

  • Disable writes from the OpenSSH command to the ssh_known_hosts file and enforce strict checking.
  • Enforce that our in package implementation of SSH connections adheres to this file also.
  • Collect public keys during instance start up that are then stored, tightly coupled with that host. These keys are able to be used as a source of truth for other authenticated users attempting to connect to remote hosts on the cluster that have empty or an incomplete ssh_known_hosts file.

Changes

Firstly, we must restrict the OpenSSH command line tool from editing the ssh_known_hosts file and strictly enforce it by updating the generator for the ssh_config file. This enables Tarmak to take control of the ssh_known_hosts file management.

In order to create a source of truth for each host’s public key, each instance will have it’s public key’s attached as tags, shortly after boot time like the following:

tarmak.io/ssh-host-ed25519-host-0 AAAAC3NzaC1lZDI1NTE5AAAAIE90XYYm6GSDlNGejM+aY5dZEe5vK4XyU++89WdGJcDc==EOF

The population of these tags will happen at boot time for all instances, regardless of whether they have been created from a direct Terraform apply or via an Amazon Auto Scaling Group. At execution time, Wing - present on every instance - will invoke an Amazon Lambda function for Instance Tagging. Passed to this function will be a collection of the instance’s public keys, it’s Amazon identity document and matching PKCS7 document.

Upon receiving this request, the Lambda function will verify the authenticity of the request and identity document by verifying the instance identity and PKCS7 document against the public AWS certificate. Further details on this can be found here. Once verified, the function will split the public keys to maximum sized chunks of 256 - the maximum size length of EC2 tags.

Finally, the function will test for the existence of these tags and do one of three actions:

  • if tags exist and match, exit success
  • if tags exist and miss-match, exit failure
  • if tags do not exist, create tags and exit success

Once an instance has requested for the creation of it’s tags, all subsequent requests should succeed with no action.

The Lambda function will have access to only resources within the Tarmak VPC. This provides the Lambda function with assurances that not only the request comes from an Amazon instance with permissions to access it, but also that the instance must reside in the Tarmak VPC as the Lambda function only has permissions to add tags to instances in it.

All SSH connections will rely on the contents of the ssh_known_hosts file however, in the case the host is not present in the file, will attempt to use the AWS instance’s public key tag to populate it’s entry.

Notable items

Public DSA keys will not be tagged.

Care must be taken to ensure that Terraform does not override tags set by the Lambda function.

A start has been made on the code for the Lambda function:

package main

import (
      "context"
      "fmt"
      "time"

      "github.com/aws/aws-lambda-go/lambda"
)

const (
      AWSCert = "global aws public cert"
      tagSize = 256
)

type EC2InstanceIdentityDocument struct {
      DevpayProductCodes []string  `json:"devpayProductCodes"`
      AvailabilityZone   string    `json:"availabilityZone"`
      PrivateIP          string    `json:"privateIp"`
      Version            string    `json:"version"`
      Region             string    `json:"region"`
      InstanceID         string    `json:"instanceId"`
      BillingProducts    []string  `json:"billingProducts"`
      InstanceType       string    `json:"instanceType"`
      AccountID          string    `json:"accountId"`
      PendingTime        time.Time `json:"pendingTime"`
      ImageID            string    `json:"imageId"`
      KernelID           string    `json:"kernelId"`
      RamdiskID          string    `json:"ramdiskId"`
      Architecture       string    `json:"architecture"`
}

type TagInstanceRequest struct {
      PublicKeys       map[string][]byte           `json:"publicKeys"`
      InstanceDocument EC2InstanceIdentityDocument `json:"instanceID"`
      PKCS7CMS         string                      `json:"pkcs7CMS"`
}

func HandleRequest(ctx context.Context, t TagInstanceRequest) error {
      if err := t.verify(); err != nil {
              return err
      }

      tags := t.createTags()

      exists, err := t.checkTagsAgainstInstance(tags)
      if err != nil || exists {
              return err
      }

      // attach tags to ec2 instance using real call
      //err := ec2.Tag{
      //      InstanceID: t.InstanceDocument.InstanceID,
      //      Tags: ....
      //}
      // if err != nil {
      //      return err
      //}

      return nil
}

// verify the pkcs7 doc against the instance identity content and AWS global
// cert
func (t TagInstanceRequest) verify() error {
      return nil
}

// check generated tags against the ec2 instance
// if existing and match exit gracefully
// if miss match, exit failure
// if not exist, we need to create
func (t TagInstanceRequest) checkTagsAgainstInstance(tags map[string][]byte) (tagsExist bool, err error) {
      return false, nil
}

// split up public keys into correct sizes for AWS tags
func (t TagInstanceRequest) createTags() map[string][]byte {
      tags := make(map[string][]byte)

      for keyName, data := range t.PublicKeys {
              data = append(data, []byte("==EOF")...)

              for i := 0; i < len(data); i += tagSize {
                      end := i + tagSize

                      if end > len(data) {
                              end = len(data)
                      }

                      tagName := fmt.Sprintf("PublicKey_%s_%s", keyName, i/tagSize)
                      tags[tagName] = data[i:end]
              }
      }

      return tags
}

func main() {
      lambda.Start(HandleRequest)
}

Limitations

Whilst we can restrict permissions of access for both the Lambda function and EC2 instances, we do not have a cryptographic signature of the public keys coming from the EC2 instance.

Whilst using Hashicorp’s Vault to set up an SSH CA for the environment would be advantageous, bootstrapping this process requires SSH connections from the client to EC2 instances. This rules out this as an option.

Out of scope

We should not disrupt the current flow of key generation on the host instances such as using key injection. At no point should private keys be in flight.

We should not store or rely on the public key being stored in the Terraform state as this would require all commands that rely on SSH, to also rely on fetching and updating the Terraform state - significantly increasing completion time for even trivial tasks.