Managing SSH Known Hosts¶
This proposal suggests a method to more securely manage and ensure authentication of remote hosts through the SSH protocol within Tarmak environments on AWS.
Background¶
Currently we solely use the external OpenSHH SSH command line tool to connect to remote instances on EC2 for both interactive shell sessions as well as tunnelling and proxy commands for other services, including, connecting to the Vault cluster and the private Kubernetes API server endpoint. Currently in development is the replacement of our programmatic use cases of SSH in favour of the in package Go solution, a choice stemming from pain points developing more sophisticated utility functions for Tarmak and the desire for improvements in control of connections to remote hosts.
During development of this replacement it became clear that proper care must be
taken during authentication of host public keys during connection and manual
management of our ssh_known_hosts
cluster file. Our current implementation
allows OpenSSH to maintain this file however, does not exit with an error if
public keys do not match due to the flag StrictHostKeyChecking
set to
no
. Not only does a miss-match in public keys not cause an error, the
population of known public keys on different authenticated machines to the same
EC2 hosts will always use the hosts presented public key, meaning the set of
public keys could potentially be different for users accessing the same cluster.
Objective¶
By implementing stricter enforcement of the ssh_known_hosts
file and passing
it’s management to Tarmak, we can improve the security of SSH connections to
remote hosts. The key high level points to achieving this is as follows:
- Disable writes from the OpenSSH command to the
ssh_known_hosts
file and enforce strict checking.- Enforce that our in package implementation of SSH connections adheres to this file also.
- Collect public keys during instance start up that are then stored, tightly coupled with that host. These keys are able to be used as a source of truth for other authenticated users attempting to connect to remote hosts on the cluster that have empty or an incomplete
ssh_known_hosts
file.
Changes¶
Firstly, we must restrict the OpenSSH command line tool from editing the
ssh_known_hosts
file and strictly enforce it by updating the generator for the
ssh_config
file. This enables Tarmak to take control of the ssh_known_hosts
file management.
In order to create a source of truth for each host’s public key, each instance will have it’s public key’s attached as tags, shortly after boot time like the following:
tarmak.io/ssh-host-ed25519-host-0 | AAAAC3NzaC1lZDI1NTE5AAAAIE90XYYm6GSDlNGejM+aY5dZEe5vK4XyU++89WdGJcDc==EOF |
The population of these tags will happen at boot time for all instances, regardless of whether they have been created from a direct Terraform apply or via an Amazon Auto Scaling Group. At execution time, Wing - present on every instance - will invoke an Amazon Lambda function for Instance Tagging. Passed to this function will be a collection of the instance’s public keys, it’s Amazon identity document and matching PKCS7 document.
Upon receiving this request, the Lambda function will verify the authenticity of the request and identity document by verifying the instance identity and PKCS7 document against the public AWS certificate. Further details on this can be found here. Once verified, the function will split the public keys to maximum sized chunks of 256 - the maximum size length of EC2 tags.
Finally, the function will test for the existence of these tags and do one of three actions:
- if tags exist and match, exit success
- if tags exist and miss-match, exit failure
- if tags do not exist, create tags and exit success
Once an instance has requested for the creation of it’s tags, all subsequent requests should succeed with no action.
The Lambda function will have access to only resources within the Tarmak VPC. This provides the Lambda function with assurances that not only the request comes from an Amazon instance with permissions to access it, but also that the instance must reside in the Tarmak VPC as the Lambda function only has permissions to add tags to instances in it.
All SSH connections will rely on the contents of the ssh_known_hosts
file however, in the case the host is not present in the file, will attempt to
use the AWS instance’s public key tag to populate it’s entry.
Notable items¶
Public DSA keys will not be tagged.
Care must be taken to ensure that Terraform does not override tags set by the Lambda function.
A start has been made on the code for the Lambda function:
package main
import (
"context"
"fmt"
"time"
"github.com/aws/aws-lambda-go/lambda"
)
const (
AWSCert = "global aws public cert"
tagSize = 256
)
type EC2InstanceIdentityDocument struct {
DevpayProductCodes []string `json:"devpayProductCodes"`
AvailabilityZone string `json:"availabilityZone"`
PrivateIP string `json:"privateIp"`
Version string `json:"version"`
Region string `json:"region"`
InstanceID string `json:"instanceId"`
BillingProducts []string `json:"billingProducts"`
InstanceType string `json:"instanceType"`
AccountID string `json:"accountId"`
PendingTime time.Time `json:"pendingTime"`
ImageID string `json:"imageId"`
KernelID string `json:"kernelId"`
RamdiskID string `json:"ramdiskId"`
Architecture string `json:"architecture"`
}
type TagInstanceRequest struct {
PublicKeys map[string][]byte `json:"publicKeys"`
InstanceDocument EC2InstanceIdentityDocument `json:"instanceID"`
PKCS7CMS string `json:"pkcs7CMS"`
}
func HandleRequest(ctx context.Context, t TagInstanceRequest) error {
if err := t.verify(); err != nil {
return err
}
tags := t.createTags()
exists, err := t.checkTagsAgainstInstance(tags)
if err != nil || exists {
return err
}
// attach tags to ec2 instance using real call
//err := ec2.Tag{
// InstanceID: t.InstanceDocument.InstanceID,
// Tags: ....
//}
// if err != nil {
// return err
//}
return nil
}
// verify the pkcs7 doc against the instance identity content and AWS global
// cert
func (t TagInstanceRequest) verify() error {
return nil
}
// check generated tags against the ec2 instance
// if existing and match exit gracefully
// if miss match, exit failure
// if not exist, we need to create
func (t TagInstanceRequest) checkTagsAgainstInstance(tags map[string][]byte) (tagsExist bool, err error) {
return false, nil
}
// split up public keys into correct sizes for AWS tags
func (t TagInstanceRequest) createTags() map[string][]byte {
tags := make(map[string][]byte)
for keyName, data := range t.PublicKeys {
data = append(data, []byte("==EOF")...)
for i := 0; i < len(data); i += tagSize {
end := i + tagSize
if end > len(data) {
end = len(data)
}
tagName := fmt.Sprintf("PublicKey_%s_%s", keyName, i/tagSize)
tags[tagName] = data[i:end]
}
}
return tags
}
func main() {
lambda.Start(HandleRequest)
}
Limitations¶
Whilst we can restrict permissions of access for both the Lambda function and EC2 instances, we do not have a cryptographic signature of the public keys coming from the EC2 instance.
Whilst using Hashicorp’s Vault to set up an SSH CA for the environment would be advantageous, bootstrapping this process requires SSH connections from the client to EC2 instances. This rules out this as an option.
Out of scope¶
We should not disrupt the current flow of key generation on the host instances such as using key injection. At no point should private keys be in flight.
We should not store or rely on the public key being stored in the Terraform state as this would require all commands that rely on SSH, to also rely on fetching and updating the Terraform state - significantly increasing completion time for even trivial tasks.