[DRAFT] Caching Scan Results by Image Reference¶
TL;DR;¶
To find vulnerabilities in container images Starboard creates asynchronous Kubernetes (K8s) Jobs. Even though running a vulnerability scanner as a K8s Job is expensive, Starboard does not reuse scan results in any way. For example, if a workload refers to the image that has already been scanned, Starboard will go ahead and create another (similar) K8s Job.
To some extent, the problem of wasteful and long-running K8s Jobs can be mitigated by using Starboard with Trivy in the ClientServer mode instead of the default Standalone mode. In this case a configured Trivy server will cache results of scanning image layers. However, there is still unnecessary overhead for managing K8s Jobs and communication between Trivy client and server. (The only real difference is that some Jobs may complete faster for already scanned images.)
To solve the above-mentioned problems, we could cache scan results by image reference. For example, a CRD based implementation can store scan results as instances of ClusterVulnerabilityReport object named after a hash of the repo digest. An alternative implementation may cache vulnerability reports in an AWS S3 bucket or a similar key-value store.
Example¶
With the proposed cluster-scoped (or global) cache, Starboard can check if the image with the specified reference has already been scanned. If yes, it will just read the corresponding ClusterVulnerabilityReport, copy its payload, and finally create an instance of a namespaced VulnerabilityReport.
Let's consider two nginx:1.16
Deployments in two different namespaces foo
and bar
. In the current implementation Starboard will spin up two K8s Jobs to
run a scanner and eventually create two VulnerabilityReports in foo
and bar
namespaces respectively.
In a cluster where Starboard is installed for the first time, when we scan the
nginx
Deployment in the foo
namespace there's obviously no
ClusterVulnerabilityReport for nginx:1.16
. Therefore, Starboard will spin up
a K8s Job and wait for its completion. On completion, it will create a
cluster-scoped ClusterVulnerabilityReport named after the hash of nginx:1.16
.
It will also create a namespaced VulnerabilityReport named after the current
revision of the nginx
Deployment.
NOTE Because a repo digest is not a valid name for a K8s API object, we may, for example, calculate a (safe) hash of the repo digest and use is as name instead.
$ kubectl get clustervulnerabilityreports
No resources found
$ starboard scan vulnerabilityreports deploy/nginx -n foo -v 3
I1008 19:58:19.355462 62385 scanner.go:72] Getting Pod template for workload: {Deployment nginx foo}
I1008 19:58:19.358802 62385 scanner.go:89] Checking if images were already scanned
I1008 19:58:19.360411 62385 scanner.go:95] Cached scan reports: 0
I1008 19:58:19.360421 62385 scanner.go:101] Scanning with options: {ScanJobTimeout:0s DeleteScanJob:true}
I1008 19:58:19.365155 62385 runner.go:79] Running task and waiting forever
I1008 19:58:19.365190 62385 runnable_job.go:74] Creating job "starboard/scan-vulnerabilityreport-cbf8c9b99"
I1008 19:58:19.376902 62385 reflector.go:219] Starting reflector *v1.Event (30m0s) from pkg/mod/k8s.io/client-go@v0.22.2/tools/cache/reflector.go:167
I1008 19:58:19.376920 62385 reflector.go:255] Listing and watching *v1.Event from pkg/mod/k8s.io/client-go@v0.22.2/tools/cache/reflector.go:167
I1008 19:58:19.376902 62385 reflector.go:219] Starting reflector *v1.Job (30m0s) from pkg/mod/k8s.io/client-go@v0.22.2/tools/cache/reflector.go:167
I1008 19:58:19.376937 62385 reflector.go:255] Listing and watching *v1.Job from pkg/mod/k8s.io/client-go@v0.22.2/tools/cache/reflector.go:167
I1008 19:58:19.386049 62385 runnable_job.go:130] Event: Created pod: scan-vulnerabilityreport-cbf8c9b99-4nzkb (SuccessfulCreate)
I1008 19:58:51.243554 62385 runnable_job.go:130] Event: Job completed (Completed)
I1008 19:58:51.247251 62385 runnable_job.go:109] Stopping runnable job on task completion with status: Complete
I1008 19:58:51.247273 62385 runner.go:83] Stopping runner on task completion with error: <nil>
I1008 19:58:51.247278 62385 scanner.go:130] Scan job completed: starboard/scan-vulnerabilityreport-cbf8c9b99
I1008 19:58:51.247297 62385 scanner.go:262] Getting logs for nginx container in job: starboard/scan-vulnerabilityreport-cbf8c9b99
I1008 19:58:51.674449 62385 scanner.go:123] Deleting scan job: starboard/scan-vulnerabilityreport-cbf8c9b99
Now, if we scan the nginx
Deployment in the bar
namespace, Starboard will
see that there's already a ClusterVulnerabilityReport (84bcb5cd46
) for the
same image reference nginx:1.16
and will skip creation of a K8s Job. It will
just read and copy the report as VulnerabilityReport object to the bar
namespace.
$ kubectl get clustervulnerabilityreports -o wide
NAME REPOSITORY TAG DIGEST SCANNER AGE CRITICAL HIGH MEDIUM LOW UNKNOWN
84bcb5cd46 library/nginx 1.16 Trivy 17s 21 50 33 104 0
$ starboard scan vulnerabilityreports deploy/nginx -n bar -v 3
I1008 19:59:23.891718 62478 scanner.go:72] Getting Pod template for workload: {Deployment nginx bar}
I1008 19:59:23.895310 62478 scanner.go:89] Checking if image nginx:1.16 was already scanned
I1008 19:59:23.903058 62478 scanner.go:95] Cache hit
I1008 19:59:23.903078 62478 scanner.go:97] Copying ClusterVulnerabilityReport to VulnerabilityReport
As you can see, Starboard eventually created two VulnerabilityReports by spinning up only one K8s Job.
$ kubectl get vulnerabilityreports -A
NAMESPACE NAME REPOSITORY TAG SCANNER AGE
bar replicaset-nginx-6d4cf56db6-nginx library/nginx 1.16 Trivy 5m38s
foo replicaset-nginx-6d4cf56db6-nginx library/nginx 1.16 Trivy 6m10s
Life-cycle management¶
Just like any other cache it's very important that it's up to date and contains the correct information. To make sure of this we need to have a automated way of automatically cleaning up the ClusterVulnerabilityReport after some time.
My suggestion is to solve this problem just like we did in PR #879.
For each ClusterVulnerabilityReport created we should annotate the report with starboard.aquasecurity.github.io/cluster-vulnerability-report-ttl
.
When the TTL ends the other controller will automatically delete the existing ClusterVulnerabilityReport and the next time the image is created in the cluster and normal vulnerabilityreport scan will happen.
I suggest that we have a default value of 72 hours for this report. This is a new feature and I don't see why we shouldn't enable it by default.
Vulnerability reports¶
From a vulnerability reports point of view we need to have a simple way for cluster admins to know if the vulnerability report is generated from a cache and if so which one?
We could ether do this by setting a status on the vulnerability report that gets created but since this feature won't be on by default I suggest we use annotations.
For example: starboard.aquasecurity.github.io/ClusterVulnerabilityReportName: 84bcb5cd46
would make it easy to find.
We can't use something like ownerReference since it would delete all vulnerabilities at the same time if a ClusterVulnerabilityReport was deleted.
Summary¶
- This solution might be the first step towards more efficient vulnerability scanning.
- It's backward compatible and can be implemented as an experimental feature behind a gate.
- Both Starboard CLI and Starboard Operator can read and leverage ClusterVulnerabilityReports.