Troubleshooting the Trivy Operator

The Trivy Operator installs several Kubernetes resources into your Kubernetes cluster.

Here are the common steps to check whether the operator is running correctly and to troubleshoot common issues.

In addition to this section, you might want to check issues, discussion forum, or Slack to see if someone from the community had similar problems before.

Also note that Trivy Operator is based on existing Aqua OSS project - Starboard, and shares some of the design, principles and code with it. Existing content that relates to Starboard Operator might also be relevant for Trivy Operator, and Starboard's issues, discussion forum, or Slack might also be interesting to check.
In some cases you might want to refer to Starboard's Design documents.

Installation

Make sure that the latest version of the Trivy Operator is installed. For this, have a look at the installation options.

For instance, if your are using the Helm deployment, you need to check the Helm Chart version deployed to your cluster. You can check the Helm Chart version with the following command:

helm list -n trivy-system

Operator Pod Not Running

The Trivy Operator will run a pod inside your cluster. If you have followed the installation guide, you will have installed the Operator to the trivy-system.

Make sure that the pod is in the Running status:

kubectl get pods -n trivy-system

This is how it will look if it is running okay:

NAMESPACE            NAME                                         READY   STATUS    RESTARTS      AGE
trivy-system     trivy-operator-6c9bd97d58-hsz4g          1/1     Running   5 (19m ago)   30h

If the pod is in Failed, Pending, or Unknown state check the events and the logs of the pod.

First, check the events, since they might be more descriptive of the problem. However, if the events do not give a clear reason why the pod cannot spin up, then you want to check the logs, which provide more detail.

kubectl describe pod <POD-NAME> -n trivy-system

To check the logs, use the following command:

kubectl logs deployment/trivy-operator -n trivy-system

If your pod is not running, try to look for errors as they can give an indication on the problem.

If there are too many logs messages, try deleting the Trivy pod and observe its behavior upon restarting. A new pod should spin up automatically after deleting the failed pod.

ImagePullBackOff or ErrImagePull

Check the status of the Trivy Operator pod running inside of your Kubernetes cluster. If the Status is ImagePullBackOff or ErrImagePull, it means that the Operator either

tries to access the wrong image
cannot pull the image from the registry

Make sure that you are providing the right resources upon installing the Trivy Operator.

CrashLoopBackOff

If your pod is in CrashLoopBackOff, it is likely the case that the pod cannot be scheduled on the Kubernetes node that it is trying to schedule on. In this case, you want to investigate further whether there is an issue with the node. It could for instance be the case that the node does not have sufficient resources.

Reconciliation Error

It could happen that the pod appears to be running normally but does not reconcile the resources inside of your Kubernetes cluster.

Check the logs for Reconciliation errors:

kubectl logs deployment/trivy-operator -n trivy-system

If this is the case, the Trivy Operator likely does not have the right configurations to access your resource.

Operator does not create VulnerabilityReports

VulnerabilityReports are owned and controlled by the immediate Kubernetes workload. Every VulnerabilityReport of a pod is thus, linked to a ReplicaSet. In case the Trivy Operator does not create a VulnerabilityReport for your workloads, it could be that it is not monitoring the namespace that your workloads are running on.

An easy way to check this is by looking for the ClusterRoleBinding for the Trivy Operator:

kubectl get ClusterRoleBinding | grep "trivy-operator"

Alternatively, you could use the kubectl-who-can plugin by Aqua:

$ kubectl who-can list vulnerabilityreports
No subjects found with permissions to list vulnerabilityreports assigned through RoleBindings

CLUSTERROLEBINDING                           SUBJECT                         TYPE            SA-NAMESPACE
cluster-admin                                system:masters                  Group
trivy-operator                           trivy-operator              ServiceAccount  trivy-system
system:controller:generic-garbage-collector  generic-garbage-collector       ServiceAccount  kube-system
system:controller:namespace-controller       namespace-controller            ServiceAccount  kube-system
system:controller:resourcequota-controller   resourcequota-controller        ServiceAccount  kube-system
system:kube-controller-manager               system:kube-controller-manager  User

If the ClusterRoleBinding does not exist, Trivy currently cannot monitor any namespace outside of the trivy-system namespace.

For instance, if you are using the Helm Chart, you want to make sure to set the targetNamespace to the namespace that you want the Operator to monitor.

The operator also could not be configured to scan the workload you are expecting. Check to make sure OPERATOR_TARGET_WORKLOADS is set correctly in your configuration. This allows you to specify which workload types to be scanned.

For example, by default in the Helm Chart values, the following Kubernetes workloads are configured to be scanned "pod,replicaset,replicationcontroller,statefulset,daemonset,cronjob,job".

Installing the Operator in a namespace with default deny-all egress/ingress network policies

If you are trying to install the Trivy-Operator in a namespace where there are default deny-all egress/ingress network policies (see example below), you might need to configure some extra network policies yourself to make sure the traffic can flow as expected and the operator does not enter an error state.

---
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: trivy-system
spec:
  podSelector: {}
  policyTypes:
  - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: trivy-system
spec:
  podSelector: {}
  policyTypes:
  - Ingress

Notice how the namespace is trivy-system, the above network policies are assuming that you installed the trivy-operator (and trivy-server when applicable) there. Keep in mind these same network policies might be part of other namespaces (like kube-system or default) where important Kubernetes components live, such as the coredns and/or the default Kubernetes service.

We'll now create a namespace where we will deploy our pod with vulnerabilities. This namespace will be called applications

kubectl create namespace applications

Next step is to create the pod with vulnerabilities. To do this we can run the following command:

kubectl run nginx -n applications

At this point, we should expect the Trivy Operator to generate the VulnerabilityReports custom resources in our applications namespace. However, if we try to get these resources across the applications namespace, we'll see that we won't get any reports:

kubectl get vulnerabilityreports -n applications
No resources found in applications namespace.

If we look at the get pods command, the description of the trivy-operator pod and its logs, we'll see some interesting messages that will help us understand why the reports aren't being generated.

kubectl get pods -n trivy-system
NAME                              READY   STATUS             RESTARTS        AGE
trivy-operator-846f8c6446-clzlk   0/1     CrashLoopBackOff   6 (2m41s ago)   8m28s

kubectl describe pods trivy-operator-846f8c6446-clzlk -n trivy-system | grep Events -A 10
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m11s                  default-scheduler  Successfully assigned trivy-system/trivy-operator-846f8c6446-clzlk to k3d-kon-test-server-0
  Normal   Created    6m26s (x4 over 7m11s)  kubelet            Created container trivy-operator
  Normal   Started    6m26s (x4 over 7m11s)  kubelet            Started container trivy-operator
  Normal   Pulled     5m38s (x5 over 7m11s)  kubelet            Container image "ghcr.io/aquasecurity/trivy-operator:0.16.4" already present on machine
  Warning  BackOff    2m4s (x32 over 7m9s)   kubelet            Back-off restarting failed container trivy-operator in pod trivy-operator-846f8c6446-clzlk_trivy-system(ddbfdf6d-751b-4137-860e-5561c71b6f8d)

The pod is in a CrashLoopBackOff state, and the description confirms that the container is constantly being restarted.

kubectl logs trivy-operator-846f8c6446-clzlk -n trivy-system
2023/11/21 06:04:02 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
{"level":"info","ts":"2023-11-21T06:04:02Z","logger":"main","msg":"Starting operator","buildInfo":{"Version":"0.16.4","Commit":"c2f0e0f4f773f090f61c07489fd6dc062d465b2d","Date":"2023-10-29T08:18:47Z","Executable":""}}
{"level":"info","ts":"2023-11-21T06:04:02Z","logger":"operator","msg":"Resolved install mode","install mode":"AllNamespaces","operator namespace":"trivy-system","target namespaces":[],"exclude namespaces":"","target workloads":["pod","replicaset","replicationcontroller","statefulset","daemonset","cronjob","job"]}
{"level":"info","ts":"2023-11-21T06:04:02Z","logger":"operator","msg":"Watching all namespaces"}
unable to run trivy operator: failed getting configmap: trivy-operator: Get "https://10.43.0.1:443/api/v1/namespaces/trivy-system/configmaps/trivy-operator": dial tcp 10.43.0.1:443: connect: connection refused

We see that the trivy-operator is correctly configured to watch all namespaces, that means that the VulnerabilityReports should be generated across all namespaces, including the applications namespace.

The first red flag that something is wrong with the networking configuration can be found in the following message:

unable to run trivy operator: failed getting configmap: trivy-operator: Get "https://10.43.0.1:443/api/v1/namespaces/trivy-system/configmaps/trivy-operator": dial tcp 10.43.0.1:443: connect: connection refused

The IP address in context 10.43.0.1 belongs to the kube-api. We can confirm so by looking for the service called kubernetes:

kubectl get svc kubernetes
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.43.0.1    <none>        443/TCP   28d

Basically, the trivy-operator cannot reach the kube-api to execute the calls looking for different resources. In example above, we can see that the trivy-operator was looking for the trivy-operator configmap in the trivy-system. Our first task will be to enable traffic between the trivy-operator pods and the kube-api service so that the trivy-operator can successfully get what it needs from the kube-api.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-from-trivy-system-to-kube-api
  namespace: trivy-system
spec:
  podSelector: {}
  egress:
  - to:
    - ipBlock:
        cidr: 10.43.0.1/32

If we run kubectl logs -n trivy-system deployment/trivy-operator, we'll see that the error referncing 10.43.0.1:443 has disappeared. This means that we have successfully enabled all outbound traffic between the trivy-system namespace and the kube-api.

NOTE: For faster results, restart the trivy-operator deployment:

kubectl rollout restart deployment trivy-operator -n trivy-system

We also notice that there are new errors as part of the logs referencing port 53:

failed to download vulnerability DB: database download error: OCI repository error: 1 error occurred:\n\t* Get \"https://ghcr.io/v2/\": dial tcp: lookup ghcr.io on 10.43.0.10:53:

This means that the trivy-operator cannot resolve DNS records. The cause of this is the fact that the traffic to the kube-dns service residing in the kube-system namespace is disabled because of the deny-all egress network policies in the trivy-system namespace. We can confirm this by running the following command and to the svc information:

kubectl get svc -n kube-system kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP,9153/TCP   20m

To remediate this issue, we'll need to create a network policy allowing traffic on port 53 to the kube-system namespace. This will allow the trivy-systems to perforn DNS lookups via the core-dns pods.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-egress-allow-kube-system-dns
  namespace: trivy-system
spec:
  egress:
    - ports:
        - port: 53
          protocol: TCP
        - port: 53
          protocol: UDP
      to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
  podSelector: {}
  policyTypes:
    - Egress

When we look at the trivy-operator logs again, we'll see that the error logs referencing the port 53 are gone. Now we see a new error mentioning port 443:

{"level":"error","ts":"2024-02-25T07:46:25Z","logger":"reconciler.scan job","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-57ff7d8c55","container":"1bad6981-ddcb-4845-98cd-e8bb5b25926c","status.reason":"Error","status.message":"2024-02-25T07:46:22.300Z\t\u001b[34mINFO\u001b[0m\tNeed to update DB\n2024-02-25T07:46:22.300Z\t\u001b[34mINFO\u001b[0m\tDB Repository: ghcr.io/aquasecurity/trivy-db\n2024-02-25T07:46:22.300Z\t\u001b[34mINFO\u001b[0m\tDownloading DB...\n2024-02-25T07:46:22.816Z\t\u001b[31mFATAL\u001b[0m\tinit error: DB error: failed to download vulnerability DB: database download error: oci download error: failed to fetch the layer: Get \"https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:2f0f866f6f274de192d9dfcd752c892e2099126fe0362dc8b4c7bb0b7e75956d?se=2024-02-25T07%3A55%3A00Z&sig=r9L1Phopnozwr%2B5TOTj8tF7D7bixyUqdsJNDESU1TPI%3D&sp=r&spr=https&sr=b&sv=2019-12-12\": dial tcp 185.199.111.154:443: connect: connection refused\n"

This means that the trivy-operator cannot talk to the internet over port 443 to download the vulnerability database. We need to create a new network policy to allow this exception.

Before proceeding with the creation of our next network policy, it is important to understand a few things. Trivy-operator itself does not download the vulnerability database. Instead, it spawns a couple of scan pods generated via a job that download the vulnerability database over port 443.

We can confirm this by doing a watch kubectl get pods -n trivy-system on pods on the trivy-system, then restarting the trivy-operator via kubectl rollout restart deployment -n trivy-system:

NAME                                        READY   STATUS    RESTARTS   AGE
trivy-operator-6b4dc78c5-nzzcm              1/1     Running   0          11s
scan-vulnerabilityreport-6f9cb46645-pzx7w   1/1     Running   0          8s

Here we see the scanning pod being spawned. We now must get a good label so we can create a network policy for it. We do so by grabbing the pod name and getting the labels via yq while we watch for the pod in another terminal.

kubectl get pods -n trivy-system scan-vulnerabilityreport-6dfb8dc69f-fwpbh -o yaml | yq '.metadata.labels'

We get the output:

app.kubernetes.io/managed-by: trivy-operator
controller-uid: 10aba790-6ee6-4802-81ed-ad77908ea10d
job-name: scan-vulnerabilityreport-6dfb8dc69f
resource-spec-hash: 764dd688f
trivy-operator.resource.kind: ReplicaSet
trivy-operator.resource.name: trivy-operator-6b65576869
trivy-operator.resource.namespace: trivy-system
vulnerabilityReport.scanner: Trivy

We can probably use app.kubernetes.io/managed-by: trivy-operator, as this is a label in a standard format Kubernetes recommends.

We proceed to create the network policy as follows:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-443-trivy-operator
  namespace: trivy-system
spec:
  egress:
    - ports:
        - port: 443
          protocol: TCP
      to:
        - ipBlock:
            cidr: 0.0.0.0/0
  podSelector:
    matchLabels:
      app.kubernetes.io/managed-by: trivy-operator
  policyTypes:
    - Egress

We use the CIDR 0.0.0.0/0 to denote that we want to allow the target pods to talk to any IP address, and we specify port 443 as the allowed port. If we query for the logs again, we'll see that the error is gone. Moreover, if we do a kubectl get vulnerabilityreport -n applications, we'll see that the report for the nginx pod has been recently generated:

NAME              REPOSITORY      TAG      SCANNER   AGE
pod-nginx-nginx   library/nginx   latest   Trivy     2m28s

Trivy-server

When deploying the trivy-operator + trivy-server for downloading the vulnerability database, you will need to create similar network policies to the ones created for the trivy-operator as a standalone component. After installing trivy-server in the current cluster, the pod entered a status of CrashLookBackOff. Upon inspecting the logs, for the trivy-server statefulset, we counter the following error:

2024-02-28T04:53:50.195Z    FATAL   failed to download vulnerability DB: database download error: OCI repository error: 1 error occurred:
  * Get "https://ghcr.io/v2/": dial tcp 140.82.114.34:443: connect: connection refused

This means that the trivy-server cannot connect to the image registry over port 443. This can be fixed by applying a network policy like allow-egress-443-trivy-operator, which we created for the trivy-operator, but first we must get the label that will be used to select the pods that the trivy-server generates. We do so by doing a kubectl get pods trivy-0 -n trivy-system --show-labels, we obtain the following output:

NAME      READY   STATUS    RESTARTS   AGE     LABELS
trivy-0   1/1     Running   0          3m40s   app.kubernetes.io/instance=trivy,app.kubernetes.io/name=trivy,controller-revision-hash=trivy-7494747496,statefulset.kubernetes.io/pod-name=trivy-0

We can make use of the label app.kubernetes.io/name=trivy, so the resulting network policy will look like this:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-443-trivy-server
  namespace: trivy-system
spec:
  egress:
    - ports:
        - port: 443
          protocol: TCP
      to:
        - ipBlock:
            cidr: 0.0.0.0/0
  podSelector:
    matchLabels:
      app.kubernetes.io/name: trivy
  policyTypes:
    - Egress

We proceed to restart the trivy-server statefulset with kubectl rollout restart sts -n trivy-system trivy and see that the error previously seen is gone. Trivy-server was able to download the DB and is listening on port 4954.

kubectl logs -n trivy-system statefulset/trivy
2024-02-28T05:17:53.590Z    INFO    Need to update DB
2024-02-28T05:17:53.590Z    INFO    DB Repository: ghcr.io/aquasecurity/trivy-db
2024-02-28T05:17:53.590Z    INFO    Downloading DB...
2024-02-28T05:17:57.550Z    INFO    Listening 0.0.0.0:4954...

When we restart the trivy-operator to test if everything works as it should, we realize that it is outputting the following error via the logs:

failed to do request: Post \"http://trivy.trivy-system:4954/twirp/trivy.cache.v1.Cache/MissingBlobs\": dial tcp 10.43.158.111:4954: connect: connection refused"

The trivy-operator has to reach out to the trivy-server on port 4954 in order to access the downloaded vulnerability database. We also need to enable that connection via a networkpolicy (you guessed it), we can delete the previously created network policy allow-egress-443-trivy-operator via kubectl delete networkpolicy allow-egress-443-trivy-operator -n trivy-system and create a new one with a new name that mentions port 4954 to also allow egress traffic to port 4954 and rename it to something that reflects its new purpose. Last, but not least:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-443-and-4954-trivy-operator
  namespace: trivy-system
spec:
  egress:
    - ports:
        - port: 443
          protocol: TCP
        - port: 4954
          protocol: TCP
      to:
        - ipBlock:
            cidr: 0.0.0.0/0
  podSelector:
    matchLabels:
      app.kubernetes.io/managed-by: trivy-operator
  policyTypes:
    - Egress

After having saved the changes to the policy, we can proceed and restart the trivy-operator kubectl rollout restart deployment trivy-operator -n trivy-system. When looking at the logs for the trivy-operator, we see that there are some errors indicating it still cannot connect to port 4954:

dial tcp 10.43.158.111:4954: connect: connection refused

So far we have created network policies to allow egress traffic. There is one last missing network policy and it is of type ingress. This network policy will allow the trivy-server to receive traffic on port 4954.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-4954-trivy-server
  namespace: trivy-system
spec:
  ingress:
    - ports:
        - port: 4954
          protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/name: trivy
  policyTypes:
    - Ingress

This network policy uses the appropiate matchLabels to only target the trivy-server. When restarting the trivy-operator with kubectl rollout restart deployment trivy-operator -n trivy-system, we see that the errors are gone. When doing a kubectl get vulnerabilityreport -n applications, we see that there is a newly generated vulnerabilityreport for our nginx pod:

NAME              REPOSITORY      TAG      SCANNER   AGE
pod-nginx-nginx   library/nginx   latest   Trivy     12s

We have successfully added all the necessary network policies for our trivy-operator to work on client/server mode.