Documentation

Prometheus Integration

Purpose

Prometheus server can monitor various metrics and provide an observation of the Antrea Controller and Agent components. The doc provides general guidelines to the configuration of Prometheus server to operate with the Antrea components.

About Prometheus

Prometheus is an open source monitoring and alerting server. Prometheus is capable of collecting metrics from various Kubernetes components, storing and providing alerts. Prometheus can provide visibility by integrating with other products such as Grafana.

One of Prometheus capabilities is self-discovery of Kubernetes services which expose their metrics. So Prometheus can scrape the metrics of any additional components which are added to the cluster without further configuration changes.

Antrea Configuration

Enable Prometheus metrics listener by setting enablePrometheusMetrics parameter to true in the Controller and the Agent configurations.

Prometheus Configuration

Prometheus version

Prometheus integration with Antrea is validated as part of CI using Prometheus v2.19.3.

Prometheus RBAC

Prometheus requires access to Kubernetes API resources for the service discovery capability. Reading metrics also requires access to the "/metrics" API endpoints.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

Antrea Metrics Listener Access

To scrape the metrics from Antrea Controller and Agent, Prometheus needs the following permissions

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: prometheus-antrea
rules:
- nonResourceURLs:
  - /metrics
  verbs:
  - get

Antrea Components Scraping configuration

Add the following jobs to Prometheus scraping configuration to enable metrics collection from Antrea components. Antrea Agent metrics endpoint is exposed through Antrea apiserver on apiport config parameter given in antrea-agent.conf (default value is 10350). Antrea Controller metrics endpoint is exposed through Antrea apiserver on apiport config parameter given in antrea-controller.conf (default value is 10349).

Controller Scraping

- job_name: 'antrea-controllers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_container_name]
  action: keep
  regex: kube-system;antrea-controller
- source_labels: [__meta_kubernetes_pod_node_name, __meta_kubernetes_pod_name]
  target_label: instance

Agent Scraping

- job_name: 'antrea-agents'
kubernetes_sd_configs:
- role: pod
scheme: https
tls_config:
  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_container_name]
  action: keep
  regex: kube-system;antrea-agent
- source_labels: [__meta_kubernetes_pod_node_name, __meta_kubernetes_pod_name]
  target_label: instance

For further reference see the enclosed configuration file (/build/yamls/antrea-prometheus.yml).

The configuration file above can be used to deploy Prometheus Server with scraping configuration for Antrea services. To deploy this configuration use kubectl apply -f build/yamls/antrea-prometheus.yml

Antrea Prometheus Metrics

Antrea Controller and Agents expose various metrics, some of which are provided by the Antrea components and others which are provided by 3rd party components used by the Antrea components.

Below is a list of metrics, provided by the components and by 3rd parties.

Antrea Agent Metrics

  • antrea_agent_conntrack_antrea_connection_count: Number of connections in the Antrea ZoneID of the conntrack table. This metric gets updated at an interval specified by flowPollInterval, a configuration parameter for the Agent.
  • antrea_agent_conntrack_max_connection_count: Size of the conntrack table. This metric gets updated at an interval specified by flowPollInterval, a configuration parameter for the Agent.
  • antrea_agent_conntrack_total_connection_count: Number of connections in the conntrack table. This metric gets updated at an interval specified by flowPollInterval, a configuration parameter for the Agent.
  • antrea_agent_egress_networkpolicy_rule_count: Number of egress networkpolicy rules on local node which are managed by the Antrea Agent.
  • antrea_agent_ingress_networkpolicy_rule_count: Number of ingress networkpolicy rules on local node which are managed by the Antrea Agent.
  • antrea_agent_local_pod_count: Number of pods on local node which are managed by the Antrea Agent.
  • antrea_agent_networkpolicy_count: Number of networkpolicies on local node which are managed by the Antrea Agent.
  • antrea_agent_ovs_flow_count: Flow count for each OVS flow table. The TableID is used as a label.
  • antrea_agent_ovs_flow_ops_count: Number of OVS flow operations, partitioned by operation type (add, modify and delete).
  • antrea_agent_ovs_flow_ops_error_count: Number of OVS flow operation errors, partitioned by operation type (add, modify and delete).
  • antrea_agent_ovs_flow_ops_latency_milliseconds: The latency of OVS flow operations, partitioned by operation type (add, modify and delete).
  • antrea_agent_ovs_total_flow_count: Total flow count of all OVS flow tables.

Antrea Controller Metrics

  • antrea_controller_address_group_processed: The total number of address-group processed
  • antrea_controller_address_group_sync_duration_milliseconds: The duration of syncing address-group
  • antrea_controller_applied_to_group_processed: The total number of applied-to-group processed
  • antrea_controller_applied_to_group_sync_duration_milliseconds: The duration of syncing applied-to-group
  • antrea_controller_length_address_group_queue: The length of AddressGroupQueue
  • antrea_controller_length_applied_to_group_queue: The length of AppliedToGroupQueue
  • antrea_controller_length_network_policy_queue: The length of InternalNetworkPolicyQueue
  • antrea_controller_network_policy_processed: The total number of internal-networkpolicy processed
  • antrea_controller_network_policy_sync_duration_milliseconds: The duration of syncing internal-networkpolicy

Common Metrics Provided by Infrastructure

Apiserver Metrics

  • apiserver_audit_event_total: Counter of audit events generated and sent to the audit backend.
  • apiserver_audit_requests_rejected_total: Counter of apiserver requests rejected due to an error in audit logging backend.
  • apiserver_client_certificate_expiration_seconds: Distribution of the remaining lifetime on the certificate used to authenticate a request.
  • apiserver_current_inflight_requests: Maximal number of currently used inflight request limit of this apiserver per request kind in last second.
  • apiserver_envelope_encryption_dek_cache_fill_percent: Percent of the cache slots currently occupied by cached DEKs.
  • apiserver_longrunning_gauge: Gauge of all active long-running apiserver requests broken out by verb, group, version, resource, scope and component. Not all requests are tracked this way.
  • apiserver_registered_watchers: Number of currently registered watchers for a given resources
  • apiserver_request_duration_seconds: Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
  • apiserver_request_total: Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response contentType and code.
  • apiserver_response_sizes: Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
  • apiserver_storage_data_key_generation_duration_seconds: Latencies in seconds of data encryption key(DEK) generation operations.
  • apiserver_storage_data_key_generation_failures_total: Total number of failed data encryption key(DEK) generation operations.
  • apiserver_storage_envelope_transformation_cache_misses_total: Total number of cache misses while accessing key decryption key(KEK).
  • apiserver_watch_events_sizes: Watch event size distribution in bytes
  • apiserver_watch_events_total: Number of events sent in watch clients

Authenticated Metrics

  • authenticated_user_requests: Counter of authenticated requests broken out by username.

Authentication Metrics

  • authentication_attempts: Counter of authenticated attempts.
  • authentication_duration_seconds: Authentication duration in seconds broken out by result.
  • authentication_token_cache_active_fetch_count:
  • authentication_token_cache_fetch_total:
  • authentication_token_cache_request_duration_seconds:
  • authentication_token_cache_request_total:

Go Metrics

  • go_gc_duration_seconds: A summary of the GC invocation durations.
  • go_goroutines: Number of goroutines that currently exist.
  • go_info: Information about the Go environment.
  • go_memstats_alloc_bytes: Number of bytes allocated and still in use.
  • go_memstats_alloc_bytes_total: Total number of bytes allocated, even if freed.
  • go_memstats_buck_hash_sys_bytes: Number of bytes used by the profiling bucket hash table.
  • go_memstats_frees_total: Total number of frees.
  • go_memstats_gc_cpu_fraction: The fraction of this program's available CPU time used by the GC since the program started.
  • go_memstats_gc_sys_bytes: Number of bytes used for garbage collection system metadata.
  • go_memstats_heap_alloc_bytes: Number of heap bytes allocated and still in use.
  • go_memstats_heap_idle_bytes: Number of heap bytes waiting to be used.
  • go_memstats_heap_inuse_bytes: Number of heap bytes that are in use.
  • go_memstats_heap_objects: Number of allocated objects.
  • go_memstats_heap_released_bytes: Number of heap bytes released to OS.
  • go_memstats_heap_sys_bytes: Number of heap bytes obtained from system.
  • go_memstats_last_gc_time_seconds: Number of seconds since 1970 of last garbage collection.
  • go_memstats_lookups_total: Total number of pointer lookups.
  • go_memstats_mallocs_total: Total number of mallocs.
  • go_memstats_mcache_inuse_bytes: Number of bytes in use by mcache structures.
  • go_memstats_mcache_sys_bytes: Number of bytes used for mcache structures obtained from system.
  • go_memstats_mspan_inuse_bytes: Number of bytes in use by mspan structures.
  • go_memstats_mspan_sys_bytes: Number of bytes used for mspan structures obtained from system.
  • go_memstats_next_gc_bytes: Number of heap bytes when next garbage collection will take place.
  • go_memstats_other_sys_bytes: Number of bytes used for other system allocations.
  • go_memstats_stack_inuse_bytes: Number of bytes in use by the stack allocator.
  • go_memstats_stack_sys_bytes: Number of bytes obtained from system for stack allocator.
  • go_memstats_sys_bytes: Number of bytes obtained from system.
  • go_threads: Number of OS threads created.

Process Metrics

  • process_cpu_seconds_total: Total user and system CPU time spent in seconds.
  • process_max_fds: Maximum number of open file descriptors.
  • process_open_fds: Number of open file descriptors.
  • process_resident_memory_bytes: Resident memory size in bytes.
  • process_start_time_seconds: Start time of the process since unix epoch in seconds.
  • process_virtual_memory_bytes: Virtual memory size in bytes.
  • process_virtual_memory_max_bytes: Maximum amount of virtual memory available in bytes.
Getting Started

To help you get started, see the documentation.