ClusterLoaderV2 Access Token Test Failing: Troubleshooting Guide
This document analyzes the failing ClusterLoaderV2.access-tokens test and provides a guide to troubleshoot the problem.
The ClusterLoaderV2.access-tokens overall test located at /home/prow/go/src/k8s.io/perf-tests/clusterloader2/testing/access-tokens/config.yaml has been consistently failing in the sig-release-master-informing and ec2-master-scale-performance jobs. This guide provides information about the failures, their timeline, potential causes, and relevant links for further investigation.
Jobs and Tests Affected
The primary focus is on the ClusterLoaderV2.access-tokens test, specifically its performance within the following jobs:
sig-release-master-informingec2-master-scale-performance
It's crucial to understand how these jobs are configured and what performance baselines are expected for the access-tokens test in each environment.
Failure Timeline
The failures have been occurring for approximately two weeks, with the first instance observed on November 3, 2025. Here's a breakdown:
- First Failure: Mon, 03 Nov 2025 07:03:34 UTC
- Latest Failure: Mon, 17 Nov 2025 07:02:25 UTC
This consistent failure pattern suggests a regression or environmental change introduced around the initial failure date. Analyzing changes to the Kubernetes codebase, cluster configuration, or testing infrastructure around November 3rd could provide valuable clues.
Analyzing the Failure Reason
The provided logs indicate that the APIResponsivenessPrometheus measurement is exceeding the defined threshold. Specifically, the perc99 (99th percentile) latency for API calls is exceeding the expected 30-second limit. Here are excerpts from the logs:
F 2025-11-17 04:02:25 -0300 -03 :0
[measurement call APIResponsivenessPrometheus - APIResponsivenessPr...73s, perc99: 1m0s Count:137 SlowCount:73}; expected perc99 <= 30s]]
:0
F 2025-11-16 04:03:11 -0300 -03 :0
[measurement call APIResponsivenessPrometheus - APIResponsivenessPr...59s, perc99: 1m0s Count:172 SlowCount:88}; expected perc99 <= 30s]]
:0
This means that a significant number of API calls are taking longer than expected, specifically the 99th percentile is at 1 minute. The slow count indicates a high number of requests exceeding a certain latency threshold (likely related to the 30s perc99 target), suggesting a systemic performance issue rather than isolated incidents.
To effectively diagnose the problem, you need to investigate the root cause of this increased API latency. The possibilities are numerous, but let's examine a few likely suspects:
Potential Causes
-
Resource Contention: The cluster might be experiencing resource contention (CPU, memory, I/O) on the control plane nodes. Monitoring the resource utilization of the API servers, etcd, and other control plane components is crucial. Look for spikes in CPU usage, memory pressure, or disk I/O that correlate with the increased latency. Use tools like
top,kubectl top, and Prometheus to gather this data. -
Etcd Performance: etcd is a key-value store that Kubernetes uses to store all of its data. If etcd is slow, it can cause the entire cluster to slow down. Investigate etcd's performance by checking its logs for errors, monitoring its latency metrics, and ensuring it has sufficient resources. Slow disk I/O, network latency, or excessive load on etcd can all contribute to API slowdowns.
-
Network Issues: Network latency between the API servers and other components (etcd, kubelets) can also contribute to the problem. Examine network performance metrics, looking for packet loss, high latency, or other network-related issues. Tools like
ping,traceroute, and network monitoring solutions can help identify network bottlenecks. -
API Server Overload: The API servers themselves might be overloaded with requests. Analyze API server logs to identify the types of requests that are contributing to the increased load. Look for patterns in the requests, such as a sudden increase in the number of requests or a specific type of request that is particularly slow. Consider increasing the number of API server replicas if necessary.
-
Access Token Issues: While the test name suggests access tokens, the problem might not be directly related to token validation itself. However, inefficient token handling within the API server could be a contributing factor. Ensure that token validation mechanisms are optimized and not introducing unnecessary overhead.
-
Code Regression: A recent code change in Kubernetes itself or in a related component could have introduced a performance regression. Review recent commits to the Kubernetes codebase, focusing on changes that might affect API server performance, resource management, or networking.
-
Cluster Configuration: Changes to cluster configuration, such as enabling new features or modifying existing settings, can sometimes impact performance. Review recent configuration changes to identify any settings that might be contributing to the problem.
-
Kubernetes Version: Verify the Kubernetes version being used in these tests. Sometimes, performance issues are specific to certain Kubernetes versions and are resolved in later releases.
Troubleshooting Steps
-
Gather Data: Collect as much data as possible about the cluster's performance, including CPU usage, memory usage, disk I/O, network latency, and API server logs. Use tools like
kubectl,top,vmstat,iostat, and Prometheus to gather this data. -
Analyze Logs: Carefully examine the logs from the API servers, etcd, kube-scheduler, and kube-controller-manager for errors or warnings. Look for patterns in the logs that might indicate the cause of the problem.
-
Identify Slow Queries: Use tools like
etcdctlto identify slow queries to etcd. This can help you determine if etcd is the bottleneck. -
Profile the API Server: Use profiling tools to identify the code paths that are consuming the most CPU time in the API server. This can help you pinpoint the source of the performance problem.
-
Isolate the Problem: Try to isolate the problem by running the
ClusterLoaderV2.access-tokenstest in a smaller, more controlled environment. This can help you rule out environmental factors that might be contributing to the problem. -
Rollback Changes: If you suspect that a recent code change or configuration change is the cause of the problem, try rolling back to a previous version to see if that resolves the issue.
Relevant Links
The following links provide access to the test results and triage information:
- Prow Job View: This link provides detailed logs and information about the specific test run.
- Testgrid Link: This link provides a historical view of the test's performance over time.
- Kubernetes Triage Dashboard: This link provides a summary of the test failures and potential causes.
Conclusion
The failing ClusterLoaderV2.access-tokens test indicates a performance regression related to API responsiveness. By systematically investigating resource utilization, etcd performance, network latency, API server load, and recent code changes, you can identify the root cause and implement corrective actions. Remember to gather as much data as possible, analyze logs carefully, and isolate the problem to effectively troubleshoot this issue. The links provided offer valuable insights into the test's history and potential causes.
For more in-depth information on Kubernetes performance tuning, refer to the official Kubernetes Performance Tuning Documentation.