Skip to main content
logoTSB - Application Owner Troubleshooting GuideVersion: Latest

Troubleshooting Quick Reference

Quick Reference or Full Guide?

If you're familiar with the Tetrate Platform, this Quick Reference provides a useful checklist for common troubleshooting processes.

Otherwise, you may wish to start with the Troubleshooting Guide for a step-by-step walkthrough of the platform and its capabilities

The following tips will help you to identify the error condition that is affecting your application. The Tetrate platform provides a number of insights - performance, response codes, logs and traces - that will help you to identify the issue you face. Additionally, if you check metrics over a long time window, you may be able to identify when an issue began, and correlate that with a related action or change.

My service is not available...

  1. Send a test request to your service

    • if you get a timeout when connecting, the edge gateway is not responding. Verify DNS resolution (are you connecting to the correct IP)?
    • a 301, 302, 303 or other 3xx response code redirects you to try a different location, often provided in the Location response header
    • if you get a 404 response code, verify that you are connecting to the right gateway. Then verify that the gateway configuration is correct, specifically that the hostnames and paths are as you would expect
    • if you get a 401 or 403 response code, you need to provide a correct authentication token. Additionally, a terse '403 RBAC: access denied' response indicates that a tetrate-managed security policy has denied access via the mesh
    • a 422 or other higher 4xx response code indicates that an issue was detected that blocked the request, perhaps by a web application firewall, openapi firewall or other content check
    • if you get a 503 response code, your service is not responding, cannot be found or does not exist. Check the configuration of the final tier2 gateway (are you targetting the correct service?), and verify the correct deployment of your service
  2. View your service in the TSB UI (dashboard)

    • look at the dependencies first. Is your service recieving traffic, and is the source of the traffic what you expect (for example, the downstream gateway)?
    • check the access logs for your service to understand what traffic it is receiving. No traffic suggests a routing or upstream configuration problem
    • look at the metrics for the service over a larger time window. Can you identify at what time the service stopped receiving traffic? Can you correlate that with a configuration change or other event?
  3. Look at the configuration in the TSB UI (workspaces)

    • generate a configuration report for the hostname used to access your service. Is the configuration status green, and is the configuration deployed to the clusters and namespaces you expect?
    • locate the gateway configuration for the service. Does it use the expected hostnames and paths, and does it route to the expected target service name and port?

My users are reporting intermittent errors...

  1. Send a test requests to the service to attempt to replicate the issue

    tip

    How to send a test request. If necessary, use the developer tools in your browser to access your application and look for errors in certain requests.

    • a 429 status code indicates that rate limiting has been triggered and the request has been denied. Other response codes are explained above and here
    • timeouts and intermittent 5xx response codes may indicate an infrastructure or application problem. This problem may be correlated with certain requests that take a long time to process.
  2. Look at the 'endpoint metrics' for your tier2 gateway

    • view the endpoint metrics report for the tier2 gateway that manages traffic to your service. Can you identify certain requests (uris/paths) that have high rates of errors?
  3. Examine request traces for your service

    • bearing in mind that only 1% of requests are sampled, attempt to find a trace that matches the intermittent problem. Traces for errored requests are highlighted red in the TSB UI
    • if you can provoke the error, add the header X-B3-Sampled: 1 to your request to force-generate a trace
    • examine the trace to identify the source of the error. Bear in mind that the hop that generated the error may not report a trace span, so you will need to look at spans from downstream services which targetted the potential failing service
  4. Check the logs for your erroring service, looking for requests that match the error signature

    • for example, if erroring request have a certain response code or uri, filter the logs based on that signature
    • if you cannot locate any logs, check downstream proxies which may have handled and blocked/dropped the request

My service is suffering from high latency

  1. Go straight to the TSB dashboard

    • check the performance summary, including the ApDex score and the P50-P99 metrics. Do these indicate a latency problem?
    • look at the endpoint metrics, which break latency out by path. Does the latency problem affect certain requests only, or is it more widespread?
    • look at response time over time. Is the latency consistently high, or is it correlated with certain events or traffic levels?
  2. Is it the platform's fault?

    • look at the Average Metrics summary report for your service. Is the envoy sidecar proxy adding significant latency to transactions? If so, this may indicate a configuration problem (such as in communication with an external authorization process) or a resourcing/capacity issue
    • look at request traces for requests to your service which exhibit the latency problem. Can you identify which hop/component is adding the greatest latency?
  3. Is there a capacity problem?

    • the Tetrate platform does not report capacity metrics directly; these may be available through other dashboards provided by your platform team
    • if you have access to the kubernetes platform, look for performance metrics, eviction logs and other error conditions

Depending on the nature of the error you observe:

  • if your service is failing to respond, you may need to redeploy the service or assign additional resources
  • if traffic is not delivered to your service, you will need to verify the gateway configuration or consider security and firewall policies that are blocking requests
  • intermittent problems may relate to resource constraints (causing timeouts or errors), rate limits, or firewall rules that are occasionally triggered

If you are unable to diagnose or fix the issue you are facing, the tests in this document should help you to raise an accurate and informed error report for the attention of your platform team.