Envoy hangs with ext_authz_error

Hi,

We’ve been using Pomerium for a couple of years with Azure AD as the IDP. Up until recently we were running Pomerium 0.25.2 on Ubuntu 20.04. We recently updated to 0.27.2, and then replaced the VM with Ubuntu 24.04.

Since the Ubuntu update, after a few hours of use Pomerium has stopped forwarding connections. Envoy logs an “ext_authz_error” after about 10 seconds, and the browser sees an HTTP 500 error.

Here’s an example pair of logs for a single request. In the first log, from the authorize service, it looks like the user has already authenticated. In the second log, from envoy, the duration is ~10 sec and there’s an unspecified ext_authz_error.

2024-11-01T13:11:26.902863+00:00 pomeriumhost pomerium[56016]: {"level":"info","service":"authorize","request-id":"fd9a7bb7-3145-4538-a66e-ef6a89a23b07","check-request-id":"fd9a7bb7-3145-4538-a66e-ef6a89a23b07","method":"GET","path":"/some/url","host":"targethost.pomerium.example.com","ip":"1.2.3.4","session-id":"e5b74232-02d3-4a8a-b238-c837bb57cb41","user":"uuid-here","email":"legituser@example.com","allow":true,"allow-why-true":["claim-ok"],"deny":false,"deny-why-false":[],"time":"2024-11-01T13:11:26Z","message":"authorize check"}

2024-11-01T13:11:37.385182+00:00 pomeriumhost pomerium[56016]: {"level":"info","service":"envoy","upstream-cluster":"","method":"GET","authority":"targerthost.pomerium.example.com","path":"/some/url","user-agent":"Mozilla/5.0 ...,"referer":"...","forwarded-for":"1.2.3.4","request-id":"fd9a7bb7-3145-4538-a66e-ef6a89a23b07","duration":10000.525646,"size":0,"response-code":500,"response-code-details":"ext_authz_error","time":"2024-11-01T13:11:37Z","message":"http-request"}

Restarting the service fixes it for another few hours.

I checked a few things:

  1. In tcpdump, it doesn’t look like Pomerium is contacting the IdP (Microsoft) when connections come in. I didn’t watch for very long - it’s possible it doesn’t do this on demand, or caches it, so that may also be why.
  2. I didn’t see any obvious failures when I ran strace -fp on the Pomerium process.
  3. I didn’t see external connections stuck in SYN_SENT in ss -ant

I think the problem is between envoy and pomerium, but am not sure how to dig deeper.

What’s your environment like?

  • Pomerium version (retrieve with pomerium --version): 0.27.2
  • Server Operating System/Architecture/Cloud: Ubuntu 24.04, amd64, Azure

What’s your config.yaml?

address: :443
authenticate_service_url: https://pomerium.example.com
autocert: true
autocert_ca: ...
autocert_eab_key_id: ...
autocert_eab_mac_key: ...
idp_provider: azure
idp_provider_url: ...
idp_client_id: ...
idp_secret_id: ...
idp_request_params:
  domain_hint: example.com

routes:
  # just a couple of example routes
  - from: https://targethost.pomerium.example.com
    to: https://internalhost:8443
    preserve_host_header: true
    set_request_headers:
      X-Forwarded-Port: 443
    policy:
      - allow:
          or:
            - claim/groups: "uuid-here"

  - from: tcp+https://targethost2.pomerium.example.com:12345
    to: tcp://internalhost2:12345
    policy:
      - allow:
          or:
            - claim/groups: "uuid-here"

Hi,

Did the problem start with the upgrade from 0.25.2 to 0.27.2? Or with the upgrade from Ubuntu 20.04 to Ubuntu 24.04? Or were the upgrades so close in time that its unclear?

Thanks,

  • Caleb

It was after the Ubuntu upgrade, but unfortunately they’re close enough together (1-2 days) that it’s not definitive. One thing I could do is downgrade to 0.25.2. I’ll wait for it to crash one more time to get a better sense of the timing, and then I’ll downgrade.

Thanks!

I looked through the changes, but nothing is standing out to me. You are using all-in-one mode with the in-memory databroker? (there were changes to add keepalive, but they should only apply to split-service mode)

It is possible to increase the envoy log level: Proxy Log Level | Pomerium . There might be something in those messages to indicate what’s going wrong, but unfortunately they logs are so verbose, and the issue is so rare, that it may be more trouble than it’s worth.

Based on the description of the problem it sounds like the authorize service is accepting gRPC requests (or at least not overtly rejecting them) but not responding in a timely fashion. Is a CPU core pegged at 100% when this happens (I wonder if the code is stuck in an infinite loop somewhere)?

Yes, this is all-in-one mode. Not sure about the databroker. I don’t think I have any config options with “broker” in the name, so it’d be on defaults.

Both CPUs appear to be almost completely idle when this is happening (and also when things are normal).

In the config.yaml I enabled debugging and restarted the service:

pomerium_debug: true
proxy_log_level: debug

but all the messages I see in syslog are still "level":"info", so I’m unsure if I’ve done it right.

This might be provoked by connections to the Apache Flink web dashboard. Every crash seems to be associated with people looking at it. We’ve been using that dashboard for years, but a couple of additional users have taken an interest in the past two weeks.

I’ve downgraded to 0.25.2 to see if it makes a difference. There haven’t been any crashes in the week since I did that. Still unclear if it’s 0.27.2 or a different usage pattern that’s triggering it.

I’m still not seeing debug logs from Envoy despite adding some debug settings:

pomerium_debug: true
proxy_log_level: debug

Would they go to syslog, like the other logs?

you additionally need set log_level: debug