Databroker stuck getting no_helathy_upstream errors

What happened?

using pomerium 17.1
in RKE2 kubernetes clusters
VMs are running RHEL 8.5
an idp setup and service routes

Clusters running on aws VMs run fine. On other privately provision VM, pod seems healthy but accessing configured routes returns 500 errors. There are errors in the logs. The service seems stuck at creating a grpc connection:

It seems that the internal envoy is struggling to get where it is going.
I’m stumped how this difference comes about, but it is consistent - the privately provisioned all fail, the aws provisioned all work.

What did you expect to happen?

How’d it happen?

  1. Ran x
  2. Clicked y
  3. Saw error z

What’s your environment like?

  • Pomerium version (retrieve with pomerium --version):
  • Server Operating System/Architecture/Cloud:

What’s your config.yaml?

# Paste your configs here
# Be sure to scrub any sensitive values

What did you see in the logs?

# Paste your logs here.
# Be sure to scrub any sensitive values
2022-09-29T12:10:43-07:00 {"level":"debug","service":"envoy","name":"http","time":"2022-09-29T19:10:43Z","message":"[C1] new stream"}
2022-09-29T12:10:43-07:00 {"level":"debug","service":"envoy","name":"http","time":"2022-09-29T19:10:43Z","message":"[C1][S1843344355433927730] request headers complete (end_stream=false):\\n\\':method\\', \\'POST\\'\\n\\':scheme\\', \\'http\\'\\n\\':path\\', \\'/databroker.DataBrokerService/SyncLatest\\'\\n\\':authority\\', \\'127.0.0.1:34565\\'\\n\\'content-type\\', \\'application/grpc\\'\\n\\'user-agent\\', \\'grpc-go/1.44.1-dev\\'\\n\\'te\\', \\'trailers\\'\\n\\'x-request-id\\', \\'KQSvrQjexCCbxBWjAKAYxf\\'\\n\\'jwt\\', \\'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NjQ0ODIyNDN9.f5Sc_vzK27mxslXF7PF0howXN_XZOz3rdWEn7GyP4XE\\'\\n\\'grpc-trace-bin\\', \\'AABv2eR5O9J5CfesoSofESslAWrK/AqgM/UuAgA\\'"}
2022-09-29T12:10:43-07:00 {"level":"debug","service":"envoy","name":"router","time":"2022-09-29T19:10:43Z","message":"[C1][S1843344355433927730] cluster \\'pomerium-databroker\\' match for URL \\'/databroker.DataBrokerService/SyncLatest\\'"}
2022-09-29T12:10:43-07:00 {"level":"debug","service":"envoy","name":"upstream","time":"2022-09-29T19:10:43Z","message":"no healthy host for HTTP connection pool"}
2022-09-29T12:10:43-07:00 {"level":"debug","service":"envoy","name":"http","time":"2022-09-29T19:10:43Z","message":"[C1][S1843344355433927730] Sending local reply with details no_healthy_upstream"}
2022-09-29T12:10:43-07:00 {"level":"debug","service":"envoy","name":"http","time":"2022-09-29T19:10:43Z","message":"[C1][S1843344355433927730] encoding headers via codec (end_stream=true):\\n\\':status\\', \\'200\\'\\n\\'content-type\\', \\'application/grpc\\'\\n\\'grpc-status\\', \\'14\\'\\n\\'grpc-message\\', \\'no healthy upstream\\'\\n\\'date\\', \\'Thu, 29 Sep 2022 19:10:43 GMT\\'\\n\\'server\\', \\'envoy\\'"}
2022-09-29T12:10:43-07:00 {"level":"debug","service":"envoy","name":"http","time":"2022-09-29T19:10:43Z","message":"[C1][S1843344355433927730] doEndStream() resetting stream"}
2022-09-29T12:10:43-07:00 {"level":"debug","service":"envoy","name":"http","time":"2022-09-29T19:10:43Z","message":"[C1][S1843344355433927730] stream reset"}
2022-09-29T12:10:43-07:00 {"level":"debug","service":"envoy","name":"http2","time":"2022-09-29T19:10:43Z","message":"[C1] stream closed: 0"}
2022-09-29T12:10:43-07:00 {"level":"error","syncer_id":"databroker","syncer_type":"type.googleapis.com/pomerium.config.Config","error":"rpc error: code = Unavailable desc = no healthy upstream","time":"2022-09-29T19:10:43Z","message":"error during initial sync"}
2022-09-29T12:10:43-07:00 {"level":"error","syncer_id":"databroker","syncer_type":"type.googleapis.com/pomerium.config.Config","error":"rpc error: code = Unavailable desc = no healthy upstream","time":"2022-09-29T19:10:43Z","message":"sync"}

The log from the working scenario looks like:

{"level":"debug","service":"envoy","name":"http","time":"2022-09-29T20:41:52Z","message":"[C5] new stream"}
{"level":"debug","service":"envoy","name":"http","time":"2022-09-29T20:41:52Z","message":"[C5][S5053413184369283876] request headers complete (end_stream=false):\\n\\':method\\', \\'POST\\'\\n\\':scheme\\', \\'http\\'\\n\\':path\\', \\'/databroker.DataBrokerService/Put\\'\\n\\':authority\\', \\'127.0.0.1:33479\\'\\n\\'content-type\\', \\'application/grpc\\'\\n\\'user-agent\\', \\'grpc-go/1.44.1-dev\\'\\n\\'te\\', \\'trailers\\'\\n\\'grpc-tags-bin\\', \\'AAAMZ3JwY19zZXJ2aWNlHGRhdGFicm9rZXIuRGF0YUJyb2tlclNlcnZpY2UAB3NlcnZpY2UTcG9tZXJpdW0tZGF0YWJyb2tlcgAEaG9zdA8xMjcuMC4wLjE6MzM0NzkAC2dycGNfbWV0aG9kA1B1dA\\'\\n\\'x-request-id\\', \\'QMovzT8osJ1fsW2Ftrtf6S\\'\\n\\'jwt\\', \\'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NjQ0ODc3MTJ9.49ViGWo8r0C0-AWE8DpEJvBNeyXo7An-reyi2uhvocU\\'\\n\\'grpc-trace-bin\\', \\'AADiU9nA0zLe+sECgXrXJGAPAQsFaHyaoNAiAgA\\'"}
{"level":"debug","service":"envoy","name":"router","time":"2022-09-29T20:41:52Z","message":"[C5][S5053413184369283876] cluster \\'pomerium-databroker\\' match for URL \\'/databroker.DataBrokerService/Put\\'"}
{"level":"debug","service":"envoy","name":"router","time":"2022-09-29T20:41:52Z","message":"[C5][S5053413184369283876] router decoding headers:\\n\\':method\\', \\'POST\\'\\n\\':scheme\\', \\'http\\'\\n\\':path\\', \\'/databroker.DataBrokerService/Put\\'\\n\\':authority\\', \\'127.0.0.1:33479\\'\\n\\'content-type\\', \\'application/grpc\\'\\n\\'user-agent\\', \\'grpc-go/1.44.1-dev\\'\\n\\'te\\', \\'trailers\\'\\n\\'grpc-tags-bin\\', \\'AAAMZ3JwY19zZXJ2aWNlHGRhdGFicm9rZXIuRGF0YUJyb2tlclNlcnZpY2UAB3NlcnZpY2UTcG9tZXJpdW0tZGF0YWJyb2tlcgAEaG9zdA8xMjcuMC4wLjE6MzM0NzkAC2dycGNfbWV0aG9kA1B1dA\\'\\n\\'x-request-id\\', \\'QMovzT8osJ1fsW2Ftrtf6S\\'\\n\\'jwt\\', \\'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NjQ0ODc3MTJ9.49ViGWo8r0C0-AWE8DpEJvBNeyXo7An-reyi2uhvocU\\'\\n\\'grpc-trace-bin\\', \\'AADiU9nA0zLe+sECgXrXJGAPAQsFaHyaoNAiAgA\\'\\n\\'x-forwarded-proto\\', \\'http\\'"}
{"level":"debug","service":"envoy","name":"pool","time":"2022-09-29T20:41:52Z","message":"[C18] using existing connection"}
{"level":"debug","service":"envoy","name":"pool","time":"2022-09-29T20:41:52Z","message":"[C18] creating stream"}
{"level":"debug","service":"envoy","name":"router","time":"2022-09-29T20:41:52Z","message":"[C5][S5053413184369283876] pool ready"}
{"level":"debug","service":"envoy","name":"http","time":"2022-09-29T20:41:52Z","message":"[C5][S5053413184369283876] request end stream"}

Additional context

Add any other context about the problem here.

Recommended Kubernetes deployment method is all-in-one - as it naturally does not have split-mode issues that may be extremely time consuming to debug without cluster access.

  1. are you running it with ingress controller or just standalone Pomerium?
  2. It sounds like DNS issue. If you use debug image, you should be able to shell into the container and check the output of http://localhost:9091/clusters - it would show which upstream IP addresses were resolved, i.e. here pomerium-databroker.pomerium-master-postgres.svc.cluster.local was resolved to 10.28.2.8 and 10.28.5.29 and both upstreams are healthy.
pomerium-databroker::10.28.5.29:443::cx_active::1
pomerium-databroker::10.28.5.29:443::cx_connect_fail::0
pomerium-databroker::10.28.5.29:443::cx_total::2
pomerium-databroker::10.28.5.29:443::rq_active::0
pomerium-databroker::10.28.5.29:443::rq_error::0
pomerium-databroker::10.28.5.29:443::rq_success::3853
pomerium-databroker::10.28.5.29:443::rq_timeout::0
pomerium-databroker::10.28.5.29:443::rq_total::3853
pomerium-databroker::10.28.5.29:443::hostname::pomerium-databroker.pomerium-master-postgres.svc.cluster.local
pomerium-databroker::10.28.5.29:443::health_flags::healthy
pomerium-databroker::10.28.5.29:443::weight::1
pomerium-databroker::10.28.5.29:443::region::
pomerium-databroker::10.28.5.29:443::zone::
pomerium-databroker::10.28.5.29:443::sub_zone::
pomerium-databroker::10.28.5.29:443::canary::false
pomerium-databroker::10.28.5.29:443::priority::0
pomerium-databroker::10.28.5.29:443::success_rate::-1
pomerium-databroker::10.28.5.29:443::local_origin_success_rate::-1
pomerium-databroker::10.28.2.8:443::cx_active::2
pomerium-databroker::10.28.2.8:443::cx_connect_fail::0
pomerium-databroker::10.28.2.8:443::cx_total::4
pomerium-databroker::10.28.2.8:443::rq_active::1
pomerium-databroker::10.28.2.8:443::rq_error::0
pomerium-databroker::10.28.2.8:443::rq_success::3858
pomerium-databroker::10.28.2.8:443::rq_timeout::0
pomerium-databroker::10.28.2.8:443::rq_total::3859
pomerium-databroker::10.28.2.8:443::hostname::pomerium-databroker.pomerium-master-postgres.svc.cluster.local
pomerium-databroker::10.28.2.8:443::health_flags::healthy
pomerium-databroker::10.28.2.8:443::weight::1
pomerium-databroker::10.28.2.8:443::region::
pomerium-databroker::10.28.2.8:443::zone::
pomerium-databroker::10.28.2.8:443::sub_zone::
pomerium-databroker::10.28.2.8:443::canary::false
pomerium-databroker::10.28.2.8:443::priority::0
pomerium-databroker::10.28.2.8:443::success_rate::-1
pomerium-databroker::10.28.2.8:443::local_origin_success_rate::-1