404 response for all routes after upgrade - probable config error

What happened?

We have 4 kubernetes cluster using pomerium installed with helm. They were on version 0.15.7 and working absolutely fine. A couple of days ago I upgraded them to v0.18.0 but when I try to access any of the routes I get a 404 from the route /.pomerium/sign_in

I know it reaches the authenticate pod because I get this in the logs

2:21AM INF http-request authority=authenticate.domain.com duration=2.086012 forwarded-for=192.168.29.64,127.0.0.6 method=GET path=/.pomerium/sign_in referer= request-id=9ad1527b-7af6-4aed-a65d-1e3ab886be90 response-code=404 response-code-details=via_upstream service=envoy size=19 upstream-cluster=pomerium-control-plane-http user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"

I downgraded back to the original version and it worked again, I then upgraded a litte bit at a time and found out it breaks after 0.16.2. Upgrading to 0.16.4 makes it fail again.

I then found out that the configuration layout for routes had changed at some point, although hadn’t seen any mention of this in the upgrade notes or changelog, so I updated the config file (on version 0.16.4) but it still didn’t work. I tried upgrading to 0.18.0 but still no joy.

I’m sure there must be something wrong in my config but can’t figure out what it could be.

What did you expect to happen?

Obviously hoped it would keep working as before

What’s your environment like?

Kubernetes v1.23.4

What’s your config.yaml?

autocert: false
dns_lookup_family: V4_ONLY
address: :80
grpc_address: :80
authenticate_service_url: https://authenticate.domain.com
authorize_service_url: http://pomerium-authorize.auth.svc.cluster.local
databroker_service_url: http://pomerium-databroker.auth.svc.cluster.local
idp_provider: oidc
idp_scopes: openid, profile, email
idp_provider_url: https://gitlab.domain.com
insecure_server: true
grpc_insecure: true
administrators: "sarah@domain.com"

pomerium_debug: true
idp_client_id: <redacted>
idp_client_secret: <redacted>
idp_service_account: <redacted>
routes:
  - from: https://kibana.domain.com
    policy:
    - allow:
        or:
        - domain:
            is: domain.com
    timeout: 30s
    to: http://kibana-kibana.kube-logging:5601
  - from: https://prometheus.domain.com
    policy:
    - allow:
        or:
        - domain:
            is: domain.com
    timeout: 30s
    to: http://wx-kube-prometheus-stack-prometheus.monitoring:9090
  - from: https://alertmanager.domain.com
    policy:
    - allow:
        or:
        - domain:
            is: domain.com
    timeout: 30s
    to: http://wx-kube-prometheus-stack-alertmanager.monitoring:9093
  - from: https://grafana.domain.com
    pass_identity_headers: true
    policy:
    - allow:
        or:
        - domain:
            is: domain.com
    timeout: 30s
    to: http://wx-grafana.monitoring
  - from: https://authenticate.domain.com
    to: http://pomerium-authenticate.auth.svc.cluster.local
    allow_public_unauthenticated_access: true

What did you see in the logs?

# Databroker continously repeats this
3:06AM WRN failed to refresh directory users and groups error="unknown directory provider oidc" service=identity_manager
3:06AM INF put id=identity_manager_last_user_group_refresh_errors type=type.googleapis.com/pomerium.events.LastError

# Authorize has this
{"level":"warn","time":"2022-08-17T03:03:34Z","msg":"stapling OCSP","service":"autocert","error":"no OCSP stapling for [domain.com authorize.domain.com pomerium-authorize.auth.svc.cluster.local]: no OCSP server specified in certificate"}

# Authenticate has same to authorize
{"level":"warn","time":"2022-08-17T02:43:35Z","msg":"stapling OCSP","service":"autocert","error":"no OCSP stapling for [domain.com authenticate.domain.com pomerium-authenticate.auth.svc.cluster.local]: no OCSP server specified in certificate"}
# and also the 404 error on sign_in
2:21AM INF http-request authority=authenticate.domain.com duration=2.086012 forwarded-for=192.168.29.64,127.0.0.6 method=GET path=/.pomerium/sign_in referer= request-id=9ad1527b-7af6-4aed-a65d-1e3ab886be90 response-code=404 response-code-details=via_upstream service=envoy size=19 upstream-cluster=pomerium-control-plane-http user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"

# Proxy shows this on request
3:13AM INF http-request authority=grafana.domain.com duration=49.552422 forwarded-for=192.168.29.23,127.0.0.6 method=GET path=/ referer= request-id=6d16b8bd-80ae-49f0-b2a3-b5d808cbfbf0 response-code=302 response-code-details=ext_authz_denied service=envoy size=1276 upstream-cluster= user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"

I know the ‘no OCSP server’ errors were not in the logs in the previous working versions as I checked when I rolled back.

Additional context

We use istio in our k8s clusters which is why I set insecure to true but don’t see how that can be the issue as it works fine with earlier versions.

Update

Seems to be something to do with user authentication as when I remove the policy and add allow_public_unauthenticated_access: true to the routes it works fine.

Would anybody be able to help me figure out where the issue is?

I know the settings for GitLab must be correct as it works in a previous version of pomerium but maybe some other config has changed?