Massive RAM usage while using K8s CRD

Hi everyone!

What happened?

We have switched to pomerium all-in-one with kubernetes CRD for both HTTPS and TCP rules.
Since then, RAM usage has skyrocketed and pods take a long time to come online.
Persistence is configured with postgres.

If the TCP rules are removed from the CRD, RAM usage is fine and startup times are also fine.

What’s your environment like?

  • Pomerium version (retrieve with pomerium --version):
  • Server Operating System/Architecture/Cloud:

What’s your config.yaml?

using CRD

What did you see in the logs?

During runtime, this error message is printed in several occurences:

“Deprecated field: type envoy.type.matcher.v3.RegexMatcher Using deprecated option \‘envoy.type.matcher.v3.RegexMatcher.google_re2\’ from file regex.proto. This configuration will be removed from Envoy soon. Please see Version history — envoy 1.26.0-dev-2d259e documentation for details. If continued use of this field is absolutely necessary, see Runtime — envoy 1.26.0-dev-2d259e documentation for how to apply a temporary and highly discouraged override.”

no other errors are printed.

Additional context

Please find attached the monitoring screenshots.



Here is the log after a pod crash:

{"level":"error","error":"rpc error: code = Canceled desc = context canceled","time":"2023-01-20T09:52:00Z","message":"access log stream error, disconnecting"}

10

{"level":"error","error":"rpc error: code = Canceled desc = context canceled","time":"2023-01-20T09:52:00Z","message":"access log stream error, disconnecting"}

9

{"level":"error","error":"rpc error: code = Canceled desc = context canceled","time":"2023-01-20T09:52:00Z","message":"access log stream error, disconnecting"}

8

{"level":"error","error":"rpc error: code = Canceled desc = context canceled","time":"2023-01-20T09:52:00Z","message":"access log stream error, disconnecting"}

7

{"level":"error","error":"rpc error: code = Canceled desc = context canceled","time":"2023-01-20T09:52:00Z","message":"access log stream error, disconnecting"}

6

{"level":"error","syncer_id":"databroker","syncer_type":"type.googleapis.com/pomerium.config.Config","error":"error receiving sync record: rpc error: code = Unavailable desc = error reading from server: EOF","time":"2023-01-20T09:52:00Z","message":"sync"}

5

{"level":"error","error":"rpc error: code = Canceled desc = context canceled","time":"2023-01-20T09:52:00Z","message":"access log stream error, disconnecting"}

4

{"level":"error","error":"rpc error: code = Canceled desc = context canceled","time":"2023-01-20T09:52:00Z","message":"access log stream error, disconnecting"}

3

{"level":"error","error":"rpc error: code = Canceled desc = context canceled","time":"2023-01-20T09:52:00Z","message":"access log stream error, disconnecting"}

2

{"level":"error","syncer_id":"databroker","syncer_type":"type.googleapis.com/pomerium.config.Config","error":"error calling sync: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:36961: connect: connection refused\"","time":"2023-01-20T09:52:01Z","message":"sync"}

1

{"level":"fatal","pid":19,"time":"2023-01-20T09:52:01Z","message":"envoy: subprocess exited"}

This happens sometimes on startup and sometimes when adding or editing a rule via CRD.

Hi,

Could you please elaborate which release did you upgraded from and to?

Could you please give us some estimate wrt amount of your Ingress resources and how many of them use TCP? I assume you mean Ingress objects annotated with tcp_upstream: true Ingress Configuration | Pomerium ?

  1. this is not a crash but rather various Pomerium internal services winding down (note the context cancelled). there must be some other errors earlier in the log that actually caused the issue.

  2. When you say CRD do you mean Ingress resource or Pomerium global settings? I’m a bit confused here.

There’s a lot happening here, would you join our Slack channel to try troubleshoot it synchronously?

Hi @denis ,

we were running 0.19.1 via helm chart, and swiched to latest version via all in one deployment.

Sure thing, i already am in the slack channel.

Thank you

I confirm that we are using ingress TCP resources:

447 ingresses, of which 341 of them are TCP.

thank you

Here is the config:

pomerium.yaml

apiVersion: ingress.pomerium.io/v1
kind: Pomerium
metadata:
name: global
spec:
secrets: pomerium/bootstrap
authenticate:
url: https://authenticate.redacted
identityProvider:
provider: oidc
url: https://keycloak.redacted
secret: pomerium/idp
certificates:
- pomerium/pomerium-wildcard-tls
storage:
postgres:
secret: pomerium/dbsecret
jwtClaimHeaders:
additionalProperties: email, groups, user, preferred_username

example tcp ingress

apiVersion: v1
kind: Service
metadata:
name: redacted-ssh-service-tcp
spec:
type: ExternalName
externalName: ‘redacted
ports:
- protocol: TCP
name: ssh
port: 22

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: jente-demo-ssh-ssh-ingress-tcp
namespace: pomerium
annotations:
ingress.pomerium.io/tcp_upstream: ‘true’
ingress.pomerium.io/allowed_idp_claims: |
groups:
- redacted
preferred_username:
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
- redacted
spec:
ingressClassName: pomerium
tls:
- hosts:
- redacted
secretName: pomerium-wildcard-tls
rules:
- host: redacted
http:
paths:
- pathType: ImplementationSpecific
backend:
service:
name: redacted
port:
name: ssh