Loading…
Loading…
A symptom-first index for the most common operational issues. If a recipe doesn't resolve it, escalate to your support contact with the structured-log evidence…
A symptom-first index for the most common operational issues. If a recipe doesn't resolve it, escalate to your support contact with the structured-log evidence each section asks you to capture. Since v1.0.
Placeholders (<region>, <your-hostname>, etc.) stand in for your own values.
Every error response and structured-log line carries an operator-readable code:
| Prefix | Surface | Meaning |
| --- | --- | --- |
| S2R-ADM-#### | Admin API | Admin-plane errors (auth, validation, RBAC, settings, credential key) |
| S2R-RUN-#### | Runtime | REST-runtime errors (timeout, payload size, mapping, upstream) |
| S2R-LIC-#### | License | License validation (expired, over-cap, signature) |
| S2R-WRK-#### | Worker | Background job errors (aggregation, ingest derivation) |
| S2R-DEPLOY-#### | Deploy / startup | Migration, env-var binding, secret-resolution errors |
The ones you will see most often:
S2R-RUN-0413 — sync payload exceeded 30 MB.S2R-ADM-0419 — credential (envelope) key not bound on this revision.S2R-RUN-0413)A synchronous request body exceeded the 30 MB ceiling and was rejected with
HTTP 413. This ceiling is fixed for v1.0.
What to do:
result_code=rejected and S2R-RUN-0413; find it
by the correlation ID in the live traffic log.The default backend timeout is 100 ms and retries are 2. A slow backend will surface as upstream timeouts.
What to do: tune the per-operation timeout and retry count for the affected service under its runtime settings — these are operator-configurable and should be raised to match a known-slow backend. Confirm the backend itself is healthy before raising the timeout broadly.
S2R-ADM-0419 — credential (envelope) key not boundThe S2R_CREDENTIAL_KEY environment variable is missing on the running revision,
so the platform cannot decrypt the pgcrypto-encrypted backend credentials. This
typically appears on the first backend-profile request after a deploy when
the secret binding was dropped (Cloud Run does not inherit env vars/secrets from
the prior revision).
What to do:
Redeploy through the deploy script (not a manual gcloud run deploy), which
re-applies the centralized secret bindings.
Reuse the existing key value. Regenerating S2R_CREDENTIAL_KEY
invalidates every previously-encrypted credential row.
Confirm the binding is present on the new revision:
gcloud run revisions describe `<revision>` `
--region=`<region>` `
--format='value(spec.containers[0].env)' | Select-String S2R_CREDENTIAL_KEY
See the deployment runbook and the configuration reference.
Symptom: the Dashboard shows zero gateway-edge traffic in the last few minutes; the live traffic log is empty.
Authoritative check first — is data actually landing?
SELECT MAX(created_at) FROM s2r_f5_request_log;
If that timestamp is fresh, traffic is flowing regardless of any zero-window alert (the alert observes one signal; the table is the truth).
If it is stale, there can be more than one ingest target on the install (a public ingest service, an edge ingest service, and — if you run one — a relay-VM container). They must all run the matching image. Check each:
Public ingest service:
gcloud run services describe s2r-worker --region=<region> --format='value(status.imageDigest)'
— compare to the digest your last deploy reported; redeploy worker if stale.
Edge ingest service:
gcloud run services describe s2r-f5-ingest --region=<region> --format='value(status.imageDigest)'
— same comparison; redeploy if stale.
Relay VM container — SSH to the relay VM and compare the local image digest. If it doesn't match, run the persisted refresh helper, which pulls the current image and restarts the container:
sudo /opt/update_relay.sh
Other common causes:
/. Truncate them
(sudo find /var/lib/docker/containers -name "*-json.log" -exec truncate -s 0 {} +)
and restart the container.A worker change that touches the database schema always needs the relay refreshed afterward (see deployment runbook); otherwise a stale relay's SQL mismatches the schema and ingest silently stops.
Symptom: after an upgrade, dashboards show wrong numbers, the worker logs repeated errors, or a service fails to start citing a schema/migration problem.
Cause: the schema-owning service (admin-api) and the services depending on the schema are out of step — usually because the deploy order was not followed, or a relay/ingest target wasn't refreshed.
What to do:
Migrations are forward-only; if a release notes a schema change that a rollback cannot tolerate, restore from the pre-upgrade backup. See the upgrade guide.
Symptom: KPIs are inconsistent, or the page takes tens of seconds.
Cause: the worker's aggregation refresh is failing or stalled, so dashboards read stale aggregation tables.
Check the worker's aggregation logs for repeated ERROR entries (a common
cause is schema drift after an incomplete migration). Redeploy worker to apply
pending migrations and resume the refresh. If the refresh is healthy but numbers
still look off, drill into Service detail — per-service counters and the global
KPIs come from different aggregations, and a mismatch points at which one is
stale. Treat a failing aggregation refresh as high-priority — the dashboards
depend on it.
You authenticate at your IdP but the UI shows no admin role.
Recovery path: redeploy admin-api with the bootstrap admin email set
(S2R_ADMIN_BOOTSTRAP_EMAILS). The revision comes up in recovery auth mode;
log in as that email to be granted admin, after which the mode auto-flips back to
protected. Confirm protected under Settings → Auth Mode; if it does not flip
back, redeploy without the bootstrap flag. See RBAC & roles.
After approving an update, one service fails to roll out and the platform falls
back to a rollback. Identify the failed service from Settings → Updates, pin
traffic to the last-known-good revision, capture the failure logs for your
support contact, and retry once the underlying issue is resolved. For Helm:
helm rollback s2r <revision> -n s2r.
Symptom: the first request after a quiet period takes much longer than steady-state; subsequent requests are fast.
Cause: the runtime scaled to zero (Cloud Run), with an additional warm-up for VPC-attached services.
Fix: keep a warm minimum — min-instances ≥ 1 on the Cloud Run runtime
service, or replicas ≥ 1 on the Helm runtime deployment. Trade-off: always-on
cost vs no cold starts.
Every log line is JSON and carries service, correlationId, userId,
errorCode, message, and a context. The correlation ID is propagated
across admin-api → runtime → worker → DB, so a single transaction is greppable
end to end.
resource.type="cloud_run_revision"
resource.labels.service_name="s2r-admin-api" # or s2r-runtime / s2r-worker / s2r-admin-ui
severity>=WARNING
Useful additions:
jsonPayload.correlationId="<uuid>" — every line for one transaction.jsonPayload.errorCode=~"^S2R-RUN-" — runtime errors only.textPayload:"S2R-DEPLOY" — the deploy-provenance line (one per service per
startup; the authoritative answer to "which code is running").For Compose/Helm, the same JSON lines flow to container stdout —
docker compose logs -f <service> or kubectl -n s2r logs deploy/<service>.
Capture the correlation ID, the error code, and the deploy-provenance line for the affected service, then escalate via Support.