Troubleshooting

A symptom-first index for the most common operational issues. If a recipe doesn't resolve it, escalate to your support contact with the structured-log evidence…

A symptom-first index for the most common operational issues. If a recipe doesn't resolve it, escalate to your support contact with the structured-log evidence each section asks you to capture. Since v1.0.

Placeholders (<region>, <your-hostname>, etc.) stand in for your own values.

Error-code prefixes

Every error response and structured-log line carries an operator-readable code:

| Prefix | Surface | Meaning | | --- | --- | --- | | S2R-ADM-#### | Admin API | Admin-plane errors (auth, validation, RBAC, settings, credential key) | | S2R-RUN-#### | Runtime | REST-runtime errors (timeout, payload size, mapping, upstream) | | S2R-LIC-#### | License | License validation (expired, over-cap, signature) | | S2R-WRK-#### | Worker | Background job errors (aggregation, ingest derivation) | | S2R-DEPLOY-#### | Deploy / startup | Migration, env-var binding, secret-resolution errors |

The ones you will see most often:

S2R-RUN-0413 — sync payload exceeded 30 MB.
S2R-ADM-0419 — credential (envelope) key not bound on this revision.

"Request rejected — payload too large" (`S2R-RUN-0413`)

A synchronous request body exceeded the 30 MB ceiling and was rejected with HTTP 413. This ceiling is fixed for v1.0.

What to do:

Confirm the caller is sending an unexpectedly large body (a misbehaving client, an embedded attachment, or batched payloads).
For genuinely large transfers, split the payload or use an asynchronous transport rather than a single synchronous call.
The rejection is logged with result_code=rejected and S2R-RUN-0413; find it by the correlation ID in the live traffic log.

"Backend timeout" / upstream errors

The default backend timeout is 100 ms and retries are 2. A slow backend will surface as upstream timeouts.

What to do: tune the per-operation timeout and retry count for the affected service under its runtime settings — these are operator-configurable and should be raised to match a known-slow backend. Confirm the backend itself is healthy before raising the timeout broadly.

`S2R-ADM-0419` — credential (envelope) key not bound

The S2R_CREDENTIAL_KEY environment variable is missing on the running revision, so the platform cannot decrypt the pgcrypto-encrypted backend credentials. This typically appears on the first backend-profile request after a deploy when the secret binding was dropped (Cloud Run does not inherit env vars/secrets from the prior revision).

What to do:

Redeploy through the deploy script (not a manual gcloud run deploy), which re-applies the centralized secret bindings.
Reuse the existing key value. Regenerating S2R_CREDENTIAL_KEY invalidates every previously-encrypted credential row.

Confirm the binding is present on the new revision:

gcloud run revisions describe `<revision>` `
    --region=`<region>` `
    --format='value(spec.containers[0].env)' | Select-String S2R_CREDENTIAL_KEY

See the deployment runbook and the configuration reference.

Ingest stopped flowing

Symptom: the Dashboard shows zero gateway-edge traffic in the last few minutes; the live traffic log is empty.

Authoritative check first — is data actually landing?

SELECT MAX(created_at) FROM s2r_f5_request_log;

If that timestamp is fresh, traffic is flowing regardless of any zero-window alert (the alert observes one signal; the table is the truth).

If it is stale, there can be more than one ingest target on the install (a public ingest service, an edge ingest service, and — if you run one — a relay-VM container). They must all run the matching image. Check each:

Public ingest service: gcloud run services describe s2r-worker --region=<region> --format='value(status.imageDigest)' — compare to the digest your last deploy reported; redeploy worker if stale.
Edge ingest service: gcloud run services describe s2r-f5-ingest --region=<region> --format='value(status.imageDigest)' — same comparison; redeploy if stale.
Relay VM container — SSH to the relay VM and compare the local image digest. If it doesn't match, run the persisted refresh helper, which pulls the current image and restarts the container:
```
sudo /opt/update_relay.sh
```

Other common causes:

Relay VM disk full — container JSON logs filled /. Truncate them (sudo find /var/lib/docker/containers -name "*-json.log" -exec truncate -s 0 {} +) and restart the container.
Gateway feed paused — confirm your gateway's outbound syslog/log target still points at the platform's ingest endpoint and the network path is intact.

A worker change that touches the database schema always needs the relay refreshed afterward (see deployment runbook); otherwise a stale relay's SQL mismatches the schema and ingest silently stops.

Schema mismatch after an upgrade

Symptom: after an upgrade, dashboards show wrong numbers, the worker logs repeated errors, or a service fails to start citing a schema/migration problem.

Cause: the schema-owning service (admin-api) and the services depending on the schema are out of step — usually because the deploy order was not followed, or a relay/ingest target wasn't refreshed.

What to do:

Deploy admin-api first — it applies the Flyway migrations and advances the schema. Confirm it came up cleanly.
Then redeploy/cycle runtime and worker against the migrated schema.
Refresh any relay/ingest target so its embedded SQL matches.
Confirm the aggregation refresh is healthy — a failing refresh leaves dashboards reading stale tables (see below).

Migrations are forward-only; if a release notes a schema change that a rollback cannot tolerate, restore from the pre-upgrade backup. See the upgrade guide.

Dashboard shows wrong numbers / impossibly slow

Symptom: KPIs are inconsistent, or the page takes tens of seconds.

Cause: the worker's aggregation refresh is failing or stalled, so dashboards read stale aggregation tables.

Check the worker's aggregation logs for repeated ERROR entries (a common cause is schema drift after an incomplete migration). Redeploy worker to apply pending migrations and resume the refresh. If the refresh is healthy but numbers still look off, drill into Service detail — per-service counters and the global KPIs come from different aggregations, and a mismatch points at which one is stale. Treat a failing aggregation refresh as high-priority — the dashboards depend on it.

"I'm locked out of the admin UI"

You authenticate at your IdP but the UI shows no admin role.

Recovery path: redeploy admin-api with the bootstrap admin email set (S2R_ADMIN_BOOTSTRAP_EMAILS). The revision comes up in recovery auth mode; log in as that email to be granted admin, after which the mode auto-flips back to protected. Confirm protected under Settings → Auth Mode; if it does not flip back, redeploy without the bootstrap flag. See RBAC & roles.

"Update applied but service won't start"

After approving an update, one service fails to roll out and the platform falls back to a rollback. Identify the failed service from Settings → Updates, pin traffic to the last-known-good revision, capture the failure logs for your support contact, and retry once the underlying issue is resolved. For Helm: helm rollback s2r <revision> -n s2r.

Cold-start latency spikes

Symptom: the first request after a quiet period takes much longer than steady-state; subsequent requests are fast.

Cause: the runtime scaled to zero (Cloud Run), with an additional warm-up for VPC-attached services.

Fix: keep a warm minimum — min-instances ≥ 1 on the Cloud Run runtime service, or replicas ≥ 1 on the Helm runtime deployment. Trade-off: always-on cost vs no cold starts.

Where to find logs

Structured-log shape

Every log line is JSON and carries service, correlationId, userId, errorCode, message, and a context. The correlation ID is propagated across admin-api → runtime → worker → DB, so a single transaction is greppable end to end.

Cloud Logging filters (GCP-native)

resource.type="cloud_run_revision"
resource.labels.service_name="s2r-admin-api"   # or s2r-runtime / s2r-worker / s2r-admin-ui
severity>=WARNING

Useful additions:

jsonPayload.correlationId="<uuid>" — every line for one transaction.
jsonPayload.errorCode=~"^S2R-RUN-" — runtime errors only.
textPayload:"S2R-DEPLOY" — the deploy-provenance line (one per service per startup; the authoritative answer to "which code is running").

For Compose/Helm, the same JSON lines flow to container stdout — docker compose logs -f <service> or kubectl -n s2r logs deploy/<service>.

Still stuck?

Capture the correlation ID, the error code, and the deploy-provenance line for the affected service, then escalate via Support.

All Specaria SOAP to REST docs

Loading…

Error-code prefixes

"Request rejected — payload too large" (S2R-RUN-0413)

"Backend timeout" / upstream errors

S2R-ADM-0419 — credential (envelope) key not bound

Ingest stopped flowing

Schema mismatch after an upgrade

Dashboard shows wrong numbers / impossibly slow

"I'm locked out of the admin UI"

"Update applied but service won't start"

Cold-start latency spikes

Where to find logs

Structured-log shape

Cloud Logging filters (GCP-native)

Still stuck?

Error-code prefixes

"Request rejected — payload too large" (S2R-RUN-0413)

"Backend timeout" / upstream errors

S2R-ADM-0419 — credential (envelope) key not bound

Ingest stopped flowing

Schema mismatch after an upgrade

Dashboard shows wrong numbers / impossibly slow

"I'm locked out of the admin UI"

"Update applied but service won't start"

Cold-start latency spikes

Where to find logs

Structured-log shape

Cloud Logging filters (GCP-native)

Still stuck?

"Request rejected — payload too large" (`S2R-RUN-0413`)

`S2R-ADM-0419` — credential (envelope) key not bound

"Request rejected — payload too large" (`S2R-RUN-0413`)

`S2R-ADM-0419` — credential (envelope) key not bound