Threshold Tuning for Downtime Windows

Promoting a PostgreSQL extension is only safe if you can prove, before the ALTER EXTENSION UPDATE ever runs, that the operation will fit inside the stall budget your SLO allows. The failure this page addresses is the self-inflicted outage: an update that queues behind a long-running transaction, acquires an ACCESS EXCLUSIVE lock, and blocks every subsequent query on the affected objects — turning a two-second catalog rewrite into a multi-minute pile-up. Threshold tuning is the discipline of converting an abstract downtime SLO into concrete, enforced pipeline gates — lock-wait ceilings, WAL ceilings, latency ceilings — so a promotion either completes inside the window or fails fast and hands off to rollback, never half-applies while connections drain. This page is for database SREs and platform engineers who own the maintenance-window contract on production PostgreSQL fleets.

Up: Extension Upgrade Planning & Compatibility Validation — the staged validation pipeline this gate belongs to, and where the compatibility and simulation stages that feed threshold tuning are defined.

The Threshold Decision Flow

Threshold tuning is the final gate before promotion. It consumes the simulated cost of the update, compares each dimension against a hard ceiling derived from the SLO, and only opens the promotion gate when every dimension fits — otherwise it blocks and routes to triage.

Each arrow into the decision node is an independently measured cost dimension, and the feedback edge is what keeps the ceilings honest: after every promotion the observed lock wait, WAL volume, and latency degradation are folded back into the ceiling estimate, so the gate tracks the real workload instead of a one-time guess.

Prerequisites

Threshold tuning reads catalog and activity state and never issues DDL of its own; the actual update is owned separately by ALTER EXTENSION Automation. Confirm the environment meets these assumptions before wiring the gate into a pipeline.

PostgreSQL version: 9.6 or newer. The dry-run reads pg_available_extension_versions (present since 9.6), pg_locks, and pg_stat_activity; the lock_timeout GUC used to bound acquisition has existed since 9.3, so any supported server qualifies.
Python packages: Python 3.8+ with psycopg 3.x and psycopg_pool (pip install "psycopg[pool]"). Only the standard library plus these is required. Async fleets can port the sampler to asyncpg; the driver trade-offs are covered under ALTER EXTENSION Automation.
Required privileges: every query the gate runs is read-only, so a role with CONNECT and SELECT on the system catalogs is sufficient. The role that later executes the update needs ownership of the extension; keep that credential separate and least-privilege per Security Boundaries & Permissions.
Catalog state: the target version must be reachable on the node — an unbroken chain of --from--to-- update scripts present on disk. Validate reachability against a maintained compatibility matrix before threshold evaluation begins, and resolve the transition order with Dependency Tree Analysis so the estimated window covers every intermediate hop, not just the final one.
A representative cost signal: either historical telemetry from prior promotions or a shadow-replica replay. The most accurate window estimate comes from routing the candidate through Async Upgrade Simulation against production-sized data before the gate runs.

Core Concept: Why the Window Is Not the Script Runtime

The naive model — “downtime equals how long the update script takes” — is wrong on both ends, and both errors are what threshold tuning exists to correct.

Lock acquisition, not lock hold, dominates. ALTER EXTENSION UPDATE takes an ACCESS EXCLUSIVE lock on each catalog object it rewrites. That lock cannot be granted while any other transaction holds a conflicting lock — even a plain SELECT on a dependent table holds ACCESS SHARE, which conflicts. Worse, once the ALTER is waiting for the lock, PostgreSQL queues every newer lock request behind it, so a single idle-in-transaction session can convert a fast update into a full stall on the table for the entire duration of that transaction. The measured cost that matters most is therefore the acquisition wait under live load, which is why the gate snapshots pg_locks and pg_stat_activity before deciding.

Transactionality decides the rollback shape. Most extension update scripts execute inside a single implicit transaction: if any statement fails, the whole update rolls back and the catalog is untouched — a clean failure. But a script that contains a non-transactional command (CREATE INDEX CONCURRENTLY, ALTER TYPE ... ADD VALUE, VACUUM) forces PostgreSQL to run outside that safety net, so a mid-flight failure can leave an INVALID index or a partially applied change behind. The threshold model must know which shape it is dealing with, because a non-transactional update needs a wider window (concurrent index builds are slow) and an explicit recovery path.

The window has three serial phases. Total stall ≈ lock-acquisition wait + lock-held rewrite (catalog rows in pg_proc/pg_class/pg_type plus cached-plan invalidation for live sessions) + on replicas, WAL apply lag until the standby catches up. Threshold tuning derives one ceiling per phase from the SLO and gates on all three, because a candidate can fit the rewrite budget and still blow the replica-lag budget on a busy standby. The replica dimension is developed in depth under Tuning Maintenance Windows for High-Availability Clusters.

lock_timeout is the enforcement primitive. Setting lock_timeout for the session that runs the ALTER turns “wait indefinitely and block the table” into “fail with 55P03 lock_not_available after N milliseconds and leave the table alone.” That single GUC is what makes the acquisition-wait ceiling enforceable rather than merely estimated — the gate computes the ceiling, and lock_timeout guarantees the server honors it.

Step-by-Step Implementation

The gate is four self-contained steps: translate the SLO into ceilings, snapshot live contention, run a read-only dry-run, then decide with an exit code a CI/CD stage can branch on. Each block is complete and copy-pasteable.

Step 1 — Translate the SLO into hard ceilings

Start from the contract the business signed — a maximum tolerable stall — and derive per-phase ceilings from it. Keeping the derivation in one place makes the gate auditable: every number traces back to the SLO.

from dataclasses import dataclass, asdict


@dataclass(frozen=True)
class WindowSLO:
    """The maintenance-window contract, in milliseconds and megabytes."""
    max_total_stall_ms: int = 2000   # hard SLO: worst-case client-visible stall
    replica_apply_budget_ms: int = 800   # portion reserved for standby catch-up
    wal_ceiling_mb: int = 100        # WAL a single update may generate
    max_p95_latency_ms: int = 150    # query p95 must stay under this during window


@dataclass(frozen=True)
class Thresholds:
    """Derived ceilings the gate actually enforces."""
    max_lock_wait_ms: int
    max_rewrite_ms: int
    max_wal_mb: int
    max_p95_latency_ms: int

    @classmethod
    def from_slo(cls, slo: WindowSLO) -> "Thresholds":
        # The primary and replica phases must both fit inside the stall budget.
        on_primary = slo.max_total_stall_ms - slo.replica_apply_budget_ms
        # Split the primary budget: acquisition is the volatile part, so give it
        # the larger share and reserve a fixed floor for the deterministic rewrite.
        return cls(
            max_lock_wait_ms=int(on_primary * 0.7),
            max_rewrite_ms=int(on_primary * 0.3),
            max_wal_mb=slo.wal_ceiling_mb,
            max_p95_latency_ms=slo.max_p95_latency_ms,
        )


if __name__ == "__main__":
    slo = WindowSLO()
    print(asdict(Thresholds.from_slo(slo)))
    # {'max_lock_wait_ms': 840, 'max_rewrite_ms': 360, 'max_wal_mb': 100, 'max_p95_latency_ms': 150}

Step 2 — Snapshot live contention before committing to a window

The acquisition-wait ceiling is meaningless without the current lock landscape. This query surfaces the sessions that would block an ACCESS EXCLUSIVE grant on the extension’s objects — long transactions and idle-in-transaction sessions are the usual culprits.

-- Sessions that hold locks conflicting with ACCESS EXCLUSIVE, plus how long
-- they have been running. Any row here is a session the ALTER would queue behind.
SELECT
    a.pid,
    a.state,
    now() - a.xact_start                   AS xact_age,
    now() - a.query_start                  AS query_age,
    l.locktype,
    l.mode,
    a.query
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
WHERE a.datname = current_database()
  AND l.granted
  AND a.pid <> pg_backend_pid()
  AND (a.state = 'idle in transaction' OR now() - a.xact_start > interval '5 seconds')
ORDER BY xact_age DESC NULLS LAST;

A window should not open while a session with a multi-minute xact_age is live: the ALTER would wait behind it and, worse, block all new traffic on the table for that entire span. Gate the promotion on this returning zero rows, or on the oldest transaction being younger than max_lock_wait_ms.

Step 3 — Run a read-only dry-run of the update cost

The dry-run never mutates state: it verifies the target version is reachable, measures the dependency fan-out that drives rewrite cost, and returns a structured estimate. In production the per-dependency coefficients come from Async Upgrade Simulation telemetry rather than the static factors shown here.

import json
from contextlib import contextmanager
from typing import Any, Dict

import psycopg
from psycopg_pool import ConnectionPool


@contextmanager
def readonly_txn(pool: ConnectionPool):
    """Borrow a connection pinned to a read-only transaction. Never writes."""
    conn = pool.getconn()
    try:
        conn.autocommit = False
        with conn.cursor() as cur:
            cur.execute("SET TRANSACTION READ ONLY")
        yield conn
        conn.rollback()   # discard: the dry-run must leave no trace
    except Exception:
        conn.rollback()
        raise
    finally:
        pool.putconn(conn)


def dry_run(conn: psycopg.Connection, ext: str, target: str) -> Dict[str, Any]:
    with conn.cursor() as cur:
        # 1. Reachability: is the target version installable from disk on THIS node?
        cur.execute(
            "SELECT 1 FROM pg_available_extension_versions "
            "WHERE name = %s AND version = %s",
            (ext, target),
        )
        if cur.fetchone() is None:
            raise ValueError(f"{target!r} is not reachable for extension {ext!r}")

        # 2. Dependency fan-out: how many objects the rewrite must touch.
        cur.execute(
            """
            SELECT count(*)
            FROM pg_depend d
            JOIN pg_extension e ON d.refobjid = e.oid
            WHERE e.extname = %s AND d.deptype = 'e'
            """,
            (ext,),
        )
        deps = cur.fetchone()[0]

    # 3. Cost model. Replace coefficients with simulation telemetry in production.
    return {
        "extension": ext,
        "target_version": target,
        "dependency_count": deps,
        "est_rewrite_ms": deps * 12,        # catalog rewrite + plan invalidation
        "est_lock_wait_ms": 150 + deps * 5, # baseline contention + per-object waits
        "est_wal_mb": max(1.0, deps * 0.8),
        "mode": "read_only_dry_run",
    }


if __name__ == "__main__":
    dsn = "postgresql://readonly@localhost:5432/production_db"
    with ConnectionPool(dsn, min_size=1, max_size=2) as pool:
        with readonly_txn(pool) as conn:
            print(json.dumps(dry_run(conn, "pg_stat_statements", "1.10"), indent=2))

Step 4 — Gate the promotion and emit a CI/CD exit code

The gate compares the dry-run estimate against the derived ceilings and returns a status a pipeline stage can branch on: 0 promote, 1 threshold violation (block, route to triage), 2 execution error. When it does promote, it stamps the session with the enforced lock_timeout so the server honors the acquisition ceiling.

import logging
import sys
from typing import Any, Dict, List

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)


def check_ceilings(metrics: Dict[str, Any], t: "Thresholds") -> List[str]:
    """Return the list of violated dimensions (empty == the window fits)."""
    violations = []
    if metrics["est_lock_wait_ms"] > t.max_lock_wait_ms:
        violations.append("lock_wait")
    if metrics["est_rewrite_ms"] > t.max_rewrite_ms:
        violations.append("rewrite")
    if metrics["est_wal_mb"] > t.max_wal_mb:
        violations.append("wal_volume")
    return violations


def gate(metrics: Dict[str, Any], t: "Thresholds") -> int:
    violations = check_ceilings(metrics, t)
    if violations:
        logging.error("Threshold violation on: %s", ", ".join(violations))
        logging.info(json.dumps({"status": "blocked", "violations": violations,
                                 "metrics": metrics}))
        return 1

    # The window fits. lock_timeout enforces the acquisition ceiling on the
    # session that will run the ALTER, so a slow grant fails 55P03 instead of
    # queuing behind live traffic and blocking the table.
    logging.info("SET lock_timeout = '%dms';  -- enforce acquisition ceiling",
                 t.max_lock_wait_ms)
    logging.info(json.dumps({"status": "promote", "metrics": metrics}))
    return 0


if __name__ == "__main__":
    try:
        slo = WindowSLO()
        thresholds = Thresholds.from_slo(slo)
        dsn = "postgresql://readonly@localhost:5432/production_db"
        with ConnectionPool(dsn, min_size=1, max_size=2) as pool:
            with readonly_txn(pool) as conn:
                estimate = dry_run(conn, "pg_stat_statements", "1.10")
        sys.exit(gate(estimate, thresholds))
    except Exception as exc:  # noqa: BLE001 - top-level CI boundary
        logging.critical("Gate execution failed: %s", exc)
        sys.exit(2)

Dry-Run & Validation Gate

The gate’s contract is that a promotion only proceeds on a 0, and a 0 is only possible when every measured dimension fits its ceiling. A passing run emits a machine-readable record downstream stages pin to:

{
  "status": "promote",
  "metrics": {
    "extension": "pg_stat_statements",
    "target_version": "1.10",
    "dependency_count": 14,
    "est_rewrite_ms": 168,
    "est_lock_wait_ms": 220,
    "est_wal_mb": 11.2,
    "mode": "read_only_dry_run"
  }
}

Wire this as a blocking stage. In GitHub Actions or GitLab CI, run the gate script and let its exit code decide the pipeline path — a 1 halts the promotion and can trigger an asynchronous shadow-replica replay for a more precise estimate; a 0 opens the promotion gate:

validate_extension_window:
  stage: pre-deploy
  script:
    - python gate_thresholds.py
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'

Because the dry-run runs in a READ ONLY transaction and rolls back, the stage is safe to run on every merge request, not only inside a maintenance window — the gate never touches production state and can advise long before the change is scheduled.

Failure Modes & Error Taxonomy

When the window is mis-tuned, PostgreSQL reports it with specific SQLSTATE codes. Categorizing them into transient-versus-terminal is the job of Error Categorization Frameworks; the codes that surface specifically from a downtime-window miss are:

SQLSTATE	Condition	What it means for the window	First response
`55P03`	`lock_not_available`	`lock_timeout` fired before the `ACCESS EXCLUSIVE` grant — acquisition wait exceeded its ceiling. This is the gate working as designed.	Retry in a quieter window or clear the blocking session; do not raise `lock_timeout` blindly.
`57014`	`query_canceled`	`statement_timeout` cancelled the update mid-flight — the rewrite phase overran its budget.	Widen `max_rewrite_ms` only if the SLO allows, or shrink the change set.
`40P01`	`deadlock_detected`	The `ALTER` and a concurrent transaction locked objects in opposite order.	Reduce concurrency during the window; retry with jitter.
`53300`	`too_many_connections`	The pool filled while sessions queued behind the lock — the pile-up symptom.	Cap pool size; the gate’s contention snapshot should have blocked this window.
`55006`	`object_in_use`	A dependent object could not be altered because a session was actively using it.	Drain the object’s sessions or reschedule.

A 55P03 is a success of threshold tuning, not a failure of the upgrade: the server refused to trade an SLO breach for a completed update. Treat repeated 55P03s as a signal that the window is scheduled against the wrong traffic profile, not as a reason to loosen the ceiling.

Rollback & Recovery Path

The recovery path depends entirely on the transactionality established in the core concept above.

Transactional update that failed (the common case). If the update script ran inside its implicit transaction and any statement failed — including a 55P03 or 57014 cancellation — the whole transaction rolled back and the catalog is byte-for-byte unchanged. There is nothing to undo; confirm with a version check and reschedule:
```
SELECT extname, extversion FROM pg_extension WHERE extname = 'pg_stat_statements';
-- extversion still shows the PRE-update version → clean rollback, safe to retry
```
Non-transactional update that failed mid-flight. A script containing CREATE INDEX CONCURRENTLY can leave an INVALID index behind. Find and drop it before retrying, or the next attempt errors on the leftover:
```
SELECT c.relname
FROM pg_class c
JOIN pg_index i ON i.indexrelid = c.oid
WHERE i.indisvalid = false;
-- DROP INDEX CONCURRENTLY <relname>;  for each invalid index, then re-run
```
Catalog corruption or a partially applied change that will not clean up. This is the case a forward fix cannot reach. Restore prior state from a snapshot taken before the window, or roll the timeline back with a point-in-time recovery as described in Snapshot & Point-in-Time Recovery. Always take that snapshot before opening the window, so the recovery target predates any change.

Because the extension update path is one-directional — most extensions ship no downgrade scripts — the pre-window snapshot is the only reliable rollback for a terminal failure. Never treat “re-run the update” as recovery for a non-transactional partial apply.

Performance & Scale Considerations

Iterate per database, gate per node. pg_extension is per-database, so a fleet promotion runs the gate once per database and aggregates: the window opens only when the slowest database still fits the SLO, not the average.
Bound acquisition, never the hold. Keep lock_timeout tight (the derived max_lock_wait_ms) and let the rewrite run unbounded once the lock is held — cancelling mid-rewrite via statement_timeout risks a non-transactional partial apply. Tune the two GUCs independently.
Retry with jitter, not with a wider ceiling. On a busy cluster the right response to 55P03 is exponential backoff with jitter across a few attempts, so the ALTER slips into a natural lull. Widening the ceiling to force a grant just trades a clean refusal for a real stall.
Do not parallelize updates that share catalog rows. Two concurrent extension updates touching pg_proc or a shared type serialize on the same locks anyway and multiply deadlock risk; run them sequentially and let the gate schedule around each other’s windows.
Feed actuals back after every promotion. Capture the real acquisition wait, WAL bytes, and p95 degradation, and fold them into the ceiling estimate with an exponential moving average. This closes the loop the flow diagram opens with, so the gate converges on the workload’s true cost instead of drifting on stale coefficients.

FAQ

Why enforce a lock_timeout instead of just running the update in a maintenance window?

A quiet window reduces the probability of a long lock wait; lock_timeout bounds the consequence. Even at 3 a.m. a single forgotten idle-in-transaction session can hold ACCESS SHARE on a dependent table, and without lock_timeout the ALTER queues behind it and blocks all new traffic for the life of that transaction. The GUC converts that open-ended risk into a deterministic 55P03 after a known number of milliseconds, which is what makes the acquisition ceiling enforceable rather than aspirational.

Is downtime just the runtime of the ALTER EXTENSION UPDATE statement?

No — and assuming so is the most common tuning error. Client-visible stall is dominated by lock acquisition wait under live load, not the script’s own execution, and on replicas it extends through WAL apply lag until the standby catches up. The threshold model splits the SLO across acquisition, rewrite, and replica phases precisely because a candidate can fit the script runtime and still breach the SLO on acquisition or replica lag.

What is the difference between a transactional and a non-transactional extension update for rollback?

A transactional update wraps every intermediate script in one transaction: a failure rolls the whole thing back and leaves the catalog untouched, so recovery is a no-op. A non-transactional update — any script containing CREATE INDEX CONCURRENTLY, ALTER TYPE ... ADD VALUE, or VACUUM — runs outside that safety net, so a mid-flight failure can leave an INVALID index or partial change that you must clean up by hand before retrying. The window and the recovery path both differ, so the gate must know which shape it faces.

Should I raise the ceiling when a promotion keeps failing with 55P03?

Almost never. A 55P03 means threshold tuning refused to trade an SLO breach for a completed update — the gate did its job. Repeated 55P03s indicate the window is scheduled against the wrong traffic profile or a chronic blocking session, not that the ceiling is too strict. Fix the contention or move the window; loosening the ceiling just converts a clean refusal into a real, client-visible stall.

How do I make the estimate reflect production instead of static coefficients?

Route the candidate through a shadow-replica replay under synthetic production load before the gate runs, and use the observed lock wait, WAL generation, and p95 degradation as the cost coefficients. That is exactly what Async Upgrade Simulation produces, and feeding its telemetry into the exponential moving average keeps the ceilings tracking real workload drift rather than a one-time guess.

Threshold Tuning for Downtime Windows #

The Threshold Decision Flow #

Prerequisites #

Core Concept: Why the Window Is Not the Script Runtime #

Step-by-Step Implementation #

Step 1 — Translate the SLO into hard ceilings #

Step 2 — Snapshot live contention before committing to a window #

Step 3 — Run a read-only dry-run of the update cost #

Step 4 — Gate the promotion and emit a CI/CD exit code #

Dry-Run & Validation Gate #

Failure Modes & Error Taxonomy #

Rollback & Recovery Path #

Performance & Scale Considerations #

FAQ #

Related Pages #