Routing Staging Upgrades to Mirror Production Topology

Staging environments must replicate production topology at the catalog, extension, and connection-routing layers before any version promotion. Divergence in pg_extension state, shared library paths, or parameterized routing rules causes silent upgrade failures during CI/CD execution. This reference details exact synchronization patterns, upgrade routing logic, and deterministic recovery procedures for PostgreSQL extension lifecycle management.

Topology Parity & Extension State Synchronization

Begin by extracting production extension state and routing configuration. Use pg_dump --schema-only --section=pre-data combined with direct catalog queries against pg_extension and pg_available_extension_versions to generate a baseline manifest. Validate parity with:

SELECT extname, extversion, extrelocatable, extconfig
FROM pg_extension
ORDER BY extname;

Cross-reference staging against this manifest using a Python-driven diff pipeline. Any mismatch in extversion or missing pg_catalog entries indicates a broken Extension Upgrade Planning & Compatibility Validation workflow. Enforce strict shared_preload_libraries alignment across both clusters; mismatched library paths trigger FATAL: could not access file during CREATE EXTENSION or ALTER EXTENSION UPDATE. Consult the official PostgreSQL documentation on extension catalogs for authoritative catalog structure definitions.

Route all staging upgrade traffic through isolated endpoints to prevent cross-contamination with production connection pools via Test Environment Routing. Implement a pre-flight validation script that asserts identical max_connections, work_mem, and shared_buffers ratios, as parameter skew can mask extension memory allocation failures during promotion.

CI/CD Routing Logic & Upgrade Execution

Route staging upgrades through a deterministic CI/CD gate. Use pg_isready and connection pooling state checks to ensure zero active transactions before applying extension updates. Implement a transactional wrapper that captures pre-upgrade catalog snapshots:

psql -d staging_db -c "BEGIN; CREATE TABLE ext_pre_upgrade_snapshot AS SELECT * FROM pg_extension; COMMIT;"

Execute version promotion using ALTER EXTENSION <name> UPDATE TO '<target_version>';. Monitor pg_stat_activity for waiting states on AccessShareLock or ExclusiveLock. If routing logic routes traffic through a proxy (PgBouncer/HAProxy), drain connections via pg_terminate_backend() for idle sessions, then apply the update. For Python DevOps builders, wrap execution in psycopg2 with autocommit=False and explicit savepoints to guarantee atomic rollback on lock timeout:

import psycopg2
from psycopg2 import errors

conn = psycopg2.connect(dsn)
conn.autocommit = False
cur = conn.cursor()
try:
    # Bound the lock wait so a contended catalog lock raises LockNotAvailable
    # (SQLSTATE 55P03) instead of blocking indefinitely.
    cur.execute("SET lock_timeout = '30s';")
    cur.execute("ALTER EXTENSION postgis UPDATE TO '3.4.2';")
    conn.commit()
except errors.LockNotAvailable:
    conn.rollback()
    raise
finally:
    cur.close()
    conn.close()

Refer to psycopg2 transaction control documentation for precise savepoint semantics and connection lifecycle management. Ensure your CI/CD pipeline enforces a 30-second idle drain window before triggering the ALTER EXTENSION command to prevent lock escalation during concurrent schema migrations.

Diagnostics & Edge-Case Resolution

Catalog Corruption on Interrupted Upgrades: If ALTER EXTENSION UPDATE is interrupted by a network partition, OOM killer, or forced connection termination, the extension control file and pg_extension catalog may diverge. Symptoms include ERROR: extension "<name>" has no update path from version "<current>" to "<target>", orphaned SQL functions, or missing type definitions in pg_type.

Step-by-Step Resolution:

  1. Halt Routing: Immediately pause all application traffic to the affected staging database.
  2. Audit Catalog State: Query pg_extension alongside filesystem control files in $(pg_config --sharedir)/extension/. Verify version alignment.
  3. Force Catalog Repair: If the control file matches the target version but the catalog lags, you can update the catalog row directly — but this requires superuser AND SET allow_system_table_mods = on;. Editing a system catalog by hand is dangerous; a wrong value corrupts the catalog. Prefer DROP EXTENSION ... CASCADE followed by CREATE EXTENSION ... VERSION '<target_version>' in a controlled maintenance window whenever the extension supports it.
  4. Deterministic Rollback: Use the ext_pre_upgrade_snapshot table to restore exact pre-upgrade state if the extension supports downgrade paths. Otherwise, execute DROP EXTENSION <name> CASCADE followed by CREATE EXTENSION <name> WITH VERSION '<target_version>' in a single maintenance window.
  5. Post-Upgrade Validation: Run SELECT * FROM pg_available_extension_versions WHERE name = '<name>'; to confirm routing rules recognize the new version. Re-execute extension-specific diagnostic queries (e.g., SELECT PostGIS_Version(); or SELECT pg_stat_statements_reset();) to verify functional parity.

Implement automated health checks that compare pg_extension.extversion against the expected CI/CD manifest. Any deviation should trigger an immediate pipeline halt and route to a deterministic recovery playbook, ensuring staging remains a faithful production mirror for subsequent promotion cycles.