NEWSLayers closes first external funding round led by LOI VentureRead more
‹ All Articles

Using Multiple PgBouncers to Isolate Application and Worker Connections

Jake CastoJake Casto6 min read

Originally posted on Medium.

Read the original

On February 27, we had a rough afternoon. Multiple waves of automated traffic hit our platform while several merchants were already in busy windows. We’d been mitigating it since mid-morning, but at 1:51 PM, the combined bot load and real customer traffic caused a 13-minute service interruption that affected a few tenants; our first broad service interruption in over two years.

During the postmortem, we traced a single chain of failure. First, a surge in background job traffic triggered a rush of worker activity. Next, the shared connection pool between our web app and workers became fully saturated by the workers. As a result, app requests were blocked, queued, and ultimately timed out for customers. The database itself handled the load, but the shared pool was the point of contention.

Lesson learned: separate your pools. That’s the single biggest takeaway from this incident. The fix was giving each consumer their own pool. Here’s how we did it.

Our setup

We run Layers on PlanetScale Postgres on an M-640 metal cluster in AWS us-east-1, with six replicas. It’s a high-throughput setup built for the query volumes that come with serving multiple large Shopify storefronts. That translates to hundreds of thousands of search queries, bulk-collection membership writes, and concurrent materialized-view refreshes.

Previously, both the app and workers connected through PlanetScale’s local PgBouncer on port 6432 as a single shared pool. At the time, we set the pool size to 300 max client connections, which was more than enough for routine activity and left a healthy buffer for spikes on busy days. In our setup, a single worker process maintains a client connection; each worker can process multiple jobs or requests, allowing us to comfortably scale to 295 workers (reserving a few connections for monitoring). This setup worked under normal conditions but became a bottleneck under intense pressure.

The problematic prior architecture, in which both the web app and queue workers share a single PgBouncer pool pinned to Replica 1, leaves five other replicas unused. Writes route traffic through a separate local PgBouncer to the primary, while all read traffic, regardless of source, is routed through a single node.

Why a shared pool breaks under load

PlanetScale runs PgBouncer in transaction pooling mode, which keeps a fixed number of server-side connections open to Postgres and multiplexes client connections across them. It’s efficient and scales well for OLTP workloads. But when one consumer saturates the pool, everything else waits.

Web requests are short and transactional, allowing connections to be released quickly. Queue workers hold connections longer; tasks like materialized view refreshes take seconds, and catalog sync jobs fire large writes in sequence. Workers also spike during high-traffic periods.

When workers surged on February 27, they filled the pool. The database itself was healthy the entire time. The bottleneck was entirely at the pooling layer.

A sequence diagram of the failure mode: queue workers exhaust the shared PgBouncer pool with catalog sync, facet refresh, and embedding jobs; when bot traffic triggers a surge in web app connections, requests queue and time out, surfacing 500 errors to customers at 1:51 PM, while the underlying database remained healthy throughout.

The fix: one bouncer per consumer

PlanetScale lets you provision dedicated PgBouncer instances separately from the local default. Each one has its own isolated pool of server-side connections, its own name, and its own credentials. You can create as many as you need. While alternative mitigations like statement queuing or increasing the pool size were options, we chose multiple PgBouncers because they offered true isolation and avoided potential side effects like increased queuing delays or contention within a single shared pool. This approach gave us clearer operational boundaries and more predictable behavior during surges.

We created two:

  • app-bouncer for all web application traffic
  • worker-bouncer for all queue worker traffic

Both target our replica fleet. The connection format is straightforward: append the bouncer name to your database username (postgres.{branchId}|app-bouncer) and connect on port 6432. PlanetScale handles the rest.

The resolved setup, showing the web app and queue workers routing through separate dedicated PgBouncer pools (app-bouncer and worker-bouncer) on port 6432, with reads distributed across six replicas and writes going directly to the primary on port 5432.

With this in place, a worker surge saturating the worker-bouncer has no effect on the app-bouncer. The pools are separate. Workers and app traffic can spike independently without interfering with each other.

If you want to validate this isolation in your own environment, try running a simple load script or chaos test: simulate a heavy worker spike while tracking latency and error rates on app requests. This lets you verify that worker activity does not impact the web pool under stress, and gives your team confidence in the setup before peak traffic hits.

Because both bouncers target the full replica fleet, read queries are distributed across all six replica nodes rather than being pinned to a single one. This is a nice secondary benefit at our query volumes.

Before and after:

Previously, both the web app and queue workers shared a single local PgBouncer on port 6432 for reads, with writes going through the same pool to the primary. Now, the web app routes read through app-bouncer and workers through worker-bouncer; each is a dedicated replica pool on port 6432. Writes from workers and all migrations continue to use a direct connection on port 5432 to the primary, bypassing PgBouncer entirely, as they always should.

The failover behavior

Dedicated PgBouncers handle failover events more reliably than the local default. After February 27, this proved especially important to us.

With the local PgBouncer, a failover drops all client connections. Your application must detect these disconnects and attempt to reconnect, which can increase outage time during an incident.

With dedicated PgBouncers, failovers are handled using PgBouncer’s PAUSE command. During a failover, incoming transactions are queued inside the dedicated bouncer while the database transitions. Once Postgres comes back online, transactions automatically resume, eliminating dropped client connections and minimizing service disruption. From the application’s view, there’s a brief increase in latency but no connection loss.

For a cluster with six replicas where maintenance events aren’t rare, that behavior matters.

In short

If your app and queue workers share a connection pool, you have a coupling that works in normal conditions but becomes a liability under load. That’s a bad time to discover it. Take a moment today to audit your own pool setup before a bottleneck catches you by surprise.

Dedicated PgBouncers on PlanetScale provide isolated pools for each consumer. This lets them operate independently under stress, preventing one from impacting the other.

I wish we’d done this before February 27.

Jake Casto

Jake Casto · Founder, Layers

Jake Casto is the founder of Layers, the enterprise search and merchandising platform built for Shopify Plus. He previously co-founded Proton, a Shopify Plus engineering studio that shipped more than 400 storefronts, where Layers began as an internal tool for a problem that kept repeating. He writes about search infrastructure, performance, and the engineering behind discovery at scale.

Connect on LinkedIn