Blog/Diagnostics

Detecting Observer Leaks in Meteor Publications

SkySignal TeamApril 21, 202613 min read

Your Meteor server is leaking memory. Slowly. Every night at 3am it gets restarted by a cron job your team added years ago "just in case," and in the morning everything is fine again. The APM dashboards look clean. Method response times are normal. So what's climbing?

Most of the time it's observers. And most APMs can't see them.

What an observer actually is

When a client subscribes to a publication, Meteor's server does this internally:

It runs your publication function with the given arguments.
Your function typically returns a cursor (or an array of cursors).
Meteor calls await cursor.observeChanges(callbacks, { nonMutatingCallbacks: true }) on each returned cursor — you can see this in the Meteor source in _publishCursor.
The returned handle is an observer. It's a long-lived object that watches MongoDB (via change streams, the oplog, or polling) and pushes diffs to the subscribing client.
When the client unsubscribes or disconnects, Meteor calls handle.stop() to tear the observer down.

That last step is the leaky one. If a code path creates an observer but doesn't guarantee cleanup — a publication that manually calls cursor.observeChanges inside a handler without wiring up this.onStop(), or a stray Meteor.subscribe on the server from inside a library — the observer keeps running forever. The only way out is a process restart.

Why it's invisible to standard APMs

Standard APMs are built around method calls and HTTP requests. They instrument the entry points of your app — Meteor.methods, WebApp handlers, background jobs — and report latency and error rates. All of those are transient: a request comes in, something happens, a response goes out.

Observers are the opposite. A single observer can live for days, firing callbacks on every MongoDB change, pushing DDP messages to a client that may have disconnected hours ago. Standard APMs have no slot in their data model for "things that live forever and fire events in the background."

Symptoms you're leaking

Server RSS memory climbs linearly with uptime on a roughly stable user count.
MongoDB load — especially change-stream or oplog tailer CPU — scales with server uptime rather than concurrent users.
A nightly or weekly restart "fixes" perf problems that nobody can reproduce in staging.
Observer count keeps climbing in your dashboard even when your active DDP connection count is stable or falling.
Deployments to multi-container setups improve faster than user count would predict — because each new container starts with zero leaked observers.

Naive detection: age thresholds

The obvious first idea: flag any observer older than N hours. Set N to, say, 6 hours. Alert on anything older.

This catches real leaks. It also fires constantly on legitimate long-lived observers:

An admin dashboard left open in a Chrome tab overnight — the subscription is perfectly valid, the observer is happy, but it's 14 hours old by morning.
Meteor autopublish (if it's still enabled in a dev/staging env) — one observer per client, lives as long as the connection.
Internal service-to-service DDP connections — some teams have long-running admin clients that stay subscribed for weeks.
Queue-worker subscriptions — a background worker subscribes to a "pending jobs" publication and stays subscribed.

You end up with an alert channel that everyone mutes by week two. Bad signal-to-noise ratio kills the whole detection system.

A better approach: 7-signal confidence scoring

A real leak has multiple fingerprints. A legitimate long-lived observer typically only has one or two of them. SkySignal's ObserverLeakDetectionService scores each observer against seven independent signals and sums the weights into a confidence score between 0 and 100.

Lifespan (20 points)

How long has this observer been alive? Longer is more suspicious, but only in combination with other signals. A 48-hour observer with no activity and no DDP liveness is extremely suspicious. A 48-hour observer attached to a live admin dashboard that's actively emitting updates is not.

Updates per minute (15 points)

How often is this observer actually firing its callbacks? An observer that has been alive for 12 hours and has fired 400,000 callbacks is pushing a lot of work for someone. If that someone is a disconnected client, that's waste.

Document count (10 points)

How many documents is this observer tracking? Observers that grow to track tens of thousands of documents are often the result of under-scoped queries — a Publish.find({}) that's pulling everything because a filter argument is undefined.

Collection write frequency (15 points)

How often does the underlying collection get written to? A stale observer on a hot collection is more expensive than a stale observer on a write-once config collection. This signal captures the cost of letting the leak run, not just the likelihood that it is one.

DDP liveness (15 points)

Is the DDP connection that started this observer still alive? If you can trace the observer back to a session via connectionId and the session is gone, that's the single highest-confidence leak signal you can get. The observer exists, the subscriber doesn't.

Publication context (10 points)

Can we trace this observer back to a specific Meteor.publish call? Observers with attached publication context (via AsyncLocalStorage) are easier to reason about — we know what created them, what the pub name is, what the session is. Observers without publication context are often the truly orphaned ones — they came from some manual cursor.observeChanges call deep in library code that nobody remembers wiring up.

Growth rate (15 points)

Is the observer's document count climbing unboundedly? A subscription to "my notifications" should converge to a steady-state count. If it keeps growing, something is wrong — either the query is too broad, or client-side cleanup is missing.

How the score maps to severity

score >= 80  => critical
score 60-79  => warning
score 40-59  => info
score < 40   => suppressed

A single signal rarely gets you above 40. A leak that's old, busy, attached to a dead DDP session, on a hot collection, with unbounded growth, and no pub context? That lights up every signal and ends up at 95. That's the one you actually want to see in your alerts.

Handling autopublish and legit long-lived subs

Autopublish is the classic false positive. When it's on, every collection auto-publishes to every connected client, creating one observer per client per collection. These observers can live for hours and look suspicious on every axis except the one that matters: they correspond to a real, still-connected client.

The heuristic: if the publication name is null (indicating autopublish), the lifespan is long, and there is some activity, we mark the observer as isAutoPublish: true and apply a 3× multiplier to the lifespan threshold (72 hours instead of 24). This suppresses the noise without hiding genuinely stuck autopublish observers where the client really did disappear.

For explicit long-running subs — admin dashboards, worker pipelines — the DDP liveness signal does the work: as long as the connection is still exchanging heartbeats, the observer's score stays low regardless of its age.

Per-connection DDP liveness

Tracking liveness at the DDP level requires knowing which connection started the observer. SkySignal's agent attaches a connectionId to each observer record at creation time, read from the publication's this.connection handle. When the DDP session ends — client disconnect, server-side kick, network failure — we mark all observers with that connectionId as having lost liveness. Any that persist past that point are almost certainly leaks, not long-lived subs.

Fixing a leak once you've found it

SkySignal groups observer leaks into groups — observers that share the same collection, publication name, and query fingerprint. That's the unit of fixing. You almost never have one leaky observer; you have one leaky code path that's produced forty observers so far.

The fix patterns:

Missing `this.onStop`

// broken — manually observing without cleanup
Meteor.publish('dashboard', function () {
  const handle = Dashboards.find().observeChanges({
    added: (id, fields) => this.added('dashboards', id, fields),
    changed: (id, fields) => this.changed('dashboards', id, fields),
    removed: (id) => this.removed('dashboards', id),
  });
  this.ready();
  // handle never stops!
});

// fixed — wire up onStop
Meteor.publish('dashboard', async function () {
  const handle = await Dashboards.find().observeChanges({
    added: (id, fields) => this.added('dashboards', id, fields),
    changed: (id, fields) => this.changed('dashboards', id, fields),
    removed: (id) => this.removed('dashboards', id),
  });
  this.onStop(() => handle.stop());
  this.ready();
});

The fixed version handles the Meteor 3 reality that observeChanges returns a Promise, awaits the handle, and registers a stop callback. If you skip this.onStop, Meteor has no way to tear your observer down when the client unsubscribes.

Server-side subscriptions to remote DDP

If you're subscribing to a remote DDP server from your Meteor server (e.g., a polling bridge to another service), make sure the subscription handle is stored somewhere and explicitly stopped on shutdown. A common leak: a library re-initializes the connection on each startup but never stops the old handles.

Cursor scope

Narrow the query. An observer tracking 50,000 documents is, on its own, a performance problem even before you ask whether it's leaked. Add a time window ({ createdAt: { $gte: since } }), a user scope, or a status filter. Reactive publications should publish the smallest cursor that answers the UI's question.

`this.ready()` and error paths

If your publication throws before calling this.ready() but after starting an observer manually, make sure the observer stops. The safest pattern is to create observers inside a try/catch and stop them explicitly on error:

Meteor.publish('risky', async function () {
  let handle;
  try {
    handle = await Items.find(query).observeChanges(callbacks);
    this.onStop(() => handle?.stop());
    this.ready();
  } catch (err) {
    handle?.stop();
    throw err;
  }
});

Wrap-up

Observer leaks are the quietest class of performance bug in a Meteor app. They don't spike on a chart, they don't throw errors, they just slowly steal memory and MongoDB bandwidth. Catching them requires instrumentation that understands long-lived server state, not just request flow. Once you have that — and a scoring model that isn't fooled by legitimately long subs — most Meteor apps find at least one real leak on the first scan.

Feature

Observer leak detection

The 7-signal confidence scoring system described above, with grouped findings and DDP-liveness-aware alerts.

See how it works

Get started

Scan your own app

Install the agent, let it run for 24 hours, and see which publications are leaking observers right now.

Start free trial