Upload Error Recovery Patterns

Networks fail mid-transfer, tokens expire, and phones walk out of coverage — so the difference between a robust uploader and a fragile one is entirely in how it handles failure. The goal is to turn every recoverable error into a checkpoint rather than a restart: classify the failure, back off with jitter, retry the same chunk idempotently against the same offset, and pause cleanly when the device goes offline. This topic sits under Frontend UX, Chunking & Progress Tracking and supplies the transitions that drive the retrying state in resumable upload state machines.

Prerequisites

[ ] Node 20+ with a modern ESM bundler
[ ] TypeScript 5.x, strict enabled
[ ] An idempotent chunk endpoint keyed by byte offset or part number
[ ] A durable offset store (IndexedDB) so retries resume from a checkpoint
[ ] Access to the browser online/offline events and navigator.onLine

How error recovery works

Not all errors deserve a retry. A 400, 401, 403, 404, or 422 is the server telling you the request itself is wrong — retrying just repeats the mistake. A 500, 502, 503, 504, 408, or 429, or a raw network failure with no response, is transient — those are worth retrying. The classifier is the gate; everything downstream depends on getting it right, because retrying a fatal error wastes the budget and not retrying a transient one drops the upload.

Retries must back off and they must jitter. Fixed-interval retries from many clients synchronize into a thundering herd that re-saturates a recovering server; exponential growth with full jitter spreads them out. And because a retried chunk re-sends bytes the server may already hold, the chunk handler must be idempotent — keyed by offset or part number — so duplicate delivery overwrites instead of corrupting.

A failed chunk is classified; fatal errors terminate, retryable ones check budget and connectivity, wait a jittered backoff, and resume from the checkpoint offset.

Step 1: Classify the error

Decide retryability from the HTTP status, or treat a missing response (a thrown network error) as retryable. Keep the fatal set explicit so an unexpected status defaults to a cautious retry rather than a silent drop.

export class UploadError extends Error {
  constructor(message: string, readonly status: number | null) {
    super(message);
    this.name = "UploadError";
  }
}

const FATAL = new Set([400, 401, 403, 404, 405, 409, 410, 422]);

export function isRetryable(err: unknown): boolean {
  if (!(err instanceof UploadError)) return true; // unknown throw: assume transient
  if (err.status === null) return true;           // no response = network failure
  if (FATAL.has(err.status)) return false;
  return err.status >= 500 || err.status === 408 || err.status === 429;
}

Expected: isRetryable(new UploadError("x", 503)) is true; isRetryable(new UploadError("x", 403)) is false; a plain new Error("offline") is true.

Step 2: Compute exponential backoff with full jitter

Grow the wait exponentially, cap it, and randomize across the whole interval. Full jitter (uniform between 0 and the exponential bound) is the most effective at de-correlating clients.

export function backoffDelay(
  attempt: number,
  baseMs = 500,
  capMs = 30_000,
): number {
  const exp = Math.min(capMs, baseMs * 2 ** attempt);
  return Math.floor(Math.random() * exp); // full jitter
}

// Honor a server's Retry-After header when present (seconds or HTTP-date).
export function retryAfterMs(header: string | null): number | null {
  if (!header) return null;
  const secs = Number(header);
  if (Number.isFinite(secs)) return secs * 1000;
  const when = Date.parse(header);
  return Number.isNaN(when) ? null : Math.max(0, when - Date.now());
}

Expected: backoffDelay(0) returns 0–500 ms, backoffDelay(3) returns 0–4000 ms, and any attempt is capped at 30,000 ms; retryAfterMs("2") returns 2000.

Step 3: Gate retries on connectivity

Retrying while the device is offline burns the attempt budget against a dead link. Wait for the online event instead, so backoff measures real connectivity, not wall-clock time spent in a tunnel.

export function waitUntilOnline(): Promise<void> {
  if (navigator.onLine) return Promise.resolve();
  return new Promise((resolve) => {
    const handler = () => {
      window.removeEventListener("online", handler);
      resolve();
    };
    window.addEventListener("online", handler);
  });
}

Expected: while offline, the promise stays pending; the moment the browser fires online, it resolves and the loop continues — no attempts are consumed in between.

Step 4: Make each chunk retry idempotent

A retried chunk must address the same bytes so re-delivery is a no-op on the server. Key the request by absolute byte offset (or part number) and let the server overwrite that range rather than append.

import { UploadError } from "./errors.js";

export async function putChunkAt(
  endpoint: string,
  uploadId: string,
  blob: Blob,
  offset: number,
): Promise<void> {
  const res = await fetch(`${endpoint}/${uploadId}`, {
    method: "PATCH",
    headers: {
      // tus-style: the server commits these bytes at exactly this offset.
      "Upload-Offset": String(offset),
      "Content-Type": "application/offset+octet-stream",
    },
    body: blob,
  }).catch(() => {
    throw new UploadError("network error", null);
  });

  if (res.status === 409) {
    // Offset conflict: our checkpoint is stale, caller must re-handshake.
    throw new UploadError("offset conflict", 409);
  }
  if (!res.ok) {
    throw new UploadError(`chunk failed`, res.status);
  }
}

Expected: re-sending the same offset after a transient failure produces the same committed object; a 409 signals a stale checkpoint that must be reconciled via the resume handshake.

Step 5: Assemble the retry loop with checkpoint resume

Combine classification, backoff, connectivity gating, and idempotent delivery. On a 409, re-read the authoritative offset before continuing so the loop never fights a stale checkpoint.

import { isRetryable, UploadError } from "./errors.js";
import { backoffDelay } from "./backoff.js";
import { waitUntilOnline } from "./connectivity.js";

export async function uploadWithRecovery(
  send: (offset: number) => Promise<void>,
  resync: () => Promise<number>,
  startOffset: number,
  maxAttempts = 6,
): Promise<void> {
  let offset = startOffset;
  let attempt = 0;
  for (;;) {
    try {
      await send(offset);
      return;
    } catch (err) {
      if (err instanceof UploadError && err.status === 409) {
        offset = await resync();       // checkpoint was stale; re-handshake
        continue;                      // a conflict does not cost an attempt
      }
      attempt += 1;
      if (attempt >= maxAttempts || !isRetryable(err)) throw err;
      await waitUntilOnline();
      await new Promise((r) => setTimeout(r, backoffDelay(attempt)));
    }
  }
}

Expected: a single 503 triggers one jittered wait then a successful resend; six straight failures throw after the budget is spent; a 403 throws immediately without retrying.

Configuration reference

Option	Type	Default	Effect
`maxAttempts`	number	`6`	Retries before the upload transitions to `failed`
`baseMs`	number	`500`	Initial backoff interval before exponential growth
`capMs`	number	`30_000`	Maximum single backoff wait
`jitter`	`"full"` \| `"none"`	`"full"`	Randomization strategy across the backoff window
`honorRetryAfter`	boolean	`true`	Use the server’s `Retry-After` over computed backoff

Edge cases & gotchas

Retry storms after a server recovers

If every client retries on the same schedule, a recovering server is immediately re-flooded. Full jitter is the fix — never use fixed or equal-jitter intervals for a fleet of uploaders hitting one endpoint.

Stale checkpoint causing 409 / 412 loops

When the persisted offset disagrees with the server, blind retries loop forever. Treat 409/412 as a signal to re-run the resume handshake and adopt the server’s offset before resending, exactly as the resumable upload state machines loop does.

Token expiry mid-upload

A presigned URL that expires returns 403, which the classifier marks fatal — correctly, because retrying the dead URL is pointless. Catch 403 one level up, request a fresh token from the backend, and restart the chunk with the new credentials rather than counting it against the retry budget.

navigator.onLine false positives

navigator.onLine only knows the OS believes a network exists, not that the internet is reachable. Use it to pause aggressively, but still rely on real request failures and backoff to handle captive portals and dead-but-connected links.

Verification

Simulate a transient server error and confirm the client recovers, then assert the classifier in isolation.

# Force a 503 once, then 200, and watch the client retry and succeed.
curl -i -X PATCH https://api.example.com/uploads/abc123 \
  -H 'Upload-Offset: 5242880' \
  -H 'Content-Type: application/offset+octet-stream' \
  --data-binary @chunk-01.bin
# HTTP/1.1 503 Service Unavailable  (first hit)
# HTTP/1.1 204 No Content           (retry succeeds)

import { isRetryable, UploadError } from "./errors.js";

console.assert(isRetryable(new UploadError("x", 503)) === true, "5xx retryable");
console.assert(isRetryable(new UploadError("x", 422)) === false, "422 fatal");
console.assert(isRetryable(new Error("dropped")) === true, "network err retryable");

FAQ

Which HTTP status codes should I retry?

Retry 500, 502, 503, 504, 408, and 429, plus any failure with no response at all (a thrown network error). Do not retry 400, 401, 403, 404, or 422 — those mean the request is wrong, so retrying repeats the same rejection. Keep the fatal list explicit so unknown codes default to a cautious retry.

Why add jitter instead of plain exponential backoff?

Without jitter, many clients that failed together retry together, re-saturating a recovering server in synchronized waves. Full jitter — a random wait between zero and the exponential bound — spreads retries out, which both protects the server and improves each client’s odds of getting through.

How do retries avoid duplicating data?

Key every chunk by its absolute byte offset or part number so a retried request overwrites the same range rather than appending. Re-sending an already-committed chunk then becomes a harmless no-op, which is what makes the retry loop safe to run aggressively.

What happens when the user goes offline mid-upload?

Gate the retry loop on the browser’s online event so it pauses instead of burning attempts against a dead link, and keep the committed offset persisted. When connectivity returns, re-run the resume handshake and continue from the checkpoint — the same mechanism that powers real-time upload progress events holding the bar steady through the gap.

Upload Error Recovery Patterns #

Prerequisites #

How error recovery works #

Step 1: Classify the error #

Step 2: Compute exponential backoff with full jitter #

Step 3: Gate retries on connectivity #

Step 4: Make each chunk retry idempotent #

Step 5: Assemble the retry loop with checkpoint resume #

Configuration reference #

Edge cases & gotchas #

Retry storms after a server recovers #

Stale checkpoint causing 409 / 412 loops #

Token expiry mid-upload #

navigator.onLine false positives #

Verification #

FAQ

Related

Articles in this topic