From 3 Hours to 11 Seconds: Rebuilding an ERPNext Shopify Inventory Sync
Integrations
Software Development
Software Architecture

From 3 Hours to 11 Seconds: Rebuilding an ERPNext Shopify Inventory Sync

April 29, 2026 Imesha Sudasingha

We maintain an open-source Shopify connector for the ERPNext / Frappe ecosystem. Our customers use it to sync items, orders, and inventory between Shopify storefronts and ERPNext-based back offices (including our own product, NexWave). Last week we shipped the single biggest performance change the connector has ever had. This is the story.

TL;DR: The inventory sync used to make two REST calls per item with a one-second sleep between each. A store with 7,300 items burned through ~14,600 API calls and over three hours before timing out. We rewrote it to use Shopify’s inventorySetQuantities GraphQL mutation in 250-item batches, with a cached inventory_item_id on the ERPNext side and throttle-based back-pressure instead of a blanket sleep. Same store now finishes in about eleven seconds with ~30 API calls. That is roughly a 486× reduction in API calls and a 1000× reduction in wall time. The rewrite is live in version-15 and version-16 of the connector. The lesson is that we kept raising the timeout until it became obvious the timeout was never the real problem.

Shopify inventory sync: 3 hours to 11 seconds


The Connector, Briefly

nexwave_shopify_connector is an open-source Shopify integration for ERPNext and NexWave. It handles item sync, order sync, inventory sync, fulfilments, and multi-store / multi-company mappings. The repo is at github.com/one-highflyer/nexwave_shopify_connector, GPL-licensed, and runs on any Frappe v15 or v16 site.

The piece that broke was inventory sync: pushing stock-on-hand figures from ERPNext warehouses into Shopify locations so the storefront shows the right numbers.

What the Old Path Actually Did

For each item enabled for Shopify, for each mapped location, the sync did three things:

  1. Fetch the Shopify variant (1 REST call) to resolve the inventory_item_id.
  2. Call inventory_levels.set to write the quantity (1 REST call).
  3. Sleep for one second to stay under Shopify’s 2 calls/sec REST leaky bucket.

For a store with 7,300 items, that is 14,600 REST calls and 7,300 seconds of sleep. The arithmetic alone puts you north of two hours before you touch the API.

In practice it was worse. One-off 429s triggered retries. A handful of items weren’t tracked on Shopify at all, so their inventory_levels.set returned 422 and the old code failed the whole store. A bench upgrade tightened the HTTP timeout and cut the sync off partway through. We were losing data without anyone noticing, because the logs said “running.”

The Incremental Patches (That Didn’t Actually Fix It)

Before the rewrite, we kept applying the obvious patches to the obvious symptoms. Each of them was merged. Each of them shipped. None of them fixed the real problem.

  • PR #40: rate limiting. We were hammering the leaky bucket and generating hundreds of 429s per run. Fix: enforce a one-second sleep between items and skip items with inventory tracking disabled on Shopify instead of failing them.
  • PR #41: duplicate job prevention. The scheduler ran every five minutes. A new store with no last_inventory_sync timestamp always qualified. By the time the first sync had been running for ten minutes, two more had been queued behind it. Fix: re-check the interval at the start of the worker.
  • PR #42: raise the job timeout from 30 minutes to 60 minutes. Some stores were still timing out.
  • PR #50: raise the job timeout from 60 minutes to 180 minutes. Large stores were still timing out.

By PR #50 we had a three-hour timeout on what was nominally a stock-level push. It was at this point we stopped treating the timeout as a dial and started treating it as a signal.

The per-item REST loop wasn’t slow because the timeout was too low. It was slow because REST + sleep is structurally wrong for this workload. You cannot make a 14,600-call-with-sleep pipeline fast. You have to stop doing 14,600 calls.

The Shape of the Rewrite

Shopify’s GraphQL Admin API has an inventorySetQuantities mutation that takes up to 250 inventory levels in a single call. It also returns a throttleStatus object on every response, telling you exactly how many cost points you have left in the bucket. If you use both of those, you get two things the REST path can’t give you:

  1. Batching. 7,300 items in batches of 250 is 30 API calls instead of 14,600. That one change accounts for most of the speedup.
  2. Back-pressure instead of blanket sleep. Rather than sleeping one second between every item (regardless of bucket state), you inspect throttleStatus.currentlyAvailable on each response. If you’re below a threshold (we picked 200 cost points out of the default 1000 bucket), pause just long enough to let the bucket refill. Most of the time you don’t pause at all.

The mutation itself looks like this (this is verbatim from inventory_graphql.py in the connector):

mutation inventorySetQuantities($input: InventorySetQuantitiesInput!) {
  inventorySetQuantities(input: $input) {
    inventoryAdjustmentGroup {
      createdAt
      changes {
        name
        delta
        quantityAfterChange
        item { id }
        location { id }
      }
    }
    userErrors {
      field
      message
      code
    }
  }
}

Two things worth calling out:

  • userErrors is per-row. If 3 of the 250 items in a batch fail (bad SKU, deleted variant, whatever), the other 247 still succeed. You have to map each userErrors[i].field path back to the source row to know which ones failed, but that is a local problem, not an API problem.
  • throttleStatus is on every response. We parse extensions.cost.throttleStatus into a small ThrottleStatus dataclass and pause between batches only when we’re near the limit.

The Cache That Killed the Per-Item Variant Lookup

Batching the writes is the big win. But there was still a step 1 in the old loop: for each item, fetch the Shopify variant to get its inventory_item_id. Even with batching, doing this per item would add thousands of calls back in.

The fix is a simple cache. We added a shopify_inventory_item_id field to the Item Shopify Store child table on the ERPNext side. It is populated in two places:

  • Product sync. When we fetch a Shopify product (already done for other reasons), we pull the variants[0].inventory_item_id off the response and store it on the child row. No extra API call.
  • SKU mapping. When the user runs “Fetch Products & Map by SKU”, we capture the inventory_item_id in the same pass.

On the first sync cycle after upgrading, a batched nodes GraphQL query backfills the cache lazily for any rows that don’t have one yet:

query variantInventoryItems($ids: [ID!]!) {
  nodes(ids: $ids) {
    ... on ProductVariant {
      id
      inventoryItem { id tracked }
    }
  }
}

One call, 250 variants at a time, and the cache is warm for good. On every subsequent run, inventory sync reads the cached IDs from our database and goes straight to the batched mutation.

The Retry Logic That Doesn’t Hide Failures

Shopify occasionally returns 429 (rate limited) with a Retry-After header, or 5xx during their own incidents. The old path silently swallowed both categories into “failed item” counts, which meant a Shopify outage looked identical to a genuinely broken item. Operators couldn’t tell the difference.

The new path classifies errors explicitly:

  • 429 with Retry-After: wait the header value, retry once. If it still fails, record the batch as failed and move on.
  • 5xx: exponential backoff (2s, 4s), up to two retries. After that, batch fails.
  • Non-retryable 4xx (bad request, auth failure, schema errors): fail immediately with a real error message. These are bugs, not transients.
  • JSON parse errors: the response isn’t valid JSON (usually a Shopify error HTML page). Treated as non-retryable, because retrying an HTML error page is unlikely to produce different HTML.

This matters more than it might sound. In the old system, a bad auth token would cause 100% of items to “fail to sync” in the log, and the summary would read “0 synced, 7300 failed” for hours until someone noticed. In the new system, the first batch fails with 401 Unauthorized, the whole run aborts, and the operator gets a Shopify Log entry that names the problem.

Three Operational Safeguards That Saved a Support Call

Rewriting the loop was the hard part. Three small safeguards came with it that we think are worth copying into any scheduled-job integration:

Force-bypass on manual triggers. The old “Sync Inventory” button on the store form respected the same frequency guard as the scheduled job. An operator who clicked it within an hour of the last scheduled run would silently get nothing, because _should_sync_inventory() returned False. Now the button passes force=True and bypasses the guard. If you clicked it, you meant it.

Conditional timestamp advancement. last_inventory_sync used to be updated at the start of every run. A bad token would therefore mask itself for an hour: the sync would fail, the timestamp would advance, and the next scheduled run would skip. Now the timestamp only advances if at least one batch succeeded. A totally broken run leaves the timestamp where it was and the next scheduled tick tries again.

Per-store kill switch via bench config. Edge cases happen. We added a nexwave_shopify_disable_graphql_inventory_sync bench config entry that takes a list of store domains. Any store in that list is skipped without a release. If a specific store turns out to have weird data, we can disable its sync in 30 seconds rather than pushing a hotfix.

None of these are technically sophisticated. They just make the integration behave sensibly when something goes wrong, which, if you remember nothing else from this post, is what makes an integration production-grade.

The Numbers

On the specific store we tested against (about 7,300 items, single location, warm cache):

PathAPI callsWall time
REST, per-item with sleep (old)~14,6003+ hours (timed out)
GraphQL batched, warm cache (new)~30~11 seconds

That is roughly 486× fewer API calls and 1000× less wall time for the same semantic operation.

The cold-cache first run is slower by a few seconds because of the nodes backfill query, but once the cache is populated it stays populated.

We also relaxed two operational parameters that had been fighting against the slow loop:

  • The default per-store sync interval moved from 30 minutes to 60 minutes. The sync is now fast enough that a 30-minute cadence is overkill.
  • The scheduler cron changed from */5 to */10. No need to check every five minutes for stores whose syncs take eleven seconds to finish.

What We Upstream

We keep the connector open source because we think this kind of work should be shared. If you are running ERPNext with Shopify, or if you’re building your own Shopify integration on any stack and looking for a concrete example of GraphQL batch sync with throttle-aware back-pressure, the code is there.

Grab the connector

Repo: github.com/one-highflyer/nexwave_shopify_connector (GPL, Frappe v15 and v16).

The rewrite PRs: #51 (version-15) and #53 (version-16 port).

Install it on any Frappe bench with bench get-app and bench install-app nexwave_shopify_connector. The README covers configuration, OAuth setup, and the multi-store / multi-company mapping model.

The tests are worth flagging too: 34 FrappeTestCase cases covering throttle parsing, batch payload shape, userError mapping, lazy backfill, disabled stores, 429/5xx retry paths, zero and negative qty clamping, partial-failure timestamp semantics, and force-bypass on manual sync. Integration code without tests is just hope with a chance of bugs.

The Lesson (Which Is Not “Use GraphQL”)

“Use GraphQL” is the boring answer. The interesting answer is the thing we kept not doing: stop treating timeout errors as something to tune, and start treating them as something to diagnose.

We raised the timeout three times. 30 minutes, 60 minutes, 3 hours. Each raise felt like a small fix. Each one was us agreeing, collectively and without saying it out loud, that the integration’s time budget was “however long the loop needs.” The real budget was always tens of seconds, because that is what the business actually needed. Nothing about “push stock levels to Shopify” should take longer than a lunch break, and the fact that it was taking hours was telling us the architecture was wrong.

Timeouts are diagnostic tools. If you keep bumping one, something deeper is probably broken.


If You Need This Built

HighFlyer builds and maintains integrations for NZ businesses: Shopify, WooCommerce, Xero, Stripe, Akahu, bank feeds, payment gateways, and anything else your stack talks to. We maintain the open-source ERPNext Shopify connector described above, and we build custom connectors when the off-the-shelf options don’t fit.

See our services or get in touch.

Tags

ShopifyERPNextNexWaveGraphQLAPI IntegrationPerformanceNew ZealandAucklandOpen SourceInventory Sync

Share this post

About the Author

Imesha Sudasingha

Imesha Sudasingha

Head of Engineering

Imesha is the Head of Engineering at HighFlyer and a member of the Apache Software Foundation with 10+ years of experience across integration, cloud, and AI. He maintains the open-source Shopify connector for ERPNext / NexWave that this post describes.

You May Also Like

What We Learned Building a Production LLM Agent That Writes Its Own ERP Queries
April 16, 2026

What We Learned Building a Production LLM Agent That Writes Its Own ERP Queries

Tool-call integrity during history trimming, delegated arithmetic, permission pass-through, and why demos lie. Lessons from shipping an agent into production.

Read More
Why Australian ERPs Keep Failing New Zealand Businesses
April 12, 2026

Why Australian ERPs Keep Failing New Zealand Businesses

NZ businesses think and invoice in GST-inclusive terms. Most cloud ERPs do not. The mismatch creates friction that shows up...

Read More
The Invisible Work: What Separates a Six-Month Software Project From a Six-Year One
April 10, 2026

The Invisible Work: What Separates a Six-Month Software Project From a Six-Year One

A builder does not get credit for the foundations. Neither does a software team. But if you asked either to...

Read More

We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. Learn more