← All posts

Tested restores, how to verify your backup strategy is real

A working runbook for backup restore testing, what each cadence should cover and why a backup that has never been restored is not a backup.

· Atticus Rowan

A backup that has never been restored is not a backup. It is an optimistic assumption. The gap between “the backup job completed successfully” and “the data is recoverable” is wider than most firms realize, and that gap is what determines whether a ransomware event or a hardware failure is a 4-hour inconvenience or a multi-week rebuild.

Restore testing is the discipline that closes the gap. It is also the discipline most firms know they should do and do not. Cyber insurance renewals, PE diligence processes and operational-resilience audits all now ask for a dated restore-test record, and “we run restores when we need to” is not the answer any of them want.

Here is the working runbook we use to stand up a restore-testing practice at a mid-market firm, including the cadence, the scope and the evidence artifacts the modern underwriting environment actually asks for.

What “tested” actually means

A real restore test has five characteristics.

  • Restoring from the backup. Not reading a file in place. Not re-mounting the production volume. An actual restore to an isolated environment.
  • Against a real dataset. The data that would actually need to be recovered in an incident, not a synthetic placeholder.
  • With documented timing. Start time, completion time, total elapsed duration, broken down by phase (retrieval, staging, application-level restore, verification).
  • With verification. Did the restored data open, load, mount or run as expected. A file restored to a location but corrupted on the way is not a successful restore.
  • With a written record. Date, who performed it, what system, what the result was, any anomalies found.

A backup job dashboard showing 100% success rates is not any of these.

The three-tier cadence

Most mid-market firms can implement a workable restore-testing practice with three tiers.

Monthly, critical-system subset

A rotating selection of the most critical systems, tested monthly. At any given firm this is usually 3 to 8 systems. ERP, email, finance, customer database, file server. The intent is that every critical system is restore-tested at least every 60 days.

What the test covers:

  • Restore a single representative object from each system (a mailbox, an ERP record, a file tree)
  • Time the restore end to end
  • Verify the restored object is intact
  • Document the result

Estimated effort: 2 to 4 hours of IT or MSP time per month.

Quarterly, full-system scope

A full-system restore of one critical system, rotated quarterly across the critical-system list. At the end of 2 years, every critical system has had at least one full-system restore test.

What the test covers:

  • Restore the entire system to an isolated environment
  • Bring the application up against the restored data
  • Confirm the application functions normally with the restored dataset
  • Time the end-to-end restore and document the runtime against the documented RTO for that system

Estimated effort: 1 to 2 days of IT or MSP time per quarter.

Annual, disaster-recovery rehearsal

A scheduled off-hours exercise that simulates a broader failure. Multiple systems, coordinated recovery, documented runtimes against documented RTOs.

What the test covers:

  • Restore a cluster of systems in the order the DR runbook prescribes
  • Simulate the dependencies between systems (authentication before application, database before app server)
  • Identify any dependency mismatches, missing runbook steps or recovery sequence issues
  • Produce a written after-action review

Estimated effort: 1 to 3 days of IT or MSP time per year, plus preparation.

Skipping the annual exercise is the most common failure mode at firms that have the first two tiers in place. The annual test is where the dependency-ordering errors, the undocumented runbook gaps and the “nobody noticed that system was not in the backup scope” problems surface.

Common failure modes the runbook catches

Firms that start a restore-testing practice consistently find the same issues. A representative list from real-world engagements:

  • A system was not actually being backed up. Often an SaaS or a server added after the backup scope was last reviewed.
  • The backup was running but retention was too short. The system was backed up nightly but only kept for 7 days, and the recovery scenario needed a 30-day-old copy.
  • Restores were possible but the runbook was out of date. The documented recovery steps referenced tools, credentials or contacts that had changed.
  • The restore took much longer than the documented RTO. The RTO was 4 hours. The actual restore was 11 hours. This is the gap that matters most and is invisible without testing.
  • The restore succeeded but the application could not come up. Database restored, application server restored, but a third dependency (a license server, a certificate, an authentication trust) was missed.
  • Immutability was claimed but not actually enforced. A restore attempt uncovered that the “immutable” backup could be modified by an admin account after all.

Each of these is common. Each is fixable. Each is invisible until a real restore test catches it.

Evidence the modern environment asks for

A restore-testing practice produces evidence artifacts the modern underwriting, diligence and audit environment all expect.

  • A dated restore-test log, per system, per test, with the fields above (time, scope, result, runtime, verification).
  • RTO compliance reporting, showing actual restore times against documented recovery objectives.
  • Dependency maps, produced as a byproduct of DR rehearsals.
  • Runbook revision history, showing that the recovery procedure is maintained and tested.

Firms that can produce the evidence in under an hour are in credible shape for any of the three use cases. Firms that cannot are not, regardless of what the backup vendor’s dashboard shows.

What a first-time implementation looks like

For a mid-market firm that has never run formal restore tests, the first cycle is almost always where the biggest issues surface.

  • Week 1. Inventory backup scope. Identify the critical-system subset. Write the first monthly test plan.
  • Weeks 2 to 4. Execute the first monthly test. Document every anomaly. Usually the anomalies drive immediate remediation.
  • Month 2. Second monthly test on a different critical system. Expand the inventory of remediated issues.
  • Month 3. First quarterly full-system restore test. Compare actual runtime against documented RTO. Update the RTO if the documented number is not achievable.
  • Months 4 to 12. Continue the monthly rotation. Schedule the annual DR rehearsal for an appropriate window.

The first 3 months usually produce a list of 10 to 20 remediation items per firm. After the first year of cycles, the remediation list is usually under 5 items per quarter. That is the shape of a working practice.

Where we fit

Our default cadence for managed clients includes monthly documented restore testing on a critical-systems subset, quarterly full-scope restore testing and an annual off-hours disaster-recovery rehearsal. The evidence library sits ready for cyber insurance renewal, PE diligence or any customer security questionnaire.

The practical point is that restore testing is a discipline sitting alongside whatever backup vendor the firm uses. Veeam, Rubrik, Cohesity, Acronis, any of them can produce good restore results. None of them produce a working restore-testing practice without someone actually running the tests and documenting the results.

If your backup posture has not been exercised against a restore-testing cadence or if a recent cyber insurance application surfaced weak answers in the restore-test question, schedule a discovery call. We can walk through the current state and scope the first 90 days of a working restore-testing practice.