Wayback Machine: Retrieving Archived Web Content

Web pages vanish, get edited, or go behind paywalls. The Wayback Machine gives investigators a way to recover what a page looked like at a specific point in time. This tutorial walks through searching, pivoting, and preserving archived content in a way that holds up under scrutiny.

Who this is for

Beginner

Journalists, researchers, and civic investigators who need to surface deleted statements, prior website versions, or infrastructure changes on a target domain.

What you'll need

A modern browser.
No account. The Internet Archive is free to use.
A note-taking tool and a local folder for screenshots (PNG or WebP) and WARC/HTML exports.
Optional: a command line with curl or wget for batch capture.

How it works

The Internet Archive's Wayback Machine crawls the public web and stores snapshots (called captures) of individual URLs. Each capture is addressable through a canonical URL that encodes the capture timestamp. Because snapshots are immutable and hosted by a third party, they function as a reasonable public-record form of preservation — though not a forensic one. For stronger integrity you pair a Wayback URL with a local copy and a hash.

Step-by-step walkthrough

Open the Wayback Machine. Go to https://web.archive.org/ and paste the target URL into the search box. You will see a calendar heatmap showing every date with at least one capture.
Pick a capture date. Click a highlighted day, then select a specific timestamp. The resulting URL has the pattern https://web.archive.org/web/YYYYMMDDhhmmss/https://example.com/page. Copy that full URL — it is your archival citation.
Compare versions. Use the "Changes" view (the two-diamond icon) to diff two captures. This surfaces quiet edits — for example, a removed sentence in a press release or a changed corporate officer on an About page.
Trigger a new capture if none exists. Use the "Save Page Now" form: https://web.archive.org/save/https://example.com/page. Keep the returned snapshot URL for your notes. If the page sets x-robots-tag: noarchive, Wayback will refuse the capture; fall back to archive.today or a local WARC.
Pivot with the site index. Append * to the query to list every captured path on a domain, for example:
```
https://web.archive.org/web/*/example.com/*
```
This reveals deleted staff pages, orphaned PDFs, and subdomains that never made it into search results.
Download a clean copy. For litigation or publication, capture a local WARC alongside the Wayback snapshot:
```
wget --warc-file=example-2026-03-02 \
     --warc-cdx \
     --page-requisites \
     --convert-links \
     https://example.com/page
```
Compute a SHA-256 hash of the resulting WARC:
```
sha256sum example-2026-03-02.warc.gz
```
Record the hash in your source log.

Common pitfalls

Confusing capture time with publication time. A snapshot dated 2022-04-10 proves the content existed on that day, not that it was published that day.
Trusting a single capture. Pages load assets dynamically. Missing CSS or blocked scripts can make a snapshot misleading. Check two or three adjacent captures.
Citing the live URL instead of the Wayback URL. If you link the live URL in your published work, your evidence disappears the moment the target edits the page.
Ignoring robots.txt retroactive removals. The Internet Archive honors retroactive robots.txt exclusions. A capture you saw last month may not be accessible today. Save local copies.
Assuming Wayback covers everything. JavaScript-heavy single-page apps, authenticated pages, and most media embeds are captured poorly or not at all.

Verifying your findings

Pair every Wayback citation with: (1) the exact archival URL, (2) a local screenshot, (3) a local HTML or WARC export, and (4) a hash of the local file. Cross-check against archive.today, which captures client-rendered content differently. If two independent archives agree, the finding is stronger.

For a full workflow on turning raw captures into documented evidence, see the analysis phase guide.

Apply this in practice

See archival techniques used in the verifying a viral image case study. For broader context on how published investigations rely on archived records, read the Epstein Revealed investigation series.