Preserving Digital Evidence: Screenshots, Archives, and Hashing

Digital evidence degrades. Pages get edited. Posts get deleted. Registrars change. Screenshots in a folder are not evidence; they are artifacts without provenance. A preservation workflow is what turns collection into something a colleague can audit, a lawyer can defend, and a court can accept.

This post walks through the workflow. For tool-level detail, see /tools/wayback-machine/ and /tools/metadata/.

What Preservation Must Prove

A preserved artifact has to answer four questions:

What is it? (a page, a document, a video)
Where did it come from? (URL, agency, system)
When was it captured? (date and time in a consistent timezone)
Has it changed since capture? (integrity verification)

Without answers to all four, a finding built on the artifact is defensible only to someone who already trusts you.

The Minimum Viable Preservation

For any artifact you plan to rely on:

Archive to a third-party service. The Wayback Machine via web.archive.org/save/URL is the default. Archive.today (archive.ph) is a useful backup for sites that block the Wayback Machine.
Screenshot the full page with browser URL bar and timestamp visible. Full-page captures (via browser dev tools or tools like FireShot) preserve scrolled content.
Save the raw source. wget --page-requisites --convert-links URL or the browser's "Save as MHTML" preserves HTML and assets.
Record a cryptographic hash of the captured files.
Log the artifact in the collection log with URL, timestamp, archive URL, hash, and retrieval method.

That is the baseline. Anything less leaves gaps.

Hashing in Practice

A cryptographic hash is a fixed-length fingerprint of a file. If the file changes, the hash changes. SHA-256 is the standard; MD5 and SHA-1 are acceptable for non-adversarial use but are considered cryptographically broken and should be avoided when anyone might have incentive to tamper.

Command line:

sha256sum evidence.pdf
# 9f4e2c1a8b... evidence.pdf

# Windows PowerShell:
Get-FileHash -Algorithm SHA256 evidence.pdf

Record the hash in the collection log at the time of capture. Re-hash before citing. A matching hash proves the artifact is what you captured; a non-matching hash is a red flag demanding explanation.

For multi-file archives, hash each file individually and publish the manifest:

9f4e2c1a... evidence.pdf
3a8c9d4e... screenshot.png
7b1f0e6d... page.html

Timestamping

Timestamps matter. Two approaches:

Archive-service timestamp. The Wayback Machine records capture time in UTC and exposes it in the URL (web.archive.org/web/20260329142200/...). This is independent of your local clock.
OpenTimestamps (opentimestamps.org) produces a Bitcoin-anchored proof that a hash existed at a given time. Free, trustless, and increasingly accepted in journalism preservation practice.

Log timestamps in ISO 8601 UTC: 2026-03-29T14:22:00Z. Local-timezone strings introduce ambiguity.

Screenshots Without Provenance Are Junk

A screenshot pulled from a phone gallery with no URL, no timestamp, and no browser context is not evidence. It is an image.

Minimum for a usable screenshot:

Browser URL bar visible in the frame
Operating-system clock visible in the frame, or OpenTimestamps proof attached
Captured via a reproducible method (browser dev tools' full-page screenshot or a documented tool)
Saved as PNG (lossless), not JPEG
Hashed immediately after capture

For investigations documenting rights violations in the field — the kind of work catalogued by the ICE Encounter rights guides — video capture follows analogous rules: continuous recording, visible context, no edits before hashing.

Page-Source Preservation

HTML source often contains information that a rendered screenshot hides: tracking IDs, hidden fields, metadata tags, JavaScript references. For any page that matters:

wget --page-requisites --convert-links --adjust-extension \
     --no-parent --timestamping URL

Or the browser-based equivalent via "Save As → Webpage, Complete." Archive the resulting folder as a single ZIP, hash it, log it.

For dynamic pages that render via JavaScript, use a headless browser:

chromium --headless --disable-gpu --dump-dom URL > page.html
chromium --headless --disable-gpu --screenshot=screen.png \
         --window-size=1920,1080 URL

Document Preservation

PDFs, DOCX, and XLSX files carry metadata that lasts only until the next save. Preserve the original:

Download with wget or curl, not through a browser's "save" dialog, which may re-render.
Hash immediately.
Record source URL, download time, HTTP response headers (particularly Last-Modified and ETag).
Store unchanged; work from copies.

Metadata extraction (exiftool, pdfinfo) runs against the preserved original:

exiftool evidence.pdf > evidence-metadata.txt

See /tools/metadata/ for the full tutorial.

Video and Audio

Video is harder to preserve because it is large, often platform-gated, and frequently re-encoded between platforms. Practical workflow:

Pull with yt-dlp where possible — preserves original quality and container.
Hash the original file before any processing.
For platform-restricted content, screen-record with OBS using consistent settings, and capture the URL bar and system clock in the same frame.
Log the video with a description of the capture method (direct download vs screen recording) so downstream analysts know whether to expect pristine or re-encoded content.

Chain of Custody

For any artifact that may end up in legal or journalistic review, maintain a chain-of-custody log:

Artifact: evidence.pdf
SHA-256: 9f4e2c1a...
Captured: 2026-03-29T14:22:00Z
Capturer: Angel Reyes
Source URL: https://example.gov/foia/2026-03/evidence.pdf
Archive URL: https://web.archive.org/web/20260329142200/...
Transfers:
  2026-04-01T10:00Z → Editor, via encrypted share, hash re-verified 9f4e2c1a...
  2026-04-15T09:00Z → Legal review, via case management system

Every transfer re-verifies the hash. A broken chain is a finding-killer.

Tooling Stack

A serviceable preservation stack costs nothing:

Wayback Machine ("Save Page Now")
Archive.today as backup
wget, curl, yt-dlp for direct capture
sha256sum / Get-FileHash for hashing
exiftool, pdfinfo, mediainfo for metadata
OpenTimestamps for trusted timestamping
A cold-stored, read-only archive (external drive, S3 Glacier, or similar) for originals

For investigators handling large document productions, structured tools that integrate preservation into the review workflow save time — many operations use DocumentCloud for journalism or the Subthesis legal document analysis tool for legal-grade filings, both of which embed provenance into the document-review layer.

When Preservation Fails

Failures to expect:

Paywalled content — Wayback Machine often can't capture behind paywalls. Document access method; consider contacting the publisher for an archive-eligible copy.
JavaScript-heavy SPAs — default Wayback capture misses. Use a headless browser fallback.
Rate limiting on Wayback Machine — "Save Page Now" is rate-limited. For bulk work, use archive.org/save/ in a loop with backoff.
Geo-restricted content — capture may succeed or fail depending on where the archive servers are. Document whichever outcome.

The Reporting Impact

Reports built on preserved artifacts read differently. Instead of "the company's website said X," the finding is "the company's website, at [URL] as archived on [date] at [archive URL, hash], stated X. The current live page does not contain this statement as of [date]."

That phrasing is harder to attack and easier to verify. It is also the standard the reporting phase of OSINT methodology asks for.

The Two Rules

If it matters, archive it now. The link you save tomorrow is 20% likely to be gone.
If you cannot prove when you captured it, you captured nothing. Hash, timestamp, log.

Every investigator learns these by losing a source. Better to learn them before.