Collection

Collection is where the investigation meets the real world, and where most of the errors that will later blow up the analysis are quietly introduced. The phase is deceptively simple — go find the information — but the discipline is not in what you find; it is in how you record finding it. A claim with a screenshot is gossip. A claim with a timestamped, hashed, archived, and logged capture is evidence.

The goal of collection

The collection phase produces two artefacts: the data itself, and the record of how the data was acquired. The second artefact is usually more important than practitioners realise. If your investigation is ever contested — by a source, a lawyer, a regulator, an editor, or a referee — you will be asked not what you found but how you found it, when you found it, and whether you can prove the thing you found was there at the time you say it was.

Assume from the first click that you will need to reconstruct your collection path six months later for somebody hostile. Design the workflow for that reconstruction.

What good collection looks like

For every piece of evidence captured, record:

Source URL, including query parameters, in full.
Timestamp of capture, in UTC, to at least minute precision.
Capture method — browser screenshot, wget, API call, archive submission.
Hash of the captured artefact (SHA-256 is standard).
Archival copy submitted to a third-party archive (Wayback Machine, archive.today) before any further action on the page.
Investigator notes — what the capture is intended to prove, and anything about the access conditions (logged in, specific region, specific device) that could affect reproducibility.

A spreadsheet, a markdown log, or a dedicated tool like Hunchly all work. What matters is that the schema is consistent and the log is written at the time of collection, not reconstructed from memory afterwards.

Source classes and their handling

Primary government records — registries, filings, court dockets, regulatory submissions. Treat as high evidentiary weight. Capture the page, capture the underlying document if linked, and record the registry's own version or revision identifier where available. Do not paraphrase; quote.

Secondary reporting — news articles, trade press, academic papers. Treat as leads, not findings. Reporting is useful to point you at primary sources but is itself a chain-of-citation problem. Trace each claim to its origin before relying on it.

Social media — posts, profiles, public interactions. Treat as volatile. Archive immediately, because the content can be edited or deleted between the moment you find it and the moment you cite it. Preserve the full thread context, not just the target post.

Technical artefacts — DNS records, IP addresses, domain registration data, certificate transparency logs, server banners. Treat as corroborative. A registrant record matches or does not match a claim elsewhere; on its own it rarely proves anything.

Leaked or breached data — treat as radioactive until you have a legal and ethical opinion on handling. Many jurisdictions distinguish between publicly indexed leaks and directly obtained breach data, and some impose obligations on possession regardless of source. Do not assume permission from availability.

Preserving provenance

The single most common failure in OSINT is losing the chain from a published finding back to the moment of its collection. The fix is boring and reliable: archive before you act. Before you tweet about a finding, before you pivot on it, before you even tell a colleague — submit the source URL to the Wayback Machine and archive.today, save a local PDF or WARC, and hash both. The cost is thirty seconds per artefact. The cost of not doing it is an investigation that collapses when the page changes.

A common trap: using a satellite image or a public photo to confirm a location before cross-checking the caption against a reverse image search. The image may be real, the caption may be a relabel, and the mistake will survive publication if no independent provenance was captured.

Tools relevant to this phase

Wayback Machine for both submitting and retrieving archival captures.
WHOIS and DNS lookup for domain and infrastructure provenance.
Google dorking for targeted discovery of documents indexed but not linked from obvious places.
Shodan for internet-exposed infrastructure associated with a target.
The Subthesis legal document analysis tool for systematic extraction of entities and claims from long documents captured during collection.

Match the tool to the source class. Shodan is wasted effort on a question about corporate ownership; a company registry search is wasted effort on a question about exposed services.

Common pitfalls

Collecting before planning is complete. The seductive mistake. You find a thread, you pull it, and an hour later you have a hundred tabs open and no idea which ones you can cite.

Contaminating the subject. Logged-in searches, account visits, and direct messages may be visible to the subject. For sensitive work, use a clean browser profile, consider a VPN, and avoid any action that creates a notification or an access-log entry against the target.

Over-collecting. Evidence you did not scope is evidence you may legally need to disclose or ethically need to consider. Collect what answers the question; resist the urge to hoover.

Relying on live links in the final report. The page you cited will move or disappear. Every citation should point to an archive, with the original URL noted for transparency.

Deliverables checklist

By the end of collection you should have:

A complete capture log with timestamps, hashes, and archive URLs for every artefact.
Archived copies of every live source submitted to at least one independent archive.
Organised evidence folders keyed to the intelligence requirement from planning.
A running list of gaps — claims you cannot yet substantiate — to carry into analysis.
A written note of any access conditions, VPN use, or accounts that might affect reproducibility.

Collection ends not when you run out of sources but when the evidence against the plan is sufficient to make a defensible finding — or to conclude the question cannot be answered from open sources. Either outcome is legitimate.

Previous phase: Planning. Next phase: Analysis.