Following the Flight Logs
Domain: financial
Tools used
- metadata
- foia
- google-dorking
- wayback-machine
Outcome
A published methodology for turning a scanned log set into a searchable, hashed, and cross-referenced dataset suitable for long-form investigation.
This case study references publicly known investigations into the Epstein flight logs, which have been discussed and excerpted at length in filings in the Southern District of New York, in reporting by the Miami Herald, and in the long-running Epstein Revealed investigation series. All techniques described are for lawful, ethical use and rely only on material already in the public record.
Context
Flight-log records attached to public litigation, FOIA releases, and court exhibits regularly arrive as image-only PDFs of handwritten or typewritten pages. These documents are simultaneously evidentially heavy and technically hostile: OCR errors are frequent, columns shift across pages, and handwriting resists automated extraction. The documents referenced in the Epstein investigations are a widely cited example, with scanned logs forming a core part of the public record released through unsealed litigation.
This case study describes the methodology an investigator would apply to turn such a record into a structured, cross-referenced dataset. The workflow is generalisable to any large release of scanned, semi-structured logs.
Question
Given a release of scanned flight logs, can a researcher produce a structured, searchable dataset whose every row is traceable back to a specific page and hash in the original release?
Subquestions:
- What is the canonical source for each log page, and how is its integrity established?
- How are rows normalised (date, aircraft, origin, destination, passengers) when OCR is unreliable?
- How are passenger names cross-referenced against other public records without introducing errors?
Methodology
Planning. The investigator defined the scope narrowly: a specific released log set, a fixed release URL, a frozen working copy. The output would be a CSV with one row per flight-log entry, each row carrying a page reference and a hash of the source PDF.
Collection.
- The release was downloaded from the court docket and archived via Wayback Machine. The PDF was hashed with SHA-256 and the hash published alongside the working dataset.
- Metadata extraction on the PDF confirmed creator/producer fields consistent with courthouse scanning equipment. This is not authentication but is a corroborator.
- Commercial OCR (high-accuracy, not free-tier) was applied to the PDF to produce a first-pass text layer. The OCR output was stored alongside the original, never replacing it.
Structured extraction.
- A row schema was defined before extraction began:
date,tail_number,origin,destination,passenger_name_raw,passenger_name_canonical,source_page,source_hash,ocr_confidence,human_reviewed. - Every extracted row was reviewed against the scanned page. OCR output below a confidence threshold was flagged for manual transcription. No row entered the dataset without a page reference.
- Passenger names were normalised into a canonical form only after cross-referencing with at least two independent public sources — other court exhibits, published articles, public corporate filings. Ambiguous names remained in their raw form.
Cross-referencing. Canonical passenger names were matched against:
- Public corporate directorships via OpenCorporates and Companies House (company registry workflow).
- Published reporting, via Google dorking on the distinctive name and date combinations.
- Additional FOIA-released records using the FOIA request process for tangentially relevant agency files.
Tools used
- Metadata extraction for source-PDF provenance.
- Wayback Machine for canonical source preservation.
- Google dorking for locating existing reporting on specific entries.
- FOIA request process for ancillary records.
Evidence snapshot
Release PDF SHA-256 published alongside dataset.
Each dataset row carries source_page and source_hash. Rows with OCR confidence below 0.90 were human-transcribed. Names entered canonical form only after two-source corroboration.
Findings
- The methodology above produces a dataset whose every claim is reversible to the scanned page and the hashed release.
- OCR alone is not sufficient; human review is the load-bearing step and must be built into the schema from the start.
- The strongest findings in such a dataset are almost always structural — frequency patterns, network edges, origin/destination clusters — rather than single-row revelations.
- Published methodology writeups matter as much as the dataset itself. An investigation without a reproducible method cannot be defended.
Lessons learned
- Provenance and hashing must come before analysis. A dataset without its source hash is a liability.
- Canonicalisation is an investigative act. Deciding that two differently-spelled names refer to the same person is a finding in itself and must be sourced.
- Release dates are not capture dates. The fact that a log was released in year X does not mean it was made in year X; every row needs an independent date anchor.
- Separate the dataset from the interpretation. Publish the dataset with its methodology so that others can re-analyse it independently.
Ethical considerations
Read the long-form treatment at the Epstein Revealed investigation series. For processing scanned, semi-structured legal document sets at scale, use the Subthesis legal document analysis tool.