Metadata Extraction: Document and Image Metadata Analysis

Extract and interpret metadata from documents and images: EXIF, XMP, PDF /Info dictionaries, Office core properties, and building a defensible metadata log.

Intermediate document

Metadata Extraction: Document and Image Metadata Analysis

Every file carries a second layer: the metadata the creator did not intend to publish. Camera model, original filename, author, editing software, GPS, and revision history sit inside photos and documents by default. For an investigator, extracting that layer is often a five-minute shortcut to a lead that a week of open-source search would not find.

Who this is for

Intermediate

Journalists verifying leaked documents, researchers analysing document trails, civic investigators documenting evidence.

What you'll need

  • ExifTool installed (https://exiftool.org/) — the canonical metadata tool.
  • Poppler utilities (pdfinfo, pdftotext) for PDFs.
  • Optional: MAT2 for metadata stripping (useful to understand what a privacy-aware subject would have removed).
  • A scratch directory you will delete after the case — metadata often contains personal data.

How it works

Metadata formats vary by file type. Images use EXIF (camera), IPTC (editorial), and XMP (cross-application). PDFs carry an /Info dictionary, XMP streams, and — for many PDFs — an embedded document history. Office files (.docx, .xlsx) are ZIP archives whose docProps/ directory stores author, company, last editor, revision count, and sometimes an edit-time timeline. JPEG thumbnails can even preserve a pre-edit version of an image that was cropped or blurred.

Step-by-step walkthrough

  1. Hash the original before touching it.

    sha256sum suspect.pdf > suspect.pdf.sha256
    

    Metadata extraction is read-only with ExifTool's defaults, but chain-of-custody requires the pre-analysis hash.

  2. Run ExifTool on anything.

    exiftool -a -G1 -s suspect.jpg
    exiftool -a -G1 -s suspect.pdf
    exiftool -a -G1 -s suspect.docx
    

    -a shows duplicate tags, -G1 shows the specific group of each tag, -s uses compact output.

  3. For images, read the trio of tag groups.

    • EXIF: Make, Model, DateTimeOriginal, GPSLatitude, GPSLongitude, LensModel.
    • IPTC: By-line, Caption-Abstract, Keywords.
    • XMP: editing software, history, and any embedded rights metadata. Extract GPS to a mapping tool:
    exiftool -c "%.6f" -GPSPosition suspect.jpg
    
  4. For PDFs, layer three lookups.

    pdfinfo suspect.pdf
    exiftool suspect.pdf
    pdftotext -layout suspect.pdf suspect.txt
    

    The Creator and Producer fields often reveal which software generated the PDF, which frequently matches or contradicts the ostensible author. For PDFs assembled from scanned pages, Creator can identify the specific scanner model — a strong corroborator of provenance.

  5. For Office files, unzip.

    unzip -p suspect.docx docProps/core.xml
    unzip -p suspect.docx docProps/app.xml
    

    core.xml carries creator, lastModifiedBy, revision, and timestamps. app.xml carries Company and Application. Track changes may still be present in word/document.xml even when visually accepted.

  6. Check thumbnails for pre-edit state.

    exiftool -b -ThumbnailImage suspect.jpg > thumb.jpg
    

    Compare thumb.jpg with the main image. Discrepancies suggest post-capture editing.

  7. Document the extraction in a metadata log. For every file: filename, hash, extraction tool and version, extraction timestamp, and the specific tags that matter to the investigation. Keep the log separate from the files themselves.

Common pitfalls

  • Treating metadata as proof. Metadata can be edited trivially. It is a lead and a corroborator, not a standalone fact.
  • Stripping before extracting. Any tool that "cleans up" a file destroys evidence. Work only on copies, and keep the original untouched and hashed.
  • Ignoring XMP on PDFs. Many investigators stop at pdfinfo. XMP on a PDF often contains the richer author and history data.
  • Trusting timestamps uncritically. Device clocks drift, time zones are stored inconsistently, and scanning software sometimes stamps the scan time rather than the document's actual date.
  • Leaking metadata in your own publications. The investigator who cannot strip metadata cleanly from a redacted release often exposes sources. Test every export before publication.

Verifying your findings

A metadata-derived claim needs independent corroboration. GPS placing a photo in a city is stronger when the visible architecture in the photo matches; an Office "creator" field is stronger when the same name appears on unrelated public filings. Document both the metadata finding and the corroborator in your case log. See the analysis phase guide.

Related tutorials

Apply this in practice

The following the flight logs case study relies on systematic metadata extraction across a large document set. For analysing document collections at scale with structured outputs, use the Subthesis legal document analysis tool.