Metadata Extraction: Document and Image Metadata Analysis
Extract and interpret metadata from documents and images: EXIF, XMP, PDF /Info dictionaries, Office core properties, and building a defensible metadata log.
Metadata Extraction: Document and Image Metadata Analysis
Every file carries a second layer: the metadata the creator did not intend to publish. Camera model, original filename, author, editing software, GPS, and revision history sit inside photos and documents by default. For an investigator, extracting that layer is often a five-minute shortcut to a lead that a week of open-source search would not find.
Who this is for
Intermediate
Journalists verifying leaked documents, researchers analysing document trails, civic investigators documenting evidence.
What you'll need
- ExifTool installed (
https://exiftool.org/) — the canonical metadata tool. - Poppler utilities (
pdfinfo,pdftotext) for PDFs. - Optional: MAT2 for metadata stripping (useful to understand what a privacy-aware subject would have removed).
- A scratch directory you will delete after the case — metadata often contains personal data.
How it works
Metadata formats vary by file type. Images use EXIF (camera), IPTC (editorial), and XMP (cross-application). PDFs carry an /Info dictionary, XMP streams, and — for many PDFs — an embedded document history. Office files (.docx, .xlsx) are ZIP archives whose docProps/ directory stores author, company, last editor, revision count, and sometimes an edit-time timeline. JPEG thumbnails can even preserve a pre-edit version of an image that was cropped or blurred.
Step-by-step walkthrough
-
Hash the original before touching it.
sha256sum suspect.pdf > suspect.pdf.sha256Metadata extraction is read-only with ExifTool's defaults, but chain-of-custody requires the pre-analysis hash.
-
Run ExifTool on anything.
exiftool -a -G1 -s suspect.jpg exiftool -a -G1 -s suspect.pdf exiftool -a -G1 -s suspect.docx-ashows duplicate tags,-G1shows the specific group of each tag,-suses compact output. -
For images, read the trio of tag groups.
- EXIF:
Make,Model,DateTimeOriginal,GPSLatitude,GPSLongitude,LensModel. - IPTC:
By-line,Caption-Abstract,Keywords. - XMP: editing software, history, and any embedded rights metadata. Extract GPS to a mapping tool:
exiftool -c "%.6f" -GPSPosition suspect.jpg - EXIF:
-
For PDFs, layer three lookups.
pdfinfo suspect.pdf exiftool suspect.pdf pdftotext -layout suspect.pdf suspect.txtThe
CreatorandProducerfields often reveal which software generated the PDF, which frequently matches or contradicts the ostensible author. For PDFs assembled from scanned pages,Creatorcan identify the specific scanner model — a strong corroborator of provenance. -
For Office files, unzip.
unzip -p suspect.docx docProps/core.xml unzip -p suspect.docx docProps/app.xmlcore.xmlcarriescreator,lastModifiedBy,revision, and timestamps.app.xmlcarriesCompanyandApplication. Track changes may still be present inword/document.xmleven when visually accepted. -
Check thumbnails for pre-edit state.
exiftool -b -ThumbnailImage suspect.jpg > thumb.jpgCompare
thumb.jpgwith the main image. Discrepancies suggest post-capture editing. -
Document the extraction in a metadata log. For every file: filename, hash, extraction tool and version, extraction timestamp, and the specific tags that matter to the investigation. Keep the log separate from the files themselves.
Common pitfalls
- Treating metadata as proof. Metadata can be edited trivially. It is a lead and a corroborator, not a standalone fact.
- Stripping before extracting. Any tool that "cleans up" a file destroys evidence. Work only on copies, and keep the original untouched and hashed.
- Ignoring XMP on PDFs. Many investigators stop at
pdfinfo. XMP on a PDF often contains the richer author and history data. - Trusting timestamps uncritically. Device clocks drift, time zones are stored inconsistently, and scanning software sometimes stamps the scan time rather than the document's actual date.
- Leaking metadata in your own publications. The investigator who cannot strip metadata cleanly from a redacted release often exposes sources. Test every export before publication.
Verifying your findings
A metadata-derived claim needs independent corroboration. GPS placing a photo in a city is stronger when the visible architecture in the photo matches; an Office "creator" field is stronger when the same name appears on unrelated public filings. Document both the metadata finding and the corroborator in your case log. See the analysis phase guide.
Related tutorials
- Reverse image search to pair image metadata with external appearances.
- FOIA request process for processing released PDFs that arrive with rich metadata.
- Company registry searches for cross-referencing authorship fields to organisations.
Apply this in practice
The following the flight logs case study relies on systematic metadata extraction across a large document set. For analysing document collections at scale with structured outputs, use the Subthesis legal document analysis tool.