Skip to content
reader.me

PDF GLOSSARY

PDF glossary: terms and formats

What every PDF term and format actually means, in plain language. The jargon you run into, explained.

Formats

Concepts

OCR

OCR (Optical Character Recognition) turns the picture of text into actual, selectable characters. A scanned page or a photo of a document is, to a computer, just a grid of pixels: there is no text in it, only an image that happens to look like words. OCR analyses the shapes of letters and rebuilds the underlying string of characters.

AcroForm

An AcroForm is PDF's native, built-in form technology, the kind of interactive form that has been part of the format since the late 1990s. The fillable fields you see in a tax return or an application form, text boxes, checkboxes, radio buttons, dropdowns and signature fields, are AcroForm objects defined directly in the PDF's object structure.

XFA

XFA (XML Forms Architecture) is Adobe's alternative form technology, in which the form is defined not by native PDF objects but by an XML payload embedded inside the PDF wrapper. It was designed for complex, dynamic forms: layouts that grow as you add rows, fields that appear or disappear based on earlier answers, and tight binding to back-end data schemas.

Metadata

Metadata is the data about your data, the information a PDF carries beyond the visible page content. There are two main stores: the legacy Document Information Dictionary (title, author, subject, keywords, the software that created it, and creation and modification dates) and XMP, an XML-based block that holds the same fields plus richer, extensible properties.

Compression

Compression is what keeps PDF file sizes manageable, and a single document usually mixes several methods because it mixes several kinds of content. Text and vector drawing instructions compress losslessly with Flate (the same Deflate algorithm behind ZIP), so every character comes back exactly as it went in.

Embedded fonts

Embedded fonts are typefaces packaged inside the PDF itself rather than borrowed from the computer that opens it. This is the feature that makes PDF genuinely portable: if the font travels with the document, the text renders identically everywhere, even on a machine that has never had that typeface installed.

Text layer

The text layer is the part of a PDF that holds real, machine-readable characters, the content you can select with the cursor, copy, search and have read aloud. A PDF built from a word processor or page-layout app has this layer natively, with each character mapped to a position and a font.

Watermark

A watermark is text or an image laid over a PDF's pages to mark status or ownership, a faint "DRAFT" or "CONFIDENTIAL" diagonally across the page, a company logo, or a copyright line. It signals intent without obscuring the underlying content, usually by being semi-transparent or sitting behind the main text.

Linearization (Fast Web View)

Linearization, marketed by Adobe as Fast Web View, is a way of reorganising a PDF's internal byte order so it can be displayed before the whole file has arrived. In a normal PDF the cross-reference table that indexes every object sits at the very end, so a viewer technically needs the complete file to know where things are.

Security

Images

Vector graphic

Vector graphics describe an image as mathematics, points, lines, curves and fills, rather than as a fixed grid of coloured dots. A circle is stored as a centre, a radius and a colour, so the computer redraws it at whatever size is asked for. The consequence is the defining property of vector art: it scales to any size with no loss of sharpness.

Raster image

A raster image is a rectangular grid of pixels, each holding a colour value, the model behind every photograph and scan. Unlike a vector, a raster has a fixed native resolution: it stores exactly so many dots across and down, and all its detail is baked into that grid.

JPG / JPEG

JPG (also written JPEG, after the Joint Photographic Experts Group that defined it) is the lossy raster format built for photographs. It works by transforming the image into frequency components and discarding the fine detail the human eye is least likely to miss, which is how it squeezes a full-colour photo into a small file.

PNG

PNG (Portable Network Graphics) is the lossless raster format for graphics with sharp edges and flat colour, screenshots, logos, icons, diagrams and anything containing text. Lossless means it stores the image exactly: re-save it as often as you like and not a single pixel changes, the opposite of JPEG's generational decay.

WebP

WebP is an image format from Google that aims to replace both JPEG and PNG with one container. Its trick is supporting two modes: lossy compression for photographs, like JPEG, and lossless compression for graphics, like PNG, while typically producing smaller files than either at comparable quality.

TIFF

TIFF (Tagged Image File Format) is the heavyweight raster format used in archiving, scanning and professional imaging. Its name comes from its structure: a flexible set of tags describing the image, which lets a single TIFF hold uncompressed or losslessly compressed data, high bit depths, embedded colour profiles and a great deal of technical metadata.

SVG

SVG (Scalable Vector Graphics) is an open, XML-based vector format, an image written as readable text describing shapes, paths, colours and text. Because it is vector, it scales to any size with perfectly crisp edges, and because it is XML, it can be styled with CSS, animated, and even searched or edited in a plain text editor.

DPI / PPI

DPI (dots per inch) measures resolution, how many dots of detail are packed into each inch of an image or print. The higher the number, the finer the detail and the larger the file. It is the single setting that most often decides whether a scan or an export looks crisp or disappointing.