Understanding PDF Structure and Content Streams

May 14May 14

PDF files look simple on the surface, but internally they are highly structured documents built from objects, streams, and drawing instructions. If you're working with tools like DynaPDF's parser functions, understanding how PDFs are organized is essential.

1. The High-Level Structure of a PDF

A PDF file consists of four main parts:

Header – Defines the PDF version (e.g., %PDF-1.7)
Body – Contains all objects (pages, fonts, images, etc.)
Cross-reference table (xref) – Maps object locations
Trailer – Points to the root object and metadata

Everything in a PDF is stored as an object, identified by an object number and generation number.

2. Objects in a PDF

Objects are the building blocks of a PDF. Common object types include:

Dictionaries
Arrays
Strings
Numbers
Streams

For example, a page itself is just a dictionary object referencing other objects:

<<
  /Type /Page
  /Parent 2 0 R
  /Contents 5 0 R
  /Resources 6 0 R
>>

The important part here is /Contents— this is where the actual drawing instructions live.

3. What is a Content Stream?

A content stream is a special type of object that contains instructions describing how to render a page. These instructions are written in a compact, stack-based syntax similar to PostScript.

A content stream looks like this internally:

5 0 obj
<< /Length 44 >>
stream
0 0 m
100 100 l
S
endstream
endobj

This example draws a line from (0,0) to (100,100).

4. Operators Inside Content Streams

Content streams consist of operators and operands.

m → MoveTo
l → LineTo
c → CurveTo
re → Rectangle
S → Stroke path
f → Fill path
Tj → Show text

Each operator modifies the drawing state or produces visible output.

Content streams are the heart of a PDF page. Everything visible—text, shapes, images—comes from these instructions.

By analyzing them, you can:

Extract text or vector graphics
Remove unwanted elements
Modify drawings
Rebuild page layouts

5. How DynaPDF Represents Content

When using DynaPDF.Parser.Content, these low-level instructions are converted into structured JSON. This makes it far easier to analyze or modify a page programmatically.

For example, a simple path might become:

{
  "Operator": "DrawPath",
  "OPNames": ["MoveTo", "LineTo"],
  "Vertices": [
    { "x": 0, "y": 0 },
    { "x": 165, "y": 0.5 }
  ],
  "Mode": 1,
  ...
}

Instead of parsing raw PDF syntax, you now work with clean data:

Operator – High-level command
Vertices – Geometry points
Mode – Stroke/fill behavior
OPNames – Underlying PDF operators

6. Editing Content Streams

With DynaPDF, the workflow typically looks like this:

Parse the page with DynaPDF.Parser.ParsePage.
Retrieve JSON via DynaPDF.Parser.Content, optionally filter operators (e.g., "DrawPath")
Mark entries for deletion with DynaPDF.Parser.Delete
Use DynaPDF.Parser.FindText and DynaPDF.Parser.ReplaceSelText function to search and replace.
Write changes back to the page with DynaPDF.Parser.WriteToPage function.

This allows precise control over individual drawing commands instead of rewriting the entire document.

7. Mental Model: How a PDF Page is Rendered

Think of a PDF page like a script executed step-by-step:

Set graphics state (color, line width, font)
Define paths (MoveTo, LineTo, etc.)
Draw them (stroke/fill)
Render text
Place images

Each instruction builds on the previous state, which is why order matters.

Conclusion

A PDF is not just a static document—it’s a sequence of drawing commands stored in structured objects. The content stream is where the real action happens, and tools like DynaPDF expose this layer in a developer-friendly way.

Once you understand content streams, manipulating PDFs becomes far more predictable and powerful.