Convert PDF to Markdown using Claude Vision API – A Python Script

A Python Tool for Converting PDFs to Markdown Using Claude Vision

This Python script combines the power of ImageMagick and Anthropic’s Claude Vision API to convert PDF documents into well-formatted markdown files. The script processes PDFs page by page, converting each page to a high-resolution image and then using Claude’s advanced vision capabilities to extract and format the content.

Development Background

This tool was developed through an interactive session with Claude 3.5 Sonnet, Anthropic’s latest language model. The development process involved:

  • Initial script design and requirements gathering through conversation with Claude
  • Iterative development of the core functionality using Claude’s code generation capabilities
  • Integration of multiple tools: ImageMagick for PDF processing, Anthropic’s API for vision analysis, and Python’s standard libraries for file handling
  • Testing and refinement of the prompt used to instruct Claude’s vision system for optimal text extraction

The entire tool, including this documentation, was created through conversation with Claude. This demonstrates the potential of AI-assisted development for creating practical tools that combine multiple technologies.

Key Features

The script offers several powerful features:

  • High-resolution PDF to PNG conversion using ImageMagick
  • Intelligent text extraction and formatting using Claude Vision API
  • Automatic markdown formatting preservation
  • Progress tracking and error handling
  • Temporary file cleanup
  • Rate limiting to prevent API throttling

Prerequisites

Before using the script, you’ll need:

  • Python 3.x
  • ImageMagick installed on your system
  • An Anthropic API key
  • The anthropic Python package

Installation

pip install anthropic
# On Ubuntu/Debian:
sudo apt-get install imagemagick
# On macOS:
brew install imagemagick

The Script

#!/usr/bin/env python3
import anthropic
import argparse
import os
import subprocess
import base64
from pathlib import Path
import time

def convert_pdf_to_images(pdf_path, output_dir):
    """Convert PDF to PNG images using ImageMagick"""
    os.makedirs(output_dir, exist_ok=True)
    output_pattern = str(Path(output_dir) / 'page_%03d.png')
    
    # Use high resolution for better OCR
    cmd = ['convert', '-density', '300', pdf_path, '-quality', '100', output_pattern]
    subprocess.run(cmd, check=True)
    
    # Return sorted list of generated image paths
    return sorted(Path(output_dir).glob('page_*.png'))

def encode_image(image_path):
    """Encode image as base64"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def process_image_with_claude(client, image_path):
    """Send image to Claude and get text description"""
    base64_image = encode_image(image_path)
    
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": """Please analyze this page and:
1. Extract all text content, preserving the original structure and formatting in markdown
2. For any diagrams, charts, or technical figures:
   - Provide a detailed description of the visual content
   - If it's a flow chart, sequence diagram, or similar, recreate it using mermaid.js syntax
   - For graphs and charts, describe the data representation and key trends
   - For complex technical diagrams, break down the components and their relationships
3. For general images:
   - Provide a detailed description of what the image shows
   - Note any important details or context
   - Explain how the image relates to the surrounding text
4. Maintain the document's logical flow by placing image descriptions and diagram recreations at appropriate points in the text

Format everything in clean markdown, preserving headers, lists, and other formatting elements."""
                },
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": base64_image
                    }
                }
            ]
        }
    ]

    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=4096,
        messages=messages
    )
    
    return response.content[0].text

def main():
    parser = argparse.ArgumentParser(description='Convert PDF to Markdown using Claude Vision')
    parser.add_argument('pdf_path', help='Path to input PDF file')
    parser.add_argument('--output', '-o', help='Output markdown file path')
    parser.add_argument('--temp-dir', default='temp_images', help='Directory for temporary image files')
    args = parser.parse_args()

    # Get API key from environment variable
    api_key = os.getenv('ANTHROPIC_API_KEY')
    if not api_key:
        raise ValueError("Please set ANTHROPIC_API_KEY environment variable")

    client = anthropic.Client(api_key)
    
    # Set output path
    if not args.output:
        args.output = str(Path(args.pdf_path).with_suffix('.md'))

    try:
        # Convert PDF to images
        print("Converting PDF to images...")
        image_paths = convert_pdf_to_images(args.pdf_path, args.temp_dir)
        
        # Process each image with Claude
        print("Processing images with Claude...")
        markdown_content = []
        for i, image_path in enumerate(image_paths, 1):
            print(f"Processing page {i} of {len(image_paths)}...")
            try:
                page_content = process_image_with_claude(client, image_path)
                markdown_content.append(page_content)
                # Add a brief delay to avoid rate limits
                time.sleep(1)
            except Exception as e:
                print(f"Error processing page {i}: {e}")
                markdown_content.append(f"\n\n[Error processing page {i}]\n\n")

        # Write markdown file
        print("Writing markdown file...")
        with open(args.output, 'w', encoding='utf-8') as f:
            f.write('\n\n'.join(markdown_content))
        
        print(f"Conversion complete! Output saved to: {args.output}")

    finally:
        # Cleanup temporary images if they exist
        if os.path.exists(args.temp_dir):
            print("Cleaning up temporary files...")
            for image_path in image_paths:
                try:
                    os.remove(image_path)
                except Exception as e:
                    print(f"Error removing temporary file {image_path}: {e}")
            os.rmdir(args.temp_dir)

if __name__ == '__main__':
    main()

Usage

To use the script:

  1. Set your Anthropic API key as an environment variable:
    export ANTHROPIC_API_KEY='your-key-here'
  2. Run the script with your PDF file:
    python pdf_to_markdown.py input.pdf -o output.md

How It Works

The script follows these steps:

  1. Converts each page of the PDF to a high-resolution PNG image using ImageMagick
  2. Processes each image with Claude’s vision capabilities to extract text and understand layout
  3. Formats the extracted content as markdown, preserving the original document structure
  4. Combines all pages into a single markdown file
  5. Cleans up temporary image files

Technical Implementation Details

The tool integrates several key technologies:

  • ImageMagick: Used for high-quality PDF to image conversion, ensuring optimal input for the vision system
  • Claude Vision API: Leverages Anthropic’s latest vision model for accurate text extraction and understanding of document layout
  • Python Libraries: Uses pathlib for robust file handling, argparse for command-line interface, and base64 for image encoding

The development process highlighted several interesting technical considerations:

  • The importance of high-resolution image conversion for optimal OCR results
  • The need for rate limiting to prevent API throttling
  • The balance between memory usage and image quality when handling large PDFs
  • The importance of proper error handling for both the conversion and API interaction processes

Update: Enhanced Image Processing (December 14, 2024)

The script has been updated with enhanced image processing capabilities. The new version includes improved handling of diagrams, charts, and images with the following features:

  • Automatic recreation of flow charts and sequence diagrams using mermaid.js syntax
  • Detailed descriptions of graphs and data visualizations, including trend analysis
  • Comprehensive breakdown of technical diagrams and their components
  • Enhanced context preservation between images and surrounding text

Development Process Log

This tool was developed through an interactive conversation with Claude 3.5 Sonnet. Here’s the exact sequence of requests and development steps:

  1. Initial Request:
    "Can you write two bash command lines? The first command line needs to take Wikipedia file and print each page to a PNG using image magic. The second one uses a command line to feed the PNG images into some sort of AI vision recognition system do a web search to find a command line one"
  2. Command Line Interface:
    "Is there a command line client for anthropic Claude?"

    Response: Created a Python-based command line interface for Claude

  3. Main Script Development:
    "Can you write a python script that takes a PDF file as an argument uses image magic to print out the PDF to PNG files one page profile and then passes it through Claude anthropic with a prompt telling it to describe and essentially do optical character recognition on each image and then take all of that text and turn it into a markdown file. Essentially take a PDF and turn it into Marc down using AI to look at each page."

    Response: Created the initial version of the PDF to Markdown converter

  4. Publishing:
    "Can you publish this script as a WordPress post and also write a bit of a description of what it does. Make sure you format it using Html and put the code into a code block. When you've posted it tell me what the url is"

    Response: Created the initial blog post

  5. Development Context:
    "Can you update the post to include a description of how it was built, using Claude desktop, and these tools and so on"

    Response: Added development background and technical implementation details

  6. Enhanced Image Processing:
    "Can you update the prompt in the program to include phrasing around describing images and also creating AI art of things like flow charts and graphs where possible. Once you've done this then update the word press posting. Including a note about this addendum."

    Response: Enhanced the image processing capabilities and added the update section

  7. Development Log:
    "Yes, can you update the word press posting with a new section that includes all the instructions I've given you in the order I gave you, essentially so people can see exactly what I said, and did to get you to do this."

    Response: Added development process log

  8. Content Completion Fix:
    "You need to update the WordPress posting with all the data not just placeholder include the script and all the previous content too. Also make sure you include this instruction in the section where all the instructions are as a perfect example of where the AI tends to goof up."

    Response: Fixed the post to include all content instead of placeholders, and added this interaction as an example of AI limitations and the importance of clear communication

This log demonstrates the iterative development process using AI assistance, showing how a complex tool can be built through natural language interaction with an AI model. Each step built upon the previous ones, with the AI understanding context and maintaining consistency throughout the development process.

The development process also revealed an important lesson about AI behavior: AIs can sometimes take shortcuts or use placeholders when updating content, requiring explicit instructions to maintain and update all existing content. This is demonstrated in step 8, where the initial attempt to update the post would have lost previous content if not corrected.

The entire development process took place in a single conversation, showcasing how AI can be used for:

  • Initial concept development and prototyping
  • Code generation and refinement
  • Documentation creation and publishing
  • Iterative improvements and feature additions
  • Learning from and correcting mistakes in the development process

This development log is itself part of the conversation, demonstrating the recursive and self-documenting nature of AI-assisted development, including the ability to recognize and correct potential issues in the development process.