Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## How to develop

For local development, simply use :
For local development, simply use:

```bash
$ yarn install
Expand Down
7 changes: 7 additions & 0 deletions gatsby-browser.js
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,13 @@ export const onRouteUpdate = ({ location, prevLocation }) => {
) {
pageHeadTittle = "PDF Services API Extract PDF";
} else if (
window.location.pathname.indexOf(
"pdf-services-api/howtos/pdf-to-markdown-api/"
) >= 0
) {
pageHeadTittle = "PDF Services API PDF to Markdown API";
}
else if (
window.location.pathname.indexOf(
"pdf-services-api/howtos/pdf-properties/"
) >= 0
Expand Down
13 changes: 13 additions & 0 deletions gatsby-config.js
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ module.exports = {
description: 'Create, combine and export PDFs',
path: '../document-services/apis/pdf-services/'
},
{
title: 'PDF to Markdown',
description: 'Convert PDF documents to Markdown format',
path: '../document-services/apis/pdf-to-markdown/'
},
{
title: 'PDF Accessibility Auto-Tag',
description: 'Auto-tag PDF content to improve accessibility',
Expand Down Expand Up @@ -229,6 +234,10 @@ module.exports = {
title: 'Extract PDF',
path: 'overview/pdf-services-api/howtos/extract-pdf.md'
},
{
title: 'PDF to Markdown API',
path: 'overview/pdf-services-api/howtos/pdf-to-markdown-api.md'
},
{
title: 'Get PDF Properties',
path: 'overview/pdf-services-api/howtos/pdf-properties.md'
Expand Down Expand Up @@ -716,6 +725,10 @@ module.exports = {
title: 'Extract PDF',
path: 'overview/legacy-documentation/pdf-services-api/howtos/extract-pdf.md'
},
{
title: 'PDF to Markdown API',
path: 'overview/legacy-documentation/pdf-services-api/howtos/pdf-to-markdown-api.md'
},
{
title: 'Get PDF Properties',
path: 'overview/legacy-documentation/pdf-services-api/howtos/pdf-properties.md'
Expand Down
126 changes: 126 additions & 0 deletions src/pages/overview/pdf-services-api/howtos/pdf-to-markdown-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
title: PDF to Markdown API | Adobe PDF Services
description: Learn about the PDF to Markdown API service that converts PDF documents into well-formatted Markdown text.
---

# PDF to Markdown API

The PDF to Markdown API (included with the PDF Services API) is a cloud-based web service that automatically converts PDF documents – native or scanned – into well-formatted Markdown text. This service preserves the document's structure and formatting while converting it into a format that's widely used for LLM flows, content authoring and documentation.

## Structured Information Output Format

The output of a PDF to Markdown operation includes:

- A primary `.md` file containing the converted Markdown content

### Output Structure

The following is a summary of key elements in the converted Markdown:

#### Elements

Ordered list of semantic elements converted from the PDF document, preserving the natural reading order and document structure. The conversion handles:

- Text content with proper Markdown syntax
- Document hierarchy and structure
- Inline formatting and emphasis
- Links and references
- Images and figures
- Tables and complex layouts

#### Content Types

The API processes various content types as follows:

##### Text Elements

- **Headings**: Converted to appropriate Markdown heading levels (H1-H6)
- **Paragraphs**: Preserved with proper spacing and formatting
- **Lists**: Both ordered and unordered lists with proper nesting
- **Text Emphasis**: Bold, italic, and other text formatting
- **Links**: Preserved with proper Markdown link syntax

##### Images and Figures

- Provided as base64-embedded images in the Markdown output
- Referenced correctly in the Markdown output
- Original quality preserved
- Proper alt text and captions maintained

##### Tables

- Converted to Markdown table syntax
- Column alignment preserved
- Cell content formatting maintained
- Complex table structures supported

#### Element Types and Paths

The API recognizes and converts the following structural elements:

| Category | Element Type | Description |
| --------- | ----------------- | --------------------------------------------------------- |
| Aside | Aside | Content which is not part of regular content flow |
| Figure | Figure | Non-reflowable constructs like graphs, images, flowcharts |
| Footnote | Footnote | Footnote |
| Headings | H, H1, H2, etc | Heading levels |
| List | L, Li, Lbl, Lbody | List and list item elements |
| Paragraph | P, ParagraphSpan | Paragraphs and paragraph segments |
| Reference | Reference | Links |
| Section | Sect | Logical section of the document |
| StyleSpan | StyleSpan | Styling variations within text |
| Table | Table, TD, TH, TR | Table elements |
| Title | Title | Document title |

### Reading Order

The reading order in the output Markdown maintains:

- Natural document flow
- Proper content hierarchy
- Column-based layouts
- Page transitions
- Inline elements and references

## Use Cases

The PDF to Markdown API is particularly valuable for:

- LLM-friendly content ingestion and prompt creation
- Training/Fine-tuning LLM with PDFs
- Content migration from PDF to documentation platforms
- Legacy document conversion
- Content repurposing for modern documentation systems
- Integration with Markdown-based workflows
- Automated document processing pipelines
- Searchable internal knowledge repositories

## API Limitations

### File Constraints

- **File Size**: Maximum of 100MB per file
- **Page Count**:
- Non-scanned PDFs: Up to 400 pages
- Scanned PDFs: Up to 150 pages
- **Page Dimensions**: Between 6" and 17.5" in either dimension

### Processing Limits

- **Rate Limits**: Maximum 25 requests per minute
- **Language Support**: Optimized for English, supports other Latin-based languages
- **OCR Quality**: Dependent on scan quality (minimum 200 DPI recommended)

### Document Requirements

- Files must be unprotected or allow content copying
- No support for:
- Hidden objects (JavaScript, OCG)
- XFA and fillable forms
- Complex annotations
- CAD drawings or vector art
- Password-protected content

## REST API

See our public API Reference for [PDF to Markdown API](../../../apis/#tag/PDF-to-Markdown).
Loading