HTML to Markdown Conversion¶

The HTML conversion module converts HTML documents into well-structured markdown suitable for RAG ingestion and chunking with MarkdownChunker. It supports both standard HTML (semantic tags) and IETF RFC markup (pre-formatted text with span-based headings).

Overview¶

When building RAG systems, HTML content from web pages, documentation sites, or standards bodies needs to be converted into markdown before chunking. The HTMLConverter class handles this conversion, auto-detecting the document format and applying the appropriate strategy.

Supported Formats¶

Format	Detection	Description
Standard HTML	Default	Semantic tags: `<h1>`-`<h6>`, `<p>`, `<ul>`, `<ol>`, `<table>`, `<pre>`, `<blockquote>`, `<dl>`
IETF RFC markup	Auto-detected	Pre-formatted text with `<span class="h1">`-`<span class="h6">` headings inside `<pre>` tags, as served by datatracker.ietf.org

Quick Start¶

from dataknobs_xization import HTMLConverter, html_to_markdown

# Simple conversion
markdown = html_to_markdown("<h1>Title</h1><p>Hello world.</p>")

# With configuration
converter = HTMLConverter(base_heading_level=2, wrap_width=80)
markdown = converter.convert(html_string)

# From a file
from pathlib import Path
markdown = converter.convert(Path("document.html"))

# RFC documents are auto-detected
markdown = converter.convert(rfc_html, title="RFC 6749")

HTMLConverter Class¶

Initialization¶

from dataknobs_xization import HTMLConverter, HTMLConverterConfig

# Using keyword arguments
converter = HTMLConverter(
    base_heading_level=1,       # Minimum heading level in output (1 = #)
    include_links=True,         # Preserve hyperlinks as markdown links
    strip_nav=True,             # Remove <nav>, <header>, <footer>
    strip_scripts=True,         # Remove <script> and <style>
    preserve_code_blocks=True,  # Keep <pre>/<code> as fenced blocks
    link_style="inline",        # "inline", "reference", or "text"
    strip_images=False,         # Whether to remove images
    wrap_width=0,               # Wrap paragraph text (0 = no wrap)
    frontmatter=None,           # Optional YAML frontmatter dict
)

# Or using a config object
config = HTMLConverterConfig(base_heading_level=2, include_links=False)
converter = HTMLConverter(config=config)

Converting HTML¶

# From a string
markdown = converter.convert("<h1>Title</h1><p>Content.</p>")

# From a file path
markdown = converter.convert(Path("page.html"))

# With an explicit title (prepended as top-level heading)
markdown = converter.convert(html_string, title="My Document")

Standard HTML Conversion¶

The converter handles all common semantic HTML elements:

Headings¶

<h1>Title</h1>
<h2>Section</h2>
<h3>Subsection</h3>

Converts to:

# Title

## Section

### Subsection

Use base_heading_level to offset all headings (e.g., base_heading_level=2 turns <h1> into ##).

Inline Markup¶

HTML	Markdown
`<strong>bold</strong>` / `<b>bold</b>`	`bold`
`<em>italic</em>` / `<i>italic</i>`	`italic`
`<code>code</code>`	`code`
`<a href="url">text</a>`	`[text](url)`
`<img src="url" alt="alt">`	`![alt](url)`

Lists¶

Unordered and ordered lists, including nested lists:

<ul>
  <li>First</li>
  <li>Second
    <ul><li>Nested</li></ul>
  </li>
</ul>

Becomes:

- First
- Second
  - Nested

Tables¶

<table>
  <tr><th>Name</th><th>Value</th></tr>
  <tr><td>foo</td><td>bar</td></tr>
</table>

Becomes:

| Name | Value |
| --- | --- |
| foo | bar |

Code Blocks¶

<pre> tags convert to fenced code blocks. Language detection uses <code class="language-python">:

<pre><code class="language-python">print("hello")</code></pre>

Becomes:

```python
print("hello")
```

Other Elements¶

Blockquotes: <blockquote> becomes > prefixed lines
Definition lists: <dl>/<dt>/<dd> becomes bold terms with : definition lines
Horizontal rules: <hr> becomes ---
Figures: <figcaption> text rendered in italics

Element Stripping¶

By default, these elements are removed:

<script>, <style>, <noscript>, <svg>, <iframe> (always when strip_scripts=True)
<nav>, <header>, <footer> (when strip_nav=True)

IETF RFC Conversion¶

The converter auto-detects IETF RFC markup by looking for:

A <div class="rfcmarkup"> container
<span class="h1"> through <span class="h6"> heading elements inside <pre> tags

RFC-Specific Handling¶

Heading spans: <span class="h2">1. Introduction</span> becomes ## 1. Introduction
Page breaks: <hr> elements and page footer/header lines are stripped
Table of Contents: Auto-detected and removed from output
Preamble: Abstract, Status, and Copyright sections before the first heading are skipped
Bullet lists: RFC-style o prefixed items become markdown bullet lists
ASCII art: Box-drawing diagrams are wrapped in fenced code blocks
Links: External links (http://, https://) are preserved; internal RFC links render as plain text

Example: Converting an RFC¶

from dataknobs_xization import html_to_markdown

# Fetch RFC HTML from datatracker.ietf.org
markdown = html_to_markdown(rfc_html, title="RFC 6749 - OAuth 2.0")

The converter extracts the document title from the first <span class="h1"> if no title is provided.

YAML Frontmatter¶

Prepend YAML frontmatter to the output:

markdown = html_to_markdown(
    html_content,
    frontmatter={
        "source": "https://example.com/page",
        "type": "article",
        "tags": ["oauth", "security"],
    },
)

Output:

---
source: https://example.com/page
type: article
tags:
  - oauth
  - security
---

# Title
...

ContentTransformer Integration¶

HTML conversion is integrated into the ContentTransformer dispatch:

from dataknobs_xization import ContentTransformer

transformer = ContentTransformer(base_heading_level=2)

# Via the generic dispatch method
markdown = transformer.transform(html_string, format="html")

# Or directly
markdown = transformer.transform_html(html_string, title="My Page")

Convenience Function¶

For quick one-off conversions:

from dataknobs_xization import html_to_markdown

markdown = html_to_markdown(
    content,                    # HTML string or Path
    title="Document Title",     # Optional title
    base_heading_level=1,       # Starting heading level
    include_links=True,         # Preserve hyperlinks
    frontmatter={"key": "val"}, # Optional YAML frontmatter
)

Integration with RAG Pipeline¶

The typical flow for HTML-to-RAG ingestion:

from dataknobs_xization import HTMLConverter, parse_markdown, chunk_markdown_tree

# Step 1: Convert HTML to markdown
converter = HTMLConverter(base_heading_level=1)
markdown = converter.convert(html_content, title="My Document")

# Step 2: Parse markdown into tree
tree = parse_markdown(markdown)

# Step 3: Chunk for vector storage
chunks = chunk_markdown_tree(tree, max_chunk_size=500)

for chunk in chunks:
    # Store in vector database
    pass

API Reference¶

HTMLConverterConfig¶

@dataclass
class HTMLConverterConfig:
    base_heading_level: int = 1
    include_links: bool = True
    strip_nav: bool = True
    strip_scripts: bool = True
    preserve_code_blocks: bool = True
    link_style: Literal["inline", "reference", "text"] = "inline"
    strip_images: bool = False
    wrap_width: int = 0              # 0 = no wrapping
    frontmatter: dict[str, Any] | None = None

HTMLConverter¶

class HTMLConverter:
    def __init__(
        self,
        config: HTMLConverterConfig | None = None,
        **kwargs: Any,
    ) -> None: ...

    def convert(
        self,
        content: str | Path,
        title: str | None = None,
    ) -> str: ...

html_to_markdown¶

def html_to_markdown(
    content: str | Path,
    title: str | None = None,
    base_heading_level: int = 1,
    include_links: bool = True,
    frontmatter: dict[str, Any] | None = None,
) -> str: ...