HTML to Markdown Conversion¶
The HTML conversion module converts HTML documents into well-structured markdown suitable for RAG ingestion and chunking with MarkdownChunker. It supports both standard HTML (semantic tags) and IETF RFC markup (pre-formatted text with span-based headings).
Overview¶
When building RAG systems, HTML content from web pages, documentation sites, or standards bodies needs to be converted into markdown before chunking. The HTMLConverter class handles this conversion, auto-detecting the document format and applying the appropriate strategy.
Supported Formats¶
| Format | Detection | Description |
|---|---|---|
| Standard HTML | Default | Semantic tags: <h1>-<h6>, <p>, <ul>, <ol>, <table>, <pre>, <blockquote>, <dl> |
| IETF RFC markup | Auto-detected | Pre-formatted text with <span class="h1">-<span class="h6"> headings inside <pre> tags, as served by datatracker.ietf.org |
Quick Start¶
from dataknobs_xization import HTMLConverter, html_to_markdown
# Simple conversion
markdown = html_to_markdown("<h1>Title</h1><p>Hello world.</p>")
# With configuration
converter = HTMLConverter(base_heading_level=2, wrap_width=80)
markdown = converter.convert(html_string)
# From a file
from pathlib import Path
markdown = converter.convert(Path("document.html"))
# RFC documents are auto-detected
markdown = converter.convert(rfc_html, title="RFC 6749")
HTMLConverter Class¶
Initialization¶
from dataknobs_xization import HTMLConverter, HTMLConverterConfig
# Using keyword arguments
converter = HTMLConverter(
base_heading_level=1, # Minimum heading level in output (1 = #)
include_links=True, # Preserve hyperlinks as markdown links
strip_nav=True, # Remove <nav>, <header>, <footer>
strip_scripts=True, # Remove <script> and <style>
preserve_code_blocks=True, # Keep <pre>/<code> as fenced blocks
link_style="inline", # "inline", "reference", or "text"
strip_images=False, # Whether to remove images
wrap_width=0, # Wrap paragraph text (0 = no wrap)
frontmatter=None, # Optional YAML frontmatter dict
)
# Or using a config object
config = HTMLConverterConfig(base_heading_level=2, include_links=False)
converter = HTMLConverter(config=config)
Converting HTML¶
# From a string
markdown = converter.convert("<h1>Title</h1><p>Content.</p>")
# From a file path
markdown = converter.convert(Path("page.html"))
# With an explicit title (prepended as top-level heading)
markdown = converter.convert(html_string, title="My Document")
Standard HTML Conversion¶
The converter handles all common semantic HTML elements:
Headings¶
Converts to:
Use base_heading_level to offset all headings (e.g., base_heading_level=2 turns <h1> into ##).
Inline Markup¶
| HTML | Markdown |
|---|---|
<strong>bold</strong> / <b>bold</b> |
**bold** |
<em>italic</em> / <i>italic</i> |
*italic* |
<code>code</code> |
`code` |
<a href="url">text</a> |
[text](url) |
<img src="url" alt="alt"> |
 |
Lists¶
Unordered and ordered lists, including nested lists:
Becomes:
Tables¶
Becomes:
Code Blocks¶
<pre> tags convert to fenced code blocks. Language detection uses <code class="language-python">:
Becomes:
Other Elements¶
- Blockquotes:
<blockquote>becomes>prefixed lines - Definition lists:
<dl>/<dt>/<dd>becomes bold terms with: definitionlines - Horizontal rules:
<hr>becomes--- - Figures:
<figcaption>text rendered in italics
Element Stripping¶
By default, these elements are removed:
<script>,<style>,<noscript>,<svg>,<iframe>(always whenstrip_scripts=True)<nav>,<header>,<footer>(whenstrip_nav=True)
IETF RFC Conversion¶
The converter auto-detects IETF RFC markup by looking for:
- A
<div class="rfcmarkup">container <span class="h1">through<span class="h6">heading elements inside<pre>tags
RFC-Specific Handling¶
- Heading spans:
<span class="h2">1. Introduction</span>becomes## 1. Introduction - Page breaks:
<hr>elements and page footer/header lines are stripped - Table of Contents: Auto-detected and removed from output
- Preamble: Abstract, Status, and Copyright sections before the first heading are skipped
- Bullet lists: RFC-style
oprefixed items become markdown bullet lists - ASCII art: Box-drawing diagrams are wrapped in fenced code blocks
- Links: External links (
http://,https://) are preserved; internal RFC links render as plain text
Example: Converting an RFC¶
from dataknobs_xization import html_to_markdown
# Fetch RFC HTML from datatracker.ietf.org
markdown = html_to_markdown(rfc_html, title="RFC 6749 - OAuth 2.0")
The converter extracts the document title from the first <span class="h1"> if no title is provided.
YAML Frontmatter¶
Prepend YAML frontmatter to the output:
markdown = html_to_markdown(
html_content,
frontmatter={
"source": "https://example.com/page",
"type": "article",
"tags": ["oauth", "security"],
},
)
Output:
ContentTransformer Integration¶
HTML conversion is integrated into the ContentTransformer dispatch:
from dataknobs_xization import ContentTransformer
transformer = ContentTransformer(base_heading_level=2)
# Via the generic dispatch method
markdown = transformer.transform(html_string, format="html")
# Or directly
markdown = transformer.transform_html(html_string, title="My Page")
Convenience Function¶
For quick one-off conversions:
from dataknobs_xization import html_to_markdown
markdown = html_to_markdown(
content, # HTML string or Path
title="Document Title", # Optional title
base_heading_level=1, # Starting heading level
include_links=True, # Preserve hyperlinks
frontmatter={"key": "val"}, # Optional YAML frontmatter
)
Integration with RAG Pipeline¶
The typical flow for HTML-to-RAG ingestion:
from dataknobs_xization import HTMLConverter, parse_markdown, chunk_markdown_tree
# Step 1: Convert HTML to markdown
converter = HTMLConverter(base_heading_level=1)
markdown = converter.convert(html_content, title="My Document")
# Step 2: Parse markdown into tree
tree = parse_markdown(markdown)
# Step 3: Chunk for vector storage
chunks = chunk_markdown_tree(tree, max_chunk_size=500)
for chunk in chunks:
# Store in vector database
pass
API Reference¶
HTMLConverterConfig¶
@dataclass
class HTMLConverterConfig:
base_heading_level: int = 1
include_links: bool = True
strip_nav: bool = True
strip_scripts: bool = True
preserve_code_blocks: bool = True
link_style: Literal["inline", "reference", "text"] = "inline"
strip_images: bool = False
wrap_width: int = 0 # 0 = no wrapping
frontmatter: dict[str, Any] | None = None
HTMLConverter¶
class HTMLConverter:
def __init__(
self,
config: HTMLConverterConfig | None = None,
**kwargs: Any,
) -> None: ...
def convert(
self,
content: str | Path,
title: str | None = None,
) -> str: ...