Ox Content

Architecture

This document provides a deep dive into the architecture and design decisions of Ox Content.

Overview

Ox Content is designed as a modular, high-performance Markdown processing toolkit. The architecture follows the Oxc philosophy of prioritizing speed, memory efficiency, and correctness.

Architecture Overview

Crate Structure

ox-content/
├── crates/
│   ├── ox_content_allocator/   # Foundation: Arena allocator
│   ├── ox_content_ast/         # Core: AST definitions
│   ├── ox_content_parser/      # Core: Markdown parser
│   ├── ox_content_renderer/    # Core: HTML renderer
│   ├── ox_content_search/      # Core: Full-text search engine
│   ├── ox_content_ssg/         # Core: Static site generation
│   ├── ox_content_napi/        # Bindings: Node.js via napi-rs
│   ├── ox_content_wasm/        # Bindings: WebAssembly
│   ├── ox_content_vite/        # Integration: Vite plugin
│   ├── ox_content_og_image/    # Feature: OG image generation
│   └── ox_content_docs/        # Feature: Source code documentation
├── playground/                 # Interactive playground
└── docs/                       # Documentation (self-hosted)

Memory Management

Arena Allocation with bumpalo

Ox Content uses bumpalo for arena-based allocation. This is the key to our performance advantage.

How Arena Allocation Works

Arena Allocation

Traditional: 4 separate heap allocations, 4 separate deallocations

Arena: 1 contiguous region, 1 deallocation (drop arena)

Benefits

  1. Fast Allocation - Just bump a pointer, no free list traversal

  2. Zero-Copy Parsing - AST nodes can reference source slices directly

  3. Efficient Deallocation - Drop the entire arena at once

  4. Cache-Friendly - Related data stored contiguously in memory

  5. No Fragmentation - Memory is allocated linearly

Implementation

// ox_content_allocator/src/lib.rs

use bumpalo::Bump;

/// Arena allocator for AST nodes.
pub struct Allocator {
    bump: Bump,
}

impl Allocator {
    /// Creates a new allocator with default capacity.
    pub fn new() -> Self {
        Self { bump: Bump::new() }
    }

    /// Allocates a value in the arena.
    pub fn alloc<T>(&self, value: T) -> &mut T {
        self.bump.alloc(value)
    }

    /// Allocates a string slice in the arena.
    pub fn alloc_str(&self, s: &str) -> &str {
        self.bump.alloc_str(s)
    }

    /// Creates a new Vec in the arena.
    pub fn new_vec<T>(&self) -> Vec<'_, T> {
        Vec::new_in(&self.bump)
    }
}

// Re-export arena-aware types with standard names
pub type Box<'a, T> = bumpalo::boxed::Box<'a, T>;
pub type Vec<'a, T> = bumpalo::collections::Vec<'a, T>;
pub type String<'a> = bumpalo::collections::String<'a>;

Usage Pattern

fn process_markdown(source: &str) -> String {
    // Create arena - all allocations happen here
    let allocator = Allocator::new();

    // Parse document - AST allocated in arena
    let parser = Parser::new(&allocator, source);
    let document = parser.parse().unwrap();

    // Render to HTML - output is owned String
    let mut renderer = HtmlRenderer::new();
    let html = renderer.render(&document);

    html
    // allocator dropped here - all AST memory freed at once
}

AST Design

mdast Specification

The AST follows the mdast specification, which is part of the unified ecosystem. This ensures compatibility with existing tools and plugins.

Node Hierarchy

Document (root)
├── Block Nodes
│   ├── Paragraph
│   │   └── Inline Nodes...
│   ├── Heading (depth: 1-6)
│   │   └── Inline Nodes...
│   ├── CodeBlock (lang, meta, value)
│   ├── BlockQuote
│   │   └── Block Nodes...
│   ├── List (ordered, start, spread)
│   │   └── ListItem (checked)
│   │       └── Block Nodes...
│   ├── Table
│   │   └── TableRow
│   │       └── TableCell
│   │           └── Inline Nodes...
│   ├── ThematicBreak
│   └── Html (raw)

└── Inline Nodes
    ├── Text (value)
    ├── Emphasis
    │   └── Inline Nodes...
    ├── Strong
    │   └── Inline Nodes...
    ├── InlineCode (value)
    ├── Link (url, title)
    │   └── Inline Nodes...
    ├── Image (url, alt, title)
    ├── Break
    ├── Delete (GFM)
    │   └── Inline Nodes...
    └── FootnoteReference (identifier)

Span Information

Every node includes source location information:

/// Source span (byte offsets).
#[derive(Debug, Clone, Copy)]
pub struct Span {
    /// Start byte offset (inclusive).
    pub start: u32,
    /// End byte offset (exclusive).
    pub end: u32,
}

impl Span {
    pub fn new(start: u32, end: u32) -> Self {
        Self { start, end }
    }
}

This enables:

  • Error messages with precise source locations

  • Source maps for debugging

  • Syntax highlighting in editors

  • Incremental re-parsing

Visitor Pattern

The AST can be traversed using the visitor pattern:

/// Trait for visiting AST nodes.
pub trait Visit<'a> {
    fn visit_document(&mut self, document: &Document<'a>) {
        for node in &document.children {
            self.visit_node(node);
        }
    }

    fn visit_node(&mut self, node: &Node<'a>) {
        match node {
            Node::Paragraph(n) => self.visit_paragraph(n),
            Node::Heading(n) => self.visit_heading(n),
            Node::CodeBlock(n) => self.visit_code_block(n),
            // ... other variants
        }
    }

    fn visit_paragraph(&mut self, paragraph: &Paragraph<'a>) {
        for child in &paragraph.children {
            self.visit_node(child);
        }
    }

    fn visit_heading(&mut self, heading: &Heading<'a>) {
        for child in &heading.children {
            self.visit_node(child);
        }
    }

    // ... other visit methods with default implementations
}

Example: Table of Contents Generator

use ox_content_ast::{Visit, Document, Heading, Node, Text};

struct TocGenerator {
    entries: Vec<TocEntry>,
}

struct TocEntry {
    depth: u8,
    text: String,
    id: String,
}

impl<'a> Visit<'a> for TocGenerator {
    fn visit_heading(&mut self, heading: &Heading<'a>) {
        let mut text = String::new();
        for child in &heading.children {
            if let Node::Text(t) = child {
                text.push_str(t.value);
            }
        }

        let id = slugify(&text);
        self.entries.push(TocEntry {
            depth: heading.depth,
            text,
            id,
        });
    }
}

fn generate_toc(document: &Document<'_>) -> Vec<TocEntry> {
    let mut generator = TocGenerator { entries: vec![] };
    generator.visit_document(document);
    generator.entries
}

Parser Design

Architecture

Parser Architecture

Parser Options

/// Parser configuration options.
#[derive(Debug, Clone, Default)]
pub struct ParserOptions {
    /// Enable GFM (GitHub Flavored Markdown) extensions.
    pub gfm: bool,
    /// Enable footnotes.
    pub footnotes: bool,
    /// Enable task lists.
    pub task_lists: bool,
    /// Enable tables.
    pub tables: bool,
    /// Enable strikethrough.
    pub strikethrough: bool,
    /// Enable autolinks.
    pub autolinks: bool,
    /// Maximum nesting depth for block elements.
    pub max_nesting_depth: usize,
}

impl ParserOptions {
    /// Creates options with all GFM extensions enabled.
    pub fn gfm() -> Self {
        Self {
            gfm: true,
            footnotes: true,
            task_lists: true,
            tables: true,
            strikethrough: true,
            autolinks: true,
            max_nesting_depth: 100,
        }
    }
}

Parsing Strategy

  1. Block-First - Parse block structure first (paragraphs, headings, etc.)

  2. Inline Later - Parse inline content within blocks

  3. Lazy Evaluation - Only parse what's needed

  4. Error Recovery - Continue parsing after errors when possible

CommonMark Compliance

The parser follows the CommonMark spec:

  • ATX headings (# Heading)

  • Setext headings (underlined)

  • Fenced code blocks (` or ~~~)

  • Indented code blocks

  • Block quotes (>)

  • Lists (ordered and unordered)

  • Thematic breaks (---, ***, ___)

  • Emphasis and strong emphasis

  • Links and images

  • Hard and soft line breaks

Renderer Design

HTML Renderer

/// HTML renderer with customizable options.
pub struct HtmlRenderer {
    options: HtmlRendererOptions,
    output: String,
}

/// Renderer configuration.
#[derive(Debug, Clone)]
pub struct HtmlRendererOptions {
    /// Use XHTML-style self-closing tags.
    pub xhtml: bool,
    /// Soft break string.
    pub soft_break: String,
    /// Hard break string.
    pub hard_break: String,
    /// Enable syntax highlighting.
    pub highlight: bool,
    /// Sanitize HTML output.
    pub sanitize: bool,
}

Renderer Trait

Custom renderers can be implemented:

/// Trait for rendering AST to output format.
pub trait Renderer {
    type Output;

    fn render(&mut self, document: &Document<'_>) -> RenderResult<Self::Output>;
}

HTML Escaping

The renderer properly escapes HTML entities:

Character Entity
& &amp;
< &lt;
> &gt;
" &quot;
' &#39;

URL encoding is also handled for link/image URLs.

NAPI Bindings

Architecture

NAPI Architecture

Data Transfer

  • AST is serialized to JSON for JavaScript interop

  • HTML rendering happens in Rust for maximum performance

  • Async support for large documents

Thread Safety

The NAPI bindings are designed to be thread-safe:

  • Each parse operation creates its own allocator

  • No shared mutable state between calls

Vite Integration

Environment API

Ox Content integrates with Vite's Environment API for SSG:

// Creates a server-side environment for Markdown processing
const mdEnv = new Environment('markdown', {
  // Custom module resolution for .md files
  resolve: {
    extensions: ['.md'],
  },
  // Transform .md to JS modules
  transform: async (code, id) => {
    if (id.endsWith('.md')) {
      const result = await parseAndRender(code);
      return `export default ${JSON.stringify(result)}`;
    }
  },
});

Hot Module Replacement

The Vite plugin supports HMR for Markdown files:

  1. File change detected

  2. Re-parse changed file

  3. Send update to client

  4. Update rendered content without full reload

Performance Characteristics

Memory Usage

Content Size Traditional Parser Ox Content
1 KB ~50 KB heap ~8 KB arena
10 KB ~500 KB heap ~80 KB arena
100 KB ~5 MB heap ~800 KB arena

Parse Speed (approximate)

Content Size Traditional Parser Ox Content
1 KB ~1 ms ~0.1 ms
10 KB ~10 ms ~1 ms
100 KB ~100 ms ~10 ms

Benchmarks vary by content complexity and hardware.

Security Considerations

HTML Sanitization

When rendering untrusted Markdown, enable sanitization:

let options = HtmlRendererOptions {
    sanitize: true,
    ..Default::default()
};
let mut renderer = HtmlRenderer::with_options(options);

Nesting Limits

The parser enforces maximum nesting depth to prevent stack overflow:

let options = ParserOptions {
    max_nesting_depth: 50,  // Limit nesting
    ..Default::default()
};

Input Validation

  • Maximum input size limits

  • Invalid UTF-8 handling

  • Malformed Markdown graceful handling

Future Directions

  • Incremental Parsing - Re-parse only changed portions

  • Streaming Parser - Parse large documents in chunks

  • WASM Build - Run in browsers without NAPI

  • Custom Syntax Extensions - Plugin system for custom blocks

  • Source Maps - Full source map generation