ribbit/TOKENIZER_DESIGN.md

4.0 KiB

HopDown Tokenizer Design

Problem

The regex-based inline parser and serializer can't reliably distinguish structural delimiters from literal text characters. This causes:

  • toMarkdown escaping bugs (over-escaping inside inline tags, under-escaping in text nodes)
  • Round-trip failures (toHTML(toMarkdown(html)) !== html)
  • Fragile interactions between features (underscore normalization + strikethrough, HTML passthrough + escaping)

Invariants

  1. toHTML satisfies GFM spec rules 1-15
  2. toMarkdown always emits the canonical form
  3. toHTML(toMarkdown(html)) === html (single-pass round-trip)

Architecture

Token types

text      — literal characters, will be escaped during serialization
delimiter — structural marker (**, *, ~~, `, etc.)
html      — raw HTML tag passthrough
break     — hard line break (<br>)

Inline tokenizer (markdown → tokens)

Scans left-to-right, character by character. Maintains a stack of open delimiters. Produces a flat token stream:

Input:  "hello **bold *nested*** end"
Tokens: [text "hello "] [open **] [text "bold "] [open *] [text "nested"] [close *] [close **] [text " end"]

The tokenizer handles:

  • Backslash escapes: \* → text token containing *
  • Entity resolution: &amp; → text token containing &
  • Flanking rules: only emit delimiter tokens when flanking conditions are met
  • Code spans: ` opens a code span that consumes everything until the matching `
  • Links: [text](url) parsed as a unit
  • Autolinks: <url> and bare URLs
  • Hard line breaks: trailing spaces or \ before newline
  • HTML tags: <span> etc. passed through as html tokens

Inline parser (tokens → HTML)

Walks the token stream and matches open/close delimiter pairs using a stack. Produces HTML string. Handles:

  • Delimiter pairing with precedence (*** before ** before *)
  • Multiple-of-3 rule
  • Nesting validation (no em inside em, no links inside links)

Serializer (DOM → tokens → markdown)

Walks the DOM tree. For each node:

  • Text nodes → text tokens (the serializer knows these need escaping)
  • Element nodes → look up the tag, emit delimiter tokens + recurse into children
  • Unknown elements → recurse into children

Then the token stream is serialized to a string:

  • Delimiter tokens → emitted verbatim (they're structural)
  • Text tokens → characters that would be misinterpreted as delimiters are backslash-escaped. The serializer knows exactly which characters are dangerous because it knows what delimiters exist.
  • HTML tokens → emitted verbatim

Why this solves the round-trip problem

The key insight: delimiter tokens and text tokens are different types. When serializing <strong>hello *world*</strong>, the output is:

[delim **] [text "hello "] [delim *] [text "world"] [delim *] [delim **]

The * around "world" are delimiter tokens (from the nested <em>). If instead the text contained a literal *:

<strong>hello * world</strong>

The output would be:

[delim **] [text "hello * world"] [delim **]

The * is a text token. During serialization, the text token scanner sees * and escapes it to \* because * is a known delimiter character. The delimiter tokens are never escaped. No ambiguity.

Files

  • types.ts — Token type, updated Tag interface
  • tokenizer.ts — Inline tokenizer (markdown → tokens)
  • serializer.ts — DOM → tokens → markdown string
  • hopdown.ts — Orchestrator (block parsing, delegates inline to tokenizer)
  • tags.ts — Tag definitions (simplified: no more regex patterns)

Migration

The Tag interface changes:

  • pattern field removed (tokenizer handles delimiter matching)
  • toMarkdown returns Token[] instead of string
  • match stays the same (block-level matching is already clean)
  • toHTML stays the same

The HopDown public API stays the same:

  • toHTML(markdown) — unchanged
  • toMarkdown(html) — unchanged
  • findCompletePair, findUnmatchedOpener — reimplemented on tokenizer
  • getTagForElement, getEditableSelector — unchanged