HopDown Tokenizer Design

Problem

The regex-based inline parser and serializer can't reliably distinguish structural delimiters from literal text characters. This causes:

toMarkdown escaping bugs (over-escaping inside inline tags, under-escaping in text nodes)
Round-trip failures (toHTML(toMarkdown(html)) !== html)
Fragile interactions between features (underscore normalization + strikethrough, HTML passthrough + escaping)

Invariants

toHTML satisfies GFM spec rules 1-15
toMarkdown always emits the canonical form
toHTML(toMarkdown(html)) === html (single-pass round-trip)

Architecture

Token types

text      — literal characters, will be escaped during serialization
delimiter — structural marker (**, *, ~~, `, etc.)
html      — raw HTML tag passthrough
break     — hard line break (<br>)

Inline tokenizer (markdown → tokens)

Scans left-to-right, character by character. Maintains a stack of open delimiters. Produces a flat token stream:

Input:  "hello **bold *nested*** end"
Tokens: [text "hello "] [open **] [text "bold "] [open *] [text "nested"] [close *] [close **] [text " end"]

The tokenizer handles:

Backslash escapes: \* → text token containing *
Entity resolution: & → text token containing &
Flanking rules: only emit delimiter tokens when flanking conditions are met
Code spans: ` opens a code span that consumes everything until the matching `
Links: [text](url) parsed as a unit
Autolinks: <url> and bare URLs
Hard line breaks: trailing spaces or \ before newline
HTML tags: <span> etc. passed through as html tokens

Inline parser (tokens → HTML)

Walks the token stream and matches open/close delimiter pairs using a stack. Produces HTML string. Handles:

Delimiter pairing with precedence (*** before ** before *)
Multiple-of-3 rule
Nesting validation (no em inside em, no links inside links)

Serializer (DOM → tokens → markdown)

Walks the DOM tree. For each node:

Text nodes → text tokens (the serializer knows these need escaping)
Element nodes → look up the tag, emit delimiter tokens + recurse into children
Unknown elements → recurse into children

Then the token stream is serialized to a string:

Delimiter tokens → emitted verbatim (they're structural)
Text tokens → characters that would be misinterpreted as delimiters are backslash-escaped. The serializer knows exactly which characters are dangerous because it knows what delimiters exist.
HTML tokens → emitted verbatim

Why this solves the round-trip problem

The key insight: delimiter tokens and text tokens are different types. When serializing <strong>hello *world*</strong>, the output is:

[delim **] [text "hello "] [delim *] [text "world"] [delim *] [delim **]

The * around "world" are delimiter tokens (from the nested <em>). If instead the text contained a literal *:

<strong>hello * world</strong>

The output would be:

[delim **] [text "hello * world"] [delim **]

The * is a text token. During serialization, the text token scanner sees * and escapes it to \* because * is a known delimiter character. The delimiter tokens are never escaped. No ambiguity.

Files

types.ts — Token type, updated Tag interface
tokenizer.ts — Inline tokenizer (markdown → tokens)
serializer.ts — DOM → tokens → markdown string
hopdown.ts — Orchestrator (block parsing, delegates inline to tokenizer)
tags.ts — Tag definitions (simplified: no more regex patterns)

Migration

The Tag interface changes:

pattern field removed (tokenizer handles delimiter matching)
toMarkdown returns Token[] instead of string
match stays the same (block-level matching is already clean)
toHTML stays the same

The HopDown public API stays the same:

toHTML(markdown) — unchanged
toMarkdown(html) — unchanged
findCompletePair, findUnmatchedOpener — reimplemented on tokenizer
getTagForElement, getEditableSelector — unchanged

4.0 KiB Raw Blame History