4 Internals

src -------> target/stage ------> target/<format>
    frontend              backend
  • Frontend (Sec. 4.1): format-independent processing through Pandoc
  • Backend (Secs. 4.2.1, 4.2.2): format-specific rendering through Pandoc and mdBook/LaTeX

4.1 Frontend

src/*.md

    |
    | prepend-heading
    v

[md]

    |
    | pandoc (frontend)
    v

[json]

    |
    | preprocessor
    v

target/stage/src/*.json

Each Markdown file is transformed individually into an output-independent JSON file using Pandoc. The use of JSON ensures that all the details are preserved correctly, including citations (which would get lost if HTML was used for intermediate files). The preprocessor is not yet implemented.

The transformation of each file is independent of each other; there is no state maintained.

Still, even though the transformations are totally independent, the makefile has to know which files need to be transformed. This is what target/stage/src/SUMMARY.mk stores: a list of the book items in the correct ordering as a makefile variable $(item_names). It is obtained via .local/bin/get-book-items.

4.2 Backend

Using the files in the staging directory (target/stage), we use mdBook to parse the structure of the book items (chapters, sections, etc) in SUMMARY.md and render the output files. This is handled by the following custom Rust tools:

  • html-mdbook-toml, which translates the Makefile variable $(metadata) into target/stage/html/book.toml for mdBook.
  • html-merge: combines the input files into an amalgamation so that pandoc-crossref works correctly (also appends the bibliography if present)
  • html-fix-links: make the links in an amalgamation work correctly after splitting
  • html-split: splits the amalgamation back into individual pages for mdbook
  • latex-merge, reads SUMMARY.md and combines the various target/stage/src/*.json files into a single JSON file target/pdf/book.json to be consumed by Pandoc for PDF generation.

4.2.1 HTML Output

target/stage/src/*.json

        |
        | html-merge
        v

[json]

        |
        | pandoc-crossref
        | +
        | pandoc-citeproc
        | +
        | html-fix-links
        | +
        | pandoc (HTML)
        v

[html]

        |
        | html-split
        v

target/stage/html/src/*.htm

        |
        | mdbook
        v

target/html

The JSON files are translated into .htm using Pandoc. The choice of .htm instead of .html is mainly to work around an mdBook quirk.

In principle, this one should be straightforward: just call mdbook build <dir>. In practice, we have to use a patched version of mdBook because we want to render Markdown using Pandoc rather than the default pulldown-cmark.

4.2.1.1 The mdBook patch

We use a patched version of mdBook that renders directly through HTML rather than through Markdown. This is because we use want to use Pandoc to preprocess the input files and “standard” Markdown is not very expressive.

Originally, the plan was to preprocess with Pandoc and save them as CommonMark so that mdBook could parse them. However, the problem is that (a) Pandoc does not support a lossless conversion to CommonMark (b) pulldown-cmark does not adhere to CommonMark very well (e.g. it does not preserve whitespace within <pre> elements as the spec demands).

Therefore, we simply disable the Markdown processing altogether and use HTML. To avoid duplicating a large chunk of code from mdBook, so we opted to just patch the original code instead.

4.2.1.2 Using .htm instead of .html

After rendering, mdBook copies all files except .md files into the output directory. If we named the input files .html they would immediately overwrite the files we just generated! Moreover, using .htm makes it easy to delete the input files from the output directory afterward.

4.2.1.3 Keeping target/html tidy

Part of the reason to have target/stage is so that we can select precisely what files we want. mdBook is not very selective so it will copy everything in src into dest and then purge all the .md files.

4.2.2 PDF Output

target/stage/src/*.json

        |
        | latex-merge
        v

target/pdf/book{_head.tex,{_before,,_after}.json}

        |
        | pandoc-multiref
        | +
        | pandoc (LaTeX)
        v

target/pdf/book.tex

        |
        | latexmk
        v

target/pdf/book.pdf

latex-merge is responsible for combining the chapters together into a single JSON file, shifting the heading levels as necessary. Then Pandoc is invoked to generate the .tex file. Finally we run latexmk to produce the PDF.

The main reason for invoking Pandoc within latexbook is because there are auxiliary files (e.g. frontmatter file) that would otherwise get entangled with the Makefile.

4.2.2.1 Location of the .bib file

It’s much easier to have the .bib file at the top-level. The reason is that we want it to work with both pandoc-citeproc, which treats it as a path relative to the top-level directory, and natbib, which treats it as a path relative to target/pdf. If we put it inside src, then pandoc-citeproc would complain, and mdbook would unwittingly copy the .bib file into target/html.

Example of a bibliographic citation: (Doe 2000)