Generating HTML pages from Latex



While latex is pretty much "not designed" for web content, it is very useful to generate a web-version of a latex document. The purpose of latex is clearly for typesetting layouts on a pre-defined page, but when you want to share the information with others, it's generally a lot easier for them to go to a webpage then it is to download and open a PDF. In addition, it's generally easier to view a webpage than a PDF because the content is continuous, and one can scroll around and click hyperlinks in a way that is far more fluid than on a PDF.

Now that MathML and SVG are becoming more supported by web browsers, there is a strong case for sharing mathy documents on the web in addition to paper documents (or PDFs, which are only slightly more readable than paper).

To this end, I've been evaluating various different Latex to HTML converters. I've tried the following on Linux (Ubuntu):

  1. TTH
  2. LaTeX2HTML
  3. text4ht
  4. LaTeXML

By far my favorite is LaTeXML. It generates crisp, simple pages using MathML and CSS, making it easy to customize the style. It doesn't support a whole lot of packages that I generally would like to use (like algorithm2e), but then again none of them do. Also, the ArXiV project is working on a branch of LaTeXML so there is promise that it will grow quickly to support a lot of the best packages.

Document Setup

My current approach to generating both PDFs and HTMLs from latex source is to use separate top-level documents for both. The directory structure looks something like this:

    document
     |- document_html.tex
     |- document_pdf.tex
     |- document.tex
     |- preamble_common.tex
     |- preamble_html.tex
     |- preamble_pdf.tex
     \- references.bib

The two versions of document_[output].tex are the top-level files. They look like this:

%document_html.tex

\documentclass[10pt]{article}
\input{preamble_common}
\input{preamble_html} 
\begin{document}
\input{document}
\end{document}

The pdf version is the same but it uses preamble_pdf as an input. Note that in latex you cannot nest \include directives, but you can nest \input directives. Also, \include inserts a page-break so there is no need to use them here. Rather document.tex may \include it's chapters as tex files or the like.

Makefile

To ease the process of generating the different types, I'm using a makefile.

# The following definitions are the specifics of this project
PDF_OUTPUT  :=  document.pdf
HTML_OUTPUT :=  document.html

PDF_MAIN    :=  document_pdf.tex
HTML_MAIN   :=  document_html.tex

COMMON_TEX  :=  document.tex \
                preamble_common.tex

PDF_TEX     :=  $(COMMON_SRC) \
                document_pdf.tex \
                preamble_pdf.tex

HTML_TEX    :=  $(COMMON_SRC) \
                document_html.tex \
                preamble_html.tex

BIB         :=  references.bib



# these variables are the dependencies for the outputs
PDF_SRC     := $(PDF_TEX) $(BIB)
HTML_SRC    := $(HTML_TEX) $(BIB)

# the 'all' target will make both the pdf and html outputs
all: pdf html

# the 'pdf' target will make the pdf output
pdf: $(PDF_OUTPUT)

# the 'html' target will make the html output
html: $(HTML_OUTPUT)

# the pdf output depends on the pdf tex files
# we use a shell script to optionally run pdflatex multiple times until the
# output does not suggest that we rerun latex
$(PDF_OUTPUT): $(PDF_TEX) 
    @echo "Running pdflatex on $(PDF_MAIN)"
    @pdflatex $(basename $(PDF_MAIN)) > $(basename $(PDF_MAIN))_0.log
    @echo "Running bibtex"
    @-bibtex   $(basename $(PDF_MAIN)) > bibtex_pdf.log 
    @echo "Checking for rerun suggestion"
    @for ITER in 1 2 3 4; do \
        STABELIZED=`cat $(basename $(PDF_MAIN)).log | grep "Rerun"`; \
        if [ -z "$$STABELIZED" ]; then \
            echo "Document stabelized after $$ITER iterations"; \
            break; \
        fi; \
        echo "Document not stabelized, rerunning pdflatex"; \
        pdflatex $(basename $(PDF_MAIN)) > $(basename $(PDF_MAIN))_$$ITER.log; \
    done
    @echo "Copying pdf to target file"
    @cp $(basename $(PDF_MAIN)).pdf $(PDF_OUTPUT)

# the html output depends on the html tex files
# we have to process all of the bibliography files separately into xml files, 
# and then include them all in the call to the postprocessor
$(HTML_OUTPUT): $(HTML_TEX) 
    @echo "Running latexml on $(HTML_MAIN)"
    @latexml $(HTML_MAIN) -dest=$(basename $(HTML_OUTPUT)).xml > $(basename $(HTML_MAIN)).log 2>&1
    @BIBSTRING=""; \
    for BIBFILE in $(BIB); do \
        echo "Running latexml on $$BIBFILE"; \
        XMLFILE=`basename "$$BIBFILE" .bib`.xml; \
        LOGFILE=`basename "$$BIBFILE" .bib`_html.log; \
        latexml $$BIBFILE -dest=$$XMLFILE > $$LOGFILE 2>&1; \
        BIBSTRING="$$BIBSTRING -bibliography=$$XMLFILE"; \
    done; \
    echo $$BIBSTRING > bibstring.txt
    @echo "postprocessing with `cat bibstring.txt`"
    @latexmlpost $(basename $(HTML_OUTPUT)).xml `cat bibstring.txt` -dest=$(HTML_OUTPUT) -css=navbar-left.css

# the 2>/dev/null redirects stderr to the null device so that we don't get error
# messages in the console when rm has nothing to remove
clean:
    @-rm -v *.log 2>/dev/null
    @-rm -v *.out 2>/dev/null
    @-rm -v *.aux 2>/dev/null
    @-rm -v *.xml 2>/dev/null
    @-rm -v *.pdf 2>/dev/null
    @-rm -v *.html 2>/dev/null
    @-rm -v bibstring.txt 2>/dev/null

Some notes on the makefile. I execute bibtex ignoring errors (the dash symbol before 'bibtex') because bibtex will exit with an error if it doesn't find any citations, or if there is no bibliography. Each iteration of pdflatex is output to a logfile named "document_pdf_<i>.log" where "<i>" is the iteration number. The output of pdflatex and bibtex is supressed by dumping it to the logfile (I the verbosity useless to have in the console).

The shell script in the PDF recipe iterates up to four times. The first thing it does is greps the output of the most recent run pdf latex looking for the line where latex recommends that we "Rerun" latex. If it finds such a line it sets the shell variable STABELIZED to that string. Otherwise it gets the empty string. Then we test to see if the string is empty. If it's empty, we're done so we break the loop. If it's not, then we rerun pdflatex.

The shell script in the HTML recipe iterates over each of the (potentially multiple, potentially zero) bibliography files, processing each of them with latexml. It then appends the string "-bibliography=<filename>.xml" to the BIBSTRING shell variable. The last thing it does is echos the contents of that shell variable to the file "bibstring.txt". This so so that subsequent commands by make can find it.

Comments


Comments powered by Disqus