[Tex/LaTex] Are there any open research problems in the world of TeX

big-listlatex-miscresearchtex-core

For a year-long capstone project in computer science, I have to do some 'original research.' I'm still not entirely sure what this entails, but it just struck me that perhaps I can do research and simultaneously help the only open source project I truly believe in.

Are there any open problems in TeX or LaTeX that I could attempt to solve over the course of a year?

If possible, one problem per answer. (Note you can still post multiple answers.)


To clarify, I'm not looking for ideas that would make cool packages per se. I'm not looking to add some neat package on CTAN; that isn't nearly fundamental enough. I'm not even looking for high-scale programming project either; those are things you can just truck through (with the possible exception of a BibLaTeX style editor; I understand that's a doosie). I'm looking for problems that

  • have (or need) a clear definition of the problem
  • are fundamentally applicable to the core of TeX's typesetting (To clarify, I'm not expecting to re-code TeX for this, but it's not out of the question. The 'rivers' problem may have to lead to this, whoever does it.)
  • are within the scope of an undergraduate/graduate final year. (I go to a weird school; we have some pretty amazing research going on at the undergraduate level. See the project's official specification, keeping in mind that each department has its own spin.)

I ended up going with 're-evaluating and improving Knuth-Plass for modern hardware' as my proposal, and my adviser was enthusiastic about it (at least upon realizing I wasn't wanting to extend the language of TeX). So I began my research and, lo and behold, it's already been looked at! (This isn't surprising, but it's definitely depressing.) Perhaps this algorithm could be implemented in one of the newer TeX engines, but that is beyond the scope of CS research. (For example, it could be applied to avoid rivers and stacks, to improve pagination, etc.) Thanks for all of your suggestions!

Best Answer

I don't know whether these are open problems or not, but since you are looking for a capstone project, you might be interested to explore if the basic algorithmic aspects of TeX can be improved.

  1. Line-breaking algorithm. The current line breaking algorithm is a gold standard that all line-breaking software emulate, but is it the best way to break lines? Knuth and Plass's algorithm made specific premature optimization choices (pun intended!) like separating page-break from line break, assigning badness based on the raggedness of line but not accounting for rivers, etc. The only real advance since then has been character protrusion, and from what I understand, it still follows the same basic line-breaking algorithm. Now don't get me wrong. I am not saying that these choices were wrong. But a lot of these choices were made because the computational resources of that time could not really handle anything more sophisticated. But now that we have computers that are 1000 times more powerful than those in the 70s, it should be possible to explore other options to see if the line breaking algorithm can be improved by taking into account more factors, especially page-breaking, footnotes, side-notes, and floats. What is better, perfect line breaks but huge vertical spaces to balance the page, or slightly underfull lines but no vertical spaces? There is no way to play around with these in the current framework (please correct me if I am wrong).

  2. Automatic breaking of display equations. Currently the breqn package implements the ideas of Michael J Downes, but AFAIK, the algorithmic aspects are not as well understood as that of line-breaking of text. Is it possible to case line-breaking of display equations as an optimization problem and determine a solution based on penalties and badness?

  3. Parsing natural math. There are recurring questions asking if it is possible to automatically translate <= to \le, sin(x/y) to \sin\left(\frac{x}{y}\right), etc. Although it is possible to do so to a varying degree of success with TeX and LuaTeX (e.g., the calcmath module in ConTeXt), I haven't seen any work that tries to understand how to parse math without markup. Given how sophisticated the current NLP techniques are, it should be possible to do better than simple heuristics for parsing natural math.