[Tex/LaTex] What research exists on Speech to LaTeX / MathML

accessibilityconversionresearch

Preface

I'm currently working in the area of speech recognition and simultaneous lecture translation. Essentially, you have one speaker (the lecturer) who speaks in German (for example) and many students who don't understand German. The speaker can get the spoken translation of their words, in any language they like, just a few seconds after they speak.

So the process involves:

  1. Transforming German speech into German text; and
  2. Transforming German text into text in a foreign language; and
  3. Synthesizing speech for text in a foreign language.

This system has already been built and works amazingly well—it's far from perfect, but works well enough for you to be able to get an idea what the lecture is about.

An example for this system is right at the beginning of this video:

Video: Simultaneous machine translation (in German)

But where the system currently really sucks is math. For example, when you say something like:

The sum from n equals 1 to infinity of 1 over n squared goes to infinity.

It will recognize exactly this. But I think it should recognize

$\sum_{n=1}^\infty \frac{1}{n^2} \rightarrow \infty$

or something similar. And yes, I am aware of the fact that even humans cannot always tell how speech (especially without context) would translate to LaTeX without any ambiguity. But even providing several alternatives would be a big improvement.

My Question

Are there any papers that deal with the problem "math speech to text" or "math speech to (La)TeX" or "math speech to MathML"?

I have seen a question asking about LaTeX to MathML to speech but not the other way round.

I've also found metroplexvoice.com, but this is not natural language. They make unnaturally long pauses for the system, and they say things like "newline".

Summary of papers I know

When people who are interested in the same topic read this question, they should not have to read the whole paper to get a rough idea what it is about. So I'll try to summarize the ones I've read

How can we speak math?

The first part of this paper is about text to speech (or rather math to speech). It explains ambiguity of math speech and recommends a multimodal approach (speech and text) for on-line recognition of math.

Best Answer

You are taking on a REALLY hard problem that is IMHO not solvable in your context. When humans speak math, they frequently speak it ambiguously. A simple example is "b plus c over d". Even if you get 100% recognition of the tokens of the language, which is a challenge with single letter tokens, there is no way to know if this is \frac{b+c}{d} or b+\frac{c}{d}.

When computers speak math for people who are blind, they need to add somewhat unnatural delimiters in such as "fraction ... over ... end fraction" to make the speech unambiguous. Fateman's paper discusses this need and suggests one way of speaking 2D expressions. Because you probably don't want to force your speakers to speak a certain way, I suspect translating the words and leaving the listener to piece together the expression in their mind is better than trying to create the math notation in TeX or MathML and being wrong a non-trivial amount of time.

In addition to ambiguous speech, another big challenge that you will face is that there are many, many ways to speak the same notation. Some are simple variants ("loop integral" and "countour integral"). However, others have many variations and are inconsistently used even by a single speaker. As an example, someone might speak a parenthesized quantity with a pause and rate change one time, say "quantity" or "all" or some other phrase another time, or might even speak the parens (open/close) in one of several ways yet another time.

If you are willing to train your speakers to speak math in some specified way to eliminate the ambiguity and simplify the number of variations you need to deal with, then I think the problem is solvable. But even with those simplifications, it is still a really hard problem because you are going to need to drive down recognition problems with some clever context recognition that likely requires a large body of sample equations.