[Tex/LaTex] How to use mathpix (a LaTeX OCR tool) to identify LaTeX from images

ocrpython

I'd just heard about mathpix, a way that can identify formula from images and generate the LaTeX code. I have some handouts (already printed) from my teacher 10 years ago written in Chinese and many math formulas. I don't have the original digital file, but only have those documents on my shelves. I want to turn it into digital files, namely get the Chinese texts and math formula in LaTeX code so that I can reproduce and reprint it. However doing it by hand is a heavy work, so I want to seek some clever way. I think mathpix can help me a lot. But I have two main questions with respect to it:

If I have a picture like this, with many inline math: (just a demo, not the actual document I have)

Can I get the result both with the English words and LaTeX inline math? (I mean,
get the resulting string "Suppose $A$ is bounded subset of $\Bbb R^n$. If ...") It seems that I need a pure text OCR tool and mathpix work together nicely. How to achieve such task?
If I have bunches of images to identify, I guess I need to write some python program with the mathpix API provided in mathpix API. But the sample code given is not work in my python 3 now. I'm not good at python, how to modify it? Or is there other clever way to do? (Maybe I should ask this question in another board, but I think it would be fewer people know LaTeX there.)

Best Answer

If you only have the hard copy version, you could also try using the Mathpix Android or iOS apps to take a picture of the documents and it will render the LaTeX. You can then export the LaTeX. Try it out and see if that works any better for you!

Scripts vs. binaries

Firstly, the good news for you is that scripts are much easier to integrate into TeX distributions than compiled programs. To get the latter into TeX Live or MiKTeX it is best to get in touch with the distribution maintainer(s) through the appropriate mailing list.

Submission to CTAN

You need to decide whether to submit your package to CTAN as a flat .zip archive or in a .tds.zip TDS format (see also TDS submission guidelines). TDS avoids ambiguity in file layout, but be sure to adhere to the specification - a flat .zip with no subdirectories is preferred over messed up .tds.zip. Test your .tds.zip before submitting, e.g., install into TEXMFHOME and see if everything works (see below on how to test the executable scripts). Here's the layout I would suggest for your package pythontex:

doc/
   +-- latex/ 
            +-- pythontex/
                         +-- pythontex.pdf
                         +-- README
scripts/
       +-- pythontex/
                    +-- pythontex.bat
                    +-- pythontex.py
                    +-- pythontex_types.py
                    +-- pythontex_utils.py
source/
      +-- latex/ 
               +-- pythontex/
                            +-- pythontex.dtx
tex/
   +-- latex/ 
            +-- pythontex/
                         +-- pythontex.sty

You should include all package files in the archive, not just the source files (e.g., not only .dtx, but also the .sty file derived from it). You should also include a short and clear README file specifying the purpose of the package, its license (needs to be free to include in TL), its contents (files and their purpose) and any other requirements needed to install and use the package (e.g., external dependencies like Python).

Executable scripts

Directories with executables (i.e., those added to PATH) are not included in TDS specification, but as a package author you don't need to worry about that. Just put your scripts under scripts/<package name> and make clear in the package README, which script is the main program to be executed. TeX distros will then add a symlink (TL on Unix) or a launching wrapper (win32, TL and MiKTeX) in the bin directory.

Whether to include a wrapper for launching the script is up to you. In principle, this is not needed nowadays for TeX Live and MiKTeX - both have their own, specialized wrappers for this purpose. However, some users may need to install your package directly from CTAN (e.g., to use with older TL version), so adding at least a .bat wrapper for Windows (see example below) may be nice. For Unix just start your main script with #!/usr/bin/env python (for system portability /usr/bin/env is recommended over hardcoding interpreter's absolute path).

For windows I can suggest the following wrapper (if saved as pythontex.bat it will execute pythontex.py script).

@echo off
setlocal enableextensions
rem assuming the main script is in the same directory
if not exist "%~dpn0.py" (
  echo %~nx0: main script "%~dpn0.py" not found>&2
  exit /b 1
)
rem check if interpreter is on the PATH
for %%I in (python.exe) do set "PYTHONEXE=%%~$PATH:I"
if not defined PYTHONEXE (
  echo %~nx0: Python interpreter not installed or not on the PATH>&2
  exit /b 1
)
"%PYTHONEXE%" "%~dpn0.py" %*

As I mentioned, TeX Live and MiKTeX use their own methods of launching scripts, though I'm only familiar with TL's side of things. TeX Live uses runscript.tlu utility for this and users can make use of it also for their own custom or manually installed scripts. This can be also used by package authors for testing, e.g., you could test if your .tds.zip works correctly. For details see the output of runscript -h (add -v switch to learn all the gory details of the actual implementation). Here's an excerpt from it:

The following script types and their file extensions are currently
supported and searched in that order:

  Lua      (.tlu;.texlua;.lua) --  included
  Perl     (.pl)               --  included
  Ruby     (.rb)               --  requires installation
  Python   (.py)               --  requires installation
  Tcl      (.tcl)              --  requires installation
  Java     (.jar)              --  requires installation
  VBScript (.vbs)              --  part of Windows
  JScript  (.js)               --  part of Windows
  Batch    (.bat;.cmd)         --  part of Windows

Finally, Unix-style extensionless scripts are searched as last and
the interpreter program is established based on the she-bang (#!)
specification on the very first line of the script.  This can be
an arbitrary program but it must be present on the search path.

It is recommended to write new utilities in Lua, if only possible, since Lua interpreter is now available out of the box on all platforms thanks to LuaTeX. A close second is Perl, which is shipped with TL on win32. Anything else has to be installed separately on Windows.

Finding script/package resources

Complex scripts might be spread over multiple files and there is no silver bullet solution to how to locate such files. The standard way of finding files in TeX Live (which now works in MiKTeX too) is to use Kpathsea and its kpsewhich utility, e.g., kpsewhich -format texmfscripts pythontex_utils.py will output the full path to pythontex_utils.py if it finds it in scripts subdirectory under one of the TEXMF trees. In LuaTeX, Kpathsea library is built-in and can be accessed directly. There might be some other, perhaps better ways, which are specific to Python, Perl, etc., but this should be asked elsewhere.

Best Answer

Related Solutions

[Tex/LaTex] Macro that inserts arguments into an environment

[Tex/LaTex] How to publish a package that includes scripts and/or executables

Scripts vs. binaries

Submission to CTAN

Executable scripts

Finding script/package resources

Related Question