[Tex/LaTex] take a CSV file, process the data, and generate a report file, where can I go to learn how to do this

datatoolMATLABreporttexstudio

I have access to MATLAB for data processing, and TeXstudio for the report. I am familiar with MATLAB about 100x more than LaTeX. The data flow I am trying to create:

  1. CSV file is generated with 50 columns of data, 1 timestamp column
    for everything, 1 column where timestamps for events are noted by
    the user since data is collected continuously (they push a button
    while data is being collected to create the event timestamp)

  2. MATLAB looks in the event timestamps column and finds the data at those times by comparing it against the timestamp column to locate the segments of data for processing.

  3. MATLAB finds the parameters of the data segments that the user requests (min, max, median, rise times, fall times, etc)

  4. MATLAB stores all of this information in a file somewhere or builds a large table with the information. There may also be graphs created from those data segments in (2)

Here's where LaTeX comes in:

  1. LaTeX will create the report for this data. With pre-entered statements like during Task X the maximum value reached was Y, where Y is taken from the MATLAB table that was created for that task in (3).

  2. This is repeated for each task. Some data columns may be grouped, and there may be some plots, table and figures that would also need to be dispersed throughout the final report.

I can figure out most of the MATLAB portions myself, it is the combining of the two together and the LaTeX portion i want to learn. Where are the best references or examples to do what I am proposing?

The data should look something like this (sorry I cant seem to figure out how to format a table on this thing but with a zillion more rows and data lines:

Time        Data Line 1  Data Line 2  Data Line 3  Data Line 4  Data Line 5 

7:49:23 AM  0.43493256   851.4415556  0.032704144  -1.24928     -0.016921   

7:49:24 AM  0.52979029   851.4415556  0.032704144  -1.24928     -0.016921   

Best Answer

Here's an example document using Python, with both code and output.

To use this example, you will need Python. Anaconda is a good option for this sort of thing. You also need the pythontex package. It's in TeX Live. It can also be installed manually; download the latest version from GitHub, extract, then run the Python installer in the pythontex directory.

To compile the document, you will need to use a three-step compile when Python code needs to be executed. Run LaTeX, then pythontex, then LaTeX again. For example, for a file analysis.tex, you might use

  • pdflatex -interaction=nonstopmode analysis.tex
  • pythontex analysis.tex
  • pdflatex -interaction=nonstopmode analysis.tex

When pythontex is properly installed, you should have a wrapper/symlink that allows it to be run from the command line.

Data file data.csv:

Time,A,B,C,D,E
7:49:23 AM,0.43493256,851.4415556,0.032704144,-1.24928,-0.016921
7:49:24 AM,0.52979029,851.4415556,0.132704144,-3.07928,-0.016921
7:49:25 AM,0.34579029,851.4415556,0.173704144,-2.24258,-0.012351

TeX source:

\documentclass{article}

\usepackage{graphicx}
\usepackage{pythontex}

\begin{pycode}
# Import functions etc. that may be needed
from numpy import median, average, mean, std
import collections
from matplotlib import pyplot as plt

# Read in data and parse each column into a list of values
# Put the data in a dictionary, with keys corresponding to column labels
# This assumes simple, well-behaved CSV, with no quoting
with open('data.csv') as f:
    raw_data = f.readlines()
# Add the data file as a dependency to be tracked
# This causes Python code to be re-executed when changes are detected
# Optional, but can be useful
pytex.add_dependencies('data.csv')

# Need a way to convert times into a format that allows easy comparison
# Could also find a library to do this
def time_to_ISO8601_int(t):
    t = t.strip()
    hms, meridian = t.split(' ', 1)
    h, m, s = map(int, hms.split(':'))
    if 'AM' in meridian and h == 12:
        h = 0
    elif 'PM' in meridian and h != 12:
        h += 12
    return h*10000 + m*100 + s

# Store processed data in an ordered dictionary
# Ordered dictionaries store data in order of insertion
# So columns maintain their ordering
data = collections.OrderedDict()
# Create an entry in the data dictionary for each column
# Create an empty list for each column, which will be filled in later
# Python is zero-indexed
header_row = raw_data[0]
for item in header_row.split(','):
    data[item.strip()] = []
# Process data into dictionary
# For very large data sets, a more efficient approach might be beneficial
for line in raw_data[1:]:
    if line:
        vals = list(line.split(','))
        vals[0] = time_to_ISO8601_int(vals[0])
        vals[1:] = map(float, vals[1:])
        for n, data_list in enumerate(data.values()):
            data_list.append(vals[n])
\end{pycode}


\begin{document}

Make a table of max values.

\begin{table}[h!]
\begin{center}
\begin{pycode}
print('\\begin{tabular}{|c|c|}')
print('\\hline')
print('Field & Max \\\\')
print('\\hline')
for key in data:
    if key != 'Time':
        print(key + '&' + str(max(data[key])) + '\\\\')
        print('\\hline')
print('\\end{tabular}')
\end{pycode}
\end{center}
\caption{Maximum values}
\end{table}


Plot some data.

\begin{figure}[h!]
\begin{pycode}
plt.figure(figsize=(4,3))
plt.plot(data['Time'], data['A'], label='A')
plt.plot(data['Time'], data['C'], label='C')
plt.plot(data['Time'], data['E'], label='E')
# Prevent plotting from reformatting tick labels
plt.ticklabel_format(useOffset=False, axis='x')
plt.xticks(data['Time'])
plt.xlabel('Time (ISO 8601)')
plt.legend()
plt.savefig('fig.pdf', bbox_inches='tight')
\end{pycode}
\begin{center}
\includegraphics{fig}
\end{center}
\caption{Figure caption.}
\end{figure}

\end{document}

Output:

enter image description here