I suggest you look at the script make-index.py (and related files) in the scripts folder of the download page at the Stacks Project (http://www.math.columbia.edu/algebraic_geometry/stacks-git/). The index it generates isn't really ideal, but at least their strategy will give you some idea how to get started. They seem to take the approach that (in a gigantic math textbook) the things which most deserve to be in the index are the italicized word(s) or phrase(s) in each definition environment. In my experience using math books, the most common reason I look something up in the index is to learn its definition, so this seems appropriate, although maybe not for books in other subjects. However you might be able to use the Stacks Project script as a guide to automate the creation of an index which suits your own needs, even if they are very different.
Not much of an answer, more a couple of loose thoughts ...
Off-hand I'm not aware of any such system and also not aware of any
research that deals with automatic newspaper layout. As far as I know
there has been only very very limited attempts to approach the subject
of automatic typesetting with more complex layout rules and
dependencies that go beyond what is largely a linear process. You
can count the with your hands:
- Michael Plass (under Knuth)
- Graham Asher in 1990 or so (Type & Set) - not sure what happened to that
- Anne Brüggemann-Klein in the mid 90ties
- Richard Furuta and a few others in the 90ties
- Stephan Wohlfeil 1997 (Phd: On the Pagination of Complex Book-like Documents)
and to my knowledge nada otherwise. And those are all looking more at the questions arising from "book-like" documents rather than newspapers/journals. But I might be very wrong as I
didn't follow that area closely in the last 10 years.
But assuming my knowledge is correct for a moment, it isn't really
really surprising, is it? What you have is a global optimization
problem of a constraint system where the possibilities that you need
to test grow astronomically the moment you have more than a single
column and a good number of floats with a certain set of
constraints. And so far any serious attempts to do much better than
choosing the trivial way out (no floats, just linear typesetting - aka
MS-Word model) or a simple greedy algorithm that never looks back
(like LaTeX does) got defeated by the complexity of the task.
Now newspaper typesetting on one hand comes with the additional
complexity (but perhaps also the freedom) of having multiple input
streams of limited length which allow for reordering (to some
extent). On the other hand it will have much different requirements on
picture order and call-outs.
By the way, to my knowledge it is quite common in newspaper writing
that the authors have to write to length and if they don't they get
edited to it. Are you thinking of taking that into account? Because if
so that would simplify the task probably considerably.
So I think the first task would be to understand and research the
constraint system, e.g., what kind of rules make newspapers or journals
tick. Those will not be universal and most likely they are
contradicting each other if taken all together. But they form a basis
of what an algorithm needs to be able to be configured for. And only
when those boundaries are known can one delve deeper into the
question of designing such an algorithm. How close one can get to an
ideal, I don't know. In some respects, I would assume that it might in
fact be simpler for newspapers due to the flexibility of reordering
stories but in any case I believe this is an open research topic that is
so far unsolved (just like "the pagination of complex book-like
documents" effectively is). --- I'm certainly interested and have been
for more than two decades, even if I had to take a longer break after
I don't know if Wohlfeil's PhD work is still easily available (it was difficult for me to get back then) but a quick search on the web brought up a shorter paper by Brüggeman-Klein/Klein/Wohlfeil "On the Pagination of Complex Documents" which is from around the same time. And I also found "Pagination reconsidered" by the same authors (but no date to go with it, but from the number it was probably earlier).
I'm sure that there are probably many other sources but one good book that I think is worth looking at for those who speak German is "Praxishandbuch Gestaltungsraster" by Andreas and Regina Maxhauer. Its focus isn't the newspaper angle, but rather the grid one but that naturally covers a good number of possible rules.
By the way, a good way to do some research (through far from perfect at the moment) is to look around in Microsoft's Academic Search. For example that gives you some more background on what Anne was doing over the years and which papers she co-authored. But you have to be aware that there is a lot of rubbish in the data they have and it is horribly incomplete in parts.
Upon reading a bit in Stefan's PhD thesis again (which I incorrectly labeled habil initially) I came across the work of Krista Lagus who wrote in her master thesis about "Automated pagination of the generalized newspaper using simulated annealing". I didn't find the thesis on the web but perhaps it is worth exploring further.
Yes, it is possible. But there are two problems here, the typesetting and the design issue.
Typesetting: Of course TeX handles typesetting very well. But you will run into things like shaped paragraphs. For these you need to calculate the shape of the image that will blend into the paragraph. Or if you have a rectangular graphic that is in the middle of two columns, you need to cut out this shape in two columns. This all gets really difficult in LaTeX.
Design: You can't fit a text into a grid easily. You need to take into account how long the text is. You have one or two images along with the text. Do you want them to be above the text, below? Which stories go on the first page? These questions require a lot of designing. And putting the design instructions into LaTeX code is not a fun task.
At my company we offer a TeX based software that solves problem #1, but #2 has do be done nevertheless.