[Tex/LaTex] pdfcrop content remains

croppdfpdfcrop

Here's a minimal working example to illustrate the issue. Suppose we make a document like this

\documentclass[12pt]{article}
\begin{document}
11111111111111111111111111111

\vspace{5mm}

22222222222222222222222222222
\end{document}

Then crop it with pdfcrop in two ways

pdfcrop 1.pdf a.pdf --margin="0 0 -5 -530"

and

pdfcrop 1.pdf b.pdf --margin="0 0 -5 -560"

Then make a second document

\documentclass[12pt]{article}
\usepackage{graphicx}
\begin{document}
\includegraphics{a.pdf}

\vspace{5mm}

\includegraphics{b.pdf}
\end{document}

Here's a screenshot of what results:
enter image description here

The line of twos which was supposed to be cropped is still visible when highlighted!

Is there a way to definitively crop a pdf, without content remaining in this way? Either using pdfcrop or not.

In the past I've succeeded in removing such leftovers by converting to ps and back with pdf2ps and ps2pdf, but for more complex situations it tends to rasterize certain parts of vector graphics, which is what I'm trying to avoid.

Best Answer

Cropping in pdfcrop, options viewport/trim of \includegraphics, or the very most other tools for cropping is done by shrinking the dimensions of the visible area, smaller values for /MediaBox or /CropBox. This is quite fast and easy to implement. But the contents of the whole page is untouched. That means, the part outside the crop area are still present, usually invisible, but present as you have seen by selecting the "invisible" text.

Removing this contents is highly expensive and complicated, because the whole PDF page contents needs to be analyzed and rewritten with the visible things only at the right places. And if objects (images, graphics elements, characters, ...) are both inside and outside the crop area it becomes quite fast quite ugly.

I am interested in tools, which are able to do a "deep" cropping with removal of objects, but I do not know such a tool.

  • To some degree it can be done manually, depending on the crop area, the objects, which are to be removed and, how the PDF page stream is written, and how deep the knowledge about PDF page stream syntax is. First, the PDF page stream needs to be uncompressed, e.g., via pdftk. Then the lines are identified, which generate unwanted contents, they can be removed by replacing the first byte by the comment char %. The size of the object stream should not be changed, otherwise many file offset values need to be corrected in the PDF structures. Of course the removal must not violate the syntax, also the current transfer matrix and other graphics state operators might be corrected.

  • Another option might be a vector graphics program, which can import and export PDF files. Then the unwanted objects might be removable there.

  • With some loss of quality, the PDF page can be converted to a bitmap image. Then the image can be easily cropped by many tools and image editors.