[Tex/LaTex] pdfcrop content remains

croppdfpdfcrop

Here's a minimal working example to illustrate the issue. Suppose we make a document like this

\documentclass[12pt]{article}
\begin{document}
11111111111111111111111111111

\vspace{5mm}

22222222222222222222222222222
\end{document}

Then crop it with pdfcrop in two ways

pdfcrop 1.pdf a.pdf --margin="0 0 -5 -530"

and

pdfcrop 1.pdf b.pdf --margin="0 0 -5 -560"

Then make a second document

\documentclass[12pt]{article}
\usepackage{graphicx}
\begin{document}
\includegraphics{a.pdf}

\vspace{5mm}

\includegraphics{b.pdf}
\end{document}

Here's a screenshot of what results:
enter image description here

The line of twos which was supposed to be cropped is still visible when highlighted!

Is there a way to definitively crop a pdf, without content remaining in this way? Either using pdfcrop or not.

In the past I've succeeded in removing such leftovers by converting to ps and back with pdf2ps and ps2pdf, but for more complex situations it tends to rasterize certain parts of vector graphics, which is what I'm trying to avoid.

Best Answer

Cropping in pdfcrop, options viewport/trim of \includegraphics, or the very most other tools for cropping is done by shrinking the dimensions of the visible area, smaller values for /MediaBox or /CropBox. This is quite fast and easy to implement. But the contents of the whole page is untouched. That means, the part outside the crop area are still present, usually invisible, but present as you have seen by selecting the "invisible" text.

Removing this contents is highly expensive and complicated, because the whole PDF page contents needs to be analyzed and rewritten with the visible things only at the right places. And if objects (images, graphics elements, characters, ...) are both inside and outside the crop area it becomes quite fast quite ugly.

I am interested in tools, which are able to do a "deep" cropping with removal of objects, but I do not know such a tool.

To some degree it can be done manually, depending on the crop area, the objects, which are to be removed and, how the PDF page stream is written, and how deep the knowledge about PDF page stream syntax is. First, the PDF page stream needs to be uncompressed, e.g., via pdftk. Then the lines are identified, which generate unwanted contents, they can be removed by replacing the first byte by the comment char %. The size of the object stream should not be changed, otherwise many file offset values need to be corrected in the PDF structures. Of course the removal must not violate the syntax, also the current transfer matrix and other graphics state operators might be corrected.
Another option might be a vector graphics program, which can import and export PDF files. Then the unwanted objects might be removable there.
With some loss of quality, the PDF page can be converted to a bitmap image. Then the image can be easily cropped by many tools and image editors.

Related Solutions

[Tex/LaTex] pdfcrop generates larger file

Here is my version of an improved pdfcrop.

Default operation is to remove white margins from the pdf input, optionally leaving a user defined extra-margin (option -m ...).

Alternative operation is to trim the page edges by user defined amounts (option -t ...).

pdfcrop.sh uses gs (Ghostscript) for page-wise determination of the tightly enclosing bounding box, pdftk for uncompressing/compressing the PDF files and getting the order of pages (which doesn't need to be linear), and perl for replacing original page dimensions by the tight bounding boxes found.

Unlike original pdfcrop the bash script below preserves the original interactive parts of the PDF (links, annotations etc.). The output file size is about the same as before.

Update: Option -two added for two-sided page layout

Usage examples:

#getting help
pdfcrop.sh -help

#default operation
pdfcrop.sh orig.pdf cropped.pdf
pdfcrop.sh -m 10 orig.pdf cropped.pdf
pdfcrop.sh -hires orig.pdf cropped.pdf

#trimming pages
pdfcrop.sh -t "10 20 30 40" orig.pdf trimmed.pdf
#same for two-sided layout
pdfcrop.sh -t "10 20 30 40" -two orig.pdf trimmed.pdf

Content of pdfcrop.sh:

#!/bin/bash

function usage () {
  echo "Usage: `basename $0` [Options] <input.pdf> [<output.pdf>]"
  echo
  echo " * Removes white margins from every page in the file. (Default operation)"
  echo " * Trims page edges by given amounts. (Alternative operation)"
  echo
  echo "If only <input.pdf> is given, it is overwritten with the cropped output."
  echo
  echo "Options:"
  echo
  echo " -m \"<left> [<bottom> [<right> <top>]]\""
  echo "    adds extra margins in default operation mode. Unit is bp. A single number"
  echo "    is used for all margins, two numbers \"<left> <bottom>\" are applied to the"
  echo "    right and top margins alike."
  echo
  echo " -t \"<left> [<bottom> [<right> <top>]]\""
  echo "    trims outer page edges by the given amounts. Unit is bp. A single number"
  echo "    is used for all trims, two numbers \"<left> <bottom>\" are applied to the"
  echo "    right and top trims alike."
  echo
  echo " -two"
  echo "    to be used for documents with two-sided page layout; the meaning of <left>"
  echo "    and <right> changes to <inner> and <outer> for options -m and -t"
  echo
  echo " -hires"
  echo "    %%HiResBoundingBox is used in default operation mode."
  echo
  echo " -help"
  echo "    prints this message."
}

c=0
mar=(0 0 0 0); tri=(0 0 0 0)
bbtype=BoundingBox
two=0

while getopts m:t:h: opt
do
  case $opt
  in
    m)
    eval mar=($OPTARG)
    [[ -z "${mar[1]}" ]] && mar[1]=${mar[0]}
    [[ -z "${mar[2]}" || -z "${mar[3]}" ]] && mar[2]=${mar[0]} && mar[3]=${mar[1]}
    c=0
    ;;
    t)
    if [[ "$OPTARG" == "wo" ]]
    then
      two=1
    else
      eval tri=($OPTARG)
      [[ -z "${tri[1]}" ]] && tri[1]=${tri[0]}
      [[ -z "${tri[2]}" || -z "${tri[3]}" ]] && tri[2]=${tri[0]} && tri[3]=${tri[1]}
      c=1
    fi
    ;;
    h)
    if [[ "$OPTARG" == "ires" ]]
    then
      bbtype=HiResBoundingBox
    else
      usage 1>&2; exit 0
    fi
    ;;
    \?)
    usage 1>&2; exit 1
    ;;
  esac
done
shift $((OPTIND-1))

[[ -z "$1" ]] && echo "`basename $0`: missing filename" 1>&2 && usage 1>&2 && exit 1
input=$1;output=$1;shift;
[[ -n "$1" ]] && output=$1 && shift;

(
    [[ "$c" -eq 0 ]] && gs -dNOPAUSE -q -dBATCH -sDEVICE=bbox "$input" 2>&1 | grep "%%$bbtype"
    pdftk "$input" output - uncompress
) | perl -w -n -s -e '
  BEGIN {@m=split /\s+/, $mar; @t=split /\s+/, $tri; @mb=(); $p=-1;}
  sub fixMB {
    if($c){
      if($two && $p%2) {
        $mb[0]+=$t[2];$mb[1]+=$t[1];$mb[2]-=$t[0];$mb[3]-=$t[3];
      }
      else {
        $mb[0]+=$t[0];$mb[1]+=$t[1];$mb[2]-=$t[2];$mb[3]-=$t[3];
      }
      print "/MediaBox [", join(" ", @mb), "]\n";
    } else {
      @bb=split /\s+/, $bbox[$p];
      if($two && $p%2) {
        $bb[0]+=$mb[0];$bb[1]+=$mb[1];$bb[2]+=$mb[0];$bb[3]+=$mb[1];
        $bb[0]-=$m[2];$bb[1]-=$m[1];$bb[2]+=$m[0];$bb[3]+=$m[3];
      }
      else {
        $bb[0]+=$mb[0];$bb[1]+=$mb[1];$bb[2]+=$mb[0];$bb[3]+=$mb[1];
        $bb[0]-=$m[0];$bb[1]-=$m[1];$bb[2]+=$m[2];$bb[3]+=$m[3];
      }
      print "/MediaBox [", join(" ", @bb), "]\n";
    }
  }
  if (/BoundingBox:\s+([\d\.\s]+\d)/) { push @bbox, $1; next;}
  elsif (/\/MediaBox\s+\[([\d\.\s]+\d)\]/) {
    @mb=split /\s+/, $1; next if($p<0);
    fixMB; @mb=(); $p=-1; next;
  }
  elsif (/pdftk_PageNum\s+(\d+)/) {
    $p=$1-1; next unless(@mb);
    fixMB; @mb=(); $p=-1; next;
  }
  print;
' -- -mar="${mar[*]}" -tri="${tri[*]}" -c=$c -two=$two | pdftk - output "$output" compress

[Tex/LaTex] The perfect pdfcrop

To be able to crop a vector graphic reliably you must "print" it to see where the black dots are.

"Printing" always involves a resolution: the black dots must have a positive size.

pdfcrop uses the bbox device of ghostscript. According to the documentation of ghostscript the default resolution of this device is 4000 dpi.

You can change this resolution but simply enlarging it doesn't mean that you get a more "perfect" result: To be able to decide if a crop is "perfect" you must "print" it e.g. to a screen to see where the black dots are and on the lower resolution of the screen you will see your "exact" crop only at a very large zoom.

Best Answer

Related Solutions

[Tex/LaTex] pdfcrop generates larger file

[Tex/LaTex] The perfect pdfcrop

Related Question