[Tex/LaTex] Included PDF image has invisible, but selectable text


I am currently facing some problem when I tried to import a PDF as a graphic.
The PDF with some original text beside was first cropped to remove the text and then imported using \includegraphics.

enter image description here

However, as shown in the figure, the original text remained and could be 'seen' when highlighted (when viewing in PDF reader). In fact, by copying and pasting the text to a word file, the original text could be viewed again.

Anyone know what caused this and maybe the solution for it? (besides converting the PDF to image)

Best Answer

What (likely) happened

It is very likely, that your imported PDF was imported with the complete content -- but part of that content was just hidden by applying a different 'CropBox' (smaller than the 'MediaBox' of the original PDF) to the import. This can be achieved by just manipulating a very small part of the PDF code.

How it typically happens

It typically happens when you use macOS' 'Preview' app to "crop" parts of the pages. And users often get fooled: they think they had removed everything which is now invisible and have protected their wider company secrets when sending such a page "extract" to a customer. However, a simple change of two numbers in a text editor will make all page content visible again!

How to better understand what happened

For understanding how this type of "cropping" works, imagine a page full of contents to be covered by a white sheet of paper. Now cut a little window into that sheet that lets you just see the "cropped" area. The rest of the content is not gone, but still there; it's just currently invisible and can be made visible again by simply removing the white sheet of paper with the little window in it. Since the PDF viewers do not really show you the white areas around the visible part your mind is easily fooled into imagining that the remaining content is gone for good. To really crop you'd have to cut away all the unwanted content from the original paper and burn it...

How to solve the problem

In order to make the content outside of such cropped areas really go away (and be permanently removed from the page's /Contents stream, you have to re-process the initial cropped page one more time. Running it through Ghostscript like this will achieve this result:

gs -sDEVICE=pdfwrite -o really-cropped.pdf seemingly-cropped.pdf

How to check if the solution worked

To verify, you should then run these two pdfinfo commands:

 pdfinfo -box seemingly-cropped.pdf
 pdfinfo -box really-cropped.pdf

Only that PDF file which has the same values for MediaBox and for CropBox can be guaranteed to not carry extra (hidden) content outside the area being displayed by PDF viewers. These ones you could savely import into your documents. All PDF files which have smaller CropBox-es than MediaBox-es may have extra invisible contents travelling with them....