MATLAB: Quickly Search Strings inside PDF files

optimizationpdfsearchstrfindText Analytics Toolboxwile e. coyote

I have ~25,000 PDF files that I want to classify based on the presence of keywords in their text. I know there's a PDF Toolbox that provides MATLAB with an interface for reading PDF text, but the fact that it comes from Sourceforge makes it difficult to obtain (this is for work) and the reliance on java seems to me like it would make the process very slow -especially for searching so many files. Is there a simpler, faster way to parse these documents if all I want to do is basically strfind on the text to check for keywords?

Best Answer

PDFs are designed to guarantee an equal output on different machines. You want to create a catalogue of the contained strings. These two jobs do not match.

What about converting the PDFs by one of the many pdf2text tools and work on the text files? E.g. http://www.foolabs.com/xpdf, http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-NET

Related Solutions

MATLAB: Read in file that has raw binary image data and an XML footer

You've got the right idea.

Since you know the size of the image, open the mixed image/XML file with fopen, then use fseek to position the reader just after the image. Then use fread to read the tail of the file. Write this text to a temporary file.

MATLAB: How to extract images from a PDF using MATLAB

MATLAB ships with the Apache PDFBox Java library which allows importing and processing PDF files. Use the following MATLAB function extractImagePDF() to extract images from a native PDF and save them as JPG files:

function extractImagePDF(pdfFile)
import java.io.*
import javax.imageio.ImageIO.*
import org.apache.pdfbox.*
filename = fullfile(pwd,pdfFile);
jFile = File(filename);
document = pdmodel.PDDocument.load(jFile);
catalog = document.getDocumentCatalog();
pages = catalog.getPages();
 
iter = pages.iterator();
% look for image objects on each page of the PDF
while (iter.hasNext())  
    page = iter.next();
    resources = page.getResources();
    pageImages = resources.getXObjectNames;
    if ~isempty(pageImages)
        imageIter = pageImages.iterator();
        i = 1;
        % extract each image object from page and write to destination folder
        while (imageIter.hasNext())
            key = imageIter.next();
            if (resources.isImageXObject(key))
                xObject = resources.getXObject(key);            
                img = xObject.getImage();
                outputfile = File("Img_"+i +".jpg");
                write(img, "jpg", outputfile);
            end
            i = i+1;
        end
    end
    
end
document.close();

Note that the above code will not work for scanned PDF files.

Best Answer

Related Solutions

MATLAB: Read in file that has raw binary image data and an XML footer

MATLAB: How to extract images from a PDF using MATLAB

Related Question