TopLaw: SmallLaw: Making PDF Files Searchable

SmallLaw: Making PDF Files Searchable

By Ross Kodner | Monday, December 21, 2009

Originally published on December 7, 2009 in our free SmallLaw newsletter.

Another SmallLaw column, another Acrobat tip. Below you'll learn how to transform a piece of paper into a fully searchable, text-based PDF file. You can then search through these files using any desktop search software, litigation review software, or document management system, including many practice management systems with DMS capabilities. Also, you can copy and paste text from these documents into any other application. Both of these functions rely on OCR — optical character recognition.

Acrobat's Succeeds Through a Less Is More Approach …

The process of OCRing documents is hardly a new concept, yet it remains one dreaded by most legal staffers, often requiring a lot of manual labor. Even today, in an era of significant hardware horsepower and versions of OCR software well into the "teens," results are often mixed — after an aggravatingly slow process.

The reason is simple — recognizing the text itself isn't especially technically challenging any longer — all OCR products handle this function with aplomb. But the process of recognizing all those little black marks on a page and then having to create a Word document that matches the layout and formatting of the source document remains elusive.

Since version 7, Acrobat has included built-in text recognition. Acrobat is much more efficient than dedicated OCR software. It only needs to recognize text — the easier part of the equation. Acrobat recognizes the text, and then stores a list of words in an index "beneath" the visual surface layer of the PDF file. This indexed word list is then accessible and searchable by desktop search and document management tools. Acrobat isn't encumbered with the task of creating a word processing file so the process is faster and yields copy-able and paste-able text, ready to insert into any other application. After all, most of the time, the need is limited to leveraging the text from the source document, not the formatting.

When scanning paper, the resulting PDF is an "image": essentially a digital photograph of the source paper. There is no useable or selectable text in an image PDF. Some scanning software has the ability to both scan and convert documents into searchable PDF files in a single step. In most situations, however, this approach isn't desirable, at least when applied to every document. The reason is efficiency — conversion of an image PDF takes much more time than merely scanning to an image PDF format. The better approach is to individually select the PDFs you wish to make searchable, on an as-needed, ad hoc basis.

Acrobat Standard and Professional Edition convert image PDF files to searchable versions one document at a time. Both Acrobat Standard and Professional Edition have the ability to make batches of PDFs searchable — via selection of multiple documents within a folder or entire folders. The latter approach might seem appealing, but the process can be so time-consuming that a you may lose functionality of your PC for hours. There are better approaches to batch conversion discussed below.

How to Make Scans Searchable in Acrobat 9 …

The process of searchable PDF conversion in Acrobat 9 Standard or Pro is as follows:

Scan the paper document or open an existing PDF.

Go to the Document menu and select "OCR Text Recognition," and then "Recognize Text Using OCR" from the submenu.

From the Recognize Text dialog box, the option for "All Pages" is the default — click OK to start the process.

A progress indicator will appear in the bottom right corner of the Acrobat display, showing the conversion of each page into searchable format.

When the progress indicator disappears, the process is complete — just remember to re-save your PDF files.

The file is now a searchable, or in AdobeSpeak, an "accessible" PDF — ready for you to highlight and select text that can be copied/pasted, or ready to have its text found in a variety of different types of text searches.

To convert a batch of files, the process is similar, with the variation as follows:

Instead of selecting Recognize Text Using OCR from the OCR Text Recognition menu, instead select "Recognize Text in Multiple Files Using OCR."

From the "Paper Capture Multiple Files" dialogue box, click the "Add Files" button and select the option to either Add Files or Add (entire) Folders.

Navigate to the files or folders desired and they'll be added to the batch for processing.

Click OK and then wait for the batch to complete. Acrobat will automatically save the newly searchable files.

A third party product called Autobahn DX from the UK-based company Aquaforest can streamline the batch searchable conversion process considerably. With Autobahn DX installed on a Windows Server, the network version of the software can run automatically at scheduled times. The program will identify all image PDF files in the designated folders and convert them to searchable PDFs — without tying up anyone's PC or wasting a staffer's valuable time. While not inexpensive at $2,999 (including 12 months of software maintenance and support), many small firms have found the cost reasonable versus the value of staff time otherwise wasted baby-sitting Acrobat's resource-hungry, workstation-based batch conversion process.

The Bottom Line …

Searchable PDF files are infinitely more useful in the daily grind of law practice than mere image PDF files. Whether converting files individually or in batches, searchability and copy-ability open up a broad range of text-handling opportunities you would otherwise miss.

Written by Ross Kodner of MicroLaw.

How to Receive SmallLaw
Small firm, big dreams. Published first via email newsletter and later here on our blog, SmallLaw provides you with a mix of practical advice that you can use today, and insight about what it will take for small law firms like yours to thrive in the future. The SmallLaw newsletter is free so don't miss the next issue. Please subscribe now.

Permalink | Email This |

Topics: Business Productivity/Word Processing | SmallLaw

TechnoLawyer

SmallLaw: Making PDF Files Searchable