Herbert W. Armstrong Searchable Library
Notes on Searchability
scroll down
PDF Best Practices #5: Acrobat's Find & Search
Text retrieval functionality is key with
'digital haystacks' of information
19 April 2002
By Shlomo Perets
of MicroType
(www.microtype.com)
Some documents,
publications or books are read from beginning to end and are never opened
again; others are read partially or only browsed, and then referred to again
when looking for more information on a specific item. Electronic documentation
is the ideal format for this "reference" mode, as in addition to the
traditional table of contents and indexes and their online implementations, it
supports search functions to efficiently locate all instances of required
items.
Text retrieval functions
make the difference between a digital haystack where items are "known to
exist somewhere" but nevertheless cannot be located without a significant
effort, and a document collection where required items can be found instantly
even when there are thousands of pages in multiple files.
Acrobat offers two text
retrieval functions that differ in concept and implementation: the basic Find
function, and the more advanced Search function. The differences between these
functions will be discussed in detail later in this column.
What You
See Is Not Necessarily There
Both Find and Search
suffer from a number of problems related to text representation. For both
functions to work, text in the electronic document must be identical to the
text in the original document. This may seem obvious, but many insidious
side-effects are introduced when a document is converted to PDF, and often even
the PDF producer, let alone the reader, is unaware of the implications this has
on text retrieval.
Correct text
representation, however, should not be taken for granted, as there are still
several issues which will cause text not be to searchable consistently:
·
Older
drivers, ATM versions and printer driver settings can cause text in some fonts
to be "garbled" internally (it prints and displays as intended, but it
is not searchable).
It is also worthwhile
noting that documents scanned into PDF are not searchable. Scanned PDFs to
which optical character recognition (OCR) was applied may be partially
searchable, but with a significant numbers of errors -- with words not found or
mis-recognized.
Leaving these issues
aside, one has to remember that PDF is essentially a presentation format and
not a document in the sense of text flow. Searches will not work when the
phrase being searched for is split between pages. Depending on the applications
used to author and create the PDF, there may be additional problems of lines
placed in a reversed or unexpected order, so that phrases split across lines
are not located. Items split between lines in table cells or in multi-column
layouts may also be impossible to locate (as the word sequence uses a logic
different from what is expected). Hyphenated words or phrases with natural
hyphens split between lines may also pose difficulties, depending on the
specific authoring applications or PDF creators. When creating "Tagged
PDFs" (authored with Word2000 + PDFMaker 5 or FrameMaker 7.0), the
additional information stored in the PDF file significantly improves
Find/Search functionality. "Structured PDFs" (as authored with Word98
+ PDFMaker or FrameMaker 6.0) make no difference with respect to text searches,
despite the extra structure information embedded in the PDF.
Find
The Find function (Edit > Find) does not require
special preparations on the part of the PDF producer, other than verifying that
text is interpreted correctly in Acrobat. Find locates the specified text (word
or phrase) in the currently open PDF file only (locally or in a web browser);
options include matching letter case and "whole word".
In terms of speed, the
Find function is rather slow, even with the fastest computers. (Try locating a
phrase, which is present in one of the last pages in a PDF file containing a
few hundred pages, and you'll see the status bar showing the page numbers
rolling page by page).
Search
Compared to Find, the Search function (Edit >
Search > Query) is much more efficient. Search supports powerful text
retrieval functions such as looking for multiple words, together with logical
operators (And, Not, Or), with optional Proximity (locating multiple items only
if they are in approximately the same three-page zone, or a larger zone if
there is not much text per page), as well as word stemming, "sounds
like", thesaurus and wildcards options.
The Search function can
also use PDF metadata, i.e. file-specific DocInfo fields such as Title,
Keywords, Author, Subject (and optionally custom fields). When including these
fields in the search query, fields and their value range can either be typed
directly, or can be added to the Search dialog box (Preferences, Search,
Include in Query); custom fields can only be typed directly. It is possible to
search based on field values exclusive, or to combine phrases with field
values.
Search is cross-document
and very fast compared to the Find function -- both factors are related to the
mechanics of the Search function: the PDF producer uses Acrobat Catalog in
advance to prepare a "full-text search index", listing all words in
the document collection being indexed. This "index" (.pdx file
pointing to a folder structure with index-specific files) provides the Search
function with pointers to all occurrences of different words (including text in
vector graphics, if it is retained as text). When the user searches for a word,
it is the pre-prepared index that is being searched, and not the document
itself. When a word is found, pointers to the locations in documents within the
collection are displayed. This means that the user in not searching within the
current document, and can search for a word without any documents being open.
With the Search function,
the user must first select/activate the index [shown above] to be used (Edit
> Search > Select Indexes, or the Indexes button in the Search dialog
box). The PDF producer should assist, whenever possible, by associating the
index with PDFs in the document collection -- either with PDFs that are
considered main entry points, or with all PDFs. This way, end-users will
automatically have the index activated without having to select it manually. An
exception to this is when the same PDFs are distributed individually or placed
on a site; Acrobat will display an error message if the attached index is not
present. (Acrobat Search requires the index and all PDFs to be stored on a
local or network drive, maintaining the relative path present when the index
was created. Search won't work if the index or PDF files are stored in a web
site; there are, however, third-party products that support PDF searches on the
web).
The Search function typically searches a group of
documents; when it displays the search results, this is comprised of a list of
all matching PDF files, each shown by its title and a score. Clicking any of
the titles takes you to the first page with corresponding hits in that file,
highlighted. Clicking the Previous Highlight or Next Highlight buttons takes
you to the previous or next occurrence, moving transparently to the next or
previous file in the list of results. The search results can be narrowed down
by searching within the Search Results, rather than searching the entire
collection again (hold down the Control or Option keys, and the Search button
changes to Refine).
If there is only one hit,
Acrobat takes you directly to the location in which hits are found --
highlighting the matching words (without displaying the Search Results box with
document titles).
Having a meaningful PDF
document title is essential, as the file name -- displayed instead of a title
-- is not descriptive or "friendly" enough. It is also a good idea to
set the opening mode of all files to show the title in the title bar to
maintain orientation as to the item currently being viewed, so that it will be
of use even if the document is opened at the middle of file, as can often
happen when clicking the Next/Previous Highlight buttons (or, for that matter,
when following cross-file links or bookmarks). In Acrobat 5, select File >
Document Properties > Open Options, "Display Document Title"; when
the PDF is displayed with previous versions of Acrobat, select "Resize
Window to Initial Page" for a similar effect.
To take advantage of the
Search function and display the list of results so that the specific section of
interest can be selected directly, it is essential that the PDF document or
document collection is constructed as a set of independent chapters, each being
a separate PDF, and not as book/s converted to single-file PDFs. Each PDF
should have its own unique Title, Subject, Author and Keywords fields --
chapter-specific -- applied consistently throughout the document collection;
these also help to pinpoint subjects of interest.
When searching a
single-file book with the Search function, the reader has to click the Next
Highlight button continuously, with no clues as to the location/context
(similar to the situation when using Find), meaning that readers back in the in
the digital haystack.
When the same source
material is split into multiple files and the Search function is used, the list
of results indicates the probable sections, so that the reader can decide,
based on the title, whether to click the document. Having separate chapters
also means that it is possible to open multiple windows if necessary, each with
its own title displayed in the title bar. (Splitting a book to separate
chapters should not compromise navigation -- this is possible through the use
of cross-file links and bookmarks.)
It is recommended to
provide a meaningful title for the index, and also include a brief description
(including information as to options enabled or disabled for the index).
Even when all groundwork
for powerful and efficient searches is there, readers can be helped in various
ways:
·
First
and foremost, "Reader with Search" should be indicated as a required
version (free download; the Search function is not available in the somewhat
smaller-size Reader).
Searching for specific
information does not exclude other access/navigation mechanisms, including
bookmarks and links in items such as a table of contents or a standard index;
these complement one another. Whereas the table of contents and index lists
items directly so that they can be selected, one has to know precisely what to
look for when using Find or Search.
PDFs in
Acrobat 5 CD
Large
Single-File PDFs
The major shortcoming of
the Acrobat 5 PDFs, in my opinion, is the inefficient use of the Search
function. Acrobat Help (page 222) rightly advises: "Consider creating a
separate PDF file for each chapter or section of a document. When you separate
a document into parts and then search it, search performance is
optimized." However, all PDFs in the Acrobat 5 CD were constructed as a
single PDF for an entire book. This applies to the Acrobat Help itself, but
also to the PDF Reference (696 pages) and even to the gigantic Acrobat Core API
Reference (2755 pages). When searching for "event", for example, we
get 16 books listed, with no clues as to specific sections within these books
where items are located. (It is possible to formulate the search query for a
better focus and fewer items listed, but the end result is still entries that
show the entire book.)
The Core API Reference
demonstrates another potential problem, where Acrobat Catalog splits very large
PDFs to two or more parts. In the Search Results, we see two entries which
relate to the same PDF: "Acrobat Core API Reference" and
"Acrobat Core API Reference: Pages 2389 to 2755." While it may be
possible to minimize this separation by modifying Catalog preferences, it is
best to avoid having such large PDFs in the first place.
Text
Representation Problems
Text in PDFs in the
Acrobat 5.0 CD is generally "well-behaved" -- no major anomalies are
found.
In a few documents,
spaces are missing in the "internal representation". As an example,
inspect the Contents page in the Acrobat Development Overview
(DevelopmentOverview.pdf in the Getting_Started folder in the SDK
documentation). When trying to locate the phrase "This Document",
which appears three times in the top area of the page, you will not succeed.
Select the text with the Text Select Tool, copy and paste it to a text editor;
you will then be able to see that spaces are missing in different locations:
Trying to find
"ThisDocument" will succeed in locating these instances. A similar
problem can be seen in the "Acrobat Developer FAQ" PDF.
Extra spaces added in
random locations within words are actually a more common problem in PDFs, but
in the case of the Acrobat 5 PDFs this was not traced.
The Acrobat Help file
(Help > Acrobat Help) demonstrates the problem associated with hyphenation.
The document uses moderate hyphenation, where only longer words are hyphenated,
with 5 or more characters left on either side. These hyphens -- such as in
"accessi-bility", "appli-cation" -- cause text to be
interpreted differently. Searching for plain "accessibility" and
"application" will not locate the hyphenated versions, but "accessi
bility" and "appli cation" (with a hyphen or spaces in the
hyphen's location) will succeed.
The opposite problem --
of a hyphen discarded at end of line -- is seen in the Acrobat JavaScript PDF
(Help > Acrobat JavaScript Guide). Trying to find "client-side"
(typing either "client-side" or "client side"), we get one
match. But is it the only instance? No. Using Find with "clientside"
we locate another instance where "client-side" is split between lines
at the "natural" hyphen.
The Acrobat Distiller
Parameters (DistillerParameters.pdf in PDF_Creation_APIs) demonstrates the
impact of having information arranged in tabular form, with multi-line items.
Acrobat has no idea of the presence of table columns, which significantly
reduces retrieval of phrases split between lines. when searching for the phrase
"sampled images", several instances are located, but not the one in
page 37.
In Batch Sequences
(BatchSequences.pdf), the title in the first page was converted to a bitmap --
making it impossible to locate; a similar problem is seen in ADBC.pdf. This problem,
where larger-size characters are transformed to bitmaps, is related to the
PostScript driver being used.
Additional
Examples
To see potential problems
with products that support advanced typography features, such as ligatures,
small caps and old-style numerals, see the Adobe OpenType User Guide,
authored with Adobe InDesign and exported directly to PDF:
·
"2002"
is present in the first page below the title -- but cannot be located as since
old-style figures are used.
While these OpenType
features result in a superior typography, they should be avoided in online
documents, until Acrobat Find and Search functions are enhanced to support the
additional characters.
As an example for a PDF
with text that is internally deformed, see the Adobe InDesign Programming Guide.
It includes numerous code fragments (see pages 419 and onwards) set in a
monospace font, and the same font is used in regular text to indicate function
names or related items. All of these are not searchable. Copy and paste the
text and you'll see why: "matrix passed" is understood internally as
"2#___A".#%%_&"". With this type of document, users
could have happily used the copy and paste function to reduce typing
time/errors when studying or implementing the techniques discussed, but results
in this case are of no value.