|Montague Institute l Contents l Index l Digest l Courses l Calendar l Subscribe
How to manage PDF metadata
This article provides an introduction to PDF: what it is, how to enter metadata into PDF documents, and special considerations for metadata management in PDF document collections.
PDF is popular with publishers because it allows them to convert documents into a Web-compatible, universally accessible format with very little additional cost. For that reason, electronic content purchased from external vendors (e.g. market research reports) is usually in PDF format.
For users, PDF is both bane and boon. Some dislike it because PDF files take longer to download and require the user to learn additional program commands (i.e. learn how to use Adobe Acrobat Reader). Others like it because with the fee-based Acrobat program, they can annotate and edit the document content.
How to make PDF files findable
Metadata is not absolutely necessary to make a collection of PDF documents findable. Search engines can find them by scanning the full text of each document. But metadata can make search results easier to use and more accurate. For example, using the topic search or advanced search function of a metadata-aware search engine, a user could find all documents where AUTHOR = "Joe Jones" and PUBLISHER = "Nature." Metadata also makes it possible to sort search results by date and display only those documents that the user has permission to view.
Most search experts believe that metadata is necessary to provide a reasonably good user experience. A metadata investment pays off in greater user productivity.
How to add metadata to PDF documents
In many respects, adding metadata to PDF files is no different than adding it to other types of files. All document creation programs include some system-supplied metadata (e.g. CREATION DATE) as well as standard metadata elements (e.g. TITLE) that can be filled in by the author. Some even allow the author to create custom metadata elements (see below).
You can also add metadata to a PDF document by selecting FILE > DOCUMENT PROPERTIES in the Adobe Acrobat program (not the free Acrobat Reader program). In either case, the metadata is entered by a human and embedded in the document, where it can be accessed by a metadata-aware search engine (a program such as Ultraseek or AtomZ that can read and act on document metadata values).
PDF metadata and the publishing workflow
Should authors enter metadata?
To create usable metadata, authors need a consistent organization scheme (taxonomy or schema) as well as a list of accepted values for each element in the scheme. They also need some training in indexing (assigning the right keywords) so that users — often in another job function or discipline — can find relevant documents. Organizations that live or die by their content and intellectual property (e.g. publishers, e-commerce firms, consulting firms, law firms) can usually justify the metadata investment.
Auto-categorization tools can save time and reduce costs for some kinds of "cut-and-dried" content such as news stories, press releases, regulatory filings, or proposals. But the human touch is needed for more conceptual content that spans multiple disciplines or that is intended to promote collaboration and out-of-the-box thinking. The two approaches are not mutually exclusive. Many auto-categorization programs offer human experts the ability to "tweak" results, "train" the software, or create classification rules (see Quiver's QKS Classifier: a hybrid categorization tool).
Metadata from external sources
However, metadata for third party content is often delivered as an electronic data feed. A feed is roughly the equivalent of an electronic collection of library catalog cards. The metadata on a library card or in a feed includes basic bibliographic information (e.g. TITLE, PUBLICATION DATE) plus a pointer (call number or URL) to the item's location — not the document itself. Feeds, usually in XML format, can be imported into a database whose contents can then be formatted into an A - Z index or read by a metadata-aware search engine.
Feed formats range from the very simple with minimal metadata (e.g. see the Montague Institute RSS news feed) to the complex with a great variety of metadata (e.g. see this group of sample records from the National Library of Medicine).
How to capture and store PDF metadata in a repository
Once imported, the records can be enhanced with local data, either via automated rules or human data entry.
1. [Government agency]
Our primary motivation for using metadata is to improve the performance of our enterprise search engine (Ultraseek). We can "tune" Ultraseek to give documents with Dublin Core metadata a higher relevancy rank, so they show up higher in the search results. Metadata also ensures that documents are placed into the correct category on our portal Web site.
Acrobat 6.0 and subsequent versions have Dublin Core fields built into the Document Properties. To see them, you can go to Document Properties > Description > Additional Metadata > Advanced. Dublin Core is one of the schemas. There is also a tab for custom fields.
The only usability test on LIV terms that I'm aware of was done several years ago by a search engine expert who is also a librarian. An excerpt from the report is given below:
I use LIV to help with search terms as well as matching rules in the Ultraseek Content Classification Engine. I always encourage agency web page creators to look at their search logs and use both terms from the thesaurus as well as highly-used search phrases that are synonyms. It seems to work pretty well for our dual purposes.
I am also using the Quick Links feature in Ultraseek more for that purpose. For example, the legislature pages are linked to certain phrases, and the hits show up above the search results. For example, if a user types "statutes" into the search box, links to the main Statutes page as well as the Bills page show above other search results.
2. [Software company]
Our internal market research portal contains thousands of PDF documents from vendors such as IDC, Forrester, and Gartner. For the most part, these documents are produced externally and sent to us already in PDF form. We add metadata directly into the PDF (title, author, short summary) using Adobe Acrobat. We can also add metadata via a homegrown Content Management System (CMS).
With both methods we use topics from our extensive taxonomy which currently resides in Ultraseek. Topics cover document type (research report, newsletter, presentation, etc), company business unit, research source, and many content/business areas. Metadata is added by an information professional who selects the documents from research vendors and uploads them each morning to the portal using the CMS. Keywords are not extensively used at this point but could be in the future. Often we use basic metadata provided by the vendor (e.g. author, title) but add topics from our own taxonomy.
Our Web portal uses Ultraseek to leverage both the full-text indexing of the PDFs and the added metadata from the CMS, displaying the key metadata about the documents (title, summary, research source, author, date published and number of pages) in the search results.
Metadata is dynamically pulled from our database of PDF documents using a custom front end to Ultraseek. Users can either browse the research collection by topic or search the full text of each document. The system automatically sends out email alerts informing subscribers of new documents based on the topics associated with each document in the CMS.
In addition to using metadata to enhance the search experience, we are also interested in using Adobe Policy Server [NOTE 2] to control access to PDF documents for employees. When employees leave the company, the Policy Server would prevent them from accessing any PDF documents that they take with them.
Created on August 26, 2012 l Updated on August 9, 2012