Related articles
Adobe Acrobat 5: The Swiss Army knife of knowledge base publishing
Quiver's QKS Classifier: A hybrid categorization too
Monthly table of contents
How to manage PDF metadata
August, 2006
Question A Society member is considering adding metadata to PDF documents published on her organization's public Web site. In this case, metadata means descriptive information such as author, title, publication date. The objective is to make it easier to find PDF documents with the AtomZ search engine. Her colleague, the Web site publisher, wants to add only the basic PDF metadata found in the Adobe document properties (see below).
click image to enlarge
PDF metadata as displayed in Adobe Acrobat 7. To view metadata in a PDF document, open it with Acrobat or Acrobat Reader and select "Document Properties" in the File menu.
She is interested in hearing what kind of metadata others have added to PDF documents, when it is added (i.e. at what point in the process), how it has been added, and by whom. She wants to know:
This article provides an introduction to PDF: what it is, how to enter metadata into PDF documents, and special considerations for metadata management in PDF document collections.
Background PDF (Portable Document Format) is an interchange file format used for displaying and printing electronic documents. PDF has three primary virtues:
1. The reader does not need to install the program used to create the original document. 2. The document displays or prints exactly as the author or publisher intended, with all the original fonts, columns, graphics, and other page elements. 3. Publishers can control how a document is used — i.e. whether it can be printed, copied, or excerpted.
1. The reader does not need to install the program used to create the original document.
2. The document displays or prints exactly as the author or publisher intended, with all the original fonts, columns, graphics, and other page elements.
3. Publishers can control how a document is used — i.e. whether it can be printed, copied, or excerpted.
PDF is popular with publishers because it allows them to convert documents into a Web-compatible, universally accessible format with very little additional cost. For that reason, electronic content purchased from external vendors (e.g. market research reports) is usually in PDF format.
For users, PDF is both bane and boon. Some dislike it because PDF files take longer to download and require the user to learn additional program commands (i.e. learn how to use Adobe Acrobat Reader). Others like it because with the fee-based Acrobat program, they can annotate and edit the document content.
How to make PDF files findable PDF files are replacing books as corporate libraries "go virtual," and the descriptive information associated with them (the metadata) is becoming the electronic equivalent of the library catalog card. To make a collection of PDF files as findable as a collection of library books, you need to do the following:
1. Create an organization scheme (taxonomy) that accurately describes each file. A typical scheme (also called a schema) includes such elements as author, title, publication date, publisher, and keywords. In addition, some of the elements might have standardized values, such as "article", "white paper", "slide show", or "spreadsheet" for the element FORMAT. 2. For each PDF file, assign values for the descriptive elements (e.g. AUTHOR = Joe Jones"). The element name (AUTHOR) plus the element value ("Joe Jones") together constitute a metadata element. 3. Create a search and retrieval mechanism that uses metadata, such as an A - Z index or search engine.
1. Create an organization scheme (taxonomy) that accurately describes each file. A typical scheme (also called a schema) includes such elements as author, title, publication date, publisher, and keywords. In addition, some of the elements might have standardized values, such as "article", "white paper", "slide show", or "spreadsheet" for the element FORMAT.
2. For each PDF file, assign values for the descriptive elements (e.g. AUTHOR = Joe Jones"). The element name (AUTHOR) plus the element value ("Joe Jones") together constitute a metadata element.
3. Create a search and retrieval mechanism that uses metadata, such as an A - Z index or search engine.
Metadata is not absolutely necessary to make a collection of PDF documents findable. Search engines can find them by scanning the full text of each document. But metadata can make search results easier to use and more accurate. For example, using the topic search or advanced search function of a metadata-aware search engine, a user could find all documents where AUTHOR = "Joe Jones" and PUBLISHER = "Nature." Metadata also makes it possible to sort search results by date and display only those documents that the user has permission to view.
Most search experts believe that metadata is necessary to provide a reasonably good user experience. A metadata investment pays off in greater user productivity.
How to add metadata to PDF documents Metadata can be associated with a PDF document in one of four ways:
1. entered by the author using the original composition program and copied automatically during the PDF conversion process; 2. entered manually by an indexer or editor after the document has been converted to PDF; 3. entered by an indexer or editor in a database (metadata repository); 4. automatically entered by an auto-categorization program that electronically scans the document and assigns metadata based on computer algorithms or human-created rules.
1. entered by the author using the original composition program and copied automatically during the PDF conversion process;
2. entered manually by an indexer or editor after the document has been converted to PDF;
3. entered by an indexer or editor in a database (metadata repository);
4. automatically entered by an auto-categorization program that electronically scans the document and assigns metadata based on computer algorithms or human-created rules.
In many respects, adding metadata to PDF files is no different than adding it to other types of files. All document creation programs include some system-supplied metadata (e.g. CREATION DATE) as well as standard metadata elements (e.g. TITLE) that can be filled in by the author. Some even allow the author to create custom metadata elements (see below).
Custom properties in Word. In addition to the standard metadata elements defined under the Word Properties Summary tab at left, you can define your own metadata elements (properties) using the Custom tab. Shown here are custom properties for DESCRIPTION, IDENTIFIER, and PUBLISHER.
To enter metadata values in a Word document, select "Properties" under the "File" menu.
You can also add metadata to a PDF document by selecting FILE > DOCUMENT PROPERTIES in the Adobe Acrobat program (not the free Acrobat Reader program). In either case, the metadata is entered by a human and embedded in the document, where it can be accessed by a metadata-aware search engine (a program such as Ultraseek or AtomZ that can read and act on document metadata values).
PDF metadata and the publishing workflow Adding metadata to documents before they are converted to PDF can be tricky. Depending on the conversion program used, metadata entered in the authoring program (e.g. Word) may or may not show up in the resulting PDF file. Even when the metadata does survive the PDF conversion process, the metadata elements might not match. For example, in Word the producer might be called COMPANY, while in the PDF version, it might be called PUBLISHER. If you expect authors to add metadata during composition, make sure that it will appear correctly after the PDF conversion process.
Should authors enter metadata? A debate still rages over whether information managers should expect authors to enter metadata. On the one hand, the author knows the material better than anyone else and, with the right tools and training, can enter metadata values quickly and easily.
To create usable metadata, authors need a consistent organization scheme (taxonomy or schema) as well as a list of accepted values for each element in the scheme. They also need some training in indexing (assigning the right keywords) so that users — often in another job function or discipline — can find relevant documents. Organizations that live or die by their content and intellectual property (e.g. publishers, e-commerce firms, consulting firms, law firms) can usually justify the metadata investment.
Auto-categorization tools can save time and reduce costs for some kinds of "cut-and-dried" content such as news stories, press releases, regulatory filings, or proposals. But the human touch is needed for more conceptual content that spans multiple disciplines or that is intended to promote collaboration and out-of-the-box thinking. The two approaches are not mutually exclusive. Many auto-categorization programs offer human experts the ability to "tweak" results, "train" the software, or create classification rules (see Quiver's QKS Classifier: a hybrid categorization tool).
Metadata from external sources PDF content purchased from third parties may or may not contain meaningful metadata. Moreover, you can't dictate what organization scheme will be used. Even when a publisher uses a recognized metadata standard such as Dublin Core, its categories and keywords might not be appropriate for your organization's internal uses. You can use the Acrobat program to add or change document metadata from third party sources.
However, metadata for third party content is often delivered as an electronic data feed. A feed is roughly the equivalent of an electronic collection of library catalog cards. The metadata on a library card or in a feed includes basic bibliographic information (e.g. TITLE, PUBLICATION DATE) plus a pointer (call number or URL) to the item's location — not the document itself. Feeds, usually in XML format, can be imported into a database whose contents can then be formatted into an A - Z index or read by a metadata-aware search engine.
Feed formats range from the very simple with minimal metadata (e.g. see the Montague Institute RSS news feed) to the complex with a great variety of metadata (e.g. see this group of sample records from the National Library of Medicine).
How to capture and store PDF metadata in a repository PDF metadata from third party electronic feeds is typically imported into an internal metadata repository (database) on a daily or weekly basis. Typically, some type of automated conversion is necessary to map the incoming XML metadata to fields in your metadata repository. Each vendor may require a different conversion program or XML style sheet.
Metadata repository record showing "card catalog" metadata (i.e. Author, title, URL, etc.), topics (keywords), and a list of products mentioned in the article.
This record is part of the Montague Institute's internal Knowledge Base, which is also used to populate our A - Z index and provide a laboratory work area for our Web courses.
Once imported, the records can be enhanced with local data, either via automated rules or human data entry.
Bottom line Metadata practices, not only for PDF but also for other electronic formats, should be tailored to your content and organization.
Responses We received two responses to this question.
1. [Government agency] We have been using the Adobe Acrobat PDF Document Properties for metadata since 1998. With more recent versions of Acrobat, there are embedded Dublin Core fields, so content creators can be in compliance with our metadata standard by filling out those fields and applying our thesaurus terms, which are based on the Legislative Indexing Vocabulary (LIV).
Our primary motivation for using metadata is to improve the performance of our enterprise search engine (Ultraseek). We can "tune" Ultraseek to give documents with Dublin Core metadata a higher relevancy rank, so they show up higher in the search results. Metadata also ensures that documents are placed into the correct category on our portal Web site.
Acrobat 6.0 and subsequent versions have Dublin Core fields built into the Document Properties. To see them, you can go to Document Properties > Description > Additional Metadata > Advanced. Dublin Core is one of the schemas. There is also a tab for custom fields.
The only usability test on LIV terms that I'm aware of was done several years ago by a search engine expert who is also a librarian. An excerpt from the report is given below:
The benefits of adding dc.subject tags, utilizing a controlled vocabulary, are clear. From internal studies, the staff have found it takes an average 5-7 minutes per page to add metadata. This time assumes that the person applying metadata has established familiarity with our metatagging software, our version of the LIV, and the controlled vocabulary. The average pay rate of indexers is $17/hour. Assuming a 75% production rate, the cost comes out to between $1.12 and $1.50 per web document.
Not all documents on a Web site need to or should have metadata. The goal of metadata is to get users to the most relevant documents on a specific topic. Often that topic is found on an index.html page, with links to allow further exploration of that topic. In this case, the metadata belongs on the index page.
With periodicals, the goal is to get the user to the front page of the periodical, which may be the latest issue or a page listing previous issues. [NOTE 1] Then, as each new issue is added, the metadata remains constant on the opening page or the template page, and need’t be added to each subsequent issue.
Using these rules of thumb, we have added metadata to approximately the top 50% of the Web pages spidered on our Web search site. This makes the addition of metadata much more cost-effective than if each page had to have metadata.
I use LIV to help with search terms as well as matching rules in the Ultraseek Content Classification Engine. I always encourage agency web page creators to look at their search logs and use both terms from the thesaurus as well as highly-used search phrases that are synonyms. It seems to work pretty well for our dual purposes.
I am also using the Quick Links feature in Ultraseek more for that purpose. For example, the legislature pages are linked to certain phrases, and the hits show up above the search results. For example, if a user types "statutes" into the search box, links to the main Statutes page as well as the Bills page show above other search results.
Ultraseek Quicklinks. Quicklinks are URLs selected by an editor as the most relevant place to start looking for information on a specific keyword.
2. [Software company]
Our internal market research portal contains thousands of PDF documents from vendors such as IDC, Forrester, and Gartner. For the most part, these documents are produced externally and sent to us already in PDF form. We add metadata directly into the PDF (title, author, short summary) using Adobe Acrobat. We can also add metadata via a homegrown Content Management System (CMS).
With both methods we use topics from our extensive taxonomy which currently resides in Ultraseek. Topics cover document type (research report, newsletter, presentation, etc), company business unit, research source, and many content/business areas. Metadata is added by an information professional who selects the documents from research vendors and uploads them each morning to the portal using the CMS. Keywords are not extensively used at this point but could be in the future. Often we use basic metadata provided by the vendor (e.g. author, title) but add topics from our own taxonomy.
Our Web portal uses Ultraseek to leverage both the full-text indexing of the PDFs and the added metadata from the CMS, displaying the key metadata about the documents (title, summary, research source, author, date published and number of pages) in the search results.
Metadata is dynamically pulled from our database of PDF documents using a custom front end to Ultraseek. Users can either browse the research collection by topic or search the full text of each document. The system automatically sends out email alerts informing subscribers of new documents based on the topics associated with each document in the CMS.
In addition to using metadata to enhance the search experience, we are also interested in using Adobe Policy Server [NOTE 2] to control access to PDF documents for employees. When employees leave the company, the Policy Server would prevent them from accessing any PDF documents that they take with them.
NOTES:
1. This makes sense for large organizations with many periodicals. For smaller organizations or departments that have only one or two periodicals, it makes more sense to enter metadata for each article, as we do for the Montague Institute Review and the Knowledge Base Editor's Digest.
2. Adobe's LiveCycle Policy Server is a J2EE (Java 2 Enterprise Edition) application that runs on most widely used platforms, including Windows, Linux and the Macintosh. Pricing for the server starts at $50,000 per CPU (see "Adobe Policy Server Sets PDF Access Rights ").