Innovative publishing solutions.

Common eBook Formats

Kenneth M Brooks, Jr


NOTE: This article was written in 2001. Since then there have been a number of further developments in eBook formats:

  • Introduction of the Sony Reader.
  • The disappearance of Gemstar and the purchase of its assets by ETI
  • The dramatic strides in format consolidation by the IDPF (formerly the OEBF)
  • The introduction of Adobe Digital Editions replacing the Adobe eBook Reader
  • The acquisition of Mobipocket by amazon.com and the popularization of that format.
  • The acquisition of Palm Digital Media by Motricity and its subsequent rebranding as eReader.com
  • The acquistion of VitalSource Technologies by Ingram Digital

Publishers currently distributing and selling eBooks have their books available in the following formats:

  • Palm Digital Media (for use on PalmOS and PocketPC handhelds)
  • Adobe eBook Reader (for use on PCs)
  • Microsoft Reader (for use on PCs and PocketPC handhelds)
  • Gemstar (for use on dedicated Gemstar and RCA devices)
  • OeB (for use in Mobipocket reading software on handhelds and the Franklin eBookman)

Palm Reader

Palm Digital Media has reader software, the Palm Reader, that runs on both PalmOS devices and on Windows CE / Pocket PC devices. The Reader not only renders content in its own native format, PML (short for Peanut Markup Language), it also renders the open Palm Doc format, making the Reader practical as a tool for personal documents, as well. Publishers should note, however, that content must be in PML in order to take advantage of Palm's content protection capability. This protection is a simple, yet effective, method that is based on the unwillingness of people to hand out their credit card information.

Although Palm content is currently only distributed through a single site (www.peanutpress.com), it has not prevented Palm owners from finding and increasingly purchasing titles there. Current speculation in the industry is that sales of titles in Palm Digital Media formats exceeds all of the other formats put together - most likely due to the combined installed base, title availability and reading experience on a PDA versus a desktop or laptop PC.

PML (Peanut Markup Language) is reminiscent of simple HTML. Both HTML and RTF (Rich Text Format) serve as reasonable inputs to PML conversion.

Adobe eBook Reader (AER)

The Adobe Acrobat eBook Reader (AER), formerly the Glassbook Reader, features page numbers that match the pages of the original book, a linked Table of Contents, and indices that match the original book, and it can be viewed both in single and double page mode. In fact its key advantage is its use of PDF and the advantages of that format: the page stays as the publisher designed it with design, artwork and typography intact.

If the publisher allows it, the AER will also read the book aloud to the consumer in a synthesized voice. It offers standard features such as bookmarks, annotation support, dictionary support, text search capabilities, and sub-pixel rendering to improve resolution on laptop screens.

Because of its basis in PDF, virtually every book that has been printed in the last 5-10 years has existed in either PostScript or PDF at some point in its life. For this reason it is both easy and convenient for publishers to make content available for sale as an AER edition. Further, Adobe has introduced a low, fixed-price business model for the Adobe content server, ensuring that the use of this format will continue to expand.

To convert a PDF into the AER format is relatively straightforward: The AER requires an attached cover image, pagination that matches the original folios in the book, a linked TOC and a minimal set of title metadata. Other linking and enhancements can be added as required by the publisher.

Microsoft Reader (MSR)

Microsoft introduced a highly innovative & user-friendly software, the Microsoft Reader, based on extensive research in reading habits. Features include glossary, annotation tools, easy navigation, sub-pixel rendering (ClearType) to improve resolution and a plug-in to MSWord that allows individuals to publish their own material. The Microsoft Reader is available for PDAs (handhelds) running PocketPC 2002 and PCs. PDAs running earlier versions of the PocketPC OS aren't able to implement Microsoft's highest level of security so publishers have prevented most content from being available on those older devices.

Microsoft had demonstrated an aggressive development schedule to increase the ability of MS Reader to handle complex layouts and typography and also to make it easy for consumers to publish in Reader format on their own.

The file format used by the Microsoft Reader is called a "lit" file from the extension on the input file (.lit), and is generated from an OeB file.

Gemstar

Gemstar acquired NuvoMedia and SoftBook Press and so inherited the Rocket eBook and the SoftBook Reader, two of the first entries into the current eBook market. Gemstar has since introduced upgrades to both through a licensing arrangement with Thompson (RCA), the ReB 1100, and ReB 1100, respectively (ReB stands for RCA eBook). These devices are dedicated reading devices with a strong focus on content security and user friendliness.

Because of the proliferation of devices, publishers must contend with four separate formats if they want their titles accessible to all of the devices. Most publishers, indeed Gemstar itself, focus on the ReB 1100 and 1200 as the current technology and it is rumored that Gemstar will be introducing an upgrade to their professional publishing software that will seamlessly generate both formats from the same OeB source file.

Rocket eBook and ReB 1100 utilize a subset of HTML 4.0. The SoftBook Press Reader and the ReB 1200 implement many of the features of HTML 4.0 and utilize the Gemstar Professional Publisher Software to take either an HTML or OEB input to generate the output format.

Other key formats

There are three key formats through which content usually moves and from which the end-editions above are generated. For example, the Microsoft Reader format is generated from OeB and the Adobe Acrobat eBook Reader format is generated from PDF. Although not strictly for eBook distribution, the following can be end formats in themselves. These intermediate formats are:

  • HTML
  • PDF
  • OeB

HTML

Hypertext Markup Language (HTML) is the format upon which most eBook and on-line formats other than PDF are based. It is the language of the Internet (WorldWideWeb) and is read by every browser on the planet. Further there are extensive numbers of tools and people familiar with its creation. It is also the most common format for distribution of extended bibliographic information such as first chapters, marketing blurbs, author biographies, quotes, etc. With the increasing availability of "always-on" Internet access, HTML will continue to be the common language of electronic publishing for many years to come.

HTML can be prepared using Cascading Style Sheets (CSS) that represent styles and typography separately from the HTML coding itself. This offers the ability to switch formats and design of the text simply by changing a style sheet rather than reworking the entire file. It is always a good idea to start the eBook conversion process with clean, CSS-based HTML.

PDF [1]

Portable Document Format (PDF) is a standard introduced by Adobe Systems, Inc. first as an on-line viewer for PostScript, a technology introduced to permit portability of pages across multiple printing technologies — digital and otherwise. It has undergone substantial updating over time and is the current lingua franca of the publishing and graphics arts industries. It is the basis not only of many traditional printing workflows, but also is the basis for print-on-demand applications and some eBook formats, as well. Since most books exist in PDF at some stage of their lifecycle as they are prepared for print, making conversion to PDF-based formats is relatively straightforward.

The key advantage to PDF-based formats is that the page is rendered on a device exactly as the publisher or designer envisioned it. This, while an advantage, has also been a problem, limiting its usefulness in devices with screens of different sizes. Adobe has recognized this and with PDF 1.4 introduced both the ability to insert tags into the PDF and to reflow paged text to screens of different sizes and shapes. The introduction of PDF readers for the PalmOS and PocketPC OS are examples of where this drive for cross-platform portability is heading. Adobe has also recognized that it is important to be able to export to HTML and various flavors of XML and introduced a number of plug-ins to make this possible.

Beyond the ability of PDF to output to various devices, formats and configurations, it is also becoming increasingly easy to generate well-designed pages from HTML and XML. With the introduction of Adobe InDesign 2.0 import and export are becoming accessible and painless.

OEB

The Open eBook (OeB) publication structure was originally developed as a joint effort between Nuvomedia, Softbook Press and Microsoft to establish a single standard format that could be used by publishers to import into the various reading devices and applications. It has since evolved into an industry standard, administered by the Open eBook Forum, in which many publishers, retailers and technology companies participate.

The OEB 1.0 and 1.1 publication structure consists of an XML package file (called the OEB Package File, or OPF file) and a series of HTML files that contain the content of the title. While useful in this configuration, the standard allows considerable latitude in what is "valid" OEB and prevents the true cross-platform utilization of the format. This is a problem affecting the validity of archives and future usefulness. Upcoming releases of OEB are likely to address this by increasing the XML "intensity" of the standard, moving away from the current orientation of CSS-based HTML.

Although established as an interchange format to make content portable across devices, OEB is becoming an output format as well. In addition to the many publishers that are using OEB as one of their archival formats, the Microsoft Reader, Mobipocket (and through Mobipocket, Franklin) and Gemstar pull OEB directly into their packaging and publishing software. There are efforts underway by a number of companies, as well, to create OEB-smart browsers that will directly interpret the OEB into an on-line reading experience.

Conclusion

The current world of eBook, POD, and on-line publishing requires files that are in a variety of formats, which at this stage in the industry's development are all generating revenue. These formats continue to evolve, becoming increasingly useful for both publishers and users. Because it is still too early to tell which formats will succeed and fail, publishers should ensure that their titles are available in all of the popular formats to increase their overall sales. Alert publishers will also maintain archives in HTML and/or OEB (with meaningful tags), PDF, as well as application files (InDesign, Quark) if available.



[1] Note that there are three common variants of PDF—PDF Normal, PDF Image Only and PDF Image + Text—only one of which, PDF Normal, gives file sizes appropriate for download over dial-up connections. PDF Normal either has been or can be produced from most page layout application (Quark, for example) simply by installing Adobe Acrobat and selecting "Print to PDF".

PDF Image Only consists of scanned images captured within a PDF wrapper. This format is generally used for print-on-demand (POD) applications, but is not terribly appropriate for eBook use because of the large file sizes involved. It also has a limitation in that the text of the book is not searchable—the "text" consists only of digital pictures of book pages. It was to address this limitation that PDF Image + Text was introduced.

PDF Image + Text is created using Adobe Acrobat Capture (available as a free Acrobat plug-in for low-volume application). It uses OCR to recreate the text contained in the page images in a hidden layer behind the images themselves. This text, although not of high enough quality to display by itself, is high enough to give reasonable search results. The text can also be copied and pasted from the document into various applications for cleanup or other uses. PDF Image + Text is best used where searching is required and file size is not an issue such as on an intra-net or CD-ROM.