Comments
yourfanat wrote: I am using another tool for Oracle developers - dbForge Studio for Oracle. This IDE has lots of usefull features, among them: oracle designer, code competion and formatter, query builder, debugger, profiler, erxport/import, reports and many others. The latest version supports Oracle 12C. More information here.

The increasing integration of Adobe and former Mac...


ADS BY GOOGLE
Blog-N-Play.com
Anytime a feature of a framework gives me something for free that I don't need to manually implement I'm a happy camper. One such feature of ASP.NET MVC 2 is jQuery client-side validation. The
Flex Top Links You Must Click On !


A Review of Key PDF and Font Concepts
Missing PDF Fonts

This article is the first of two that will look at some of the challenges faced by developers and non-developers who work with PDF technologies and who are curious about what causes fonts in a PDF to render incorrectly or even go missing. Specifically, these documents provide an overview of some of the problems associated with missing font information in PDFs.

Part 1 presents the Portable Document Format as well as industry terms and concepts related to that format. The problem of missing font information will also be introduced. Part will expand on those terms and concepts and explore some of the common scenarios in which PDFs are either missing partial or entire font information.

Brief Overview of PDF
The Portable Document Format was originally conceived in 1991 as the Camelot Project, by Adobe's co-founder Dr. John Warnock. Inspired by the device independence of PostScript, Dr. Warnock wanted to develop a technology that could accurately display and print electronic documents across different operating systems, hardware, or applications. His answer was the PDF.

Unlike its predecessor (i.e., PostScript), PDF was first and foremost a file format and not a programming language. Although PDF evolved from PostScript, the primary difference is that PostScript is a true page description language and PDF is not. PDF does not contain programming constructs such as looping, control-flow constructs, or variables. Rather, PDF was envisioned to go further than PostScript by being able to describe how pages behave and what type of information a document could contain. Years later, the PDF would encompass complex features and functionalities such as search capabilities, audio, and even video.

On July 1, 2008, PDF became an open standard published by ISO as ISO 32000-1: 2008.

PDF Structure
PDFs are essentially collections of data objects organized in a hierarchical manner that describe how one or more pages in a document must be displayed. These data objects can describe a page, a resource, other objects, a sequence of operating instructions, and so on. Furthermore, a data object can reference other objects and be referenced by other objects (i.e., an object can be a parent object and a child object at the same time).

PDF documents contain four main types of objects that define its structure:

  • The document catalog object
  • Page objects
  • Page content objects
  • Document and page resources

Document Object
The document object typically contains a cross-reference table and page objects. It can also contain elements such as document information, named destinations, thumbnails, and bookmarks.

Page Objects
Page objects can contain one or more content objects as well as several other types of elements such as page cropping information, hyperlinks, article threads, file annotations, form fields, digital signatures, and child pages in the document. Page objects also contain references to all the resources used by a page.

Content Objects
Content objects contain marking operators (i.e., drawings) and use resources such as fonts, images or color spaces that are needed to fully render the page.

Resource Objects
PDF defines a number of resource objects such as fonts, images, color spaces, and patterns. Fonts are needed to render text, color spaces represent colors used in the document, patterns define how backgrounds are painted, etc.

PDF Organization
PDFs are sectioned into four separate areas:

  • Header
  • Body
  • Cross-reference table (xref)
  • Trailer

The Header
The header contains a comment that identifies the nature of a PDF document and the specifications to which it adheres. For example, the following comment indicates that the document conforms to version 1.

%PDF-1.7
%ÿÿÿÿ

The Body
The body of a PDF is where the content objects in the document are located. These objects include text streams, image data, fonts, annotations, and so on (see Listing 1). The body can also contain numerous types of invisible (non-display) objects that help implement the document's interactivity, security features, or logical structure. Each object has three essential components: a numerical identifier, a fixed position (also known as an offset), and its content.

The Cross-Reference Table
The cross-reference table (see Listing 2) lists the locations of all the objects in a PDF document. The cross-reference table is divided into sections where each section begins with the starting and ending identifiers of the objects in that section. With the cross-reference table, a PDF parser can randomly identify object offsets and quickly access object locations throughout the document without having to read the entire file.

The Trailer
Even though the trailer is technically the end of a PDF document, it is the first entry point that applications use to access the essential components of a PDF. The trailer contains pointers that parsers and applications use to locate the cross-reference table and other important objects in a PDF.

Examples of important objects include the root object (that identifies the beginning of a page tree) and info objects (that contain vital metadata) (see Listing 3).

Terms and Concepts
Before outlining the challenge of missing fonts in PDFs, it's important to review some of the underlying concepts and technologies that will be used throughout the rest of the documents.

Glyphs and Characters
Norman Walsh defines a glyph as: "the actual shape (bit pattern, outline) of a character image." For example, an italic "a" and a roman "a" are two different glyphs representing the same underlying character. In this strict sense, any two images that differ in shape constitute different glyphs. Consequently, glyphs are organized into different types of fonts. By contrast a character is an abstract symbol that is given shape through a glyph's design.

Character Codes
A character code is a digit associated with a specific character. For example, a character with the character code "37" displays a different glyph depending on its typeface (e.g., Calibri, Arial, Webdings, etc.).

At the most basic level, an application that renders PDF documents only needs to access the character codes, the font information, and the mapping from the character code to the font information. With this information the rendering application extracts the key graphical data to draw a glyph on an output device such as a screen or printer.

Fonts and Typefaces
Although the difference between fonts and typefaces may seem trivial to some, confusion still lingers within some development circles, where the term font families is commonly used when referring to typefaces. This is why it is important to clarify some of the upcoming terms.

A font is a comprehensive group of characters with a specific style of type. It includes the letter and number set, special characters, as well as diacritical marks (accents). Furthermore, a font specifies the member of a type family such as roman, boldface, or italic type. Within the context of PDF software development, a font is a PDF object commonly referred to as a font object (see Listing 4), font dictionary, or font data file. A font object contains a set of glyphs, characters, or symbols (such as wingdings). The font object also identifies the font program and contains additional information such as its properties.

By contrast, a typeface specifies a consistent visual appearance or style that can be a "family" or a related set of fonts. Arial, Tahoma, or Helvetica are examples of typefaces. A typeface can contain a series of fonts. For example, a typeface such as Helvetica may include roman, bold, and italic fonts.

Font Technologies: Laying the Foundation
From their inception in the mid-1980s, font technologies have helped jump start the desktop publishing revolution and have enabled the written word to cross over to digital typesetting mediums.

Standards expanded, new font technologies emerged, and within a few years, the world of PDF had become more complex. Not only did those who developed PDF viewers and converters have to adapt to the emerging trends within the PDF industry, but they also had to support the rising demands for different languages.

Asian languages presented PDF developers with new challenges as the existing font technologies could no longer sufficiently answer increasing font complexities. These new challenges helped push font technologies and developers forward.

Outline Fonts
Although digital fonts are generally grouped into three format types (namely, bitmap, stroke, and outline (vector) types), this article will focus on outline fonts. Unlike bitmap fonts that are collections of raster images of glyphs, outline fonts (also known as scalable fonts) are collections of vector images. This means that outline fonts describe glyphs using points that are interpreted as lines and curves.

The advantage to using vector images is that they can be scaled to varying sizes without losing too many details. By contrast, bitmap fonts lose their detailed edges and often appear jagged or choppy when resized (see Figure 1).

Hinting: When Scalability Isn't Enough
Even though outline fonts are scalable, there are many instances in which proper rasterization can be compromised. For example, different applications, output devices, or printers can affect rasterization. To address this problem, hinting technologies were developed. Hinting is additional mathematical information added to a font to ensure it retains its visual integrity when rasterized under various conditions.

Type 1 (PostScript)
Developed by Adobe Systems, PostScript fonts were developed to answer the demands of emerging laser printing technologies at the time. Using a subset of the PostScript language, Type 1 fonts contain an organized collection of procedures to describe glyph forms.

In addition, glyph outlines were interpreted by Type 1 fonts using a field of mathematical analysis known as (cubic) Bézier curves. When first introduced, Type 1 fonts were the first to include proprietary hinting technology to improve their display capabilities.

Type 1 fonts store information in two files. One file contains the character outlines (referred to as printer fonts) and the other contains the character information to display on screen.

Type 3
Type 3 fonts are essentially the same as Type 1 fonts except that they don't include hinting technology. While Type 1 fonts only use a subset of the PostScript language, Type 3 fonts encompass most of the PostScript language. This makes Type 3 fonts capable of displaying more elaborate designs and ligatures than Type 1 fonts. However, the added weight of the PostScript language into Type 3 fonts also makes their file sizes larger. They therefore take up more memory. Because Type 3 fonts use bit-mapped technology instead of hinting, they often produce poorer display results when they are scaled.

TrueType
Developed by Apple Computers, TrueType fonts are similar to Type 1 fonts, but include some important differences. Like Type 1 fonts, TrueType also uses Bézier curves to describe glyph information; however, TrueType employs quadratic mathematics rather than cubic. Another difference between TrueType and Type 1 is that TrueType contains both the screen and printer font data in a single file. In addition, hinting information is stored inside the font file. This additional information makes TrueType fonts larger than their original PostScript rivals. Unlike Type 1 files, however, which are composed of a subset of the PostScript language, TrueType font files are composed of structured tables. Each table contains the necessary information that applications or PDF viewers need to use and display a font. Tables also contain information to ensure that glyphs are displayed correctly when there are different types of internal encodings used in a document.

OpenType
OpenType fonts bring together some of Type 1 and TrueType technologies into one cross-platform format. OpenType's character encoding is based on Unicode and, as a result, can support up to 65,536 glyphs. OpenType offers more development flexibility especially when working with Asian character sets and more sophisticated Roman glyphs that may use non-lining numerals, small caps, fractions, ligatures, and swashes. Like TrueType, an OpenType font contains all of its outline, metric, and bitmap information in a single file.

Font File Structures
In addition to their technological differences, fonts can also be categorized according to how they are structured as PDF objects. Generally, fonts can be structured as:

  • Simple fonts
  • Composite fonts

PDFs contain font objects that essentially act as wrappers for embedded font programs that contain the actual font data. Font programs can be TrueType, OpenType, Type 1, and so forth. Font objects also contain a number of properties and descriptions of the font data in order to enable PDF applications and viewers to use the font in the document.

Simple Fonts
Simple fonts use a single byte of information to represent a glyph. As a result, a maximum of 256 (28) different glyph representations are possible. The Simple Font category includes the original instances of Type 1 and TrueType fonts.

Composite Fonts
Because of their 256 character encoding limitation, Simple Fonts could not support complex Asian glyphs, where a typical Japanese font can have over 7,000 Kanji, Katakana, and Hiragana characters, or non-horizontal writing.

The solution was the development of Composite Fonts (or CID fonts). Unlike Simple Fonts, Composite Fonts are multi-byte and can thus contain an arbitrary number of glyphs. As a result, Composite Fonts are able to support a wider range of glyphs.

Composite Font technologies enable developers to use any number of base fonts and create new composite fonts. Composite font technologies also enable developers to include two sets of character spacing details (metrics) in fonts. One metric can be used for horizontal writing mode and another for vertical writing mode.

Aside from their ability to handle complex glyphs, Composite Fonts are also flexible and expandable.

CMap File
A CMap is an ASCII text file that contains the PostScript language instructions required to map character codes to CID codes used by Composite Fonts. For example, after a character code is processed (from a keyboard input), the CMap file maps the character code to a corresponding Character Identifier number (CID). The CID code is then passed on to the Composite Font, which will in turn generate the appropriate glyph. As we shall see in Part 2, CMap files can also be missing and impact proper PDF processing.

Font Embedding
To display, print, or process a PDF accurately, it must contain the necessary font information. If font information is missing, recipients may not be able to display or edit the document properly or, worse, applications may not be able to process the PDF at all.

Embedding fonts in a PDF ensures that they display and print exactly from one system to another as the author intended. The following sections will look at how fonts are embedded in PDFs and introduce the upcoming subject matter for Part 2.

Full Font Embedding
The first method of embedding fonts is full font embedding. Full font embedding effectively makes the font part of the PDF thereby preventing font substitution when recipients need to display or print a PDF. Essentially recipients don't need the same fonts to view or edit the document. This method is advisable in situations in which modifications to the PDF are expected.

Full font embedding can also potentially help avoid some of the problems associated with missing system fonts and ensure optimal viewing regardless of the system and platform. In an ideal PDF world, fully embedding all fonts would reduce many development woes.

The main drawbacks to full font embedding are file size and licensing issues. Every embedded font makes the document larger, especially if it contains Chinese, Japanese, or Korean (CJK) fonts, which can be problematic. In fact, CJK fonts are rarely fully embedded due to their large character sets. Also, fully embedded fonts can be extracted and used outside of the PDF file. As a result, this font extraction can create the potential of unlimited font distribution and violate the licensing policy of the font manufacturer. The solution then is to partially embed fonts in a document.

Partial Font Embedding (Subsetting Fonts)
Unlike full font embedding, subsetting a font only embeds the glyph definitions for the characters used (i.e., that are displayed in the PDF).

There are three main reasons you should subset fonts. First, as previously stated, PDFs are primarily for content exchange and viewing. PDF is not an ideal editing format, despite the popularity of PDF editing programs available on the Internet, and it is generally assumed (rightfully or wrongfully) by the PDF's creator that the recipient will not modify the document's contents. As we shall see in the following document, editing a PDF is not always a straightforward affair.

Second, subsetting fonts reduces document size. For example, the size of the font "Arial Unicode MS" is nearly 20MB; however, subsetting this font to show 10 Kanji characters would instead only add approximately 25KB to the PDF. In cases where CJK fonts are used, full embedding all fonts would result in problematically very large files.

Third, subsetting of fonts avoids licensing issues because the font then becomes unusable for other purposes then rendering the document, which is often permitted by the font licensors. The drawback with partially embedded fonts is that if recipients do not have the fonts on their system, they will not be able to edit the document or will be very limited in their ability to edit text. This is where the problem of missing fonts begins to emerge.

When Fonts Go Missing
Now that some of the key PDF and font concepts have been reviewed, the different problems that can occur when font information is missing can be addressed.

Part 2 will explore how problems associated with missing font information can start right at the source, with the creation of the PDF document itself. These problems include full and partial font embedding, incomplete font information in TrueType fonts, and missing CMap files.

References

About Franc Gagnon
Franc Gagnon is a senior technical writer at Amyuni Technologies Inc., providing developers with concise documentation on the subject of PDF.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

  Subscribe to our RSS feeds now and receive the next article instantly!
In It? Reprint It! Contact advertising(at)sys-con.com to order your reprints!
Featured White Papers
Latest Flex Stories & Posts
'Scene7 has done a great job developing a sophisticated system that underpins the online delivery of high-fidelity imagery and dynamic assets,' said John Loiacono, senior vice president of Creative Solutions at Adobe.
Apple polished its reputation as a contrarian Tuesday when the news broke because of an SEC filing that it had hired Adobe’s CTO Kevin Lynch to be VP of technology and reportedly coordinate its hardware and software teams. Adobe confirmed his departure effective Friday, March 22. He ...
Companies that need to improve application performance, for example, in server virtualization or VDI environments, frequently come to the conclusion that flash is the best strategy. SSD technology is becoming the "go to" solution for enhancing the performance of these kinds of critical...
If you’ll ask me what would be a job that you always wanted but never got, I’d answered a Technical Evangelist for a large corporation. I know how to do it, I like travel, I like meeting new people, and I can convince that the software I believe in is good for you. But. I [...
Tоday I was participating in a discussion on one Java forum – the question was if Java is easy or difficult programming language to learn. IMO, Java is not difficult to lear, to teach, and to use. It’s a strongly-typed compiled language with tools that help you out to identify most of ...
Online learning company Lynda.com has gotten $103 million in funding from Accel Partners and equity investor Spectrum Equity. Meritech Capital Partners also participated. It’s the first money Lynda has taken from the outside and is meant to push the 17-year-old company into the inter...
Subscribe to the World's Most Powerful Newsletters

Developer Links You Must Click On !
'Scene7 has done a great job developing a sophisti...
Apple polished its reputation as a contrarian Tues...
Companies that need to improve application perform...
If you’ll ask me what would be a job that yo...
Tоday I was participating in a discussion on one J...
Online learning company Lynda.com has gotten $103 ...
Anyone who ever monitored or analyzed an applicati...
Wikibon produced an interesting material (looks li...
I hear people talking about how Solid State Device...
The International Trade Commission decided Wednesd...
Violin Memory, the high-speed flash memory array m...
In the past I have had Photoshop loaded on my comp...
Adobe Systems Incorporated has reported financial ...
We’re starting to writing a book for O’Reilly that...
Yesterday I had to record a 2-hour long webinar. T...
I’m a partner in two companies – an IT consu...
People are accustomed to buying goods online. If a...
Oh, the ignominy of it. A UK judge has told Apple ...
World IPv6 Launch: June 6: The Internet Society, a...
I’ve been using Firefox on my Mac as my defa...
ADS BY GOOGLE

Product Reviews
In the past I have had Photoshop loaded on my computer, but I rarely used it unless it was to use the autocorrection features. Each year I have a ton of vacation pic...
If you still can not adopt reading books from your computer screen, or do not like to waste your laptop battery charge for just reading a book, then I think Amazon K...
This article is the first of two that will look at some of the challenges faced by developers and non-developers who work with PDF technologies and who are curious a...
The goal of this article is to give you a brief overview of some of the objects from clear.swc, which is a part of the open source Clear Toolkit Flex framework avail...
Live Cycle Data Services is an excellent solution for building scalable RIA, but it’s not cheap. For those who don’t have a budget for LCDS, Adobe has released open...