Monkeybread Software - DynaPDF Manual

DynaPDF Manual - Page 485

Function Reference

Page 485 of 839

Text objects use a separate coordinate system which is represented by the text transformation matrix

tm. We call this coordinate system text space. All text properties such as font size, text width and so

on are calculated in text space. The PDF format supports also several text positioning operators to

decrease the size of a text object. To make the usage of the function easier DynaPDF includes all text

positioning operators already in the text transformation tm.

The text coordinate system must be transformed to user space by multiplying the text matrix with

the current transformation matrix cm to enable the calculation of the text position. The combined

matrix must be recalculated each time GetPageText() returns a new text object.

As mentioned earlier a content stream is not organized into text lines and the order in which text

objects occur is essentially arbitrary. A text record can occur in two different formats: as an array or

as one coherent text string. The array form enables the definition of kerning between characters in a

compact format since PDF viewers ignore any available kerning information in a font resource. The

strings in a kerning array lie always on the same text line.

The kerning array is also often used to emulate space characters because word spacing does not

work with CID fonts. Most PDF drivers use the same algorithm to format text of single and multi-

byte fonts; that is the reason why space characters are very often emulated with kerning space.

However, it is quite easy to determine whether a space character is emulated at given position: if the

displacement is larger than the half space width we can assume that a space character was emulated

at this position. The half space width should be used because the fonts of documents which emulate

space characters with kerning space contain often no space character. DynaPDF sets a default space

width in this case which can be too large if a condensed font is used.

However, the array form is just one possible format to enable kerning between characters. Due to

several reasons the array form is sometimes not used. Many PDF drivers update the text position

with text positioning operators instead. This technique produces not only much greater content

streams it splits text records also into separate ones. This complicates the identification of word

boundaries a lot because each record is returned in a separate GetPageText() call. We need now the

coordinates to determine whether the text must be assigned to the same line. If the text is not rotated

this is not a big deal but if the coordinate system is rotated or if it contains other transformations

some further math is required to determine whether a text record must be assigned to the current

line.

We want now take a look into a PDF content stream to determine how an arbitrary text can be

stored in a PDF file. The following text can be stored in many different ways and it is important to

understand that many variants are possible and exist in real PDF files.

The rendered result of the string "The fox eats the lazy mouse." looks quite normal:

The fox eats the lazy mouse.

However, a PDF driver does not necessarily store this text in one record, there are many possible

variants:

%This is the easiest variant, one record contains the entire text line.