Monkeybread Software - DynaPDF Manual

DynaPDF Manual - Page 486

Function Reference

Page 486 of 839

%It would be returned in one GetPageText() call as one coherent kerning

%record.

(The fox eats the lazy mouse.)Tj

%This version emulates the spaces with kerning space.

%It would be returned in one GetPageText() call with 6 kerning records.

[(The)-280(fox)-280(eats)-280(the)-280(lazy)-280(mouse.)]TJ

%This version uses PDF positioning operators to emulate spaces.

%It produces 6 separate GetPageText() calls.

(The)Tj

2.8 0 Td

(fox)Tj

2.8 0 Td

(eats)Tj

2.8 0 Td

(the)Tj

2.8 0 Td

(lazy)Tj

2.8 0 Td

(mouse.)Tj

In the worst case each text record consists of only one character and it is also possible that the entire

text occurs unsorted or combined with other texts which lie on completely different positions than

this one. There is not necessarily a logical connection between what you see on screen and what is

stored in the PDF file. Especially if a PDF file contains tables the order of text records is sometimes

very difficult to understand.

Possible encoding issues

If text must be extracted, deleted, or replaced then it is very important that the text in the PDF file

can be converted to Unicode. This conversion is possible if the font uses a standard encoding like

WinAnsi or MacRoman, if it contains a ToUnicode CMap, or if it contains PostScript Character

names which are listed in the Adobe Glyph List, or if it uses a predefined external CMap and if this

CMap is available in one of the CMap search paths (SetSetCMapDir() for further information).

More complicated is the processing of certain European scripts such as Russian, Greek, Czech, and

so on. A common technique to process such scripts is to convert the original font to a symbol font to

avoid the usage of a CID font (multi-byte font) because the PDF format supports only four pre-

defined 8 bit encodings (WinAnsi, MacRoman, MacExpert, and Symbol). The advantage is that 8 bit

strings can be stored in the PDF file which results in a smaller file size and the PDF file is still

compatible to older Acrobat versions prior 4.0 because CID fonts are supported since PDF 1.3.

The problem is that if the font resource contains no ToUnicode CMap or PostScript character names

it is no longer possible to convert the text to Unicode. Depending on how a PDF file was created the

encoding is also often not known by the PDF driver, e.g. when converting PCL or AFP files to PDF.