Monkeybread Software - DynaPDF Manual

DynaPDF Manual - Page 611

Previous Page 610 Index Next Page 612

Function Reference

Page 611 of 839

fntTranslateRawString2() in C/C++). For this kind of algorithm the TShowTextArrayW

callback function should be used because it provides anything required to develop fast text

extraction algorithms. The example projects text_extraction and text_coordinates

demonstrate how text extraction algorithms can be developed.

• Text search algorithms could use the TShowTextArrayW callback function too but the usage

is much more complicated if strings of CID fonts must be processed. CID fonts support

encodings with arbitrary code lengths from one through four bytes per character. Because

the string width cannot be computed from the translated Unicode string the function must

be able to find the position in the source string. This is not easy especially if the search text

was stored in multiple text records.

To simplify the development of text search algorithms the content parser provides the

TShowTextArrayA callback function which returns the raw source strings. The conversion to

Unicode can be done with TranslateRawCode() (the name is fntTranslateRawCode() in

C/C++). The function converts a sequence of source bytes to Unicode and calculates the

width of that character. The advantage is that the exact position of every character in a string

can be easily calculated independent of the current font type. The overhead due to the call on

a per character basis is not large because the function is strongly optimized to improve

processing speed. The example text_search demonstrates how a text search algorithm can be

developed.

Using the Content Parser

The content parser can be used to extract text, vector graphics, and images from a PDF file. The

following sections describe which callback functions must set, what must be stored in the graphics

state, as well as other important aspects.

Note that DynaPDF is delivered with several example projects which demonstrate how the content

parser can be used. Before developing your own code take a look into the examples text_extraction,

text_search, or image_extraction.

Text Extraction or Text Search Algorithms

The following callback functions should be set to process PDF text:

TBeginTemplate

TEndTemplate

// Optional

TMulMatrix

TSetCharSpacing

TSetFont

TSetLeading

// Optional

TRestoreGraphicState

TSaveGraphicState

TSetFillColor

// Optional

TSetStrokeColor

// Optional

TSetTextDrawMode

Previous topic: Text Scaling, Sub string coordinates

Next topic: Unicode conversion