DynaPDF Manual - Page 471
Previous Page 470 Index Next Page 472
Page 471 of 821
replaced in the middle or end of a kerning array. To make text replacement
easier it is possible to preserve an arbitrary number of kerning records from
deletion. The value of DeleteKerningAt represents the first array index
which should be deleted. All kerning records above this index will be
deleted too. Take a look into the demo examples/edit_text which is delivered
with DynaPDF to determine how this member can be used.
The font flags describe important characteristics of the current font:
• 0x00001 // Fixed pitch font
• 0x00002 // Serif style
• 0x00004 // Symbol font
• 0x00008 // Script style
• 0x00020 // Non-symbolic font
• 0x00040 // Italic style
• 0x40000 // Force Bold (Type1 fonts only)
A widely used technique to reduce the amount of data that must be stored in a PDF file is the usage
of non-embedded CID fonts. CID fonts, whether embedded or not, can depend on external CMaps
which must be available at runtime.
To process strings of such fonts correctly DynaPDF must be able to load required CMap files if
necessary. Therefore, DynaPDF is delivered with the most important CMap files which are provided
by Adobe Systems. These CMaps can be found in the DynaPDF installation directory at
/Resource/CMap/. Applications which extract text from PDF files should include these CMaps so
that they can be loaded at runtime.
The search path to external CMaps must be set with SetCMapDir() before executing GetPageText()
the first time. The function creates a CMap cache that is hold in memory until the PDF instance will
be deleted. The search path(s) to external CMap files should be set only one time per PDF instance
and one PDF instance should be used to process so many PDF files as possible. This can significantly
improve processing speed.
Order of Text records
GetPageText() returns always when a text showing operator was found. That means the returned
text represents not a text line. It can be a single character up to a complete text line depending on
how the text is stored in the PDF file.
The order in which text is returned is essentially arbitrary. It depends on the file creator whether
text is stored in the logical reading order. For example, most PDF drivers convert headers and
footers first. Such strings appear then at the beginning of the content stream. All other strings are in
turn not necessarily ordered and one text line can be stored in several different text objects.
A text search or text replacement algorithm must correctly handle cases in which a word or sentence
is separated into different text objects. In the worst case GetPageText() returns always only a single
Previous topic: GetPageObject (Rendering Engine), GetPageOrientation (Rendering Engine), GetPageText
Next topic: Organization of content streams and pages, Organization of text objects