Improved PDF import functionality in TX Text Control 15.1

ckrause · Dec 9, 2009

In version 15.0, TX Text Control released the first PDF import filter. The idea was to provide the functionality to import PDF documents and do further processing on such documents. The first version was able to extract the text from PDF files. Even paragraphs were recognized and the text frame mode was able to generate good looking documents.

If you don't know the PDF format in detail, you might wonder why it is so tricky to load this format. To understand this, imagine a PDF document like a real executable file in contrast to the source code.

An RTF document is like the source code. You can open it easily in a text editor and if you understand RTF tags, you can make changes to the document.

Whereas the PDF document can be compared to the true Win32 EXE which can't be easily decompiled. To understand what the EXE does or how it is implemented, you need to reverse engineer the executable. The same is valid for PDF documents: The Adobe PDF format is a low-level output format and was designed to be printed - not to be imported again. In most cases it contains only geometrical information about the single characters. You have to calculate which characters belong to a sentence, a paragraph or a text frame.

So, what has been improved?

Aside from new image formats that are now supported, the most impressive part is the grouping of created text frames. Have a look at a typical PDF page:

It consists of a heading, an introduction text and a table with 3 columns. The following screenshots show the results from version 15.0 in comparison to version 15.1:

As you can see in the 1. screenshot, all text areas are inserted as single text frames to realize the distances between the paragraphs. The problem with this is that you can't extract the text from all of these frames using copy and paste. With version 15.1 (2. screenshot), all text frames are grouped into one large frame and the distances are realized using paragraph distances, specific line spacing and indents.

The tabular data in the PDF is now realized using tab positions and not single text frames which makes the text readable and reusable. The following illustration shows the same document in TX Text Control 15.1 with visible control characters where the line spacing and tab positions are highlighted:

Find out more about the new features of TX Text Control here: What's New in TX Text Control 15.1

About TX Text Control:

TX Text Control was originally released in 1991, since then more than 40,000 copies have been sold. Starting off as a single, small DLL, TX Text Control has made its way through 16-bit DLL and VBX versions to today‘s Enterprise edition with its .NET and ActiveX components. The recent addition to the family, TX Text Control .NET Server, offers all of TX Text Control advanced word processing functionality in an easy-to-use server-side .NET component. Customers benefit from these years of experience, large user base, and at the same time, appreciate developing with a mature, reliable product.

Contact Informations:

support@textcontrol.com

North & South America:
Phone: +1 704-370-0110
Phone: +1 877-462-4772 (toll free)

Europe:
Phone: +49 421 335 910

Asia Pacific:
Phone: +886 2-2797-8508

Improved PDF import functionality in TX Text Control 15.1

ckrause

Well-known member

Similar threads

Share this page

Latest posts