PDF Text Extraction Based on Columns & Set Header/Footer for Whole PDF

sherazam · Jan 5, 2015

What?s new in this release?

The latest version of Aspose.Pdf for .NET (9.9.0) has been released. It provides some great new features and empowers the developers to manipulate PDF documents with more ease. In case we have a PDF document with more than one columns (multi-column) PDF document and we need to extract the page contents while honoring the same layout, then Aspose.Pdf for .NET is the right choice to accomplish this requirement. One approach is to reduce font size of contents inside PDF document and then perform text extraction. In this new release, we also have introduced several improvements in TextAbsorber and in internal text formatting mechanism. So now during the text extraction using ?Pure? mode, you may specify ScaleFactor option and it can be another approach to extract text from multi-column PDF document besides above stated approach. This scale factor may be set to adjust grid which is used for the internal text formatting mechanism during text extraction. Specifying the ScaleFactor values between 1 and 0.1 (including 0.1) has the same effect as font reducing. Specifying the ScaleFactor values between 0.1 and -0.1 is treated as zero value, but it makes algorithm to calculate scale factor needed during extracting text automatically. The calculation is based on average glyph width of most popular font on the page, but we cannot guarantee that in extracted text no string of column is reaches the start of next column. Please note that if ScaleFactor value is not specified, the default value of 1.0 will be used. It means no scaling will be carried out. If specified ScaleFactor value is more than 10 or less than -0.1, the default value of 1.0 will be used. Please note that there is no direct correspondence between new ScaleFactor and old coefficient of manually font reducing. However by default algorithm takes into account value of font size that have already reduced due to some internal reasons. For example reducing font size from 10 to 7 has the same effect as setting scale factor to 5/8 (= 0.625). Different type of compression can be applied over images to reduce their size. The type of compression being applied over image depends upon the ColorSpace of source image i.e. if image is Color (RGB), then apply JPEG2000 compression, and if it is Black & White, then JBIG2/JBIG2000 compression should be applied. Therefore identifying each image type and using an appropriate type of compression will create best/optimized output. The following code snippet can be used to identify if the images inside PDF file are Colored or Black & White. Header and Footer are very important element inside PDF document. They are used to show some important information related to PDF document i.e. Document Tile, company logo, Confidential Notice, page count etc. When creating PDF document, we can add Header/Footer element for each page. However in order to have optimized performance, another approach is to first create PDF document with all required elements, create Header/Footer instance, iterate through all PDF pages and add the newly created Header/Footer object to each page of document. The following code snippet shows the steps to create Table object, add sample information inside table, create Header/Footer object, add table to paragraphs collection of Footer object and then set Footer object as footer for each page of document. The page borders are path drawing operations. Therefore the Pdf->Html processing logic just performs drawing instructions and places the background behind the text. So, to repeat the logic, you has to process contents operators manually and draw the graphics yourself. Also please note that following code snippet might not work accurately for various PDF files but if you encounter any issue, please feel free to contact. The Document class has OptimizeResources(..) which takes Document.OptimizationOptions object to optimize the size of input document. The Document class also has a property named OptimizeSize which Gets or sets optimization flag. When pages are added to document, equal resource streams in resultant file are merged into one PDF object if this flag set. This allows to decrease resultant file size but may cause slower execution and larger memory requirements. The default value is false. When this option is turned off, newly added pages are scanned and if duplicate resources are found, they are ?merged? with existing resources (provided they are same). However recently we have observed that this works with stream objects only (i.e. contents of the pictures or font files). Therefore we started to investigate the possibility of optimization for dictionary objects which will allow to use shared font dictionaries. Some customer have recently reported that they experienced serious size expansion issues. Therefore in this new release, the Optimization is improved in order to merge streams and dictionaries of the resources (fonts, images etc). Nevertheless, OptimizationOptions.AllowPageReuse property is added to enable/disable pages merging. We investigated the enhancement requested earlier to set printer driver settings and as per our observations, the printer driver settings are very specific to particular printer. The .Net Framework provides extra printing features in WPF (Presentation Foundation), but Aspose.Pdf.dll cannot use it and I am afraid we do not have any plans to introduce its dependency in a short time period. As well as the enhancements and features discussed above, there have been numerous fixes related to HTML to PDF conversion, PDF to Excel conversion, XPS to PDF conversion, PDF to TIFF conversion, conversion of PDF to PDF/A compliant documents, text replacement, rendering PDF files to XPS, creating TOCs in PDF files, and printing PDFs with embedded fonts. The list of important new and improved features are given below

- Extract text based on columns
- Identify if image in PDF is Colored or Black & White
- Extract RawFormat from XImages from PDFDocument.Pages.Resources.Images
- Setting Header/Footer once for new PDF document
- Integrate Imaging into Aspose.Pdf
- Extract table borders as image
- Improve on-fly resources optimization
- How to set printer driver setting
- XSL-FO to PDF - Font Name issue
- When trying to concatenate PDF files, the application hangs
- PDF to Image - Arrow of line Annotation is missing in resultant file
- PDF to HTML - Transparent text loses its transparency after conversion
- PDF to HTML: Incorrect output HTML-missing images
- Border appearance changes with zoom factor
- Text replace changes font face to Times New Roman
- TextFragmentAbrober issue: Font of text is changed in output PDF file.
- PDF Table cells rowspan not working, when page breaks
- Incorrect text generated
- Merging the pages of same PDF increases the resultant file size
- Old generator does not throws IOException on a locked file
- TextFragmentAbsorber getting incorrect position of text position and occurrences
- TextFragmentAbsorber behaving abnormally
- PDF to PNG: output image is too small
- OptimizeResources(..) method is not reducing file size
- Issue while adding Text to an existing PDF document
- PDF to HTML - Hyperlink is removed in resultant file
- Image object is returning inorrect resolution value
- When adding Image to table cell and setting Image height, application hangs
- PDF to HTML conversion throws OutOfMemoryException
- Text is printing with thicker lines
- Tables appearance issue
- Added footer overflowing the document
- PDF to JPEG - page contents are distorted
- TimeZone is removed from ModifyDate field
- OptimizeResources() method makes PDF unreadable
- PDF to TIFF - Euro sign mangled in output file
- PDF to JPEG - Characters are incomplete on resultant image
- Filling form with AutoFiller loses the javascript
- Image count issue
- Error when Image field is flattened
- PDF to PDFA: compliance failure
- TOC expanding beyond one page resultants in incorrect hyperlinks
- Checkbox field types don't always get checked in PDF
- Problem concatenating PDF files using PdfFileEditor
- PDF to JPEG: White rectangle instead of image's part
- Text replace increases space between letters
- TextFragmentAbsorber behaving abnormally
- PDF to XLS: Two columns merged into a single column

Other most recent bug fixes are also included in this release.

Newly added documentation pages and articles

Some new tips and articles have now been added into Aspose.Pdf for .NET documentation that may guide you briefly how to use Aspose.Pdf for performing different tasks like the followings.

- Identify if image inside PDF is Colored or Black & White
- Add Image to Existing PDF File

Overview: Aspose.Pdf for .NET

Aspose.Pdf is a .Net Pdf component for the creation and manipulation of Pdf documents without using Adobe Acrobat. Create PDF by API, XML templates & XSL-FO files. It supports form field creation, PDF compression options, table creation & manipulation, graph objects, extensive hyperlink functionality, extended security controls, custom font handling, add or remove bookmarks; TOC; attachments & annotations; import or export PDF form data and many more. Also convert HTML, XSL-FO and MS WORD to PDF.

More about Aspose.Pdf for .NET

- Homepage of Aspose.Pdf for .NET C#
- Online Demo for Aspose.Pdf for .NET
- Download Aspose.Pdf for .NET
- Read online documentation of Aspose.Pdf for .NET

Contact Information
Aspose Pty Ltd, Suite 163,
79 Longueville Road
Lane Cove, NSW, 2066
Australia
Aspose - Your File Format Experts
sales@aspose.com
Phone: 888.277.6734
Fax: 866.810.9465

PDF Text Extraction Based on Columns & Set Header/Footer for Whole PDF

sherazam

Well-known member

Similar threads

Share this page

Latest posts