Print Advertisement Extraction Application
Business Problem - Streamline the current cumbersome and manual effort required in the extraction of data and images from PDF files and creation of hotspots, with, a semi-automated procedure supported by this SoftStar developed application.
Solution – A proof of concept was created in an initial 3 week analysis phase. Based on the understanding from this 3 week effort, a rules based extraction process was employed in which user specified rules were used to “understand” and automatically extract and parse unstructured data in PDF files and save that information in separate fields in a database, for subsequent access by multiple applications. This process also extracts images and creates hotspots, thus reducing the work load of the graphics team. Further, to improve system performance, the application uses low resolution PDFs for the client front-end application. The original high quality images at the server backend are then extracted from high resolution PDF files using the image information gathered during extraction process.
Features include
- Rule Engine – Users can create rules based on text attributes, keywords and patterns for different listing fields and specify formatting to parsed text.
- Text Extraction - Users can apply retailer or promotion specific rules to extract, parse and format text and Images for a listing from PDF files.
- Hotspotting – As part of listing extraction process, user can choose to automatically create hotspots for a listing on a PDF page. These hot spots can be attached to one or more advertisement listings.
- Image Extract – Users can extract listing images from a low resolution PDF and tie one of the images to a listing. The application extracts a high resolution image from a high resolution PDF on backend through a batch process based on the image information gathered at front end. It also provides simple work flow capability to graphics users to correct the images.
- System Integration - Integrated with existing systems so that application could be in place by the 2006 holiday season without requiring major changes and training to other existing systems.
Technologies - DotNet, C#, SQL Server and a 3rd party PDF extraction library.
