Extract Text from a Document

Print this Topic  Previous Topic Home Topic Next Topic
You are here: Search for a Document > Document Full Text Search >Extract Text from a Document

The Full Text Search feature works by extracting (OCR) text from documents and then indexing the text. You can use the Extract and Index option to manually extract and index a document (if you have turned off automatic indexing or wish to re-index the document).

 

To Extract text (OCR) from Document

1.Select the document that you want to extract from the List View pane.
2.Click the More drop down arrow and select the Extract and Index option of the Home tab.
3.The text from the document will now be extracted and indexed.
4.Select the document and click the More drop down arrow and select the Show Extracted Text option of the Home tab to view the extracted text.
5.You can modify the extracted text being displayed. Click the Save button to save the extracted text.

 

To Extract text (OCR) from Document using Microsoft Office OCR engine

1.In Globodox, click the Globodox button.
2.Click the Options button. The Options window will be launched.
3.From the Extract and Index section,select Use Microsoft Office OCR Engine option to make it your default OCR Engine.
4.Click the OK button to apply the changes
5.Select the document that you want to extract from the List View pane.
6.Click the More drop down arrow and select the Extract and Index option of the Home tab.
7.The text from the document will now be extracted and indexed.
8.Select the document and click the More drop down arrow and select the Show Extracted Text option of the Home tab to view the extracted text.
9.You can modify the extracted text being displayed. Click the Save button to save the extracted text.
 

Note:
You will need to have MS Office Document Imaging installed on the system, to use the Microsoft Office OCR Engine. MS Office Document Imaging has been discontinued with the launch of MS Office 2010. So text extraction using MS Office OCR Engine, only works if the version of MS Office installed on your machine is older than MS Office 2010.

 

Notes:

Globodox uses it's built-in text extractor for MS Word (DOC, DOCX), MS Excel (XLS, XLSX) and PDF files. In case of any other file formats, for Globodox to be able to extract text from a file of that particular format, an IFilter for that file format must be installed on the user's machine.

 
IFilters for the following file formats are installed by default on Windows 2000/XP/2003/2008//Vista/7 machines...

PPT (Microsoft PowerPoint presentation)
HTML documents
TXT documents
 

Related Topics
Search for text in a document

Document Full Text Search - FAQ
Recognize barcodes on documents

 


Page URL: http://www.itaz.com/globodox/help/index.htm?extract_text_from_document.htm