Document Full Text Search - FAQ

Print this Topic  Previous Topic Home Topic Next Topic
You are here: Search for a Document > Document Full Text Search >Document Full Text Search - FAQ
Expand All   Collapse All

What does the document full text search feature do?

The Document Full Text search feature allows you to search for documents in Globodox based on their content. The Full Text Search feature works by extracting text from documents that you add to Globodox and then indexing the text. The text can be automatically extracted in the background when you add/modify a document. Otherwise the text extraction and indexing can be performed manually later.

 

Since text extraction happens in the background, the process continues even when you close Globodox. To stop text extraction...

 

Explore Control Panel > Administrative Tools > Services. Select ITAZ Globodox Indexing Services under the Name column. Right click the entry and select the Stop option.

Why is it useful?

Without the full text search feature you can find documents either...

using the indexing information that you have stored along with each document, or...
using the properties of the document (for e.g. file name, file size, file type etc.)

Enabling the full text search provides you with a third method for quickly finding documents.

For what file types does the document full text search feature work?

Depending on the file type (i.e. file format) text extraction from documents is now done using OCR, built-in text extractors and IFilters installed on the user's machine.

For example for TIFF, JPG, PNG and other image file types Globodox uses its built-in OCR engine to extract text. You can configure Globodox to use the faster Microsoft Office OCR engine if it is installed (this is available if you have MS Office Document Imaging installed on the machine). Note: Starting with MS Office 2010, Microsoft no more ships MS Office Document Imaging with MS Office.

Globodox uses it's built-in text extractor for MS Word (DOC, DOCX), MS Excel (XLS, XLSX) and PDF files (PDF files which contain text and not only scanned images).

For other file types, Globodox uses IFilters installed on your machine to extract text

PDF files are handled a little differently. PDF files created by Globodox contain scanned images. So Globodox extracts text from them using OCR. For all other PDF files, Globodox first uses its built-in text extractor and if that does not return any text, Globodox tries OCR to extract text from the PDF file.

IFilters act as plug-ins and are a part of Microsoft Indexing Service (they are also used by Windows Desktop Search). Using the IFilter mechanism improves the accuracy and performance of text extraction in Globodox.

For Globodox to be able to extract text from a file of a particular format, an IFilter for that file format must be installed on the user's machine.

IFilters for the following file formats are installed by default on Windows 2000/XP/2003/Vista machines...

PPT (Microsoft PowerPoint presentation)
DOC (Microsoft Word document) - By default Globodox does not use this because it uses its built-in extractor for MS Word files.
XLS (Microsoft Excel spreadsheet) - By default Globodox does not use this because it uses its built-in extractor for MS Excel files.
HTML documents
TXT documents

 

You can also install third party filters to enable Globodox to extract text from other file types, e.g.:

More information and downloads links for various IFilters (both free and commercial) are available at...

Why aren’t all IFilters automatically installed along with Globodox?

Although some IFilters are available for free, we cannot ship them with Globodox as they are published by different companies. You will find download links for available IFilters (both free and commercial) at…
http://www.ifilter.org/Links.htm

Is OCR available in Globodox?

Yes, OCR is available in Globodox. You can use the built-in OCR engine to extract text from TIFF, JPG, PNG and other image file types. You can configure Globodox to use the faster Microsoft Office OCR engine if it is installed (this is available if you have MS Office Document Imaging installed on the machine).

What is the 'Use built-in OCR engine' setting?

The Use built-in OCR engine option allows you to use the built-in engine to OCR your documents.

What is the 'Use Microsoft OCR engine' setting?

The Use Microsoft OCR engine option allows you to use the Microsoft OCR engine to OCR your documents. You will need to have MS Office Document Imaging installed on the system, to use the Microsoft Office OCR Engine.
 

How can I stop background text extraction on a machine?

Background text extraction only happens on the machine on which Globodox has been installed in server mode (a single user installation of Globodox is always installed in server mode). On this machine, the extraction of text from newly added documents continues in the background even when Globodox itself is not running. To stop background text extraction...

Explore Control Panel > Administrative Tools > Services. Select ITAZ Globodox Indexing Service under the Name column. Right-click the entry and select the Stop option.

Globodox does not extract text from my document. Why?

Globodox uses two different methods depending on the file type (i.e. file format) to extract text from documents.

For example, for TIFF, JPG, PNG and other image file types, Globodox uses its built-in OCR engine to extract text. You can configure Globodox to use the faster Microsoft Office OCR engine if it is installed (this is available if you have MS Office Document Imaging installed on the machine).

For file types such as .DOC, .XLS, .TXT, .HTM Globodox uses IFilters installed on your machine to extract text.

PDF files are handled a little differently. PDF files created by Globodox contain scanned images. So Globodox extracts text from them using OCR. For all other PDF files, Globodox first uses its built-in text extractor and if that does not return any text, Globodox tries OCR to extract text from the PDF file.

When I search for some text, documents (which I am sure contain that text) are not listed in the search results. Why?

For the Full Text feature to work, the text from the document should be extracted. Depending on the file type (i.e. file format) text extraction from documents is done using OCR and IFilters installed on the user's machine.

 

The reason for this could be that the IFilter for that particular file format is not installed on the machine. For Globodox to be able to extract text from a file of a particular format, the IFilter for that file format must be installed on the machine.

 

It could also be that the file for which text extraction is failing, is password protected.

 

Another reason could be that the size of the document may be larger than the size specified in the Maximum size of documents to extract text from option.

Will Globodox display a message if it cannot extract text from a particular document?

No. Globodox attempts to find the IFilter for every document and proceeds without displaying any error message (and without extracting text) if the IFilter for a particular file cannot be found on the machine. However, for backward compatibility reasons Globodox does complain if it cannot find the IFilter for PDF files.

What are the options available with the document full text search feature?

The following options are available with the full text search feature...

Automatically index documents on check-in
Limit the size of documents to extract text from to
What is the 'Automatically index documents on check-in' option?

Select the Automatically index documents on check-in option if you want documents to be automatically indexed on check-in. This option can only be selected if you have selected the document check-in/checkout option. Selecting the Automatically index documents on check-in option ensures that documents will be indexed as soon as they are added or modified (in other words as soon as they are checked-in). However, enabling this option will slow down the addition (checking-in) of documents to Globodox because of the processing required to index the document. For more information, see Automatically Extract text from documents.

What is the 'Maximum size of documents to extract text from' option?

Specify the file size that should be indexed in this box. Please note that this option is only available for MS Access DB. By default the limit of the file size is set to 1 MB. This means that files larger then 1 MB will not be indexed. For slower machines it is recommended to choose a lower value. A larger value affects the performance of MS Access DB. This option is useful in a multi-machine scenario where you can disable extracting and indexing of text on slow machines for large files without disabling full text search.

Can I disable the automatic extraction and indexing of documents on specific machines?

Sometimes for slower machines you may want to turn off the automatic extraction and indexing of documents (even when the feature has been enabled for the Globodox DB). For this, open the Options window (on the slow machine) and turn off the relevant option available for Document Full Text Search.

       How do I use the document full text search feature to search for documents?

To search for documents using the document full text search feature...

In Globodox, select Workspace > All Documents in the Navigation pane. The documents will be displayed in the List View pane.
Click the Double Down Arrow button to bring up the Advanced Search pane.
Select the Document Text option from the Field Name drop-down, to search for text in the document.
Select the appropriate comparison operator (i.e. contains, begins with, equal to etc.) from the Comparison drop-down. E.g. To search for text beginning with specific alphabets use the "begins with" operator in your query condition.
Enter the value which will be used for comparison in the Compare To box.
You can add more criteria to your search by clicking the Add button. To remove a criteria, click the Remove button.
To get a result which matches all the criteria specified by you, select the Match all conditions option from the Conditions drop-down. To get a result which matches any criteria, select the Match any conditions option from the Conditions drop-down.
Click the Search button to begin the search. The search results will be displayed in the List View pane.

If from the Comparison drop-down list you had chosen was "does not contain" then the search would have returned all documents which do not contain the text you have specified.


Related Topics
Extract Text from Document
Search for text in a document

View the Extracted Text of the Document

 

 


Page URL: http://www.itaz.com/globodox/help/index.htm?document_full_text_search_faq.htm