What can be indexed and searched?

Sources for the index

What types of contents can be indexed using Forward Search? And how?

Forward Search operates on text. It can find texts containing specific “terms”. Terms are most commonly words, but can also be also codes, tokens, numbers, as long as they are accessible as written text. Ideally, they can be stored in any digital form. But in practice there are some limits. 

Forward Search can access many different content storage in different ways. This is described a bit more below, but to understand this better, let's take an ultra-short look at the Forward Search Indexing Pipeline.

The Forward Search Indexing Pipeline

The pipeline

The Forward Search indexing Pipeline starts with a trigger, which is typically started by scheduled task or by an event-based update (a “publish” in a CMS, for instance).

The first part of the pipeline is a crawler. The crawler can iterate a website or a file-server, collecting all the accessible documents therein. Several security-handlers in Forward Search allow access to protected pages and files, if that is required.

Filtering is actually a two-step process: First those documents that should not be processed at all are removed from the pipeline, and secondly, the content of each document is filtered by appropriate filters to select only the relevant text content. For web pages, this typically means that some of the header information is preserved for indexing and most of the main body-contents likewise. But for a video clip file for instance, the filter removes the actual binary film but preserves information like format, play-time, frame rate, photographer, copyright and contents summary, so that is indexed and thus searchable.

The finalizer receives the found contents from the filtering process, and is a last-chance modification point: Here, the found text elements can be manipulated and the documents can be enriched with extra data retrieved from various sources, for instance databases, product-catalogues, web services and ERP systems.

The processor inserts the finalized documents into the index. Once all documents that were found and filtered, are in the index, it is searchable.

After the indexing was completed, the post processor gets a second chance at manipulating the indexed documents. Again extra data can be added to the index, but the post processor can do much more and as explained in other pages on this site it is a plug able architecture. For example the OCR feature is implemented as such a module.

Supported source types

Forward Search allows crawling of websites, file systems and SharePoint systems. An API allows for the development of other source data providers than these three, if need arises.

Forward Search will filter more than 25 file formats, including webpages, text files, Microsoft Office and OpenOffice documents and a large number of media files. An API allows for the addition of more filters to allow the indexing of other document types if need arises.

When indexing webpages, Forward Search supports standard Dublin Core metadata information, and will automatically include more than 20 different fields, and further allows for easy configuration of the inclusion of an unlimited number of custom fields. Similarly, Forward Search includes many standard metadata fields in the many different file formats supported.

In the Finalizer, the API allows for processing all the text fields already found by the filters, and manipulating this contents. But as stated above, you can also retrieve extra information from various proprietary systems, webservices or purely algorithmically, and add new fields to the document to be indexed. For instance, an “Address” field could be converted to geo-position coordinates for use by the Geo Search module. Or perhaps stock- and quality information could be added to an indexed product, retrieved directly from an ERP system.

In the post-processor, all modified documents of an indexing process are revisited and further manipulation is possible, including the retrieval of more data from external sources, just as in the finalizer.

Injection

If the standard pipeline is not capable of creating the desired index, it is possible to write directly to the index, injecting actual documents, prepared for indexing just as the processor does in the pipeline. Using the injection interface allows for the creation of search systems that does not rely on crawlable websites or file systems, but can for instance be based on search queries into relational databases or the iteration of content nodes in proprietary contents systems.