Fedora Generic Search Service

See a demo at your own site, almost out-of-the-box

Perform these steps:

Create a Fedora 3.5 installation by quick install. The only piece of custom configuration needed is setting the value of the param enabled to true for the Messaging module in fedora.fcfg:

<module role="org.fcrepo.server.messaging.Messaging" class="org.fcrepo.server.messaging.MessagingModule">
    <comment>Fedora's Java Messaging Service (JMS) Module</comment>
    <param name="enabled" value="true"/>

Download fedoragsearch.war from either the DTU prerelease site, or from the official Duraspace site. Alternatively, you may build fedoragsearch.war from source.
Copy fedoragsearch.war into the tomcat webapps directory of your Fedora installation. Tomcat will unpack it, if it is running, or else when you start it.
Create a GSearch administrator in fedora-users.xml
```
<user name="fgsAdmin" password="fgsAdminPassword">
    <attribute name="fedoraRole">
        <value>administrator</value>
    </attribute>
</user>
```
Notice, only users with names 'fedoraAdmin', 'fgsTester' and names starting with 'fgsAdmin' are authorized to perform updateIndex actions.
Create the set of configuration files. All you need to do is edit a few of the property values in the file webapps/fedoragsearch/FgsConfig/fgsconfig-basic.properties, including passwords, and run
```
> ant -f fgsconfig-basic.xml
```
This ant script ends with writing the fgsconfigFinal files to the classpath location that you have chosen. Therefore, you need to run it with permission to write there.
Restart tomcat.
Now this documentation page is visible at your own demo site. and the admin pages are visible here:
There is a customizable end-user search client page. See the section on Customizable end-user search client.
Ingest the Fedora 3.5 demo objects, there are 41, 20 of them are data objects and will be indexed. View the admin pages.

Major features

The service has the following major features:

Indexing of Fedora FOXML records, including the text contents of datastreams and the results of disseminator calls.
Search in the index.
Plugin of selected search engines, so far Lucene, Solr and Zebra.

You are encouraged to share problems and experience with the Fedora community, send mail to fedora-commons-users, or to Chris Wilper, or to Gert Schmeltz Pedersen.

The following figure serves to give a first understanding for a developer, who will use GSearch in a Fedora application:

The figure shows:

A REST client, running in a user's browser, which may combine accesses to Fedora and to GSearch.
A SOAP client, running anywhere, may do the same.
The Search Service implements a generic set of operations:
- updateIndex - indexing the contents of the Fedora repository.
- gfindObjects - search similar to Fedora findObjects and to the SRW/SRU operation searchRetrieve.
- browseIndex - browsing terms in a given index, similar to the SRW/SRU operation scan.
- getRepositoryInfo - describing the properties of a repository,
- getIndexInfo - describing the properties of an index.
Engine specific implementations of the operations will receive client requests, communicate with the engine indexer and search server, and return the responses in the appropriate form to the clients.

GSearch may run in a separate web server and may index more than one Fedora repository, and it may update more than one index in parallel.

XSLT stylesheets are part of the configuration of GSearch, and XSLT transformations play an essential role in the workflow:

All engine specific operations return an engine specific xml answer, which is transformed by an engine-specific xslt stylesheet into result page xml. For a SOAP request this is the answer. For a REST request this is transformed to an html answer. There may be any number of xslt stylesheets to select from, the default ones are selected in the properties file. Selecting a copy stylesheet will allow the transfer of an answer untransformed. An alternative result page format is OpenSearch, which is an RSS2.0 extension.
Parameters allow clients to select repository, index, and xslt stylesheets by name. In a real application, these values may be determined by the developer in the code, or by the administrator in the properties file.

Search engine plugins

Lucene

The Lucene plugin comes in fedoragsearch.war as the java package dk.defxws.fgslucene together with the Apache Lucene java libraries.

The Lucene plugin is used by configuration as explained below.

Lucene has a very rich functionality, and this plugin exploits a small fraction of it. As a java programmer, you may have ideas for further exploitation, which you may realize by implementing an enhanced version of the plugin. Please, share such ideas and implementations with the Fedora community.

Solr

The Solr server is downloaded, installed and configured as described at the Solr web site.

The Solr plugin comes in fedoragsearch.war as the java package dk.defxws.fgssolr.

The Solr plugin is used by configuration as explained below. It has dependencies on the configuration of the Solr server.

This plugin indexes documents via the HTTP POST interface of Solr. Searches may be performed via the Solr native HTTP GET to the Solr server and via gfindObjects, which accesses the Lucene index directly. Solr functionality does not include browsing, however, this is offered by the plugin via the browseIndex operation.

Zebra

The Zebra plugin comes in fedoragsearch.war as the java package dk.defxws.fgszebra .

The Zebra plugin is used by configuration as seen from FgsConfig/FgsConfigIndexTemplate/Zebra/zebraconfig, which includes a README file, which explains how to get and install Zebra, and how to configure it.

Install and configure for your application

Perform these steps:

Download fedoragsearch.war as above and copy it to a tomcat or similar web server. It does not need to be the web server running Fedora itself. You may rename the .war file, before you copy it into the webapps directory, in order to give it another webapp name.
Set the value of the param enabled to true for the Messaging module in fedora.fcfg as above.
Now this documentation page is visible at your own site. and the admin pages here.
The SOAP service operations are deployed with the .war file, and the .wsdl file is available here.
The choice of search engine is made with the fgsindex.operationsImpl property in your index.properties file, as set in the file fgsconfig-basic.properties (see below). If you choose Solr or Zebra, you have to install and start the respective server.
Create the configuration files
- If you migrate from GSearch 2.2 or 2.3 to 2.4, you simply reuse the configuration files you have. The only things you must do from 2.2 are rename the root directory of the configuration files from 'config' to 'fgsconfigFinal', and if you use sortType AUTO in index.properties (explicitly or by default), change to STRING. You may want to add the new properties. If you kept the configuration files within tomcat in the default classpath, you may want to move them outside, see below.
- If you start with GSearch 2.4, creating the configuration files is much simpler than before. Here are the two basic parts:
  - Generate indexing stylesheet from example foxml files
    - Copy the directory webapps/fedoragsearch/FgsConfig to a location outside tomcat.
    - Go to this location.
    - Put one or more example foxml files in FgsConfig/indexingXsltGenerator/foxml . They must end with newline. If you want to index managed xml datastreams, insert an example inline, see the example in the test file test_fgs23.xml.
    - At FgsConfig run
```
>ant generateIndexingXslt
```
    - Now you have
```
FgsConfig/FgsConfigIndexTemplate/Lucene/foxmlToLuceneGenerated.xslt
```
      and
```
FgsConfig/FgsConfigIndexTemplate/Solr/foxmlToSolrGenerated.xslt
```
      You may use them as they are or copy-to-another-name and edit them, probably there are many index fields that you do not want. There are foxmlToLucene.xslt and foxmlToSolr.xslt files, useful for the Fedora demo objects. You will put the name into the basic property file in order to use that indexing stylesheet at indexing time.
    - There are foxmlToLucene.xslt and foxmlToSolr.xslt files, useful for the Fedora demo objects, that you may use for customizing instead of generating from foxml files.
  - Edit and use the basic property values
    You edit a basic property file and run an ant script with it. This will insert your property values into your copy of a set of template configuration files, providing the final set of configuration files. These may be edited, if you want to select among more than the basic configuration options. Here are the basic steps in more detail:
    - Edit the file FgsConfig/fgsconfig-basic.properties
    - Run with privilege to write to the final config location, that you stated in fgsconfig-basic.properties:
```
> ant -f fgsconfig-basic.xml
```
    - This has used the property values in fgsconfig-basic.properties and inserted them into the copies of the template config files, that now make up the final config files, which have been copied to the final config location.
    - This location of the final config files must be in tomcat classpath, in order that GSearch can find them at startup. By default webapps/fedoragsearch/WEB-INF/classes is in tomcat classpath. Alternatively, you may add another classpath location to tomcat in catalina.properties in the line starting with
```
shared loader=
```
      and state that location in fgsconfig-basic.properties. Make sure that there is only one 'fgsconfigFinal'-directory and one log4j.xml file in the classpath.
    - You should read through the final config files. You may edit all the properties of the final config files. If you do edit them, and they are within tomcat, be sure to keep a copy outside tomcat. The reason is, that if you put a new fedoragsearch.war into tomcat webapps, then tomcat will delete the existing unpacked fedoragsearch directory with your edited final config files.

The default webapp configuration in .../webapps/fedoragsearch/WEB-INF/web.xml enforces authorization based on fedora-users.xml. Then only users with names 'fedoraAdmin', 'fgsTester' and names starting with 'fgsAdmin' are authorized to perform updateIndex actions. If you want not to enforce authorization, then copy the file web_withoutAuthN.xml onto web.xml. Then even updateIndex actions are not protected.
Then you may restart fedoragsearch and call http://<HOSTPORT>/fedoragsearch/rest in order to index and search. The name "rest" may be reconfigured in .../webapps/fedoragsearch/WEB-INF/web.xml
Try the command line client. Change directory to
```
.../webapps/fedoragsearch/client/
```
make the file executable, and run
```
sh runRESTClient.sh
```
then you will get the usage instruction.
For your real applications, you may provide alternative stylesheets in webapps/fedoragsearch/WEB-INF/classes/config/rest and set their names in webapps/fedoragsearch/WEB-INF/classes/config/fedoragsearch.properties.
Inspect the Lucene index with Luke. Notice, Luke cannot open an empty Lucene index.

Configuring GSearch for automatic updates

As of version 2.1, GSearch has the ability to listen to update messages provided by Fedora. These messages are sent via JMS, so a JMS provider must be available (a JMS provider is included with Fedora 3.0). In order to configure the update listener, open updater.properties and set the following property values. These values will most likely be the same as those specified in your Fedora configuration.

java.naming.factory.initial
- Default: org.apache.activemq.jndi.ActiveMQInitialContextFactory
- Specifies the JNDI initial context which will be used to look up JMS administered objects.
java.naming.provider.url
- Default: tcp://localhost:61616
- Specifies the address at which a connection can be made to the messaging provider.
- The update listener will attempt to connect to the messaging provider at this address on server startup, so make sure that your provider is running and available, otherwise you will see a connection error.
connection.factory.name
- Default: ConnectionFactory
- Specifies the JNDI name of the ConnectionFactory object needed to create a connection to the JMS provider.
topic.fedoraAPIM
- Default: fedora.apim.update
- Specifies the topic on which to listen for updates.
client.id
- Default: fedoragsearch0
- The identifier of the GSearch client. If you have more than one instance of GSearch running they must have different client identifiers.

If you decide not to use the automatic updates feature in GSearch, you'll need to open fedoragsearch.properties and remove (or comment out) the line specifying fedoragsearch.updaternames. This will disable the update listener.

Configuring Fedora for automatic updates

Fedora 3.0 added the ability to send a message whenever a change is made to the content of the repository (through API-M.) This messaging capability must be enabled and configured to work properly. See the Fedora documentation for instructions on configuring messaging.

As an alternative to updates via messaging, it is possible to configure Fedora to send a signal via REST to GSearch when objects are added, modified, and purged. Using messaging is the preferred method for automatic updates, and this technique, while still available, should be considered deprecated. It it not recommended to use both the update listener and REST-based updates.

To enable REST-based updates, edit your fedora.fcfg file and change the class of the fedora.server.storage.DOManager module to org.fcrepo.server.storage.GSearchDOManager. Then populate the following module parameters as needed:

gSearchRESTURL - The REST endpoint for GSearch, for example, http://localhost:8080/fedoragsearch/rest
gSearchUsername - If GSearch is protected by authentication, this is the username that Fedora should use to authenticate.
gSearchPassword - The password for the above user, if applicable

Full-text and metadata extraction from datastreams using Apache Tika

Parameters for getDatastreamFromTika, getDatastreamTextFromTika, and getDatastreamMetadataFromTika
indexFieldTagName	either "IndexField" (with the Lucene plugin) or "field" (with the Solr plugin)
textIndexField (not used with getDatastreamMetadataFromTika)	fieldSpec for the text index field, null or empty if not to be generated
indexfieldnamePrefix (not used with getDatastreamTextFromTika)	optional or empty, prefixed to the metadata index field names
selectedFields (not used with getDatastreamTextFromTika)	comma-separated list of metadata fieldSpecs, if empty then all fields are included with default params
fieldSpec	metadataFieldName [ '=' indexFieldName] [ '/' [index] [ '/' [store] [ '/' [termVector] [ '/' [boost]]]]]
- metadataFieldName	must be exactly as extracted by Tika from the document. You may see the available names, if you log in debug mode and look for "METADATA name=" under "fullDsId=" in the log, when "getFromTika" was called during updateIndex
- indexFieldName	is used as the generated index field name. If not given, GSearch uses metadataFieldName after replacement of the characters ' ', ':', '/', '=', '(', ')' with '_'
- the following parameters are used with Lucene (with Solr these values are specified in schema.xml)
- index	[ 'TOKENIZED' \| 'UN_TOKENIZED' ] # first alternative is default
- store	[ 'YES' \| 'NO' ] # first alternative is default
- termVector	[ 'YES' \| 'NO' ] # first alternative is default
- boost	<decimal number> # '1.0' is default

Search result filtering

Search result filtering will show only those search hits that the user is actually permitted to read. Three solutions have been investigated and demonstrated and presented here. Besides, the demonstration is included with the GSearch distribution in .../WEB-INF/classes/configDemoSearchResultFiltering/ . In brief, the three solutions are:

Post-search filtering, which requires a request to the XACML mechanism for each hit, and the total number of permitted hits is only known at the end, a costly procedure especially when few hits are permitted out of a large number.
In-search filtering, which requires additional index fields and query rewriting, that is, a logical partitioning of the index.
Pre-search filtering, which requires a physical partitioning of the index and selection of the pertinent index at query time.

Both in-search and pre-search filtering face the challenge of exact correspondence between the filtering mechanism and the XACML policies.

For your own purpose, in fedoragsearch.properties, you have to select the preferred searchResultFilteringType and set the searchResultFilteringModule to a class that you have to program, as a subclass of the demo class dk.defxws.fedoragsearch.server.SearchResultFilteringDemoImpl or as an implementation of the interface dk.defxws.fedoragsearch.server.SearchResultFiltering .

Building from source

Get the source from github:

  git clone https://github.com/fcrepo/gsearch.git

To build fedoragsearch.war in FedoraGenericSearch/dist:

  cd FedoraGenericSearch
  ant buildfordownload

To build fedoragsearch.war in FedoraGenericSearch/localdist for local testing:

  cd FedoraGenericSearch
  ant -Dlocal.FEDORA_HOME=... -Dlocal.SOLR_HOME=... -Dlocal.SOLR_SERVER=... buildforlocaltest

To run tests, copy localdist/fedoragsearch.war into tomcat webapps on a Fedora repository with demo objects with MessagingModule enabled.

Create a test user in fedora-users.xml

    <user name="fgsTester" password="fgsTesterPassword">
      <attribute name="fedoraRole">
        <value>tester</value>
      </attribute>
    </user>

Test operations on the lucene plugin:

    ant junit-lucene  
    ant junit-testsonlucene
    ant junit-fgs23  
    ant junit-lucene-fgs24_1010 
    ant junit-lucene-fgs24_1019

Test operations on the solr plugin:

    after startup of the solr server:
    ant junit-solr
    ant junit-solr-fgs24_1010

Test operations on the zebra plugin, install, configure and startup the zebra server:

    see $FEDORA_HOME/tomcat/webapps/fedoragsearch/WEB-INF/classes/configDemoOnZebra/index/DemoOnZebra/zebraconfig/README
    ant junit-zebra

New features in version 2.4

Compatibility with Lucene 3.5 and Solr 3.5 (FCREPO-1005)
Useful end-user search page generation from indexing stylesheet (FCREPO-1006)
- See the section on Customizable end-user search client.
Performance measurements and possibly improvements (FCREPO-1007). Measurements taken using Apache JMeter, on a production quality platform, giving some insight into the performance implications of various choices. Download the report from github. Morten Sørensen, DTU Library, is co-developer and co-author on this.
Filtering of search results by access constraints (FCREPO-1008)
- Based on work presented at OR2008.
- Problem: Search results contain hits that the user does not have the access rights to see
- Solution: Extend access rights to search results by filtering
- Thanks to Swithun Crowe for providing a real life example
- See the section on Search result filtering.
Interaction with the Resource Index (FCREPO-1009)
- See the section on Embedded queries.
Use of Apache Tika for full-text and metadata extraction (FCREPO-1010)
- The Apache Tika� toolkit extracts text and metadata from documents, if the format is detectable by AutoDetectParser in Tika.
- In addition to the text extraction with PDFBox, GSearch now provides the following text and metadata extraction functions:
  - getDatastreamTextFromTika: retrieves the text only
  - getDatastreamMetadataFromTika: retrieves metadata only, also for non-text datastreams like images
  - getDatastreamFromTika: retrieves both text and metadata
- See the section on Full-text and metadata extraction from datastreams.
- Thanks to Adam Soroka for the suggestion and the review.
Management of GSearch configurations in Fedora objects (FCREPO-1018)
- See the section on Management of GSearch configurations in Fedora objects.
Exploration of complex GSearch use cases (FCREPO-1019)
- Jonathan Green states: "... the index may not always share a 1 to 1 relationship with objects in fedora."
- GSearch now allows more than one index document per Fedora object, their ids are formed as <PID>'$'<suffix>, where the suffix in the test case is a datastream id.
- The opposite, an index document with values from more than one Fedora object, is possible by the use of the document() function of XSLT.
- See the section on Many-to-many relationship between Fedora objects and index documents.
- A typical application using GSearch will index one repository in one index. However, you have the possibility to index many repositories in one or more indexes in parallel, see the section on Many-to-many relationship between repositories and indexes.

You may also see the complete list of issues for GSearch.

New features in version 2.3

Fedora 3.5 compatibility
- Indexing of managed xml datastreams shown with test object
Lucene 3.4 compatibility
Solr 3.4 compatibility
Zebra 2.0 compatibility
PDFBox 1.6 compatibility
Simplified configuration with two main parts:
- Indexing stylesheet generated from example foxml files, requiring less xslt experience
- Basic properties specified in simple property file, instead of in ant script
Selection of xslt processor, xalan or saxon, see fedoragsearch.properties

You may also see the complete list of issues for GSearch 2.3.

New features in version 2.2

Fedora 3.1 compatibility
Lucene 2.4.0 compatibility
Solr 1.3.0 compatibility
For the lucene plugin: Search result filtering by access constraints, as defined by XACML policies, in order to show only those search hits that the user is actually permitted to read. Read more ....

New features in version 2.1.1

Fedora 3.0 compatibility

New features in version 2.1

Fedora 3.0b2 compatibility
Added an update listener which uses the Fedora Messaging Client to listen for updates being performed through API-M. These update messages contain the information needed to perform index updates, thereby keeping GSearch up-to-date with the Fedora repository.
Enhanced the sortFields parameter to gfindObjects for Lucene, sorting search results by a custom Comparator class, see the index.properties file in configTestOnLucene and the test class dk.defxws.fedoragsearch.test.ComparatorSourceTest.
Enhanced the fromFoxmlFiles action of updateIndex for Lucene, so that all files are attempted to be indexed, even though one or more may fail, in which case log messages are given. Before, one failure would cause abortion.

New features in version 2.0

Added a plugin for the Apache Solr search server.
Added easier configuration, so that you need only edit one file with property values, then run it with ant.
Updated to Lucene version 2.3.0.
Added params to indexing in the format:
```
...&indexDocXslt=[xslt-name][(paramname1=value1[,paramname2=value2[,...]])]
```
Use the parameters at indexing time by putting xsl:param statement in the indexing xslt stylesheet, like this:
```
<xsl:param name="someparamname" select="defaultvalue"/>
```
Added optimize options for Lucene indexing:
```
fgsindex.mergeFactor and fgsindex.maxBufferedDocs
```
will affect performance, see the index.properties file in configTestOnLucene. Also added
```
...?operation=updateIndex&action=optimize
```
which will perform IndexWriter.optimize() which merges all segments together into a single segment, optimizing an index for search. Removed the optimize() call after each updateIndex.
Added untokenizedFields property to Lucene index.properties files. Adding the property with a list of all untokenized fields will ensure that they all select the appropriate analyzer.
Added a sortFields parameter to gfindObjects for Lucene, sorting search results as specified, see the index.properties file in configTestOnLucene.
Added properties snippetBegin and snippetEnd, making highlight code configurable, see the index.properties file in configTestOnLucene.
Added property for custom URIResolver used by xslt transformers for basic auth and SSL, see the example dk.defxws.fedoragsearch.server.URIResolverImpl class and the index.properties file in configTestOnLucene.
Removed encoding of special characters in indexFields. Snippets now show special characters without modification. Indexes should be reindexed.

Background

The Fedora Generic Search Service, abbreviated GSearch, is part of the Fedora Service Framework.

The primary feature of GSearch is that it makes it easy to make your digital contents in Fedora searchable for yourself and your end-users.

GSearch was developed by Gert Schmeltz Pedersen at the Technical University of Denmark, with feedback and contributions from members of the Fedora community, including Adam Soroka, Alfred Heller, Alistair Young, Beth Kirschner, Bill Brannan, Binaya Poudyal, Blake Anderson, Boon Low, Chris Wilper, Christian Orthmann, Christian Tønsberg, Eric Brown, Eric James, Jonathan Green, Jun Yamog, Junran Lei, Lasse Aagren, Leire Urcelay, Luis Zorita, Matt Zumwalt, Matthias Razum, Michael Appleby, Michael Hoppe, Morten Sørensen, Nikolai Schwertner, Nilani Ganeshwaran, Patrick Monbaron, Pierre-Yves Landron, Ranju Upadhyaya, Robert Sherratt, Roel de Cock, Ryan E. Scherle, Sam Liberman, Scott Hammel, Shunde Zhang, Simon Lamb, Stephen Bayliss, Steve DiDomenico, Stuart Chalk, Swithun Crowe, Thierry Michel, and Xinjian Guo.

The work is funded by DEFF, Denmark's Electronic Research Library.

Fedora Generic Search Service Version 2.4

compatible with Fedora Version 3.5

I. DEMONSTRATION

See a demo at your own site, almost out-of-the-box

II. OVERALL DESCRIPTION

Major features

More on the updateIndex operation

Search engine plugins

Lucene

Solr

Zebra

III. CONFIGURATION

Install and configure for your application

Create the configuration files

Generate indexing stylesheet from example foxml files

Edit and use the basic property values

Configuring GSearch for automatic updates

Configuring Fedora for automatic updates

Multilingual configuration

IV. FURTHER USAGE

Full-text and metadata extraction from datastreams using Apache Tika

Customizable end-user search client

Search result filtering

Management of GSearch configurations in Fedora objects

Many-to-many relationship between Fedora objects and index documents

Many-to-many relationship between repositories and indexes

Embedded queries