This is the one-and-only documentation page for the Fedora Generic Search Service, abbreviated fedoragsearch or GSearch.

You, the reader, are presumably responsible for or involved in making your digital contents in Fedora searchable for your end-users. GSearch makes this task relatively easy.

GSearch comes with three plugins for top-class open-source search engines, Apache Lucene, Apache Solr, and Zebra.

Your choice of search engine plugin depends on circumstances:

The choice is taken by configuration.

Table of Contents

I. DEMONSTRATION
      See a demo at your own site, almost out-of-the-box
II. OVERALL DESCRIPTION
      Major features
      More on the updateIndex operation
      Search engine plugins
III. CONFIGURATION
      Install and configure for your application
      Create the configuration files
      Generate indexing stylesheet from example foxml files
      Edit and use the basic property values
      Configuring GSearch for automatic updates
      Configuring Fedora for automatic updates
      Multilingual configuration
IV. FURTHER USAGE
      Full-text and metadata extraction from datastreams using Apache Tika
      Customizable end-user search client
      Search result filtering
      Management of GSearch configurations in Fedora objects
      Many-to-many relationship between Fedora objects and index documents
      Many-to-many relationship between repositories and indexes
      Embedded queries
      Building from source
V. HISTORY
      New features in version 2.4
      New features in version 2.3
      New features in version 2.2
      New features in version 2.1.1
      New features in version 2.1
      New features in version 2.0
      Background

I. DEMONSTRATION

See a demo at your own site, almost out-of-the-box

Perform these steps:

II. OVERALL DESCRIPTION

Major features

The service has the following major features:

You are encouraged to share problems and experience with the Fedora community, send mail to fedora-commons-users, or to Chris Wilper, or to Gert Schmeltz Pedersen.

The following figure serves to give a first understanding for a developer, who will use GSearch in a Fedora application:

The figure shows:

GSearch may run in a separate web server and may index more than one Fedora repository, and it may update more than one index in parallel.

XSLT stylesheets are part of the configuration of GSearch, and XSLT transformations play an essential role in the workflow:

More on the updateIndex operation

Search engine plugins

Lucene

The Lucene plugin comes in fedoragsearch.war as the java package dk.defxws.fgslucene together with the Apache Lucene java libraries.

The Lucene plugin is used by configuration as explained below.

Lucene has a very rich functionality, and this plugin exploits a small fraction of it. As a java programmer, you may have ideas for further exploitation, which you may realize by implementing an enhanced version of the plugin. Please, share such ideas and implementations with the Fedora community.

Solr

The Solr server is downloaded, installed and configured as described at the Solr web site.

The Solr plugin comes in fedoragsearch.war as the java package dk.defxws.fgssolr.

The Solr plugin is used by configuration as explained below. It has dependencies on the configuration of the Solr server.

This plugin indexes documents via the HTTP POST interface of Solr. Searches may be performed via the Solr native HTTP GET to the Solr server and via gfindObjects, which accesses the Lucene index directly. Solr functionality does not include browsing, however, this is offered by the plugin via the browseIndex operation.

Zebra

The Zebra plugin comes in fedoragsearch.war as the java package dk.defxws.fgszebra .

The Zebra plugin is used by configuration as seen from FgsConfig/FgsConfigIndexTemplate/Zebra/zebraconfig, which includes a README file, which explains how to get and install Zebra, and how to configure it.

III. CONFIGURATION

Install and configure for your application

Perform these steps:

Configuring GSearch for automatic updates

As of version 2.1, GSearch has the ability to listen to update messages provided by Fedora. These messages are sent via JMS, so a JMS provider must be available (a JMS provider is included with Fedora 3.0). In order to configure the update listener, open updater.properties and set the following property values. These values will most likely be the same as those specified in your Fedora configuration.

If you decide not to use the automatic updates feature in GSearch, you'll need to open fedoragsearch.properties and remove (or comment out) the line specifying fedoragsearch.updaternames. This will disable the update listener.

Configuring Fedora for automatic updates

Fedora 3.0 added the ability to send a message whenever a change is made to the content of the repository (through API-M.) This messaging capability must be enabled and configured to work properly. See the Fedora documentation for instructions on configuring messaging.

As an alternative to updates via messaging, it is possible to configure Fedora to send a signal via REST to GSearch when objects are added, modified, and purged. Using messaging is the preferred method for automatic updates, and this technique, while still available, should be considered deprecated. It it not recommended to use both the update listener and REST-based updates.

To enable REST-based updates, edit your fedora.fcfg file and change the class of the fedora.server.storage.DOManager module to org.fcrepo.server.storage.GSearchDOManager. Then populate the following module parameters as needed:

Multilingual configuration

Add the attribute

URIEncoding="UTF-8"
to .../tomcat/conf/server.xml and to .../tomcat/conf/server_fedoraTemplate.xml in order to search special characters like the Spanish "ñ", "í" etc. (thanks to Luis Zorita).

IV. FURTHER USAGE

Full-text and metadata extraction from datastreams using Apache Tika

Parameters for getDatastreamFromTika, getDatastreamTextFromTika, and getDatastreamMetadataFromTika
indexFieldTagNameeither "IndexField" (with the Lucene plugin) or "field" (with the Solr plugin)
textIndexField
(not used with getDatastreamMetadataFromTika)
fieldSpec for the text index field, null or empty if not to be generated
indexfieldnamePrefix
(not used with getDatastreamTextFromTika)
optional or empty, prefixed to the metadata index field names
selectedFields
(not used with getDatastreamTextFromTika)
comma-separated list of metadata fieldSpecs, if empty then all fields are included with default params
fieldSpecmetadataFieldName [ '=' indexFieldName] [ '/' [index] [ '/' [store] [ '/' [termVector] [ '/' [boost]]]]]
- metadataFieldNamemust be exactly as extracted by Tika from the document. You may see the available names, if you log in debug mode and look for "METADATA name=" under "fullDsId=" in the log, when "getFromTika" was called during updateIndex
- indexFieldNameis used as the generated index field name. If not given, GSearch uses metadataFieldName after replacement of the characters ' ', ':', '/', '=', '(', ')' with '_'
- the following parameters are used with Lucene (with Solr these values are specified in schema.xml)
- index[ 'TOKENIZED' | 'UN_TOKENIZED' ]
# first alternative is default
- store[ 'YES' | 'NO' ]
# first alternative is default
- termVector[ 'YES' | 'NO' ]
# first alternative is default
- boost<decimal number>
# '1.0' is default

Customizable end-user search client

The download contains the following files in webapps/fedoragsearch/ that you may customize:

Search result filtering

Search result filtering will show only those search hits that the user is actually permitted to read. Three solutions have been investigated and demonstrated and presented here. Besides, the demonstration is included with the GSearch distribution in .../WEB-INF/classes/configDemoSearchResultFiltering/ . In brief, the three solutions are:

Both in-search and pre-search filtering face the challenge of exact correspondence between the filtering mechanism and the XACML policies.

For your own purpose, in fedoragsearch.properties, you have to select the preferred searchResultFilteringType and set the searchResultFilteringModule to a class that you have to program, as a subclass of the demo class dk.defxws.fedoragsearch.server.SearchResultFilteringDemoImpl or as an implementation of the interface dk.defxws.fedoragsearch.server.SearchResultFiltering .

Management of GSearch configurations in Fedora objects

Many-to-many relationship between Fedora objects and index documents

Many-to-many relationship between repositories and indexes

Embedded queries

This is a mechanism that allows you to embed risearch queries in Lucene or Solr queries, and vice versa.

This provides interaction with the Resource Index, both when you index and when you search.

It compensates for the lack of joins in bibliographic query languages like in Lucene and Solr,

and it compensates for the lack of text search functionality in logic languages like the risearch query languages.

The full potential of this mechanism still has to be explored and realized.

These preliminary examples show some of the potential:

Building from source

Get the source from github:

  git clone https://github.com/fcrepo/gsearch.git

To build fedoragsearch.war in FedoraGenericSearch/dist:

  cd FedoraGenericSearch
  ant buildfordownload

To build fedoragsearch.war in FedoraGenericSearch/localdist for local testing:

  cd FedoraGenericSearch
  ant -Dlocal.FEDORA_HOME=... -Dlocal.SOLR_HOME=... -Dlocal.SOLR_SERVER=... buildforlocaltest

To run tests, copy localdist/fedoragsearch.war into tomcat webapps on a Fedora repository with demo objects with MessagingModule enabled.

Create a test user in fedora-users.xml

    <user name="fgsTester" password="fgsTesterPassword">
      <attribute name="fedoraRole">
        <value>tester</value>
      </attribute>
    </user>

Test operations on the lucene plugin:

    ant junit-lucene  
    ant junit-testsonlucene
    ant junit-fgs23  
    ant junit-lucene-fgs24_1010 
    ant junit-lucene-fgs24_1019  

Test operations on the solr plugin:

    after startup of the solr server:
    ant junit-solr
    ant junit-solr-fgs24_1010  

Test operations on the zebra plugin, install, configure and startup the zebra server:

    see $FEDORA_HOME/tomcat/webapps/fedoragsearch/WEB-INF/classes/configDemoOnZebra/index/DemoOnZebra/zebraconfig/README
    ant junit-zebra

V. HISTORY

New features in version 2.4

You may also see the complete list of issues for GSearch.

New features in version 2.3

You may also see the complete list of issues for GSearch 2.3.

New features in version 2.2

New features in version 2.1.1

New features in version 2.1

New features in version 2.0

Background

The Fedora Generic Search Service, abbreviated GSearch, is part of the Fedora Service Framework.

The primary feature of GSearch is that it makes it easy to make your digital contents in Fedora searchable for yourself and your end-users.

GSearch was developed by Gert Schmeltz Pedersen at the Technical University of Denmark, with feedback and contributions from members of the Fedora community, including Adam Soroka, Alfred Heller, Alistair Young, Beth Kirschner, Bill Brannan, Binaya Poudyal, Blake Anderson, Boon Low, Chris Wilper, Christian Orthmann, Christian Tønsberg, Eric Brown, Eric James, Jonathan Green, Jun Yamog, Junran Lei, Lasse Aagren, Leire Urcelay, Luis Zorita, Matt Zumwalt, Matthias Razum, Michael Appleby, Michael Hoppe, Morten Sørensen, Nikolai Schwertner, Nilani Ganeshwaran, Patrick Monbaron, Pierre-Yves Landron, Ranju Upadhyaya, Robert Sherratt, Roel de Cock, Ryan E. Scherle, Sam Liberman, Scott Hammel, Shunde Zhang, Simon Lamb, Stephen Bayliss, Steve DiDomenico, Stuart Chalk, Swithun Crowe, Thierry Michel, and Xinjian Guo.

The work is funded by DEFF, Denmark's Electronic Research Library.