This is the one-and-only documentation page for the Fedora Generic Search Service, abbreviated fedoragsearch or GSearch.
You, the reader, are presumably responsible for or involved in making your digital contents in Fedora searchable for your end-users. GSearch makes this task relatively easy.
GSearch comes with three plugins for top-class open-source search engines, Apache Lucene, Apache Solr, and Zebra.
Your choice of search engine plugin depends on circumstances:
The choice is taken by configuration.
Table of Contents
Perform these steps:
<module role="org.fcrepo.server.messaging.Messaging" class="org.fcrepo.server.messaging.MessagingModule"> <comment>Fedora's Java Messaging Service (JMS) Module</comment> <param name="enabled" value="true"/>
<user name="fgsAdmin" password="fgsAdminPassword"> <attribute name="fedoraRole"> <value>administrator</value> </attribute> </user>Notice, only users with names 'fedoraAdmin', 'fgsTester' and names starting with 'fgsAdmin' are authorized to perform updateIndex actions.
> ant -f fgsconfig-basic.xmlThis ant script ends with writing the fgsconfigFinal files to the classpath location that you have chosen. Therefore, you need to run it with permission to write there.
The service has the following major features:
You are encouraged to share problems and experience with the Fedora community, send mail to fedora-commons-users, or to Chris Wilper, or to Gert Schmeltz Pedersen.
The following figure serves to give a first understanding for a developer, who will use GSearch in a Fedora application:
The figure shows:
GSearch may run in a separate web server and may index more than one Fedora repository, and it may update more than one index in parallel.
XSLT stylesheets are part of the configuration of GSearch, and XSLT transformations play an essential role in the workflow:
The Lucene plugin comes in fedoragsearch.war as the java package dk.defxws.fgslucene together with the Apache Lucene java libraries.
The Lucene plugin is used by configuration as explained below.
Lucene has a very rich functionality, and this plugin exploits a small fraction of it. As a java programmer, you may have ideas for further exploitation, which you may realize by implementing an enhanced version of the plugin. Please, share such ideas and implementations with the Fedora community.
The Solr server is downloaded, installed and configured as described at the Solr web site.
The Solr plugin comes in fedoragsearch.war as the java package dk.defxws.fgssolr.
The Solr plugin is used by configuration as explained below. It has dependencies on the configuration of the Solr server.
This plugin indexes documents via the HTTP POST interface of Solr. Searches may be performed via the Solr native HTTP GET to the Solr server and via gfindObjects, which accesses the Lucene index directly. Solr functionality does not include browsing, however, this is offered by the plugin via the browseIndex operation.
The Zebra plugin comes in fedoragsearch.war as the java package dk.defxws.fgszebra .
The Zebra plugin is used by configuration as seen from FgsConfig/FgsConfigIndexTemplate/Zebra/zebraconfig, which includes a README file, which explains how to get and install Zebra, and how to configure it.
Perform these steps:
>ant generateIndexingXslt
FgsConfig/FgsConfigIndexTemplate/Lucene/foxmlToLuceneGenerated.xsltand
FgsConfig/FgsConfigIndexTemplate/Solr/foxmlToSolrGenerated.xsltYou may use them as they are or copy-to-another-name and edit them, probably there are many index fields that you do not want. There are foxmlToLucene.xslt and foxmlToSolr.xslt files, useful for the Fedora demo objects. You will put the name into the basic property file in order to use that indexing stylesheet at indexing time.
> ant -f fgsconfig-basic.xml
shared loader=and state that location in fgsconfig-basic.properties. Make sure that there is only one 'fgsconfigFinal'-directory and one log4j.xml file in the classpath.
.../webapps/fedoragsearch/client/make the file executable, and run
sh runRESTClient.shthen you will get the usage instruction.
As of version 2.1, GSearch has the ability to listen to update messages provided by Fedora. These messages are sent via JMS, so a JMS provider must be available (a JMS provider is included with Fedora 3.0). In order to configure the update listener, open updater.properties and set the following property values. These values will most likely be the same as those specified in your Fedora configuration.
If you decide not to use the automatic updates feature in GSearch, you'll need to open fedoragsearch.properties and remove (or comment out) the line specifying fedoragsearch.updaternames. This will disable the update listener.
Fedora 3.0 added the ability to send a message whenever a change is made to the content of the repository (through API-M.) This messaging capability must be enabled and configured to work properly. See the Fedora documentation for instructions on configuring messaging.
As an alternative to updates via messaging, it is possible to configure Fedora to send a signal via REST to GSearch when objects are added, modified, and purged. Using messaging is the preferred method for automatic updates, and this technique, while still available, should be considered deprecated. It it not recommended to use both the update listener and REST-based updates.
To enable REST-based updates, edit your fedora.fcfg
file
and change the class of the fedora.server.storage.DOManager
module to org.fcrepo.server.storage.GSearchDOManager
.
Then populate the following module parameters as needed:
gSearchRESTURL
- The REST endpoint for
GSearch, for example, http://localhost:8080/fedoragsearch/restgSearchUsername
- If GSearch is protected by
authentication, this is the username that Fedora should use to
authenticate.gSearchPassword
- The password for the above
user, if applicableAdd the attribute
URIEncoding="UTF-8"to .../tomcat/conf/server.xml and to .../tomcat/conf/server_fedoraTemplate.xml in order to search special characters like the Spanish "ñ", "í" etc. (thanks to Luis Zorita).
Parameters for getDatastreamFromTika, getDatastreamTextFromTika, and getDatastreamMetadataFromTika | |
---|---|
indexFieldTagName | either "IndexField" (with the Lucene plugin) or "field" (with the Solr plugin) |
textIndexField (not used with getDatastreamMetadataFromTika) | fieldSpec for the text index field, null or empty if not to be generated |
indexfieldnamePrefix (not used with getDatastreamTextFromTika) | optional or empty, prefixed to the metadata index field names |
selectedFields (not used with getDatastreamTextFromTika) | comma-separated list of metadata fieldSpecs, if empty then all fields are included with default params |
fieldSpec | metadataFieldName [ '=' indexFieldName] [ '/' [index] [ '/' [store] [ '/' [termVector] [ '/' [boost]]]]] |
- metadataFieldName | must be exactly as extracted by Tika from the document. You may see the available names, if you log in debug mode and look for "METADATA name=" under "fullDsId=" in the log, when "getFromTika" was called during updateIndex |
- indexFieldName | is used as the generated index field name. If not given, GSearch uses metadataFieldName after replacement of the characters ' ', ':', '/', '=', '(', ')' with '_' |
- the following parameters are used with Lucene (with Solr these values are specified in schema.xml) | |
- index | [ 'TOKENIZED' | 'UN_TOKENIZED' ] # first alternative is default |
- store | [ 'YES' | 'NO' ] # first alternative is default |
- termVector | [ 'YES' | 'NO' ] # first alternative is default |
- boost | <decimal number> # '1.0' is default |
The download contains the following files in webapps/fedoragsearch/ that you may customize:
The file fieldsUnique.xml is found in FgsConfig/indexingXsltGenerator/generatedFiles. It has one element per index field generated from your example foxml files. You may add, modify, and delete index field elements to suit the needs of your end-user search client.
From the admin pages, this is the end-user search client page.
Search result filtering will show only those search hits that the user is actually permitted to read. Three solutions have been investigated and demonstrated and presented here. Besides, the demonstration is included with the GSearch distribution in .../WEB-INF/classes/configDemoSearchResultFiltering/ . In brief, the three solutions are:
Both in-search and pre-search filtering face the challenge of exact correspondence between the filtering mechanism and the XACML policies.
For your own purpose, in fedoragsearch.properties, you have to select the preferred searchResultFilteringType and set the searchResultFilteringModule to a class that you have to program, as a subclass of the demo class dk.defxws.fedoragsearch.server.SearchResultFilteringDemoImpl or as an implementation of the interface dk.defxws.fedoragsearch.server.SearchResultFiltering .
http://.../fedoragsearch/rest?operation=configure&configureAction=setFgsConfigObjectsthat copies the fgsconfigFinal files into datastreams of a Fedora object, where they can be modified (and even further datastreams be created), and one action
http://.../fedoragsearch/rest?operation=configure&configureAction=getFgsConfigObjectsthat copies the datastreams into the fgsconfigFinal files, where the modifications will take effect immediately.
This is a mechanism that allows you to embed risearch queries in Lucene or Solr queries, and vice versa.
This provides interaction with the Resource Index, both when you index and when you search.
It compensates for the lack of joins in bibliographic query languages like in Lucene and Solr,
and it compensates for the lack of text search functionality in logic languages like the risearch query languages.
The full potential of this mechanism still has to be explored and realized.
These preliminary examples show some of the potential:
Get the source from github:
git clone https://github.com/fcrepo/gsearch.git
To build fedoragsearch.war in FedoraGenericSearch/dist:
cd FedoraGenericSearch ant buildfordownload
To build fedoragsearch.war in FedoraGenericSearch/localdist for local testing:
cd FedoraGenericSearch ant -Dlocal.FEDORA_HOME=... -Dlocal.SOLR_HOME=... -Dlocal.SOLR_SERVER=... buildforlocaltest
To run tests, copy localdist/fedoragsearch.war into tomcat webapps on a Fedora repository with demo objects with MessagingModule enabled.
Create a test user in fedora-users.xml
<user name="fgsTester" password="fgsTesterPassword"> <attribute name="fedoraRole"> <value>tester</value> </attribute> </user>
Test operations on the lucene plugin:
ant junit-lucene ant junit-testsonlucene ant junit-fgs23 ant junit-lucene-fgs24_1010 ant junit-lucene-fgs24_1019
Test operations on the solr plugin:
after startup of the solr server: ant junit-solr ant junit-solr-fgs24_1010
Test operations on the zebra plugin, install, configure and startup the zebra server:
see $FEDORA_HOME/tomcat/webapps/fedoragsearch/WEB-INF/classes/configDemoOnZebra/index/DemoOnZebra/zebraconfig/README ant junit-zebra
...&indexDocXslt=[xslt-name][(paramname1=value1[,paramname2=value2[,...]])]Use the parameters at indexing time by putting xsl:param statement in the indexing xslt stylesheet, like this:
<xsl:param name="someparamname" select="defaultvalue"/>
fgsindex.mergeFactor and fgsindex.maxBufferedDocswill affect performance, see the index.properties file in configTestOnLucene. Also added
...?operation=updateIndex&action=optimizewhich will perform IndexWriter.optimize() which merges all segments together into a single segment, optimizing an index for search. Removed the optimize() call after each updateIndex.
The Fedora Generic Search Service, abbreviated GSearch, is part of the Fedora Service Framework.
The primary feature of GSearch is that it makes it easy to make your digital contents in Fedora searchable for yourself and your end-users.
GSearch was developed by Gert Schmeltz Pedersen at the Technical University of Denmark, with feedback and contributions from members of the Fedora community, including Adam Soroka, Alfred Heller, Alistair Young, Beth Kirschner, Bill Brannan, Binaya Poudyal, Blake Anderson, Boon Low, Chris Wilper, Christian Orthmann, Christian Tønsberg, Eric Brown, Eric James, Jonathan Green, Jun Yamog, Junran Lei, Lasse Aagren, Leire Urcelay, Luis Zorita, Matt Zumwalt, Matthias Razum, Michael Appleby, Michael Hoppe, Morten Sørensen, Nikolai Schwertner, Nilani Ganeshwaran, Patrick Monbaron, Pierre-Yves Landron, Ranju Upadhyaya, Robert Sherratt, Roel de Cock, Ryan E. Scherle, Sam Liberman, Scott Hammel, Shunde Zhang, Simon Lamb, Stephen Bayliss, Steve DiDomenico, Stuart Chalk, Swithun Crowe, Thierry Michel, and Xinjian Guo.
The work is funded by DEFF, Denmark's Electronic Research Library.