Endeca at the NCSU Libraries
The Technology Behind the Endeca Catalog
Basic information
The new online catalog is an implementation of the Endeca Information Access Platform Guided
Navigation software. The Endeca software creates a "navigation engine"
process that responds to search queries and an API for building a web
application that communicates with this back-end server. NCSU created a
servlet based Java web application that uses URL parameters to construct
the user's query, send that query to the Endeca navigation engine, and
display the results. The application server uses Apache and Tomcat.
Once a properly formatted query is submitted, the Endeca navigation engine
returns a complex object that includes the resulting records and their
properties (title, author, etc.) as well as all available refinements. The
web application parses this object to display the results list page.
Java Beans are used to parse holdings-level data into easy-to-use objects
for the jsp files that actually produce the resulting web pages.
Nearly every feature on the results list page is enabled using Endeca's
pre-defined URL parameters. This includes sorting results and paging through
result sets.
Call numbers
Since Endeca is not a library-specific application, extra logic
was required to enable correct sorting for Library of Congress
call numbers. The first LC call number for a title is identified
as the sorting call number. Then NCSU uses a Perl script to
add padding to this sorting call number. When padded, the call
numbers sort correctly using Endeca.s default ASCII
alphabetical sort.
To make the LC Class dimension work properly, NCSU built a similarly
padded LC hierarchy based on the documentation on the Library of Congress web
site. The hierarchy creates call number ranges into which the padded LC
class numbers fall. Unlike call number sorting, the LC class number for
each item belonging to a title is identified, since it is theoretically
possible that a single title has different LC class numbers. NCSU uses
only the first 1-3 letters and the following decimal number of the call
number for the LC class number (cutter is excluded).
Data processing
The Endeca portion of the online catalog does not use a live
index of NCSU's MARC data. NCSU started with a full
bibliographic and item record export from SirsiDynix Unicorn.
Each night at 00:30, the system generates a report of records
that have been added or modified to the MARC database.
Using the SirsiDynix cat_key (a unique identifier for the
record), the Endeca database is updated. The process works
like this:
- System creates list of updated cat_keys
- System extracts MARC records and holdings (stored in MARC 999) for the list of cat_keys
- The MARC records are reformatted into a flat file using MARC4J
- A perl script compares the modified/added record list to those already in
the Endeca data, and overwrites/adds based on cascading rule of unique
identifiers (unfortunately, SirsiDynix does not have a unique record identifier in the MARC record)
- Once the data is updated, a full index of the Endeca data is run (note:
Endeca does offer a partial indexing, i.e., updates only, module)
- Endeca Navigation engine is restarted
The process outlined above is managed with a shell script that runs every night from cron. The entire process, including re-indexing the entire database, takes approximately 7 hours. Future changes in the architecture of the technical
backend should cut this re-indexing time in half.
New Features
We are continually working to improve the functionality of our Endeca catalog.
The CatalogWS (web services) project is one such effort to make catalog data available via XML through a simple web API. Our goal
is to enable easier access to the catalog data so that other applications can take advantage of it. See CatalogWS Applications for a list of applications that take advantage of these web services.
|