My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
ZoieSystem  

Featured, Phase-Design
Updated Aug 16, 2009 by john.w...@gmail.com

ZoieSystem Architecture


Architecture Overview

Zoie is a realtime indexing and search system, and as such needs to have relatively close coupling between the logically distinct Indexing and Searching subsystems: as soon as a document made available to be indexed, it must be immediately searchable.

The ZoieSystem is the primary component of Zoie, that incorporates both Indexing (via implementing DataConsumer<V>) and Search (via implementing IndexReaderFactory<R extends IndexReader>).

Configuration

ZoieSystem can be configured via Spring:

<!-- An instance of a DataProvider:
     FileDataProvider recurses through a given directory and provides the DataConsumer
     indexing requests built from the gathered files.
     In the example, this provider needs to be started manually, and it is done via jmx.
-->
<bean id="dataprovider" class="proj.zoie.impl.indexing.FileDataProvider">
  <constructor-arg value="file:${source.directory}"/>
  <property name="dataConsumer" ref="indexingSystem" />
</bean>
	

<!-- 
  an instance of an IndexableInterpreter:
  FileIndexableInterpreter converts a text file into a lucene document, for example
  purposes only
-->
<bean id="fileInterpreter" class="proj.zoie.impl.indexing.FileIndexableInterpreter" />

<!-- A decorator for an IndexReader instance:
     The default decorator is just a pass through, the input IndexReader is returned.
-->
<bean id="idxDecorator" class="proj.zoie.impl.indexing.DefaultIndexReaderDecorator" />

<!-- A zoie system declaration, passed as a DataConsumer to the DataProvider declared above -->
<bean id="indexingSystem" class="proj.zoie.impl.indexing.ZoieSystem" init-method="start" destroy-method="shutdown">

  <!-- disk index directory-->
  <constructor-arg index="0" value="file:${index.directory}"/>     

  <!-- sets the interpreter -->
  <constructor-arg index="1" ref="fileInterpreter" />

  <!-- sets the decorator -->
  <constructor-arg index="2">
    <ref bean="idxDecorator"/>
  </constructor-arg>

  <!-- set the Analyzer, if null is passed, Lucene's StandardAnalyzer is used -->
  <constructor-arg index="3">
    <null/>
  </constructor-arg>

  <!-- sets the Similarity, if null is passed, Lucene's DefaultSimilarity is used -->
  <constructor-arg index="4">
    <null/>
  </constructor-arg>

  <!-- the following parameters indicate how often to triggered batched indexing,
       whichever the first of the following two event happens will triggered indexing
  -->

  <!-- Batch size: how many items to put on the queue before indexing is triggered -->
  <constructor-arg index="5" value="1000" />

  <!-- Batch delay, how long to wait before indxing is triggered -->
  <constructor-arg index="6" value="300000" />        

  <!-- flag turning on/off real time indexing -->
  <constructor-arg index="7" value="true" />          
</bean>	

<!-- a search service -->
<bean id="mySearchService" class="com.mycompany.search.SearchService">
  <!-- IndexReader factory that produces index readers to build Searchers from -->
  <constructor-arg ref="indexingSystem" />
</bean>

Indexing

Documents get into the ZoieSystem for addition to lucene indices by way of a decoupled DataProvider abstraction, which indexes via push: ZoieSystem implements the DataConsumer interface, the natural partner to DataProvider. What follows is a brief call-stack walk-through of indexing:

  • DataProvider is running on its own thread/pool/remote machine/etc, and controls the flow of DataEvent<V> by calling...
  • DataConsumer#consume(Collection<DataEvent<V>>), which in this case is the ZoieSystem which delegates to several internal
  • IndexDataLoader's (also implements DataConsumer), which have an internal
  • IndexableInterpreter, whose job it is to iterate over DataEvent<V> and spit out Indexable objects via Indexable interpret(V data), and these resultant objects provide
  • org.apache.lucene.document.Document objects via buildDocuments(). These are handed off to (either the RamSearchIndex or DiskSearchIndex subclasses of)

RAM-to-Disk Index Segment Copy:

Prior to 1.4.0, indexing for RAM Index and Disk Index both tokenized document data and built inverted indexes separately. In 1.4.0, we eliminated this duplicate work. Disk index now copies index segments from RAM index instead of going through tokenization and inversion again. This greatly reduces CPU load and disk I/O when documents are flushed to Disk index.

Overview: The part of Zoie that enables real-time searchability is the fact that ZoieSystem contains three IndexDataLoader objects:

  • a RAMLuceneIndexDataLoader, which is a simple wrapper around a RAMDirectory,
  • a DiskLuceneIndexDataLoader, which can index directly to the FSDirectory (followed by an optimize() call if a specified optimizeDuration has been exceeded) in batches via an intermediary
  • BatchedIndexDataLoader, whose primary job is to queue up and batch DataEvents that need to be flushed to disk

All write requests that come in through the DataProvider are tee'ed off into a "currentWritable" RAMDirectory and into an in memory queue of DataEvent's which the BatchedIndexDataLoader collects until it hits a threshold batch size (and a minimum delay time has passed), after which it gets written to disk (and if the amount of time since the last optimize is greater than the parametrized optimizationDuration, an IndexWriter#optmize() is called).

Searching

ZoieSystem acting as an IndexReaderFactory, provides an "expert" search api (note that these IndexReader instances will always be subclasses of ReadOnlyIndexReader, and thus not usable for modifying the index - only searching it), for clients of ZoieSystem who need access to the IndexReader internals (for faceting, caching, etc...). For clients who do not need/want such an expert api, there will be (in an upcoming Zoie release) a more simplified SearcherFactory interface which compartmentalizes the IndexReader internals a bit more by wrapping the IndexReader's in a MultiSearcher.

ZoieSystem delegates the getIndexReaders() call to the SearchIndexManager, which returns a triplet of two ZoieIndexReaders backed by RAMDirectory's (which are transient - one Reader per request. These indices are small, so this is performant) and one ZoieIndexReader which has an IndexReaderDispenser with multiple IndexReader views on the same disk-based FSDirectory.

Comment by royde...@gmail.com, Feb 10, 2009

can u pls explain the role of RAM Index B?

Comment by project member john.w...@gmail.com, Feb 12, 2009

While Ram A holds transient indexing data for when the batch indexer sleeps, Ram B holds transient indexing data when batch indexer is working on a batch.

Comment by pengxuep...@gmail.com, Dec 28, 2009

the images in the wiki did not display.


Sign in to add a comment
Powered by Google Project Hosting