| Issue 5: | Enhancement: User defined plugins. |
1 of 11
Next ›
|
| 2 people starred this issue and may be notified of changes. | Back to list |
DPsearch has the ability to search across a very large set of documents (we have tested with over 20M). We can search the entire document space or parts of the document based on the concept of sections and limits (like meta-tags, last-modified-date ...). However, like most search engines the searches are restricted to information that has been indexed. Thus if we have some new information about a document or existing information that was not used to create special section or limit indexes then it becomes difficult without re-indexing the collection.Additionally the additional restrictions are best dealt with by other programs that could apply logic that is not necessary "search" type. A couple of examples would be: Lets assume that the documents indexed have some information about say the geography associated with the document. However, when the collection was originally indexed the geography was not considered important and no geography section was created. It would be nice to be able to search the document collection for the search criteria and then filter the results by some geography restriction. Obviously the simples solution would be to add in a definition of a geography section, and re-index the collection. However with very large collections this is very expensive both in terms of time and disk space. In addition we could end up with literally dozens if not hundreds of sections. Another situation would be where the documents found need to be restricted on some criteria no related to a search (e.g. only show the documents "permitted" to the user making the query). Again in theory we could do some combinations of ownership and other restrictions indexed in - the information is pretty dynamic and we will need to re-index all the time. The solution proposed is to have the notion of "filter plugins" added to dpsearch. The dpsearch engine gathers all the search results into an array and then after removing duplicates, clones etc. retrieves the document information for a pageful. In this case imagine a small user provided function that is called after the result list build and cleaned but before the document information retrieval step. The filter could then get a list of record ids for the documents and then return a modified list that may have some records removed (or added?) based on external criteria. This will allow fine grained local control over the results. If such a mechanism were available then we could solve the situations above by doing the following Build a new database table (in the same database as used by dpsearch or a separate one) that has a table tracking the record id and columns for other meta-data. The plugin would then check filter the results using the database information. Clearly it will be slower than a the index natively but for infrequently used but large or a new metadata that will be indexed but as a transitional mechanism this would work quite well. Similarly for the second example the plugin would call an external permissions program that could resolve the permission based on other criteria which have nothing to do with the search engine. Finally we could add in records into the result set if deemed necessary (though I suspect that this is better done outside the search engine when creating a results page). The changes that I see would be Ability to build a plugin - best would be the ability to have a shared library that can be setup in the config file. If defined and present then the search engine would use it and if not it would not. The plugin API should be very simple (at least for starters): - Call to initialize the plugin - call to re-initialize the plugin (when the search engine gets a HUP signal). - Call to terminate the plugin - Call to process a result list (I suspect only an array of proposed results, a command line and returning the array of results). - We could add additional APIs available to the plugin to access dpsearch functions for ease of writing the plugin - e.g. functions to print messages into the log ... Changes to dpsearch - Addtitional paramters to pass information to the plugin. E.g. &pluginparms="parm1, parm2, parm3" - Configuration file changes to define the plugin - Code changes to call the plugin. Questions: - What happens if there are multiple plugins? Particularly passing commandline parameters over. - Where is the plugin called - when the results are obtained in cache.c or sql.c or when they are assembled in search.c? Each has a plus/minus in terms of having access to information (e.g. if there are multiple indexes calling the plugin from cache.c or sql.c will mean that the plugin can get the correct database information and be able to use that as opposed to in search.c where the search may be running on a machine with no access to the actual database). - What languages should be allowed? Clearly the application is in C so C or C++ is natural - however it is also easier to write plugins in some scripting language. |
|
,
May 03, 2009
In the latest snapshot of 4.53 version the Limit command has been extended so it can accept a SQL query which return possible pairs of limit value and url.rec_id. E.g. Limit prm:strcrc32 "SELECT label, rec_id FROM labels" pgsql://u:p@database.ext/site/ The third parameter (DBAddr) is optional, used to specify a connection to an another SQL-database where limit table resides. prm - is the name of limit and the name of CGI-parameter is used for this limit strcrc32 - is the type of limit, particularly this limit value is a string. Instead of strcrc32 is possible to use any of the following limit types: hex8str - hex string or base-26 string similar to those used in categories, and the nested limit will be created; int - integer value (4 byte integer). In serach.htm and searchd.conf configuration files it's possible to specify reduced variant of such Limit command: Limit prm:strcrc32
Status: Started
Owner: dp.maxime Labels: -Type-Defect Type-Enhancement |
|
| ► Sign in to add a comment |