What's new? | Help | Directory | Sign in
Google
grassyknoll
GrassyKnoll: a Search Engine in Python
  
  
  
  
    
Search
for
Updated Mar 10, 2008 by peter.fein
Labels: Featured
Tutorial  
a tutorial

Introduction

This tutorial is intended to give an overview of how to work with Grassyknoll. To follow along at home, you should already have Grassyknoll installed. You'll want to grab the source tarball as well, as the samples are not installed by default. See issue 74.

The tutorial runs a webserver, the RestFrontend, and provides an interface to a Collection. Accessing the server with your standard web browser provides a basic HTML interface to a subset of the Collection API. Sorry it's so ugly, the last time I made web page was in 1997. Seriously. The full Collection API is available when talking to the server in a non-HTML format.

Working Environment

If the executables are not on your PATH, you'll need to adjust the commands accordingly.

You'll see lots of examples like:

pfein@brick:~/grassyknoll$ ls -l samples/
total 3944
drwxr-xr-x 5 pfein pfein    8192 2008-03-09 20:04 demo
-rwxr-xr-x 1 pfein pfein     615 2008-03-09 19:00 load_nsf.py
-rw-r--r-- 1 pfein pfein 2170880 2008-03-07 22:25 nsf_ra.tar.gz
-rw-r--r-- 1 pfein pfein 1842253 2008-02-27 19:13 shakespeare.tar.gz

pfein@brick:~/grassyknoll$ is your shell prompt. Everything else is the output of shell commands. If you've unpacked the tarball somewhere other than ~/grassyknoll, you'll want to change to that directory. Don't worry if the contents of your tarball don't exactly match the above.

Sample Data

We'll be using 2003 National Science Foundation Awards abstracts, covering basic research in the sciences in a wide variety of academic disciplines. See data definitions

Available Demos

Configuration files are provided for several backends.

File BackEnds Notes
lucene_config.py LuceneBackend Full-text search queries
sqlite_config.py SqliteBackend Query by choosing among fixed values
dict_config.py DictionaryBackend No queries.
remote_config.py ClientBackend Forwards to another (running) server. No queries. Listens on port 8081 Don't start with this!

This tutorial is written using the LuceneBackend. If you use one of the other backends, some of the output will be slightly different. However, the tutorial is identical for all backends (except for queries, where backend-specific details will be provided).

Running the Server

Fire up the server in a separate terminal:

pfein@brick:~/grassyknoll$ grassyknoll_d ~/grassyknoll/samples/demo/lucene_config.py
WARNING:SmartStorage:Created /home/pfein/grassyknoll/samples/demo/lucene_demo

This will create an empty lucene index in samples/demo/lucene_demo/. The server takes a single argument, the name of a config file.

The server will print out the requests that come in:

localhost - - [10/Mar/2008 01:39:35] "GET / HTTP/1.1" 200 634
localhost - - [10/Mar/2008 01:39:57] "GET /pants HTTP/1.1" 404 238

You can watch this output to get a sense of what's happening behind the scenes. See RestUrls for more information on the URLs.

Did it work?

Let's see what's in the server. Point your browser to http://localhost:8080/

Screenshot

Stopping the server

To stop the server, just hit ^C. You can shut down the server at any time and restart it by re-running the above command.

Advanced: Using curl

curl is a command line HTTP client. It's useful for debugging and getting a lower-level view of a webserver than is possible with a web browser.

pfein@brick:~/grassyknoll$ curl -H "Accept: application/json" http://localhost:8080/
{"ids": [],
"metadata": {"LuceneCollection_thread": "MainThread",
             "LuceneCollection_pid": 20432,
             "LuceneCollection_host": "brick",
             "LuceneCollection_time": 0.0011680126190185547}}

By adding the Accept: application/json header, we get results in JSON format.

The curl output has been edited for readability on this wiki.

Loading Data

Demos aren't very interesting without data. Load some:

pfein@brick:~/grassyknoll$ samples/load_nsf.py

Listing Documents

Going to the server root will give you a list of available document ids.

http://localhost:8080/

Screenshot

pfein@brick:~/grassyknoll$ curl -H "Accept: application/json" http://localhost:8080/
{"ids": ["a0300005", "a0300025", "a0300044", "a0300051", "a0300064", "a0300071",
         <...>
         "a0331381", "a0331387", "a0331497"],
"metadata": {"LuceneCollection_thread": "MainThread",
             "LuceneCollection_pid": 20432,
             "LuceneCollection_host": "brick",
             "LuceneCollection_time": 0.064599037170410156}}

Retrieve a Result

Clicking on an id will retrieve the corresponding result.

http://localhost:8080/a0300005

Screenshot

pfein@brick:~/grassyknoll$ curl -H "Accept: application/json" http://localhost:8080/a0300005?fields=Title,Date,Award_Instr
{"__id__": "a0300005",
"__url__": "\/a0300005",
"Award_Instr": "Standard Grant", 
"Date": "2003-03-26",
"Title": "A Model Analysis of Newly Released Galileo Electron Density Data"}

Deleting Documents

Hitting the Delete button on a result page will delete that document.

http://localhost:8080/a0300005?method=DELETE

Screenshot

Verify that the document is gone by clicking on its id. You should get a 404 Not Found page.

http://localhost:8080/a0300005

pfein@brick:~/grassyknoll$ curl -H "Accept: application/json" -X DELETE http://localhost:8080/a0300005
{"ids": ["a0300005"],
"metadata": {"LuceneCollection_thread": "MainThread",
             "LuceneCollection_pid": 20432,
             "LuceneCollection_host": "brick",
             "LuceneCollection_time": 0.00030088424682617188}}

Queries

Different backends support different methods of querying the Collection.

LuceneBackend

Go back to the server homepage and enter some search terms. The server supports the Lucene query syntax.

For example, we'll search for abstracts about "biochemistry".

http://localhost:8080/__query__/search?q=biochemistry

Screenshot

pfein@brick:~/grassyknoll$ curl -H "Accept: application/json" http://localhost:8080/__query__/search?q=biochemistry
{"results": [{"__id__": "a0307212",
              "__url__": "\/a0307212",
              "__score__": 0.380184739828,
              "Investigator": "Himadri B. Pakrasi pakrasi@biology2.wustl.edu  (Principal Investigator current)\nBijoy K. Ghosh  (Co-Principal Investigator current)\nRalph S. Quatrano  (Co-Principal Investigator current)",
              "Total_Amt": 50000,
            <...>}],
"metadata": {"count": 3,
             "LuceneCollection_thread": "MainThread",
             "LuceneCollection_pid": 20432,
             "LuceneCollection_host": "brick",
             "LuceneCollection_time": 0.01209712028503418}}

SqliteBackend

Go back to the server homepage and choose an item from the menu.

For example, we'll query for "Cooperative Agreements".

http://localhost:8080/__query__/and?Award_Instr=Cooperative+Agreement

Screenshot

pfein@brick:~/grassyknoll$ curl -H "Accept: application/json" http://localhost:8080/__query__/and?Award_Instr=Cooperative+Agreement
{"results": [{"__id__": "a0310163",
              "__url__": "\/a0307300",
              "Expires": "2003-09-30", "Investigator": "Jeffrey T. Kiehl   (Principal Investigator current)",
              "Total_Amt": 25000,  "Award_Instr": "Cooperative Agreement",
            <...>}],
"metadata": {"count": 2,
             "SqliteCollectionReader_pid": 21778,
             "SqliteCollectionReader_time": 0.00316619873046875,
             "SqliteCollectionReader_host": "brick",
             "SqliteCollectionReader_thread": "MainThread"}}

Other URLs

There are several other URLs that the server understands, but you'll need to type them in by hand.

Distributed Server

The remote_config.py uses a ClientBackend to provide access to another GrassyKnoll server. It's a proof of concept of distributed computing in GrassyKnoll.

While leaving your first server running, open another terminal and run

pfein@brick:~/grassyknoll$ grassyknoll_d ~/grassyknoll/samples/demo/remote_config.py

This server listens on http://localhost:8081/ (note the different port). It forwards all requests to the original server. You can tell you're talking to the proxying server by the extra metadata.

Screenshot

By watching the server output, you can see the requests being forwarded:

pfein@brick:~/grassyknoll$ grassyknoll_d ~/grassyknoll/samples/demo/remote_config.py
localhost - - [10/Mar/2008 03:22:10] "GET / HTTP/1.1" 200 30394
pfein@brick:~/grassyknoll$ grassyknoll_d ~/grassyknoll/samples/demo/lucene_config.py
INFO:SmartStorage:Opened /home/pfein/grassyknoll/samples/demo/lucene_demo
localhost - - [10/Mar/2008 03:22:11] "GET / HTTP/1.1" 200 7904

curl

pfein@brick:~/grassyknoll$ curl -H "Accept: application/json" http://localhost:8081/
{"ids": ["a0300025", "a0300044", "a0300051", "a0300064", "a0300071",
         <...>
         "a0331290", "a0331381", "a0331387", "a0331497"],
"metadata": {"ClientCollection_thread": "MainThread", 
             "ClientCollection_pid": 22071,
             "ClientCollection_host": "brick",
             "ClientCollection_time": 0.11742901802062988,
             "LuceneCollection_pid": 22055, 
             "LuceneCollection_host": "brick",
             "LuceneCollection_time": 0.059844970703125,
             "LuceneCollection_thread": "MainThread"}}

Where Next?

Thanks for coming this far! You can:


Comment by aaron.s.lav, Mar 16, 2008

Note: in the deletion example, http://localhost:8080/a0300005?method=DELETE won't work from the browser, because the browser sends the request with a GET method, and GrassyKnoll correctly returns a 405 (because deletion isn't idempotent).


Sign in to add a comment