due at 23:59 on Sun 6 May

In this assignment, you will gain experience using distributed programming techniques to build a simple search engine. Although this one doesn’t use GeekOS, you can continue to use your Linux virtual machine. (But it should work on any UNIX machine, so you could try it on a Mac if you prefer.)

Update to get the scripts

From your usual cs643 folder, do:

liucs:~/cs643$ git pull

This should download the a7 folder, containing several .py (Python) scripts, and a sub-folder scifi with text files. The scripts we’ll use, in order, are:

invert-map.py separates the words from its input, creating an initial inverted index where each word is mapped to a document ID.
invert-reduce.py merges multiple occurrences of the same word, creating a set of document IDs. This is the complete inverted index.
invert-search.py loads the inverted index and uses it to search for a collection of keywords and display matching document IDs.

Git should make the Python scripts executable, but let’s verify this:

liucs:~/cs643$ cd a7
liucs:~/cs643/a7$ ls -l invert*.py *.sh
-rwxr-xr-x 1 league league 384 2011-04-23 08:54 invert-map.py
-rwxr-xr-x 1 league league 446 2011-04-23 08:54 invert-reduce.py
-rwxr-xr-x 1 league league 631 2011-04-23 08:54 invert-search.py
-rwxr-xr-x 1 league league 199 2011-04-23 08:54 add-doc-id.sh

If you don’t see the x’s in the permissions on the left side of this output, then you must do the following for the rest of these instructions to work correctly.

liucs:~/cs643/a7$ chmod +x invert*.py *.sh

Find some text data

The next step is to locate 5 or more text documents, then download and save them in the a7 folder. These will be the documents we will index. A good source for substantial text documents is Project Gutenberg http://www.gutenberg.org/, but you are welcome to use other sources. For example, you could select several long articles from Wikipedia.

Make sure your documents have distinct names with no spaces, and the extension .txt (or possibly .html, but it’s convenient if they all have the same extension, so we can refer to all of them using *.txt).

Prepare the text documents

My map-reduce scripts are designed to support chunking the data in whatever way is convenient; it’s possible that one server in the distributed system will see parts of several documents, but never an entire one.

So what we need to do is mark each line of text with the identity (filename) of the document that contains it. I wrote a short shell-script to do just that. It will modify your files though, so before you run it, though, make a backup copy of all your documents in a different folder. That way if it screws up it’s easier to start over!

liucs:~/cs643/a7$ ./add-doc-id.sh *.txt

After executing the above command, open up a few of the text files in your editor. You should see that each line starts with the base part of the filename, followed by a colon and the normal content. Here is an example from scifi/pg201.txt:

pg201:If my poor Flatland friend retained the vigour of mind which he enjoyed
pg201:when he began to compose these Memoirs, I should not now need to
pg201:represent him in this preface, in which he desires, firstly, to return

Create the inverted index

Now you’ll create the index. Unless you obtained terabytes of documents (hint: don’t!), you’ll have no problem running this on your own laptop. But it’s nice to know that we’ve structured the computation so that it can run as a streaming map-reduce job on a cluster of machines.

Below is the command. It gathers the contents of all your files, sends them through the mapper and the reducer, then saves the results in index.dat:

liucs:~/cs643/a7$ cat *.txt | ./invert-map.py | ./invert-reduce.py > index.dat

Assuming that all worked, open up index.dat in your text editor. You should see lines like these, where the word appears to the left of a colon, and then the collection of documents to the right:

absorption:pg29882 pg29809 pg29607 pg29390 pg1607 pg29919
absorptive:pg29255
abstain:pg29768
abstention:pg201 pg97

Use the index to search for keywords

Now we can use the index we created to search for keywords in your document collection. That command looks like this:

liucs:~/cs643/a7$ ./invert-search.py < index.dat  term1 term2 term3

Replace the term1 etc. with your search terms. These should be words only (no hyphens, slashes, dots, quotes) and should be in all lowercase (because all words in the index were converted to lowercase).

Here’s a sample run from my data set:

liucs:~/cs643/a7$ ./invert-search.py <index.dat missile target
Loaded index.
Results:
set(['pg29607', 'pg29919', 'pg28617', 'pg30124'])

The returned documents each contain both keywords.

Write up your experience and commit

Create a file readme.txt in your a7 folder. In there, provide details to address the following items:

What was your document source and how many documents did you include? Provide URLs if possible.
Choose one line (one word) from index.dat, and then show the context of where that word appears in each of the documents named on that line.
Find a set of search terms that individually appear in several documents, but together appear in fewer documents. (In other words, the combination of those words reduces the result set.) Show the results of invert-search for each search term individually, and then for all of them together. Explain what invert-search does in the example you provide.
Find a set of search terms that individually appear in at least one document, but together appear in no documents. Show the same evidence and reasoning as in the previous item.

You should add and commit your readme file and your index data file. You may commit some of your source documents if legal to do so (not under copyright, etc.), but for the sake of my server space, please don’t add more than about 4 megabytes of documents. (If your documents are large, just include 1 or 2 of them.)

liucs:~/cs643/a7$ git add readme.txt
liucs:~/cs643/a7$ git add index.dat
liucs:~/cs643/a7$ git add doc1.txt doc2.txt # Or whatever
liucs:~/cs643/a7$ git commit -am 'my assignment 7'

Assignment 7