Assignment 7 (Distributed systems)
due 4 May at noon
In this assignment, you will gain experience using distributed programming techniques to build a simple search engine. Although this one doesn't use GeekOS, you can continue to use your Linux virtual machine. (But it should work on any UNIX machine, so you could try it on a Mac if you prefer.)
Table of Contents
1 Update to get the scripts
From your usual cs643
folder, do:
svn update
This should download the a7
folder, containing several .py
(Python) scripts, and a sub-folder scifi
with text files. The
scripts we'll use, in order, are:
-
invert-map.py
separates the words from its input, creating an initial inverted index where each word is mapped to a document ID. -
invert-reduce.py
merges multiple occurrences of the same word, creating a set of document IDs. This is the complete inverted index. -
invert-search.py
loads the inverted index and uses it to search for a collection of keywords and display matching document IDs.
Subversion should make the Python scripts executable, but let's verify this:
% cd a7 % ls -l invert*.py *.sh -rwxr-xr-x 1 league league 384 2011-04-23 08:54 invert-map.py -rwxr-xr-x 1 league league 446 2011-04-23 08:54 invert-reduce.py -rwxr-xr-x 1 league league 631 2011-04-23 08:54 invert-search.py -rwxr-xr-x 1 league league 199 2011-04-23 08:54 add-doc-id.sh
If you don't see the x
's in the permissions on the left side of this
output, then you must do the following for the rest of these
instructions to work correctly.
chmod +x invert*.py *.sh
2 Find some text data
The next step is to locate 5 or more text documents, then download and
save them in the a7
folder. These will be the documents we will
index. A good source for substantial text documents is Project
Gutenberg http://www.gutenberg.org/ , but you are welcome to use
other sources. For example, you could select several long articles
from Wikipedia.
Make sure your documents have distinct names with no spaces, and the
extension .txt
(or possibly .html
, but it's convenient if they all
have the same extension, so we can refer to all of them using
*.txt
).
3 Prepare the text documents
My map-reduce scripts are designed to support chunking the data in whatever way is convenient; it's possible that one server in the distributed system will see parts of several documents, but never an entire one.
So what we need to do is mark each line of text with the identity (filename) of the document that contains it. I wrote a short shell-script to do just that. It will modify your files though, so before you run it, though, make a backup copy of all your documents in a different folder. That way if it screws up it's easier to start over!
./add-doc-id.sh *.txt
After executing the above command, open up a few of the text files in
your editor. You should see that each line starts with the base part
of the filename, followed by a colon and the normal content. Here is
an example from scifi/pg201.txt
:
pg201:If my poor Flatland friend retained the vigour of mind which he enjoyed pg201:when he began to compose these Memoirs, I should not now need to pg201:represent him in this preface, in which he desires, firstly, to return
4 Create the inverted index
Now you'll create the index. Unless you obtained terabytes of documents (hint: don't!), you'll have no problem running this on your own laptop. But it's nice to know that we've structured the computation so that it can run as a streaming Hadoop job on a cluster of machines.
Below is the command. It gathers the contents of all your files, sends
them through the mapper and the reducer, then saves the results in
index.dat
:
cat *.txt | ./invert-map.py | ./invert-reduce.py > index.dat
Assuming that all worked, open up index.dat
in your text editor. You
should see lines like these, where the word appears to the left of a
colon, and then the collection of documents to the right:
absorption:pg29882 pg29809 pg29607 pg29390 pg1607 pg29919 absorptive:pg29255 abstain:pg29768 abstention:pg201 pg97
5 Use the index to search for keywords
Now we can use the index we created to search for keywords in your document collection. That command looks like this:
./invert-search.py < index.dat term1 term2 term3
Replace the term1
etc. with your search terms. These should be words
only (no hyphens, slashes, dots, quotes) and should be in all
lowercase (because all words in the index were converted to
lowercase).
Here's a sample run from my data set:
% ./invert-search.py <index.dat missile target Loaded index. Results: set(['pg29607', 'pg29919', 'pg28617', 'pg30124'])
The returned documents each contain both keywords.
6 Write up your experience and commit
Create a file readme.txt
in your a7
folder. In there, provide
details to address the following items:
- What was your document source and how many documents did you include? Provide URLs if possible.
-
Choose one line (one word) from
index.dat
, and then show the context of where that word appears in each of the documents named on that line. -
Find a set of search terms that individually appear in several
documents, but together appear in fewer documents. (In other words,
the combination of those words reduces the result set.) Show the
results of
invert-search
for each search term individually, and then for all of them together. Explain whatinvert-search
does in the example you provide. - Find a set of search terms that individually appear in at least one document, but together appear in no documents. Show the same evidence and reasoning as in the previous item.
You should add and commit your readme file and your index data file. You may commit some of your source documents if legal to do so (not under copyright, etc.), but for the sake of my server space, please don't add more than about 4 megabytes of documents. (If your documents are large, just include 1 or 2 of them.)
svn add readme.txt svn add index.dat svn add doc1.txt doc2.txt # Or whatever they are called, keep under 4M svn commit -m 'my assignment 7'