Scalability of Webcrawlers designed in SML (Standard Meta Language)

Written by Jeff Chiang and Justin Tung in 2001

Storage of Data

To start to consider scalable webcrawlers involving arbitrarily large data, the abstract data types to be looked at should be ones that involve low worst case run times in functions such as lookup and insert. Looking at the data structures learned so far, red-black balanced trees provide the optimal big-O running times for lookup and insert which are both O(log n). The advantage O(log n) running times for these key functions becomes apparent when n reaches billions as is the case when an indexer is storing and wading through all the data on webpages. New webpages are easily inserted into the existing scheme and old ones can be found easily and updated. A possible implementation might be to count the length of the URL and store data based on that number. Another way could be to sort data into subtrees based on certain other ways of dividing the information on the web (i.e. file extensions of URLs, general topics). can be done via a record which stores fields that contain specific information about page title, language, format (from URLs and potential file links) as well as information about the content on the page. This content like the index of webpages should be stored in the same format to provide optimal lookup times for words, phrases, etc.

Content and page title are probably the most important elements in a search in the web and should be displayed along with the URL in a results page. However, advanced searches going beyond simple boolean expressions requires more complex analysis of data in the fields. Current search engines support phrase finding, negative queries, language specific queries, and similar pages or pages that link to a certain page. Additional support includes searching certain directories (images, newsgroups, audio/mp3s, downloads, etc.), dated page queries, and even reviewing most recent queries (probably to save resources by bringing up information already searched for previously).

How to implement these ‘advanced searches’?

Using the data structure setup discussed above, advanced searches really are based on good algorithms that go through the data stored in an index and pulls out the important information. Phrase finding wil l require the ability of the search engine to parse text into groups of words and compare them, once a page meets this phrase requirement, the URL and its corresponding information should be sent back. Negative queries work similar to ‘positive’ queries and it should be a negation of the implementation used for regular queries. Similar pages can be searched for by keywords in page titles, text, and other data stored on the page. Finding links to a certain page should be easy using the tree setup since a find on all URLs that have a common link can be done and the URLs returned. Subtopics and more specific areas of the web can be implemented by dividing data during the insert into the web index as mentioned above with subtrees. Then searches will be limited to those trees when needed, thus decreasing running times by limiting what topics/data to search for. The implementation of storing recent or most popular queries can be done by storing a set of URLs in a database (balanced tree) either for reference or for near future use. This storing of returns on queries helps cut down the amount of queries (in which many could be the same) to a search engine.

Analysis of Running Times and Scalability considerations to Web Search Algorithms

Since all the webpages are stored as balanced trees and within those trees, the nodes store content data in balanced trees, search for a specific page (i.e. URL) could be done in O(log n) time. The good thing about balanced trees within balanced trees is that any other type of search that might access fields in the nodes will take at most O(log n) time. Since data on the internet is so large, worst-case running times do provide a good estimation as smaller orders such as constants (accessing data) do not really affect running time. As a result of processing terabytes of data, frequency of crawls, and the general expectations of speed on the internet, algorithms with the smallest possible big-O are the most desirable. Fast algorithms will be able to service the large amount of people using the search engine everyday in good time.

Caching Web Pages

Although the caching of every examined page may seem like an impossible task, it is indeed what Google does. Google employs a compression technique on each document that trades off speed for file size. Its 3:1 compression ratio allows moderate space-saving while still retaining adequate retrieval speed. Each document is stored with a docID, length, and URL. In addition to this central repository, a document index, lexicon, hit list, and hit list indexes are stored as well. The document index stores in each entry the current document status, a pointer into the repository, a document checksum, and various statistics. The hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. All of this information is then used to create the hit list indexes, which record the docID of a document that contains the words in the hit list. Even the queries themselves are cached, resulting in faster search times should the same query be repeated. The result of these data structures blended together is a fast and accurate return of the pages requested by users. Frequency of Web Crawls and Changing Web pages In order to keep track of what pages have or have not changed, a record of the time of last access of any page needs to be maintained. Any further attempt to access the same page in less than the specified period of time is rejected and “local copy up to date” condition is signaled. If a page has been accessed previously, the HTTP HEAD access method is used to determine the last modified date of the current remote version. If this is unchanged from the last modified date of the current local copy then no further network traffic ensues and “local copy up to date” condition is signaled. If the remote version has changed then an HTTP GET access gets the new copy.

A search engine needs to maintain information about the last 10 accesses or attempted accesses to a resource. The stored information includes the date of access, time needed for transfer, amount of data transferred and the HTTP status code [right name]. In the internal database the status code is extended with non-standard to indicate various types of communication failure. If successive access attempts fail, the page is assumed to be no longer available. It is marked as such and may still be included in the database but is presented to the searching user with a warning. Eventually the page will be eliminated entirely from the database.