Scalability of Webcrawlers designed in SML (Standard Meta Language)

Written by Jeff Chiang and Justin Tung in 2001

Storage of Data

To start to consider scalable webcrawlers involving arbitrarily large data,   the abstract data types to be looked at should be ones that involve low   worst case run times in functions such as lookup and insert. Looking at the   data structures learned so far, red-black balanced trees provide the optimal   big-O running times for lookup and insert which are both O(log n). The   advantage O(log n) running times for these key functions becomes apparent   when n reaches billions as is the case when an indexer is storing and wading   through all the data on webpages. New webpages are easily inserted into the   existing scheme and old ones can be found easily and updated. A possible   implementation might be to count the length of the URL and store data based   on that number. Another way could be to sort data into subtrees based on   certain other ways of dividing the information on the web (i.e. file extensions   of URLs, general topics). can be done via a record which stores fields that   contain specific information about page title, language, format (from URLs and   potential file links) as well as information about the content on the page.   This content like the index of webpages should be stored in the same format   to provide optimal lookup times for words, phrases, etc.

Content and page title are probably the most important elements in a   search in the web and should be displayed along with the URL in a results   page. However, advanced searches going beyond simple boolean expressions   requires more complex analysis of data in the fields. Current search   engines support phrase finding, negative queries, language specific queries,   and similar pages or pages that link to a certain page. Additional support   includes searching certain directories (images, newsgroups, audio/mp3s,   downloads, etc.), dated page queries, and even reviewing most recent queries   (probably to save resources by bringing up information already searched for   previously).

How to implement these ‘advanced searches’?

Using the data structure setup discussed above, advanced searches really   are based on good algorithms that go through the data stored in an index   and pulls out the important information. Phrase finding wil l require the   ability of the search engine to parse text into groups of words and compare   them, once a page meets this phrase requirement, the URL and its corresponding   information should be sent back. Negative queries work similar to ‘positive’   queries and it should be a negation of the implementation used for regular   queries. Similar pages can be searched for by keywords in page titles, text,   and other data stored on the page. Finding links to a certain page should be   easy using the tree setup since a find on all URLs that have a common link   can be done and the URLs returned. Subtopics and more specific areas of the   web can be implemented by dividing data during the insert into the web index   as mentioned above with subtrees. Then searches will be limited to those   trees when needed, thus decreasing running times by limiting what topics/data   to search for. The implementation of storing recent or most popular queries   can be done by storing a set of URLs in a database (balanced tree) either   for reference or for near future use. This storing of returns on queries   helps cut down the amount of queries (in which many could be the same) to   a search engine.

Analysis of Running Times and Scalability considerations to Web Search Algorithms

Since all the webpages are stored as balanced trees and within those   trees, the nodes store content data in balanced trees, search for a specific   page (i.e. URL) could be done in O(log n) time. The good thing about balanced   trees within balanced trees is that any other type of search that might access   fields in the nodes will take at most O(log n) time. Since data on the   internet is so large, worst-case running times do provide a good estimation   as smaller orders such as constants (accessing data) do not really affect   running time. As a result of processing terabytes of data, frequency of crawls,   and the general expectations of speed on the internet, algorithms with the   smallest possible big-O are the most desirable. Fast algorithms will be able   to service the large amount of people using the search engine everyday in good   time.

Caching Web Pages

Although the caching of every examined page may seem like an impossible task,   it is indeed what Google does.  Google employs a compression technique on   each document that trades off speed for file size. Its 3:1 compression ratio   allows moderate space-saving while still retaining adequate retrieval speed.   Each document is stored with a docID, length, and URL.  In addition to this   central repository, a document index, lexicon, hit list, and hit list indexes   are stored as well. The document index stores in each entry the current document   status, a pointer into the repository, a document checksum, and various statistics.   The hit list corresponds to a list of occurrences of a particular word in a   particular document including position, font, and capitalization information.   All of this information is then used to create the hit list indexes,   which record the docID of a document that contains the words in the hit list.   Even the queries themselves are cached, resulting in faster search times   should the same query be repeated. The result of these data structures blended   together is a fast and accurate return of the pages requested by users.   Frequency of Web Crawls and Changing Web pages In order to keep track of   what pages have or have not changed, a record of the time of last access   of any page needs to be maintained.  Any further attempt to access the same   page in less than the specified period of time is rejected and “local copy   up to date” condition is signaled. If a page has been accessed previously,   the HTTP HEAD access method is used to determine the last modified date of   the current remote version. If this is unchanged from the last modified date   of the current local copy then no further network traffic ensues and “local   copy up to date” condition is signaled.  If the remote version has changed   then an HTTP GET access gets the new copy.

A search engine needs to maintain information about the last 10 accesses   or attempted accesses to a resource. The stored information includes the   date of access, time needed for transfer, amount of data transferred and   the HTTP status code [right name]. In the internal database the status   code is extended with non-standard to indicate various types of   communication failure. If successive access attempts fail, the page is   assumed to be no longer available. It is marked as such and may still   be included in the database but is presented to the searching user with   a warning. Eventually the page will be eliminated entirely from the database.

Advertisement

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.