Acknowledgements: We would like to thank Carol Blumberg and Brant Deppa of the Mathematics and Statistics Department of Winona State University for their assistance in the design of the project and the statistical analysis of the data. We would also like to thank Don Byrd of the University of Massachusetts at Amherst for his advice and suggestions on related literature in the field.
H. Vernon Leighton 6/16/97
Five search engines, Alta Vista, Excite, Hotbot, Infoseek, and Lycos, are compared for precision on the first twenty results returned for fifteen queries. All searching was done from January 31 to March 12, 1997. In the study, steps are taken to ensure that bias has not unduly influenced the evaluation. Friedmann's randomized block design is used to perform multiple comparisons for significance. Analysis shows that Alta Vista, Excite and Infoseek are the top three services, with their relative rank changing depending on how one interpreted the concept of "relevant." Correspondence analysis shows that Lycos performed better on short, unstructured queries, while Hotbot performed better on structured queries.
In this study, we compare five commercial World Wide Web search services, which are also commonly called "search engines": Alta Vista, Excite, Hotbot, Infoseek, and Lycos. Our test suite is a set of fifteen questions that are submitted to all of the services. The measurement that we use, "first twenty precision," rates the services based on the percentage of results within the first twenty returned that were relevant or useful. We then analyze the statistics and compare the services.
In the realm of search engine studies, many studies comparing relevance have been conducted. We began this study because many previous studies have arrived at conflicting conclusions as to which services are better at delivering superior precision and because most of those studies have either had small test suites or have not reported how the study was conducted. This study compares all search services that have been recommended recently for their performance at delivering relevant results for the user, and uses a carefully controlled and documented experimental design to ensure valid results. It is our opinion that evaluating pages for relevance without blinding them to the evaluator is a major problem for search service studies.
The results of the various experiments indicate that there are three services, alphabetically Alta Vista, Excite and Infoseek, that are superior in most respects to the other two studied, Hotbot and Lycos, in terms of first twenty precision. How these top three rank compared to each other depends on how one defines "relevant." The chief problem with Hotbot is the frequent occurrence of duplicate links; the chief problems with Lycos's performance are irrelevant and inactive links. Lycos's irrelevant links are often caused by the lack of a "required" operator (often denoted by a + prefix). Both Hotbot and Lycos do not seem to implement the fuzzy match for unstructured queries, in which lower case search terms match both lower and upper case terms, but upper case only match upper case terms.
In part two of this report, we define the problem and critique related work in the field. In part three, we explain the methodology used to develop the test suite, to conduct the searches, and to evaluate the results from the search services. In part four, we explain the evaluation criteria and the formula by which these evaluated results are used to calculate first twenty precision. In part five, the experimental design for the primary statistical tests is described. Part six gives results and discusses their meaning. Conclusions and plans for future work are presented in part seven.
II. Problem Statement and Related Work
This work attempts to compare the precision of major Web index services in an objective and fair manner with regard to general subject queries that may be posed in the academic setting. One begins such a study by choosing which services to compare. One must develop an unbiased suite of queries to test those services. Then one must design fair methods for searching the services and evaluating the results of the searches. At many points in the design, it is possible to subtly favor one service over another. Conscious or unconscious bias must be guarded against. Finally, one must arrive at some metric for measuring precision and use that metric to analyze performance.
Here we chose our services--Alta Vista, Excite, Hotbot, Infoseek, and Lycos--because each of them has been recommended for the relevance of their results in recent studies (1996 or later).  Other services that could have been chosen in the category of major search service are WebCrawler and Open Text. Time did not allow their inclusion.
There are other types of services that we did not include. Selective Web databases--C|Net, Excite Reviews, Magellan, and Yahoo--are certainly helpful at cutting through the user's information overload, and may even be better at satisfying user information needs than the major search services (Venditto (1997) suggested that the review services were better in late 1996, Tomaiuolo and Packer's study (1996) indicated that they were not in late 1995), but they are much smaller and are to some degree edited. Traditional vendors of databases, such as Dialog (now Knight-Ridder Information Inc.), are also different, in that their databases are often commercially developed and professionally edited records of commercially published texts in academic, scientific and technical areas. While all of these other services may be able to answer many of the same questions, and may therefore be comparable in terms of satisfying some level of information need, they really are different from the major search services in scope and purpose and were not included.
This study has a test suite large enough for valid statistical analyses, though not large enough to split into parts and compare different subsets of the results. The queries have been obtained in such a way that the researcher did not personally pick the queries in the test suite. The searching has been done to minimize the possibility of favoring the service queried first, or the one queried last. The resulting pages have been blinded so that the evaluator would not know from which service they came. These steps have been taken to prevent the various types of bias that can unfairly influence the results.
Most published precision studies have had test suites that were too small for statistical usefulness. Ding and Marchionini (1996), the best modelled study to date, studied first twenty precision, but used only five queries. Westera (1996) only used five queries, all dealing with wine. Leighton's 1995 study (Leighton, 1995) only had eight queries, and there have been a host of minor studies that purport to judge precision based on three, two or even one measly query.
Chu and Rosenthal (1996) tested first ten precision, had enough queries for statistical comparisons, recorded crucial information about how the searching was conducted, and performed some statistical tests. However, they only studied three search services, and they did not subject their mean precision figures to any test for significance. Gauch and Wang (1996) had twelve queries, studied almost all of the major search services (and even the major metasearch services) and reported first twenty precision, but did not test for significance in the differences reported. Tomaiuolo and Packer (1996) studied first ten precision on two hundred queries. They did list the query topics that they searched, but they used structured search expressions (using operators) and did not list the exact expression entered for each service. They reported the mean precision, but again did not test for significance in the differences. They often did not visit the link to see if it was in fact active. Nor did they define the criteria for relevance. Despite the shortcomings, an impressive, huge study.
Studies reported in popular journals were often vague about how many or exactly what queries were searched. Venditto (1996) used first twenty-five precision, but did not report how many queries were used nor what the exact statistics were. Scoville (1996) used first ten precision and gave exact mean scores, but explained neither how many queries were used, nor whether the differences in mean were significant. Munro and Lidsky (1996) also used first ten precision in a hefty fifty query by ten search engine study, but did not list the queries or the statistical analysis. From their description, it is clear that their range of types of queries was much wider than that used in this study. They reported their results as a scale of four stars, indicating that more exact numbers would be easily misleading (probably because of issues with statistical significance).
None of the studies in the related literature indicate that an attempt was made
blind the service of origin of the links evaluated. Unless this step is taken,
there must always be the question of bias, conscious or unconscious, on the
part of the evaluator.
Development of the Test Suite
The development of the test suite requires two steps: one must first choose the topics to be searched for and then, second, choose exactly what search expression will be submitted to each service. Biases, both conscious and unconscious, can enter the process with either step, as, for example, one may select general subject areas which one knows a given search engine is stronger in than others, or one may choose a form of expression that exploits a feature in one search service that is not available in other services.
Because the type of query is the general subject inquiry in an academic setting, the queries are ones actually asked at a University Library Reference Desk. During February 1997, the test suite developer recorded the topic of every reference question that he was asked in which the patron specifically requested that the Internet be used as a source of information. We stopped after ten queries were obtained. These queries were neither invented nor selected by someone who knew the abilities of the various services. These topics were supplemented by selecting five queries from another study.
The selection of exactly what search expression to enter is perhaps the single weakest point in the design of this study. Other studies have suffered from bias here. First, one must choose how many words to enter in the search, then choose which words. These queries are usually narrowly defined academic topics and used multiple words, as was done in previous studies (Chu and Rosenthal, 1996, Ding and Marchionini, 1996, Tomaiuolo and Packer, 1995).
In addition to choices about what words to use to search, one must decide whether to use an unstructured collection of words (sometimes referred to as natural language) or to use text structured by operators. If the text is structured, one must choose the appropriate operators (proximity and Boolean) and constraints. Our decision as to how many and which queries would be structured or unstructured was our most difficult one. When conducting preliminary queries, we became uneasy about our own ability to know for each of the five services and for each query exactly what expression would be optimal. Furthermore, as Magellan's Voyeur (Magellan, 1997) indicates, most users do not use operators in their searching. Finally, unstructured queries force the search service to do more of the work, ranking results by its own algorithm rather than the constraints specified by the operators. Because of all of these factors, we chose the unstructured--or as we refer to it, simple--text as the preferred expression, and only chose the structured text when, without it, the topic was too easily open to multiple interpretations. Whether we made optimal, or even adequate, choices, is an issue for criticism. To allow for that criticism, we have recorded the exact settings for the forms submitted to the services in Appendix A. We categorized the queries as "structured," "simple" or "personal name."  There are seven simple queries, seven structured queries and one personal name in the final test suite.
The closer in time that a query is searched on the different index services, the better. Ideally, a query would be executed on all search services at the same time. This goal of close time proximity of the searching is so that if a relevant page were to be made available between the time one engine was searched and the time a second was searched, the second would have the unfair advantage of being able to index and retrieve the new page. In the Ding and Marchionini study, the search engines were all searched within twenty minutes of each other. In the present study, all of the search engines were searched, for a given query, on the same day. For most queries, the engines were all searched within a half of an hour of each other. For a few queries, the searching was attempted in the middle of the day, and Excite, the only search engine that had this problem, often took a great deal of time, or simply did not respond. In those cases, Excite was searched in the late afternoon, after the electronic rush-hour. 
Another goal of searching is to check the pages that were cited in the results from the search services as quickly as possible after the results are obtained. The longer one waits after the results are obtained, the more likely it is that some pages which were truly active at the time of the search have been removed from the Web and are erroneously judged to be inactive by the researcher. In this study, the pages were all obtained by an automated web browser which used a blinding algorithm so that the evaluator would be unable to know from which search service the citation came (see below). Because of that arrangement, all one hundred results of each query had to be retrieved at the same time and stored in labeled files. This was done either on the same day that the searches were conducted or the day after. The stored pages were then evaluated by the researcher over a period of from a day to a week.
For each query, we wrote down a draft set of criteria for categorizing the links based on the general criteria discussed in Relevancy Categories, below. This was done before evaluating any links. Then, as the links were evaluated, the criteria were adjusted as necessary to take into account the nature of the subject involved.
After conducting several preliminary queries and studying the results, we noticed that we were developing biases and judgments about the various search services. To prevent these biases from clouding our judgment in categorizing individual pages returned by the services, we developed a method of blinding the pages so that for any given page, we would not know ahead of time which service had returned it as a result.
A PERL program was used to strip the URLs of the results and load them into a batch file. This program was run by the research assistant, and the program named the files of retrieved pages with labels so that the evaluator would not know a priori which results came from which search service. The batch files would then be invoked to retrieve the pages cited by the search engine results pages. The batch file used the "get" program from the PERL suite of web utilities called "libwww-perl-0.40." 
The evaluator then called up each HTML page using EMACS, inspected it and assigned it to one of these categories: inactive, zero, one, two or three (duplicates were not discovered until the pages could be matched to search engines). Some unique feature (part of the title, etc.) was noted so that if a later page in the evaluation looked similar, a match could be discovered. In this way, even if the evaluation was not evenly or fairly done in other respects, at least the same page would receive the same score throughout the evaluation of a query. For some pages, the evaluation was reassessed when a version of the page could be viewed in which graphics were present.
For each query, after the hundred pages were inspected in this way, the evaluations were then mapped back to the results pages generated by the search engines. The results pages were evaluated to detect duplicates and mirror sites. For sites that gave a 603 (server not responding) or a Forbidden error, those links were checked several more times over a several week period.
It happened that sometimes the blinding method failed to work correctly, and the results of one of the search engines had to be retrieved later, up to a week later. The results were only used in this case if, when we searched the query a second time in the same search service, all of the links that were inactive in the earlier results were still present in the first twenty hits of the later results. We then evaluated those links knowing the service they came from.
By reexecuting the query and only using the original results if the new results preserved the inactive links, we were attempting to be able to say that the search engine was indeed providing users with citations to inactive links, despite the fact that the resulting pages of that engine were not retrieved when the ones from the other search engines were. We were surprised to discover that in all cases when this happened, the first twenty results from a search engine from a search expression were the same a week later (with some minor differences in order). This fact is interesting: the results for three out of three queries in these search engines did not change over the course of a week.
Both in the method the queries were chosen and the method in which the resulting pages were evaluated, we attempted to prevent our own natural biases from effecting the study. The blinding process for page evaluation only worked for the initial inspection of the page, since later checking and updating was done with an awareness of the source of the page. Nevertheless, that initial blinding was important to establishing relevancy precedents within a process that by its nature is very subjective and open to subtle bias.
Return to the Table of Contents.
1. From data collected from September 1995 to January 1996, Tomaiuolo and Packer (1996) found Alta Vista had the best precision. From data collected in December of 1995, Ding and Marchionini (1996) studied Infoseek, Lycos and Open Text and found none of them to have a higher mean statistically. They did not specifically recommend any of the three. In the January, 1996, PC World, Scoville found Lycos best, with Excite and Infoseek a close second and third (Scoville, 1996). From data collected in January of 1996, Chu and Rosenthal gave Alta Vista top honors (Chu and Rosenthal, 1996). From data collected in January of 1996, Internet World Magazine found Infoseek to be the most precise search engine, and Alta Vista to have the best recall (Venditto 1996). In November of 1996, Westera ranked her top three as Alta Vista, Excite and Infoseek (Westera, 1996). In December of 1996, PC Magazine gave Excite, Hotbot and Yahoo their Editor's Choice award (Singh and Lidsky, 1996), although Hotbot was given the award for their sophisticated interface as well as the relevance of results, and Yahoo was awarded as the best directory. Return to Text
2. As an example of subject areas, we noticed that Lycos was particularly weak with pages from commercial firms and on business topics, while Excite and Hotbot were much stronger in that area. As far as search expression, in Infoseek (1997), it appeared to us that in some cases the forms of the queries were picked to exploit search features that only Infoseek had. But the Infoseek study made no claims to being unbiased, and the forms in question could be defended as "fair" in some sense if it could be demonstrated that actual users often perform searches using those forms. Return to Text
3. The suite collection was done at the Maxwell Library at Winona State University, Winona, MN. From the Tomaiuolo and Packer (1995) study, we selected every twentieth query for the first one hundred queries, obtaining five queries. Tomaiuolo and Packer also reported recording their topics by getting them in part from real reference questions, and we felt that the addition of their queries should help overcome the location bias of Winona State. Return to Text
4. For example, see the forward to Leighton's 1995 study (Leighton, 1995). Return to Text
5. If one observes Magellan's Voyeur (Magellan, 1997), or has access to another database of search expressions actually entered by users, one will notice that the typical popular query is only one or two words long, and often deals with sex. Return to Text
6. In the case of personal names, we capitalized the name, and in Hotbot, we did indicate that the expression was a person. In our preliminary test queries, we tried personal names in Hotbot with and without this specification, and it performed significantly worse without the personal name specification than with it. Return to Text
7. In Leighton's 1995 study, he kept track of the response time of the various services(Leighton, 1995). We did not include that dimension in our design, but when we noticed the problem with Excite during the day, we took an informal count. Excite was the only service that had a problem, and it was slow or non-responsive only during the middle of the day when Internet traffic is high. Return to Text
8. Located at http://www.ics.uci.edu/pub/websoft/libwww-perl/ and maintained by Roy Fielding at the University of California at Irvine. Return to Text
9. The instances where this reexecution occurred are as follows: the search for query 3, on athletes, was reexecuted on Infoseek; the search for query 5, on cable TV, was reexecuted on Lycos; the search for query 8, on classical architecture, was reexecuted on Alta Vista. Return to Text
Chu, Heting and Marilyn Rosenthal. (1996). "Search engines for the World Wide Web: A comparative study and evaluation methodology," ASIS 1996 Annual Conference Proceedings, Baltimore, MD, October 19-24, 1996, 127-135. Also available: http://www.asis.org/annual-96/ElectronicProceedings/chu.html [28 January 1997].
Conover, W. J. (1980). Practical Nonparametric Statistics. 2nd Ed. New York: John Wiley and Sons.
Ding, Wei and Gary Marchionini. (1996). "A comparative study of web search service performance," ASIS 1996 Annual Conference Proceedings, Baltimore, MD, October 19-24, 1996, 136-142.
Gauch, Susan and Guijun Wang (1996). "Information Fusion with ProFusion," Webnet 96 Conference, San Francisco, CA, October 15-19, 1996. [online]. Available: http://www.csbs.utsa.edu:80/info/webnet96/html/155.htm [22 February 1997].
Harman, Donna. (1995). "Overview of the Second Text Retrieval Conference (TREC-2)," Information Processing and Management, v31 n3, 271-289.
Infoseek. (1997). Infoseek: Precision vs. Recall. [online]. Available: http://www.infoseek.com/doc?pg=prec_rec.html [7 February 1997].
Leighton, H. Vernon. (1995). Performance of four World Wide Web (WWW) index services: Infoseek, Lycos, WebCrawler, and WWWWorm. [online]. Available: http://www.winona.edulibrary//webind.html [1 July 1996].
Magellan Internet Guide. (1997). Real-time Magellan Searches. [Online]. Available: http://voyeur.mckinley.com/voyeur.cgi [24 January 1997].
Munro, Jay and David Lidsky. (1996). "Web search sites," PC Magazine, v15 n21 (December 3, 1996), 232.
Singh, Amarendra and David Lidsky. (1996). "All-Out Search." PC Magazine, v15 n21 (December 3, 1996), p. 213 (17).
Scoville, Richard. (1996). "Special Report: Find it on the Net!" PC World, v14 n1 (January 1996), p. 125 (6). Also Available http://www.pcworld.com/reprints/lycos.htm [1 February 1997].
Tomaiuolo, Nicholas G. and Joan G. Packer. (1996). "An analysis of Internet search engines: assessment of over 200 search queries." Computers in Libraries. v16 n6 (June 1996), p58 (5). The list of queries used is in: Quantitative Analysis of Five WWW "Search Engines". [online]. Available: http://neal.ctstateu.edu:2001/htdocs/websearch.html [7 February 1997].
Venditto, Gus. (1996). "Search Engine Showdown." Internet World, v7 n5 (May 1, 1996), 78-86.
Venditto, Gus. (1997). "Critic's Choice." Internet World, v8 n1 (January 1, 1997), 83-96.
Westera, Gillian. (25 November 1996). Search Engine Comparison: Testing Retrieval and Accuracy. [online] Available: http://www.curtin.edu.au/curtin/library/staffpages/gwpersonal/senginestudy/results.htm [7 February 1997].