Return to the Table of Contents.

IV. Evaluation Criteria

For a study of relevance to a user, the measures of relevancy must be defined. In this study, general criteria for what counts as "relevant" and "useful" have been devised. Results are then categorized. Once the data is available by category, a formula is used to characterize the basic measure-- first twenty precision--using those categories into a single metric for each query for each service.

Relevancy Categories

An area of the design that is inherently subjective is the judging relevancy of the results. Other studies have had two different persons judge some of the same results, so that the biases of one person were counterbalanced by those of the other person (Chu and Rosenthal (1996), and Munro and Lidsky (1996). In Leighton's previous study (Leighton, 1995), he defined each category or bin into which he would assign a returned link. That way, although the divisions of the categories were subjective, they were nevertheless tied to criteria, and one can hope that the judgments were therefore more consistent than otherwise. Ding and Marchionini followed this model, using criteria to establish categories for relevancy. In this study, we have used relevancy categories also. The categories for this study are: duplicate links, inactive links, irrelevant links (0), technically relevant links (1), potentially useful links (2), and most probably useful links (3).

The first non-numeric category is for duplicate links. If the link in question is the same basic URL as a link earlier in the return list, it is put into the duplicate category, regardless of its other qualities (being inactive, valid or relevant). This is done because, regardless of a link's other qualities, the search engine had the opportunity to match URLs and eliminate duplicates prior to calculating word occurrences and frequencies. The duplicate category includes slight variants, but obvious ones. For example, "http://www.here.org/" and "http://www.here.org/index.html" are counted as duplicates. If one of the directory names in the URL is capitalized in one instance but not in another, it is still counted as a duplicate. Mirror sites or aliases, which have different IP addresses or different directory path names (even when the files are obviously the same or slightly different versions of each other), are not counted as duplicates. The only service that has problems with duplication is Hotbot. The Hotbot display can even detect and acknowledge the presence of some duplicates and variants, but does not eliminate them. [10]

The second non-numeric category is for inactive links. These include 404 errors, where the server is contacted, but the path is not found, errors which say that access to the page is forbidden, pages announcing that the desired page has moved, and 603 errors, where the server is not responding. For forbidden and 603 errors, the links are checked again several times over a several week period. The inactive link ratio is an indication of how frequently and thoroughly the service checks the links in its database for currency.[11]

The active and non-duplicated sites are then graded into categories from zero to three. Links that are assigned to the zero category are links that we judged to be fundamentally irrelevant to the query entered. They do not satisfy an important aspect of the search expression. An example is in query #2, "unemployment rate in Illinois," where either the phrase "unemployment rate" or the word "Illinois" is not in the page retrieved.

The category one means that the page satisfies the query, but is not potentially useful. One is assigned in two general cases. In the first case, the search query is technically satisfied, but the page retrieved is not related to the topic indicated by the query. In general, though not always, a page dealing with the correct topic would have the query terms used in closer proximity than the category one page. For instance, in query #2, if the page quotes an unemployment rate for the United States, and then later in the page mentions the state of Illinois, but not its unemployment rate, then the link will be assigned to category one. A ranking algorithm should be able to lower the rank of this category one page by taking the proximity of the word Illinois to the words "unemployment rate" into account. Several pages which were large alphabetic lists for use in dictionaries and spell checkers were retrieved. Those pages had all of the terms specified in the search expression, but were not useful.

The second general case in which the page is assigned category one is if the page is on the correct topic, but is so small or uninformative that it can be of little, if any, conceivable use to anyone who has conducted the search. For instance, in query #15, Maria Luisa Bombal, if the page is a syllabus of a course on Latin American literature, and mentions her name as one of the authors to be read, but no other information is given about her, then it is assigned a category one. While there is no way for an automatic indexing and ranking algorithm to effectively judge the usefulness of a page, nevertheless, a page that has more than one occurrence of the terms in the expression will probably have more substantial discussion of the topic than ones that had only one occurrence, and an algorithm that ranks these pages with multiple occurrences higher in the return list will probably have fewer one scores in the first twenty hits.[12]

Category two is defined as potentially useful to the person conducting the search. The reason we say potentially is that, any given user may have a specific aspect of the topic in mind, which was not specified in the search expression. A category two item is one that provides some useful information about the correct topic of the search. For instance, in query #2, if a page returned by a search engine either gives an unemployment rate for Illinois or discusses the unemployment rate of Illinois, a two is given. A particular user may only want the most current rate, and so some of the pages in category two are not what that user wants, but they are at least potentially useful--they could be of conceivable use to some searcher.

Also awarded category two are pages with links to pages that are judged to be category three. For instance, in query #6, "MTX International" corporation, a page that provided a link to MTX International's homepage was judged to be a two. After all, it provided useful information--the location of the company's homepage. Likewise, in query #11, a page that had a link to the National Gallery of Prague's homepage was given a two.

Category three is defined as clearly useful to almost anyone who would conduct the search. It is given to pages that fall into two cases. In the one case, the three is awarded for a page that is a clearinghouse for Web links on the topic, a webliography, the Web equivalent of a bibliography. As a central clearinghouse, it points to pages that deal with many aspects of the topic. Because many aspects are covered, the webliography will be of use to most anyone searching the topic. For example, a preliminary query that was not used in the test suite was "the data mining of association rules." For that query, the Quest Data Mining Homepage at IBM Almaden was given a three, since it provided access to many different pages dealing with various aspects of data mining and the mining of association rules. Likewise, for query #9, "ecotourism," a page that provided links to many different sites dealing with ecotourism would receive a three.

The second case in which a three is given is when the page is judged to be extremely thorough for the topic specified. For example, in query #14, "Prozac, not the rock group," the Prozac page from Fairlite corporation, which is the manufacturer of the product, and which went into great length about its chemical composition, benefits and side effects, was judged to be useful to almost anyone who would conduct the search. The homepages for MTX International, In Focus Systems, and the National Gallery of Prague were all assigned to category three. The lengthy Sports Illustrated article on the problem of domestic violence among athletes was given a three. For query #15, there were no category three pages for Maria Luisa Bombal found by any search engine in the first twenty results, and one might not exist.

All pages retrieved by the search services are categorized into one of these bins: duplicate, inactive, zero, one, two or three. For criteria used on each query for each category, see Appendix A. These categories were then collapsed into a binary score of zero or one for different experiments that are run on the data, see Experimental Design below.

The Basic Measurement

Once relevancy judgments have been performed, it remains to be established what measure shall be used to compare the performance of the search services. This study attempts to judge them by their ability to come through for the user in terms of relevant hits. In order to measure this ability, we have developed a method which, while not demonstrated to correlate with observed user behavior, nevertheless attempts to capture this sense of relevancy or usefulness to the searcher. [13] The results of the above evaluation process are measured for their ability to put relevant pages within the first twenty links returned by a query, what we call "first twenty precision." Five different experiments are conducted for first twenty precision, so that one can see how the search services perform using various different versions of what counts as "relevant" or "useful."

In Leighton's 1995 study, he created a measurement he called "top ten precision," which should have been more accurately called "first ten precision."[14] What was measured was the percent of relevant links to the total number of active, non-duplicate links that were in the first ten links returned by the search service. So had WWWWorm returned five relevant links, and the rest of the links were active and non-duplicate, then for that query, WWWWorm's first ten precision would have been fifty percent. Many other studies have used "first X precision," whether it was for the first ten, twenty, or twenty-five results.

True recall, the ratio of the total number of relevant elements in the space to the total of relevant results returned by the search, cannot be calculated for Web space, because the total number of relevant links changes quickly and is practically unknowable.[15] True precision, the ratio of relevant elements returned to the total number of elements returned, is too arduous to calculate, because it would mean examining all of the links returned by a service, which may number in the thousands or millions. The "first X precision" is designed to reflect the quality of service delivered to the user: how good is the relevancy within the first few pages of results?

For this study, we have followed the example of Ding and Marchionini and have increased the size of the examined group to twenty. In practice, a persistent user might well examine the first fifty, or one hundred returned links. A colleague at Winona State intentionally jumps many pages in the results lists in order to avoid clusters of links from the same server. That notwithstanding, we felt that the first twenty are a reasonable and feasible group to examine. In calculating the precision, we have added a weighing factor, to increase value for certain characteristics of the results in the metric.

The Formula for Calculating the Metric

Desired Qualities

When devising a measure such as first twenty precision, one has to establish how that number is calculated. The number should behave in certain ways, rising and falling appropriately under various conditions. In this study, there are several qualities that are incorporated in the final metric.

First, for each different analysis, a link either meets the criteria under examination, or it does not. If one is examining the precision for pages of a certain quality, it does not make sense to use the numeric categories from the evaluation as numbers to be averaged or calculated. If one is examining the first twenty precision of potentially useful pages (category two or more), then ten category one pages are not better than four category two pages or three category three pages.[16]

Second, it is better for a good link to be higher in the ranked list than lower. The user will see and evaluate the links listed first, so pages located there should be given more weight. Third, the statistic should reflect the fact that if the search service returns fewer hits with the same number of good hits (i.e. if it has better true precision), it is easier for the user to find and use the good links. However, this goal of rewarding true precision should be moderated, so as not to reward a service for being too cautious and returning no hits or only very few.

Finally, one must decide how to treat inactive and duplicate links. In Leighton's 1995 study and in Chu and Rosenthal's study, duplicates were eliminated from both the numerator and denominator of the precision ratio. That is, a search service was not penalized for not eliminating them. In the first three tests of this study, duplicates are only eliminated from the nominator and services are penalized, in the final two, they are eliminated from both numerator and denominator in order to see what the precision is under that scenario. In this study, services that return inactive links are penalized by only eliminating the inactive links from the numerator.

The Actual Formula

In order to analyze and compare the first twenty precision of the five search services, we have devised a formula that gives the performance of the service on a query as a single number between zero and one. The results, up to the first twenty, from each service for each query have been categorized already by their status (active or not, duplicate or not) and degree of usefulness. The categorized results are then used to produce this metric.

The formula for our metric begins by converting the categories into binary values of zero or one. For example, if the test is for pages that satisfied the query minimally, then pages in categories one, two and three are assigned a one in the formula, all others are given a zero. These values are then multiplied by weights that give more value to links earlier in the list. The first twenty links are divided into three groups: the first three links, the next seven links and the last ten links. These groups were chosen because the first three usually constitute the user's first screen, the next seven round out the user's first results page and the last ten make up the second page of results. (The links beyond the twentieth could be said to be in a fourth group--one that is given zero weight.) The first group is multiplied by twenty, the next group by seventeen, and the final group by ten. These weighted values are then summed to generate the numerator of the metric. For example, if for a given test, a service returned five good links, it would score (3 x 20) + (2 x 17) = 60 + 34 = 94 if they were the first five links, but only (5 x 10) = 50 if they were all between ranks eleven and twenty.

The denominator is calculated by the number, up to twenty, of links returned by the search service. If the service returned twenty or more links, then the sum of all weights to twenty is used, 279. The denominator is adjusted if there are fewer than twenty links returned, in order to give some benefit to true precision. If the denominator were not adjusted, it would always be 279.

One could calculate the denominator just by summing the weights up to the number of links returned and using that number. The problem with using that number, is that the service could receive high scores by returning only a very few links, and benefit greatly from an overly conservative algorithm. Indeed, if no links are returned, the denominator would be zero, with the metric undefined. Because of this boundary condition, the denominator for this metric is generated by summing all of the weights to twenty, 279, and then subtracting 10 for each link less than twenty returned. For instance, if a service returns fifteen links, the denominator is 279 - (5 x 10) = 229, while if it returns one link, the denominator is 279 - (19 x 10) = 89.

The numerator is divided by the denominator to calculate the final metric. For a search service that returns over twenty links, and the first fifteen of them are good, the metric is 229/279. For the search service that returns only fifteen links, and all of them are good, the metric is 229/229, or a perfect 1. For the search service that returns one hit, and it is good, it scores 20/89. A search engine that returns no hits gets 0/79, or zero. A search engine that returns five hits, the first three of which are good, scores 60/(279 - (15 x 10)) or 60/129. This function is a Rube Goldberg machine, but it seems to capture our attempt to credit the search engine for not padding out the rest of the first twenty hits with bad hits, yet without pathological boundary conditions. Below is the complete formula:

       (Links 1-3 x 20) + (Links 4-10 x 17) + (Links 11-20 x 10)
   -----------------------------------------------------------------
              279 - [(20 - No. of links retrieved, or 20) x 10]
One could argue that the adjustable denominator is arbitrary, rewarding services that return between ten and twenty links, and only gradually penalizing services that return less than ten links. No benefit is given to a service that returned five hundred rather than five hundred thousand links. We admit that it is arbitrary. The first twenty precision itself as a measurement is also arbitrary, as are the weights given for the ranked groups. They are arbitrary, but they are established in order to attempt to reflect the quality of the service delivered to users. It is toward that end that this investigation is undertaken at all.

Each experiment, described below, used a slightly different definition for what counts as a good or bad link. For each experiment, the numerator and denominator are calculated for each query in each search service according to the above formula. Those statistics are then analyzed to see how the services compared to each other. See Appendix C for the raw numerators and denominators used to compute the metrics.

V. Experimental Design

Because of the possible multiple ways of defining what should count as a good or bad link, we have performed several different experiments on our data, to compare the services using a variety of definitions. The first three experiments show how the services rate if they are penalized for both duplicates and inactive links. The last two experiments show how the first two experiments would look if duplicate links were not penalized.

The first experiment assigns a 1 to all links in categories one, two and three. It shows how well each service delivers links that minimally satisfy the search query. All other links are assigned a 0. The second experiment assigns a 1 to all links in categories two and three, and assigns all others a 0. It shows how well each service delivers links that are potentially useful for the topic in question. The third experiment assigns a 1 to only category three links, giving all others a 0. It shows how well each service delivers very useful links in the first twenty links returned.

The fourth and fifth experiments start by eliminating duplicates from the results, and treating the remaining links as a list of less than twenty results. In other words, if there were three duplicates in the first twenty links, the duplicates would be removed, and the denominator would be calculated as if there had only been seventeen links returned. Experiment four then assigns a 1 to all category one, two and three links, while experiment five assigns a 1 to category two and three links.

Experiments four and five were designed for any critics of our other experiments who are of the opinion that duplicates are harmless, since they are usually easy enough to detect by the description and abstract. One must be aware of the issue of duplicates because of the treatment of mirror pages and multiple pages from the same directory or server.

Mirror pages and pages from the same server have their advantages and disadvantages. The advantage of a search service returning a page that mirrors a previous one is that if one server is down or overloaded, the user may try the mirror. The disadvantage is that valuable space toward the top of the ranked list is occupied by pages with redundant content. Mirror pages are counted as separate and valid links, even when they might actually be the same page with an alternative pathname which the search service possibly could have detected and eliminated.

Likewise, some search services will return pages from the same server in clusters. The advantage with this is that those clusters may all be pages that have the most relevant information. The disadvantage is that a site's pages will usually link to one another, and having multiple pages from the same site again takes valuable space in the ranked list with what may be nearly redundant information. The user is forced to page through the results to find new sites. Yet, in this study, services were never penalized for clustering their pages, unless the pages were all category zero or the server failed to respond.

In part because mirror pages and clusters are not penalized in this study despite their negative aspects, we felt the need to run some of our experiments a second time without penalties for duplicates. We felt this need especially because the developers of Hotbot obviously treat the listing of duplicates as a positive feature, while other services, notably Alta Vista, Excite and Infoseek, the three that typically scored highest, often clustered pages and seemed to treat clusters as a positive feature. Thus we have nicknamed experiments four and five the "Hotbot Specials," because Hotbot was the only service that regularly returned duplicates. Ironically, Hotbot was also the service most prone to deliver mirror pages.

The raw results of the experiments are listed in Appendix B. Three different numbers are packed into each "a" table: the first number showing how many links within the first three were given a one in the experiment, the second one showing how many links within the first ten (including the first three) were given a one in the experiment, and the third one showing how many links within the first twenty were given a one in the experiment. Table B1a reports data on experiment one, B2a reports data on experiment two, etc. Also in Appendix B are Tables B1b, B2b, B3b, B4b and B5b, which show the metric computed for each query for each service for experiments one through five, respectively, rounded to two digits (the statistical analyses did not use rounded numbers).

Table 1: Shapiro-Wilk test for Normality of Residuals from an ANOVA model

Experiment           S-W p-value
--------------------------------
    1                   0.7441
    2                   0.2496
    3                   0.0003
    4                   0.9510
    5                   0.5111

The metrics for each experiment were evaluated to see if the residuals from an ANOVA were distributed normally. Only the residuals for experiment four were normally distributed. See Table 1 for a list of the normality test results. Because the normality assumption required for the ANOVA model was violated, the Friedmann's randomized block design was used, in which the blocks were the queries and the treatment effects were the search services. The Friedmann test estimates population medians rather than means because of the skewness present. In all five experiments, the null hypothesis--that the service's medians could be equal--was rejected. See the bottom of Table 2 for p-values for the Friedmann's tests. Because the null hypothesis was rejected, pairwise multiple comparisons could be conducted between individual services. [17]


Next Section

Return to the Table of Contents.


ENDNOTES, Part II

10. Because there are plenty of duplicates that are not acknowledged, and there is not a great deal of utility in displaying acknowledged duplicates, we suspect this setup to be a bug dressed up as a feature on the part of the Hotbot developers. Return to Text.

11. In the case of query #8, Alta Vista nearly suffered a complete disaster: eighteen of the first twenty links were from the same server, which returned forbidden errors. Fortunately for Alta Vista, the last time the links were checked, the server at Australian National University let this researcher in. They had apparently been experimenting with limiting access to users who had paid licensing fees, but had then abandoned the plan. An interesting observation is that we had repeated the search on Alta Vista two weeks after first discovering that the site was off limits, and received exactly the same first twenty results, eighteen of which were inactive to the user who had not paid licensing. So in this instance, Alta Vista was directing users to a forbidden server two weeks after it was off limits. See the similar observations under Evaluation Method. Return to Text.

12. However, that is not always true. Some pages repeat a term many times in the META fields, in order to attract the notice of the search services, but then have very little content dealing with the META terms. This problem is part of the general problem on the Web that the pages there were not selected and indexed by a third party, as are books in a library. Return to Text.

13. When we say "usefulness," we mean the usefulness of the contents, rather than the usefulness that, say, an intuitive user interface provides. This study did not evaluate search services on their limiting features, their screen layout or their presentation of results, which are also important. Return to Text.

14. Prof. Carol Blumberg at Winona State University pointed out that the name "top ten" is ambiguous, and may either mean the first ten links, or the ten best links retrieved. Return to Text.

15. TREC-2 (Harman 1995) used a sampling method called the pooling method for calculating approximate recall. In this method, the first 200 results from the different search tools were pooled into a group that was then considered to be the pool of relevant materials on the topic. One could attempt such a method with a World Wide Web study if time permitted. Return to Text.

16. Ding and Marchionini unfortunately did calculate an average based on their categories, producing a number whose meaning is unclear to us. Return to Text.

17. The formula for multiple comparisons was taken from equation 15 on page 300 of Practical Nonparametric Statistics (Conover, 1980). Return to Text.


REFERENCES

Chu, Heting and Marilyn Rosenthal. (1996). "Search engines for the World Wide Web: A comparative study and evaluation methodology," ASIS 1996 Annual Conference Proceedings, Baltimore, MD, October 19-24, 1996, 127-135. Also available: http://www.asis.org/annual-96/ElectronicProceedings/chu.html [28 January 1997].

Conover, W. J. (1980). Practical Nonparametric Statistics. 2nd Ed. New York: John Wiley and Sons.

Ding, Wei and Gary Marchionini. (1996). "A comparative study of web search service performance," ASIS 1996 Annual Conference Proceedings, Baltimore, MD, October 19-24, 1996, 136-142.

Gauch, Susan and Guijun Wang (1996). "Information Fusion with ProFusion," Webnet 96 Conference, San Francisco, CA, October 15-19, 1996. [online]. Available: http://www.csbs.utsa.edu:80/info/webnet96/html/155.htm [22 February 1997].

Harman, Donna. (1995). "Overview of the Second Text Retrieval Conference (TREC-2)," Information Processing and Management, v31 n3, 271-289.

Infoseek. (1997). Infoseek: Precision vs. Recall. [online]. Available: http://www.infoseek.com/doc?pg=prec_rec.html [7 February 1997].

Leighton, H. Vernon. (1995). Performance of four World Wide Web (WWW) index services: Infoseek, Lycos, WebCrawler, and WWWWorm. [online]. Available: http://course1.winona.edu/vleighton/webind.html [1 July 1996].

Magellan Internet Guide. (1997). Real-time Magellan Searches. [Online]. Available: http://voyeur.mckinley.com/voyeur.cgi [24 January 1997].

Munro, Jay and David Lidsky. (1996). "Web search sites," PC Magazine, v15 n21 (December 3, 1996), 232.

Singh, Amarendra and David Lidsky. (1996). "All-Out Search." PC Magazine, v15 n21 (December 3, 1996), p. 213 (17).

Scoville, Richard. (1996). "Special Report: Find it on the Net!" PC World, v14 n1 (January 1996), p. 125 (6). Also Available http://www.pcworld.com/reprints/lycos.htm [1 February 1997].

Tomaiuolo, Nicholas G. and Joan G. Packer. (1996). "An analysis of Internet search engines: assessment of over 200 search queries." Computers in Libraries. v16 n6 (June 1996), p58 (5). The list of queries used is in: Quantitative Analysis of Five WWW "Search Engines". [online]. Available: http://neal.ctstateu.edu:2001/htdocs/websearch.html [7 February 1997].

Venditto, Gus. (1996). "Search Engine Showdown." Internet World, v7 n5 (May 1, 1996), 78-86.

Venditto, Gus. (1997). "Critic's Choice." Internet World, v8 n1 (January 1, 1997), 83-96.

Westera, Gillian. (25 November 1996). Search Engine Comparison: Testing Retrieval and Accuracy. [online] Available: http://www.curtin.edu.au/curtin/library/staffpages/gwpersonal/senginestudy/results.htm [7 February 1997].


Questions: Please contact Vernon Leighton at vleighton@winona.edu Last updated 8/29/97