Forward

written June 25, 1996

I have been surprised with the popularity of this study. Surprised because the actual data had a short half-life and I am sure the results are thoroughly out-of-date. Some of the major web index services (Alta Vista, Open Text, etc.) were not public at the time. Nevertheless, because the study seems to still be popular, I feel compelled to clean things up a bit. First, Steve Kirsch, the president of Infoseek, wrote me a pointed criticism, to which I feel that I have adequately responded. I will reprint that correspondence below. Second, Dr. Robert Knodt wrote, questioning how I arrived at my results. Upon examination, I see that I did not give some very important pieces of data, so I will provide them here in the forward.

First, the Infoseek correspondence:

Response by Steve Kirsch, President, InfoSeek Corporation to "Performance of Four World Wide Web (WWW) Index Services: Infoseek, Lycos, Webcrawler and WWWWorm"

Vernon,

I received a copy of your study. Very interesting.

There are a couple of things which you may wish to add before posting on the net again (and append as notes to your study):

  1. InfoSeek has extensively tested and benchmarked its precision recall performance against other system and has consistently emerged better than or equal to the top rated systems in TREC. This is based upon hundreds of queries and professional relevance judgments.
  2. It's dangerous to reach conclusions based on 8 queries.
  3. In Query #1, Lycos was told to get all three words, while InfoSeek was not given the same instructions. That's not fair!
  4. Except in one case, you failed to take advantage of the incredible power that InfoSeek offers. The biggest precision booster in queries is to use discriminating phrases. Lycos does not support this. InfoSeek does! There is absolutely no doubt in my mind that for any given query, you can formulate it more precisely and get more precise results using InfoSeek than with Lycos and I'd challenge you to come up with a just a single counter example of this!

       For example,
            +chess tournament       should be       chess-tournament
            law school              should be       law-school
            precision recall        should be       precision-recall
            civil war               should be       Civil War
    
  5. I recently needed to get Jane Anderson's email address. I typed Jane Anderson in InfoSeek and virtually all hits were relevant. The top hit had exactly what I needed. In Lycos, I got one hit in the top 10 that was relevant. Clearly there are lots of practical, real life queries, where Lycos doesn't come close to matching InfoSeek's precision.

If you want to be fair, you may wish append this response to your study or incorporate it in the conclusions.

Steve Kirsch
President
InfoSeek


Reply by H. Vernon Leighton to Steve Kirsch

I would like to address the points raised by Mr. Kirsch.

  1. First, I reread my announcement, and I am afraid I did not make clear in it that the one conclusion that I was confident of was that both Lycos and Infoseek are better than either Webcrawler and WWWWorm. In my study, I did not conclude that either Lycos or Infoseek was better than the other, in part because I felt that my sample was too small to infer any statistical difference. So your claim that Infoseek *is* in fact statistically better does not contradict my basic finding.

  2. I totally agree that eight queries a definitive study does not make. My main reason for not doing more was the tremendous amount of time required for following the links of each result's set to establish even an admittedly imperfect number of relevant hits. In that case, of course, then I could be charged justifiably with advertising my study as more than it actually is. For that reason you are understandably concerned.

  3. On the point that I failed to use the syntax of Infoseek, well, that may be true. I tried to read the online help in each service to find service specific optimizations, but if as you say, I really missed the boat, then I apologize. On the other hand, if I missed the optimizations after having attempted to read the help pages, then many other users probably have missed them, too.

    I tried in general to come up with an expression in each service that customized it. I often tried two or three candidates and took the best. For instance, I often had to shave words off of my query in WWWWorm to get any hits. I had to work the query about ATM in Lycos to keep from generating many false hits. With the query that you questioned, I believe that I tried a few different expressions in Infoseek, especially because I had bad results in what was otherwise a very good service. But even without that question, my statistical results were about the same. I could be unconsciously biased, but I felt that I made a good faith effort to try to get some relevant results in each index, if possible.

  4. Thank you for the syntax tips.

I was impressed by the extent to which you test and evaluate your own service. It shows in the quality of the results. I have recommended to our collection development librarian that we subscribe to the fee-based Infoseek service for our Reference Desk at the library. We have not done so yet, unfortunately, but I remain hopeful. I congratulate you on offering a free service, and wish you best of luck in generating sales, you clearly deserve it.

I would like to explain again why I published my study the way I did. I did not send it to a peer-review journal, not because I was afraid of responsible criticisms such as Mr. Kirsch's (and on Internet, one can publish irresponsibly) but because the results, with their admittedly limited validity, become increasingly inaccurate with each passing day. I felt that announcing on Listservs and publishing on the Web was the most timely way to deliver a finding that had any usefulness at all. Had my other duties allowed, I would have advertised it at the beginning of September.

H. Vernon Leightonvleighton@winona.edu

Supplimentary charts

Written June 25, 1996

Here I have added some charts that will help the researcher reconstruct my statistics. Because I am reconstructing a year after the fact, I can only find the data on queries one through five. While this data will not help reconstruct all of my statistics, it nevertheless will show what numbers I was working with. Because of the rapid change anyway, the numbers are not as important as an understanding of how they were obtained and whether they are calculated correctly.

In Chart S1, I have listed for each query and each service a set of three numbers: first, the total number of results retrieved by the service, second, the number of those sites which, after several attempts, did not respond, and third, the number of items in the retrieved list that were duplicates of other items in the list. By duplicates, I do not mean mirror sites, but true copies of an earlier URL.

In Chart S2, I have listed for each query and each service the number of sites total (not just in the top ten) that I judged to be either a "4" or a "5" on my ratings scale.

In Chart S3, I list the bad sites and duplicates within the top ten results from each query and each service.

I apologize to the many readers for whom I did not provide this data before. The data on total number of hits is necessary for generating precision. The number of bad sites also indicates the quality of the service. As one can see, in June of 1995, Lycos was retrieving a fair number of sites that were not active. To arrive at the chart #4 in the study, the one that says the most to me, one has to know the total number of sites in the top ten, the number of good sites in the top ten, and the number of duplicates in the top ten. To see the definitions of the regular charts, click here.


Chart #S1: Total Results/Dead Sites/Duplicates

Queries       Infoseek       Lycos          Webcrawler    WWWWorm
      1       10/0/0         44/9/1         12/4/0         2/0/0
      2       10/0/1         50/8/9         50/5/0         50/2/29
      3       10/0/1         12/0/1         15/0/0         5/1/0
      4       10/1/1         18/2/3         10/2/0         2/0/1
      5       10/0/0         48/3/8         2/0/0          50/1/41


Chart #S2: Total Good Sites without Duplicates

Queries       Infoseek       Lycos          Webcrawler    WWWWorm
      1          0             5                0            1
      2          6            25               10           16
      3          6             5               13            2
      4          6             8                0            1
      5          8            27                1            0


Chart #S3: Bad Sites/Duplicates in the Top Ten

Queries       Infoseek       Lycos          Webcrawler    WWWWorm
      1          0/0          1/0              3/0          0/0
      2          0/1          0/0              3/0          1/0
      3          0/1          0/1              0/0          1/0
      4          1/1          0/0              2/0          0/1
      5          0/0          0/0              0/0          0/6


Further thoughts

Written June 25, 1996

Since having done this study, I have used various index services often. I have noticed that the problem of duplicates has virtually gone away. The big index services today (e.g. Infoseek, Lycos, Alta Vista, Open Text, Yahoo), all seem to do fairly well (as opposed to Webcrawler and WWWWorm in this study). A further study would have to have a large sample size to discern a difference. And then there is the issue of what types of questions are you asking, and what counts as a good answer. Well, enjoy the study.

Performance of Four World Wide Web (WWW) Index Services: Infoseek, Lycos, Webcrawler and WWWWorm

by H. Vernon Leighton

Because of the concerns that some have raised about my study and how I have publicized it, I am providing access to comments and criticisms. Click here to see the initial announcement that I submitted to PACS-L, GOVDOC-L, and IR-L relating to the study. Click here for an eletter sent by Steve Kirsch expressing concern about the study and my response to his letter.

Table of Contents

I did this project for a graduate class in Computer Science. Although I did not get an outstanding grade on the project (see Appendix C for the gist of the professor's comments), I feel that there are real results and that they are important enough to post on the Internet.

This project studies four of the largest web index services available for free on Internet: Infoseek, Lycos, Webcrawler, and WWWWorm. It was done in late Spring of 1995, so the results are already dated. I varied two factors to my design to study the server performance: time of user request (peak hours of day or quiet hours) and relevance of the results.

In conducting the experiment, I found that those two factors are so unrelated that I now consider the data to represent two experiments, each testing one factor. In my experience, when the indexer handled the query at all in the peak times, it produced the same result that it had in the off- peak response.


The Relevance Experiment

To study the relevance and precision of query results from the various services, I conducted a two factor full factorial test with no replication (chapter 21 in the textbook). The different queries formed the rows and the services formed the columns. In this way, one could analyze the effects of the queries and the services on their respective results.

Query selection

The queries were selected to be a combination of easy and difficult ones in terms of finding Internet resources. Although the various indexers had different query syntax, I attempted to request that each word had to be in the record. For long queries on services with a "min-word" field, I required one word less than the number of words in the search string, for instance, for "best law schools ranking" I would require a minimum of three words matched. I tried to choose queries that an indexer could fail on if it were not choosing all of the words in a subject-related context.

The challenge in creating a query is to make a query that will actually retrieve the types of documents that you feel are relevant. In traditional databases, keywords are selected from a controlled vocabulary. Successful queries then can be built from that vocabulary. In Webspace, no controlled vocabulary exists. Different indexers even index different parts of the indexed resources: some only use title keywords, some also include URL anchor words, and some index the full text of each web page.

Because my queries are open to interpretation, grading the relevance of the resulting document set is subjective, which is not to say that it is not rigorous. Sometimes the only confirmation I have of some of my judgments is that at least one server returned results that agreed with them--at least one indexer seemed to "understand" what I meant.

Often I had to revise my definition of what should count as a correct response: for example, the last set of results for "Japanese" contained pages for the Japanese varieties of some trees. I had been looking for "Japanese" as part of the subject, and not as part of the phrase, "for a Japanese version of this page, click here," but I was expecting Japanese language, Japanese culture, or Japanese web servers.

Queries

What follows is a list of the queries that I used (a chart of the criteria for judging each question is located in Appendix A):

Query #1: ranking best law schools
Query #2: Japanese
Query #3: atm internet protocol
Query #4: precision recall information
Query #5: civil war battlefields Virginia
Query #6: victims child sexual abuse
Query #7: water quality agriculture nitrates
Query #8: chess tournament

The Index Services

Four services were compared in this study. The first three are well financed by their sponsors: Infoseek is a Sun Microsystems company, Lycos is a project developed by Carnegie-Mellon University and Web Crawler, while starting out as a graduate student project, is now supported by America Online. WWWWorm is, I believe, still a student project running on a 486 using NEXTSTEP software. Not surprisingly, one can tell from the results that Worm is in a class below the others. There are many other web index services on Internet, Jumpstation II, Nikos, Aliweb, etc. For those, I felt that the poor results on the initial queries warrented non-inclusion. As it turns out, that conclusion should have been reached for WWWWorm as well.

Because each service has a different search technique, the syntax and semantics of the search query are different. As I stated above, I tried to create the equivalent of "almost" and Boolean "and" query. In Infoseek, one may specify phrases, proper names, and required as opposed to desired terms. So, for question #5: civil war battlefields Virginia, in Infoseek, I required all words except Virginia, since a page might not specifically mention the state. In Lycos, one simply specifies how many of the terms are required, not which ones, so I specified three of the terms in question #5. In Worm, the regular query often got zero hits, so I would drop one of the less crucial terms to try to get some results. See Appendix B for a list of the queries as they were issued to the respective services.

What measures I actually used

I was originally going to use "recall" as a measurement as part of this comparison, but I have decided against it. Recall as a usable metric has been challenged for large databases because there seems to be little overlap in relevant citations retrieved by different queries on the same topic (see Marchionini, 1994, and Belkin, Cool, Croft, and Callan, 1993). The Web is as large and unstructured a search space as one is likely to find, so recall is almost meaningless.

I have developed a set of criteria gradually as I studied the results of the various queries. A set of hypertext links are not like the normal results of a bibliographic database. Some qualities of the results sets are: there may be URLs that no longer link, there may be duplicates of previous hits, there may be mirror sites that have the same page, but at a different location, and there may be pages that point to other promising pages. In my opinion, a good indexer will not have many bad links or duplicate hits; A mirror site is valid; and a page that points to a great page should get some credit.

When counting precision, though, I feel that two precision figures are needed: a precision for all records retrieved and a precision for the first ten urls retrieved. There are two reasons for this second measure: it allows direct comparisons between the precision of Infoseek (free) and the others, and because the quality of hits in the first part is important from the point of view of the user, who may not want to go through many links trying to find a site that has what the user wants (for a similar precision see Turtle, 1994). In both precision numbers, I chose to include bad links in the base, since if the page had deleted them, precision would have been better.

The measures I developed:

1. the bad or duplicate url to total hit ratio within the top ten returned urls,
2. the total number of good hits (rating a four or five) not counting duplicates,
3. the ratio of a four or five score to the total results set not counting duplicates, or total precision, and
4. the ratio of a four or five score to the results in the top ten items in the results set not counting duplicates, or top ten precision.

Results

See charts #1 to #4 for results. In analyzing results, there is the danger of applying tests that are not appropriate for the data in question. In this case, there is the danger of treating the results of each query as simply a repetition of a generic trial on a particular server. If that were so, one could apply a one way test of variance to establish the source of variation among the services. However, because each column represents the same query, instead of a random and independent test, a two way analysis of variance must be performed, to study the effect of the query as a factor.

For measures one through three, I decided that a careful analysis of variance was not really necessary: a visual inspection of the chart was enough to get a general idea of the results, and also, only for the last measure are the four services truly comparable. I added averages and standard deviations with the understanding that the samples were not random and independent. See chart #5 for the results of the two way ANOVA test on measure four: the precision of the top 10 hits.

Analysis

Of the four services, it is clear that WWWWorm is not in the same class as the other services. It rarely got many results, and often when it did, the percentage that were duplicates was very high. For Worm, though, when it did find matches, it sometimes had more relevant URLs than Webcrawler did. Webcrawler also did poorly, having the lowest alpha of all four in the top ten precision (see chart #4).

Lycos and Infoseek had very comparable results. Except for the first question, on which Infoseek unexpectedly bombed, they had comparable top ten precision. Without the first question, Infoseek's average top ten precision was 71.4%, while for Lycos, it was 80.6%. [Note added on 11/8/95: Due to the small sample size, this difference is so small that I did not even bother to check for statistical significance.] Lycos had more total good hits, but I had set the maximum set at 50 hits for Lycos, and the free service from Infoseek only gives ten hits (one can subscribe for more complete results). With the first query included, Lycos widened the gap with Infoseek. The analysis of variance indicated that the queries did not have a significant impact on performance, but that the choice of server did. I did not generate a confidence interval for the effects due to lack of time.

Chart #1: Bad or Dup to Total Results Ratio [6/25/96: within the Top Ten Results]

() indicates the total size of results set

Queries       Infoseek       Lycos          Webcrawler     WWWWorm
   1             0            10              30             0 (2)
   2            10             0              30            10
   3            10            10               0            20 (5)
   4            20            10              20            50 (2)
   5             0             0               0 (2)        60
   6             0             0               0            40
   7            10            10              20             0 (2)
   8            20            20               0 (5)        60 (5)
------------------------------------------------------------------
Averages         8.75          7.5            12.5          30
Standard Dev.    8.345         7.07           13.887        25.635

Chart #2: Total relevant hits, minus duplicates

() indicates the total size of results minus duplicates
Queries       Infoseek       Lycos          Webcrawler     WWWWorm
    1            0             5                0            1 (2)
    2            6            25               10           16
    3            6             5               13            2 (5)
    4            6             8                0            1 (2)
    5            8            27                1 (2)        0
    6            7            19                3           11
    7            5            20                8            2 (2)
    8            7            14                2 (5)        0 (2)
------------------------------------------------------------------
Averages         5.625        15.375            4.62         4.125
Standard Dev.    2.446         8.733            5.012        5.987


Chart #3: Total Precision without Duplicates

Queries       Infoseek       Lycos          Webcrawler    WWWWorm
      1          0            11.6              0           50 (2)
      2         66.7          61               20           76.2
      3         66.7          45.5             86.7         40 (5)
      4         66.7          53.3              0           50 (2)
      5         80            67.5             50 (2)        0
      6         70            65.5             33.3         52.4
      7         62.5          60.6             47.1        100 (2)
      8         87.5          66.7             40 (5)        0 (2)
------------------------------------------------------------------
Averages        62.513        53.963           34.638       46.075


Chart #4: Top Ten Precision Ratio without duplicates: Two factor Analysis

Queries    Infoseek    Lycos     Webcrawler   WWWWorm  | Qaverages   Beta
   1           0        40           0          50 (2) |    22.5     -33.106
   2          66.7     100          10          80     |    64.175     8.569
   3          66.7      50          90          40 (5) |    61.675     6.069
   4          66.7      80           0          50 (2) |    49.175    -6.431
   5          80        80          50 (5)       0     |    52.5      -3.106
   6          70        90          33.3        57.1   |    62.6       6.994
   7          62.5      89          75         100 (2) |    81.6      25.994
   8          87.5      75          40 (5)       0 (2) |    50.625    -4.981
-------------------------------------------------------------------
Averages      62.513    75          37.288      47.138 |    55.60625
Alpha's        6.906    20         -18.319      -8.469  


Chart #5: Two-way ANOVA of Top Ten Precision

SOURCE    DF      Mean Squares     F-ratio    F-table 95%
Query      7          1147          1.4819      3.08
Server     3          2267          2.9289      2.49
ERROR     21           774     
TOTAL     31


The Response Experiment

Response is a simple measure that hides a multitude of parameters. My plans for response time were to search each of the index services both when they were idle and when they were busy, in order to compare. However, the difficulty in reaching some of the services during busy times has forced me to scale back and study only the response times during periods when responses on Internet in general were quite good (Saturday and Sunday).

Results

See Chart #6 for the results of the response times. It is quite clear from a casual inspection that Webcrawler and Infoseek have the best response times, and the least variable ones. WWWWorm had a steady response time that never fell below 70 seconds, which I suspect is the true floor for its response.


Chart #6: Response Times for Low Traffic Periods


Queries       Infoseek       Lycos          Webcrawl       WWWWorm
   1             6            23                1            70
   2             7            15                1            89
   3             9            15                2            71
   4             5            30                1            81
   5             4            10                2            78
   6             6             6                1            74
   7             4             4                2            71
   8             7             3                1            71

Average 6 13.25 1.375 75.625 Standard Dev. 1.690 9.528 0.518 6.675


Appendix A

Queries were chosen primarily to have the quality that one or two of the words in isolation could be interpreted in ways other than that intended. They were also chosen to be queries for which resources were likely to be found on the internet. Finally, they are common topics that are asked about at a University Library reference desk (except for number four, which was my question).

Query #1: ranking best law schools
1= unrelated
2= a page from a college that has a law school
3= a page from a law school discussing its quality
4= a page that points to a page that contains rankings of law schools
5= a page that lists a ranking of the best law schools (in someone's estimation).

Query #2: Japanese
1= unrelated to the word or concept "Japanese"
2= a page from Japan that has a Japanese version, but which does not discuss anything Japanese as a subject
3= not used here
4= a page that discusses something Japanese: the country, the language, "Japanese" varieties of various plants, etc., or a page that points to a "5".
5= a page with rich resources dealing with some aspect of a Japanese subject.

Query #3: atm internet protocol
1= Unrelated
2= actually mentions asynchronous transfer mode.
3= mentions internet protocol in relation to ATM.
4= mentions implementation of IP over ATM or points to a "5".
5= a page that has discussions/resources dealing with the implementation of IP over ATM.

Query #4: precision recall information
[Note: this killed two birds with one stone, I searched the net for my topic.]
1= "I recall precisely one time when farmer Jones ..."
2= actually mentions information precision and recall.
3= presents some p/r numbers for a database.
4= the presentation of a full study of p/r in information retrieval.
5= discussion or resources or a study of the theory of precision and recall. in information retrieval.

Query #5: civil war battlefields Virginia
1= unrelated
2= deals with U.S. Civil War
3= deals with either Virginia or some battlefield in Civil War
4= deals with a Virginia civil war battlefield or points to a "5".
5= page with rich resources dealing with facts, citations, etc. on the topic of civil war battlefields in Virginia

Query #6: victims child sexual abuse
1= unrelated
2= mentions child abuse
3= deals with child abuse or neglect and mentions sexual abuse.
4= deals with child sexual abuse, or points to a "5", or deals with men who perceive themselves to be the victims of child sexual abuse enforcement laws.
5= deals with the issues around being the victim of child sexual abuse.

Query #7: water quality agriculture nitrates
1= mentions water, nothing else
2= mentions water quality, or irrigation, nothing else
3= deals with water quality, unrelated to agricultural pollution
4= deals with nitrate levels in water, or points to a "5", or deals with water quality and agriculture
5= discusses agricultural use of nitrates and their effect on water quality.

Query #8: chess tournament
1= no mention of tournaments and no pointers.
2= no tournaments and minimal pointers to chess resources, none of which deal with tournaments.
3= rich chess resources and a pointer to a page that mentions tournaments.
4= a page that mentions the results of a tournament or that points to a "5".
5= a page that lists a schedule of chess tournaments.


Appendix B

The following are the actual queries submitted using the syntax of the service. Some queries failed in their original form, and were adjusted, often by dropping one or more words.

Query                                                           Date searched

Query #1:
Infoseek: best law school ranking                               5/28/95
Lycos: best law school ranking (min-words=3)                    5/28/95
Webcrawler: best law school ranking                             5/28/95
Worm: best law school                                           5/28/95

Query #2:
Infoseek: +Japanese                                             5/21/95
Lycos: Japanese (min-word=1)                                    5/21/95
Webcrawler:Japanese                                             5/21/95
Worm: Japanese                                                  6/3/95

Query #3:
Infoseek: +atm +ip                                              5/21/95
Lycos: atm. Internet protocol                                   5/21/95
Webcrawler: atm ip                                              5/21/95
Worm:atm ip                                                     6/3/95

Query #4:
Infoseek: +precision +recall                                    5/21/95
Lycos: precision recall information                             5/21/95
Webcrawler: precision recall information                        5/21/95
Worm:precision recall information                               6/3/95

Query #5:
Infoseek: civil war battlefields Virginia                       5/28/95
Lycos: civil war battlefields Virginia                          6/2/95
Webcrawler: civil war battlefields Virginia                     6/3/95
Worm: civil war battlefields                                    6/3/95

Query #6:
Infoseek: child sexual abuse victims                            6/3/95
Lycos: child sexual abuse victims                               6/3/95
Webcrawler: child sexual abuse victims                          6/3/95
Worm: sexual abuse                                              6/3/95

Query #7:
Infoseek: "water quality" +nitrates agriculture                 6/3/95
Lycos: water quality nitrates agriculture                       6/3/95
Webcrawler: water quality nitrates                              6/4/95
Worm: water quality nitrates                                    6/3/95

Query #8:
Infoseek: +chess tournament                                     6/4/95
Lycos: chess tournament                                         6/4/95
Webcrawler: chess tournaments                                   6/4/95
Worm: chess tournament                                          6/4/95


Appendix C: What my Professor Said

Out of a sense of truth in packaging, I thought I ought to paraphrase the comments of the professor who graded the assignment. The course was entitled Performance Analysis of Computer Systems, and was taught by Prof. Lilja at the University of Minnesota. He deducted a good many points because my statistical analysis was not thorough enough. While that criticism is probably justified, I feel the need to post the study anyway, because the basic result--that Lycos and Infoseek are the best index services out there right now--is fairly well established by the minimal analysis that I did. The F-ratios indicate that for the results the significant difference lies in the services and not in the queries. The gap between the top two and the bottom two services is wide.


Bibliography

Belkin, N., C. Cool, W. B. Croft, Callan, "The effect of multiple query representations on information retrieval system performance," in Proceedings of the Sixteenth Annual International ACM-SIGIR Conference. Ed. by Robert Korfhage, Edie Rasmussen and Peter Willett. Pittsburgh, Pennsylvania, June 27-July 1, 1993, p. 339-346.

Marchionini, Gary and Diane Barlow. A comparison of Boolean-based retrieval to the WAIS system for retrieval of aeronautical information: Final report. NASA Contractor Report 4569. Washington: Government Printing Office, 1994.

Turtle, Howard. "Natural language vs. Boolean query evaluation: a comparison of retrieval performance," in Proceedings of the Seventeenth Annual International ACM-SIGIR Conference. Ed. by W. Bruce Croft and C. J. van Rijsbergen. Dublin, Ireland, July 3-6, 1994, p. 212-220.

Copyright H. Vernon Leighton vleighton@winona.edu