On Sun, 16 Mar 1997, Harmon Seaver wrote: > Thomas A. Robertson wrote: > >> Is this the kind of application Digital's AltaVista could work in. If >> you are not familiar with it, go to www.altavista.digital.com and >> check it out. They are making a similar product for use on our PCs and >> I plan to install it next week. > > Altavista's search engine -- as well as every other WWW search > engine (although I haven't looked at others recently) -- is seriously > flawed. It's Boolean logic simply isn't. I think what Robert is seeking > is something more on the order of the search engine used by Dialog, or > even the Notis library system. In other words, if you enter a search > with the line "cats AND dogs" you get a return with both, not 10,000 > hits containing one or the other -- something the web search engines > can't seem to do very well. And then, of course, beyond that, you want > limiters for all and sundry types -- dates, proximity, wildcards, etc. > > This is slightly overstated. Each of the search engines has online help that explains available capabilities for devising a more precise search, and those capabilities change at irregular intervals. (So does the online documentation!) Keeping up with what's possible (in regular and "power search" mode) is an exercise in frustration, but far more features are available than are presented to novice searchers who don't go looking for them. Ways of constructing phrases (using either quotation marks or parentheses, depending on the search engine) have to be determined by checking the Help or FAQ files, and you must make that effort. Defaulting to OR or to AND for terms typed in also varies; again, see the online help. Unfortunately, AND is a pretty sloppy operator in full-text searching; unless you are allowed to limit by field in which the terms are to co-occur, you get some (seemingly) pretty bizarre results which are technically correct--both your terms may occur, but in quite unrelated locations. The problem is more one of searching full-text rather than fields, than of presence or absence of boolean operators. Most of the search engines have them; they're just applied in a scattershot fashion. Even DIALOG, which allows limiting by field, delivers a lot of false drops in full-text fields unless you carefully craft your search with proximity operators (e.g. "within 3 words of each other"). And you'll still have some unpleasant surprises that take time to account for. Open Text, for one, has a power search mode that does let you search for a mixture of words or a phrase in a specified field, and further combine those choices with boolean operators. The problem Harmon mentions, of not excluding items which do not include *all* items you specifically "AND-ed" together, comes from "smart" capabilities, which attempt to rank items for you in decreasing order of presumed relevance. Items which contain both/all terms would be listed first, but then items containing fewer of your terms are also listed, in presumed decreasing relevance. You're often not told where the list of items containing all search terms stops and those containing fewer terms begins. Sometimes you're told how many terms are included, but not which terms. Other "weighting" is dones by how often a term occurs, or in how close proximity. Since the search engine doesn't know your mind, its algorithms for determining relevance might not produce what you're looking for. If you're lucky, the online help may tell you its weighting criteria. More often you must guess. Some engines are designed to learn from your past choices, and select items along the lines of your previous indication of "goodness" or "badness." That requires storing that information somewhere, in a user profile, sometimes identified in a "cookie." You have to allow that for such a system to work. Professional searchers pay good money to build such user profiles; others may consider them intrusive and violations of privacy. This is one place that shows how good artificial intelligence is getting (or isn't getting). "Smart" search engines try to second-guess you; they fail as often as they succeed. Carefully crafted boolean searches are harder to construct than just slamming in a few terms, but give greater control and precision. But you have to read (and first find!) the online help to know what's possible in each engine. One further issue--how duplicates are handled--will affect the amount of stuff you have to wade through. Some engines do a better job of eliminating items that are in fact identical (turned up by searching links to links to items); some don't even attempt to thin the mess. Obviously, since the features of one engine aften differ from another, "metasearch" engines like Metacrawler simply can't overcome inconsistencies and produce the same level of accuracy in every engine they run in sequence. It's not going to get much better till we start taming all this stuff, and that will require better classification and indexing. Search engines can only locate and limit according to what's retreivable; if that's a morass, morass is what we'll get. For now, the Web is a morass. On the other hand, if you test AltaVista or Excite Lite on material on your own PC, results could be much better, depending on the database or files you want to search. More tagging will allow more control. In any case, you can find out how well a demo search engine will work for you. So--go ahead and try! Dorothy Day --- Dorothy Day School of Library and Information Science Indiana University day@xxxxxxxx