[Date Prev][Date Next][Subject Prev][Subject Next][Date Index][Subject Index]

Re: OED, CD-ROMS, Hard drives, etc.




On Sun, 16 Mar 1997, Harmon Seaver wrote:

> Thomas A. Robertson wrote:
>
>> Is this the kind of application Digital's AltaVista could work in. If
>> you are not familiar with it, go to www.altavista.digital.com and
>> check it out. They are making a similar product for use on our PCs and
>> I plan to install it next week.
>
>   Altavista's search engine -- as well as every other WWW search
> engine (although I haven't looked at others recently) -- is seriously
> flawed. It's Boolean logic simply isn't. I think what Robert is seeking
> is something more on the order of the search engine used by Dialog, or
> even the Notis library system. In other words, if you enter a search
> with the line "cats AND dogs" you get a return with both, not 10,000
> hits containing one or the other -- something the web search engines
> can't seem to do very well. And then, of course, beyond that, you want
> limiters for all and sundry types -- dates, proximity, wildcards, etc.
>
>

This is slightly overstated. Each of the search engines has online help
that explains available capabilities for devising a more precise search,
and those capabilities change at irregular intervals. (So does the
online documentation!) Keeping up with what's possible (in regular and
"power search" mode) is an exercise in frustration, but far more
features are available than are presented to novice searchers who don't
go looking for them. Ways of constructing phrases (using either
quotation marks or parentheses, depending on the search engine) have to
be determined by checking the Help or FAQ files, and you must make that
effort. Defaulting to OR or to AND for terms typed in also varies;
again, see the online help.

Unfortunately, AND is a pretty sloppy operator in full-text searching;
unless you are allowed to limit by field in which the terms are to
co-occur, you get some (seemingly) pretty bizarre results which are
technically correct--both your terms may occur, but in quite unrelated
locations. The problem is more one of searching full-text rather than
fields, than of presence or absence of boolean operators. Most of the
search engines have them; they're just applied in a scattershot fashion.
Even DIALOG, which allows limiting by field, delivers a lot of false
drops in full-text fields unless you carefully craft your search with
proximity operators (e.g. "within 3 words of each other"). And you'll
still have some unpleasant surprises that take time to account for.

Open Text, for one, has a power search mode that does let you search for
a mixture of words or a phrase in a specified field, and further combine
those choices with boolean operators.

The problem Harmon mentions, of not excluding items which do not include
*all* items you specifically "AND-ed" together, comes from "smart"
capabilities, which attempt to rank items for you in decreasing order of
presumed relevance.

Items which contain both/all terms would be listed first, but then items
containing fewer of your terms are also listed, in presumed decreasing
relevance. You're often not told where the list of items containing all
search terms stops and those containing fewer terms begins. Sometimes
you're told how many terms are included, but not which terms.

Other "weighting" is dones by how often a term occurs, or in how close
proximity. Since the search engine doesn't know your mind, its
algorithms for determining relevance might not produce what you're
looking for. If you're lucky, the online help may tell you its weighting
criteria. More often you must guess.

Some engines are designed to learn from your past choices, and select
items along the lines of your previous indication of "goodness" or
"badness." That requires storing that information somewhere, in a user
profile, sometimes identified in a "cookie." You have to allow that for
such a system to work. Professional searchers pay good money to build
such user profiles; others may consider them intrusive and violations of
privacy.

This is one place that shows how good artificial intelligence is getting
(or isn't getting). "Smart" search engines try to second-guess you; they
fail as often as they succeed. Carefully crafted boolean searches are
harder to construct than just slamming in a few terms, but give greater
control and precision. But you have to read (and first find!) the online
help to know what's possible in each engine.

One further issue--how duplicates are handled--will affect the amount
of stuff you have to wade through. Some engines do a better job of
eliminating items that are in fact identical (turned up by searching
links to links to items); some don't even attempt to thin the mess.

Obviously, since the features of one engine aften differ from another,
"metasearch" engines like Metacrawler simply can't overcome
inconsistencies and produce the same level of accuracy in every engine
they run in sequence.

It's not going to get much better till we start taming all this stuff,
and that will require better classification and indexing. Search engines
can only locate and limit according to what's retreivable; if that's a
morass, morass is what we'll get. For now, the Web is a morass.

On the other hand, if you test AltaVista or Excite Lite on material on
your own PC, results could be much better, depending on the database or
files you want to search. More tagging will allow more control. In any
case, you can find out how well a demo search engine will work for you.
So--go ahead and try!

	Dorothy Day

---
Dorothy Day			
School of Library and Information Science
Indiana University
day@xxxxxxxx