|
|
Trying to build a better search engine This column originally ran in ComputorEdge on February 1, 2008
When the World Wide Web was launched onto the Internet in the early 1990s by Timothy Berners-Lee, search engines were already an established type of software tool. When the Internet was still a text-based environment, with telnet and FTP the main vehicles for navigating from server to server and accomplishing online tasks (mostly file retrieval and remotely launching executables), the Archie and Veronica search engines made it possible to search through files on FTP sites. The immediate popularity of the Web led to the explosion of the amount of information stored online on publicly accessible servers that we see today. Making sense of the Internet once the Web took off required search engines to make effective use of the massive amounts of information available. Think about it: How much of the Internet could you or would you use if there were no search engines? As early as 1993, the first search engines dedicated to the Web was launched AliWeb. By 1995, there were a handful of popular search engines: WebCrawler, Infoseek, Lycos, Excite, AltaVista and Magellan. Each of them had a slightly different way of finding information on the Web, and a search on one would almost always return different results than a search on any of the others. But what they all shared with each other along with eventual search engine leaders Google and Yahoo was that they all used Boolean-based syntax for searches. Quick explanation: If you type Jim Trageser into a search form, you will get all documents that contain the words "Jim" and "Trageser." If you surround "Jim Trageser" with quote marks, then you'll only get those documents that contain the full phrase "Jim Trageser." Boolean searches also let you run negative searches if you type in "Jim Trageser" -Atari, you'll get any documents that contain the phrase "Jim Trageser" and not the word Atari. Boolean searches are a fairly powerful way of quickly finding a set pattern of data: If you're looking for information for a school assignment on Albert Camus' "The Stranger," you could use any popular search engine to look up "Albert Camus" "The Stranger." If that still returned an inordinate number of results, you could search for "Albert Camus" "The Stranger" novel. By adding additional search criteria, you can incrementally narrow the results to find the results you're looking for. Beyond Boolean If powerful in the hands of someone who understands how Boolean syntax works, Boolean searches aren't exactly warm and fuzzy for non-geeks. Based on the same syntax as algebra and other higher math, Boolean syntax does not lend itself toward what anyone outside an engineering major would call a "natural speech" search. And that's been the holy grail of search engine programmers the last decade or so: The come up with a form of search engine that will provide relevant results for normal, human queries: "Who was George Washington?" for instance. While that might seem simple enough, the sort of question most American second-graders can answer, coming up with an algorithm to accomplish that has proven remarkably difficult. In the late 1990s, a Web search called AskJeeves.com (named after the unflappable butler in P.G. Wodehouse's popular novels). Users were encouraged to type in their searches in normal, everyday language yet the results were (disappointingly) the same as those in the existing Boolean search engines: a list of hyperlinked Web sites one could click on if any one looked promising. AskJeeves.com eventualy repositioned itself as Ask.com a fairly standard, apparently Boolean-style search engine. In the spring of 2004, another stab at getting beyond Boolean syntax was launched. Grokker.com made a big deal out of the fact that its search engine grouped results in geometric patterns rather than a long list of Web sites. For instance, a Grokker.com search for "Jim Trageser" this week grouped results in circles and squares labeled "North County Times" (my main employer), "The left and abortion" (a series of essays I wrote some years ago), "CD Reviews" and "Computers." Interestingly, while Grokker.com still offers this "map view" of results, it's most distinctive feature, the default view is the standard text view. The latest attempt In October, we shared with you the news that a company called Powerset.com was working on natural language searches. Powerset has now allowed a limited number of folks into its search engine in an early beta test (which your loyal ComputorEdge correspondent was among) as the Powerset folks turn their technology loose on the Wikipedia. And to be honest, the early results are pretty impressive. I typed in "When was Atari founded?" and was taken to the page with the text outlining the date of the company's founding highlighted. Now, as I had typed in Atari, Powerset had suggested a series of natural language questions, of which that was one. So I went back to the search and typed in, unprompted, "What was the Atari 1040 ST?" Again, I was taken to a page with a highlighted sentence "The ST was primarily a competitor to the Apple Macintosh and the Commodore Amiga systems." Which is correct. So for now, you'd have to give Powerset a B+. |
|