P2P Search Engine

There seems to be increasing interest in the idea of using peer-to-peer (P2P) technology to rejuvenate search technology in light of Google‘s growing inability to link people with knowledge. There have been a couple of early attempts in this area, including Widesource (a P2P system that indexes people’s browser bookmarks), and Grub (similar to SETI@Home, it leverage user’s spare cycles and bandwidth to index the web). A new project in this area is a side project of Ian Clarke, of Freenet fame, called WebQuest. WebQuest allows the user to refine their searches. Most of these ideas parallel my own on how we might improve search engines’ capabilities to extract context from web pages. But it has a few drawbacks that I’d like to throw out there to people to consider and attempt to solve.

Systems such as Google use keyword extracting, and link analysis to attempt to extract context. The system is based on the assumption that users link to other sites based on context, and therefore it should be possible to figure out the context and rank a page’s content. Other sites, such as DMOZ use a system of human moderators who can understand a page’s context better and categorize it – incurring much manual labour in the process.

But why use such an indirect route? Users know what they’re looking for and, like the US Supreme Court on pornography, they’ll know it when they see it. Why not provide a mechanism for users to “close the loop” and provide direct feedback to the search engine, thus allowing other users to benefit from this extra input into the system?

This has been bugging me for a while, so I decided to throw together the following straw man for a P2P Search Engine that would allow users to leverage how other people “ranked” pages based on their response to search engine results.

This is an updated version of the original post, which pointed to PDF and Word versions of the proposal. As part of a recent web reorganization, I figured it would just be easier to include the proposal text in the post itself.

The Problem

Current popular search engine technology suffers from a number of shortcomings:

  • Timeliness of information: Search engines index information using a “spider” to traverse the Internet and index its content. Due to the size of the Internet and the finite resources of the search engine, pages are indexed only periodically. Hence, many search engines are slightly out of date, reflecting the contents of a web page the last time the page was indexed.
  • Comprehensiveness of information: Due to both the current size and rate of information expansion on the Internet, it is highly unlikely that current search engines are capable of indexing all publicly available sites. In addition, current search engines rely on links between web pages to help them discover additional resources; however, it is likely that “islands” of information unreferenced by other sites are not being indexed.
  • Capital intensive: Significant computing power, bandwidth, and other capital assets are required to provide satisfactory search response times. Google, for example, employs one of the largest Linux clusters (10,000 machines).
  • Lack of access to “deep web”: Search engines can’t interface with information stored in corporate web sites’ databases, meaning that the search engine can’t “see” some information.
  • Lack of context/natural language comprehension: Search engines tend to be dumb, attempting to extract context in crude, indirect fashions. Search technologies, such as Google’s PageRank™, attempt to extract context from web pages only indirectly, through analysis of keywords in the pages and hyperlinks interconnections between pages.

The only available option to solve these problems is to develop a search technology that comprehends natural language, can extract context, and employs a massively scalable architecture. Given the exponential rate of information growth, developing such a technology is critical to enabling individuals, governments, and corporations to find and process information in order to generate knowledge.

The Proposed Solution

Fortunately, there already exists a powerful natural language and context extraction technology: the search users themselves. Coincidentally, they also are in possession of the resources required to create a massively scalable distributed architecture for coordinating search activities: a vast amount of untapped computational power and bandwidth, albeit spread across millions of individual machines.

What is required is a tool that allows users to:

  1. Index pages and generate meta-data as they surf the web.
  2. Share that meta-data with other search users.
  3. Search other users’ meta-data.
  4. Combine other users’ meta-data for a set of search terms with the user’s own opinion on how well the result matches the context of the set of search terms. This new meta-data is cached locally and shared with the network of users, thus completing the feedback loop.

Leveraging Surfing Habits and Search Habits to Extract Search Context

Implementation Hurdles

Insertion of Forged Meta-Data

Though users’ behaviour would “vote out” inappropriate material that got added to the peers’ cache of meta-data, the system would still be prone to attacks designed to boost certain sites’ rankings. A major design challenge would be to enable the system to withstand an attempt at forgery by a rogue peer or a coordinated network of rogue peers.

Search Responsiveness

As peers on the network will be spread across the Internet, accessing the network at a variety of connection speeds, the responsiveness of the network will be variable. Special consideration must be given to how to design the structure of the P2P network to incorporate supernodes to offset this characteristic.

Determining Surfer Behaviour

A major question that needs to be answered: how can we determine, through the users’ interaction with search results, their impression of a given search result? If a user goes to a site and leaves immediately, does this necessarily indicate the result was unsuitable and its score should be decremented? Or something else? If a user stays at a web page for a while, does it mean they like it, or that they went for coffee?

Generating a Site’s Initial Score

As a user surfs, an initial score must be generated for the sites they surf. How will this score be generated? Traditional search engines utilize reference checking in order to come up with a score for a web site; however, that technique is not practical when it’s only a single peer surfing a site. That leaves only more primitive techniques, such as keyword extraction, or other means to generate an initial score. However, we may be able to extract additional information based on how the user arrived at the web page. For example, if a user surfs from one site to another via a link, it might be possible to use the average score of the originating site as a base for generating the initial score.

Achieving Critical Mass

In early stages of development, the usefulness of the network for finding information will be directly proportional to the number of peers on the network. It’s a classis chicken and egg problem: without any users, no useful meta-data will be generated, and without the meta-data, no users will have an incentive to use the network. A possible solution to the problem would be to build a gateway to Google into each peer, to be used as a mechanism for seeding the network in its early development.

Privacy Issues

By tracking user’s surfing patterns, we are essentially watching the user and sharing that information with other users. Will users accept this? How can we act to protect the privacy of the user, while still extracting information that can be shared with other users?

Business Applications

The real question that needs to be answered, long before consideration can be given to the potential technical challenges, is: how can this technology be used to make money? A few proposals:

  • Consumer Search Engine: The technology could be launched as an alternative to traditional search engines, using the technology to deliver well-targeted consumers to advertisers, and thus generate revenue.
  • Internal Corporate Information Retrieval Tool: large corporations, such as IBM, could use a modified version of the technology to enable them to find and leverage existing internal assets.
  • Others?

Conclusion

Yes, there are numerous holes in the idea (which I’ve highlighted in the document), but I don’t think any of them are insurmountable. The most important one, in my mind, is: how could you tweak this technology to build a real business? Though it would be possible to try to go the Google route (selling narrowly targeted advertising), I’m not sure that would be very smart considering not only the size of the existing competitor (Google) but also the number of other companies trying to bring down Google. It might be a good idea, in which case I’ve just shot myself in the foot by not patenting it, or a bad idea, in which case I’ll save myself the trouble of pursuing it. What are people’s thoughts on the merit of the idea (or lack thereof)?

Wandfight at the HP Corral

Vancouver – It was supposed to be a joyous occasion, but the combination of poor crowd control and a small book inventory led to disaster at Chapters on Robson last night just as the latest installment of the popular Harry Potter series went on sale. Though the evening started amiably enough, with little witches and warlocks from the local Hogwart’s International School of Witchcraft anxiously awaiting 12:01am, by the end of the evening the event had escalated into a full scale wizard riot that led to numerous injuries, destruction of property, and holes in the space-time continuum.

The book that started it all...the riot, that is.Even before the event itself, officials at the Ministry of Magic had expressed concern at the growing intermingling of wizards and muggles. At the event, this concern was confirmed by the presence of a large crowd of Christian fundamentalists clad in Holy Power t-shirts and preparing stakes around a large bonfire in front of the Vancouver Art Gallery steps opposite the book store. Though no action was taken by this group, their presence, coupled with their outspoken desire to “burn witches like it’s 1599” and repetitive “Bringing in the Sheep” sing-alongs, only served to increase tensions at the event.

The final straw came at 12:01am, when Chapters staff revealed that, due to high demand, they had only been able to secure a single palette of the new book for sale to the public. In an effort to calm the crowd, Gilderoy Lockhart, the former Hogwart’s professor and special guest for the evening, attempted to use his powers to create a duplicate palette of books. But faster than you could say “lacarnum inflamarae”, Lockhart had engulfed the palette in flames, leaving only a few display copies of the book unscathed.

One of the crazed dark witches makes off with her prize...And then things got ugly.

A group of dark wizards, who had maintained all evening that they were interested in buying the latest Harry Potter book “just to find the flaws”, decided to take action and obtain the surviving copies of the book. In an attempt to create a diversion, the group enchanted the Science Fiction & Fantasy section, thus releasing a swarm of Orcs, several small hobbits, a confused grey-haired gentlemen wearing a wizarding robe that hasn’t been fashionable for several centuries, and a humanoid from a small planet somewhere in the vicinity of Betelgeuse.

Meanwhile, this reporter, who had previously believed the worst part of the evening had passed with his consumption of a vomit-flavoured “Bertie Bott’s Every Flavour Bean”, had located sanctuary behind a pile of unsold Danielle Steel novels.

The riot was eventually quelled when Ministry of Magic officials and Vancouver Police Department riot personnel arrived on the scene and dispersed the crowd. The Ministry of Social Services has since taken custody of the group of fantasy creatures for their own protection, placing the hobbits in foster homes, the older wizard in elderly care, the group of Orcs in anger management, and the humanoid from Betelgeuse in Alcoholics Anonymous. Gilderoy Lockhart has not been seen since the event, and is assumed dead. Good riddance.