There seems to be increasing interest in the idea of using peer-to-peer (P2P) technology to rejuvenate search technology in light of Google‘s growing inability to link people with knowledge. There have been a couple of early attempts in this area, including Widesource (a P2P system that indexes people’s browser bookmarks), and Grub (similar to SETI@Home, it leverage user’s spare cycles and bandwidth to index the web). A new project in this area is a side project of Ian Clarke, of Freenet fame, called WebQuest. WebQuest allows the user to refine their searches. Most of these ideas parallel my own on how we might improve search engines’ capabilities to extract context from web pages. But it has a few drawbacks that I’d like to throw out there to people to consider and attempt to solve.
Systems such as Google use keyword extracting, and link analysis to attempt to extract context. The system is based on the assumption that users link to other sites based on context, and therefore it should be possible to figure out the context and rank a page’s content. Other sites, such as DMOZ use a system of human moderators who can understand a page’s context better and categorize it – incurring much manual labour in the process.
But why use such an indirect route? Users know what they’re looking for and, like the US Supreme Court on pornography, they’ll know it when they see it. Why not provide a mechanism for users to “close the loop” and provide direct feedback to the search engine, thus allowing other users to benefit from this extra input into the system?
This has been bugging me for a while, so I decided to throw together the following straw man for a P2P Search Engine that would allow users to leverage how other people “ranked” pages based on their response to search engine results.
This is an updated version of the original post, which pointed to PDF and Word versions of the proposal. As part of a recent web reorganization, I figured it would just be easier to include the proposal text in the post itself.
The Problem
Current popular search engine technology suffers from a number of shortcomings:
- Timeliness of information: Search engines index information using a “spider” to traverse the Internet and index its content. Due to the size of the Internet and the finite resources of the search engine, pages are indexed only periodically. Hence, many search engines are slightly out of date, reflecting the contents of a web page the last time the page was indexed.
- Comprehensiveness of information: Due to both the current size and rate of information expansion on the Internet, it is highly unlikely that current search engines are capable of indexing all publicly available sites. In addition, current search engines rely on links between web pages to help them discover additional resources; however, it is likely that “islands” of information unreferenced by other sites are not being indexed.
- Capital intensive: Significant computing power, bandwidth, and other capital assets are required to provide satisfactory search response times. Google, for example, employs one of the largest Linux clusters (10,000 machines).
- Lack of access to “deep web”: Search engines can’t interface with information stored in corporate web sites’ databases, meaning that the search engine can’t “see” some information.
- Lack of context/natural language comprehension: Search engines tend to be dumb, attempting to extract context in crude, indirect fashions. Search technologies, such as Google’s PageRank™, attempt to extract context from web pages only indirectly, through analysis of keywords in the pages and hyperlinks interconnections between pages.
The only available option to solve these problems is to develop a search technology that comprehends natural language, can extract context, and employs a massively scalable architecture. Given the exponential rate of information growth, developing such a technology is critical to enabling individuals, governments, and corporations to find and process information in order to generate knowledge.
The Proposed Solution
Fortunately, there already exists a powerful natural language and context extraction technology: the search users themselves. Coincidentally, they also are in possession of the resources required to create a massively scalable distributed architecture for coordinating search activities: a vast amount of untapped computational power and bandwidth, albeit spread across millions of individual machines.
What is required is a tool that allows users to:
- Index pages and generate meta-data as they surf the web.
- Share that meta-data with other search users.
- Search other users’ meta-data.
- Combine other users’ meta-data for a set of search terms with the user’s own opinion on how well the result matches the context of the set of search terms. This new meta-data is cached locally and shared with the network of users, thus completing the feedback loop.
Implementation Hurdles
Insertion of Forged Meta-Data
Though users’ behaviour would “vote out” inappropriate material that got added to the peers’ cache of meta-data, the system would still be prone to attacks designed to boost certain sites’ rankings. A major design challenge would be to enable the system to withstand an attempt at forgery by a rogue peer or a coordinated network of rogue peers.
Search Responsiveness
As peers on the network will be spread across the Internet, accessing the network at a variety of connection speeds, the responsiveness of the network will be variable. Special consideration must be given to how to design the structure of the P2P network to incorporate supernodes to offset this characteristic.
Determining Surfer Behaviour
A major question that needs to be answered: how can we determine, through the users’ interaction with search results, their impression of a given search result? If a user goes to a site and leaves immediately, does this necessarily indicate the result was unsuitable and its score should be decremented? Or something else? If a user stays at a web page for a while, does it mean they like it, or that they went for coffee?
Generating a Site’s Initial Score
As a user surfs, an initial score must be generated for the sites they surf. How will this score be generated? Traditional search engines utilize reference checking in order to come up with a score for a web site; however, that technique is not practical when it’s only a single peer surfing a site. That leaves only more primitive techniques, such as keyword extraction, or other means to generate an initial score. However, we may be able to extract additional information based on how the user arrived at the web page. For example, if a user surfs from one site to another via a link, it might be possible to use the average score of the originating site as a base for generating the initial score.
Achieving Critical Mass
In early stages of development, the usefulness of the network for finding information will be directly proportional to the number of peers on the network. It’s a classis chicken and egg problem: without any users, no useful meta-data will be generated, and without the meta-data, no users will have an incentive to use the network. A possible solution to the problem would be to build a gateway to Google into each peer, to be used as a mechanism for seeding the network in its early development.
Privacy Issues
By tracking user’s surfing patterns, we are essentially watching the user and sharing that information with other users. Will users accept this? How can we act to protect the privacy of the user, while still extracting information that can be shared with other users?
Business Applications
The real question that needs to be answered, long before consideration can be given to the potential technical challenges, is: how can this technology be used to make money? A few proposals:
- Consumer Search Engine: The technology could be launched as an alternative to traditional search engines, using the technology to deliver well-targeted consumers to advertisers, and thus generate revenue.
- Internal Corporate Information Retrieval Tool: large corporations, such as IBM, could use a modified version of the technology to enable them to find and leverage existing internal assets.
- Others?
Conclusion
Yes, there are numerous holes in the idea (which I’ve highlighted in the document), but I don’t think any of them are insurmountable. The most important one, in my mind, is: how could you tweak this technology to build a real business? Though it would be possible to try to go the Google route (selling narrowly targeted advertising), I’m not sure that would be very smart considering not only the size of the existing competitor (Google) but also the number of other companies trying to bring down Google. It might be a good idea, in which case I’ve just shot myself in the foot by not patenting it, or a bad idea, in which case I’ll save myself the trouble of pursuing it. What are people’s thoughts on the merit of the idea (or lack thereof)?