BC Tech For Sale?

In an earlier topic I posed the question “who are the giants in BC”, seeking to prompt whatever readership I might have to help identify the important figures in BC. As I pointed out in that topic, I believe British Columbians don’t celebrate our leaders, don’t take pride in what we are capable of accomplishing. In a similar vein, I want to consider the recent acquisitions of BC corporations.

Last week, Intel decided to acquire the gaggle of PMC Sierra emigrants that formed West Bay Semiconductors in 1999. In a similar move, Business Objects splurged and purchased local reporting software success Crystal Decisions. Great, right? Some home town entrepreneurs strike it big, and somewhere a venture capitalist gets both his wings and a liquidation event. Everybody wins.

But consider a similar story: in mid-2000, Intel acquired local communications software developer Trillium Digital Systems. Trillium, a leader in producing standards-compliant communications protocol software, developed software required to implement the hardware backend driving today’s modern telcos. Trillium was especially popular in its industry, due primarily to its support for a variety of hardware platforms. However, after the acquisition Trillium became an Intel-only shop, shedding bales of valuable intellectual property in the process, to please its new corporate sugar daddy. But when the hard times hit, Intel sold Trillium to Continuous Computing Corporation for a conveniently undisclosed sum.

What’s sad about the Trillium story is that an otherwise healthy company chose to be acquired, and then driven into the ground by a foreign parent company. On the one hand, it was probably a good strategy for Intel – after all, they managed to eliminate support for their competitors’ products. But on the other hand, it really sucked for the large numbers of local engineers and software developers that lost their jobs, and the local companies that benefited indirectly from Trillium’s past level of performance.

The question is this: do British Columbian companies look to sell out too fast, rather than try to become the world leader in their industry? Do we talk a good game about building world-class companies, but lose our nerve when presented with a cheque? Will I be here lamenting the decline of West Bay and Crystal Decisions in a year or two?

P2P Search Engine

There seems to be increasing interest in the idea of using peer-to-peer (P2P) technology to rejuvenate search technology in light of Google‘s growing inability to link people with knowledge. There have been a couple of early attempts in this area, including Widesource (a P2P system that indexes people’s browser bookmarks), and Grub (similar to SETI@Home, it leverage user’s spare cycles and bandwidth to index the web). A new project in this area is a side project of Ian Clarke, of Freenet fame, called WebQuest. WebQuest allows the user to refine their searches. Most of these ideas parallel my own on how we might improve search engines’ capabilities to extract context from web pages. But it has a few drawbacks that I’d like to throw out there to people to consider and attempt to solve.

Systems such as Google use keyword extracting, and link analysis to attempt to extract context. The system is based on the assumption that users link to other sites based on context, and therefore it should be possible to figure out the context and rank a page’s content. Other sites, such as DMOZ use a system of human moderators who can understand a page’s context better and categorize it – incurring much manual labour in the process.

But why use such an indirect route? Users know what they’re looking for and, like the US Supreme Court on pornography, they’ll know it when they see it. Why not provide a mechanism for users to “close the loop” and provide direct feedback to the search engine, thus allowing other users to benefit from this extra input into the system?

This has been bugging me for a while, so I decided to throw together the following straw man for a P2P Search Engine that would allow users to leverage how other people “ranked” pages based on their response to search engine results.

This is an updated version of the original post, which pointed to PDF and Word versions of the proposal. As part of a recent web reorganization, I figured it would just be easier to include the proposal text in the post itself.

The Problem

Current popular search engine technology suffers from a number of shortcomings:

  • Timeliness of information: Search engines index information using a “spider” to traverse the Internet and index its content. Due to the size of the Internet and the finite resources of the search engine, pages are indexed only periodically. Hence, many search engines are slightly out of date, reflecting the contents of a web page the last time the page was indexed.
  • Comprehensiveness of information: Due to both the current size and rate of information expansion on the Internet, it is highly unlikely that current search engines are capable of indexing all publicly available sites. In addition, current search engines rely on links between web pages to help them discover additional resources; however, it is likely that “islands” of information unreferenced by other sites are not being indexed.
  • Capital intensive: Significant computing power, bandwidth, and other capital assets are required to provide satisfactory search response times. Google, for example, employs one of the largest Linux clusters (10,000 machines).
  • Lack of access to “deep web”: Search engines can’t interface with information stored in corporate web sites’ databases, meaning that the search engine can’t “see” some information.
  • Lack of context/natural language comprehension: Search engines tend to be dumb, attempting to extract context in crude, indirect fashions. Search technologies, such as Google’s PageRankā„¢, attempt to extract context from web pages only indirectly, through analysis of keywords in the pages and hyperlinks interconnections between pages.

The only available option to solve these problems is to develop a search technology that comprehends natural language, can extract context, and employs a massively scalable architecture. Given the exponential rate of information growth, developing such a technology is critical to enabling individuals, governments, and corporations to find and process information in order to generate knowledge.

The Proposed Solution

Fortunately, there already exists a powerful natural language and context extraction technology: the search users themselves. Coincidentally, they also are in possession of the resources required to create a massively scalable distributed architecture for coordinating search activities: a vast amount of untapped computational power and bandwidth, albeit spread across millions of individual machines.

What is required is a tool that allows users to:

  1. Index pages and generate meta-data as they surf the web.
  2. Share that meta-data with other search users.
  3. Search other users’ meta-data.
  4. Combine other users’ meta-data for a set of search terms with the user’s own opinion on how well the result matches the context of the set of search terms. This new meta-data is cached locally and shared with the network of users, thus completing the feedback loop.

Leveraging Surfing Habits and Search Habits to Extract Search Context

Implementation Hurdles

Insertion of Forged Meta-Data

Though users’ behaviour would “vote out” inappropriate material that got added to the peers’ cache of meta-data, the system would still be prone to attacks designed to boost certain sites’ rankings. A major design challenge would be to enable the system to withstand an attempt at forgery by a rogue peer or a coordinated network of rogue peers.

Search Responsiveness

As peers on the network will be spread across the Internet, accessing the network at a variety of connection speeds, the responsiveness of the network will be variable. Special consideration must be given to how to design the structure of the P2P network to incorporate supernodes to offset this characteristic.

Determining Surfer Behaviour

A major question that needs to be answered: how can we determine, through the users’ interaction with search results, their impression of a given search result? If a user goes to a site and leaves immediately, does this necessarily indicate the result was unsuitable and its score should be decremented? Or something else? If a user stays at a web page for a while, does it mean they like it, or that they went for coffee?

Generating a Site’s Initial Score

As a user surfs, an initial score must be generated for the sites they surf. How will this score be generated? Traditional search engines utilize reference checking in order to come up with a score for a web site; however, that technique is not practical when it’s only a single peer surfing a site. That leaves only more primitive techniques, such as keyword extraction, or other means to generate an initial score. However, we may be able to extract additional information based on how the user arrived at the web page. For example, if a user surfs from one site to another via a link, it might be possible to use the average score of the originating site as a base for generating the initial score.

Achieving Critical Mass

In early stages of development, the usefulness of the network for finding information will be directly proportional to the number of peers on the network. It’s a classis chicken and egg problem: without any users, no useful meta-data will be generated, and without the meta-data, no users will have an incentive to use the network. A possible solution to the problem would be to build a gateway to Google into each peer, to be used as a mechanism for seeding the network in its early development.

Privacy Issues

By tracking user’s surfing patterns, we are essentially watching the user and sharing that information with other users. Will users accept this? How can we act to protect the privacy of the user, while still extracting information that can be shared with other users?

Business Applications

The real question that needs to be answered, long before consideration can be given to the potential technical challenges, is: how can this technology be used to make money? A few proposals:

  • Consumer Search Engine: The technology could be launched as an alternative to traditional search engines, using the technology to deliver well-targeted consumers to advertisers, and thus generate revenue.
  • Internal Corporate Information Retrieval Tool: large corporations, such as IBM, could use a modified version of the technology to enable them to find and leverage existing internal assets.
  • Others?

Conclusion

Yes, there are numerous holes in the idea (which I’ve highlighted in the document), but I don’t think any of them are insurmountable. The most important one, in my mind, is: how could you tweak this technology to build a real business? Though it would be possible to try to go the Google route (selling narrowly targeted advertising), I’m not sure that would be very smart considering not only the size of the existing competitor (Google) but also the number of other companies trying to bring down Google. It might be a good idea, in which case I’ve just shot myself in the foot by not patenting it, or a bad idea, in which case I’ll save myself the trouble of pursuing it. What are people’s thoughts on the merit of the idea (or lack thereof)?