The Future of the Deep Web
This is an exercise in the literal.
Librarians have long warned that you can't find everything that is on the Web on search engines. The part of the Web that is not searchable through these engines is known as the deep Web.
By the way, I refuse to call this the Invisible Web. I think this is a ridiculous term for any librarian to use. Just because something is in one type of database and not in another doesn't make it invisible. By this logic, our catalogs, databases and online journals should be called the Invisible Library. I've explained this more fully in a page I maintain on The Deep Web.
I was updating that page this morning because I've been thinking about the future of the deep Web in relation to the infamous goal of Google "to organize the world's information and make it universally accessible and useful." Google is making a good show of meaning this literally, so for the sake of argument, I'll take it literally.
Based on the evidence so far, "accessible" doesn't necessarily mean full text accessible. But let's put that aside for the moment.
I've done search engine training for years, and my mantra has always been to select the search tool based on your query. This is a fundamental guideline to encourage students away from a Google-centric search strategy. (When I first started out, it was a Yahoo!-centric strategy. Does anyone remember the 2001 American Libraries article "Yahoo! and the Abdication of Judgment"? That was mine, and it occurs to me that it would have made for a great blog rant if there were blogs back then.) So I've been amusing myself with the thought that, perhaps in my professional lifetime, a handful of search portals will indeed contain literally everything to be searched and I can finally stop pushing this.
The move is on to open up search. Google Book Search, Google Scholar, Window Live Academic, Amazon Search Inside the Book and various emerging services offer a great deal of content on the deep Web, especially scholarly content. While this content is often not free, due to copyright and publishers' restrictions, the search is definitely so. In other words, free search of previously inacessible material is at the heart of this development. Note that I said "search" and not "search and access to" this material.
What I'm trying to guess is what will happen to content in independent, usually free databases that abound on the Web. This is a significant part of the deep Web - in fact, of the Web itself - and there are untold thousands of these databases. Consider for example the USDA PLANTS Database, Library of Congress THOMAS, the Kelly Blue Book, and many thousands more. Will Google and others be such magnets that everything from governments to corporations to universities to research institutes will open up their data to them? Google already has made inroads with addresses and phone numbers, flight information, stock quotes, definitions, movies, tracking numbers, patents, Open WorldCat, etc. But this is only a fraction of the world's data.
I believe that this is where Google and others will stumble in their efforts to make everything searchable. If they consider part of the world's information to be contained in these databases (and why wouldn't they?), what is their plan? They'll have a tough challenge getting their hands on all this data. Even if they could do it technically, would the sources of the data allow it?
Maybe there will come a time when not being represented in major centralized portals will be so undesirable as to be avoided. The USDA would not longer be satisfied with their site coming up in a portal search. They'd want their internal data to appear, too.
Federated search vs. independence. This is a trend to watch.
End of exercise.
I'll be in London next week to speak to Ingenta's publisher customers about The Ideal 2.0 Scholarly Portal. Maybe I'll run some of these ideas past them for their comments.
