CS73N Notes 16: for the CS73N class of 1 June 2006
Started by Gio Wiederhold, 2 June 2006.
Topics covered: The deep web
The deep web and its use of databases
The [Note Deep Web] consists of all the material that is stored on computers on the Internet, but not as accessible and harvestable web pages. it includes
- Structured databases to support websites of businesses containing records of the items they are selling. These pages are accessed by Dynamic HTML and Java programs when a customer visits the site, and searches for specific items. When those pages are retrieved they could be stoored by users on their web sites, and thus enter the open web. But it is unlikely that all the information from the store would become available that way. Using databases makes it harder for search engines to find the information, but it prevents easy, wholesale release of information to competitors.
- Structured databases are also used by free services, as the Internet Movie Database (imdb). Such services can display the information in a wide variety of ways, that would require many more open web pages to display.
- Collection of Books are now coming into the open web. while indexed, but only by search engines that have access to those libraries. The books' contents is kept in libraries that are protected, mainly because of copyright concerns. Because of their size, standard database technology is rarely used. Also, for such libraries there is no or little requirement for regular updating, a strong feature of database technology.
- Image libraries are another portion of the deep web. The images may be available through indexed searches based on meta-information, as captions that people add, but the images themselves are not directly searchable. Those images represent a much larger volume of information than the available metadata. Some of these are extremely large, as the collections that NASA and other space agencies keep of satelite earth and space observations. To retrieve those images, one has to specify the satellite, time, and observing instrument. Some earth observations have geo-coordinates that are added to their metadata.
- Video collections have even larger ratios of image material to metadata.
- Movies are typically kept in well-protected libraries, because of value and rights consderation.
The size of the deep web has been estimated to be about 100 times the size of the open web, but the information that might of interest to general surfers is a small fraction of that.
Adding metadata
Anciliary metadata is crucial for mapping textual queries to to images, videos, and movie clips.
Image processing research can extract parameters, as `blob descriptors <add reference> or `wavelet parameters' <see gio's home page> that can then be compared with parameters extracted from sample images. Such work has not yet seen much practical application, although certain images, as pornographic images can be recognized with high precision, allowing identification of suspect web sites (see WIPE project on gio's home pages).
GPS information is increasingly providing another automatic source of metadata.
Videos are important to organzations as newscasters, and interesting work has gone on to increase the amount of metadata. The processes involve segementation, and then use of anciliary data, as dates, time, and place of recording (again with GPS coordinates if available), the caption tracks from broadcasts for visually limited people, and matching to voiceprint libraries.
A segment break is identified when all pixels of an image change at the same time. Such an indication is already available as a byproduct of the video compression schemes used.
Voiceprints are very effective, because searches tend to focus on well-known people, as politicians, other public figures, entertainers, and the broadcasters themselves. Having voiceprint libraries of several thousand entries gives already a great deal of coverage when those are matched to the video-segments kept in the libraries.
Fin
Back to the Spring 2008 Schedule of CS73N, or its Home page.