CS73N

Note Browsers

CS73N Note on Browser technologies

Created on 20 May 2006 by Gio Wiederhold from earlier CS99 documents and Wiederhold, Gio: Trends for the Information Technology Industry; report prepared for MITI under sponsorship of the Japan External Trade Organization (JETRO), San Francisco CA 94104, April 1999.  That document contains cited references.

Complementary to [Notes13].

 Getting the Right Information

Getting the right, and by implication, complete information, is a question of breadth. In traditional measures completeness of coverage is termed `recall?. To achieve a high recall rapidly all possibly relevant sources have to be accessed. Since complete access for every information request is not feasible, information systems depend on having indexes. Having an index means that an actual information request can start from a manageable list, with points to locations and pages containing the actual information.

The effort to index all publicly available information is immense. Comprehensive indexing is limited due to the size of the web itself, the rate of change of updates to the information on the web, and the variety of media used for representing information [PonceleonSAPD:98]. Automatic indexing systems focus on the ASCII text presented on web pages, primarily in HTML format. Documents stored in proprietary formats, as Microsoft Word, Powerpoint, Wordperfect, Postscript, and Portable Document Format (PDF) [Adobe:99] are ignored. Valuable information is often presented in tabular form, where relationships are represented by relative position. Such representations are hard to parse by search engines.

Also generally inaccessible for search are images, including icons and corporate logos, diagrams and images [Stix:97]. Some of these images contain crucial embedded text, that is not easy to extract [WangWL:98]. Only specialized vendors provide image libraries, and the quality of their retrieval depends much on ancillary descriptive information, perhaps augmented with some selection on content parameters as color or texture [Amico:98]. There are also valuable terms for selection in speech, both standalone and as part of video representations. Some of the problems can be, and are being addressed by brute force, using heavyweight indexing engines and smart indexing engines. For instance, sites that have been determined to change frequently will be visited by the `worms? which collect data from the sources more often, so that the average information is as little out of date as feasible [Lynch:97].

1. Indexing 

Input for indexes can be produced by the information supplier, but those are likely to be limited. The consumer of information will typically find it too costly to produce indexes for their own use only. Schemes requiring cooperation of the sources have been proposed [GravanoGT:94]. Since producing an index is a valued-added service, it is best handled by independent companies, who can distinguish themselves, by comprehensiveness versus specialization, currency, convenience of use, and cost. Those companies can also use tools that break through access barriers in order to better serve their population. There is also a role for professional societies [ACM:99]. We will review current technologies for such enterprises in Section 3.2.7.

2 Semantic Inconsistency The accuracy and coverage recall is also limited by semantic problems. The basic issue is the impossibility of having wide agreements on the meaning of terms among organizations that are independent of each other. We denote the set of terms and their relationships, following current usage in Artificial Intelligence, as an ontology [WG:97]. Many ontologies have existed for a long time without having used the name. Schemas, as used in databases, are simple, consistent ontologies. Foreign keys relating table headings in database schemas imply structural relationships. Included in ontologies are the values that variables can assume; of particular significance are codes for enumerated values used in data-processing [McEwen:74]. Names of states, counties, etc. are routinely encoded. When such terms are used in a database the values in a schema column are constrained, providing another example of a structural relationship. There are thousands of such lists, often maintained by domain specialists. Other ontologies are being created now within DTD definitions for the eXtended Markup Language (XML) [Connolly:97].

The accuracy and coverage recall is also limited by semantic problems. The basic issue is the impossibility of having wide agreements on the meaning of terms among organizations that are independent of each other. We denote the set of terms and their relationships, following current usage in Artificial Intelligence, as an [WG:97]. Many ontologies have existed for a long time without having used the name. Schemas, as used in databases, are simple, consistent ontologies. Foreign keys relating table headings in database schemas imply structural relationships. Included in ontologies are the values that variables can assume; of particular significance are codes for enumerated values used in data-processing [McEwen:74]. Names of states, counties, etc. are routinely encoded. When such terms are used in a database the values in a schema column are constrained, providing another example of a structural relationship. There are thousands of such lists, often maintained by domain specialists. Other ontologies are being created now within DTD definitions for the eXtended Markup Language (XML) [Connolly:97].

A major effort, sponsored by the National Library Medicine (NLM), has integrated diverse ontologies used in healthcare into the Unified Medical Language System (UMLS) [HumphreysL:93]. In large ontologies collected from diverse sources or constructed by multiple individuals over a long time some inconsistencies are bound to remain. Large ontologies have been collected with the objective to assist in common-sense reasoning (CyC) [LenatG:90]. Cyc provides the concept of microtheories to circumscribe contexts within its ontology. CyC has been used to articulate relevant information from distinct sources without constraints imposed by microtheories [ColletHS:91]. That approach provides valuable matches, but not complete precision. Most ontologies have associated textual definitions, but those are rarely sufficiently precise to allow a formal understanding without human interpretation.

Inconsistency of semantics among sources is due to their autonomy. Each source develops in its own context, and uses terms and classifications that are natural to its creators and owners. The problem with articulation by matching terms from diverse sources is not just that of synonyms ? two words for the same object, or one word for completely different objects, as miter in carpentry and in religion. The inconsistencies are much more complex, and include overlapping classes, subsets, partial supersets, and the like. Examples of problems abound. The term vehicle is used differently in the transportation code than in the building code, although over 90% of the instances are the same.

The need for consistent terms is recursive. Terms do not only refer to real-world objects, but also to abstract groupings. The term ?vehicle? is different for architects, when designing garage space, from that of traffic regulators, dealing with right-of-way rules at intersections. A vendor site oriented towards carpenters will use very specific terms, say sinkers and brads, to denote certain types of nails, that will not be familiar to the general population. A site oriented to homeowners will just use the general category of nails, and may then describe the diameter, length, type of head, and material.

Inconsistent use of terms makes sharing of information from multiple sources incomplete and imprecise. Forcing every category of customers to use the same terminology is inefficient. The homeowner cannot afford to learn the thousands of specialized terms needed to maintain one?s house, and the carpenter cannot afford wasting time by circumscribing each nail, screw, and tool with precise attributes. Mismatches are rife when dealing with geographic information, although localities are a prime criterion for articulation [MarkMM:99]. Many ontologies have textual definitions for their terms, just as found in printed glossaries. These definitions will help readers, but cannot guarantee precise automatic matching, because the terms used in the definitions also come from their own source domains. The problems due to inconsistency are even more of a hindrance to business than to individuals, who deal more often with single instances, as discussed in Section 4.1. Research tasks to deal with semantic inconsistency are indicated in Section 8.3.

3 Isolation Information stored in an Intranet, behind a firewall, is not accessible to the public search engines, as are sites that explicitly forbid access in their headers. Systems that extract information dynamically out of databases or other sources also create unwittingly or intentionally barriers that make the actual data inaccessible for indexing.

Information stored in an Intranet, behind a firewall, is not accessible to the public search engines, as are sites that explicitly forbid access in their headers. Systems that extract information dynamically out of databases or other sources also create unwittingly or intentionally barriers that make the actual data inaccessible for indexing.

Those sources are referred to as the [Note Invisible Web]  and considered an oder of magnitude larger than the accessible web.  The content of the invisble web is likely not much more important to searchers, since the size estimates include huge image libraries collected by NASA of space and earth observations.

Where limited access is intentional the requester cannot argue, but much valuable material is not accessed because its interface, its representation or its access paths do not allow indexing. For instance, the entire content of the Library of Congress is hidden behind a web page that presents a query engine. A customer who knows to search there will be served, but none of the material will appear in the information returned by one of the web-based search engines, which provide the primary access path for most consumers.

4 Suitability The suitability of the information for use once it is obtained also needs assessment. Medical findings of interest to a pathologist will be confusing to patients, and advice for patients about a disease should be redundant to the medical specialist. Some partitioning for roles exists now; for instance Medline has multiple access points [Cimino:96]. But smart selection schemes might well locate information via all paths, and most information that is publicly available is not labeled with respect to consumer roles, and it may even be presumptuous to do so.

The suitability of the information for use once it is obtained also needs assessment. Medical findings of interest to a pathologist will be confusing to patients, and advice for patients about a disease should be redundant to the medical specialist. Some partitioning for roles exists now; for instance Medline has multiple access points [Cimino:96]. But smart selection schemes might well locate information via all paths, and most information that is publicly available is not labeled with respect to consumer roles, and it may even be presumptuous to do so.

There is hence a role for mediating modules to interpret meta-information associated with a site and use that information to filter or rank the data obtained from that site [Langer:98]. Doing so requires understanding the background and typical intent of the customer. Note that the same individual can have multiple customer roles, as a private person or as a professional.

5 Quality-based Ranking Assessing the quality of information and the underlying merchandise and services is an important service, as discussed in Section 2.2.3, and should be integrated into mediating services. Here three parties are involved in the module:

Assessing the quality of information and the underlying merchandise and services is an important service, as discussed in Section 2.2.3, and should be integrated into mediating services. Here three parties are involved in the module:
  1. sources of the data, which should be up-to-date and highly available;
  2. customers, to whom information is to be delivered;
  3. assessors, who apply expertise to the mediation of data into information.

The latter must understand the sources as well as the categories of customers, and also be able to respond to feedback from the customers [NaumannLF:99]. Tools to help rank the quality of data by a wide variety of source and customer attributes should be easy to insert.

6 Determining Unusual Features Important for the Purchaser Unusual features are, by their own definition, varied, span a wide range, and are often omitted from the primary information. Examples may be the shade of a color wanted to match a piece of apparel, secondary measurements as size of a piece of furniture wanted for a specific odd location, the weight of an object to assess its portability, or its consumption of electricity or batteries. The lack of such information in on-line catalogs, or obtainable from call centers is astounding. Even for such obvious uses, as laptop computers, weight and actual battery life is hard to ascertain, and similar factors for desktop computers are impossible to find. Providing generous return policies, at high cost to the vendors, is one way of overcoming the lack of confidence generated by missing information.

Unusual features are, by their own definition, varied, span a wide range, and are often omitted from the primary information. Examples may be the shade of a color wanted to match a piece of apparel, secondary measurements as size of a piece of furniture wanted for a specific odd location, the weight of an object to assess its portability, or its consumption of electricity or batteries. The lack of such information in on-line catalogs, or obtainable from call centers is astounding. Even for such obvious uses, as laptop computers, weight and actual battery life is hard to ascertain, and similar factors for desktop computers are impossible to find. Providing generous return policies, at high cost to the vendors, is one way of overcoming the lack of confidence generated by missing information.

There is an obvious tension in providing more specifications. Organizing the information to make it suitable for the consumer requires insight and care, often lacking in the engineers that design the goods and their marketeers. Many of the parameters are hard to specify, especially factors describing quality. If much detail, irrelevant to many, is given, then the consumer who is not interested will be overloaded, and may give up on the purchase altogether.

7 Tools for Selection and Search The need for assistance in selection relevant information from the world-wide-web was recognized early in the web?s existence [BowmanEa:94]. This field has seen rapid advances, and yet the users remain dissatisfied with the results. Complaints about `information overload? abound. Web searches retrieve an excess of references, and getting a relevant result, as needed to solve some problem requires much subsequent analysis. And yet, in all that volume, there is no guarantee that the result is precise and complete.

The need for assistance in selection relevant information from the world-wide-web was recognized early in the web?s existence [BowmanEa:94]. This field has seen rapid advances, and yet the users remain dissatisfied with the results. Complaints about `information overload? abound. Web searches retrieve an excess of references, and getting a relevant result, as needed to solve some problem requires much subsequent analysis. And yet, in all that volume, there is no guarantee that the result is precise and complete.

Searches through specific databases can be made to be complete and precise, since the content of a database, say the list of students at a University, and their searchable attributes, as maintained by the registrar, can be expected to be complete. Not obtaining, say, all the Physics students from a request, is seen as an error in precision, and receiving the names of any non-Physics student is an error of relevance.

Effect of Sponsors Most of the search services are provided by companies that obtain their support by also displaying advertising, which means that the focus is initially on breadth ? attracting many viewers ? rather than on depth, providing high-value information for specialized audiences. Many advertising sponsors prefer having their advertisements seen by a more specific audience, and that is accommodated by having such advertisements presented at later stages in the search, when the customer has narrowed the search to some specific topic. This approach is likely to cause more effort to be expended on paths where advertisements are easier to sell.

Most of the search services are provided by companies that obtain their support by also displaying advertising, which means that the focus is initially on breadth ? attracting many viewers ? rather than on depth, providing high-value information for specialized audiences. Many advertising sponsors prefer having their advertisements seen by a more specific audience, and that is accommodated by having such advertisements presented at later stages in the search, when the customer has narrowed the search to some specific topic. This approach is likely to cause more effort to be expended on paths where advertisements are easier to sell.

Search Techniques There is a wide variety of search techniques available. They are rarely clearly explained to the customers, perhaps because a better understanding might cause customers to move to other searches. Since the techniques differ, results will differ as well, but comparisons are typically based in recall rather than on precision. Getting more references always improves recall, but assessing precision formally requires an analysis of relevance, and knowing what has been missed, which is an impossible task given the size and dynamics of the web.

There is a wide variety of search techniques available. They are rarely clearly explained to the customers, perhaps because a better understanding might cause customers to move to other searches. Since the techniques differ, results will differ as well, but comparisons are typically based in recall rather than on precision. Getting more references always improves recall, but assessing precision formally requires an analysis of relevance, and knowing what has been missed, which is an impossible task given the size and dynamics of the web.

Potentially more relevant results can be obtained by intersecting the results from a variety of search techniques, although precision is then likely to suffer further.

We briefly describe below the principal techniques used by some well-known search engines; they can be experienced by invoking www.name.com. This summary can provide hints for further improvements in the tools.

Yahoo catalogues useful web sites and organizes them as a hierarchical list of web-addresses. By searching down the hierarchy the field is narrowed, although at each bottom leaf many entries remain, which can then be further narrowed by using keywords. Yahoo employs now a staff of about 200 people, each focusing on some area, who filter web pages that are submitted for review or located directly, and categorizes those pages into the existing classification. Some of the categories are dynamic, as recent events and entertainment, and aggregate information when a search is requested.

catalogues useful web sites and organizes them as a hierarchical list of web-addresses. By searching down the hierarchy the field is narrowed, although at each bottom leaf many entries remain, which can then be further narrowed by using keywords. Yahoo employs now a staff of about 200 people, each focusing on some area, who filter web pages that are submitted for review or located directly, and categorizes those pages into the existing classification. Some of the categories are dynamic, as recent events and entertainment, and aggregate information when a search is requested.

Alta Vista automates the process, by surfing the web, creating indexes for terms extracted from the pages, and then using high-powered computers to report matches to the users. Except for limits due to access barriers, the volume of possibly relevant references is impressive. However, the result is typically quite poor in precision. Since the entire web is too large to be scanned frequently, references might be out of date, and when content has changed slightly, redundant references are presented. Context is ignored, so that when seeking, say, a song title incorporating the name of a town, information about the town is returned as well.

automates the process, by surfing the web, creating indexes for terms extracted from the pages, and then using high-powered computers to report matches to the users. Except for limits due to access barriers, the volume of possibly relevant references is impressive. However, the result is typically quite poor in precision. Since the entire web is too large to be scanned frequently, references might be out of date, and when content has changed slightly, redundant references are presented. Context is ignored, so that when seeking, say, a song title incorporating the name of a town, information about the town is returned as well.

Excite combines some of the features, and also keeps track of queries. If prior queries exist, those results are given priority. Searches are also broadened by using the ontology service of Wordnet [Miller:93]. The underlying notion is that customers can be classified, and that customers in the same class will share interests. However, asking similar queries and relating them to individual users is a limited notion, and leads only sometimes to significantly better results. Collecting personal information raises questions of privacy protection.

combines some of the features, and also keeps track of queries. If prior queries exist, those results are given priority. Searches are also broadened by using the ontology service of Wordnet [Miller:93]. The underlying notion is that customers can be classified, and that customers in the same class will share interests. However, asking similar queries and relating them to individual users is a limited notion, and leads only sometimes to significantly better results. Collecting personal information raises questions of privacy protection.

Firefly provides customer control over their profiles. Individuals submit information that will encourage businesses to provide them with information they want [Maes:94]. However, that information is aggregated to create clusters of similar consumers, protecting individual privacy. Business can use the system to forward information and advertisements that are appropriate to that cluster. There is a simplification of matching a person to a single customer role. Many persons have multiple roles. At times they may be a professional customer, seeking business information, and at other times they may pursue their sports hobby, and subsequently they may plan a vacation for their family. Unless these customer roles can be distinguished, the clustering of individuals is greatly weakened.

provides customer control over their profiles. Individuals submit information that will encourage businesses to provide them with information they want [Maes:94]. However, that information is aggregated to create clusters of similar consumers, protecting individual privacy. Business can use the system to forward information and advertisements that are appropriate to that cluster. There is a simplification of matching a person to a single customer role. Many persons have multiple roles. At times they may be a professional customer, seeking business information, and at other times they may pursue their sports hobby, and subsequently they may plan a vacation for their family. Unless these customer roles can be distinguished, the clustering of individuals is greatly weakened.

Alexa collects not only references, but also the webpages themselves. This allows Alexa to present information that has been deleted from the source files. Ancillary information about web pages is also provided, as the author organization, the extent of use, the `freshness? of updates, the number of pages at a site, the performance, and the number of links referring to this page. Such information helps the customer judge the quality of information on the page. Presenting web pages that have been deleted provides an archival service, although the content may be invalid. The creators of such webpages can request Alexa to stop showing them, for instance if the page contained serious errors or was libelous. Since the inverted links are made available one can also go to referencing sites.

collects not only references, but also the webpages themselves. This allows Alexa to present information that has been deleted from the source files. Ancillary information about web pages is also provided, as the author organization, the extent of use, the `freshness? of updates, the number of pages at a site, the performance, and the number of links referring to this page. Such information helps the customer judge the quality of information on the page. Presenting web pages that have been deleted provides an archival service, although the content may be invalid. The creators of such webpages can request Alexa to stop showing them, for instance if the page contained serious errors or was libelous. Since the inverted links are made available one can also go to referencing sites.

Google ranks the importance of web pages according to the total importance of web pages that refer to it. This definition is circular, and Google performs the required iterative computation to estimate the scaled rank of all pages relative to each other. The effect is that often highly relevant information is returned first. It also looks for all matches to all terms, which reduces the volume greatly, but may miss relevant pages [PageB:98].

ranks the importance of web pages according to the total importance of web pages that refer to it. This definition is circular, and Google performs the required iterative computation to estimate the scaled rank of all pages relative to each other. The effect is that often highly relevant information is returned first. It also looks for all matches to all terms, which reduces the volume greatly, but may miss relevant pages [PageB:98].

Junglee provides integration over diverse sources. By inspecting sources, their formats are discerned, and the information is placed into tables that then can be very effectively indexed. This technology is suitable for fields where there is sufficient demand, so that the customer needs can be understood and served, as advertisements for jobs, and searches for merchandise. Accessing and parsing multiple sources allows, for instance, price comparisons to be produced. Vendors who wish to differentiate themselves based on the quality of their products (see Section 2.2.3) may dislike such comparisons.

provides integration over diverse sources. By inspecting sources, their formats are discerned, and the information is placed into tables that then can be very effectively indexed. This technology is suitable for fields where there is sufficient demand, so that the customer needs can be understood and served, as advertisements for jobs, and searches for merchandise. Accessing and parsing multiple sources allows, for instance, price comparisons to be produced. Vendors who wish to differentiate themselves based on the quality of their products (see Section 2.2.3) may dislike such comparisons.

Cookies is not an independent search engine, but a device used by many engines and applications to track users? activities between sessions. Cookies are left on the user?s computer by some applications and read at a later time by the same or a related application. For instance, a search for some movie, recorded in a cookie, can trigger an advertisement for a similar movie later. The use of cookies moves the storage of user-specific information to the user?s computer. It hence also changes the flavor of privacy concerns. Browsers allow rejecting of cookies and applications that generate cookies.

is not an independent search engine, but a device used by many engines and applications to track users? activities between sessions. Cookies are left on the user?s computer by some applications and read at a later time by the same or a related application. For instance, a search for some movie, recorded in a cookie, can trigger an advertisement for a similar movie later. The use of cookies moves the storage of user-specific information to the user?s computer. It hence also changes the flavor of privacy concerns. Browsers allow rejecting of cookies and applications that generate cookies.

This list of techniques can be arbitrarily extended. New ideas in improving the relevance and precision of searches are still developing [Hearst:97]. There are, however, limits to general tools. Three important additional factors conspire against generality, and will require a new level of processing if searching tools are to become effective.

8 Factors Reducing the Effectiveness of Search Engines The three principal factors hindering the effectiveness of search engines are: unsuitable source representations, inconsistent semantics (as discussed in Section 3.2.2), and inadequate modeling of the customers? requirements. Effectiveness must be increased if web-based information is to be routinely used in business settings. Overcoming these three limitations requires in each case combining automation with manual, value-added inputs, as discussed in Sections 8.3 and 8.1.

The three principal factors hindering the effectiveness of search engines are: unsuitable source representations, inconsistent semantics (as discussed in Section 3.2.2), and inadequate modeling of the customers? requirements. Effectiveness must be increased if web-based information is to be routinely used in business settings. Overcoming these three limitations requires in each case combining automation with manual, value-added inputs, as discussed in Sections 8.3 and 8.1.

Representation of data in sources uses text, icons, images, etc. in a variety of formats. Text-based search engines are limited to textual representation of data [Nelson:97]. This means that information made available in proprietary formats, as Microsoft Word and Powerpoint, postscript, Adobe PDF, or embedded into images is not easily captured. The W3QL language permits the specification of web queries using forms, but unless the allowable query terms can be enumerated, most information hidden behind these forms remains inaccessible [KonopnickiS:98]. The search engines will fail to find much scientific information, for which web standards do not provide adequate formatting. As more information moves to visual representations there is a further lack of search capability. If the objective of the producer of the web page is to be found by the search engine, they will use simple ASCII in HTML and XML texts. Tricks used by aggressive sites to increase the chances of being rated high include adding and repeating terms in portions of the web pages that are not displayed.

Modeling the customer's requirements effectively requires more than tracking recent web requests. First of all a customer in a given role has to be disassociated from all the other activities that an individual may participate in. We distinguish here customers, performing a specific role, and individuals, who will play several different roles at differing times. In a given role, complex tasks can be modeled using a hierarchical decomposition, with a structure that supports the divide-and-conquer paradigm that is basic to all problem-solving tasks [W97:M].

9 Feature Overload Without clean models we encourage the addition of more and more features to our systems. Each feature is the result of some bright idea or engineering solution, but the resulting systems are confusing and unclear for the customer. Having models can help bridge the gap between the engineers, that are feature oriented, and customers who experience overload, not only of contents, but also of means on how to deal with the content. Feature growth in customer interfaces of information systems applications is similar to that in general software (Section 6.3), but less constrained by interface standards.

The End

Return to class [Notes13] or CS73N Home

Site

Changes
Index
Search

User

Log In
Register

 
 

Last Modified 2006-05-20