C773A Class Notes 13 for 19 May 2006
Started by Gio Wiederhold 19 May 2006, lost and reentered the same day as Note13. Edited by SG Tessler 19 May 2006.
Topics covered
Presentations
The last 4 presentations were given:
- Nina Liu: Advice on how to take care of your skin (Your body's largest organ [Gio]). Questions included how to motivate Dermatologists to work with you.
- Doug Tarlow: Unofficial Stanford Technology Guide, focused on incoming students. Might have sections on academic needs for computing, personal communication, and entertainment equipment, as for music and video
- Angela Tenney: Unofficial guide for Stanford students as dealing with the draw, meal plans, and course selection, as these Freshman seminars and IHUM programs.
- Yuriko Tamara: On-line Fantasy baseball, why and how.
These projects focused on public service so they did not have very strong business models.
Also, make sure that you are well identified as the `authoritative source, and that the date of creation is clear.
Assignment
The last writing assignment is:
Tell the reader why they should use your web site by preparing an analysis of competing and complementary web pages.
We envisage something like three web pages, one covering the general benefits of your presentation for the specific audience, often Stanford students, and two linked reference pages describing competitors and complementary material.
The reference pages would include brief informative description, listing the focus, benefits, and liabilities of those pages, as well as the actual pointers. Identifying the owners or the sponsors will be very useful for the readers. Those description should enable the readers to avoid wasting time by accessing pages that are of little use to them.
You may wonder about helping competitors by including pointers to their pages, making them more accessible and even increasing their page rank (see the Google description in this note ). But now you are in control, and Google will let readers find those pages anyhow – and is academically the responsible thing to do. [From Shirley: "control" is the operative word here. Think about Amazon's service that points customers to "other sources of new and used books". What is their reasoning? What is their business model?]
Make sure that the external web references don't take readers out of your site, but only create new windows. This is achieved by adding in the HTLM reference HREF the directive `target="_blank" -- an option in many web-page generating tools, including this wiki product, Editme. (Choose the target option "open in new window _blank" when doing an "insert/edit link".)
Browsers
The major problem facing individual consumers is the ubiquity and diversity of information. Just as the daily newspaper presents an overload of choices for the consumer in its advertising section, the World-Wide Web contains more alternatives than can be investigated in depth. When leafing through advertisements the selection is based on the prominence of the advertisement, the convenience of getting to the advertised merchandise in one's neighborhood, the reputation of quality, personal or created by marketing, of the vendor, and unusual features, suitability for a specific need, and price. The dominating factor differs based on the merchandise. Similar factors apply to online purchasing of merchandise and services. Lacking the convenience of leafing through the newspaper, greater dependence for selection is based on selection tools. [From Wiederhold: Trends For the Information Technology Industry; Stanford University, April 1999, section 3.2].
A brief historical sequence showing the development of those section tools, known as browsers:
When the web started to have interesting content for the public, in the late 1990ties, a program called Harvester collected all websites it could find, to provide a repository for search. Web sites are found by starting to crawl from several known pages, and following pointers (identified by text following `<A xhref=' on those pages, as well as finding pages just listed on directories on these sites. Crawling takes now much effort, and is carried out perhaps weekly by large search engine companies, and much less frequently by local efforts as the WebBase Project at Stanford. To remain as up-to-date as feasible modern search engines keep lists of pages that change frequently and crawl those more often.
Since the repositories created by crawling grew rapidly, indexes were added to locate relevant pages. And index lists all the words appearing on all the pages, with the URLs that point to the documents where they appear. Later indexes added a position indicator as well, showing where in the document that word appeared. An index is alphabetically arranged, allowing rapid access to the word and then to the reference pages.
As the indexes grew, the number of references for each word became excessive. First very frequent words were eliminated. Examples of such stopwords are `the' , `and' , etc. as well as all single letter words. Eventually those stopwords had to be made language specific.
Then documents were ranked so that documents that many instances of the word were deemed to be more relevant and would be placed at the head of the list of URLs. Further improvements were making the count relative to the size of the document, so that large documents would not be always preferred. Adding the document size became another requirement for the crawlers.
Vector ranking methods is an older technology, adapted for some browsers. Here the word is assigned a value relative to its occurrence in the language overall. In practice the language base consists of all crawled documents. Stopwords had a very low value, but even a general world as `system' would have a lower value than a more specific word as `laptop'. Words that occur only once (or very rarely) in the world are likely misspellings, and in any case would be unlikely to appear in a query, so even they have a very high vector value, are not very useful.
The next innovation was the PageRank algorithm, developed by Larry Page and Sergey Brin at Stanford.
" PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."
Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query. " [From Google Technology, obtained 20 May 2006]
The PageRank Algorithm essentially extracts public knowledge from the web, contributed by all those users who place links to other pages on their websites, as the participants on the CS73N class. Determining the Page Rank requires theoretically iterating without end, since every time around the quality rating will change, affecting other quality ratings, In practice there is a threshold to stop the iterations, when changes become minor.
Before Google, Yahoo used people to assign useful web pages into a hierarchy for convenient access by searchers. Hundreds of ontologists inspect web pages and classify them. Today the web is too large to classify more than a minuscule fraction manually, but the hierarchical sarch is still attractive to readers that benefit from the explicit guidance.
Of course, there are no perfect hierarchies to classify all the worlds knowledge, so that searchers who have different model of how the world should be organized can find a hierarchy produced by others frustrating, often considering the regularity they impose to be bureaucratic.
Over time, all competing browsers have adopted technologies from each other. Some of the effective technologies require more work in phrasing the query from the users. But moist users appear to be very disinclined to use advanced tools, and rather miss useful references or wade through irrelevant material. For an older, more specific description of browser technologies, see Note Browsers. An understanding of how browsers work can help searchers in using the available search engines more effectively.
Two related issues came up related to the class discussion
How can Google protects its work.
There are three fundamental methods to protect Intellectual Property IP, (to be) detailed in [Note Protection]: Patents, copyright, and trade secret.
In this case Google licensed the Patent that defines the embodiment of the PageRank algorithm from Stanford, where its founders developed it. But once the general principle is known, competitors can develop similar algorithms others. And it is hard for the patent owner to check if a patent is violated, without inspecting the actual code of a competitor.
Keeping the method as a trade secret may be more effective, but potential competitors can experiment with a public service and eventually deduce how things work. Also, investors don't like to risk their investment in such a formally unprotected effort.
Use of copyright is even less useful, since an interesting, non-trivial algorithm can be coded (embodied) in many ways . And again, it is hard to check if a competitor is copying your code, if the code itself resides on their servers, and is never made public.
Use of copyrighted material.
You may want to place copies of someone's copyrighted material on your web page. There are rules that permit such use for academic and review purposes. They are not very well defined, but would certainly apply to modest use in the Cs73N class.
You should always identify the source used so that web browsers and people that use paper libraries can locate the material. When you use material from the web always identify the date it was copied, since the web is dynamic, and the sources may change afterwards. It is also wise to copy the entire source web page to your files, in case questions of content arise later.
Copying webpages you reference externally is also worthwhile, again to deal with the dynamic nature of the web. If there is much concern you can even make the referenced to your copy, identifying as `Cached as of .. '.
Return
to the 2006 Class Spring 2008 Schedule or to the CS73N Home page.