Tuesday, November 29

A Risky Gamble With Google

By SIVA VAIDHYANATHAN

Wouldn't it be cool if we didn't have to tell students that a Web search is insufficient for serious scholarly research? Wouldn't it be great if we could use a single, simple portal to find the most-significant Web pages, images, scholarly articles, and books dealing with a particular subject or keyword? Wouldn't it be wonderful if we could do full-text searches of millions of books?

The dream of a perfect research machine seems almost within our reach. Google, the Mountain View, Calif., company flying high off a huge initial public offering of stock and astounding quarterly revenues, announced late last year that it would digitize millions of bound books from five major English-language libraries. It plans to make available online the full text of public-domain books (generally those published before 1923, plus government works and others never under copyright) and excerpts from works still in copyright.

Harvard University will allow Google to scan 40,000 books during the pilot phase of the project, and the number may grow. The library has more than 15 million volumes. The University of Michigan at Ann Arbor has agreed to let Google scan its entire collection — some 7.8 million works — and Stanford University says it is keeping open the possibility of including "potentially millions" of its more than eight million volumes. The Bodleian Library at the University of Oxford will allow Google to scan public-domain books, which it says are principally those published before 1920. The main library alone holds 6.5 million books in its collection. And the New York Public Library will put in from 10,000 to 100,000 public-domain volumes. It holds 20 million volumes. Even if the project only included Michigan's collection, it would be astounding.

Google is doing all the scanning and optical-character recognition with a secret proprietary machine and promises not to damage the pages or bindings. According to Google's contract with Michigan (the only contract released to the public), the university will be offered a digital copy as well.

I have to confess, I am thrilled and dazzled by the potential of such a machine and the research and distribution opportunities it presents. I sincerely wish every Internet user had access to a full-text search of every book in the Google libraries.

But, as we all know, we should be careful what we wish for. This particular project, I fear, opens up more problems than it solves. It will certainly fail to live up to its utopian promise. And it dangerously elevates Google's role and responsibility as the steward — with no accountability — of our information ecosystem. That's why I, an avowed open-source, open-access advocate, have serious reservations about it.

It pains me to declare this: Google's Library Project is a risky deal for libraries, researchers, academics, and the public in general. However, it's actually not a bad deal for publishers and authors, despite their protestations.

On one side, we have the opponents. In recent lawsuits, the Authors Guild and the Association of American Publishers, which is representing five of its members, have charged Google with copyright infringement for scanning works still under copyright. Less well publicized are the questions that some librarians have about whether the five participating libraries are acting in the best interest of libraries and users in general.

On the other side, the "copyfight" community (or the "Free Culture Movement," as it is increasingly known), has generally cheered on Google, as it seems to be the champion of more flexible and open copyright principles and is aggressively confronting big media companies like Viacom, AOL Time Warner, Disney, and the News Corporation, all of which own major publishing houses. For the copyfighters, this is David versus Goliath.

As it became clear that Google was not going to placate its critics easily, the company announced last summer that it would cease scanning works under copyright until at least November 1. It has now begun digitizing them again. I can't help but envision this fight as one between Godzilla and Megalon. Whoever the bad guy turns out to be, a whole lot of good things are going to get crushed in the melee.

The fear that we are traveling down the road to Googlizing just about everything lurks behind this controversy. Google plays a peculiar and powerful role in our information ecosystem. It is a ubiquitous brand, used as a noun and a verb everywhere from adolescent conversations to scripts for Sex and the City. Its initial public offering in 2004 generated $1.67-billion in cash. Its stock price has soared in value since, and its revenue has more than doubled to $3-billion per year.

Yet the core service of Google.com — its search engine — handles less than 50 percent of the Web-search business in the United States. While Google is clearly the leader, Yahoo also handles a significant chunk, Microsoft's MSN a smaller but still sizable portion. Microsoft — by virtue of being the default search engine built into the default Web browser available right out of the computer box — is gaining on both of them all over the world.

Microsoft already controls most of the desktops in the world. It also controls an increasing number of operating systems for mobile data devices. Thus many of the world's files are potentially indexible and searchable by Microsoft itself. And the company has ways of locking other firms out of essential services in the desktop environment. In addition, the chief advantage Google has had in the Web-search areaits algorithm, PageRankis no longer the only effective search engine on the market. Sex and the City's Carrie Bradshaw might use Google now, but there is no reason to believe she would next year.

To preserve its status as the elite, venerated, and fast-moving technology company that does good as well as well, Google must therefore do two things. It must continue to convince the world that it is the anti-Microsoft. And it must find more things to index and expose to the world.

So far Google has protected its brand as the good guy on the block. The damage Google has done to the world is minimal. It seems to provide users a service at no cost (beyond buying a computer and paying for Internet use) and with little annoyance. Getting big by keeping advertisements small, it has carefully avoided pinching our marketing-saturated nervous systems and offered illusions of objectivity, precision, comprehensiveness, and democracy. We are led to believe that Google search results are determined by peer review — by creating our own links on the Web and having Google count them — not by an editorial team of geeks or, worse, marketing consultants. So far, that strategy has worked well: Google has the balance sheets to prove it.

To maintain them, however, Google must get bigger. It must go new places and send its spiders crawling through unindexed corners of human knowledge. Google's corporate mission statement includes the rather optimistic and humanistic phrase, "to organize the world's information and make it universally accessible and useful." When the company's executives present their products to the public, they often start with that statement, expecting everyone to be grateful that someone is organizing the torrent of sports statistics, UFO stories, poetry, and pornography. But Google co-founder Sergey Brin once offered a more ominous indication of what the enterprise might become: "The perfect search engine would be like the mind of God."

Both quotations should worry us. Is it really proper for one company — no matter how egalitarian it claims to be — to organize all the world's information? Who asked it to? Isn't that the job of universities, libraries, academics, and librarians? Have those institutions and people failed in their mission? Must they outsource everything? Is anyone even watching to see if Google does the job properly?

Let's examine the Google Library Project in depth. First, we must deal with some confusing branding. The undertaking is really a subset of another program, Google Print, which, according to the company, aims to put book content "where you can find it most easily — right in your Google search results." Google Print also has another part, the Publisher Program, which has been online for more than a year now, presenting full-text searches authorized by publishers of many thousands of books. Google spent months negotiating the deals with publishers, allaying fears of hacking and piracy. It demonstrated that users could see only a few pages at a time and could not print them. And it showed publishers that links to online book sites would spur sales with little risk or expense to them.

Google Library, on the other hand, looks and feels much different. For works in the public domain, it will allow users to search and read the entire text. For works still under copyright, it will provide only short snippets of text with search words in context. No one will see even one entire page of a copyrighted book. And, as with Google Print, no one will be able to print or read the entire book — at least via Google. It's unclear whether libraries will allow greater access to their own copies. To make allowances for skittish publishers who take umbrage at the presumptuousness of the wholesale scanning of copyrighted books, Google will also permit publishers to "opt out" if they notify the company which titles they would like withheld.

There's a lot right with this plan. Google may be solving a major research problem. At present researchers rely on keyword searches within text documents. The services that provide such access include some of the most expensive databases around: most notably, LexisNexis and ProQuest. Those index only periodicals, government and legal documents, dissertations, and other specialized materials, not the full texts of books, which are searchable only by author, subject, title, or keywords selected by human catalogers. The current research menu is biased toward those resources that are fully searchable, like academic journals and periodicals. Book authors lose out. It's hard to find a complete list of books that discuss, for instance, the White House adviser Karl Rove. You can find a small pile of books about Rove and another pile about his chief patron, President Bush. But how would you find that Rove played a role in the stem-cell-research controversy or that he is mentioned in a biography of anti-Semitic chess master Bobby Fischer as "the Bobby Fischer of American politics," unless you already knew enough to put such names or issues into one of the current search engines — or could search the actual texts of millions of books? Such associations can spark imaginative scholarship, or at least generate rather eccentric trivia.

Further, most of the best search methods are exclusive. If you are not part of a university community, you can't do much good research. Most public-library catalogs are still rather pedestrian. And few public libraries have licenses to vast scholarly electronic resources. So Google would provide nonuniversity users a glimpse of what might be available at the major research libraries in their area. It would provide broad — although far from universal — access to information.

The third issue Google Library purports to resolve is not really a problem. There is a general perception that students won't use a book when they can find a Web site that appears to do the same work. I'm not convinced that is universally true. And I am a firm believer in convincing students of the value of edited and reviewed information over hastily posted text. I grade accordingly. Still, there may be something to the argument. Google can be addictive, no matter one's age. If students are going to use Google first, then it makes sense to let Google offer better information.

But all is not well in this plan. We could solve each of the problems listed above without Google, although it would take a deep commitment from the public and its institutions to make good information more accessible. This hardly seems like the right time or country to call for a massive public commitment of resources to benefit the public good, however. Google is the first to step up to the plate — although Yahoo has now followed with its own plan to scan public-domain books from a consortium of libraries, and Microsoft has just promised to join with a $5-million investment.

Google's is still the most ambitious plan, however, and its much bolder venture into the world of print offers us at least three reasons to worry: privacy, privatization, and property.

Privacy has been a problem for Googleor, more precisely, Google usersfor some time. Scores of newspaper and magazine articles have considered the complications of finding one's personal history or long-lost sappy poems accessible via Google. With the launch of Google's Web-based e-mail service, Gmail, it became clear that the company was reading user mail for hints about how it might target ads. In addition, Google has potential access to all our search histories.

With Google Library, we have a whole new set of privacy concerns. Can we trust Google not to turn over individual reading records of patrons to the FBI or local law-enforcement officials? There is nothing in Google's privacy policy that promises it will resist such abusive practices. In fact, its policy declares that it will give law-enforcement investigators information to "satisfy any applicable law, regulation, legal process or enforceable governmental request." Even a stronger privacy pledge from the company would be hard to take seriously. Plenty of other companies, like airlines, have violated their own privacy policies.

More important, nothing in the recently revealed contract between the University of Michigan and Google seems to bind the company to respect patron confidentiality. Michigan abrogated its responsibility on that issue. It should have demanded a stronger pledge of user confidentiality in return for access to its rich archive of knowledge. I know many librarians who would rather go to jail than reveal my borrowing habits to suspicious snoops. I doubt I can count on Google's employees to be as committed to user confidentiality.

It's important to remember that Google serves its own masters: its stockholders and its partners. It does not serve the people of the state of Michigan or the students and faculty members of Harvard University.

So the five libraries in Google's project are essentially outsourcing the risk and responsibility for digitizing texts and making them searchable online. The process of privatization is particularly troubling. Of course we should not pretend that libraries operate outside market forces or do not depend on outsourcing many of their functions. But we must recognize that some of the thorniest problems facing libraries today — paying for and maintaining commercial electronic databases and cataloging services — are a direct result of rapid privatization and onerous contract terms. There are too many devils in too many details.

The long-term risk of privatization is simple: Companies change and fail. Libraries and universities last. Should we entrust our heritage and collective knowledge to a business that has been around for less time than Brad Pitt and Jennifer Aniston were together? A hundred years from now, Google may well not exist. Much to the dismay of Ohio State University's football fans, the University of Michigan will. For that reason alone, it's imperative that stable public institutions take the lead in such an ambitious project.

Already this month, the book wars are heating up. Amazon.com and Random House, the world's largest publisher of trade books, have announced plans to allow people to read books online (although not to download or copy them) — for a fee. What happens if the competition ratchets up the price? What if stockholders decide that Google Library is a money loser or too much of a copyright-infringement liability? What if they decide that the infrastructure costs of keeping all those files on all those servers do not justify the expense? What then? Will the proprietary formats through which Google displays its files survive the company's demise? Will they survive other technological changes?

Like all companies, Google protects its proprietary formats and technologies. All software companies use a matrix of nondisclosure contracts, trade secrets, copyrights, and patents to restrict competition and limit public oversight. Like Google's search processes and algorithms, the content of the Print Project is protected by digital-rights-management technologies and not open to public scrutiny. So Google is heavily invested in a strong — perhaps too strong — regime of property rights.

Yet the company has also set itself up as the champion of the public interest in matters of intellectual property and Internet freedom. On its corporate blog, the company announced the hiring of a new Washington, D.C., lobbyist by declaring that part of his portfolio is to defend the public-interest notions of "Net neutrality" (keeping the Internet open and competitive), "copyrights and fair use" (protecting both the principles of exclusive rights and the limits on them like fair use), and limited "intermediary liability" for tech companies (protecting them from suits over what users might do with their technology).

Beware any corporation that pretends to speak for the public interest. That's usually a contingent pledge based on convenience and temporary market conditions. Microsoft used to seem like it was on the public-interest side of copyright battles in the 1990s when it fought Apple's attempts to monopolize the graphical user interface. Many times we have learned the hard way that companies shift their public-policy orientations as their market status changes.

The most serious problem Google Library creates concerns one aspect of intellectual property — copyright. This plan injects more uncertainty and panic into a system that is already out of equilibrium. Since the rise of digital media and networks, content owners have been scrambling to install radical legal and technological controls over content and the machines that play and distribute that content. Meanwhile, users have been inventing and discovering powerful new ways to create, revise, and distribute content — often other people's content. As a result, we now have an absurd copyright system. Almost nothing stops bad actors (like DVD pirates) from infringing copyrights for profit. Yet the system plays havoc with innocent copyright users like librarians, researchers, students, and computer programmers. Much about copyright in the digital age is up for grabs. After a series of high-profile cases, it's still not clear whether the big copyright owners (who tend to win such cases) will triumph in the long run and quash the efforts of millions of people to use their own culture as they see fit. Google's plan to copy books further destabilizes the system, poking at the very core of copyright.

It also invites a collision of norms and rules unlike any we have seen in almost a century. It is certainly the most interesting — and possibly the most disruptive — copyright conflict since battles over copying film and recorded music.

Some months ago, when the major copyright debate was over peer-to-peer file sharing (a simple problem, in comparison), I wrote a supporting brief in a case before the U.S. Supreme Court, Metro-Goldwyn-Mayer Studios Inc. v. Grokster Ltd., on behalf of media-studies scholars. We argued that the court should avoid the temptation to overregulate technology. I drew parallels between Grokster, a company that produced search engines through which users found copyrighted music files, and Google, a search engine through which users find copyrighted text files.

Of course there is one big difference: Grokster did not actually do any copying. Google does. For years it has been making copies of the copyrighted Web pages it indexes to store in its own cache. It does so because the law grants it some immunity to make cache copies under a provision designed to protect Internet service providers.

It also does so because that's the norm in the Web world: If you don't want your Web page copied and indexed by search engines, you can opt out. You can post a small signal in your page instructing Google's machines not to include your content in its index. But just about everyone wants to be indexed by Google and other search engines; only companies with strong audiences in the analog world like to opt out of search-engine indexes. So the opt-out system has worked well on the Web.

In the real world outside cyberspace, copyright is opt in. Among the various rights that copyright law grants copyright holders, the exclusive right to copy is at the core. It's why copyright is called copyright. Again, that system has worked fairly well in the real world. Courts, Congress, and practice have generated some important exceptions to the exclusive right to copy. The clearest example is fair use for scholarship, education, commentary, and news. Copyright law also provides for some other specific exemptions, like showing a film or making a backup copy of library materials for class.

So the copyright-infringement suits brought by authors and by publishers against Google Library will present the courts with an interesting dilemma: Should they favor the norms of the Web (opt out) over the norms of the real world (opt in)? Google is clearly asserting the right to do what it has always done in the Web world — copied without permission for the purpose of providing an important commercial service that rides free on others' copyrighted work. The courts will have to decide if that is a bit too presumptuous and disruptive in the real world.

Google is confident that legal rulings about fair use on the Web will support its claim. The strongest case in Google's favor comes, like Google, from the West Coast: Kelly v. Arriba Soft Corporation (decided by the U.S. Court of Appeals for the Ninth Circuit in San Francisco in 2003). In that case, a photographer named Leslie A. Kelly sued a company that had produced a Web index of "thumbnails" (small, distilled versions of larger digital images) that linked to Kelly's copyrighted photographs. The circuit court ruled that the thumbnails were "transformative" (that they changed material into something new), that the index did not harm Kelly's market, and that the benefits of the service outweighed any fundamental right to exclude the copyrighted photographs.

That was a landmark case based on common sense: The photographer put the photos on the Web knowing that the Web is the sort of place where search engines make cache copies and thumbnails when providing search results. And the index is an essential service. It makes the Web work. Google has based its entire business on the confidence the ruling gives it. Without Kelly, search engines might have to seek a license from every Web-page producer on the Internet. The cost of doing that would be high. We would have no search engines.

Meanwhile, over on the East Coast, things have been different. In the heart of some of the staunchest defenders of copyright, the New York publishing industry, the recording industry won an important suit in U.S. District Court for the Southern District of New York in 2000. The case, UMG Recordings, Inc., et al. v. MP3.com, Inc., involved MP3.com, a startup that had offered a "locker" service. Its customers could log into MyMP3.com, insert a compact disc, and the service would read it. Tapping into the company's vast library of CDs, it would place digital copies into users' "lockers," so that they could listen to their collection on any computer connected to the Internet. Users had to verify that they had purchased a copy of the CD, but what they listened to weren't copies of their own CDs. MP3.com thought it had a good case. It had lawfully purchased thousands of CDs. Its customers had lawfully purchased thousands of CDs. No one was "stealing" music. In fact, two copies had been sold. The company was merely offering a virtual service based on a clearly legal activity: making digital backups of a private CD collection.

The recording industry disagreed, and so did the court, ruling that MP3.com had infringed on the exclusive right to copy. Because it was a commercial service, not a private person or educational institution, and because it did not "transform" the music into anything new, it could not claim fair use as a defense. The fact that the MyMP3.com service did not harm the market for CDs did not seem to matter to the court. Its decision was copyright fundamentalism, the norms of the real world trumping the norms of the Web. Significantly for Google, the case involved a company making harmless copies of works that the copyright holder had not intended for the Internet — much like publishers.

Other important copyright cases in the U.S. Court of Appeals for the Second Circuit in New York have also recently gone in favor of copyright fundamentalism, most notably New York Times Co., Inc., et al. v. Jonathan Tasini et al. In that, the newspaper had argued that its resale of the work of freelancers (without permission or compensation) to electronic databases contributed to user-friendly full-text services. The circuit court, in 1999, and ultimately the Supreme Court, in 2001, disagreed. It's no coincidence that the authors and publishers filed their suits against Google in the Southern District of New York, knowing that there is a general suspicion of newfangled views of copyright on the East Coast.

So this is my concern: Google has not only, in the words of a University of Pittsburgh law professor, Michael J. Madison, "bet the company" on its Print Library. It has "bet the Internet." If a New York court rules clumsily, indelicately, or too broadly in an attempt to express umbrage at Google's audacity, then the principles of Kelly are in danger. So are future similar initiatives, whether they come from libraries or the private sector.

And because libraries and universities are partners in the effort, a fundamentalist ruling could frighten university counsels when they give advice to faculty members and librarians about what we may all do under the fair use of copyrighted material. Frankly, university counsels are already skittish enough. They often advise against doing things that are clearly legal out of fear and misunderstanding. A bad loss in the Google case could blow a massive chilling effect across all sorts of good ideas.

I share another of my concerns with those librarians who haven't been as supportive of Google as the five repositories that joined its project. It goes beyond the intricacies of copyright and fair use to the fear that Google's power to link files to people will displace the library from our lives. Wayne A. Wiegand, a professor of library studies at Florida State University, uses a phrase to describe his scholarly mission, studying "the library in the life of the user." That means getting beyond the functional ways people employ library services and collections. It means making sense of what a library signifies to a community and the individuals in that community. Libraries are more than resources. They are both places and functions. They are people and institutions, budgets and books, conversations and collections. They are greater than the sum of their books.

The presumption that Google's powers of indexing and access come close to working as a library ignores all that libraries mean to the lives of their users. All the proprietary algorithms in the world are not going to replace them. There was a reason why Franklin, Jefferson, Madison, and others of their generation believed the republic could not survive without libraries. They are embodiments of republican ideals. They pump the blood of a democratic culture, information.

So I worry. We need services like that provided by Google Library. But they should be "Library Library" projects. Libraries should not be relinquishing their core duties to private corporations for the sake of expediency. Whichever side wins in court, we as a culture have lost sight of the ways that human beings, archives, indexes, and institutions interact to generate, preserve, revise, and distribute knowledge. We have become obsessed with seeing everything in the universe as "information" to be linked and ranked. We have focused on quantity and convenience at the expense of the richness and serendipity of the full library experience. We are making a tremendous mistake.

Siva Vaidhyanathan is an assistant professor of culture and communication at New York University. He is the author of Copyrights and Copywrongs: The Rise of Intellectual Property and How It Threatens Creativity (New York University Press, 2001) and The Anarchist in the Library: How the Clash Between Freedom and Control Is Hacking the Real World and Crashing the System (Basic Books, 2004).

Section: The Chronicle Review
Volume 52, Issue 15, Page B7