Harvard Depository | Harvard University Archives | Library Digital Initiative | Office for Information Systems
Office for Scholarly Communication | Open Collections Program | Records Management Office | Weissman Preservation Center | Staff Resources
Harvard University Library Notes / January 2007 / No. 1335
Librarians' Assembly: Clifford Lynch
Clifford Lynch is executive director of the Coalition for Networked Information (CNI), which includes more than 200 member organizations concerned with the use of information technology and networked information to enhance scholarship and intellectual productivity. He holds a PhD in computer science from the University of California at Berkeley and is an adjunct professor at Berkeley's School of Information. Lynch addressed the Librarians' Assembly on Thursday, December 14, 2006.
The following excerpts are drawn, in slightly condensed form, from a transcript of Lynch's wide-ranging Librarians' Assembly talk. Later this winter, the full recording of Lynch's talk will be on deposit in the Harvard University Archives.
We are at a very pivotal time for research libraries, where different universities are going to make different choices about what they want their research libraries to be and, indeed, about how higher education is going to operate.
It is routine now to talk about the massive changes that have taken place in scholarly communication. We've all watched the great migration of scientific journals from print on paper to electrons being used to emulate print on paper, files which you then print on demand, rather than paper that you file away in your office. And the not-insubstantial ramifications of even that simple change has had on everything from libraries' business arrangements through a shift from purchased subscriptions to license agreements, to the implications for preservation in a world where libraries no longer hold the material they're concerned with the long-term preservation of.
We've seen these developments play out—and talked endlessly about them!—but I would suggest that these are just the early and modest symptoms of a much greater change that is now upon us. What you really need to be watching is what's happening to the practice of scholarship.
The sciences have been largely transformed over the past 20 years or so by a steady infusion of information technology, high-performance computing, advanced networking, visualization, and a reliance on data sets, both as evidence and as communication: Evidence in the sense of observations of the natural world, experimental readouts. Communication in the sense that so much of science now is expressed in very complex computer models, simulations, and visualizations.
These data sets—vast in number, and in some cases absolutely massive in size—are part and parcel of science. Data has become the coin of the realm. The way scientists work together is changing. Significant scientific developments now involve multinational teams scattered across the world, working together often with very little interest in or sensitivity to the boundaries of the institutions that host them—wanting to do annoying things like share data sets across a scientific team which is hosted at 10 different institutions.
In astronomy, they've started building national virtual observatories and then networking them together to make international virtual observatories. Essentially, they're just collecting and organizing the observations out of various telescopes at various wavelengths.
One of the other neat things about astronomy data is that it comes with its own metadata. If you design the piece of equipment right, it pretty much tells you everything you need to know. It tells you what time it is, where you're looking in the sky, what wavelengths you're looking at. What more do you need?
The astronomers are building these enormous aggregate databases of sky surveys. Historically, these databases were the byproducts of specific observations. Now they're building these wholesale surveyors that just sit there moving across the sky, feeding data into databases. Ongoing data collection is becoming an infrastructure activity rather than a byproduct of observing activity.
Most of the data that's moving into these sky surveys is moving in there unscreened by human eyes. Basically, algorithms look at the data, saying, "Is there something interesting in here that we've been programmed to look for?" Maybe moving things or things with variable brightness, or something that wasn't there the last time we looked at this piece of sky. But one of these algorithms has to raise its hand and say, "I think some human had better look at this."
There is so much data coming in that we're relying on computational tools to reduce it, to recognize anomalous cases, and to bring them to our attention.
You could build disciplinary repositories for it on a national or international level. This has already happened in some areas of molecular biology. And in the UK, they've invested substantially in data archives by discipline.
In the US this is harder to do, because, outside of the biomedical area, where we have the National Library of Medicine, and the agricultural area, where we have the National Agriculture Library, there's not a natural government home for this. There's really nobody tasked with this.
The NSF manages the distribution of money to support research, but they seldom do operational things like maintaining data over the long term. In fact, if you look at the history of the NSF, they have a very strong bias against ongoing funding of infrastructure.
The take-away here is that we're going to see this responsibility fall largely to our great research libraries.
How are we going to manage this flood of data at our institutions? Are our libraries going to do this? Or are we going to leave it to scientists working in academic departments and labs? Are we going to set up entirely new support organizations or collaborations within the university?
There are issues here that implicate libraries, but that go well beyond them. Scholarly practice today increasingly integrates high demands for data management and for organizational and curational skills. We don't, as yet, have an educational system in place to produce a pipeline of sufficient numbers of people with those skills—both working scholars themselves and support staff. It's worth reflecting on how basic information technology—which at some level is much less demanding than informatics and data stewardship—has been supported over the past couple of decades in terms of IT support organizations, professional training for scholars, and general knowledge and expertise diffused throughout the population.
We don't have the right organizational structures in place to deliver the needed support to our scholars. Indeed, we don't even have a clear understanding of what level of informatics competence one should expect from a scholar entering his or her chosen discipline today. We do now expect scholars to be able to type, use telephones, do basic sorts of computer things. Some of this is going to become part of the baseline for the next generation of scholars. Some of it is going to come out of
If you've not read it, I would invite you to have a look at the report that was issued in final form over the summer from the American Council of Learned Societies' Commission on Cyberinfrastructure in the Humanities and Social Sciences [http://www.acls.org/cyberinfrastructure—Ed.], which makes the observation that the same kinds of trends that are happening in sciences are playing out in humanities, with a couple of important differences.
First, there's much less uniformity in the humanities and social sciences. Every lab today relies on computers to manage certain kinds of processes. But humanists may only rely on computers for searching the web and word processing. There is still a lot of controversy in certain fields in the humanities about the extent to which digital work represents legitimate and praiseworthy scholarly practice, and a lot of variation in the level of adoption of IT from scholar to scholar.
Second, the evidence base that the humanities deal in is quite different. In the sciences, the evidence base is mostly observational, supplemented by mostly fairly recent papers (and yes, I know there are exceptions like mathematics and biodiversity). But the evidence base that humanists want includes the entire human record. It includes everything published and unpublished in all of our archives, in all of our university libraries, all of our museums. In order to truly build a cyber infrastructure to support the humanities fully, we'll need to deal with the challenge of producing digital surrogates of pretty much the entire human cultural and intellectual record—a rather sizable undertaking toward which Harvard, among others, has already taken some very important steps with some of the mass digitization programs that they have brought into operation. But I think we need to recognize that in terms of changing scholarly practice the humanities are going to be affected every bit as much as the sciences.
The material that is in museums and archives, as well as libraries, needs to be brought together in a much more coherent way than it has in the past to support scholarship. You can see the vector here if you look at the growing role of images in all kinds of scholarly discourse in the last couple of decades, as it's become easier to find them, easier to use them, easier to interchange them in scholarly work.
We're going to see things like reconstructions of classical buildings, of religious sites, of ancient cities, perhaps accompanied by simulations of their economies and agricultural systems and the movement of their people within it. I don't know quite how we're going to store these or where we're going to put them. It may be that something like a successor to Second Life will become a giant virtual shelf in which you can park these simulations and let people explore them on an ongoing basis.
In scholarly publishing, even among the commercial players, there are shared and common values about things like the importance of preservation and the maintaining of an archival record.
But there are serious questions about how we're going to open up all the back files of scholarly communication that the academy has given away all the rights to—though I think you can point to progress here that also underscores a respect for shared and common values. But leave that aside.
I want to contrast this relatively good commonality of values within scholarly publishing with a very different mindset that is characteristic of the broader sphere of consumer culture. In this sphere, it's seen as very unfortunate that the Constitution establishes time limits on copyright—meaning that I can only have it until one day before hell freezes over, which is basically the legislative implementation some of the consumer-market commercial interests have pursued. Never mind the notion that copyright is a bargain with society.
The idea is to control the availability of copyrighted works, to withdraw them from the market, to build up demand and then sell it again. Does this sound like the kind of thing that is done, for example, with famous Disney films—where basically there is no commitment to film as part of a shared cultural record? Indeed some material is pulled off the market permanently because it portrays minorities, for example, in ways that are now embarrassing.
This is the challenge that we face in dealing with all of the post-1923 stuff, and particularly with the materials that really weren't designed with a scholarly audience in mind, but are important subjects for scholarly inquiry. These include film, mass-market print materials, television broadcasts, all of this kind of thing. It seems to me that, at root, this is a public policy problem and that part of dealing with the problem is making the case to the public that there is a shared cultural record that we all have a stake in.
For all we talk, in libraries and in higher education, about the risks of collaboration, it is a relatively low-risk operation. I'm collaborating with him, and if the collaboration doesn't work out, we'll move on.
I think that research libraries are moving toward greater interdependency, and interdependency is quite different from collaboration. The stakes are much higher.
Interdependency is when the people I'm working with screw up then I'm in trouble too. It's when you make decisions to split the load with another institution that you don't control, and if they mess up, you're the person responsible for explaining to the Harvard community why you just oversaw the decline and fall of one of the world's great collections of English literature by choosing a bad partner to get interdependent with.
In our collections, the degree of replication that characterized the print world is not going to be sustainable and something else is going to emerge. We're going to see collections that are much more unique because of the move towards digital data that supports scholarship.
So the collection of digital data which is held at Harvard—coming largely from the Harvard faculties' scholarly activities—is going to look very different from the set held down the road at MIT. You will have scholars who will need material from MIT. MIT will have scholars who need material from you.
But the collections are going to be much more unique, and that's going to underscore that world of interdependence, rather than simple collaboration.