SciTechSociety: cataloging

Tuesday, June 27, 2017

Forward to the Past

What will academic libraries look like in 2050?

In the early days of the web, librarians had to fight back against the notion that libraries would soon be obsolete. They had solid arguments. Information literacy would become more important. Archiving and managing information would become more difficult. In fact, academic libraries saw an opportunity to increase their role on campus. This opportunity did not materialize. Libraries remain stuck in a horseless-carriage era. They added an IT department. They made digital copies of existing paper services. They continued their existing business relationships with publishers and various intermediaries. They ignored the lessons of the web-connected knowledge economy. Thriving organizations create virtuous cycles of abundance by solving hard problems: better solutions, more users, more revenue, more content, more expertise, and better solutions.

Academic libraries seem incapable of escaping commodity-service purgatory, even when tackling their most ambitious projects. They are eager to manage data archives, but the paper-archive model produces an undifferentiated commodity preservation service. A more appropriate model would be the US National Virtual Astronomical Observatory, where preservation is a happy side effect of extracting maximum research out of existing data. Data archives should be centers of excellence. They focus on a specific field. They are operated by researchers who keep abreast of the latest developments, who adapt data sets to evolving best practices, who make data sets interoperable, who search for inconsistencies between different studies, who detect, flag, and correct errors, and who develop increasingly sophisticated services.

No university can take a center-of-excellence approach to data archiving for every field in which it is active. No archive serving just one university can grow to a sufficiently large scale for excellence. Each field has different needs. How many centers does the field need? How should centers divide the work? What are their long-term missions? Who should manage them? Where are the sustainable sources for funding? Libraries cannot answer these questions. Only researchers have the required expertise and the appropriate academic, professional, and governmental organizations for the decision-making process.

Looking back over the past twenty years, all development of digital library services has been limited by the institutional nature of academic libraries, which receive limited funding to provide limited information and limited services to a limited community. As a consequence, every major component of the digital library is flawed, and none has the foundation to rise to excellence.

General-purpose institutional repositories did not live up to their promise. [Let IR RIP] The center-of-excellence approach of disciplinary repositories, like ArXiv or PubMed, performed better in spite of less stable funding. Geographical distance between repository managers and scholars did not matter. Disciplinary proximity did.

Once upon a time, the catalog was the search engine. Today, it tells whether a printed item is checked out and/or where it is shelved. It is useless for digital information. It is often not even a good option to find information about print material. The catalog, bloated into an integrated library system, wastes resources that should be redirected towards innovation.

Libraries provide access to their site licenses through journal databases, OpenURL servers, and proxy servers. They pay for this expensive system so publishers can perpetuate a business model that eliminates competition, is rife with conflict of interest, and can impose almost unlimited price increases. Scholars should be able to subscribe to personal libraries as they do for their infotainment. [Hitler, Mother Teresa, and Coke] [Where the Puck won't be] [Annealing the Library] [What if Libraries were the Problem?]

In the paper era, the interlibrary-loan department was the gateway to the world's information. Today, it is mostly a buying agent for costly pay-per-view access to papers not covered by site licenses. Personal libraries would eliminate these requests. Digitization and open access can eliminate requests for out-of-copyright material.

Why is there no scholarly app store, where students and faculty can build their own libraries? By replacing site licenses with app-store subsidies, universities would create a competitive marketplace for subscription journals, open-access journals, experimental publishing platforms, and other scholarly services. A library making an institutional decision must be responsible and safe. One scholar deciding where to publish a paper, whether to cancel a journal, or which citation database to use can take a risk with minimal consequence. This new dynamic would kickstart innovation. [Creative Destruction by Social Network]

Libraries seem safe from disruption for now. There are no senior academics sufficiently masochistic to advocate this kind of change. There are none who are powerful enough to implement it. However, libraries that have become middlemen for outsourced mediocre information services are losing advocates within the upper echelons of academic administrations every day. The cost of site licenses, author page charges, and obsolete services are effectively cutting the innovation budget. Unable to attract or retain innovators, stagnating libraries will just muddle through while digital services bleed out. When some services fall apart, others become collateral damage. The print collection will shrink until it is a paper archive of rare and special items locked in a vault.

Postscript: I intended to write about transforming libraries into centers of excellence. This fell apart in the writing. I hesitated. I rewrote. I reconsidered. I started over again.
If I am right, libraries are on the wrong track, and there is no better track. Libraries cannot possibly remain relevant by replicating the same digital services on every campus. There is a legitimate need for advanced information services supported by centers of excellence. However, it is easier to build new centers from scratch than to transform libraries tied up in institutional straitjackets.
Perhaps, paper-era managers moved too slowly and missed the opportunity that seemed so obvious twenty years ago. Perhaps, that opportunity was just a mirage. Whatever the reason, rank-and-file library staff will be the unwitting victims.
Perhaps, I am wrong. Perhaps, academic libraries will carve out a meaningful digital future. If they do, it will be by taking big risks. The conventional options have been exhausted.

Wednesday, October 1, 2014

The Metadata Bubble

In an ideal world, scholars deposit their papers in an Open Access repository, because they know it will advance their research, support their students, and promote a knowledge-based society. A few disciplinary repositories, like ArXiv, have shown that it is possible to close the virtuous cycle where scholars reinforce each other's Open Access habits. In these communities, no authority is needed to compel participation.

Institutional repositories have yet to build similar broad-based enthusiastic constituencies. Yet, many Open Access advocates believe that the decentralized approach of institutional repositories creates a more scalable system with a higher probability for long-term survival. The campaign to enact institutional deposit mandates hopes to jump start an Open Access virtuous cycle for all scholarly disciplines and all institutions. The risk of such a campaign is that it may backfire if scholars should experience Open Access as an obligation with few benefits. For long-term success, most scholars must perceive their compelled participation in Open Access as a positive experience.

It is, therefore, crucial that repositories become essential scholarly resources, not dark archives to be opened only in case of emergency. The Open Archives Initiative (OAI) repository design provided what was thought to be the necessary architecture. Unfortunately, we are far from realizing its anticipated potential. The Protocol for Metadata Harvesting (OAI-PMH) allows service providers to harvest any metadata in any format, but most repositories provide only minimal Dublin Core metadata, a format in which most fields are optional and several are ambiguous. Extremely few repositories enable Object Reuse and Exchange (OAI-ORE), which allows for complex inter-repository services through the exchange of multimedia objects, not just metadata about them. As a result, OAI-enabled services are largely limited to the most elementary kind of searches, and even these often deliver unsatisfactory results, like metadata-only placeholder records for works restricted by copyright or other considerations.

In a few years, we will entrust our life and limb to self-driving cars. Their programs have just milliseconds to compute critical decisions based on information that is imprecise, approximate, incomplete, and inconsistent: all maps are outdated by the time they are produced, GPS signals may disappear, radar and/or lidar signatures are ambiguous, and video or images provide obstructed views in constantly changing environments. When we can extract so much actionable information from such "dirty" information, it seems quaint to obsess about metadata.

Databases automatically record user interactions. Users fill out forms and effectively crowdsource metadata. Expert systems can extract, from any document in any format and in any language, author information, citations, keywords, DNA sequences, chemical formulas, mathematical equations, etc. Other expert systems have growing capabilities to analyze sound, image, and video. Technology is evaporating the pool of problems that require human intervention at the transaction level. The opportunities for human metadata experts to add value are disappearing fast.

The metadata approach is obsolete for an even more fundamental reason. Metadata are the digital extension of a catalog-centered paper-based information system. In this kind of system, today's experts organize today's information so tomorrow's users may solve tomorrow's problems efficiently. This worked well when technology changed slowly, when experts could predict who the future users would be, what kind of problems they would like to solve, and what kind of tools they would have at their disposal. These conditions no longer apply.

When digital storage is cheap, why implement expensive selection processes for an archive? When search technology does not care whether information is excruciatingly organized or piled in a heap, why spend countless hours organizing and curating content? Why agonize over potential future problems with unreadable file formats? Preserve all the information about current software and standards, and start developing the expert systems to unscramble any historical format. Think of any information-management task. How reasonable is the proposition that this task will require direct human intervention in two years? In five years? In ten years?

For content, more is more. We must acquire as much content as possible, and store it safely.

For content administration, less is more. Expert systems give us the freedom to do the bare minimum and to make a mess of it. While we must make content useful and enable as many services as possible, it is no longer feasible to accomplish that by designing systems for an anticipated future. Instead, we must create the conditions that attract developers of expert systems. This is remarkably simple: Make the full text and all data available with no strings attached.

Real Open Access.