Preliminary Inventory of Digital Collections

Incomplete thoughts on digital libraries.

Library Catalog Pages Ranking in Search Engines

Recently there was a thread on the code4lib list about local catalog records showing up in results in the search engines like Google, Bing, and Yahoo!. The anecdotal evidence is that Google is actively crawling and indexing library catalogs like Johns Hopkins’.

Some of the discussion has revolved around how useful this local catalog data is to folks coming from search engines. How many of these users are satisfied with coming to a local library catalog? I think many people will be unsatisfied because they have found something interesting that they cannot access. Much of what academic libraries have is only available to their own students, faculty, and staff or to other institutions through inter-library loan. This situation may be improving for users.

Digital Collections, Crawling, and Aggregating Content

Code4Lib 2012 Lightning Talk That Wasn’t

Lightning talks filled up fast this year at Code4Lib before I had a chance to sign up, which is probably for the best since I had already had the opportunity to give a full length talk. Here is the lightning talk that I had prepared with each slide being followed by my draft speaker notes.

Hi.

Digital Libraries have aspired to create the one big pot of digital library stuff to hold everything. For the most part we’ve used niche protocols, dumbed-down metadata, cumbersome workflows, and lots of time massaging metadata in an effort to achieve these big aggregations.

I want to talk about what I think is better way to do aggregations. There’s a lot more you can build with what I’m talking about, but I want to set aggregations in my sights.

Common Crawl, Web Data Commons, and Microdata

The other day I discovered the Web Data Commons, which is building on top of the Common Crawl to extract Microformat, Microdata, and RDFa data and make it available for free download. This means that there is starting to be free structured data from a big portion of the Web available for for anyone to play with at very low cost. Common Crawl takes care of the crawling and then Web Data Commons will do data extraction. This opens up new possibilities for services, specialized search, and aggregations of content. Big web data is being opened up for small startups and individuals.

Listing Published Octopress Posts

In converting my blog from Wordpress to Octopress, I had a lot of old posts I was leaving unpublished. I wanted to keep them around but don’t see the need to republish them right now. I also want to be able to create a lot of drafts of ideas and leave them unpublished. Then whenever I’m ready to work on a post, they’re all right there in my repository already.

Problem is that I find it hard to read through the filenames of posts and try to remember which have been published and which have not. So in order to see the publication status of all my posts, I created this rake task. I just dropped this at the end of Rakefile and run rake listpub.

Solving the Item-Level Problem on the Web

Digital Collections Services Through Using Web Crawls

Digital libraries have attempted to provide various aggregations of their content. Usually the participants in the aggregation already make that content accessible on the open web. The approaches to aggregating content that have been taken in the past have relied on hosting institutions to provide their metadata in new ways and support additional infrastructure and workflows. An alternative approach to creating aggregations is to perform targeted crawls and reuse the content on the pages. The problem with the crawler approach dentifying items in the collection as opposed to other pages. This document presents a few possibilities for how to identify items.

DPLA Strawman Technical Proposal

Collection Achievements and Profiles System and DPLA Crawler Services

This is a quick strawman proposal for what the Digital Public Library of America should build as the first parts of a generative platform. This document is not in a finished state, but just as the DPLA has been good at opening up its process with the Beta Sprint, I wanted to release this document early even in this unfinished state.

I attended the December DPLA Technical Workshop in Cambridge and was inspired by the discussion there. I hope that this document makes it clearer some of the approaches I and others at that meeting were advocating. I shared this with the DPLA Interim Development Team a couple of weeks ago, and now that development has started I thought I would share it here as well.

While the first iteration of the DPLA platform may be set and on its way, I still wanted to share one vision of what a generative platform for aggregations might involve. The main point is to get the DPLA to the aggregations they likely need to present at some point. This document leaves aside the question of whether creating aggregations is a good idea. The desire to create aggregations is a big, often unquestioned, assumption of big digital library projects. I think what is set out below is one simple architecture for accomplishing aggregations in a very Web-centered way while potentially having more reuse outside of just aggregations.

Ruby and Rails Using RVM on a Fresh and Updated Ubuntu 11.10 Install

Here are the steps I took to install Ruby and Rails on a fresh and updated Ubuntu 11.10 install. The two places where there were hiccups involved having to install openssl through rvm and updating to a more recent version of rubygems. Some steps are thrown in there just to show how rvm and gem provide some information. I used a virtualbox image to allow me to have a clean install.