The Internet Archive Keeps Book-Scanning Free

The Internet Archive Keeps Book-Scanning Free By Dave Bullock 03.19.08 12:00 AM SAN FRANCISCO -- While Google has made headlines over the last two years for scanning thousands of copyrighted works for its Book Search project, the Internet Archive is quietly digitizing around 1,000 public domain titles every day. Photo: Dave Bullock/Wired.com The book to be scanned sits in front of a technician underneath a V-shaped glass platter. Two opposing cameras angled at each page take photos of the book. On screen is the multipage view that the operator uses to verify the quality of the scans and the book's pagination. For those picturing an efficient, automated process involving robotic arms and high-tech scanners, the scanning at the University of California's Northern Regional Library Facility is relatively primitive. With monastic diligence, workers sit in book-scanning stations and manually turn pages all day long. 1

The process is labor-intensive, but surprisingly efficient: The text collection on archive.org is the world's largest online collection of free books, with nearly 350,000 titles and growing. And though there are high-end auto book scanners on the market, even a giant like Google is reportedly using a similar manual process due to size variance and the delicacy of old books. It's still unclear whether the courts will allow copyrighted books scanned by Google to stay online, but the titles scanned at the Internet Archive will always be free and available. You can even order copies to be printed on demand and shipped to your home, paying only for production costs. Take the Wired.com tour of this grass-roots effort to liberate books from the confines of scarcity. 2

Scanning books into the Internet Archive's custom-built Scribe Station is a manual process. Although automated page-turning machines exist, Internet Archive has chosen to go the manual route due to the large amount of extremely delicate, rare and valuable manuscripts they scan. 3

The book scanner uses off-the-shelf Canon hardware including the EOS 1-Ds Mark II and the EF 100 mm f/2.8 macro lens. The newer systems use the 5-D instead of the 1-Ds, which saves money in the short term. But, according to Internet Archive staff, the 5-D fails much more frequently, resulting in increased maintenance costs. 4

At the start of every shift the operator calibrates the color levels using a pair of color-calibration cards. When the scanning project first started, Internet Archive attempted to color correct the scanned pages to white, but later decided to capture and store them as they are in their various aged shades of yellow. Preservation of the oxidized tints makes the virtual viewing of old books more lifelike. 5

Soon, you'll be able to print books found at the Internet Archive with this self-contained, fully automated book machine [http://www.ondemandbooks.com/]. Send it a PDF and it will print and bind it into a complete book. The process takes about 10 minutes depending on the size of the book, and costs $10 plus a penny per page. If this service gains popularity, it could put a wrench in the assumption that digitizing information will be the death of physical books. Suddenly, thousands of out-of-print titles could be back in publication. 6

Inside the book machine, the laser-printed pages are trimmed (top left), then slathered with adhesive (top right) on what will become the book's spine. The cover is then wrapped around the book (lower left). After another trim, out pops a custom-printed book ready for reading (bottom right). 7

Instead of stacks of books, these archival volumes are now contained in racks of 160 terabyte boxes. Multiple redundant copies of the archive's data are spread across servers all over the world. 8

At the turn of the last century, fold-out illustrations (top right) were all the rage. These foldouts are cool to look at, but present a problem for scanning due to their size. When an operator comes across one of these foldouts in a book, they scan the closed version and note the foldout in the Scribe software. Later, another scanner is used consisting of a camera mounted on a copy stand (top left). Before entering the world of public-domain-promoting nonprofits, Robert Miller spent the last few decades at the top levels of various brick-and-mortar tech corporations. He is currently the director of books at the Internet Archive, and it's his vision that drives the archive's quest to digitize all public-domain knowledge and publish it online. 9