kentoh - Fotolia

How analogue film will be the future of digital history

The pandemic meant GitHub had to wait until July to store a 21TB snapshot of its code repositories, on special film that can last a thousand years

This article can also be found in the Premium Editorial Download: Computer Weekly: Freezing digital history in the Arctic Circle

Earlier in July, a new initiative to preserve historical open source code began, with snapshots of the code that makes Facebook and Netflix among others archived for future prosperity. The open source code of these and other GitHub repositories were successfully deposited in the GitHub Arctic Code Vault. These snapshots aim to preserve the code for future generations, historians and scientists.

The storage medium GitHub is entrusting to store this valuable archive on is good old fashioned film, which is not dissimilar to the reels that people used to put into cameras before digital camera manufacturers came along claiming SD cards were better.

The GitHub Arctic Code Vault is a data repository preserved in the Arctic World Archive (AWA). This data repository is located in a decommissioned coal mine in the Svalbard archipelago, closer to the North Pole than the Arctic Circle. The archive is stored 250 meters deep in the permafrost of an Arctic mountain. GitHub originally captured a snapshot of every active public repository on 2 February 2020.

The archive holds 6,000 of its most significant repositories for perpetuity, capturing the evolution of technology and software. This collection includes the source code for the Linux and Android operating systems; the programming languages Python, Ruby, and Rust; web platforms Node, V8, React, and Angular; cryptocurrencies Bitcoin and Ethereum; AI tools TensorFlow and FastAI; and many more.

Describing why it is important to maintain such a code archive, Thomas Dohmke, vice-president of special projects at GitHub, says: “Over the past 20 years, open source software has dramatically changed our lives.” For instance, the German coronavirus track and trace app and apps for finding the status of a flight or booking a car all rely on open source code.

“Moving forward, there will be no major invention that doesn’t rely on open source software,” he said. For instance, the code that Katie Bouman and the team behind the Event Horizon Telescope, used to capture the first ever picture of a black hole, is based on open source software. “Some 90% of all software is dependent on open source software,” says Dohmke. “No one wants to reinvent the wheel. Developers pull in libraries from GitHub.”

From a purely practical perspective, the dependency on open source code in modern software development actually means that developers may find the code repository their application depends on has been removed by its maintainer. “Stuff gets lost because hard disk drives fail, or the inventor intentionally deletes the repository when it became a burden.” He says this recently happened when the inventor of a Javascript library decided to delete it. Its removal broke software that had coding dependencies based on it.

Read more about archiving

  • Archiving is a must due to mounting data and operational requirements. Discover how archives are tiered, how long to keep archival data, and potential costs and savings.
  • There's a lot to consider in your data archive plan. From compliance to data integrity and retention, following these best practices will improve your data protection.

“We know knowledge gets lost,” says Dohmke. “For instance, you can’t find a recipe for Roman concrete or how they built them. The original plans for the Saturn V rocket were lost.” Today, this is happening as developers strive to invent new things, which means early versions of products are not only superseded, but also forgotten. “We didn’t care about the early Amazon pages or the first blogs. Their creators have moved on.”

From a historical perspective, he adds: “The way we do software development may become irrelevant.” Without an archive, the understanding of how software development was done in the early 21st century may be lost forever.

Dohmke says the team at GitHub has put together a manual describing software development practices and how developers collaborate. Such a manual may become more important as coding becomes more automated and the advent of AI algorithms such as GPT-3, which shows that an AI can be taught how to write software.

Due to the global pandemic, the original snapshot of GitHub could not be flown to the Arctic Global Archive. Instead, Github worked with Piql to write 21TB of repository data to 186 reels of piqlFilm. 

According to Piql, film is a photosensitive, chemically stable and secure medium with proven longevity of hundreds of years. Film cannot be altered, and once the data is written, it cannot be edited. The data is stored offline and will not be affected in case of an electricity shortage or if exposed to electromagnetic pulses.

The code was successfully deposited in the Arctic Code Vault on 8 July 2020.

Storing for a thousand years

Github has worked with Piql, which has developed a way to archive data, built on principles of open source and future access. The media, called piqlFilm (digital photosensitive archival film), provides authenticity measures, supplier independence and does not require data migration. Data stored on piqlFilm can be read back both by machines and the human eye. The manufacturer estimates that archived data will remain stable for a thousand years.

Github is also working with Microsoft’s Project Silicon, which builds on research from the University of Southampton’s optoelectronics research centre. This makes use of recent discoveries in ultrafast laser optics to store data in quartz glass by using femtosecond lasers.

Read more on Open source software