A Curious Case of Disregarded Robots.txt

The Internet Archive recently announced an apparent change of policy concerning the collection of web sites for their long-term preservation effort:

Before this announcement, it was commonly believed that you could ask the Internet Archive not to make copies of a site by adding a statement to the site’s robots.txt file, which would be honored.

The announcement, posted April 17, 2017, reads in part:

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

However, as I had already noted on Stack Exchange in March 2017, robots.txt had not been fully honored for at least 10 years:

I just did a quick test, commenting out the ia_archiver Disallow entry for a site that had it for at least the past 10 years. Then I looked the site up on archive.org/web, and it showed up grabs it had collected in 2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015, 2016 and 2017! This means that Archive.org never strictly honored what others thought to be a “do not archive” statement during these years, it was merely not exposing the archived copies.

The web site for which I sacrificed the robots.txt continuity was 0xAA.org (an event we used to organize), and here is a screenshot taken during the test:

Archive.org's overview of 0xAA.org saves on March 29, 2017

The original robots.txt file was reinstated after the test, again leading to the familiar “Page cannot be displayed due to robots.txt.” message. The overview also shows captures from before 2007, and the copies saved by the Internet Archive also show the exclusion rule to be present already in 2003. However I did not have the backups at hand to double-check the change history of that older period.

When I originally found this out, I thought that this was interesting to know, but not necessarily newsworthy. However, the blog post that followed seemed to express an official change in direction that gave me second thoughts. While robots.txt is not an official, compulsory declaration of permission of non-permission, neither is the email takedown request mechanism which is proposed as an alternative. I believe in the importance of asking before doing, and it felt like robots.txt was a good element to maintain a balance of acceptance. If the Internet Archive was in a legal gray area with its saving and making available of copyrighted works, could things get worse once it stops respecting robots.txt?

 

I am a big admirer of the Internet Archive. I can’t imagine the loss of the early web of the 1990s, which they so preciously worked to preserve. I visited them several times, I have friends working there, and I would entrust them with some of our own works, if something happened to us.

Having said that, does the public not deserve full disclosure and transparency, rather than what may be seen as a careful exercise in obfuscation?  “A few months” (limited to “U.S. government and military web sites”) and “10 years” (on all sites?) are not the same thing. We are reassured that this new data harvesting “has not caused problems”. But what about the “right to be forgotten“?

Individual users can already now collect screen grabs (like I did), or save pages, or print them. But we have learned that our traditional rules don’t always scale to what becomes possible in a massively automated new world order.

These web grabs, like truth itself, may be helpful, or haunting. What if a version containing a tragic error were to be preserved against the will of the publisher? What about our juvenile mistakes? How long until somebody requests these “few months” (10+ years) for court use with a simple subpoena?

I do not oppose preserving the public web for posterity, even against the will of the original content publishers. I cited some more difficult test cases before, in what I find a fascinating voyage between “free will” and “encrypted mind”. However, I am concerned about making that material available earlier, in a way that goes against the free choice of individuals who may have something to say about that content, and about this not being disclosed with the transparency and debate it probably deserves.

Some thoughts on how this could be improved:

  • Any organization like the Internet Archive should enjoy the privileges and responsibilities that come with Library status, including special powers to archive works while they are still protected by copyright, as well being protected under laws which would otherwise prohibit the circumvention of access-control measures. This could include not just web content, but also software (including copy-protected games) and other digital content. I am aware of precedents, e.g. National Libraries in some states of the former Yugoslavia, where it became necessary for each to individually preserve the works of a fragmenting country.
  • Robots.txt and/or the <meta> tag could be extended to separately express consent to long-term preservation and consent to dissemination of cached versions during the copyright term (or another shorter period, which could be specified). Adhering to this might not be a universal requirement, but at least the original intention could be taken into account later.

Optimal Conversion of DSC-F1 PMP to JPEG

If, like myself, you have several thousand pictures taken with a Sony DSC-F1 camera (a 1997 model), you are probably looking for a good solution to preserve these for the future with no loss of information. The DSC-F1 camera stores files using Sony’s proprietary PMP format, which is essentially JPEG with a custom 124-byte header. The header contains information like date taken, picture orientation, shutter and aperture details, etc. Nowadays these mostly camera-specific fields are encoded by embedding EXIF/DCF metadata in the JPEG file.

I would like to convert all my PMP file to JPEG (with EXIF metadata) files, because that’s the format that is currently universally accepted both by operating systems and by album applications. Whatever new format will emerge in the future (e.g. JPEG 2000 with EXIF-equivalent metadata), I am quite confident that a similar conversion will be supported in the future.

I couldn’t find a piece of software able to do all of the following:

  • Conversion of .pmp to .jpeg file(s)
  • Ability to convert selection of files or entire directories
  • Lossless “conversion” of JPEG portion
  • Conversion of all PMP header data to EXIF metadata
  • Option to delete original file(s) after successful conversion
  • Option to perform lossless rotation of JPEG image to reflect orientation indicated in PMP header (resulting JPEG oriented as shown by Sony Digital Still Camera Album Utility)
  • Option to apply original PMP file date to JPEG file
  • Ideally, ability to perform reverse conversion (from JPEG with EXIF to PMP), which would simplify comparisons and integrity checks

Tempest Solutions offers a free tool, Pump, which can perform a lossless conversion, but it does not support writing the PMP header metadata as EXIF. (Special thanks to Chris Klingebiel of Tempest Solutions for sharing information about PMP file format details.)

I already use ACDSee 7 to automatically perform lossless rotations of JPEG images, based on existing EXIF rotation attributes, so the requirement to perform the rotation in the conversion tool is not really important. However, this adds to the importance of properly converting PMP rotation attributes to EXIF rotation information.

The purpose of this page is to request feedback from other DSC-F1 camera users, and especially to see who may be interested in a tool fitting the above description. If there is sufficient interest, it will raise the priority for me to invest in the creation of such a conversion tool, and I will let you know then the tool is ready.

Other sites of interest include