net – mike.pub

A Curious Case of Disregarded Robots.txt

The Internet Archive recently announced an apparent change of policy concerning the collection of web sites for their long-term preservation effort:

Before this announcement, it was commonly believed that you could ask the Internet Archive not to make copies of a site by adding a statement to the site’s robots.txt file, which would be honored.

The announcement, posted April 17, 2017, reads in part:

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

However, as I had already noted on Stack Exchange in March 2017, robots.txt had not been fully honored for at least 10 years:

I just did a quick test, commenting out the ia_archiver Disallow entry for a site that had it for at least the past 10 years. Then I looked the site up on archive.org/web, and it showed up grabs it had collected in 2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015, 2016 and 2017! This means that Archive.org never strictly honored what others thought to be a “do not archive” statement during these years, it was merely not exposing the archived copies.

The web site for which I sacrificed the robots.txt continuity was 0xAA.org (an event we used to organize), and here is a screenshot taken during the test:

The original robots.txt file was reinstated after the test, again leading to the familiar “Page cannot be displayed due to robots.txt.” message. The overview also shows captures from before 2007, and the copies saved by the Internet Archive also show the exclusion rule to be present already in 2003. However I did not have the backups at hand to double-check the change history of that older period.

When I originally found this out, I thought that this was interesting to know, but not necessarily newsworthy. However, the blog post that followed seemed to express an official change in direction that gave me second thoughts. While robots.txt is not an official, compulsory declaration of permission of non-permission, neither is the email takedown request mechanism which is proposed as an alternative. I believe in the importance of asking before doing, and it felt like robots.txt was a good element to maintain a balance of acceptance. If the Internet Archive was in a legal gray area with its saving and making available of copyrighted works, could things get worse once it stops respecting robots.txt?

I am a big admirer of the Internet Archive. I can’t imagine the loss of the early web of the 1990s, which they so preciously worked to preserve. I visited them several times, I have friends working there, and I would entrust them with some of our own works, if something happened to us.

Having said that, does the public not deserve full disclosure and transparency, rather than what may be seen as a careful exercise in obfuscation? “A few months” (limited to “U.S. government and military web sites”) and “10 years” (on all sites?) are not the same thing. We are reassured that this new data harvesting “has not caused problems”. But what about the “right to be forgotten“?

Individual users can already now collect screen grabs (like I did), or save pages, or print them. But we have learned that our traditional rules don’t always scale to what becomes possible in a massively automated new world order.

These web grabs, like truth itself, may be helpful, or haunting. What if a version containing a tragic error were to be preserved against the will of the publisher? What about our juvenile mistakes? How long until somebody requests these “few months” (10+ years) for court use with a simple subpoena?

I do not oppose preserving the public web for posterity, even against the will of the original content publishers. I cited some more difficult test cases before, in what I find a fascinating voyage between “free will” and “encrypted mind”. However, I am concerned about making that material available earlier, in a way that goes against the free choice of individuals who may have something to say about that content, and about this not being disclosed with the transparency and debate it probably deserves.

Some thoughts on how this could be improved:

Any organization like the Internet Archive should enjoy the privileges and responsibilities that come with Library status, including special powers to archive works while they are still protected by copyright, as well being protected under laws which would otherwise prohibit the circumvention of access-control measures. This could include not just web content, but also software (including copy-protected games) and other digital content. I am aware of precedents, e.g. National Libraries in some states of the former Yugoslavia, where it became necessary for each to individually preserve the works of a fragmenting country.
Robots.txt and/or the <meta> tag could be extended to separately express consent to long-term preservation and consent to dissemination of cached versions during the copyright term (or another shorter period, which could be specified). Adhering to this might not be a universal requirement, but at least the original intention could be taken into account later.

Proposal for Setting Canonical Host via Robots.txt

This is a proposal to indicate a preferred host name (e.g. domain with or without “www”) for search engine robots by adding a “Canonical-host” entry to the robots.txt file.

Valid host values are as per RFC 2396 and RFC 2732, i.e. “hostname | IPv4address | [IPv6address]”.

For example:

User-agent: * Canonical-host: www.example.com

User-agent: * Canonical-host: example.com

User-agent: * Canonical-host: host2.example.com

User-agent: * Canonical-host: 10.20.30.40

User-agent: * Canonical-host: [FC00:AA10:BB20:CC30:DD40:EE50:FF70:CA40]

Rationale

It has always been a common practice to make a web site accessible both with and without a “www” host name. This remains the way sites are almost always configured by default by an ISP under managed hosting plans. While potentially interesting from a usability standpoint (both www.example.com and its shorter form, example.com, will work when typed in a browser’s address field), this results in several problems as soon as the different URLs pointing to the same host are published on the web in spite of the site’s maintainer preference for one specific form.

Known issues include:

Discrepancy between search engine result URLs and URL preference of site maintainers
Inconsistencies within search engine results (some pages on a site listed with “www”, others without)
Same site with and without “www” is listed as having a different “page rank”
Failed matches between search engine results and categorization schemes
Difficulty, for a search engine, to accurately determine whether different URLs pointing to the same IP address (e.g. HTTP 1.1 virtual hosts) are actually meant to point to the same web site, or not (after all, the content itself can change between crawls)
No clarity about number of actual sites indexed by search engines (are different uncanonized URLs pointing to the same web site counted multiple times?)

Solutions to this are limited in part because:

Inconsistent incoming links (e.g. with and without “www”) are not under the control of a site’s maintainer
While HTTP redirects could be used to express a preference (e.g. by permanently redirecting accesses to example.com to www.example.com, or vice versa), not all managed hosting providers give the customer access to such configuration options

Resorting to robots.txt to solve this problem comes natural for several reasons:

Robots.txt provides a method “for encoding instructions to visiting robots”
Robots.txt is popular among robots
Robots.txt is always accessible by a site’s maintainer
Robots.txt is already site-centered (one robots.txt per site)
Martijn Koster’s “A Method for Web Robots Control” RFC allows for extensions to the robots.txt format (“extension = token : *space value [comment] CRLF”)

While this discussion centers on the presence or lack of the “www” host name, which is a very practical and frequent issue, the aim is to propose a flexible solution that can be applied to other situations as well.

Conclusion

In consideration of the above, the proposal is made to define an extension token named “Canonical-host”, allowing the maintainer of a web site to indicate a preferred host name value to be used by robots to access and index the site.

More specifically:

Robots should interpret and follow this preference in the same way as they would process a permanent HTTP redirect (status 301)
Search engines and web categorization systems (“directories”) should consider the preference as a request to update their host name records, if required

Post Scriptum: Robots.txt vs. rel=”canonical”

In 2009 the major search engines announced support for the rel=”canonical” attribute:

Handling legitimate cross-domain content duplication

Although from a per-page rather than per-site perspective, the new implementation addresses many of the needs covered by this proposal. At the same time though, it requires adding a tag on each page, and it cannot be applied to scenarios where the content administrator has no control over the HTML headers, e.g. with many CMS systems, or with web services. Not to mention non-HTML content (audio, video, images, etc.)

As of 2012, both Yandex and Google are supporting a “Host” directive that is substantially the same as the one I was proposing under “Canonical-host”.

Thoughts on Unsolicited Email Advertising

Last revision April 6, 2003. Original text published January 24, 1998. © 1998-2003 Mike Battilana. SPAM is either a registered trademark or a trademark of Hormel Foods Corporation in the United States and/or other countries.

Overview

This document contains the following main parts:

Introduction
Original Text (thoughts about “spam” and “antispam”)
2000 Update (includes lawsuit)
2002 Update (comments about emerging trends)

Introduction

I originally wrote this text after my outgoing emails were not being delivered any more as a consequence of an episode of unsolicited “bulk mail” advertising. You can still find the full story below. While most of this content is country-neutral, some episodes are linked to my work experience with US and Italian companies.

Over the years, mostly thanks to this page, I came in contact with a variety of other users (both frustrated by some of the unsolicited email they receive, and also those who prefer to consider this “free speech”) as well as organizations trying to deal with unsolicited email advertising in one way or another, and even a few people who said they found this text useful in studying possible regulations for their respective countries. The phenomenon itself evolved, and so did these notes.

If you are not familiar with the type of email I am talking about, have a look at what people like myself are getting every day as of early August 2001:

Each frame of the above animation shows a set of titles, dates, and the first few lines of the messages which every day I have to receive and sort. Somebody else could be doing this, but it wouldn’t be email any more: real time, fast, private, or working day and night if necessary. I removed a few duplicates and some numbers and character codes which sometimes appeared in the message title for what I believe where tracking purposes. I obviously also receive “normal” email, which is not shown here.

Needless to say, I don’t know the individuals or organizations sending out these emails to millions of internet users like myself. Even if some of these mails appear to imply that the recipient subscribed or otherwise expressed interest in the topic (e.g. “Below is the result of your feedback form” or “This E-mail is never sent unsolicited”), that does not apply to me, neither for the “wonder drugs”, nor for the religious texts, nor for the money-related and all other “services” offered in these emails. If you think that the above collection is a wasteful celebration of stupidity more than an expression of free speech, be reassured that it’s all real. I did not pick the funniest messages, or the ones with more spelling errors, or the ones trying to target the people who are most desperate and in need of help.

I have talked to some legislators who were not at all familiar with the issue, either because they did not use email, or because they did not access it themselves, or because they never had their address appearing on a web site or newsgroup discussion. This has been one reason for me to add this updated introduction.

I know, I know, I shouldn’t put my email address in public places, is that what you are thinking? Or maybe I should just give up, and use several different addresses, perhaps making people sign contracts to not insert the “good ones” in files where a human could understand them, but a good address extraction robot couldn’t? Well… I hereby would like to affirm the right (the freedom, if you prefer) and pleasure to be reachable by interesting people like yourself, and not to have to hide from email robots programmed to harvest addresses and then bomb me with “AS SEEN ON NATIONAL TV!”, “YOUR A WINNER!!!”, “Claim Your Complementary Digital Camera” and “Do Not Repay Your Credit Card Debt” messages. Also, I like the idea of being able to use just one email address, and to allow real people, not advertisers, to be able to use it for a very, very long time. Email addresses are one of the few things in technology which have the nice potential to offer some long-term stability. No, I am not actually going around posting my address on web sites, but, just to mention an example, my email address happens to be included in documents which in part have to be made available online. Also, I would like friends to be able to find me, if they lost or never had my address, and I would like to be able to use internet newsgroups the way they were meant to be used, not in a constant challenge in which addresses have to be artificially modified with “nospam” prefixes, suffixes and other formulas which sometimes robots can sort out, but users who are not technically familiar with email can’t.

If this is still new to you, try and imagine yourself receiving these numerous emails, sent to your address, calling you “DEAR FRIEND AND FUTURE MILLIONAIR”, and having to process them one by one, risking to miss something important in this distracting and, frankly, often irritating process. Now, imagine yourself receiving these emails on your GPRS or G3 phone, which is nice and small, and gives you constant access to your email inbox, but in which you still pay for every byte, both directly and to support the infrastructure.

Recent technology trends have not only made it possible to read email in real time on inexpensive phones with “always on” internet connections, or on satellite phones with expensive connection fees, but they also helped the senders of unsolicited email in more than one way. For example, thanks to new and increasingly popular software every personal computer can easily be set up to be its own SMTP server (i.e. to send emails without using your ISP’s SMTP server), thereby making it relatively useless to block the address of a specific SMTP server, since the same IP address could, the following day, be used by another user who needs to send “legitimate” email using some similar SMTP server software installed on a notebook (having your own SMTP software on a portable computer is quite useful, as you don’t have to reconfigure your mail programs for every different internet connection you use as you travel).

While sending mass emails is getting technically easier yet more powerful, encouraging “hit and run” behavior and enhancing aspects which existing laws did not completely cover, one of the most prominent businesses promoting themselves via unsolicited commercial email now consists of… the business itself. Preconfigured CDs containing millions of email addresses and SMTP server and other software to be used to prepare mass mailings now cost less than $100, and can be used on any PC. I must confess that reading some of these emails, and considering the similarity in style with other emails which tend to play with emotions, illusions, hopes and “immense potential” more than with solid facts, I was wondering whether, after all, advertising via unsolicited email does produce any positive results at all, other than for those offering these CDs and mailing services. It certainly looks to me like there is a new emerging trend in which the phenomenon is increasingly self-promoting itself.

Emerging technologies, such as Enum (Telephone Numbering Mapping), which can be used to map telephone numbers to email addresses and other information and systems on the internet, also have the potential for new types of exploitation and abuse. Just imagine robots which query Enum servers to harvest valid email addresses based on random phone numbers, and then use this data to automatically send unsolicited email.

Unsolicited commercial email is exploiting that fascinating part of technology in which the cost to reproduce something in unlimited quantities tends to zero, and in which everything which is technically possible and legally accepted happens not only in theory, but also in practice. However this only applies to the senders, whereas on the recipient side different rules apply, and the amount of energy, time, money, patience and legal resources tends to be in proportion to every single message or sender. So what starts with one “Congratulations!!! You have been selected as a finalist in the Getaway Travel Sweepstakes!” email per week can easily evolve to a point where one thousand or one billion different companies or individuals start sending similar emails, every day, to users like you and me. The more messages you will receive the more it will cost you to do something about it. And, of course, these messages will all say “This is a one time mailing – no need to unsubscribe”.

It All Began with “Junk Mail”…

It all started in 1995, when my private CompuServe account began to receive some unsolicited advertisements. These first, few, emails often included lists of other recipients’ email addresses (which all other receivers could read), which were tens of times longer than the message itself. In most cases, no effort was made to hide the sender’s email and server data. A few of these advertisements offered for sale lists of email addresses. This is probably how the chain reaction was ignited. Technical considerations apart, I was quite annoyed: access to my CompuServe account was through an expensive toll call (there was no local CompuServe access number), and in at least one case it happened that these dozens of unsolicited mails filled the hard disk partition where my mail was stored, so that I could not receive the messages I really wanted to receive. I contacted CompuServe security, but, apparently, there was nothing they could do, or wanted to do. I can’t avoid considering that I was paying money for each minute online, even while I was dealing with undesired mail. For some of the organizations involved, this meant additional profits. The more incoming junk mail there was in the mailbox, the higher telephone and online connection fees I and other users like myself had to pay.

At that time (between the last months of 1995 and the first months of 1996), most of these unsolicited emails originated from a few Internet domains, which were used for different types of messages, and most of these offered the possibility to be removed from their “service”. Not that I felt right about users having to waste time doing this each time they log on, but I asked to be removed from all of these lists. I did this more or less in a single day. The effect was in part unexpected: the domains from which, until then, most of the junk mail was flowing in, suddenly did not appear any more in the email headers. However, I was receiving about the same amount of junk mail, only coming from apparently random domains. What a coincidence, I thought…

By mid-1996, it sometimes happened that I had to download 100 Kbytes or more of junk mail at a time from my CompuServe account. Sometimes junk mail caused the account to overflow, bouncing back “legitimate” messages, as well as more junk mail. This is because, like many service providers, CompuServe has a storage limit on incoming mail, and at the time that was 100 messages. Everything in excess of that would bounce back to the sender, until I downloaded my mail, making room for more. Needless to say, more than 90% of messages were junk mail. Like many others, I lost potentially important professional opportunities because mail from people I had given my address to did not get through.

Obviously, I stopped using the CompuServe email account. By the time, most Internet Service Providers (ISPs) and legislators were unsure about what action to take, if any. It seems that, as I am writing this, the situation is more or less unchanged, only that more junk mail than ever is floating around, contributing to the overload of the internet, and to user frustration.

How can this be? When I access my email using a cellphone, each piece of junk mail costs me a lot of money and time. Cellphone or not, junk mail can, as I described above, prevent me from receiving important email, for example by filling up my hard disk or my email account. Users with a single phone line have to keep their phone line unnecessarily busy to download junk mail they do not want. In the meantime, their (voice) phone line is busy, and others cannot reach them for matters that may be very important. When I access mail on my ISP’s account, I have to pay telephone connect time and monthly subscription, as well as ISP connect time and monthly subscription. I think people have a right to choose what to pay their telephone and ISP bills for. I do not pay any of these to receive postal advertisements, and postal advertisements do not prevent me from receiving other mail! How can some people selling software designed to send junk mail claim that unsolicited email cannot be compared to fax advertising (which is illegal in most countries, because it keeps the recipient line busy, uses the recipient’s paper, etc.)? In my opinion, unsolicited email advertising is even worse than unsolicited fax advertising, because with email the recipient pays most of the costs, whereas with fax (and postal) advertisements it is the sender who has to pay for most of the delivery. The recipient’s time and frustration, busy telephone lines, telephone fees, ISP fees, disk space, lost email, does all of this have no value?

I forgot to mention this: my CompuServe account is a German one, and I collect my email from different countries, which have different laws about unsolicited email advertising. In Germany, this practice is prohibited, more or less like unsolicited fax advertising. Yet, this account keeps receiving junk mail from the US every day. Don’t the senders realize that they may be in violation of international laws? Don’t these organizations know that CompuServe addresses beginning with “1” are assigned to non-US residents? Don’t the senders check the US InterNIC database to see whether a domain (.com, .net, .org etc.) is registered to a US organization or not? Not that I know, according to the email I see floating around. But even that wouldn’t be enough, because, even in the case of a domain registered to an organization residing in a country in which “spamming” is not against the law, the individual recipient of the email may reside in a territory in which the same is illegal. In theory, these doubts could be enough to stop more than 99% of today’s junk mail, if it originated from responsible and careful senders. Instead, they go on, hiding behind fake addresses and headers (as if this alone wasn’t a sufficient sign that there is something fundamentally wrong with this).

As long as there is even only jurisdiction in the world in which “spamming” is illegal, “spammers” should in my opinion actively check that their unsolicited mail is not being delivered or read in that jurisdiction. Of course, this is neither practical nor possible, especially considering mobile users. Which would lead to the conclusion that “spamming” is potentially illegal in all cases, and senders should accept the related possible consequences. As far as the collection, storage, distribution, sale and use of lists of email addresses and other personal data from newsgroups and web pages is concerned, it should also be noted that in many countries this is subject to separate data protection and privacy regulations.

Theoretically Traceable, Practically Anonymous

SMTP (Simple Mail Transport Protocol) is the internet protocol used by mail servers (SMTP servers) to process email requests. Senders of unsolicited commercial email very frequently rely on the unauthorized use of third-party SMTP servers. This can be of advantage for several possible reasons:

the use of a third-party SMTP server introduces a theoretically thin, but practically quite effective layer of perceived anonymity;
letting somebody else’s SMTP server do the work reduces the transmission time, the bandwidth requirements and costs, and the overhead of having to deal with transmission and address errors;
the use of different SMTP servers makes it more difficult for “anti-spam” systems to detect and block the flood of mails based on the SMTP server address;
the more SMTP servers are used, the more the abuse is fragmented into smaller chunks, and the less likely it is that each victim of such unauthorized use takes action (“you won’t sue me for just having sent a few kilobytes through your server, will you?”);
using somebody else’s SMTP server can be a way to bypass an ISPs requirement that clients not engage in “spamming” through their own SMTP servers.

A computer can send a single request to a SMTP server consisting of an email message body and hundreds of addresses, and the SMTP server will then in turn do most of the work and send those hundreds of emails as requested. In the original SMTP implementations it was not considered to be necessary to require any type of explicit authentication of the requesting system. Even without a username and password, SMTP operations are however not anonymous, because SMTP is built on top of TCP, which consists of data packets contained in lower level IP (internet protocol) packets. Every IP packet sent from A to B has to contain both the “internet address” of A and B. A mail server can be configured to only accept requests from addresses residing on the local network, or otherwise associated to specific systems. The sender address contained in an IP packet can be forged (one just has to send a packet with a fictitious A address), but then B would not be able to send data back to A. In a procedure known as “IP spoofing” a malicious sender A sends an IP packet to B providing an incorrect A reply address to B. But an IP packet alone is not very useful. TCP/IP, which is the combination of TCP packets inside IP packets, and which is used for SMTP requests, additionally makes use of TCP “sequence numbers”. For successful “TCP spoofing”, a malicious sender A has to not only forge its address both at the IP level and in the TCP packet, but it also has to predict the correct sequence number that B will tell it to use in the first reply. TCP sequence number prediction attacks are neither easy to implement nor guaranteed to succeed, but they are technically possible. Additionally, SMTP introduces an additional level of interaction between A and B, so that a malicious sender A would have to correctly predict the answers of the SMTP server, and respond accordingly. Unless errors or other unusual circumstances occur, it is however generally possible to estimate what a SMTP server will answer based on the requests of A.

If the SMTP server is configured to only accept requests from a known local address range, then a simple firewall system can be put in place to filter all packets sent from the internet to the SMTP server which have a “spoofed” address, i.e. if the packet comes from “outside”, and it has a forged “inside” sender address, the packet is blocked, and alarm bells ring. However, especially in large and complex networks, this may also unnecessarily restrict legitimate use of the mail server. This can in turn be solved by additional protection systems, the complexity of which is however the main reason why many SMTP servers are “unprotected” and accept requests from any client. This however does not mean that requests are anonymous: the IP address (real or fictitious) is logged as part of each SMTP transaction, and becomes part of the email header.

In brief, in order to remain anonymous and also have its SMTP requests be accepted, a malicious sender A would have to:

Forge its sender address A, making sure it is an address accepted by B to process SMTP requests
Correctly predict the TCP sequence number that will be requested by B (the request is lost, as the A address is forged)
Further interact with B making assumptions about the contents and timing of requests from B to A (again, all messages from B to A are lost)
Make sure that no real A is online and replies, or else attack such an A system while communicating with B, so that the real A has no time to report an error to B
Hope that no firewall is installed between B and the internet which blocks internet packets which carry fake sender addresses appearing to originate from the local network, if that is what A did
Hope that no intermediate routers or other systems keep a record of its activity

In practice, already making one connection in which all of the above conditions are met is very, very difficult, and time consuming. Sending millions of email messages in this manner would be both highly impractical and beyond the technical skills of the average senders of unsolicited commercial email. This means that in practice:

The identity of sender A is not forged
Sender A can only use third-party SMTP servers which accept requests from unknown systems (no password protection or other address-based restriction)

Unfortunately, the damage/cost ratio in cases of SMTP abuse, which is covered by existing computer crime laws much more than “spam” itself, works to the advantage of the “spammer”. Even if the yearly total worldwide damage is high, no single party usually sustains a cost high enough to motivate an expensive technical and legal work. Furthermore, even at the following level, i.e. at the investigative and judiciary phase, the same logic also applies (i.e. low damage, high cost, case archived). So, it seems that SMTP is not anonymous enough to not be able to trace a sender, but it is anonymous enough to work well for “spam”.

SMTP now does support authentication (e.g. RFC 2476, RFC 2487, RFC 2554, RFC 2645, RFC 3207), i.e. it is in theory much easier to trace back a message to the real ISP and sender. The Authenticated Mail Transfer Protocol (AMTP) specification further builds on this, aiming to replace SMTP with a more secure derivative. Authenticated SMTP, or AMTP, however, is not yet a requirement on the side of ISPs and network operators. Maybe the increasing cost of “spam” will accelerate this requirement.

To Regulate, or not to Regulate?

My opinion is that junk mail should be regulated by law, just like advertising over fax, automatic voice calling systems, pagers, SMS short messages and other electronic media has already been regulated in many countries. The simple underlying logic would be that receiving a piece of email costs money and resources just like receiving a fax. I don’t like the idea in general, but I see no alternative, especially where existing regulations do not cover things like the access of a third-party’s SMTP server, or forged or missing sender identity information.

Whatever regulation is considered, I would encourage it to also carefully harden itself against the emerging trend of “one time” mailings. If you allow these, even with an “opt out” option, you are de facto allowing for unlimited one time mailings to paralyze the system and annoy users as if there was no regulation at all. Similarly, sender email addresses which may be “valid”, but in which the user or domain part are used only once per mailing or even possibly once for each destination address, make the requirement to use a “valid” address irrelevant for the purpose of automatic filtering, which possibility is sometimes mentioned in the same context.

The best solution that comes to my mind is a combination of legal regulation (or extension/interpretation of existing laws, e.g. as applied to fax advertising, computer crimes, theft, privacy, etc.) plus a technical approach (e.g. authenticated SMTP, or AMTP), because obviously the internet as a whole has shown to be vulnerable to the abusive weight of unsolicited advertising.

In spite of what I wrote so far, I also think that junk mail should be allowed at a separate layer, and that there should be a way to flag, transport and deliver unsolicited messages, if the end user so chooses. But this should be implemented in a way that the proper mechanisms for authentication and distribution of costs and resources are applied. It is too late to filter one million messages after they already traveled twice around the planet and were stored by the final ISP pending verification of a user option. The concept of allowing for the optional acceptance of unsolicited mail is important to me, because it means, after all, free speech, whereas blocking “spam” can become a form of censorship. I think that you should, at any time, be able to open the window and see what it is like outside, even if this means receiving a million messages a day.

I can in part understand the fear that a law may have negative side effects. If the regulations forbidding junk faxes can be considered a good precedent to compare with, can somebody perhaps tell me any negative side effects they had? I think, here we are talking about laws which give to the people more freedom than they take away. This is the freedom not to have to answer a phone only to hear a robot playing a tape, the freedom not to have your fax paper or hard disk space consumed by junk faxes and email, the freedom not to have thousands of robots try to access expensive 24×7 human customer service resources. The freedom not to have to waste money and time on something you have decided you don’t want to receive, and the freedom to still receive all of this, if you wish so.

… Then Came “Anti-Spamming”

The most disrupting episode I experienced in relation with “spamming” felt worse than the effects of three years of junk mail, combined, and was what actually led me to write this.

I was using a very large, professional and reliable ISP, the best in the country, in my opinion. One day a CompuServe user sent some junk mail using my ISP’s SMTP server. The sender had probably no contract or other right to use this SMTP server to flood the net with this “Earn $280 – $500+ weekly” message, but this nevertheless occurred. The sender may even have had, in theory, the malicious intention of blocking that ISP’s mail service, rather than disseminating junk mail. Whatever the case, both results were achieved. Within a short time, the IP address of my ISP’s SMTP server was put on a sort of “black list”, which, I discovered, was checked by a large number of organizations and ISPs to determine whose mail to accept or to refuse, probably on the (incorrect) assumption that all mail coming from a mail server whose IP address is in the “black list” is junk mail. Indeed, I was later told, we were honored by none other than the “mother or all ‘black lists'”.

Instantly, thousands of “innocent” users like myself had their email blocked, at least with respect to recipients whose systems were checking this “black list” (I was surprised to see how many did). Hidden in the error messages which kept bouncing back to me from a dozen of sites, I found a hint to the system which was at the origin of our troubles. I use the word “hidden”, because experience tells me that the average user does not look into these “Returned mail: Service unavailable” reports, scanning the technical contents. Anyway, I went to the web site in which I was hoping to find an explanation of the “problem”, but the system, belonging to the maintainers of this “black list”, could give only one answer: “ERROR… The remote site or server may be down”. So I sent an email to the company behind these not so efficient technologies, telling them that their site was down, that their service was blocking my email, and asking what I could do about it. Very efficiently, within seconds, I received a reply: “Access denied due to spam and mail abuse”. It should be noted that I also tried, with the same results, the address that ISPs were supposed to use to try and get help when their own mail server is trapped in the “list”.

In the meantime, I realized how many messages, all sent to different people in different countries, were bouncing, all in relation with this “service”. My work was being interrupted – not by junk mail, but by somebody’s intention to stop junk mail. Somebody had decided that, based on one piece of junk mail, all users of a very large and professional ISP had to have their email blocked.

I called my ISP, and they told me that they that they had already contacted the maintainers of the “black list” three days earlier, and that they were taking the problem very seriously, even considering legal action. Just to be sure, I tracked the fax number of this “black list” company, and explained them the situation again, in my own words.

The following day, my email was still bouncing, but at least the web site with the information about the “black list” was up again. This web site basically explained two things with which I do not agree. First, it contained repeated “reassuring” claims like “we have not singled you out” (no, they just “singled out everybody”, I thought, thinking of episodes in which an entire village was “punished” for something done by an individual), “we mean you no harm” (why do facts often end up so far away from the “good intentions” behind them, and how can somebody who is an active part of a system that punishes innocent people claim to “mean no harm”?), “we will help… usually within minutes” (our ISP has been waiting for four days now, and we are all still counting), “we are not the network’s police force” (from what I saw, it would indeed seem more like police, judge and jury at the same time) and “Loss of connectivity hurts us all. Spam hurts us all even more” (one opinion thousands of users like myself do not share – wait until “loss of connectivity” hits you, and see what feels worse, and whether it is right to fight a problem by creating another problem, even if possibly smaller, and with the best of intentions).

Does the goal justify the means? Can we hurt innocent people in relation to something committed by others? Does this solve the problem? I don’t think so. What I see here, is a confusing and dangerous mixture of personal opinions and objective facts, good intentions and disrupting results. A good example for the need for a clear and official position, with a real police, real judges, and real juries, rather than a multicolored variety of home-made imitations. Until we have this, both the senders of junk mail and the supporters of “black lists” will have their very own reasons to justify their respective positions.

Additionally, this site explained that what ISPs had to do, in their opinion, was to program their SMTP servers so that they would not receive and forward mail from users who were not logged on to that ISP (i.e. they should prevent “third-party mail relay”). But, I wonder, isn’t this a “solution” that, for many, is “proposed” too late, since it comes after people’s accounts have been blocked? Additionally, such a solution, even if applicable, could not, alone, prevent an ISP’s authorized customer to send out one or more pieces of junk mail using that same ISP’s SMTP server. Even if sending junk mail were against the ISP’s policies, there are enough ISPs to allow for “hit and run” tactics, where the sender uses an ISP once to send thousands of messages, and then changes ISP, and so on.

In my original article I had also written:

Also, I and many people like myself frequently use one ISP to log on to the Internet, and another organization’s SMTP server (to which of course we have to be granted access). In my case, this happens because I move around a lot with a portable computer, and I access the internet using different accounts (preferably with a provider who has local access, or using somebody’s LAN), yet I prefer not to reprogram my mail software to use a different SMTP server each time. Other people do it for testing, or for other reasons which may not be as frequent as junk mail, but which should not, in my opinion, be penalized by junk mail. Nobody in the internet user community, I think, should be made to pay any price for this abuse of the email system. The fact that something good is misused by some for illicit purposes should not, in my opinion, result in this possibility being banned for all other uses as well. If this were always the case, it would be a very high price to pay for peace and justice, especially when the “solution” does not solve the “problem”. Last but not least, let’s not forget that using a third-party’s SMTP server is often an ISP’s only hope to get its email out in order to ask for its IP addresses to be removed from somebody’s “black list”. In my case, using a different SMTP server was the only way to escape the effects of the “black list”. What an irony, isn’t it?

I now think that the above “Keep SMTP Free” defense is not applicable, for two reasons:

authenticated SMTP and AMTP, which were just beginning to emerge in 1998, now allow people to use any SMTP or AMTP server regardless of their current ISP or network configuration, thereby solving the legitimate needs outlined above;
the abuse of un-authenticated SMTP by both “spammers” and virus software has proven to be a technical weakness which is damaging the internet as a whole, so in spite of it having possible legitimate uses (now rendered obsolete by authenticated SMTP and AMTP), I think it should be treated like a security hole, and banned as we do with other defective software.

Other solutions have been proposed to try to stop, or, at least, reduce, the flood of junk mail. For example, many mail clients now have an option to automatically filter out mail coming from certain senders. Very quickly, junk mail senders have “adapted” to this situation, putting random letters and digits in their addresses, so that no piece of mail has the same address, and can be detected on the basis of this. (This use of random parts does not mean that the address itself does not work as a return address, i.e. it can be a “legitimate” and working address even if it has a random part.) Also, even in the best case, this mail is processed only after the user has paid for the cost of receiving all messages, and after the other issues already mentioned, so much of the damage is already done.

Other systems try to detect junk mail at the server level, based on the assumption that if an identical message is received by several recipients at the same time, it is considered “suspicious” (if not automatically deleted). In the best case, this requires manual inspection of the mail by an administrator, in the worst case this is an intrusion in privacy, and causes delay in the delivery of requested or subscribed information (say, 1000 users of an ISP have deliberately subscribed to a newsletter), if not the deletion of some mail which has nothing to do with junk mail. Also, small sites do not have enough users to allow for this type of detection, and detection in big sites can easily be circumvented by sending junk mail messages at random times for each recipient, and with random changes in the contents (changing spacing, punctuation, minor errors, proper nouns, user name in text body, etc.).

Should I even mention this? From time to time I tried to add my email address to serious-sounding lists and web sites that collect addresses of people who do not wish to receive junk mail. Each time, I also added a unique test address that allowed me to see if the data ended up in “spam” lists, just in case. I do not know if, as a consequence of this, I was receiving more junk mail or less junk mail, but I honestly couldn’t tell the difference. However Sally Hambridge and Albert Lunde, authors of RFC 2635 (published more than a year after my tests), reported that “Careful tests have been done with sending remove requests for ‘virgin’ email accounts (that have never been used anywhere else). In over 80% of the cases, this resulted in a deluge of unsolicited email, although usually from other sources than the one the remove was sent to.”

I believe that the fragmented attempts for self-regulation, do-it-yourself “justice”, and other forms of private and market control of unsolicited email advertising failed, in some cases even adding chaos to chaos, and damage to damage. For this reason, I believe that some type of higher-level regulation is necessary, to control the issue as a whole, ideally with a higher priority where this phenomenon is more intense, such as in the US.

Conclusion in Sight?

After five days, the account with my preferred provider was still blacklisted. So I decided to write again to the company maintaining the “black list”. As in the other parts of this text, I’ve decided not to disclose names and other contact data, other than those of “institutions”. I already mentioned CompuServe. The other service provider I mentioned before is Italian Telecom’s Interbusiness, which offers Internet connectivity to Italian ISPs and corporations, using anything from ATM down to ISDN, and is Italy’s most important backbone and connection to the world.

The text of my letter follows.

Dear […],

Following my two letters of January 24, I wanted to update you on the case concerning the inclusion of IP address Y.Y.Y.Y in your “blackhole list”.

I understand that the IP address is still “blocked”, in spite of repeated and numerous attempts on behalf of Interbusiness, as well as by ourselves, and presumably others, to have the address removed from your list. The inclusion of this IP address in the list is currently causing a lot of problems, and I would ask you once again, very kindly, to please remove it, perhaps in consideration of the following explanation, which, in my opinion, should convince you of the anti-spamming policies not only regarding this case, or Interbusiness, but of many civilized nations like Italy and Germany, which have succesfully banned spamming (without blocking any user’s email accounts).

I heard that one of the issues here is that Interbusiness does not have an “abuse(at)interbusiness.it” email address, nor a “policy” against spamming on its web site. About the email address, I spoke with them today, and made sure that they learned about your request, in case they didn’t already know. For your information, however, all Italian .it domains already have to have, by written contract, a fully operational “postmaster” account. This is regularly tested by the Italian NIC (the equivalent of the US InterNIC, which, as far as I am aware, does not enforce such a regular testing). Don’t you think that this means of contacting a domain administrator may suffice, at least for now?

As for the policy, in case you do not know this already, all Italian domain operators have signed an anti-spamming policy with the Italian NIC. Otherwise, they could not have an .it domain. Beyond this, since 1993 in Italy we have a set of laws specifically dealing with computer crimes, which make it a most serious offense to access somebody else’s server, forge email headers, disturb network operations doing things such as mass spamming, creating software and installing systems for doing all of this, etc. Under these laws, the people behind the piece of junk mail which caused you to take this action risk up to five years in prison, or perhaps more, and very high fines. In addition to that, we have a good set of court precedents dating back to the days of fax advertising.

Don’t you think, in consideration of this, that perhaps Italian ISPs should not be required by you to have a “policy” about spamming, since the contracts and the laws we already have are much more than a “policy”? Proof for this is, if there need to be any, that in Italy spamming was virtually eliminated years ago.

I do understand that the last remaining issue here is that Interbusiness’ SMTP server was misused by, probably, a US organization. I was told by Interbusiness – and I would like to share this information with you, in case you did not already know about it – that they are now evaluating different technical possibilities to prevent this from happening in the future. But please understand, they are a very large organization, with hundreds of servers, and such a decision and implementation is not as quick to put in practice as for a one-man ISP. I cannot speak for them, of course, but this is my opinion.

Beyond the technical aspects of the server’s misuse, it must be considered that several Italian laws have been violated. This US organization made a very big mistake in using a server in Italy, one of the countries where accessing and using somebody else’s computers is punishable under some of the toughest and most up-to-date computer crime laws anywhere. This could, maybe, even create a precedent for your cause, regarding US spamming (also see, perhaps, my notes on this issue on [edited]).

I can assure you that legal action is being considered by several parties, and more than one investigation is already in progress. The company named in the advertisement has been contacted, and seems to be very supportive about this case. They claim that a competitor of them has sent this message. Comparing with past advertisements of this company, which have already been listed, it should not be difficult for anybody to determine if the message was meant to discredit it. As far as I know, CompuServe security is now being contacted to see who was logged on and sent the mail to the Interbusiness server. Information about the two email accounts mentioned in the mail is also being collected. If you, with your technical experience, can find more information inside this piece of mail, or have any suggestions to identify the senders, please let us all know.

However, I must also remind you, if necessary, that, by deliberately continuing to directly or indirectly block email of Italian users, with destinations which potentially include all countries of the world, you are also exposing yourself to a variety of jurisdictions and laws. The companies involved here are big enough to set precedents in more than one direction.

To conclude, I can only say that it is my personal opinion that this case is being dealt with not only with due diligence, but probably with more efforts than any of the cases in which you “help within minutes”, as explained on your web site. For this reason, it is very difficult for me to understand why, after so many days, you are still blocking the email of thousands of Interbusiness users like ourselves.

Thank you again for what you can do.

Aftermath

The day after the letter was sent, the maintainers of the list replied, and “unblocked” the IP address. Little more than a month afterwards (in March 1998), Washington became the first US state to pass a law that makes it unlawful to send email advertising with forged or hidden header and other sender data, or containing misleading information in the subject line. Apparently, though, it did not help much…

2000 Update

More than two years after writing the above, I again had a chance to be surprised by the “creativity” of an organization presumably specializing in unsolicited commercial email. Actually, this episode made me dig a bit deeper into how “spammers” work.

While I was working at Cloanto’s Italian office in May 2000 somebody in the US (as confirmed by their ISP) began mailing unsolicited commercial email using a forged “@cloanto.it” address, using SMTP servers of what appeared to be unrelated third-parties, located in different countries, including the US and Italy. The items being promoted in this episode were not the usual pills or get-rich-quick schemes, but the essence of “spam” itself: millions of email addresses, complete with bulk mailing software.

Within seconds, Cloanto’s mail servers started receiving the first bounces and complaints from some of the intended recipients. Within minutes, the activity was traced back to the presumed sources.

A first lawsuit was filed:

Anonymized version

Because the matter is still in the hands of police and judicial authorities in several countries, I am unable to provide full details. To make things more interesting, though, I can mention that the name of one of the individuals identified as part of my personal research into this specific episode is also mentioned in “Behind Enemy Lines – A Spammer’s Luck Runs Out when She Forges the Wrong Domain“, by “Man in the Wilderness”, which is an interesting recount of a very similar episode, occurring at about the same time, of which however I was not aware at the time. If you want to learn more about how these organizations appear to operate, I recommend that you read Behind Enemy Lines.

The first lawsuit was based both on Italian civil laws and on more than 10 different articles of the Italian criminal code (Codice Penale), and was filed both on behalf of Cloanto IT srl and on behalf of myself as an individual. Since the episode involved, among other things, the alleged unauthorized access to third-party servers, unsolicited mailings to private citizens, use of server and connectivity resources, and the offering of commercial products, the laws invoked ranged from various computer crimes, to privacy, theft, unauthorized use of registered trademarks, unfair competition, and several others.

In theory, in consideration of the criminal charges which have been filed, it is my understanding that the individuals involved in this episode risk deportation to Italy and imprisonment. Maybe they will be extradited, or maybe they will be held at some airport upon entering the European Union, if they ever do. Or maybe nothing at all will happen. Italian justice is known to be a bit slow, and there are certainly more important matters than an episode of “spam” which requires relatively complex international procedures in a country which is not known to excel at speaking English. On one hand this is not a case of homicide, but on the other hand the cost of unsolicited commercial advertising is being estimated at billions of dollars (2002 data by Ferris Research and Gartner) per year. And if “spammers” started thinking twice before using email addresses, domain names and SMTP servers which could be covered by laws in any part of the world, maybe this will help.

2002 Update

Just a quick update about some trends I noticed during 2002:

The word “spam” has become so popular that I am going to begin using it without quotes, in spite of it originally meaning something else.
The use of addresses which never appeared on web sites and newsgroup posts suggests that a large amount of email data has been extracted from personal email address books or network communications via tools like email-collecting ActiveX controls, “virus” programs, and/or TCP/IP packet sniffers. In plain text: you can work as hard as you want to keep your addresses private, but if a friend of yours installs some malicious software by mistake, your address will ultimately be collected, one way or another.
Fake sender addresses in general are increasingly being used, with little or no respect for the abuse of real domains and addresses, and with intense rotation of multiple From addresses for the same mailing. I see fewer From addresses which, even if they traced back to over-quota mailboxes, at least appeared to be real. Changing the From address for each message or small group of messages is both a primitive attempt to avoid address-based filtering, and it more evenly distributes costs. For such an illegal practice to stay alive, finding out who “spammed” you or used your data in a fake From address has to remain considerably more expensive than the damage you sustained, which is an important part of the synergy of factors which is making the whole phenomenon possible in the first place (other important aspects include lack of prosecution, relative simplicity and anonymity compared for example to X.400 mail, and the cost of sending unlimited emails tending to zero).
Fake sender addresses from the same group of people as the recipient(s) appearing in the To field (by domain and as originally collected, e.g. via software running on user’s computer) are increasingly being used, the logic being that if a mail appears to be from a friend or co-worker it may more easily appear to be legitimate and pass through a filter.
Fake sender addresses are increasingly being used in email messages appearing to be sent by “anti-spam” activists and organizations. The messages are subtle enough to appear to be from the senders they claim to be, but annoying enough to be perceived as annoying in the way they address the issue.
Fake sender addresses are also increasingly misused by virus programs, suggesting that a single approach against easily forged sender addresses could help better control both problems.
Random destination addresses are increasingly being used (e.g. not only “sales” and “info” at any domain are being targeted, but also “jody”, “cliff”, etc.), further indicating how low the cost is to send out such emails and not having to care about consequences or proper data maintenance.
I am receiving spam in English (more than 90%), Spanish, Portuguese, Chinese, French, German, Italian, etc. Does this mean I know all of these languages? No. It is just another example of how inexpensive it is to randomly spam people.
Subject line and message body obfuscation (e.g. “V1AGRA” with the digit “1” instead of the letter “I”) and randomization techniques (one or more random words, or entire random paragraphs from long texts, e.g. books or web sites) are increasingly being used to make each message different from other messages. This is just an example of the dynamics of adaptation of one front as the other front introduces new tools to defend itself. As long as it is legal, I expect this adaptation process to continue. It only takes one smart programmer to empower a million technically unskilled “spam kiddies”.
Spam is increasingly used to advertise itself as a product or service.
As network connectivity is increasingly being offered in hotels and through public wireless access points, unsolicited commercial email is being increasingly sent via notebooks from hotel rooms and through wireless networks (intentionally public, or unsecured enough to be accessible by anybody driving by with a notebook and an antenna).
“NetBIOS Spam” or “Messenger Spam”: the Windows Messenger service is being used to convey pop-up window advertising messages through ports like 135, 137 and 139, which are used for other purposes as well, forcing network administrators to shut down yet another useful service. Under the protection of firewalls I was not personally aware of the magnitude of the phenomenon, until I saw the messages pop up every few minutes on a system which had just been placed online. Seeing is believing, they say.
Spam filtering products and services are getting increasingly popular, but not one of them can guarantee that non-spam messages (written by humans or machine-generated) are not filtered by mistake. Once again, the innocent senders and recipients of legitimate and possibly important emails pay the price of spam, and the careful and time consuming manual sorting of incoming mail remains the only reliable way to process unsolicited mail.
Some spam filtering products appear to generate more annoying mail than they eliminate. I keep receiving Turing test requests (to prove that I am human, i.e. not a spam sender) to confirm that a message “I sent” is not spam. Whoever developed such software apparently never considered that the majority of spam now uses forged sender addresses, so the message text is incorrect (it more or less explicitly accuses people of having sent something they never sent) and the message itself is annoying (as it “spams” people who have nothing to do with the spam mail). Once again, innocent people pay the price of spam…
Spam is making people angry, affecting their health and how they react to other people. Imagine going to work on a Monday morning, switching your computer on, finding 419 junk mail messages, and then taking an important call…
The fact that users have to increasingly hide and/or frequently change their email addresses, and/or resort to spam filtering products and services to defend themselves from spam, rather than being protected by laws, is increasingly supporting and legitimating spam itself. I am noticing several emerging parallels between this situation and that of “virus” and “antivirus” software.
Spam is increasing in volume and in variety, especially when it comes to more aggressively cultivating illegality and ignorance (e.g. alleged depictions of rape, modern-day snake oil formulas, get-rich-quick schemes, exam cheats, pirate software, cable and satellite TV cracks, etc.), which seems to tell a story about the perpetrators, their consumers, and the society which nourishes this.
The maintainers of The Spamhaus Project, who also operate the Register of Known Spam Operations (ROKSO), estimate that about 90% of all spam received by Internet users in North America and Europe is sent by a hard-core group of about 100-200 individuals (data released during the year ranged from “100+” to “180+”). They note on their site that “These known, professional, chronic spammers, many with criminal records for theft and fraud, are loosely grouped into gangs (‘spam gangs’) and move from network to network seeking out Internet Service Providers (‘ISPs’) with poor spam control and taking advantage of the slowness of some service providers to terminate them.”
Innocent people and organizations are increasingly paying the cost of unsolicited commercial email. I would never have imagined, back in 1998, that in 2002 I would not be able to send out an email containing the word “free” in the subject line, as it would be automatically classified as junk mail by some filters, unless I paid for a service which embeds copyrighted poetry in the email headers.
People are tired of spam. Really tired. “As seen on NBC, CBS, CNN, and even Oprah.”