Blog Archives

near impossibility of archiving web experiences

9/1/2012

Leafing through the October 1954 copy of National Geographic, I noticed that I was paying far less attention to the article content (though the title of the article “Spearing Lions with Africa’s Masai” was a clear indication that times have changed), but instead, most of my focus was drawn to other peripheral things, the design, the typography, and above all, the advertisements. While the articles themselves may shine a light on the prevailing thinking in 1954, these other details say a lot more about the contemporary society. This is probably universal – historians probably learn just as much from carelessly discarded ancient artifacts, as they do from organized text.

Extrapolating it in the other direction, our future generations may also learn about us not just from our writings and conscious creations, but by reading between the lines and looking at the overall media experience we go through. Undoubtedly, the media that defines our time is the Web. So it is natural to assume that future generations would like to know not only the content of our websites, but how exactly we experienced and interacted with it. For example, they may be curious about the phenomenon called Facebook, and would like to know what it was and what we did with it.

Throughout our history, especially in recent times, we have, in most cases, been able to save and archive our intellectual creations. Books were stored in libraries, music was passed on or recorded, paintings were saved or reproduced, and in the recent era even plays were recorded on film and stored. Today, an unprecedented number of human beings are involved in the creation of web content, so we must ask the question, are we archiving our work?

My claim is we are not, and as things stand today, we are incapable of doing so. Because of some peculiarities of the technologies we are using today to build these websites, it will be practically impossible for anyone, even a decade from now, to take a look at what websites today looked and behaved like. We will have a few screen shots, we will have anecdotal descriptions, and we may have the actual textual and media content, but we will not be able to experience the working site.

In this article I’ll try to explain why that is the case, why it matters, and if there is anything that we can do to change that.

Web technology in a nutshell

Let me briefly describe what goes on behind the scene when we point our browser to a web page. As soon as we express the desire to visit a webpage, a request is sent to a particular computer that is hosting the desired web page. The computer, or the server, returns the content of the web page in a special type of document format called HTML. The browser knows how to interpret this HTML document and render a visual representation of this page on your computer’s screen. In order to do that the browser may request other chunks of information from other servers (e.g. a picture or a video). It also uses other software elements to complete this translation. There are many such software components, but the two most common ones are JavaScript and CSS. These programs, which reside inside your computer, must correctly interpret the instructions on the page to display it correctly. Most modern web pages also request other pieces of information from the server, based on how the user interacts with the page, and these are often referred to as dynamic web pages.

Therefore, in order to display a page correctly, the user must have a modern browser, and must also have correct versions of JavaScript, CSS engine, and when dynamic requests are made to some server, it must send back specific pieces of data, as expected by the page. If any of these pieces are not fully compatible with what the page was designed for then the page will not render correctly.

The problem is, all of these underlying technologies are changing constantly, and unless the browser’s technology matches what the website server is expecting then the site will not render correctly. The browsers try their best to remain backward compatible and be able to display older websites as well as possible, but they are facing a losing battle. A browser today has a much better chance of displaying website that was cutting edge ten year ago than displaying a website that used the latest technology available just three years ago. This problem is likely to become more acute as the rate of technology change accelerates.

Let’s archive

As we all know, all websites change over time, as does any other media outlet – magazines, television, radio. However, there is a subtle difference between the web and any other media. If you want to see an old copy of a magazine, you just have to pull out the old copy, and there it is, with advertisement and all. If you want to see an old TV program, you just have to record it, store it somewhere, and then play it back. However, for a website, the state of an older webpage is not saved anywhere. If you want to see your own Facebook page from last year, there is no easy way to do so. The website design and content is changing constantly, and like a flowing river, what happened before has flown by. In fact if you want to see the page you just saw a moment ago on New York Times, you cannot really do that because if nothing else, you are most likely to see the same news story but with different advertisements, and maybe even a few new user comments got added to the story.

Though not as easy as recording a TV program, it is possible to “record” a web page. That is, there are available technologies that can store the coded HTML page that the web server sent out. This code can then be played back later on and a suitable browser can then render that page exactly as it appeared the first time.

First problem: dynamic content

Now, if the page included any dynamic component, as most modern websites do, then every time the user interacts with the page, the web server must respond with exactly the same responses as it did the first time. This may happen if the recorded page code is played back almost immediately or even within a few days, but it is highly unlikely that the website design will remain static for too long. Eventually, the new website will no longer respond correctly when these dynamic requests are made. It is theoretically conceivable to also save some of these dynamic responses, but in practice that would be very hard to do, since many of these responses are in reaction to how the user interacts with the page, and how can we predict and save all possible user actions.

Second problem: changing web technologies

Let’s say we managed to store the original page code, and also stored all the possible dynamic responses that the server made. With all of that at our disposal, we should be able to exactly replicate the website’s behavior at a later date. However, the problem is, we still need a browser that can interpret these codes exactly as they used to do when the website was originally designed. This is an extremely difficult task. As I mentioned before, the browser technology, JavaScript, CSS etc. are changing constantly, and so to configure a computer in the same sate is difficult. Even if we save a copy of these programs from the time where we saved the webpage content, they are not likely to run on a newer Operating System. So, in order to play back the saved page, we have to save copies of the operating system software, a copy of the browser, and all the other support software that were available when the website was created.

Third problem: changing hardware

Let’s say we saved copies of all the necessary software along with saved copies of the web page. Even after all that, it is unlikely that they will run on future computer hardware. So the only viable solution is to actually keep a computer from the same era, unchanged, and then try to reload the saved web page code on that machine. This is a sound solution in principle, but for all practical purpose is almost impossible to carry out. We will have to keep a computer from every period in history and use the right machine to play back websites from that era.

Even if we are willing to do that, what is the likelihood that a physical computer that we keep archived today will actually work when plugged in 50 years from now. Do we then also keep a storehouse full of spare parts from each period?

Too expensive

The above discussion makes it clear that even if we can solve all the technical hurdles, doing so will have an enormous cost. Since the historical value of this can only be appreciated by our future generations, it is unlikely anyone today will be able to justify the enormous cost of creating such an extensive hardware and software archive. The value to us, today, of being able to see a Facebook page from two years ago is too small to justify any major effort. That probably explains why we have not done much towards solving this problem.

Possible solutions

There are a number of web archives, some operating since the early days of the web (e.g. http://archive.org/web/web.php). They try to store copies of the HTML page and the associated media files. While this is of enormous value, they have not been able to solve the problems discussed above. As a result, they are more successful at rendering older web pages than more recent ones.

One possible solution is to store the functioning web page experiences as a video streams. That is, to create an automated system that can try to exercise a web site by interacting with it the way a human would and record the screens as a video stream. There are two major technical hurdles here – (a) it is a non-trivial problem to create an artificial Intelligence program that can interact with the page is a sensible and human manner, and (b) with the new range of human-machine interaction through touch gestures, physical movement, and voice, it is not at all easy to simulate a human user’s actions.

Therefore, even if we can build such a system, it will only be a rough approximation of what the users actually experienced. Moreover, for the viewer of this video stream, they will get a passive view of what the web site looked like, but they will not be able to actually experience the interaction that the original users had.

Never before

What is truly remarkable is that it is the first time in modern history that we, as a race, is spending so much time on an activity (both creation and use of web sites), and yet we have no practical means or desire of archiving it for posterity. That is how we have always dealt with our real life experiences -- we could never archive them and they only existed in the present and in our memories. Now our creations are also joining that rank. May be it is time to recognize that we have entered a fundamental shift – that we are moving into a world where everything that matters is just in the present. Maybe we are transitioning from a society that valued continuity of history to a society that only connects to the present time -- a Twitterized world, where nothing needs to be permanent – just a continuous flow of information. We are standing beside a river and watching it flow by, enjoying what is in front of our eyes, with no need to run along the banks to see what floated by before.

Older Comments (2)
1. Michael Riedijk said on 11/16/12 - 10:53PM
Kunal, I read your blog post with great interest. I agree with the challenges you are bringing forward here. The reality is that technology is moving very fast. My development team team at PageFreezer.com actually has already overcome the 3 hurdles your describing here. PageFreezer is able to capture and live replay complex interactive websites with Javascript, Ajax, CSS3, HTML5 etc. This work without any manual configuration or customization. PageFreezer is working with leading Fortune 500 companies and government agencies exactly for that: to accurately preserve their marketing web communications. Contact me if you want to know more: [email protected] Regards, Michael
2. Kunal Sen said on 11/19/12 - 10:51AM
Michael, thank you for your comment. I am an outsider when it comes to Web archiving, so I am honored that you took the time to read it. I am also excited to know that PageFreezer can handle some of the technical problems, such as JavaScript, Ajax, CSS etc. However, in order to play back these sites in the future, when OS, browsers, and other support technologies would change, most probably even the hardware platforms, how will anyone play them back? If the answer is any type of special player that you have developed then my question would be why should I assume that your company will be around 25 years from now? Looking back 25 years, most of the computer companies that existed 25 years ago are gone now, only a tiny fraction survived. Therefore, while software like yours may be very usuful for short-term legal use, I am very skeptical how it can help in historical preservation, where we are talking about time scales of decades and centuries.

0 Comments