Reference rot in governmental information provision: extent of the problem and potential solutions

(1)

University of Amsterdam

Thesis

MA Archival and Information Studies

Reference rot in governmental

information provision: extent of the

problem and potential solutions

Author

Lotte Wijsman

Supervisor

Dr. A. Dekker

January 31, 2021 Word count: 22,972

(2)

Introduction

It is said that history needs to be studied in order for it to stop repeating itself. To progress as a society, it is important to learn from the mistakes from the past.1 _{Over the course of}

time, disaster has struck several times, causing information objects to be lost for good.2 _The

most famous example of this is the library of Alexandria which burned down, losing countless manuscripts in the process. Further in time, printed books decayed due to the paper wilting away. These events are still dreaded today, as a lot of information has been lost that may have brought us great works of ancient, and less ancient, times. Today, we might be on the brink of another great loss of information: The World Wide Web.

The contents hosted on the web have a quite volatile nature. Every day, new web pages appear, while others disappear. Some estimate the average life span of a web page only to be 44 days.3 _{Moreover, the content on some of the web pages changes continuously. Therefore,}

the web is an entity that is constantly shifting and changing. However, sometimes it is of crucial importance to retain the current version of information, and a conscious eort must be made to preserve parts of its contents. Over the years, researchers focussed on the retention of referenced resources in academic literature have scrutinised the severity of the problem of web pages disappearing and contents changing.4 _{In 2014, a research project}

named the Hiberlink project used the term reference rot to describe these issues.5 _{This term}

encompasses both of the aforementioned issues of web pages becoming unavailable or their content changing. The rst component of reference rot is link rot, which causes links to be inaccessible and users to encounter the infamous `404 Page not Found' error code. The second component of reference rot is content drift, which is far more abstract to establish because of the variety of components (textual and visual) that can be included. Estimating

1_{History Repeating, College of Liberal Arts and Human Sciences, Virginia Tech, accessed January 2nd,}

2021, https://liberalarts.vt.edu/magazine/2017/history-repeating.html.

2_{An information object is a Data Object together with its Representation Information. From:}

"Ref-erence Model For An Open Archival Information System (OAIS)," accessed September 20th, 2020, https://public.ccsds.org/Pubs/650x0m2.pdf, p. 12.

3_B. _Kahle, _Preserving _the _Internet, _accessed _November _29th, _2020,

web.archive.org/web/19970504212157/http://www.sciam.com/0397issue/0397kahle.html.

4_{M. Klein et al., Scholarly Context Not Found: One in Five Articles Suers from Reference Rot, PLoS}

ONE 9, no. 12 (2014): p. 2. https://doi.org/10.1371/journal.pone.0115253.

(5)

the severity of content drift is also more dependent on the designated community, since they can decide whether the change of a single word in a sentence is vital for the essence of that web page or not.6

Assessing the literature surrounding reference rot within this thesis will be done from the perspective of the eld of library sciences. The Hiberlink project, and other research focussing on this subject from the eld of library science, oftentimes consider references in academic publications. Their focus lies on providing access to journal articles while also keeping these accessible themselves by checking if the resources referenced in those works are still available.7 _{Sometimes, without having access to these references, the article itself}

cannot be understood. However, the issues of link rot and content drift, in combination with information provision of the semi government and government organisations, is still rather underexposed.8 _{These organisations together produce enormous amounts of information each}

year. This information comes in the form of web pages, but also in PDFs that are embedded within them.

Why should information from the semi-government and government be preserved? One of the primary reasons is to comply to the Public Records Act of 1995. Article 3 states that governmental bodies must deliver and keep records held by them in good, orderly, and accessible condition.9 _{Moreover, semi-governmental and governmental institutions are}

in-creasingly sharing information exclusively online. This makes certain web pages the one and only available capture of certain information. Losing this could mean a direct loss of information. In 2015, more than 68% of Dutch individuals made use of (semi-)governmental websites. Of these individuals using the (semi-)governmental websites, the vast majority ex-pressed mainly using the website for searching for information.10 _{Combining these statistics}

with the fact that no more than 2% of Dutch governmental organisations archive their web

6_{The designated community consists of potential users who should be able to understand the preserved}

information. From: OAIS, p. 21.

7_{About, Hiberlink, accessed September 3rd, 2020, http://hiberlink.org/about.html.}

8_{To the best of the knowledge of the author at the time of writing, no academic literature on the issue of}

reference rot for (semi-)governmental organisations is available.

9_Archiefwet _1995, _Overheid.nl, _accessed _November _11th, _2020,

https://wetten.overheid.nl/BWBR0007376/2020-01-01.

10_{ICT, kennis en economie 2016, Statistics Netherlands, accessed December 3rd, 2020,}

(6)

pages creates the potential for the loss of a large amount of information.11 _{A recent}

exam-ple of problems with the information provision by Dutch (semi-)governmental organisations causing diculties for users was the month-long period in 2018 when the systems of Dienst Uitvoering Onderwijs (DUO) were oine. Students were unable to access their information regarding the status of their student loans, their personal information within the system or making administrative changes to the study that they were enrolled in.12 _{On the day}

the system became available again, more than 11,000 changes to personal information elds were requested by students.13 _{This large increase in trac even caused certain parts of the}

website to be temporarily unavailable. This eagerness to make changes as soon as possible illustrates the hardship experienced by students when not being able to access their infor-mation from (semi-)governmental organisations. By focussing on preserving the diverse oer of information that the (semi-)government provides, this thesis will aim to provide readers with a new angle on the problem of reference rot and its potential solutions. Specically, this thesis will not follow the line of previous research by considering the problem of refer-ence rot in the academic context. Rather, it will consider referrefer-ence rot in the information provision of (semi-)governmental organisations. In particular, the problems and solutions identied by the Hiberlink project in the context of academic literature will be applied to information provided by (semi-)governmental organisations. The fact that 68% of all Dutch citizens make use of digital information and the recent hardship caused by the DUO system being oine for an extended period of time illustrates the importance of also attributing attention to information provided by (semi-)governmental organisations.

Specically, this research will attempt to quantify the degree to which reference rot is present in the provision of digital information of semi-governmental and governmental or-ganisations in order to give an indication of the magnitude of the problem. The magnitude of the problem could inuence how rigorous of a solution would be proportional to alleviate

11_{Webarchivering bij de centrale overheid, Information and Heritage Inspectorate, accessed November}

26th, 2020, https://www.inspectie-oe.nl/binaries/inspectie-oe/documenten/rapport/2016/12/8/rapport-webarchivering/Rapport+Erfgoedinspectie+webarchivering+bij+de+centrale+overheid.pdf, p. 7.

12_{Moet DUO echt een maand oine vanwege een nieuwe site?, RTV Noord, accessed January 22nd, 2021,}

https://www.rtvnoord.nl/nieuws/190856/Moet-DUO-echt-een-maand-oine-vanwege-een-nieuwe-site.

13_{MijnDUO site extremely busy following major overhaul, Erasmus Magazine, accessed January}

22nd, 2021, https://www.erasmusmagazine.nl/en/2018/05/02/topdrukte-bij-mijnduo-na-grootscheepse-vernieuwing/.

(7)

these concerns. Given that dierent (semi-)governmental organisations have dierent nan-cial means and technical resources, it is of importance to obtain an estimate of the extent of the problems that can be caused by the loss of information associated with reference rot. The severity of these issues will help determine which rigour of the solutions is proportional. The researchers behind the Hiberlink project provided their readers with two particular solutions to solve the problem of reference rot: web archiving and persistent identiers. Web archiving, as dened in this thesis, is a way to create snapshots of web pages of dierent places in time. Web archiving consists of four steps: selection, harvest, preservation, and access of web pages.14 _{Web pages are selected that are of importance, these are harvested}

with the assistance of a web crawler.15 _{After this, a check can be done to ensure the quality}

of the harvest has been done properly. Subsequently, the web page can be made available by means of a viewer. By making the taken snapshots available with a viewer, users can travel back in time to these dierent snapshots to view the web page. The other solution is to make use of persistent identiers. These come in dierent forms that can be applied dierently. Perhaps the most widely known persistent identier scheme is that of the Digital Object Identier (DOI). This can be applied to an information object such as a PDF.16 _{However, it}

is also possible to use PURLs to create an URL that is less susceptible to link rot.17

Within the two solutions of web archiving and persistent identiers, there are many choices to be made to t specic situations. This leads us to the research question of this thesis: How prevalent is the existence of reference rot within websites of semi-governmental and governmental organisations, and what potential solutions are proportional to the im-portance of overcoming this potential loss of information?

The term reference rot has its limitations to be used for this research because it generally refers to references present in academic articles. This research will not focus on academic references. Nonetheless, the choice to still make use of the term reference rot is made because it encompasses both link rot and content drift. Moreover, the terminology is in line with

14_{Further elaboration on these phases will be provided in Chapter 1.}

15_{Although it would also be possible to harvest web pages with other methods than web crawling such as}

screen-casting and screen-recording, this thesis will focus only this method.

16_{H. Hilse and J. Kothe, Implementing persistent identiers: overview of concepts, guidelines and}

recom-mendations (London: Consortium of European Research Libraries; Amsterdam: European Commission on Preservation and Access, 2006), p. 24.

(8)

the Hiberlink project.18 _{Further denitions are often taken from the Reference Model for an}

Open Archival Information System (OAIS). This model contains the recommended practice on what is required for `archives to provide permanent, or indenite Long Term, preservation of digital information.'19

This research will start with an extensive literature review on the several elements present in the research question. It starts with an extension of the problem statement by presenting relevant considerations and reviewing the literature on reference rot. What estimates have been previously made concerning link rot and content drift and which methods have been used. After this has been reviewed, the two solutions of reference rot, web archiving and persistent identiers, will also be analysed by reviewing the literature related to the workings of these concepts.20 _{Important here is also how both solutions could be applied. Web}

archiving has a process of four steps, among which several decisions can be made. This multitude of options also applies to the solution of persistent identiers. These can be applied to single documents, but also to entire web pages.

In order to establish the prevalence and severity of reference rot within websites of semi-governmental and semi-governmental organisation and to analyse which solutions are proportional to the importance of overcoming this potential loss of information, multiple methods are equipped. The rst part relates to the rst part of the research question, the estimation process. Estimating link rot had to be done rather dierently from the previously mentioned literature, that focussed mostly on references to web resources in academic articles. By using a web crawler, 404 errors were sought. Moreover, the research will analyse a website error log of the National Archives of the Netherlands from the period of September 17th until September 25th of 2020. The other problem, that of content drift, is a rather dicult and intangible concept since it can be based on textual dierences, but also on dierences in visual elements.21 _{Due to technical limitations, it was decided that following the Hiberlink}

project was the best practice in this case. Although this thesis recognises the value of

18_{The Hiberlink project conducts similar research but focusses on the eld of academia rather than}

infor-mation provided by (semi-)governmental organisations.

19_{OAIS, p. ii, 3.}

20_{Further elaboration on the methods of this thesis will be provided in Chapter 2.}

21_{S. Jones et al., Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content,}

(9)

also considering visual elements, this was not feasible within the scope of this study. This means that an exclusive focus will be laid on textual dierences, instead of incorporating visual dierences as well. This will be done using a tool from the Internet Archive that is currently in the beta stage of development, called `Changes'.22 _{This tool gives percentages of}

changes between two taken snapshots of the text of a web page. By creating a dataset where these percentages of dierent semi-governmental and governmental websites are collated, an estimation can be made. After estimations have been made of both link rot and content drift, an attempt will be made to nd out if there is an association between the two. Are link rot and content drift positively or negatively correlated? Understanding this could provide further insights in the reason why link rot and content drift may occur.

The second part of the method consists of seeing what solutions are proportional to the importance of overcoming the problem of reference rot. After analysing the literature on the two solutions, it is also important to assess the characteristics of the solutions. This is of importance to ensure that the proposed solutions are in fact feasible options for the (semi-)governmental organisations. Simply minimising the extent of reference rot may not be the only consideration, as this may come with extraordinary nancial costs. Therefore, attention will not only be attributed to the eectiveness of the solutions, but also to the direct costs and technical requirements associated with implementing the solutions. This should ensure that no solutions are proposed that are not proportional to the nancial means of the organisations or the degree in which reference rot actually poses a problem to them. This is of particular importance due to the moderate amount of reference rot that was found, as will discussed further below.

The combination of using qualitative and quantitative analysis will results into several conclusions and recommendations that can be made. Proof of reference rot was found and based on this, the solutions could be compared. Given the complicated nature of the issue of reference rot, there is no one-size ts all solution available. Dierent organisations and dierent web pages need tailored solutions to their needs. Web archiving is a solution that requires a large investment to create and further develop. This also requires an investment

22_{Wayback Machine adds `changes' feature, Search Engine Land, accessed November 2nd, 2020,}

https://searchengineland.com/wayback-machine-adds-changes-feature-318922. Further information on the `Changes' feature will also be provided in Chapter 2.

(10)

of labour hours to make decisions and select the right settings in order to appraise what needs to be archived. This also requires the right technical knowledge to be present in the organisation. While this can be possibly achieved for large organisations, this will arguably not be the case for small organisations where there is a lack of knowledge and human capital. An option is to make use of make use of already established web archives, such as the Internet Archive, but this also carries its own disadvantages.23 _{The solution of persistent identiers}

might be a better t for these small organisations. Another thing to keep in mind with these solutions is that they need to t the needs of the designated community. While they might have sustainable access to the information objects, it might still not be accessible to them because of a lack of knowledge and the objects may not be found easily.

This paper will proceed by rst providing the reader with a theoretical framework. Sub-sequently, methods are introduced in order to generate the results of this thesis. Using the methods to make an estimation of the issue of reference rot within websites of semi-governmental and semi-governmental organisations makes way to focus on potential solutions. Which potential solutions t best for which situations and which organisations? The ele-ments together will lead to a further discussion of the research where the conclusions are shared together with limitations of this research and a nal recommendation.

(11)

1 Theoretical Framework

1.1 Reference Rot

The term reference rot originates from the eld of academia. Scholars often reference sources in their work. These references used to refer to physical books and journals, but this has changed over the past couple of years. Paper-based academic literature has shifted towards web-based literature. Safeguarding access to academic literature is essential as the references allow for others to check where scholars got their information from and what they have based their own conclusions on. By providing access to these references, one of the main principles of science will remain: to support reproducibility of results.24

Reference rot is the overarching term for the issues of link rot and content drift. These issues deal with the change and disappearance of web pages. The term reference rot was rst introduced by the Hiberlink project. This project was a two-year study that ran from 2013 to 2015, which aimed to estimate the degree of reference rot in online scientic and other academic articles.25

The denitions created and used by the Hiberlink project are the following:

Link rot: resources identied by the Uniform Resource Identier (URI) could become

obsolete and therefore a URI reference to these information objectives may no longer grant access to the content that is referenced.

Content drift: a resource identied by the URI could change as time progresses and

therefore, content at the destination of the URI could change, potentially to an extent where it is no longer representative of the originally referenced content.26

Other papers concerning link rot use other denitions. David C. Tyler and Beth McNeil talk about `dead' URLs that return both the `404 Page Not Found' and the `403 Forbidden', or similar error messages.27 _{Overall, however, it is agreed upon that link rot provides the}

24_{Klein et al., Scholarly Context Not Found, p. 1.} 25_{Hiberlink, About.}

26_{Klein et al., Scholarly Context Not Found, p. 2.}

27_{D. Tyler and B. McNeil, Librarians and Link Rot: A Comparative Analysis with Some Methodological}

(12)

user with the `404 Page not Found' error code.28 _{Content drift is more dicult to dene,}

since it does not give a clear error code. Papers often write about web change, but do not specify what this change entails.29 _{Mostly, they indicate textual change, but do not consider}

subtleties such as contextual changes.30

Before more can be explained about reference rot, it is important to understand what the Uniform Resource Identier (URI) exactly is, which is distinct from the more familiar Uniform Resource Locator (URL). The URI concept was rst published by Tim Berners-Lee in 1994 under the name Universal Resource Identier.31 _{The URI could be classied in three}

ways: a name, a locator, or both name and locator. Since 1994, these classications of name and locator have moved in their own direction and formed the Uniformed Resource Name (URN) and the URL.32_{The URN is a location-independent identier with a dened name. A}

well-known example of this is the International Standard Book Numbers (ISBN). The URL on the other hand refers to a location, a network-based resource. These are often web pages accessed with HTTP. It is said that a URL is always a URI, but not every URI is a URL. This is because the URI can also be in the form of the URN if the resource is not network-based. This thesis will refer only to URI's if they are explicitly mentioned in the literature. The term URL is more tted since this thesis focusses exclusively on network-located resources, which is the case for URLs.33

The Hiberlink project did not come out of thin air. It was preceded by a pilot study from the Los Alamos National Laboratory in 2011. This study, which focussed on the persistence of academically referenced web resources, found signicant amounts of references being either

28_{Link Rot:} _{The Web is Decaying, The Arweave Project, accessed January 14th, 2021,}

https://arweave.medium.com/link-rot-the-web-is-decaying-cc7d1c5ad48b; Link Rot, Techopedia, accessed January 14th, 2021, https://www.techopedia.com/denition/20414/link-rot.

29_{J. Cho and H. Garcia-Molina, Estimating Frequency of Change, ACM Transactions on Internet}

Tech-nology 3, nr. 3 (2003): p. 256-290. http://dx.doi.org/10.1145/857166.857170; E. Adar, J. Teevan, S. Dumais, and J. Elsas, The Web Changes Everything: Understanding the Dynamics of Web Content, in Proceed-ings of the Second ACM International Conference on Web Search and Data Mining, WSDM'09 (2009): p. 282-291. http://dx.doi.org/.10.1145/1498759.1498837.

30_{More on this will be discussed later on in this paragraph.} 31_{Hilse and Kothe, Implementing Persistent Identiers, p. 4.} 32_Ibidem.

33_{Networking Basics: What's the Dierence Between URI, URL, and URN?, CBT Nuggets, accessed}

Oc-tober 14th, 2020, https://www.cbtnuggets.com/blog/technology/networking/networking-basics-whats-the-dierence-between-uri-url-and-urn.

(13)

no longer available or not being properly archived.34 _{The large degree of link rot found in}

the study conducted by Los Alamos National Laboratory inspired the setup of the Hiberlink project to further examine the extent of reference rot.

1.1.1 Link Rot

A question that naturally arises is why links rot. As aforementioned, URLs are used to access network-based resources. There are several reasons why links rot. Content could for instance be renamed, moved, or even deleted. Another reason may be that the entire website does not exist anymore. Link rot may even occur by the change or addition of a single letter. References might point towards a `http:'-address, while the correct address needs to be `https:'. Moreover, content, design, and infrastructure of websites are constantly changing, making them dicult to manage and increasing the likelihood of link rot. The further expansion of the web has also made it increasingly dicult to keep up with.35 _{Moreover, it}

could be the case that links to external websites no longer work. The upkeep of these web pages is outside of the possibilities of the custodian managing the website from which the reference originates. This is what happens with academic papers. While the paper itself is still available at the original location, resources referenced have been lost.

The Hiberlink project sought a way to estimate how serious the problem of link rot was. They conceded that part of the problem was already solved, since journal articles were already making use of Digital Object Identiers (DOI), a persistent identier scheme. Therefore, their focus lied mainly on the so-called `web at large resources'. These are distinct from journal articles and can point to a wide range of web content. These types of resources are also present in this thesis and can potentially render some information useless in the future, since the sources can no longer be veried. Without access to these resources, an

34_{The study found that for repositories from arXiv and University of Texas 30% and 28% respectively of}

all URLs had rotten. A further signicant share was found to be still available, but at risk due to not being properly archived. From: R. Sanderson, M. Philips, and H. Van de Sompel, Analyzing the Persistence of Referenced Web Resources with Memento, arXiv preprint (2011): p.1. arXiv:1105.3459.

35_The _growing _problem _of _Internet _link _rot _and _best _practices _for

me-dia and online publishers, Journalist's Resource, accessed September 20th, 2020, https://journalistsresource.org/studies/society/internet/website-linking-best-practices-media-online-publishers/.

(14)

entire publication can be rendered inaccessible since it can no longer be understood.36

The Hiberlink project considered three extensive corpora.37 _{Together, these repositories}

contain over 3.5 million scholarly articles. The study further renes the sample of scholarly articles, by setting several lters. First of all, the sample was limited to include only links that were likely to contain articles that were published recently enough to include web at large resources, whilst also limiting the sample to articles that were published suciently long ago such that link rot was given time to manifest itself.38 _{It seems advisable to limit the}

sample to exclude recent publications, as this would likely underestimate the degree of link rot.39 _{Secondly, articles for which technical issues arose were also excluded.}40 _Eventually,

this left the researchers with 392,939 articles that use URI references to the web at large resources. In these articles, 1,059,742 URI references were found to web at large resources.41

After testing these 1,059,742 URI references, it was found that the three corpora included in the Hiberlink research all show signicant percentages of link rot of references in articles, as is shown in the following table.

Table 1: This table shows percentage of link rot found per corpora. From: Klein et al., Scholarly Context Not Found, p. 14-15.

Percentage of link rot per corpora Year ArXiv Elsevier PMC 1997 34% 66% 80% 2005 18% 41% 36% 2012 13% 22% 14%

The table clearly shows that when a greater amount of time has passed, link rot becomes more prevalent.42 _{Moreover, the Hiberlink project found that articles increasingly use}

refer-36_{Klein et al., Scholarly Context Not Found, p. 3.}

37_{Specically the repositories from arXiv, Elsevier, and PubMed Central (PMC).}

38_{Specically, articles within the three repositories that were published between January 1997 and}

Decem-ber 2012 were used in the sample created in 2014. From: Klein et al., Scholarly Context Not Found, p. 7-8.

39_Ibidem.

40_{Certain downloads failed because the collaborating institutions of the research project, such as the Los}

Alamos National Lab and the University of Edinburgh, did not have a proper subscription to all journals. Another problem was related to the software used to download the XML-formatted articles. From: Idem, p. 8-9.

41_{Idem, p. 12-13.} 42_{Idem, p. 14-15.}

(15)

ences to web at large resources.43 _{It can therefore be deduced that the problem of link rot}

in academic articles will likely progress over time.

Other previous papers on link rot have made various estimates to the degree of link rot over time. Some of the researchers found moderate percentages of URLs that had rotten. Matthew Falagas et al. studied URLs referenced in two leading medical journals. At the time of their research in November 2006, they found that respectively 10.1% and 11.4% of the URLs in publications from the period of October 2005 to March 2006 had rotten. However, when they looked at a less recent period, November 2003 to January 2004, there was an increase to 27.2% and 30.6%. Not only did they conclude that over time the degree of link rot will increase, they also stated that since they only reviewed primary medical journals, that other journals, for example those in the areas of inter-disciplinary subjects could encounter higher degrees of link rot.44 _{However, their conclusion on the propagation}

of link rot as time passes was based on dierent sets of URLs. More suited perhaps is the research done by Mary Casserly and James Bird. They examined the persistence of URLs from library and information science journals from 1999-2000 twice, in 2002 and between August 2005 and June 2006. By using the same URLs, a more accurate progression of link rot could be given here. In the three years that had passed after the original study, the availability of the referenced URLs had dropped by 17.4%, leading to 61% of links that had rotted.45 _{These estimates are in line with other research that considered the degree of link rot}

several years after the links were referenced. Specically, research by Daniela Dimitrova and Michael Bugeja, and Cassie Wagner et al. found estimates ranging from 48.7% to 49.3%.46

However, many of these studies focussed on references from journals in a specic eld of academia. The Hiberlink project went above this by considering a variety of repositories, as discussed prior.

43_{Idem, p. 26.}

44_{M. Falagas, E. Karveli, and V. Tritsaroli, The risk of using the Internet as reference resource: A}

comparative study, International Journal of Medical Informatics 77 (2008): p. 281.

45_{M. Casserly and J. Bird, Web Citation Availability: A Follow-up Study, Library Resources and}

Tech-nical Services 52, nr. 1 (2008): p. 50-51.

46_{D. Dimitrova and M. Bugeja, The Half-Life of Internet References Cited in Communication Journals,}

New Media Society 9, nr. 5 (2007): p. 816-817; C. Wagner et al., Disappearing Act: Decay of Uniform Resource Locators in Health Care Management Journals, Journal of the Medical Library Association 97, nr. 2 (2009): p. 123-124.

(16)

As discussed above, there is robust academic literature on the degree of link rot in academic papers. Although the durations considered in the papers over which the links could potentially become obsolete diers, many of them nd signicant decreases in available links over time. There may, however, be dierences in the degree of link rot between dierent elds of academia. The heterogeneity in degrees of link rot for dierent topics further underlines the importance of considering the degree of link rot specically for (semi-)governmental organisations.

The Hiberlink project also went further than other papers on link rot, by not only checking if references were still active, but also scrutinising if the content on the web page was still the same.

1.1.2 Content Drift

Following their research on link rot, the Hiberlink project used the same dataset from their research on link rot to investigate the concept of content drift. It should be noted that assess-ing content drift is more challengassess-ing than assessassess-ing link rot. There is no clear HTTP status code for content drift, such as the 404 error for link rot. Moreover, when does something constitute content drift? Has this occurred after the change of a single letter or should the change be more material in nature? While there are many dierent approaches to take with this abstract concept, the Hiberlink project focussed solely on textual dierences in these web at large resources.47

The Hiberlink project adopted a two-step approach to assess the degree of content drift. The rst step concerned locating snapshots of the referenced web at large resources.48 _These

snapshots mostly contain content that was text/html, which led to the choice to focus specif-ically on textual dierences. Only snapshots containing the three most frequently used con-tent types (text/html, application/pdf, and text/plain) were selected to be retained for the analysis. It should be noted that this limited the Hiberlink study to these specic types of

47_{Jones et al., Scholarly Context Adrift, p. 1.}

48_{For every reference within the three aforementioned repositories, the Hiberlink project attempted to}

collect three snapshots. These snapshots were collected from 19 dierent web archives. Specically, two snapshots close before and after the publication date were obtained, as well as one snapshot taken after considerable time had passed.

(17)

content. In more recent times, with the onset of Web 2.0, more dynamic types of content have grown in relative importance.49

The Hiberlink study classies content drift to have occurred for every information object where a textual change has occurred, regardless of the extent of the change. Specically, the Hiberlink project classies content drift to have taken place whenever their computed aggregate similarity indicator is below 100. Their aggregate similarity indicator is construed as the average of four other similarity measures, each normalised to 100. This implies that if only one of the four measures nds some textual change while the others nd that the text is identical, the aggregate measure will still be below 100. The Hiberlink project may thereby overestimate content drift.

Evidently, a limitation of the Hiberlink research must be given here. While sometimes the percentage of change is small and it seems as if the two pages are similar enough, a small change can sometimes have a great impact. For example, if this change is from `do' to `don't'. At the same time, the change of a few letters can have no bearing on the contents whatsoever. An example of this is when `can't' is changed to `cannot'. This implies that in future research, a focus should also be laid on user experiences, in order to not only quantify a level of content drift, but also qualify this. Andrew Jackson, the Web Archiving Technical Lead at the British Library, also researched content drift while focussing on textual dierences. His research showed that after only two years 60% of the URLs had either rotted or had changed in a manner that rendered them unrecognisable.50 _{However, the same limitation as present}

in the Hiberlink project also applies here. Small changes that could carry great importance are not taken into account. In their attempts to quantify the degree of content drift, by making use of tools that apply technical analyses to the information objects, it would be extremely dicult to capture subtleties which may be vital to the interpretation.

Recognising this diculty was done by Wallace Koehler, who further stated that assessing the inuence a change has is subjective. While some changes might be seen as important from

49_{Web 2.0, Techopedia, accessed November 20th, 2020,}

https://www.techopedia.com/denition/4922/web-20.

50_{A. Jackson, Ten years of the UK web archive: what have we saved?, British Library, accessed}

January 15th, 2021, https://blogs.bl.uk/webarchive/2015/09/ten-years-of-the-uk-web-archive-what-have-we-saved.html.

(18)

a design perspective, their importance can be marginal from an informational perspective.51

In addition to this, assigning importance to the change occurring in the information object is subjective to the purpose and function of the web page, but also on how it is used.52

Dierent users will have dierent purposes for using the web page, which will alter their perception of importance of the dierent elements.

For the Hiberlink project, a reduction of the URI references was necessary due to some URI references not having live web content, implying that link rot had already occurred. This left the researchers with a nal total of 241,091 URI references.53 _{After performing the}

second step, which was comparing the snapshots of the web archives with the live web page of the URI reference, the researchers found that 76.4% (184,065) of the URI references were subject to content drift.54 _{Again, this only indicates that something has changed, but not}

how drastic the individual changes might be regarding the interpretation of the information object. Over the years, content drift increases, as is seen in Table 2.

Table 2: This table shows the amount of content drift across URI references from the three corpora. From: Jones et al., Scholarly Context Adrift, p. 23.

Stable URI references Year ArXiv Elsevier PMC 1997 ±10% ±10% 0% 2007 ±60% ±40% ±40% 2012 ±70% ±60% ±60%

Research done concerning content drift is often aimed towards estimating how volatile web pages are. These studies found evidence of content drift on web pages. For example, Eytan Adar et al. found that web pages in their sample had only remained constant for 11% of the web pages, after one year.55 _{Their study shows that content drift increases over time,}

51_{W. Koehler, Web Page Change and Persistence A Four-Year Longitudinal Study, Journal of the}

American Society for Information Science and Technology 53, nr. 2 (2002): p. 168-169.

52_{Idem, p. 170.}

53_{Jones et al., Scholarly Context Adrift, p. 11-14.} 54_{Idem, p. 20.}

55_{E. Adar, M. Dontcheva, J. Fogarty, D. Weld, Zoetrope: Interacting with the Emphemeral Web, in}

(19)

as 95% of the web pages had remained identical after one week.56 _{Jackson found that 40%}

of all web pages in his sample had experienced content drift after two years.57 _Jonathan

Zittrain et al. considered references from supreme court papers from the United States that were published since 1996 and found that 50% of the references to web at large resources had changed by 2014.58

Although there is robust evidence of the presence of content drift, estimates do dier. An important driver of this could be the type of information object in which the references are provided. Junghoo Cho and Hector Garcia-Molina found that there are signicant dierences between web domains in their frequency of change. They found that 40% of all websites in the .com domain change on a daily basis. Web pages within the .edu and .gov, which are used for educational and governmental websites in the United States, were found to be more static.59 _{This could imply that the web pages from (semi-)governmental organisations in the}

Netherlands may also be relatively static compared to the web in general.

As argued previously, the issue of content drift may increase over time. Moreover, the focus on merely textual dierences may not be sucient anymore. The web has moved from the web 1.0, which was mostly textual, to the web 2.0, which is interactive and extremely dynamic. Dynamic content, such as video and audio, has grown in importance and can even form the focus of a web page.60

The results of the Hiberlink project clearly indicate that the issues of link rot and content drift worsen over time. While their research specically covers referenced resources in aca-demic papers, the problem is also present in other areas. This thesis will focus on reference rot of (semi-)governmental websites in the Netherlands. More and more (semi-)governmental information is being provided online. In 2017, three-quarters of Dutch citizens used websites of the (semi-)government to nd information.61 _{Case law shows that citizens and companies}

56_Ibidem.

57_{Jackson, UK web archive: what have we saved.}

58_{J. Zittrain, K. Albert, and L. Lessig, Perma: Scoping and Addressing the Problem of Link and Reference}

Rot in Legal Citations, Legal Information Management 14, nr. 2 (2014): p. 89.

59_{J. Cho and H. Garcia-Molina, The Evolution of the Web and Implications for an Incremental Crawler,}

in Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00 (2000): p. 200-209.

60_{Techopedia, Web 2.0.}

61_{Toename online gebruik van overheidswebsites, Statistics Netherlands, accessed November 16th, 2020,}

(20)

can derive rights from the information provided on these websites. Because of this, it is necessary to have available which information was on the website at a particular time.62

Moreover, these organisations are obligated to comply to the Public Records Act of 1995. Article 3 of this law states that governmental bodies must deliver and keep records held by them in good, orderly, and accessible condition.63

1.2 Solutions

Based on Chapter 1.1, it is fair to say that web pages have a volatile nature. Several estimations have been made of the average lifespan of web pages. Estimates of the average life span of a web page vary. Specically, estimates of 44 days,64_{75 days,}65_{and 100 days were}

found.66 _{The previously discussed Hiberlink project states that in order to ght reference}

rot, a pro-active approach needs to be adopted.67 _{It further gives two possible solutions for}

the problems of link rot and content drift: web archiving and persistent identiers.68 _These

two solutions are often cited as potential solutions to link rot and content drift.

The applications of web archiving and persistent identier are not homogenous. Both have a variety of settings and tools that can be used, therefore leading to dierent results. This paragraph will introduce an overview of these applications and options.

1.2.1 Web Archiving

Web archiving is a set of processes that is used to capture copies or snapshots of web resources for permanent preservation.69 _{It can be decided to archive the web for legal reasons, such as}

the previously mentioned Public Records Act of 1995 and legal disputes. However, another

62_{Information and Heritage Inspectorate, Webarchivering bij de centrale overheid, p. 4.} 63_{Overheid.nl, Archiefwet 1995.}

64_{Kahle, Preserving the Internet.}

65_{S. Lawrence et al., Persistence of Web references in scientic research, Computer 34, no. 2 (2001): p.}

30.

66_{M. Ashenfelder, The Average Lifespan of a Webpage, accessed September 14th, 2020,}

https://blogs.loc.gov/thesignal/2011/11/the-average-lifespan-of-a-webpage/.

67_{Klein et al., Scholarly Context Not Found, p. 27.} 68_{Idem, p. 28-35.}

69_{A. Brown, Practical Digital Preservation: A how-to guide for organizations of any size (London: Facet}

(21)

reasoning can be cultural-historic. Web content is a carrier of creativity, communication, entertainment, and technological development.70

Web archiving consists of four distinct phases: appraisal, harvest, quality assessment, and providing access.71 _{The rst phase of web archiving is the appraisal of the web pages.}

Appraisal refers to the process where information objects are evaluated for their value. More-over, this establishes whether the information objects, in this case web pages, need to be preserved and if this is the case, how long they must be preserved for.72 _{Important for the}

organisations is to assess what their preservation intent is. Preservation intent describes why and how we should try to preserve certain digital objects. Trevor Owens states that clarifying preservation intent is crucial before taking action.73 _{By stating the preservation}

intent, organisations can determine what type of information objects they deem to be of importance to preserve. Although this is important to do for any information object, it can be viewed of as of particular importance for the web, due to its volatile nature. As the web is ever expanding and it is dicult to have a complete overview of the web, it is crucial to decide which part of the web is of particular relevance for web archiving.

It is dicult to get a grasp of the web, let alone archive it. According to Niels Brügger, the web in its original form is not suited for archiving.74 _{Therefore, the individual archiving the}

web must, as part of the appraisal process, determine which specic parts of the web should be preserved. These parts should subsequently be harvested. Following the harvest, the archived web pages are shaped into a presentable form and made available as information objects to users.75 _{An example of this shaping process is the addition of a header that}

contains metadata. The resulting information object is therefore also dependent on the choices made by the individual archiving it. This could form a problem if the personal preferences of an individual are too strongly reected in the web archive. Theo Thomassen

70_{Webinar webarchivering Netwerk Digitaal Erfgoed, YouTube, accessed June 19th, 2020,}

https://youtu.be/h6_WAEgVv4M.

71_Ibidem.

72_{J. Niu, An Overview of Web Archiving, D-Lib Magazine 18, no.} _{3/4 (2012).}

http://www.dlib.org/dlib/march12/niu/03niu1.html.

73_{T. Owens, The Theory and Craft of Digital Preservation (Baltimore: Johns Hopkins University Press,}

2018), p. 82.

74_{N. Brügger, Archiving Websites: General Considerations and Strategies (Århus: The Centre for Internet}

Research, 2005): p.24.

(22)

argues that an archive must be open to interpretation by the several groups of users, instead of only catering to one specic type and taking away all room for interpretation.76 _An

example of this may be the frequency selected by the archivist.77 _{Only harvesting a website}

once a year may cater to the specic needs of (semi-)governmental organisations, but not the needs of a historian wishing to see more frequent captures.

Generally, web archives use the following set of criteria: top-level domain (e.g. .nl), topic or event, media type, and genre. The library of the NASA Goddard Space Flight Center (GSFC) for instance only archives web pages within the Goddard website. This library further selects to avoid crawling media types such as large video les and software products.78 _{Other libraries and archives decide to focus on genre. These genres can be blogs,}

newspapers, video games, etc. The National Library of France has created a web archive focussed on a web collection of e-diaries and the Internet Archive has a collection based on the genres of software and videogames.79

Hand-picking every website that ts the criteria set by the organisations seems like an impossible task, because of the multitude of websites created each day, not to speak of every single web page that exists within those websites. Some organisations try to automate the selection process based on the criteria previously presented: domain, topic or event, media type, and genre. On a technical level, it is possible to create software that determines which domain is used or which media types are present. Furthermore, software could be able to dierentiate whether a web page is an online journal or a blog. The National Library of the Czech Republic has tried to automatize their appraisal process by making use of a WebAnalyzer. Their primary focus is to archive websites that fall inside their domain (.cz). However, they realised that websites outside of this domain could still be relevant for them to archive. The WebAnalyzer is integrated within a web crawler and while crawling the

76_{T. Thomassen, De veelvormigheid van de archiefontsluiting en de illusie van de toegankelijkheid, in}

Ontwikkeling in de ontsluiting van archieven. Jaarboek 2001 Stichting Archiefpublicaties, ed. T. Thomassen, B. Looper, and J. Kloosterman (The Hague: Stichting Archiefpublicaties, 2001), p. 13.

77_{An archivist is a professional that has expertise in how to manage information objects to ensure}

sus-tainable access. From: Archivist, Dictionary of Archives Terminology, accessed December 15th, 2020, https://dictionary.archivists.org/entry/archivist.html; Brügger, Archiving Websites, p. 10.

78_{Niu, Web Archiving.}

79_{Niu, Web Archiving.; Software Library, Internet Archive, accessed November 18th, 2020,}

(23)

web, it searches for pre-dened properties that characterize the Czech web. Each time the WebAnalyzer encounters one of the pre-dened properties, a number of points is allocated to the URL. After a certain number of points has been reached, the web page is considered to be relevant and will be archived.80

Choosing which websites to harvest comes back to creating a selection of the web, as was previously stated by Brügger. With the national turn in web archiving, which was described in the study by Esther Weltevrede, the web is often no longer seen as cyberspace.81 _Content

is now being selected by nationality or language of its users.82 _{Where boundaries were rst}

non-existent, they are now being created, turning the web into an almost physical space. Surya Bowyer underlines this by noting the movement through space that is present in terminology. Harvesting web pages is done by a crawler, that crawls from one place to the next. Moreover, several web pages under a domain are called a `site' and users `surf' these web pages.83 _{By setting the boundaries as mentioned by Weltevrede, the archiving}

organisation is eectively limiting how far the crawler is allowed to physically crawl. This underlines the fact that archiving organisations have increasingly had to impose subjective limits to parts of the web that they are aiming to preserve. This is inherent to the volatile and ever-growing nature of the web, where it is simply not possible to retain everything.

Selection can furthermore be based on legal reasons. The National Library of the Nether-lands has based its selection policy upon this reasoning. Information objects on web pages can be protected by copyright held by journalists, publishers, software manufacturers. Moreover, intellectual property rights can be present in web pages. Harvesting a website considered to be copying, which makes it an act that is governed by copyright legislation. This means permission is necessary from the owner(s). Copyright law poses a barrier to web archive large numbers of websites, since permission is needed in order to archive the selected web pages. However, this is where a choice must be made. On the one hand, the copyright law must be followed. On the other hand, the National Library of the Netherlands argues that

81_{Cyberspace is a space disconnected and distinct from reality. It is an information space that is identiable}

but cannot be located. From: E. Weltevrede, Thinking Nationally with the Web: A Medium-Specic Approach to the National Turn in Web Archiving, MA thesis, University of Amsterdam (2009): p. 17-18.

82_{Idem, p. 84.}

83_{S. Bowyer, The Wayback Machine: notes on a re-enchantment, Archival Science (2020): p. 5-6.}

(24)

web archiving is of crucial importance. Preserving our digital cultural heritage is in the interest of the general public, as well as scientic research. While waiting for permission, a lot can happen as was seen earlier this chapter. Not only can the links rot, but the contents may also have changed radically. This would mean the content would already be lost while permission is awaited. The National Library of the Netherlands has therefore chosen for an opt-out method to solve this problem. This method assumes the implicit permission given by custodians of websites. They are sent a message that the library intends to harvest, archive, and make the website accessible for purposes of heritage. In addition to this message, a deadline is provided within which the custodian can refuse to give permission. If there is no response from the custodian, the library assumes implicit or tacit consent.84

The provided examples show the dierent ways in which the archiving organisations can create a relevant selection of web pages that they desire to retain. It shows that there are several distinct considerations that should be made when deciding which contents to preserve in a web archiving collection. Although the appraisal question is relevant for all archiving, it is of particular importance for web archiving due to the ever expanding and volatile nature of the web. There is, inherently, a degree of subjectivity imposed on web archives by these necessary selection processes. As discussed by Thomassen, this role played by the archivist may have negative eects. It is therefore crucial to be aware of this inuence.

The second phase is to harvest these previously appraised web pages. There are multiple ways to acquire the contents of a website. It is possible to obtain the web content through direct, bilateral, contact between the archivist and the custodian of the website. This may be appropriate in situations in which a limited amount of content needs to be obtained, or when there is a single website of relevance.85 _{However, if larger amounts of websites need}

to be acquired, this may not be an ecient way. In this case, it is oftentimes more time-and cost ecient to access the websites directly time-and obtain them without coordinating with the custodian of the website. Naturally, the aforementioned concerns of potential copyright violations are relevant in this context. It is, however, possible to only access the web pages

84_{Legal issues,} _{National Library of the Netherlands,} _{accessed October 23rd,} _2020,

https://www.kb.nl/en/organisation/research-expertise/long-term-usability-of-digital-resources/web-archiving/legal-issues.

(25)

that the custodian has selected themselves to be harvested.

Harvesting web pages is done by using crawlers. The web crawler visits a given set of web pages and subsequently harvests these pages. In the case of web archiving, crawlers make use of a seed list that directs them to content that needs to be preserved. Moreover, the crawler follows hyperlinks to nd additional content to preserve. The previous appraisal process is tightly connected to the crawling process. The crawler can be set to exclude certain types of content or types of domains (e.g. .edu, .gov) as was previously mentioned. By providing carefully set lters, the crawler is prevented from harvesting the entire web. Furthermore, the selection of which types of content need to be harvested impacts the choice for a crawler. Some crawlers are unable to harvest certain types of content, such as GIS les, or more dynamic content such as video.86 _{It is also possible to limit the scope of the web}

crawler to include only web pages for which the custodian has not indicated that permission has been denied.87 _{By adding a robots.txt le, the custodian indicates that certain pages of}

les should not be requested by search engines or crawled and harvested by web crawlers.88

Although it is possible to ignore such robots.txt instructions, this could lead to a potential copy right infringement.

Another type of setting is how many times the website needs to be crawled. This can be automated to be done daily, weekly, monthly, or even yearly. It can also be done manually, by leaving when the crawler is active to the discretion of the individual. When setting the crawler to crawl a website often, it could harvest several duplicates when content has not changed. Spending time, eort, and nancial means to capture duplicates is a waste. By using checksums, this can be prevented.89 _{Originally, this unique signature can be used}

to check the object's integrity over time. However, it can double as a check to see if web archiving the object would lead to a duplicate. This process is also called incremental crawling, where only changed content is crawled.90 _{Another issue with time is the timestamp}

86_Ibidem. 87_Ibidem.

88_Introduction _to _robots.txt, _Google _Developers, _accessed _October _29th, _2020,

https://developers.google.com/search/docs/advanced/robots/intro.

89_{A checksum is a string of characters that relate to a digital object, and which act as the object's}

unique signature or digital nger print. From: Glossary, The National Archives, accessed November 13th, 2020, https://nationalarchives.gov.uk/archives-sector/projects-and-programmes/plugged-in-powered-up/digital-preservation-workows/glossary/; Niu, Web Archiving..

(26)

given to the archived website. When an extensive website that contains a multitude of dynamic content is harvested, the harvesting process may take a substantial amount of time. This could even take several days or longer. In this time, one could imagine that the website is prone to content drift. Jinfang Niu, gives the example of two web pages (p1 and p2) being harvested. P1 is harvested directly (t1), but when the crawler reaches p2, both web pages have been changed. This means that p2, that existed at the same time as p1, is never archived. However, p2-a, the updated version of the web page, is. While the original website consisted of pages p1 and p2, the updated website contained p1-a and p2-a, and the archived version consists of p1 and p2-a. The crawler has then archived a website that has never existed.91 _{This is called temporal incoherence. Temporal incoherence is a}

property of a set of archival Web pages, indicating that there was a point in time at which all the archived pages were live simultaneously on the Web.92 _{This also has an eect on the}

authenticity of the archived website. Authenticity is when the information object purports itself as being that information object and is free of corruption and tampering.93 _However,

if the website purports itself to be taken at that time, but the harvest takes an extensive amount of time and the web pages have already been subject to content drift, one could argue that the authenticity of the information object is dubious. An example of this was shown by Scott Ainsworth et al. in their paper on temporal coherence where a weather website showed a large radar image with clear and sunny weather while the daily image on that same web page showed cloudy weather with a chance of rain. These types of discrepancies are telling for temporal incoherence.94 _{It also exemplies that the information object is no}

longer authentic, since it does not show the weather that day correctly. For users, this could form a serious issue and could even make the page unusable. This forms a limitation of web crawling as a means of collecting websites.

https://skemman.is/bitstream/1946/6074/1/kristinn-sigurdsson-iwaw06.pdf.

92_A. _Ball, _Web _Archiving, _accessed _November _13th, _2020,

http://www.dcc.ac.uk/sites/default/les/documents/reports/sarwa-v1.1.pdf.

93_{Authenticity,} _Terminology, _InterPARES _Trust, _accessed _December _1st, _2020,

https://interparestrust.org/terminology/term/authenticity.

94_{S. Ainsworth, M. Nelson, and H. Van de Sompel, Only One Out of Five Archived Web Pages Existed}

as Presented, in Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT'15 (2015): p. 257. https://doi.org/10.1145/2700171.2791044.

(27)

The third phase is to conduct a quality assessment of the harvested web pages. However, this is not the only step that is taken following the harvest. The harvested information objects need to be checked for their quality, they need to be organised, and metadata needs to be added. As mentioned previously, web crawlers can struggle with certain types of content, mainly being dynamic content such as video and audio. This, in combination with the fact that there is always a degree of automation, could mean that some content might not be harvested. Checking for quality, however, can vary with dierent organisations. If the scope of the harvest is considerable, checking everything manually is not feasible. However, there is a basic level of quality assessment. Part of this is done before the harvest is enacted. Two approaches are included in this pre-harvest process: resource analysis and collection testing. The resource analysis included assessing the selected website on its content. This includes of the website is static or dynamic and if there is content that could form a problem later. An example of this is the end-of-life of Adobe Flash Player.95 _{This discontinuation}

would render a lot of classic games unusable.96 _{If a web archive has previously selected to}

archive web pages that contain games that rely on the Adobe Flash Player, the web pages will lose their purpose. The following approach is to test the collection. This is not necessary if the selected web pages are only to be harvested once but could be helpful when the crawler will harvest on a monthly or yearly basis. After the test harvest is completed, the equipped method can be evaluated, and potential improvements can be made.97

The other part of the quality control takes place after the harvest. Adrian Brown names a total of nine types of tests that can be conducted: availability, navigation, date and time, frames, text, images, multimedia content, downloadable content, and search. The rst, and most basic test is availability, which simply checks if the website has been archived and can be accessed. Several of the other tests are visual checks, such as date and time, frames, text, images, and multimedia content. The other types, such as navigation, downloadable content, and search, check if the functionality of the live website is preserved in the archived

95_{This software is no longer supported by Adobe since December 31st, 2020.}

96_{The Ragtag Squad That Saved 38,000 Flash Games, Wired, accessed December 10th, 2020,}

https://www.wired.com/story/ash-games-digital-preservation-ashpoint/; Na 25 jaar stopt Flash Player en daarmee tienduizenden online games, NOS, accessed January 1st, 2021, https://nos.nl/artikel/2362722-na-25-jaar-stopt-ash-player-en-daarmee-tienduizenden-online-games.html.

(28)

one.98

After selecting, harvesting, and organising the web pages, it is necessary to grant access to potential users. Although a web archive could choose to make its contents available to the public, many of the web archives in fact limit access. Archives that limit access usually do so for reasons related to copyright legislation.99 _{It is also possible to only provide access to}

the web archive at the physical location of the organisation that maintains the web archive. Moreover, it is possible for an archive not to provide immediate access to the contents of the archived websites. Introducing a delay in the possibility to access the archived version of the website ensures that the web archive does not compete with the website itself for users. These delays can vary across archives could be as large as several months.100 _{The Internet}

Archive takes 3-10 hours between harvesting the web pages and adding them to the Wayback Machine that grants access.101 _{Brügger recognised the problem of the delays introduced by}

archiving organisations having to treat the archived web pages rst as one of the reasons why users cannot always access information sought in web archives.102

In creating access to a web archive, as mentioned previously, the archived web pages are shaped into a presentable form for users. According to Brügger and Weltevrede, how access is provided could be a product of its time.103 _{An example of this is the Internet Archive,}

which grants access by making use of the `surfer's web' from the 1990s. Users search for the website, after which they can surf between the web pages themselves. Dierent options are the `searcher's' web of 2000s, or the `scroller's' web of the 2010s that makes use of the smartphone.104

Concluding, every step of the web archiving process consists of decisions that need to be taken. With each decision that is being taken, the web archive is formed subjectively.

98_{Idem, p. 72-73.}

99_{Brügger, Archiving Websites, p.10.} 100_{Niu, Web Archiving.}

101_Using _The _Wayback _Machine, _Internet _Archive, _accessed _January _20th, _2021,

https://help.archive.org/hc/en-us/articles/360004651732-Using-The-Wayback-Machine.

102_{Brügger, Archiving Websites, p.10.}

103_{N. Brügger, Understanding the Archived Web as a Historical Source, in The SAGE Handbook of Web}

History, ed. N. Brügger and I. Milligan (London: SAGE Publications Ltd., 2018), p. 21-22; Weltevrede, "Thinking Nationally," p. 49.

104_{R. Rogers, Doing Web history with the Internet Archive: screencast documentaries, Internet Histories}

(29)

For (semi-)governmental organisations, it is important to be aware of this. Moreover, while it is necessary to have a clear statement of preservation intent, it should not be the case that the web archive is created focussed on a specic type of user. As websites from (semi-)governmental organisations can have various purposes, such as legal or historical value, it is of importance for the archiving organisation to ex ante impose its own interpretation on the information objects.

1.2.2 Persistent Identiers

Persistent identiers are often associated with the eld of academia, that uses the Digital Object Identier (DOI) system to create sustainable access to academic papers.105 _However,

persistent identiers come in various forms and can therefore be used in dierent contexts. This paragraph will provide an introduction to persistent identiers and will also look at how they are used.

Persistent identiers are `long-lasting references to a digital resource.'106 _{Usually, this}

identier consists of two components: a unique identier, and an accompanying service that can locate the resource for a long period of time, even when the location has changed. The identier makes sure that the resource is what it purports to be, ensuring authenticity.107

An important element of any persistent identier scheme is the resolution system. A resolution system allows one to navigate from a persistent identier to the location of the in-formation object. As the name implies, the identier remains persistent, while the location of the information object may change. The resolution system allows one to input the persistent identier and receive as an output the current location of the information object.108

Similar to the solution of web archiving, automatization is not completely preferable and human labour hours have to be invested. Not only must the service be maintained that provides the resolution, but individuals must also select a type of persistent identier scheme. According to the Digital Preservation Handbook by the Digital Preservation Coalition, there

105_{Klein et al., Scholarly Context Not Found, p. 3.}

106_{Persistent Identiers, Digital Preservation Coalition,} _{accessed September 29th,} _2020,

https://www.dpconline.org/handbook/technical-solutions-and-tools/persistent-identiers.

107_Ibidem.

(30)

are ve common types of persistent identier schemes: Handles, Digital Object Identier (DOI), Archival Resource Key (ARK), Persistent Uniform Resource Locator (PURL), and Universal Resource Name (URN). There are, however, other types of persistent identiers. However, the ve aforementioned schemes are most directed towards the purposes relevant to this thesis, being digital objects and digital preservation.109

Handles are commonly used for objects of museums and archives.110 _{The aim of the}

Handle System is to support digital libraries by means of creating a framework. A Handle works in two ways: it identies the specic object and it also identies the organisation that created the object or is currently maintaining it. Handles provide general purpose software to resolve identiers.111 _{Handles contain the form <prex/sux>. The prex denotes the}

organisation, while the sux identies the information object. An example of a Handle is 145.76/jan2005-rk324942199.112

The DOI is a type of Handle and its technical infrastructure is dependent on it. The Digital Objects Identier is commonly used for scientic publications and books.113 _{It was}

originally meant to link customers with publishers, facilitate electronic commerce, and en-able copyright management systems.114 _{Despite making use of the technical infrastructure}

of the Handle System, the DOIs are distinct from Handles due to the inclusion of a meta-data model. The persistent identier used as the DOI typically includes metameta-data such as the publisher, the journal or book, and potentially the year or serial number of the publica-tion.115 _{DOIs are standardised since 2012, when they were published under the ISO standard}

26324.116 _{An example of a DOI is 10.1002/joc.1130. DOIs use a similar structure as}

Han-dles, with the exception of the DOI signature (`10'). The 10 is the DOI identier within

109_{Digital Preservation Coalition, Persistent Identiers.}

110_{PID 2:} _{Kies de beste Persistent Identier, YouTube, accessed October 21st, 2020,}

https://www.youtube.com/watch?v=rIvManSuguw&feature=youtu.be.

111_{Factsheet, DOI, accessed December 14th, 2020, https://www.doi.org/factsheets/DOIHandle.html.} 112_J. _Hakala, _Persistent _identiers: _an _overview, _accessed _December _10th, _2020,

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwiJw6uq- IHuAhWWraQKHQqhCGUQFjABegQIAxAC&url=http%3A%2F%2Fwww.persid.org%2Fdownloads%2FPI-intro-2010-09-22.pdf&usg=AOvVaw3FpwOlMfn1AfT1gk-Kr15T.

113_{YouTube, PID 2: Kies de beste Persistent Identier.}

114_{DOI, accessed November 4th, 2020, http://web.archive.org/web/19980204120927/http://www.doi.org:80/.} 115_{DOI Handbook, DOI, accessed November 4th, 2020, https://www.doi.org/hb.html.}

116_{ISO 26324:2012, International Organization for Standardization, accessed November 25th, 2020,}