• No results found

6 Data: metadata collection to link logs

6.2 Search, annotation methods and some results

The concrete steps I took are the following. As stated, I started with a list of religions which have or could have a foothold in the Netherlands (a list which I expanded freely as the search was ongoing). In the case of the more obscure religious movements, I checked whether the movement had an entry on the Dutch Wikipedia. If there was a practicing group in the Netherlands, the entry more often than not linked to the website of the Dutch branch. If using Google, I took care to use logged-out browsers cleaned of cookies to ensure Google’s tendency to personalize search results based on past history did not interfere.

Usually, the name of the movement in Dutch and ‘Netherlands’ / ‘Nederland’ was used as search keywords. In the case of looking for specific kinds of organisations, I added the relevant keyword (like

‘Hindoe webshop’). A further method I used in searching for migrant churches was translating a search phrase (such as ‘Thai temple Netherlands’) into the language of the target diaspora, and entering that in

139 Kemman, Kleppe, and Scagliola, ‘Just Google It: Digital Research Practices of Humanities Scholars,’ 4.

140 Franco Moretti, ‘The Slaughterhouse of Literature’, MLQ: Modern Language Quarterly 61, no. 1 (March 2000):

207–27.

141 Kevin Kee and Stephen Ramsay, eds., ‘The Hermeneutics of Screwing Around; or What You Do with a Million Books’, in Pastplay: Teaching and Learning History with Technology (University of Michigan Press, 2010).

142 Ramsay, 9.

35 the search bar. This way, I found multiple sites that (almost) did not use Dutch or English. Finally, I used Google Maps as a spatio-visual extension to the textual search engine.

In addition to using direct search, I also used the so-called ‘snowball method’, which meant searching found sites for further relevant links. Especially sites that hosted a dedicated page with links to other sites relevant to their own topic or business (i.e., sites within their web sphere) – a practice of mostly older websites – was useful to at least somewhat mitigate the influence of the Google algorithm and find sites that would not feature highly in its search results, but do feature in a web owner’s personal network.

I put a stop to my search when I had gathered 1048 links, more than double that of the previous webcollection. The overwhelming majority of these sites (829) were hosted on the Dutch top-level domain .nl. Small minorities were .com (92), .org (78), .net (18), .nu (8), .eu (8),

with .tv, .frl., .blog, .amsterdam and a number of othe r European domains as outliers. Of these 1048 sites, I ultimately selected 656 candidates for actual preservation in the KB archive. It is also for these sites that I gathered detailed metadata, which forms the basis of this thesis’ dataset. In deciding on a metadata scheme, I tried to take into consideration as much as possible the use that future users of the archive would derive from it. What information could I record with relative ease in order to make the archive searchable purely by metadata. In the end, I decided on the following full scheme (of which certain columns were used for internal descriptive purposes. These columns have been removed from the public version, as well as from the file accompanying this thesis.):

1. Archived in WCT

Whether a website has already been archived by the KB at an earlier date.

2. Name

De self-declared name of a website, such as displayed at the top of the homepage, or a logo. Subtitles are displayed by brackets: ( ) and alternative titles with square brackets: [ ].

Names in multiple languages are separated by “=”.

3. URL

The link of a homepage, or in rare cases a sub-division of a site.

4. Online

Whether a site was still available as intended (not offline or put up for sale) at the time of inspection.

5. Access date

The date of site inspection.

6. Language(s)

The languages a site is offered in, unless the site concerns a Dutch subsite of a multilingual international site. The count of the various languages used is as follows:

Dutch: 611 English: 119 German: 19 Arabic: 12 French: 10 Spanish: 6 Turkish: 6

Farsi: 5 Hebrew: 5 Romanian: 4 Serbian: 4 Amharic: 3 Chinese: 3 Korean: 3

Russian: 3 Syrian: 3 Armenian: 2 Greek: 2 Hindi: 2 Hungarian: 2 Italian: 2

36 Japanese: 2

Kurdish: 2 Polish: 2 Tamil: 2 Thai: 2 Tigrinya: 2

Ambonese: 1 Danish: 1 Finnish: 1 Frisian: 1 Indonesian: 1 Croatian: 1

Norwegian: 1 Urdu: 1 Vietnamese: 1 Swedish: 1

7. Slogan

The slogan displayed on a website. Can also be a citation from a holy text.

8. Description

Short description of the content of a site, community/company and other notable features.

9. Location

The self-reported location of a site. In the case of a community this may be a location of worship or an administrative headquarter. Can also be the owner’s hometown.

10. Province

Province that contains the location.

11. Email address

The most visible email-address on the website.

12. Include in collection

Whether the site is deemed noteworthy enough for inclusion in the webarchive and extended annotation.

13. Reason for selection

Reason for selection.

14. C1: Metatype 15. C2: Movement

16. C3: Representatives or owners 17. C4: Goals

18. C5: Functionality 19. Social media

Which social media channels a site maintains. Since social media is not/cannot be archived by the KB, this gives some indication of the agent’s presence on contemporary platforms. I looked only at logos or links visible on the site; whether they are active has not been tested (except in the case of some WordPress sites, which can automatically place unlinked social media logos on a site).

The meaning of columns 14 to 18 contains a detailed, layered categorization of the various

characteristics of a site. Columns 15 to 18 have a syntax of tiered categories: there are broad top-level categories, which can be further clarified by the selection of subcategories, up to a maximum of 4 layers.

An example would be: World Religions > Christianity > Protestantism > Evangelism. See appendix 7.1 for the full list in Dutch.

37 Category 1: Metadata has been described before, it is the broad stroke division of the sites in three rough categories. The count of these is as follows:

Religious: 562 Spiritual: 152 Secular: 55

Category 2: Movement tries to pin down the religious identity (or identities) of the website. As mention above, this is not so easy on some sites, especially those ‘spiritual’ sites which do not neatly fit in a single subcategory, and instead are served better by labelling the various spiritual subjects and practices (astrology, paragnosis, chakras and energies, reiki, tarot, etc.). Certain ‘unofficial’ broad terms have been used to group particular movements, such as ‘reconstructionist movements’ for neo-pagan groups. ‘New age’ has specifically been used to denote to those ‘spiritual’ websites that refer explicitly refer to the coming of a new age and the need for human ascension, not to spirituality in general. The final count of the various movements is as follows (in Dutch), in order from most frequent to least frequent:

Wereldreligies: 497 Christendom: 348 Protestantisme: 143 Spiritualiteit: 106 Islam: 70

Rooms-katholicisme: 56 Evangelisten: 55

Reconstructionistische bewegingen: 46 Neo-paganisme: 45

Nieuwe of afgeleide bewegingen: 40 Jodendom: 40

Boeddhisme: 40 Hindoeïsme: 33 Paranormaliteit: 29

Chakra’s, aura’s en energieën: 24 Oosters-Orthodox: 22

Mystiek en pantheïsme: 22 Mediums & paranogsten: 20 Wicca en hekserij: 17 Gereformeerde kerken: 16 Reiki & handoplegging: 15 Soefisme: 15

Heidendom: 15 Pinksterbeweging: 14 Tarot: 14

Non-theïsme: 14 Soenni: 13 New Age: 13 Engelen: 13

Secundaire traditionele stromingen: 12 Inheemse ‘volksreligies’: 11

Gidsen en lichtwezens: 10 Astrologie: 10

Sji’a: 9

Oriëntaals-orthodoxe kerken: 9 Zen: 8

Geesten en verschijningen: 7

Daoïsme: 7

Modern sjamanisme: 7 Vajrayana: 7

Reïncarnatie: 7 Hervormde kerken: 7 Kabbalahla: 6 Druïden: 6

Messiasbelijdende joden: 6 Joods-christenen: 6

Vrijzinnigheid: 6

Liberaal of progressief jodendom: 5 Antroposofie: 5

Theravada: 5 Jezuïsme: 5 Russisch-Orthodox: 5 Osho: 5

Atheïsme: 4 Baháʼí: 4

Orthodox jodendom: 4 Vrijmetselarij: 4 Grieks-Byzantijnse: 4 Servisch-Orthodox: 4 Oud-katholicisme: 4 Zevendedagsadventisten: 3 Ahmadiyya: 3

Baptisten: 3

Traditioneel sjamanisme: 3 Roemeens-Orthodox: 3 Christen-Anarchisten: 3 Mormonen: 3

Parodistische stromingen: 3 Pantheïsme: 3

Jehova’s getuigen: 3 Theosofie: 3 Methodisten: 3 Rozenkruizers: 3 Anglicanisme: 2

Apostolisch Genootschap: 2 Armeens-Apostolische Kerk: 2 Brahma Kumaris Kumaris: 2 Alawieten: 2

Pastafarianisme: 2

Doopsgezinden Mennonieten: 2 Eritrees-Orthodoxe Kerk: 2 Hare-Krishna: 2

Pure Land: 2 Satanisme: 2

Koptisch-Orthodoxe Kerk: 2 Mahayana: 2

Oosters-Katholiek: 2

Syrisch-Orthodoxe Kerk van Antiochië: 2 Masorti-jodendom: 2

Remonstranten (Arminianen): 2 Scientology: 2

Shinto: 2 Sikhisme: 2 Otherkin: 2 Adidam: 1

Unitaristen (Broeders in Christus): 1 Bulgaars-Orthodox: 1

Ethiopisch-Orthodoxe Kerk: 1 Evangelische Broedergemeente (Moraven): 1

Falun Gong: 1 Georgisch-Orthodox: 1 Hellenisme: 1

Reconstructionistisch jodendom: 1 Thelema: 1

Quakers: 1 Rastafari: 1 Jainisme: 1 Animisme: 1

Syrisch-Katholieke Kerk: 1 Vrij-katholicisme: 1

Category 3: Representatives or owners tries to note the types of communities, organisations and

individual agents that are connected with and/or responsible for the websites. Category 4: Goals looks at what these agents would likely desire to achieve with their website: community maintenance and communication, commerce, recruitment, (self-)publishing, etc. This list often tries to stay in generalities,

38 as to not become mired in, for instance, all the various services that churches offer their flock (baptisms, confession, marriages, etc.). Finally, Category 5: Functionality looks closely at the internet-features employed by a website – such as offering downloads, streaming video or audio, an event calendar, a newsletter. It also takes inventory of whether a community takes advantage of the internet to carry out their day-to-day tasks, such as fundraising and looking for volunteers.

I quickly discovered that filling in all this metadata by hand, especially the tiered categorization, was both extremely time-consuming and prone to spelling and syntax errors (which would compromise the proper machine readability of the metadata). Therefore, I decided to create a graphical user interface (GUI) program in Python that could assist in easing the workflow and minimize the danger for errors. Figure 1 shows the result of this project, and Appendix 1.1 shows my Python code. It should be noted that the Tool was designed with the specific metadata scheme of my project in mind, and all the fields and checkboxes thus correspond to specific columns. In other words, the tool is not completely modular. At the same time, the left-hand side of the values can be considered general, and may be used by any special collection, while the categories on the right-hand side are generated from external .txt files, which indicate the tier of each category with a specific symbol. With some work, the tool thus can be adapted for new projects.

The functionality is as follows. First the metadata-spreadsheet is loaded with ‘Load dataset.’

Optionally, the KB’s internal record of archived sites can be loaded next. If this is done, the program will cross-reference the files and check if the current URL has already been archived by the KB. The ‘Load row’

button is used to load the row indicated in the field next to it: all values in that row will then be displayed in the program. The ‘Previous’ and ‘Next’ buttons are used to ‘walk’ through the file. The ‘Start driver’

and ‘Auto-open URL’ are tools for speeding up the inspection process: an automated browser window is opened using the Selenium Python plugin, and each time a row is loaded its URL will be automatically opened, removing the need to do this manually. The rest of the fields correspond to the values in the schema. When done editing, the ‘Export CSV’ button will export a .csv file with the added or edited values.

39

Figure 1. The Webcollector tool