6th RESAW Conference

June 5 – 6, 2025

Program

Overview of the full program from June 4-6.
For the abstracts, please select the day view.

Program
 
 
RESAW 2025: THE DATAFIED WEB
PROGRAM

Days: Wednesday, June 4th Thursday, June 5th Friday, June 6th

Wednesday, June 4th

View this program: with abstractssession overviewtalk overview

13:00-15:00 Session 1A: Pre-conference 1
Demonstration of BelgicaWeb: Sustaining Access to Belgium’s Born-Digital Heritage (abstract)
13:00-15:00 Session 1B: Pre-conference 2
Towards an “Algorithmic Archive”: Developing Collaborative Approaches to Persistent Social and Algorithmic Data Services for Researchers (abstract)
15:00-15:30Coffee Break
15:30-17:30 Session 2B: Pre-Conference 4
Empowering Data-Driven Research Through Digital Archives with Internet Archive’s ARCH (abstract)
15:30-17:30 Session 2C: Pre-Conference 5
Qualitative Digital Methods Workshop: Mapping the Evolution of (AI) Content Generation Infrastructures (abstract)
Thursday, June 5th

View this program: with abstractssession overviewtalk overview

10:45-11:00Coffee Break
11:00-12:30 Session 5A: Platforms
11:00
Metrics on the Inside: How Platform Employees Understand Platform Health (abstract)
11:20
The platformization of the follower factory: para-platforms, automation, and labor in the market for social media engagements (abstract)
11:40
Super-App Histories: Tracing Alipay, Meituan, and WeChat through App Repositories (abstract)
11:00-12:30 Session 5C: Technologies and Datafication
11:00
CD-ROMs versus Online in the 90s: Hybrid Paths to Datafication (abstract)
11:20
«Finally the big internet connected to tonet»: infrastructure, websites, users practices and imaginaries as the components of the «-net» (abstract)
11:40
A Hidden Track? The start of implementation and the current use of trac(k)ing methods concerning the Internet of Things basic technology Bluetooth. (abstract)
12:30-14:00Lunch Break
14:00-15:30 Session 6A: Gender and Intimacy
14:00
“The flames are 50/50 right now”: content moderation practices at the onset of the HIV/AIDS epidemic in the United States (1982–1990) (abstract)
14:20
A Marriage of Convenience: Transgender Websites within LGBT+ Hyperlink Networks, 2009-2022 (abstract)
14:40
Flashing Intimate Things in People’s Faces. Intimate Computing and the Datafied Web. (abstract)
14:00-15:30 Session 6B: Web archives practices
14:00
Bulk access to web-archived data using APIs (abstract)
14:20
Navigating the Datafied Web: User requirements and literacy with web archives (abstract)
14:40
Lessons learnt from preparing collections as data: the UK Web Archive experience (abstract)
15:30-16:00Coffee Break
16:00-17:30 Session 7A: Web Archives as Data
16:00
Mining Digital Terror: A Case Study in Using September 11 Web Archives as Data (abstract)
16:20
Establishing which websites constituted a national web in the 1990s (abstract)
16:40
Datafication of Web Archives and the Periodization of Website History: A Case Study of the National Museum of Australia (abstract)
16:00-17:30 Session 7B: RSNs
16:00
To Monetize or not to Monetize: doubts, resistance and U-turns in early YouTubers communities (abstract)
16:20
Tumblr Purge: A Story Told Through Data (abstract)
16:40
The business of datafied identity: LiveRamp’s evolution in the audience economy (abstract)
16:00-17:30 Session 7C: Social Media and APIs
16:00
Robots.txt and A History of Consent for Web Data Capture (abstract)
16:20
On Reciprocity – Algorithmic Interweavings between PageRank and Social Media (abstract)
16:40
APIs. How their role in the history of computing and their software engineering principles shape the modern datafied web. (abstract)
17:30-18:00Coffee Break
20:00-23:00Dinner Reception
Friday, June 6th

View this program: with abstractssession overviewtalk overview

10:30-11:00Coffee Break
11:00-12:30 Session 10B: Platform Histories Roundtable
11:00
Platform Histories Roundtable (with Miglė Bareikytė, Marcus Burkhardt, Devika Naraya, Anne Helmond, Fernando van der Vlist) (abstract)
11:00-12:30 Session 10C: Past Metrics
11:00
Translating Web Data into Media History: A Methodological Reflection of Archiving and Analyzing the XS4ALL Homepage Collection. (abstract)
11:20
The early datafied web: Visitor counters on the Danish web in the 1990s (abstract)
11:40
From Hit Counters to the Professionalisation of Web Metrics in Luxembourg (1990s-Mid-2000s) (abstract)
12:30-14:00Lunch Break
14:00-15:30 Session 11A: Data Regimes
14:00
Historicizing Environmental Data on the Web: Surfrider.org, 1997-2024 (abstract)
14:20
The un/expected work of open data policies (abstract)
14:40
Investigative turn in the Baltics in times of war in Europe (abstract)
14:00-15:30 Session 11B: Web archives Practices
14:00
Temporally Extending Existing Web Archive Collections for Longitudinal Analysis (abstract)
14:20
Engaging audiences with the UK Web Archive: Strategies for general readers, data users, and the digitally curious (abstract)
14:40
Seed lists on themes and events on Arquivo.pt: a curious starting point for discovering a web archive (abstract)
14:00-15:30 Session 11C: Methods
14:00
Critical AI technography: Researching the material political economy and power of AI platforms (abstract)
14:20
AI: A Lever for ‘Decolonizing’ Archives? Web Archives as a Datafield for Critical and Inclusive Uses of AI in History (abstract)
14:40
Echoes of Dolly: isolating long-term political schemata by abstracting web archives as Zotero collections (abstract)
15:30-15:45Coffee Break
15:45-16:30 Session 12: My PhD in 5 Mns
15:45
Before WEB 2.0: A Cultural History of Early Web Practices in the Netherlands from 1994 until 2004 (abstract)
15:50
Manifesting The Web: Network Imaginaries in Manifesto Writing Between the 1980s and the 2020s (abstract)
15:55
Battlefield of Truth(s) on Investigative Frontlines: From Data Activism to OSINT Professionalism (abstract)
16:30-16:45Coffee Break
 
 
Program for Wednesday, June 4th
 
 
RESAW 2025: THE DATAFIED WEB
PROGRAM FOR WEDNESDAY, JUNE 4TH
Days:
next day
all days

View: session overviewtalk overview

13:00-15:00 Session 1A: Pre-conference 1
Demonstration of BelgicaWeb: Sustaining Access to Belgium’s Born-Digital Heritage

ABSTRACT. BelgicaWeb is an innovative web archiving project to preserve and provide sustainable access to Belgium’s born-digital heritage, including websites and social media content. BelgicaWeb is a BRAIN-be 2.0 project funded by BELSPO (the Belgian Science Policy Office). The BelgicaWeb project brings together partners with different expertise. KBR (Royal Library of Belgium) is the project coordinator, CRIDS from the University of Namur provides expertise on the relevant legal frameworks and IDLab, GhentCDH and imec-mict-ugent of Ghent University work on data enrichment, user engagement and evaluation and outreach to the research community, respectively.

This demo will showcase the features of a user-friendly interface to KBR’s web archived content and API that are being developed within the project. Both are optimised for archived websites and social media, enabling researchers and the public to explore these collections in novel ways. By enriching metadata through techniques like Natural Language Processing and Linked Open Data, the project provides advanced search and data interaction capabilities.

The BelgicaWeb platform addresses the challenges of ephemeral born-digital content by creating new collections of archived web content, aggregating existing (meta)data, and ensuring that these collections are Findable, Accessible, Interoperable, and Reusable (FAIR). During this demonstration, we will highlight key features, including full-text search, multilingual functionalities, and data-level access through a robust API designed for big data analyses and digital humanities research.

This demonstration aims to engage both technical and non-technical audiences, providing insights into the development of the access platform and API. The possibility to exchange best practices with researchers working with archived web material during the demo can provide additional useful insights for the BelgicaWeb project and is therefore an added value.

13:00-15:00 Session 1B: Pre-conference 2
Towards an “Algorithmic Archive”: Developing Collaborative Approaches to Persistent Social and Algorithmic Data Services for Researchers

ABSTRACT. Proposed Duration: 120 minutes (the duration of the workshop could be adjusted to accommodate available time and conference organisation needs)

Social media platforms have become a fundamental means of communication, shaping the contemporary understanding of human behaviours, health and political crisis as well as documenting historical events (Simon, 2012; van Dijck, 2011). However, social platforms are private organisation that impose strict limitations to access and the preservation of data (Bruns, 2019; Thomson, 2016). Despite their importance for research purposes and as cultural heritage material, only a few memory institutions consistently archive this important source of information posing a significant risk to their long-term availability. The Algorithmic Archive project is part of the Bodleian Libraires’ broader strategy to further unlock the potential of the existing born-digital collections. The Algorithmic Archive project seeks to develop a sustainable strategy to create a persistent social and algorithmic data archive which can support research efforts in a wide range of disciplines. Members of the project will moderate the session.

The workshop will be divided into two corresponding sessions, each beginning with a brief introduction (5 minutes) outlining the session’s aims and the tasks for the audience, as follows:

1. Use Case Presentations and Breakout Discussions (60 minutes)

– Participants will be invited to share their experiences using social media and algorithmic data in their projects, highlighting research questions addressed, methodologies employed, and challenges encountered.

– Participants will then break into small groups to discuss specific themes, such as data access, tool reliability, data and metadata structures, and interdisciplinary approaches, fostering a collaborative environment for knowledge exchange. A set of questions and topics for discussions will be provided.

2. Building A Sustainable Infrastructure for A Persistent Social and Algorithmic Data Service (60 minutes)

– A guided session to gather participants insights on key aspects regarding the development of social data archive services, including issues and expectations surrounding short- and long-term access. This session will also offer the opportunity to identify potential partnerships for the development of standards to preserve and access social data.

The workshop will conclude with a brief (5 minutes) session to summarise key insights, outlining action points, and discussing how memory institutions can support researchers’ needs as well as identifying potential partners for the development of shared standards for the collection of social and algorithmic data.

This workshop welcomes insights and perspectives from researchers, data scientists, archivists, librarians, and anyone interested in the research implications of social media and algorithmic data. By bringing together these diverse perspectives, the workshop aims to foster discussions and partnerships to develop sustainable strategies to collect social media platforms, and ultimately benefit both scholarship and society.

15:00-15:30Coffee Break
15:30-17:30 Session 2A: Pre-Conference 3
Mentorship for Early Career Scholars in Web Archive Studies

ABSTRACT. This session aims to create a space for open discussion and networking for early career scholars (PhD students and postdoctoral researchers). Organized by five advanced scholars with strong expertise in web archives, this 2-hours session will focus first on the role and place of web archives in research, including how case studies, close and distant reading, and different tools and methods may be used, refined, and presented in research. Ethical and legal issues will also be addressed, based on concrete needs (copyright, anonymization of research results, FAIR Data, and so on). The session may also move on to more general questions related to academic careers, strongly keeping in mind the research areas of participant scholars in web studies. The topics to be considered may include: opportunities for funding; relevant journals and strategies for publication; avenues for promoting and disseminating research to both academic and public audiences, participation of the general public, including social networks, the main conferences related to web archives, and other forums; issues of inclusivity and diversity facing researchers in the field, etc.

Collectively, the organizers of this session combine a wide range of disciplinary and professional perspectives, knowledge of different national and international contexts, and diverse skillsets and expertise in web archives. They have also had varying career trajectories, and in particular have come to working with web archives via different routes and through different experiences, partnerships and topics.

The session will strongly focus on the needs and adapt to the requests of early scholars, in order to align closely with the challenges that they may face in our area. With this second session of mentorship (the first one was organized in Marseille for RESAW23), we hope to establish a regular session at the RESAW conferences. It may also facilitate the development of a peer network among the attendees, who may develop their research and build their professional networks alongside each other.

15:30-17:30 Session 2B: Pre-Conference 4
Empowering Data-Driven Research Through Digital Archives with Internet Archive’s ARCH

ABSTRACT. In this comprehensive 2-hour session, we will explore and discuss the latest advancements and innovations of the Internet Archive’s ARCH platform.

ARCH (Archives Research Compute Hub) is a cutting-edge platform engineered to facilitate the building of research collections, enable computational analysis, and support the generation of datasets from terabytes and even petabytes of data. ARCH supports the open publication and preservation of user-generated datasets created from thousands of libraries, archives, and memory organizations worldwide, empowering researchers, students, and information professionals to study, analyze, and interpret digital collections in unprecedented ways.

Designed with a focus on curating research collections using primary digital sources such as web pages, texts, and images, ARCH enables users to effortlessly create over a dozen distinct datasets from these sources with a simple click. These datasets can be directly downloaded either through an in-browser interface or via an API, enhancing accessibility and user experience.

Moreover, ARCH facilitates the efficient utilization of these research-ready datasets by offering in-browser data previews and visualizations. More interactive analysis is encouraged and supported by enabling the integration of computational tools such as Jupyter Notebooks, Google CoLab, Gephi, and Voyant into the research process.

A significant feature of ARCH is its one-click publication mechanism on archive.org, allowing datasets to be easily accessed, shared, and preserved indefinitely. This feature not only promotes open access to information but also ensures the long-term preservation of valuable data.

To support and enhance user experience, ARCH provides comprehensive technical support, online training, and extensive help center documentation. These resources are designed to optimize the effective use of the platform, making sophisticated research processes more accessible to users who may not have advanced coding or scripting skills.

ARCH benefits from the robust, non-profit infrastructure of the Internet Archive and utilizes open-source tools to streamline the computational handling of digital collections. This enables librarians, collection managers, and educators to offer sophisticated research tools to their communities, thereby democratizing access to advanced research methodologies.

Recently, ARCH has integrated AI-powered tools that enhance the platform’s capabilities. These tools are readily accessible on our dedicated computing cluster, equipped with GPU support, making advanced computational tasks more feasible for our users.

ARCH is available for both institutional and individual use, offering flexible access options for a diverse range of professionals including researchers, librarians, archivists, museum staff, journalists, and more.

This format provides a comprehensive overview of ARCH’s features, but we will also delve deeper into the technical details and underlying technologies. It will feature a combination of presentations, brief demonstrations, and interactive live sessions. Participants will have the opportunity to engage with the tools interactively, ask questions, and view actual datasets, making this an informative experience that offers participants a clear view into how the ARCH platform can enhance their research capabilities.

15:30-17:30 Session 2C: Pre-Conference 5
Qualitative Digital Methods Workshop: Mapping the Evolution of (AI) Content Generation Infrastructures

ABSTRACT. tba

 
 
Program for Thursday, June 5th
 
 
RESAW 2025: THE DATAFIED WEB
PROGRAM FOR THURSDAY, JUNE 5TH
Days:
previous day
next day
all days

View: session overviewtalk overview

10:45-11:00Coffee Break
11:00-12:30 Session 5A: Platforms
11:00
Metrics on the Inside: How Platform Employees Understand Platform Health

ABSTRACT. Where have employees at social media platforms looked to get a ‘status update’ about the health of the platforms they work for? This paper draws on interviews conducted with 53 former employees of platforms that have shuttered over time, including GeoCities, Friendster, MySpace, and Vine, to understand the ways that employees used analytics, broadly construed, to make sense of the success and decline of a platform. Originating in a broader project about platform closure, this presentation adds to RESAW 2025’s focus on histories of the Datafied Web by describing the ways that platform employees used both traditional notions of quantitative analytics to understand their platform’s success and decline, as well as less-discussed, qualitative markers of health. This paper thus complicates an understanding of analytics as purely quantitative, instead showing the ways that employees and the organizations they worked for integrated information from quantitative analytics programs with sometimes surprising metrics like press coverage, whether they needed additional computing infrastructure, visits from public figures to organization offices, and public reception of company merchandise.

From GeoCities to Vine, the most tangible evidence of decline for these organizations was through quantitative user metrics. Digital metrics (alternatively, digital analytics) included information like page views, amount of viewing time, number of comments, posts, or likes, and paths through numerous pages, measurements describing user behavior as users interact with a site (Tandoc, 2014). These metrics were in turn used to locate value and attach meaning to user behaviors on a site (Beer, 2017). The importance of digital metrics in media production cultures has been discussed most in the context of online journalism, wherein figures around audience engagement have been shown to shape the editorial process at numerous stages (Beer, 2017; Christin, 2020).

Research on metrics and social media platforms shows how analytics displayed to users–like the number of likes on Instagram or the number of retweets on Twitter–act as markers of social distinction (Paßmann & Schubert, 2020), while it is also known that user metrics influence the design of algorithms on platforms, which in turn shape user experience on those sites (Couldry & Powell, 2014). Despite these studies, there is limited empirical evidence showing how social media platform employees have made sense of user metrics within organizations. Interviews with platform employees show the significance of digital metrics for internal comprehension of a platform’s value and its competitiveness with like entities, especially when this value appears to be shifting from success to decline.

Yet an understanding of how employees digested this information into conclusions about a platform’s overall health, this paper argues, must also consider the qualitative and affective information that platform employees were also interpreting. Primary themes from interviews across included the importance of: (1) media and press, especially in magazines and newspapers with high cultural capital; (2) technical capacity, especially the counterintuitive notion that if the site kept crashing it was because it was growing at an unprecedented pace, and; (3) public reception of company merchandise, for instance, how strangers would respond to an employee wearing a company-branded t-shirt. By surfacing the varied means through which employees and organizations understand organizational health through both quantitative and qualitative analytics, a more complete story of platforms, and their place in web history, can be told.

11:20
The platformization of the follower factory: para-platforms, automation, and labor in the market for social media engagements

ABSTRACT. This paper examines the evolution of an illicit, sprawling, yet obfuscated global market for artificial social media engagements, which inflates follower counts and engagement metrics on social media profiles and posts. The organization of this market has previously been characterized using industrial metaphors such as ‚click farms,‘ ‚follower factories,‘ and ‚digital sweatshops‘ primarily based in the Global South. These descriptions emphasize exploitative and often informal labor conditions. Using a mixed-methods approach that integrates ethnography with digital methods, this research delineates the platformization of the follower factory, focusing on the shift towards automation rather than manual interaction, which has facilitated the rapid expansion of this multi-sided market. As resellers scale up, this has necessitated a more complex labor organization involving marketing, customer service, and administrative work, which have shaped cottage industries across regions such as Indonesia, India, and Nigeria.

A key element of this research is its use of Internet history methods, including historical reverse IP lookups and the Internet Archive’s Wayback Machine, to trace the evolution of this market over time. By employing historical reverse IP lookups, we were able to map the growth of panel websites (websites used for reselling engagement services) and their global distribution, highlighting how the market has expanded since 2016. This process revealed the central role of platform providers such as Perfect Panel, which offers pre-built platforms for reselling social media engagements. This technical infrastructure has allowed even users with limited technical knowledge to set up and scale engagement reselling businesses, contributing to the market’s rapid proliferation.

Moreover, the Wayback Machine allowed us to track the evolution of engagement services offered by platforms such as Just Another Panel (JAP) over time. By capturing historical snapshots of the services offered on JAP, we observed the diversification of engagement types and their associated pricing, revealing both the volatility of the market and the persistence of certain services. Through this historical lens, we explore how the para-platform ecosystem has developed alongside and remains reliant on corporate social media platforms.

At the core of this market lies what we term a para-platform ecosystem, which, while operating outside the immediate control of corporate social media platforms, still depends on their infrastructure for the delivery of engagement services. This ecosystem exists in a conflictual and asymmetrical yet productive relationship with social media platforms. On one hand, it disrupts the platform’s organization of users and their activities; on the other, it generates activity metrics that align with platforms’ economic models by boosting user engagement.

By examining platformization and platform ecosystems ‘from below,’ this paper challenges dominant platform theory, which typically focuses on corporate platforms. It argues that the para-platform ecosystem complicates conventional narratives by demonstrating how platforms are not only centers of economic power and governance but also spaces where informal and illicit economies thrive. By exposing this illicit backend of the datafied web, the study provides critical insights into the hidden infrastructures and practices of online engagement markets at the intersections between formal platform economies and their shadow counterparts.

11:40
Super-App Histories: Tracing Alipay, Meituan, and WeChat through App Repositories

ABSTRACT. This paper offers a historical exploration of the phenomenon of ‘super-appification’ in China, focused on a comparative analysis of Alipay, Meituan, and WeChat. While discourses around ‘super-apps’ are often accompanied by promotional narratives and hype, recent research in digital media studies has suggested the term nevertheless reflects an increasing concentration of corporate media power within the global platform and app economy – a process characterized by dual tendencies of platformization and appification leading to the emergence of integrated service ecosystems encompassing communication, financial transactions, transportation and delivery, and more (Pitre, 2022; van der Vlist, 2024). Notably, apps like WeChat have been frequently cited as ‘the poster-child’ of this trend (Chan, 2022), resonating with critical examinations of such platforms that have broadly considered their governance structures and infrastructural implications (Plantin & de Seta, 2018; de Kloet, et al. 2019), including their pervasive integration into everyday life (Harwit, 2017; Chen, et al. 2018). With a focus on these new Asian megacorps (Steinberg, et al. 2022), our contribution thus aims to make a contribution to the late history of ‘the datafied web,’ a period when platforms like Meituan and Alipay evolved from websites into app-based ecosystems, and WeChat introduced the potential for internal mini-programs. While still relying on HTTP and HTTPS protocols, their growth marks a shift to mobile-first ecosystems that pursue datafication through proprietary protocols, custom software development kits (SDKs), and closed infrastructures. The development of ‘super-apps’ like Alipay, Meituan, and WeChat, accordingly, highlights the continued entanglement of the web with platformization and the balkanization of the internet, signaling the emergence of new digital fiefdoms.

Methodologically, we contribute to platform and app historiography by expanding multi-situated app studies (Dieter, et al. 2019) to new modes of diachronic analysis. Taking inspiration from biographical studies of websites (Rogers, 2017) and platforms (​​Burgess & Baym, 2020; Helmond & van der Vlist, 2019), we consider how apps such as Alipay, Meituan, and WeChat have played an active role in ‘authoring’ their own historical trajectories through being situated within digital infrastructures (Helmond & van der Vlist, 2021). To operationalize this perspective, we leverage traces of software versioning sourced from industry data, web archives, and app repositories in conjunction with digital tools like scrapers, decompilers, and code inspectors. A key resource in our work is AndroZoo, a large-scale app repository hosted by the University of Luxembourg, which contains over 24 million Android application packages and their metadata collected from various marketplaces. While AndroZoo has mainly supported research on app descriptions, malware detection, app permissions, and GDPR compliance (Alecci, et al. 2024), its potential for interdisciplinary studies of media concentration and the phenomenon of super-appification remains underexplored. We will present several initial findings from this exploratory research, including the large expansion of device permissions by ‘super-apps’ to facilitate datafication across an increasing range of services; their deep integration with dominant smartphone manufacturers; the parallel platformization strategies taken up to expand beyond mainland China; and the patterning of infrastructural traces with corporate acquisitions. In addition to documenting these specificities of Chinese ‘super-app’ development, our inquiry reflexively considers the challenges associated with utilizing such complex archives, including their technical limitations, and the need for diverse methodological considerations to adequately ground research findings.

11:00-12:30 Session 5B: Panel: The Challenge of Archival Practices‘ Context for a Better Understanding of Data Web Archives at Aix-Marseille University
11:00
The Challenge of Archival Practices‘ Context for a Better Understanding of Data Web Archives at Aix-Marseille University

ABSTRACT. The Challenge of Archival Practices‘ Context for a Better Understanding of Data Web Archives at Aix-Marseille University

Web archives have become a key data source within universities—produced and reused like any other type of data. One particular aspect of this issue is that the teams working on it are diverse and multidisciplinary, and in the case of this session, all of them focus on the Mediterranean region. Chaired by Sophie Gebeil (UMR TELEMMe), this panel session addresses the theme of the conference ‘Web Archiving Data Practices and Challenges’ by discussing the differentiated practices and challenges of archived web data for Mediterranean studies. It will introduce three speakers from different professions at Aix-Marseille University (AMU), each with distinct expertise in web data. They incorporate web archives into their work in various ways, such as tracking the progress of research programs, supporting PhD students with their theses, or archiving researchers’ outputs. Christine Mussard, historian and deputy director of the IREMAM research laboratory, demonstrates how investigative practices are transformed through contact with web archives. Véronique Ginouvès, head of the MMSH archives, shares the challenges related to the preservation of online databases managed by the MMSH. Finally, JC Peyssard, head of the MMSH media library, shows how his own expert practices intersect with the support requests from novice researchers who are unaware that understanding the datafied web also requires a hermeneutic reflection. At the MMSH, web archiving practices span from considering web archives as historical material (Brügger, 2012) and as a critical method (Weber, 2020), to viewing web archives as a domain of expertise in data analysis within the humanities and social sciences, in order to address the challenges faced by a research community dealing with impeded fieldwork.The presentation aims to highlight the necessity of working together in order to, over the long term, understand, use, and about data web archives, while documenting who is involved and how they do it. It demonstrates the necessity of an inter and transdisciplinary approach that combines specialized expertise of academics and non academics to transform practices and address the challenges posed by the datafied web in the study of Mediterranean societies.

The web archive as a resource in an impeded field: from a substitute source to a major trace in the development of colonial history Christine Mussard (MCF HDR – INSPÉ – UMR IREMAM) The practice of social science research presupposes a regular relationship with the field of study, which is seen as a place where the subject matter can be impregnated and where a wide range of data can be collected. In recent years, the Covid epidemic has hampered access to these experimental spaces, forcing researchers to invent other ways of reaching them. For researchers involved in projects in the Arab world, geopolitical tensions have exacerbated these obstacles even further. The Institute of Research and Study on the Arab and Islamic Worlds (IREMAM) conducts research on the entire Middle East and North Africa region in all the social sciences and humanities. Access to land is regularly restricted or even forbidden, as conflicts erupt between Middle Eastern states. The use of web archives and, more generally, digital data has become more widespread among researchers, who have had to develop new skills to make use of them. In this presentation, I propose to show, through the prism of a research experience in the history of Algeria under French domination, the evolving approach I have taken to the web archive, initially considered as a substitute material pending access to the source in situ, then envisaged as a central piece of my documentation. This reflection therefore looks at the way in which a constrained context affects the historian’s relationship with his sources, including testimonies, generating an unexpected renewal of their uses, and a revision of the way in which they are related and ranked. However, in addition to the odd photo and memory I took from these websites, I was also able to see the different ways of presenting the school memories which occupied a large part of these community sharing platforms. They revealed how the contributors were attempting to reunite classes in today’s very different French Algeria, providing insights into the social connections of the past and the form they take today. The aim of this presentation is, therefore, to investigate the different ways of using these websites which tell memory- packed stories, sorting the real from the fake in terms of the sources and understanding the practice of memory expression as a research topic.

Neglect, Stammering, Focus: Processes of an Archival Experience of Archive Collections and Audiovisual Projects at AMU Posted on the Web Over the Past 30 Year Véronique Ginouvès (IRHC CNRS, UAR3125) The presentation aims to offer both a reflective history of practices and to highlight the challenges faced by a Mediterranean sound archives and research center at Aix-Marseille University, as the world of the web and its uses unfolds. The first email written from the UMR TELEMMe, my laboratory created in 1994 at the Université de Provence, was sent from the Sound Archives Department in 1995, yet it was never archived. The first online sound archives database, was created using MySQL in 1997, it left no trace. The first relational database software that facilitated the online publication and documentation of the sound archives dates back to the early 2000s, but it was only in 2005 that we thought to archive parts of this documentation on the Wayback Machine. In 2020, the database software the Sound archives center had been using was acquired and a retroconversion of all metadata was required. We did it in EAD format and made an export in DC format from the OAIPMH repository ; we hope the metadata of the new platform will soon be archived at CINES – it is the platform of the Ministry of Higher Education and Research. The editorial projects of the „Pôle image-sons, pratiques du numérique“ in which I have been involved since 1998, could now be described using terms like „new media“ or „alternative narratives.“ These projects were online and operated with Adobe Flash. Fortunately, this time, we archived them using Conifer and the Wayback Machine before Flash’s discontinuation in 2019. In recent months, we contacted INA for legal web deposit, and we are awaiting confirmation on whether the database will be included. The blog for the „Pôle images-sons pratiques du numérique“ project, which captured content from the older site (the form is available on the Wayback Machine), is saved on the CINES servers by the platform Hypotheses itself. The web archiving process for projects from an archival center requires careful foresight. However, the first challenge is simply remembering to think about it. Using a data base daily fosters a form of negligence: tomorrow, I will still have access to the site, and the day after that as well. Yet, there comes a day when everything stops. As I write this summary, the Wayback Machine has been forced to suspend operations due to a cyberattack—a sobering reminder of the fragility of our web archiving efforts and the limited tools available to us.This presentation also aims to highlight the issue of web archive hosting, considering the various archiving spaces while recognizing the risks inherent in large, centralized spaces that host data and which are often seen as high-value targets for hackers.

Web Archives as a Substitute for Fieldwork: Lessons from a Decade of Research in the MENA Region Jean-Christophe Peyssard (IR CNRS, UAR3125) For nearly a decade, numerous researchers and students have turned to web sources for their research. The centrality of the web and social media in global culture, coupled with increasing difficulties in accessing field sites—or complete inaccessibility in certain areas—has significantly contributed to this trend. The impeded fieldwork is now often replaced by digital fieldwork across all disciplines of the humanities and social sciences. Archaeologists have become engrossed in digital archives, while anthropologists, political scientists, and sociologists have spread across social networks and online video platforms, collecting disordered and context-lacking data on their hard drives during their explorations. Simultaneously, research libraries are beginning to recognize the crucial issues related to the consultation, collection, and preservation of web-based corpora (Neal, 2014), and are witnessing an increasing number of users confronted with these new field materials. The field of web archives has become highly structured since the creation of the IIPC in 2003. The tools, methods (Brügger, 2018), and research collectives are now well-established and have proven their relevance. However, vernacular and improvised uses of web sources remain largely the norm within the within the community, as observed from the research library of the MMSH (AMU). After a decade of experience in training, supporting, and conducting projects using web archives focused on the Middle East and North Africa area, this presentation aims to provide an assessment and offer perspectives on the difficulties and challenges encountered. Can digital fieldwork substitute for impeded fieldwork? Under what conditions can web archives provide genuinely useful knowledge in the study of a society and its social and cultural realities? What are the necessary skills and knowledge prerequisites for an ethical and useful use of web archives in the context of impeded fieldwork?

References Brügger, Niels. 2012. “Web History and the Web as a Historical Source.” Zeithistorische Forschungen/Studies in Contemporary History, Online-Ausgabe, 9(2). https://zeithistorische-forschungen.de/2-2012/4426. https://doi.org/10.14765/zzf.dok-1588. Brügger, Niels. 2018. The Archived Web: Doing History in the Digital Age. Cambridge, Massachusetts: The MIT Press. https://search.worldcat.org/fr/search?q=bn:9780262039024. Gebeil, Sophie. 2021. Website Story: Histoire, Mémoires et Archives du Web. Bry-sur-Marne: INA. Gebeil, Sophie, and Jean-Christophe Peyssard, eds. 2023. Exploring the Archived Web during a Highly Transformative Age: Proceedings of the 5th International RESAW Conference, Marseille, June 2023, FUP. https://doi.org/10.36253/979-12-215-0413-2. Mussard, Christine. 2024. “Websites as Historical Sources? The Benefits and Limitations of Using the Websites of Former Repatriates for the History of Schooling in Colonial Algeria.” In Exploring the Archived Web during a Highly Transformative Age, edited by Sophie Gebeil and Jean-Christophe Peyssard. 10.36253/979-12-215-0413-2.27. Neal, James G. 2014. “The Integrity of Research Is at Risk: Capturing and Preserving Web Sites and Web Documents and the Implications for Resource Sharing.” Lyon, France. http://library.ifla.org/id/eprint/907. Weber, Matthew S. 2020. Web Archives: A Critical Method for the Future of Digital Research. Published by the research network WARCnet, Aarhus.

11:00-12:30 Session 5C: Technologies and Datafication
11:00
CD-ROMs versus Online in the 90s: Hybrid Paths to Datafication

ABSTRACT. This proposal aims to retrace part of the history of data-sharing, storage, and information management, by focusing on the 90s and the hybridization as well as transition from CD-ROMs to online databases. Our starting point is anchored in the debates within the library community, where librarians, as early adopters, engaged critically with these emerging technologies. Drawing on these secondary sources, as well as archives from the EU Publications Office, and insights from a dozen oral interviews, we aim to analyze the role of CD-ROMs as a “transient technology” , then the hybrid data management practices, based on the case study of the EU Publications Office, and finally the broader impact of CD-ROMs on the process of datafication.

The first part focuses on the discussion within the library community about CD-ROMs as a “transient technology”. This adoption was influenced by the need to store, manage and disseminate large amounts of data efficiently. However, as online databases also emerged, the role of CD-ROMs was increasingly questioned. Scholars like Stratton (1994) and Bevan (1994) engaged in discussions based on an original article published in 1990 by McSean and Derek and entitled “Is CD-ROM a Transient Technology?”. These discussions reflected a broader uncertainty about the longevity of CD-ROMs and the evolving landscape of data management.

The second part examines the hybridization of data management practices within a specific context, using the EU Publication Office as a detailed (and in progress) case study. This institution exemplifies the complex interplay between traditional print, CD-ROMs, and online platforms during the 1990s (Schafer, 2020). The EU Publication Office, tasked with disseminating of numerous data and notably the daily publication of the Official Journal as well as public tenders, faced the challenge of managing multiple formats simultaneously. The Office itself, as a user of these technologies, had to navigate the challenges of integrating different formats into a cohesive information management strategy. At the same time, the end-users of the Office’s data were also adapting to the new formats, illustrating the dual-user dynamic in this transitional period. The case of the EU Publication Office thus provides a concrete example of how institutions managed the shift from analog to digital data and the co-existence of print, CD-Roms based, and online information.

Finally, we will conclude with the broader role of CD-ROMs in the process of datafication. CD-ROMs were instrumental in the conversion, dissemination, and transmedia movement of data, which set the stage for the web. In this way, CD-ROMs served as a catalyst for the broader process of datafication.

References

Bevan, N. (1994). Transient Technology? The Future of CD‐ROMs in Libraries. Program 28, no. 1, 1-14. https://doi.org/10.1108/eb047155. McSean, T., and Derek L. (1990). Is CD-ROM a Transient Technology?. Library Association Record 92, no. 11, 837-841. Schafer, V. (2020). From Print to Digital, from Document to Data: Digitalisation at the Publications Office of the European Union. Open Information Science, (4), 204-217. doi:10.1515/opis-2020-0015

Stratton, B. (1994). The Transiency of CD-ROM? A Reappraisal for the 1990s. Journal of Librarianship and Information Science, vol. 26, no. 3, 157-164.

11:20
«Finally the big internet connected to tonet»: infrastructure, websites, users practices and imaginaries as the components of the «-net»

ABSTRACT. This paper explores the specific period in the history of the internet in the Russian city of Tomsk. The city-wide internet, known as «tonet», existed from 1998 to 2008/2010. Established through a peering agreement among ISPs in 1998, tonet enabled affordable, high-speed access to local websites — and slower and much more expensive access to all other websites. After the introduction of home networks in 2001 the user base expanded rapidly. In 2008, unlimited data plans were introduced, which made the previous advantages of speed and price differences obsolete. Over the next two years, unlimited plans became widespread, and tonet became history. I focus on the crucial period from 2008 to 2010 when these changes in ISP’s policies fundamentally altered user experiences. I have conducted 29 interviews on this topic in addition to 48 already done in previous years and analyzed local press publications, advertisements, and website archives to reconstruct users‘ reactions to the city’s internet infrastructure changes. This paper is a follow-up to my presentation at RESAW-2019 and an article by my colleagues Polina Kolozaridi and Dmitry Muravyov (Internet Histories, Volume 4, 2020/1)

This study contributes to both Internet Histories and Infrastructure Studies. I introduce the concept of «-nets» to describe networks like tonet, which sheds light on the interconnection of physical infrastructure (wires, routers) with digital infrastructure (websites, forums) and a discourse (imaginary, self-descriptions, community narratives). I am aware of Kevin Driscoll and Camille Paloques-Berges’s work on «The Net» and translated it into Russian; nonetheless, I propose a more infrastructural notion of this concept. «-nets» can be a productive way of distinguishing the internet segments corresponding to those three parameters as a special type of object — having natural boundaries, self-descriptions, and, importantly, a scale smaller than a country. The vast majority of work in this area focuses on the scale of countries, and «net» may be the missing element in a more granular understanding of the scale of networks.

Furthermore, I address the gap between infrastructure design and users’ practices. Thomas Hughes’s work shows that early infrastructure studies focused on the builders and overlooked users (Hughes 1983, 1986, Joerges, 1999:18). However, later research demonstrated the productivity of addressing the user. It shifted the focus from the inventors to the practices of using infrastructures, affirming the relational rather than essentialist nature of infrastructures: «there are only observed infrastructural relationships» (Slota, Bowker, 2017:531). This second conceptualization is not particularly interested in how infrastructures were built. Instead, it focuses on the impact of infrastructures on user practices and vice versa (a good example is Shah and Sandvig, 2008). In my paper, I draw on the resources of both conceptualizations to show tonet as a set of interconnected infrastructures. I will demonstrate how the material network becomes the infrastructure for urban websites and forums, which in turn become the infrastructure for user practices and generate discourse about tonet. In addition, I will outline the theoretical work that has been done to connect two conceptualizations in a consistent way.

11:40
A Hidden Track? The start of implementation and the current use of trac(k)ing methods concerning the Internet of Things basic technology Bluetooth.

ABSTRACT. Bluetooth technology now forms the cornerstone of numerous applications used by digital societies in their environments. Sensor-based interactions (coupling) between different mobile devices with integrated Bluetooth chips give rise to mobile communication networks or: co-operations between various participating human and technical actors. From its original claim as an „[u]niversal radio interface for ad hoc, wireless connectivity“ (Haartsen 1998) – aimed at replacing cables and fostering wireless connections – the endeavor and possibilities for further adaptations and applications of this technology have significantly expanded over the past three decades. Examples include Internet of Things (IoT) applications through fitness trackers (AirTags) or smart watches (wearables). A variety of specialized applications have emerged from this universal technology. For the WPANs (Wireless Personal Area Networks) or piconets generated by Bluetooth devices, the simple claim „to ensure the best use of a shared medium“ was pursued when they were introduced in 2000 (Braley et al. 2000: p. 26). However, due to increasingly digitalized living environments, the focus of interoperability has shifted to adaptations for purposes of the IoT. It is now centered on the „development of open consensus standards addressing wireless networking for the emerging Internet of Things (IoT), allowing these devices to communicate and interoperate with one another, mobile devices, wearables; Optical Wireless Communications (OWC), Autonomous Vehicles, etc.“ (IEEE 802.15 Working Group on Wireless Specialty Networks). The pairing of different devices and the sharing of information are media practices facilitated by Bluetooth that organize the environments of digital societies. Media practices associated with Bluetooth and their interaction with existing and emerging environments (Sprenger 2019; Sprenger/Engemann 2015), i.e., the practice of environing – an active modification of the environments via the involved actors (humans or technologies) (cf. Cubasch et al. 2021) –, are needed to be considered, when learning how digital infrastructures and data capturing are developing. Our environments have been translated into data since the advent of Computer Supported Cooperative Work (CSCW). And the ongoing data streams surrounding Bluetooth-enabled devices through integrated chips (i.e., sensor-based media) that bring about the computerization of physical environments are even evolving.

When Contact-Trac(k)ing methods during the Covid pandemic were introduced very quickly via Apps (e.g. Germany’s Corona-Warning-App), users tend to be cautious about their use, especially regarding to privacy concerns. Meanwhile the technology emerged to be ‘always-on’ at smart devices and the ongoing-tracking functions are seeming to be no longer an issue. But surveillance studies cannot be excluded from the discussion of Bluetooth technology, given its foundational role in tracking and tracing apps and various IoT applications often used for advertising and marketing purposes (e.g., Bluetooth Beacons) and corresponding practices such as a possible capturing of data (Agre 1994).

But since when has the possibility of Tracing or Tracking been part of the Bluetooth Technology? And how secure is the permanent ‚visibility‘ and the long-term activation of Bluetooth devices regarding private, personal data?

Through methods such as oral history interviews and archival research (e.g., company archives), the historical reconstruction as part of the project focuses on (1) the development of Bluetooth technology and (2) the period when trac(k)ing capabilities became part of the technical specifications and the business model of the Bluetooth SIG. The project adopts a praxeological approach to basic research in media studies, focusing on digital infrastructures and levels of co-operation through Boundary Objects (Star/Griesemer 1989) in the origins of the technologies that we integrate almost seamlessly into our everday digital lives.

The contribution aims to raise critical questions about this ubiquitous technology, its use practices and the options for digital contact trac(k)ing.

References:

Agre, Philip E. (04/1994): „Surveillance and Capture: Two Models of Privacy“, in: The Information Society Bd. 10(2), 101–127.

Braley, Richard C./Gifford, Ian C./Heile, Robert F. (2000): „Wireless Personal Area Networks: An Overview of the IEEE P802.15 Working Group“, in: Mobile Computing Communications Review 4(1), DOI: 10.1145/360449.360465, 20–27.

Cubasch, Alvin J./Engelmann, Vanessa/Kassung, Christian (2021): „Theorie des Filterns. Zur Programmatik eines Experimentalsystems“, Zenodo (Preprint 04/2021), DOI: 10.5281.

Haartsen, Jaap (1998): „Bluetooth – The Universal Radio Interface for ad hoc, Wireless Connectivity“, in: Ericsson Review, The Telecommunications Technology Journal, 1998(3), 110–117.

Star, Susan L./Griesemer, James R. (1989): „Institutional Ecology,‚Translations‘ and Boundary Objects: Amateurs and Professionals in Berkeley‘s Museum of Vertebrate Zoology, 1907-39“, in: Social Studies of Science, 19(3) (1989-08-01), 387–420.

Sprenger, Florian (2019): Epistemologien des Umgebens: Zur Geschichte, Ökologie und Biopolitik künstlicher environments, 1. Aufl., Bd. 65, Bielefeld: transcript.

Sprenger, Florian/Engemann, Christoph (2015): Internet der Dinge: über smarte Objekte, intelligente Umgebungen und die technische Durchdringung der Welt. Digitale Gesellschaft, Bielefeld: transcript.

12:30-14:00Lunch Break
14:00-15:30 Session 6A: Gender and Intimacy
14:00
“The flames are 50/50 right now”: content moderation practices at the onset of the HIV/AIDS epidemic in the United States (1982–1990)

ABSTRACT. This paper received The Journal of Internet Histories Early Career Researcher Award for 2024.

The timeline for the onset of the HIV/AIDS epidemic in the United States occurred parallel to the domestic shift of computing and the advent of DIY computer networking efforts. During this critical time, many activists and community organisers within lesbian, gay, bisexual, transgender, and queer (LGBTQ+) spaces utilised computer networking, such as bulletin board systems (BBSs) and Usenet boards to facilitate information exchange within their affected communities. Due to the sensitive nature of the epidemic and often-vital need for up-to-date information, content moderation became an increasingly important issue on these boards. This paper uses varying archival methods to explore the development of content moderation practices, and the influence of HIV/AIDS culture, on bulletin board systems and Usenet boards, with a special focus on boards dedicated to LGBTQ+ content and HIV/AIDS information exchange.

14:20
A Marriage of Convenience: Transgender Websites within LGBT+ Hyperlink Networks, 2009-2022

ABSTRACT. The acronym LGBT+ suggests that a natural alliance exists between lesbians, gays, bisexuals, and transgenders. Indeed, relatively recently, “[m]any formerly LGB organizations began to ‘add the T’”—highlighting that these marginalized groups fight a common cause (Stone, 2009, p. 336). At the same time, however, transgenders and the other ‘letters’—gays and lesbians in particular—are “odd bedfellows” (Ros & Motmans, 2015). Transgender people “destabilize the otherwise easy division of men and women into the categories of straight and gay because they are both and/or neither” (Devor & Matte, 2004, p. 179). This has resulted in “a contradictory environment simultaneously welcoming and hostile:” “Transgender relations to gay and lesbian community formations necessarily became strategic—sometimes oppositional, sometimes aligned” (Stryker, 2008, pp. 146 and 149).

My talk sheds light on this uneasy alliance by means of hyperlink analyses. I analyzed the special LGBT+ web collection of the Dutch National Library. Perhaps nowhere else is the online LGBT landscape as rich and diverse as in the Netherlands. This collection (2009-present) contains over 200 websites of Dutch LGBT+ organizations (each harvested once annually); some catering to specific ‘letters,’ others to the entire LGBT+ community. It is unique in size and richness, but has not yet been researched.

Addressing web archiving data practices and challenges head-on, I will discuss how I analyzed the queer network that these websites formed and how this network evolved over time (2009-2022). Per year, I extracted and scrutinized all hyperlinks of these websites, for hyperlinks yield insights into “hyperlinked identities” (Szulc, 2015, p. 121). I concentrated on all thousands of links that directed to websites that catered to LGBT+ people. Gephi was then used to visualize and analyze – i.e., distant-read – the resulting queer network.

My findings will highlight that transgender websites formed a clear cluster within the network. Transgender organizations indeed were a community within a community. No other cluster was tied as strongly together—underscoring the aforementioned tension within the LGBT+ community. Capitalizing on Gephi’s functionalities, I will interpret this main finding, e.g. show which websites had a high centrality and which bridged transgender with other queer websites, and will discuss changes over time.

14:40
Flashing Intimate Things in People’s Faces. Intimate Computing and the Datafied Web.

ABSTRACT. Computers have been envisioned early on as devices that would be able to aid people in their lives and augment their cognitive capacities. Sociotechnical imaginaries of human-machine interaction often refer to the intimate or speak of an intimate computer as a desirable mode of interaction. In contrast, Lauren Berlant traces the advent of an “intimate public sphere” with the rise of the Reaganite right and its mass-media rhetoric of sentimentality and a traumatized national identity: “Now everywhere in the United States intimate things flash in people’s faces” (Berlant, 1997, p.1). Writing in the late 1990s, their observations do not include the development of a digital public sphere of the datafied web. What Berlant observed will be continued and complicated here, in a technically realized public space. Early on digital artists are experimenting with new forms of intimacy in the public sphere, and software companies design their products to simulate spaces of sharing. Using the once ubiquitous meta-platform for web-based applications Macromedia/Adobe Flash as a case study, my talk will explore how spaces of intimate privacy were opened up on the early internet. I will offer insights both into the aesthetics and technical structures of this development. I will first introduce the artist-duo Auriea Harvey and Michaël Samyn, known as “Entropy8Zuper!”, and explore their Flash-based cyberperformance of online-intimacy, “Wirefire”, as an experiment with new forms of relating to another. I will then trace the formalization and commodification of forms of interaction involving the sharing of digital objects in the development of the Flash Communication Server (FlashComm), culminating in the misuse of Flash’s “Local Shared Object”, the so-called Flash-Cookie, in the context of behavioral advertising. I thereby propose an alternative history of intimate computing that draws a parallel between practices of navigating technological and interpersonal vulnerabilities. „Wirefire“ and Flash developers were searching, simultaneously, but from very different starting points, for ways of digital communication that could support the imaginary of a shared, intimate space. In this sense, Flash offers a valuable case study of what Ara Wilson called an „infrastructure of intimacy“, tracing how „infrastructure offers a useful category for illuminating how intimate relations are shaped by, and shape, materializations of power“ (Wilson, 2016, p.263). Flash offered tools to program intimacy in a simulated public sphere of sharing digital data and introduced new technical vulnerabilities into the protocols of the internet. Speaking with Michel Serres, a relation always involves a third party, a para-site. My example of Flash shows, how the navigation of vulnerabilities can be both the entry point of exploitation and extraction in what Berlant called a heteronormative metaculture, as well as the condition and foundation of trusting, resilient relationships.

14:00-15:30 Session 6B: Web archives practices
14:00
Bulk access to web-archived data using APIs

ABSTRACT. In the context of archiving large datasets, ensuring that the data is both accessible and searchable is paramount for facilitating research and discovery. In recognition of this need, we implemented Application Program Interface (API) access to Arquivo.pt in 2018. This initiative not only improved accessibility but also enabled the development of microservices that operate on our platform. As a result, nearly half of the web traffic to Arquivo.pt is now generated through API requests. Each year, a diverse array of projects spanning disciplines such as economic analysis, artistic endeavors, and computer science emerge, all leveraging the capabilities of our APIs. Currently, we offer four distinct APIs, each designed to meet the varying needs of our community. In recent years, there has been a growing demand from the research and education sectors for the ability to perform bulk downloads of web-archived data and index files. This demand arises from a range of applications, including the training of artificial intelligence models, optimizing the routing of web archive requests, and retrieving information from specific websites, such as news outlets. In response to these requests, Arquivo.pt has made all its index files publicly available in real-time, significantly facilitating the bulk download of web-archived data. This decision has opened new avenues for researchers, evidenced by a more than sixtyfold increase in our network bandwidth since the introduction of bulk download access. This enhancement not only streamlines data retrieval but has also made it possible for researchers to utilize our data in the development of Large Language Models (LLMs). One notable outcome of this effort is GlórIA, an LLM designed for processing European Portuguese, which comprises an impressive 35 billion tokens. The integration of our archived data into AI research exemplifies how access to comprehensive datasets can drive innovation and advancement in various fields. Through this paper, we aim to underscore the critical importance of providing users with broad access to archived data and to detail how our APIs and services are actively utilized within the research community. By sharing our experiences and insights, we hope to demonstrate the transformative impact of accessible data on research and development.

14:20
Navigating the Datafied Web: User requirements and literacy with web archives

ABSTRACT. Introduction

As more and more information is created and shared online, the importance of web archives has grown. Afterall, online information on the web does not last forever. Contrary to popular belief, the longevity of Web pages has an average lifespan of around 1,132 days (Agata et al., 2014, p. 464). Here, web archives, as a new form of archival material curated through a process of selecting and preserving websites (Cui et al., 2023), come to the rescue. Web archives are digital collections of web pages and other online content that have been preserved over time. They provide a snapshot of the web at a specific moment in time and can be a valuable source of information for the different users groups, especially researchers. However, navigating and using web archives can be challenging, as they are often different from other types of online resources. Web archived content may be incomplete, may not function as it did originally, and may be difficult to locate. In addition, web archives often contain a mix of primary and secondary sources, and it can be difficult to determine the reliability and credibility of the information they contain. Researchers need specific skills and knowledge to effectively use archived web content as research data. As more researchers begin working with this content, understanding their ability to navigate and utilize web archives is becoming increasingly important. This study seeks to explore both the user requirements and their literacy with web archives.

Methodology

This study uses both qualitative and quantitative methods. We held two workshops with researchers, librarians, archivists, and other users who interact with these digital resources to map out their requirements. Next, an online survey was distributed across these different user groups to validate these requirements and to assess their literacy. For assessing literacy with web archives, this survey includes 36 statements grouped into five key categories. The first category gauges users’ familiarity and understanding with the basic concepts of web archives such as distinction between original and archived web content and awareness of the diverse range of materials in web archives. A second group of statements explores users’ ability to recognize the research potential of web archives and the impact of archiving practices on the nature and availability of archived content. Another category focuses on skills to search and navigate web archives. The last category evaluates the users‘ ability to critically assess the limitations and biases in archived content and the curational practices that shape web archives.

Conclusion

Our findings offer guidance on improving the accessibility of web archives by highlighting the diverse requirements of users and suggesting ways to create tailored experiences based on literacy levels. The results indicate that adaptive interfaces and personalized user paths can be developed to make web archives more useful for everyone, from beginners to experienced researchers.

References

Agata, T., Miyata, Y., Ishita, E., Ikeuchi, A., & Ueda, S. (2014, September). Life span of Web pages: A survey of 10 million pages collected in 2001. In IEEE/ACM Joint Conference on Digital Libraries (pp. 463-464).

Cui, C., Pinfield, S., Cox, A., & Hopfgartner, F. (2023, March). Participatory Web Archiving: Multifaceted Challenges. In Information for a Better World: Normality, Virtuality, Physicality, Inclusivity: 18th International Conference, iConference 2023, Virtual Event, March 13–17, 2023, Proceedings, Part I (pp. 79-87). Cham: Springer Nature Switzerland.

14:40
Lessons learnt from preparing collections as data: the UK Web Archive experience

ABSTRACT. This paper proposes an examination of the UK Web Archive’s Datasheets for Datasets project as a pioneering initiative that integrates the critical role of librarians and archivists in the evolving landscape of the datafied web. By adopting a method from the machine-learning field to improve the description and documentation of web archive datasets, this project not only enhances access to and understanding of digital collections but also provides a framework for addressing the broader implications of datafication in library and archival practices.

The UK Web Archive collects and preserves websites published in the UK, encompassing a broad spectrum of topics. The entire collection amounts to approximately 1.5 petabytes (PB) of data, which necessitates the use of machine learning approaches to explore the collection effectively, in addition to the detailed examination of individual websites that the UK Web Archive also facilitates. Moreover, the archive includes curated or thematic collections that cover a diverse array of subjects and events, ranging from UK General Elections, blogs, and the UEFA Women’s Euros 2022, to Live Art, the History of the Book, and the French community in London.

Since its inception, the UK Web Archive has collected websites using a number of different methods, with an evolving technological structure and under different legal regulations. The result of this means that what can be discovered and accessed is complicated and, therefore, not always easy to explain and understand. To try to ensure wider access to our collection we plan to publish the metadata we have created to describe archived websites as data.

Collections as data has become a very popular term within the GLAM sector in recent years. However, publishing collections as data is still not an easy task and there are little in the way of guidelines on how to do this as one solution does not fit all. In this presentation, we will reflect on some of the challenges of preparing UK Web Archive collection metadata for publication, how we published these collections and what additional material was required to ensure reuse of this data. These published data sets will be of use to researchers who want to use them with new datamining tools.

The UK Web Archive Datasheets for Datasets project embodies a forward-thinking approach to the challenges and opportunities datafication presents to libraries and archives. By fostering transparency, education, and ethical engagement, this initiative not only enhances the utility of web archive datasets but also exemplifies the crucial role librarians and archivists play in shaping our collective understanding and use of the datafied web. This paper aims to inspire further dialogue and exploration within the RESAW community and beyond, encouraging a thoughtful and human-centred approach to the evolving relationship between libraries, archives, and large datasets.

14:00-15:30 Session 6C: Panel: Data Loss in Archival Regimes: The Politics of Data Twinning, Preservation and Conversion
14:00
Data Loss in Archival Regimes: The Politics of Data Twinning, Preservation and Conversion

ABSTRACT. Chair: Nanna Bonde Thylstrup (University of Copenhagen)

Panel Description Web and data archives play an increasingly central role in societies both as crucial sources of 21st century history and as suppliers of datasets for machine learning systems. Yet, little is known about the decision-making processes that intentionally exclude or accidentally fail to capture data, partly due to complex layers of technicity in the data archiving process (Bingham and Byrne, 2021) and partly because of the relative novelty of the field of digital archiving (Dowling, 2019). This panel offers empirical and critical substantiation of how ‘archivers’ (Ketelaar 2023) in formalized and non-formalized data repositories and archives detect, measure, conceptualize, experience and counteract data exclusion and disappearance across processes. Methods and theories are drawn from STS, information studies and media studies and cases include explorations of web archiving processes in the German National Library, GitHub dataset preservation in an Arctic Vault in Norway, and the storage of administrative digital data at the Danish National Archives.

Research presented in this panel is part of the Data Loss project, where our analysis of data politics shifts focus from accumulation and aggregation to disappearance, destruction and dispossession. Rather than conceptualize data loss as the inverse of accumulation, our research approaches loss as an integral part of data collection, storage, and preservation. What is at stake in researching data loss, then, is not only a matter of quantitatively measuring and mapping what is lost in digital information ecologies; it is also to qualitatively understand how data loss is discursively and materially co-constituted within knowledge infrastructures.

Paper 1 Arctic Archives: Making platformed datasets cold-storage-ready

This paper argues that GitHub’s 2020 Arctic Vault project exemplifies the politics of data loss in deep time archival processes, revealing how platform centralization and control reshape the future of open-source software preservation. On February 2, 2020, GitHub curated an algorithmically selected “greatest hits” collection of 17,000 data repositories, migrating them from GitHub servers to a deep time archival format known as Piql Film. The archives were then distributed across four locations: the Bodleian Library in Oxford, the Bibliotheca Alexandrina in Egypt, Stanford Libraries in California, and the Arctic World Archive in Svalbard, Norway. By making these repositories ‘cold-storage-ready,’ GitHub enforced platform-centric rules that excluded datasets with external dependencies, thus sacrificing the diversity and interconnectedness of the open-source ecosystem in favor of centralization.

GitHub has been described as a “software intermediary,” (Bounegrou, 2023) where it operates simultaneously as a developer platform, community hub, and storage container. Mackenzie (2018) has argued that GitHub is distinct from other code sharing platforms in that it configures coding as a ‘social networked practice’, where code repositories are adorned with social media-style apparatus of following, watching, liking, and tagging. GitHub has attributed its success as a platform to these affordances, often praising its dedicated user base of developers who continuously monitor, update, and maintain the code. Leading up to the Archive Project deadline in 2020, users were informed that in order for their datasets to be included, they had to meet all of the necessary conditions: any dependencies hosted elsewhere, in other open-source repositories, had to be hosted or mirrored in the default branch on GitHub or else the software in the archive would not be usable. Through these conditions, GitHub’s platform-oriented forms of centralisation and control (Fuller et al, 2017) are extended into deep time through the preservation process of making GitHub ‘cold-storage-ready,’ while also creating conditions in which repositories with external dependencies are lost.

Further, to create the conditions in which the data might be preserved for up to 1,000 years, the open-source software code from GitHub underwent a series of transformations. The repositories were algorithmically selected by internal popularity scores, then automatically crawled and scraped before sent to Piql. There, 21TB of data was copied onto 186 reels of 35mm polyester film coated with a gelatin emulsion containing microscopically small light-sensitive silver halide crystals. This is commonly known as silver halide film, a popular photography and microfilm medium, but is here referred to as piqlFilm. Piql uses this material to transfer binary data using photons in frames along the film using their piqlWriter. A photochemical processor, known as the piqlProcessor, is a machine where the information written on the film is chemically developed and fixed to ensure image presence and permanence. Data is thereafter only read through a piqlReader, which reads frames from the film and converts them back into sampled images which are then decoded back into digital data. The cold storage transformation of GitHub-hosted repositories for the Arctic Vault challenge and build on archival processes to render datasets fixed in space and time, while making processes that are contingent on Piql products and inscribing these machines and formats into the future.

Finally, while the parent project to GitHub’s Arctic Vault, the Arctic World Archive (AWA), provides an interface through which their registered partners can make modifications or pull requests to their data deposits, the Vault provides no access at all to their users, meaning data cannot be removed or destroyed once included. In communications, they state that they will revisit and evaluate the project every 5 years. This is common with cold-storage archives (Radin and Kowal, 2017), where “cryo-objects are thus available and unavailable at the same time” (Braun, 2024). Through its empirical examination of the GitHub Arctic Vault project and its theoretical engagement with platform studies and archival technologies, this paper expands on emerging work in STS, media studies, and critical data studies that attends to the politics of deep time data preservation. It contributes to these fields by foregrounding how platform-driven archival practices extend centralization into the future, creating conditions in which the preservation of data paradoxically entails its stasis and potential obsolescence.

Paper 2 Web Archives as the Digital Twin: Navigating Tensions Between Preservation and AI

Recent news that the Internet Archive has „backed up the entire cultural heritage“ of Aruba in response to climate change exemplifies how discourses on web archives are shaped by imaginaries of completeness and sustainability—suggesting that, despite the ephemerality of the web, everything can be stored and made accessible forever. These visions have gained further momentum with the advent of generative AI, where web archives are seen as valuable datasets for training models and developing new digital methods (see Ogden, Summers, and Walker 2023; Acker and Chaiet 2020).

Drawing on ethnographic research with the German National Library, the Internet Archive, and web archiving experts, I discuss how the focus on completeness and AI-driven use cases creates a persistent sense of falling behind and inadequacy for traditional memory institutions. These institutions face challenges in balancing an emphasis on scale with selective curation, outdating technology, and limited resources. The German web archive is a compelling case, as the .de domain is too large to crawl completely (a common strategy of other national libraries), leaving them to decide what defines ‚the German internet‘ and which parts should be preserved. Their strategy of selective web archiving—focused on topic-based and event-specific crawls—demonstrates that data loss is not a failure but an inherent aspect of web archiving.

Questioning the digital fantasy of completeness and recognizing the conditions and challenges of web archiving highlights the need for a nuanced understanding of the political implications of both the practices and realities of selection and loss as well as connected generative endeavors. In this context this contribution has two main objectives: first, to empirically examine the German National Library’s selective web archiving strategy, where data loss and curation must be discussed, rather than hidden under the overpromise of supposed technical possibilities. Second, to critically explore tensions between different archival regimes and the competing strategies of preservation versus generative knowledge production.

To discuss these tensions, I introduce the concept of the digital twin, which represents a shift from traditional cultural heritage preservation to a generative, data-driven model. In this view, web archives are imagined as dynamic, continuously updated knowledge systems. These archival systems are expected to serve dual roles: mirroring the entire online world while simultaneously training and optimizing AI models that inform future scenarios, which then feed back into and refine the archive itself. Visions of digital twinning contrasts conditions of national libraries, where the archive is governed by a legal mandate to preserve a representative selection, often restricted to on-site access due to regulations like copyright. Since both web archives and digital twins are infrastructures of knowledge production and are deeply political, a more nuanced understanding of their „infra-politics“ is crucial—one that acknowledges what is selected, saved, or lost in these processes (see Thylstrup 2018; Thylstrup et al. 2021).

Paper 3 Ontological overflows of data friction: Investigating mundane enactments of data absence at the Danish National Archives

In recent decades, several national archival institutions have been turning to the long-term preservation of administrative digital data of various kinds, prominently including the US and UK UK National Archives. A particularly pertinent case of this is constituted by the Danish National Archives, an institution which regards itself as “Denmark’s Digital Memory” (The Danish National Archives, n.d.) and has been charged with digital preservation since the 1970s (Rostgaard, 2023). As such, the Danish National Archives reflect both national archives’ broader turn to digital data preservation, but also exhibit a particularly significant institutional expertise in the project of keeping digital data present. In this paper, we take the case of the Danish National Archives to approach the long-term preservation of digital data as not exclusively or primarily productive of digital data presence, but also one that is shaped by the making of data absence. We ask: How are the practices of digital data preservation at the Danish National Archives always shaped by mundane makings of data absence?

To approach this question, we draw from recent theorizing in science and technology studies (STS) that calls for the value of exploring processes of how actors engender “the absent”. Inspired by a longer lineage of work within STS that attends to excluded and neglected people and things in technoscience (e.g., Star, 1990; Latour, 1992; Mol, 1999; Bowker & Star, 2000; Barad, 2007; Puig de la Bellacasa, 2011), Lee (2023: p. 2) develops the concept of “ontological overflows” to stay with actors but to look “the other way – toward practices of excluding, cutting, removing – the practices of making absences”. We transpose this analytical orientation to our own work, investigating how data disappears in digital knowledge regimes at the Danish National Archives. Additionally, we mobilize the long-standing STS concept of data friction which describes the “costs in time, energy, and attention required to simply collect, check, store, move, receive, and access data” (Edwards, 2010, p. 84). This conceptual combination is meaningful for studying archival institutions as it highlights the ontological stakes in moments of data friction and helps reinvigorate an STS of archives (Bowker, 2005; Waterton, 2010).

Mobilizing a range of materials related to the Danish National Archives’ data preservation practices, including expert interviews, institutional strategies, and press articles, our methodology draws from both infrastructural (Bowker & Star, 2000) and temporal (Velkova, 2024) inversion. The analysis highlights three ontological overflows and how they enact data absences: first, data format conversions; second, determinations of which data are ‘worthy of preservation’; and third, processes of digital data decay. These three overflows – each in their own way – highlight how the making of data absence is ingrained in the mundane project of keeping data present. Hence, the central contribution of this paper is to both conceptualize and empirically show how digital data preservation at archival institutions is shaped by mundane makings of digital data absence.

References Acker, A., and M. Chaiet. (2020). “The Weaponization of Web Archives: Data Craft and COVID-19 Publics.” Good Systems-Published Research. Barad, K. (2007). Meeting the Universe Halfway: Quantum Physics and the Entanglement of Matter and Meaning. Durham, NC: Duke University Press. Bingham NJ and Byrne H (2021) Archival strategies for contemporary collecting in a world of big data: Challenges and opportunities with curating the UK web archive. Big Data & Society 8(1). SAGE Publications Ltd: 2053951721990409. Bounegru, L. (2023). The platformisation of software development: Connective coding and platform vernaculars on GitHub. Convergence, 0(0). Bowker, G. (2005). Memory Practices in the Sciences. Cambridge, MA: The MIT Press. Bowker, G. & Star, S. L. (2000). Sorting Things Out: Classification and Its Consequences. Cambridge, MA: The MIT Press. Braun, V. (2024). The stuff of memories: Planning hindsight in animal cryobanks. Social Studies of Science, 0(0). Dowling S (2019) Why there’s so little left of the early internet. BBC. https://www.bbc.com/future/article/20190401-why-theres-so-little-left-of-the-early-internet Edwards, P. N. (2010). A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. Cambridge, MA: The MIT Press. Fuller M, Goffey A, Mackenzie A, et al. (2017) Big diff, granularity, incoherence, and production in the GitHub software repository. In How to Be a Geek: Essays on the Culture of Software. Cambridge, UK: Polity Press. Ketelaar, E. (2023). The Agency of Archivers. Oxford Twenty-First Century Approaches To Literature, 287. Latour, B. (1992). Where are the missing masses? The sociology of a few mundane artifacts. In W. Bijker & J. Law (Eds.), Shaping Technology/Building Society (pp. 225-258). Cambridge, MA: The MIT Press. Lee, F. (2023). Ontological overflows and the politics of absence: Zika, disease surveillance, and mosquitos. Science as Culture, 33(3), 417-442. Mackenzie A (2018) 48 million configurations and counting: platform numbers and their capitalization. Journal of Cultural Economy 11(1): 36–53. Mol, A. (1999). Ontological politics. A Word and Some Questions. The Sociological Review, 47(1_suppl), 74-89. Ogden, J., E. Summers, and S. Walker. (2023). “Know (Ing) Infrastructure: The Wayback Machine as Object and Instrument of Digital Research.” Convergence. Puig de la Bellacasa, M. (2011). Matters of care in technoscience: Assembling neglected things. Social Studies of Science, 41(1), 85-106. Radin, J. and Kowal, E. (2017). Cryopolitics: Frozen Life in a Melting World. The MIT Press. Rostgaard, M. (2023). Archival paradigms: The past, present, and digitised future of Danish archiving. In G. Bak & M. Rostgaard (Eds.), The Nordic Model of Digital Archiving (pp. 23-41). Routledge. Star, S. L. (1990). Power, technology and the phenomenology of conventions: on being allergic to onions. The Sociological Review, 38(1_suppl), 26-56. The Danish National Archives (n.d.). Strategy 2025: The digital memory of Denmark. The Danish National Archives. Available at: https://en.rigsarkivet.dk/wp-content/uploads/2024/02/The-Danish-National-Archive-Strategy-2025.pdf [Accessed: 01/07/2024] Thylstrup, Nanna Bonde. (2018). The Politics of Mass Digitization. Cambridge, Massachusetts, London, England: MIT Press. Thylstrup, Nanna Bonde, Daniela Agostinho, Annie Ring, Catherine D’Ignazio, and Kristin Veel. (2021). “Big Data as Uncertain Archives.” In Uncertain Archives. Critical Keywords for Big Data, 1–27. Cambridge, Massachusetts, London, England: MIT Press. Waterton, C. (2010). Experimenting with the Archive: STS-ers As Analysts and Co-constructors of Databases and Other Archival Forms. Science, Technology, & Human Values, 35(5), 645-676. VanDerHorn, Eric, and Sankaran Mahadevan. (2021). “Digital Twin: Generalization, Characterization and Implementation.” Elsevier Decision Support Systems (145). Velkova, J. (2024). Data Infrastructures and their Temporalities. In T. Venturini, A. Acker & J. Plantin (Eds.), Sage Handbook of Data and Society. Thousand Oaks, CA: SAGE Publications.

15:30-16:00Coffee Break
16:00-17:30 Session 7A: Web Archives as Data
16:00
Mining Digital Terror: A Case Study in Using September 11 Web Archives as Data

ABSTRACT. The September 11th attacks and their aftermath (the “9/11 attacks”) are some of the most-documented events in human history. Throughout September 2001 and beyond, the Internet Archive, Library of Congress, and researchers from various fields came together to capture the unfolding events, creating digital repositories at (then) unprecedented scale. These repositories are held by these collecting institutions, as well as George Mason University’s September 11 Digital Archive. Yet their potential has been largely unrealized in histories of 9/11, due to the scale of data, fragmented formats, and inconsistent metadata in these early web archives.

I am currently writing a monograph on a digital history of 9/11, which involves marrying several different types of born-digital historical content: web archives found in the Internet Archive’s Wayback Machine, 3,349 blogs that were manually uploaded to the September 11 Digital Archive, as well as thousands of e-mails sent on list-servs (and preserved via web archives), manually donated to sites, as well as pager messages and other digital media obtained through freedom of information requests. This project involves transforming fragmented, inconsistent archives into cohesive datasets for computational analysis.

To write this history, I am transforming this information – pager message, e-mail, list-serv post, blog, and website – into large databases of CSV files for future analysis. This process demands careful integration of machine-generated and user-generated metadata, across formats that vary in structure, timezone, and language. For example, one list-serv was donated as a very long single text file, and another list-serv has been manually scraped from Wayback Machine snapshots of a Yahoo! message group, but I need to integrate them together.

The scale of these messages and sites is beyond human-readable comprehension. This is where computational techniques like topic modeling, metadata analysis, and keyword searches become essential tools to surface patterns in the data while maintaining a focus on the underlying human stories. For much of my data, my goal is to create human-readable PDFs, contextualized into the larger corpus via distant reading.

In my talk, I will advance three arguments. First, I will provide preliminary examples from my work, illustrating what a computationally-driven history of 9/11 can improve for the historiography. Secondly, I will underscore how much work I need to do today as a historian in standardizing metadata – which underscores the need for robust metadata frameworks today that adapt to the evolving nature of digital content. For web archivists today, this continually underscores the importance of establishing clear metadata standards at the outset of any event-based collecting initiative to enhance future accessibility and research potential. Finally, my presentation will explore the ethical considerations that arise when reprocessing and republishing publicly-available digital materials. For example, I am combining several discrete born-digital collections into one comprehensive dataset for my own analysis — should I share such information? This dilemma reflects broader concerns about data ownership, privacy, and the ethics of reprocessing web archives as data.

16:20
Establishing which websites constituted a national web in the 1990s

ABSTRACT. When doing historical studies of the web one of the first and most fundamental tasks is to determine which web entities are within the scope of the study, be that web elements, web pages, websites, or web spheres. Depending on the concrete study the result is a list that identifies the web entities to be included, and with this list at hand researchers can try to retrieve the relevant web entities, e.g. in a web archive, and use them in their analyses. However, establishing such a list is not easy, because useful comprehensive sources and overviews are very often lacking.

This presentation investigates how a researcher can establish which websites constituted a national web in the 1990s, based on the ongoing research project ‚Histories of the Danish web in the 1990s‘. The aim is to develop and test a method to identify Danish websites of the 1990s. The method use two overall approaches, each of which come with different sub-approaches.

(1) Finding old ccTLD domain name lists Obviously, the list of registered domain names of the ccTLD .dk is a strong candidate when trying to establish a list of website domains of the past. The following sub-approaches were used: (a) Contacting the existing ccTLD administrator: Punktum.dk, today’s administrator has not preserved old ccTLD lists, they were discarded in 2018 due to GDPR rules. (b) Contacting previous ccTLD administrators: The .dk ccTLD had several administrators in the 1990s, just identifying these, let alone finding relevant staff to contact today, is a challenge; this step has not yet been fully explored. (c) Searching the websites of existing and previous ccTLD administrators in the Internet Archive: Spending a lot of time (and with a great deal of luck) and having access to the archived Danish web from the 1990s through a SolrWayback interface, including full-text search, was a huge success, and complete ccTLD lists from 1996 (Oct) and 1997 (Jan) were found.

(2) Reconstructing website domain names based on other sources ccTLD domain names are important, but they do by no means constitute a complete list of existing domain names. In the Danish case only companies, organisations, and the like were entitled to buy a domain name until begin 1997, and therefore all other web actors had their websites hosted on web hotels, with web addresses like ‚inet.tele.dk/name‘. This is one of the reasons why reconstructing website domain names is relevant, and here the following sub-approaches were used: (a) .dk websites in existing web archives: Based on the above mentioned SolrWayback access a list of the .dk domain names that were present in the web archive was created, including sub-domains of web hotels. (b) Outgoing links from .dk websites to .dk websites: A list of all outgoing links from all websites identified in step (2a) was created and filtered to keep the website address where the website had not been archived (called ‚known unknowns‘) which resulted in a list of .dk websites that were linked to (but not archived) and that may have existed in the past. (c) Directories/lists/web hotels: Various web directories/lists/web hotels were identified and their listing of websites was used (due to scripting in the code they were not in all cases identified in the two previous steps). (d) Other sources: Other sources were consulted, including print media (books, magazines), digital copies of news papers, and usenet groups and a number of websites were identified, in particular for the period before Oct 1996 when the Internet Archives started. Based on these different approaches annual lists of Danish websites in the 1990s was establashed, as comprehensively as possible.

As this brief overview indicates, establishing a complete picture of which websites constituting a national web in the past is not straigh-forward, hence, claims to comprehensiveness of the analyses based on this material may be weakened.

In the presentation, all of the points above will be explained and evaluated in detail, and their potential use in other use cases will be debated.

16:40
Datafication of Web Archives and the Periodization of Website History: A Case Study of the National Museum of Australia

ABSTRACT. Studying the history of museums on the web faces multiple challenges, including those related to the specificity of the website as an object of study (Brügger, 2009). The problem of the ephemeral character of the website is quite familiar to the researchers of the live web and becomes even more complicated in relation to the archived web (an den Heuvel, 2010). Periodisation of the websites’ development and reconstruction of the versions of the websites for research are under ongoing discussion by scholars. There are several approaches to the periodisation of the website evolution: 1) reference to the technological changes in website construction and design (Allen, 2013; Helmond, 2013); 2) shifts in the content published on the websites (Chakraborty & Nanni, 2017); 3) generalisation of the web development (Anat Ben-David, 2019). The versioning of websites, selecting portions of information that should be taken into account while researching, and decomposing preserved data into fragments also refer to periodisation. A version can be considered as a composition of the snapshots from a certain period. A year is often taken into account for reconstructing a website or it can be a selection of separate years with some gap in between for tracing the changes (Svarre & Skov, 2024). Of course, the approach depends on the research purposes and may vary. The proposed paper suggests considering the periodisation based on the assessment of the resources preserved on the web archives and available for research. The Web archives have undergone significant changes from the early years of the Internet to today. It is not surprising that the quality and details of web preservation have changed over time, directly affecting the amount and quality of data we have today for scholarly consideration. Identification of these periods is essential to reconstructing the versions of the website, revealing the shift in content within the period and then addressing the changes, keeping in mind the volume of data. The website of the National Museum of Australia (NMA) has been selected as a case study. This example is interesting from a comparative perspective on the web archives. The NMA website has been preserved by both initiatives – the Internet Archive and the Web Archive of Australia – since 1996. The Jupyter Notebooks from the GLAM Workbench have been utilised to obtain the data and test the hypothesis. The code from the Notebooks (Web Archives, 2024) has been modified to obtain and visualise data for research purposes. The diagrams reveal the distribution of the snapshots preserved on each of the considered web archives and identify the specific periods of the website’s preservation. The gaps in data have also been mapped. This approach supported the argument for selecting particular periods of time for studying the website’s history and showed differences and similarities in the datasets in the two web archives. The rationalities, processes, and results of the research will be outlined at the conference.

References: 1. Allen, M. (2013). What was Web 2.0? Versions as the dominant mode of Internet history. New Media & Society, 15(2), 260–275. 2. an den Heuvel, C. (2010). Web Archiving In Research And Historical Global Collaboratories. In N. Brugger (Ed.), Web History (in book series Digital Formations, ed. S. Jones) (pp. 279-303). Peter Lang International Academic Publishers. 3. Ben-David, A. (2019). National web histories at the fringe of the Web: Palestine, Kosovo, and the quest for online self-determination. In N. Brügger, D. Laursen (Eds.), The historical web and Digital Humanities: The case of national web domains (pp. 89—109). Abingdon: Routledge. 4. Brügger, N. (2009). Website history and the website as an object of study. New Media & Society, 11(1-2), 115-132. 5. Chakraborty, A.; Nanni, F. (2017). The changing digital faces of science museums: diachronic analysis of museum websites. In Niels Bruegger (ed.) Web 25: Histories from the first 25 Years of the World Wide Web. New York, NY: Peter Lang. 6. Helmond, A. (2013). The algorithmization of the hyperlink. Computational Culture, 3(3). 7. Svarre, T.; Skov, M. (2024) The online presence of the Danish public sector from 2010 to 2022: Generating an archived web corpus. In: Exploring the Archived Web during a Highly Transformative Age: Proceedings of the 5th International RESAW Conference, Marseille, June 2023. Gebeil, S. & Peyssard, J-C. (eds.). Firenze University Press 8. Web Archives. GLAM Workbench, accessed on 19.09.2024, https://glam-workbench.net/web-archives

16:00-17:30 Session 7B: RSNs
16:00
To Monetize or not to Monetize: doubts, resistance and U-turns in early YouTubers communities

ABSTRACT. In this presentation, the aim is to understand the impact of monetization and what it meant to the ideals of participatory amateur culture in the early 21st century. The focus will be on the first ten years of YouTube (2005-2015), which started as a perfectly suitable infrastructure for amateurs who loved the opportunity to distribute self-made videos. Initially, YouTube had a strong base in a peer community-driven platform that believed in a truly democratic opportunity for even and fair distribution of user-generated content. Especially in its early stage, YouTube ‘thrived on enthusiasm of users as they ran and operated their new virtual spaces, which were often regarded as experiments in online citizenship and a reinvention of the rules for democratic governance’ (Burgess and Green 2009, 15). Scholars like Benkler (2006) believed a networked public sphere, characterised by non-market peer production, was possible. The early days of the web saw a level of enthusiasm for the promise of a so-called ‘pro-am’ revolution, which is famously summarised by the concept of ‘participatory culture’ by Henry Jenkins (2006). However, this ideal of participation and democratic communities was questioned by many users after Google acquired YouTube in 2006. Although Google had initially promised to keep the community-based identity of the platform intact, the gradual introduction of regulations (e.g. the partnership programme in 2008) that enabled some form of monetisation ultimately changed the character of YouTube. YouTube developed into a platform with a far more complex interwovenness of not just what users do or hope or what technologies enable but also the particular business models and specific rules of governance that underpin this whole system (cf. Van Dijck et al., 2018) Because of this, YouTube became the perfect example of a website that deploys automated technologies and business models to organise data streams as part of measuring economic interactions, while providing social exchanges between users of the Internet. While the platform quickly commercialized, a discursive struggle ensued regarding what makes YouTube a place for amateurs. As one of the amateurs voiced discontent around 2015: ‘YouTube died when it stopped being a hobby and started becoming about the money.’ Some users pleaded for the return of the “old YouTube: the once active community of alternative, unruly users generating a specific cultural form. This discussion was repeated in many other conversations elsewhere on YouTube, and it shows users‘ passionate, often quite divergent, positions about the identity of the ‘real’ YouTuber. With YouTube’s 20th anniversary, it is an excellent moment to revisit historical controversies and debates about the impact of the datafication of this part of the web.

16:20
Tumblr Purge: A Story Told Through Data

ABSTRACT. In November 2018, after being suspended from Apple’s App Store for hosting child pornography, Tumblr announced its decision to ban *all* NSFW (not safe/suitable for work) content with the aid of machine-learning classification. The decision to opt for strict terms of use governing nudity and sexual depiction was as fast as it was drastic, leading to the quick erasure of subcultural networks developed over a decade. My contribution maps out platform critiques of and on Tumblr through a combination of visual and digital methods. By analyzing 15158 posts made between November 2018 (when Tumblr announced its new content policy) and August 2019 (when Verizon sold Tumblr to Automattic), it explores the key stakes and forms of user resistance to Tumblr’s “porn ban”. The presentation reflects on the circulation of user-generated content in response to platform-driven censorship, with particular attention to practices of screenshotting and memeification. It further explores the changing relations of relevance in the controversy surrounding the deplatforming of Tumblr cultures.

16:40
The business of datafied identity: LiveRamp’s evolution in the audience economy

ABSTRACT. Data brokers are companies that collect, aggregate, and sell access to personal information about individuals and organisations within the “audience economy” (Helmond and Van der Vlist, 2023). These companies have become central actors in the digital economy by providing businesses with detailed consumer profiles that can be used for marketing, credit scoring, and other (often automated) decision-making processes. Notable data brokers companies include Acxiom, Experian, Equifax, and LiveRamp, whose revenues reflect the growing importance of data as a commodity.

This paper contributes to the growing body of research on the role of data brokers in the digital data economy (Christl and Spiekermann, 2016; Crain, 2018, 2021; Elmer, 2004; McGowan et al., 2024; McGuigan, 2023; Reviglio, 2022; Van der Vlist and Helmond, 2021; Zook and Spangler, 2023) by offering a historical perspective on one of its central players: LiveRamp, which serves as a microcosm for understanding political economy and power in the broader data brokerage landscape (Van der Vlist and Helmond, 2021). Applying a methodological framework that utilises the Internet Archive Wayback Machine for examining platform evolution (Helmond et al., 2019; Helmond and Van der Vlist, 2019), we trace LiveRamp’s platform evolution across three dimensions: its discursive positioning, data and product offerings, and partner ecosystem. Through this analysis, we reveal how LiveRamp has navigated and influenced the shifting dynamics of the digital data economy in its favour.

First, we examine LiveRamp’s discursive positioning by analysing changes in its taglines, “about” pages, and other self-representative content. This analysis reveals how the company has rebranded itself over time—from an “onboarder” of data to an “identity provider.” This shift shows its ambition to build critical “identity infrastructure” within the data industry. Furthermore, LiveRamp increasingly markets itself as an “open” and “interoperable platform for data collaboration,” positioning itself as a neutral or independent actor in contrast to the closed data silos of major social media platforms.

Second, we explore the evolution of LiveRamp’s data and product offerings. Over time, the company has expanded its services from basic data “onboarding” to facilitating complex data “connectivity” and “collaboration” across platforms. This includes the aggregation and marketisation of first-party data, an area of growing significance as privacy regulations like the GDPR in Europe and the California Consumer Privacy Act (CCPA) in the United States impose stricter rules on how companies collect and share personal data. The emergence of data “marketplaces”, “clean rooms” and the increasing role of “first-party” and “synthetic” data are key developments we observe in LiveRamp’s trajectory, reflecting a broader industry shift towards privacy-preserving data practices.

Third, we examine LiveRamp’s partner ecosystem, which is critical to its expansion and integration within the broader data economy. By forming strategic partnerships with major technology platforms, service, media, and data providers, LiveRamp has established itself as a central connector and infrastructural gateway within the audience economy. These partnerships also reveal how the larger data economy has evolved in response to technological innovations, shifting market conditions, and regulatory frameworks.

While many of their operations remain opaque, examining the evolution of a data broker on these three levels helps us to better understand: (1) how discursive self-positioning reveals the power dynamics and ideological underpinnings in LiveRamp’s portrayal of its role within the data economy; (2) how the company’s corporate messaging and product offerings co-evolve in response to regulatory pressures, positioning it as a ethical “leader” or “innovator” in the field; and (3) the growing scale and complexity of the data economy, as evidenced by the company’s partnerships and integrations with other players in the industry. These elements bring together LiveRamp’s evolution as a data broker and highlight the key role of partnerships in creating and expanding its infrastructural (platform) power within the data economy. Furthermore, this analysis underscores the company’s evolving influence in the advertising industry, particularly as privacy regulations shifted in favour of LiveRamp’s position.

16:00-17:30 Session 7C: Social Media and APIs
16:00
Robots.txt and A History of Consent for Web Data Capture

ABSTRACT. Web archives are increasingly positioned as ideal data repositories for building generative artificial intelligence (AI) and training datasets for Machine Learning (ML) (van Strien, 2023; Alam, 2023; Cargnelutti et al. 2024; Deckers & Potthast, 2022). In recent years, the alignments and overlaps between web archives and training data have become more widely discussed and addressed, as web archives like Common Crawl have been used as the basis for model training data like the Colossal Clean Crawled Corpus (Dodge et al., 2021; Baack, 2024). At the same time, pushback against the development of generative AI by users and companies alike has taken the form of restricting access to open data on the web. A recent study by the Data Provenance Initiative, an MIT-led research group, discovered an “emerging crisis in consent,” affecting the free collection of AI training data sets from the open web (Roose, 2024).

In this paper we take a historical, socio-technical, and critical approach to understanding the ongoing relationships between data, web archives, and AI. We frame our discussion by examining the 30-year-old Robots Exclusion Protocol (REP, or Robots.txt) as a means for controlling crawler behavior, and its ongoing role as an infrastructural component of the web inscribing concepts of consent. Past work on the protocol, developed by Koster (1994) highlights its role in censorship and impacts to excluding materials from web archives (Elmer, 2009; Ogden, 2020).

In our approach, we trace the use of REP in the parallel genealogies of web archiving, and the development of AI and machine learning technologies. Using a historical analysis of early web mailing lists (Hocquet & Wieber, 2018), we demonstrate how both histories exhibit three key moves that allow for robots.txt to be reinterpreted and repurposed over time to justify collecting decisions and represent ethical data decision making. In this paper, we argue that 1) robots.txt use for indexing and retrieval has been extended to technologies for capture and extraction, 2) the definitions for bot behavior and ‘politeness’ have been deployed as a de facto ethical framework for all web data collection, and 3) prompted by recognition of data’s value and ownership, robots.txt extends ethical rules to determine the legal implications of robots.txt.

We focus our discussion on one aspect of data collection: how automated tools conceptualize and operationalize consent. As the generative turn in AI makes clear, web archives should not only be understood as historic collections, but also sites of future-oriented knowledge regimes. As such, understanding web archives’ orientation to consent from data creators or data subjects has long-ranging effects on archived web data’s future uses. We conclude by considering how critical data studies and critical archival studies can contribute perspectives beyond the technical solutions of REP, and address the context-dependent nature of consent and mediating access to information. We reflect upon the impacts for large-scale data collection and analysis and the future of both web archives and AI.

References Alam, S. (2023). IACopilot [Python]. Internet Archive. https://github.com/internetarchive/iacopilot Baack, S. (2024). Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI. https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/ Cargnelutti, M., Mukk, K., & Stanton, C. (2024, February 12). WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI. Library Innovation Lab. https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/ Deckers, N., & Potthast, M. (2022). WARC-DL: Scalable Web Archive Processing for Deep Learning. https://arxiv.org/abs/2209.12299v1 Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (arXiv:2104.08758). arXiv. https://doi.org/10.48550/arXiv.2104.08758 Elmer, G. (2009). Robots.txt: The Politics of Search Engine Exclusion. In J. Parikka & T. D. Sampson, The Spam Book: On viruses, porn, and other anomalies from the dark side of digital culture (pp. 217–227). Hampton Press. Koster, M. (1994). A Standard for Robot Exclusion. The Web Robots Pages. http://www.robotstxt.org/orig.html Ogden, J. R. (2020). Saving the Web: Facets of Web Archiving in Everyday Practice [Phd, University of Southampton]. https://eprints.soton.ac.uk/447624/ Roose, K. (2024, July 19). The Data That Powers A.I. Is Disappearing Fast. https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html van Strien, D. (2023, May 10). Getting Started with Machine Learning and GLAM (Galleries, Libraries, Archives, Museums) Collections | Internet Archive Blogs. https://blog.archive.org/2023/05/10/getting-started-with-machine-learning-and-glam-galleries-libraries-archives-museums-collections/ Hocquet, A., & Wieber, F. (2018). Mailing list archives as useful primary sources for historians: Looking for flame wars. Internet Histories, 2(1–2), 38–54. https://doi.org/10.1080/24701475.2018.1456741

16:20
On Reciprocity – Algorithmic Interweavings between PageRank and Social Media

ABSTRACT. This paper argues, that from a historical point of view Google’s PageRank algorithm plays a crucial role for the datafied infrastructures of contemporary social media. In the first part of the talk I will show, the social principle of reciprocity, which is always to be regarded as two-sided and essential for the production of sociality, also plays a central role at the technological implementation level of PageRank. To demonstrate this, Moreno’s sociometry (Moreno 1934), which is a central reference point in the patents of PageRank (Page 2001, Page 2004) and essential for its functioning, is to be brought into dialogue with the gift theory by Marcel Mauss (1990). Both approaches, Moreno’s and Mauss‘, assume that the smallest social unit is the dyad and that society only comes into being through a third element (e.g. a gift). It will be argued that PageRank, based on the web architecture with hyperlinks (as third elements), makes use of precisely this central social principle of two-way reciprocity and institutionalizes it technically. This marks at the same time an expansion of ‘intersubjective spacetime’ (Munn 1986), wherein something like reputation can arise in the first place. And is precisely this principle of ‘networked prestige’ (Halavais 2008) that underlies both Google’s PageRank and the datafied infrastructures of today’s social media platforms. Against this background, the second part of the talk focuses on the growing blogosphere at the beginning of the 2000s and its interweaving with PageRank. Of particular importance here are the track and pingback functions within the blog journals, which are based on the principle of two-sided reciprocity, as they automate (social) linking practices of bloggers among themselves. The blog search engine Technorati has been taking advantage of these practices since 2002, similar to the principle of PageRank, and rates blogs based on the reputation of the links for each blog (using track- and pingbacks), thus assigning them a ‘networked prestige.` In other words, it can be observed in concrete terms how the dyadic principle (as the smallest social unit) is technically institutionalised and an algorithmic organized hierarchy emerges, which also becomes the basis of the feeds of social media platforms. Finally, it will be shown on a theoretical level that the introduction of PageRank marks a crucial distinction between a direct and a generalised form of reciprocity (Stegbauer 2011), that is inscribed in social media platforms. This manifests itself in the principles of ‘befriending’ (as a direct form of reciprocity) or ‘following’ (as a generalised form of reciprocity) in the user interfaces, as well as in hybrid forms of both reciprocities, which arise for instance from privacy settings (a private Instagram or Twitter/X account). In the context of the datafied web, reciprocity is a precondition for the platformization (Helmond 2015) of social media.

References

Helmond, A (2015): The Platformization of the Web: Making Web Data Platform Ready. Social Media + Society 1(2). Mauss, M (1990): The Gift: The form and reason for exchange in archaic societies. London: Routledge. Moreno, J. L. (1934) Who shall survive? A New Approach to the Problem of Human Interrelations. Washington: Nervous and Mental Disease Publishing. Munn, N. (1986) The fame of Gawa. A symbolic study of value transformation in a Massim (Papua New Guinea) society. Cambridge: University Press Page, L. (2001) Method for Node Ranking in a Linked Database, US Patent 6285999B1. Page, L. (2004) Method for Scoring Documents in a Linked Database, US Patent 6799176B1. Stegbauer, C. (2011) Reziprozität. Einführung in soziale Formen der Gegenseitigkeit. Wiesbaden: Springer.

16:40
APIs. How their role in the history of computing and their software engineering principles shape the modern datafied web.

ABSTRACT. This paper takes a media and software studies approach to the discussion of APIs (Application Programming Interfaces) and shows how their genealogy and their main software engineering principles (e.g. separation of concerns, information hiding, reusage) resurface as wide-ranging social and political implications in modern-day web APIs, leading to discussions about data-sharing and data-hiding, access, power, platformization, innovation and collaboration, as well as interdependence, the commodification of the web, and legal considerations.

But let’s start at the beginning. APIs are as old as the history of digital computing itself. Way before modern web APIs became one of the defining components of the datafied web, certain computer programming routines (which weren’t known under the term API back then) powered the rise of computing and software design from as early as the 1940s on. At a time, when computers consisted mainly of big, wired hardware and machine code, Herman Goldstine and John von Neumann already in their 1949 paper „Planning and Coding of Problems for an Electronic Computing Instrument“ saw the need for shared computing components, that is, for library subroutines. These were meant to be libraries for tasks that must be computed all the time, like mathematical equations or input-output communication.

But it wasn’t until the end of the 1960s that the term Application Programming Interface was coined, designating a well-defined interface, that allows one software component to access programmatically another component. This definition can still be applied to contemporary web APIs. Web APIs started with e-commerce sites like eBay and Amazon, were followed by social media platforms like Facebook and X and were built into mobile applications like Google Maps and Instagram. Countless smaller APIs conclude the current web landscape.

Within the data-driven web, the core software design principles of APIs become not only a technical necessity, but also actors with an extensive social and political scope. For example, accessing one software component through an API means the following: there is a well-defined access point, through which you can interact with certain components. All other components are hidden from outside access. This is called information hiding, and it is one of the core software design principles of APIs. This principle is beneficial for reducing complexity, for decreasing dependency between programs, and for protecting components from misuse. But at the same time, it could lead to hindering access, and exploiting power relations between the creators and the users of an API.

APIs are full of inherent ambiguities. And it is exactly at this intersection, where this paper is placed. Showing their historicity and their powerful innovative implications on the one hand, as well as their immanent unstable power negotiations on the other hand. It is like an ever-changing dance around openness and closedness, around creation and hinderance for everything and everybody who is interacting with an API – be it users, developers, algorithmic agents, or hardware components.

17:30-18:00Coffee Break
18:00-19:30 Session 8: KEYNOTE by Nanna Bonde Thlystrup (chair: Sebastian Gießmann)
KEYNOTE: Vanishing points: technographies of data loss

ABSTRACT. What happens to data when it vanishes? How do digital remains persist even as information seemingly disappears? What can the politics of disappearance tell us about power in datafied worlds?

Disappearance has become a crucial yet understudied force shaping digital experiences and infrastructures. This keynote develops a technographic approach to examine how data loss and digital remains create complex patterns of presence and absence that defy simple narratives of erasure. Through cases ranging from platform architectures to digital archives, it traces how power operates through sophisticated mechanisms of appearing and vanishing, leaving traces that persist in unexpected ways.

By mapping these dynamics of disappearance, the exploration uncovers how our data landscapes are shaped not by mere accumulation, but through intricate processes of loss, persistence, and transformation. The keynote explores how digital societies negotiate memory and forgetting, positioning disappearance itself as a crucial digital experience.

The talk develops theoretical tools for analyzing these dynamics while remaining grounded in concrete technological practices and their political implications. Through this, it compels us to rethink fundamental assumptions about presence, absence, and the complex temporalities and materialities of digital culture.

20:00-23:00Dinner Reception
 
 
Program for Friday, June 6th
 
 
RESAW 2025: THE DATAFIED WEB
PROGRAM FOR FRIDAY, JUNE 6TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:30 Session 9: KEYNOTE by Jonathan Gray: Public data cultures (chair: Tatjana Seitz)
KEYNOTE : Public data cultures

ABSTRACT. This talk explores how data is made public on the Internet amidst the rise of social media, platforms and AI. Retracing the emergence of legal and technical conventions of open data, it looks towards a more expansive understanding of public data cultures which shape how we know and live together. Through a series of empirical vignettes, the talk reconsiders data as cultural material, medium of participation and site of transnational coordination. It then turns to two forms of intervention: making data that is considered missing and entrypoints for critical data practice. As well as situating public data cultures in relation to the datafication and platformisation of the web, the talk will highlight the role of web archives in studying these developments.

10:30-11:00Coffee Break
11:00-12:30 Session 10A: Panel : The Skybox research programme
11:00
More than data : the Skybox research programme

ABSTRACT. Skyblog (2002 – ∞) was a pioneering and emblematic social networking platform of the French web in the 2000s. By 2011, it hosted up to 33.5 million blogs, 90% created by teenagers. Skyblog offered users a free and customizable digital space where they could easily create blogs, share content such as text, images, videos, and music, personalize their pages, and connect with others through virtual friendships. The platform left a significant mark on web culture and the history of the French web. In 2023, Skyrock, the platform editor, announced the closure of Skyblog, sparking collaboration with digital heritage institutions such as the National Library of France (BnF) and the National Audiovisual Institute (Ina) to guarantee its long-term preservation. Through cooperation, the BnF optimized the crawl processes, collecting original datasets, resulting in a collection of up to 12.6 million blogs and 40 terabytes of data. The aim of this panel is to present the challenges of managing a vast digital archive, with a particular focus on the inherent difficulties faced by web archivists and research teams involved in the Skybox research programme. We will begin by reviewing the technical aspects and methodological goals of the Skybox research programme scheduled to run from 2024 to 2027. The project objective is to develop an epistemology of the web archive based notably on quantitative methods and computational approaches, using the Skyblog collection as a field of study. One of the researchers involved in the Skybox project is Quentin Lobbé, whose work focuses on the analysis of digital migrations. He studies how skybloggers moved from and within the plateform. Emmanuelle Bermès employs a methodology that combines link mapping and datavisualization with personal narratives, mapping connections between blogs while addressing sensitive content, particularly that involving minors, in order to preserve the emotional depth of the datafied web. 302 words

Présentation n°1 (500 mots) : The Skyblog web archive behind the scene – Alexandre Faye, Sara Aubry and Marina Hervieu (BnF) The Skyblog collection is without doubt one of the most complex and comprehensive preservation projects ever undertaken by the BnF web archiving team within the context of its legal deposit mission [1]. The team has previously engaged in the preservation of French blogging platforms, yet none on the scale of Skyblog. The preliminary estimates, based on the data from the pilot collection, indicated a capture period of more than two years and a data volume of 80 terabytes. However, these estimates proved to be technically unfeasible. How to capture all available blog contents (mainly texts, images, photos), but also the entire social network dimension of the platform (comments, followers, favorites, avatars, rewards)? This challenge was overcome through the implementation of a methodical data collection preparation process and the establishment of a collaborative relationship with the technical team at Skyrock [2]. The optimisation process entailed modifications to the blogs codebase and the platform’s back-end infrastructure. For instance, the source code was changed to display 24 posts per page instead of 8, allowing the crawlers to archive more information more quickly. Another aspect of the project involved the identification of data sets managed by Skyblog, including usage stats, user profile data, editorial information, moderation terms, and music files, and their subsequent integration into the collection. This collection gives rise to questions pertaining to the professional practices of archivists. How can this mass of data be rendered accessible? What are the legal, ethical and technical issues involved in using the data? What would be the best tools, existing or to be developed to make them searchable and usable? Given the heterogeneity of the large data set, the age of the bloggers and the diversity of their practices on the web (often unconventional) [3], it is evident that the majority of the data is not given if not capta [4]. For example, the data sets recovered are of different types: technical tracking data that can be quantified (creation date, number of posts, number of comments, number of friends) and user-generated data (username, place of residence, body size and weight, astrological sign, etc.). The challenge is to facilitate informed exploration of this diversity of datasets, to take into account the actual needs, achievements and ideas of researchers and to create a synergy around this collection. This is the whole purpose of the collective project Skybox, which aims to establish a unified interface for consulting datasets, use cases and methodological recommendations. The project transforms the web archive into a datafield object of study. Concurrently, it also contributes to the process of making Skyblog part of our heritage. In this presentation, we will take a look back at the technical challenges of the preservation methodological and technical processes as well as the preliminary thoughts and work on the Skybox research programme. (464 words)

References: 1. BnF. 2024. Le web français collecté par la BnF pour le patrimoine et la recherche, Paris, URL : https://www.bnf.fr/fr/depot-legal-du-web 2. Faye, Alexandre. 04/09/2024. Aujourd’hui les skyblogs entrent dans l’Histoire. Web Corpora, URL : https://doi.org/10.58079/128yr 3. Deseilligny, Oriane. 2009. Pratiques d’écriture adolescentes: l’exemple des Skyblogs. Le Journal des psychologues, (9), Paris, France, 30-35. 4. Drucker, Johanna. 2020. Visualisation. L’interprétation Modélisante. Rennes, France : B42 Press.

Présentation n°2 (500 mots) : “I’m shutting down my blog, follow me!” Digital migration: a mirror for identity formation in adolescence – Quentin Lobbé (EHESS) “Digital migration” refers to the way all or part of an online community can move from one territory of the Web to another, whether or not this move is coordinated — a recent example being users massively migrating towards the federated network Mastodon after E. Musk has bought Twitter. Digital migrations are extremely interesting to study from a historical point of view since they can be the reflection of: Major evolution within the web itself (web 1.0 to web 2.0, launch of social media platforms, popularisation of mobile web, etc.) [1,2]; A frustration, weariness or disappointment regarding a given platform [3]; A reaction to a socio-political context outside the web (repressive legislation, surveillance, censorship, war, etc.) [4]. In this presentation, I aim to analyse the migration trajectories of « Skyblog » users – a French blogging platform – based on the National Library of France’s web archive containing nearly 13 million blogs. As studying migration movements means dealing with issues of discontinuity inherent to any body of web archives, I will first focus on the technical difficulties such a project implies. How to follow a collective trajectory, let alone an individual one, through potentially either incomplete, inconsistent or redundant archives? I will detail how I could, on the one hand, automate the detection process for potential traces of digital migration from Skyblog archives, and reconstruct migration trajectories spanning over several years on the other. I will then proceed to explain how the Skyblog platform is, according to me, a special case in the history of the Web. My first hypothesis was indeed that Skyblog users had gradually abandoned their blogs to resort to more modern platforms such as Facebook or Twitter, but my first results show that over 80% of detected digital migrations away from Skyblog are actually migrations within the platform itself, i.e. that Skyblog users create a first blog, then shut it down to open a second, then a third one, etc. Although these are individual migrations, they very often take place as part of a collective movement since the motives for most detected migrations are advertised by a text posted on the closing blog so that online friends can read it. For many young French speakers of the late 2000s, Skyblog has indeed been the ideal place to discover a new way of socialising online [5,6]. My research also shows that the main reason for shutting down a Skyblog results from a discrepancy between a past identity and a new one, be it still under construction. Digital migrations on the Skyblog platform therefore appear as a mirror for identity formation in adolescence. With this contribution I aim to enrich the scientific literature of the 2010s on « the online Self ».

References: 1. Weltevrede, Esther, Helmond, Ann, 2012. Where do bloggers blog ? platform transitions within the historical dutch blogosphere. First Monday 2. Lobbé, Quentin, 2018. Where the dead blogs are : a disaggregated exploration of web archives to reveal extinct online collectives , in : International Conference on Asian Digital Libraries, Springer. pp. 112–123. 3. Horbinski, Andrea. 2018. Talking by letter: the hidden history of female media fans on the 1990s internet. Internet Histories, 2(3-4), 247-263. 4. Ermoshina, Ksenia, Musiani Francesca. 2021, The Telegram ban: How censorship “made in Russia” faces a global Internet, First Monday, 26(5). 5. Cardon, Dominique, & Delaunay-Téterel, Hélène. 2006. La production de soi comme technique relationnelle: un essai de typologie des blogs par leurs publics. Réseaux, (4), 15-71. 6. Fluckiger, Cédric. 2006. La sociabilité juvénile instrumentée: L’appropriation des blogs dans un groupe de collégiens. Réseaux, (4), 109-138. 7. Stora, Michaël. 2009. «Ça ne regarde que les autres!» ou le blog à l’épreuve de l’adolescence. Empan, (4), 066-071. 8. Deseilligny, Oriane. 2009. Pratiques d’écriture adolescentes: l’exemple des Skyblogs. Le Journal des psychologues, (9), Paris, France, 30-35. (456 words)

Présentation n°3 (500 mots) : From data to emotions – Emmanuelle Bermès (ENC) With 12.6 million blogs and 40 terabytes, the 2023 Skyblog archive is probably the largest web corpus within the BnF collections [4]. When entering such an impressive amount of content, the researcher can only be abashed and disoriented. Using quantitative methods and distant reading may be the first idea that comes to mind. However, there is a long way from lists, counts, statistics and metrics to the individual stories hidden in the archive: stories of men, women, teenagers who often unveil themselves in an intimate and intense way. By treating these persons as if they were only data, there is a high risk to betray their legacy. Ethical concerns should be at the forefront of our preoccupations when studying vernacular content from the early social networks, especially when minors were involved [3]. In order to enter the corpus in a way that allows us to connect with the emotions conveyed by this very special corpus, we have designed a research method that is very similar to the crawling process used by archiving bots. We start with seeds – individual websites that have been identified and selected by librarians and their partners in the course of the creation of the web archives collections at BnF, since the mid 2000’s. Using the web crawler Hyphe, developed by the medialab at Sciences Po and tailored to recrawl the BnF web archive during the ResPaDon project [1], we explore the Skyblog corpus by conducting what we have called a « fractal exploration ». Following the numerous links between blogs on the Skyblog platform, we discover hundreds of new blogs, and we can then leverage datavisualization in order to identify clusters. In this presentation, we will show how this method leads to the identification of communities and provides a way for the researcher to progress towards close reading and the discovery of significant stories. Instead of trying to get an overview of the corpus using numbers, we use the links that originate from skybloggers connecting with one another, just like breadcrumbs which show a way through this mass of content. We will discuss the benefits of this approach, but also question its limitations in the context of the crawl realized by the BnF in 2023, more than 10 years after the platform’s peak of popularity, considering that only a third of the blogs remained online at that time. Finally, we will discuss how the combination of distant reading and close reading can help to get a global sense of the corpus, without relinquishing the emotions, which are constitutive of any type of cultural heritage [2]. (430 words)

References: 1. Aubry, Sara, Audrey Baneyx, Emmanuelle Bermès, Laurence Favier, Alexandre Faye, Marie-Madeleine Géroudet, and Benjamin Ooghe-Tabanou. 2024. « A network to develop the use of web archives: Three outcomes of the ResPaDon project ». In Exploring the Archived Web During a Highly Transformative Age, edited by Sophie Gebeil and Jean-Christophe Peyssard. Florence, Italie: Firenze University Press. 2. Bermès, Emmanuelle. 2024. De l’écran à l’émotion: quand le numérique devient patrimoine. Paris, France: École nationale des chartes-PSL. 3. Milligan, Ian. 2019. « Learning to See the Past at Scale: Exploring Web Archives through Hundreds of Thousands of Images ». In Seeing the Past with Computers, édité par Kevin Kee et Timothy Compeau, 116‑36. Experiments with Augmented Reality and Computer Vision for History. University of Michigan Press, URL : https://www.jstor.org/stable/j.ctvnjbdr0.10. 4. Tybin, Vladimir. 2024. « Les skyblogs au service de la science ». Chroniques, mars 2024, URL: https://www.bnf.fr/fr/les-skyblogs-au-service-de-la-science.

11:00-12:30 Session 10B: Platform Histories Roundtable
11:00
Platform Histories Roundtable (with Miglė Bareikytė, Marcus Burkhardt, Devika Naraya, Anne Helmond, Fernando van der Vlist)

ABSTRACT. Platforms have multiple histories. The global histories of the political economy of platform capitalisms can be dated back from racial capitalism of so-called „platform or racial fixes“ during the global financial crisis of 2008 to the history of flexibilization in just-in-time production in 1960s Japan. In terms of media history, platform histories tie in with the modularization and outsourcing of software development and the archeology of algorithmic techniques, the privatization and monopolization of infrastructural services and capitalist data capture. Platform histories scale from the development of singular modules, platform ecosystems to global political economies of platforms. Despite these many historical perspectives on platforms, platform historiography remains largely a desideratum of platform studies and lacks systematic theoretical and methodological approaches. The proposed roundtable aims to provide the first collection dedicated to drawing together and synthesizing the existing multiplicity of platform-centric research as well as cross-platform histories, while focusing on exploring and developing multifaceted platform histories. Platform giants are internationally operating organizations embedded in complex technologies and infrastructures. For instance, social media platforms rely on exploitative, labor-intensive content moderation, while platform labor is organized within and through meticulously designed interfaces, apps, and their infrastructures. We aim to bring together perspectives from platform labor research and platform-centric research. How can critical platform history be written amidst the tensions and forces of infrastructural power, data-intensive economies, and geographic specificities? This roundtable responds to this challenge theoretically and methodologically – multi-layered, multi-sided and globally entangled. Within the roundtable, we want to discuss these historiographical approaches to the most central infrastructures of the datafied web – platforms – with researchers from various fields. The roundtable serves as preparation for a special issue, which will be the first to systematically deal with platform histories.

Organised and moderated by : Sebastian Randerath and Tatjana Seitz. With the participation of : Miglė Bareikytė, Marcus Burkhardt, Devika Naraya, Anne Helmond, Fernando van der Vlist

11:00-12:30 Session 10C: Past Metrics
11:00
Translating Web Data into Media History: A Methodological Reflection of Archiving and Analyzing the XS4ALL Homepage Collection.

ABSTRACT. Web archives have become an invaluable resource for contemporary historical research, providing new primary sources and unique opportunities to investigate online cultures (Milligan; 2019). The increasing reliance on born-digital materials, such as websites, has led to the adoption of digital humanities methods in historical research, notably through the use of a “web-minded approach” (Brügger, 2018). This approach stresses the need to consider the specific characteristics of archived web pages, to be mindful of the processes behind their archiving, and to apply methods appropriate for working with such material. While historians have traditionally depended on source criticism, engaging with web archives requires additional skills and insights to interpret these digital artifacts and translate them into meaningful historical analysis. This paper examines the steps involved in this process, fostering dialogue between a web archivist and a media history scholar. It offers a methodological reflection on the types of data that are significant within web archives, why these are crucial for historians, and how they can be effectively incorporated into historical research.

Key aspects to both the archivist and historian concerning archived web collections will be discussed such as collection formation, metadata selection, and sample preparation for tools like the SOLR Wayback. Furthermore, the paper reflects how the various types of data included in a collection can be appropriated for DH methods like multi-modal content analysis, link analysis, or topic modelling. Preliminary phases should be taken into account as well, hence curatorial decisions and related technical considerations like harvest dates and crawl depth, will be examined too. All of these factors are to be considered by the web archivist, subsequently affecting the content of a collection as well as the material’s periodisation, authenticity, and thus the notions scholars can construct using them.

Historical research using the XS4ALL homepage collection archived at the Dutch Royal Library will form the exemplary base for this paper. This collection includes a variety of URLs of websites created by XS4ALL subscribers, who could design their own homepage (de Bode & Teszelszky, 2021). The collection presents notable cases to be considered by both archivists and historians. For example, it was harvested from a curated list of URLs rather than being indexed by search engines due to historical significance of XS4ALL. Furthermore, this period of the early web offers interesting obsolete technologies to be studied (i.e. Flash) or data challenges like independent websites that are not linked to any other URL, or lack inbound links altogether. Another technical aspect is that XS4ALL websites underwent a domain name change. The leading question is what web data aspects historians should know to properly use archived web collections.

This paper seeks to investigate the translation of web data into historical narratives by examining the XS4ALL homepage collection through both archival and historical lenses, employing a web-minded approach. This process is shaped by the interplay of curatorial, archival, and technical decisions that affect how digital-born materials should be interpreted and understood by scholars – a source that will continue to gain prominence in contemporary history research.

11:20
The early datafied web: Visitor counters on the Danish web in the 1990s

ABSTRACT. One of the earliest ways of datafying the web was web counters calculating the number of visitors on a website. Visitor counters were the only way that a website holder could get automatic feedback information from and about the visitors Although the information about the number of visitors was very limited and not very detailed it gave the website owner a sense of how popular the website was, while at the same time flagging this information for the users of the website.

This paper paper discuss the emergence, spread and development of early visitor counters on the Danish web in the 1990s. The paper takes the point of departure in the research project ‘Histories of the Danish web in the 1990s’. The paper is guided by the following research question: Which role(s) did visitor counters play as one of the early web’s fundamental infrastructure elements?

Based on this research question the paper presents an initial mapping by investigated the following aspects of the development of early visitor counters: (1) Visitor counters, producers, companies, economy, market: An analysis of the main actors who produced visitor counters, including business model, and the market, from early handheld handcoding, via peer-to-peer distribution of relevant HTML-code to professional international web companies like Digits (digits.com) or Internet Audit Bureau (internet-audit.com), as well as Danish counter providers like chart.dk and Danmarks Top100 (danmarks-top100.dk), and web hotels like Cybernet (cybernet.dk). (2) Technology: An analysis or how visitor counters were constructed, and how they worked on a website, including how they collected clicks and communicated with the visitor counter companies with a view for the website owner to become part of the top hit charts on the producers website. (3) Statistics: A mapping of how many visitor counters existed on the Danish web in total and relative to the total number of websites. (4) Network: An analysis of the hyperlink network between websites using a visitor counter, and the providers of counters. (5) Website owners, use forms, and aesthetics: An analysis of which types of websites visitor counters were used on, and of how they were communicatively and aesthetically framed by the website owner, including wording, icons, placement on the web page etc.

Sources: Websites from the Internet Archive, extracted from the national Danish web archive Netarkivet, and accessed through a SolrWayback interface which allows for free text search and extraction of all elements of the web pages. Internal documents from visitor counter companies in so far this has been provided. Research interviews with a limited number of website holders from the 1990s.

The presentation will outline the results within each of the focus areas above, including how they interrelate.

11:40
From Hit Counters to the Professionalisation of Web Metrics in Luxembourg (1990s-Mid-2000s)

ABSTRACT. The objective of this presentation is twofold: firstly, to identify the top-ranking websites in Luxembourg during the late 1990s; and secondly, to trace the evolution of the professionalization of measurement metrics in the country. This study focuses on how various stakeholders organised themselves to provide standardised data, thereby fostering the development of the nascent online advertising industry.

Furthermore, this presentation seeks to elucidate the methodological challenges inherent to the analysis of website metrics, particularly those associated with the use of web archives. These challenges include the limitations of web archives in capturing the user perspective (Meyer, Thomas & Schroeder, 2011) and the inherent issues of web archives themselves, such as incompleteness and temporal and spatial inconsistencies between archived fragments (Brügger, 2018). To illustrate, an analysis of the website cim.be, a Belgian company that many Luxembourg editors were affiliated with in 2001 to ensure certified Internet audience measurement and data veracity, revealed only fragmented data: it makes it difficult to draw comparisons over time. This data was supplemented with information from newspapers, magazines, and company websites to gain insight into the audience of the 1990s and 2000s (Arend, 2006).

In the 1990s, website traffic was measured by analysing server logs, which provided information through thousands of lines. However, the market soon evolved towards more user-friendly web analytics solutions, such as web counters. Each company had its own system, which often lacked reliability (Webster, Phalen & Lichty, 2013; Shiu, n.d.) It can be argued that one of the driving forces behind Luxembourg’s development of standardised metrics and the proposal for a neutral institution to oversee them was the burgeoning online advertising industry, which led to the first conference on online advertising as early as 1999.

In addition, we provide a list of the top visited websites from December 1997 to August 1998 provided by the first Internet directories for Luxembourg websites and web portals and 2004, from CIM.be to include the users in the website mapping of Luxembourg.

References:

Brügger, N. (2008). The archived website and website philology: A new type of historical document? Nordicom Review, 29, 155–175 Meyer, E., Thomas A., & Schroeder, R. (June 30, 2011) Web Archives: The Future(s). SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.1830025 Arend, Olivia. (2006, March). 2001-2005: Splendeur et misères du Web Luxembourgeois. Paperjam, 118–121. Webster, J. G., Phalen, P. F., & Lichty, L. W. (2013). The Audience Measurement Business. In Ratings Analysis (4th ed.). Routledge. Shiu, Alicia. (n.d.). The Early Days of Web Analytics. Amplitude. Retrieved October 15, 2024, from https://amplitude.com/blog/the-early-days-of-web-analytics

12:30-14:00Lunch Break
14:00-15:30 Session 11A: Data Regimes
14:00
Historicizing Environmental Data on the Web: Surfrider.org, 1997-2024

ABSTRACT. Web-based environmental data dashboards provide critical points of access for users hoping to gain knowledge concerning their surroundings, yet their historical development has not been explicitly tracked in existing literature. This paper historicizes environmental data on the web by examining preserved copies of the US-based nonprofit organization Surfrider Foundation’s coastal water quality monitoring data dating back to its earliest crawl via Wayback Machine in October, 1997.

The Surfrider Foundation was founded in in 1984 to pursue coastal environmentalism. Successful early initiatives included various sewage runoff and industrial waste management infrastructure improvements across the US east and west coasts. Since at least its earliest web crawl, the organization has provided information concerning Southern California’s coastal water quality to users via its website at surfrider.org. Today, the organization’s Blue Water Task Force tracks water quality data through a network of volunteers collecting, processing, and logging water samples in dozens of locations globally. While data were published via text on a static HTML web page in the late 1990s, today they are presented in downloadable JSON and CSV files, which are in turn contextualized within a dynamic JavaScript-based map. This study examines the earliest iterations of Surfrider’s water quality data publication efforts on the web and compares them to its most recent.

Analysis comprises two phases. First, I examine the nature of water quality data and data structures presented on the Surfrider website, noting data categories and formats. Next, I examine the Surfrider web site’s source code to identify the web design techniques used to publicize environmental data. In each phase of analysis, findings from the 1997 Wayback Machine crawl are compared against findings from the website in its current form to better understand the historical development of public-facing environmental data sharing practices on the web over time.

Conceptually speaking, this study builds on recent scholarship concerning the role of data dashboards in the sociotechnical construction of coastal water quality knowledge (Hodges 2024), and contributes to the literature by introducing a historical perspective. While previous research has shown contemporary coastal water quality data initiatives to emphasize bacteria quantities above all other water quality metrics, the present study shows that during Surfrider Foundation’s earlier phases, water quality data assumed the form of richer, “thicker” descriptions more akin to ethnographic field notes than discrete bacteria counts. Each approach in turn performs a different form of “synchronization” between data and reality, thus facilitating different ideological and material activities (Bowker 2008, p. 48). In conclusion, I argue that Surfrider’s current emphasis on discrete, tabular bacteria data synchronizes their initiatives with an emphasis on the potential for acute bacterial illnesses, rather than the long-term illnesses cased by other forms of pollution or the embodied risk-management practices outlined in their 1990s-era descriptions.

References:

Bowker, G. C. (2005). Memory Practices in the Sciences. MIT Press.

Hodges, J.A. (16 July, 2024) “Comparing Ocean Epistemologies: Reverse-Engineering Los Angeles’ Data Dashboards.” Society for the Social Studies of Science/European Association for the Study of Science and Technology (4S/EASST), quadrennial joint meeting. Amsterdam, NL.

14:20
The un/expected work of open data policies

ABSTRACT. This paper examines the history of the datafied web from a literal perspective: the use of the web to make scientific research data accessible. Here we focus on how policies, laws, and guidelines have constructed the web as a site for sharing and accessing datasets with a focus on open science data. The history of “open science” policies are distinct from that of “open data”; they are both important to accounts of data access on the web. While the US government has long been involved in producing scientific data (e.g. Aronova et al., 2010; Edwards, 2010) and collecting data about its citizens (Bouk, 2017; Igo, 2018) that is useful to scientists and social scientists, the precursors to open science polices that ensure that citizens have access to science research has been traced to the start of the National Science Foundation after World War II (Pasek, 2017). Meanwhile, open data in US policy is rooted in traditions of transparency of the US government and online digital access initiatives (Schrock, 2016). The newest iteration of open data laws and policies emphasize that data needs to be “machine readable” or “machine actionable” (Rep. Ryan, 2022; Wilkinson et al., 2016). Scholars have examined how open technology activism reproduces neoliberal ideologies (e.g. Hester, 2016; Kelty, 2008), and in the realm of open science data, private corporations are often best positioned to make these resources serve their own profit-seeking ends (Leonelli, 2013; Mirowski, 2018).

The discourses around open data imagine that data is something that can be plucked from its context via the open web and used elsewhere, but the worlds in which many of these policies went into effect have shifted rapidly due to AI firms accessing data from the open web. Data misuse occurs when data’s original context, intended use, or its „originary domain“ limits are ignored (Acker, 2018). Furthermore, as many indigenous and Black feminist scholars have shown, data access is typically envisioned for those who will use the data, and not always for those who may be most impacted by data’s reuse and deployment (e.g. Carroll et al., 2022, Sutherland, 2024). The consequences of the context of data reuse are further shifted due to new emphases on the machine actionability because AIs can become the new context of data reuse. By incorporating open data into AI via web infrastructures, new ethical, material, labor, intellectual property and fairness dimensions for open access come into focus. AI shifts the stakes of unbridled access to ope science data and prompts us to revisit policies governing web infrastructure.

14:40
Investigative turn in the Baltics in times of war in Europe

ABSTRACT. The datafication during polycrises (Henig & Knight, 2023; Norman, Ford & Cold-Ravnkilde, 2024) has contributed to the “investigative turn” – intensification of (digital) investigative practices in working with digital media and data to resist emergent digital injustices. Professional journalists were joined by think tanks, government institutions and individuals in using different types of data to analyze and critique the growing phenomena of the dark side of digitalisation and infrastructural globalization, including disinformation, corruption networks, polarisation. These actors use different (digital platform) data as evidence in producing new narratives about ongoing controversies, conflicts and wars (Bedenko & Bellish, 2024; Pastor-Galindo et al., 2020). Investigations conducted by Bellingcat, such as into the downing of MH-17 with the help of geolocating the origin of the Buk missile, by using geographical landmarks or intercepts from Russian security services or the documentation of digital evidence of the Syrian revolution by Mnemonic in building the Syrian Archive are iconic examples of such contemporary data-based investigations. Nevertheless, on the one hand, systematic efforts to gather and utilize information from open source information, can be traced back to the mid-19th century in the United States and the early 20th century in Europe (Block, 2023); on the other hand, the geographical diversity of investigative actors goes beyond those located in western parts of Europe. Since the illegal occupation of Crimea and the ongoing Russian war against Ukraine, a complex landscape of investigative actors has also emerged in Central and Eastern Europe with a diverse focus in terms of topics, strategies and methods of cooperation. Within these frameworks, investigative practices aim to: counter disinformation (Denisenko, 2023), expose corruption and sanctions-evasion networks, preserve digital memory of the ongoing war (Nazaruk, 2022; Bareikyte & Skop, 2022), develop new narratives, methodologies and sustainable digital infrastructures to research the war (Bareikyte et al., 2024) and its aftermath in the future, create new cultures of evidence-based research, and securitise societies and environments in Central and Eastern Europe. Within CEE, the Baltic states (Estonia, Latvia and Lithuania) represent an interesting but also complicated case in the context of investigative practices. While these countries are currently not under direct attack from Russia, as Ukraine is, their well-developed digital infrastructures have experienced digital attacks at both narrative and infrastructure levels, including disinformation, cyber-attacks and GPS jamming (Braw, 2024; LETA/TBT, 2024). While investigative journalism has experienced a massive decline a decade ago (Houston, 2010), investigative media and data practices have been increasingly used in the Baltic states of Estonia, Latvia and Lithuania in recent years as a response to Russia’s information war (Denisenko, 2023). The analytical, critical, and educational role of investigative journalists, citizen activists, think tanks, and scholars in countering informational attacks while using investigative practices and digital data, is crucial to the formation and development of cooperative action and meaning making practices in complicated times for this region (Chakars & Ekmanis, 2022). This diverse range of actors working on different “fronts” and in the different parts of society illustrate the emergent culture of contemporary premonition of war that shapes the contemporary cultures of preparedness in the Baltics. In our talk, we focus on the investigative practices in the Baltics, which include investigative journalism, fact-checking, OSINT and experimental-educational work, which we explore through semi-structured interviews and fieldwork in 2024-2025. Interviewees comprise representatives of non-profit organisations, public broadcasters, private media companies, academics and freelance journalists. We map and present the actors, focusing on their audience strategies, the role of platform and other types of data in their work, and cooperation practices in their work, outlining the meaning of investigative practices in contemporary datafied and securitised cultures in Central and Eastern Europe.

14:00-15:30 Session 11B: Web archives Practices
14:00
Temporally Extending Existing Web Archive Collections for Longitudinal Analysis

ABSTRACT. The Environmental Governance and Data Initiative (EDGI) regularly crawled US federal environmental websites between 2016 and 2020 to capture changes between two presidential administrations. However, because it does not include the previous administration ending in 2008, the collection is unsuitable for answering our research question, „Were the website terms deleted by the Trump administration added by the Obama administration?“ Thus, like many researchers using the Wayback Machine’s holdings for historical analysis, we do not have access to a complete collection suiting our needs. To answer our research question, we must extend the EDGI collection back to January, 2008. This includes discovering relevant pages that were not included in the EDGI collection that persisted through 2020, not just going further back in time with the existing pages. We pieced together artifacts collected by various organizations for their purposes through many means (Save Page Now, Archive-It, and more) in order to curate a dataset sufficient for our intentions.

In this paper, we contribute a methodology to extend existing web archive collections temporally to enable longitudinal analysis, including a dataset extended with this methodology. We identified the reasons URL candidates could be missing from the initial EDGI dataset, and crawled the past web of 2008 in order to identify these missing pages. We also identified small domains that were vulnerable to being missed by our past web crawl, and found that these domains benefited from a complete web archive index lookup instead. We probed another large collection, the End of Term 2008 dataset, for additional longitudinal candidates, but found that crawler traps were inflating the size of the dataset, leading to only a small number of additional URLs. By analyzing the provenance of the final collection, we determined that this new longitudinal dataset covering three US presidential administrations only exists because of aggregation of artifacts collected by many organizations. We also found that automated brute-force methods alone were not sufficient to create this collection, and that iterative manual analysis of automated results produced more seeds for candidates. Our new dataset includes 1,220 archived triplets (2008, 2016, and 2020) of US federal environmental webpages.

We use our new dataset to analyze our question, „Were the website terms deleted by the Trump administration added by the Obama administration?“ We find that 81 percent of the pages in the dataset changed between 2008 and 2020, and that 87 percent of the pages with terms deleted by the Trump administration were terms added during the Obama administration. We probed for change trends: when agencies had the same terms repeatedly removed across their websites. We found that certain agencies experienced a large number of change trends, including OSHA, NIH, and NOAA, while 17 of the 30 agencies, including NASA and the Department of Energy, experienced no change trends. Finally, we analyzed the 56 deleted terms and phrases tracked by EDGI and found that the terms fell into two categories: climate and regulation, and identified that there were more change trends in regulation term deletions than climate term deletions.

14:20
Engaging audiences with the UK Web Archive: Strategies for general readers, data users, and the digitally curious

ABSTRACT. This paper explores approaches to engaging three distinct audiences —general readers, data users, and the digitally curious— with the UK Web Archive. Building on collaborative work with the National Archives UK, and drawing on experiences from the Cambridge University Libraries and the National Library of Scotland, we present practical recommendations and demonstrate best practices for designing web archives to meet diverse user needs while ensuring broad and equitable access to digital resources. To enhance the experience of general readers, we have introduced exploratory, user-friendly, and gamified interfaces that encourage interactive exploration of web archive collections. Additionally, public engagement is a key focus, with outreach events such as exhibitions designed to raise awareness of these valuable digital resources among library users. By creating engaging experiences that invite discovery, we aim to bridge the gap between casual web users and the rich historical material contained within web archives.

For data users, we prioritize curating detailed metadata and implementing Datasheets for Data to support the quantitative analysis of web archive collections. Outreach initiatives for this community include hands-on workshops and data visualization calls, which invite users to interpret and represent the collections through visual mediums. The visual outputs from these calls often enrich future public-facing resources, further enhancing the archives‘ accessibility to general readers. Through these efforts, we aim to foster a collaborative ecosystem that encourages innovative research and deeper exploration of the collections.

A major focus of our work is addressing the digital skills gap, particularly for the digitally curious—those who recognize the potential of web archives but lack the technical skills to fully engage with them. To support this group, we are developing in-library workshops tailored to building foundational digital literacy and data analysis skills. These workshops are designed not only to upskill participants but also to inspire them to explore web archives more confidently. By equipping users with the tools to navigate and analyze collections, we hope to empower a broader demographic to engage with these resources.

In summary, in this paper, we present a strategy to improve the usability of the UK Web Archive across varied institutions. Through a combination of material development (datasets, interfaces) and diverse outreach events (exhibitions, data visualization calls, workshops), we aim to meet the needs of general readers, data users, and the digitally curious. By tailoring our approach to these distinct groups, we strive to create an inclusive, dynamic web archive experience that invites exploration, research, and digital empowerment.

14:40
Seed lists on themes and events on Arquivo.pt: a curious starting point for discovering a web archive

ABSTRACT. Every year, Arquivo.pt makes special collections dedicated either to a particular topic or to events. To do this, it starts by producing a list of seeds (addresses of selected web pages) which it then puts on record. The recorded content becomes accessible after a year of embargo, along with additional information, such as the seed lists, contextual information and, in some cases, logs and cdx indexes.

This presentation briefly explains 1) the mission of Arquivo.pt to support research; 2) the criteria for creating a thematic collection or a collection about an event; 3) the method for obtaining lists of seeds; 4) the results obtained; 5) issues relating to the recording of seeds, namely the tools used and limitations; 6) the use case of the special collection on Portuguese artists; 7) Finally, we mention the lessons learnt and the challenges that have come from researchers.

Arquivo.pt (the Portuguese Web Archive) is a public service that anyone can use to find old web pages. However, its primary mission is to serve scientific research. Organically, it belongs to the government research support organisation Fundação para a Ciência e a Tecnologia. Arquivo.pt makes all the data available using various strategies and services: search interface, API for automatic processing, open datasets and seed lists.

Arquivo.pt has made special collections on the occasion of events, such as political elections, the Olympic games, the start of the pandemic. Others have focussed on specific topics, such as museums, the press, street art, artists, etc. The thematic collections are intended to arouse the curiosity of the community, feeding the Arquivo.pt collection with content from their field of study.

The selection of seeds was partly manual and, in some collections, the community participated. However, Arquivo.pt uses an automatic methodology to obtain a large number of seeds. Services such as BingSearchAPI are used.

In October 2024, the Archive published the 51st set of open data, more than half of which were seed lists on specific events and topics (on Dados.gov – open data portal; arquivo.pt/datasets).

The seed list was just a starting point for recording. It can be a starting point for research, raising various questions. Is the content recorded representative of a particular theme? How many of these seeds were not successfully recorded? In which cases was it due to technologies that blocked access?

To illustrate the ideas in this presentation, a special collection on Portuguese visual artists is mentioned. This collection emerged from a collaboration with the artists‘ community. A PHD researcher included this material in her research project.

Among the lessons learnt, we would highlight the following: the seed lists are useful for a first approach by researchers; it is useful to gather even more information about the selection and recording process. In this sense, the seed lists on themes and events on Arquivo.pt are a curious starting point for discovering a web archive. The challenge now remains for researchers to test their methodologies on these datasets.

14:00-15:30 Session 11C: Methods
14:00
Critical AI technography: Researching the material political economy and power of AI platforms

ABSTRACT. This paper proposes technography as a valuable methodology for conducting critical empirical and historical studies on the material political economy and power of artificial intelligence (AI) platforms. We argue that technography—a descriptive and interpretive approach to analysing the structural and operational aspects of technical systems (Bucher, 2016; Helmond and Van der Vlist, 2019)—can be applied to critically examine AI platforms like Azure OpenAI, Amazon SageMaker, Google’s Vertex AI, and NVIDIA AI. This methodology is crucial for scrutinising how major technology companies, or “Big AI”, are driving the AI’s “industrialisation” across various sectors and in everyday digital life.

Our adaptation of AI technography draws from existing research to investigate the material, evolutionary, and discursive components of AI systems (Van der Vlist et al., 2024; Luitse, 2024) and their broader platform infrastructures (Burkhardt, 2020; Helmond et al., 2019). It employs sources like technical platform documentation, corporate blogs, financial reports, and archived product pages from the Internet Archive to provide a historically grounded understanding of the workings and power structures underlying AI platforms (Helmond and Van der Vlist, 2019). These sources enable a critical evaluation of the objectives, functions, and claims made by these companies and reveal their evolving influence on AI development and deployment.

Additionally, this methodology allows researchers to examine the specific strategies employed by major technology companies to consolidate and exert economic, infrastructural, and symbolic power. For instance, Amazon’s AWS and Google Cloud have become dominant by providing essential cloud infrastructure services that have become the backbone of the “datafied web” since the early 2010s, coinciding with the rise of data-driven “surveillance advertising” as its dominant business model (Crain, 2021; Van der Vlist and Helmond, 2021). In this context, our method offers a critical, empirical framework for analysing three critical dimensions of “Big AI” and its political economy within the broader history of the datafied web.

First, it addresses AI’s deep industrialisation, where major technology companies drive economic and technological expansion across various sectors, consolidating market power and monopolisation dynamics (Van der Vlist et al., 2024). This reinforces existing power structures, with Big Tech leveraging control over AI infrastructure to limit competition, particularly concerning cloud-reliant large language models (LLMs) (Kak and Myers West, 2023; Luitse and Denkena, 2021; Narayan, 2022).

Second, it addresses the evolving infrastructural power of AI platforms. These platforms have evolved alongside large-scale cloud infrastructure dependencies, as major technology companies set new standards and shape the conditions for AI production and implementation across domains such as cultural production, healthcare, and security (Jacobides et al., 2021; Van der Vlist et al., 2024). This includes strategies of vertical integration, complementary innovation, and abstraction to obscure the complex operations and governance of AI platforms (Luitse, 2024).

Third, it addresses the evolving symbolic power of AI platforms. Companies use discursive strategies to influence dominant ideas about desirable AI types and promote notions of “openness” and “democratisation” (Burkhardt, 2020; Widder et al., 2023), or AI ethics (Aradau and Blanke, 2022).

Taken together, critical AI technography is oriented towards how companies like Microsoft, Google, Amazon, and NVIDIA shape contemporary AI trajectories within broader web history, through their converging economic, infrastructural, and symbolic power. As AI increasingly permeates economic sectors and digital life, it is essential for critical scholars, journalists, activists, policymakers, and regulators to trace and critique the forces driving AI’s evolution.

14:20
AI: A Lever for ‘Decolonizing’ Archives? Web Archives as a Datafield for Critical and Inclusive Uses of AI in History

ABSTRACT. Concluded in 2024, the European program Polyvocal Interpretation Of Contested Colonial Heritage (PICCH) aimed to explore how archival documents created from a colonial perspective could be reappropriated and reinterpreted to become an effective source for constructing an inclusive future society. In France, the term ‘decolonization’ has been heavily instrumentalized, losing the profound meaning attributed to it by historical thinkers like Achille Mbembe. In this project, decolonizing French television and web archives aims to make these materials from former colonial powers more inclusive and respectful towards populations still facing discrimination today, challenges that have been driving archivists worldwide for years (Ghaddar & Caswell, 2019). One of the project’s objectives was to refine the metadata of television archives as well as web data concerning narratives of events related to the colonial past or post-colonial issues. We scrutinized the media coverage of the 1983 March for Equality and Against Racism from a transmedia perspective, based on web video corpora and archived web pages from the INA. One of the goals was to examine the visibility accorded by the media to the marchers themselves: in 1983, they were young suburbanites, born to immigrant parents in French urban suburbs, perceived as Maghrebi or Black, leading to an essentialization of the discourse on this event in the media. The marchers were relegated to the periphery of the journalistic narrative from the 1980s until more recent commemorations, and they utilize the web to reclaim the narrative of this event. Given the volume of data (archived web pages, voice-over text from videos, video metadata), we employed AI programs to automate the identification of the marchers, whether through text (names, nicknames) or through their faces in the videos.

Based on this case study, this paper will eschew the interpretation of the online media coverage of the march to concentrate on the methodological and hermeneutical questions raised by cultural biases when employing deep learning AI programs to analyze web data. It seeks to investigate under what conditions the application of AI programs to analyze archived web data can enhance the consideration of marginalized historical actors in the analysis of contemporary transmedia narratives.

Firstly, we will present the corpus and methodology used to study the media treatment of the marchers in the television and web archives of the Institut national de l’audiovisuel. Secondly, we will review the application of AI to these corpora, focusing on the significance of cultural biases in data processing through two examples: the thematization of text from HTML pages archived by the INA in 2013 and automated visual recognition in videos. Finally, we will consider the lessons learned from this experience and propose hermeneutic and ethical reflections for web historians confronted with hegemonic biases in the processing of web data.

14:40
Echoes of Dolly: isolating long-term political schemata by abstracting web archives as Zotero collections

ABSTRACT. Abstracting web archives as data at the document level and as metadata at the corpus level facilitates the systematic exploration of niche topics in massive collections. This paper focuses on showing how adapting a scientometrics software on web archives enabled both the abstraction corpuses of web archive pages as Zotero collection and a deeper document-level exploration to help unravel past and present controversies.

The live birth of Dolly, the first ever cloned large mammal, was a striking biotechnological performance which quickly became a commonplace of public life. What made it different from previous comparable events is that it happened in the age of the early internet, a social era that we can now partially access through web archives. This makes it a rare opportunity to evaluate the trajectory of very specific political schemata, namely those concerning what is politically at stake with developments in biotechnology.

Most of these schemata have been defined and refined throughout the 20th century. Is progress in genome engineering the key to eternal life? An impending revival of the third Reich ideology? The sign that nothing would ever be sacred anymore? Every time a new technical milestone is reached, the sociopolitical imaginaries around eugenics, heredity, and what would constitute “fitness” in person or a population are reactivated. Yet the debates over Dolly are peculiar because they are the first of that nature in the internet age. For the first time, netizens will have the opportunity to discuss the announcements of scientists and provide their own perspectives on what the existence of that sheep would mean for the present and for the future.

This study is focused on the mention of Dolly in the French political debate during the 2002 presidential election. It has been conducted by deploying a new methodology that abstracts extractions of full-text indexed web archive as Zotero collections of documents. Using PANDORÆ, a software originally designed to perform scientometric analysis, the web archives are queried at the text content, abstracted as Zotero-compatible data to enable curation, and then re-imported and explored both at the corpus and document level through ad-hoc data visualization algorithms. The exhaustive study of the relevant archived web pages shows that in the French political web of 2002, Dolly is constructed both as a creature symbolizing the hubris of humankind and a symptom of new policy problems that policymakers are ill-equipped to handle.

15:30-15:45Coffee Break
15:45-16:30 Session 12: My PhD in 5 Mns
15:45
Before WEB 2.0: A Cultural History of Early Web Practices in the Netherlands from 1994 until 2004

ABSTRACT. My PhD research aims to construct a cultural history of web practices in the Netherlands from 1994 to 2004, a pivotal period that predates the rise of Web 2.0. This project focuses on the transformative 1990s and early 2000s, an era of rapid internet culture evolution that remains underexplored in media historical scholarship (Verhoef, 2023, p. 1). Often overshadowed by the swift development of platforms and social media, this research seeks to fill a significant historiographical gap. By moving beyond conventional American-centric narratives in Internet Studies (Abbate, 2017), the project examines diverse contexts, particularly the role of amateur contributions, reinforcing the notion that „everyday people made the internet social“ (Driscoll, 2022, p. 194) and highlighting the concept of the early „vernacular web“ (cf. Howard, 2008).

Utilising an interdisciplinary methodology that integrates media history, digital humanities, web anthropology, and archaeology, the research unfolds in two phases. The first phase investigates the Netherlands‘ interpretation of the web, identifying dominant socio-technical imaginaries that illuminate both technological and societal developments. This analysis also explores how various actors leveraged the web to achieve broader social, economic, and political objectives. Complementing this historical narrative, a study of intellectual imaginaries draws from influential academic publications in Internet Studies, which have shaped local web initiatives.

In the second phase, the focus shifts to a bottom-up approach, utilising archived web collections to examine the practices of early adopters, particularly the XS4ALL homepages. Additionally, I aim to move beyond prominent initiatives, such as XS4ALL and DDS, by exploring other web localities in the northern Netherlands through oral history. By merging these two stages, the research addresses themes like small-scale web entrepreneurship and the creative practices of amateur users, while also diving into critical archival studies by examining digital heritage, canonisation, source criticism, and the ethics of working with personal archival materials.

15:50
Manifesting The Web: Network Imaginaries in Manifesto Writing Between the 1980s and the 2020s

ABSTRACT. „We are the mice living in the foundations of the Internet. If it needs doing, we do it ourselves. We voluntarily restrict our use of CPU, memory, disk space, and bandwidth. We prefer simple protocols like Gopher. We prefer simple formats like plain text.“ (Small Internet Manifesto, 2019)

― this is a quote from one of the dozens of manifestos published online by tech movements as a response to an increasing datafication and platformisation of the web. Manifesto has been an important genre in the history of computer networks for the last 4 decades. Despite the rhetorical inflation of internet myths, manifesto writing has been consistently present among tech activists and social movements. My dissertation focuses on the literary and epistemic history of this genre: manifestos written about the internet and published online. The main material is a corpus of web archives of 125 manifesto webpages.

Manifesto is one of the literary and rhetorical forms that historically contribute to shaping what Paolo Bory (2020) calls “network imaginaries”. Literary forms that shape network imaginaries receive a noticeable amount of attention and include metaphors (Wyatt 2021, Markham 2020), maps (Bory and Rikitianskaia, 2020), anecdotes (Natale, 2016), myths (Katz-Kimchi, 2015), and narratives of internet pioneers (Bory, Benecchi, Balbi, 2016). In contrast to those, internet manifestos are published on the web and make use of different formatting possibilities of webpages. This creates an interesting self-referentiality: the web materially influences the production of network imaginaries.

Drawing from the cultural history of the internet, (German) media studies, and electronic literature, I analyse changes in manifesto writing throughout internet history. The conference presentation will focus on manifestos that appear as a direct reaction to the datafication of the web. In this, the talk will correspond to the conference’s interest in “histories of practices of resistance to datafication and platform economies”.

15:55
Battlefield of Truth(s) on Investigative Frontlines: From Data Activism to OSINT Professionalism

ABSTRACT. Open Source Intelligence (OSINT), viewed through the lens of civic tech, covers various aspects of citizen engagement. OSINT practices interrogate datafication and democratic participation by instrumentalising data for good in creating positive social impact (Daly, David & Mann, 2019; Guterriez, 2022; Milan & van der Welden, 2016). I am particularly interested in the social impact and diffusion in the application (cf. Rosenberg, 2013; Belgith, Venkatagiri & Luther, 2022) of intelligence practices during Russia´s war against Ukraine (Brantly, 2024, rather than the moment of invention or introduction of new tools and techniques and thereby arguing that human action is shaping technology (Pinch & Bijker, 1984). By integrating established academic research methods and inventive strategies employed by OSINT collectives (Kazansky et al., 2019), my PhD project examines what role play civilian OSINT practitioners in producing OSINT-derived evidence through collaborating in a participative work environment. My research explores how OSINT hobbyists strengthen their analytical skills through training and building an OSINT profession, but also how a loosely organised, EU-based collective, as my case study, evolves into a professionalised organisation with established hierarchies, a commitment to high standards of objectivity, accuracy and transparency as an oath of professionalisation, and stakeholder outreach strategies to produce more and faster volumes of actionable intelligence. Using collaborative digital team ethnography (Beneito-Montagut, R., Begueria, A., & Cassián, N., 2017) in contemporary organisational settings (Akemu & Abdelnour, 2018), I explore the collection practices, organisational hierarchies and community strategies of the above mentioned collective. The matter of overcoming hierarchies and competition is my main research interest, but also addressing the potential inequalities that may result, such as neglecting potential impacts on communities and individuals or disregarding the individual attribution and authorship of OSINT operatives by selling ready-made actionable intelligence reports to external stakeholders, by exploiting and amplifying the output of local actors and data producers to shape particular narratives about Russia’s war against Ukraine (Müller and Wiik 2023:206).

16:30-16:45Coffee Break
 
 

Program Highlights

Keynote by Nanna Bonde Thylstrup

Session 8:

Chair: Sebastian Gießmann

Thursday, June 5, 2025

18:00-19:30

open to public

What happens to data when it vanishes? How do digital remains persist even as information seemingly disappears? What can the politics of disappearance tell us about power in datafied worlds?

Disappearance has become a crucial yet understudied force shaping digital experiences and infrastructures. This keynote develops a technographic approach to examine how data loss and digital remains create complex patterns of presence and absence that defy simple narratives of erasure. Through cases ranging from platform architectures to digital archives, it traces how power operates through sophisticated mechanisms of appearing and vanishing, leaving traces that persist in unexpected ways.

By mapping these dynamics of disappearance, the exploration uncovers how our data landscapes are shaped not by mere accumulation, but through intricate processes of loss, persistence, and transformation. The keynote explores how digital societies negotiate memory and forgetting, positioning disappearance itself as a crucial digital experience.

The talk develops theoretical tools for analyzing these dynamics while remaining grounded in concrete technological practices and their political implications. Through this, it compels us to rethink fundamental assumptions about presence, absence, and the complex temporalities and materialities of digital culture..

About Nana Bonde Thylstrup

Nanna Bonde Thylstrup is Associate Professor in Modern and Digital Culture, and Principal Investigator of the ERC-funded project Data Loss: The Politics of Disappearance, Destruction and Dispossession in Digital Societies (DALOSS). Her research examines how digitization and algorithmic processes transform knowledge infrastructures, with particular focus on the politics and ethics of data, machine learning, and digital infrastructures. She co-leads the Digital Culture Research Cluster at UCPH and has held visiting fellowships at Duke, Cornell, and Columbia Universities. Thylstrup is author of The Politics of Mass Digitization (MIT Press, 2019) and co-editor of Uncertain Archives: Critical Keywords for the Age of Big Data (MIT Press, 2021) and (W)ARCHIVES: Archival Imaginaries, War, and Contemporary Art (Sternberg Press 2021). Her work appears in journals including Big Data & Society, Journal of Cultural Economy, Information, Communication & Society, First Monday and Media Culture & Society. She serves on the editorial board of Cambridge University Press's series on AI in Culture and Society and regularly consults for cultural heritage organizations and governments on digitization and emerging technologies.

Keynote by Jonathan Gray

Session 9:

Chair: Tatjana Seitz

Friday, June 6, 2025

09:00-10:30

open to public

This talk explores how data is made public on the Internet amidst the rise of social media, platforms and AI. Retracing the emergence of legal and technical conventions of open data, it looks towards a more expansive understanding of public data cultures which shape how we know and live together. Through a series of empirical vignettes, the talk reconsiders data as cultural material, medium of participation and site of transnational coordination. It then turns to two forms of intervention: making data that is considered missing and entrypoints for critical data practice. As well as situating public data cultures in relation to the datafication and platformisation of the web, the talk will highlight the role of web archives in studying these developments.