ESRA 2025 Preliminary Program
All time references are in CEST
Bridging Methodology and Computational Social Science 2 |
Session Organisers |
Mr Joshua Claassen (DZHW, Leibniz University Hannover) Dr Oriol Bosch (Oxford University) Professor Jan Karem Höhne (DZHW, Leibniz University Hannover)
|
Time | Thursday 17 July, 14:00 - 15:00 |
Room |
Ruppert 002 |
In today’s world, daily activities, work, and communication are continuously tracked via digital devices, generating highly granular data, including digital traces (e.g., app usage and browsing) and sensor data (e.g., geolocation). Researchers from various disciplines are increasingly utilizing these data sources, though often with different research objectives. Methodologists tend to focus on evaluating the quality and errors of digital data, while Computational Social Scientists (CSS) often leverage these data to answer more substantive research questions. However, there is a lack of collaboration between both worlds, resulting in a discipline divide.
For example, CSS researchers have embraced data donations, yet methodologists have not provided sufficient empirical evidence on the quality of such data. Moreover, web tracking data is rapidly being adopted in CSS, but methodological guidelines on how to gather the substantive content of website visits and apps (e.g., through HTML scraping) is lacking. However, there are methodological error frameworks covering both measurement and representation. These frameworks are yet to be (fully) leveraged.
This session invites contributions that bridge the gap between methodology and CSS, fostering collaboration across disciplines. We particularly welcome CSS work that incorporates a strong methodological foundation, as well as methodological research with clear relevance to substantive CSS inquiries. Topics may include, but are not limited to:
• Substantive research showcasing best practices when using digital data
• Assessments of digital data in terms of quality and errors
• Approaches reducing representation, sampling, and measurement errors of digital data
• Studies substituting more traditional data collections (e.g., web surveys) with digital data (e.g., measuring opinions with digital traces)
• Studies that go beyond the pure tracking (or donating) of app, search term, and URL data, including data integration and enrichment strategies
Keywords: Digital trace data, Computational social science, Survey methodology, Web tracking, Data donation
Papers
Capturing Public Opinion by Automatically Summarizing Social Media Posts
Dr Frederick Conrad (University of Michigan) - Presenting Author
For information seekers who wish to quickly gain a qualitative sense of the opinions in a large group of people, social media posts may be a promising resource. The massive volume of social media is one of its greatest virtues as a data source, but its volume also creates a great challenge. To usefully capture the gist of millions of posts one must distill vast quantities of content. The approach on which we focus is “abstractive summaries” generated by Large Language Models (LLM), paragraph long synopses of posts. We report a study which evaluates the fitness for use of LLM-generated summaries for developing qualitative insights about the discourse among social media users. Judges in the study evaluated eight summaries from two corpora of posts, rating (1) the extent to which each summary captured the gist of posts sampled from a corpus; (2) the extent to which each sentence in each summary captured the gist of the posts on which (according to the LLM) it was based; (3) whether any of the sentences were hallucinated by the LLM; (4) whether important information was excluded from the summary; and (5) whether the summaries reflected the breadth of topics within the two corpora. In general, the judges rated the quality of the summaries positively, both overall and at the sentence level; evidence of outright hallucination was rarely, if ever, observed; and omitted posts were usually judged to be “not important,” although there was some indication that relevant information was lost. The take-away is that the summaries were of sufficiently high quality to enable an information seeker to formulate accurate impressions of what social media users are saying and to explore the actual posts in a deliberate and informed way.
Who posted that? Automatically inferring characteristics of social media users
Dr Mao Li (University of Michigan) - Presenting Author
Social media continues to be a promising source of data for social research but a significant limitation is that posted content often does not include information about who created each post. If this type of information were available, it could allow researchers to generalize beyond the users of the social media platform as well as to conduct analyses of user subgroups. In this talk, we explore a method for inferring users’ characteristics by (1) constructing a data set of social media user handles, their self-reported characteristics, and their posts, (2) training models to learn the relationship between the characteristics of each user and their posts, and (3) testing the models’ predictions in a social media corpus for which no information about the characteristics of who created the posts is provided to the models but to which we have access, allowing us to assess the models’ performance. To evaluate this approach, we recruited active social media users from a US-based commercial probability panel, collected their Twitter and/or Reddit user handles (n=1850), and obtained their self-reported characteristics (Age, Education, Gender, Urbanicity, Partisanship, Political ideology) from the panel vendor. We train Large Language Models on 80% of the data (users’ posts and characteristics) to predict each user’s characteristics from the content of their posts and test model performance (accuracy) on the remaining 20%. We further evaluate the models’ performance by comparing predicted distributions of user characteristics to distributions of these characteristics estimated from survey data. So far, the models are performing reasonably well, with models for gender and age performing particularly well. The overall goal is to create general-purpose predictive models that can be applied to various social media platforms and, in conjunction with other NLP tools, help make Big Data more like Designed Data.