Opportunities and Limitations of Web Scraping in Social Research |
|
Session Organisers | Mr Aleksei Rotmistrov (National Research University Higher School of Economics) Miss Svetlana Zhuchkova (National Research University Higher School of Economics) |
Time | Tuesday 16th July, 14:00 - 15:30 |
Room | D22 |
Because of its development and deep penetration into many areas of life, the Internet has become a source of a massive amount of information, including social one. Thus, users leave “footprints” of their communication or other activity on the pages of social networks, online communities, and various thematic sites. Web scraping, which is an actively used method in computer science to collect such data automatically, gradually transpasses into the field of social sciences. The use of web-scraped data opens up new opportunities for such a research, due to the large volume and non-reactive nature of the information received. However, such data have some severe limitations: they are heterogeneous and not structured, most of the extracted features are categorical variables (which limits the variety of methods used for analysis), the proportion of missing data among the studied objects increases, and so on. Besides, the use of data from web pages changes a study’s design in an unusual way: the data-driven paradigm, in which data becomes the basis of the future theory and which is not typical for social sciences, comes to the fore. The natural question is: what is the real potential of this data collection method in social research? Is web scraping able to be used as a qualitative analog or a substitute for the standard survey data? If so, what are the ways to overcome the identified constraints? If not, what are the limits of this approach? The session’s participants are invited to respond to these questions in their reports and demonstrate their own experience in the use of web scraping and web data.
Keywords: web scraping, social science, non-reactive data, data-driven paradigm
Ms Marya Vorobyova (National Research University Higher School of Economics) - Presenting Author
Web-scraping can be one of the methods for collecting sociological information. However, this method has not yet been widely used in sociology. It is obvious that some important limitations of the method exist.
According to previous research on web-scraping, limitations are following: lack of suitable software and practical skills of sociologists for work with large amounts of data. Research ethics are also considered as important limitation. From the methodological point of view, the forming of representative sampling imposes restrictions on the possibility of using web-scraping in sociology. Moreover, Internet data are influenced by effects of social desirability.
However, web-scraping has unique capabilities that can solve modern problems of collecting sociological information.
The purpose of this paper is to more broadly define the possibilities and limitations of web-scraping as a method for collecting data in sociology. As an empirical example, there was the search for links between the socially determined characteristics of a movie as a commodity and its popularity based on data collected from imdb.com site (N = 36679).
Examining the use of web-scraping on an empirical example helped to identify two more significant limitations: the problem of missing values and the limitation of the operationalization of theoretical research concepts.
The possibility of collecting almost the entire general population of research is considered as unique web scraping capability. It is also worth noting the quality of mathematical models built on data that was collected using web scraping. The regression model built in this study has the R2 = 0.749. Another advantage of web scraping, as well as other non-reactive data collection methods, is the collection of information in form, which this information was produced by the studied subjects.
Mr Anton Boichenko (HSE University) - Presenting Author
The research presents the analysis of the corpus of Russian hip-hop/rap songs using the topic modeling analysis. The aim of the work is to show how the patterns of hegemonic masculinity can be traced in these rap songs’ lyrics, as rap has been gaining massive popularity since the 1990-s and has a great influence on its audience. The works on the western rap songs show that the patterns of traditional hegemonic masculinity are translated through the texts of hip-hop artists. Considering this feature of the texts, we base the research on the concept of hegemonic masculinity in its traditional perception which includes: aggression, homophobia, aggressive and/or risk-taking behavior, demonstration of domination over femininity and over other masculinities. Using the method of web-scrapping 10196 texts of Russian hip-hop songs were gathered from the https://рэп-текст.рф (translation: https://rap-text.rf) open source. The usage of the source helps to identify a single text as belonging to rap songs in Russia, as the “borders” between genres are vague and the identification of a text as rap by users helps to overcome this problem. The research is now at the stage of collecting and processing data. The gathered data will be tokenized, lemmatized and analyzed using the BigARTM library which is based on the principle of additive regularization and proved to effectively derive topics from different corpuses. The assumed result of the research is the derivation of topics that are present in the corpus of Russian hip-hop texts. In particular we expect to find those topics that can be associated with the patterns of hegemonic masculinity. As the final result, we plan to prove the presence of the mentioned patterns and their representations in the chosen set of texts.
Mr Petr Makeev (National Research University Higher School of Economics) - Presenting Author
The phenomenon of private tutoring is widely spread in Russia. Private tutors mostly perform their activity within informal context. Due to the last statement question of estimation and analysis of such kind of activity seems quite sophisticated and nontrivial. However, their (private tutors) activity is not totally hidden in informal sphere and has some formal features. For instance, there are special websites, which work as database of private tutors in open access on the Internet. With the help of parsing method (or web scraping) data from six biggest Russian websites was collected, processed and analyzed. About a hundred and fifty (150 000) thousand observations holding information about gender, price, experience and other variables were downloaded and processed. Descriptive statistics, drawing socio-demographic portrait of typical Russian private tutor, are presented, showing the way of studying the informal market of private tutors with help of such method as web scraping (or parsing).
Miss Maria Rodionova (National Research University Higher School of Economics) - Presenting Author
Sociological techniques that belong to qualitative methodology are often used to study hard-to-reach groups. Samples of such studies are often small, since the search for the participants and, further, obtaining their consent to the research is largely complicated. The qualitative results can hardly describe the relationship of various characteristics of community members, while claiming the external validity.
One of the communities that may be considered as hard-to-reach is people involved in BDSM practices. The community itself is quite heterogeneous, and includes as people who are only interested in BDSM, those who actively practice BDSM, divided in turn into those who practice it privately and those who perform actions in the public scenes (individual can combine both). Web scraping of the online BDSM dating service in this case allows researcher to obtain a sample that is commensurate in size with the population of those registered in service and perform further analysis.
In this case, descriptive statistics accurately reflecting the main trends can be presented as well as a search for relationships between features indicated in the profile, for example, gender and BDSM-role, age and willingness to provide material support to the future partner, etc. Methods that are more complex can be applied, for example, building a SNA on BDSM-interests and tracking their relationship with other profile characteristics.
This study focuses on the possibilities of working with data collected through web scraping. A set of analysis methods that can be applied to data of this kind within a sociological discipline is considered. A number of limitations of the method are also noted, which mainly consists in working not with the data of individuals as such, but with their virtual representation in the dating service profile, which is obviously not identical to the person.
Mr Aleksei Rotmistrov (National Research University Higher School of Economics) - Presenting Author
Miss Svetlana Zhuchkova (National Research University Higher School of Economics)
There is a special segment of social actors that are trying to be partly hidden from strangers’ glances. One of the reasons for hiding one's activity is that such actors try to avoid persecution them by competing actors. For example, radical oppositional groups often try to avoid persecution them by authorities. It is hard for scientists to explore hidden actors because the actors tend to reject any access of strangers to their lives, activities, and even thoughts.
On the other hand, the hidden social actors usually need to communicate with each other and to recruit novice proponents. For this purpose, the contemporaneity provides an effective tool such as the Internet. When hidden actors use the Internet, they leave their footprints that may be explored besides the actors’ willingness.
This paper addresses the advantages and limitations of web scraping of the Russian social network VKontakte. For the scraping, the respective API was used. Among the technical limitations were: the necessity of authorization, the limited quantity of downloading objects, the errors appearing when the server is pending the answer to the query, the limited number of queries per second. Regarding a methodological limitation, it was needed to develop a multi-step algorithm to select only relevant pages.
The main advantage was the exhaustive representation of the needed online groups. Thus, even if any group was closed for non-member access, it was scraped nevertheless. After gaining the full list of relevant online groups, the quantities of their members were detected, and a number of their later posts were scraped. The scraped posts underwent the applying of a topic modeling algorithm. This algorithm helped to detect some stable topic areas produced by the explored online groups. Because of using an automatic algorithm, it makes possible to repeat the research serially.