11affiliationtext: [email protected]22affiliationtext: [email protected]33affiliationtext: [email protected]$*$$*$affiliationtext: Corresponding author$\dagger$$\dagger$affiliationtext:  University of Exeter, Exeter, United Kingdom$\star$$\star$affiliationtext:  Oxford Internet Institute, University of Oxford, Oxford, UK${\ddagger}$${\ddagger}$affiliationtext: Ewha Womans University, Seoul, South Korea$\mathsection$$\mathsection$affiliationtext: Alan Turing Institute, London, UK

Divided by discipline? A systematic literature review on the quantification of online sexism and misogyny using a semi-automated approach

Aditi Dutta †,∗ ID Susan Banducci Chico Q. Camargo †,§,‡,⋆
Abstract

In recent years, several computational tools have been developed to detect and identify sexism, misogyny, and gender-based hate speech, especially on online platforms. Though these tools intend to draw on knowledge from both social science and computer science, little is known about the current state of research in quantifying online sexism or misogyny. Given the growing concern over the discrimination of women in online spaces and the rise in interdisciplinary research on capturing the online manifestation of sexism and misogyny, a systematic literature review on the research practices and their measures is the need of the hour. We make three main contributions: (i) we present a semi-automated way to narrow down the search results in the different phases of selection stage in the PRISMA flowchart; (ii) we perform a systematic literature review of research papers that focus on the quantification and measurement of online gender-based hate speech, examining literature from computer science and the social sciences from 2012 to 2022; and (iii) we identify the opportunities and challenges for measuring gender-based online hate speech. Our findings from topic analysis suggest a disciplinary divide between the themes of research on sexism/misogyny. With evidence-based review, we summarise the different approaches used by the studies who have explored interdisciplinary approaches to bridge the knowledge gap. Coupled with both the existing literature on social science theories and computational modeling, we provide an analysis of the benefits and shortcomings of the methodologies used. Lastly, we discuss the challenges and opportunities for future research dedicated to measuring online sexism and misogyny.

Keywords— systematic literature review, online sexism and misogyny, semi-automated publication analysis, applied natural language processing, scientometrics

1 Introduction and background

The growth of the Internet has been accompanied by an increase in online abuse of marginalized groups. Particularly girls and women have been the target of such hostile environment in online spaces and platforms (Jurasz and Barker, , 2019). In the past few years, the disproportionate impact of online hate speech towards girls and women has given rise to an active interest among the research community in countering online sexism and misogyny, and an increase in research on quantifying the same using machine learning approaches. Yet, these approaches are seen to differ fundamentally in what accounts as sexism or misogyny in their work, and consequently in their measurement and operationalization of the construct, due to its multiple underlying concepts. Given the differences in the study of sexism and misogyny within and beyond the disciplines, Computational Social Science (CSS) has emerged as a new interdisciplinary field which uses computational approaches to study social constructs. While these approaches show impressive performance, they fail to identify and capture all forms of sexism or misogyny, and are often prone to erroneous classifications. This calls for the need to investigate the current state of research in online sexism or misogyny, and identifying the current challenges arising due to disciplinary and methodological divide.

Online sexism or misogyny

Definition

Manne, (2017) describes misogyny as “upholding the social norms of patriarchies by policing and patrolling them”, while sexism serves to “justify these norms, largely via an ideology of supposedly ‘natural’ differences between men and women concerning their talents, interests, proclivities, and appetites”. Manne, (2017) further elaborates on misogyny as a property of the social system that evolved from a system of patriarchal oppression: “Misogyny is what misogyny does to some such, often so as to preempt or control the behavior of others.” In other words, misogyny takes women belonging to specific social class, specified or unspecified (based on race, class, age, sexuality, cis/trans, etc.) and threatens hostile consequences when they violate or challenge the norms or expectations as a member of that group of social class. Sexism and misogyny are central concepts to understanding the status of women yet there is no consensus across disciplines on their definition. Wrisley, (2023) argues that a working definition of misogyny is difficult not only because it is a complex concept, but because its use has been extended beyond the original meaning.

Causes

Back in 2013, World Health Organization (WHO, , 2013) reported violence against women as “a global health problem of epidemic proportion”, primarily talking about the offline violence, but warning about its impending thrive in the social media. Rightly so, the Internet, especially social media, has emerged to be integral in the perpetration of sexism and misogyny, as women face various forms of violence there (Jurasz and Barker, , 2019). More so, previous research have found the role of specific categories and linguistic forms (such as the generic masculine111A gender-biased form used to indicate those also of feminine gender, in accordance to a hierarchy favorable to men.) to play a role in promoting and reinforcing prejudices, sexist attitudes and gender stereotypes (Sensales and Areni, , 2017). Such manifestations can take different forms, yet are united by a common goal of discrediting women’s participation in public and their political voices (Jurasz and Barker, , 2019). In the last few years, those systemic gender inequality has manifested in cyberspace through the proliferation of abusive content that is even more aggressive, eventually prompting more research on characterization of this new form of online misogyny (Fontanella et al., , 2024). Thus, the online platforms have proliferated the blurring of the lines between online and offline lives (Megarry, , 2014).

Impact

Even self-identifying as a woman online can risk the chances of internet harassment. When gender identity is known, gender stereotyping and gender-based discrimination from the “real world” are seen to freely operate, eventually causing “gender asymmetry” in the dynamics of online harassment (Herring, , 1999). Presumable actions like the preconceived tone of one’s posts in the digital space could be enough to “trigger” misogynistic mockery aimed at them. Speaking up against it could in turn trigger a consequential retort of a misogynistic and sexist nature to the speaker (both by men AND women), further encouraging sexism. These people who purposefully cause aggressive derailing in online feminist spaces intend to disrupt the free speech of the given group (Bartow, , 2009). Megarry, (2014) situates the abuse women experience online through a discursive context and concludes that it aims to diminish their voices in the digital platforms, and police their behavior in the public sphere. The sheer volume of gendered online abuse raises a significant social concern. While some victims were applauded when they had exposed the perpetrators via ‘feminist digilantism’, it has also exposed the risk of reinscribing the fact that people view the problems to be solved privately by individuals, rather than in public domains (Jane, , 2016). But the impact of misogyny goes beyond psychological and personal, as it also has material dimensions, especially concerned with the distribution of resources in society. Therefore, misogyny and gender-based violence require further contextualization of its complex relationship with the online culture and technology, to shape the digital gender politics of the future (Ging and Siapera, , 2018, 2019).

Given its impact, online misogyny (and sexism) can be seen as “seeking to prevent women from participating in building the forthcoming technological future”(Ging and Siapera, , 2018). It is therefore necessary to stop such proliferation in online spaces to promote gender equality, raise awareness and eliminate it at the earliest by detecting them through computational tools.

Detecting online sexism and misogyny

Research on sexism has largely been qualitative in nature, with a small number of studies employing quantitative methods (Yasseri et al., , 2016), and even less using computational approaches to analyze the immense amount of available online data on sexism. Thus, the need arises to apply natural language processing (NLP) approach to analyze such data to advance both sociological understanding of the kind of sexism existing in online spaces, and methodological understanding of using and improving computational models to capture the same through detection and identification tasks. Language is a form of social behavior in itself, as it expresses identities and social categories (Dinan et al., , 2020), which is why text analysis has been proven to be one of the established methods in mapping and analyzing hostility in online discourses, particularly for online gendered hate-speech (Jane, , 2016). Yet, most of the earlier works have neglected or retrofitted the link between the data and sexism as a theoretical construct (Samory et al., , 2021). Primarily, sexism and misogyny has been researched as a part of the hate-speech diaspora, disregarding the forms of sexism ‘not involving hate’ (Parikh et al., , 2021), or other non-hostile forms that are subtle and often deceptive (Jha and Mamidi, , 2017). While online communities now emphasize on the detection of sexism [or misogyny] (and other hate speech) more than ever before, automatic detection of these are challenging as most research focus on using textual features to solve the issue (Das et al., , 2023). Given the need to investigate the developing multidisciplinary approaches over the years, Fontanella et al., (2024) perform a systematic literature review on the study of misogyny using computational methods, where they find a “limited connection between the areas of knowledge that are necessary to fully grasp this complex phenomenon”. Through our research, we extend the review on articles beyond misogyny with an extensive discussion on the identified practices, along with their challenges and limitations in implementation, backed by social science literature.

Current study

The aim of this paper is to examine the academic literature quantifying sexism and misogyny, covering the different aspects of their work based on two broad categories of research fields - Social Science (SS) and Computer Science (CS). In this Systematic Literature Review (SLR), we do a semi-automated approach to perform data screening and quality assessment, and eventually using the final selection of research studies for reporting the computational tools that have been developed or used to quantify sexism and misogyny in social media platforms or other online spaces from the years 2012 till 2022. We use ‘quantification’ to refer to methods for identification, classification, or detection of sexism and misogyny. The intention is to inform the challenges and limitations of the current practices, discuss on the disciplinary divide in the research, and indicate future research on this topic. Most of the research on sexism and misogyny use the terms interchangeably since the theories they stem from indicate misogyny as an extreme form of sexist ideology.

2 Research questions

The challenge of considering sexism and misogyny from a quantitative perspective, when considering their highly subjective nature, motivates our research questions:

  1. RQ1:

    What are the main topics in the studies identified, and how do they differ by discipline and over time?

  2. RQ2:

    How has the existing literature operationalised sexism and misogyny?

  3. RQ3:

    What are the main challenges and opportunities of computational approaches to the study of sexism and misogyny? Which of the challenges do they address?

The main objective of this paper is to provide a comprehensive systematic literature review, drawn from the research landscape of sexism and misogyny, studied over the years of 2012-2022. The aim is not to focus on specifics from any individual paper but to provide a general overview of the existing literature and draw conclusions from their study designs and research outputs. These observations are to inspire researchers on best working practices and approaches, while also contributing to future research objectives.

Our systematic literature review is divided into two stages: (a) Identifying the relevant studies through multiple steps by performing a semi-automated selection flowchart as illustrated in the PRISMA flowchart (Figure 3.1) in Section 3, (b) Conducting an in-depth analysis of the selected study results in Section 4. While stage 1 is expected to answer the first research question, stage 2 will answer the second and third research questions.

3 Identifying relevant studies

3.1 Search strategy

We searched six databases – Google Scholar, ArXiv, Elsevier, Scopus, Semantic Scholar, and Web of Science – using a closely related set of keywords that operationalized our review criteria of ‘quantifying’ sexism and misogyny. This returned a comfortable number of results that were useful for performing the quantitative analysis. Search results were implemented such that the range of year of publication lay between 2012 and 2022. All of the articles should be in English, containing the full abstracts and titles for each of them. The reporting strategy follows the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses), presented in Figure 3.1222“The flow diagram depicts the flow of information through the different phases of a systematic review. It maps out the number of records identified, included, and excluded, and the reasons for exclusions. Different templates are available depending on the type of review (new or updated) and sources used to identify studies.”(Takkouche and Norman, , 2011), which uses a checklist approach to systematic literature reviews.

Refer to caption
Figure 3.1: PRISMA flowchart diagram (Takkouche and Norman, , 2011) for this research. Each step shows the number                                                                                                                                                         of studies included and eliminated at that point of the research.

This research was conducted to review papers with three main characteristics, namely:

  1. 1.

    The papers study sexism and/or misogyny.

  2. 2.

    The papers ideally study the propagation in social media platforms or other broadcast (preferably text-based) media.

  3. 3.

    The papers use various methodologies for measuring or quantifying sexism and misogyny (e.g., scales, models, etc.).

ArXiv and Web of Science were chosen to collect studies from the fields of CS and SS, using the search criteria as shown in Table B.

This yielded a total of 1511 results in Web of Science for CS and SS, while 234 results in ArXiv. We also included 71 records from external sources in the first stage. The data collection method will be discussed in detail in the next section.

3.2 Data search and collection

In this section, we elaborate on the experimentation conducted with each of the citation databases, and the advantages and disadvantages encountered during the study. For this research, some fields of the search results, namely - title, abstract, year of publication, and the discipline of research for each of the search results were integral to the study. To perform the automated step of narrowing down our search results, some measures were taken to check the consistency and reliability of the data, which is shown in the Table B.

ArXiv is a platform that offers researchers to e-publish a draft version of their final work preceding a formal peer review and publication in a peer-reviewed scholarly or scientific journal, also referred to as ‘pre-prints’333A preprint is a full draft research paper that is shared publicly before it has been peer-reviewed. Most preprints are given a digital object identifier (DOI) so they can be cited in other research papers. A preprint is a full draft of a research paper that is shared publicly before it has been peer-reviewed. (Mudrak, , 2018). Due to the popularity of ArXiv among CS researchers, its API was used with the expectation of returning unpublished or pre-published works for all disciplines. However, it was found that only the areas of CS vary widely. “In theoretical computer science and machine learning, over 60%percent\%% of published papers are on ArXiv, while other areas are essentially zero.” Sutton and Gong, (2017) We opted to use advanced search queries to narrow down the results, as simpler queries were expected to return more irrelevant results, that had to be removed before analysis. Though the API returned only a limited number of papers, most of them were found to be relevant. Hence, we took it for analysis but did not use it as our only source due to its skewed disciplinary variety.

Web of Science and Scopus showed results in retrieving studies from CS. Though Zhang, (2014) found that Scopus retrieved “significantly” more studies in CS as compared to the Web of Science, with all of the kinds of document types - conference proceedings, journal articles, reviews, and editorials; yet for our search type, more relevant works were found in Web of Science. As Fiala and Tutoky, (2017) mentioned in their work, CS has a greater reliance on conference proceedings as compared to other disciplines. To some extent, these conference proceedings papers are also indexed in Web of Science in the Conference Proceedings Citation Index, which makes it possible to carry out scientometric studies of CS based on the data from Web of Science (Fiala and Tutoky, , 2017).

For Google Scholar, we used two external APIs like SerpAPI for scraping the data, as well as a software named ’Publish or Perish’ (Harzing, , 2007) to collect the search results. Both of the methods were rejected because of their disadvantages. Such as, Publish or Perish could only extract 1000 results at a time for each search query. While this drawback was overcome by searching for documents with a shorter range of years to stay within the limit, it lacked some of the fields that were needed for this study - abstract and discipline. Alternatively, SerpAPI (SerpAPI, , 2019) worked similar to a web scrapping tool and could only scrape the results as the search engine demonstrates, i.e., it only scrapes what Google shows on their Google Scholar pages, nothing more. Even though the fields we got through this API were relevant, they did not contain the full information we needed for the analysis. For example, the full text in the title and abstract was missing and was instead indicated with dotted extensions in the beginning and end of the text. For the remaining tested citation databases - Elsevier and Semantic Scholar, the possible search queries were either too simple (consequently giving back a lot of irrelevant studies), did not give back enough studies on our topic, or lacked some of the essential fields (e.g., abstract) that were integral to this study, especially for the automated search strategy used to eliminate non-relevant studies.

Therefore, we found empirical evidence indicating that the research outputs we got from ArXiv and Web of Science were ideal for our work. Alongside the search queries, we augmented the dataset with manually added papers that satisfied the selection criteria: A.2. This data from external sources included studies shared in the social platforms Twitter (or X) and LinkedIn, recommendations of other researchers in the field, and following the references of the reviewed papers (i.e., citation tracking).

3.2.1 Final methodology selection criteria

Observing the pros and cons of all the citation databases, it was decided to use the Web of Science API to collect data based on the individual areas of discipline - SS and CS, as the primary data source. Since many of the relevant computational papers were seen to be published in ArXiv within the given period, those papers were also considered as part of the data collection. It was done to ensure that we get full coverage of both published and unpublished works (pre-prints), relevant to the study of sexism and misogyny during the 11 years. As discussed before, we also included the publications that were informed through external sources. While the Web of Science was taken as the main source for published works, ArXiv was taken as a source for unpublished works. We then combine the selected search results for the next section 3.3, before removing the duplicates.

3.3 Data extraction and synthesis

In this section, we first provide an overview of the collected data from the previous Section 3.2.1, and then use automated approaches for the data extraction stage. The analyses are performed before the application of the selection criteria A.2. For each of the following subsections, the fields considered were:

  • Title of the paper

  • Abstract (Multiple abstracts of the same paper were replaced with the first abstract)

  • Year of publication (or pre-printing)

  • Language of the paper

Refer to caption
Figure 3.2: Number of publications per year.                                                                                                                                                         The blue bars reflect the research articles on Computer Science, while the yellow bars reflect the research articles on Social Science, between the years of 2012-2022.

Figure 3.2 shows a steep rise in the study and publication of research on sexism and misogyny, in both the fields of CS and SS. While SS studies always dominated research on the topic, CS works also showed admirable improvement, with a lot of the papers getting published in 2022 alone.

As we had discussed in the Section 3.2, there has been a rising trend of pre-prints in CS G.1, many of which were later published and indexed in citation databases. Studies researching social media platforms like Facebook, Twitter, and Instagram were seen to be limited, with less than 100 works dedicated to research on sexism and misogyny in these online platforms. While almost all of the returned results indicated that works were published majority in English, among the other languages - Spanish and Portuguese followed through, though separated by huge margins.

Pre-processing of the text was done to drop duplicates and remove characters in the text that could hinder the automated selection of the studies based on the titles and abstracts. Studies containing no abstracts at this stage were removed as they could not be added for automated selection criteria. Given that the count of such papers was only 13, the abstracts were looked up in Google Scholar and later manually checked, if they satisfied the selection criteria for this research.

For the automated extraction stage, we perform two steps in chronological order: topic modeling and keyword co-occurrence network to narrow down our search.

3.3.1 Topic modeling

Topic modeling444“Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. It leverages ‘unsupervised’ machine learning to analyze and identify clusters or groups of similar words within a body of text” (Pykes, , 2023). was used with the pre-processed data containing the abstracts and titles from both disciplines, to generate clusters of topics based on the documents (i.e., the collection of studies containing abstracts and topics). Among all the topic modeling techniques experimented with, BERTopic (Grootendorst, , 2022) proved to be the best choice for the task. It is because BERTopic “leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions”(Grootendorst, , 2022), hence enhancing the topic recognition ability by the model.555More information on the techniques used in this methodology is explained in the supplementary section: J.1

We applied the BERTopic algorithm to the collections of CS and SS papers separately to capture the topics of research between these two disciplines, and to check the differences in the themes of sexism between them. Among all the experimentation conducted - including setting different ranges of parameters to get the best representative models from there, we further employed fine-tuning of the model to improve on that, by using multiple representations from the model. For our work, we used these different representations from keywords and phrases to summarize and custom labels. The Figures 3.3 and 3.4 indicate the topics recognized by the model. Using the aforementioned parameters, we used the BERTopic model to groups documents into topic clusters, identified by their keywords and keyphrases. It uses clustering to define topics and hence does not assign more than one topic to each document. In the figures, each point corresponds to each document in their respective disciplines. BERTopic uses HDBSCAN by default for clustering, and it does not force all the data points to be a part of any of the recognized clusters or topics. Simultaneously, BERTopic uses UMAP to perform dimensionality reduction. We then used further customization of the UMAP by setting the parameter ‘n_components’ to 2, to ‘pre-reduce’ embeddings for visually depicting our model results in the two figures. For those topics that do not form a part of any groups (also termed as “outliers”), the points are marked in grey in the figures. The colored points in both the figures indicate topics, and each color represents a unique topic for the sets of documents, which have further been marked correspondingly with labels of the same color boxes. The algorithm itself exhibits strong local clustering to group similar topic categories together, to which we also controlled the balance between the local and the final structure to efficiently distinguish between each topic.

It uses a light hue of the same colors encircling each topic to indicate the cluster belonging to the respective topic. In the figure, different colors indicate

While some points in the same cluster may look further away than the points from another cluster, it is due to its projection in 2D-dimensional space which we did for better visualization; hence the points within the same clusters are closer in a multi-dimensional space.

Refer to caption
Figure 3.3: This figure show a UMAP scatterplot, where each point represent one                                                                                                                                                         document. The unique colors in the figure represent a different topic in computer science                                                                                                                                                         centering around sexism and misogyny between 2012 and 2022. Through topic modeling,                                                                                                                                                         usually each document get assigned a set of key words as themes within the paper, which are then grouped together with an unique color, representing the same topic with similar sets of keywords found across all the documents. When grouped, each topic is described by their topic name in the same color. The grey points represent outliers (documents which did not get any                                                                                                                                                         assigned topic). The highlighted topic name indicates more relevance to our research objectives.
Refer to caption
Figure 3.4: Similar to Figure 3.3, this figure show a UMAP scatterplot where each each unique color represent a different topic in social science centering around sexism and misogyny                                                                                                                                                         between 2012 and 2022. The highlighted topic indicates more relevance to our research objectives.

In the figures, the highlighted topics depict the most relevant searches for our study. We see that while SS studies have the most diversity in their topical approach as compared to CS, they still lack research on the automated detection or identification task of online sexism and misogyny (which is the aim of this research). Online sexism and misogyny form a part of online hate speech, we use the topic in CS that specifically looks into the quantification of the same, rather than analyzing the gender bias in various forms. To quantify the hate speech, we would require research works focusing on the detection tasks. Hence, at this stage, the topic we chose out of the two disciplines and the various categories is the ‘Hate Speech Detection in Social Media’. In the following step, we further exhibit the relevance of our chosen topic from this step by performing the keyword co-occurrence network analysis.

3.3.2 Keyword co-occurrence network

To validate if the topics captured from the automated selection of topics from each discipline in the previous section were representative of the corresponding documents, we navigated the disciplines and each topic, alongside their respective keywords. To obtain the most frequent keywords in the set of documents, we use KeyBERT666KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and key phrases that are most similar to a document.(Grootendorst, , 2020) to extract embeddings with a BERT model to get a document-level representation from our abstracts and titles. From each document, we used KeyBERT to identify key phrases that would provide with a more accurate summary of the documents, rather than simple keywords. KeyBERT works by creating an embedding of document texts, from which BERT key phrase embeddings of a pre-defined word n-gram range length of 1-2 words777Word n-gram range lets users decide the length of sequence of consecutive words that should be extracted from a given text. were created. Consequently, cosine similarities between the document and their respective keyphrase embeddings are calculated to extract the top 10 keyphrases that best describe that document. These selected keyphrases per document are then compared against the whole set of documents. We chose to look into the 100 most common keywords in the documents taken at both discipline level (CS and SS), as well as the topic level (each topic based on the topics we generated in Section 3.3.1). This was done to check the relevance of the keywords, and consequently the set of documents that would best represent our research objective of performing a literature review on the quantification of online sexism and misogyny.

Refer to caption
Figure 3.5: Most frequent keywords gathered from the abstracts and titles of Computer Science                                                                                                                                                         studies in the topic of ‘Hate Speech Detection using Deep Learning models’

On comparing the keywords present in Computer Science (see Figure I.1) and Social Science (see Figure I.2), we found that the contents of the papers (from their title and abstract) focused on different kinds of sexism and misogyny - both indicating similarity in topics, but contribution at different capacities. Social science focuses on multiple aspects of sexism and misogyny and has been studied using both qualitative and quantitative methodologies. While computer science mostly focuses on task types (such as prediction, detection, identification, etc.) and setting analytical goals. On further analysis, we selected the most relevant topic among all of the highlighted topics from both disciplines and performed a keyword search on each of them. Figure 3.5 corresponds to the topic ‘Hate Speech Detection in Social Media’, and it showed the most promising result of containing the necessary keywords needed for this study.

The figure consists of the 100 most frequent keyphrases in the topic. The size of each circle (node) indicates the weight of that particular keyphrase in the set of documents. The color of the nodes can be anything that is recognized by Matplotlib in colormap specified and could be randomly generated. The edge width of the edges joining the nodes indicates the number of associations between the two connected nodes - the more the number of connections, the thicker the edge width. The co-occurrence network for all the highlighted topics from the previous step showed us the most used keywords pertaining to the said topics. By studying the generated keywords, we could capture the general objectives of the respective papers under the same topic, as stated from their abstracts. We then selected the most relevant topic for our work, which focuses on computational approaches in detecting online sexism and misogyny (a part of the online hate speech discourse). Consequently, we decided to perform the full-text literature review on the articles which fell under the topic of ‘Hate Speech Detection in Social Media’.

On completing this step, we included a few more of the studies that had been identified through citation tracking and did a full-text screening of all the collected texts.

Note: As we see from the plots, till here, all of the analyses largely relied on the information provided in the title or the abstract of the papers. Hence, it limits us in providing a concise assessment of the exact models, methodologies, or datasets used by the corresponding papers. For which we would need a full-text assessment.

3.4 Final data selection

3.4.1 Screening

Following the selection of the citation database, the automatic filtering of papers using BERTopic, and the validation via topic keywords, we identified the necessary characteristics 3.1 of the data to emphasize the findings which eventually led us to select the research articles based on the selection criteria: A.2, by performing a manual screening. Eligible articles were divided into two categories at this stage. The first category is the data acquired from the automated stage, while the second category is the records identified from citation tracking.

Finally, we thoroughly reviewed all publications related to the quantification of sexism and misogyny in online social platforms to determine their focus and methodologies, by reading their full text.

3.4.2 Qualitative assessment of the selected studies

A total of 96 full-text articles were analyzed qualitatively, as shown in Figure 3.1. We assessed them based on four criteria, namely:

  1. (i)

    Irrelevant study focus - Whether they are focused on studying the propagation of sexism and misogyny (irrespective of whether they indicated such in the abstract). Many of the works focused on hate speech, but because we wanted to only review sexism and misogyny, they were eliminated.

  2. (ii)

    Irrelevant study designs - Whether the intended outcome of the research was not about performing a computation analysis on the detection or identification of online sexism and misogyny.

  3. (iii)

    Studies not quantifying sexism/misogyny - Whether the paper focused on a review of studies in the relevant topic; or contained a summary of the author’s thoughts from multiple papers, such as opinion pieces around the same topic.

  4. (iv)

    Sources could not be traced back - in case the paper’s paraphrased contents with citations were not reflective of the summary the original authors indicated in their study.

45 out of the 96 research articles qualified from this step, from surpassing these exclusion criteria and as a result, were included in the final qualitative analysis.

The full text of all the papers was reviewed qualitatively and information about each was added to a summary table covering the following points:

  • Forms of hate speech studied. It is because hate speech could encompass a lot of things, including sexism and misogyny.

  • Definitions of sexism or misogyny (or both) used in the study.

  • Language(/s) of the data used for the study.

  • Data selection criteria. This could depend on the original data collection method, such as - using keywords, hashtags, public profiles by monitoring user’s online activity, users identified as sexist/misogynist, tags of sexism, specific phrases; or even based on a particular timeline of interest.

  • Datasets used and their types (external, API-generated, etc.)

  • Dataset modifications, if done. This could be in the form of data augmentation, counterfactual examples, document expansion by adding semantically similar words, transliterating multilingual dataset to uniform language, and many more.

  • Broadcast media or social media platform which is of interest for the study.

  • Annotators used in the study, and their tasks. If each or group of annotators had different tasks, that was also recorded.

  • Pertaining to the previous point, the Kappa values that are statistical measures used to measure inter-rater reliability, are also noted.

  • Research bias addressed or acknowledged in the study. If acknowledged, it is posted as a limitation in the paper.

  • Pre-processing or post-processing done on the data.

  • Performance metric used.

  • Embedding type used, since this could range from word-level to node-based.

  • Classification or clustering type, and the respective models.

  • Syntactic, linguistic, and semantic/lexical features.

  • Prompt topics and intersectionality (if present).

Recently several SLR tools have incorporated semi-automation using Artificial Intelligence techniques, for supporting the screening and extraction (pre-screening) phases (Bolanos et al., , 2024), like we did in our research. Of such tools, only a few use topic modeling for their work. Such as, RobotAnalyst888https://www.nactem.ac.uk/robotanalyst/ and SWIFT-Review999https://www.sciome.com/swift-review/ uses Latent Dirichlet Allocation (LDA) that assigns a topic to a paper based on the most recurrent terms shared by other papers using a generative probabilistic model, while Iris.ai101010https://iris.ai/ clusters the papers according to a two-level taxonomy of global topics and specific topics (Bolanos et al., , 2024). The former two tools depend on the term frequencies while the later perform Named Entity Recognition (NER) and allow users to to customize entity extraction by letting them define their own set of categories beforehand. Even with its advantage of the superior language capabilities to produce one of the most advanced techniques in language topic modeling today (Briggs, , 2023), BERTopic has remained unexplored for the same task. In our work, we use that potential alongside the promising result of Large Language Models (LLMs) in their information extraction capabilities, to cluster our topics before validating the results with network analysis and selecting the topic(/s) more suited for our work. This proved to be particularly useful to us in the screening and qualitative assessment phase as empirical analysis of the topics generated to their corresponding papers showed that the approach accurately clustered similar papers together.

4 Results of the Systematic Literature Review

4.1 Data statistics

Post data screening and qualitative assessment, we finally narrow down the number of manuscripts to 45, that satisfied the scope of our meta-analysis. In the first subsection, we provide a brief overview of the key statistics of the 45 papers. In the following subsections, we provide an overview of the existing computational approaches dedicated to quantifying sexism and misogyny. Beyond that, we discuss the challenges and limitations faced by the said approaches from the existing literature.

4.1.1 Author collaboration network

We provide an author collaboration network in Figure 4.1, where the name of the researchers are nodes, their size and color indicating the number of relevant manuscripts they authored or co-authored. The connections between the authors are indicative of co-authorship on manuscripts, and their weighted edges imply the frequency of co-authorship.

Refer to caption
Figure 4.1: Network connection of all the author collaborations between the 45                                                                                                                                                         research articles screened for our literature review.

4.1.2 Characteristics of study or research designs

Table 4.1 gives a summary of all the general characteristics we found in the full-text reviews of the selected studies. It provides a summary of the most used categories of each field (design/methodology) that the documents in our literature review have used. The other categories which were not featured in the table were mostly used by only one document. Each document could have multiple categories under the same field of design or methodology. For example, one document could be researching datasets of multiple languages in different platforms of interest and using multiple models, with different levels of classification at different stages. The categories that are uniquely present in a document are marked with an asterisk() beside it, while the fields with their entire list of categories in the table are marked with an obelisk(). In here, the asterisk() symbol would not just indicate that the feature itself is unique to the field, but also all the documents should add up to the total number of literature listed.

Characteristics of Study or Research Designs
Characteristics Count Characteristics Count
Benchmark Datasets used Languages of datasets
Waseem and Hovy, (2016); Waseem, (2016) 14 English 36
Fersini et al., (2018); Basile et al., (2020) 13 Spanish 7
Basile et al., (2019) 5 Italian 6
Bates, (2015) 6 South Asian languages (e.g., Bangla, Hindi) 6
Rodríguez-Sánchez et al., (2021) 4 Other European languages 2
Fersini et al., (2018) 3 Paradigms 111111More on the paradigms (Röttger et al., , 2022) in the Supplementary Section C.1, in the table of terminologies.
Machine Learning models Perspective 37
Support Vector Machine (SVM) 16 Descriptive 5
Bidirectional Encoder Representations from Transformers (BERT) 15 Unsupervised (as per model features) 3
Long Short-Term Memory (LSTM) 15 Evaluation type
Logistic Regression (LR) 10 Binary Classification 32
Convolutional Neural Networks (CNN) 9 Multi-class Classification 22
Naive Bayes (NB) 6 Multi-label Classification 4
Random Forest (RF) 6 Cluster Analysis 1
Decision Tree (DT) 4 Results per class 1
Multilayer Perceptron (MLP) 4 Annotator types
XGBoost 3 External dataset 24
fastText 2 Experts 5
Platform of interest Authors 4
Twitter (or X) 28 Amateurs/Crowdsourced external annotators 3
Sexism reported online (from Everyday Sexism Project) (Bates, , 2015) 3 Students of linguistic, communication and gender 3
Facebook 2 Machine learning models 3
Reddit 2 Social Scientists 2
Gab 2 Annotator character not stated 2
Table 4.1: Summary table of some of the topmost categories of designs/methodologies in all                                                                                                                                                          the observed characteristics across the selected studies.                                                                                                                                                          indicates unique features of a document, i.e., a document can only have either of the categories in the field.                                                                                                                                                                             indicates that the categories listed below are a part of the exhaustive list for that particular field.

4.1.3 Overview of the general methodologies

As per the Table 4.1, we do see the frequency of each source of online data and the machine learning models used in the 45 manuscripts we reviewed. The type of classification of sexism and misogyny used in the said studies are otherwise unknown and how they link between the sources of online data and the computational methodologies is an important source of information to indicate the multi-level connections between the variables, and consequently its impact on the quantification of sexism and misogyny. Figure 4.2 show the connection between nodes in each level of information (source of online data, classification type and model used), while the connections are the links between each level with their weights indicating the frequency of connections between each node type at different levels. This is a many-to-many mapping between the three levels, and show the flow of information between each of them. The colors of each node at different levels represent the unique relation between the linked nodes at each level (e.g., Twitter [level 1] =>absent=>= > Misogyny (5 categories) [level 2] is different from Twitter [level 1] =>absent=>= > Misogyny (binary) [level 2], and Reddit [level 1] =>absent=>= > Misogyny (binary) [level 2] is different from Reddit [level 1] =>absent=>= > Abuse/Aggression [level 2]). The abbreviations for the models in level 3 are found in Supplementary Section H. Overall, the Figure 4.2 gives a clear evidence that Twitter was the most explored online data source to investigate different forms of sexism and misogyny.

Refer to caption
Figure 4.2: Sankey diagram of the link between each categories of online data, classification type and computational models. The Sankey Diagram allows to visualize flow between various points in a system. In our system, we show the flow (i.e., count of association) between [Sources of online data]=>absent=>= >[Sexism/misogyny classification type]=>absent=>= >[Model used].

4.2 Overview of the existing computational approaches

4.2.1 General definitions and strategies used

Most computational works on sexism and misogyny took the automated identification problem as a binary classification task, i.e., deciding if the text in question is sexist/misogynistic or not. For defining the terms, researchers used different (non-standardized) forms for their work because of computational benefits, such as model performances. Though they served to be effective in some instances, they also presented with limitations of their own. For instance, Grosz and Conde-Cespedes, (2020) define sexism as the prejudicial and discriminatory nature of sexist behavior pervading in the social context, especially for women. Using theoretical concepts, along with the typology of abuse presented in earlier research, Guest et al., (2021) define misogyny as content “directed abuse at women or a closely-related gendered group (e.g., feminists).” Whereas Lynn et al., (2019) define misogyny as a hate crime, a result of a “cultural attitude of hatred for females because they are female”, and presented with two bidirectional-DL models (Bi-LSTM and Bi-GRU) with dropout layers, which performed well in sensitivity and accuracy even with a slightly imbalanced dataset. Having a multi-agent classification approach to a model can further enhance its performance when built using sentence embedding techniques and TF-IDF, enriched with misogyny lexicons (Attanasio and Pastor, , 2020). Plaza-del Arco et al., (2021) proposed Multi-Task Learning (MTL) system with hard parameter sharing approach (sharing the hidden layers between all tasks, while keeping several task-specific output layers) using BERT-based models for utilizing the transferred knowledge from multiple other (but related to sexism identification) tasks like polarity and emotion classification and offensive language detection classification helped in the identification, both in binary and multiple categories. Though emotion was not helpful in categorizing, MTL shows promising generalization to the original task. Frenda et al., (2019) exploit stylistic, semantic and topic information about misogynistic speech to identity misogyny and classify it to different categories. For gathering linguistic features, they propose an approach based on stylistic features captured by means of character n-grams, on sentiment information and on a set of lexicons built by examining the misogynistic tweets from training data provided by the organizers. Each text was represented by a vector composed of all specific topic features(set of lexicons), pondered with Information Gain, and character n-grams, weighted with TF-IDF measure. This set of features were experimented employing a Support Vector Machine (SVM) algorithm and an ensemble technique, reaching promising results. Canós, (2018) worked on the same data and task, experimenting SVM alongside TF-IDF with a one-vs-one and one-vs-rest classifier approach, where the later proved better for English, presumably because of larger vocabulary. On the other hand, Nozza et al., (2019) defined several templates to create a balanced synthetic dataset for their proposed DL model- Universal Sentence Encoder (USE), which further debiased their model features to be less-sensitive to identity terms and yet obtain a better categorization.

4.2.2 Overview of performance evaluation

When it comes to evaluating performance, classification models with traditional computational approaches seem to fair relatively similar or better (in some cases) at automated sexism/misogyny identification tasks. Indurthi et al., (2019)’s work shows how different set of pretrained embeddings trained from different state-of-the-art architectures and methods when used with simple machine learning (ML) classifiers like SVM and XGBoost perform very well in binary classification tasks. Kohli et al., (2021) used two kinds of methods: first using an ensemble approach comprising of XGBoost, LightGBM and Naïve Bayes; and second employing BERT-based architecture. Both the models performed well on binary identification task, but differently on different languages and aggression label analysis, one of which was gendered, due to the overlapping context in all. Overall, SVM is seen to be the best-performing conventional classifier (hence taken as a baseline for some of the works), and a lot of papers have used it in their work as a standalone classifier, or as an ensemble voting classifier (Frenda et al., , 2018; Nascimento et al., 2022a, )- alongside other classifiers like Gradient Boosting and Random Forest (RF). Regression models like Logistic Regression have also been used by a lot of studies, especially for binary tasks. While Decision Tree (Plaza-Del-Arco et al., , 2020), and RF (Singh et al., , 2021) has also been used, they do not show much success among the conventional ones whereas most DL models use Fully Connected (FC) layers for classification (Bashar et al., , 2019).

4.2.3 Is classification the only way?

Though almost all the computational methods employ classification techniques, it is not the only way. But it is favorable, for good reasons. Clustering techniques are mostly useful for content analysis and to study discourse, to help identify implicit themes/topics from the data which may be (unintentionally) omitted during manual inspection, and reassignment into its overarching categories for better interpretation, even though may sometimes provide superficial results (Siddiqi et al., , 2018). Karami et al., (2019) employs unsupervised text-mining approaches like LDA topic modeling. Utilizing the themes they found, they performed qualitative thematic analysis before finally moving to a theoretical thematic analysis to group the previously identified topics into four categories of sexism. Melville et al., (2019) also uses LDA for grouping 7 topics, and alongside clustering based on Louvain algorithm (Blondel et al., , 2008) for grouping 20 topics. They define sexism based on themes and sites associated with the experience of sexism from Everyday Sexism (Bates, , 2015) and journalism. From these studies, it is evident that clustering is more useful for content analysis, rather than for the detection/identification tasks.

4.3 Challenges

In this section, we outline the challenges for the interdisciplinary approaches that are the likely due to the disciplinary divide, and ways of addressing them. We identify that these challenges could be because of two broad reasons - (i) Use of different computational strategies; and (ii) Linking social science theories to the tested computational strategies. The first part essentially talks about the different strategies used, compares them based on different parameters in each subsection, and weighs the advantages and limitations of each approach. The second part focuses more on how existing literature has sexism and misogyny in their work. By analyzing how the same terms are defined in social science theories, we form an argument about how the existing computational research could benefit from a more fine-grained categorization of sexism (or misogyny) to improve their automated identification task.

4.3.1 Use of different computational strategies

This subsection is intended to shed light on the different computational strategies that have been used to quantify sexism and misogyny, while also segregating the strategies based on some differentiators like the small dataset size and the dataset languages used.

Not a binary classification problem, yet a challenging task

Binary classification can potentially lead to problems in effectively depending on these models to provide reliable outcomes, without explaining those predictions in theoretical diversity. Many computational methods allow only an at-scale understanding of the properties of sexist/misogynistic language. Studies have noted that relative nuances of the terms have proven to be difficult in previous works (Samory et al., , 2021), yet evaluating the intensity of misogynistic behavior and type of behavior evidenced in the examined context is necessary (Lynn et al., , 2019). Many of the tasks are subjective, e.g., hate speech detection, harassment detection, etc.- in the sense that there is not any “single objective truth” for defining any of them. While some beliefs are more widely accepted as the norm, they do not define these terms in their entirety. This could be argued that because most of the annotation processes are not actively managed by the dataset creators, it creates partly subjective datasets that fail to clearly serve a downstream use (Röttger et al., , 2022). Barak, (2005) talks about the four forms of gender harassment in cyberspace: active verbal, passive verbal, active graphic, and passive graphic sexual harassment. Depending on two major factors- objective and subjective: “(a) the nature of the verbal or graphic stimulus in terms of explicitness, blatancy, or clamorousness, in addition to its continuity and repetition and (b) the personal attitudes, sensitivities, and preferences of the recipient”, the degree of the four possibilities of sexual harassment is observed to differ on personal subjectivity level of the offense. And this can hold for sexism and misogyny as well. Butt et al., (2021) start by using sexism as a binary task, defining it following the Oxford English Dictionary, i.e., “prejudice, stereotyping or discrimination, typically against women, on the basis of sex”. Simultaneously, they also experiment on the categorization of the EXIST dataset, adding another category to it “Misogyny and non-sexual violence”: sexist text when describing a sexist situation or criticizing a sexist behavior. Nevertheless, not much research have considered the differences between various forms of sexism and the overlapping ways women face them, be it in any environment, online or offline (Melville et al., , 2019).

Effect of small dataset size on tasks and performance

Even with a good performance of the models, the dataset size can sometimes limit its credibility and the possibility of exploring more complex and novel deep learning (DL) architectures with sophisticated attention mechanisms that require more data, such as in the work of Grosz and Conde-Cespedes, (2020). For binarily distinguishing misogyny, Guest et al., (2021) build upon a hierarchical taxonomy with three levels: four non-mutually exclusive categories on misogynistic pejoratives, treatment, derogation, and gendered personal attacks; and third-level sub-categories for some of them to separate them based on their thematic groups. Though using logistic regression and BERT (weighted and unweighted both) gave good performance scores, the relatively small size of their dataset, and an even lesser proportion of misogynistic content (8.1%) was a hindrance since they performed classification only on the TRUE labels for misogyny. Schütz et al., (2022) approach the challenge of small-sized datasets with different transfer learning strategies: by applying two pre-trained multilingual transformers for modeling textual content; and performing data augmentation through extension with similar contents from external datasets to adapt the model of sexism identification and categorization. Experimentation reveals that fine-tuning of the whole model on domain-specific data results in improvement of both. However, they observed pre-training to be more advantageous over fine-tuning since the latter showed signs of over-fitting and did not improve the results when used on external datasets. The DL models do not always perform well, especially with insufficient data, which could be one of the main reasons why in some cases the traditional models perform better. More so, when rich linguistic features are generated (Plaza-Del-Arco et al., , 2020).

Dependence on external benchmark datasets

As some of the previous works have recognized, the problem in quantifying sexism lies in the lack of high-quality datasets for training the models and enabling efficient and scalable automated detection systems (Guest et al., , 2021). Most of the CS studies have experimented using external datasets. Because they depend heavily on benchmark labeled datasets, which could also impact the quality and reliability of the representation and diversity of the datasets. Like Zeinert et al., (2021) observe, “When abusive language is annotated, classes are often created based on each unique dataset(a purely inductive approach), rather than taking advantage of general, established terminology from, for instance, social science or psychology (a deductive approach, building on existing research).” While these datasets form a part of the shared tasks121212“Shared tasks are collaborative efforts in which researchers and practitioners come together to solve a common problem using shared data and evaluation measures. They promote competition, collaboration, and progress in research, and have become an important part of many academic and industrial communities.”(SIGEDU, , 2024) in automated misogyny identification (or similar tasks) with labeled data, they have been observed as misrepresented and mislabeled by the researchers. Such as the IberEval2018 dataset had fewer representations of some behavior categories (<<< 2% for the misogyny category ‘derailing’), while over-representation of some of their target categories (>>> 85% for ‘active’ tweets), and differences in the presence of certain categories over the two languages (Canós, , 2018). Whereas, for the dataset of the shared task in TRAC2020 (Bhattacharya et al., 2020a, ), many of the texts contained multiple languages, but were only part of one, making it difficult for non-speakers of the other languages to form reasonable analysis of the classification performances, as it requires knowledge about social structure and culture. The ratio of the texts containing hate speech discourse differed in proportion within different languages too (Gordeev and Lykova, , 2020). Given these differences, a dedicated effort in data collection and annotation is needed, utilizing the social science theories.

Hence, both of these previous sections imply the need to collect enough ‘reliable’ data before performing any experimentation.

Dataset Languages Explored

Majority of the studies in online hate speech detection have been done in English. Closely following it are Spanish and Italian languages, through the shared tasks. A few recent works have been done in Hindi, Bangla, and Arabic. Even within same languages, there lies substantive differences in the peculiar lexical choice and morphological structures rising from the regional colloquial usages (Bhattacharya et al., 2020b, ). Though models like fine-tuned cross-lingual multitask BERT shows promising performance even with non-English languages, with it performing better on Bangla when experimented alongside English, presumably owing to the dataset peculiarity or specific features of the language itself (Gordeev and Lykova, , 2020). But when it comes to representation, there is a huge gap between only the use of Indo-European languages (especially English) and other languages. When working with multiple languages, “back-translation” has been used in the said languages to augment the data and translate all of them to a single uniform language, which could be one of the source languages or different.

Butt et al., (2021) performed the same technique on Spanish and English (source) languages to convert it into English, with German being the second language, using the deep-translator python library. And their results on all of their tried ML algorithms show an improvement with the augmentation; even indicating that with proper pre-processing, it could give competitive results in comparison to deep learning models. Zeinert et al., (2021) too had experimented translating misogynistic posts provided by Anzovino et al., (2018) to Danish using translation services in an attempt to augment the minority class data. But it did not prove as useful in providing a sampling alternative, hence inferring that language-specific investigation is important for cultural discovery, for the sake of automatic detection systems.

In fact, Waseem, (2016) suggests against boosting the minority class in the interest of mimicking reality in the datasets, even if it causes larger misclassification for the class. Rahali et al., (2021) uses gender swap data augmentation and data consolidation with feature ablation, which is seen to improve the learning of the model, especially when used with the same language. But using multi-language datasets does not help much, since English does not consolidate well with other languages (e.g., Arabic and French) with limited samples as compared to English, inevitably giving rise to data imbalance, a data bias. So, there is a need to look beyond English.

Singh et al., (2021) converted the whole dataset from multiple languages to a uniform English dataset, by transliterating the sentences belonging to other languages using IndicTransliterator. But that required transferring a word from the alphabet of one language to another, which could give faulty outcomes. Given the linguistic variety and limitations that could be faced when delving into other languages, improving the existing gaps in sexism identification tasks in English should be of primary focus.

Biases

Bias itself is a broad term, and it can be defined in several ways depending on the field of research, the task, and other factors. The bias discussed in one paper may be different from that in another. Therefore, in this section, we bring together a report on (a) how the studies in our dataset explain biases or quantify them, (b) acknowledgement of any biases present in their research, and (c) if they do acknowledge, what measures they take to counter the said bias. We see that quantitative social science usually provide a background on bias, and yet in NLP, the definition of bias could be fundamentally dependent on analytical goals giving rise to NLP-specific situations such as biases in word embeddings, annotator labels, or predicting over-amplified demographics (Hovy and Prabhumoye, , 2021). In this work, we follow the meaning of ‘bias’ as defined by (Shah et al., , 2020), which focuses on “the mismatch of ideal and actual distributions of labels and user attributes in the training and application of a system.” Furthermore, the rapid growth in the field of NLP could partially contribute to an inability to adapt to the newer circumstances (Hovy and Prabhumoye, , 2021).

Gender bias:
Eagly and Mladinic, (1989)
explain the relationship between attitudes toward women and men using attitude theory, and stereotypes that follow these groups based on that. “The cognitive aspect of an attitude- i.e., the person’s thoughts about an attitude object- can be defined as the attribute he/she ascribes to the attitude object.” If the attitude object is taken synonymous to a social group, attributes ascribed to those groups express positive or negative evaluation, and people’s beliefs, or cognition about them having to evaluate meaning, eventually which develops to stereotypical interpretations about those groups. Their study of shared stereotypes shows that even with positive evaluation, women were perceived to be inferior to men in agentic, or instrumental (masculine-positive) qualities, whereas superior in communal or expressive (female-positive) qualities. In fact, Schmid Mast, (2004) provide empirical evidence attesting for the existence of implicit hierarchy gender stereotype, showing that while men were associated with hierarchy, women were associated with egalitarian structures more than vice-versa. Given the magnitude of such stereotype was not seen to be small, it accounts for the inherent societal bias which still exists. And the social media has the power to intensify this common bias through its influence. Consequently, when such informative but biased social information in the data is fed to the machines, it can lead to gender bias. With models (unintentionally) learning the negative associations about the stereotypes of different communities or groups from the training data and propagating them. A solution to that has been provided by Dinan et al., (2020) in their work, where they use semantic and pragmatic framework to measure bias along three dimensions they get from “the knowledge of the conversational and performative aspects of gender.” By independently investigating the contribution of author gender to the data, they aimed to understand the gender bias better.

Annotator bias:
It is seen that having a lot of categories of misogyny may also impact on the annotators’ agreement, both in terms of depth (subcategories) and breadth (different types) of the said categories, owing to the differences in experience and values of the annotators. And their inherent social biases may impact on their choice, especially when working using contexts (Guest et al., , 2021). Having different level of understanding of the language in question or personal prejudices, and differing individual world-view are seen as primary issues in inter-annotator disagreements. Bhattacharya et al., 2020b used several rounds of discussions and sensitization towards gender issues among annotators to resolve this issue, by providing with counterexample method and examining annotator votes, alongside using an ‘unclear’ tag in case of disagreement. Sometimes when adversarial examples (even just 25%) are included while training the dataset, it is seen to help in the robustness of the models and their performance. In fact, providing the models with different aspects of sexism and challenging the models with different examples have shown to be effective for generalizability (Samory et al., , 2021).

Some studies use different criteria for selecting the annotators they want in their study, based on both similarities and differences on each. It could be based on region, demographic, education and ethnicity (Guest et al., , 2021), native speakers of language (Chiril et al., , 2021), feminists (Jha and Mamidi, , 2017); but mostly who studied gender (Lynn et al., , 2019) and linguistics (Nozza et al., , 2019). A comparison on amateur (crowd-sourced workers) and expert (having both a theoretical and applied knowledge of hate speech) annotators (as most studies use either of them) by Waseem, (2016) state the contrasts observed in annotation with both, and the consequent model performances which did not substantially improve on their previous model (Waseem and Hovy, , 2016). The emphasis on the most significant features changes from extra-linguistic features for majority-voted amateurs to content of the tweets for the experts, and among the features they experimented with, the ones having highest performances (high F1 score) were not necessarily the features with the best performances. Singh et al., (2021) considers misogyny/sexism as a subset of hate-speech, and used data that was manually annotated by multiple annotators using ’Discursive Methods of Annotation’ since it was seen as a pragmatic approach to including the socio-pragmatic phenomenon using social studies, and as a function of both the contextual factors and the discursive experience of the speaker. Zeinert et al., (2021) does an iterative process of raising cases for revision in the discussion rounds, formulating the issue, and providing documentation for annotation, inviting in annotators with diversity in age, occupation/background, region (spoken dialects). Annotation biases can lead to other kinds of bias, like racial bias due to lack of knowledge of different dialects- which could potentially amplify the harm against people from the minority community (Sap et al., , 2019).

Any other causes of bias?
Hirsch, (1992)
documents the oppression of women through language, as she talks about male-specific words that are positively portrayed in English, in turn reflecting the “consensus reality” of the patriarchal society. While theorizing the language and gender connection (with many of the examples drawn from political discourse) from one of its reviewed books, it talks about how language is used as a tool to further perpetuate patriarchy. The same is seen for the computational models based in English. The datasets taken for studies could also add to the bias owing to the different considerations made due to the data source, hence not representing the diversity in real-world. For example, domain sources where misogyny is assumed to be most likely like women fashion blogs, fitness tips videos, etc. (Bhattacharya et al., 2020b, ); or when sexism is taken as one of the sentiment label, with data collected around some specific cases/instances/event networks like #Coronavirus, #ClimateChange, #Immigrants and #MeToo (Katsarou et al., , 2021). Another form of such bias could rise because of subjectivity in mislabeled data. Samory et al., (2021) had performed re-annotation on the external datasets they used in their study, following the sexism annotation codebook they devised themselves. Relying on two baselines: Gender-Word (Zhao et al., , 2018) and Jigsaw’s Perspective API (Hosseini et al., , 2017), they found a large majority of sexist tweets were non-sexist, only similar-to\sim60% of the sexist labels adhering to their ground truth. They found that stratifying misclassification rates helped in giving a more accurate result. Both these points could hinder model performance.

Yet, with the listed biases, always a question remains if they were a cause of systematic errors (both conscious or unconscious) or were a result of a narrowed preference in a particular direction in favor of the said bias. In other words, the use of ‘bias’ to refer to systematic error is problematic. According to Hammersley and Gomm, (1997), it depends on ‘truth’ and ‘objectivity’, whose justification and role have been questioned. Due to the ambiguous nature of the term itself, we might question if the forms of bias explored are a result of methodological adaptability; conscious limitations due to the scope of the research (such as research designs); or could arise because of the models themselves. Either way, they may not indicate the research as “being biased”. Of the five most common sources of bias in NLP tasks as identified by Hovy and Prabhumoye, (2021), we have reviewed almost all of them in this section. This indicates that these biases are well-known across the CSS literature, and can be explored more to mitigate them from all sources, using algorithmic and methodological approaches.

Linguistic representations of online misogyny/sexism

Lexical dependency on theorizing sexism and misogyny.
To linguistically characterize misogyny and sexism, many studies have used different theoretical concepts to represent both. Farrell et al., (2019) had built a list of key lexicons for categorizing misogyny using Encyclopaedia of Feminist Theories (Code, , 2002) and other pre-existing hate-speech lexicons and studies of the specific rhetoric of manosphere, taken from different corpus. In their observatory work, they study the evolution of communities where users share in-group characteristics. But even though corroborating the theories and existing ideas helped in providing lexicons, they acknowledged the limitations of using it due to its lack of completeness (shortcomings in capturing all the words that might be relevant). Other times, studies use words ‘typically associated’ with misogynistic content created by domain experts (Lynn et al., , 2019) which is used as neologisms for identification of emerging or cloaked misogyny.

Does lexical dependency could cause overfitting?
NLP models tend to overfit because of too much influence of certain identity terms and lexical dependencies, which eventually results in false positives, severe unintended bias, and lower performance. Bashar et al., (2019) acknowledge that misogynistic abusive tweets might contain certain keywords, but would not necessarily always contain such slurs. To work around that, they show that classifiers can work with small-labeled datasets, provided that the word vectors used are pre-trained on the context domain of the problem and paired with careful customization and regularization. This proves that a large-labeled dataset is not always required for training purposes. In fact, if the word vectors are pre-trained in the context of the problem domain, alongside careful customization of the model, the classifiers could also be trained on small datasets. On the other hand, Plaza-del Arco et al., (2021) generates linguistic resources using a set of word embeddings, with the initial seed lexicon eventually getting populated with words and n-grams more attuned to the domain because of linguistic similarities. Using a voting schema rule with logistic regression and multinomial Naïve Bayes, alongside the lexicon-based system and combinations of unigrams and bigrams gave a good result with the Spanish dataset. Observations show that some expressions of hate when combined with other terms change the sense entirely and hence better-supervised learning begins with larger data.

Possible solutions to counter bias by lexical dependencies.
For larger datasets, the issue is elevated with the imbalanced nature of the datasets and their disproportionate dependence on these determinate terms, having a high correlation to minority class (Nascimento et al., 2022b, ). Using such identity terms, or samples from target domains during the training phase requires a-priori knowledge but can often lead to the introduction of further bias. Introducing a regularization approach to the models to add some degrees of contextualization using EAR could mitigate the problem to some extent, as they are seen to show competitive performance, along with an improvement in the bias metrics (Attanasio et al., , 2022). Consequently, developing classifiers that can decompose gender bias within full sentences into semantic dimensions can be used, since it can be contextually determined (rather than being explicitly gendered). This has in turn shown to give a better performance in controlling gender differences (Dinan et al., , 2020). Ou and Li, (2020) find limitations of only using the pooler output of DL multilanguage models like XLM-RoBERTa, and hence obtains deeper and more abundant semantic features by extracting from its hidden layer state which gives better performance. Data correction strategy focused on gender bias, consisting of two-stage modules- bias detection and replacement of the said bias-sensitive words (BSWs), is seen to reduce the differentiation of similar terms related to gender, and in turn, contribute to mitigating the unintended bias. Since the frequency of female identity terms is high (even when representing similar groups/classes or other social identities) in datasets related to sexism and misogyny, they replaced these potential bias terms with <<<identity>>> tag without compromising the model accuracy. Their proposed multi-view stacked classifier is seen to outperform other state-of-the-art models and diminish gender bias (Nascimento et al., 2022b, ).

4.3.2 Linking social science theories to computer science research

Following from the previous subsection where we introduced our argument that sexism/misogyny is not a binary task, in this section, we expand on that point by providing social science theories and scales to explain the need to not computationally limit the classification to the binary output. To support that, alongside including the theories and scales, we also analyze how some studies have aided their work with these theories in any capacities (i.e., the extent of adaptation - using one or more categories of the scaling) and implemented them at any stage of their research. We distinguished each subdivision into two parts: the concept and the applications, to help us differentiate between the concepts themselves and on how they are implemented in studies.

Sexism is not always hostile

Concept.
Grosz and Conde-Cespedes, (2020)
state, that models can perform detection tasks easier on datasets containing large amounts of “hostile” sexism, since it hinges on some words, regardless of their context. But that does not provide a real-world scenario. In general, sexism is said to have two components: hostility towards women and endorsement of traditional gender roles, and most of the sexist attitude measures so far have stemmed from there. But it is not always so. Through their anthropological research on sexism, Glick and Fiske, (1997) call sexism “fundamentally ambivalent”, adding the subjectively benevolent nature of sexism to the previously perceived singularly hostile nature. They argue that the “simultaneous existence of male structural power and female dyadic power” creates an ambivalent ideology. While the hostile ideology seeks justification of their male position through derogatory characterization of women (HS), benevolent ideology relies on kinder and gentler justification, which may inherently look as subjectively positive for the sexist as they encompass feelings of protectiveness and affection towards women (BS). By drawing parallels from paternalism, which also has two ideologies- dominative and protective, they demonstrate that the protectiveness is particularly strong when women(e.g., wives, mothers, daughters) are dyadically dependent on men, as a feeling something akin to the sense of “ownership”. The hierarchical stereotype ideology explained before constitutes the belief contributing to the gender differentiation. Like paternalism, it also consists of both hostile and benevolent side. Competitive gender differentiation being the hostile kind, delves on negative stereotypes of women implying men to be the better gender; and the complementary gender differentiation (the benevolent kind) stems from the traditional stereotypes of women through assigned gender roles and men’s dyadic dependence on women, albeit in an extremely positive light (Eagly and Mladinic, , 1994). Similarly, for heterosexuality, which has a hostile side when viewing women as mere sexual objects who use sexual attraction to gain power over men; and intimate or benevolent side that romanticizes the former belief, viewing women as necessary for men to feel “complete”.

Applications.
Sexism in ambivalent theory (Glick and Fiske, , 1996) is thus hypnotized to encompass these three sources of male ambivalence, which has been used by Jha and Mamidi, (2017) to computationally identify benevolent sexism, and classify sexist content based on the two components. They confirm the hypothesis that HS is evidently negative and easily identifiable, while BS is retweeted much more and is camouflaged, seemingly harmless or noble and hence, harder to detect. It was seen that while SVM showed high precision for both, recall was quite low for HS; their Seq2Seq model (LSTM-based bi-directional RNN) showed a higher recall for both, even though its precision was not as high, presumably because it takes in the structure of the tweet. But owing to the bag-of-n-grams feature of FastText (and lesser parameters to tune), it outperformed both the former classifiers. On the other hand, Singh et al., (2021) used the hostile side of the three sources of male ambivalence to define sexism binarily and annotate dialogues in popular sitcoms. Using these concepts, they manually annotated the external datasets (source domain) and used a semi-supervised domain-adaptive learning approach to generate classes in the model for the unannotated data (target domain), thus further augmenting the training data and improving the final classification performance. However, error analysis showed certain false positives like incorrectly classifying aggressive negative statements to a particular woman, or contents with explicit sexual terms and mentions of marriages or weddings as sexist. This could be the underlying drawback of not using a diverse dataset since the authors had included dialogues that included derogatory terms and dialogues justifying stereotypes against women or gender roles. But Mishra et al., (2019) use the concepts from previous research rather differently, by taking inspiration from studies that use randomly initialized user embeddings for improving performances, and inter and intra-user representations based on tweets. Instead of the former semi-supervised approach, they use graph convolutional networks (GCN) based approach, applied to the heterogenous graph representation of two types of nodes- authors and their tweets, to generate richer author profiles. The intention was to use such heterogenous representation to enable the model to learn both community structure and linguistic behavior of authors in such communities. Even with this improvement, several abusive tweets were misclassified, primarily due to the presence of abusive content in the URL (not in the tweet itself), and the deliberate obfuscation of words and phrases by the authors to evade detection.

Subtle forms of sexism/misogyny

Concept.
Since most of the sexism measurement scales are focused on hostile sentiments, it fail to capture the contemporary forms of subtle sexism, which are often cloaked in the guise of egalitarian views and harbor (more) traditional beliefs. Due to the increase in social awareness of sex discrimination, the more blatant form of sexism is reduced, replaced with the subtle forms of indirect indices. And the lack of conceptual framework of understanding, coupled with methodological problems were indicated in the simulation study conducted by Beattie and Diehl, (1979), where they observe the use of indirect means to interpret the gender and hence influence the evaluation criteria. This gave suggestive evidence to a new form of sexism called “neosexism”, which was first introduced by Tougas et al., (1995), and defined as “a manifestation of a conflict between egalitarian values and residual negative feelings towards women”. They used a predictive model of ‘attitude to affirmative actions’ to test the discriminatory bias and evaluated the practical implications of neosexism through their Neosexism Scale (NS). The study indicated that “neosexist beliefs were linked with opposition to programs designed to facilitate integration of both women and minorities”, which leads to further proves the importance of understanding the existing prejudicial beliefs of women to understand the different forms of sexism.

Applications.
An analysis of the cross-sectional data during the 2016 US presidential election and the #MeToo movement by Archer and Kam, (2020) shows its significant correlation to neosexism, and the various degrees of dismissal of the respondents to the existing gender discrimination, hence indicating its existence in online platforms. Zeinert et al., (2021) had used NS in their work on Danish tweets to add neosexism to their taxonomy along with the previously categorized forms of sexism. Interestingly, while annotating, they found that neosexism formed the most common form of misogyny and accounted for most of the annotation challenges based on disagreements, primarily due to the challenge of understanding the author’s intentions, the degree of abuse (since misrepresentation could harm the subject or the fact) and lack of world knowledge. This further added to the class imbalance in the last stage of sexism labeling in their dataset which affected the reliability of the performance, even though they started with a 1:1 class balance at the initial stage (labeling abusive or not) of their iterative labeling scheme based on the MALER framework proposed by Finlayson and Erjavec, (2017). To prevent such bias caused by an imbalanced dataset, Indurthi et al., (2019) process the training dataset using SMOTE (Chawla et al., , 2002) which synthetically oversamples data and ensure all classes have an equal number of instances. While the existence of subtle forms of sexism and misogyny is undeniable, having unbiased data representative of the same is essential to gain a better computational outcome.

4.3.3 Need for fine-grained categorization

Use of different linguistic features to capture nuances

For a more fine-grained approach to bring in the context and nuances of misogyny, using natural language processing (NLP) is essential. NLP applications like sentiment analysis are crucial for analyzing and detecting online sexism/misogyny. Incorporating polarity and emotion information is seen to be useful for the benefit of the task as they portray the usually emotional, expression of negative emotion and polarity towards the recipient (Plaza-Del-Arco et al., , 2020; Plaza-del Arco et al., , 2021). Using feature representations has further helped in training the model, by adding representations of the text in terms of various lexical, syntactic, and morphological features. While the most common types of features used are the bag-of-words representations of text, and/or the embeddings, adding to the features also helps in the performance. Many papers have used it to enhance their model performance. The idea is to map out the various aspects of sexism as seen in the everyday social constructs and use it to comprehensively map them out for the benefit of the identification tasks (Samory et al., , 2021).

Error analysis and its findings

One of the key drawbacks of the sexism and/or misogyny identifications is that the models are not able to pick up the slight or subtle implications of sexism in the text, mostly depending on the context (Guest et al., , 2021). Alternatively, text containing a lot of sexual terms would be marked as sexist/misogynistic. As Singh et al., (2021) observe in their error analysis, many of the confounded variables were specific terms which either referred to extremely sexual terms or aggressively negative statements. Chiril et al., (2020) performed further characterization of the binary sexist classification by distinguishing cases where the addressee is directly addressed from those where she is not. The three categories being: (i) directed assertions - sexist tweet directly addressed to a woman or a group of women; (ii) descriptive assertions- sexist tweets not directed to an addressee; and (iii) reported assertions- tweets containing report of an experience or a denunciation of sexist behavior. On performing classification based on results per class, they identified the absence of context with the utterance, humor, and satire, and the use of stereotypes or metaphors to be the causes of misclassification through their manual error analysis in their best performing model- BERT. And they discussed the necessity of the need of reasoning. As Frenda et al., (2019) had also stated one of their principle problems is the use of linguistic devices like irony and sarcasm. Inspired by social science work, Sharifirad et al., (2018) categorize sexism into four complementary types: information threat, and indirect, sexual, and physical harassment. To improve coverage of the classes in data and reduce data scarcity, they too use ConceptNet to generate texts using some of its relations like IsA, RelatedTo, etc., and with three different replacement approaches: all words (better than the other two), noun, and verb. Adding more information by enriching the text semantically and augmenting data using general-purpose knowledge graphs and concepts of Wikidata was seen as effective, but not as much as only using text generation, presumably because of the lack of mapping between the two augmentation techniques. In either and both cases, it is seen to have drastically improved the classification results.

Can psychometric scales be useful for capturing social constructs ?

Different psychometric scales can also be used to map out various aspects of sexism/ misogyny as a social construct, to comprehensively detect the different categorizations. King and King, (1997) reaffirm the previously stated theory on modern sexists, and describe them as “people who while rejecting old-fashioned discrimination and stereotypes, may believe that discrimination against women is a thing of the past, feel agnostic against women who are making political and economic demands, and feel resentment about special favors for women, such as policies designed to help women in academics and work.” In other words, the distinction between old-fashioned and modern sexism lies in the fact that the former showcases an obvious unequal treatment of women while questioning their intelligence, while the latter is less sympathetic to women’s issues (if at all they perceive them to be issues) since they presume greater equality in the workforce than what exists. The Modern Sexism (MS) scale this study provides aims to be a good indicator to detect modern sexism, which could be both overt and covert in nature. People endorsing MS beliefs are hence less likely to detect the occurrence of a normative sexist behavior (Swim et al., , 2004). In the review by Swim and Cohen, (1997) on the MS scale, they indicate the same as they observe that it measures the subtle forms of sexism that are built upon cultural and societal norms. They also review another general measure of sexism, namely the Attitude Toward Women Scale (AWS), which measures overt or blatant sexism. And through their analysis, they indicate that even with these distinctive differences, both share related constructs. These social constructs are often perpetrated as discriminatory attitudes towards a feminine gender role, which are traditionally allocated and differentiated by sex. García-Cueto et al., (2015) propose a scale to assess the gender role attitude, showing how sexist attitudes can be modified using the theoretical perspectives of gender equality.

Multi-class vs multi-label classification of sexism

Parikh et al., (2019) show the possibility of co-occurrence of some categories of sexism in the multi-class classification. They provide with multi-label categorization (accounting to twenty-three categories, as directed by a social scientist) on first-person accounts of sexism from survivors reporting any types of sexism, following gender-related discourses and campaigns, which could even impact on public policies. Their annotation followed a three-phase process to ensure that the final dataset had been reviewed by at-least two of the annotators, each studying/having studied topics related to gender and/or sexuality, following detailed guidelines and pre-training with a pilot round as a first-stage and further quality check in the next two stages with reduction in annotation categories to fourteen for the final classification task. They tailored(/tuned) a BERT model for the domain of instances of sexism using unlabeled data in training set, using masked language modeling and next sentence prediction task, which then flexibly combines its sentence representation with distributional word embeddings and a linguistic feature vector. Their linguistic feature representation comprised of a variety of features, namely from biased language detection work, PERMA(Positive Emotion, Engagement, Relationships, Meaning, Accomplishments) features for polarities, in association with eight basic emotions and lexicons for sentiments, affect and scores to meaningfully distinguish among the categories. Their proposed multi-label multi-class LSTM-based neural framework outperformed many of the baseline traditional ML and DL models. Building on the same dataset, a more fine-grained approach was performed by Abburi et al., (2021) on enhancing the categorization schema to capture all the twenty-three categories. They employ a set of self-trained semi-supervised learning for classifying the accounts of sexism to augment the labeled data so that the categories can co-occur. They devise mechanisms to enhance the textual diversity in the expanded labeled set, alleviating the skew in the original class distribution and favoring samples that are hard to classify, using score computation and intersection. Adding to the previously used combined domain-tailored BERT with attention mechanism with biLSTM, they propose a loss function that makes use of the label confidence scores associated with each of the pseudo-labeled samples in the augmented data. This multi-level training method using category hierarchy for the multi-label classification trains the model sequentially at different levels, and was observed to outperform numerous baselines across several metrics. But both these studies have data collected from online sexism reports, which considers only the sexist examples and performs detection task, hence not performing the identification of sexism.

However, Swim et al., (2004) states that along with the use of sexist language, this too is a limitation since it is unable to generalize on any other types of sexist behavior which are unidentified by the studies. These studies however assert the advantages of using their multi-label classification, which can be inspirational for the future research on sexism identification tasks. Talavera et al., (2021) too uses multi-label classification for sexism, but with much lesser fine-grained categorization having only five categories. Like the previous studies, they conclude that exploiting transfer learning capabilities of pretrained language models with optimized fine-tuning to the target domain is a desirable approach to achieve competitive performance, especially for the tasks where training data is scarce.

While the studies from the previous sections immensely contributed on segregating the fine-grained features of sexism, we urge more researchers to build on the social theories to support sexism identification tasks. Through this review, we provide a comprehensive report on the research practices used in the study of sexism, which often take an inter-disciplinary approach, but fail to diminish the gap between the disciplinary divide.

5 Summary of general strategies used and existing challenges

Our summary of key research findings identified through the literature review reflects the current drawback in the study of sexism and misogyny identification tasks. Irrespective of the different measures taken by the literary works, some limitations remain consistent, which further hinder obtaining a robust model capable of quantifying sexism or misogyny. As Vidgen and Derczynski, (2020) suggest, ”More standardization is an important aspiration as research continues to mature, although it must be balanced with enabling research innovation and freedom.” Therefore, we summarise the research findings in the following points:

  1. 1.

    Achieving a good performance score is not the ideal calculator for a good model.

  2. 2.

    Sexism or misogyny should not be limited to a binary classification task, they should further be categorized into multiple categories and classified accordingly.

  3. 3.

    Datasets used for the tasks should be representative of the diversity in the propagation of online sexism or misogyny, to limit biases in the model.

  4. 4.

    It is important to computationally capture sexism and misogyny in their subtle forms, hence focusing on its covert and indirect propagation is necessary.

  5. 5.

    The forms of bias that come while performing CSS tasks can be mitigated using algorithmic and methodological approaches.

  6. 6.

    The balance between the requirements of the two disciplines - social science and computer science, left a limited number of interdisciplinary research outputs. Online sexism and misogyny would benefit from the expertise of both disciplines, hence interdisciplinary works should be promoted more.

6 Conclusion

This paper has provided a meta-analysis, shedding light on how sexism and misogyny have been studied between the years of 2012 and 2022. This analysis identifies how research, even on the same topic of sexism and misogyny, can cover a diverse range of sub-topics as identified through the topic modeling in Section 3.3.1. Through the keyword search in Section 3.3.2, we further see how the keywords used in the abstracts and titles play a pivotal role in identifying each topic. In this paper, we do an in-depth examination of 45 literary works on the quantification of sexism and misogyny, providing critical insight into the existing literature, their use of data, methods, and techniques; and formulation of tasks. Through our identification of the research studies and the multiple screening processes, we check for the eligibility of the works by evaluating their relevance to the research question. Based on an evidence-based review, we provide an extensive analysis of all the final selected works, investigating the challenges and opportunities for future work. By listing out the approaches used by the studies at various stages, we form a comprehensive summary of the various directions used at different stages of research, starting right from the conceptualization of sexism and misogyny, to its measurement using different techniques. While the trends in methodology have shown improvement in capturing the dynamics of sexism and misogyny, by accounting for the nuances in the sexist/misogynist language, a large part of it is still left unexplored, mostly due to computational incapacity to capture complexity of the construct. It is seen as harder to manage the nuances of categorization in sexism and misogyny, while also attempting to improve the modeling performance. To balance both, studies are seen to typically opt for bias mitigation techniques, rather than focusing on the categorization, since it comes with an added responsibility of providing a representative dataset. Through our steps in a systematic literature review, the activities meet the research aims outlined at the beginning of the paper. Identification or detection of such a social inequality is a challenge, more so because of its subjective nature. To improve the tasks, all the challenges and limitations must be addressed, to ensure the mitigation of biases as much as possible, and bridge the gaps in the existing literature. Considering the summary we provided, we hope to contribute to further development on this topic ensuring updated resources on the same, and encouraging investigation on the change in dynamics of online sexism and misogyny.

Declarations

Material and code availability

No data was generated during this research, but were acquired from online websites or through the API access of the stated citation databases.

All the shareable acquired data collected and used in the research, along with its analysis is made available in the GitHub page: https://github.com/booktrackerGirl/Sys-lit-review-Sexism131313This repository would be made public upon acceptance for publication.. We also include the permissive license to allow users to use, modify and distribute the materials.

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Funding

We thank the University of Exeter for funding the cost to access the Web of Science Expanded API and SerpAPI. A.D.’s time on the research was funded by the SSIS Global Excellence PhD Studentship from the University of Exeter. S.B.’s time on the research was funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 101019284). C.Q.C. thanks the Ewha Frontier 10-10 project and the DSO National Laboratories Singapore for funding this research.

References

  • Abburi et al., (2021) Abburi, H., Parikh, P., Chhaya, N., and Varma, V. (2021). Fine-grained multi-label sexism classification using a semi-supervised multi-level neural approach. Data Science and Engineering, 6(4):359–379.
  • Anzovino et al., (2018) Anzovino, M. E., Fersini, E., and Rosso, P. (2018). Automatic identification and classification of misogynistic language on twitter. In International Conference on Applications of Natural Language to Data Bases.
  • Archer and Kam, (2020) Archer, A. M. and Kam, C. D. (2020). Modern sexism in modern times public opinion in the# metoo era. Public Opinion Quarterly, 84(4):813–837.
  • Attanasio et al., (2022) Attanasio, G., Nozza, D., Hovy, D., and Baralis, E. (2022). Entropy-based attention regularization frees unintended bias mitigation from lists.
  • Attanasio and Pastor, (2020) Attanasio, G. and Pastor, E. (2020). Politeam @ ami: Improving sentence embedding similarity with misogyny lexicons for automatic misogyny identification in italian tweets. EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020.
  • Barak, (2005) Barak, A. (2005). Sexual harassment on the internet. Social Science Computer Review, 23(1):77–92.
  • Bartow, (2009) Bartow, A. (2009). Internet defamation as profit center: The monetization of online harassment. Faculty Publications.
  • Bashar et al., (2019) Bashar, M. A., Nayak, R., Suzor, N., and Weir, B. (2019). Misogynistic tweet detection: Modelling cnn with small datasets. In Data Mining: 16th Australasian Conference, AusDM 2018, Bahrurst, NSW, Australia, November 28–30, 2018, Revised Selected Papers 16, pages 3–16. Springer.
  • Basile et al., (2019) Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F. M., Rosso, P., and Sanguinetti, M. (2019). SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In May, J., Shutova, E., Herbelot, A., Zhu, X., Apidianaki, M., and Mohammad, S. M., editors, Proceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
  • Basile et al., (2020) Basile, V., Croce, D., Maro, M. D., and Passaro, L. C. (2020). EVALITA 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian. In Basile, V., Croce, D., Maro, M. D., and Passaro, L. C., editors, Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online event, December 17th, 2020, volume 2765 of CEUR Workshop Proceedings. CEUR-WS.org.
  • Bates, (2015) Bates, L. (2015). Everyday sexism. Schuster UK.
  • Beattie and Diehl, (1979) Beattie, M. Y. and Diehl, L. A. (1979). Effects of social conditions on the expression of sex-role stereotypes. Psychology of Women Quarterly, 4(2):241–255.
  • (13) Bhattacharya, S., Singh, S., Kumar, R., Bansal, A., Bhagat, A., Dawer, Y., Lahiri, B., and Ojha, A. K. (2020a). Developing a multilingual annotated corpus of misogyny and aggression. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pages 158–168, Marseille, France. European Language Resources Association (ELRA).
  • (14) Bhattacharya, S., Singh, S., Kumar, R., Bansal, A., Bhagat, A., Dawer, Y., Lahiri, B., and Ojha, A. K. (2020b). Developing a multilingual annotated corpus of misogyny and aggression. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pages 158–168, Marseille, France. European Language Resources Association (ELRA).
  • Bhattacherjee, (2019) Bhattacherjee, A. (2019). Social science research: principles, methods and practices (revised edition). University of South Florida.
  • Blondel et al., (2008) Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008.
  • Bolanos et al., (2024) Bolanos, F., Salatino, A., Osborne, F., and Motta, E. (2024). Artificial intelligence for literature reviews: Opportunities and challenges.
  • Briggs, (2023) Briggs, J. (2023). Advanced topic modeling with bertopic.
  • Butt et al., (2021) Butt, S., Ashraf, N., Sidorov, G., and Gelbukh, A. F. (2021). Sexism identification using bert and data augmentation - exist2021. In IberLEF@SEPLN.
  • Canós, (2018) Canós, J. S. (2018). Misogyny identification through svm at ibereval 2018. In IberEval@SEPLN.
  • Chawla et al., (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.
  • Chiril et al., (2021) Chiril, P., Benamara, F., and Moriceau, V. (2021). “be nice to your wife! the restaurants are closed”: Can gender stereotype detection improve sexism classification? In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2833–2844, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Chiril et al., (2020) Chiril, P., Moriceau, V., Benamara, F., Mari, A., Origgi, G., and Coulomb-Gully, M. (2020). An annotated corpus for sexism detection in french tweets. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1397–1403.
  • Code, (2002) Code, L. (2002). Encyclopedia of feminist theories. Routledge.
  • Daniels and Leaper, (2011) Daniels, E. and Leaper, C. (2011). Gender issues. In Brown, B. B. and Prinstein, M. J., editors, Encyclopedia of Adolescence, pages 151–159. Academic Press, San Diego.
  • Das et al., (2023) Das, A., Rahgouy, M., Zhang, Z., Bhattacharya, T., Dozier, G., and Seals, C. D. (2023). Online sexism detection and classification by injecting user gender information. In 2023 IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings), pages 1–5.
  • Devlin et al., (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dinan et al., (2020) Dinan, E., Fan, A., Wu, L., Weston, J., Kiela, D., and Williams, A. (2020). Multi-Dimensional Gender Bias Classification. arXiv e-prints, page arXiv:2005.00614.
  • Eagly and Mladinic, (1989) Eagly, A. H. and Mladinic, A. (1989). Gender stereotypes and attitudes toward women and men. Personality and Social Psychology Bulletin, 15(4):543–558.
  • Eagly and Mladinic, (1994) Eagly, A. H. and Mladinic, A. (1994). Are people prejudiced against women? some answers from research on attitudes, gender stereotypes, and judgments of competence. European Review of Social Psychology, 5(1):1–35.
  • Farrell et al., (2019) Farrell, T., Fernandez, M., Novotny, J., and Alani, H. (2019). Exploring misogyny across the manosphere in reddit. In Proceedings of the 10th ACM Conference on Web Science, WebSci ’19, page 87–96, New York, NY, USA. Association for Computing Machinery.
  • Fersini et al., (2018) Fersini, E., Nozza, D., and Rosso, P. (2018). Overview of the evalita 2018 task on automatic misogyny identification (ami). In EVALITA@CLiC-it.
  • Fiala and Tutoky, (2017) Fiala, D. and Tutoky, G. (2017). Computer science papers in web of science: A bibliometric analysis. Publications, 5(4).
  • Finlayson and Erjavec, (2017) Finlayson, M. A. and Erjavec, T. (2017). Overview of annotation creation: Processes and tools. Handbook of Linguistic Annotation, pages 167–191.
  • Fontanella et al., (2024) Fontanella, L., Chulvi, B., Ignazzi, E., Sarra, A., and Tontodimamma, A. (2024). How do we study misogyny in the digital age? a systematic literature review using a computational linguistic approach. Humanities and Social Sciences Communications, 11(1):1–15.
  • Frenda et al., (2019) Frenda, S., Ghanem, B., Montes-y Gómez, M., and Rosso, P. (2019). Online hate speech against women: Automatic identification of misogyny and sexism on twitter. Journal of intelligent & fuzzy systems, 36(5):4743–4752.
  • Frenda et al., (2018) Frenda, S., Ghanem, B., and y Gómez, M. M. (2018). Exploration of misogyny in spanish and english tweets. In IberEval@SEPLN.
  • García-Cueto et al., (2015) García-Cueto, E., Rodríguez-Díaz, F. J., Bringas-Molleda, C., López-Cepero, J., Paíno-Quesada, S., and Rodríguez-Franco, L. (2015). Development of the gender role attitudes scale (gras) amongst young spanish people. International journal of clinical and health psychology, 15(1):61–68.
  • Ging and Siapera, (2018) Ging, D. and Siapera, E. (2018). Special issue on online misogyny. Feminist Media Studies, 18(4):515–524.
  • Ging and Siapera, (2019) Ging, D. and Siapera, E. (2019). Gender Hate Online Understanding the New Anti-Feminism: Understanding the New Anti-Feminism. Springer.
  • Glick and Fiske, (1996) Glick, P. and Fiske, S. (1996). The ambivalent sexism inventory: Differentiating hostile and benevolent sexism. Journal of Personality and Social Psychology, 70:491–512.
  • Glick and Fiske, (1997) Glick, P. and Fiske, S. (1997). Hostile and benevolent sexism: Measuring ambivalent sexist attitudes toward women. Psychology of Women Quarterly, 21(1):119–135.
  • Gordeev and Lykova, (2020) Gordeev, D. and Lykova, O. (2020). BERT of all trades, master of some. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pages 93–98, Marseille, France. European Language Resources Association (ELRA).
  • Grootendorst, (2020) Grootendorst, M. (2020). Keybert: Minimal keyword extraction with bert.
  • Grootendorst, (2022) Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure.
  • Grosz and Conde-Cespedes, (2020) Grosz, D. and Conde-Cespedes, P. (2020). Automatic detection of sexist statements commonly used at the workplace.
  • Guest et al., (2021) Guest, E., Vidgen, B., Mittos, A., Sastry, N., Tyson, G., and Margetts, H. (2021). An expert annotated dataset for the detection of online misogyny. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1336–1350, Online. Association for Computational Linguistics.
  • Hammersley and Gomm, (1997) Hammersley, M. and Gomm, R. (1997). Bias in social research. Sociological Research Online, 2(1):7–19.
  • Harzing, (2007) Harzing, A. (2007). Publish or perish.
  • Herring, (1999) Herring, S. C. (1999). The rhetorical dynamics of gender harassment on-line. The Information Society, 15(3):151–167.
  • Hirsch, (1992) Hirsch, S. F. (1992). Julia penelope, speaking freely: Unlearning the lies of the fathers’ tongues. new york: Pergamon, 1990. pp. xxxvii 281. - deborah cameron (ed.), the feminist critique of language: A reader. london and new york: Routledge, 1990. pp. xi 258. Language in Society, 21(1):136–142.
  • Hosseini et al., (2017) Hosseini, H., Kannan, S., Zhang, B., and Poovendran, R. (2017). Deceiving google’s perspective api built for detecting toxic comments.
  • Hovy and Prabhumoye, (2021) Hovy, D. and Prabhumoye, S. (2021). Five sources of bias in natural language processing. Language and Linguistics Compass, 15(8):e12432.
  • Indurthi et al., (2019) Indurthi, V., Syed, B., Shrivastava, M., Chakravartula, N., Gupta, M., and Varma, V. (2019). FERMI at SemEval-2019 task 5: Using sentence embeddings to identify hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 70–74, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
  • Jane, (2016) Jane, E. A. (2016). Online misogyny and feminist digilantism. Continuum, 30(3):284–297.
  • Jha and Mamidi, (2017) Jha, A. and Mamidi, R. (2017). When does a compliment become sexist? analysis and classification of ambivalent sexism using twitter data. In Proceedings of the Second Workshop on NLP and Computational Social Science, pages 7–16, Vancouver, Canada. Association for Computational Linguistics.
  • Jiang et al., (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Jurasz and Barker, (2019) Jurasz, O. and Barker, K. (2019). Online misogyny: A challenge for digital feminism? Journal of International Affairs, 72(2):95–114.
  • Karami et al., (2019) Karami, A., Swan, S., White, C. N., and Ford, K. (2019). Hidden in plain sight for too long: Using text mining techniques to shine a light on workplace sexism and sexual harassment. Psychology of Violence.
  • Katsarou et al., (2021) Katsarou, K., Sunder, S., Woloszyn, V., and Semertzidis, K. (2021). Sentiment polarization in online social networks: The flow of hate speech. In 2021 Eighth International Conference on Social Network Analysis, Management and Security (SNAMS), pages 01–08.
  • King and King, (1997) King, L. A. and King, D. W. (1997). Sex-role egalitarian ism scale: Development, psychometric properties, and recommendations for future research. Psychology of Women Quarterly, 21(1):71–87.
  • Kohli et al., (2021) Kohli, G., Kaur, P., and Bedi, J. (2021). ARGUABLY at ComMA@ICON: Detection of multilingual aggressive, gender biased, and communally charged tweets using ensemble and fine-tuned IndicBERT. In Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification, pages 46–52, NIT Silchar. NLP Association of India (NLPAI).
  • Lazer et al., (2020) Lazer, D. M., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., Freelon, D., Gonzalez-Bailon, S., King, G., Margetts, H., et al. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507):1060–1062.
  • Lynn et al., (2019) Lynn, T., Endo, P. T., Rosati, P., Silva, I., Santos, G. L., and Ging, D. (2019). A comparison of machine learning approaches for detecting misogynistic speech in urban dictionary. In 2019 International Conference on Cyber Situational Awareness, Data Analytics And Assessment (Cyber SA), pages 1–8.
  • Manne, (2017) Manne, K. (2017). Down Girl: The Logic of Misogyny. Oxford University Press.
  • Megarry, (2014) Megarry, J. (2014). Online incivility or sexual harassment? conceptualising women’s experiences in the digital age. Women’s Studies International Forum, 47:46–55.
  • Melville et al., (2019) Melville, S., Eccles, K., and Yasseri, T. (2019). Topic modeling of everyday sexism project entries. Frontiers in Digital Humanities, 5.
  • Mishra et al., (2019) Mishra, P., Tredici, M. D., Yannakoudakis, H., and Shutova, E. (2019). Abusive language detection with graph convolutional networks.
  • Mudrak, (2018) Mudrak, B. (2018). What are preprints, and how do they benefit authors?
  • (70) Nascimento, F. R., Cavalcanti, G. D., and Da Costa-Abreu, M. (2022a). Unintended bias evaluation: An analysis of hate speech detection and gender bias mitigation on social media using ensemble learning. Expert Systems with Applications, 201:117032.
  • (71) Nascimento, F. R., Cavalcanti, G. D., and Da Costa-Abreu, M. (2022b). Unintended bias evaluation: An analysis of hate speech detection and gender bias mitigation on social media using ensemble learning. Expert Systems with Applications, 201:117032.
  • Nozza et al., (2019) Nozza, D., Volpetti, C., and Fersini, E. (2019). Unintended bias in misogyny detection. In IEEE/WIC/ACM International Conference on Web Intelligence, WI ’19, page 149–155, New York, NY, USA. Association for Computing Machinery.
  • Ou and Li, (2020) Ou, X. and Li, H. (2020). Ynu_oxz@ haspeede 2 and ami: Xlm-roberta with ordered neurons lstm for classification task at evalita 2020. EVALITA Evaluation of NLP and Speech Tools for Italian, 2765:102–109.
  • Parikh et al., (2019) Parikh, P., Abburi, H., Badjatiya, P., Krishnan, R., Chhaya, N., Gupta, M., and Varma, V. (2019). Multi-label categorization of accounts of sexism using a neural framework.
  • Parikh et al., (2021) Parikh, P., Abburi, H., Chhaya, N., Gupta, M., and Varma, V. (2021). Categorizing sexism and misogyny through neural approaches. ACM Transactions on the Web (TWEB), 15(4):1–31.
  • Plaza-del Arco et al., (2021) Plaza-del Arco, F. M., Molina-González, M. D., López, L., and Martín-Valdivia, M. (2021). Sexism identification in social networks using a multi-task learning system. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII International Conference of the Spanish Society for Natural Language Processing., Málaga, Spain, volume 2943, pages 491–499.
  • Plaza-Del-Arco et al., (2020) Plaza-Del-Arco, F.-M., Molina-González, M. D., Ureña López, L. A., and Martín-Valdivia, M. T. (2020). Detecting misogyny and xenophobia in spanish tweets using language technologies. ACM Trans. Internet Technol., 20(2).
  • Pykes, (2023) Pykes, K. (2023). What is topic modeling? an introduction with examples.
  • Rahali et al., (2021) Rahali, A., Akhloufi, M. A., Therien-Daniel, A.-M., and Brassard-Gourdeau, E. (2021). Automatic misogyny detection in social media platforms using attention-based bidirectional-lstm. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2706–2711.
  • Rodríguez-Sánchez et al., (2021) Rodríguez-Sánchez, F., Carrillo-de Albornoz, J., Plaza, L., Gonzalo, J., Rosso, P., Comet, M., and Donoso, T. (2021). Overview of exist 2021: sexism identification in social networks. Procesamiento del Lenguaje Natural, 67:195–207.
  • Röttger et al., (2022) Röttger, P., Vidgen, B., Hovy, D., and Pierrehumbert, J. B. (2022). Two contrasting data annotation paradigms for subjective nlp tasks.
  • Samory et al., (2021) Samory, M., Sen, I., Kohne, J., Flöck, F., and Wagner, C. (2021). “call me sexist, but…” : Revisiting sexism detection using psychological scales and adversarial samples. Proceedings of the International AAAI Conference on Web and Social Media, 15(1):573–584.
  • Sap et al., (2019) Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N. A. (2019). The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, Florence, Italy. Association for Computational Linguistics.
  • Schmid Mast, (2004) Schmid Mast, M. (2004). Men are hierarchical, women are egalitarian: An implicit gender stereotype. Swiss Journal of Psychology, 63(2):107–111.
  • Schütz et al., (2022) Schütz, M., Boeck, J., Liakhovets, D., Slijepčević, D., Kirchknopf, A., Hecht, M., Bogensperger, J., Schlarb, S., Schindler, A., and Zeppelzauer, M. (2022). Automatic sexism detection with multilingual transformer models.
  • Sensales and Areni, (2017) Sensales, G. and Areni, A. (2017). Gender biases and linguistic sexism in political communication: A comparison of press news about men and women italian ministers. Journal of Social and Political Psychology, 5(2).
  • SerpAPI, (2019) SerpAPI (2019).
  • Shah et al., (2020) Shah, D. S., Schwartz, H. A., and Hovy, D. (2020). Predictive biases in natural language processing models: A conceptual framework and overview. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5248–5264, Online. Association for Computational Linguistics.
  • Sharifirad et al., (2018) Sharifirad, S., Jafarpour, B., and Matwin, S. (2018). Boosting text classification performance on sexist tweets by text augmentation and text generation using a combination of knowledge graphs. In Proceedings of the 2nd workshop on abusive language online (ALW2), pages 107–114.
  • Siddiqi et al., (2018) Siddiqi, N., Bains, A., Aleem, S., and Aleem, S. (2018). Analysing threads of sexism in new age humour: A content analysis of internet memes. Indian journal of social research, 59:356.
  • SIGEDU, (2024) SIGEDU (2024).
  • Singh et al., (2021) Singh, S., Anand, T., Ghosh Chowdhury, A., and Waseem, Z. (2021). “hold on honey, men at work”: A semi-supervised approach to detecting sexism in sitcoms. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 180–185, Online. Association for Computational Linguistics.
  • Sutton and Gong, (2017) Sutton, C. and Gong, L. (2017). Popularity of arxiv. org within computer science. arXiv preprint arXiv:1710.05225.
  • Swim et al., (2004) Swim, J., Mallett, R., and Stangor, C. (2004). Understanding subtle sexism: Detection and use of sexist language. Sex Roles, 51:117–128.
  • Swim and Cohen, (1997) Swim, J. K. and Cohen, L. L. (1997). Overt, covert, and subtle sexism: A comparison between the attitudes toward women and modern sexism scales. Psychology of Women Quarterly, 21(1):103–118.
  • Takkouche and Norman, (2011) Takkouche, B. and Norman, G. (2011). Prisma statement. Epidemiology, 22(1):128.
  • Talavera et al., (2021) Talavera, I., Fidalgo, D. C., and Vila-Suero, D. (2021). System description for exist shared task at iberlef 2021: Automatic misogyny identification using pretrained transformers. In IberLEF@ SEPLN, pages 484–490.
  • Tougas et al., (1995) Tougas, F., Brown, R., Beaton, A. M., and Joly, S. (1995). Neosexism: Plus ça change, plus c’est pareil. Personality and social psychology bulletin, 21(8):842–849.
  • Vidgen and Derczynski, (2020) Vidgen, B. and Derczynski, L. (2020). Directions in abusive language training data, a systematic review: Garbage in, garbage out. Plos one, 15(12):e0243300.
  • Waseem, (2016) Waseem, Z. (2016). Are you a racist or am I seeing things? annotator influence on hate speech detection on Twitter. In Proceedings of the First Workshop on NLP and Computational Social Science, pages 138–142, Austin, Texas. Association for Computational Linguistics.
  • Waseem and Hovy, (2016) Waseem, Z. and Hovy, D. (2016). Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop, pages 88–93, San Diego, California. Association for Computational Linguistics.
  • WHO, (2013) WHO (2013). Violence against women.
  • Wrisley, (2023) Wrisley, S. P. (2023). Feminist theory and the problem of misogyny. Feminist Theory, 24(2):188–207.
  • Yasseri et al., (2016) Yasseri, T., Eccles, K., and Melville, S. (2016). Sexism typology: Literature review.
  • Zeinert et al., (2021) Zeinert, P., Inie, N., and Derczynski, L. (2021). Annotating online misogyny. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3181–3197, Online. Association for Computational Linguistics.
  • Zhang, (2014) Zhang, L. (2014). The impact of data source on the ranking of computer scientists based on citation indicators: a comparison of web of science and scopus . Issues in Science and Technology Librarianship.
  • Zhao et al., (2018) Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computational Linguistics.

Appendix A Systematic Literature Review strategy

A.1 Draft search string

Draft string length: 256 character limit

  1. 1.

    (misogyny OR sexism)

  2. 2.

    (hate OR toxic OR abusive OR offensive)

  3. 3.

    (detection OR identification OR prediction OR classification)

  4. 4.

    (”natural language processing” OR NLP OR ”deep learning” OR ”machine learning” OR ML OR ”artificial intelligence” OR AI)

  5. 5.

    1/ AND 2/ AND 3/ AND 4/

  6. 6.

    Limit 5 to (english language and yr=”2012 -Current”)

A.2 Inclusion and exclusion criteria

  1. 1.

    Remove posts from online publishing platforms, online research platforms or similar (e.g. blogs)

  2. 2.

    Remove papers outside the year range (2012-2022)

  3. 3.

    Remove papers not written in English

  4. 4.

    Remove dissertations, theses, books, and whole conference proceedings; but include pre-prints within the period

  5. 5.

    Remove symposium submissions

  6. 6.

    Limit by date of external events (2000- current)

  7. 7.

    Limit by the platform used for study- comparative study across platforms maybe included

  8. 8.

    Remove studies not looking at text data (so images, video, etc)

  9. 9.

    Remove studies that look into offline instances of sexism and misogyny

  10. 10.

    Remove studies that do not look into online social platforms (like Meta, Twitter, Reddit, etc.)

  11. 11.

    Remove studies that focus on the mental and physical impact of online hate speech from the aforementioned platforms.

  12. 12.

    Only keep papers that measure misogyny and/or sexism.

    1. (a)

      This means removing studies with no quantitative methods, papers proposing guidelines, policy recommendations, discussions, tutorials, dataset descriptions, research briefs, working papers, purely theoretical approaches, opinion pieces, position papers, case studies, etc.

    2. (b)

      Removing studies where frameworks are only stated without any measurements/results proceeding it.

    3. (c)

      It will include papers that measure misogyny/sexism with other forms of online hate, such as toxicity, hate-speech, aggression, etc.

    4. (d)

      It can include gender-bias classification studies that fall close to the definition of sexism/misogyny as generic terms, depending on the context it is being used.

Appendix B Citation Database queries

Citations and their search queries
Google Scholar ((misogyny OR sexism) AND (hate OR toxic OR abusive OR offensive) AND (detection OR identification OR prediction OR classification) AND (“natural language processing” OR NLP OR ”deep learning” OR ”machine learning” OR ML OR ”artificial intelligence” OR AI) AND (language=”English” AND yr=”2012 -2022”))
ArXiv (all:sexism+OR+all:sexist+OR+all:misogyny+OR+all:misogynist+OR+all:%22gender+discrimination%22+OR+all:%22gender+violence%22+OR+all:%22gender+stereotype%22all:sexism+OR+all:sexist+OR+all:misogyny+OR+all:misogynist+OR+all:\%22gender+% discrimination\%22+OR+all:\%22gender+violence\%22+OR+all:\%22gender+stereotype% \%22italic_a italic_l italic_l : italic_s italic_e italic_x italic_i italic_s italic_m + italic_O italic_R + italic_a italic_l italic_l : italic_s italic_e italic_x italic_i italic_s italic_t + italic_O italic_R + italic_a italic_l italic_l : italic_m italic_i italic_s italic_o italic_g italic_y italic_n italic_y + italic_O italic_R + italic_a italic_l italic_l : italic_m italic_i italic_s italic_o italic_g italic_y italic_n italic_i italic_s italic_t + italic_O italic_R + italic_a italic_l italic_l : % 22 italic_g italic_e italic_n italic_d italic_e italic_r + italic_d italic_i italic_s italic_c italic_r italic_i italic_m italic_i italic_n italic_a italic_t italic_i italic_o italic_n % 22 + italic_O italic_R + italic_a italic_l italic_l : % 22 italic_g italic_e italic_n italic_d italic_e italic_r + italic_v italic_i italic_o italic_l italic_e italic_n italic_c italic_e % 22 + italic_O italic_R + italic_a italic_l italic_l : % 22 italic_g italic_e italic_n italic_d italic_e italic_r + italic_s italic_t italic_e italic_r italic_e italic_o italic_t italic_y italic_p italic_e % 22)
Elsevier (’misogyny detection OR misogyny identification OR misogyny prediction OR misogyny classification OR sexism detection OR sexism identification OR sexism prediction OR sexism classification’)
Scopus TITLE-ABS-KEY (( misogyny OR sexism OR gender AND violence OR gender AND discrimination ) AND ( detection OR identification OR prediction OR classification ) AND PUBYEAR >>> 2011 AND PUBYEAR <<< 2023 AND PUBYEAR >>> 2011 AND PUBYEAR <<< 2023 AND ( LIMIT-TO ( SUBJAREA , ”SOCI” ) OR LIMIT-TO ( SUBJAREA , ”COMP” ) OR LIMIT-TO ( SUBJAREA , ”PSYC” ) ) AND ( LIMIT-TO ( LANGUAGE , ”English” ))
Semantic Scholar (‘online sexism misogyny’)
Web of Science
(Social Science) TS=((misogyn* OR sexis* OR (gender NEAR/10 discrim*) OR (gender NEAR/10 stereoty*) OR (gender NEAR/10 violence) OR (gender NEAR/10 based)) NEAR/200 (detect* OR identif* OR predict* OR classif*)) AND WC=((“History” OR “Political Science” OR “Womenś Studies” OR “Social Sciences” OR “International Relations” OR “History %26 Philosophy Of Science” OR “Linguistics” OR “Anthropology” OR “Sociology” OR “Social Work” OR “Language %26 Linguistics” OR “Information Science” OR “Psychology” OR “Social” OR “Ethnic Studies” OR “Philosophy” OR “Psychiatry”) NOT (“Computer Science” OR “Artificial Intelligence” OR “Theory %26 Methods” OR “Engineering” OR “Software Engineering” OR “Scientific Disciplines” OR “Automation %26 Control Systems” OR “Mathematical” OR “Mathematics” OR “Mathematical Methods”)) AND PY=2012-2022
Web of Science
(Computer Science) TS=((misogyn* OR sexis* OR (gender NEAR/10 discrim*) OR (gender NEAR/10 stereoty*) OR (gender NEAR/10 violence) OR (gender NEAR/10 based)) NEAR/200 (detect* OR identif* OR predict* OR classif*)) AND WC=((“Computer Science” OR “Artificial Intelligence” OR “Theory %26 Methods” OR “Engineering” OR “Software Engineering” OR “Scientific Disciplines” OR “Automation %26 Control Systems” OR “Mathematical” OR “Mathematics” OR “Mathematical Methods”) NOT (“History” OR “Political Science” OR “Womenś Studies” OR “Social Sciences” OR “International Relations” OR “History %26 Philosophy Of Science” OR “Linguistics” OR “Anthropology” OR “Sociology” OR “Social Work” OR “Language %26 Linguistics” OR “Information Science” OR “Psychology” OR “Social” OR “Ethnic Studies” OR “Philosophy” OR “Psychiatry”)) AND PY=2012-2022
Table B.1: Citation databases and their respective queries

Appendix C Terminologies and their meaning

Construct - “A construct is an abstract concept that is specifically chosen (or ‘created’) to explain a given phenomenon. Constructs used for scientific research must have precise and clear definitions that others can use to understand exactly what it means and what it does not mean.” (Bhattacherjee, , 2019)
Computational Social Science CSS Computational social science is an interdisciplinary academic sub-field concerned with computational approaches to the social sciences. It leverages the capacity to collect and analyze data with an unprecedented breadth and depth and scale. (Lazer et al., , 2020).
Hostile Sexism HS Hostile sexism refers to negative views toward individuals who violate traditional gender roles. For example, some people disparage girls who enter traditionally masculine domains such as science or sports (Daniels and Leaper, , 2011).                                                                                                                                                                             Part of ambivalent sexism (Glick and Fiske, , 1996).
Neo-sexism Scale NS A scale designed to tap into a new type of gender prejudice, called neo-sexist beliefs (Tougas et al., , 1995).
Benevolent Sexism BS Benevolent sexism includes valuing feminine-stereotyped attributes in females (e.g., nurturance) and a belief that traditional gender roles are necessary to complement one another. Benevolent sexism also includes the view known as paternalism that females need to be protected by males. Benevolent sexism contributes to gender inequality by limiting women’s roles (Daniels and Leaper, , 2011).                                                                                                                                                                             Part of ambivalent sexism (Glick and Fiske, , 1996).
Bidirectional Encoder Representations from Transformers BERT BERT is a language representation model, which is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers (Devlin et al., , 2018).
Language Models (or Large Language models) LM (or LLM) A large language model is a computational model capable of language generation or other natural language processing tasks. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.
Bag-of-words BoW “A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: 1. A vocabulary of known words. 2. A measure of the presence of known words.”https://tinyurl.com/5n6d9knt
Descriptive Paradigm - “The descriptive paradigm encourages annotator subjectivity to create datasets as granular surveys of individual beliefs. Descriptive data annotation thus allows for the capturing and modeling of different beliefs.” (Röttger et al., , 2022)
Perspective Paradigm - “The prescriptive paradigm, on the other hand, discourages annotator subjectivity and instead tasks annotators with encoding one specific belief, formulated in the annotation guidelines. Prescriptive data annotation thus enables the training of models that seek to consistently apply one belief.” (Röttger et al., , 2022)
Table C.1: Terminologies

Appendix D Experimentation results from other citation databases

For Google Scholar, we used both external APIs like SerpAPI for scraping the data, as well as a software named ’Publish or Perish’ (Harzing, , 2007) to collect the search results. Both of the methods were rejected because of their disadvantages. Such as, Publish or Perish could only extract 1000 results at a time for each search query. While this drawback was overcome by searching for documents with a shorter range of years to stay within the limit, it lacked some of the fields that were needed for this study - abstract and discipline. Alternatively, SerpAPI (SerpAPI, , 2019) worked similar to a web scrapping tool and could only scrape the results as the search engine demonstrates, i.e., it only scrapes what Google shows on their Google Scholar pages, nothing more. Even though the fields we got through this API were relevant, they did not contain the full information we needed for the analysis. For example, the full text in the title and abstract was missing and was instead indicated with dotted extensions in the beginning and end of the text.

Appendix E Web of Science strategy

We performed automated elimination (or pre-processing) techniques based on the following criteria to narrow down our search results for both areas of study141414More details can be found here: https://images.webofknowledge.com/images/help/WOS/hp_advanced_search.html:

  • Remove studies that are not published in English.

  • Remove studies that do not contain any abstracts.

  • Keep only the first abstract in studies that contain more than one abstract.

  • Remove certain publication types, such as review articles and editorials.

With the Web of Science API, separate search queries were used for the two broad disciplines ( or research areas) - CS and Social Science. The categories of the research areas taken for each of them are as follows:

Computer Science Social Science
Computer Science Artificial Intelligence Theory and Methods Engineering Software Engineering Scientific Disciplines Automation and Control Systems Mathematical Mathematics Mathematical methods History Political Science Womenś Studies Social Sciences International Relations History and Philosophy of Science Linguistics Anthropology Sociology Social Work Language and Linguistics Information Science Psychology Social Ethnic Studies Philosophy Psychiatry
Table E.1: Categories for each area of research

These disciplines were taken from the Web of Science category list, which branches from five major research areas - out of which we took the two categories Social Sciences and Technology. The published works present in the Web of Science Core Collection are assigned to at least one Web of Science category. Each of the said Web of Science categories (as listed in table E.1) is mapped to one research area found in the classification of research areas151515Source: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html.

Appendix F ArXiv strategy

The ArXiv API was used following the query search strategy161616More details of the search strategy can be found here: https://info.arxiv.org/help/api/user-manual.html#query_details.

We performed automated elimination (or pre-processing) techniques based on the following criteria to narrow down our search results for both areas of study:

  • Remove studies that are not between 2012 and 2022.

  • Remove studies that do not contain any abstracts.

While combining search results of 14 and 16, care was taken to remove the duplicate studies based on the title and abstract, where we kept the study from the former database. This is to ensure consistency along the data since the published and updated (i.e., when the pre-prints were submitted to ArXiv) years could differ, hence ensuring the published works are not mislabeled as pre-prints.

Appendix G Further analysis of the initial search results

Documents by disciplines

Refer to caption
(a)
Refer to caption
(b)
Figure G.1: (a) Type of publications in Computer Science.                                                                                                                                                         (b) Type of publications in Social Science.

Figure G.1 shows the frequency of publications per year in the range of 2012-2022, as per each discipline and publication type. Like we had discussed previously in Section 3.3, we see a huge disparity in the number of publications between the disciplines which focus on sexism and/or misogyny. This inherently appears to impact on the diversity of the concept explored by the disciplines, with SS exploring a broader range of themes than CS. Furthermore, we also see that the type of publication too differs quite a bit as CS tend to produce a handful of research as pre-prints on this topic.

Documents focused on social media platforms

Refer to caption
(a)
Refer to caption
(b)
Figure G.2: Publications mentioning social media platforms in titles or(/and) abstracts in:                                                                                                                                                         (a) Computer Science.                                                                                                                                                         (b) Social Science.

The share of documents focusing on different social media platforms, as observed in Figure G.2, reveal that X (formerly Twitter) was the dominant platform for most research in CS, while Facebook (or Meta) was more dominant in SS till 2022. The ease of access to Twitter data during the period could have been a contributing factor to allow application of automated approaches in CS. Whereas, Facebook having more number of active users could have contributed to more research in SS, than any other platforms (including Twitter).

General topics centering around sexism or misogyny

Refer to caption
(a)
Refer to caption
(b)
Figure G.3: General topics centred around sexism/misogyny over the years in:                                                                                                                                                         (a) Computer Science.                                                                                                                                                         (b) Social Science.

Figure G.3 show the different thematic (or topic) representations across the disciplines over the period of 2012-2022. Not only do we see a wider range of themes in SS expanding over more number of research (like we observe in the previous subsection as well), but also a steady rise in most of the topics along the time. Especially the theme of ‘Feminism with misogyny/sexism’ and the ‘Hostile sexism’ seems to be of particular interest for SS research, given the proliferation of sexism and misogyny beyond offline spaces. The theme of ‘Linguistics in sexism’ are instrumental in capturing the subtle forms of sexism, and is therefore seen to gain traction over the years. The themes of CS research on sexism and misogyny seem to fluctuate in the given period with no consistent rise, except for the ‘Gender-based violence’.

Appendix H Abbreviation of models

The abbreviations used in Figure 4.2 are a collection of the following models as shown in Table H.1.

Abbreviation Full name of the model(/s)
LR Logistic Regression
RF Random Forest
SVM Support Vector Machine
BERT BERT, RoBERTa, mtBERT, FlauBERT, XLMRoBERTa, BERT-base, among other BERT based models
CNN Convolutional neural network
NB Naïve-Bayes, MultinomialNB
LSTM LSTM, Bi-LSTM
W2V Word2Vec, GloVe
LDA Latent Dirichlet Allocation
GB Gradient Boosting, CatBoost
DT Other Decision Tree models
GCN Graph Convolutional Network
RNN Recurrent Neural Network
DNN Deep neural network (unspecified)
XGB XGBoost
kNN k-NearestNeighbours
BoW Bag-of-Words
RC Ridge Classifier
n-grams unigrams, bi-grams and other types of n-grams
IG Information Gain
MLP Multi-layer Perceptron
Embeddings FastText, InferSent, Universal Sentence Encoder, and other types of embeddings
OVR One-vs-Rest
GRU Gated Recurrent Units
Table H.1: Model names and their abbreviation

Appendix I Most frequent keywords

In this section, we demonstrate the top 100 keywords in the co-occurrence network, like in Section 3.3.2 but based on all the manuscripts of each individual field.

I.1 Most frequent keywords in Computer Science

Refer to caption
Figure I.1: Network diagram of most frequent keywords in Computer Science.                                                                                                                                                          Among all the top 100 frequent and relevant keywords, the 6 most common ones (in descending order) are highlighted in the figure: 1. gender stereotypes 2. ambivalent sexism 3. hostile sexism                                                                                                                                                         4. sexism 5. benevolent sexism 6. sexist attitudes

In the Figure I.1, the top 5 most common keywords are labelled in red boxes.

I.2 Most frequent keywords in Social Science

Refer to caption
Figure I.2: Network connection of most frequent keywords                                                                                                                                                         in Social Science

Appendix J Expanding on the automated selection techniques used

J.1 Topic Modeling approach

The model starts by transforming the input documents (abstracts and the titles) into numerical representations, with the help of embedding, which in these cases is a sentence embedding. Sentence embedding with transformer models maps a text of variable length to a fixed size embedding that should be representative of the the meaning of the input text. For our research, we used the sentence transformer ‘bge-small-en-v1.5’171717The huggingface page of the model: https://huggingface.co/BAAI/bge-small-en-v1.5, which maps the each paragraph of our document to a 384 dimensional dense vector space, that was then used to cluster topics of similar semantic structure. In topic modeling, it is key to have a good quality of topic representations to interpret the overall topic and understand patterns in the document, for which we used bag-of-words (BoW) of medium length n-gram value (1-3 n-grams). To further enhance the representative-ness of the topics from BoW, Term Frequency-Inverse Document Frequency (TF-IDF) of our document, which works on a document-level, were adjusted to c-TF-IDF as per their weights, which works on a cluster/categorical/topic level. It considers the differences in documents from different clusters, and can be calculated as: c-TF-IDF (for a term x within class c)

Wx,c=tfx,c×log(1+Afx)subscript𝑊𝑥𝑐norm𝑡subscript𝑓𝑥𝑐𝑙𝑜𝑔1𝐴subscript𝑓𝑥W_{x,c}=||tf_{x,c}||\times log(1+\frac{A}{f_{x}})italic_W start_POSTSUBSCRIPT italic_x , italic_c end_POSTSUBSCRIPT = | | italic_t italic_f start_POSTSUBSCRIPT italic_x , italic_c end_POSTSUBSCRIPT | | × italic_l italic_o italic_g ( 1 + divide start_ARG italic_A end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ) (1)

where           tfx,c𝑡subscript𝑓𝑥𝑐tf_{x,c}italic_t italic_f start_POSTSUBSCRIPT italic_x , italic_c end_POSTSUBSCRIPT = frequency of word x in class c,                    fxsubscript𝑓𝑥f_{x}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = frequency of word x across all classes,                    A = average number of words per class Though both of these approaches did a good job of acquiring the topic representations, we used representation models to fine-tune the topics to refine its representations. For that, we used a combination of three models - a fast keyword extraction model called KeyBERTInspired, PartOfSpeech model, and MaximalMarginalRelevance model. The KeyBERTInspired model increases the coherence and reduces stopwords

Alongside this approach, we tried to further refine our topic representation by fine-tuning using a Large Language model (LLM) named ‘Mistral 7B v0.1’ - a 7 billion parameter language model, which has shown to outperform other state-of-the-art language models like Llama 13B across all elevated benchmarks Jiang et al., (2023).

# The main representation of a topic
main_representation = KeyBERTInspired()

# Additional ways of representing a topic
pos_patterns = [
            [{’POS’: ’ADJ’}, {’POS’: ’NOUN’}],
            [{’POS’: ’NOUN’}], [{’POS’: ’ADJ’}]
]
aspect_model1 = PartOfSpeech("en_core_web_sm",
            pos_patterns=pos_patterns)
aspect_model2 = [KeyBERTInspired(top_n_words=30, random_state=1234),
            MaximalMarginalRelevance(diversity=.5)]

# LLM model
llm = Llama(model_path="../openhermes-2.5-mistral-7b.Q3_K_M.gguf",
            n_gpu_layers=-1, n_ctx=4096, stop=["Q:", "\n"])
prompt = """ Q:
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: ’[KEYWORDS]’.

Based on the above information, can you give a short label
of the topic of at most 5 words?
A:
"""
aspect_model3 = LlamaCPP(llm, prompt=prompt)

# Add all models together to be run in a single ‘fit‘
representation_model = {
           "Main": main_representation,
           "Aspect1":  aspect_model1,
           "Aspect2":  aspect_model2,
           "Aspect3": aspect_model3
        }
# The documents to train on are the titles and abstracts of the studies
topic_model = BERTopic(representation_model=representation_model)
                .fit(docs)

To assess the model performance, the metrics perplexity and coherence scores were calculated as well. Perplexity is a predictive likelihood that specifically measures the probability that new data occurs given what was already learned by the model. In other words, perplexity characterizes how surprised a model is with new, unseen data. Coherence is typically used to analyze the relationship between two sets of data or the similarity between data sets. In topic modeling, topic coherence measures the quality of the data by comparing the semantic similarity between highly repetitive words in a topic. We used this to maximize intra-topic and minimize inter-topic similarity. We attained a perplexity score of 1.231.231.231.23 and a coherence score of 0.350.350.350.35 from our topic model.

Appendix K Analysis of the Computer Science studies - the final selection

In this section, we explore the data statistics for the CS manuscripts which were finally selected before the full-text screening process. The visual analysis is purely based on the text contained in abstracts and titles of the selected studies.

K.1 Documents by models

Refer to caption
Figure K.1: Models gathered from the abstracts and titles of                                                                                                                                                          Computer Science studies

K.2 Task types and social platforms it is experimented on

Refer to caption
(a)
Refer to caption
(b)
Figure K.2: (a) Types of task present in CS for the quantification of sexism and misogyny. The task types in the figure represent the tasks that we ideally expect a paper to have when quantifying the said terms. Note that, ALL the three task types are relevant for our work, and there are 110 total.                                                                                                                                                         (b) The aforementioned tasks and their application on different social media platforms. In this case, Twitter, Reddit and Facebook has shown to be the most research upon, while the other platforms are not. Regardless, a good number of research on those platforms use the specified tasks.