An excerpt from a pre-publication version of:

Belskie, M., Zhang, H., & Hemminger, B. M. (2023). Measuring Toxicity Toward Women in Game-Based Communities. Journal of Electronic Gaming and Esports1(1).



Literature Review

What is toxicity and taxonomy of toxicity

Although there is some disagreement on what might constitute toxicity, we framed part of our own usage to be consistent with Beres et al. (2021) to mean “various types of negative behaviors involving abusive communications directed towards other players and disruptive gameplay that violates the rules and social norms of the game.” Other definitions have included “negative behaviors exhibited by players in online environments” (Türkay et al., 2020) or a hodgepodge of terms including “incivility, griefing, and degrading comments” (Shen et al., 2020). Although the Beres et al. (2021) definition lacks some specificity it is still a comprehensive definition because it includes the actors – both those who are toxic and those who are recipients of toxicity, it makes clear that toxicity is comprised of actions and comments, it acknowledges the role of cultural norms, and it is framed within a context of esports.

The taxonomy presented by Beres et al. (2021) was partially useful in describing the toxicity that we saw in posts. Their framework includes flaming, trolling, griefing, and spamming as types of toxic language or behavior evident in games. Another taxonomy presented by Mall et al., (2020) focused instead on patterns of toxic behaviors by users across time and was comprised of fickle-minded, steady, radicalized, and pacified users which was less useful for our research objectives. Xia et al. (2020) are less concerned with a strict definition of toxicity and instead focus on a two-part dynamic of things that are toxic or things that cause toxicity. The Perspective API (Perspective Developers: Attributes & Languages, 2022) taxonomy includes items for Toxicity, Severe_Toxicity, Identity_Attack, Insult, Profanity, and Threat. The taxonomy of Risch and Krestel (2020), containing classifications for profanity, insults, threats, identity hate, and “otherwise toxic” was deemed to be most useful to our research for a few reasons. One, that they helped us to address whether comments toxic towards women more frequently of a specific type, two, they permitted us to analyze how accurate automated tools are at identifying different types of toxicity, and finally, it mapped neatly onto the Perspective API analysis that we used.

Toxicity toward Women in games and game communities

Toxic language and behavior, particularly towards women, are popular in games and game communities. Even when content is scrubbed, a lot of the damage has already been done as was seen in the #GamerGate and Fappening controversies on Reddit. In these cases, the management (or lack thereof) of content in forums related to these topics actively contributes to increased toxicity – especially toxicity towards women (Farrell et al., 2019; Massanari, 2017). In a chapter by Andrews and Crawford (2021) looking at successful examples of mixed gender sports – equestrian sports and riflery – they noted that safe and non-toxic environments were minimum requirements for successful participation by women.

Darvin et al.’s (2021) contributions have been significant with a strong qualitative study of ten women representing professional gamers and executives in the esports industry. Their key findings include that much of the evident toxicity is both top down and bottom up. The companies that make the games which create these communities are themselves subject to toxicity allegations. These same patterns of toxicity are evident at every level of the esports vertical, from upper management down to casual communities. They also uncovered that Perceptions that women just “aren’t interested” in esports are false and look like what they did for women in traditional sports prior to creation and implementation of Title IX. A similar study by Hayday and Collison (2019) conducted focus groups for 65 participants followed up by interviews of 16 participants and likewise discovered that the effects of toxic masculinity in the space greatly contributed to the reported negative experiences of women in esports. Both studies together help to illustrate the ways that toxicity in esports and esports communities is actively perceived by women as being harmful, and not just theoretically harmful.

Much of the toxicity towards women in games, and one borne out by comments in game communities, takes the form of expectations for women to only fulfill certain roles in games (Ruotsalainen & Friman, 2018). Where this is damning for women is that they are subjected to verbal abuse and other toxicity for selecting these “pre-approved” roles in games – specifically Support roles or female characters – when they do and are also subjected to verbal abuse and other toxicity when they depart from these “permitted” roles for not “knowing their place”. These attitudes are further compounded by the fact that when women do experience in-game success with these characters they are branded as “one-tricks”, a derogatory insult meant to imply the player is unskilled, and that their success is ill-gotten. This precarious condition for inclusion in esports and esports communities makes these spaces especially toxic to women and functions to actively exclude women.

While literature exploring toxicity towards women in games is abundant, the existing literature is incomplete. There is a lack of taxonomic understanding of toxicity directed toward women, which contributes to the lack of understanding for why current approaches toward automated toxicity detection are insufficient to quantifying toxicity towards women, and thus inadequate to ameliorating the issue.

Measuring and detecting toxicity

Methods for measuring or detecting toxicity can be categorized as manual or automatic processes. Manual processes include in-game reporting within games or human assessment of toxicity in games or game communities. Automated processes typically involve either semantic analysis with natural language processing or applying corpora of terms to game or forum logs to measure counts of specific words or phrases (Brassard-Gourdeau & Khoury, 2018; Jurgens, Chandrasekharan, & Hemphill, 2019; Noever, 2018). These automated processes, such as the Perspective API, produce a likelihood of toxicity so that human intervention then decides on what threshold to positively identify toxicity.

There are numerous examples of studies using Perspective or other tools for automatically assessing toxicity in forums. A common goal some of these studies share is to create better tools to aid in detection of toxic comments for the purpose helping to better automoderate communities. While some of the selected literature does not explicitly focus on detecting toxicity in games or game communities, their methods and conclusions prove useful to our own understanding and underline the gaps that still exist in this space.

In one such paper that looks at toxicity detection in a general context, Noever (2018) points to a few extant flaws in the Perspective API. Of these, the biggest seems to be a lack of transparency for how it works. His work explores alternative strategies for automatically detecting toxicity, and two promising conclusions he points to are the value of ensemble strategies – using many weaker, cheaper, or faster platforms to form an aggregate judgment, and using tree-based algorithms to apply faceted analyses to content.

One idea presented is detecting toxicity before it occurs based on maximum likelihood. Where most measurements are a posteriori, a few papers look to better explore and explain the antecedents or triggers to toxicity and toxic comments (Almerekhi et al., 2020;  Almerekhi, Jansen, & Kwak, 2020; Jurgens, Chandrasekharan, & Hemphill, 2019; Shen et al. 2020; Xia et al., 2020). These triggers can frequently be benign and non-toxic themselves, but still act as tent-poles for toxic comments and reactions (Almerekhi et al., 2020). A common shortcoming exhibited in these studies is their reliance solely on automated detection of toxicity in their analyses. Where the automated tools are inaccurate in their toxicity detection, and we posit that there are places where they are, conclusions can be misleading or inaccurate.

Human measurement of toxicity is critical to improving the datasets that inform machine learning approaches. A particularly good example of a method is found in a paper by Carton, Mei, and Resnick (2020) that uses consensus coding for comments. Where this approach is lacking, and the authors acknowledge this limitation, is that no information is provided indicating how competent the human coders are at detecting toxicity. Optimal coding would utilize either an ensemble process with many human coders, or a smaller set of expert coders. The normalization of toxic language and behaviors is covered in other research (Beres et al., 2021), and non-expert coders are a limitation to the accuracy of human consensus coding. And seeking to add explanatory functions to automated detection of toxicity for aiding human detection can in fact confound human scoring of that toxicity (Carton, Mei, & Resnick, 2020).

Detection of toxicity can diverge across different detection systems. Although the research by Venkit and Wilson (2021) focused on people with disabilities, a secondary discovery they made in their research was that different systems (DistilBERT, Perspective (Google API), TextBlob, and VADER) produced different levels of sentiment score for identical statements. Their paper provides some support for the ideas of others to build consensus systems of automated detection of toxicity, but also highlights limitations shared across all those systems to prioritize and overestimate the toxicity of certain terms without sensitivity to context.

Sentiment analysis is one technique used in attempts to improve the results of automated detection of toxicity. The rational for this approach is that toxic comments can be coded (e.g., using “leet speak” or other methods) in ways to make it harder for automated tools to compare that word against a corpus of known toxic terms, but that sentiment will be more difficult to disguise. Brassard-Gourdeau and Khoury (2020) found that sentiment detection could refine and improve classification of comments as toxic, but it should be noted that those results were compared to a dataset that itself was scored for toxicity by an automated system rather than expert human coding which leaves their conclusions open to the same limitations as the automated scoring process.

The primary gap we see in the scoring and toxicity measures described in the literature to date is a focus on agreement as a percentage in favor of accuracy. Jurgens, Chandrasekharan, & Hemphill (2019) call out to this state of research that seems more focused on simple measurements without critical analysis of what those measurements mean or how they should be used. If a human and Perspective are coding a set of one hundred comments both could agree that there are forty toxic comments, and in this case, both agree that 40% of the comments are toxic. The problem is those could be non-overlapping sets, and the human and AI do not agree on which forty comments are toxic. That automated tools are in fact frequently incorrect when detecting certain types of toxicity is not unknown (Almerekhi et al., 2020), but it is also not explored at sufficient depth. Methodologically it appears that a common problem is that the human coding utilized by other projects is calibrated using the Perspective API literature, and we believe this contributes to artificially inflated agreement. Our research goes a step further to describe which toxic taxonomies have higher and lower rates of agreement and relies on expert human coding.

Normalization of toxicity

Culture begets culture. The effect of bad actors working in concert can have outsized effects because of recommender systems in Reddit, which contributes to establishment of toxic norms in Reddit communities (Massanari, 2017). These norms in turn can alter the behavior of posters in these different forums, where a single user will display differing rates of toxicity based on the forum they are posting in (Almerekhi, Kwak, & Jansen, 2020).

Frequently these established norms can align the community against people belonging to marginalized groups (Cullen, 2022). And where communities are especially vocal or abusive it can act as a gate-keeping mechanism to force people from marginalized groups to behave or speak only in ways consistent with what the community is willing to tolerate (Beres et al., 2021; Cullen, 2022).

Normalization of toxicity creates in people beliefs that toxicity is a subjective thing; whether a comment elicits perceptions of toxicity is independent of whether a comment is toxic. As concluded by Jurgens, Chandrasekharan, & Hempill (2019) though, we cannot improve conditions of toxic online environments as long as we have different, competing standards for defining toxic language and behavior.

Mitigation of toxicity

The standard for mitigation of toxicity in forums is effective moderation. Good moderation is capable of warning users ahead of time in a stickied thread that toxic comments will not be tolerated, directly removing toxic comments, and banning frequent offenders. The results of a study by Srinivasan et al., 2019 showed that effective moderation, although not a significant contributor to future reformed behaviors, did significantly contribute to reduced future rates of toxic comments.


This first prompt is used to set the expectations for the generative AI. It stipulates what you will be providing (which can work as error correction if you then provide something other than what you state you will), as well as lets it know to just hold onto that text until you request something. This is useful in cases where the amount of text you are including exceeds the character limit of the AI. In those cases you can either state that you are giving it a document in parts, or a better feature largely available now is to put the document into a PDF and upload that.

You
I am going to begin by giving you text from an article I wrote. Don’t do anything with it yet.

With the text fed into the working memory space you can now make specific requests. You can see this first question is very generalized and can work with any text to develop questions more suitable for formative assessment or by a student to check their comprehension.

You
from this text I need to generate a set of multiple choice questions to assess student comprehension of the material for a class of college students. please use the text and begin by generating 20 multiple choice questions along with the correct response.

Here that prompt is refined to focus narrowly on a specific subject within the text. This prompt could just as easily be phrased in a way to permit you to have a conversation with the document. “How does the normalization of toxicity contribute to toxicity towards women in online gaming.”

You
I would like 5 questions that focus on the effects of normalization of toxicity and how that contributes toward toxicity towards women in online gaming. Please generate 5 open ended questions, and then generate 5 multiple choice questions with the correct response listed.