A Review of
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
Algorithmic Bias in Facial Recognition Technology on the Basis of Gender and Skin Tone
Researchers identify discrepancies in classification of gender and skin tone by facial recognition technology indicating algorithmic bias.
Introduction
Artificial Intelligence has permeated into decision-making related to hiring, loan applications, and even duration of an individual’s sentence in prison. Despite its many advantages, errors in facial recognition algorithms that depend on artificial intelligence and machine learning can have dangerous consequences such as wrongfully accusing an individual of a crime due to errors in misidentification.
This study, by Buolamwini and Gebru, examined three commercial Application Programming Interface (API)-based classifiers of gender from facial images and found that recognition capabilities are not balanced across genders and skin tones. Through use of facial recognition technology the researchers found discrepancies in the classification: dark-skinned women reported the highest error rate compared to light-skinned men, who had the more accurate results.
Joy Buolamwini is a computer scientist and digital activist at the MIT Media Lab where she focuses on encouraging ethical and inclusive technology in addressing algorithmic bias. Timnit Gebru serves as a research scientist in Google’s ethical AI team and completed a postdoc at Microsoft with the Fairness, Accountability, Transparency, and Ethics in AI group where she examined algorithmic bias and ethical implications underlying data projects.
Methods and Findings
Assessment of gender classification remains limited to binary labels since classification systems construct gender into two defined classes. Since the researchers were interested in conducting an intersectional analysis, they provided skin type annotations for unique subjects in two datasets and built a new facial image dataset that is balanced by gender and skin type. The new dataset called Pilot Parliaments Benchmarks (PPB) included 1270 people from three African countries (Rwanda, Senegal, South Africa) and three European countries (Iceland, Finland, Sweden).
Buolamwini and Gebru chose not to use race labels since phenotypic features vary significantly across individuals within a racial or ethnic category and these racial and ethnic categories are unstable as they vary across geographies and time. Labeling faces using skin types allowed the researchers to understand the importance of phenotypic attributes. Analysis of the benchmarks found a bias favoring lighter males and disadvantaging darker individuals, especially darker females. The classifiers also performed more effectively on male faces. The researchers suggested that darker skin may not be the only factor responsible for misclassification and darker skin may instead be highly correlated with facial geometrics or gender presentation standards.
Conclusions
The researchers recommend that the error gaps between male and female as well as lighter and darker classifications in artificial intelligence should be closed. Since default camera settings are often optimized to better expose lighter skin than darker skin, under- and overexposed images lose crucial information making them inaccurate measures of classification within artificial intelligence systems. Lack of representation of specific demographic groups in benchmark datasets can result in frequent targeting and suspicion towards the already marginalized. Inaccurate facial recognition systems often misidentify people of color, women, and young people resulting in perilous and life-threatening circumstances. The authors highlight a critical need to ensure phenotypic and demographic accuracy of these systems to protect the general public and ensure technologies remain accountable and transparent.
This research represents a significant development in gender classification benchmarking by introducing the first intersectional demographic and phenotypic assessment of facial gender classification accuracy. Additional research may investigate gender classification on an inclusive benchmark of unconstrained images as well as further evaluate intersectional error analysis of facial recognition technology to ensure algorithmic fairness, transparency, and accountability.
Topics
Thank you for visiting RRAPP
Please help us improve the site by answering three short questions.