Structural sexism is also ever-present in the world of science and publications. Especially in technical and engineering disciplines, female authors are massively underrepresented.  While writing a scientific paper, project report or homework, it is not easy to cite equal number of male and female authors. To assist in the analysis of one's own bibliography with regard to gender equality, a program was written which assigns the first names of a bibliography to their probable gender, counts them and outputs them accordingly. Not only is this an interesting data analysis and visualization of gender inequity, such a program could help scientists and students in the future to close the gender gap in their bibliographies or at least to counteract it.

Goals[edit | edit source]

The objective of the project is to create a program that can automatically read and evaluate a .bib-file. For this purpose, the authors are to be classified by a program on the basis of their first names on a spectrum from probably female to probably male, counted and the output the results in an easily understandable way.

Ideally, the program can be developed in a user-friendly form with a graphical user interface and made easily available to the public through the university library or other ways.

Research[edit | edit source]

Most definitely I'm not the first person to use a computer to assign people to a gender based on their first name. As it turns out, there are several ways to do this. In 2018, a comparative study[1] was published that examines and compares the five most relevant tools for converting first names to genders. The open source Python package gender-guesser[2] stands out for its high precision, reliability, and free availability. Although the names are compared with a database from 2008, which has not been updated since then, this package should serve its purpose for now.

While I am aware that a binary gender definition does not adequately represent non-binary authors, to my knowledge there is no better method for estimating the gender of such a large group of people.

Also, I am not the first person to examine gender relations in authorships. In 2021, a paper[3] appeared on gender differences in publication submissions and peer reviews during the first wave of the COVID-19 pandemic. Their methodology for identifying genders also uses the gender-guesser package in a first stage, but all unrecognized names are identified using gender-api,[4] which is private and costly. While the usage of a similar two-staged method would increase the quality and reliability of the program created in this project, it could then no longer be made available free of charge and open source.

A 2013 paper on the role of gender in scholarly authorship[5] and the online interactive results[6] of their investigation of the JSTOR corpus were extremely instructive and inspiring.

Counting Program[edit | edit source]

The program recognizes all first names of the authors from a .bib-file and classifies them with the gender-guesser package, released under the GNU General Public License into the categories female, mostly_female, androgynous, mostly_male, male and unknown. This is followed by a count of the authors in the six categories and further analysis. Finally, the results are displayed in the console.

Basics[edit | edit source]

When preparing a homework assignment, a report, a publication or similar, used and cited literature is usually collected and managed in a literature management program (Citavi, Zotero, Jabref or similar). When using the text typesetting system LaTeX,[7] which is common in technical and engineering disciplines, a .bib-file is exported from this software, on which all citations and the creation of the bibliography is based.[8] All information about the references used is contained in this file, including a list of authors. Each publication stored in the .bib-file is assigned a unique citation key. The corresponding authors are stored in an ordered list in a line 'author = {},' each separated by an 'and'. This uniform structure allows my simple program to extract the names for each citation key from the .bib-file and analyze them.

How does it work?[edit | edit source]

The program works pretty straigt forward by working itsself through the follwong steps:

  • Import of all necessary Python packages
  • Import of the .bib-file from a file path
  • Creation of a vector with all citation keys
  • Creation of a dict-array in which each citation key is assigned a vector with the corresponding authors
  • Filtering & preparation of the names
    • Removal of empty elements
    • Recognize the name format last surname, first name or surname first name by the presence of a comma
    • If there is a space, everything after it will be ignored.
    • If a hyphen is present, everything after the hyphen is ignored.
    • For author names in curly brackets, the brackets are removed
    • In case of unbalanced curly braces, the superfluous braces are removed
  • Categorization of the prepared first names using gender-guesser
    • Creation of the empty vectors females, mostly_females, andys (androgynous given names), mostly_males, males and unknowns for the storage of the names to be assigned
    • filling the vectors with the first names assigned by the gender-guesser
  • Filtering/preparation of the unassignable names from the unknowns vector
    • filtering of abbreviated first names and storage in the shorts vector
    • filtering of first names containing special characters, storage in the formats vector
  • Calculation of interesting sizes and numbers
    • length resp. number of females, mostly_females, andys, mostly_males, males, formats, shorts and unknowns
    • count_all = total number of all authors
    • femratio = percentage of recognized female first names out of all recognized first names
  • Output results in the console
    • number of all female authors count_all
    • length of females-, mostly_females-, andys-, mostly_males- and males-vectors (=count) with the respective percentages of the total of all names
    • Percentage of recognized female first names femratio
    • Number, proportion and breakdown of names that cannot be assigned (abbreviations, incorrect formatting, unrecognized)
    • List of androgynous first names "could not decide:"
    • List of unknown given names unknowns "could not guess:"

Results[edit | edit source]

Comparison between Jonathans manual count and the automatic count output by the program
Comparison between Jonathans manual count and the automatic count output by the program

To make the underrepresentation of female scientists visible, Jonathan Muth counted the male- and female-read first names of his bibliography in his bachelor thesis. Although Jonathan preferred to cite female scientists whenever possible, he ended up with 180 female and 381 male first names in his bibliography. 140 first names he could not assign to a gender. To validate the counting program, the automatic count of the .bib file is compared with Jonathan's manually counted results from his bachelor thesis.

The console outputs the following after the automatic evaluation by the program:

total: 743

females: 136 (18.3%)

mostly_females: 4 (0.5%)

androgynous: 58 (7.8%)

mostly_males: 23 (3.1%)

males: 306 (41.2%)

Percentage of recognized female authors: 26.6%

Not assignable: 216 (29.1%), of which: (Unknown% / Total%)

79 Abbreviations (36.6% / 10.6%)

18 Incorrectly formatted (8.3% / 2.4%)

119 Unrecognized (55.1% / 16.0%)

could not decide: [...]

could not guess: [...]

The automatic evaluation yields a total of 743 first names, 42 more than Jonathans count. Of these, 136 are female, 4 are probably female, 23 are probably male, 306 are male, 58 are androgynous and 216 are unknown or unassignable.

How can I use the program?[edit | edit source]

Download-link ?

You can contact me via timohuber(at) .

List of publications featuring this tool[edit | edit source]

This simple tool having an impact on science and research makes me incredibly proud. This is a list of publications using the tool to critically discuss the citations in their work:

  • [9] Muth J, Klunker A and Völlmecke C (2023) Putting 3D printing to good use—Additive Manufacturing and the Sustainable Development Goals. Front. Sustain.4:1196228. doi: 10.3389/frsus.2023.1196228
  • Lauenstein, F (2024) Durchführung einer Pinch-Ananlyse als Baustein der Dekarbonisierung einer Papierfabrik - Identifikation und Bewertung von Maßnahmen zur Energieeinsparung in der Bestandsanlage. Bacheloararbeit. Institut für Energietechnik, TU Berlin

References[edit | edit source]

  1. Santamaría L, Mihaljević H. 2018. Comparison and benchmark of name-to-gender inference services. PeerJ Computer Science 4:e156
  2. Documentation of python package gender-guesser
  3. Squazzoni F, Bravo G, Grimaldo F, García-Costa D, Farjam M, Mehmani B (2021) Gender gap in journal submissions and peer review during the first wave of the COVID-19 pandemic. A study on 2329 Elsevier journals. PLoS ONE 16(10): e0257919.
  4. Gender-API
  5. West JD, Jacquet J, King MM, Correll SJ, Bergstrom CT (2013) The Role of Gender in Scholarly Authorship. PLoS ONE 8(7): e66212.
  6. Interactive Gender composition of scholarly publications (1665 - 2011).
  7. LaTeX – A document preparation system
  8. BibTeX Format Description
  9. Muth J, Klunker A and Völlmecke C (2023) Putting 3D printing to good use—Additive Manufacturing and the Sustainable Development Goals. Front. Sustain.4:1196228. doi: 10.3389/frsus.2023.1196228
FA info icon.svg Angle down icon.svg Page data
Part of Engineering for Equity Think Tank
Keywords citations, gender equality, science, references, automatic, program, python, scholar
Authors Timo Huber
License CC-BY-SA-4.0
Language English (en)
Related 0 subpages, 1 pages link here
Impact 172 page views
Created August 4, 2022 by Timo
Modified February 18, 2024 by Timo
Cookies help us deliver our services. By using our services, you agree to our use of cookies.