Structural sexism is also ever-present in the world of science and publications. Especially in technical and engineering disciplines, female authors are massively underrepresented.  While writing a scientific paper, project area or homework, it is not easy to cite equal number of male and female authors. To assist in the analysis of one's own bibliography with regard to gender equality, a program was written which assigns the first names of a bibliography to their probable gender, counts them and outputs them accordingly. Not only is this an interesting data analysis and visualization of gender inequity, such a program could help scientists and students in the future to close the gender gap in their bibliographies or at least to counteract it.

Goals[edit | edit source]

The objective of the project is to create a program that can automatically read and evaluate a .bib-file. For this purpose, the authors are to be classified by a program on the basis of their first names on a spectrum from probably female to probably male, counted and the output the results in an easily understandable way.

Ideally, the program can be developed in a user-friendly form with a graphical user interface and made easily available to the public through the university library or other ways.

Research[edit | edit source]

Most definitely I'm not the first person to use a computer to assign people to a gender based on their first name. As it turns out, there are several ways to do this. In 2018, a comparative study[1] was published that examines and compares the five most relevant tools for converting first names to genders. The open source Python package gender-guesser[2] stands out for its high precision, reliability, and free availability. Although the names are compared with a database from 2008, which has not been updated since then, this package should serve its purpose for now.

While I am aware that a binary gender definition does not adequately represent non-binary authors, to my knowledge there is no better method for estimating the gender of such a large group of people.

Also, I am not the first person to examine gender relations in authorships. In 2021, a paper[3] appeared on gender differences in publication submissions and peer reviews during the first wave of the COVID-19 pandemic. Their methodology for identifying genders also uses the gender-guesser package in a first stage, but all unrecognized names are identified using gender-api[4], which is private and not costly. While te usage of a similar two-staged method would increase the quality and reliability of the program created in this project, it could then no longer be made available free of charge and open source.

A 2013 paper on the role of gender in scholarly authorship[5] and the online interactive results[6] of their investigation of the JSTOR corpus were extremely instructive and inspiring.

Counting Program[edit | edit source]

The program recognizes all first names of the authors from a .bib-file and classifies them with the gender-guesser package, released under the GNU General Public License into the categories female, mostly_female, androgynous, mostly_male, male and unknown. This is followed by a count of the authors in the six categories and further analysis. Finally, the results are displayed in the console.

Basics[edit | edit source]

When preparing a homework assignment, a report, a publication or similar, used and cited literature is usually collected and managed in a literature management program (Citavi, Zotero, Jabref or similar). When using the text typesetting system LaTeX[7], which is common in technical and engineering disciplines, a .bib-file is exported from this software, on which all citations and the creation of the bibliography is based[8]. All information about the references used is contained in this file, including a list of authors. Each publication stored in the .bib-file is assigned a unique citation key. The corresponding authors are stored in an ordered list in a line 'author = {},' each separated by an 'and'. This uniform structure allows my simple program to extract the names for each citation key from the .bib-file and analyze them.

How does it work?[edit | edit source]

The program works pretty straigt forward by working itsself through the follwong steps:

  • Import of all necessary Python packages
  • Import of the .bib-file from a file path
  • Creation of a vector with all citation keys
  • Creation of a dict-array in which each citation key is assigned a vector with the corresponding authors
  • Filtering & preparation of the names
    • Removal of empty elements
    • Recognize the name format last surname, first name or surname first name by the presence of a comma
    • If there is a space, everything after it will be ignored.
    • If a hyphen is present, everything after the hyphen is ignored.
    • For author names in curly brackets, the brackets are removed
    • In case of unbalanced curly braces, the superfluous braces are removed
  • Categorization of the prepared first names using gender-guesser
    • Creation of the empty vectors females, mostly_females, andys (androgynous given names), mostly_males, males and unknowns for the storage of the names to be assigned
    • filling the vectors with the first names assigned by the gender-guesser
  • Filtering/preparation of the unassignable names from the unknowns vector
    • filtering of abbreviated first names and storage in the shorts vector
    • filtering of first names containing special characters, storage in the formats vector
  • Calculation of interesting sizes and numbers
    • length resp. number of females, mostly_females, andys, mostly_males, males, formats, shorts and unknowns
    • count_all = total number of all authors
    • femratio = percentage of recognized female first names out of all recognized first names
  • Output results in the console
    • number of all female authors count_all
    • length of females-, mostly_females-, andys-, mostly_males- and males-vectors (=count) with the respective percentages of the total of all names
    • Percentage of recognized female first names femratio
    • Number, proportion and breakdown of names that cannot be assigned (abbreviations, incorrect formatting, unrecognized)
    • List of androgynous first names "could not decide:"
    • List of unknown given names unknowns "could not guess:"

Results[edit | edit source]

Comparison between Jonathans manual count and the automatic count output by the program

To make the underrepresentation of female scientists visible, Jonathan Muth counted the male- and female-read first names of his bibliography in his bachelor thesis. Although Jonathan preferred to cite female scientists whenever possible, he ended up with 180 female and 381 male first names in his bibliography. 140 first names he could not assign to a gender. To validate the counting program, the automatic count of the .bib file is compared with Jonathan's manually counted results from his bachelor thesis.

The console outputs the following after the automatic evaluation by the program:

total: 743

females: 136 (18.3%)

mostly_females: 4 (0.5%)

androgynous: 58 (7.8%)

mostly_males: 23 (3.1%)

males: 306 (41.2%)

Percentage of recognized female authors: 26.6%

Not assignable: 216 (29.1%), of which: (Unknown% / Total%)

79 Abbreviations (36.6% / 10.6%)

18 Incorrectly formatted (8.3% / 2.4%)

119 Unrecognized (55.1% / 16.0%)

could not decide: [...]

could not guess: [...]

The automatic evaluation yields a total of 743 first names, 42 more than Jonathans count. Of these, 136 are female, 4 are probably female, 23 are probably male, 306 are male, 58 are androgynous and 216 are unknown or unassignable.

Author position evaluation[edit | edit source]

Hardly any publication is published by only one or two authors and the order of the author list of a publication is not random at all. There are several methods to list the order of authors. In general, the person(s) who made the largest contribution to the work described in the publication or conducted the largest proportion of the underlying research is/are listed as first author(s). The remaining authors are listed in descending order of their respective contributions. The author listed last is prestigious in some academic fields as he/she is significantly responsible for the funding and/or public perception of the publication. However, this cannot be generalized for all research areas and is not always common practice, which is why such an evaluation of last authors would not really be useful in my opinion.

Nevertheless a function has been implemented, which searches a bibliography for works with more than 3 authors and evaluates the probable gender and their corresponding positions.

References[edit | edit source]

  1. Santamaría L, Mihaljević H. 2018. Comparison and benchmark of name-to-gender inference services. PeerJ Computer Science 4:e156
  2. Documentation of python package gender-guesser
  3. Squazzoni F, Bravo G, Grimaldo F, García-Costa D, Farjam M, Mehmani B (2021) Gender gap in journal submissions and peer review during the first wave of the COVID-19 pandemic. A study on 2329 Elsevier journals. PLoS ONE 16(10): e0257919.
  4. Gender-API
  5. West JD, Jacquet J, King MM, Correll SJ, Bergstrom CT (2013) The Role of Gender in Scholarly Authorship. PLoS ONE 8(7): e66212.
  6. Interactive Gender composition of scholarly publications (1665 - 2011).
  7. LaTeX – A document preparation system
  8. BibTeX Format Description
Page data
Authors Timo
Published 2022
License CC-BY-SA-4.0
Issues Automatically detected page issues. Click on them to find out more. They may take some minutes to disappear after you fix them. No main image
Cookies help us deliver our services. By using our services, you agree to our use of cookies.