gender_analysis.analysis package

base analyzers

dependency parsing module

gender_analysis.analysis.dependency_parsing.generate_dependency_tree(document, genders=None, pickle_filepath=None)

This function returns the dependency tree for a given document. This can optionally be reduced such that it will only analyze sentences that involve specified genders’ subject/object pronouns.

Parameters:
  • document – Document we are interested in
  • genders – a collection of genders that will be used to filter out sentences that do not involve the provided genders. If set to None, all sentences are parsed (default).
  • pickle_filepath – filepath to store pickled dependency tree, will not write a file if None
Returns:

dependency tree, represented as a nested list

gender_analysis.analysis.dependency_parsing.get_descriptive_adjectives(tree, gender)

Returns a list of adjectives describing pronouns for the given gender in the given dependency tree.

Parameters:
  • tree – dependency tree for a document, output of generate_dependency_tree
  • genderGender to search for usages of
Returns:

List of adjectives as strings

gender_analysis.analysis.dependency_parsing.get_descriptive_verbs(tree, gender)

Returns a list of verbs describing pronouns of the given gender in the given dependency tree.

Parameters:
  • tree – dependency tree for a document, output of generate_dependency_tree
  • genderGender to search for usages of
Returns:

List of verbs as strings

gender_analysis.analysis.dependency_parsing.get_pronoun_usages(tree, gender)

Returns a dictionary relating the occurrences of a given gender’s pronouns as the subject and object of a sentence.

Parameters:
  • tree – dependency tree for a document, output of generate_dependency_tree
  • genderGender to check
Returns:

Dictionary counting the times male pronouns are used as the subject and object, formatted as {‘subject’: <int>, ‘object’: <int>}

dunning module

gender_analysis.analysis.dunning.compare_word_association_between_corpus_dunning(word, corpus1, corpus2, word_window=None, to_pickle=False, pickle_filename='dunning_associated_words.pgz')

Finds words associated with the given word between the two corpora. The function can search the document automatically, or passing in a word window can refine results.

Parameters:
  • word – Word to compare between the two corpora
  • corpus1 – Corpus object
  • corpus2 – Corpus object
  • word_window – If passed in as int, trims results to only show associated words within that range.
  • to_pickle – boolean determining if results should be pickled.
  • pickle_filename – str or Path object, location of existing pickle or save location for new pickle
Returns:

Dictionary

gender_analysis.analysis.dunning.compare_word_association_in_corpus_dunning(word1, word2, corpus, to_pickle=False, pickle_filename='dunning_vs_associated_words.pgz')

Uses Dunning analysis to compare the words associated with word1 vs those associated with word2 in the given corpus.

Parameters:
  • word1 – str
  • word2 – str
  • corpus – Corpus object
  • to_pickle – boolean; True if you wish to save the results as a Pickle file
  • pickle_filename – str or Path object; Only used if the pickle already exists or you wish to write a new pickle file
Returns:

Dictionary mapping words to dunning scores

gender_analysis.analysis.dunning.dunn_individual_word(total_words_in_corpus_1, total_words_in_corpus_2, count_of_word_in_corpus_1, count_of_word_in_corpus_2)

applies Dunning log likelihood to compare individual word in two counter objects

Parameters:
  • total_words_in_corpus_1 – int, total wordcount in corpus 1
  • total_words_in_corpus_2 – int, total wordcount in corpus 2
  • count_of_word_in_corpus_1 – int, wordcount of one word in corpus 1
  • count_of_word_in_corpus_2 – int, wordcount of one word in corpus 2
Returns:

Float representing the Dunning log likelihood of the given inputs

>>> total_words_m_corpus = 8648489
>>> total_words_f_corpus = 8700765
>>> wordcount_female = 1000
>>> wordcount_male = 50
>>> dunn_individual_word(total_words_m_corpus,
...                      total_words_f_corpus,
...                      wordcount_male,
...                      wordcount_female)
-1047.8610274053995
gender_analysis.analysis.dunning.dunn_individual_word_by_corpus(corpus1, corpus2, target_word)

applies dunning log likelihood to compare individual word in two counter objects (-) end of spectrum is words for counter_2 (+) end of spectrum is words for counter_1 the larger the magnitude of the number, the more distinctive that word is in its respective counter object

Parameters:
  • target_word – desired word to compare
  • corpus1 – Corpus object
  • corpus2 – Corpus object
Returns:

log likelihoods and p value

>>> from gender_analysis import Corpus
>>> from gender_analysis.analysis.dunning import dunn_individual_word_by_corpus
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as FILEPATH2,
...     SMALL_TEST_CORPUS_CSV as PATH_TO_CSV
... )
>>> filepath1 = TEST_DATA_DIR / 'document_test_files'
>>> test_corpus1 = Corpus(filepath1)
>>> test_corpus2 = Corpus(FILEPATH2, csv_path = PATH_TO_CSV, ignore_warnings = True)
>>> dunn_individual_word_by_corpus(test_corpus1, test_corpus2, 'sad')
-332112.16673673474
gender_analysis.analysis.dunning.dunning_result_displayer(dunning_result, number_of_terms_to_display=10, corpus1_display_name=None, corpus2_display_name=None, part_of_speech_to_include=None, save_to_filename=None)

Convenience function to display dunning results as tables.

part_of_speech_to_include can either be a list of POS tags or a ‘adjectives, ‘adverbs’, ‘verbs’, or ‘pronouns’. If it is None, all terms are included.

Optionally save the output to a text file

Parameters:
  • dunning_result – Dunning result dict to display
  • number_of_terms_to_display – Number of terms for each corpus to display
  • corpus1_display_name – Name of corpus 1 (e.g. “Female Authors”)
  • corpus2_display_name – Name of corpus 2 (e.g. “Male Authors”)
  • part_of_speech_to_include – e.g. ‘adjectives’, or ‘verbs’
  • save_to_filename – Filename to save output
Returns:

gender_analysis.analysis.dunning.dunning_result_to_dict(dunning_result, number_of_terms_to_display=10, part_of_speech_to_include=None)

Receives a dictionary of results and returns a dictionary of the top number_of_terms_to_display most distinctive results for each corpus that have a part of speech matching part_of_speech_to_include

Parameters:
  • dunning_result – Dunning result dict that will be sorted through
  • number_of_terms_to_display – Number of terms for each corpus to display
  • part_of_speech_to_include – ‘adjectives’, ‘adverbs’, ‘verbs’, or ‘pronouns’
Returns:

dict

gender_analysis.analysis.dunning.dunning_total(counter1, counter2, pickle_filepath=None)

Runs dunning_individual on words shared by both counter objects (-) end of spectrum is words for counter_2 (+) end of spectrum is words for counter_1 the larger the magnitude of the number, the more distinctive that word is in its respective counter object

use pickle_filepath to store the result so it only has to be calculated once and can be used for multiple analyses.

Parameters:
  • counter1 – Python Counter object
  • counter2 – Python Counter object
  • pickle_filepath – Filepath to store pickled results; will not save output if None
Returns:

Dictionary

>>> from collections import Counter
>>> from gender_analysis.analysis.dunning import dunning_total
>>> female_counter = Counter({'he': 1,  'she': 10, 'and': 10})
>>> male_counter =   Counter({'he': 10, 'she': 1,  'and': 10})
>>> results = dunning_total(female_counter, male_counter)

Results is a dict that maps from terms to results Each result dict contains the dunning score… >>> results[‘he’][‘dunning’] -8.547243830635558

… counts for corpora 1 and 2 as well as total count >>> results[‘he’][‘count_total’], results[‘he’][‘count_corp1’], results[‘he’][‘count_corp2’] (11, 1, 10)

… and the same for frequencies >>> results[‘he’][‘freq_total’], results[‘he’][‘freq_corp1’], results[‘he’][‘freq_corp2’] (0.2619047619047619, 0.047619047619047616, 0.47619047619047616)

gender_analysis.analysis.dunning.dunning_total_by_corpus(m_corpus, f_corpus)

Goes through two corpora, e.g. corpus of male authors and corpus of female authors runs dunning_individual on all words that are in BOTH corpora returns sorted dictionary of words and their dunning scores shows top 10 and lowest 10 words

Parameters:
  • m_corpus – Corpus object
  • f_corpus – Corpus object
Returns:

list of tuples (common word, (dunning value, m_corpus_count, f_corpus_count))

>>> from gender_analysis.analysis.dunning import dunning_total_by_corpus
>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import TEST_CORPUS_PATH, SMALL_TEST_CORPUS_CSV
>>> c = Corpus(TEST_CORPUS_PATH, csv_path=SMALL_TEST_CORPUS_CSV, ignore_warnings = True)
>>> test_m_corpus = c.filter_by_gender('male')
>>> test_f_corpus = c.filter_by_gender('female')
>>> result = dunning_total_by_corpus(test_m_corpus, test_f_corpus)
>>> print(result[0])
('mrs', (-675.5338738828469, 1, 2031))

gender_analysis.analysis.dunning.dunning_words_by_author_gender(corpus, display_results=False, to_pickle=False, pickle_filename='dunning_male_vs_female_authors.pgz')

Tests distinctiveness of shared words between male and female authors using dunning analysis.

If called with display_results=True, prints out the most distinctive terms overall as well as grouped by verbs, adjectives etc. Returns a dict of all terms in the corpus mapped to the dunning data for each term

Parameters:
  • corpus – Corpus object
  • display_results – Boolean; reports a visualization of the results if True
  • to_pickle – Boolean; Will save the results to a pickle file if True
  • pickle_filename – Path to pickle object; will try to search for results in this location or write pickle file to path if to_pickle is true.
Returns:

dict

gender_analysis.analysis.dunning.female_characters_author_gender_differences(corpus, to_pickle=False, pickle_filename='dunning_female_chars_author_gender.pgz')

Between male-author and female-author subcorpora, tests distinctiveness of words associated with male characters

Prints out the most distinctive terms overall as well as grouped by verbs, adjectives etc.

Parameters:
  • corpus – Corpus object
  • to_pickle – boolean, False by default. Set to True in order to pickle results
  • pickle_filename – filename of results to be pickled
Returns:

dict

gender_analysis.analysis.dunning.freq_plot_to_show(results)

displays bar plot of relative frequency of all words in results

Parameters:results – dict of results from dunning_total or similar, i.e. in the form {‘word’: { ‘freq_corp1’: int, ‘freq_corp2’: int, ‘freq_total’: int}}
Returns:None, displays bar plot of relative frequency of all words in results
gender_analysis.analysis.dunning.male_characters_author_gender_differences(corpus, to_pickle=False, pickle_filename='dunning_male_chars_auth_gender.pgz')

Between male-author and female-author subcorpora, tests distinctiveness of words associated with male characters

Prints out the most distinctive terms overall as well as grouped by verbs, adjectives etc.

Parameters:
  • corpus – Corpus object
  • to_pickle – boolean, False by default. Set to True in order to pickle results
  • pickle_filename – filename of results to be pickled
Returns:

dict

gender_analysis.analysis.dunning.masc_fem_associations_dunning(corpus, to_pickle=False, pickle_filename='dunning_he_vs_she_associated_words.pgz')

Uses Dunning analysis to compare words associated with HE_SERIES vs. words associated with SHE_SERIES in a given Corpus.

Parameters:
  • corpus – Corpus object
  • to_pickle – Boolean; saves results to a pickle file if True
  • pickle_filename – Filepath to save pickle file if to_pickle is True
Returns:

Dictionary

gender_analysis.analysis.dunning.score_plot_to_show(results)

displays bar plot of dunning scores for all words in results

Parameters:results – dict of results from dunning_total or similar, i.e. in the form {‘word’: { ‘dunning’: float}}
Returns:None, displays bar plot of dunning scores for all words in results

frequency module

class gender_analysis.analysis.frequency.GenderFrequencyAnalyzer(genders: Optional[Sequence[gender_analysis.gender.gender.Gender]] = None, **kwargs)

Bases: gender_analysis.analysis.base_analyzers.CorpusAnalyzer

The GenderFrequencyAnalyzer instance accepts a series of texts and a series of Gender instances and finds occurrences of each of the Gender instances’ identifiers (currently pronouns). Helper methods are provided to organize and analyze those occurrences according to relevant criteria.

Instance methods:
by_date() by_document() by_gender() by_identifier() by_metadata()
by_date(time_frame: Tuple[int, int], bin_size: int, format_by: str = 'count', group_by: str = 'identifier')

Return analysis organized by date (as determined by Document metadata).

Parameters:
  • time_frame – a tuple of the format (start_date, end_date).
  • bin_size – int for the number of years represented in each list of frequencies
  • format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
  • group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns:

a dictionary of gender-pronoun pairs with top-level keys corresponding to the values in the input Documents’ ‘date’ metadatak key.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> from gender_analysis.text.corpus import Corpus
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> analyzer.by_date((2000, 2008), 2).keys()
dict_keys([2000, 2002, 2004, 2006])
>>> actual_analysis = analyzer.by_date((2000, 2010), 2).get(2002).get('Female')
>>> expected_analysis = {'she': 0, 'her': 7, 'herself': 0, 'hers': 0}
>>> actual_analysis == expected_analysis
True
by_document(format_by: str = 'count', group_by: str = 'identifier')

Return analysis organized by Document.label.

Parameters:
  • format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
  • group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns:

a dictionary of gender-pronoun pairs with top-level keys corresponding to the labels of the input Documents.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> doc = analyzer.corpus.documents[7]
>>> analyzer_document_labels = list(analyzer.by_document().keys())
>>> document_labels = list(map(lambda d: d.label, analyzer.corpus.documents))
>>> analyzer_document_labels == document_labels
True
>>> expected_result = {'hers': 0, 'herself': 0, 'she': 0, 'her': 6}
>>> actual_result = analyzer.by_document().get(doc.label).get('Female')
>>> expected_result == actual_result
True
by_gender(format_by: str = 'count', group_by: str = 'identifier')

Return analysis organized by Gender.label.

Parameters:
  • format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
  • group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns:

a dictionary of gender-pronoun pairs with top-level keys corresponding to the labels of the input Genders.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> actual_female_results = Counter({'her': 14, 'she': 8, 'herself': 1, 'hers': 0})
>>> actual_male_results = Counter({'his': 11, 'he': 11, 'him': 4, 'himself': 2})
>>> expected_results = {'Male': actual_male_results, 'Female': actual_female_results}
>>> actual_results = analyzer.by_gender()
>>> actual_results == expected_results
True
by_identifier(format_by: str = 'count', group_by: str = 'identifier') → Union[int, collections.Counter, Dict[str, float]]

Return analysis organized by Gender identifiers.

Parameters:
  • format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
  • group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns:

a dictionary of gender-pronoun pairs.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> actual_results = analyzer.by_identifier()
>>> expected_results = {'she': 8, 'herself': 1, 'her': 14, 'hers': 0, 'him': 4, 'he': 11,                                 'himself': 2, 'his': 11}
>>> actual_results == expected_results
True
by_metadata(metadata_key: str, format_by: str = 'count', group_by: str = 'identifier') → Dict[Union[str, int, float], Union[collections.Counter, Dict[str, float]]]

Return analysis organized by input metadata_key (as determined by Document metadata).

Parameters:
  • metadata_key – a string corresponding to one of the columns in the input metadata csv.
  • format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
  • group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns:

a dictionary of gender-pronoun pairs with top-level keys corresponding to the input metadata_key.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> analyzer.by_metadata('author_gender').keys()
dict_keys(['male', 'female'])
>>> actual_results = analyzer.by_metadata('author_gender').get('male').get('Male')
>>> expected_results = {'himself': 0, 'his': 4, 'him': 1, 'he': 7}
>>> actual_results == expected_results
True

instance distance module

metadata_visualizations module

proximity module