gender_analysis.analysis package¶

base analyzers¶

dependency parsing module¶

gender_analysis.analysis.dependency_parsing.generate_dependency_tree(document, genders=None, pickle_filepath=None)¶

This function returns the dependency tree for a given document. This can optionally be reduced such that it will only analyze sentences that involve specified genders’ subject/object pronouns.

Parameters:	document – Document we are interested in genders – a collection of genders that will be used to filter out sentences that do not involve the provided genders. If set to None, all sentences are parsed (default). pickle_filepath – filepath to store pickled dependency tree, will not write a file if None
Returns:	dependency tree, represented as a nested list

gender_analysis.analysis.dependency_parsing.get_descriptive_adjectives(tree, gender)¶

Returns a list of adjectives describing pronouns for the given gender in the given dependency tree.

Parameters:	tree – dependency tree for a document, output of generate_dependency_tree gender – Gender to search for usages of
Returns:	List of adjectives as strings

gender_analysis.analysis.dependency_parsing.get_descriptive_verbs(tree, gender)¶

Returns a list of verbs describing pronouns of the given gender in the given dependency tree.

Parameters:	tree – dependency tree for a document, output of generate_dependency_tree gender – Gender to search for usages of
Returns:	List of verbs as strings

gender_analysis.analysis.dependency_parsing.get_pronoun_usages(tree, gender)¶

Returns a dictionary relating the occurrences of a given gender’s pronouns as the subject and object of a sentence.

Parameters:	tree – dependency tree for a document, output of generate_dependency_tree gender – Gender to check
Returns:	Dictionary counting the times male pronouns are used as the subject and object, formatted as {‘subject’: <int>, ‘object’: <int>}

dunning module¶

gender_analysis.analysis.dunning.compare_word_association_between_corpus_dunning(word, corpus1, corpus2, word_window=None, to_pickle=False, pickle_filename='dunning_associated_words.pgz')¶

Finds words associated with the given word between the two corpora. The function can search the document automatically, or passing in a word window can refine results.

Parameters:	word – Word to compare between the two corpora corpus1 – Corpus object corpus2 – Corpus object word_window – If passed in as int, trims results to only show associated words within that range. to_pickle – boolean determining if results should be pickled. pickle_filename – str or Path object, location of existing pickle or save location for new pickle
Returns:	Dictionary

gender_analysis.analysis.dunning.compare_word_association_in_corpus_dunning(word1, word2, corpus, to_pickle=False, pickle_filename='dunning_vs_associated_words.pgz')¶

Uses Dunning analysis to compare the words associated with word1 vs those associated with word2 in the given corpus.

Parameters:	word1 – str word2 – str corpus – Corpus object to_pickle – boolean; True if you wish to save the results as a Pickle file pickle_filename – str or Path object; Only used if the pickle already exists or you wish to write a new pickle file
Returns:	Dictionary mapping words to dunning scores

gender_analysis.analysis.dunning.dunn_individual_word(total_words_in_corpus_1, total_words_in_corpus_2, count_of_word_in_corpus_1, count_of_word_in_corpus_2)¶

applies Dunning log likelihood to compare individual word in two counter objects

Parameters:	total_words_in_corpus_1 – int, total wordcount in corpus 1 total_words_in_corpus_2 – int, total wordcount in corpus 2 count_of_word_in_corpus_1 – int, wordcount of one word in corpus 1 count_of_word_in_corpus_2 – int, wordcount of one word in corpus 2
Returns:	Float representing the Dunning log likelihood of the given inputs

>>> total_words_m_corpus = 8648489
>>> total_words_f_corpus = 8700765
>>> wordcount_female = 1000
>>> wordcount_male = 50
>>> dunn_individual_word(total_words_m_corpus,
...                      total_words_f_corpus,
...                      wordcount_male,
...                      wordcount_female)
-1047.8610274053995

gender_analysis.analysis.dunning.dunn_individual_word_by_corpus(corpus1, corpus2, target_word)¶

applies dunning log likelihood to compare individual word in two counter objects (-) end of spectrum is words for counter_2 (+) end of spectrum is words for counter_1 the larger the magnitude of the number, the more distinctive that word is in its respective counter object

Parameters:	target_word – desired word to compare corpus1 – Corpus object corpus2 – Corpus object
Returns:	log likelihoods and p value

>>> from gender_analysis import Corpus
>>> from gender_analysis.analysis.dunning import dunn_individual_word_by_corpus
>>> from gender_analysis.testing.common import TEST_DATA_DIR

>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as FILEPATH2,
...     SMALL_TEST_CORPUS_CSV as PATH_TO_CSV
... )
>>> filepath1 = TEST_DATA_DIR / 'document_test_files'
>>> test_corpus1 = Corpus(filepath1)
>>> test_corpus2 = Corpus(FILEPATH2, csv_path = PATH_TO_CSV, ignore_warnings = True)
>>> dunn_individual_word_by_corpus(test_corpus1, test_corpus2, 'sad')
-332112.16673673474

gender_analysis.analysis.dunning.dunning_result_displayer(dunning_result, number_of_terms_to_display=10, corpus1_display_name=None, corpus2_display_name=None, part_of_speech_to_include=None, save_to_filename=None)¶

Convenience function to display dunning results as tables.

part_of_speech_to_include can either be a list of POS tags or a ‘adjectives, ‘adverbs’, ‘verbs’, or ‘pronouns’. If it is None, all terms are included.

Optionally save the output to a text file

Parameters:

dunning_result – Dunning result dict to display
number_of_terms_to_display – Number of terms for each corpus to display
corpus1_display_name – Name of corpus 1 (e.g. “Female Authors”)
corpus2_display_name – Name of corpus 2 (e.g. “Male Authors”)
part_of_speech_to_include – e.g. ‘adjectives’, or ‘verbs’
save_to_filename – Filename to save output

Returns:

gender_analysis.analysis.dunning.dunning_result_to_dict(dunning_result, number_of_terms_to_display=10, part_of_speech_to_include=None)¶

Receives a dictionary of results and returns a dictionary of the top number_of_terms_to_display most distinctive results for each corpus that have a part of speech matching part_of_speech_to_include

Parameters:	dunning_result – Dunning result dict that will be sorted through number_of_terms_to_display – Number of terms for each corpus to display part_of_speech_to_include – ‘adjectives’, ‘adverbs’, ‘verbs’, or ‘pronouns’
Returns:	dict

gender_analysis.analysis.dunning.dunning_total(counter1, counter2, pickle_filepath=None)¶

Runs dunning_individual on words shared by both counter objects (-) end of spectrum is words for counter_2 (+) end of spectrum is words for counter_1 the larger the magnitude of the number, the more distinctive that word is in its respective counter object

use pickle_filepath to store the result so it only has to be calculated once and can be used for multiple analyses.

Parameters:	counter1 – Python Counter object counter2 – Python Counter object pickle_filepath – Filepath to store pickled results; will not save output if None
Returns:	Dictionary

>>> from collections import Counter
>>> from gender_analysis.analysis.dunning import dunning_total
>>> female_counter = Counter({'he': 1,  'she': 10, 'and': 10})
>>> male_counter =   Counter({'he': 10, 'she': 1,  'and': 10})
>>> results = dunning_total(female_counter, male_counter)

Results is a dict that maps from terms to results Each result dict contains the dunning score… >>> results[‘he’][‘dunning’] -8.547243830635558

… counts for corpora 1 and 2 as well as total count >>> results[‘he’][‘count_total’], results[‘he’][‘count_corp1’], results[‘he’][‘count_corp2’] (11, 1, 10)

… and the same for frequencies >>> results[‘he’][‘freq_total’], results[‘he’][‘freq_corp1’], results[‘he’][‘freq_corp2’] (0.2619047619047619, 0.047619047619047616, 0.47619047619047616)

gender_analysis.analysis.dunning.dunning_total_by_corpus(m_corpus, f_corpus)¶

Goes through two corpora, e.g. corpus of male authors and corpus of female authors runs dunning_individual on all words that are in BOTH corpora returns sorted dictionary of words and their dunning scores shows top 10 and lowest 10 words

Parameters:

m_corpus – Corpus object
f_corpus – Corpus object

Returns:

list of tuples (common word, (dunning value, m_corpus_count, f_corpus_count))

>>> from gender_analysis.analysis.dunning import dunning_total_by_corpus
>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import TEST_CORPUS_PATH, SMALL_TEST_CORPUS_CSV
>>> c = Corpus(TEST_CORPUS_PATH, csv_path=SMALL_TEST_CORPUS_CSV, ignore_warnings = True)
>>> test_m_corpus = c.filter_by_gender('male')
>>> test_f_corpus = c.filter_by_gender('female')
>>> result = dunning_total_by_corpus(test_m_corpus, test_f_corpus)
>>> print(result[0])
('mrs', (-675.5338738828469, 1, 2031))

gender_analysis.analysis.dunning.dunning_words_by_author_gender(corpus, display_results=False, to_pickle=False, pickle_filename='dunning_male_vs_female_authors.pgz')¶

Tests distinctiveness of shared words between male and female authors using dunning analysis.

If called with display_results=True, prints out the most distinctive terms overall as well as grouped by verbs, adjectives etc. Returns a dict of all terms in the corpus mapped to the dunning data for each term

Parameters:	corpus – Corpus object display_results – Boolean; reports a visualization of the results if True to_pickle – Boolean; Will save the results to a pickle file if True pickle_filename – Path to pickle object; will try to search for results in this location or write pickle file to path if to_pickle is true.
Returns:	dict

gender_analysis.analysis.dunning.female_characters_author_gender_differences(corpus, to_pickle=False, pickle_filename='dunning_female_chars_author_gender.pgz')¶

Between male-author and female-author subcorpora, tests distinctiveness of words associated with male characters

Prints out the most distinctive terms overall as well as grouped by verbs, adjectives etc.

Parameters:	corpus – Corpus object to_pickle – boolean, False by default. Set to True in order to pickle results pickle_filename – filename of results to be pickled
Returns:	dict

gender_analysis.analysis.dunning.freq_plot_to_show(results)¶

displays bar plot of relative frequency of all words in results

Parameters:	results – dict of results from dunning_total or similar, i.e. in the form {‘word’: { ‘freq_corp1’: int, ‘freq_corp2’: int, ‘freq_total’: int}}
Returns:	None, displays bar plot of relative frequency of all words in results

gender_analysis.analysis.dunning.male_characters_author_gender_differences(corpus, to_pickle=False, pickle_filename='dunning_male_chars_auth_gender.pgz')¶

Between male-author and female-author subcorpora, tests distinctiveness of words associated with male characters

Prints out the most distinctive terms overall as well as grouped by verbs, adjectives etc.

Parameters:	corpus – Corpus object to_pickle – boolean, False by default. Set to True in order to pickle results pickle_filename – filename of results to be pickled
Returns:	dict

gender_analysis.analysis.dunning.masc_fem_associations_dunning(corpus, to_pickle=False, pickle_filename='dunning_he_vs_she_associated_words.pgz')¶

Uses Dunning analysis to compare words associated with HE_SERIES vs. words associated with SHE_SERIES in a given Corpus.

Parameters:	corpus – Corpus object to_pickle – Boolean; saves results to a pickle file if True pickle_filename – Filepath to save pickle file if to_pickle is True
Returns:	Dictionary

gender_analysis.analysis.dunning.score_plot_to_show(results)¶

displays bar plot of dunning scores for all words in results

Parameters:	results – dict of results from dunning_total or similar, i.e. in the form {‘word’: { ‘dunning’: float}}
Returns:	None, displays bar plot of dunning scores for all words in results

frequency module¶

class gender_analysis.analysis.frequency.GenderFrequencyAnalyzer(genders: Optional[Sequence[gender_analysis.gender.gender.Gender]] = None, **kwargs)¶

Bases: gender_analysis.analysis.base_analyzers.CorpusAnalyzer

The GenderFrequencyAnalyzer instance accepts a series of texts and a series of Gender instances and finds occurrences of each of the Gender instances’ identifiers (currently pronouns). Helper methods are provided to organize and analyze those occurrences according to relevant criteria.

Instance methods:: by_date() by_document() by_gender() by_identifier() by_metadata()

by_date(time_frame: Tuple[int, int], bin_size: int, format_by: str = 'count', group_by: str = 'identifier')¶

Return analysis organized by date (as determined by Document metadata).

Parameters:

time_frame – a tuple of the format (start_date, end_date).
bin_size – int for the number of years represented in each list of frequencies
format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.

Returns:

a dictionary of gender-pronoun pairs with top-level keys corresponding to the values in the input Documents’ ‘date’ metadatak key.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> from gender_analysis.text.corpus import Corpus
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> analyzer.by_date((2000, 2008), 2).keys()
dict_keys([2000, 2002, 2004, 2006])
>>> actual_analysis = analyzer.by_date((2000, 2010), 2).get(2002).get('Female')
>>> expected_analysis = {'she': 0, 'her': 7, 'herself': 0, 'hers': 0}
>>> actual_analysis == expected_analysis
True

by_document(format_by: str = 'count', group_by: str = 'identifier')¶

Return analysis organized by Document.label.

Parameters:	format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively. group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns:	a dictionary of gender-pronoun pairs with top-level keys corresponding to the labels of the input Documents.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> doc = analyzer.corpus.documents[7]
>>> analyzer_document_labels = list(analyzer.by_document().keys())
>>> document_labels = list(map(lambda d: d.label, analyzer.corpus.documents))
>>> analyzer_document_labels == document_labels
True
>>> expected_result = {'hers': 0, 'herself': 0, 'she': 0, 'her': 6}
>>> actual_result = analyzer.by_document().get(doc.label).get('Female')
>>> expected_result == actual_result
True

by_gender(format_by: str = 'count', group_by: str = 'identifier')¶

Return analysis organized by Gender.label.

Parameters:	format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively. group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns:	a dictionary of gender-pronoun pairs with top-level keys corresponding to the labels of the input Genders.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> actual_female_results = Counter({'her': 14, 'she': 8, 'herself': 1, 'hers': 0})
>>> actual_male_results = Counter({'his': 11, 'he': 11, 'him': 4, 'himself': 2})
>>> expected_results = {'Male': actual_male_results, 'Female': actual_female_results}
>>> actual_results = analyzer.by_gender()
>>> actual_results == expected_results
True

by_identifier(format_by: str = 'count', group_by: str = 'identifier') → Union[int, collections.Counter, Dict[str, float]]¶

Return analysis organized by Gender identifiers.

Parameters:	format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively. group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns:	a dictionary of gender-pronoun pairs.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> actual_results = analyzer.by_identifier()
>>> expected_results = {'she': 8, 'herself': 1, 'her': 14, 'hers': 0, 'him': 4, 'he': 11,                                 'himself': 2, 'his': 11}
>>> actual_results == expected_results
True

by_metadata(metadata_key: str, format_by: str = 'count', group_by: str = 'identifier') → Dict[Union[str, int, float], Union[collections.Counter, Dict[str, float]]]¶

Return analysis organized by input metadata_key (as determined by Document metadata).

Parameters:

metadata_key – a string corresponding to one of the columns in the input metadata csv.
format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.

Returns:

a dictionary of gender-pronoun pairs with top-level keys corresponding to the input metadata_key.

>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV
>>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH,
...                                    csv_path=DOCUMENT_TEST_CSV)
>>> analyzer.by_metadata('author_gender').keys()
dict_keys(['male', 'female'])
>>> actual_results = analyzer.by_metadata('author_gender').get('male').get('Male')
>>> expected_results = {'himself': 0, 'his': 4, 'him': 1, 'he': 7}
>>> actual_results == expected_results
True

gender_analysis.analysis package¶

base analyzers¶

dependency parsing module¶

dunning module¶

frequency module¶

instance distance module¶

metadata_visualizations module¶

proximity module¶

Gender Analysis Toolkit

Navigation

Related Topics