gender_analysis.analysis package¶
base analyzers¶
dependency parsing module¶
-
gender_analysis.analysis.dependency_parsing.generate_dependency_tree(document, genders=None, pickle_filepath=None)¶ This function returns the dependency tree for a given document. This can optionally be reduced such that it will only analyze sentences that involve specified genders’ subject/object pronouns.
Parameters: - document – Document we are interested in
- genders – a collection of genders that will be used to filter out sentences that do not involve the provided genders. If set to None, all sentences are parsed (default).
- pickle_filepath – filepath to store pickled dependency tree, will not write a file if None
Returns: dependency tree, represented as a nested list
-
gender_analysis.analysis.dependency_parsing.get_descriptive_adjectives(tree, gender)¶ Returns a list of adjectives describing pronouns for the given gender in the given dependency tree.
Parameters: - tree – dependency tree for a document, output of generate_dependency_tree
- gender – Gender to search for usages of
Returns: List of adjectives as strings
-
gender_analysis.analysis.dependency_parsing.get_descriptive_verbs(tree, gender)¶ Returns a list of verbs describing pronouns of the given gender in the given dependency tree.
Parameters: - tree – dependency tree for a document, output of generate_dependency_tree
- gender – Gender to search for usages of
Returns: List of verbs as strings
-
gender_analysis.analysis.dependency_parsing.get_pronoun_usages(tree, gender)¶ Returns a dictionary relating the occurrences of a given gender’s pronouns as the subject and object of a sentence.
Parameters: - tree – dependency tree for a document, output of generate_dependency_tree
- gender – Gender to check
Returns: Dictionary counting the times male pronouns are used as the subject and object, formatted as {‘subject’: <int>, ‘object’: <int>}
dunning module¶
-
gender_analysis.analysis.dunning.compare_word_association_between_corpus_dunning(word, corpus1, corpus2, word_window=None, to_pickle=False, pickle_filename='dunning_associated_words.pgz')¶ Finds words associated with the given word between the two corpora. The function can search the document automatically, or passing in a word window can refine results.
Parameters: - word – Word to compare between the two corpora
- corpus1 – Corpus object
- corpus2 – Corpus object
- word_window – If passed in as int, trims results to only show associated words within that range.
- to_pickle – boolean determining if results should be pickled.
- pickle_filename – str or Path object, location of existing pickle or save location for new pickle
Returns: Dictionary
-
gender_analysis.analysis.dunning.compare_word_association_in_corpus_dunning(word1, word2, corpus, to_pickle=False, pickle_filename='dunning_vs_associated_words.pgz')¶ Uses Dunning analysis to compare the words associated with word1 vs those associated with word2 in the given corpus.
Parameters: - word1 – str
- word2 – str
- corpus – Corpus object
- to_pickle – boolean; True if you wish to save the results as a Pickle file
- pickle_filename – str or Path object; Only used if the pickle already exists or you wish to write a new pickle file
Returns: Dictionary mapping words to dunning scores
-
gender_analysis.analysis.dunning.dunn_individual_word(total_words_in_corpus_1, total_words_in_corpus_2, count_of_word_in_corpus_1, count_of_word_in_corpus_2)¶ applies Dunning log likelihood to compare individual word in two counter objects
Parameters: - total_words_in_corpus_1 – int, total wordcount in corpus 1
- total_words_in_corpus_2 – int, total wordcount in corpus 2
- count_of_word_in_corpus_1 – int, wordcount of one word in corpus 1
- count_of_word_in_corpus_2 – int, wordcount of one word in corpus 2
Returns: Float representing the Dunning log likelihood of the given inputs
>>> total_words_m_corpus = 8648489 >>> total_words_f_corpus = 8700765 >>> wordcount_female = 1000 >>> wordcount_male = 50 >>> dunn_individual_word(total_words_m_corpus, ... total_words_f_corpus, ... wordcount_male, ... wordcount_female) -1047.8610274053995
-
gender_analysis.analysis.dunning.dunn_individual_word_by_corpus(corpus1, corpus2, target_word)¶ applies dunning log likelihood to compare individual word in two counter objects (-) end of spectrum is words for counter_2 (+) end of spectrum is words for counter_1 the larger the magnitude of the number, the more distinctive that word is in its respective counter object
Parameters: - target_word – desired word to compare
- corpus1 – Corpus object
- corpus2 – Corpus object
Returns: log likelihoods and p value
>>> from gender_analysis import Corpus >>> from gender_analysis.analysis.dunning import dunn_individual_word_by_corpus >>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> from gender_analysis.testing.common import ( ... TEST_CORPUS_PATH as FILEPATH2, ... SMALL_TEST_CORPUS_CSV as PATH_TO_CSV ... ) >>> filepath1 = TEST_DATA_DIR / 'document_test_files' >>> test_corpus1 = Corpus(filepath1) >>> test_corpus2 = Corpus(FILEPATH2, csv_path = PATH_TO_CSV, ignore_warnings = True) >>> dunn_individual_word_by_corpus(test_corpus1, test_corpus2, 'sad') -332112.16673673474
-
gender_analysis.analysis.dunning.dunning_result_displayer(dunning_result, number_of_terms_to_display=10, corpus1_display_name=None, corpus2_display_name=None, part_of_speech_to_include=None, save_to_filename=None)¶ Convenience function to display dunning results as tables.
part_of_speech_to_include can either be a list of POS tags or a ‘adjectives, ‘adverbs’, ‘verbs’, or ‘pronouns’. If it is None, all terms are included.
Optionally save the output to a text file
Parameters: - dunning_result – Dunning result dict to display
- number_of_terms_to_display – Number of terms for each corpus to display
- corpus1_display_name – Name of corpus 1 (e.g. “Female Authors”)
- corpus2_display_name – Name of corpus 2 (e.g. “Male Authors”)
- part_of_speech_to_include – e.g. ‘adjectives’, or ‘verbs’
- save_to_filename – Filename to save output
Returns:
-
gender_analysis.analysis.dunning.dunning_result_to_dict(dunning_result, number_of_terms_to_display=10, part_of_speech_to_include=None)¶ Receives a dictionary of results and returns a dictionary of the top number_of_terms_to_display most distinctive results for each corpus that have a part of speech matching part_of_speech_to_include
Parameters: - dunning_result – Dunning result dict that will be sorted through
- number_of_terms_to_display – Number of terms for each corpus to display
- part_of_speech_to_include – ‘adjectives’, ‘adverbs’, ‘verbs’, or ‘pronouns’
Returns: dict
-
gender_analysis.analysis.dunning.dunning_total(counter1, counter2, pickle_filepath=None)¶ Runs dunning_individual on words shared by both counter objects (-) end of spectrum is words for counter_2 (+) end of spectrum is words for counter_1 the larger the magnitude of the number, the more distinctive that word is in its respective counter object
use pickle_filepath to store the result so it only has to be calculated once and can be used for multiple analyses.
Parameters: - counter1 – Python Counter object
- counter2 – Python Counter object
- pickle_filepath – Filepath to store pickled results; will not save output if None
Returns: Dictionary
>>> from collections import Counter >>> from gender_analysis.analysis.dunning import dunning_total >>> female_counter = Counter({'he': 1, 'she': 10, 'and': 10}) >>> male_counter = Counter({'he': 10, 'she': 1, 'and': 10}) >>> results = dunning_total(female_counter, male_counter)
Results is a dict that maps from terms to results Each result dict contains the dunning score… >>> results[‘he’][‘dunning’] -8.547243830635558
… counts for corpora 1 and 2 as well as total count >>> results[‘he’][‘count_total’], results[‘he’][‘count_corp1’], results[‘he’][‘count_corp2’] (11, 1, 10)
… and the same for frequencies >>> results[‘he’][‘freq_total’], results[‘he’][‘freq_corp1’], results[‘he’][‘freq_corp2’] (0.2619047619047619, 0.047619047619047616, 0.47619047619047616)
-
gender_analysis.analysis.dunning.dunning_total_by_corpus(m_corpus, f_corpus)¶ Goes through two corpora, e.g. corpus of male authors and corpus of female authors runs dunning_individual on all words that are in BOTH corpora returns sorted dictionary of words and their dunning scores shows top 10 and lowest 10 words
Parameters: - m_corpus – Corpus object
- f_corpus – Corpus object
Returns: list of tuples (common word, (dunning value, m_corpus_count, f_corpus_count))
>>> from gender_analysis.analysis.dunning import dunning_total_by_corpus >>> from gender_analysis import Corpus >>> from gender_analysis.testing.common import TEST_CORPUS_PATH, SMALL_TEST_CORPUS_CSV >>> c = Corpus(TEST_CORPUS_PATH, csv_path=SMALL_TEST_CORPUS_CSV, ignore_warnings = True) >>> test_m_corpus = c.filter_by_gender('male') >>> test_f_corpus = c.filter_by_gender('female') >>> result = dunning_total_by_corpus(test_m_corpus, test_f_corpus) >>> print(result[0]) ('mrs', (-675.5338738828469, 1, 2031))
Tests distinctiveness of shared words between male and female authors using dunning analysis.
If called with display_results=True, prints out the most distinctive terms overall as well as grouped by verbs, adjectives etc. Returns a dict of all terms in the corpus mapped to the dunning data for each term
Parameters: - corpus – Corpus object
- display_results – Boolean; reports a visualization of the results if True
- to_pickle – Boolean; Will save the results to a pickle file if True
- pickle_filename – Path to pickle object; will try to search for results in this location or write pickle file to path if to_pickle is true.
Returns: dict
Between male-author and female-author subcorpora, tests distinctiveness of words associated with male characters
Prints out the most distinctive terms overall as well as grouped by verbs, adjectives etc.
Parameters: - corpus – Corpus object
- to_pickle – boolean, False by default. Set to True in order to pickle results
- pickle_filename – filename of results to be pickled
Returns: dict
-
gender_analysis.analysis.dunning.freq_plot_to_show(results)¶ displays bar plot of relative frequency of all words in results
Parameters: results – dict of results from dunning_total or similar, i.e. in the form {‘word’: { ‘freq_corp1’: int, ‘freq_corp2’: int, ‘freq_total’: int}} Returns: None, displays bar plot of relative frequency of all words in results
Between male-author and female-author subcorpora, tests distinctiveness of words associated with male characters
Prints out the most distinctive terms overall as well as grouped by verbs, adjectives etc.
Parameters: - corpus – Corpus object
- to_pickle – boolean, False by default. Set to True in order to pickle results
- pickle_filename – filename of results to be pickled
Returns: dict
-
gender_analysis.analysis.dunning.masc_fem_associations_dunning(corpus, to_pickle=False, pickle_filename='dunning_he_vs_she_associated_words.pgz')¶ Uses Dunning analysis to compare words associated with HE_SERIES vs. words associated with SHE_SERIES in a given Corpus.
Parameters: - corpus – Corpus object
- to_pickle – Boolean; saves results to a pickle file if True
- pickle_filename – Filepath to save pickle file if to_pickle is True
Returns: Dictionary
-
gender_analysis.analysis.dunning.score_plot_to_show(results)¶ displays bar plot of dunning scores for all words in results
Parameters: results – dict of results from dunning_total or similar, i.e. in the form {‘word’: { ‘dunning’: float}} Returns: None, displays bar plot of dunning scores for all words in results
frequency module¶
-
class
gender_analysis.analysis.frequency.GenderFrequencyAnalyzer(genders: Optional[Sequence[gender_analysis.gender.gender.Gender]] = None, **kwargs)¶ Bases:
gender_analysis.analysis.base_analyzers.CorpusAnalyzerThe GenderFrequencyAnalyzer instance accepts a series of texts and a series of Gender instances and finds occurrences of each of the Gender instances’ identifiers (currently pronouns). Helper methods are provided to organize and analyze those occurrences according to relevant criteria.
- Instance methods:
- by_date() by_document() by_gender() by_identifier() by_metadata()
-
by_date(time_frame: Tuple[int, int], bin_size: int, format_by: str = 'count', group_by: str = 'identifier')¶ Return analysis organized by date (as determined by Document metadata).
Parameters: - time_frame – a tuple of the format (start_date, end_date).
- bin_size – int for the number of years represented in each list of frequencies
- format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
- group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns: a dictionary of gender-pronoun pairs with top-level keys corresponding to the values in the input Documents’ ‘date’ metadatak key.
>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV >>> from gender_analysis.text.corpus import Corpus >>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH, ... csv_path=DOCUMENT_TEST_CSV) >>> analyzer.by_date((2000, 2008), 2).keys() dict_keys([2000, 2002, 2004, 2006]) >>> actual_analysis = analyzer.by_date((2000, 2010), 2).get(2002).get('Female') >>> expected_analysis = {'she': 0, 'her': 7, 'herself': 0, 'hers': 0} >>> actual_analysis == expected_analysis True
-
by_document(format_by: str = 'count', group_by: str = 'identifier')¶ Return analysis organized by Document.label.
Parameters: - format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
- group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns: a dictionary of gender-pronoun pairs with top-level keys corresponding to the labels of the input Documents.
>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV >>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH, ... csv_path=DOCUMENT_TEST_CSV) >>> doc = analyzer.corpus.documents[7] >>> analyzer_document_labels = list(analyzer.by_document().keys()) >>> document_labels = list(map(lambda d: d.label, analyzer.corpus.documents)) >>> analyzer_document_labels == document_labels True >>> expected_result = {'hers': 0, 'herself': 0, 'she': 0, 'her': 6} >>> actual_result = analyzer.by_document().get(doc.label).get('Female') >>> expected_result == actual_result True
-
by_gender(format_by: str = 'count', group_by: str = 'identifier')¶ Return analysis organized by Gender.label.
Parameters: - format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
- group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns: a dictionary of gender-pronoun pairs with top-level keys corresponding to the labels of the input Genders.
>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV >>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH, ... csv_path=DOCUMENT_TEST_CSV) >>> actual_female_results = Counter({'her': 14, 'she': 8, 'herself': 1, 'hers': 0}) >>> actual_male_results = Counter({'his': 11, 'he': 11, 'him': 4, 'himself': 2}) >>> expected_results = {'Male': actual_male_results, 'Female': actual_female_results} >>> actual_results = analyzer.by_gender() >>> actual_results == expected_results True
-
by_identifier(format_by: str = 'count', group_by: str = 'identifier') → Union[int, collections.Counter, Dict[str, float]]¶ Return analysis organized by Gender identifiers.
Parameters: - format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
- group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns: a dictionary of gender-pronoun pairs.
>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV >>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH, ... csv_path=DOCUMENT_TEST_CSV) >>> actual_results = analyzer.by_identifier() >>> expected_results = {'she': 8, 'herself': 1, 'her': 14, 'hers': 0, 'him': 4, 'he': 11, 'himself': 2, 'his': 11} >>> actual_results == expected_results True
-
by_metadata(metadata_key: str, format_by: str = 'count', group_by: str = 'identifier') → Dict[Union[str, int, float], Union[collections.Counter, Dict[str, float]]]¶ Return analysis organized by input metadata_key (as determined by Document metadata).
Parameters: - metadata_key – a string corresponding to one of the columns in the input metadata csv.
- format_by – accepts ‘frequency’ and ‘relative’ as acceptable values, returns analysis with counts as a frequency of all words in texts or relative to one another, respectively.
- group_by – accepts ‘label’ and ‘aggregate’ as acceptable values, returns analysis with counts grouped by pronoun category (‘subject’ and ‘object’) or summed, respectively.
Returns: a dictionary of gender-pronoun pairs with top-level keys corresponding to the input metadata_key.
>>> from gender_analysis.testing.common import DOCUMENT_TEST_PATH, DOCUMENT_TEST_CSV >>> analyzer = GenderFrequencyAnalyzer(file_path=DOCUMENT_TEST_PATH, ... csv_path=DOCUMENT_TEST_CSV) >>> analyzer.by_metadata('author_gender').keys() dict_keys(['male', 'female']) >>> actual_results = analyzer.by_metadata('author_gender').get('male').get('Male') >>> expected_results = {'himself': 0, 'his': 4, 'him': 1, 'he': 7} >>> actual_results == expected_results True