gender_analysis.text package¶

common module¶

exception gender_analysis.text.common.MissingMetadataError(metadata_fields, message='')¶

Bases: Exception

Raised when a function that assumes certain metadata is called on a corpus without that metadata

gender_analysis.text.common.convert_text_file_to_new_encoding(source_path, target_path, target_encoding)¶

Converts a text file in source_path to the specified encoding in target_encoding

Note: Currently only supports encodings utf-8, ascii and iso-8859-1

Parameters:	source_path – str or Path target_path – str or Path target_encoding – str
Returns:	None

>>> from gender_analysis.common import BASE_PATH
>>> sample_text = ' ¶¶¶¶ here is a test file'
>>> source_path = Path(BASE_PATH, 'source_file.txt')
>>> target_path = Path(BASE_PATH, 'target_file.txt')
>>> with open(source_path, 'w', encoding='iso-8859-1') as source:
...     _ = source.write(sample_text)
>>> get_text_file_encoding(source_path)
'ISO-8859-1'
>>> convert_text_file_to_new_encoding(source_path, target_path, target_encoding='utf-8')
>>> get_text_file_encoding(target_path)
'utf-8'
>>> import os
>>> os.remove(source_path)
>>> os.remove(target_path)

gender_analysis.text.common.create_path_object_and_directories(output_dir, filename=None)¶: Creates a path object for the file with the absolute output_dir and with the given filename (if provided). It will create the path to the output_dir if it is non-existent

gender_analysis.text.common.download_nltk_package_if_not_present(package_name)¶

Checks to see whether the user already has a given nltk package, and if not, prompts the user whether to download it.

We download all necessary packages at install time, but this is just in case the user has deleted them.

Parameters:	package_name – name of the nltk package
Returns:

gender_analysis.text.common.get_text_file_encoding(filepath)¶

Returns the text encoding as a string for a txt file at the given filepath.

Parameters:	filepath – str or Path object
Returns:	Name of encoding scheme as a string

>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> from pathlib import Path
>>> import os
>>> path=Path(TEST_DATA_DIR, 'sample_novels', 'texts', 'hawthorne_scarlet.txt')
>>> get_text_file_encoding(path)
'ascii'

Note: For files containing only ascii characters, this function will return ‘ascii’ even if the file was encoded with utf-8

>>> import os
>>> from pathlib import Path
>>> from gender_analysis.common import BASE_PATH
>>> text = 'here is an ascii text'
>>> file_path = Path(BASE_PATH, 'example_file.txt')
>>> with open(file_path, 'w', encoding='utf-8') as source:
...     _ = source.write(text)
...     source.close()
>>> get_text_file_encoding(file_path)
'ascii'
>>> os.remove(file_path)

gender_analysis.text.common.load_csv_to_list(file_path)¶

Loads a csv file from the given filepath and returns its contents as a list of strings.

Parameters:	file_path – str or Path object
Returns:	a list of strings

>>> from pathlib import Path
>>> from gender_analysis.testing.common import LARGE_TEST_CORPUS_CSV
>>> corpus_metadata_path = LARGE_TEST_CORPUS_CSV
>>> corpus_metadata = load_csv_to_list(corpus_metadata_path)
>>> type(corpus_metadata)
<class 'list'>

gender_analysis.text.common.load_graph_settings(show_grid_lines=True)¶

Sets the seaborn graph settings to the defaults for the project. Defaults to displaying gridlines. To remove gridlines, call with False.

Parameters:	show_grid_lines – Boolean; Determines whether to show gridlines in graphs.
Returns:	None

gender_analysis.text.common.load_pickle(filepath)¶

Loads the pickle stored at a given filepath, and returns the Python object that was stored.

Parameters:	filepath – str or Path object
Returns:	Previously-pickled object

>>> from gender_analysis.common import BASE_PATH
>>> from pathlib import Path
>>> from gender_analysis.text.common import load_pickle
>>> pickle_filepath = Path(BASE_PATH, 'testing', 'test_data','test_pickle.pgz')
>>> loaded_object = load_pickle(pickle_filepath)
>>> loaded_object
{'a': 4, 'b': 5, 'c': [1, 2, 3]}

gender_analysis.text.common.load_txt_to_string(file_path)¶

Loads a txt file and returns a str representation of it.

Parameters:	file_path – str or Path object
Returns:	The file’s text as a string

>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> novel_path = Path(TEST_DATA_DIR, 'sample_novels', 'texts', 'austen_persuasion.txt')
>>> novel_text = load_txt_to_string(novel_path)
>>> type(novel_text), len(novel_text)
(<class 'str'>, 466887)

gender_analysis.text.common.store_pickle(obj, filepath)¶

Store a compressed “pickle” of the object in the “pickle_data” directory and return the full path to it.

Parameters:	obj – Any Python object that can be pickled filepath – str or Path object
Returns:	Path object

Example in lieu of Doctest to avoid writing out a file.

my_object = {‘a’: 4, ‘b’: 5, ‘c’: [1, 2, 3]} gender_analysis.common.store_pickle(my_object, ‘path_to_pickle/example_pickle.pgz’)

corpus module¶

class gender_analysis.text.corpus.Corpus(path_to_files, name=None, csv_path=None, pickle_on_load=None, ignore_warnings=False)¶

Bases: object

The corpus class is used to load the metadata and full texts of all documents in a corpus

Once loaded, each corpus contains a list of Document objects

Parameters:	path_to_files – Must be either the path to a directory of txt files or an already-pickled corpus name – Optional name of the corpus, for ease of use and readability csv_path – Optional path to a csv metadata file pickle_on_load – Filepath to save a pickled copy of the corpus

>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> path = TEST_DATA_DIR / 'sample_novels' / 'texts'
>>> c = Corpus(path)
>>> type(c.documents), len(c)
(<class 'list'>, 99)

clone()¶

Return a copy of the Corpus object

Returns:	Corpus object

>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import TEST_CORPUS_PATH
>>> path = TEST_CORPUS_PATH
>>> sample_corpus = Corpus(path)
>>> corpus_copy = sample_corpus.clone()
>>> len(corpus_copy) == len(sample_corpus)
True

count_authors_by_gender(gender)¶

This function returns the number of authors in the corpus with the specified gender.

NOTE: there must be an ‘author_gender’ field in the metadata of all documents.

Parameters:	gender – gender identifier to search for in the metadata (i.e. ‘female’, ‘male’, etc.)
Returns:	Number of authors of the given gender

>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as path,
...     SMALL_TEST_CORPUS_CSV as path_to_csv
... )
>>> c = Corpus(path, csv_path=path_to_csv, ignore_warnings = True)
>>> c.count_authors_by_gender('female')
7

filter_by_gender(gender)¶

Return a new Corpus object that contains documents only with authors whose gender matches the given parameter.

Parameters:	gender – gender identifier (i.e. ‘male’, ‘female’, ‘unknown’, etc.)
Returns:	Corpus object

>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as path,
...     LARGE_TEST_CORPUS_CSV as path_to_csv
... )
>>> c = Corpus(path, csv_path=path_to_csv)
>>> female_corpus = c.filter_by_gender('female')
>>> len(female_corpus)
39
>>> female_corpus.documents[0].title
'The Indiscreet Letter'

>>> male_corpus = c.filter_by_gender('male')
>>> len(male_corpus)
59

>>> male_corpus.documents[0].title
'Lisbeth Longfrock'

get_document(metadata_field, field_val)¶

Returns a specific Document object from self.documents that has metadata matching field_val for metadata_field.

This function will only return the first document in self.documents. It should only be used if you’re certain there is only one match in the Corpus or if you’re not picky about which Document you get. If you want more selectivity use get_document_multiple_fields, or if you want multiple documents, use subcorpus.

Parameters:	metadata_field – metadata field to search field_val – search term
Returns:	Document Object

>>> from gender_analysis import Corpus
>>> from gender_analysis.text.common import MissingMetadataError
>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as path,
...     LARGE_TEST_CORPUS_CSV as path_to_csv
... )

>>> c = Corpus(path, csv_path=path_to_csv)
>>> c.get_document("author", "Dickens, Charles")
<Document (dickens_twocities)>
>>> c.get_document("date", '1857')
<Document (bronte_professor)>
>>> try:
...     c.get_document("meme_quality", "over 9000")
... except MissingMetadataError as exception:
...     print(exception)
This Corpus is missing the following metadata field:
    meme_quality
In order to run this function, you must create a new metadata csv
with this field and run Corpus.update_metadata().

get_document_multiple_fields(metadata_dict)¶

Returns a specific Document object from the corpus that has metadata matching a given metadata dict.

This method will only return the first document in the corpus. It should only be used if you’re certain there is only one match in the Corpus or if you’re not picky about which Document you get.

If you want multiple documents, use subcorpus.

Parameters:	metadata_dict – Dictionary with metadata fields as keys and search terms as values
Returns:	Document object

>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as path,
...     LARGE_TEST_CORPUS_CSV as path_to_csv
... )
>>> c = Corpus(path, csv_path=path_to_csv)
>>> c.get_document_multiple_fields({"author": "Dickens, Charles", "author_gender": "male"})
<Document (dickens_twocities)>
>>> c.get_document_multiple_fields({"author": "Chopin, Kate", "title": "The Awakening"})
<Document (chopin_awakening)>

get_field_vals(field)¶

This function returns a sorted list of the values present in the corpus for a given metadata field.

Parameters:	field – field to search for (i.e. ‘location’, ‘author_gender’, etc.)
Returns:	list of strings

>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as path,
...     LARGE_TEST_CORPUS_CSV as path_to_csv
... )
>>> c = Corpus(path, name='sample_novels', csv_path=path_to_csv)
>>> c.get_field_vals('author_gender')
['both', 'female', 'male']

get_sample_text_passages(expression, no_passages)¶

Returns a specified number of example passages that include a certain expression.

The number of passages that you request is a maximum number, and this function may return fewer if there are limited cases of a passage in the corpus.

Parameters:	expression – expression to search for no_passages – number of passages to return
Returns:	List of passages as strings

>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as path,
...     LARGE_TEST_CORPUS_CSV as path_to_csv
... )
>>> corpus = Corpus(path, csv_path=path_to_csv, ignore_warnings=True)
>>> results = corpus.get_sample_text_passages('he cried', 2)
>>> 'he cried' in results[0][1]
True
>>> 'he cried' in results[1][1]
True

multi_filter(characteristic_dict)¶

Returns a copy of the corpus, but with only the documents that fulfill the metadata parameters passed in by characteristic_dict. Multiple metadata keys can be searched at one time, provided that the metadata is available for the documents in the corpus.

Parameters:	characteristic_dict – dict with metadata fields as keys and search terms as values
Returns:	Corpus object

>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as path,
...     LARGE_TEST_CORPUS_CSV as path_to_csv
... )
>>> c = Corpus(path, csv_path=path_to_csv)
>>> corpus_filter = {'author_gender': 'male'}
>>> len(c.multi_filter(corpus_filter))
59

>>> corpus_filter['filename'] = 'aanrud_longfrock.txt'
>>> len(c.multi_filter(corpus_filter))
1

subcorpus(metadata_field, field_value)¶

Returns a new Corpus object that contains only documents with a given field_value for metadata_field

Parameters:	metadata_field – metadata field to search field_value – search term
Returns:	Corpus object

>>> from gender_analysis import Corpus
>>> from gender_analysis.testing.common import (
...     TEST_CORPUS_PATH as path,
...     LARGE_TEST_CORPUS_CSV as path_to_csv
... )
>>> corp = Corpus(path, csv_path=path_to_csv)
>>> female_corpus = corp.subcorpus('author_gender','female')
>>> len(female_corpus)
39
>>> female_corpus.documents[0].title
'The Indiscreet Letter'

>>> male_corpus = corp.subcorpus('author_gender','male')
>>> len(male_corpus)
59
>>> male_corpus.documents[0].title
'Lisbeth Longfrock'

>>> eighteen_fifty_corpus = corp.subcorpus('date','1850')
>>> len(eighteen_fifty_corpus)
1
>>> eighteen_fifty_corpus.documents[0].title
'The Scarlet Letter'

>>> jane_austen_corpus = corp.subcorpus('author','Austen, Jane')
>>> len(jane_austen_corpus)
2
>>> jane_austen_corpus.documents[0].title
'Emma'

>>> england_corpus = corp.subcorpus('country_publication','England')
>>> len(england_corpus)
51
>>> england_corpus.documents[0].title
'Flatland'

update_metadata(new_metadata_path)¶

Takes a filepath to a csv with new metadata and updates the metadata in the corpus’ documents accordingly. The new file does not need to contain every metadata field in the documents - only the fields that you wish to update.

NOTE: The csv file must include at least a filename for the documents that will be altered.

Parameters:	new_metadata_path – Path to new metadata csv file
Returns:	None

document module¶

class gender_analysis.text.document.Document(metadata_dict)¶

Bases: object

The Document class loads and holds the full text and metadata (author, title, publication date, etc.) of a document

Parameters:	metadata_dict – Dictionary with metadata fields as keys and data as values

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Austen, Jane', 'title': 'Persuasion', 'date': '1818',
...                      'filename': 'austen_persuasion.txt',
...                      'filepath': Path(TEST_DATA_DIR,
...                                       'sample_novels', 'texts', 'austen_persuasion.txt')}
>>> austen = Document(document_metadata)
>>> type(austen.text)
<class 'str'>
>>> len(austen.text)
466887

find_quoted_text()¶

Finds all of the quoted statements in the document text.

Returns:	List of strings enclosed in double-quotations

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Austen, Jane', 'title': 'Persuasion',
...                      'date': '1818', 'filename': 'test_text_0.txt',
...                      'filepath': Path(TEST_DATA_DIR,
...                                       'document_test_files', 'test_text_0.txt')}
>>> document_novel = Document(document_metadata)
>>> document_novel.find_quoted_text()
['"This is a quote"', '"This is my quote"']

get_count_of_word(word)¶

Returns the number of instances of a word in the text. Not case-sensitive.

If this is your first time running this method, this can be slow.

Parameters:	word – word to be counted in text
Returns:	Number of occurences of the word, as an int

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Hawthorne, Nathaniel', 'title': 'Scarlet Letter',
...                      'date': '2018', 'filename': 'test_text_2.txt',
...                      'filepath': Path(TEST_DATA_DIR,
...                                       'document_test_files', 'test_text_2.txt')}
>>> scarlett = Document(document_metadata)
>>> scarlett.get_count_of_word("sad")
4
>>> scarlett.get_count_of_word('ThisWordIsNotInTheWordCounts')
0

get_count_of_words(words)¶

A helper method for retrieving the number of occurrences of a given set of words within a Document.

Parameters:	words – a list of strings.
Returns:	a Counter with each word in words keyed to its number of occurrences.

>>> from gender_analysis.text.document import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_filepath = Path(TEST_DATA_DIR, 'document_test_files', 'test_text_9.txt')
>>> document_metadata = {'filename': 'test_text_2.txt', 'filepath': document_filepath}
>>> test_document = Document(document_metadata)
>>> test_document.get_count_of_words(['sad', 'was', 'sadness', 'very'])
Counter({'was': 5, 'sad': 1, 'very': 1, 'sadness': 0})

get_part_of_speech_tags()¶

Returns the part of speech tags as a list of tuples. The first part of each tuple is the term, the second one the part of speech tag.

Note: the same word can have a different part of speech tags. In the example below, see “refuse” and “permit”.

Returns:	List of tuples (term, speech_tag)

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Hawthorne, Nathaniel', 'title': 'Scarlet Letter',
...                      'date': '1900', 'filename': 'test_text_13.txt',
...                      'filepath': Path(TEST_DATA_DIR,
...                                       'document_test_files', 'test_text_13.txt')}
>>> document = Document(document_metadata)
>>> document.get_part_of_speech_tags()[:4]
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB')]
>>> document.get_part_of_speech_tags()[-4:]
[('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN'), ('.', '.')]

get_part_of_speech_words(words, remove_swords=True)¶

A helper method for retrieving the number of occurrences of input words keyed to their NLTK tag values (i.e., ‘NN’ for noun).

Parameters:	words – a list of strings. remove_swords – optional boolean, remove stop words from return.
Returns:	a dictionary keying NLTK tag strings to Counter instances.

>>> from gender_analysis.text.document import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_filepath = Path(TEST_DATA_DIR, 'document_test_files', 'test_text_9.txt')
>>> document_metadata = {'filename': 'test_text_2.txt', 'filepath': document_filepath}
>>> test_document = Document(document_metadata)
>>> test_document.get_part_of_speech_words(['peace', 'died', 'beautiful', 'foobar'])
{'JJ': Counter({'beautiful': 3}), 'VBD': Counter({'died': 1}), 'NN': Counter({'peace': 1})}

get_tokenized_text()¶

Tokenizes the text and returns it as a list of tokens, while removing all punctuation.

Note: This does not currently properly handle dashes or contractions.

Returns:	List of each word in the Document

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Austen, Jane', 'title': 'Persuasion', 'date': '1818',
...                      'filename': 'test_text_1.txt',
...                      'filepath': Path(TEST_DATA_DIR,
...                                       'document_test_files', 'test_text_1.txt')}
>>> austin = Document(document_metadata)
>>> tokenized_text = austin.get_tokenized_text()
>>> tokenized_text
['allkinds', 'of', 'punctuation', 'and', 'special', 'chars']

get_word_freq(word)¶

Returns the frequency of appearance of a word in the document

Parameters:	word – str to search for in document
Returns:	float representing the portion of words in the text that are the parameter word

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Hawthorne, Nathaniel', 'title': 'Scarlet Letter',
...                      'date': '1900', 'filename': 'test_text_2.txt',
...                      'filepath': Path(TEST_DATA_DIR,
...                                       'document_test_files', 'test_text_2.txt')}
>>> scarlett = Document(document_metadata)
>>> frequency = scarlett.get_word_freq('sad')
>>> frequency
0.13333333333333333

get_word_frequencies(words)¶

A helper method for retreiving the frequencies of a given set of words within a Document.

Parameters:	words – a list of strings.
Returns:	a dictionary of words keyed to float frequencies.

>>> from gender_analysis.text.document import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_filepath = Path(TEST_DATA_DIR, 'document_test_files', 'test_text_9.txt')
>>> document_metadata = {'filename': 'test_text_2.txt', 'filepath': document_filepath}
>>> test_document = Document(document_metadata)
>>> test_document.get_word_frequencies(['peace', 'died', 'foobar'])
{'peace': 0.02702702702702703, 'died': 0.02702702702702703, 'foobar': 0.0}

get_word_windows(search_terms, window_size=2)¶

Finds all instances of word and returns a counter of the words around it. window_size is the number of words before and after to return, so the total window is 2*window_size + 1.

This is not case sensitive.

Parameters:	search_terms – String or list of strings to search for window_size – integer representing number of words to search for in either direction
Returns:	Python Counter object

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Hawthorne, Nathaniel', 'title': 'Scarlet Letter',
...                      'date': '2018', 'filename': 'test_text_12.txt',
...                      'filepath': Path(TEST_DATA_DIR,
...                                       'document_test_files', 'test_text_12.txt')}
>>> scarlett = Document(document_metadata)

search_terms can be either a string…

>>> scarlett.get_word_windows("his", window_size=2)
Counter({'he': 1, 'lit': 1, 'cigarette': 1, 'and': 1, 'then': 1, 'began': 1, 'speech': 1, 'which': 1})

… or a list of strings.

>>> scarlett.get_word_windows(['purse', 'tears'])
Counter({'her': 2, 'of': 1, 'and': 1, 'handed': 1, 'proposal': 1, 'drowned': 1, 'the': 1})

get_wordcount_counter()¶

Returns a counter object of all of the words in the text.

If this is your first time running this method, this can be slow.

Returns:	Python Counter object

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Hawthorne, Nathaniel', 'title': 'Scarlet Letter',
...                      'date': '2018', 'filename': 'test_text_10.txt',
...                      'filepath': Path(TEST_DATA_DIR,
...                                       'document_test_files', 'test_text_10.txt')}
>>> scarlett = Document(document_metadata)
>>> scarlett.get_wordcount_counter()
Counter({'was': 2, 'convicted': 2, 'hester': 1, 'of': 1, 'adultery': 1})

update_metadata(new_metadata)¶

Updates the metadata of the document without requiring a complete reloading of the text and other properties.

‘filename’ cannot be updated with this method.

Parameters:	new_metadata – dict of new metadata to apply to the document
Returns:	None

This can be used to correct mistakes in the metadata:

>>> from gender_analysis import Document
>>> from gender_analysis.testing.common import TEST_CORPUS_PATH
>>> from pathlib import Path
>>> metadata = {'filename': 'aanrud_longfrock.txt',
...             'filepath': Path(TEST_CORPUS_PATH, 'aanrud_longfrock.txt'),
...             'date': '2098'}
>>> d = Document(metadata)
>>> new_metadata = {'date': '1903'}
>>> d.update_metadata(new_metadata)
>>> d.date
1903

Or it can be used to add completely new attributes:

>>> new_attribute = {'cookies': 'chocolate chip'}
>>> d.update_metadata(new_attribute)
>>> d.cookies
'chocolate chip'

word_count¶

Lazy-loading for Document.word_count attribute. Returns the number of words in the document. The word_count attribute is useful for the get_word_freq function. However, it is performance-wise costly, so it’s only loaded when it’s actually required.

Returns:	Number of words in the document’s text as an int

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Austen, Jane', 'title': 'Persuasion', 'date': '1818',
...                      'filename': 'austen_persuasion.txt',
...                      'filepath': Path(TEST_DATA_DIR, 'sample_novels',
...                                       'texts', 'austen_persuasion.txt')}
>>> austen = Document(document_metadata)
>>> austen.word_count
83285

words_associated(target_word)¶

Returns a Counter of the words found after a given word.

In the case of double/repeated words, the counter would include the word itself and the next new word.

Note: words always return lowercase.

Parameters:	word – Single word to search for in the document’s text
Returns:	a Python Counter() object with {associated_word: occurrences}

>>> from gender_analysis import Document
>>> from pathlib import Path
>>> from gender_analysis.testing.common import TEST_DATA_DIR
>>> document_metadata = {'author': 'Hawthorne, Nathaniel', 'title': 'Scarlet Letter',
...                      'date': '2018', 'filename': 'test_text_11.txt',
...                      'filepath': Path(TEST_DATA_DIR,
...                                       'document_test_files', 'test_text_11.txt')}
>>> scarlett = Document(document_metadata)
>>> scarlett.words_associated("his")
Counter({'cigarette': 1, 'speech': 1})

character module¶

class gender_analysis.text.character.Character(name, gender=None, mentions=None)¶

Bases: object

Defines a character that will be operated on in analysis functions

get_char_gender()¶: Get the gender for the character based on: 1. If user entry exists, fetch entered gender 2. If not, infer Character’s gender based on coreference resolution and pronouns Currently, this function only retrieves user entered gender for the character objects :return: a gender object >>> from gender_analysis.text.character import Character >>> from gender_analysis.gender.common import FEMALE >>> emma_name = ‘Emma’ >>> emma_gender = FEMALE >>> emma_mentions = [“Emma Woodhouse”, “Emma”, “Miss Woodhouse”] >>> emma = Character(emma_name, emma_gender, emma_mentions) >>> emma.get_char_gender() <Female>

gender_analysis.text package¶

common module¶

corpus module¶

document module¶

character module¶

Gender Analysis Toolkit

Navigation

Related Topics