Murphy’s API¶
Data Loader¶
A data loader that converts json.bz2
files into a functional Dask Dataframe.
While this module has a ton of functionality, most of it has been abstracted into it’s constructor (__init__
)
-
class
murphy.data_loader.
DataLoader
(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')[source]¶ Bases:
object
This is where you can specify how you want to configure the twitter dataset before you can load it. It’s functionality includes:
removing emojis
removing retweets symbols
lemmatizing the text
filtering by language
and more!
- Parameters
file_find_expression – unix-like path that is used for listing out all of the files that we need
remove_emoji – flag for removing emojis from all of the twitter text
remove_retweets_symbols – flag for removing retweet strings from all of the twitter text (RT @<retweet_username>:)
remove_truncated_tweets – flag for removing all tweets that are truncated, as not all information can be found in them
add_usernames – flag for adding in the user names from who tweeted as a separate column instead of parsing it from the user column
tokenize – tokenize tweets to make them easier to process
filter_stopwords – remove stopwords from the tweets to make them easier to process
lemmatize – lemmatize text to make it easier to process
language – select the language that you want to work with
-
static
get_files_list
(pathname: Union[str, pathlib.Path], recursive: bool = False, suffix: str = '*.json*') → List[str][source]¶ Function to get files from the given pathname.
Searches in the directory when pathname leads to a directory with the option for adding a custom suffix
If pathname given is a directory, searches in the directory
- Parameters
pathname – pathname from where we can get the files
recursive – Flag for searching recursively
suffix – suffix to search for files when a pathname leads to a directory is given
- Raises
ValueError – When no files are found based on the pathname
- Returns
Filters¶
author: v2thegreat (v2thegreat@gmail.com)
Package to filter out irrelevant rows that might not be wanted for processing
- TODO:
This package is written with the hopes to better understand what problems processing such a dataset would be
encountered, and it is hence written with the understanding that this and other scripts will be refactored - Add tests
-
class
murphy.filters.
Filter
(remove_emoji: bool = True, remove_retweets: bool = False, remove_truncated_tweets: bool = False)[source]¶ Bases:
object
-
static
filter_emoji
(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
filter_retweet_text
(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
mark_truncated_tweets
(tweet_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
NLP Tools¶
-
class
murphy.nlp_tools.
NLPTools
(tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')[source]¶ Bases:
object
-
filter_stopwords
(tweet_dataframe: dask.dataframe.core.DataFrame) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
lemmatize_tweets
(tweet_dataframe: dask.dataframe.core.DataFrame) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
nlp
= <spacy.lang.en.English object>¶
-
Batch Processing¶
Note
This module is written with the hopes to better understand what problems processing such a dataset would be encountered, and hence written in it’s current flexible manner
Module to process tweets from the data_loading in batches to reduce the workload on the scheduler by applying various functions in batches
-
class
murphy.batch_processing.
Batches
[source]¶ Bases:
object
-
static
process_in_batches
(file_paths: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any], verbose: bool = True) → Dict[str, Any][source]¶ Function to process data in batches to circumvent Dask Scheduler’s limitations (max of 100k tasks for example)
- Parameters
file_paths – path of files that need to be individually processed
read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)
func_to_apply – function to apply on the object that’s returned on the read_func
verbose – show progress bar?
- Returns
a dictionary that has the schema: {file_name: func_to_apply’s return value}
-
static
process_in_batches_generator
(file_iterator: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any]) → Iterable[Any][source]¶ Function to process data in batches to circumvent Dask Scheduler’s limitations for 100k tasks
- Parameters
file_iterator – iterator that contains a file names
read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)
func_to_apply – function to apply on the object that’s returned on the read_func
- Returns
a dictionary that has the schema: {file_name: func_to_apply’s return value}
-
static
Module contents¶
-
class
murphy.
Batches
[source]¶ Bases:
object
-
static
process_in_batches
(file_paths: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any], verbose: bool = True) → Dict[str, Any][source]¶ Function to process data in batches to circumvent Dask Scheduler’s limitations (max of 100k tasks for example)
- Parameters
file_paths – path of files that need to be individually processed
read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)
func_to_apply – function to apply on the object that’s returned on the read_func
verbose – show progress bar?
- Returns
a dictionary that has the schema: {file_name: func_to_apply’s return value}
-
static
process_in_batches_generator
(file_iterator: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any]) → Iterable[Any][source]¶ Function to process data in batches to circumvent Dask Scheduler’s limitations for 100k tasks
- Parameters
file_iterator – iterator that contains a file names
read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)
func_to_apply – function to apply on the object that’s returned on the read_func
- Returns
a dictionary that has the schema: {file_name: func_to_apply’s return value}
-
static
-
class
murphy.
DataLoader
(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')[source]¶ Bases:
object
This is where you can specify how you want to configure the twitter dataset before you can load it. It’s functionality includes:
removing emojis
removing retweets symbols
lemmatizing the text
filtering by language
and more!
- Parameters
file_find_expression – unix-like path that is used for listing out all of the files that we need
remove_emoji – flag for removing emojis from all of the twitter text
remove_retweets_symbols – flag for removing retweet strings from all of the twitter text (RT @<retweet_username>:)
remove_truncated_tweets – flag for removing all tweets that are truncated, as not all information can be found in them
add_usernames – flag for adding in the user names from who tweeted as a separate column instead of parsing it from the user column
tokenize – tokenize tweets to make them easier to process
filter_stopwords – remove stopwords from the tweets to make them easier to process
lemmatize – lemmatize text to make it easier to process
language – select the language that you want to work with
-
static
get_files_list
(pathname: Union[str, pathlib.Path], recursive: bool = False, suffix: str = '*.json*') → List[str][source]¶ Function to get files from the given pathname.
Searches in the directory when pathname leads to a directory with the option for adding a custom suffix
If pathname given is a directory, searches in the directory
- Parameters
pathname – pathname from where we can get the files
recursive – Flag for searching recursively
suffix – suffix to search for files when a pathname leads to a directory is given
- Raises
ValueError – When no files are found based on the pathname
- Returns
-
class
murphy.
Filter
(remove_emoji: bool = True, remove_retweets: bool = False, remove_truncated_tweets: bool = False)[source]¶ Bases:
object
-
static
filter_emoji
(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
filter_retweet_text
(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
mark_truncated_tweets
(tweet_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static