Murphy¶
Murphy is a text processing library for working with Twitter data, built on top of Dask
Murphy is broken down into several components and subcomponents:
Data Preprocessing: Creating scalable tools for helping researchers moving forward
NLP Tools: NLP cleaning tools for tokenization and Lemmatisation
Filters: Removing retweet strings, emojis, and other annoyances
Batch Processing: Applying batch based workloads to make data processing easier
Applying AI Models: Creating AI models built by us for various purposes
Sentiment Analysis: Predicting sentiments in tweets!
Emoji Prediction: Predicting which emojis would work best for a tweet (coming soon!)
And more! We’re still developing, so ideas and contributions are much appreciated!

Built on top of Dask¶
By building on top of Dask, we’re able to get massive performance boosts and scalability.
You also have access to individual Dask objects like Dask DataFrames and Dask Bags directly, so you can continue to experiment on your own after using our tools.
Install Murphy¶
Installing with Pip¶
Installing with pip is fairly straight forward. Just run:
pip install smpa-murphy
Installing from Source¶
Installing from source is fairly easy:
Install all dependencies:
pip install -r requirements.txt
Pull the repo:
git clone https://github.com/Social-Media-Public-Analysis/murphy.git
Move over to the main directory:
cd murphy
Install with setup.py:
python setup.py install
Quick Start Guide¶
Installation¶
To install murphy on your machine, just install via pip:
pip install smpa-murphy
For more information on installation, check out our install guide.
Starting up Dask (optional)¶
Using Dask is optional, and while all of our code is backwards compatible with Pandas, being able to launch your own Dask Cluster or having access to the Dask Dashboard or for any of it’s other use cases
To use Dask, simply import it’s Client class and initialize with your configurations
from dask.distributed import Client
client = Client(<your configs here>)
client
You can find more information on Dask Client here
Loading the Data¶
You can load datasets by pointing to where they’ve been saved from Dozent.
The syntax for doing so is as follows:
from murphy import data_loader # -> Importing murphy
data = data_loader.DataLoader(file_find_expression = 'data/test_data/*.json.bz2') # -> You can point to another location here
twitter_dataframe = data.twitter_dataframe # -> this return a dask dataframe that is lazily computed
twitter_dataframe
This is what your output should look like (in a jupyter notebook)
You might be thinking: So, my data is going to just be loaded from the file? That’s it?
Nope! Looking at a snippet from the data_loader.DataLoader
documentation
>>> help(data_loader.DataLoader)
class DataLoader(builtins.object)
| DataLoader(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
|
| Methods defined here:
|
| __init__(self, file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
| This is where you can specify how you want to configure the twitter dataset before you start processing it.
|
| :param file_find_expression: unix-like path that is used for listing out all of the files that we need
|
| :param remove_emoji: flag for removing emojis from all of the twitter text
|
| :param remove_retweets_symbols: flag for removing retweet strings from all of the twitter text (`RT @<retweet_username>:`)
|
| :param remove_truncated_tweets: flag for removing all tweets that are truncated, as not all information can be
| found in them
|
| :param add_usernames: flag for adding in the user names from who tweeted as a separate column instead of parsing
| it from the `user` column
|
| :param tokenize: tokenize tweets to make them easier to process
|
| :param filter_stopwords: remove stopwords from the tweets to make them easier to process
|
| :param lemmatize: lemmatize text to make it easier to process
|
| :param language: select the language that you want to work with
Here, we can see that the DataLoader class has tons of configurable parameters that we can use to make development easier, including built in tokenization, lemmatization, and more!
These are automatically run when you compute the your twitter_dataframe, meaning that these functions are automatically implemented and parallelized, right out of the box!
Now what?¶
Now, you can explore the data to your heart’s content! We suggest looking over this Dask Tutorial if you’re not familiar with Dask already, as it’ll make exploring the dataset easier
Murphy Use Case¶
First and foremost, Murphy is designed to be scalable.
Second, Murphy is designed with functionality in mind, and we hope it becomes the first tool people like you use to play with, understand, and visualize your data.
Finally, you also have access to the flexibility of Dask DataFrames after we’re done with it, so you can do whatever you want after using Murphy, including switching over to Spark.
Work with Data from Dozent: the best twitter scraper¶
The twitter data you can get from Dozent is extremely large, estimated to be 52.56TB per year and while we support data from 2017 to 2020 we intend to support more data later on. In comparison, the GDELT Project only works with 2.5TB of data yearly (But they do some amazing work! Seriously, check them out!)
An Exempt from Dozent’s README:
Dozent
Dozent is a powerful downloader that is used to collect large amounts of Twitter data from the
internet archive.
It is built on top of PySmartDL and multithreading, similar to how traditional download accelerators
like axel, aria2c and aws s3 work, ensuring that the biggest bottlenecks are your network and your
hardware.
The data that is downloaded is already heavily compressed to reduce download times and save local
storage. When uncompressed, the data can easily add up to several terabytes depending on the
timeframe of data being collected.
Built-in tools, made to scale¶
Complex Algorithms¶
Murphy comes prepackaged with scalable and efficient implementations of various algorithms that you already use for NLP type tasks, such as tokenization, lemmatization, functionality to remove emojis, redundant and annoyingly irrelevant information and more!
Machine Learning Models¶
Murphy implements simple ML models such as sentiment classification along with various different versions that might suit your best needs. While this is quite limited right now, we are actively working on deploying more ML models that can provide more insight into this dataset
ML Model |
Category |
Function |
Availability |
---|---|---|---|
NLTK |
Classification |
Sentiment Prediction, built using NLTK |
✔️ |
TextBlob |
Classification |
Sentiment Prediction, built using TextBlob |
✔️ |
Emoji Predictor |
Classification |
Predicting the best emoji for a sentence |
⌚ |
Murphy’s API¶
Data Loader¶
A data loader that converts json.bz2
files into a functional Dask Dataframe.
While this module has a ton of functionality, most of it has been abstracted into it’s constructor (__init__
)
-
class
murphy.data_loader.
DataLoader
(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')[source]¶ Bases:
object
This is where you can specify how you want to configure the twitter dataset before you can load it. It’s functionality includes:
removing emojis
removing retweets symbols
lemmatizing the text
filtering by language
and more!
- Parameters
file_find_expression – unix-like path that is used for listing out all of the files that we need
remove_emoji – flag for removing emojis from all of the twitter text
remove_retweets_symbols – flag for removing retweet strings from all of the twitter text (RT @<retweet_username>:)
remove_truncated_tweets – flag for removing all tweets that are truncated, as not all information can be found in them
add_usernames – flag for adding in the user names from who tweeted as a separate column instead of parsing it from the user column
tokenize – tokenize tweets to make them easier to process
filter_stopwords – remove stopwords from the tweets to make them easier to process
lemmatize – lemmatize text to make it easier to process
language – select the language that you want to work with
-
static
get_files_list
(pathname: Union[str, pathlib.Path], recursive: bool = False, suffix: str = '*.json*') → List[str][source]¶ Function to get files from the given pathname.
Searches in the directory when pathname leads to a directory with the option for adding a custom suffix
If pathname given is a directory, searches in the directory
- Parameters
pathname – pathname from where we can get the files
recursive – Flag for searching recursively
suffix – suffix to search for files when a pathname leads to a directory is given
- Raises
ValueError – When no files are found based on the pathname
- Returns
Filters¶
author: v2thegreat (v2thegreat@gmail.com)
Package to filter out irrelevant rows that might not be wanted for processing
- TODO:
This package is written with the hopes to better understand what problems processing such a dataset would be
encountered, and it is hence written with the understanding that this and other scripts will be refactored - Add tests
-
class
murphy.filters.
Filter
(remove_emoji: bool = True, remove_retweets: bool = False, remove_truncated_tweets: bool = False)[source]¶ Bases:
object
-
static
filter_emoji
(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
filter_retweet_text
(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
mark_truncated_tweets
(tweet_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
NLP Tools¶
-
class
murphy.nlp_tools.
NLPTools
(tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')[source]¶ Bases:
object
-
filter_stopwords
(tweet_dataframe: dask.dataframe.core.DataFrame) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
lemmatize_tweets
(tweet_dataframe: dask.dataframe.core.DataFrame) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
nlp
= <spacy.lang.en.English object>¶
-
Batch Processing¶
Note
This module is written with the hopes to better understand what problems processing such a dataset would be encountered, and hence written in it’s current flexible manner
Module to process tweets from the data_loading in batches to reduce the workload on the scheduler by applying various functions in batches
-
class
murphy.batch_processing.
Batches
[source]¶ Bases:
object
-
static
process_in_batches
(file_paths: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any], verbose: bool = True) → Dict[str, Any][source]¶ Function to process data in batches to circumvent Dask Scheduler’s limitations (max of 100k tasks for example)
- Parameters
file_paths – path of files that need to be individually processed
read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)
func_to_apply – function to apply on the object that’s returned on the read_func
verbose – show progress bar?
- Returns
a dictionary that has the schema: {file_name: func_to_apply’s return value}
-
static
process_in_batches_generator
(file_iterator: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any]) → Iterable[Any][source]¶ Function to process data in batches to circumvent Dask Scheduler’s limitations for 100k tasks
- Parameters
file_iterator – iterator that contains a file names
read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)
func_to_apply – function to apply on the object that’s returned on the read_func
- Returns
a dictionary that has the schema: {file_name: func_to_apply’s return value}
-
static
Module contents¶
-
class
murphy.
Batches
[source]¶ Bases:
object
-
static
process_in_batches
(file_paths: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any], verbose: bool = True) → Dict[str, Any][source]¶ Function to process data in batches to circumvent Dask Scheduler’s limitations (max of 100k tasks for example)
- Parameters
file_paths – path of files that need to be individually processed
read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)
func_to_apply – function to apply on the object that’s returned on the read_func
verbose – show progress bar?
- Returns
a dictionary that has the schema: {file_name: func_to_apply’s return value}
-
static
process_in_batches_generator
(file_iterator: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any]) → Iterable[Any][source]¶ Function to process data in batches to circumvent Dask Scheduler’s limitations for 100k tasks
- Parameters
file_iterator – iterator that contains a file names
read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)
func_to_apply – function to apply on the object that’s returned on the read_func
- Returns
a dictionary that has the schema: {file_name: func_to_apply’s return value}
-
static
-
class
murphy.
DataLoader
(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')[source]¶ Bases:
object
This is where you can specify how you want to configure the twitter dataset before you can load it. It’s functionality includes:
removing emojis
removing retweets symbols
lemmatizing the text
filtering by language
and more!
- Parameters
file_find_expression – unix-like path that is used for listing out all of the files that we need
remove_emoji – flag for removing emojis from all of the twitter text
remove_retweets_symbols – flag for removing retweet strings from all of the twitter text (RT @<retweet_username>:)
remove_truncated_tweets – flag for removing all tweets that are truncated, as not all information can be found in them
add_usernames – flag for adding in the user names from who tweeted as a separate column instead of parsing it from the user column
tokenize – tokenize tweets to make them easier to process
filter_stopwords – remove stopwords from the tweets to make them easier to process
lemmatize – lemmatize text to make it easier to process
language – select the language that you want to work with
-
static
get_files_list
(pathname: Union[str, pathlib.Path], recursive: bool = False, suffix: str = '*.json*') → List[str][source]¶ Function to get files from the given pathname.
Searches in the directory when pathname leads to a directory with the option for adding a custom suffix
If pathname given is a directory, searches in the directory
- Parameters
pathname – pathname from where we can get the files
recursive – Flag for searching recursively
suffix – suffix to search for files when a pathname leads to a directory is given
- Raises
ValueError – When no files are found based on the pathname
- Returns
-
class
murphy.
Filter
(remove_emoji: bool = True, remove_retweets: bool = False, remove_truncated_tweets: bool = False)[source]¶ Bases:
object
-
static
filter_emoji
(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
filter_retweet_text
(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
mark_truncated_tweets
(tweet_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame]) → Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]¶
-
static
Murphy Sentiment Classification¶
-
class
murphy.classification.sentiments.
Sentiments
[source]¶ Bases:
object
-
classmethod
multiple_sentiment_analysis
(text: str) → Dict[str, float][source]¶ Returns the sentiment using all implemented models as a dictionary
- Parameters
text – text to run sentiment analysis on
- Returns
key pair values of name of the sentiment function and their estimations
-
classmethod