Murphy

Murphy is a text processing library for working with Twitter data, built on top of Dask

Murphy is broken down into several components and subcomponents:

  • Data Preprocessing: Creating scalable tools for helping researchers moving forward

    • NLP Tools: NLP cleaning tools for tokenization and Lemmatisation

    • Filters: Removing retweet strings, emojis, and other annoyances

    • Batch Processing: Applying batch based workloads to make data processing easier

  • Applying AI Models: Creating AI models built by us for various purposes

    • Sentiment Analysis: Predicting sentiments in tweets!

    • Emoji Prediction: Predicting which emojis would work best for a tweet (coming soon!)

    • And more! We’re still developing, so ideas and contributions are much appreciated!

_images/overview-diagram.png

Built on top of Dask

By building on top of Dask, we’re able to get massive performance boosts and scalability.

You also have access to individual Dask objects like Dask DataFrames and Dask Bags directly, so you can continue to experiment on your own after using our tools.

Install Murphy

Installing with Pip

Installing with pip is fairly straight forward. Just run:

pip install smpa-murphy

Installing from Source

Installing from source is fairly easy:

  1. Install all dependencies:

    pip install -r requirements.txt
    
  2. Pull the repo:

    git clone https://github.com/Social-Media-Public-Analysis/murphy.git
    
  3. Move over to the main directory:

    cd murphy
    
  4. Install with setup.py:

    python setup.py install
    

Test

To test that everything is working well:

pytest tests/

Quick Start Guide

Installation

To install murphy on your machine, just install via pip:

pip install smpa-murphy

For more information on installation, check out our install guide.

Starting up Dask (optional)

Using Dask is optional, and while all of our code is backwards compatible with Pandas, being able to launch your own Dask Cluster or having access to the Dask Dashboard or for any of it’s other use cases

To use Dask, simply import it’s Client class and initialize with your configurations

from dask.distributed import Client

client = Client(<your configs here>)
client

You can find more information on Dask Client here

Loading the Data

You can load datasets by pointing to where they’ve been saved from Dozent.

The syntax for doing so is as follows:

from murphy import data_loader # -> Importing murphy

data = data_loader.DataLoader(file_find_expression = 'data/test_data/*.json.bz2') # -> You can point to another location here

twitter_dataframe = data.twitter_dataframe # -> this return a dask dataframe that is lazily computed

twitter_dataframe

This is what your output should look like (in a jupyter notebook)

You might be thinking: So, my data is going to just be loaded from the file? That’s it?

Nope! Looking at a snippet from the data_loader.DataLoader documentation

>>> help(data_loader.DataLoader)

class DataLoader(builtins.object)
 |  DataLoader(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
 |
 |  Methods defined here:
 |
 |  __init__(self, file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
 |      This is where you can specify how you want to configure the twitter dataset before you start processing it.
 |
 |      :param file_find_expression: unix-like path that is used for listing out all of the files that we need
 |
 |      :param remove_emoji: flag for removing emojis from all of the twitter text
 |
 |      :param remove_retweets_symbols: flag for removing retweet strings from all of the twitter text (`RT @<retweet_username>:`)
 |
 |      :param remove_truncated_tweets: flag for removing all tweets that are truncated, as not all information can be
 |                                      found in them
 |
 |      :param add_usernames: flag for adding in the user names from who tweeted as a separate column instead of parsing
 |                            it from the `user` column
 |
 |      :param tokenize: tokenize tweets to make them easier to process
 |
 |      :param filter_stopwords: remove stopwords from the tweets to make them easier to process
 |
 |      :param lemmatize: lemmatize text to make it easier to process
 |
 |      :param language: select the language that you want to work with

Here, we can see that the DataLoader class has tons of configurable parameters that we can use to make development easier, including built in tokenization, lemmatization, and more!

These are automatically run when you compute the your twitter_dataframe, meaning that these functions are automatically implemented and parallelized, right out of the box!

Now what?

Now, you can explore the data to your heart’s content! We suggest looking over this Dask Tutorial if you’re not familiar with Dask already, as it’ll make exploring the dataset easier

Murphy Use Case

  1. First and foremost, Murphy is designed to be scalable.

  2. Second, Murphy is designed with functionality in mind, and we hope it becomes the first tool people like you use to play with, understand, and visualize your data.

  3. Finally, you also have access to the flexibility of Dask DataFrames after we’re done with it, so you can do whatever you want after using Murphy, including switching over to Spark.

Work with Data from Dozent: the best twitter scraper

The twitter data you can get from Dozent is extremely large, estimated to be 52.56TB per year and while we support data from 2017 to 2020 we intend to support more data later on. In comparison, the GDELT Project only works with 2.5TB of data yearly (But they do some amazing work! Seriously, check them out!)

An Exempt from Dozent’s README:

Dozent

Dozent is a powerful downloader that is used to collect large amounts of Twitter data from the
internet archive.

It is built on top of PySmartDL and multithreading, similar to how traditional download accelerators
like axel, aria2c and aws s3 work, ensuring that the biggest bottlenecks are your network and your
hardware.

The data that is downloaded is already heavily compressed to reduce download times and save local
storage. When uncompressed, the data can easily add up to several terabytes depending on the
timeframe of data being collected.

Built-in tools, made to scale

Complex Algorithms

Murphy comes prepackaged with scalable and efficient implementations of various algorithms that you already use for NLP type tasks, such as tokenization, lemmatization, functionality to remove emojis, redundant and annoyingly irrelevant information and more!

Machine Learning Models

Murphy implements simple ML models such as sentiment classification along with various different versions that might suit your best needs. While this is quite limited right now, we are actively working on deploying more ML models that can provide more insight into this dataset

Machine Learning Models

ML Model

Category

Function

Availability

NLTK

Classification

Sentiment Prediction, built using NLTK

✔️

TextBlob

Classification

Sentiment Prediction, built using TextBlob

✔️

Emoji Predictor

Classification

Predicting the best emoji for a sentence

Murphy’s API

Data Loader

A data loader that converts json.bz2 files into a functional Dask Dataframe.

While this module has a ton of functionality, most of it has been abstracted into it’s constructor (__init__)

class murphy.data_loader.DataLoader(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')[source]

Bases: object

This is where you can specify how you want to configure the twitter dataset before you can load it. It’s functionality includes:

  • removing emojis

  • removing retweets symbols

  • lemmatizing the text

  • filtering by language

  • and more!

Parameters
  • file_find_expression – unix-like path that is used for listing out all of the files that we need

  • remove_emoji – flag for removing emojis from all of the twitter text

  • remove_retweets_symbols – flag for removing retweet strings from all of the twitter text (RT @<retweet_username>:)

  • remove_truncated_tweets – flag for removing all tweets that are truncated, as not all information can be found in them

  • add_usernames – flag for adding in the user names from who tweeted as a separate column instead of parsing it from the user column

  • tokenize – tokenize tweets to make them easier to process

  • filter_stopwords – remove stopwords from the tweets to make them easier to process

  • lemmatize – lemmatize text to make it easier to process

  • language – select the language that you want to work with

static get_files_list(pathname: Union[str, pathlib.Path], recursive: bool = False, suffix: str = '*.json*')List[str][source]

Function to get files from the given pathname.

Searches in the directory when pathname leads to a directory with the option for adding a custom suffix

If pathname given is a directory, searches in the directory

Parameters
  • pathname – pathname from where we can get the files

  • recursive – Flag for searching recursively

  • suffix – suffix to search for files when a pathname leads to a directory is given

Raises

ValueError – When no files are found based on the pathname

Returns

Filters

author: v2thegreat (v2thegreat@gmail.com)

Package to filter out irrelevant rows that might not be wanted for processing

TODO:
  • This package is written with the hopes to better understand what problems processing such a dataset would be

encountered, and it is hence written with the understanding that this and other scripts will be refactored - Add tests

class murphy.filters.Filter(remove_emoji: bool = True, remove_retweets: bool = False, remove_truncated_tweets: bool = False)[source]

Bases: object

static filter_emoji(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
static filter_retweet_text(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
static mark_truncated_tweets(tweet_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
static remove_truncated_tweets(tweet_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
run_filters(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]

NLP Tools

class murphy.nlp_tools.NLPTools(tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')[source]

Bases: object

filter_stopwords(tweet_dataframe: dask.dataframe.core.DataFrame)Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
lemmatize_tweets(tweet_dataframe: dask.dataframe.core.DataFrame)Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
nlp = <spacy.lang.en.English object>
run_tools(tweet_dataframe: dask.dataframe.core.DataFrame)Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
tokenize_tweets(tweet_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]

Batch Processing

Note

This module is written with the hopes to better understand what problems processing such a dataset would be encountered, and hence written in it’s current flexible manner

Module to process tweets from the data_loading in batches to reduce the workload on the scheduler by applying various functions in batches

class murphy.batch_processing.Batches[source]

Bases: object

static process_in_batches(file_paths: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any], verbose: bool = True)Dict[str, Any][source]

Function to process data in batches to circumvent Dask Scheduler’s limitations (max of 100k tasks for example)

Parameters
  • file_paths – path of files that need to be individually processed

  • read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)

  • func_to_apply – function to apply on the object that’s returned on the read_func

  • verbose – show progress bar?

Returns

a dictionary that has the schema: {file_name: func_to_apply’s return value}

static process_in_batches_generator(file_iterator: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any])Iterable[Any][source]

Function to process data in batches to circumvent Dask Scheduler’s limitations for 100k tasks

Parameters
  • file_iterator – iterator that contains a file names

  • read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)

  • func_to_apply – function to apply on the object that’s returned on the read_func

Returns

a dictionary that has the schema: {file_name: func_to_apply’s return value}

Module contents

class murphy.Batches[source]

Bases: object

static process_in_batches(file_paths: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any], verbose: bool = True)Dict[str, Any][source]

Function to process data in batches to circumvent Dask Scheduler’s limitations (max of 100k tasks for example)

Parameters
  • file_paths – path of files that need to be individually processed

  • read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)

  • func_to_apply – function to apply on the object that’s returned on the read_func

  • verbose – show progress bar?

Returns

a dictionary that has the schema: {file_name: func_to_apply’s return value}

static process_in_batches_generator(file_iterator: Iterable[str], read_func: Callable[[str], Any], func_to_apply: Callable[[Any], Any])Iterable[Any][source]

Function to process data in batches to circumvent Dask Scheduler’s limitations for 100k tasks

Parameters
  • file_iterator – iterator that contains a file names

  • read_func – function to read the file. This must return an object (for example: Dask Bag, Dask Array, str)

  • func_to_apply – function to apply on the object that’s returned on the read_func

Returns

a dictionary that has the schema: {file_name: func_to_apply’s return value}

class murphy.DataLoader(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')[source]

Bases: object

This is where you can specify how you want to configure the twitter dataset before you can load it. It’s functionality includes:

  • removing emojis

  • removing retweets symbols

  • lemmatizing the text

  • filtering by language

  • and more!

Parameters
  • file_find_expression – unix-like path that is used for listing out all of the files that we need

  • remove_emoji – flag for removing emojis from all of the twitter text

  • remove_retweets_symbols – flag for removing retweet strings from all of the twitter text (RT @<retweet_username>:)

  • remove_truncated_tweets – flag for removing all tweets that are truncated, as not all information can be found in them

  • add_usernames – flag for adding in the user names from who tweeted as a separate column instead of parsing it from the user column

  • tokenize – tokenize tweets to make them easier to process

  • filter_stopwords – remove stopwords from the tweets to make them easier to process

  • lemmatize – lemmatize text to make it easier to process

  • language – select the language that you want to work with

static get_files_list(pathname: Union[str, pathlib.Path], recursive: bool = False, suffix: str = '*.json*')List[str][source]

Function to get files from the given pathname.

Searches in the directory when pathname leads to a directory with the option for adding a custom suffix

If pathname given is a directory, searches in the directory

Parameters
  • pathname – pathname from where we can get the files

  • recursive – Flag for searching recursively

  • suffix – suffix to search for files when a pathname leads to a directory is given

Raises

ValueError – When no files are found based on the pathname

Returns

class murphy.Filter(remove_emoji: bool = True, remove_retweets: bool = False, remove_truncated_tweets: bool = False)[source]

Bases: object

static filter_emoji(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
static filter_retweet_text(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
static mark_truncated_tweets(tweet_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
static remove_truncated_tweets(tweet_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]
run_filters(twitter_dataframe: Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame])Union[dask.dataframe.core.DataFrame, pandas.core.frame.DataFrame][source]

Murphy Sentiment Classification

class murphy.classification.sentiments.Sentiments[source]

Bases: object

classmethod multiple_sentiment_analysis(text: str)Dict[str, float][source]

Returns the sentiment using all implemented models as a dictionary

Parameters

text – text to run sentiment analysis on

Returns

key pair values of name of the sentiment function and their estimations

static sentiment_analysis_nltk(text: str)float[source]

Run sentiment analysis using the library NLTK. Runs default sentiment on vader lexicon Works based on bag of words and positive and negative word lookups

Parameters

text – text to be analyzed

Returns

sentiment compound for given text

static sentiment_analysis_textblob(text: str)float[source]

Run sentiment analysis using the library textblob. Returns default sentiment Works similar to NLTK’s sentiment analysis, but includes subjectivity analysis

Parameters

text – text to be analyzed

Returns

sentiment for given text