Dec 14

7 min read

Deep Text Corrector

With writing becoming digital like never before — the need for grammatical aid from a second pair of eyes is also caught up. Automating even a section of the myriad rules of grammar — will help users with drafting e-mails, documenting, content writing, posting on social media etc. This will weed out common grammatical flaws and improve the overall quality of the write-up.

Deep Learning to the rescue

We will use in vogue deep learning techniques like encoder-decoder and attention mechanisms for natural language processing (NLP) & sequence to sequence (seq2seq) learning.

Data

Data was collected from three sources:

A large corpus of around 305000 movie dialogues from: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Digital novels sourced from Project Gutenberg: https://www.gutenberg.org/

i. BOLO The Cave Boy

ii. The Shoemaker

Preprocessing

We have loaded and concatenated the 2 novels.

Firstly let us find all the special characters in the novel text and its relevance to our problem.

Based on analysis of the above special characters — the below cleanup is done.

Now, let us load and process the movie dialogues.

Data Preparation

We will split the preprocessed text from movie corpus to sentences.

Word Count Thresholds

We see the median word count is around 6 words per sentence. There are considerable number of outliers after a word count of 20. The maximum word count concentration is around 7 words per sentence.

To keep sentences short and meaningful, we will use sentences with number of words ranging from 3 to 20.

Introduce Random Perturbations

We will introduce the following grammatical errors in the grammatically correct conversational English sentences.

Phonetically similar words:

Replace ‘there’ with ‘their’
Replace ‘then’ with ‘than’
Replace ‘their’ with ‘there’
Replace ‘than’ with ‘then’

Indefinite Articles

Remove ‘a’
50% random: replace ‘a’ with ‘an’.
Remove ‘an’
50% random: replace ‘an’ with ‘a’.

Verb Contractions

Remove: ‘ve
Remove: ‘re
Remove: ‘ll
Remove: ‘s

Definite Article

Remove ‘the’

Auxiliary Verbs

Replace ‘is’ with ‘are’
Replace ‘were’ with ‘was’
Replace ‘are’ with ‘is’

Common Grammatical Mistakes

Replace ‘didn’t have’ with ‘didn’t had’

These ‘perturbed’ sentences will be used to train the model to predict their corresponding ‘correct’ sentences.

Below is a sample of the perturbations introduced.

We added <start> and <end> tags to signal the start and end of sequences.

Tokenization

Fit tokenizers on perturbed and correct inputs.

Embedding Matrix Creation — using Glove-300

Data Pipeline Creation

We created a data generator Dataloader from Keras.

Modelling

Choice of Metrics

We will use the BLEU score to evaluate and compare our models. BLEU score ranges from 0 to 1. 0 being the worst and 1 being the best case.

BLEU Score = (Brevity Penalty).(Geometric Average of n-grams Precision Scores)

Brevity Penalty penalizes short sentences. This score is averaged over a set of sentences and mimics the human evaluation and is interpretable.

1. Encoder-Decoder

Architecture

The input sequence is summarized in the context vector (LSTM hidden state and cell state). The encoder outputs are ignored. The decoder LSTM initial states are initialized by the encoder final states. The decoder generates the output sequence. The decoder predicts till the END token. Each recurring LSTM unit is fed with the hidden state from the previous unit. It in turn produces its own hidden state along with the output. The final output is generated using a softmax layer.

A custom implementation of the Encoder-Decoder layers was done. The implementation was based on the below architecture variation.

Image courtesy: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

After fitting with the train data, below are a few sample predictions of the Encoder-Decoder model. The predictions were made using greedy search algorithm — picking the word with the highest probability from the softmax output.

We see that while the model did well with smaller and simpler sentences, it did not do a good job with longer and more complicated ones.

This is one of the limitations of the above encoder-decoder model, as for long sentences, the context vector fails to capture the essence of the input, leading to poor translation or in this case, correction.

To tide over this problem, we will attempt an encoder decoder model with ‘Attention’ mechanism.

2. Encoder-Decoder with Attention

The accuracy of translation reduces as the input sentences get longer in vanilla seq2seq model we saw above.

To tide over this problem, we use the attention mechanism. It attempts to mimic human translation — where parts of the input are given attention at a time.

Bahdanau Attention Architecture

In attention mechanism, the decoder will get information from all the hidden states of the encoder (Weighted average of all the encoder hidden states), not only the last state as was the case in vanilla seq2seq model.

The encoder layer consists of Bidirectional LSTM, as a word in the output may depend on words before and after it, in the input sequence.

The decoder layer is a unidirectional LSTM.

The input to the decoder unit is the weighted sum of outputs (Context Vector C1) from the encoder model:

Tx is a hyper parameter deciding the number of words in the input to be used , instead of the whole chunk. (attention)

α (alpha) controls how much attention should be given to the output from the encoder unit. Sum of all alpha values is 1.

where,

‘a’ is the function, which is determined by training using a simple feed forward neural network.

Below is the control flow of an encoder-decoder model with attention:

A custom implementation with Keras Layer Class was done.

Global Attention

The Bahdanau attention was implemented using the Global attention mechanism from Luong et al. 2015.

We used the ‘dot’ scoring function to measure the similarity between decoder hidden stage at current time and the encoder output — this decides the part of the input sequence that gets the attention.

We have defined a custom loss function for categorical cross entropy. A custom loss function that will not consider the loss for padded zeros.

We fit the model using the train data. The validation loss settled at around 0.15 at the end of 40 epochs.

Attention Plot

We visualized a few sentence correction alpha values to see which input words affected the output to a higher degree.

Below is a sample output:

Let us view some sample predictions:

And below are some of the fails:

In general the errors were due to special/rare words not available in the embedding. Confusion between a/the — which is quiet expected in the current scope, without the context of the sentence.

Summary

Below is the snapshot of BLEU score performance of the vanilla seq2seq encoder decoder model and the second model with attention mechanism. This was for the same random sample of 1000 input sequences.

As evident from the predicted samples and the BLEU score, the model with attention mechanism has for out performed the vanilla model.

We will use that for deployment.

Deployment

The Attention based model was deployed using Streamlit. Below is a short demo of the deployed model (no audio).

Complete code can be found in GitHub:

GitHub - prapar/Deep-Text-Corrector

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

References

Connect: LinkedIn

Title Image Credit: https://unsplash.com/@stbuccia

Deep Text Corrector

Deep Learning to the rescue

Data

Preprocessing

Data Preparation

Word Count Thresholds

Introduce Random Perturbations

Tokenization

Embedding Matrix Creation — using Glove-300

Data Pipeline Creation

Modelling

1. Encoder-Decoder

2. Encoder-Decoder with Attention

Attention Plot

Let us view some sample predictions:

And below are some of the fails:

Summary

Deployment

GitHub - prapar/Deep-Text-Corrector

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

References

More from Prashanth Parichie

Recommended from Medium

Generating data with Monte Carlo

Optimization with SciPy and application ideas to machine learning

Paper reading : Importance Estimation for Neural Network Pruning

TrufflePig: Introducing the Artificial Intelligence for Content Curation and Minnow Support

How to Train a “Backdoor” in Your Machine Learning Model on Google Colab

Object Detection by Tensorflow 2.x

Channel Attribution needs Attention!

New ABCD Of Machine Learning

Get the Medium app

Prashanth Parichie

Deep Text Corrector

Deep Learning to the rescue

Data

Preprocessing

Data Preparation

Word Count Thresholds

Introduce Random Perturbations

Tokenization

Embedding Matrix Creation — using Glove-300

Data Pipeline Creation

Modelling

1. Encoder-Decoder

2. Encoder-Decoder with Attention

Attention Plot

Let us view some sample predictions:

And below are some of the fails:

Summary

Deployment

GitHub - prapar/Deep-Text-Corrector

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

References

More from Prashanth Parichie

Recommended from Medium

Generating data with Monte Carlo

Optimization with SciPy and application ideas to machine learning

Paper reading : Importance Estimation for Neural Network Pruning

*TrufflePig*: Introducing the Artificial Intelligence for Content Curation and Minnow Support

How to Train a “Backdoor” in Your Machine Learning Model on Google Colab

Object Detection by Tensorflow 2.x

Channel Attribution needs Attention!

New ABCD Of Machine Learning

Get the Medium app

Prashanth Parichie

TrufflePig: Introducing the Artificial Intelligence for Content Curation and Minnow Support