Gmail Smart Compose

Harshavardhan Nomula
5 min readAug 25, 2021
Source: androidpolice.com

Table of Contents

  • Business Problem
  • Dataset Overview
  • Mapping the real world problem to ML problem
  • Performance Metrics
  • Data Preprocessing
  • Retrieve Structured Data
  • Modeling
  • Results
  • Future work
  • References

1. Business Problem

It is often that billions of emails are written back and forth every month and all of them starts with an empty compose box, but it takes sometime to draft an email So Gmail Smart Compose comes into picture which uses Machine Learning Model to help users to save time and type faster.

This case study mainly provides interactive and real-time suggestions with the help of ML model to users which in turn reduces repetitive typing.

2. Dataset Overview

We have been provided emails.csv file which comprises of user composed emails. Below are the details of attributes.

  • file: it represents email unique ID
  • message: it has text which contains in compose box of email such as subject, body of the email, date along with time stamp, email id of sender, email id of receiver, cc, bcc.

Shape of the dataset: (517401, 2)

Dataset has been provided by Enron Email Dataset which has over 5 lakh emails.

3. Mapping the real world problem to ML problem

Here we have to predict sequence of words instantly based on the prior sequence of words entered by a user. So this comes under sequence to sequence machine learning problem.

As real time suggestions need to appear instantly once user starts typing the major constraint would be latency and it should be as low as possible.

4. Performance Metrics

Let’s use Log perplexity as the metric for this sequence to sequence modeling problem, cause it helps to find how well the model predicted the sample. Lower the log perplexity higher the probability of predicting the correct sequence of words.

5. Data Preprocessing

Data preprocessing is an important step which involves transformation of data into an understandable format without any errors and duplicate data.

  • I’ve separated body text from the email and tokenized into words by removing punctuations such as ‘?’, ‘!’.
  • Removed special characters like new lines, html tags, extra white spaces.
  • Removed quotes and converted all characters to lower case.
  • Sentences having length less than 20 are only considered.

Let’s see the code for the data preprocessing:

6. Retrieve Structured Data

We train Seq2Seq model for generation of context of sentence by splitting the input sentence into multiple pairs. For example let’s consider below sentence and following are the multiple split-ups of sentence.

Sentence: here is our Forecast

To structure the sentence as above, we added unique separators such as <start> and <end> for differentiating input pairs. Please go through the below code for your reference:

Tokenization, padding are performed for transforming above sentence pairs to integer sequences and to have same length using keras tokenizer. Shuffling of sentences is also performed in order to avoid contiguous order of sentences.

7. Modeling

Let’s have a look at the below architecture of sequence to sequence model which is used for training.

Here is the code part for the architecture:

This is an encoder decoder model which internally used GRU’s for encoding the input sentence.

Encoder consists of bi-directional gated recurrent units (Bi-GRU’s). These GRU’s are considered because of it’s capability of better encoding of sentences than simple RNN’s. GRU’s have less complex mechanism with only one state(hidden state) compared to LSTM which has both hidden states and cell states. Since we considered Bi-GRU’s, forward and backward hidden states are concatenated to produce Encoded hidden state.

As part of model training, we pass the <start> token as input and the generated output needs to be passed as input to the next time step and so on. But there are high chances for the model to get different output from the actual output. So to prevent this while training actual output and encoded hidden state are passed as input to the decoder model as below.

And then we add dropouts to avoid overfit, final Dense layer unit predicts the integer which would map to the actual dictionary to match the output. After training for many epochs received convergence with training log perplexity of 1.6229 and validation log perplexity of 1.5287.

As part of inference, model uses the same encoder to encode the sentence but decoder doesn’t receive actual output as input.

Encoder architecture in the inference:

Inference Encoder

Decoder architecture in the inference: This decoder uses Encoded hidden state and predicted word at previous time step as input to the current time step.

Inference Decoder

8. Results

These are the few predicted sentences with below sentences as input.

Input sentences list:

  • ‘here is’
  • ‘please find my’
  • ‘let me know’
  • ‘please find my resume’
  • ‘can we’
  • ‘As discussed can we connect this’
  • ‘I will finalize and submit the project in’

Corresponding predicted/output sentences list:

  • ‘the latest investor list.’
  • ‘weekly report for ets’
  • ‘if you have any questions. thanks’
  • ‘for the attached ngi chicago employees.’
  • ‘send me the latest gtc’s for the game. thanks russell’
  • ‘week’., ‘the interstate pipelines?’

9. Deployment

Deployed the model on my local system using Flask API, where it takes first part or few words in the sentence as input and predicts the subsequent words in the sentence as output.

10. Future work

  • Extending the existing model to transformer model might help to predict longer sentences.

11. References

--

--