Natural Language Processing (NLP) refers to tools and methods to explore text data as well as identifiy patterns and making predictions.

This article is composed from Google’s view on NLP evolution.

Types of NLP models

NLP model typeExample use case
Sequence-to-oneText summarization or classification.
One-to-sequenceText generation.
Sequence-to-sequenceLanguage translation.

Word2vec starts the game for NLP neural networks

Word2Vec was developed by Google on 2013. Its goal is to learn the embedding matrix size [words] x [dimensions]. To put is simply, this means creating numerical representation for words.

Two common approaches to product the vectors:

Word2vec approachExplanation
Continuous Bag of Words (CBOW)Predict center word given the surrounding words
Skip-gramPredict surrounding words given the center word

Encoder-Decoder in NLP

Language translation is a well-known NLP application for Encoder-Decoder architecture. This approach tries to

In the last network layer the sampled softmax function computes probabilities for large word vocabularies.

Beam search provides an efficient way to find the next word based on Encoder-Decoder model.

Attention mechanism

Attention mechanism is an Recurrent Neural Network (RNN) architecture. It should improve Encoder-Decoder models for language translations by solving the problem of fixed length outputs among the other benefits.

As the name suggests, more attention can be paid to relevant words thanks to context vectors.

The main differeneces to traditional Encoder-Decoder:

  • Encoder passes more data to Decoder as each word has a hidden state
  • Extra step before output

Transformers in NLP

A Sequence-to-Sequence architecture. Can process the whole input at once unlike RNNs.


Bidirectional Encoder Representations from Transformers. Developed by Google 2018 and powers the Google search. It is trained by Wikipedia and BookCorpus.

Can handle long input texts. Works for both sentence and token level tasks. Compared to Transformers, BERT considers the order of the words in the model.

BERT can be fine-tuned from a pre-trained model. Some use cases are:

  • Single sentence classification
  • Sentence pair classification
  • Question answering
  • Single sentence tagging

Pre-requisities to use pre-trained BERT:

  • Input sentence must be converted to a single packed vector
  • A new output layer must be added
  • Optionally fine-tune the model parameters

Multi-task learning means training a model to solve various kinds of problems. There are two approaches for BERT multi-task training. The first one is masked language model by masking 15% of the words. And the other is next sentence prediction.

Two versions of BERT:

BERT versionTransformer layersHidden unitsParameters
BERT Base12768100M
BERT Large241024340M

BERT can process at most sequences of length 512.

BERT input embeddings:

BERT embeddingRepresentation
Token embeddingsWord
Segment embeddingsSentence.
Position embeddingsWord order

Large language models

Large language models are considered to be so expensive to train that they are out of reach for individuals and even many companies.

Here some examples:

  • BERT
  • T5
  • GPT-3
  • PaLM

NLP and time series problem similairities

Suprisingly NLP has many similarities to time series data. The most obvious is that both text and time series data are ordered. Machine sensors are common source for time series data.

Long Short-Term Memory (LSTM) has proven its capabilities in NLP. Personally I have seen LSTM being applied to time series data only.

LSTM is an RNN with ability to remember through the whole time series by “long term” memory. An LSTM cell has three gates: Forget, remember and output.

An LSTM cell input:

  • Current X values
  • Previous time step (ht-1)
  • Previous cell state (ct-1)