Natural Language Processing (NLP) refers to tools and methods to explore text data as well as identifiy patterns and making predictions.
This article is composed from Google’s view on NLP evolution.
Types of NLP models
|NLP model type
|Example use case
|Text summarization or classification.
Word2vec starts the game for NLP neural networks
Word2Vec was developed by Google on 2013. Its goal is to learn the embedding matrix size
[words] x [dimensions]. To put is simply, this means creating numerical representation for words.
Two common approaches to product the vectors:
|Continuous Bag of Words (CBOW)
|Predict center word given the surrounding words
|Predict surrounding words given the center word
Encoder-Decoder in NLP
Language translation is a well-known NLP application for Encoder-Decoder architecture. This approach tries to
In the last network layer the
sampled softmax function computes probabilities for large word vocabularies.
Beam search provides an efficient way to find the next word based on Encoder-Decoder model.
Attention mechanism is an Recurrent Neural Network (RNN) architecture. It should improve Encoder-Decoder models for language translations by solving the problem of fixed length outputs among the other benefits.
As the name suggests, more attention can be paid to relevant words thanks to context vectors.
The main differeneces to traditional Encoder-Decoder:
- Encoder passes more data to Decoder as each word has a hidden state
- Extra step before output
Transformers in NLP
A Sequence-to-Sequence architecture. Can process the whole input at once unlike RNNs.
Bidirectional Encoder Representations from Transformers. Developed by Google 2018 and powers the Google search. It is trained by Wikipedia and BookCorpus.
Can handle long input texts. Works for both sentence and token level tasks. Compared to Transformers, BERT considers the order of the words in the model.
BERT can be fine-tuned from a pre-trained model. Some use cases are:
- Single sentence classification
- Sentence pair classification
- Question answering
- Single sentence tagging
Pre-requisities to use pre-trained BERT:
- Input sentence must be converted to a single packed vector
- A new output layer must be added
- Optionally fine-tune the model parameters
Multi-task learning means training a model to solve various kinds of problems. There are two approaches for BERT multi-task training. The first one is masked language model by masking 15% of the words. And the other is next sentence prediction.
Two versions of BERT:
BERT can process at most sequences of length 512.
BERT input embeddings:
Large language models
Large language models are considered to be so expensive to train that they are out of reach for individuals and even many companies.
Here some examples:
NLP and time series problem similairities
Suprisingly NLP has many similarities to time series data. The most obvious is that both text and time series data are ordered. Machine sensors are common source for time series data.
Long Short-Term Memory (LSTM) has proven its capabilities in NLP. Personally I have seen LSTM being applied to time series data only.
LSTM is an RNN with ability to remember through the whole time series by “long term” memory. An LSTM cell has three gates: Forget, remember and output.
An LSTM cell input:
- Current X values
- Previous time step (ht-1)
- Previous cell state (ct-1)