How to transform text into data in function of machine learning? Here are a few ideas and techniques.

Counting words (CountVectorizer)

Given a piece of text, assemble all the different words and assign a different number to it. This gives a dictionary through which you can map a new given text to a vector.

Consequences:

  • any permutation of words in a given sentence leads to the same vector
  • the amount of time the same word appears increases the weight of the entry
  • a word not in the vocabulary is not contributing

TfIdf

TfIDf stands for term frequency–inverse document frequency. It means that a term gets a weight proportional to its statistics but at the same time is dimished if it appears often. For example, the string “the cupidity of man” contains the common words “the” and “of” so their importance is in general less than the infrequent word “cupidity”. So, the statistics is in a way reorganized in function of global properties of all words.

Consequences:

  • just like the count vectorizer, permutations do not affect the vector
  • the weight of ‘important’ words is taken into account
  • a word not in the vocabulary is not contributing

Hashing

The previous approaches mean that the more words you conside the bigger the vector becomes. It also means that vector become sparse. The hashing approach take a fixed-length vector and manages to incorporte all words into the same-length vector. This is similar to the Word2Vec approach but much less sophisticated.

Advantages:

  • it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
  • it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
  • it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

Cons:

  • there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
  • there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
  • no IDF weighting as this would render the transformer stateful.

Inside Keras there is also a hashing utility which works well.

Word2Vec

This method uses shallow neural networks and word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. So, this method is the most subtle one.

Pro:

  • Perhaps the first scalable model that generated word embeddings for large corpus (millions of unique words). Feed the model raw text and it outputs word vectors.
  • it takes context into account

Cons:

Word sense is not captured separately. For example, a word like “cell” that could mean “prison, “biological cell”, “phone” etc are all represented in one vector.

One hot

Using Keras as a way to train networks, you can use some utils therein as well. The one-hot encoder for example is also useful sometimes to encode text in a vector.

Keras text to matrix

Finally, if you want to shortest path between textual data and network input you can use the text-to-matrix utility in Keras to quickly get vectors.