This writing is a note that I take while I was working at Kata.ai as a research scientist. I think this can be useful, so I want share it here as a blog.
Feature engineering is an integral part of NLP, where we extract a numerical representation, which we call feature, of text according to a set of (possibly linguistic) rules. Those representations are feed into a learning algorithm or model, in which its task is to solve problems in NLP such as POS tag, NER, etc. The popularity of feature engineering, especially in the academic setting, might be declining, due to the emergence of deep learning, where it promises to automatically extract suitable representation for an NLP task. Even so, we should not disregard feature engineering because of two reasons:
- Feature engineering provides a way to incorporate linguistic structure into the model.
- A deep learning model requires a lot of training data, hence feature engineering might work for a scenario where the number of available (annotated) data is limited.
Yoav Goldberg, one of the prominent researchers in the domain of NLP, outlines an interesting remark in his book Neural Network Methods for Natural Language Processing regarding the dichotomy between feature engineering (which he refers as manually designed linguistic properties) and deep learning (neural network), which I quote below
Some proponents of deep learning argue that such inferred, manually designed, linguistic properties are not needed, and that the neural network will learn these intermediate representations (or equivalent, or better ones) on its own. The jury is still out on this. My current personal belief is that many of these linguistic concepts can indeed be inferred by the network on its own if given enough data and perhaps a push in the right direction. However, for many other cases we do not have enough training data available for the task we care about, and in these cases providing the network with the more explicit general concepts can be very valuable. Even if we do have enough data, we may want to focus the network on certain aspects of the text and hint to it that it should ignore others, by providing the generalised concepts in addition to, or even instead of, the surface form of the words. Finally, even if we do not these linguistic properties as input features, we may want to help guide the network by using them as additional supervision in a multi-task learning setup or by designing network architecture or training paradigms that are more suitable for learning certain linguistic phenomena. Overall, we see enough evidence that the use of linguistic concepts help improve language understanding and production systems.
With the above statement, I believe that feature engineering can still be deployed as a representation for an NLP model, in light of scarce training data.
In the construction of features, we can consider the following categorisation to guide us
- Indicator / binary → 1 if feature exists in the text, 0 if not
- 1 if the word hello appeared at least once in the text, otherwise 0
- The count of the word hello in the text
- The length of sentence
- The probability that the target word is a name
- TF-IDF score
- Word in the vocabulary
- POS tag of a word
- The category is often represented as integer index where under the hood it is translated into:
- One-hot encoded vector: A vector where 1 at the index, the rest is 0. Example: if we have a vocabulary of [i, like, pizza, love, donut], then the word pizza has a one-hot vector of [0 0 1 0 0]
- Distributional representation
- Cluster-based → assign similar words with cluster id
- Embedding-based → a real-valued vector of length d that encodes semantic relation between word. Similar word has similar vector
- Indicator / binary → 1 if feature exists in the text, 0 if not
- Level → How broad is the scope of your text that you want to extract the feature from
- Corpus level: Word with frequency < 5 in the corpus is considered as rare/UNK word
- Document level: Word relation from dependency tree from a sentence
- Token level: Does the word contains digits?
- Directly observable property
- Property of the word, such as contains digits, has mixed-case, etc
- Inferred linguistics property
- POS tag, semantic role, dependency tree
- named-list lookup
- Does the word exists in a list of common English names?
- Does the word is the a city?
- Directly observable property
- Single token → only consider the target token/word
- Window → consider m previous and n next word
- Absolute position
- Standalone feature
- Combination of features
List of Features
This section lists features that I find in several works of literature listed below. Most of these papers list features that are used in the task of NER, but I believe that these features are useful for many kinds of tasks.
 O. Bender, F. J. Och, and H. Ney, “Maximum Entropy Models for Named Entity Recognition,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, Stroudsburg, PA, USA, 2003, pp. 148–151.
 H. L. Chieu and H. T. Ng, “Named Entity Recognition with a Maximum Entropy Approach,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, Stroudsburg, PA, USA, 2003, pp. 160–163.
 Y. Goldberg, Neural network methods for natural language processing. San Rafael: Morgan & Claypool Publishers, 2017. Chapter 6.2-7.
 D. Jurafsky and J. H. Martin, Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd ed. Upper Saddle River, N.J: Pearson Prentice Hall, 2009. Chapter 22.
 D. Nadeau and S. Sekine, “A survey of named entity recognition and classification,” Lingvisticae Investigationes, vol. 30, no. 1, pp. 3–26, 2007.
Table 1. List of feature
|Word||The target word||Does the word hello occurs in the text||Windowing is commonly used|
|n-gram||Find if a n-gram is found in the document||Does the bi-gram very good occurs in the text|
|Stemmed/lemmatized word||Reduce word to its stem or lemma||Walk, walked, walking are represented by walk|
|Orthography||Combination of upper-cased and lower-cased letters||Is the first word is capitalised, uppercased, or mixed (ex: iPhone)|
|Punctuation||The use of punctuation||Ends with period; period in between; has apostrophe, ampersand, or hyphen|
|Digits||Whether the word contains digits||Has digits; mixed digits and character; digits with date pattern; contains roman numerals|
|PoS tag||The part-of-speech tag of the word||NOUN, PROPN, VERB, ADJ, etc.||Some literature use window|
|Length||Length of lexical items||Length of word, sentence, etc.|
|Position||The position of the word||Is the word in position 1 of the sentence; is the word in the start, middle, or end of the sentence|
|List-of-entity lookup||Matches the target word to list that contains entities of a single type such as list of names||Is the word exists in a common English name?; is the word is the a city?||This feature is common in NER. The feature is often called gazeeters or dictionary. We might want to consider fuzzy search, where we don't do exact match, but consider a match if two items have edit distance less than a certain threshold|
|Dictionary lookup||Matches the target word to a lexicon such as dictionary||Is the word exists in a medical dictionary||Again we might want to consider doing a fuzzy search to account for variation|
|Prefix||detects prefixes that are appended to a word||Is the word appended with prefix de-|
|Suffix||detects suffixes that are appended to a word||Is the word appended with prefix -ed|
|Word frequency||Count the occurence of a word in each document||Sometimes referred as the bag-of-words|
|N-gram frequency||Count the occurence of an n-gram in each document||Sometimes called the 'bag-of-ngrams'|
|Normalised word / n-gram frequency||The count is normalised or weighted using a certain method||Term Frequency-Inverse Document Frequency (TF-IDF)|
|Word shape||Replace all upper-cased chars with X, lower-cased chars with x, and digits with 0.||EMNLP2017 becomes XXXXX0000; Bagas becomes Xxxxx; email@example.com becomes firstname.lastname@example.org||Of course you can be creative with this rule. For example you can replace years with a special shape Y|
|Summarised word shape||Like word shape, but it collapse the consecutive occurrence of characters to one||EMNLP2017 becomes X0; Bagas becomes Xx; email@example.com becomes firstname.lastname@example.org||Again you as the modeller should design according to the problem at hand|
|Predictive token||Presence of predictive words in surrounding text||Is the word followed by inc.; is the word preceeded by dr.|
|Frequent word list||Check if the target word consists in the list. This list consists of words that occur more than N times in the corpus|
|Rare words||Words not in the frequent word list are considered rare words or unknowns (UNK)|
|Useful unigram||For each name class, words that precede the name class are ranked and the top N are compiled into a list||word to and from are often precede a location, so those words are useful unigrams for location entity||This feature is specific for a NER task, but it can also be useful for other tasks. This feature is also known as the clue-list. Clue is the some words that can be used as guidelines in detecting named-entity. While in the previous example the list is derived from the data, we can also construct the list manually based on linguistic knowledge|
|Useful bigram||Like useful unigram but for bigram|
|Regex||Use regex for pattern matching|
Characteristic of the Task and the Data
When we design feature, we must consider the type of task, because some features might be relevant for a particular task but not so much on the other. The TF-IDF feature is particularly useful for a topic classification problem because the TF-IDF value reveals words that are useful for each topic. The POS tag, orthography, and entity-list features are beneficial to a NER task, because a named entity is often a noun, and usually capitalised.
Even so, we must also consider the nature of the data that we have. Indeed, the capitalisation feature is useful for a NER task for which the corpus comes from news document with a proper editorial process. But if you make chatbots, then you are dealing with ungrammatical texts. In this case, capitalisation is often ignored hence might not be very indicative. In the end, it is up to us the feature designer to come up with a relevant set of features. The best way to do that is through trial-and-error, enumerate lots of possible features and check the evaluation criteria.
Standalone Feature vs Combination of Features
We can directly design several features picked from Table 1 and use it in a model. In that case, each of the features works independently or in standalone. In many situations, we might want to have features working in conjunction with the others. In the case of NER, having a feature 'the word is assigned with a Noun POS tag and is capitalised' is a more powerful indicator than each of those features working independently.
Feature engineering is often work in combination with a linear or log-linear model. These models cannot model a combination of features. Each of the features is expected to influence the model independently. So, to enable interaction between features, we must make the combination as its a separate feature.
As an illustration, we might want to model the interaction between POS tag and the word itself. Then we add a new feature with the template 'word X with POS tag Y'. We can see the problem that might arise here. If the vocabulary contains 1000 words, and we have a tagset of 10 tags, then there are 10000 possible combinations for that feature alone. For that reason, we might want to carefully craft a limited set of combinations by imposing a specific rule; for example, we only consider word from the frequent word list.