Text Classification of Natural Language

(Declaration of incompleteness: The document is neither complete in methods for text classification nor a scientific work.) This article is as brief introduction into my research seminar at HTW Dresden. It covers the simplest algorithms used for text classification: Edit Distance, Normalized Compression Distance and a modified Edit Distance method called Substitution Distance (modified and tested by me). All of these algorithms are not trained (no use of machine learning methods). They use a set of labeled data. Labeled means that a set of phrases exists and the correct classes are known for each of the phrases.

Classification of Text

In general, text classification is a part of a research field called Natural Language Processing (NLP) and is used for a wide range of tasks:

  • Category of post or article
  • Human-Machine interaction
  • Email routing
  • Spam detection
  • Readability assessment
  • Language detection

To classify some documents (= a bunch of words, also named as ‘phrases’) you need the corresponding class. All methods, introduced in this article use a set of documents with the labeled class to find the best matching (= or best fit) for one of the classes. Some possible classes are shown below:

  • Politics
  • Religion
  • Baseball
  • Basketball
  • Sport (in general)
  • Space
  • Medicine
  • Et cetera

Some specifications of the most popular datasets in text classification can be found here or here. The classes above work well for the most document classification tasks. My research seminar deals with human-machine interaction and therefore we need classes that describe the communication between humans or even a robot. Some considerations about possible classes for this task leads us the following result:

  • Smalltalk
  • Question
  • Accept
  • Deny
  • Answer
  • Goodbye
  • Greeting

Certainly, there are a few more classes, but for the imagination of the task it’s enough.


style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

Simple Methods for Text Classification

As already mentioned above, this article covers three methods:

  • Edit Distance
  • Normalized Compression Distance
  • Substitution Distance

Edit Distance

This method uses a token-based algorithm to compare two phrases. A simple example could be:

Calculate the distance between the words INTENTION and EXECUTION. We need a cost function with all necessary operations for this task:

  1. insert (i) = 1
  2. delete (d) = 1
  3. substitution (s)  = 1

With this cost function we are able to calculate a score that indicates the edit effort. After that, we apply the cost function to the example:

The cost function turns out that each operation costs 1 score point. With 5 operations the distance between these two words is 5. In case of a document with thousands of words this algorithm performs very messy. For simple tasks (=spoken phrases) the Edit Distance method works fine.

Normalized Compression Distance

The NCD algorithm is normally used to measure the quality of compression algorithms. For that, some theoretical considerations including the Kolmogorov complexity are necessary. Text classification using NCD is a bit more uncomplex, but not feasible without some definitions:

  • Z(x), Z(y)… Compression function (in theoretical considerations: a function that is able to build a shorter representation of a string or token stream), returns the length of the compressed representation
  • min(x,y)… Minimum function (calculates the more minimal value of two variables) returns x when x < y and y when y <= x
  • max(x,y)… Maximal function (calculates the more maximal value of two variables) returns x when x > y and y when y >= x
  • NCD(x,y)… Normalized Compression Distance (calculates the compression distance with the formula below)
  • x, y, xy… Variables and Concatenation (x and y are variables, xy is the concatenation of x and y, which means the two token streams are combined together)

With these definitions and the related formula you can calculate the NCD for your input string (test string) against all of your known phrases. The comparison with the lowest NCD value is most likely to be the correct class.

Substitution Distance

The third and last method is called Substitution Distance (or simply SD). This method based on my own considerations about possible improvements of the Edit Distance approach (introduced earlier). One of the main problems of Edit Distance are the following:

  1. good morning my name is alex
  2. i am alex good morning

Phrase 1. and 2. correspond to the same class (for instance greeting) and as we can see – they are very similar! The problem is that ‘good morning’ at the beginning (1.) and at the end (2.) will produce a significant high score of edit effort. This points out that the Edit Distance method behave very unnatural. For a human, it is a cinch to classify these two terms, but not for the machine. Nonetheless, the machine can use a simple but strength indicator to perform well. The sub phrase ‘good morning’ used in both terms can be seen as an indicator for the same class. Based on the simple Edit Distance approach, the following preprocessing step was added to the algorithm:

  1. Set score = 0
  2. Find similar sub phrases in both phrases
  3. Determine the length of all similar sub phrases
  4. Set score = length_of_similar_subphrases * (-1)
  5. Delete similar sub phrases from both phrases
  6. Apply Edit Distance to the rest of the phrases
  7. score = score + ED_Score



style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">


To make a long experiment short, this algorithm performs very well on the test sets used for my seminar. Some results of the experiments are listed below.

Additional preprocessing

Beside the baseline methods introduced above, the usage of pre- and/or post-processing steps is useful. I used 2 main procedures to refine the data. Some interesting insights into the anatomy of natural language brought interesting results.

Stemming

The stemming approach trims the words down to there stems (stemming -> stem). This is a very useful preprocessing procedure to overcome faults from writing and detecting (speech detection) as well as to avoid high scores for long words with the same statement. From the practical point of view the Lucene Stemmer (used in Lucene Search Engine, link to GermanStemmer) can be used.

Stop Word Reduction

Another preprocessing procedure is to remove all unnecessary words from the phrase. The main problem is to determine which words are unnecessary and which words are important (and in which context they are important). The solution is to inspect the dataset very well and find words to withdraw. In the following experiments the stop word reduction worked very poor. The reason for this can be seen in short phrases that lose words and also there meaning. Finally, the classification goes wrong.

Experiments

A short excerpt from my experiments is shown below. Our dataset contains 1530 phrases in 10 classes that were recorded in real-world scenarios and labeled by employees of the robotics lab. The test method is a 10-fold cross validation. The number in front of the method name points out the accuracy (can be simply read as: how many percent of 1530 test phrases were classified correctly):

  • 92.69% Substitution Distance with Lucene GermanStemmer without stopword reduction
  • 90.55% Normalized Compression Distance (without preprocessing)
  • 89.97% Edit Distance with Lucene GermanStemmer
  • 89.96% Substitution Distance with Lucene German Stemmer and Stopword-small
  • 89.32% Normalized Compression Distance with Lucene GermanStemmer
  • 88.03% Edit Distance with Lucene GermanStemmer and stopword-small
  • 87.35% Substitution Distance without Stemming with stopword-small
  • 87.01% Edit Distance without preprocessing
  • 85.27% Substitution Distance without preprocessing
  • 85.12% Normalized Compression Distance with Lucene GermanStemmer and stopword-small

The list of experimental results shows the highest accuracy with 92,69% (Substitution Distance using Lucene GermanStemmer). Its no secret that the accuracy of estimation depends on the classes and there confusion as well as the similarity between two or more classes. Well chosen classes are a key element for good estimation. All in all, the Substitution Distance and the Normalized Compression Distance are the 2 of 3 approaches that can be used for robust text classification of natural language phrases.

Epilogue

Part 2 of the NLP article series will deal with more complex operations in Natural Language Processing as well as Sentiment Analysis. The goal of this part (Part 1) was to convey a baseline comprehension of text classification and to introduce basic approaches (ED, NCD, SD). I hope you enjoyed this article! If you have any feedback, do not hesitate to contact me or comment below.

676 thoughts on “Text Classification of Natural Language

  1. hey there and thank you for your information – I have certainly picked
    up anything new from right here. I did however expertise some technical points using this website, since I experienced
    to reload the web site lots of times previous to I could get it to load properly.
    I had been wondering if your web host is OK? Not that I’m complaining, but
    slow loading instances times will sometimes affect your placement in google and can damage your high-quality score
    if ads and marketing with Adwords. Well I’m adding this RSS to
    my email and could look out for a lot more of your respective exciting content.
    Make sure you update this again soon.

  2. To create a robust brand, aan organization must interact in numerous tootally
    differeent actions, some of which might be categorised as advertising (whether
    or not you subscribe to the broader, Druckerian definition, or the more narrow one within the article above).

  3. This design is spectacular! You certainly know how
    to keep a reader entertained. Between your wit and your videos, I was almost moved to
    start my own blog (well, almost…HaHa!)
    Wonderful job. I really enjoyed what you had to say,
    and more than that, how you presented it. Too cool!

  4. Is a Sydney-primarily ased full service advertising and blogger outreach company that was
    founded in 2012 by Simon Marmot,who has over 20 years experience working for Saatchi & Saatchi,
    Cudo and Mamamia.

  5. Hello just wanted to give you a quick heads up.
    The words in your content seem to be running off the screen in Internet
    explorer. I’m not sure if this is a format issue or something to do
    with browser compatibility but I thought I’d post to let you
    know. The design and style look great though! Hope you get
    the problem resolved soon. Kudos

  6. Hello there! This is my first comment here so I just wanted to give a
    quick shout out and tell you I really enjoy reading through your articles.
    Can you suggest any other blogs/websites/forums that cover the same
    topics? Thanks for your time!

  7. I do consider all the ideas you have introduced for your post.
    They are really convincing and will certainly work. Nonetheless, the posts are very brief for newbies.
    Could you please prolong them a little from next time?
    Thanks for the post.

  8. I was pretty pleased to uncover this web site.
    I need to to thank you for your time for
    this particularly wonderful read!! I definitely savored every little bit of it and i also have you book-marked to check out
    new information in your site.

  9. Thank you a lot for sharing this with all of us you actually
    understand what you’re speaking approximately!
    Bookmarked. Kindly also seek advice from my website =).
    We will have a link change agreement among us

Leave a Reply

Your email address will not be published. Required fields are marked *