Text Classification for PHP

The task of text classification is one of the oldest in Natural Language Processing and was well considered in the past. There are some common methods for classification using Artificial Neural Networks, Support Vector Machines and other machine learning models. The problem with these approaches is the amount of training data that has to be found, normalized as well as labeled for a specific system. Another problem comes with the selection of features, which – in most cases – wastes a lot of time. Because of that, a simpler algorithm must be found to allow classification of text for people who are not interested in machine learning. Some approaches that reach these requirements already exist and were introduced here. This article provides a short introduction to one of these algorithms (NCD) as well as an implementation into PHP scripting language. This code can be simply used for an own PHP project.

Introduction

Why should we choose compression algorithms for classification? The answer is easy. A compression algorithm tries to find a way to minimize data. For that, one way is to find similarities and reduce them. To tackle this task without a computer – just with paper and pen – how would you compress the following examples:

(I) aaaaaabbbbbbcccccc

and

(II) ccccccbbbbbbffffff

I decided to use a simple compression rule (just for this example)
A possible result for the first example: 6a6b6c
And for the second example: 6c6b6f
If we concatenate both examples together: 6a6b12c6b6f
The length of these two examples (I) and (II) are in both cases 18. The compressed version of these examples uses 6 tokens for each string. We compressed a 18 token string to a 6 token string. After this concatenate operation the length of the compressed string also could be 6+6 = 12 tokens, but the compression (6a6b12c6f) uses just 11 tokens. Our simple compression algorithm has found a similarity (the b sequence at the end of example 1 and at the start of example 2). For more complex texts a good compression algorithm finds similarities in words, structures and so one. To conclude, the more the length of the concatenated string differs from the length of the concatenated string that underwent the compression, the more are these two strings similar to each other. To express it with more colloquial words, the length of the compressed string depends on the similarity of the two input strings.

Normalized Compression Distance

The NCD algorithm is normally used to measure the quality of compression algorithms. For that, some theoretical considerations including the Kolmogorov complexity are necessary. Text classification using NCD is a bit more uncomplex, but not feasible without some definitions:

  • Z(x), Z(y)… Compression function (in theoretical considerations: a function that is able to build a shorter representation of a string or token stream), returns the length of the compressed representation
  • min(x,y)… Minimum function (calculates the more minimal value of two variables) returns x when x < y and y when y <= x
  • max(x,y)… Maximal function (calculates the more maximal value of two variables) returns x when x > y and y when y >= x
  • NCD(x,y)… Normalized Compression Distance (calculates the compression distance with the formula below)
  • x, y, xy… Variables and Concatenation (x and y are variables, xy is the concatenation of x and y, which means the two token streams are combined together)

With these definitions and the related formula you can calculate the NCD for your input string (test string) against all of your known phrases. The comparison with the lowest NCD value is most likely to be the correct class.

PHP Implementation

I’ve implemented the NCD approach in PHP and the class is very easy to use. You can use this classifier to detect spam or classify other input. The example below shows the usage of the code:

require 'ncd.class.php';

$c = new NCD();

$c->add(array(
	"hi my name is alex, what about you?" => 'ok',
	"are you hungy? what about some fries?" => 'ok',
	"how are you?" => 'ok',
	"buy viagra or dope" => 'spam',
	"viagra spam drugs sex" => 'spam',
	"buy drugs and have fun" => 'spam'
	));

print_r($c->compare('hi guy, how are you?'));
print_r($c->compare('buy viagra m0therfuck5r'));

The add() method requires an array that contains the phrases for the comparison and the corresponding classes. You can call the add() method as much as you want to add more examples. It is important to provide good classification examples to the classifier. Not well chosen examples can affect the classification performance.

The print_r() methods used in the fragment above output the following lines:

Array
(
    [class] => ok
    [score] => 0.39285714285714
)
Array
(
    [class] => spam
    [score] => 0.48387096774194
)

As can be seen above, “hi guy, how are you?” was successfully classified to ‘ok’ and “buy viagra m0therfuck5r” was correct classified to ‘spam’. The score value can give an insight into the relation between the input string and string with the minimal distance to the input string. If you’re testing your classifier and this value is to high for a clear example, then you should add another comparison example using the add() method for this class.

Get the code here.

I hope you enjoyed this article. Have fun with the code and don’t hesitate to contact me if you have any suggestions or ideas.

11 thoughts on “Text Classification for PHP

  1. Hi Alex,
    I was a bit skeptical after reading your solution for NCD but it works fine with our classifier and I can never thank you enough. However, it gets more discordance than naive bayesian classifier. We get more wrong classification with NCD than with Naive Bayesian.
    Have you try other compression method than gzcompress ? If not, why have you chosen gzcompress ?

    Thank you,
    Stephane

  2. Thanks alot! It is very useful class, it is very easy to use, can you please create a new class using “naive bayesian” algorithm in similar way so that we can test and compare that which algorithm is more accurate.

  3. The paradigm of indoor horticulture is changing rapidly as more indoor gardeners begin to take advantage with the recent and profound advances in LED grow light technology.

    LED aquarium lights make aquarium looks beautiful and also
    have many advantages over other lights also. Best led plant grow lights light
    will there be, on the center of your respective heart, knowning that there is a great light in the.

    Like Italy’s La Befana, the story is the fact that Babouschka failed to offer food and shelter on the three wise men throughout their
    journey to check out the Christ Child. The using these bulbs will lay
    an excellent growth foundation for that plants and therefore are highly recommended
    to get used throughout the early stages from the plant growth.

  4. For this you will need to use floral wire to secure your
    stems together. Once again, I was confused, but we were using a giggle, so I peered one of the petals.
    With a bit attention and care you’ll be able to enjoy them for the long time.

    The large, trumpet-shaped flowers will blossom from December all of the
    way through spring. Keep each row right near the one before it because
    you go around.

  5. People with severe epilepsy need constant monitoring should they
    possess a severe fit. Just because the summer months are over
    plus the holidays have passed away, there isn’t a excuse to avoid protecting you, your loved ones and your house.
    Cctv woman eyes Tourist places, especially, have to be protected when they
    generate revenue with the city.

    The planning behind APT attacks today is insight therefore cyber gangs will focus solely on mobile devices to generate the future attacks successful.
    An auto-iris lens could have a small cable that
    connects for the camera for iris control in several lighting conditions.

  6. It is definitely a fantastic idea to accomplish some
    research when you zero in in your purchase.
    Renee says just the tone of your voice can change everything.
    Sex toy vagina He liked the colour plus the idea that there wasn’t much there.

    There are lots of cognitive activities that parents are able to use to encourage learning skills,
    helping toddlers to meet up with standard milestones on time.
    Lay out your foam rubber and trim it such a way how the foam was equal or half inch
    more than the can.

  7. Thanx a lot! I was looking for a text- classification code , and I found it here.. very good.
    Are there any other ways to make a more precise code to classify the texts by using php? can you please write down a code for classification in other languages, such as python?

Leave a Reply

Your email address will not be published. Required fields are marked *