Gopher Neural – Simple But Effective Machine Learning in Golang

gopher-neural-logo

Simple and efficient machine learning in Golang was never that easy. I was looking for a solution, that enables me to easily build classifiers and regressors using Golang. So I developed gopher-neural around a very simple back propagation implementation that already exists.

Quickstart

Preface

This code was partly taken from github.com/NOX73/go-neural. For the implementation of the core algorithm all credits belong to NOX73. The fork to gopher-neural was made to pursue the following goals:

  • Build a training / testing framework around this algorithm
  • Build rich measurement mechanisms to control the training
  • Improved I/O functionality for training
  • Provide examples for the usage of the library

Done so far

  • Changed I/O handling for JSON models
  • Added Sample and Set structure for handling of data sets
  • Implement rich measurements for the evaluation of the classifier
  • Simple data I/O for training / testing and libSVM and csv format
  • Added labels to output neurons in network and persist
  • Just output label of neuron with most confidence
  • Establish a learning framework as engine package (using epochs, decays, interraters)
  • Provide another repository using example projects including data
  • Confusion matrix handling
  • Implement rich measurements for the evaluation of regressors

Roadmap

  • Improve the split data set handling by classes (for classification and regression)
  • Pipelined learning in channels to find the optimum
  • Online learning with online evaluation
  • Feature normalizer (auto encoder also for alphanumerical features)

Install

  go get github.com/flezzfx/gopher-neural
  go get github.com/flezzfx/gopher-neural/persist
  go get github.com/flezzfx/gopher-neural/learn
  go get github.com/flezzfx/gopher-neural/engine
  go get github.com/flezzfx/gopher-neural/evaluation

gophers engine

  • number of #try (tries)
    • learningRate minus decay if not 0 continue
      • num of #epochs the network sees the training set
  • one epoch = one forward pass and one backward pass of all the training examples

learningRate = n x epoch then learningRate – decay

epochs per learning-decay

Modes

Gopher-neural can be used to perform classification and regression. This sections helps to set up both modes. In general, you have to take care about the differences between both modes during these parts: read training data from file, start engine, use evaluation modes and perform in production.

Classification

Read training data from file

data := learn.NewSet(neural.Classification)
ok, err := data.LoadFromCSV(dataFile)

Start engine

e := engine.NewEngine(neural.Classification, []int{hiddenNeurons}, data)
e.SetVerbose(true)
e.Start(engine.CriterionDistance, tries, epochs, trainingSplit, learningRate, decay)

Use evalation mode

evaluation.GetSummary("name of class1")
evaluation.GetSummary("name of class2")
evaluation.PrintConfusionMatrix()

Perform in production

x := net.CalculateWinnerLabel(vector)

Regression

Important note: Use regression just with a target value between 0 and 1.

Read training data from file

data := learn.NewSet(neural.Regression)
ok, err := data.LoadFromCSV(dataFile)

Start engine

e := engine.NewEngine(neural.Regression, []int{hiddenNeurons}, data)
e.SetVerbose(true)
e.Start(engine.CriterionDistance, tries, epochs, trainingSplit, learningRate, decay)

Use evalation mode

evaluation.GetRegressionSummary()

Perform in production

x := net.Calculate(vector)

Criterions

To let the engine decide for the best model, a few criterias were implemented. They are listed below together with a short regarding their application:

  • CriterionAccuracy – uses simple accuracy calculation to decide the best model. Not suitable with unbalanced data sets.
  • CriterionBalancedAccuracy – uses balanced accuracy. Suitable for unbalanced data sets.
  • CriterionFMeasure – uses F1 score. Suitable for unbalanced data sets.
  • CriterionSimple – uses simple correct classified divided by all classified samples. Suitable for regression with thresholding.
  • CriterionDistance – uses distance between ideal output and current output. Suitable for regression.
...
e := engine.NewEngine(neural.Classification, []int{100}, data)
e.Start(engine.CriterionDistance, tries, epochs, trainingSplit, learningRate, decay)
...

Some more basics

Train a network using engine

import (
	"fmt"

	"github.com/flezzfx/gopher-neural"
	"github.com/flezzfx/gopher-neural/engine"
	"github.com/flezzfx/gopher-neural/learn"
	"github.com/flezzfx/gopher-neural/persist"
)

const (
	dataFile      = "data.csv"
	networkFile   = "network.json"
	tries         = 1
	epochs        = 100 //100
	trainingSplit = 0.7
	learningRate  = 0.6
	decay         = 0.005
  hiddenNeurons = 20
)

func main() {
	data := learn.NewSet(neural.Classification)
	ok, err := data.LoadFromCSV(dataFile)
	if !ok || nil != err {
		fmt.Printf("something went wrong -> %v", err)
	}
	e := engine.NewEngine(neural.Classification, []int{hiddenNeurons}, data)
	e.SetVerbose(true)
	e.Start(engine.CriterionDistance, tries, epochs, trainingSplit, learningRate, decay)
	network, evaluation := e.GetWinner()

	evaluation.GetSummary("name of class1")
	evaluation.GetSummary("name of class2")

	err = persist.ToFile(networkFile, network)
	if err != nil {
		fmt.Printf("error while saving network: %v\n", err)
	}
	network2, err := persist.FromFile(networkFile)
	if err != nil {
		fmt.Printf("error while loading network: %v\n", err)
	}
  // check the network with the first sample
	w := network2.CalculateWinnerLabel(data.Samples[0].Vector)
	fmt.Printf("%v -> %v\n", data.Samples[0].Label, w)

  fmt.Println(" * Confusion Matrix *")
	evaluation.PrintConfusionMatrix()
}

Create simple network for classification

  import "github.com/flezzfx/gopher-neural"
  // Network has 9 enters and 3 layers
  // ( 9 neurons, 9 neurons and 2 neurons).
  // Last layer is network output (2 neurons).
  // For these last neurons we need labels (like: spam, nospam, positive, negative)
  labels := make(map[int]string)
  labels[0] = "positive"
  labels[1] = "negative"
  n := neural.NewNetwork(9, []int{9,9,2}, map[int])
  // Randomize sypaseses weights
  n.RandomizeSynapses()

  // now you can calculate on this network (of course it is not trained yet)
  // (for the training you can use then engine)
  result := n.Calculate([]float64{0,1,0,1,1,1,0,1,0})

Further ideas

Rename and batching in learning

  • Use term batch size = the number of training examples in one forward/backward pass.
  • Use term iterations = number of passes, each pass using [batch size] number of examples.
  • Random application of samples

Use it

Phonetizz – Overstrain Levenshtein for Phonetic Applications

Levenshtein based and phrase accepting phonetic algorithm written in Golang.

Need

Most phonetic algorithms are using a code or symbols that are not easily comparable. This motivates the development of a phonetic algorithm that uses the simplest forms of text comparison. With Phonetizz the Levenshtein distance metric was utilized to perform such calculations for such a phonetic measure. Another important point is that all known algorithms are just working on words. With Phonetizz it is possible to work also on phrases instead of just one word.

The algorithm

The idea behind Phonetizz is to use simple text metrics plus some adjustments. Below I’ve listed the hypotheses that strengthen the use of Phonetizz:

  • At least for spoken languages (Chinese token language counts as literary language) the vocals are more important than the rest of the tokens, because they have the most influence in the phonetic part.
  • The vocal and consonant signature of the word should therefore considered separately.

Based on these hypotheses the algorithm first separates the vocals and consonants. After that the Levenshtein distance is calculated for just vocals and just consonants. These two distances are later combined with a different weight, where the vocal distance is higher weighted.

Example

word1 := "An und Pfirsich John"
word2 := "Wer macht denn sowas"
word3 := "An und für sich schon"

score := phonetizz.Phonetizz(word1, word2, phonetizz.DefaultVWeight, phonetizz.DefaultCWeight)
// score equals 11.2
score = phonetizz.Phonetizz(word1, word3, phonetizz.DefaultVWeight, phonetizz.DefaultCWeight)
// score equals 2.5

Experimentation

At the moment, no representative evaluation of the algorithm exist. Let me know if you are trying or want to do this in collaboration with me.

Use Case: Rhymzz

Ryhmzz is a modified version of Phonetizz that applies another layer of different weights on top of Phonetizz. Are you looking for words that are completing your rhyme then Rhymzz helps you with that. With the use of Phonetizz (that can also used for that case) comes the possibility to get phonetic similarities of (so called) punch lines for a battle rap. Maybe the future battle rap competitions are ruled by the guys with the fastest computer.

In contrast to Phonetizz the experiments (also not representative) here are performed just on words.

"mähen" -> winner word score: gehen
"probieren" -> winner word: verlieren
"Studenten" -> winner word: Enten

DreamCity – Build your own German city

clustering

Bring your fictive German city name and get latitude and longitude of it. clustering

Intention

With this project I had 2 intensions:

  • Find relations in names of German locations, founding date and the cultures / nations there. For instance I tried to find differences in the naming of locations in the south of the German republic where the former 3 states Großherzogtum BadenKönigreich Württemberg and Königreich Bayern were located. After various clusterings of different parts of the data and predictions, I definitely found clusters there.
  • Back in the days, I was also interested in establishing a small web site, where you can create your village / city / town at a place given by DreamCity. Something like social networks with virtual locations and place for your self realization.

Anyway, now I don’t have the time to a) complete the research here and b) deploy the small funny game. So, the time was right to publish some parts of this work.

How it works

Data from 12.240 largest German cities / towns / locations were used. These locations were annotated with latitude and longitude. Below I listed the steps from data to model:

  1. Location name was divied into n-grams (where n = 3) [1]
  2. Each distinct n-gram (e.g. ‘Ber’ in Berlin) occupies a dimension in the feature vector (dictionary approach)
  3. The feature vector has a size of 5810 dimensions (distinct 3-grams of 12k locations)
  4. Each location is then represented with a 5810 dimensional vector containing {0,1}s like (0,0,0,0,0,1,….,0,1,0,0)
  5. The regressor learns to map this 5810 dimensional vector to the corresponding (lat,lng) pair of the location
  6. The regressor was configured like this: 100 hidden neurons, adam solver[2], adaptive learning rate, 900k iterations with logistic activation function [3]

After the training the results are:

For a sum of 67145.162 kilometers drift (deviation from the correct (lat, lng) pair) over 12237 samples the regressor achieves a mean error per location of: 5.49 kilometers.

Play around

If you want to play around with this code, the following notes could be interesting for you.

Execute dream city regressor

Just start the following script followed by the city name you want from regressor.

> python start_regressor.py Müllerstadt
Müllerstadt
lat: 50.4615841198 lng: 8.53205433763
>_

Evaluate the regressor on your own

For this the script below can be used on the current regressor using the current n-gram model as feature transformator.

> python eval_regressor.py
sum of all drifts = 67145.1620699
number of samples = 12237
mean drift kilometers = 5.48706072321 km
>_

Train a new regressor

This is also possible. The data set for the training can be found in data/locations.csv. If you want to retrain just with other configurations, change them in train_regressor.py and just re-run it. You can also exchange the data set with your locations and start the training again.

> python train_regressor.py
>_

What else

If you have any ideas or improvements let me know. For further questions on this publication just drop me a message on twitter @AlexanderBresk.

References

[1] Neuron food on n-grams (https://en.wikipedia.org/wiki/N-gram)

[2] Diederik P. Kingma, Jimmy Ba. Adam: A Method for Stochastic Optimization (https://arxiv.org/abs/1412.6980)

[3] What is the role of activation function in neural nets? (https://www.quora.com/What-is-the-role-of-the-activation-function-in-a-neural-network)

Answer Type Detection for German Language

The second post of the NLPG project of cip-labs deals with data. Here is the data you need to train a classifier with the widely used taxonomy of Li and Roth [1] for the German language. This data was published in the context of the master thesis of Alexander Bresk [2]. This thesis also introduces a novel classifier that uses this data to train and evaluate a model for answer type detection. The taxonomy of Li and Roth, the provided data and the trained classifiers are presented in this article.

The taxonomy

Li and Roth invented a taxonomy that grew to a standard in answer type detection for a long time. The taxonomy covers 6 master classes and 50 sub classes. The 6 master classes are:

  • Abbreviation (2)
  • Description (4)
  • Entity (22)
  • Human (4)
  • Location (5) and
  • Numeric (13).

The numbers in brackets depicts the number of sub classes for each master class. These classes can be used to cover most cases of answer types acc. to Li and Roth.

The data

The presented data consists of 1548 questions in the training set and 735 questions in the test/evaluation set. For all master and sub classes the number of samples is uniformly distributed.

  • Master classes (Train #1548 / Test #735)
  • Abbreviation (Train #72 / Test #30)
  • Description (Train #141 / Test #80)
  • Entity (Train #678 / Test #313)
  • Human (Train #119 / Test #60)
  • Location (Train #150 / Test #75)
  • Numeric (Train #388 / Test #177)



style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">


This data was labeled the following way in the master classes file:

Was ist die Hauptstadt von Deutschland?#LOCATION

In the sub classes files the same question is labeled as follows:

Was ist die Hauptstadt von Deutschland?#LOC_CITY

The annotation for the sub classes consists for the first three tokens of the master class followed by an underscore and the name of the sub class.

Current State of the Art

Currently, the state of the art in answer type detection uses models that try to find correlations in semantic networks to determine the answer type. Research activities from 2015 [3] tackle this approach. The model of Li and Roth is, of course, still used by a lot of researcher. Newer systems are solely using semantic approaches.

Download

Here you can download the data to train your own answer type detector for the German language.



style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

Sources

[1] Li, Xin, and Dan Roth. “Learning question classifiers: the role of semantic information.” Natural Language Engineering 12.03 (2006): 229-249.

[2] Alexander Bresk. “Question Answering using Unstructured Data”. HTW Dresden. 2015

[3]  SUN, H., MA, H., YI, W., TSAI, C., LIU, J. and CHANG, M. Open domain ques- tion answering via semantic enrichment. In Proceedings of the companion publication of the 24th international conference on World Wide Web. ACM – Association for Computing Machinery, May 2015. URL http://research.microsoft.com/apps/ pubs/default.aspx?id=241399.

SemMap – The semantic map for German language

During my long absence from code is poetry labs, I’ve worked on my master thesis. My thesis was on the topic Question Answering using Unstructured Data and was created to solve the QA task for the German language. The thesis produced a lot of byproducts and ideas to continue the work on this topic. All of these byproducts will be released in the next month on this blog as well as software projects on cip-labs github account.

What is SemMap

This post deals with the first semantic database (SEMantic MAP) that allows you to calculate on semantic frames for the German language. A semantic frame is a cluster of words that have the same semantic meaning. The representation of the semantic map is stored in JSON. An example:

"Commerce_collect": [
   "kassieren",
   "bitten",
   "verlangen",
   "berechnen",
   "nehmen",
   "essen",
   "futtern",
   "einkassieren"
 ],

SemMap also have calculated correlations that allow you to run calculations on semantics. The representation can be seen as a matrix, where the correlation of each frame to each frame is stored:

"Commerce": {
  "Memory": 0.00024714220074908,
  "Choosing": 0.0048732264936439,
  "Kinship": 0.016520237813453,
  "Assessing": 0.00043510950836106,
  "Deserving": 0.00019492905974576,
  "Destroying": 0.00018796730761198,
  "Part_ordered_segments": 0.016509795185252,
  "Part_orientational": 0.008117402987984,
  "Attention_getting": 0.0039403517077178,
  ...
  "Direction": 0.018431238774175
},



style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">


With this correlations you have a cluster representation of each frame. Before introducing SemMap each found semantic frame was a number that could be accumulated, not more. From now on these found frames have a vector representation with linkings to all other frames.

Data sources

The data sources of the SemMap are all legacy projects from a few universities. These projects can be seen as the German equivalent of the FrameNet. The frames of these projects were extracted, cleaned and some of them were used in SemMap. Beside this, SemMap was extended by a lot of open source dictionaries.

Correlation of frames

The frames and the containing words are just the beginning of the SemMap. The heart of this project is the correlation calculation. Each word (and therefore each frame) has correlations to other semantic frames. To calculate these correlations the SemMap was trained on a narrowed document set from the german Wikipedia (cardinality of 1.3 Mio. documents).

On these documents two correlations are calculated:

  • Weak Correlation: With the pivot on word index x, another frame occurs in the range of x-20 up to x+20
  • Strong Correlation: With the pivot on word index x, another frame occurs in the range of x-5 up to x+5

The representation of this calculations can be seen as matrix, with the correlation probability from frame A to frame B or the correlation probability that frame B co-occurs if frame A already occurred. This representation solves the following problems:

  • The occurrence of a frame is nothing more than a accumulated number (now it is a vector representation in a semantic space)
  • You can not use mathematical calculations on sentences and words (now you can, you can add, subtract, multiply and average textual components)

Possible use cases

The possible use cases for SemMap are very wide. The obvious use cases are listed below:

  • The SemMap can be used to find semantic frames of words in the text.
  • Correlation maps can be used to build a vector representation and therefore a feature space for text classification
  • Self trained correlation maps can be used to cluster your input before running a classification on it
  • Correlation maps are useful to tackle intention analysis and deep semantic understanding

The presented use cases are just the obvious ones.

How to use it

This section explains how to use the map and all corresponding files.

Use SemMap and correlation files

# get content from semmap file into $content
$semmap = json_decode($content)
print_r($semmap[<frame>]); # prints frame members

Working with correlation files:

# get content from correlation file into $content
$correlations = json_decode($content);

# print all correlations for frame Commerce_collect
print_r($correlations['Commerce_collect']);

# print the correl. that frame 'Building' occurs near to 'Commerce'
echo $correlations['Commerce']['Building'];



style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

Train new correlation files

First you need to adjust the constants in the head of the PHP file train.php. The constants are named very intuitively. You also need training documents to work on it. Lets assume that your training documents are in the directory train_docs/. So you have to run the command:

php train.php train_docs/

Download & Get Involved

You can download the current project from Github. If you want to contribute code, ideas and content to the project please contact us. Visit the NLPG project page on code is poetry laboratories.

Answer Type Detection with German Text

This short note, which was originally released on my private blog,  is a part of the preparations for my master thesis.  The thesis tries to propose a Question Answering system that pursues two goals:

  • Replace a static FAQ section with an input field to search in unstructured data
  • Act as a part of a dialog system for a service robot.

On (maybe the first in the pipeline) step of this system deals with detecting the correct answer type. So this note deals with my ideas and remarks regarding this topic (Answer Type Detection / ATD).

WHAT AND WHY

Answer Type Detection is about finding the type of an answer. Type is a very squishy wording. Why? Because every QA system has another task to solve. If you are searching for factoid answers, then the „type“ often means a data type (date, string, number, etc.) For a more complex system, the step of Answer Type Detection has to be executed very accurately. Imagine a set of documents and you’re searching for the answer. Then ATD (assumed that you’re results are trustworthy) helps to:

  • Locate the answer in your documents
  • Chooses the search & evaluation strategy

RELATED STUDIES

There are a few other studies that are dealing with this topics. Some of them were introduced below, but other works are Zhang and Lee, 2003; Huang et al., 2008; Silva et al., 2011; and Loni et al., 2011; Huang et al., 2008; Blunsom et al., 2006; Li and Roth, 2004; as well as Merkel and Klakow, 2007.


style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">


DAVIDESCU ET AL

Davidescu et al.[1] tries to evaluate some interesting features for Answer Type Detection. The team worked with a german corpus (data was crawled from web sources) and 7 answer types (Defintion, Location, Person, Yes/No, Explanation, Date, Number/Count). The approach is very similar to my work on Answer Type Detection, but I used a larger training/test-set as they did. The paper discusses the following features:

  • Collocations
  • Trigger words
  • Punctuation marks (a very weak feature, note AB)
  • Question length (also a very weak feature, note AB)
  • Named entities (Lingpipe)
  • Lemmas (Tree Tagger)
  • POS-tags (Tree Tagger)
  • Bag of Words
  • Sentence patterns

They used three well-known classifiers for the tests (Naïve Bayes, kNN and Decision Trees). My learning from this paper was:

  • NER Features don’t improve the system significantly
  • Bag of Words improves the result badly (see paper for more info)
  • Statistical Collocation also improved the results (see paper for more info)

The results fluctuate between 54% -65% accuracy. The best result was reached with the Baseline+StatisticalColloc+BagOfWords feature set using Naïve Bayes classifier.

KIRK ROBERTS AND ANDREW HICKL

The approach of Kirk Roberts and Andrew Hickl[3] shows a hierarchical classifier for scenarios with up to 200 classes (including sub classes). In „Scaling Answer Type Detection to Large Hierachies“ they defined Coarse Answer Types and Fine Answer Types.

  • ABBREVIATION
    ABBREVIATION, EXPANDED ABBREVIATION
  • ENTITY
    ANIMAL, BODY, COLOR, CREATIVE, CURRENCY, DISEASE/MEDICINE, EVENT, FOOD, INSTRUMENT, LANGUAGE, LETTER, OTHER, PLANT, PRODUCT, RELIGION, SPORT, SUBSTANCE, SYMBOL, TECHNIQUE, TERM, VEHICLE, WORD
  • DESCRIPTION
    DEFINITION, DESCRIPTION, MANNER, REASON
  • HUMAN
    GROUP, INDIVIDUAL, TITLE, DESCRIPTION
  • LOCATION
    CITY, COUNTRY, MOUNTAIN, OTHER, STATE
  • NUMERIC
    CODE, COUNT, DATE, DISTANCE, MONEY, ORDER, OTHER, PERIOD, PERCENT, SPEED, TEMPERATURE, SIZE, WEIGHT

This approach (that is also explained in other papers from the same as well as other authors, see secondary papers and publications below) outperforms some approaches that are using flat classifiers (by up to 7%) in the same scenario. I also had the insight to use hierarchical (cascaded) classifiers from previous task in text classification. Some interesting numbers of the approach:

  • Reach precision of 85% with 200 classes
  • Test/training set containing 20,000 rows

BARAK LONI

Barak Lonis master thesis[6] deals with the enhancement of feature sets and especially combining features to improve the performance. The approach uses neural networks and SVM. My approach also uses neural networks and SVM with a relatively similar understanding of features: lexical, syntactic and semantic.
Lexical Features are unigam, bigram, trigram, wh-word, word-shapes and question-length. My approach also uses all of these features. The feature space of his syntactic features are tagged-unigrams, POS tags, headword and head rule (my approach also uses all of these features with different implementations). At least the semantic features are headword-hypernyms, related words (using WordNet), question-category (also using WordNet) and query expansion (also using WordNet). I’m using or plan to use similar features with the german equivalent GermaNet and SALSA (which can be seen as the german FrameNet).

The results vary, but the best result recorded was reached with the back propagation neural network and the „WH+WS+LB+H+R“ feature set 93,8% (coarse classes). The data set consists of 6 coarse classes and 50 fine classes (TREC set). The key essence of this work (for me) is the reduction of feature dimensions from up to 14694 down to 400 dimensions. (Note: This chapter needs further investigations)

ANATOMY OF A QUESTION

To treat this section exhaustively we have to consider the big picture. Beside the topic of QA (Question Answering) exists another topic – the UQA (Universal Question Answering). The raison d’être (or the reason to exist) for UQA comes from the view that QA is just a factoid task. But in most cases questions don’t expect a fact. For sure, many ordinary questions are requesting for a person, location, date in time and so on. Junta Mizuno and Tomoyosi Akiba[2] from Toyohashi University investigated Answer Types that are more than just factoid-based (and they called it UQA). The considered classes are Fact, Method, Opinion, Experience, Reason, Definition, Possibility, Illegal, Factoid, Multiple Types (complex) & Other – ordered in descending order sorted by the frequency of appearance. For their studies they collect a set of 1,187,873 questions.

Another consideration is the subjectivity of a question. What should a system answer, when the questioner asks „What is the prettiest place in germany?“ or to use an example of my target language „Was ist der schönste Platz in Deutschland?“ ? For sure, the system can develop an own taste but this requires an internal rater that uses an algorithm to rate the content. This isn’t a part of ATD, so we disregard this topic at this point. My goal was just to sensitize that a question can ask for a objective answer or a subjective impression.

Another problem is the „Explain vs. Boolean“-competition. During the creation of my answer type model I was considering about yes/no questions. Every yes/no questions expects a boolean value (e.g. „Kann man von Dresden nach Tübingen fahren, ohne umzusteigen?“) but some of these yes/no questions also need an explanation or advanced answer. This is the „explain“ part of this answer. It is hard to find out, which question just needs a yes/no and which question wants an explanation. My current solution is that the fallback should be an explanation (when no hard yes/no decision was made) and the cream topping is a yes/no decision.

CURRENT ANSWER TYPE CLASSES & DATASET

The dataset I’m currently working with contains 8904 questions (recorded from frag.wikia.de, labeled manually by me). I’ve splitted this data with 80% for training and 20% for testing with cross-validation.

For my current approach I’m using 7 classes: DATETIME, ENTITY, LOCATION, PERSON, NUMBER, EXPLAIN and NAME. This classes are not disjunct subsets. This comes with the ambiguity of understanding a question. Below I will try to sketch the classes from special class up to the fallback class.

LOCATION → ENTITY → NAME → EXPLAIN

PERSON → ENTITY → NAME → EXPLAIN

NUMBER → EXPLAIN

DATETIME → EXPLAIN

These classes are sufficient to solve the tasks of my thesis. The fallback (e.g. no location found, but entity found) allows the AT-classifier to increase the search range to still find the right answer. Another reason to implement this fallback is the subsequent use of the system for a dialog system. To guide though a dialog the system should also (in reason of uncertainty) return sentence that contain the right answer (or a hint for that).

THE V-PERCEPTION APPROACH

With the setup explained above I’ve developed the V-perception approach, where the features were classified on various perception levels. The idea for the name came from reading some books about neuroscience and the visual cortex. Features where set with {-1,1} for SVMs and {0,1} for NN.



style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">


WORD LEVEL V1

The first level is the basic of feature generations. Here I’ve used Known Entities (e.g. „Stadt“ → for location) and so on. Against some approaches I’ve limited the dimension of this KE-feature set. For each class I’ve just reserved 3 dimensions. When 3 Known Entities for a specific class occur in the question, all of these 3 dimensions are activated (e.g. just 1 KE of class x found, then just one dimension is activated.). The reason for this decision is the wasted number dimensions and therewith the duration of calculation.

In previous work I found out that is a good strategy to tackle text classification tasks with „start with word“, „contains word“ and „ends with word“. This features are also a part of my V1 feature set. This includes the search for question words and words that are helping question words.

GRAMMAR/QUESTION LEVEL V2

The grammar level tries to find possible grammar patterns that the various types of questions have. The current approach uses the first 6 words of the question to find patterns. For annotation I’m using the Stanford POS Tagger which uses the STTS-tag set (Stuttgart Tübingen Tag Set). In addition to that I’m searching for sentence patterns in the question.

SEMANTIC LEVEL V3

Sources [4] and [5] show the role of semantic features for ATD tasks and for QA tasks in general. I’m currently evaluating the GermaNet and SALSA (german FrameNet equivalent). SALSA will be then used to annotate semantic frames to extract the intension of a specific question. GermaNet can be used to extend the approach of Known Entities with the synonyms of the nouns and verbs used in the question.

RESULTS

With the features and dataset which were introduced above I’ve used SVMs and NN to test my feature sets. A short not exhaustive look at the results:

SVM-7,0-Poly-2,5-5,0-8,0: 96,856% acc.
NN-800-1-80: 94,554% acc.
Naïve Bayes-7: 87,76% acc.
SVM-7,0-RBF-0,05: 72,993% acc.

It seems that the feature set used for this approach outperforms other state-of-the-art approaches (especially using SVM). My assumption is that the features can be powerful, but the system needs an excellent preprocessing of all extracted features to perform well.

FURTHER WORK & TODO

Some further work and things to-do were listed below:

  • Test word-shape features from [6]
  • Test freebase as source of possible features
  • Use maps application as source for location based questions
  • GermaNet extension to with labeling of classes (test with NN, NE, VV)
  • SALSA annotation of question (semantic role labeling)



style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

SOURCES

[1] Davidescu et al.; Classifying German Questions According to Ontology-Based Answer Types; Spoken Language Systems, Saarland University

[2] Junta Mizuno and Tomoyosi Akiba; Non-factoid Question Answering Experiments at NTCIR-6: Towards Answer Type Detection for Real World Questions; Toyohashi University of Technology

[3] Kirk Roberts and Andrew Hickl; Scaling Answer Type Detection to Large Hierarchies;Language Computer Corporation

[4] Xin Li and Dan Roth; Learning Question Classifiers: The Role of Semantic Information; University of Illinois at Urbana-Champaign

[5] Dan Shen, Mirella Lapata; Using Semantic Roles to Improve Question Answering; Spoken Language Systems Saarland University, School of Informatics University of Edinburgh

[6] Babak Loni; Enhanced Question Classification with Optimal Combination of Features; Department of Media and Knowledge Engineering, Delft University of Technology; 2011

SECONDARY PAPERS & PUBLICATIONS

E. Brill, S. Dumais, M. Banko. 2002. An analysis of the askMSR question-answering system. In Proceedings of the EMNLP, 257–264, Philadelphia, PA.
X. Carreras, L. Ma`rquez, eds. 2005. Proceedings of the CoNLL shared task: Semantic role labelling, 2005.
T. Cormen, C. Leiserson, R. Rivest. 1990. Introduction to Algorithms. MIT Press.
H. Cui, R. X. Sun, K. Y. Li, M. Y. Kan, T. S. Chua. 2005. Question answering passage retrieval using de- pendency relations. In Proceedings of the ACM SIGIR, 400–407. ACM Press.
T. Eiter, H. Mannila. 1997. Distance measures for point sets and their computation. Acta Informatica, 34(2):109–133.
K. Erk, S. Pado ́. 2006. Shalmaneser – a flexible toolbox for semantic role assignment. In Proceedings of the LREC, 527–532, Genoa, Italy.
C. Fellbaum, ed. 1998. WordNet. An Electronic Lexical Database. MIT Press, Cambridge/Mass.
C. J. Fillmore, C. R. Johnson, M. R. Petruck. 2003. Background to FrameNet. International Journal of Lexicography, 16:235–250.
D. Gildea, D. Jurafsky. 2002. Automatic labeling of se- mantic roles. Computational Linguistics, 28(3):245– 288.
T. Grenager, C. D. Manning. 2006. Unsupervised dis- covery of a statistical verb lexicon. In Proceedings of the EMNLP, 1–8, Sydney, Australia.
R. Jonker, A. Volgenant. 1987. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38:325–340.
M. Kaisser. 2006. Web question answering by exploiting wide-coverage lexical resources. In Proceedings of the 11th ESSLLI Student Session, 203–213.
J. Leidner, J. Bos, T. Dalmas, J. Curran, S. Clark, C. Ban- nard, B. Webber, M. Steedman. 2004. The qed open- domain answer retrieval system for TREC 2003. In Proceedings of the TREC, 595–599.
C. Leslie, E. Eskin, W. S. Noble. 2002. The spectrum kernel: a string kernel for SVM protein classification. In Proceedings of the Pacific Biocomputing Sympo- sium, 564–575.
B. Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago.
X. Li, D. Roth. 2002. Learning question classifiers. In Proceedings of the 19th COLING, 556–562, Taipei, Taiwan.
D. K. Lin. 1994. PRINCIPAR–an efficient, broad- coverage, principle-based parser. In Proceedings of the 15th COLING, 482–488.
D. Moldovan, C. Clark, S. Harabagiu, S. Maiorano. 2003. COGEX: A logic prover for question answer- ing. In Proceedings of the HLT/NAACL, 87–93, Ed- monton, Canada.
S. Narayanan, S. Harabagiu. 2004. Question answering based on semantic structures. In Proceedings of the 19th COLING, 184–191.
S.Pado ́,M.Lapata.2006.Optimalconstituentalignment with edge covers for semantic projection. In Proceed- ings of the COLING/ACL, 1161–1168.
M. Palmer, D. Gildea, P. Kingsbury. 2005. The Propo- sition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106.
D. Paranjpe, G. Ramakrishnan, S. Srinivasa. 2003. Pas- sage scoring for question answering via bayesian infer- ence on lexical relations. In Proceedings of the TREC, 305–210.
S. Pradhan, W. Ward, K. Hacioglu, J. Martin, D. Jurafsky. 2004. Shallow semantic parsing using support vector machines. In Proceedings of the HLT/NAACL, 141– 144, Boston, MA.
D. Shen, D. Klakow. 2006. Exploring correlation of de- pendency relation paths for answer extraction. In Pro- ceedings of the COLING/ACL, 889–896.
R. X. Sun, J. J. Jiang, Y. F. Tan, H. Cui, T. S. Chua, M. Y. Kan. 2005. Using syntactic and semantic re- lation analysis in question answering. In Proceedings of the TREC.
B. Taskar, S. Lacoste-Julien, D. Klein. 2005. A discrim- inative matching approach to word alignment. In Pro- ceedings of the HLT/EMNLP, 73–80, Vancouver, BC.
M. Wu, M. Y. Duan, S. Shaikh, S. Small, T. Strzalkowski. 2005. University at albany’s ilqua in trec 2005. In Proceedings of the TREC, 77–83.
T. Akiba, A. Fujii, and K. Itou. Question answer- ing using “common sense” and utility maximization principle. In Proceedings of The Fourth NT- CIR Workshop, 2004.
J. Fukumoto, T. Kato, and F. Masui. Question answering challenge (QAC-1) question answering evaluation at NTCIR workshop 3. In Proceedings of The third NTCIR Workshop, 2003.
J. Fukumoto, T. Kato, and F. Masui. Question an- swering challenge for five ranked answers and list answers – overview of NTCIR 4 QAC2 subtask 1 and 2 –. In Proceedings of The Fourth NTCIR Workshop, 2004.
GETA. transposable Generic association engine for (GETA). http://geta.ex.nii.ac.jp.
T. Kato, J. Fukumoto, and F. Masui. An overview of NTCIR-5 QAC3. In Proceed- ings of The Fifth NTCIR Workshop, 2005. http://research.nii.ac.jp/ntcir/workshop/ OnlineProceedings5/data/QAC/NTCIR5-OV-QAC- KatoT.pdf.
A. Tamura, H. Takamura, and M. Okumura. Clas- sification of multiple-sentence questions. In Pro- ceedings of the 2nd International joint Conference on Natural Language Processing, 2005.
WAHLSTER, Wolfgang (2004): SmartWeb: Mobile applications of the Semantic Web. In: Adam, Peter / Reichert, Manfred, editors. INFOR- MATIK 2004 – Informatik verbindet, Band 1. Beitr ̈age der 24. Jahrestagung der Gesellschaft fr Informatik e.V. (GI), Ulm. (Web: www.smartweb-project.de)
CRAMER, Irene / LEIDNER, Jochen L. / KLAKOW, Di- etrich (2006, to appear): Building an Evaluation Corpus for German Question Answering by Harvesting Wikipedia. Proceedings of The 5th International Con- ference on Language Resources and Evaluation, Genoa, Italy.
LI, Xin / ROTH, Dan (2002): Learning Question Classifiers. COL- ING’02.
LI, Xin / HUANG, Xuan-Jing / WU, Li-de (2005):Question Clas- sification using Multiple Classifiers. Proceedings of the 5th Workshop on Asian Language Resources and First Symposium on Asian Language Resources Network.
ZHANG, Dell / LEE, Wee Sun (2003): Question classification using support vector machines. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Cananda.
DAY, Min-Yuh / LEE, Cheng-Wei / WU, Shih-Hung / ONG, Chorng-Shyong / HSU, Wen-Lian (2005): An Integrated Knowledge-based and Machine Learning Approach for Chinese Question Classification. IEEE Interna- tional Conference on Natural Language Processing and Knowledge Engineering.
SONNTAG, Daniel /ROMANELLI, Massimo (2006, to ap- pear): A Multimodal Result Ontology for Integrated Semantic Web Dialogue Applications. Proceedings of the 5th international conference on Language Re- sources and Evaluation, Genoa, Italy.
VOORHEES, Ellen (2001): Overview of the TREC 2001 Ques- tion Answering Track. Proceedings of the 10th Text Retrieval Conference, NIST, Gaithersburg, USA.
WITTEN, Ian H. / FRANK, Eibe (2000): Data Mining – Prac- tical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers. (Web: http://www.cs.waikato.ac.nz/ ml/weka)
DUDA, Richard O. / HART, Peter E. / STORK, G. (2000): Pattern Classification. New York: John Wiley and Sons.
MITCHELL, Tom M. (1997): Machine Learning. Boston, MA: Mc- Graw Hill.
Abney, S. P. 1991. Parsing by chunks. In S. P. Abney R. C. Berwick and C. Tenny, editors, Principle-based parsing: Computation and Psycholinguistics. Kluwer, Dordrecht, pages 257–278.
Carlson, A., C. Cumby, J. Rosen, and D. Roth. 1999. The SNoW learning architecture. Technical Report UIUCDCS-R-99-2101, UIUC Computer Science Department, May.
Even-Zohar, Y. and D. Roth. 2001. A sequential model for multi-class classification. In Proceedings of EMNLP-2001, the SIGDAT Conference on Empirical Methods in Natural Language Processing, pages 10–19.
Fellbaum, C., editor. 1998. WordNet: An Electronic Lexical Database. The MIT Press, May.
Learning Question Classifiers: The Role of Semantic Information 21
Hacioglu, K. and W. Ward. 2003. Question classification with support vector machines and error correcting codes. In Proceedings of HLT-NAACL.
Harabagiu, S., D. Moldovan, M. Pasca, R. Mihalcea, M. Surdeanu, R. Bunescu, R. Girju, V. Rus, and P. Morarescu. 2000. Falcon: Boosting knowledge for answer engines. In E. Voorhees, editor, Proceedings of the 9th Text Retrieval Conference, NIST, pages 479–488.
Hermjakob, U. 2001. Parsing and question classification for question answering. In ACL- 2001 Workshop on Open-Domain Question Answering.
Hirschman, L., M. Light, E. Breck, and J. Burger. 1999. Deep read: A reading com- prehension system. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 325–332.
Hovy, E., L. Gerber, U. Hermjakob, C. Lin, and D. Ravichandran. 2001. Toward semantics- based answer pinpointing. In Proceedings of the DARPA HLT conference.
Lee, L. 1999. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 25–32.
Lehnert, W. G. 1986. A conceptual theory of question answering. In B. J. Grosz, K. Sparck Jones, and B. L. Webber, editors, Natural Language Processing. Kaufmann, Los Altos, CA, pages 651–657.
Li, X. and D. Roth. 2002. Learning question classifiers. In Proceedings of the 19th International Conference on Compuatational Linguistics (COLING), pages 556–562. Li,
X., K. Small, and D. Roth. 2004. The role of semantic information in learning question classifiers. In Proceedings of the First Joint International Conference on Natural Language Processing.
Littlestone, N. 1989. Mistake bounds and logarithmic linear-threshold learning algorithms. Ph.D. thesis, U. C. Santa Cruz, March.
Moldovan, D., M. Pasca, S. Harabagiu, and M. Surdeanu. 2002. Performance issues and error analysis in an open-domain question answering system. In Proceedings of the 40th
Annual Meeting of the Association for Computational Linguistics, pages 33–40. Pantel, P. and D. Lin. 2002. Discovering word senses from text. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Pinto, D., M. Branstein, R. Coleman, M. King, W. Li, X. Wei, and W.B. Croft. 2002.
Quasm: A system for question answering using semi-structured data. In Proceedings of the Joint Conference on Digital Libraries.
Punyakanok, V. and D. Roth. 2001. The use of classifiers in sequential inference. In Proceedings of the 13th Conference on Advances in Neural Information Processing Systems, pages 995–1001. MIT Press.
Radev, D. R., W. Fan, H. Qi, H. Wu, and A. Grewal. 2002. Probabilistic question answering from the web. In Proceedings of WWW-02, 11th International Conference on the World Wide Web.
Roth, D. 1998. Learning to resolve natural language ambiguities: A unified approach. In Proceedings of 15th National Conference on Artificial Intelligence (AAAI).
Roth, D., C. Cumby, X. Li, P. Morie, R. Nagarajan, N. Rizzolo, K. Small, and W. Yih. 2002. Question answering via enhanced understanding of questions. In Proceedings of the 11th Text Retrival Conference, NIST, pages 592–601.Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Singhal, A., S. Abney, M. Bacchiani, M. Collins, D. Hindle, and F. Pereira. 2000. AT&T at TREC-8. In E. Voorhees, editor, Proceedings of the 8th Text Retrieval Conference,NIST.
Voorhees, E. 2002. Overview of the TREC-2002 question answering track. In Proceedings of the 11th Text Retrieval Conference, NIST, pages 115–123.
Zhang, D. and W. Lee. 2003. Question classification using support vector machines. In Proceedings of the 26th Annual International ACM SIGIR conference, pages 26–32.
S. Harabagiu, D. Moldovan, M. Pasca, M. Surdeanu, R. Mihalcea, R. Girju, V. Rus, F. Lacatusu, P. Morarescu, and R. Bunescu. 2001. Answering Complex, List and Context Questions with LCC’s Question-Answering Server. In Proceedings of the Tenth Text REtrieval Conference.
S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, A. Hickl, and P. Wang. 2005. Employing Two Question Answering Systems in TREC 2005. In Proceedings of the Fourteenth Text REtrieval Conference.
Andrew Hickl, Patrick Wang, John Lehamnn, and Sanda Harabagiu. 2006a. Ferret: Interactive Question-Answering for Real-World Research Environments. In Proceedings of the 2006 COLING-ACL Interactive Presentations Session.
Andrew Hickl, John Williams, Jeremy Bensley, Kirk Roberts, Ying Shi, and Bryan Rink. 2006b. Question Answering with LCC’s Chaucer at TREC 2006. In Proceedings of the Fifteenth Text REtrieval Conference.
Andrew Hickl, Kirk Roberts, Bryan Rink, Jeremy Bensely, Tobias Jungen, Ying Shi, and John Williams. 2007. Question Answer- ing with LCC’s Chaucer-2 at TREC 2007. In Proceedings of the Sixteenth Text REtrieval Conference.
V. Krishnan, S. Das, and S. Chakrabarti. 2005. Enhanced answer type inference from questions using sequential models. In Pro- ceedings of EMNLP.
X. Li and D. Roth. 2002. Learning question classifiers. In Proc. the International Conference on Computational Linguistics (COLING).
George A. Miller. 1995. WordNet: a lexical database for English.Communications of the Association for Computing Machinery,38(11):39–41.
Christopher Pinchak and Dekang Lin. 2006. A Probabilistic Answer Type Model. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Lin- guistics, pages pages 393 – 400.
Sanda Harabagiu, Dan Moldovan, Marius Pasca, Rada Mihalcea, Mihai Surdeanu, Razvan Bunsecu, Roxana Girju, Vasile Rus, and Paul Morarescu. 2000. FALCON: Boosting knowledge for answer engines. In Proceedings of the 9th Text REtrieval Conference.

Edit Distance revisited

Hi folks. After I’ve introduced some Natural Language Processing stuff on my blog, this article should point on a distance metric that is commonly used to correct wrong words. Some mobile phones are using this algorithm to correct the input for SMS messages and so one. There is talk of Edit Distance[1]. This metric is well-proven by a lot of scientists and applications that are using it. Recently, Edit Distance is often misused for a lot applications for instance text classification. For sure, this procedure can be used to classify text, but there are a lot of sticking points.

After some text classification (more than 5-6 words per input) experiments with Edit Distance, I’ve noticed:

  • Edit Distance works good (tolerable results, but not good enough), but with some text constellations it inclines to odd results.
  • The procedure starts with 2 strings that are aligned with the most similarities (according to their position). This can be a good fit or even a bad one.
  • In most cases the deleted, inserted or substituted token does not correspond with the “correct” counterpart.

Substitution Distance

To overcome these problems I’ve extended the original Edit Distance with a preprocessing step. To calculate the distance between two strings with more than 5-6 words, it is considerable but not the best choice to choose a tokenwise working procedure. Therefore I propose to delete all words, that these two comparing strings have in common. The length of these words should be the starting point of an Edit Distance calculation. Below you can see the pseudo-code for the described algorithm.

1. Set score = 0
2. Find similar sub phrases in both phrases
3. Determine the length of all similar sub phrases
4. Set score = length_of_similar_subphrases * (-1)
5. Delete similar sub phrases from both phrases
6. Apply Edit Distance to the rest of the phrases
7. Set ED_Score = value of Edit Distance
7. score = score + ED_Score

After this, the score contains the Substitution Distance between these two strings. The lower this value is the more are these strings in common (small distance).


style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

Experiments

In an experiment with a proprietary corpus containing ~1530 phrases the Substitution Distance outperforms the Edit Distance in accuracy. The input was preprocessed with the Lucene GermanStemmer and 10-fold crossvalidated for test. Substitution Distance reached 92.69% and Edit Distance 89.97% accuracy.

Possible Enhancement

To extend the Substitution Distance method a trick can be used. Consider two words, both are linguistically related to each other (example below). These words are not exactly the same but they are related. The method introduced above wouldn’t match this.

Example:

 extension
 extensible

The example above shows two words that have 7 consecutive tokens in common. With the introduction of a threshold value that marks related words as ‘the same’ and deleted it (step 5), the accuracy of Substitution Distance could be improved. A possible threshold is 5 (or 6, 7 and 8). With the implementation of  this enhancement a stemming-like functionality can be simulated.

Conclusion

Certainly, the Substitution Distance can’t fix all weaknesses of Edit Distance, but the results are significantly more accurate using SD. The following problems aren’t solved:

* After deleting (step 5) the sub phrases, Edit Distance also deleting, substituting and inserting tokens that have no relevance to the counterparts (in the comparison string).

* The residuals (also after step 5) of the phrase mostly have more semantically than syntactically meaning.

Indeed, the performance of Substitution Distance seems improved in contrast to Edit Distance but distance metrics have one limitation. The comparison of syntaxes of phrases is very restricted. A real text classification task needs semantic constraints, which a distance metric like ED or SD can’t provide.


style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

Sources

[1] Navarro, Gonzales; „A guided tour to approximate string matching“, CSUR March 2001

Basketball: Making the game a bit more predictable!

I’m a real baller so I tried to understand the game that I’m playing on the court and watching on television. This was the motivation to this merry and fancy experiment. Sure, the game isn’t that unpredictable, but I was amazed how easy it is, to obtain good results with the simplest AI models.


style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

Introduction

Result prediction is a very interesting field in every sport. Based on statistics and hidden dependencies, an artificial intelligence can learn to predict a final score. Current news showing the relevance of this topic: Warren Buffett will pay a billion dollar if someone provides a complete and correct bracket for the upcoming March Madness event in US college basketball.
A look through the sport betting landscape reveals some other platforms that offer score prediction like vitibet.com and scorepredictor.net. These interesting tools predict the winner of a game based on some statistics. Unfortunately, I had no accuracy statistics of their algorithms and no time to play around with these tools to get a qualifing answer. So I started with some feature ideas and preprocessing of them. The first results were listed below together with the next steps and future plans for this project.

Dataset

I retrieved data from NBBL games (german youth league) from nbbl-basketball.de. For each team, I collected the data manually. It was a simple and meditating work. After a rave of statistics and 2 hours of time, the data was collected and – after some preprocessing steps (relations between statistics) ready for some tests.

First Results

The first task was to estimate winner and loser of a basketball game with a given result and statistics using a neural network. The configuration of the neural net was optimized as long as a better result occurs. Some feature improvements using PCA (Principal Component Analysis) and GLVQ (Generalized Learning Vector Quantization) brought more accurate results. I divided the data into a training set as well as a test set having 80/20 ratio. The results below can be interpreted as 91,62% of all games were predicted correctly.

  • 88,94% … Without Preprocessing
  • 90,02% … With Preprocessing
  • 91,62% … Baseline Feature Set

After this experiment, a simple predictor was trained to estimate basketball games with a good baseline accuracy (91,62%). The second experiment was a bit more complex. The goal was to find the winner of the match between team A and B where both teams have never played a match in the past as well as current shape and tendency. In more colloquial

Text Classification for PHP Applications

The task of text classification is one of the oldest in Natural Language Processing and was well considered in the past. There are some common methods for classification using Artificial Neural Networks, Support Vector Machines and other machine learning models. The problem with these approaches is the amount of training data that has to be found, normalized as well as labeled for a specific system. Another problem comes with the selection of features, which – in most cases – wastes a lot of time. Because of that, a simpler algorithm must be found to allow classification of text for people who are not interested in machine learning. Some approaches that reach these requirements already exist and were introduced here. This article provides a short introduction to one of these algorithms (NCD) as well as an implementation into PHP scripting language. This code can be simply used for an own PHP project.


style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

Introduction

Why should we choose compression algorithms for classification? The answer is easy. A compression algorithm tries to find a way to minimize data. For that, one way is to find similarities and reduce them. To tackle this task without a computer – just with paper and pen – how would you compress the following examples:

(I) aaaaaabbbbbbcccccc

and

(II) ccccccbbbbbbffffff

I decided to use a simple compression rule (just for this example)
A possible result for the first example: 6a6b6c
And for the second example: 6c6b6f
If we concatenate both examples together: 6a6b12c6b6f
The length of these two examples (I) and (II) are in both cases 18. The compressed version of these examples uses 6 tokens for each string. We compressed a 18 token string to a 6 token string. After this concatenate operation the length of the compressed string also could be 6+6 = 12 tokens, but the compression (6a6b12c6f) uses just 11 tokens. Our simple compression algorithm has found a similarity (the b sequence at the end of example 1 and at the start of example 2). For more complex texts a good compression algorithm finds similarities in words, structures and so one. To conclude, the more the length of the concatenated string differs from the length of the concatenated string that underwent the compression, the more are these two strings similar to each other. To express it with more colloquial words, the length of the compressed string depends on the similarity of the two input strings.

Normalized Compression Distance

The NCD algorithm is normally used to measure the quality of compression algorithms. For that, some theoretical considerations including the Kolmogorov complexity are necessary. Text classification using NCD is a bit more uncomplex, but not feasible without some definitions:

  • Z(x), Z(y)… Compression function (in theoretical considerations: a function that is able to build a shorter representation of a string or token stream), returns the length of the compressed representation
  • min(x,y)… Minimum function (calculates the more minimal value of two variables) returns x when x < y and y when y <= x
  • max(x,y)… Maximal function (calculates the more maximal value of two variables) returns x when x > y and y when y >= x
  • NCD(x,y)… Normalized Compression Distance (calculates the compression distance with the formula below)
  • x, y, xy… Variables and Concatenation (x and y are variables, xy is the concatenation of x and y, which means the two token streams are combined together)




style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">


With these definitions and the related formula you can calculate the NCD for your input string (test string) against all of your known phrases. The comparison with the lowest NCD value is most likely to be the correct class.

PHP Implementation

I’ve implemented the NCD approach in PHP and the class is very easy to use. You can use this classifier to detect spam or classify other input. The example below shows the usage of the code:

require 'ncd.class.php';

$c = new NCD();

$c->add(array(
	"hi my name is alex, what about you?" => 'ok',
	"are you hungy? what about some fries?" => 'ok',
	"how are you?" => 'ok',
	"buy viagra or dope" => 'spam',
	"viagra spam drugs sex" => 'spam',
	"buy drugs and have fun" => 'spam'
	));

print_r($c->compare('hi guy, how are you?'));
print_r($c->compare('buy viagra m0therfuck5r'));

The add() method requires an array that contains the phrases for the comparison and the corresponding classes. You can call the add() method as much as you want to add more examples. It is important to provide good classification examples to the classifier. Not well chosen examples can affect the classification performance.

The print_r() methods used in the fragment above output the following lines:

Array
(
    [class] => ok
    [score] => 0.39285714285714
)
Array
(
    [class] => spam
    [score] => 0.48387096774194
)

As can be seen above, “hi guy, how are you?” was successfully classified to ‘ok’ and “buy viagra m0therfuck5r” was correct classified to ‘spam’. The score value can give an insight into the relation between the input string and string with the minimal distance to the input string. If you’re testing your classifier and this value is to high for a clear example, then you should add another comparison example using the add() method for this class.

Get the code here.

I hope you enjoyed this article. Have fun with the code and don’t hesitate to contact me if you have any suggestions or ideas.

Emmy – The Autonomous RC Car

Hello World, I would like to introduce a new interesting project to cip-labs. This project deals with a 1:10 RC car named Emmy. The goal of the project is to build up a car that is able to drive autonomously on the street without any human intervention. Furthermore, Emmy should be able to follow recognized objects and explore the environment.

Motivation

It is not difficult to motivate this project. The latest developments in artificial intelligence allow some interesting use cases for the automotive industry, for instance: detection of moving objects and a lot of more relevant things. An ordinary car has a lot of controllers, sensors and so one. With Emmy the challenge are the limited hardware resources. Building a real-time system, that is able to control the behavior of a driving car (40km/h – 80km/h) is a very strong task and needs a lot of optimization.


style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

Technical stuff

The car comes from Amazon. After the build up process of Emmy, we need a RaspberryPi (700Mhz, 512MB RAM) including a camera module to capture pictures and to process them. The RaspberryPi will be linked with the steering and engine of the car using a breakout board. To avoid unwanted collisions and track the road we are using a picture stream with 640×480 pixels up to 90 frames per seconds which should allow us to drive 50km/h or more. To process this data in real-time we plan to use an external computing source connected via Wifi stick. Below you can see a list of all necessary components:

Intelligent behavior

Building a RC car from the scratch is not a challenge at all. The main goal is to bring some intelligent algorithms to Emmy so that the car is able to drive smart. Some key features are:

  • Drive and remain on the street (straight, curve, crossing)
  • Collision control (using infra-red or grayscale)
  • Detect objects and follow them
  • Explore the environment without crashing
  • Map the environment to a digital map to revisit places



style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">


For some of these tasks, a few algorithms exist. The main purpose is to find an optimal algorithm that runs robust on the limited hardware.

Team

I’m not alone. For the first time on cip-labs, this project runs with a cool project team. Gentlemen, a warm welcome to you!

Alexander Bresk – Initiator/Coding Artificial Intelligence

Sebastian Kuhz – Car Engineering

Felix Wolff - Electrical Engineering

Paul Wersch – Case Modding/Design of Emmy

To cut my huge enthusiasm about this project in a short sentence: I’m really excited and hope that we can achieve all goals with Emmy!