Archive for the 'Projects' Category

Dec
14

Exactly one year ago, I came out with an idea of a neuron based data structure. In the first article about this idea, I tried to give an overview about, how this data model should work. In the article below, I introduce you to the implementation of this model. It’s still in development, but all functions which were needed are already implemented.

What is NBDS?

The neuron based data structure (called NBDS) follows the idea, to keep an information as an atomic part. The model contains three parts. The first part is the Neuron, which acts like a container for data. The second part is the Axon. This axon connects two neurons together and it can still contain information (data about the connection or relation). The last part is the Space. In a Space you put neurons and axons together and run some operations on it. You can imagine the space as a component, that brings the order into the set of neurons and axons.

To keep it short and simple: NBDS represents a directed graph. The direction of the axons (edges) can be one-directional or bi-directional. It has something in common with the network data model, but NBDS has more flexibility. It has also something in common with the object-oriented data model, but it can have no, one or more connections to other objects (and the connection itself can also contain information)

Neuron

Biological motivation

Neurons are nerve cells of the brain. This nerve cells are connected to each other with dendrites and axons. When an electrical impulse comes into (from another neuron over dendrites and axons) the neuron have an electrical potential. When this potential reaches a special threshold value, the neuron fires also an electrical impulse out to other neurons. Thats is, how your brain works. In NBDS, the neuron haven’t a threshold value, but it has a counter, that counts, how often the neuron was reached during the operations on the Space. With this counter, you can see, how often your information was affected.

Functionality

A neuron reflects a data container. It contains the ID, name, value, description, generic attributes (key,value) and a counter (how often was the neuron affected in operations on the space). It can have connections to other neurons about axons.

Axon

Biological motivation

The axon (and also a dendrite) connects the neurons with each other. It forwards the electrical signal from the source neuron (or the retina) to the destination neuron (or more than one).

Functionality

In this data model, the axon does some more than just forward an impulse or signal. It has also an ID, name, value, description, generic attributes (key,value), a used counter, the bi-directional option and the id’s of the involved neurons.

Space

Functionality

A space, in this case, is a container that holds neurons, axons and the connections between them together. In the NBDS you can perform some operations on the Space to get out the information from the neurons and axons. Furthermore it contains some lists, to perform operations on the Space faster.

Universe

The universe array is a structure, that represents the connections between neurons. It was designed as a fast lookup table for operations on the space. Nearly all methods, that need to found out a fast connection between to neurons, using this array for a lookup.

Mnemonics

ln the real world, mnemonics are things that you know from the past. In this data model its similar. The mnemonic array hold route operations (from the past) from neuronX to neuronY. It contains all successful routes, that the route() method detects.

Operations

Serialization

The NBDS allows to serialize and unserialize a set of data. It offers two methods. The first method is to save the data as a XML file. This method is really simple and easy to import to other systems. But the XML serialization has one problem. It is not able to store objects. This affects just nested objects, that you hold in an axon or neuron.

The second option for serialization is to save it with the serialize() function of PHP. This method fix the problem wit nested objects and can store the data binary or plain in an JSON like format.

Route

The route operation finds a connection between two neurons (over axon(s)). (assuming, that this connection already exists)

Select

With this operation you can request several parts of the data structure. You can, for instance, search for all NBDS_NEURON objects which have a name NBDS_LIKE ’special_neuro%’.

Selecti

The selecti operation realizes more than one select() operations in one step. You can define $operation_arrays to declare more than one criteria for the request.

Connection

Connection() returns a component of a connection. It uses the route() operation internally to do a lookup (from_neuron – to_neuron). After that, it returns NBDS_NEURON or NBDS_AXON (or both -> NBDS_ALL) as objects.

Neighbor

This operation calculates the neighbors of a neuron. Optionally, you can set an argument ($steps), which returns the neighbors with a given distance (so called degree).

Translate

Translate performs a lookup on the _axons or _neurons array, to find out, which neuron or axon has the id, that was given as an argument. It returns the objects, that belongs to the given ID’s.

Get

A simple get() request returns some parts of the space, for instance NBDS_NEURON, NBDS_AXON, NBDS_UNIVERSE or NBDS_MNEMONICS.

Use cases

Below, I collected some ideas for an application of the NBDS:

Data model

The concept of NBDS deals with the representation of data. A bunch of data has a connection to another bunch of data. The bunch of data is very flexible an generic. Furthermore, the connection between the bunches of data has also data, that describes it.

Ontology

An ontology is a relation between words (synonym, etc.). With the NBDS you can represent this ontology and you can request synonyms of a word. You can also classify the axons as a special relation between words.

Intelligent graphs

You can use NBDS as an intelligent graph. A graph, which you can use for routing operations. After that, you can find the fastest path and the information behind it.

Data relationships

Organizing data (atomic data) with relations is one of the main advantages of the NBDS. You are able to organize nested informations with the neurons. For instance, you can put a Space.class.php object into a neuron. This means, that you can use recursive structures to store your data.

System requirements

The system requirements are just PHP 5.x or higher.


goto: download (version 1.0.0)

goto: project page

Nov
21

Some time ago, I thought about that it would nice to have a smart piece of code, that calculates the result of a given formula. The next step, I thought about, was to integrate customized functions into the code. So, I wrote a parser, that is able to do all this things. This article deals with this parser and how you can work with it. The code is really easy to understand and the parser is easy to use.

First considerations

At first you have to consider some classes, that are able to abstract all the structures you need.

class Morphem

A morphem is the atomic part of every language. In our case, a morphem represents the tokens of a formal language, for instance:

(, ), +, -, *, /, sin, cos, tan, sqrt, exp, ln, log10, -912, 12

The morphem class represents a morphem, after the lexer has detected it.

class Lexer

The lexer has the job to detect morphems in the formula. With some simple rules, the lexer detects every morphem from the left to the right:

  • FVAL – function value like (sin, cos, log)
  • DVAL – double value like (1.4, 3.1, 5, 7, 11.11)
  • CVAL – character value like (+,(,),-)
  • NOVAL – no value, tokens that are not defined
  • FINISHED – the lexer finished with ‘\0′ in the string

This class returns the current morphem to the parser. The parser uses this morphem, to calculate the result of your formula with the rules of the grammar.

class Parser

An important part of a good working parser is a good grammar. In this case, I choose a case-sensitive grammar. A grammar is a formal construct, to define, how a language is structured. In this case, the language represents all valid formulas. The following grammar was used in the code:

E -> T | T + E | T - E
T -> F | F * T | F / T
F -> (E) | N | -N | sqrt(E) | sin(E) | cos(E) | tan(E) | exp(E) | ln(E) | log10(E)
N -> I | I .D
I  -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 1I | 2I | 3I | 4I | 5I | 6I | 7I | 8I | 9I
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0D | 1D | 2D | 3D | 4D | 5D | 6D | 7D | 8D | 9D

This grammar is realized in the parser class. The customized functions, that you can integrate into the parser comes in at F. Here are laying your functions.

HowTo work with

Here you can see a simple example, that should match all cases of application.

First example

require 'Parser.class.php';
$parser = new Parser('sin(4)+cos(4)');
$result = $parser->run();
echo 'result is: ' , $result , PHP_EOL;
//outputs
result is: -1.4104461161715

As you can see in the example above, it is really easy to use the parser. You just have to type in the formula that you want to parse and nothing more.

The parser knows some functions by default:

  • sqrt(x) square root of x
  • exp(x) – e^x
  • sin(x) – sinus of x
  • tan(x) – tangens of x
  • cos(x) – cosinus of x
  • log10(x) – logarithm base 10 of x
  • log(x) – logarithm naturalis of x

So, it isn’t a problem for the parser to calculate:

(3*sin(5*cos(3))+1+ln(sin(3)))/sqrt(2)

source: wolframalpha.com

which is (output from the parser): 1.3842259196509

Extending functionality

Sure, I built in some functions, that are important for mathematical calculations, but maybe you want some more functions, to serve your needs. That is no problem:

require_once 'Parser.class.php';

function divide2($x){
return $x = $x / 2;
}

function plusRandom($x){
return $x + rand(0,10);
}

$parser = new Parser('div2(sin(pr(1)))');
$parser->addFunction('div2', 'divide2');
$parser->addFunction('pr', 'plusRandom');
echo $parser->run();
//outputs the sinus from a number from a random number in the scope of 1-11 divided by 2

Note: Every function must have one argument (not more and not less).

Symbol table

The symbol table includes all functions that you had added or where built-in. I kept this structure static, so that a new parser object has access to it. Furthermore the parser needs this structure static, because in some cases we are starting a new parser recursively. The table is an array of the lexer class and you can access it with:

Lexer::$_userFunctions[$name_in_formula] = $real_function_name;

Using the calculator

May you want to calculate a series of results for one formula, to draw a graph in a coordinate system or to simply calculate a series of results from a formula with a variable. Then you should use the Calculator.

$calc = new Calculator();
$calc->options('{x}', 0, 10, 0.5);
$calc->addFunction('div2','divide2');
print_r($calc->calculate('sin({x})+cos({x})');

The calculator runs the formula with the variable {x} starts with 0 and runs up to 10 with the step rate of 0.5. That means, the calculator runs for {x} = 0, {x} = 0.5, {x} = 1.0, {x} = 1.5 …. {x} = 10.0. The keys of the result array are the current step and the values are the result of the calculation.

Wishlist

On my wishlist for this further development of this project:

  • n ^ m
  • n!
  • n % m
  • functions with more then one parameter

Get it!

You can download the files on the project page. There I will create a little documentation of the code for some hacking fun. Maybe you found it interesting to work with this little tool. So, please let me know and write a comment or a mail. I would really like to get some feedback and wishes, that I can add to the code.

goto: download

goto: project page

Nov
01

Some month ago, I came out with an article about Benfords Law. This article deals with the summation and product formula based implementation in C. The algorithm is indeed cool for small statistical games, but not nice to use, when you have a set of numbers.

So, I developed a small class, that serves this need. Furthermore, this class is able to calculate the geometrical, harmonical and quadratical averages.

After including (require, include) this class, you can create a new object with

$benford = new Benford(array(1341,82,141,215), 10);

The first parameter is the array with the values to calculate. The second parameter is the basis of the number system. This parameter is 10 by default. (10 means a normal number system from 0..9). You can add more values with:

$benford->addValue(array(9,128,412));

or simply:

$benford->addValue(3222);

After that, you have two options:

You can perform a run through the given set of numbers and get back the values of the Benford calculation. The second option is to use an average function. When calling an average function, the run()-function will be also called internally.

Get back an array with the Benford calculation:

$benford = new Benford(array(232, 211, 410, 301, 508, 192));

$result = $benford->run();

This returns an array with all the Benford results from the given data. Below you can see the code to perform the average functions.

$benford = new Benford(array(232, 211, 410, 301, 508, 192)); 

$result_harm = $benford->getHarmonicalAverage();

$result_geom = $benford->getGeometricalAverage();

$result_quad = $benford->getQuadrateAverage();

The script is very easy to use. Have fun with it!

Download the source file.

Oct
19

The most optimization tricks for MySQL focus on query performance or server tuning. But the optimization starts with the design of the database schema. When you forget to optimize the base of your database (the structure), then you will pay the price of your laxity from the beginning of your work with the database. Sure, every storage engine have his own advantages and disadvantages. But regardless of the engine you choose, you should consider the following chapters in your database schema. In this article, I will write about some architectural considerations, that you should keep in mind, when you design a database, that will run under heavy load. This article is a part of the web-application performance series.

Read-optimized tables

In current web applications, the frequency of reading from a database is much likely, then writing data to it. For this reason, the database server have to scan a lot tables, to deliver the data. The most tables, that are including the (so called) master data are very huge. This tables have a big row size and they are very costly to scan. For this reasons it is a well-known method to split up this table. You can hold the data, that is most likely involved in scanning processes, in a seperate table, to avoid a scan about the whole data. A problem occures, when you want do query the full data set. Here you have two options:

  1. execute a join between the read-optimized table and the other table (using a foreign key in the read-optimized table)
  2. denormalize the tables with an attribute, that both tables have (and that is indexed two times)

The first alternative is not oriented on performance aspects, but it fullfills all rules of normalization. The second alternative is a denormalized table, that includes some redundancy in data. Most web applications run with little redundancies in there databases to use the performance improvement of this read-optimized tables. See also the chapter: “denormalize tables”.

Denormalize tables

As I said in the chapter above, you should denormalize tables, when your joins between two and more tables are to slow. To denormalize should be the last step, after all of your architectural repertoire is exhausted. Because denormalized tables mean to have some data not only in one column. Of course, some inconsistencies can occure and the database is stronger to handle. The administrative outlay will increase, if you denormalize to much of your data.

Denormalization is useful, if you split one table into two tables and you don’t want to join this tables together in a query. In most cases, the database would execute a join with over 2 tables to get the information. This is really costly for current web applications! But be aware, think about, which denormalization is necessary and which is not.

Artificial key columns

If you have a fast natural key (that results from the data analysis) don’t add a artifical key to it. Remember that all keys are need disk or/and memory space. But you have to weigh up whether your natural key is fast enough or not. Maybe your key is alphanumeric and is to large to be efficient under high load. Then you should use a numeric and artificial primary key.

Bigint vs. unsigned

It is important for the performance to choose the right data type for the primary key. Consider, if you really need a BIGINT or INT. Maybe the MEDIUMINT datatype can fix your needs. The difference between INT and MEDIUMINT is, that you save 25% of space with using MEDIUMINT. You should use the smallest data type for 2 reasons. The first reason is, that the row will be smaller and the second reason is, that the index of the primary key will me smaller and faster. Another idea is, to use the UNSIGNED part of a data type. In most cases, the range of UNSIGNED INT will suffice. Of course, you can apply this consideration for all numerical fields (and not just for key columns).

ranges of data types

Type Storage Minimum Value Maximum Value
(Bytes) (Signed/Unsigned) Signed/Unsigned)
TINYINT 1 -128 127
0 255
SMALLINT 2 -32768 32767
0 65535
MEDIUMINT 3 -8388608 8388607
0 16777215
INT 4 -2147483648 2147483647
0 4294967295
BIGINT 8 -9223372036854775808 9223372036854775807
0 18446744073709551615

source: mysql.com

Text vs. varchar

This is a one of the most rookie mistakes with MySQL. Some people think, that it is better to use TEXT instead of VARCHAR. One point is that VARCHAR has a defined length and TEXT hasn’t. But that is not the last one. Furthermore TEXT needs more memory for sorting. However, you should use TEXT just as a “storage” data type and VARCHAR as small field of textual information (or as a key).

NOT NULL

If it is possible, declare all columns as NOT NULL. Every column needs a bit more and this is not necessary. When you really need this column, you should use it. But when you can avoid this, you should avoid it. MySQL does more working steps, if you set NULL to a column.

Other considerations

  • Do not index columns that you not need in a select
  • Use clever refactoring to admit changes to your schema
  • Choose the minimal character set, that fits your needs
  • Use triggers just, when you really need it
Aug
08

This article deals with the history of encryption and is the first part of two articles about this topic. All techniques were described in a chronological order. In this part of the article I deal with:

  • Introduction
  • Scytale and transposition cipher
  • ROT 13 and Caesar cipher
  • Maria Stuart and Babington Plot
  • Le Chiffre indéchiffrable (Vigenére square)
  • One-Time-Pad

Introduction

With the use of public transmission paths for discrete information, it was neccessary to transmit the information safe and secret. A Information represents (at least for sender and receiver) a sequence of tokens that makes sense to them. Without any sense, it would be just a message. We can define the word information as a message with sense and meaning. Just the sender and the receiver know the logic of the message (to create an information). Well, this is the ideal case, but it is not the reality. Normally, a few more people as the sender and the receiver know the logic and the meaning of a sequence of tokens. The sender and, it’s counter part, the receiver have to consider a technique, to keep the message secure. In the history of humantiy, we found a few examples for encryption methods with political and economical background. But the encryption of private messages was also used, to keep a affair secret. A lot of women used cryptography to send intimate messages to her lovers. In this article, I will try to cover the important examples, to show how necessary encryption is.

The franziscian-monk Roger Bacon was the first, who published a book about encryption methods “Abhandlung über die geheimen Künste und die Nichtigkeit der Magie” (Treatise on the secret arts and nullity of magic). This was happened in the 13th century.  Bacon mentioned a few algorithms to encrypt a text with the normal alphabet. For instance “Atbash” (Atbasch) is a simply monoalphabetic substitution with the following rules.

source A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
target Z Y X W V U T S R Q P O N M L K J I H G F E D C B A
source HELLO WORLD
target SVOOL DLIOW

As you can see in the table, the source alphabet was mirrored by the target alphabet. There is nothing special! But how to crack this cipher? To reveal the solution, you have to find out the target language (english, german, e.g.) and after that, you have a table with a letter from the alphabet and the occurring probability of the letter in the specific language (frequency table). With this table, you can find out, how the letters are exchanged. This substituation of two letters is called a bigram (n-gram). In addition to that, you can simply exchange the letters, if you know the applied algorithm and all neccessary parameters. But, this is not the normal case.

Example of a frequency table

letter    frequency in %
e 12,702
t 9,056
a 8,167
o 7,507
i 6,966
n 6,749
s 6,327
h 6,094
r 5,987
d 4,253
l 4,025
c 2,782
u 2,758
m 2,406
w 2,360
f 2,228
g 2,015
y 1,974
p 1,929
b 1,492
v 0,978
k 0,772
j 0,153
x 0,150
q 0,095
z 0,074
ï 0,01

source: wikipedia.org

This table shows the frequency of occurence from all possible letters in the english language.

Nearly all encryption methods, that were invented in this era, works with monoalphabetic substitution. After that, the encryption algorithms were oriented on encoding words (not just simple letters). The mix of encoding words and encrypting letters is called nomenclature codes. This codes works with a substitution table and can also be cracked with a frequency analysis and the semantic knowledge of syllables and words. It would be a great help, to have the knowledge about the sense and the motivation of the message. Often, you can interpret half decrypted parts of a message and try to test any words related to the motivation of the text. To know the syntactic and semantic structure of the encrypted text, is the key of decrypting such nomenclature codes. In the subsequent chapters, I will talk about encryption methods in detail.

Scytale and transposition cipher

To encrypt a plain text with a transposition method means to change the position of a letter in a text. In contrast to the substitution, the transposition method performs no exchange of letters (monoalphabetic substitution), just a change of their positions. This method is immune against frequency analysis. The reason is, that the frequency analysis (at least in this case) can approximatly decide, which language was used to encrypt the message. But why “approximatly”? The frequency analysis can compare the frequency of occurence of every letter, with a frequency table of every language. The frequency table of the language, which matches best with the frequency table of the encrypted message is probably the language of the encrypted message (implied that the cipher works just with transposition). A mixture of transposition and substitution prevents the message from finding out the natural language.

scytale image

Scytale | source: wikipedia.org

A Scytale is a stick (cylinder) with a strip of paper (or earlier parchment) wound around it. The number of corners is the key to decrypt the text. The sender writes the message on the parchment of his stick. After that, he or she sends a courier to the receiver. The receiver has a stick with the same number of corners and can wound the piece of parchment around his stick. When the receiver has the right stick, he or she can read the decrypted text. This method was used by the Spartans and the ancient greeks to encrypt military messages. The security of a Scytale was really strong. With the idea, to develop a lot of sticks with a different number of corners, the Scytale was assailable. After the invention of mechanically en- and decryption the Scytale as tool was obsolete, but the concept was still used.

ROT 13 and Caesar Cipher

Julius Caesar (dictator of the Roman Republic), known for his assertiveness and wisdom, used a simple (but in his era really strong) method to encrypt his messages to the generals and centurions. He used a technique similar to the Atbasch example above. Of course, Caesar lifes a few decades earlier, but he also knows the idea of monoalphabetic substitution. His cipher can be named ROTx (Rotate x), in which the x is a number, that describes the number of positions to rotate.  ROT13 is a special case of the Caesar cipher and the name was created by Usenet users in 1980. This cipher is a Caesar cipher with the shifting factor 13 (x equals 13). Example:

source A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
target N O P Q R S T U V W X Y Z A B C D E F G H I J K L M
source HELLO WORLD
target URYYB JBEYQ

In Caesars era, that cipher was strong, but later, arab scholars invented the frequency analysis method to break such ciphers.

Maria Stuart  and Babington Plot

Maria Stuart was born on 8th december of 1542. In 1543 she was crowned as Queen of the Scots. After any affairs, marriages and conspiracies Maria Stuart was deposed and escaped to england. There she was arrested. After 18 years of imprisonment, she has a ray of hope. She receives letters from Anthony Babington. Babington encrypted his letters with nomenclature codes and Maria Stuart knows to decrypt the text. At first she received a paper of all nomencalature codes. After that, Babington sends all encrypted messages. The spy’s of Walsingham, the security minister of queen Elizabeth, analyze the messages that Maria Stuart received. Thomas Phelippes was one of the spy’s. After a couple of message and doing frequency analysis with this message, Phelippes cracks the cipher and decrypt all messages. He found out, that Maria Stuart and Babington planning a assassination against the queen Elizabeth. Then, Walsingham forces Phelippes (who was a genius in faking fonts) to write an encrypted letter to Maria Stuart to find out, which people are the insurgents and the accomplices. After this attack from Walsingham (today, it would be called “Man-in-the-middle-attack”), Maria Stuart was executed in England.

This example makes clear, how risky encryption and the decryption of it can be. This nomenclature code can be decrypted by application of the frequency analysis. But one condition is, that the cracker must know, whether all symbols are substituated, words are encoded or both methods are used. With this knowledge, you can crack this nomenclature codes with a simple frequency analysis.

Le Chiffre indéchiffrable (Vigenére square)

After the invention of the frequency analysis from the arab scholars, the security of the monoalphabetic substitution was broken. Blaise de Vigenére, a french diplomat, used an idea from the italian polymath Leon Battista Alberti to invent a method called polyalphabetic cipher. This cipher works with 26 (because of 26 letters in the latin alphabet) alphabets side by side. Every alphabet starts with another letter. Have a look on the image below:

source: pad2.whstatic.com

The Vigenére cipher works with a key phrase. Consider the following example:

plain text  HELLO WORLD
key phrase  KEYKE YKEYK
encrypted result  RIJVS UYVJN

But how does it work? The decryption of a message, that is encrypted with a Vigenére square was, in the era of invention, really strong. You must have the right Vigenére square to decrypt the text. Another renewal was the encryption key. The key was used to encrypt the message. So, the receiver needs the right Vigenére square and the correct key phrase. Now, here comes the explaination of the example above.

The text to encrypt is “HELLO WORLD”. We use the key phrase “KEY” for the encryption. If the key phrase is shorter than the message, then the key phrase will be repeated so long as the length of the plain text will be reached. So, our key phrase is “KEYKEYKEYK”. With our correct Vigenére square, we can start to encrypt the message. The square has a extra alphabet on the upper side and also on the left side. For the first letter from the plain text we have a look on the alphabet in the header row. We call this position plain1(H). After that, we have a look on the left side and we keep our focus on the first letter of the key phrase. This position has the name keyphrase1(K). Now, we have the position of the letter of the plain text and the position of the letter in the key phrase. The cell, were plain1(H) and keyphrase1(K) join together, is the letter of our encrypted text. For that example cell(plain1(H), keyphrase1(K)) = R. For all other letters, the algorithm is the same.

For the era of invention, the polyalphabetic substitution was very strong. The cryptoanalysts countered with a few procedures and algorithms to crack the cipher. One algorithm was used to detect the key length. Some methods to detect the key length with redundancy of natural languages are the Kasiski-test (method of Charles Babbage) and the Friedman-test. In case of a short text, a good method is to find out which words can be placed at the beginning of the text. For instance, the first word “HELLO” is more likely then the word “HFAKF” or other senseless combinations. So the range of possible n-gram’s falls rapidly. So you can appreciate a possible key with a brute force attack. In the best case, you can try to find the key. After you found out any possible parts of the key, you have a simple Caesar cipher, which you crack with the frequency table (precondition: knowledge about the used source language).

One-Time-Pad

The One-Time-Pad (OTP) method is a encryption method that is, when it is used correctly, safe and theoretically not breakable. The concept behind this encryption is, that the key (One-Pad) is used just one time. The second constraint is, that the key phrase is as long as the plain text. Here are any other constraints. The key…

  • must be secret
  • must be unpredictable and random
  • may only be used one time

The recipient receives a bundle of x One-Time-Pad keys. With the keys, the receiver can respond x/2 messages and can answer x/2 messages. There are a lot of methods to en- and decrypt messages. Just one example is the Vigenére square.

As I mentioned a few sentences before, this procedure is absolutly safe (because of the key length). That is a big advantage. On the other hand, it is a high expenditure to bring new keys to the receiver/sender. Furthermore, this method needs a central position, that generates the keys. And last but not least, a strong disadvantage is the way of exchange the keys, which can be attacked. This can be called a Man-in-the-Middle-attack.  As you can see, the OTP method is safe, but not practicable with a high number of messages.

This was the first part of my “History of Encryption” series. The second part will deal with:

  • Enigma machine
  • One-time-pad
  • Public Keys
  • Pretty Good Privacy
  • Quantum cryptography
Jul
27

A few weeks ago, I wrote an article about an implementation of a DFA (deterministic finite automaton) in PHP. Today, I present a implementation for an automaton, that is not deterministic. It means, that the automaton have no, one or more than one edges (for an element from the input alphabet) to an other state. Before I start to give you a launch into how the NFA in PHP works, I give you a short introduction into the theory of nondeterministic finite automata.

To demonstrate an easy example, I use a NFA, that accepts a word, which includes the substring ‘abc’. I used this example also in a previous article (about DFAs). You can compare the following automaton with the DFA that accepts the same words and you would come to the conclusion, that the NFA is easier to build. Thats the truth, but on the other hand, in most (not all) cases the NFA needs more time to execute all simultanous calculations for a token from the alphabet.

An automaton needs (for formal description) an input alphabet. For that example, the alphabet is specified as:

for this example:
Σ = {a,b,c}

A few sentences ago i mentioned, that the NFA needs more time to execute all simultanous calculations. But why? Well, let us choose the input word ‘ccababca’. (For further demonstrations I will use s1, s2, s3 and s4 as names for the states [blue circle] from the left to the right) We start the simulation with the first two tokens ‘cc’. The NFA remains in s1 (state 1). After that the NFA simulates for the token ‘a’. In s1 with input ‘a’ the NFA can go to s2 and can also stay in s1 (self-loop). Here we have the main problem! The NFA simulates all possible calculations and that is the problem, that the NFA has. After token ‘a’ comes token ‘b’ and for that token, the NFA simulates s1 and s3.

The PHP implementation of the NFA works with the transition table of every state. Outward from the starting state(s), the algorithm is looking for a transition from the current state to another – the successor state. After the input word was read, the NFA checks whether it is currently in a final state or not. The formalization of a final state is defined with a thicker border line (as you can see in the sketch above the text). In the PHP implementation it is just a boolean value. You can also choose more than one starting states with:

$nfa = new NFA('test_nfa');
//adding new states
//declare starting states
$nfa->setStartingStates(array('s0','s1'));

The function, explained below, is very important to build NFA’s – but not neccessary. In the code example below, we don’t need this function, therefore it has an own example above.

The code example shows the code fragment, that has the equal functionality as the automaton in the sketch.

require 'NFA.class.php';

$nfa = new NFA('test_nfa');

$s1 = new NFA_State('s1');
$s1->setTransition('b', 's1');
$s1->setTransition('c', 's1');
$s1->setTransition('a', 's2');
$s1->setTransition('a', 's1');
$nfa->setState($s1);

$s2 = new NFA_State('s2');
$s2->setTransition('b', 's3');
$nfa->setState($s2);

$s3 = new NFA_State('s3');
$s3->setTransition('c', 's4');
$nfa->setState($s3); 

$s4 = new NFA_State('s4');
$s4->setTransition('c', 's4');
$s4->setTransition('b', 's4');
$s4->setTransition('a', 's4');
$s4->setFinal(true);
$nfa->setState($s4);

$nfa->setStartingState('s1');

if($nfa->run('accbbababcababa'))
  echo "NFA accepts the input word!";
else
  echo "NFA doesn't accept the input word!";

As you can see in this example, the NFA equals a pattern matcher. With a regular expression parser you are able to build small NFA’s for every part of this expressions. After that you can concatenate this automatons to an automaton, that accepts your complete expression.

You can download the source files here. (NFA.class.php, NFA_State.class.php and test.php)

May
04

An important area of the theoretical computer science deals with finite automata and languages. We use a finite automaton to detect words of regular languages. The types of grammars that create a language are defined in the CHOMSKY-hierachy (Noam Chomsky). The table below shows the CHOMSKY-hierachy:

no. title
0 recursively enumerable
1 context-sensitive
2 context-free
3 regular

To represent a regular grammar (type 3) you can use for instance a finite automaton. Here we are using a DFA (deterministic finite automaton) to display a regular grammar. A DFA consists of states and the automaton can reach a new state with a edge to another state (states are a kind of nodes or vertexes). Every state has a successor state for all tokens of the input alphabet. This input alphabet is specified:

for this example:

Σ = {a,b,c}

The DFA in the example below accepts all words that contain the sub string ‘abc’:

The circle represents a state and the edge represents the input of a token from the input word. The state, which is marked bold, is called the final state. If all tokens of the input word were inputted into the automaton (no token was left) and the automaton is currently in a final state, then the word was accepted from the DFA.

A formal definition of a DFA:

M = (Z, Σ, δ, s0, E)

  • Z… a finite set of states
  • Σ… the input symbols (alphabet)
  • δ… transition function (δ: Q x Σ → Q)
  • s0… start state (s0 ∈ Q)
  • E… a finite set of final states (E ⊆ Q)

The implementation in PHP based on the transition function / transition table of the DFA. The run()-method uses this table to calculate the successor state of every token in every state.
After all tokens are inputted, the run()-method checks, whether the current state is a final state or not. Running run() with true as the second argument, then the function will return the path through the automaton. Using this method with false as second parameter, the return value is boolean (true for success | false for “word isn’t in that language”).

Below you can see the DFA, that accepts only words with an ‘abc’ sub string (from the example), as PHP code:

$dfa = new DFA('test_dfa');

$z0 = new DFA_State('z0');
$z0->setTransition('b', 'z0');
$z0->setTransition('c', 'z0');
$z0->setTransition('a', 'z1');
$dfa->setState($z0);

$z1 = new DFA_State('z1');
$z1->setTransition('b', 'z2');
$z1->setTransition('a', 'z1');
$z1->setTransition('c', 'z0');
$dfa->setState($z1);

$z2 = new DFA_State('z2');
$z2->setTransition('c', 'zE');
$z2->setTransition('b', 'z0');
$z2->setTransition('a', 'z1');
$dfa->setState($z2); 

$zE = new DFA_State('zE');
$zE->setTransition('c', 'zE');
$zE->setTransition('b', 'zE');
$zE->setTransition('a', 'zE');

//to set zE as a final state
$zE->setFinal(true);
$dfa->setState($zE);

$dfa->setStartingState('z0');

$result = $dfa->run('accbbababcababa', true);

goto: download (DFA.class.php, DFA_State.php & test.php)

Apr
20

MySQL is a very powerful and popular database server for web applications. It’s a open source project with a big community. A lot of people deal with the performance of this server to make their applications faster on small machines. This article deals with the server tuning, especially the tuning that you can do in the my.cnf configuration file of MySQL. Be aware, that the performance of a database server depends on a few more factors than server tuning via the configuration file. This article is a part of the web-application performance series.

Query cache

The query cache is a cache that caches the execution plan and the result of the queries. When a query comes in, the query-parser parses the query, after that the query-optimizer creates an execution plan (based on heuristics) and executes the query. This cache is a very powerful tool, that brings an enormous speedup. Be aware that the query cache acts case-sensitive, which means that it differ between “SELECT” and “select”. A query, that will return the same result set, but in file A it starts with “SELECT….” and in file B it starts with “Select” have two different entries in the query cache. This is not good, because the unnecessary & occupied space can’t be used for another important query. Below i listed some important variables to control the behavior of the cache:

Query cache status

To have an imagination of how the cache works you can type:

SHOW STATUS LIKE "qc%";

into the query window or the shell to get status information from the cache. Here, i listed the Qcache status information of my database server.

Variable_name Value
Qcache_free_blocks 493
Qcache_free_memory 20752864
Qcache_hits 130628
Qcache_inserts 246940
Qcache_lowmem_prunes 0
Qcache_not_cached 203444
Qcache_queries_in_cache 1160
Qcache_total_blocks 2885
  • Qcache_free_blocks: number of free memory blocks in cache
  • Qcache_free_memory: amount of free memory space
  • Qcache_hits: sucessful reads from the cache
  • Qcache_inserts: number of all inserted queries
  • Qcache_lowmem_prunes: number of queries that where deleted from the cache because of reaching query_cache_size
  • Qcache_not_cached: number of non cached queries
  • Qcache_queries_in_cache: number of queries that are currently in the cache
  • Qcache_total_blocks: number of total blocks in the query cache

query_cache_type

This variable accepts three values.

  • 0 – The cache is turned off. Don’t cache anything.
  • 1 – The cache is turned on. Cache anything, but not such queries that contain time functions and other special aggregate functions
  • 2 – The cache is turned on demand. It will cache all queries that are marked with SQL_CACHE, for instance (SELECT SQL_CACHE id, name, telnr FROM employee;)

Value 2 is just useful if you performing a lot of queries on your server with large result sets and they are just needed to display statistics in the backend. You should cache just queries that you are need on the production system.

query_cache_size

The cache size points out the amount of memory that the Qcache can allocate. To set the value of this variable to an optimal size, you have to study the behavior of your server. For some cases it can be useful to set 16M (which stands for 16 Megabyte) but some applications need more than 16 M. To optimize this, you can test the status of the cache, as described earlier, and keep your focus on Qcache_lowmem_prunes, Qcache_free_blocks and Qcache_free_memory.

query_cache_limit

This variable means the maximum size of a single cache entry. It depends – again – on the size of your result sets. If you have very long tables, then it would be useful to increase the value of this variable. There is no general solution, but a good value for small tables is 1M or 2M.

Key buffer

The key buffer represents the number of space that MySQL can use to build indexes and keys. This buffer is very powerful and the value should be very high (depends on your application). Indexes of large tables can be large too. Study the server status output for more information. Important variables are Key_reads (number of key-read operations that result in a disk access) and Key_writes (number of key-write operations that result in a disk access). Both values should be as small as possible.

key_buffer_size = 32M
#for large tables
key_buffer_size = 512M

Table cache

The table cache holds open tables in the memory to allow a quick access. The improvement of avoiding a disk access and a table scan is very huge. Keep your focus on Open_tables and Opened_tables. Open_tables represents the number of open tables in the table cache. These tables are currently in the memory. The opponent Opened_tables represents the number of tables that are not in the memory (the query results in a table scan). So it is your task to increase the table cache, that you have no Opened_tables. But this is the ideal case. In real-life applications you’ll never have 0 Opened_tables.

#for small appliactions
table_cache = 128
#for huge applications
table_cache 256

Some database administrators think, that FLUSH TABLES would be a good alternative to increasing the table_cache. Well, this is a common mistake. FLUSH TABLES only flushes the table cache. To do so is good, to force a quick reindexing of the cache, but it isn’t an alternative to increase the table cache.

Sort buffer

The sort buffer is used to do sort operations like ORDER BY or GROUP BY in the memory. If this buffer would be to less, MySQL continues the sort operation on the hard drive. This is a very slow process and costs a lot of time. The indicator of the right sort_buffer value is to check the Sort_merge_passes variable. This variable shows how many merge operations MySQL need and it should be 0 for optimal performance.  Is Sort_merge_passes unequal 0 then you have to increase the value of sort_buffer. Be experimental and check the status of your server for sort buffer optimization.

Read rnd buffer

The read_rnd_buffer is used after the sort operation to read the sorted rows. Sometimes the standard value, which is 256K is not enough. There is an rule which tells us to use 1M for every Gigabyte of RAM that our server has on board. If you have not to large tables 1M is a good value.

read_rnd_buffer_size = 1M

Tmp tables

Temporary tables are important for getting a fast result set. If your tables become to large, MySQL will create temporary tables on the disk. This is really slow and not practicable for high performance applications. The value of tmp_table_size must be a generous value. Created_tmp_disk_tables is the opponent of tmp_table_size and represents the number of tables that MySQL creates on the disk. It must be your goal to decrease this value to the minimum.  After watching the current value of Created_tmp_disk_tables you can decide to increase the tmp_table_size or not.

#default value
tmp_table_size = 32MB
#for smaller tables, this value would be better
tmp_table_size = 64 MB

Thread cache

Every connection to the MySQL server ends up in a thread. The thread cache starts after the connection to the client is finished. It holds the thread in the cache to avoid the overhead of instantiate a new thread. An old thread would be used to handle incoming connections. You can check the performance of your thread cache with have a look on Threads_cached and Threads_created. Threads_cached is the number of threads in the cache and Threads_created represents the number of created threads. Increase your thread_cache_size to improve the performance. Note: The thread cache was build to handle high thread concurrency in a fast way.

Join buffer

If you do a join in your query, the MySQL server creates a result table. Especially a full join needs a large join buffer. So, increase the size of your join buffer, that the server must not create a joined table on the disk. This could end in a very high loss of performance. Set join_buffer_size to a generous value. Joins are often a case for the slow-query-log. Try to avoid large queries and check the slow-query-log to detect slow queries.

Max connections

The max_connections value represents the number of connections that the server accepts (simultaneous connections). If you have a high concurrency on your database server, increase this value to handle all connections at the same time. But be aware that the number of file descriptors increases with the number of simultaneous connections. To avoid incorrect handled connections, be careful with this variable. The value depends on the hardware in your box.

#for small applications
max_connections = 500
#for larger applications
max_connections = 5000
Apr
20

The scope of MySQL optimization is very wide. One aspect of this complex field is the query performance and this article deals with it. MySQL provides powerful tools, that allow you to have a look inside the query parser & optimizer. To understand this article some MySQL server & SQL knowledge is required. Please be aware, that this article is just an overview about query performance. It can’t cover all aspects of it. This article is a part of the web-application performance series.

A simple calculation

On MySQL.com the company behind MySQL (MySQL AB) published a simple calculation to find out the estimated query performance:

log(row_count)/log(index_block_length/3*2/(index_length+data_pointer_length))+1
  • row_count: number of rows in your table
  • index_block_length: normally 1024 bytes
  • index_length: key value length (usually 3 bytes)
  • data_pointer_length: data pointer (usually 4 bytes)

The result of this formula represents the number of disk seeks for a special query. This formula is not valid for all queries.

Using EXPLAIN

EXPLAIN is a powerful tool to analyze the query performance. You can use EXPLAIN for every query.

EXPLAIN SELECT * FROM employee WHERE id = 18273;

After sending this query to the server, the result set contains information of the MySQL Optimizer. The optimizer analyzes the query and decides which execution plan would be the fastest and/or cost-effective. Check approximatly all queries to make sure, that you don’t overlook a unindexed or slow query.

The output of the optimizer contains the following columns:

  • table: involved tables (if you do a join, then this table contains more than one entry)
  • type: type of query (e.g. SYSTEM | CONST | EQ_REF| REF | RANGE | INDEX | ALL, ordered by the fastest in a descending order)
  • possible_keys: all possible keys for this query
  • key: finally used key
  • key_len: the length of the key (the length correlates with the performance)
  • ref: shows which column or constant would used to compare the value of key
  • rows: estimated number of rows to check
  • Extra: additional information about how MySQL executes the query

If you want to use an index that MySQL don’t use in the execution plan, then you can force MySQL to use this index with:

SELCT lname, fname, birthday FROM employee FORCE INDEX(lname)...

LIMIT

You should use LIMIT, if you have a lot of rows in your result set. It saves a lot of time, if you’re using LIMIT. So, you can build your query, after your pagination variable is evaluated.

SELECT lname,fname FROM employee LIMIT 0, 30

or with the application of a pagination:

SELECT lname, fname FROM employee LIMIT ($page * $rows_per_page), $rows_per_page

Caching queries effectively

MySQL implements a query cache. This cache saves the execution plan for your queries and the results. Until the table is not changed, the query cache holds the queries in the cache (depends on the possible memory).  The most common fault with the query cache is the innocence of how the query cache works. It works case-sensitive! This can be tricky if you use the same query more than one times. Then, you have to establish query coding-standards in your application, to keep the cache smart, efficently and economical.

SELECT * FROM employee WHERE id > 10 AND id < 10000

is not the same like:

select * from employee where id > 10 AND id < 10000

Split up complex queries

At a certain level, some queries can perform really slow. This will happen especially by joins, correlated queries, large tables and such complex constructions. Then you achieve a better performance with splitting this one query into more queries. Test this queries against the EXPLAIN command to check which query performs well and which not. But, be aware that before splitting a query up , the first option should be to check the indexes and the query cache.

SQL_CACHE vs SQL_NO_CACHE

If you are performing statistical queries to update user statistics and so one you should use the SQL_NO_CACHE directive. This directive is a part of the query cache. If you set the value of query_cache to  2, then you can use this directive. Use SQL_NO_CACHE for all queries, that you don’t need in your application. Because the space of your query cache is limited and shouln’t be wasted.

LIKE with %

‘%’ is known as the wildcard operator in MySQL. If you have an index on a column it can perform fast.

SELECT * FROM employee WHERE lname LIKE('New%');

This works fast, if you have an index on lname. The example below don’t works fast, because the index couldn’t be used:

SELECT * FROM employee WHERE lname LIKE('%man');

So, try to avoid the wildcard at the beginning of the like-clause. An considerable option can be to reverse the values of this column.

Other considerations

  • avoid calculated comparisons between values
  • don’t use DISTINCT and GROUP BY together
  • avoid correlated subqueries
  • use the FULLTEXT search and avoid LIKE
Apr
17

This short article i have focused on a special field of PHP optimization. It deals with the php.ini.

The php.ini file is the configuration file of PHP. It contains directives and settings, that affect the execution of the PHP script. The possible modifcations, that you can do to increase the speed of execution, are very special. You have to decide which of the settings, that are listed below, seems wise to you and your application. This article is a part of the web-application performance series.

realpath_cache_size & realpath_cache_ttl

PHP uses a realpath-cache to cache the path of include/required-files. This cache was introduced in PHP 5.10 and brought a high improvement of speed into PHP. The default value should be 16K, which is a really good ratio.

register_globals

The default value since PHP 4.2.0 is set to Off, nevertheless you should convince yourself, that this value isn’t  set to On. On the one hand, register_globals are well known as a big security issue and on the other hand PHP creates all variables in $_GET, $_POST, $_COOKIE, $_REQUEST, $_SERVER and $_SESSION in register_globals variables, too. This costs a unnecessary amount of time!

always_populate_raw_post_data

Is always_populate_raw_post_data set to On, PHP fills $HTTP_RAW_POST_DATA with raw data, that comes in with POST. Turn this directive to Off,  if you’re not planning something special with it. It saves time and memory.

register_long_arrays

This directive copies all data into the outdated $HTTP_*_VARS. You can access this data for instance with $_GET. You should turn register_long_arrays to Off to save time and memory.

expose_php

To tell the webserver which version of PHP is installed, PHP sends an information string to the webserver. Set this directive to Off, to avoid an unnecessary string copy. In addition to that, it is safer to tell just the necessary information and not more.

register_argc_argv

If you are using your PHP scripts with a webserver, then you don’t need argc (number of command line arguments) and argv (array that contains the command line arguments). Is register_argc_argv set to On, then PHP tries to parse and copy this variables into the symbol table of the script. This takes unnecessary time. Turn it to Off to save time.

asp_tags

The asp_tags directive points out, that the PHP detects also ASP tags like <% and %>. It’s not best practice to use ASP tags in PHP scripts. If you don’t use these tags, you should turn Off the asp_tags directive to save time.