cip-labs - PHP, MySQL and CMS Systems http://www.cip-labs.net code is poetry laboratories - PHP, MySQL, Lighttpd, Web-performance Mon, 12 Mar 2012 18:59:24 +0000 http://wordpress.org/?v=2.9.2 en hourly 1 Theta8 – Next step website analysis http://www.cip-labs.net/2012/03/08/theta8-next-step-website-analysis/ http://www.cip-labs.net/2012/03/08/theta8-next-step-website-analysis/#comments Thu, 08 Mar 2012 13:29:05 +0000 flezzfx http://www.cip-labs.net/?p=669

I’m proudly present a new release of a project, that I’ve developed at the cip-labs. It’s named Theta8. This smart web application analyzes the response of a given URL and collects some data about it. It gives advices, tricks and tipps for the frontend optimization. Below, I will explain some functionality in more details:

How was it built?

Theta8 includes some PHP code and also some lines of C code. The core was programmed in PHP, but some functionalities (e.g. network operations, calculations on data) were outsourced to faster C. This has more advantages, but the main reason was, to speed it up. The programming paradigm is simple OOP. At the moment, I’m testing a lot, to bring more speed into Theta8. Consider, that requesting links and resources are a lot of load on the network site. This is the bottleneck in this case.

Features

Below, I listed the features, that are currently implemented:

  • analyze metatags and make suggestions
  • detect CSS classes
  • request links and resources (e.g. headers, mime-types, dead link detection)
  • analyze caching, minification, merging and compression
  • tips for spreading hosts

Future

At the moment, I deal with some features, that I want to develop for this project. Here are some key points:

  • Speeding up the analysis
  • Improve the accuracy of information
  • Better CSS class feature
  • Usability testing tool
  • Online compression checker

Motivation

The theta8 project was inspired by the knowledge and passion that I put in the topic of web application performance. This topic is really interesting and I did some research during the last 4 years. I worked for a few companies to improve the performance of their applications. Originally, one of these companies want to buy the project (Theta8), to monetize it. I rejected this offer and decided to continue the work on this project.

Goto: Theta8

]]>
http://www.cip-labs.net/2012/03/08/theta8-next-step-website-analysis/feed/ 0
Neuron based data structure – An implementation http://www.cip-labs.net/2011/12/14/neuron-based-data-structure-an-implementation/ http://www.cip-labs.net/2011/12/14/neuron-based-data-structure-an-implementation/#comments Wed, 14 Dec 2011 20:54:08 +0000 flezzfx http://www.cip-labs.net/?p=653 Exactly one year ago, I came out with an idea of a neuron based data structure. In the first article about this idea, I tried to give an overview about, how this data model should work. In the article below, I introduce you to the implementation of this model. It’s still in development, but all functions which were needed are already implemented.

What is NBDS?

The neuron based data structure (called NBDS) follows the idea, to keep an information as an atomic part. The model contains three parts. The first part is the Neuron, which acts like a container for data. The second part is the Axon. This axon connects two neurons together and it can still contain information (data about the connection or relation). The last part is the Space. In a Space you put neurons and axons together and run some operations on it. You can imagine the space as a component, that brings the order into the set of neurons and axons.

To keep it short and simple: NBDS represents a directed graph. The direction of the axons (edges) can be one-directional or bi-directional. It has something in common with the network data model, but NBDS has more flexibility. It has also something in common with the object-oriented data model, but it can have no, one or more connections to other objects (and the connection itself can also contain information)

Neuron

Biological motivation

Neurons are nerve cells of the brain. This nerve cells are connected to each other with dendrites and axons. When an electrical impulse comes into (from another neuron over dendrites and axons) the neuron have an electrical potential. When this potential reaches a special threshold value, the neuron fires also an electrical impulse out to other neurons. Thats is, how your brain works. In NBDS, the neuron haven’t a threshold value, but it has a counter, that counts, how often the neuron was reached during the operations on the Space. With this counter, you can see, how often your information was affected.

Functionality

A neuron reflects a data container. It contains the ID, name, value, description, generic attributes (key,value) and a counter (how often was the neuron affected in operations on the space). It can have connections to other neurons about axons.

Axon

Biological motivation

The axon (and also a dendrite) connects the neurons with each other. It forwards the electrical signal from the source neuron (or the retina) to the destination neuron (or more than one).

Functionality

In this data model, the axon does some more than just forward an impulse or signal. It has also an ID, name, value, description, generic attributes (key,value), a used counter, the bi-directional option and the id’s of the involved neurons.

Space

Functionality

A space, in this case, is a container that holds neurons, axons and the connections between them together. In the NBDS you can perform some operations on the Space to get out the information from the neurons and axons. Furthermore it contains some lists, to perform operations on the Space faster.

Universe

The universe array is a structure, that represents the connections between neurons. It was designed as a fast lookup table for operations on the space. Nearly all methods, that need to found out a fast connection between to neurons, using this array for a lookup.

Mnemonics

ln the real world, mnemonics are things that you know from the past. In this data model its similar. The mnemonic array hold route operations (from the past) from neuronX to neuronY. It contains all successful routes, that the route() method detects.

Operations

Serialization

The NBDS allows to serialize and unserialize a set of data. It offers two methods. The first method is to save the data as a XML file. This method is really simple and easy to import to other systems. But the XML serialization has one problem. It is not able to store objects. This affects just nested objects, that you hold in an axon or neuron.

The second option for serialization is to save it with the serialize() function of PHP. This method fix the problem wit nested objects and can store the data binary or plain in an JSON like format.

Route

The route operation finds a connection between two neurons (over axon(s)). (assuming, that this connection already exists)

Select

With this operation you can request several parts of the data structure. You can, for instance, search for all NBDS_NEURON objects which have a name NBDS_LIKE ’special_neuro%’.

Selecti

The selecti operation realizes more than one select() operations in one step. You can define $operation_arrays to declare more than one criteria for the request.

Connection

Connection() returns a component of a connection. It uses the route() operation internally to do a lookup (from_neuron – to_neuron). After that, it returns NBDS_NEURON or NBDS_AXON (or both -> NBDS_ALL) as objects.

Neighbor

This operation calculates the neighbors of a neuron. Optionally, you can set an argument ($steps), which returns the neighbors with a given distance (so called degree).

Translate

Translate performs a lookup on the _axons or _neurons array, to find out, which neuron or axon has the id, that was given as an argument. It returns the objects, that belongs to the given ID’s.

Get

A simple get() request returns some parts of the space, for instance NBDS_NEURON, NBDS_AXON, NBDS_UNIVERSE or NBDS_MNEMONICS.

Use cases

Below, I collected some ideas for an application of the NBDS:

Data model

The concept of NBDS deals with the representation of data. A bunch of data has a connection to another bunch of data. The bunch of data is very flexible an generic. Furthermore, the connection between the bunches of data has also data, that describes it.

Ontology

An ontology is a relation between words (synonym, etc.). With the NBDS you can represent this ontology and you can request synonyms of a word. You can also classify the axons as a special relation between words.

Intelligent graphs

You can use NBDS as an intelligent graph. A graph, which you can use for routing operations. After that, you can find the fastest path and the information behind it.

Data relationships

Organizing data (atomic data) with relations is one of the main advantages of the NBDS. You are able to organize nested informations with the neurons. For instance, you can put a Space.class.php object into a neuron. This means, that you can use recursive structures to store your data.

Semantic networks

You can use it also as a semantic network. With adding neurons with names and relationships to each other, you can represent a semantic network of words.

System requirements

The system requirements are just PHP 5.x or higher.


goto: download (version 1.0.0)

goto: project page

]]>
http://www.cip-labs.net/2011/12/14/neuron-based-data-structure-an-implementation/feed/ 0
Ipsum – The PHP formula parser http://www.cip-labs.net/2011/11/21/ipsum-the-php-formula-parser/ http://www.cip-labs.net/2011/11/21/ipsum-the-php-formula-parser/#comments Mon, 21 Nov 2011 21:43:59 +0000 flezzfx http://www.cip-labs.net/?p=626 Some time ago, I thought about that it would nice to have a smart piece of code, that calculates the result of a given formula. The next step, I thought about, was to integrate customized functions into the code. So, I wrote a parser, that is able to do all this things. This article deals with this parser and how you can work with it. The code is really easy to understand and the parser is easy to use.

First considerations

At first you have to consider some classes, that are able to abstract all the structures you need.

class Morphem

A morphem is the atomic part of every language. In our case, a morphem represents the tokens of a formal language, for instance:

(, ), +, -, *, /, sin, cos, tan, sqrt, exp, ln, log10, -912, 12

The morphem class represents a morphem, after the lexer has detected it.

class Lexer

The lexer has the job to detect morphems in the formula. With some simple rules, the lexer detects every morphem from the left to the right:

  • FVAL – function value like (sin, cos, log)
  • DVAL – double value like (1.4, 3.1, 5, 7, 11.11)
  • CVAL – character value like (+,(,),-)
  • NOVAL – no value, tokens that are not defined
  • FINISHED – the lexer finished with ‘\0′ in the string

This class returns the current morphem to the parser. The parser uses this morphem, to calculate the result of your formula with the rules of the grammar.

class Parser

An important part of a good working parser is a good grammar. In this case, I choose a case-sensitive grammar. A grammar is a formal construct, to define, how a language is structured. In this case, the language represents all valid formulas. The following grammar was used in the code:

E -> T | T + E | T - E
T -> F | F * T | F / T
F -> (E) | N | -N | sqrt(E) | sin(E) | cos(E) | tan(E) | exp(E) | ln(E) | log10(E)
N -> I | I .D
I  -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 1I | 2I | 3I | 4I | 5I | 6I | 7I | 8I | 9I
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0D | 1D | 2D | 3D | 4D | 5D | 6D | 7D | 8D | 9D

This grammar is realized in the parser class. The customized functions, that you can integrate into the parser comes in at F. Here are laying your functions.

HowTo work with

Here you can see a simple example, that should match all cases of application.

First example

require 'Parser.class.php';
$parser = new Parser('sin(4)+cos(4)');
$result = $parser->run();
echo 'result is: ' , $result , PHP_EOL;
//outputs
result is: -1.4104461161715

As you can see in the example above, it is really easy to use the parser. You just have to type in the formula that you want to parse and nothing more.

The parser knows some functions by default:

  • sqrt(x) square root of x
  • exp(x) – e^x
  • sin(x) – sinus of x
  • tan(x) – tangens of x
  • cos(x) – cosinus of x
  • log10(x) – logarithm base 10 of x
  • log(x) – logarithm naturalis of x

So, it isn’t a problem for the parser to calculate:

(3*sin(5*cos(3))+1+ln(sin(3)))/sqrt(2)

source: wolframalpha.com

which is (output from the parser): 1.3842259196509

Extending functionality

Sure, I built in some functions, that are important for mathematical calculations, but maybe you want some more functions, to serve your needs. That is no problem:

require_once 'Parser.class.php';

function divide2($x){
return $x = $x / 2;
}

function plusRandom($x){
return $x + rand(0,10);
}

$parser = new Parser('div2(sin(pr(1)))');
$parser->addFunction('div2', 'divide2');
$parser->addFunction('pr', 'plusRandom');
echo $parser->run();
//outputs the sinus from a number from a random number in the scope of 1-11 divided by 2

Note: Every function must have one argument (not more and not less).

Symbol table

The symbol table includes all functions that you had added or where built-in. I kept this structure static, so that a new parser object has access to it. Furthermore the parser needs this structure static, because in some cases we are starting a new parser recursively. The table is an array of the lexer class and you can access it with:

Lexer::$_userFunctions[$name_in_formula] = $real_function_name;

Using the calculator

May you want to calculate a series of results for one formula, to draw a graph in a coordinate system or to simply calculate a series of results from a formula with a variable. Then you should use the Calculator.

$calc = new Calculator();
$calc->options('{x}', 0, 10, 0.5);
$calc->addFunction('div2','divide2');
print_r($calc->calculate('sin({x})+cos({x})');

The calculator runs the formula with the variable {x} starts with 0 and runs up to 10 with the step rate of 0.5. That means, the calculator runs for {x} = 0, {x} = 0.5, {x} = 1.0, {x} = 1.5 …. {x} = 10.0. The keys of the result array are the current step and the values are the result of the calculation.

Wishlist

On my wishlist for this further development of this project:

  • n ^ m
  • n!
  • n % m
  • functions with more then one parameter

Get it!

You can download the files on the project page. There I will create a little documentation of the code for some hacking fun. Maybe you found it interesting to work with this little tool. So, please let me know and write a comment or a mail. I would really like to get some feedback and wishes, that I can add to the code.

goto: download

goto: project page

]]>
http://www.cip-labs.net/2011/11/21/ipsum-the-php-formula-parser/feed/ 0
Benford Calculation in PHP http://www.cip-labs.net/2011/11/01/benford-calculation-in-php/ http://www.cip-labs.net/2011/11/01/benford-calculation-in-php/#comments Tue, 01 Nov 2011 07:20:50 +0000 flezzfx http://www.cip-labs.net/?p=612 Some month ago, I came out with an article about Benfords Law. This article deals with the summation and product formula based implementation in C. The algorithm is indeed cool for small statistical games, but not nice to use, when you have a set of numbers.

So, I developed a small class, that serves this need. Furthermore, this class is able to calculate the geometrical, harmonical and quadratical averages.

After including (require, include) this class, you can create a new object with

$benford = new Benford(array(1341,82,141,215), 10);

The first parameter is the array with the values to calculate. The second parameter is the basis of the number system. This parameter is 10 by default. (10 means a normal number system from 0..9). You can add more values with:

$benford->addValue(array(9,128,412));

or simply:

$benford->addValue(3222);

After that, you have two options:

You can perform a run through the given set of numbers and get back the values of the Benford calculation. The second option is to use an average function. When calling an average function, the run()-function will be also called internally.

Get back an array with the Benford calculation:

$benford = new Benford(array(232, 211, 410, 301, 508, 192));

$result = $benford->run();

This returns an array with all the Benford results from the given data. Below you can see the code to perform the average functions.

$benford = new Benford(array(232, 211, 410, 301, 508, 192)); 

$result_harm = $benford->getHarmonicalAverage();

$result_geom = $benford->getGeometricalAverage();

$result_quad = $benford->getQuadrateAverage();

The script is very easy to use. Have fun with it!

Download the source file.

]]>
http://www.cip-labs.net/2011/11/01/benford-calculation-in-php/feed/ 0
MySQL Schema Performance Considerations http://www.cip-labs.net/2011/10/19/mysql-schema-performance-considerations/ http://www.cip-labs.net/2011/10/19/mysql-schema-performance-considerations/#comments Wed, 19 Oct 2011 09:59:34 +0000 flezzfx http://www.cip-labs.net/?p=607 The most optimization tricks for MySQL focus on query performance or server tuning. But the optimization starts with the design of the database schema. When you forget to optimize the base of your database (the structure), then you will pay the price of your laxity from the beginning of your work with the database. Sure, every storage engine have his own advantages and disadvantages. But regardless of the engine you choose, you should consider the following chapters in your database schema. In this article, I will write about some architectural considerations, that you should keep in mind, when you design a database, that will run under heavy load. This article is a part of the web-application performance series.

Read-optimized tables

In current web applications, the frequency of reading from a database is much likely, then writing data to it. For this reason, the database server have to scan a lot tables, to deliver the data. The most tables, that are including the (so called) master data are very huge. This tables have a big row size and they are very costly to scan. For this reasons it is a well-known method to split up this table. You can hold the data, that is most likely involved in scanning processes, in a seperate table, to avoid a scan about the whole data. A problem occures, when you want do query the full data set. Here you have two options:

  1. execute a join between the read-optimized table and the other table (using a foreign key in the read-optimized table)
  2. denormalize the tables with an attribute, that both tables have (and that is indexed two times)

The first alternative is not oriented on performance aspects, but it fullfills all rules of normalization. The second alternative is a denormalized table, that includes some redundancy in data. Most web applications run with little redundancies in there databases to use the performance improvement of this read-optimized tables. See also the chapter: “denormalize tables”.

Denormalize tables

As I said in the chapter above, you should denormalize tables, when your joins between two and more tables are to slow. To denormalize should be the last step, after all of your architectural repertoire is exhausted. Because denormalized tables mean to have some data not only in one column. Of course, some inconsistencies can occure and the database is stronger to handle. The administrative outlay will increase, if you denormalize to much of your data.

Denormalization is useful, if you split one table into two tables and you don’t want to join this tables together in a query. In most cases, the database would execute a join with over 2 tables to get the information. This is really costly for current web applications! But be aware, think about, which denormalization is necessary and which is not.

Artificial key columns

If you have a fast natural key (that results from the data analysis) don’t add a artifical key to it. Remember that all keys are need disk or/and memory space. But you have to weigh up whether your natural key is fast enough or not. Maybe your key is alphanumeric and is to large to be efficient under high load. Then you should use a numeric and artificial primary key.

Bigint vs. unsigned

It is important for the performance to choose the right data type for the primary key. Consider, if you really need a BIGINT or INT. Maybe the MEDIUMINT datatype can fix your needs. The difference between INT and MEDIUMINT is, that you save 25% of space with using MEDIUMINT. You should use the smallest data type for 2 reasons. The first reason is, that the row will be smaller and the second reason is, that the index of the primary key will me smaller and faster. Another idea is, to use the UNSIGNED part of a data type. In most cases, the range of UNSIGNED INT will suffice. Of course, you can apply this consideration for all numerical fields (and not just for key columns).

ranges of data types

Type Storage Minimum Value Maximum Value
(Bytes) (Signed/Unsigned) Signed/Unsigned)
TINYINT 1 -128 127
0 255
SMALLINT 2 -32768 32767
0 65535
MEDIUMINT 3 -8388608 8388607
0 16777215
INT 4 -2147483648 2147483647
0 4294967295
BIGINT 8 -9223372036854775808 9223372036854775807
0 18446744073709551615

source: mysql.com

Text vs. varchar

This is a one of the most rookie mistakes with MySQL. Some people think, that it is better to use TEXT instead of VARCHAR. One point is that VARCHAR has a defined length and TEXT hasn’t. But that is not the last one. Furthermore TEXT needs more memory for sorting. However, you should use TEXT just as a “storage” data type and VARCHAR as small field of textual information (or as a key).

NOT NULL

If it is possible, declare all columns as NOT NULL. Every column needs a bit more and this is not necessary. When you really need this column, you should use it. But when you can avoid this, you should avoid it. MySQL does more working steps, if you set NULL to a column.

Other considerations

  • Do not index columns that you not need in a select
  • Use clever refactoring to admit changes to your schema
  • Choose the minimal character set, that fits your needs
  • Use triggers just, when you really need it
]]>
http://www.cip-labs.net/2011/10/19/mysql-schema-performance-considerations/feed/ 0
Ideas to bypass the frequency analysis http://www.cip-labs.net/2011/09/30/ideas-to-bypass-the-frequency-analysis/ http://www.cip-labs.net/2011/09/30/ideas-to-bypass-the-frequency-analysis/#comments Fri, 30 Sep 2011 16:37:26 +0000 flezzfx http://www.cip-labs.net/?p=571 In the history of encryption article, I even dealt with mono alphabetic substitution. This means, that every letter from the source alphabet will be substituted with exactly one letter from the target alphabet. It is simple to understand as it is easy to crack. The mono alphabetic substitution can be cracked with the frequency analysis. This old method was invented by arab scholars. They built a table with the frequency of occurrence for every letter of the source language. After that, they compared this table with the text. With this method, they were able to crack this simple cipher.

In this article, I will present some little tricks to avoid the cracking of this cipher with the frequency analysis. I must admit, that the list of the following methods is not complete. All ideas are just an approach, but I think that they can be very useful for private usage. (In all examples, I will use ROT13 to symbolize an encryption with a mono alphabetic method. Furthermore, the short sentences in this examples are nearly uncrackable with a frequency table, because the deliver not enough input.)

I considered the following techniques:

  • Vocal reduction
  • Extended target alphabet
  • Sectional target alphabets

Vocal reduction

The vocal reduction method can be very strong and hard to crack. But on the other hand, this method isn’t very practicable. The main problem is, that the receiver of the message has to reconstruct the message with a high effort. Vocal reduction means, that the sender deletes all vocals from the source text, before he encrypts the message. The analysis fails, because the frequency of the letters isn’t correct. For english texts, the target alphabet has 5 letters less than the source alphabet. The receiver knows the substitution and can decrypt the text. But, the effort to reconstruct the text with the correct vocals is very high. The only helpers, that the receiver has, are the logical semantics of the source language. With the estimated sense of the message (that makes it an information) the receiver is able to decrypt the text faster.

process

  • plain source text
  • deleted vocals source text
  • encryption
  • sending
  • receiving
  • guess the plain source text

example

plain source text: Hello World – How are you – And what about your sister.

deleted vocals source text: Hll Wrld – Hw r y – nd wht bt yr sstr

encrypted with ROT13: Uyy Jeyq – Uj e l – aq jug og le ffge.

Note that the main disadvantage of this method is the lost of information during the encryption process!

Extended target alphabet

A target alphabet is used to reflect an encrypted source letter. That’s the sore point of the mono alphabetic substitution. One letter of the source alphabet is reflected by one letter of the target alphabet. And that is the problem. The frequency analysis analyses the occurrence of all letters. Then you can try some combinations and you will crack the cipher easily. The extended target alphabet (includes for instance 0,..,9,$,#,+,*, etc.) have more options to reflect a source letter. For instance you can translate a “a” with “t” or “$”. You can’t do this for every letter, but you can do it for a few of the most used letters. So the frequency analysis goes wrong! Indeed, this method isn’t very elegant, but it fulfills it’s purpose.

example

plain source text: Hello World – How are you – And what about your sister.

substitution rules: ROT13 and A <-> N,#  and E <-> R,$

encrypted text: Uryyb Jbeyq – Ubj ner$lbh – #aq jung nobhg lbhe fvfg$e.

Sectional target alphabets

Every mono alphabetic substitution has one target alphabet, that represents the encrypted text. Another option is to use more target alphabets. This method is very variable. You can use a new alphabet for every new sentence or a new section in the text. The agreement of which alphabet is used when, must be set before the first text was encrypted. The frequency analysis fails, because the frequency of the occurrence is more unclear then with a simple mono alphabetic substitution. Which letter has the highest occurrence and which one has the lowest is random.

process

  • plain source text
  • divide text in sections (words, sentences, paragraphs etc.)
  • encrypt with various rules for every section
  • encrypted text

example

  • The key is 1938271
  • This means, that the first sentence will encrypted with ROT 1, the second with ROT 9, ROT3 etc.
  • The receiver knows the key

Please note, that this method is still a mono alphabetic method for every section. For all sections, it looks like a poly alphabetic substitution method – but that’s wrong.

Conclusion

As you can see in the example above, you can do much more with a mono alphabetic encryption method as just translating a source text into a target text with a simple alphabet. But in most cases, you have to provide extra information about the “How to decrypt?” aspect. To bypass the extra information, you can start your text with a couple of letters or numbers, that represents an encrypted key. After decrypting this key with a standard procedure, you have the right key, to fulfill the real decrypt method for the encrypted text. The ideas, introduced above, are really simple and easy to implement. I think that you can do much more tricky things with the mono alphabetic encryption method, but in my mind this ends up in more “security over obscurity” and that misses the aim.

]]>
http://www.cip-labs.net/2011/09/30/ideas-to-bypass-the-frequency-analysis/feed/ 1
The history of encryption part 1 http://www.cip-labs.net/2011/08/08/the-history-of-encryption-part-1/ http://www.cip-labs.net/2011/08/08/the-history-of-encryption-part-1/#comments Mon, 08 Aug 2011 09:03:38 +0000 flezzfx http://www.cip-labs.net/?p=521 This article deals with the history of encryption and is the first part of two articles about this topic. All techniques were described in a chronological order. In this part of the article I deal with:

  • Introduction
  • Scytale and transposition cipher
  • ROT 13 and Caesar cipher
  • Maria Stuart and Babington Plot
  • Le Chiffre indéchiffrable (Vigenére square)
  • One-Time-Pad

Introduction

With the use of public transmission paths for discrete information, it was neccessary to transmit the information safe and secret. A Information represents (at least for sender and receiver) a sequence of tokens that makes sense to them. Without any sense, it would be just a message. We can define the word information as a message with sense and meaning. Just the sender and the receiver know the logic of the message (to create an information). Well, this is the ideal case, but it is not the reality. Normally, a few more people as the sender and the receiver know the logic and the meaning of a sequence of tokens. The sender and, it’s counter part, the receiver have to consider a technique, to keep the message secure. In the history of humantiy, we found a few examples for encryption methods with political and economical background. But the encryption of private messages was also used, to keep a affair secret. A lot of women used cryptography to send intimate messages to her lovers. In this article, I will try to cover the important examples, to show how necessary encryption is.

The franziscian-monk Roger Bacon was the first, who published a book about encryption methods “Abhandlung über die geheimen Künste und die Nichtigkeit der Magie” (Treatise on the secret arts and nullity of magic). This was happened in the 13th century.  Bacon mentioned a few algorithms to encrypt a text with the normal alphabet. For instance “Atbash” (Atbasch) is a simply monoalphabetic substitution with the following rules.

source A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
target Z Y X W V U T S R Q P O N M L K J I H G F E D C B A
source HELLO WORLD
target SVOOL DLIOW

As you can see in the table, the source alphabet was mirrored by the target alphabet. There is nothing special! But how to crack this cipher? To reveal the solution, you have to find out the target language (english, german, e.g.) and after that, you have a table with a letter from the alphabet and the occurring probability of the letter in the specific language (frequency table). With this table, you can find out, how the letters are exchanged. This substituation of two letters is called a bigram (n-gram). In addition to that, you can simply exchange the letters, if you know the applied algorithm and all neccessary parameters. But, this is not the normal case.

Example of a frequency table

letter    frequency in %
e 12,702
t 9,056
a 8,167
o 7,507
i 6,966
n 6,749
s 6,327
h 6,094
r 5,987
d 4,253
l 4,025
c 2,782
u 2,758
m 2,406
w 2,360
f 2,228
g 2,015
y 1,974
p 1,929
b 1,492
v 0,978
k 0,772
j 0,153
x 0,150
q 0,095
z 0,074
ï 0,01

source: wikipedia.org

This table shows the frequency of occurence from all possible letters in the english language.

Nearly all encryption methods, that were invented in this era, works with monoalphabetic substitution. After that, the encryption algorithms were oriented on encoding words (not just simple letters). The mix of encoding words and encrypting letters is called nomenclature codes. This codes works with a substitution table and can also be cracked with a frequency analysis and the semantic knowledge of syllables and words. It would be a great help, to have the knowledge about the sense and the motivation of the message. Often, you can interpret half decrypted parts of a message and try to test any words related to the motivation of the text. To know the syntactic and semantic structure of the encrypted text, is the key of decrypting such nomenclature codes. In the subsequent chapters, I will talk about encryption methods in detail.

Scytale and transposition cipher

To encrypt a plain text with a transposition method means to change the position of a letter in a text. In contrast to the substitution, the transposition method performs no exchange of letters (monoalphabetic substitution), just a change of their positions. This method is immune against frequency analysis. The reason is, that the frequency analysis (at least in this case) can approximatly decide, which language was used to encrypt the message. But why “approximatly”? The frequency analysis can compare the frequency of occurence of every letter, with a frequency table of every language. The frequency table of the language, which matches best with the frequency table of the encrypted message is probably the language of the encrypted message (implied that the cipher works just with transposition). A mixture of transposition and substitution prevents the message from finding out the natural language.

scytale image

Scytale | source: wikipedia.org

A Scytale is a stick (cylinder) with a strip of paper (or earlier parchment) wound around it. The number of corners is the key to decrypt the text. The sender writes the message on the parchment of his stick. After that, he or she sends a courier to the receiver. The receiver has a stick with the same number of corners and can wound the piece of parchment around his stick. When the receiver has the right stick, he or she can read the decrypted text. This method was used by the Spartans and the ancient greeks to encrypt military messages. The security of a Scytale was really strong. With the idea, to develop a lot of sticks with a different number of corners, the Scytale was assailable. After the invention of mechanically en- and decryption the Scytale as tool was obsolete, but the concept was still used.

ROT 13 and Caesar Cipher

Julius Caesar (dictator of the Roman Republic), known for his assertiveness and wisdom, used a simple (but in his era really strong) method to encrypt his messages to the generals and centurions. He used a technique similar to the Atbasch example above. Of course, Caesar lifes a few decades earlier, but he also knows the idea of monoalphabetic substitution. His cipher can be named ROTx (Rotate x), in which the x is a number, that describes the number of positions to rotate.  ROT13 is a special case of the Caesar cipher and the name was created by Usenet users in 1980. This cipher is a Caesar cipher with the shifting factor 13 (x equals 13). Example:

source A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
target N O P Q R S T U V W X Y Z A B C D E F G H I J K L M
source HELLO WORLD
target URYYB JBEYQ

In Caesars era, that cipher was strong, but later, arab scholars invented the frequency analysis method to break such ciphers.

Maria Stuart  and Babington Plot

Maria Stuart was born on 8th december of 1542. In 1543 she was crowned as Queen of the Scots. After any affairs, marriages and conspiracies Maria Stuart was deposed and escaped to england. There she was arrested. After 18 years of imprisonment, she has a ray of hope. She receives letters from Anthony Babington. Babington encrypted his letters with nomenclature codes and Maria Stuart knows to decrypt the text. At first she received a paper of all nomencalature codes. After that, Babington sends all encrypted messages. The spy’s of Walsingham, the security minister of queen Elizabeth, analyze the messages that Maria Stuart received. Thomas Phelippes was one of the spy’s. After a couple of message and doing frequency analysis with this message, Phelippes cracks the cipher and decrypt all messages. He found out, that Maria Stuart and Babington planning a assassination against the queen Elizabeth. Then, Walsingham forces Phelippes (who was a genius in faking fonts) to write an encrypted letter to Maria Stuart to find out, which people are the insurgents and the accomplices. After this attack from Walsingham (today, it would be called “Man-in-the-middle-attack”), Maria Stuart was executed in England.

This example makes clear, how risky encryption and the decryption of it can be. This nomenclature code can be decrypted by application of the frequency analysis. But one condition is, that the cracker must know, whether all symbols are substituated, words are encoded or both methods are used. With this knowledge, you can crack this nomenclature codes with a simple frequency analysis.

Le Chiffre indéchiffrable (Vigenére square)

After the invention of the frequency analysis from the arab scholars, the security of the monoalphabetic substitution was broken. Blaise de Vigenére, a french diplomat, used an idea from the italian polymath Leon Battista Alberti to invent a method called polyalphabetic cipher. This cipher works with 26 (because of 26 letters in the latin alphabet) alphabets side by side. Every alphabet starts with another letter. Have a look on the image below:

source: pad2.whstatic.com

The Vigenére cipher works with a key phrase. Consider the following example:

plain text  HELLO WORLD
key phrase  KEYKE YKEYK
encrypted result  RIJVS UYVJN

But how does it work? The decryption of a message, that is encrypted with a Vigenére square was, in the era of invention, really strong. You must have the right Vigenére square to decrypt the text. Another renewal was the encryption key. The key was used to encrypt the message. So, the receiver needs the right Vigenére square and the correct key phrase. Now, here comes the explaination of the example above.

The text to encrypt is “HELLO WORLD”. We use the key phrase “KEY” for the encryption. If the key phrase is shorter than the message, then the key phrase will be repeated so long as the length of the plain text will be reached. So, our key phrase is “KEYKEYKEYK”. With our correct Vigenére square, we can start to encrypt the message. The square has a extra alphabet on the upper side and also on the left side. For the first letter from the plain text we have a look on the alphabet in the header row. We call this position plain1(H). After that, we have a look on the left side and we keep our focus on the first letter of the key phrase. This position has the name keyphrase1(K). Now, we have the position of the letter of the plain text and the position of the letter in the key phrase. The cell, were plain1(H) and keyphrase1(K) join together, is the letter of our encrypted text. For that example cell(plain1(H), keyphrase1(K)) = R. For all other letters, the algorithm is the same.

For the era of invention, the polyalphabetic substitution was very strong. The cryptoanalysts countered with a few procedures and algorithms to crack the cipher. One algorithm was used to detect the key length. Some methods to detect the key length with redundancy of natural languages are the Kasiski-test (method of Charles Babbage) and the Friedman-test. In case of a short text, a good method is to find out which words can be placed at the beginning of the text. For instance, the first word “HELLO” is more likely then the word “HFAKF” or other senseless combinations. So the range of possible n-gram’s falls rapidly. So you can appreciate a possible key with a brute force attack. In the best case, you can try to find the key. After you found out any possible parts of the key, you have a simple Caesar cipher, which you crack with the frequency table (precondition: knowledge about the used source language).

One-Time-Pad

The One-Time-Pad (OTP) method is a encryption method that is, when it is used correctly, safe and theoretically not breakable. The concept behind this encryption is, that the key (One-Pad) is used just one time. The second constraint is, that the key phrase is as long as the plain text. Here are any other constraints. The key…

  • must be secret
  • must be unpredictable and random
  • may only be used one time

The recipient receives a bundle of x One-Time-Pad keys. With the keys, the receiver can respond x/2 messages and can answer x/2 messages. There are a lot of methods to en- and decrypt messages. Just one example is the Vigenére square.

As I mentioned a few sentences before, this procedure is absolutly safe (because of the key length). That is a big advantage. On the other hand, it is a high expenditure to bring new keys to the receiver/sender. Furthermore, this method needs a central position, that generates the keys. And last but not least, a strong disadvantage is the way of exchange the keys, which can be attacked. This can be called a Man-in-the-Middle-attack.  As you can see, the OTP method is safe, but not practicable with a high number of messages.

This was the first part of my “History of Encryption” series. The second part will deal with:

  • Enigma machine
  • One-time-pad
  • Public Keys
  • Pretty Good Privacy
  • Quantum cryptography
]]>
http://www.cip-labs.net/2011/08/08/the-history-of-encryption-part-1/feed/ 1
Nondeterministic Finite Automata in PHP http://www.cip-labs.net/2011/07/27/nondeterministic-finite-automata-in-php/ http://www.cip-labs.net/2011/07/27/nondeterministic-finite-automata-in-php/#comments Wed, 27 Jul 2011 21:17:11 +0000 flezzfx http://www.cip-labs.net/?p=502 A few weeks ago, I wrote an article about an implementation of a DFA (deterministic finite automaton) in PHP. Today, I present a implementation for an automaton, that is not deterministic. It means, that the automaton have no, one or more than one edges (for an element from the input alphabet) to an other state. Before I start to give you a launch into how the NFA in PHP works, I give you a short introduction into the theory of nondeterministic finite automata.

To demonstrate an easy example, I use a NFA, that accepts a word, which includes the substring ‘abc’. I used this example also in a previous article (about DFAs). You can compare the following automaton with the DFA that accepts the same words and you would come to the conclusion, that the NFA is easier to build. Thats the truth, but on the other hand, in most (not all) cases the NFA needs more time to execute all simultanous calculations for a token from the alphabet.

An automaton needs (for formal description) an input alphabet. For that example, the alphabet is specified as:

for this example:
Σ = {a,b,c}

A few sentences ago i mentioned, that the NFA needs more time to execute all simultanous calculations. But why? Well, let us choose the input word ‘ccababca’. (For further demonstrations I will use s1, s2, s3 and s4 as names for the states [blue circle] from the left to the right) We start the simulation with the first two tokens ‘cc’. The NFA remains in s1 (state 1). After that the NFA simulates for the token ‘a’. In s1 with input ‘a’ the NFA can go to s2 and can also stay in s1 (self-loop). Here we have the main problem! The NFA simulates all possible calculations and that is the problem, that the NFA has. After token ‘a’ comes token ‘b’ and for that token, the NFA simulates s1 and s3.

The PHP implementation of the NFA works with the transition table of every state. Outward from the starting state(s), the algorithm is looking for a transition from the current state to another – the successor state. After the input word was read, the NFA checks whether it is currently in a final state or not. The formalization of a final state is defined with a thicker border line (as you can see in the sketch above the text). In the PHP implementation it is just a boolean value. You can also choose more than one starting states with:

$nfa = new NFA('test_nfa');
//adding new states
//declare starting states
$nfa->setStartingStates(array('s0','s1'));

The function, explained below, is very important to build NFA’s – but not neccessary. In the code example below, we don’t need this function, therefore it has an own example above.

The code example shows the code fragment, that has the equal functionality as the automaton in the sketch.

require 'NFA.class.php';

$nfa = new NFA('test_nfa');

$s1 = new NFA_State('s1');
$s1->setTransition('b', 's1');
$s1->setTransition('c', 's1');
$s1->setTransition('a', 's2');
$s1->setTransition('a', 's1');
$nfa->setState($s1);

$s2 = new NFA_State('s2');
$s2->setTransition('b', 's3');
$nfa->setState($s2);

$s3 = new NFA_State('s3');
$s3->setTransition('c', 's4');
$nfa->setState($s3); 

$s4 = new NFA_State('s4');
$s4->setTransition('c', 's4');
$s4->setTransition('b', 's4');
$s4->setTransition('a', 's4');
$s4->setFinal(true);
$nfa->setState($s4);

$nfa->setStartingState('s1');

if($nfa->run('accbbababcababa'))
  echo "NFA accepts the input word!";
else
  echo "NFA doesn't accept the input word!";

As you can see in this example, the NFA equals a pattern matcher. With a regular expression parser you are able to build small NFA’s for every part of this expressions. After that you can concatenate this automatons to an automaton, that accepts your complete expression.

You can download the source files here. (NFA.class.php, NFA_State.class.php and test.php)

]]>
http://www.cip-labs.net/2011/07/27/nondeterministic-finite-automata-in-php/feed/ 0
Deterministic Finite Automaton in PHP http://www.cip-labs.net/2011/05/04/deterministic-finite-automaton-in-php/ http://www.cip-labs.net/2011/05/04/deterministic-finite-automaton-in-php/#comments Wed, 04 May 2011 11:09:37 +0000 flezzfx http://www.cip-labs.net/?p=487 An important area of the theoretical computer science deals with finite automata and languages. We use a finite automaton to detect words of regular languages. The types of grammars that create a language are defined in the CHOMSKY-hierachy (Noam Chomsky). The table below shows the CHOMSKY-hierachy:

no. title
0 recursively enumerable
1 context-sensitive
2 context-free
3 regular

To represent a regular grammar (type 3) you can use for instance a finite automaton. Here we are using a DFA (deterministic finite automaton) to display a regular grammar. A DFA consists of states and the automaton can reach a new state with a edge to another state (states are a kind of nodes or vertexes). Every state has a successor state for all tokens of the input alphabet. This input alphabet is specified:

for this example:

Σ = {a,b,c}

The DFA in the example below accepts all words that contain the sub string ‘abc’:

The circle represents a state and the edge represents the input of a token from the input word. The state, which is marked bold, is called the final state. If all tokens of the input word were inputted into the automaton (no token was left) and the automaton is currently in a final state, then the word was accepted from the DFA.

A formal definition of a DFA:

M = (Z, Σ, δ, s0, E)

  • Z… a finite set of states
  • Σ… the input symbols (alphabet)
  • δ… transition function (δ: Q x Σ → Q)
  • s0… start state (s0 ∈ Q)
  • E… a finite set of final states (E ⊆ Q)

The implementation in PHP based on the transition function / transition table of the DFA. The run()-method uses this table to calculate the successor state of every token in every state.
After all tokens are inputted, the run()-method checks, whether the current state is a final state or not. Running run() with true as the second argument, then the function will return the path through the automaton. Using this method with false as second parameter, the return value is boolean (true for success | false for “word isn’t in that language”).

Below you can see the DFA, that accepts only words with an ‘abc’ sub string (from the example), as PHP code:

$dfa = new DFA('test_dfa');

$z0 = new DFA_State('z0');
$z0->setTransition('b', 'z0');
$z0->setTransition('c', 'z0');
$z0->setTransition('a', 'z1');
$dfa->setState($z0);

$z1 = new DFA_State('z1');
$z1->setTransition('b', 'z2');
$z1->setTransition('a', 'z1');
$z1->setTransition('c', 'z0');
$dfa->setState($z1);

$z2 = new DFA_State('z2');
$z2->setTransition('c', 'zE');
$z2->setTransition('b', 'z0');
$z2->setTransition('a', 'z1');
$dfa->setState($z2); 

$zE = new DFA_State('zE');
$zE->setTransition('c', 'zE');
$zE->setTransition('b', 'zE');
$zE->setTransition('a', 'zE');

//to set zE as a final state
$zE->setFinal(true);
$dfa->setState($zE);

$dfa->setStartingState('z0');

$result = $dfa->run('accbbababcababa', true);

goto: download (DFA.class.php, DFA_State.php & test.php)

]]>
http://www.cip-labs.net/2011/05/04/deterministic-finite-automaton-in-php/feed/ 4
MySQL Server Optimization http://www.cip-labs.net/2011/04/20/mysql-server-optimization/ http://www.cip-labs.net/2011/04/20/mysql-server-optimization/#comments Wed, 20 Apr 2011 08:44:55 +0000 flezzfx http://www.cip-labs.net/?p=484 MySQL is a very powerful and popular database server for web applications. It’s a open source project with a big community. A lot of people deal with the performance of this server to make their applications faster on small machines. This article deals with the server tuning, especially the tuning that you can do in the my.cnf configuration file of MySQL. Be aware, that the performance of a database server depends on a few more factors than server tuning via the configuration file. This article is a part of the web-application performance series.

Query cache

The query cache is a cache that caches the execution plan and the result of the queries. When a query comes in, the query-parser parses the query, after that the query-optimizer creates an execution plan (based on heuristics) and executes the query. This cache is a very powerful tool, that brings an enormous speedup. Be aware that the query cache acts case-sensitive, which means that it differ between “SELECT” and “select”. A query, that will return the same result set, but in file A it starts with “SELECT….” and in file B it starts with “Select” have two different entries in the query cache. This is not good, because the unnecessary & occupied space can’t be used for another important query. Below i listed some important variables to control the behavior of the cache:

Query cache status

To have an imagination of how the cache works you can type:

SHOW STATUS LIKE "qc%";

into the query window or the shell to get status information from the cache. Here, i listed the Qcache status information of my database server.

Variable_name Value
Qcache_free_blocks 493
Qcache_free_memory 20752864
Qcache_hits 130628
Qcache_inserts 246940
Qcache_lowmem_prunes 0
Qcache_not_cached 203444
Qcache_queries_in_cache 1160
Qcache_total_blocks 2885
  • Qcache_free_blocks: number of free memory blocks in cache
  • Qcache_free_memory: amount of free memory space
  • Qcache_hits: sucessful reads from the cache
  • Qcache_inserts: number of all inserted queries
  • Qcache_lowmem_prunes: number of queries that where deleted from the cache because of reaching query_cache_size
  • Qcache_not_cached: number of non cached queries
  • Qcache_queries_in_cache: number of queries that are currently in the cache
  • Qcache_total_blocks: number of total blocks in the query cache

query_cache_type

This variable accepts three values.

  • 0 – The cache is turned off. Don’t cache anything.
  • 1 – The cache is turned on. Cache anything, but not such queries that contain time functions and other special aggregate functions
  • 2 – The cache is turned on demand. It will cache all queries that are marked with SQL_CACHE, for instance (SELECT SQL_CACHE id, name, telnr FROM employee;)

Value 2 is just useful if you performing a lot of queries on your server with large result sets and they are just needed to display statistics in the backend. You should cache just queries that you are need on the production system.

query_cache_size

The cache size points out the amount of memory that the Qcache can allocate. To set the value of this variable to an optimal size, you have to study the behavior of your server. For some cases it can be useful to set 16M (which stands for 16 Megabyte) but some applications need more than 16 M. To optimize this, you can test the status of the cache, as described earlier, and keep your focus on Qcache_lowmem_prunes, Qcache_free_blocks and Qcache_free_memory.

query_cache_limit

This variable means the maximum size of a single cache entry. It depends – again – on the size of your result sets. If you have very long tables, then it would be useful to increase the value of this variable. There is no general solution, but a good value for small tables is 1M or 2M.

Key buffer

The key buffer represents the number of space that MySQL can use to build indexes and keys. This buffer is very powerful and the value should be very high (depends on your application). Indexes of large tables can be large too. Study the server status output for more information. Important variables are Key_reads (number of key-read operations that result in a disk access) and Key_writes (number of key-write operations that result in a disk access). Both values should be as small as possible.

key_buffer_size = 32M
#for large tables
key_buffer_size = 512M

Table cache

The table cache holds open tables in the memory to allow a quick access. The improvement of avoiding a disk access and a table scan is very huge. Keep your focus on Open_tables and Opened_tables. Open_tables represents the number of open tables in the table cache. These tables are currently in the memory. The opponent Opened_tables represents the number of tables that are not in the memory (the query results in a table scan). So it is your task to increase the table cache, that you have no Opened_tables. But this is the ideal case. In real-life applications you’ll never have 0 Opened_tables.

#for small appliactions
table_cache = 128
#for huge applications
table_cache 256

Some database administrators think, that FLUSH TABLES would be a good alternative to increasing the table_cache. Well, this is a common mistake. FLUSH TABLES only flushes the table cache. To do so is good, to force a quick reindexing of the cache, but it isn’t an alternative to increase the table cache.

Sort buffer

The sort buffer is used to do sort operations like ORDER BY or GROUP BY in the memory. If this buffer would be to less, MySQL continues the sort operation on the hard drive. This is a very slow process and costs a lot of time. The indicator of the right sort_buffer value is to check the Sort_merge_passes variable. This variable shows how many merge operations MySQL need and it should be 0 for optimal performance.  Is Sort_merge_passes unequal 0 then you have to increase the value of sort_buffer. Be experimental and check the status of your server for sort buffer optimization.

Read rnd buffer

The read_rnd_buffer is used after the sort operation to read the sorted rows. Sometimes the standard value, which is 256K is not enough. There is an rule which tells us to use 1M for every Gigabyte of RAM that our server has on board. If you have not to large tables 1M is a good value.

read_rnd_buffer_size = 1M

Tmp tables

Temporary tables are important for getting a fast result set. If your tables become to large, MySQL will create temporary tables on the disk. This is really slow and not practicable for high performance applications. The value of tmp_table_size must be a generous value. Created_tmp_disk_tables is the opponent of tmp_table_size and represents the number of tables that MySQL creates on the disk. It must be your goal to decrease this value to the minimum.  After watching the current value of Created_tmp_disk_tables you can decide to increase the tmp_table_size or not.

#default value
tmp_table_size = 32MB
#for smaller tables, this value would be better
tmp_table_size = 64 MB

Thread cache

Every connection to the MySQL server ends up in a thread. The thread cache starts after the connection to the client is finished. It holds the thread in the cache to avoid the overhead of instantiate a new thread. An old thread would be used to handle incoming connections. You can check the performance of your thread cache with have a look on Threads_cached and Threads_created. Threads_cached is the number of threads in the cache and Threads_created represents the number of created threads. Increase your thread_cache_size to improve the performance. Note: The thread cache was build to handle high thread concurrency in a fast way.

Join buffer

If you do a join in your query, the MySQL server creates a result table. Especially a full join needs a large join buffer. So, increase the size of your join buffer, that the server must not create a joined table on the disk. This could end in a very high loss of performance. Set join_buffer_size to a generous value. Joins are often a case for the slow-query-log. Try to avoid large queries and check the slow-query-log to detect slow queries.

Max connections

The max_connections value represents the number of connections that the server accepts (simultaneous connections). If you have a high concurrency on your database server, increase this value to handle all connections at the same time. But be aware that the number of file descriptors increases with the number of simultaneous connections. To avoid incorrect handled connections, be careful with this variable. The value depends on the hardware in your box.

#for small applications
max_connections = 500
#for larger applications
max_connections = 5000
]]>
http://www.cip-labs.net/2011/04/20/mysql-server-optimization/feed/ 0