cip-labs - PHP, MySQL and CMS Systems http://www.cip-labs.net code is poetry laboratories - PHP, MySQL, Lighttpd, Web-performance Mon, 11 Mar 2013 09:22:08 +0000 http://wordpress.org/?v=2.9.2 en hourly 1 Finding Multiplier nodes without graph analysis http://www.cip-labs.net/2013/03/11/finding-multiplier-nodes-without-graph-analysis/ http://www.cip-labs.net/2013/03/11/finding-multiplier-nodes-without-graph-analysis/#comments Mon, 11 Mar 2013 09:22:08 +0000 flezzfx http://www.cip-labs.net/?p=785 This article provides an overview of statistical indicators to find users that have a significant high impact of other users. In the past, this was mostly done by graph analysis. This approach uses indicators that need no graph analysis for their results.

Introduction

Multiplier nodes are users in social networks that have a significant number of friends and produce a high amount of network traffic. They communicate with other users or their posts were seen by a lot of people. Finding these users can be a profitable task for marketing researchers. The main idea is to feed this Multiplier user with coupons and concessions. The user then responds on the social networks and writes about his experience with the product. The scope of users that read those responses should be at the maximum. For this procedure, some graph analysis approaches exist. In this article a procedure without graph analysis is shown. These considerations are limited by a theoretical level.

Indicators

The following concepts act as indicators for a specific campaign. In this document, a campaign is denoted as description of the marketing campaign including related:

  • Keywords
  • Phrases
  • Emotions

In contrast to graph analysis, where the Multiplier is found using a graph optimizing algorithm, this analysis uses indicators to find the Multiplier. No single indicator can be seen as a full qualified indicator, which successfully estimates the Multiplier. Therefore a bunch of indicators are needed to verify the others.

User Network Value (UNV)

The UNV represents an index that shows the connection of each user from straight friends up to third degree friends (friends-of-friends-of-friends). The calculation uses a weight for each degree ηi. The number of friends is denoted with fi. Equation below represents the UNV.

where i ∈ {allUsers} and j denotes the degree of friendship. The weights should be chosen to be 1 after a summation. A possible allocation is shown can be seen below:

η1 = 0.5, η2 = 0.35, η3 = 0.15

User Activity Index (UAI)

To find out which user represents the perfect Multiplier node, another indicator points out the activity of each user in the history. To realize the UAI the activities in the past time windows must be classified in time clusters ci. For each system component si (message, comment and others) as number of activities exists a weight wi which represents the importance of each component. The idea behind UAI is an increasing length of the time cluster that corresponds with the age of the cluster. Furthermore the weight decreases corresponding to the age.

length(ci) > length(cj)

where i > j and the higher number represent an older time cluster. To define the the borders of a time cluster ci two variables are defined:

hiL < hiH

which denote the lower and higher bound of a time cluster and can be allocated with the number of hours (or minutes) from the current time.
The interpretation of the UAI is shown in equation.

The uai of user i is calculated using the number of produced system components s of user i together with the weight w of system component k. This is calculated for each time cluster j.

Neighborhood Trusting Profile and Polarization (NTPP)

The friends of the user are the key of this calculation. A Multiplier user should have a well friend environment and not known as an annoying individual. This calculation tries to estimate this state, but it is not able to find ’spammers’ successfully.
The system components are denoted with si and the corresponding weights are defined with wi. Responses to this system components are denoted with ri. If the system component has no responses, then this parameter can be used to define the acceptance within the community. A simple proposal to calculate this value is shown below.

which defines the NTTP of user i. Additional investigations are necessary and can be covered using text analysis of system components and responses.

System Component Text Matching (SCTM)

Each marketing campaign can be described by a set of keywords K = {relevantWords}. We define a similarity function sim(x,K) which returns a similarity value between a users text xij (ith user and jth text) and the elements of K. This method is able to find hidden potentials. This calculation must be done for each user with all written texts. The SCTM should use a weight w for each individual text type. The sim()-function as well as the calculation for each user can be defined in various ways. Therefore SCTM has no explicit calculation rule. All in all, a similarity index can be a powerful measure to find the correct person and classify as Multiplier.

Text Length Classification (TLC)

This measurement provides an array of information about the text lengths for all system components as well as some special information This contains the longest and shortest as well as the mean value for each system component. The next step is the classification into length clusters (for example 0-5, 6-10, …, 151-200, …). This indicator is able to find spammers and avoids the selection of them as Multiplier user.

Emotional Context Analysis (ECA)

The ECA indicator uses the emotional value of a user to select them for Multiplier calculations. Emotions are denoted with ei from (1-10). The origin emotions are rage, sorrow, fun, surprise, disgust, fear and love. A given set of words that are annotated with the corresponding emotions are defined with W. The idea contains the estimation of an emotional value between the users texts Xi and the resulting emotional value context Zi. With Zi the similarity between user and campaign emotions can be compared. This indicator should be used to refine the selection of potential Multipliers. The solely usage of this indicator is not expressive enough.

Conclusion

All introduced indicators are described theoretical. Some theoretical experiments with self-constructed social network users point out, that this approach is able to find the desired node. Nonetheless, practical experiments using the users of a real social network can bring more insight into the utility of this approach.
All in all the found indicators can be used to support the traditional graph analysis and used for the refinement of possible Multipliers. Further work on this approach contains the search for new indicators and the test of significance of existing indicators.

]]>
http://www.cip-labs.net/2013/03/11/finding-multiplier-nodes-without-graph-analysis/feed/ 0
Introduction to Random Forests http://www.cip-labs.net/2013/01/17/introduction-to-random-forests/ http://www.cip-labs.net/2013/01/17/introduction-to-random-forests/#comments Thu, 17 Jan 2013 13:22:36 +0000 flezzfx http://www.cip-labs.net/?p=743 This text is a short extract of my research activities and can be seen as a brief introduction to the topic. If you have any questions or suggestions don’t hesitate to contact me.

Random Forests are an ensemble of separately trained binary decision trees. These decision trees are trained to solve a problem together optimally. For that, the predictions of all trees (named as votes) are combined together to a final vote. Maximizing the information gain and minimizing the information entropy are the goals of the training of these trees to optimally separate the data points or predict a continuous variable.

The decision tree concept was described by Leo Breiman et al. [1] in 1984. After that, more and more applications used an ensemble of randomly trained decision trees for machine learning. One of the first researchers was Shapire[2] in 1990. He used an ensemble to solve machine learning tasks with the Boosting algorithm. As a result of his work, he found out that an ensemble of (weak) learners achieved a significant higher generalization. T. K. Ho was the first who used decision forests to implement a handwriting recognition system[3] in 1995.

More complex applications were implemented by Amit and Geman in 1997 with a shape recognition system [4] . Since then, Random Forests were especially used for computer vision tasks but mostly in the medical imaging field. See [5] and [6] for some prime examples for Random Forests in medical imaging applications.

Below, a list of possible problems that Random Forests can solve is shown:

  • Classification: Prediction of class for specific data
  • Regression: Predicting a continuous variable
  • Density: Learn a probability density function
  • Manifolds: Learn a set of manifolds with corresponding variables

Furthermore, the Random Forest can be used to learn with minimal manual annotations which results in active learning. Semi-supervised learning can also be realized with this concept. This document covers the classification and regression aspect of Random Forests. For literature on density estimation see [7] and [8]. Furthermore, information on manifold learning on Random Forests can be found in [9] and [10].

Beside the areas of application, every Random Forest can be described by the following key parameters:

  • Tree size T
  • Tree depth D
  • Choice of weak learner model and corresponding training objective function (energy function)
  • Injected randomness influenced by p

To understand the procedure of Random Forests it is necessary to be familiar with the following concepts:

Weak learner model

The Random Forest uses weak classifiers to solve its tasks. A weak classifier is specialized on a sub problem and significant faster than a more complex strong classifier. This classifier is often used with other weak learners to tackle a complex problem. The main advantage of this type of learner model in an ensemble of other weak learners is the significant better generalization performance. It performs significant better on unseen data than a stronger learner. The Boosting algorithm is used to generate a strong learner out of an ensemble of weak learners.

Training objective function/Energy function

This function is optimized for each split node (not a leaf) and its specific parameters during the training process. This means a minimization of the information entropy H(S) and a maximization of the information gain I. The maximization of the information gain can be seen as the search for the best separation of two datasets. The (weak) learner improves its parameters to separate the data more efficiently.

Ensemble learning

This concept of learning combines an ensemble of separately trained learners together. During the training step, each learner is trained separately on a random set of training examples. This leads to a significant better generalization. The prediction of this ensemble can be calculated in a few ways. The common ways are to apply a mean function over the predictions of all learners or a mean function with separate weights (confidence) for each learner. The Random Forest concept uses ensemble learning to compute predictions.

Decision Tree

A decision tree is a hierarchical data structure and often used to split a complex problem into sub problems. The root node represents the entry point of the tree. It has successor nodes (also called child nodes). These children can also have successors. A node with no successor is named leaf. In nearly all applications of Random Forests binary decision trees are used. This means that each node (except the leafs) has exactly two successors.

  • Root node: Entry point of the tree that has two successors. It acts like a regular node.
  • Regular node: Has one previous node and two successors. Each regular node applies a split function (described later) with a binary return value to the data. The result of the split function decides to forward the data to the left or the right child. These nodes are also called split nodes and implement the weak learner model.
  • Leaf: After passing a few regular nodes the data is forwarded to a leaf. A leaf has no successors and provides a confidence of class affiliation or a predicted function value. Leafs implement the predictor model.

The main effort in setting up a well-working decision tree is to determine the split function for each node wisely. In some applications, for instance computer vision, the split function is always the same and parameters as well as the tree structure are learned from training data. Note that n-ary decision trees also exist, but these type of trees are more complex and not suitable for this problem.

Information Gain

To reach an exact separation of classes of data points the information gain I must be maximized. This can be reached with minimizing the Shannon entropy H(S) which is defined as:

(#1)

where S is denoted as an example dataset, c is a class with c ∈ C and p(c) is the probability mass function of each class.

(#2)

Shannon entropy[11] describes the uncertainty of an information source in combination with a random variable. In other words it can be seen as the unpredictability of an information source. The two equations #1 and #2 show possible calculations of the information entropy. The entropy of discrete and categorical distributions can be calculated with #1. For a continuous distribution #2 is used. This works for the differential entropy of a d-variate gaussian distribution. Equation #1 is used in equation #3 to calculate the information gain.

(#3)

where H(S) is the Shannon entropy of all data points reaching a specific node, H(Si) for the i-th of the both children {1, 2}. Additionally |S| represents the complete training data points at this node and |Si| all data points that will be separated to node i in {1, 2}. H(Si) calculates the Shannon entropy of all data points for node Si. The goal of the training is to maximize the information gain.

During the training each node in the decision tree tries to find an optimal separation of incoming data points. The indication of the training process are the information gain and entropy. Minimizing the information entropy (unpredictability) H(S) and maximization the information gain I (confidence) are the main aims of each node.

Weak Learner Model

A weak learner model is often used in ensemble learning models like the Random Forest. Boosting (and algorithms similar to Boosting) uses weak learners to build a strong classifier out of a bunch of weak classifiers. For more information about Boosting and the existence of weak learner see Mannor et al. [12].

The weak learner model in Random Forests with binary decision trees is defined with a binary test function, that returns “true” or “false” which equals the forwarding of the test data to the “right” or “left” child. The weak learner function is selected during the conceptional stage before the training starts. In most Random Forests, the decision trees are always binary trees. This leads to a binary test function, because each split node needs a binary decision. In decision trees with n children, the split function must return a n-ary output. At first, some of the relevant symbols are explained:

  • θ … Weak learner function or binary test that decides how to forward the data to the left or right child.
  • φ … Selection of features from feature vector v with φ : Rd -> Rd’ and d >> d’. This points out, that φ uses a subset of all features for computation. This selection of features can vary from node to node.
  • ψ … Defines geometric primitive that is used to separate the data. Possible primitives are an axis-aligned hyperplane or an oblique hyperplane.
  • τ … Contains thresholds for the binary tests. This thresholds decide whether the test returns “true” or “false”.

The binary test function is defined as:

h(vj) ∈ {0,1}    (#4)

where θ is defined as θ = (φ, ψ, τ).

For linear data separation the following test function is defined:

h ( v, θj) = [ τ1 > φ( v ) • ψ > τ2]    (#5)

In #5 the • operation is called the indicator function that indicates “true” or “false” depending on the arguments. This can be seen as the separation operator. It separates the chosen features of data point v with φ(v) on the hyperplane ψ. After that, the thresholds can be used to control the result. For simple linear applications these thresholds are set to τ1 = – ∞ and τ 2 = ∞. These thresholds can be adapted during the training phase.
For non-linear data separation a more complex weak learner can be used to replace hyperplanes with a higher degree of freedom:

h ( v, θ j) = [ τ1 > φT (v) ψ φ (v) > τ 2 ]     (#6)
The goal of the training process for node j is to optimize the parameters of θj in the split function and the optimization of the chosen objective function I j with maximizing the information gain. The training objective function is defined by:

(#7)

using: Ij = I(Sj, SjR, SjL, θj).   (#8)

Equation #7 represents the expected goal of the training. The optimal parameters θ*j are computed during the training. S j, SjR and SjL represent the training points before splitting and after splitting to the right as well as the left child.

The search for the maximum of each node can be achieved using exhaustive searches. Finding the optimal value for τ can be calculated using the mean of integral histograms. In most Random Forests one split function (weak learner function) exists that is used in all nodes of all trees, but in some cases it can be necessary to use more than one type of split functions for a special problem.

Leaf and Prediction

The leaf predicts the result for a given test data point. Therefore it stores some important information:

  • Classification: The leaf stores an empirical distribution over the classes associated to the data point. The predictor p(c|v) with c ∈ {ck} represents the probability that data point v is an element of class c.
  • Regression: The leaf returns a continuous variable or a point estimation e.g. c* = argmaxc pt(c|v).

Training

Random Forests are often trained in an offline training phase, but there are some approaches which use Random Forests in combination with online training [17]. To reach a significant high generalization quality the randomness of provided training data must be injected during the training process. It can be injected in a wide variety of ways but the common way is to provide a random set of training data to each tree. The randomness is only injected during the training phase and not during the test phase.
S j is denoted as the set of training data that reached node j. S jR and SjL are defined as the set of data that reaches the child R or L of node j. Equation#9 describes the relationship.

(#9)

Furthermore S0 equals a training set of (randomly chosen) data points {v} injected at the root node 0 of each tree. This training data and the associated ground truth are forwarded at each split node with the goal to minimize the energy function. At each split node, the training data will be separated with the maximization of the information gain (equals the search for the best separation and their corresponding parameters). For that, the weak learner θ in each split node is used with the binary test function h(v,θj) to realize θj* from equations #7 and #8. After that, the stopping criteria are checked and the creation of a new split node or leaf started. A split node will be created, when the stopping criteria was not reached. Otherwise, a leaf will be created as mentioned in section “Leaf and Prediction”.

The training stops with one or more predefined stopping criteria. For stopping the training of the forest, especially each tree, the following criteria exist:

  • Maximum tree depth is reached, defined as D
  • Less then a number of defined number of training examples reached that node

The choice of an optimal stopping criteria can help to increase the generalization performance. Other optimizations like tree-pruning exist[13].

Testing

During the test, each split node applies its specific split function to the test data that reached it. This split function can be the same function for all nodes with a variation in the parameters. Another method to provide a split function to a node is the choice of a sub-problem solving function (e.g. the categorization of a photo, outside or inside). After the split functions in each node did its work the result is forwarded to the leaf. The leaf acts as predictor. It contains a probability of class affiliation (classification) or an estimated function value (regression). This procedure works for one decision tree. In a Random Forest, a bunch of decision trees are executing this procedure on the same input data. This is the benefit of a Random Forest, where more than one predictor estimated the result. All estimations of the decision trees are than combined to one estimation which is the final result.

Decision Forest Model

The model of a Random Forest combines a bunch of decision trees (in this document binary decision trees) to an ensemble of weak learners. The number of trees is static and was defined during the conceptional stage of the development. The optimal number of trees depends on the specific problem. It is important to inject the right amount of randomness and to get an expressive prediction out of the forest. The sections below provide considerations about these two topics.

Randomness

Randomness is one of the important components of Random Forests. The randomness offers robustness against noisy data and an improvement of generalization. Two options to inject randomness during the training into the forest are random training dataset sampling [14] and randomized node optimization [15].

As already mentioned in previous sections the randomness is injected using p. These parameters have the following influence on the trees:

  • Large values of p correspond to little randomness and large tree correlation. This is not the optimal case, because the forest acts more like a single decision tree, e.g. p = |Υ| where Υ represents the complete set of all possible combinations of parameters θ.
  • Small values of p correspond to high randomness and the trees act different, e.g. p = 1.

For the injection of randomness for each node equation #7 comes into account. This equation needs a modification to be used for a subset of Υ.

(#9)

using Υj ⊂ Υ.
The ratio of randomness is controlled by j| / |Υ|.
For | Υ | ≠ ∞ the parameter p = |Υ| is introduced. As mentioned above, this parameter can be assigned with these values p = 1,…,|Υ|. For p = 1 the trees have the highest randomness in the system. In contrast to that, p = |Υ| provides the lowest grade of randomness.

The Ensemble Model

To get a prediction for test data point v from the forest T with t ∈ {1,..,T} trees a combination of all tree predictions is calculated using equation:

(#10)

For classification the result of p(c|v) points out the probability that data point v is an element of class c. On the other side, for regression the tree posteriors have a confidence value for its estimation. To get the final result an average calculation about all tree posteriors with their corresponding confidences value is done. Tree testing can be done in parallel on using modern parallel CPU or GPU computation [16].

Sources

[1] L.Breiman, J. Friedman, C. J. Stone and R. A. Olshen, Classification and Regression Trees, Chapman and Hall/CRC, 1984.
[2] R. E. Schapire, The strength of weak learnability, Machine Learning, 5(2):197 – 227, 1990.
[3] T. K. Ho, Random decision forests, Intl Conf. on Document Analysis and Recognition, pages 278 – 282, 1995.
[4] Y. Amit and D. Geman, Shape quanization and recognition with randomized trees, Neural Computation, 9:1545 – 1588, 1997.
[5] A. Criminisi, J. Shotton, D. Robertson and E. Konukoglu, Regression forests for efficient anatomy detection and localization in CT studies, MICCAI workshop on Medical Computer Vision: Recognition Techniques and Applications in Medical Imaging, Beijing, 2010, Springer.
[6] A. Montillo, J.Shotton, J.E. Winn, Metaxas D. Iglesias and A. Criminisi, Entangled decision forests and their application for semantic segmentation of CT images, Information Processing in Medical Imaging (IPMI), 2011.
[7] B. W. Silvermann, Density Estimation, Chapman and Hall, London, 1986.
[8] J. Shotton, M. Johnson and R. Cipolla, Semantic texton forests for image categorization and segmentation, IEEE CVPR, 2008.
[9] N. Duchateau, M. De Craene, G. Piella and A. F. Frangi, Characterizing pathological deviations from normality using constrained manifold learning, MICCAI, 2011.
[10] Q. Zhang, R. Souvenir and R. Pless, On manifold structure of cardiac MRI data: Application to segmentation, In IEEE Computer Vision and Pattern Recognition, Los Alamitos, CA, USA, 2006.
[11] C. E. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal, 1948.
[12] S. Mannor, R. Meir, Y. Bengio and D. Schuurmans, On the Existence of Linear Weak Learners and Applications to Boosting, 2002.
[13] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, 2001.
[14] L. Breiman, Random forests, Machine Learning, 45(1):5 – 32, 2001.
[15] T. K. Ho, The random subspace method for constructing decision forests, PAMI, 20(8):832 – 844, 1998.
[16] T. Sharp, Implementing decision trees and forests on a GPU, In ECCV, 2008.
[17] A. R. Saffari, C. Leistner and H. Bischof, Online Random Forests, In Proc. IEEE Online Learning for Computer Vision Workshop, 2009.

]]>
http://www.cip-labs.net/2013/01/17/introduction-to-random-forests/feed/ 2
How NBDS learns to learn http://www.cip-labs.net/2012/07/27/how-nbds-learns-to-learn/ http://www.cip-labs.net/2012/07/27/how-nbds-learns-to-learn/#comments Fri, 27 Jul 2012 11:37:20 +0000 flezzfx http://www.cip-labs.net/?p=697 After a long period of absence, I’m back to announce the 0.2.5 version of the neuron-based data structure called NBDS. This project was introduced in december 2010 as a concept. After that, the development of NBDS still continued, so I wrote about my first implementation of NBDS concept in PHP. This post is about the improvements of  NBDS for PHP. I will write about new features, ideas and some things that can be easily solved with NBDS.

New features

After the last release of version 0.1.3 on github, version 0.2.5 will be the next official release. You can consult a few documents (in the docs folder or on the project page) to understand how NBDS works. Below you can find a list of new features and functionalities that were added to the NBDS:

Lookup handling

The lookup/reverse lookup table was implemented in version 0.1.1 and works pretty well, when its set to ‘on’. After I’ve noticed that, I set this feature obligatory to ‘on’ and removed all on/off switches in the code. This will allow an easier handling and furthermore it improves the speed of NBDS.

Weight of axons

Since version 0.2.2 NBDS implements a new attribute for the Axon object. Inspired by nature where a synapse between neurons has an equivalent weight that points out the importance of a synapse. To make NBDS learnable,  you need an element that learns. With this idea in mind, the following versions of NBDS will be learnable and you can train it with common training algorithms for neural networks.

Aggregate functions

In version 0.2.3 the aggregate() function was added to the Space object. It allows you to define a set of object IDs and run some operations on it. For instance <, <=, >=, ==, sum, mean, median and more. You get back all elements that are passing your aggregate test successfully. This function is very comportable for fast checks.

Callback function

The callback() function is the latest feature. Introduced in version 0.2.3 it is able to run a user-written PHP function on some objects, that you have chosen before. This is a fast operation, when you want to do complex operations on a bunch of neurons or axons.

Ideas for the future

With the continuation of development of NBDS comes a few new possibilities and ideas. Two of these ideas were listed and explained below.

Learn to learn

NBDS was built as neural data structure. At the state of art, no one knows – with exception of the grandmother neuron phenomenon – where the concrete information is stored in the brain. NBDS should provide a simple approach to that, a computer model that stores flexible objects of information in a graph structure. With this feature you can use NBDS as a neural network, it can be trained and still used as a data structure – so a few more working areas can be served with NBDS.

Implementation in C++/Java

At first I’ve implemented NBDS in PHP. This was a good decision because PHP is a very flexible language (allowing associative arrays, simple object handling for instance in serialization). After a solid implementation (in PHP) with all ideas, coming features and documentation, the focus of NBDS development will change to a hard-coded language. At the moment of writing it is unclear whether I will use C++ or Java for this task. Recommendation with valuable arguments are very welcome.

Current project using NBDS (NBDS-SM)

At the moment I’m still working on a natural language processor (NLP) on cip-labs. This NLP operates with some rules for sentences. After parsing the input and finding sorts of sentences and expressions, the information will be searched on a semantic network. The semantic nework is realized with a NBDS system. It works very well because NBDS provides a bunch of functionalities to serve that problem. Moreover NBDS has some more advantages:

  • Semantic network is a graph, NBDS too
  • Flexible add and delete of operations for edges (Axon objects) and vertexes (Neuron object)
  • Flexible attribute (key-value storage)
  • Comfortable use of operations (select, selecti, route, neighbor, and other)

Conclusion

All in all, I’m really looking forward to continue the development and the realization of all ideas and concepts to improve the NBDS system. The next steps are the implementation of the ideas – mentioned above – and some more cool stuff that is already in the conceptional stage. I would like to get some feedback from you, so don’t hesitate to contact me.

Goto: project page

]]>
http://www.cip-labs.net/2012/07/27/how-nbds-learns-to-learn/feed/ 0
Theta8 – Next step website analysis http://www.cip-labs.net/2012/03/08/theta8-next-step-website-analysis/ http://www.cip-labs.net/2012/03/08/theta8-next-step-website-analysis/#comments Thu, 08 Mar 2012 13:29:05 +0000 flezzfx http://www.cip-labs.net/?p=669

I’m proudly present a new release of a project, that I’ve developed at the cip-labs. It’s named Theta8. This smart web application analyzes the response of a given URL and collects some data about it. It gives advices, tricks and tipps for the frontend optimization. Below, I will explain some functionality in more details:

How was it built?

Theta8 includes some PHP code and also some lines of C code. The core was programmed in PHP, but some functionalities (e.g. network operations, calculations on data) were outsourced to faster C. This has more advantages, but the main reason was, to speed it up. The programming paradigm is simple OOP. At the moment, I’m testing a lot, to bring more speed into Theta8. Consider, that requesting links and resources are a lot of load on the network site. This is the bottleneck in this case.

Features

Below, I listed the features, that are currently implemented:

  • analyze metatags and make suggestions
  • detect CSS classes
  • request links and resources (e.g. headers, mime-types, dead link detection)
  • analyze caching, minification, merging and compression
  • tips for spreading hosts

Future

At the moment, I deal with some features, that I want to develop for this project. Here are some key points:

  • Speeding up the analysis
  • Improve the accuracy of information
  • Better CSS class feature
  • Usability testing tool
  • Online compression checker

Motivation

The theta8 project was inspired by the knowledge and passion that I put in the topic of web application performance. This topic is really interesting and I did some research during the last 4 years. I worked for a few companies to improve the performance of their applications. Originally, one of these companies want to buy the project (Theta8), to monetize it. I rejected this offer and decided to continue the work on this project.

Goto: Theta8

]]>
http://www.cip-labs.net/2012/03/08/theta8-next-step-website-analysis/feed/ 0
Neuron based data structure – An implementation http://www.cip-labs.net/2011/12/14/neuron-based-data-structure-an-implementation/ http://www.cip-labs.net/2011/12/14/neuron-based-data-structure-an-implementation/#comments Wed, 14 Dec 2011 20:54:08 +0000 flezzfx http://www.cip-labs.net/?p=653 Exactly one year ago, I came out with an idea of a neuron based data structure. In the first article about this idea, I tried to give an overview about, how this data model should work. In the article below, I introduce you to the implementation of this model. It’s still in development, but all functions which were needed are already implemented.

What is NBDS?

The neuron based data structure (called NBDS) follows the idea, to keep an information as an atomic part. The model contains three parts. The first part is the Neuron, which acts like a container for data. The second part is the Axon. This axon connects two neurons together and it can still contain information (data about the connection or relation). The last part is the Space. In a Space you put neurons and axons together and run some operations on it. You can imagine the space as a component, that brings the order into the set of neurons and axons.

To keep it short and simple: NBDS represents a directed graph. The direction of the axons (edges) can be one-directional or bi-directional. It has something in common with the network data model, but NBDS has more flexibility. It has also something in common with the object-oriented data model, but it can have no, one or more connections to other objects (and the connection itself can also contain information)

Neuron

Biological motivation

Neurons are nerve cells of the brain. This nerve cells are connected to each other with dendrites and axons. When an electrical impulse comes into (from another neuron over dendrites and axons) the neuron have an electrical potential. When this potential reaches a special threshold value, the neuron fires also an electrical impulse out to other neurons. Thats is, how your brain works. In NBDS, the neuron haven’t a threshold value, but it has a counter, that counts, how often the neuron was reached during the operations on the Space. With this counter, you can see, how often your information was affected.

Functionality

A neuron reflects a data container. It contains the ID, name, value, description, generic attributes (key,value) and a counter (how often was the neuron affected in operations on the space). It can have connections to other neurons about axons.

Axon

Biological motivation

The axon (and also a dendrite) connects the neurons with each other. It forwards the electrical signal from the source neuron (or the retina) to the destination neuron (or more than one).

Functionality

In this data model, the axon does some more than just forward an impulse or signal. It has also an ID, name, value, description, generic attributes (key,value), a used counter, the bi-directional option and the id’s of the involved neurons.

Space

Functionality

A space, in this case, is a container that holds neurons, axons and the connections between them together. In the NBDS you can perform some operations on the Space to get out the information from the neurons and axons. Furthermore it contains some lists, to perform operations on the Space faster.

Universe

The universe array is a structure, that represents the connections between neurons. It was designed as a fast lookup table for operations on the space. Nearly all methods, that need to found out a fast connection between to neurons, using this array for a lookup.

Mnemonics

ln the real world, mnemonics are things that you know from the past. In this data model its similar. The mnemonic array hold route operations (from the past) from neuronX to neuronY. It contains all successful routes, that the route() method detects.

Operations

Serialization

The NBDS allows to serialize and unserialize a set of data. It offers two methods. The first method is to save the data as a XML file. This method is really simple and easy to import to other systems. But the XML serialization has one problem. It is not able to store objects. This affects just nested objects, that you hold in an axon or neuron.

The second option for serialization is to save it with the serialize() function of PHP. This method fix the problem wit nested objects and can store the data binary or plain in an JSON like format.

Route

The route operation finds a connection between two neurons (over axon(s)). (assuming, that this connection already exists)

Select

With this operation you can request several parts of the data structure. You can, for instance, search for all NBDS_NEURON objects which have a name NBDS_LIKE ’special_neuro%’.

Selecti

The selecti operation realizes more than one select() operations in one step. You can define $operation_arrays to declare more than one criteria for the request.

Connection

Connection() returns a component of a connection. It uses the route() operation internally to do a lookup (from_neuron – to_neuron). After that, it returns NBDS_NEURON or NBDS_AXON (or both -> NBDS_ALL) as objects.

Neighbor

This operation calculates the neighbors of a neuron. Optionally, you can set an argument ($steps), which returns the neighbors with a given distance (so called degree).

Translate

Translate performs a lookup on the _axons or _neurons array, to find out, which neuron or axon has the id, that was given as an argument. It returns the objects, that belongs to the given ID’s.

Get

A simple get() request returns some parts of the space, for instance NBDS_NEURON, NBDS_AXON, NBDS_UNIVERSE or NBDS_MNEMONICS.

Use cases

Below, I collected some ideas for an application of the NBDS:

Data model

The concept of NBDS deals with the representation of data. A bunch of data has a connection to another bunch of data. The bunch of data is very flexible an generic. Furthermore, the connection between the bunches of data has also data, that describes it.

Ontology

An ontology is a relation between words (synonym, etc.). With the NBDS you can represent this ontology and you can request synonyms of a word. You can also classify the axons as a special relation between words.

Intelligent graphs

You can use NBDS as an intelligent graph. A graph, which you can use for routing operations. After that, you can find the fastest path and the information behind it.

Data relationships

Organizing data (atomic data) with relations is one of the main advantages of the NBDS. You are able to organize nested informations with the neurons. For instance, you can put a Space.class.php object into a neuron. This means, that you can use recursive structures to store your data.

Semantic networks

You can use it also as a semantic network. With adding neurons with names and relationships to each other, you can represent a semantic network of words.

System requirements

The system requirements are just PHP 5.x or higher.


goto: download (version 1.0.0)

goto: project page

]]>
http://www.cip-labs.net/2011/12/14/neuron-based-data-structure-an-implementation/feed/ 1
Ipsum – The PHP formula parser http://www.cip-labs.net/2011/11/21/ipsum-the-php-formula-parser/ http://www.cip-labs.net/2011/11/21/ipsum-the-php-formula-parser/#comments Mon, 21 Nov 2011 21:43:59 +0000 flezzfx http://www.cip-labs.net/?p=626 Some time ago, I thought about that it would nice to have a smart piece of code, that calculates the result of a given formula. The next step, I thought about, was to integrate customized functions into the code. So, I wrote a parser, that is able to do all this things. This article deals with this parser and how you can work with it. The code is really easy to understand and the parser is easy to use.

First considerations

At first you have to consider some classes, that are able to abstract all the structures you need.

class Morphem

A morphem is the atomic part of every language. In our case, a morphem represents the tokens of a formal language, for instance:

(, ), +, -, *, /, sin, cos, tan, sqrt, exp, ln, log10, -912, 12

The morphem class represents a morphem, after the lexer has detected it.

class Lexer

The lexer has the job to detect morphems in the formula. With some simple rules, the lexer detects every morphem from the left to the right:

  • FVAL – function value like (sin, cos, log)
  • DVAL – double value like (1.4, 3.1, 5, 7, 11.11)
  • CVAL – character value like (+,(,),-)
  • NOVAL – no value, tokens that are not defined
  • FINISHED – the lexer finished with ‘\0′ in the string

This class returns the current morphem to the parser. The parser uses this morphem, to calculate the result of your formula with the rules of the grammar.

class Parser

An important part of a good working parser is a good grammar. In this case, I choose a case-sensitive grammar. A grammar is a formal construct, to define, how a language is structured. In this case, the language represents all valid formulas. The following grammar was used in the code:

E -> T | T + E | T - E
T -> F | F * T | F / T
F -> (E) | N | -N | sqrt(E) | sin(E) | cos(E) | tan(E) | exp(E) | ln(E) | log10(E)
N -> I | I .D
I  -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 1I | 2I | 3I | 4I | 5I | 6I | 7I | 8I | 9I
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0D | 1D | 2D | 3D | 4D | 5D | 6D | 7D | 8D | 9D

This grammar is realized in the parser class. The customized functions, that you can integrate into the parser comes in at F. Here are laying your functions.

HowTo work with

Here you can see a simple example, that should match all cases of application.

First example

require 'Parser.class.php';
$parser = new Parser('sin(4)+cos(4)');
$result = $parser->run();
echo 'result is: ' , $result , PHP_EOL;
//outputs
result is: -1.4104461161715

As you can see in the example above, it is really easy to use the parser. You just have to type in the formula that you want to parse and nothing more.

The parser knows some functions by default:

  • sqrt(x) square root of x
  • exp(x) – e^x
  • sin(x) – sinus of x
  • tan(x) – tangens of x
  • cos(x) – cosinus of x
  • log10(x) – logarithm base 10 of x
  • log(x) – logarithm naturalis of x

So, it isn’t a problem for the parser to calculate:

(3*sin(5*cos(3))+1+ln(sin(3)))/sqrt(2)

source: wolframalpha.com

which is (output from the parser): 1.3842259196509

Extending functionality

Sure, I built in some functions, that are important for mathematical calculations, but maybe you want some more functions, to serve your needs. That is no problem:

require_once 'Parser.class.php';

function divide2($x){
return $x = $x / 2;
}

function plusRandom($x){
return $x + rand(0,10);
}

$parser = new Parser('div2(sin(pr(1)))');
$parser->addFunction('div2', 'divide2');
$parser->addFunction('pr', 'plusRandom');
echo $parser->run();
//outputs the sinus from a number from a random number in the scope of 1-11 divided by 2

Note: Every function must have one argument (not more and not less).

Symbol table

The symbol table includes all functions that you had added or where built-in. I kept this structure static, so that a new parser object has access to it. Furthermore the parser needs this structure static, because in some cases we are starting a new parser recursively. The table is an array of the lexer class and you can access it with:

Lexer::$_userFunctions[$name_in_formula] = $real_function_name;

Using the calculator

May you want to calculate a series of results for one formula, to draw a graph in a coordinate system or to simply calculate a series of results from a formula with a variable. Then you should use the Calculator.

$calc = new Calculator();
$calc->options('{x}', 0, 10, 0.5);
$calc->addFunction('div2','divide2');
print_r($calc->calculate('sin({x})+cos({x})');

The calculator runs the formula with the variable {x} starts with 0 and runs up to 10 with the step rate of 0.5. That means, the calculator runs for {x} = 0, {x} = 0.5, {x} = 1.0, {x} = 1.5 …. {x} = 10.0. The keys of the result array are the current step and the values are the result of the calculation.

Wishlist

On my wishlist for this further development of this project:

  • n ^ m
  • n!
  • n % m
  • functions with more then one parameter

Get it!

You can download the files on the project page. There I will create a little documentation of the code for some hacking fun. Maybe you found it interesting to work with this little tool. So, please let me know and write a comment or a mail. I would really like to get some feedback and wishes, that I can add to the code.

goto: download

goto: project page

]]>
http://www.cip-labs.net/2011/11/21/ipsum-the-php-formula-parser/feed/ 0
Benford Calculation in PHP http://www.cip-labs.net/2011/11/01/benford-calculation-in-php/ http://www.cip-labs.net/2011/11/01/benford-calculation-in-php/#comments Tue, 01 Nov 2011 07:20:50 +0000 flezzfx http://www.cip-labs.net/?p=612 Some month ago, I came out with an article about Benfords Law. This article deals with the summation and product formula based implementation in C. The algorithm is indeed cool for small statistical games, but not nice to use, when you have a set of numbers.

So, I developed a small class, that serves this need. Furthermore, this class is able to calculate the geometrical, harmonical and quadratical averages.

After including (require, include) this class, you can create a new object with

$benford = new Benford(array(1341,82,141,215), 10);

The first parameter is the array with the values to calculate. The second parameter is the basis of the number system. This parameter is 10 by default. (10 means a normal number system from 0..9). You can add more values with:

$benford->addValue(array(9,128,412));

or simply:

$benford->addValue(3222);

After that, you have two options:

You can perform a run through the given set of numbers and get back the values of the Benford calculation. The second option is to use an average function. When calling an average function, the run()-function will be also called internally.

Get back an array with the Benford calculation:

$benford = new Benford(array(232, 211, 410, 301, 508, 192));

$result = $benford->run();

This returns an array with all the Benford results from the given data. Below you can see the code to perform the average functions.

$benford = new Benford(array(232, 211, 410, 301, 508, 192)); 

$result_harm = $benford->getHarmonicalAverage();

$result_geom = $benford->getGeometricalAverage();

$result_quad = $benford->getQuadrateAverage();

The script is very easy to use. Have fun with it!

Download the source file.

]]>
http://www.cip-labs.net/2011/11/01/benford-calculation-in-php/feed/ 0
MySQL Schema Performance Considerations http://www.cip-labs.net/2011/10/19/mysql-schema-performance-considerations/ http://www.cip-labs.net/2011/10/19/mysql-schema-performance-considerations/#comments Wed, 19 Oct 2011 09:59:34 +0000 flezzfx http://www.cip-labs.net/?p=607 The most optimization tricks for MySQL focus on query performance or server tuning. But the optimization starts with the design of the database schema. When you forget to optimize the base of your database (the structure), then you will pay the price of your laxity from the beginning of your work with the database. Sure, every storage engine have his own advantages and disadvantages. But regardless of the engine you choose, you should consider the following chapters in your database schema. In this article, I will write about some architectural considerations, that you should keep in mind, when you design a database, that will run under heavy load. This article is a part of the web-application performance series.

Read-optimized tables

In current web applications, the frequency of reading from a database is much likely, then writing data to it. For this reason, the database server have to scan a lot tables, to deliver the data. The most tables, that are including the (so called) master data are very huge. This tables have a big row size and they are very costly to scan. For this reasons it is a well-known method to split up this table. You can hold the data, that is most likely involved in scanning processes, in a seperate table, to avoid a scan about the whole data. A problem occures, when you want do query the full data set. Here you have two options:

  1. execute a join between the read-optimized table and the other table (using a foreign key in the read-optimized table)
  2. denormalize the tables with an attribute, that both tables have (and that is indexed two times)

The first alternative is not oriented on performance aspects, but it fullfills all rules of normalization. The second alternative is a denormalized table, that includes some redundancy in data. Most web applications run with little redundancies in there databases to use the performance improvement of this read-optimized tables. See also the chapter: “denormalize tables”.

Denormalize tables

As I said in the chapter above, you should denormalize tables, when your joins between two and more tables are to slow. To denormalize should be the last step, after all of your architectural repertoire is exhausted. Because denormalized tables mean to have some data not only in one column. Of course, some inconsistencies can occure and the database is stronger to handle. The administrative outlay will increase, if you denormalize to much of your data.

Denormalization is useful, if you split one table into two tables and you don’t want to join this tables together in a query. In most cases, the database would execute a join with over 2 tables to get the information. This is really costly for current web applications! But be aware, think about, which denormalization is necessary and which is not.

Artificial key columns

If you have a fast natural key (that results from the data analysis) don’t add a artifical key to it. Remember that all keys are need disk or/and memory space. But you have to weigh up whether your natural key is fast enough or not. Maybe your key is alphanumeric and is to large to be efficient under high load. Then you should use a numeric and artificial primary key.

Bigint vs. unsigned

It is important for the performance to choose the right data type for the primary key. Consider, if you really need a BIGINT or INT. Maybe the MEDIUMINT datatype can fix your needs. The difference between INT and MEDIUMINT is, that you save 25% of space with using MEDIUMINT. You should use the smallest data type for 2 reasons. The first reason is, that the row will be smaller and the second reason is, that the index of the primary key will me smaller and faster. Another idea is, to use the UNSIGNED part of a data type. In most cases, the range of UNSIGNED INT will suffice. Of course, you can apply this consideration for all numerical fields (and not just for key columns).

ranges of data types

Type Storage Minimum Value Maximum Value
(Bytes) (Signed/Unsigned) Signed/Unsigned)
TINYINT 1 -128 127
0 255
SMALLINT 2 -32768 32767
0 65535
MEDIUMINT 3 -8388608 8388607
0 16777215
INT 4 -2147483648 2147483647
0 4294967295
BIGINT 8 -9223372036854775808 9223372036854775807
0 18446744073709551615

source: mysql.com

Text vs. varchar

This is a one of the most rookie mistakes with MySQL. Some people think, that it is better to use TEXT instead of VARCHAR. One point is that VARCHAR has a defined length and TEXT hasn’t. But that is not the last one. Furthermore TEXT needs more memory for sorting. However, you should use TEXT just as a “storage” data type and VARCHAR as small field of textual information (or as a key).

NOT NULL

If it is possible, declare all columns as NOT NULL. Every column needs a bit more and this is not necessary. When you really need this column, you should use it. But when you can avoid this, you should avoid it. MySQL does more working steps, if you set NULL to a column.

Other considerations

  • Do not index columns that you not need in a select
  • Use clever refactoring to admit changes to your schema
  • Choose the minimal character set, that fits your needs
  • Use triggers just, when you really need it
]]>
http://www.cip-labs.net/2011/10/19/mysql-schema-performance-considerations/feed/ 0
Ideas to bypass the frequency analysis http://www.cip-labs.net/2011/09/30/ideas-to-bypass-the-frequency-analysis/ http://www.cip-labs.net/2011/09/30/ideas-to-bypass-the-frequency-analysis/#comments Fri, 30 Sep 2011 16:37:26 +0000 flezzfx http://www.cip-labs.net/?p=571 In the history of encryption article, I even dealt with mono alphabetic substitution. This means, that every letter from the source alphabet will be substituted with exactly one letter from the target alphabet. It is simple to understand as it is easy to crack. The mono alphabetic substitution can be cracked with the frequency analysis. This old method was invented by arab scholars. They built a table with the frequency of occurrence for every letter of the source language. After that, they compared this table with the text. With this method, they were able to crack this simple cipher.

In this article, I will present some little tricks to avoid the cracking of this cipher with the frequency analysis. I must admit, that the list of the following methods is not complete. All ideas are just an approach, but I think that they can be very useful for private usage. (In all examples, I will use ROT13 to symbolize an encryption with a mono alphabetic method. Furthermore, the short sentences in this examples are nearly uncrackable with a frequency table, because the deliver not enough input.)

I considered the following techniques:

  • Vocal reduction
  • Extended target alphabet
  • Sectional target alphabets

Vocal reduction

The vocal reduction method can be very strong and hard to crack. But on the other hand, this method isn’t very practicable. The main problem is, that the receiver of the message has to reconstruct the message with a high effort. Vocal reduction means, that the sender deletes all vocals from the source text, before he encrypts the message. The analysis fails, because the frequency of the letters isn’t correct. For english texts, the target alphabet has 5 letters less than the source alphabet. The receiver knows the substitution and can decrypt the text. But, the effort to reconstruct the text with the correct vocals is very high. The only helpers, that the receiver has, are the logical semantics of the source language. With the estimated sense of the message (that makes it an information) the receiver is able to decrypt the text faster.

process

  • plain source text
  • deleted vocals source text
  • encryption
  • sending
  • receiving
  • guess the plain source text

example

plain source text: Hello World – How are you – And what about your sister.

deleted vocals source text: Hll Wrld – Hw r y – nd wht bt yr sstr

encrypted with ROT13: Uyy Jeyq – Uj e l – aq jug og le ffge.

Note that the main disadvantage of this method is the lost of information during the encryption process!

Extended target alphabet

A target alphabet is used to reflect an encrypted source letter. That’s the sore point of the mono alphabetic substitution. One letter of the source alphabet is reflected by one letter of the target alphabet. And that is the problem. The frequency analysis analyses the occurrence of all letters. Then you can try some combinations and you will crack the cipher easily. The extended target alphabet (includes for instance 0,..,9,$,#,+,*, etc.) have more options to reflect a source letter. For instance you can translate a “a” with “t” or “$”. You can’t do this for every letter, but you can do it for a few of the most used letters. So the frequency analysis goes wrong! Indeed, this method isn’t very elegant, but it fulfills it’s purpose.

example

plain source text: Hello World – How are you – And what about your sister.

substitution rules: ROT13 and A <-> N,#  and E <-> R,$

encrypted text: Uryyb Jbeyq – Ubj ner$lbh – #aq jung nobhg lbhe fvfg$e.

Sectional target alphabets

Every mono alphabetic substitution has one target alphabet, that represents the encrypted text. Another option is to use more target alphabets. This method is very variable. You can use a new alphabet for every new sentence or a new section in the text. The agreement of which alphabet is used when, must be set before the first text was encrypted. The frequency analysis fails, because the frequency of the occurrence is more unclear then with a simple mono alphabetic substitution. Which letter has the highest occurrence and which one has the lowest is random.

process

  • plain source text
  • divide text in sections (words, sentences, paragraphs etc.)
  • encrypt with various rules for every section
  • encrypted text

example

  • The key is 1938271
  • This means, that the first sentence will encrypted with ROT 1, the second with ROT 9, ROT3 etc.
  • The receiver knows the key

Please note, that this method is still a mono alphabetic method for every section. For all sections, it looks like a poly alphabetic substitution method – but that’s wrong.

Conclusion

As you can see in the example above, you can do much more with a mono alphabetic encryption method as just translating a source text into a target text with a simple alphabet. But in most cases, you have to provide extra information about the “How to decrypt?” aspect. To bypass the extra information, you can start your text with a couple of letters or numbers, that represents an encrypted key. After decrypting this key with a standard procedure, you have the right key, to fulfill the real decrypt method for the encrypted text. The ideas, introduced above, are really simple and easy to implement. I think that you can do much more tricky things with the mono alphabetic encryption method, but in my mind this ends up in more “security over obscurity” and that misses the aim.

]]>
http://www.cip-labs.net/2011/09/30/ideas-to-bypass-the-frequency-analysis/feed/ 1
The history of encryption part 1 http://www.cip-labs.net/2011/08/08/the-history-of-encryption-part-1/ http://www.cip-labs.net/2011/08/08/the-history-of-encryption-part-1/#comments Mon, 08 Aug 2011 09:03:38 +0000 flezzfx http://www.cip-labs.net/?p=521 This article deals with the history of encryption and is the first part of two articles about this topic. All techniques were described in a chronological order. In this part of the article I deal with:

  • Introduction
  • Scytale and transposition cipher
  • ROT 13 and Caesar cipher
  • Maria Stuart and Babington Plot
  • Le Chiffre indéchiffrable (Vigenére square)
  • One-Time-Pad

Introduction

With the use of public transmission paths for discrete information, it was neccessary to transmit the information safe and secret. A Information represents (at least for sender and receiver) a sequence of tokens that makes sense to them. Without any sense, it would be just a message. We can define the word information as a message with sense and meaning. Just the sender and the receiver know the logic of the message (to create an information). Well, this is the ideal case, but it is not the reality. Normally, a few more people as the sender and the receiver know the logic and the meaning of a sequence of tokens. The sender and, it’s counter part, the receiver have to consider a technique, to keep the message secure. In the history of humantiy, we found a few examples for encryption methods with political and economical background. But the encryption of private messages was also used, to keep a affair secret. A lot of women used cryptography to send intimate messages to her lovers. In this article, I will try to cover the important examples, to show how necessary encryption is.

The franziscian-monk Roger Bacon was the first, who published a book about encryption methods “Abhandlung über die geheimen Künste und die Nichtigkeit der Magie” (Treatise on the secret arts and nullity of magic). This was happened in the 13th century.  Bacon mentioned a few algorithms to encrypt a text with the normal alphabet. For instance “Atbash” (Atbasch) is a simply monoalphabetic substitution with the following rules.

source A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
target Z Y X W V U T S R Q P O N M L K J I H G F E D C B A
source HELLO WORLD
target SVOOL DLIOW

As you can see in the table, the source alphabet was mirrored by the target alphabet. There is nothing special! But how to crack this cipher? To reveal the solution, you have to find out the target language (english, german, e.g.) and after that, you have a table with a letter from the alphabet and the occurring probability of the letter in the specific language (frequency table). With this table, you can find out, how the letters are exchanged. This substituation of two letters is called a bigram (n-gram). In addition to that, you can simply exchange the letters, if you know the applied algorithm and all neccessary parameters. But, this is not the normal case.

Example of a frequency table

letter    frequency in %
e 12,702
t 9,056
a 8,167
o 7,507
i 6,966
n 6,749
s 6,327
h 6,094
r 5,987
d 4,253
l 4,025
c 2,782
u 2,758
m 2,406
w 2,360
f 2,228
g 2,015
y 1,974
p 1,929
b 1,492
v 0,978
k 0,772
j 0,153
x 0,150
q 0,095
z 0,074
ï 0,01

source: wikipedia.org

This table shows the frequency of occurence from all possible letters in the english language.

Nearly all encryption methods, that were invented in this era, works with monoalphabetic substitution. After that, the encryption algorithms were oriented on encoding words (not just simple letters). The mix of encoding words and encrypting letters is called nomenclature codes. This codes works with a substitution table and can also be cracked with a frequency analysis and the semantic knowledge of syllables and words. It would be a great help, to have the knowledge about the sense and the motivation of the message. Often, you can interpret half decrypted parts of a message and try to test any words related to the motivation of the text. To know the syntactic and semantic structure of the encrypted text, is the key of decrypting such nomenclature codes. In the subsequent chapters, I will talk about encryption methods in detail.

Scytale and transposition cipher

To encrypt a plain text with a transposition method means to change the position of a letter in a text. In contrast to the substitution, the transposition method performs no exchange of letters (monoalphabetic substitution), just a change of their positions. This method is immune against frequency analysis. The reason is, that the frequency analysis (at least in this case) can approximatly decide, which language was used to encrypt the message. But why “approximatly”? The frequency analysis can compare the frequency of occurence of every letter, with a frequency table of every language. The frequency table of the language, which matches best with the frequency table of the encrypted message is probably the language of the encrypted message (implied that the cipher works just with transposition). A mixture of transposition and substitution prevents the message from finding out the natural language.

scytale image

Scytale | source: wikipedia.org

A Scytale is a stick (cylinder) with a strip of paper (or earlier parchment) wound around it. The number of corners is the key to decrypt the text. The sender writes the message on the parchment of his stick. After that, he or she sends a courier to the receiver. The receiver has a stick with the same number of corners and can wound the piece of parchment around his stick. When the receiver has the right stick, he or she can read the decrypted text. This method was used by the Spartans and the ancient greeks to encrypt military messages. The security of a Scytale was really strong. With the idea, to develop a lot of sticks with a different number of corners, the Scytale was assailable. After the invention of mechanically en- and decryption the Scytale as tool was obsolete, but the concept was still used.

ROT 13 and Caesar Cipher

Julius Caesar (dictator of the Roman Republic), known for his assertiveness and wisdom, used a simple (but in his era really strong) method to encrypt his messages to the generals and centurions. He used a technique similar to the Atbasch example above. Of course, Caesar lifes a few decades earlier, but he also knows the idea of monoalphabetic substitution. His cipher can be named ROTx (Rotate x), in which the x is a number, that describes the number of positions to rotate.  ROT13 is a special case of the Caesar cipher and the name was created by Usenet users in 1980. This cipher is a Caesar cipher with the shifting factor 13 (x equals 13). Example:

source A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
target N O P Q R S T U V W X Y Z A B C D E F G H I J K L M
source HELLO WORLD
target URYYB JBEYQ

In Caesars era, that cipher was strong, but later, arab scholars invented the frequency analysis method to break such ciphers.

Maria Stuart  and Babington Plot

Maria Stuart was born on 8th december of 1542. In 1543 she was crowned as Queen of the Scots. After any affairs, marriages and conspiracies Maria Stuart was deposed and escaped to england. There she was arrested. After 18 years of imprisonment, she has a ray of hope. She receives letters from Anthony Babington. Babington encrypted his letters with nomenclature codes and Maria Stuart knows to decrypt the text. At first she received a paper of all nomencalature codes. After that, Babington sends all encrypted messages. The spy’s of Walsingham, the security minister of queen Elizabeth, analyze the messages that Maria Stuart received. Thomas Phelippes was one of the spy’s. After a couple of message and doing frequency analysis with this message, Phelippes cracks the cipher and decrypt all messages. He found out, that Maria Stuart and Babington planning a assassination against the queen Elizabeth. Then, Walsingham forces Phelippes (who was a genius in faking fonts) to write an encrypted letter to Maria Stuart to find out, which people are the insurgents and the accomplices. After this attack from Walsingham (today, it would be called “Man-in-the-middle-attack”), Maria Stuart was executed in England.

This example makes clear, how risky encryption and the decryption of it can be. This nomenclature code can be decrypted by application of the frequency analysis. But one condition is, that the cracker must know, whether all symbols are substituated, words are encoded or both methods are used. With this knowledge, you can crack this nomenclature codes with a simple frequency analysis.

Le Chiffre indéchiffrable (Vigenére square)

After the invention of the frequency analysis from the arab scholars, the security of the monoalphabetic substitution was broken. Blaise de Vigenére, a french diplomat, used an idea from the italian polymath Leon Battista Alberti to invent a method called polyalphabetic cipher. This cipher works with 26 (because of 26 letters in the latin alphabet) alphabets side by side. Every alphabet starts with another letter. Have a look on the image below:

source: pad2.whstatic.com

The Vigenére cipher works with a key phrase. Consider the following example:

plain text  HELLO WORLD
key phrase  KEYKE YKEYK
encrypted result  RIJVS UYVJN

But how does it work? The decryption of a message, that is encrypted with a Vigenére square was, in the era of invention, really strong. You must have the right Vigenére square to decrypt the text. Another renewal was the encryption key. The key was used to encrypt the message. So, the receiver needs the right Vigenére square and the correct key phrase. Now, here comes the explaination of the example above.

The text to encrypt is “HELLO WORLD”. We use the key phrase “KEY” for the encryption. If the key phrase is shorter than the message, then the key phrase will be repeated so long as the length of the plain text will be reached. So, our key phrase is “KEYKEYKEYK”. With our correct Vigenére square, we can start to encrypt the message. The square has a extra alphabet on the upper side and also on the left side. For the first letter from the plain text we have a look on the alphabet in the header row. We call this position plain1(H). After that, we have a look on the left side and we keep our focus on the first letter of the key phrase. This position has the name keyphrase1(K). Now, we have the position of the letter of the plain text and the position of the letter in the key phrase. The cell, were plain1(H) and keyphrase1(K) join together, is the letter of our encrypted text. For that example cell(plain1(H), keyphrase1(K)) = R. For all other letters, the algorithm is the same.

For the era of invention, the polyalphabetic substitution was very strong. The cryptoanalysts countered with a few procedures and algorithms to crack the cipher. One algorithm was used to detect the key length. Some methods to detect the key length with redundancy of natural languages are the Kasiski-test (method of Charles Babbage) and the Friedman-test. In case of a short text, a good method is to find out which words can be placed at the beginning of the text. For instance, the first word “HELLO” is more likely then the word “HFAKF” or other senseless combinations. So the range of possible n-gram’s falls rapidly. So you can appreciate a possible key with a brute force attack. In the best case, you can try to find the key. After you found out any possible parts of the key, you have a simple Caesar cipher, which you crack with the frequency table (precondition: knowledge about the used source language).

One-Time-Pad

The One-Time-Pad (OTP) method is a encryption method that is, when it is used correctly, safe and theoretically not breakable. The concept behind this encryption is, that the key (One-Pad) is used just one time. The second constraint is, that the key phrase is as long as the plain text. Here are any other constraints. The key…

  • must be secret
  • must be unpredictable and random
  • may only be used one time

The recipient receives a bundle of x One-Time-Pad keys. With the keys, the receiver can respond x/2 messages and can answer x/2 messages. There are a lot of methods to en- and decrypt messages. Just one example is the Vigenére square.

As I mentioned a few sentences before, this procedure is absolutly safe (because of the key length). That is a big advantage. On the other hand, it is a high expenditure to bring new keys to the receiver/sender. Furthermore, this method needs a central position, that generates the keys. And last but not least, a strong disadvantage is the way of exchange the keys, which can be attacked. This can be called a Man-in-the-Middle-attack.  As you can see, the OTP method is safe, but not practicable with a high number of messages.

This was the first part of my “History of Encryption” series. The second part will deal with:

  • Enigma machine
  • One-time-pad
  • Public Keys
  • Pretty Good Privacy
  • Quantum cryptography
]]>
http://www.cip-labs.net/2011/08/08/the-history-of-encryption-part-1/feed/ 1