Oct
19

The most optimization tricks for MySQL focus on query performance or server tuning. But the optimization starts with the design of the database schema. When you forget to optimize the base of your database (the structure), then you will pay the price of your laxity from the beginning of your work with the database. Sure, every storage engine have his own advantages and disadvantages. But regardless of the engine you choose, you should consider the following chapters in your database schema. In this article, I will write about some architectural considerations, that you should keep in mind, when you design a database, that will run under heavy load. This article is a part of the web-application performance series.

Read-optimized tables

In current web applications, the frequency of reading from a database is much likely, then writing data to it. For this reason, the database server have to scan a lot tables, to deliver the data. The most tables, that are including the (so called) master data are very huge. This tables have a big row size and they are very costly to scan. For this reasons it is a well-known method to split up this table. You can hold the data, that is most likely involved in scanning processes, in a seperate table, to avoid a scan about the whole data. A problem occures, when you want do query the full data set. Here you have two options:

  1. execute a join between the read-optimized table and the other table (using a foreign key in the read-optimized table)
  2. denormalize the tables with an attribute, that both tables have (and that is indexed two times)

The first alternative is not oriented on performance aspects, but it fullfills all rules of normalization. The second alternative is a denormalized table, that includes some redundancy in data. Most web applications run with little redundancies in there databases to use the performance improvement of this read-optimized tables. See also the chapter: “denormalize tables”.

Denormalize tables

As I said in the chapter above, you should denormalize tables, when your joins between two and more tables are to slow. To denormalize should be the last step, after all of your architectural repertoire is exhausted. Because denormalized tables mean to have some data not only in one column. Of course, some inconsistencies can occure and the database is stronger to handle. The administrative outlay will increase, if you denormalize to much of your data.

Denormalization is useful, if you split one table into two tables and you don’t want to join this tables together in a query. In most cases, the database would execute a join with over 2 tables to get the information. This is really costly for current web applications! But be aware, think about, which denormalization is necessary and which is not.

Artificial key columns

If you have a fast natural key (that results from the data analysis) don’t add a artifical key to it. Remember that all keys are need disk or/and memory space. But you have to weigh up whether your natural key is fast enough or not. Maybe your key is alphanumeric and is to large to be efficient under high load. Then you should use a numeric and artificial primary key.

Bigint vs. unsigned

It is important for the performance to choose the right data type for the primary key. Consider, if you really need a BIGINT or INT. Maybe the MEDIUMINT datatype can fix your needs. The difference between INT and MEDIUMINT is, that you save 25% of space with using MEDIUMINT. You should use the smallest data type for 2 reasons. The first reason is, that the row will be smaller and the second reason is, that the index of the primary key will me smaller and faster. Another idea is, to use the UNSIGNED part of a data type. In most cases, the range of UNSIGNED INT will suffice. Of course, you can apply this consideration for all numerical fields (and not just for key columns).

ranges of data types

Type Storage Minimum Value Maximum Value
(Bytes) (Signed/Unsigned) Signed/Unsigned)
TINYINT 1 -128 127
0 255
SMALLINT 2 -32768 32767
0 65535
MEDIUMINT 3 -8388608 8388607
0 16777215
INT 4 -2147483648 2147483647
0 4294967295
BIGINT 8 -9223372036854775808 9223372036854775807
0 18446744073709551615

source: mysql.com

Text vs. varchar

This is a one of the most rookie mistakes with MySQL. Some people think, that it is better to use TEXT instead of VARCHAR. One point is that VARCHAR has a defined length and TEXT hasn’t. But that is not the last one. Furthermore TEXT needs more memory for sorting. However, you should use TEXT just as a “storage” data type and VARCHAR as small field of textual information (or as a key).

NOT NULL

If it is possible, declare all columns as NOT NULL. Every column needs a bit more and this is not necessary. When you really need this column, you should use it. But when you can avoid this, you should avoid it. MySQL does more working steps, if you set NULL to a column.

Other considerations

  • Do not index columns that you not need in a select
  • Use clever refactoring to admit changes to your schema
  • Choose the minimal character set, that fits your needs
  • Use triggers just, when you really need it
Apr
20

MySQL is a very powerful and popular database server for web applications. It’s a open source project with a big community. A lot of people deal with the performance of this server to make their applications faster on small machines. This article deals with the server tuning, especially the tuning that you can do in the my.cnf configuration file of MySQL. Be aware, that the performance of a database server depends on a few more factors than server tuning via the configuration file. This article is a part of the web-application performance series.

Query cache

The query cache is a cache that caches the execution plan and the result of the queries. When a query comes in, the query-parser parses the query, after that the query-optimizer creates an execution plan (based on heuristics) and executes the query. This cache is a very powerful tool, that brings an enormous speedup. Be aware that the query cache acts case-sensitive, which means that it differ between “SELECT” and “select”. A query, that will return the same result set, but in file A it starts with “SELECT….” and in file B it starts with “Select” have two different entries in the query cache. This is not good, because the unnecessary & occupied space can’t be used for another important query. Below i listed some important variables to control the behavior of the cache:

Query cache status

To have an imagination of how the cache works you can type:

SHOW STATUS LIKE "qc%";

into the query window or the shell to get status information from the cache. Here, i listed the Qcache status information of my database server.

Variable_name Value
Qcache_free_blocks 493
Qcache_free_memory 20752864
Qcache_hits 130628
Qcache_inserts 246940
Qcache_lowmem_prunes 0
Qcache_not_cached 203444
Qcache_queries_in_cache 1160
Qcache_total_blocks 2885
  • Qcache_free_blocks: number of free memory blocks in cache
  • Qcache_free_memory: amount of free memory space
  • Qcache_hits: sucessful reads from the cache
  • Qcache_inserts: number of all inserted queries
  • Qcache_lowmem_prunes: number of queries that where deleted from the cache because of reaching query_cache_size
  • Qcache_not_cached: number of non cached queries
  • Qcache_queries_in_cache: number of queries that are currently in the cache
  • Qcache_total_blocks: number of total blocks in the query cache

query_cache_type

This variable accepts three values.

  • 0 – The cache is turned off. Don’t cache anything.
  • 1 – The cache is turned on. Cache anything, but not such queries that contain time functions and other special aggregate functions
  • 2 – The cache is turned on demand. It will cache all queries that are marked with SQL_CACHE, for instance (SELECT SQL_CACHE id, name, telnr FROM employee;)

Value 2 is just useful if you performing a lot of queries on your server with large result sets and they are just needed to display statistics in the backend. You should cache just queries that you are need on the production system.

query_cache_size

The cache size points out the amount of memory that the Qcache can allocate. To set the value of this variable to an optimal size, you have to study the behavior of your server. For some cases it can be useful to set 16M (which stands for 16 Megabyte) but some applications need more than 16 M. To optimize this, you can test the status of the cache, as described earlier, and keep your focus on Qcache_lowmem_prunes, Qcache_free_blocks and Qcache_free_memory.

query_cache_limit

This variable means the maximum size of a single cache entry. It depends – again – on the size of your result sets. If you have very long tables, then it would be useful to increase the value of this variable. There is no general solution, but a good value for small tables is 1M or 2M.

Key buffer

The key buffer represents the number of space that MySQL can use to build indexes and keys. This buffer is very powerful and the value should be very high (depends on your application). Indexes of large tables can be large too. Study the server status output for more information. Important variables are Key_reads (number of key-read operations that result in a disk access) and Key_writes (number of key-write operations that result in a disk access). Both values should be as small as possible.

key_buffer_size = 32M
#for large tables
key_buffer_size = 512M

Table cache

The table cache holds open tables in the memory to allow a quick access. The improvement of avoiding a disk access and a table scan is very huge. Keep your focus on Open_tables and Opened_tables. Open_tables represents the number of open tables in the table cache. These tables are currently in the memory. The opponent Opened_tables represents the number of tables that are not in the memory (the query results in a table scan). So it is your task to increase the table cache, that you have no Opened_tables. But this is the ideal case. In real-life applications you’ll never have 0 Opened_tables.

#for small appliactions
table_cache = 128
#for huge applications
table_cache 256

Some database administrators think, that FLUSH TABLES would be a good alternative to increasing the table_cache. Well, this is a common mistake. FLUSH TABLES only flushes the table cache. To do so is good, to force a quick reindexing of the cache, but it isn’t an alternative to increase the table cache.

Sort buffer

The sort buffer is used to do sort operations like ORDER BY or GROUP BY in the memory. If this buffer would be to less, MySQL continues the sort operation on the hard drive. This is a very slow process and costs a lot of time. The indicator of the right sort_buffer value is to check the Sort_merge_passes variable. This variable shows how many merge operations MySQL need and it should be 0 for optimal performance.  Is Sort_merge_passes unequal 0 then you have to increase the value of sort_buffer. Be experimental and check the status of your server for sort buffer optimization.

Read rnd buffer

The read_rnd_buffer is used after the sort operation to read the sorted rows. Sometimes the standard value, which is 256K is not enough. There is an rule which tells us to use 1M for every Gigabyte of RAM that our server has on board. If you have not to large tables 1M is a good value.

read_rnd_buffer_size = 1M

Tmp tables

Temporary tables are important for getting a fast result set. If your tables become to large, MySQL will create temporary tables on the disk. This is really slow and not practicable for high performance applications. The value of tmp_table_size must be a generous value. Created_tmp_disk_tables is the opponent of tmp_table_size and represents the number of tables that MySQL creates on the disk. It must be your goal to decrease this value to the minimum.  After watching the current value of Created_tmp_disk_tables you can decide to increase the tmp_table_size or not.

#default value
tmp_table_size = 32MB
#for smaller tables, this value would be better
tmp_table_size = 64 MB

Thread cache

Every connection to the MySQL server ends up in a thread. The thread cache starts after the connection to the client is finished. It holds the thread in the cache to avoid the overhead of instantiate a new thread. An old thread would be used to handle incoming connections. You can check the performance of your thread cache with have a look on Threads_cached and Threads_created. Threads_cached is the number of threads in the cache and Threads_created represents the number of created threads. Increase your thread_cache_size to improve the performance. Note: The thread cache was build to handle high thread concurrency in a fast way.

Join buffer

If you do a join in your query, the MySQL server creates a result table. Especially a full join needs a large join buffer. So, increase the size of your join buffer, that the server must not create a joined table on the disk. This could end in a very high loss of performance. Set join_buffer_size to a generous value. Joins are often a case for the slow-query-log. Try to avoid large queries and check the slow-query-log to detect slow queries.

Max connections

The max_connections value represents the number of connections that the server accepts (simultaneous connections). If you have a high concurrency on your database server, increase this value to handle all connections at the same time. But be aware that the number of file descriptors increases with the number of simultaneous connections. To avoid incorrect handled connections, be careful with this variable. The value depends on the hardware in your box.

#for small applications
max_connections = 500
#for larger applications
max_connections = 5000
Apr
20

The scope of MySQL optimization is very wide. One aspect of this complex field is the query performance and this article deals with it. MySQL provides powerful tools, that allow you to have a look inside the query parser & optimizer. To understand this article some MySQL server & SQL knowledge is required. Please be aware, that this article is just an overview about query performance. It can’t cover all aspects of it. This article is a part of the web-application performance series.

A simple calculation

On MySQL.com the company behind MySQL (MySQL AB) published a simple calculation to find out the estimated query performance:

log(row_count)/log(index_block_length/3*2/(index_length+data_pointer_length))+1
  • row_count: number of rows in your table
  • index_block_length: normally 1024 bytes
  • index_length: key value length (usually 3 bytes)
  • data_pointer_length: data pointer (usually 4 bytes)

The result of this formula represents the number of disk seeks for a special query. This formula is not valid for all queries.

Using EXPLAIN

EXPLAIN is a powerful tool to analyze the query performance. You can use EXPLAIN for every query.

EXPLAIN SELECT * FROM employee WHERE id = 18273;

After sending this query to the server, the result set contains information of the MySQL Optimizer. The optimizer analyzes the query and decides which execution plan would be the fastest and/or cost-effective. Check approximatly all queries to make sure, that you don’t overlook a unindexed or slow query.

The output of the optimizer contains the following columns:

  • table: involved tables (if you do a join, then this table contains more than one entry)
  • type: type of query (e.g. SYSTEM | CONST | EQ_REF| REF | RANGE | INDEX | ALL, ordered by the fastest in a descending order)
  • possible_keys: all possible keys for this query
  • key: finally used key
  • key_len: the length of the key (the length correlates with the performance)
  • ref: shows which column or constant would used to compare the value of key
  • rows: estimated number of rows to check
  • Extra: additional information about how MySQL executes the query

If you want to use an index that MySQL don’t use in the execution plan, then you can force MySQL to use this index with:

SELCT lname, fname, birthday FROM employee FORCE INDEX(lname)...

LIMIT

You should use LIMIT, if you have a lot of rows in your result set. It saves a lot of time, if you’re using LIMIT. So, you can build your query, after your pagination variable is evaluated.

SELECT lname,fname FROM employee LIMIT 0, 30

or with the application of a pagination:

SELECT lname, fname FROM employee LIMIT ($page * $rows_per_page), $rows_per_page

Caching queries effectively

MySQL implements a query cache. This cache saves the execution plan for your queries and the results. Until the table is not changed, the query cache holds the queries in the cache (depends on the possible memory).  The most common fault with the query cache is the innocence of how the query cache works. It works case-sensitive! This can be tricky if you use the same query more than one times. Then, you have to establish query coding-standards in your application, to keep the cache smart, efficently and economical.

SELECT * FROM employee WHERE id > 10 AND id < 10000

is not the same like:

select * from employee where id > 10 AND id < 10000

Split up complex queries

At a certain level, some queries can perform really slow. This will happen especially by joins, correlated queries, large tables and such complex constructions. Then you achieve a better performance with splitting this one query into more queries. Test this queries against the EXPLAIN command to check which query performs well and which not. But, be aware that before splitting a query up , the first option should be to check the indexes and the query cache.

SQL_CACHE vs SQL_NO_CACHE

If you are performing statistical queries to update user statistics and so one you should use the SQL_NO_CACHE directive. This directive is a part of the query cache. If you set the value of query_cache to  2, then you can use this directive. Use SQL_NO_CACHE for all queries, that you don’t need in your application. Because the space of your query cache is limited and shouln’t be wasted.

LIKE with %

‘%’ is known as the wildcard operator in MySQL. If you have an index on a column it can perform fast.

SELECT * FROM employee WHERE lname LIKE('New%');

This works fast, if you have an index on lname. The example below don’t works fast, because the index couldn’t be used:

SELECT * FROM employee WHERE lname LIKE('%man');

So, try to avoid the wildcard at the beginning of the like-clause. An considerable option can be to reverse the values of this column.

Other considerations

  • avoid calculated comparisons between values
  • don’t use DISTINCT and GROUP BY together
  • avoid correlated subqueries
  • use the FULLTEXT search and avoid LIKE