Measuring and improving MySQL performance over a (local) network

Posted by mrfelton on September 18, 2009 at 5:48pm

MySQL queries in my production environment are taking much longer than I would expect them too. The site in question is a fairly large Drupal site, with many modules installed (about 140). The webserver (Nginx) and database server (mysql) are hosted on separated machines, connected by a 100mbps LAN connection (hosted by Rackspace). Both machines are Dual Core AMD Opteron 2216HE 2.4 GHz with 4GB RAM and are running Red Hat Enterprise.

I have the exact same site running on my laptop for development. Obviously, on my laptop, the webserver and database server are on the same box.

Here are the results of my database query times as reported with the devel module:

Production:
Executed 291 queries in 320.33 milliseconds. (homepage)
Executed 517 queries in 999.81 milliseconds. (content page)

Development:
Executed 316 queries in 46.28 milliseconds. (homepage)
Executed 586 queries in 79.09 milliseconds. (content page)

As can clearly be seen from these results, the time involved with querying the MySQL database is much shorter on my laptop, where the MySQL server is running on the same database as the web server.

Why is this?!

One factor must be the network latency. On average, a round trip from from the webserver to the database server takes 0.16ms (shown by ping). I guess that must be added to every single MySQL query. So, taking the content page example above, where there are 517 queries executed. Network latency alone will add 82ms to the total query time. However, that doesn't account for the difference I am seeing (79ms on my laptop vs 999ms on the production boxes).

What other factors should I be looking at? I had thought about upgrading the NIC to a gigabit connection, but clearly there is something else involved.

I have run the MySQL performance tuning script from http://www.day32.com/MySQL/ and it tells me that my database server is configured well (better than my laptop apparently). The only problem reported is "Of 4394 temp tables, 48% were created on disk". This is true in both environments and in the production environment I have even tried increasing max_heap_table_size and Current tmp_table_size to 1GB, with no change (I think this is because I have some BLOB and TEXT columns).

Production my.cnf:

skip-locking
skip-name-resolve
key_buffer = 512M
max_allowed_packet = 16M
table_cache = 2560 
sort_buffer_size = 3M
read_buffer_size = 2M
read_rnd_buffer_size = 8M
join_buffer_size=2M
myisam_sort_buffer_size = 128M
thread_cache_size = 16
thread_concurrency = 8
query_cache_size = 128M
tmp_table_size=1024M
max_heap_table_size=1024M
open-files = 20000

interactive_timeout = 600
wait_timeout = 600

log-slow-queries=/var/lib/mysqllogs/slow-log
long_query_time=1

Development my.cnf:

skip-external-locking
key_buffer = 16M
max_allowed_packet = 16M
thread_stack = 128K
thread_cache_size = 8
query_cache_limit  = 1M
query_cache_size = 16M

Comments

Advice on identifying performance

Posted by ronaldbradford on September 18, 2009 at 6:10pm

Part of your issue is to more clearly identify your physical resource bottleneck.

created_tmp_disk_tables is Disk I/O. You are right, given your excessive tmp/max values it's TEXT/BLOB columns. You should look at trying to resolve these, a todo list item

You would expect greater disk i/o on your production server, vmstat (wio) or iostat will help here to see if it's a bottleneck.

Network latency is an overhead for every single SQL statement, and most people don't realize it's true impact for SQL. Depending on your MySQL version, the SHOW PROFILE is great to show you per step microsecond times, including the network overhead for local and remote connection. You should also make sure your not getting any network saturation, i.e. what it happening across the network. In MySQL bytes_received/bytes_sent can give you a value from the MySQL server without need to use OS monitoring.

Another clear problem is your excessive per session memory buffers (read/read_rnd/sort/join) on production, not on your dev box. Are you hitting a memory resource bottleneck with these values, at least you don't have max_connections high.

There is plenty of evidence that making these buffers too large slows down work, e.g. http://www.mysqlperformanceblog.com/2007/09/17/mysql-what-read_buffer_si...

The problem is MySQL just doesn't provide instrumentation to identify what are more optimal values here, small is generally better. The only buffer you can get instrumentation on is sort, with the Merge_sort_passes status variable.

My advice.

Confirm whether CPU/Disk/Memory/Network is a bottleneck, this will identify the best path to work on.
Decrease your per session memory buffers. Make changes, but monitor closely. Most of these are dynamic, you can change without restarting MySQL, but make sure you change my.cnf as well.
Review the network latency overhead from a SQL perspective.
I'd also say that 1024 for tmp/max tables is excessive if you don't use MEMORY tables. Will probably not affect performance, but help the most valuable resource, available Memory.
Query Cache, is it effective in your environment? Larger values on older MySQL versions can have a thrashing effect. Perhaps turn it off, this can be done dynamically, or decrease in size.

H/W Fault

Posted by ronaldbradford on September 18, 2009 at 6:14pm

I should add one more consideration. I've seen a similar situation with a client, where copying the database and running on a completely un-configured MySQL database on my laptop was significantly better for queries.

My advice to them was to swap out there H/W for a new server at the host provider. Without any clear evidence of H/W faults (I wish I had more ways to prove that), however doing so addressed the performance problem.

If you have slow query log

Posted by Jamie Holly on September 19, 2009 at 1:39am

If you have slow query log enabled then I would check that for any signs. You could also use the devel module with database logging enabled to see if any particular queries are taking too much time.

While it may be running fine on your devel box, there is always a chance you have some nasty queries are resorting to filesorts and tmp_tables plus scanning a far excessive amount of rows. You won't notice that on your development box since you are the only one accessing the database, but on a production site that is live you got multiple queries going in and every request competing for the limited resources available.

HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.

There is nothing showing up

Posted by mrfelton on September 19, 2009 at 12:35pm

There is nothing showing up in my slow query log :) I sorted that one out a while back. Here are some examples of the difference in time from development to production for some individual queries.

32   module_list   SELECT name, filename, throttle FROM system WHERE type = 'module' AND status = 1 AND bootstrap = 1 ORDER BY weight ASC, filename ASC
51   module_list   SELECT name, filename, throttle FROM system WHERE type = 'module' AND status = 1 AND bootstrap = 1 ORDER BY weight ASC, filename ASC

28    language_list  SELECT * FROM languages ORDER BY weight ASC, name ASC
55    language_list  SELECT * FROM languages ORDER BY weight ASC, name ASC

16    masquerade_init  SELECT uid_from FROM masquerade WHERE sid = 'rjrg2mburbf46ceb3buhincnk4' AND uid_as = 598
41  masquerade_init  SELECT uid_from FROM masquerade WHERE sid = '8a31930b5313c46127eba07649c70b35' AND uid_as = 598

33    sess_count SELECT COUNT(sid) AS count FROM sessions WHERE timestamp >= 1253359955 AND uid = 0
66    sess_count SELECT COUNT(sid) AS count FROM sessions WHERE timestamp >= 1253358402 AND uid = 0

38    cache_get  SELECT data, created, headers, expire, serialized FROM cache WHERE cid = 'locale:ie'
54  cache_get  SELECT data, created, headers, expire, serialized FROM cache WHERE cid = 'locale:ie'

44    cache_get  SELECT data, created, headers, expire, serialized FROM cache_views WHERE cid = 'views_default_views:ie'
99  cache_get  SELECT data, created, headers, expire, serialized FROM cache_views WHERE cid = 'views_default_views:ie'

Now, whilst devel reports slower queries for every single query, for some queries nthe difference is much more substantial. It really sticks out to me that the biggest difference in query times seems to be related to queries that involve the cache or session tables. On Production, almost every instance of cache_get is highlighted red, whilst on development none of them are. As you can see above, the difference in the query time for these queries can be absolutely huge. Clearly, in production, the cache and session tables will be much larger. What is is about these queries that makes them so much slower on production?

I decided to look at one of these queries in more detail. I ran this query on my development machine (laptop), on the web server (which connects to a remote, but local database server), and on the production database server itself. Here are the results.

Development:
+----------+------------+---------------------------------------------------------------------------------------------------------+
| Query_ID | Duration   | Query                                                                                                   |
+----------+------------+---------------------------------------------------------------------------------------------------------+
|        1 | 0.00051900 | SELECT data, created, headers, expire, serialized FROM cache_views WHERE cid = 'views_default_views:ie' | 
+----------+------------+---------------------------------------------------------------------------------------------------------+

+--------------------------------+----------+
| Status                         | Duration |
+--------------------------------+----------+
| starting                       | 0.000040 | 
| checking query cache for query | 0.000016 | 
| checking privileges on cached  | 0.000009 | 
| sending cached result to clien | 0.000439 | 
| logging slow query             | 0.000009 | 
| cleaning up                    | 0.000006 | 
+--------------------------------+----------+

Production - database server:
+----------+------------+---------------------------------------------------------------------------------------------------------+
| Query_ID | Duration   | Query                                                                                                   |
+----------+------------+---------------------------------------------------------------------------------------------------------+
|        1 | 0.00062100 | SELECT data, created, headers, expire, serialized FROM cache_views WHERE cid = 'views_default_views:ie' | 
+----------+------------+---------------------------------------------------------------------------------------------------------+

+--------------------------------+----------+
| Status                         | Duration |
+--------------------------------+----------+
| (initialization)               | 0.000007 | 
| checking query cache for query | 0.000016 | 
| checking privileges on cached  | 0.000011 | 
| sending cached result to clien | 0.000578 | 
| logging slow query             | 0.000009 | 
+--------------------------------+----------+

Production - web server:
+----------+------------+---------------------------------------------------------------------------------------------------------+
| Query_ID | Duration   | Query                                                                                                   |
+----------+------------+---------------------------------------------------------------------------------------------------------+
|        1 | 0.00823100 | SELECT data, created, headers, expire, serialized FROM cache_views WHERE cid = 'views_default_views:ie' | 
+----------+------------+---------------------------------------------------------------------------------------------------------+

+--------------------------------+----------+
| Status                         | Duration |
+--------------------------------+----------+
| (initialization)               | 0.000007 | 
| checking query cache for query | 0.000015 | 
| checking privileges on cached  | 0.000009 | 
| checking permissions           | 0.000009 | 
| sending cached result to clien | 0.008182 | 
| logging slow query             | 0.000009 | 
+--------------------------------+----------+

As you can see, the database server actually ran the query faster than my laptop, but the query took much longer to run on the web server. Actually, it didn't really take longer to run, but longer to send to the client. I think this must point to network issues. But, other than latency, which I have already identified as being about 0.16ms, I can't think what else it may be from. I have cacti monitoring the machines in question, but it doesn't seem to indicate that the network is saturated, stating that the database server has on average 790kbytes/sec outbound (over a 100mbps lan). Or, does this indicate it is slow sending it because it has to read it from the disk for some reason, which may be slow?

--
Tom
www.systemseed.com - drupal development. drupal training. drupal support.

There is a chance that

Posted by Jamie Holly on September 19, 2009 at 1:28pm

There is a chance that something is screwy on the network. One simple test would be to upload a decent size file onto the database server (or even just do a mysql dump if the file is a decent size) then in a terminal to the webserver, sftp to the database server and do a get of that file. That way you can see the actual download speed.

Something else that could be contributing is table overhead. Cache and session tables get a lot of writes/deletes/updates and contain text fields. Altering the data in tables with variable length fields always creates overhead. If you have PHPMyAdmin look at the "overhead" values for those tables from the list of tables. If not then you can do a:

SHOW TABLE STATUS FROM {db_name} WHERE Data_free>0;

The Data Free is the table's overhead in bytes. If that is rather large then do an optimize on the offending tables.

HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.

Seems I did have quite a few

Posted by mrfelton on September 19, 2009 at 2:27pm

Seems I did have quite a few tables that needed optimising. However, even after doing so there has been no noticeable change. I have also tried enabling mysql compression as discussed http://groups.drupal.org/node/17913 and http://2bits.com/articles/slow-mysql-queries-multi-server-setup-use-comp..., but that didn't seem to make any difference either.

Using scp to pull a large file from the database server to the web server reports that the file was downloading consistently at 10.1MB/s. That sounds about right for a 100mbps connection.

--
Tom
www.systemseed.com - drupal development. drupal training. drupal support.

Sent data

Posted by jax on September 21, 2009 at 5:46am

On a multilingual D5 site the cached translation data was in excess of 2MB. Pulling this through a 100Mbit network on each request was not really beneficial for performance. So you might want to check the amount of data that is passed for one request and if it's a lot a possible solution is to cache it on the web server with memcached.

After installing cacherouter

Posted by mrfelton on September 21, 2009 at 7:48am

After installing cacherouter and memcached, query times on the production box have dropped significantly.

Before:

homepage: Executed 291 queries in 320.33 milliseconds.
content page:  Executed 273 queries in 533.75 milliseconds.

After:

homepage: Executed 247 queries in 161.19 milliseconds.
content page:  Executed 241 queries in 185.39 milliseconds.

Since I had already identified (above) that it was mainly queries to the cache table that were taking the time, installing memcached and cacherouter effectively removed these queries from the equation and has had a massive effect on the query times. I'm still convinced that there is a network issue at play here and will investigate a 1000mbps LAN connection for further improvements.

--
Tom
www.systemseed.com - drupal development. drupal training. drupal support.

My bet is link speed too

Posted by timdeeson on September 22, 2009 at 7:26am

I've not analysed all your stats but on the surface of it I'd say your 1000mbps hunch is right. We had a very similar scenario and the issue was the transfer time of the cache blobs going back and forth.

When a page cache could be 8MB, you've got a few going back and forth between web heads and a DB server and it's a 100mbps link the maths start to indicate the link alone can become the throttle.

In our case we moved to a 1000mbps link between web heads - DB server and the problem disappeared.

Tim Deeson - http://www.deeson.co.uk/emedia

Our network link is due to be

Posted by mrfelton on September 22, 2009 at 7:59am

Our network link is due to be upgraded at the weekend. I'll report back on the result next week.

--
Tom
www.systemseed.com - drupal development. drupal training. drupal support.

Just curious about the final

Posted by daring on September 30, 2009 at 8:47pm

Just curious about the final result with this? Did you have any luck..?

Measuring and improving MySQL performance over a (local) network

Comments

Advice on identifying performance

H/W Fault

If you have slow query log

There is nothing showing up

There is a chance that

Seems I did have quite a few

Sent data

After installing cacherouter

My bet is link speed too

Our network link is due to be

Just curious about the final

High performance

Group organizers

New groups

Group notifications