Hi all.
We have a setup that requires master master replication for both HA and load balancing. We have an identical stack in two cities, seperated by a 1gbit WAN link. WebServer (WS) / Database (DB) / Search (Solr) servers in one city, and the same at the other.
Currently both WebServers are pointing to the same database in City 1. The reason being we cannot get a stable master master replication setup working.
I have excluded all cache tables from replication, although only cache_form is actually used because we run memcached. I also exclude the sessions table, because nobody will ever jump from webserver to webserver, and finally the semaphore table. This is also setup as an in memory table, but as recommended there's no need to share it because it's database host specific. This table did initially catch me out because it has no auto incrementing numerical keys, so broke replication almost immediately. After excluding this from replication, the site was up and stable for around 4 hours.
The problem I have now is after running fine for around the 4 hour mark, under decent load (the site handles around 4k logged in users daily), replication was broken by a key clash on the variable table. Because the key's are also non numeric, the standard MySQL way of handling this by offestting the increment doesn't apply.
I've searched all over Google trying to find some best practices on how to deal with this, but I'm at a loss. The table is a non numerical key value store, so there's no way to guarantee both webservers won't try to wrote the same thing.
The only thing I can think of is to use something like redis, so it provides the distribution and disk backup after a restart, unlike memcached which would loose everything from ram should the machine shutdown. This seems very messy and hacky to core. I see references to master-master setups, but nobody has ever mentioned or had any issue with the variable table which I find bizzare if it behaves the same as other tables such as semaphore.
Some background on versions, d7.32 (intranet so security not essential but we try to keep up to date). CentOS 6.4 64bit. Php 5.3.7, MySQL 5.1-67.
Any help or advice to get me started would be really appreciated.

Comments
It seems like your issue is
It seems like your issue is the variable table, I set up a similar architecture, but used MariaDB and Galera.
I had two webservers, and three masters using Galera with MariaDB.
These masters were then replicated to a slave located elsewhere, with two additional webservers. If I remember properly, I set the slave to ignore the variable table along with semaphore and cache like you have already.
What I did that was a little different though, is use mysql-proxy on the masters. This way the settings.php file doesn't have to be changed and can point to the proxy ip. If one of the masters goes down, it automatically picks up one of the ones that is up. With the masters, I didn't exempt any tables.
Essentially, the local internal webserver and database servers replicated to public slave servers.
Hi BHosmer. Thanks for the
Hi BHosmer.
Thanks for the reply. So does using Galrea eliminate the issue with the variable, semaphore and cache tables?
Yes, but I would do some
Yes, but I would do some testing first just to make sure. During my setup and evaluation, I realized that you can think of the galera cluster as one database and you don't need to worry about exempting tables. Galera takes care of synchronizing the writes so you don't have to worry about them.
My architecture was a little different from yours though. The galera cluster was used for internal content updates to account for disaster and failover. The only users that actually accessed this database were authenticated staff members. This site wasn't accessible to public users. We then used a simple slave to replicate data to the public site that never wrote to the database, with the exception of the cache and semaphore tables. I don't recall whether I had to exempt the variable table or not.
I found it quite helpful to set up a three node galera cluster and a slave locally using Vagrant for testing and evaluation.
Percona XtraDB Cluster
Have you tried using Percona XtraDB Cluster? I know that it uses Galera. We are using it with Drupal on a production instance and works beautifully. Here is a blog post on how to install it on Ubuntu: http://redcrackle.com/blog/how-set-percona-xtradb-cluster-ubuntu.
DB lag on a different server.
When I make some tests, each query made to a database on a different server (even if it is in the same network), is 4 times slower than a query made to local db. Since there are many queries made to construct a single page, it makes a separate DB server unusable.
As far as I know, using a Galera Cluster means the queries are made to a separate server. How was your experiences about this?
From my experience Local
From my experience Local Database is faster than remote.
Even if you have 100mb direct connection.
Drupal cache queries (specialty fields related) and VIEWS queries utilize a lot of MySQL bandwidth.