Peter ZaitsevMulti-Valued INSERTs, AUTO_INCREMENT & Percona XtraDB Cluster (9.9.2014, 14:38 UTC)

A common migration path from standalone MySQL/Percona Server to a Percona XtraDB Cluster (PXC) environment involves some measure of time where one node in the new cluster has been configured as a slave of the production master that the cluster is slated to replace. In this way, the new cluster acts as a slave of the production environment – traditional replication takes care of getting the data into the cluster, and then Galera replication handles the intra-cluster traffic. This often works without issue, although there is one case that I’ve encountered recently where special care must be taken to properly configure the stream to ensure that replication does not break. If you use multi-valued inserts with auto-increment columns, then this post is for you.

For purposes of our discussion, assume that we have a basic 3-node PXC cluster that we’ve set up using the PXC Reference Architecture document, and that we’re replicating from an asynchronous master (call it “server A”) into one of the PXC nodes. Without loss of generality, we’ll pick PXC03. Also, for purposes of our discussion, we’ll be working with the following table definition:

serverA> show create table auto_inc_test;
CREATE TABLE `auto_inc_test` (
  `stuff` varchar(20) DEFAULT NULL,
1 row in set (0.00 sec)
serverA> SELECT * FROM auto_inc_test;
Empty set (0.00 sec)

If we insert rows into this table one at a time, we have no issues.

serverA> INSERT INTO auto_inc_test(stuff) VALUES ('first row');
serverA> INSERT INTO auto_inc_test(stuff) VALUES ('second row');
serverA> INSERT INTO auto_inc_test(stuff) VALUES ('third row');
serverA> SELECT * FROM auto_inc_test;
| i | stuff      |
| 1 | first row  |
| 2 | second row |
| 3 | third row  |
PXC03> SELECT * FROM auto_inc_test;
| i | stuff      |
| 1 | first row  |
| 2 | second row |
| 3 | third row  |

But if we start doing multi-valued inserts, we can run into a problem.

serverA> INSERT INTO auto_inc_test(stuff) VALUES('first row'),('second row'),('third row');
serverA> INSERT INTO auto_inc_test(stuff) VALUES('fourth row'),('fifth row');
serverA> SELECT * FROM auto_inc_test;
| i | stuff      |
| 1 | first row  |
| 2 | second row |
| 3 | third row  |
| 4 | fourth row |
| 5 | fifth row  |
PXC03> SELECT * FROM auto_inc_test;
| i | stuff      |
| 1 | first row  |
| 2 | second row |
| 5 | third row  |
... output elided ...
Last_SQL_Errno: 1062
Last_SQL_Error: Error 'Duplicate entry '5' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'INSERT INTO auto_inc_test (stuff) VALUES ('fourth row'),('fifth row')'
... output elided ...

Uh oh. Replication is broken and our data is now inconsistent. So why does this happen and how can we prevent it?


The astute observer will note that I have not yet said anything about the binary log format on the master. If the binary log format on the master is already set to ROW, then the above error will not occur. RBR will properly replicate multi-valued INSERTs to the PXC cluster without issue, and the data will be consistent. Problem solved. However, there may be reasons that the master is not using or cannot use RBR, such as disk space or IOPS limitations, and thus it’s running in MIXED or STATEMENT mode. In that case, we need to look elsewhere….


When set to ON (the default), this variable has the effect of automatically specifying values for auto_increment_increment and auto_increment_offset based on the cluster size. The idea behind it is to help prevent auto-increment value conflicts when writing to multiple nodes. However, what it also means is that in a multi-node cluster, the auto-increment values generated by any given node will never be consecutive and the “next” auto-increment value on the slave cluster node will always be higher than what the master believes it should be. For example:

serverA> INSERT INTO auto_inc_test (stuff) VALUES ('first row'),('second row'),('third row');
serverA> SHOW CREATE TABLE auto_inc_test;
CREATE TABLE `auto_inc_test` (
  `stuff` varchar(20) DEFAULT NULL,

Truncated by Planet PHP, read more at the original (another 3010 bytes)

Jean-Jerome SchmidtAnnouncing Severalnines Package Repository (8.9.2014, 16:45 UTC)
September 9, 2014
By Severalnines

We are excited to announce the availability of YUM/APT repositories for ClusterControl, making new releases of ClusterControl  easily accessible using YUM or APT package managers. The repo is found at, with instructions provided on the landing page. Our Cluster Configurators will be using these repositories. As a result, users upgrading from s9s_upgrade_cmon starting from version v.1.2.8 will be configured with the package repository.


ClusterControl requires extra post-installation setup steps, such as generating an API token, configuring cmon/dcps database schema, grant privileges on cmon schema, setting up SSL and so on. We provide a post-installation script for this purpose at [Apache document root]/clustercontrol/app/tools/ If you are installing for the first time, you are required to run this script to ensure ClusterControl is properly set up.


In this blog post, we will show you how to use the Severalnines repository to install and manage ClusterControl packages in your infrastructure. 


New Package Naming


For the purpose of streamlining installations based on repository, we have changed the package names as follows:


Old Package Name

New Package Name

ClusterControl CMON Controller


Truncated by Planet PHP, read more at the original (another 4475 bytes)

Peter ZaitsevHow to calculate the correct size of Percona XtraDB Cluster’s gcache (8.9.2014, 14:12 UTC)

How to calculate the correct size of Percona XtraDB Cluster's gcacheWhen a write query is sent to Percona XtraDB Cluster all the nodes store the writeset on a file called gcache. By default the name of that file is galera.cache and it is stored in the MySQL datadir. This is a very important file, and as usual with the most important variables in MySQL, the default value is not good for high-loaded servers. Let’s see why it’s important and how can we calculate a correct value for the workload of our cluster.

What’s the gcache?
When a node goes out of the cluster (crash or maintenance) it obviously stops receiving changes. When you try to reconnect the node to the cluster the data will be outdated. The joiner node needs to ask a donor to send the changes happened during the downtime.

The donor will first try to transfer an incremental (IST), that is, the writesets the cluster received while the node was down. The donor checks the last writeset received by the joiner and then checks local gcache file. If all needed writesets are on that cache the donor sends them to the joiner. The joiner applies them and that’s all, it is up to date and ready to join the cluster. Therefore, IST can only be achieved if all changes missed by the node that went away are still in that gcache file of the donor.

On the other hand, if the writesets are not there a full transfer would be needed (SST) using one of the supported methods, XtraBackup, Rsync or mysqldump.

In a summary, the difference between a IST and SST is the time that a node needs to join the cluster. The difference could be from seconds to hours. In case of WAN connections and large datasets maybe days.

That’s why having a correct gcache is important. It work as a circular log, so when it is full it starts to rewrite the writesets at the beginning. With a larger gcache a node can be out of the cluster more time without requiring a SST. My colleague Jay Janssen explains in more detail about how IST works and how to find the right server to use as donor.

Calculating the correct size
When trick is pretty similar to the one used to calculate the correct InnoDB log file size. We need to check how many bytes are written every minute. The variables to check are:

wsrep_replicated_bytes: Total size (in bytes) of writesets sent to other nodes.

wsrep_received_bytes: Total size (in bytes) of writesets received from other nodes.

mysql> show global status like 'wsrep_received_bytes';
show global status like 'wsrep_replicated_bytes';
select sleep(60);
show global status like 'wsrep_received_bytes';
show global status like 'wsrep_replicated_bytes';
| Variable_name        | Value    |
| wsrep_received_bytes | 83976571 |
| Variable_name          | Value |
| wsrep_replicated_bytes | 0     |
| Variable_name        | Value    |
| wsrep_received_bytes | 90576957 |
| Variable_name          | Value |
| wsrep_replicated_bytes | 800   |


Bytes per minute:

(second wsrep_received_bytes – first wsrep_received_bytes) + (second wsrep_replicated_bytes – first wsrep_replicated_bytes)

(90576957 – 83976571) + (800 – 0) = 6601186 bytes or 6 MB per minute.

Bytes per hour:

6MB * 60 minutes = 360 MB per hour of writesets received by the cluster.

If you want to allow one hour of maintenance (or downtime) of a node, you need to increase the gcache to that size. If you want more time, just make it bigger.

The post How to calculate the correct size of Percona XtraDB Cluster’s gcache appeared first on MySQL Performance Blog.

Cédric PEINTRETime to forget show processlist for monitoring? (7.9.2014, 21:45 UTC)

Disclaimer: I’m not specially an expert of benchmarking, this post is to compare different options. All comments and advices are welcome.

I’m not telling you anything new, the show processlist command is a fantastic command line tool for instant check.
But what about monitor your databases with this command embedded in a tool?

Just have a look at this graph:

Screenshot 2014-06-23 at 10.09.38 PM

With 5K queries per seconds, how much will be hidden with a show processlist executed every seconds? Probably a lot.
So, I wanted to test which alternatives could be efficient to retrieve all the queries during a time lapse.

Test procedure and configuration

I used sysbench 0.5 (with oltp.lua) to make my tests on different configurations and tools.
My test server is a 20 cores (hyper-threaded) server with 128Go of RAM and a RAID 10 disk setup.
These tests are based on Percona Server 5.5.31.
I tested the amount of TPS from 1 to 64 concurrent threads.

Here is the standard benchmark graph (TPS):


Slow query log set to 0 (file output)


Estimated size of the file: 52Go
We reach a stable peak of 1000 transactions per second.

Slow query log set to 0 (table output)


Estimated size of the table: 8.9Go
Table contention is probably the cause of a stabilization around 600 transactions per second.

General log in a file


Estimated size of the file: 11Go
2000 transactions per second can be hoped here.

General log in a table


Estimated size of the table: 9.9Go
Again, the results are below compared to the previous results in a file.

Infinite show processlist in a file


Estimated size of the file: 1.1Go
The results don’t seem so dramatic but not sure all the queries were grabbed in the file.

MariaDB Audit Plugin in a file


Estimated size of the file: 14Go
server_audit_events             = QUERY
server_audit_file_path          = /var

Truncated by Planet PHP, read more at the original (another 4662 bytes)

Peter ZaitsevPercona XtraDB Cluster 5.5.39-25.11 is now available (5.9.2014, 14:49 UTC)

Percona XtraDB Cluster 5.5.39-25.11Percona is glad to announce the new release of Percona XtraDB Cluster 5.5 on September 5th 2014. Binaries are available from downloads area or from our software repositories.

Based on Percona Server 5.5.39-36.0 including all the bug fixes in it, Galera Replicator 2.11, and on Codership wsrep API 25.11, Percona XtraDB Cluster 5.5.39-25.11 is now the current 5.5 General Availability release. All of Percona‘s software is open-source and free, and all the details of the release can be found in the 5.5.39-25.11 milestone at Launchpad.

New Features:

  • New session variable wsrep_sync_wait has been implemented to control causality check. The old session variable wsrep_causal_reads is deprecated but is kept for backward compatibility (#1277053).
  • systemd integration with RHEL/CentOS 7 is now available for Percona XtraDB Cluster from our testing repository (#1342223).

Bugs Fixed:

  • Percona XtraDB Cluster has implemented threadpool scheduling fixes. Bug fixed #1333348.
  • When gmcast.listen_addr was configured to a certain address, local connection point for outgoing connections was not bound to listen address. This would happen if OS has multiple interfaces with IP addresses in the same subnet, it may happen that OS would pick wrong IP for local connection point and other nodes would see connections originating from IP address which was not listened to. Bug fixed #1240964.
  • Client connections were closed unconditionally before generating SST request. Fixed by avoiding closing connections when wsrep is initialized before storage engines. Bug fixed #1258658.
  • Issue with re-setting galera provider (in wsrep_provider_options) has been fixed. Bug fixed #1260283.
  • Variable wsrep_provider_options couldn’t be set in runtime if no provider was loaded. Bug fixed #1260290.
  • Node consistency issues with foreign keys have been fixed. This fix introduces two new variables: wsrep_slave_FK_checks and wsrep_slave_UK_checks. These variables are set to TRUE and FALSE respectively by default. They control whether Foreign Key and Unique Key checking is done for applier threads. Bug fixed #1260713.

Truncated by Planet PHP, read more at the original (another 6083 bytes)

Peter ZaitsevUsing MySQL 5.6 Global Transaction IDs (GTIDs) in production: Q&A (5.9.2014, 10:00 UTC)

Using MySQL Global Transaction IDs in ProductionThank you to all of you who attended my webinar last week about Global Transaction IDs (GTIDs), which were introduced in MySQL 5.6 to make the reconfiguration of replication straightforward. If you missed my webinar, you can still listen to the recording and download the sides (free). We had a lot of questions during the webinar, so let me try to answer them here. Please let me know in the comments if additional clarification is needed.

Q: Does GTID provide any benefit to master-master replication? If yes, how?

A: GTIDs don’t change the nature of MySQL replication: for instance it is still asynchronous and if you write on both masters in a master-master setup, there is still no write conflict detection mechanism. The main benefit of GTIDs is that any change of the replication topology is easy because you no longer need to run complex calculations to find the right binlog positions when connecting a slave to a new master.
So master-master replication can be configured with GTIDs, it does not provide a lot of benefits compared to position-based replication as you will never change the replication topology.
And having a setup where both masters receive writes is still not recommended with GTIDs.

Q: Will GTIDs work well with master:standby master? How quick would the failover be?
A: Yes, GTIDs works well with this kind of setup (which is one master and one slave). GTIDs do not provide failover, you will have to use an external tool. Speed of failover then depends on the tool you will use.

Q: For already set up MASTER-MASTER/MASTER-SLAVE Replication, after getting GTID set up, we need to rebuild replication again using AUTO POS=1, correct?
A: Yes, using MASTER_AUTO_POSITION=1 is necessary to indicate that you will use GTID replication. So you will have to run: STOP SLAVE; CHANGE MASTER TO … MASTER_AUTO_POSITION = 1; START SLAVE;

Q: Application having tables from different Engines(InnoDB and MyISAM), how that will handled in GTID?
A: Transactions using both MyISAM and InnoDB tables are not allowed, please refer to the documentation

Q: In a master-slave replication topology (with GTID enabled), how does slave get data from the master if the master’s binary logs are purged given that AUTO_POSITION=1 is used as part of the change master command?
A: This will break replication with error 1236.

Q: Whats the value of show slave status who determines if there is a lag on the slave?
A: This is Seconds_Behind_Master. It’s not always reliable though. For instance if you have a replication setup like A -> B -> C, Seconds_Behind_Master on C will shop the lag relatively to B, not A.

Q: What is the value of saving the history of previous master’s GTIDs executed in the show slave status -> Executed_Gtid_Set?
A: The new replication protocol makes sure that when the slave connects to its master, it sends the range of GTIDs it has already executed. Then the master sends back all other transactions. That’s why Executed_Gtid_Set contains the history of all executed transactions.

Q: We use DB Master and Slave VIPs on our servers, can the mysqlfailover tool also switch the VIP to the new master? Is it scriptable on the event of a failover?
A: Yes you can use extension points to add you own custom scripts with mysqlfailover. See the documentation for –exec-before and –exec-after.

Q: How does mysqlfailover handle brief network instability between the Master and Slaves?
A: mysqlfailover only triggers failover when it suspects the master is no longer alive. So network instability between the master and its slaves won’t affect it for master crash detection. However

Truncated by Planet PHP, read more at the original (another 5654 bytes)

Open QueryTracing down a problem, finding sloppy code (5.9.2014, 02:33 UTC)

Daniel was tracking down what appeared to be a networking problem….

  • server reported 113 (No route to host)
  • However, an strace did not reveal the networking stack ever returning that.
  • On the other side, IP packets were actually received.
  • When confronted with mysteries like this, I get suspicious – mainly of (fellow) programmers.
  • I suggested a grep through the source code, which revealed  return -EHOSTUNREACH;
  • Mystery solved, which allowed us to find what was actually going on.


  1. Don’t just believe or presume the supposed origin of an error.
  2. Programmers often take shortcuts that cause grief later. I fully appreciate how the above code came about, but I still think it was wrong. Mapping a “similar” situation onto an existing error code is convenient. But when an error occurs, the most important thing is for people to be able to track down what the root cause is. Reporting this error outside of its original context (error code reported by network stack) is clearly unhelpful, it actually misdirects and requires people to essentially waste time to track it down (as above).
  3. Horay once again for Open Source, which makes it so much easier to figure these things out. While possibly briefly embarrassing for the programmer, more eyes allows code to improve better and faster – and, perhaps, also entices towards better coding practices from the outset (I can hope!).

What do you think?

Peter ZaitsevMySQL 101: Monitor Disk I/O with pt-diskstats (4.9.2014, 07:00 UTC)

MySQL 101: Monitor Disk I/O with pt-diskstatsHere on the Percona Support team we often ask customers to retrieve disk stats to monitor disk IO and to measure block devices iops and latency. There are a number of tools available to monitor IO on Linux. iostat is one of the popular tools and Percona Toolkit, which is free, contains the pt-diskstats tool for this purpose. The pt-diskstats tool is similar to iostat but it’s more interactive and contains extended information. pt-diskstats reports current disk activity and shows the statistics for the last second (which by default is 1 second) and will continue until interrupted. The pt-diskstats tool collects samples of /proc/diskstats.

In this post, I will share some examples about how to monitor and check to see if the IO subsystem is performing properly or if any disks are a limiting factor – all this by using the pt-diskstats tool.

pt-diskstats output consists on number of columns and in order to interpret pt-diskstats output we need to know what each column represents.

  • rd_s tells about number of reads per second while wr_s represents number of writes per second.
  • rd_rt and wr_rt shows average response time in milliseconds for reads & writes respectively, which is similar to iostat tool output await column but pt-diskstats shows individual response time for reads and writes at disk level. Just a note, modern iostat splits read and write latency out, but most distros don’t have the latest iostat in their systat (or equivalent) package.
  • rd_mrg and wr_mrg are other two important columns in pt-diskstats output.  *_mrg is telling us how many of the original operations the IO elevator (disk scheduler) was able to merge to reduce IOPS, so *_mrg is telling us a quite important thing by letting us know that the IO scheduler was able to consolidate many or few operations. If rd_mrg/wr_mrg is high% then the IO workload is sequential on the other hand, If rd_mrg/wr_mrg is a low% then IO workload is all random. Binary logs, redo logs (aka ib_logfile*), undo log and doublewrite buffer all need sequential writes.
  • qtime and stime are last two columns in pt-diskstats output where qtime reflects to time spent in disk scheduler queue i.e. average queue time before sending it to physical device and on the other hand stime is average service time which is time accumulated to process the physical device request. Note, that qtime is not discriminated between reads and writes and you can check if response time is higher for qtime than it signal towards disk scheduler. Also note that service time (stime field and svctm field in in pt-diskstats & iostat output respectively) is not reliable on Linux. If you read the iostat manual you will see it is deprecated.

Along with that, there are many other parameters for pt-diskstats – you can found full documentation here. Below is an example of pt-disktats in action. I used the  –devices-regex option which prints only device information that matches this Perl regex.

$ pt-diskstats --devices-regex=sd --interval 5
#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
1.1 sda   21.6   22.8    0.5      45%    1.2  29.4 275.5   4.0    1.1      0%   40.0  145.1   65%   158 297.1 155.0   2.1
1.1 sdb   15.0   21.0    0.3      33%    0.1   5.2   0.0   0.0    0.0      0%    0.0    0.0   11%     1  15.0   0.5   4.7
1.1 sdc    5.6   10.0    0.1       0%    0.0   5.2   1.9   6.0    0.0     33%    0.0    2.0    3%     0   7.5   0.4   3.6
1.1 sdd    0.0    0.0    0.0       0%    0.0   0.0   0.0   0.0    0.0      0%    0.0    0.0    0%     0   0.0   0.0   0.0
5.0 sda   17.0   14

Truncated by Planet PHP, read more at the original (another 12219 bytes)

Peter ZaitsevMigrating to Percona XtraDB Cluster 2014 edition: Sept. 10 MySQL webinar (3.9.2014, 19:31 UTC)

Join Jay Janssen Sept. 10 at 10 a.m. PDT and learn how to migrate to Percona XtraDB Cluster 5.6Join me online next week (September 10 at 10 a.m. PDT) for my live webinar on Migrating to Percona XtraDB Cluster.  This was a popular webinar that I gave a few years ago, so I’m doing it again with updates for Percona XtraDB Cluster 5.6 (PXC) and all the latest in the Galera world.

This webinar will be really good for people interested in getting an overview of what PXC/Galera is, what it would take to adopt it for your application, and some of the differences and challenges it brings compared with a conventional MySQL Master/slave setup.  I’d highly suggest attending if you are considering Galera in your environment and want to get a better understanding of its uses and antipatterns.

Additionally, I’ll cover such questions as:

  • What are the requirements for running Percona XtraDB Cluster?
  • Will I have to reload all my tables?
  • How does configuration for the cluster differ from configuring a stand-alone InnoDB server?
  • How should my application interact with the Cluster?
  • Can I use Percona XtraDB Cluster if I only have two MySQL servers currently?
  • How can I move to the Cluster and keep downtime to a minimum?
  • How can I migrate to Percona XtraDB Cluster gradually?

I hope to see you next Wednesday. And please feel free to ask questions in advance in the comments section below. Next week’s live event, like all of our MySQL webinars, is free. Register here!

The post Migrating to Percona XtraDB Cluster 2014 edition: Sept. 10 MySQL webinar appeared first on MySQL Performance Blog.

Jean-Jerome SchmidtHow to change AWS instance sizes for your Galera Cluster and optimize performance (3.9.2014, 13:32 UTC)
September 3, 2014
By Severalnines

Running your database cluster on AWS is a great way to adapt to changing workloads by adding/removing instances, or by scaling up/down each instance. At Severalnines, we talk much more about scale-out than scale up, but there are cases where you might want to scale up an instance instead of scaling out. 

In this post, we’ll show you how to change instance sizes with respect to RAM, CPU and IOPS, and how to tune your Galera nodes accordingly. Moreover, this post assumes that instances are launched using Amazon VPC.


When do we need to upgrade an instance?


You typically need to upgrade an instance when you run out of server resources. This includes CPU, RAM, storage capacity, disk throughput and bandwidth. You must allow enough headroom for your database to operate and grow. Performance tuning will allow you to get the most out of your servers, but in some cases, this might not be enough.


read more

LinksRSS 0.92   RDF 1.
Atom Feed