Planet MariaDB

March 26, 2019

Peter Zaitsev

Upcoming Webinar Wed 3/27: Monitoring PostgreSQL with Percona Monitoring and Management (PMM)

Monitoring PostgreSQL with Percona Monitoring and Management (PMM)

Monitoring PostgreSQL with Percona Monitoring and Management (PMM)Please join Percona’s Product Manager, Michael Coburn, as he presents his talk Monitoring PostgreSQL with Percona Monitoring and Management (PMM) on March 27th, 2019 at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

Register Now

In this webinar, learn how to monitor PostgreSQL using Percona Monitoring and Management (PMM) so that you can:

Gain greater visibility of performance and bottlenecks for PostgreSQL
Consolidate your PostgreSQL servers into the same monitoring platform you already use for MySQL and MongoDB
Respond more quickly and efficiently in Severity 1 issues

We’ll also show how using PMM’s External Exporters can help you integrate PostgreSQL in only minutes!

In order to learn more, register for this webinar on how to monitor PostgreSQL with PMM.

by Michael Coburn at March 26, 2019 03:58 PM

March 25, 2019

Peter Zaitsev

Percona Server for MongoDB Operator 0.3.0 Early Access Release Is Now Available

Percona Server for MongoDB

Percona Server for MongoDB OperatorPercona announces the availability of the Percona Server for MongoDB Operator 0.3.0 early access release.

The Percona Server for MongoDB Operator simplifies the deployment and management of Percona Server for MongoDB in a Kubernetes or OpenShift environment. It extends the Kubernetes API with a new custom resource for deploying, configuring and managing the application through the whole life cycle.

You can install the Percona Server for MongoDB Operator on Kubernetes or OpenShift. While the operator does not support all the Percona Server for MongoDB features in this early access release, instructions on how to install and configure it are already available along with the operator source code in our Github repository.

The Percona Server for MongoDB Operator is an early access release. Percona doesn’t recommend it for production environments.

New Features


Fixed Bugs

  • CLOUD-141: Operator failed to rescale cluster after self-healing.
  • CLOUD-151: Dashboard upgrade in Percona Monitoring and Management caused loop due to no write access.
  • CLOUD-152: Percona Server for MongoDB crash took place in case of no backup section in the Operator configuration file.
  • CLOUD-91: The Operator was throwing error messages with Arbiters disabled in the deploy/cr.yaml configuration file.

Percona Server for MongoDB is an enhanced, open source and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB Community Edition. It supports MongoDB® protocols and drivers. Percona Server for MongoDB extends MongoDB Community Edition functionality by including the Percona Memory Engine, as well as several enterprise-grade features. It requires no changes to MongoDB applications or code.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system.

by Dmitriy Kostiuk at March 25, 2019 02:13 PM

How to Perform Compatible Schema Changes in Percona XtraDB Cluster (Advanced Alternative)?

PXC schema changes options

PXC schema changes optionsIf you are using Galera replication, you know that schema changes may be a serious problem. With its current implementation, there is no way even a simple ALTER will be unobtrusive for live production traffic. It is a fact that with the default TOI alter method, Percona XtraDB Cluster (PXC) cluster suspends writes in order to execute the ALTER in the same order on all nodes.

For factual data structure changes, we have to adapt to the limitations, and either plan for a maintenance window, or use pt-online-schema-change, where interruptions should be very short. I suggest you be extra careful here, as normally you cannot kill an ongoing ALTER query in Galera cluster.

For schema compatible changes, that is, ones that cannot break ROW replication when the writer node and applier nodes have different metadata, we can consider using the Rolling Schema Update (RSU) method. An example of 100% replication-safe DDL is OPTIMIZE TABLE (aka noop-ALTER). However, the following are safe to consider too:

  • adding and removing secondary index,
  • renaming an index,
  • changing the ROW_FORMAT (for example enabling/disabling table compression),
  • changing the KEY_BLOCK_SIZE(compression property).

However, a lesser known fact is that even using the RSU method or pt-online-schema-change for the above may not save us from some unwanted disruptions.

RSU and Concurrent Queries

Let’s take a closer look at a very simple scenario with noop ALTER. We will set wsrep_OSU_method to RSU to avoid a cluster-wide stall. In fact, this mode turns off replication for the following DDL (and only for DDL), so you have to remember to repeat the same ALTER on every cluster member later.

For simplicity, let’s assume there is only one node used for writes. In the first client session, we change the method accordingly to prepare for DDL:

node1 > set wsrep_OSU_method=RSU;
Query OK, 0 rows affected (0.00 sec)
node1 > select @@wsrep_OSU_method,@@wsrep_on,@@wsrep_desync;
| @@wsrep_OSU_method | @@wsrep_on | @@wsrep_desync |
| RSU                |          1 |              0 |
1 row in set (0.00 sec)

(By the way, as seen above, the desync mode is not enabled yet, as it will be automatically enabled around the DDL query only, and disabled right after it finishes).

In a second client session, we start a long enough SELECT query:

node1 > select count(*) from db1.sbtest1 a join db1.sbtest1 b where<10000;

And while it’s ongoing, let’s rebuild the table:

node1 > alter table db1.sbtest1 engine=innodb;
Query OK, 0 rows affected (0.98 sec)
Records: 0 Duplicates: 0 Warnings: 0

Surprisingly, immediately the client in the second session receives its SELECT failure:

ERROR 1213 (40001): WSREP detected deadlock/conflict and aborted the transaction. Try restarting the transaction

So, even a simple SELECT is aborted if it conflicts with the local, concurrent ALTER (RSU)… We can see more details in the error log:

2018-12-04T21:39:17.285108Z 0 [Note] WSREP: Member 0.0 (node1) desyncs itself from group
2018-12-04T21:39:17.285124Z 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 471796)
2018-12-04T21:39:17.305018Z 12 [Note] WSREP: Provider paused at 7bf59bb4-996d-11e8-b3b6-8ed02cd38513:471796 (30)
2018-12-04T21:39:17.324509Z 12 [Note] WSREP: --------- CONFLICT DETECTED --------
2018-12-04T21:39:17.324532Z 12 [Note] WSREP: cluster conflict due to high priority abort for threads:
2018-12-04T21:39:17.324535Z 12 [Note] WSREP: Winning thread:
THD: 12, mode: total order, state: executing, conflict: no conflict, seqno: -1
SQL: alter table db1.sbtest1 engine=innodb
2018-12-04T21:39:17.324537Z 12 [Note] WSREP: Victim thread:
THD: 11, mode: local, state: executing, conflict: no conflict, seqno: -1
SQL: select count(*) from db1.sbtest1 a join db1.sbtest1 b where<10000
2018-12-04T21:39:17.324542Z 12 [Note] WSREP: MDL conflict db=db1 table=sbtest1 ticket=MDL_SHARED_READ solved by abort
2018-12-04T21:39:17.324544Z 12 [Note] WSREP: --------- CONFLICT DETECTED --------
2018-12-04T21:39:17.324545Z 12 [Note] WSREP: cluster conflict due to high priority abort for threads:
2018-12-04T21:39:17.324547Z 12 [Note] WSREP: Winning thread:
THD: 12, mode: total order, state: executing, conflict: no conflict, seqno: -1
SQL: alter table db1.sbtest1 engine=innodb
2018-12-04T21:39:17.324548Z 12 [Note] WSREP: Victim thread:
THD: 11, mode: local, state: executing, conflict: must abort, seqno: -1
SQL: select count(*) from db1.sbtest1 a join db1.sbtest1 b where<10000
2018-12-04T21:39:18.517457Z 12 [Note] WSREP: resuming provider at 30
2018-12-04T21:39:18.517482Z 12 [Note] WSREP: Provider resumed.
2018-12-04T21:39:18.518310Z 0 [Note] WSREP: Member 0.0 (node1) resyncs itself to group
2018-12-04T21:39:18.518342Z 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 471796)
2018-12-04T21:39:18.519077Z 0 [Note] WSREP: Member 0.0 (node1) synced with group.
2018-12-04T21:39:18.519099Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 471796)
2018-12-04T21:39:18.519119Z 2 [Note] WSREP: Synchronized with group, ready for connections
2018-12-04T21:39:18.519126Z 2 [Note] WSREP: Setting wsrep_ready to true

Another example – a simple sysbench test, during which I did noop ALTER in RSU mode:

# sysbench /usr/share/sysbench/oltp_read_only.lua --table-size=1000 --tables=8 --mysql-db=db1 --mysql-user=root --threads=8 --time=200 --report-interval=1 --events=0 --db-driver=mysql run
sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)
Running the test with following options:
Number of threads: 8
Report intermediate results every 1 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 1s ] thds: 8 tps: 558.37 qps: 9004.30 (r/w/o: 7880.62/0.00/1123.68) lat (ms,95%): 18.28 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 8 tps: 579.01 qps: 9290.22 (r/w/o: 8130.20/0.00/1160.02) lat (ms,95%): 17.01 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 8 tps: 597.36 qps: 9528.89 (r/w/o: 8335.17/0.00/1193.72) lat (ms,95%): 15.83 err/s: 0.00 reconn/s: 0.00
FATAL: mysql_stmt_store_result() returned error 1317 (Query execution was interrupted)
FATAL: `thread_run' function failed: /usr/share/sysbench/oltp_common.lua:432: SQL error, errno = 1317, state = '70100': Query execution was interrupted

So, SELECT queries are aborted to resolve MDL lock request that a DDL in RSU needs immediately. This of course applies to INSERT, UPDATE and DELETE as well. That’s quite an intrusive way to accomplish the goal…

“Manual RSU”

Let’s try a “manual RSU” workaround instead. In fact, we can achieve the same isolated DDL execution as in RSU, by putting a node in desync mode (to avoid flow control) and disabling replication for our session. That way, the ALTER will only be executed in that particular node.

Session 1:

node1 > set wsrep_OSU_method=TOI; set global wsrep_desync=1; set wsrep_on=0;
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
node1 > select @@wsrep_OSU_method,@@wsrep_on,@@wsrep_desync;
| @@wsrep_OSU_method | @@wsrep_on | @@wsrep_desync |
| TOI                |          0 |              1 |
1 row in set (0.00 sec)

Session 2:

node1 > select count(*) from db1.sbtest1 a join db1.sbtest1 b where<10000;
| count(*)  |
| 423680000 |
1 row in set (14.07 sec)

Session 1:

node1 > alter table db1.sbtest1 engine=innodb;
Query OK, 0 rows affected (13.52 sec)
Records: 0 Duplicates: 0 Warnings: 0

Session 3:

node1 > select id,command,time,state,info from information_schema.processlist where user="root";
| id | command | time | state                           | info |
| 11 | Query   | 9    | Sending data                    | select count(*) from db1.sbtest1 a join db1.sbtest1 b where<10000 |
| 12 | Query   | 7    | Waiting for table metadata lock | alter table db1.sbtest1 engine=innodb |
| 17 | Query   | 0    | executing                       | select id,command,time,state,info from information_schema.processlist where user="root" |
3 rows in set (0.00 sec)
node1 > select id,command,time,state,info from information_schema.processlist where user="root";
| id | command | time | state          | info |
| 11 | Sleep   | 14   |                | NULL |
| 12 | Query   | 13   | altering table | alter table db1.sbtest1 engine=innodb |
| 17 | Query   | 0    | executing      | select id,command,time,state,info from information_schema.processlist where user="root" |
3 rows in set (0.00 sec)

In this case, there was no interruption, the ALTER waited for it’s MDL lock request to succeed gracefully, and did it’s job when it became possible.

Remember, you have to execute the same commands on the rest of the nodes to make them consistent – even for noop-alter, it’s important to make the nodes consistent in terms of table size on disk.

Kill Problem

Another fact is that you cannot cancel or kill a DDL query executed in RSU or in TOI method:

node1 > kill query 12;
ERROR 1095 (HY000): You are not owner of thread 12

This may be an annoying problem when you need to unblock a node urgently. Fortunately, the workaround with wsrep_on=0 also allows to kill an ALTER without that restriction:

Session 1:

node1 > kill query 22;
Query OK, 0 rows affected (0.00 sec)

Session 2:

node1 > alter table db1.sbtest1 engine=innodb;
ERROR 1317 (70100): Query execution was interrupted


The RSU method may be more intrusive then you’d expect. For schema compatible changes, it is worth considering “manual RSU” with

set global wsrep_desync=1; set wsrep_on=0;

When using it though, please remember that wsrep_on applies to all types of writes, both DDL and DML, so be extra careful to set it back to 1 after the ALTER is done. So the procedure will look like this:

SET GLOBAL wsrep_desync=1;
SET wsrep_on=0;
ALTER ...  /* compatible schema change only! */
SET wsrep_on=1;
SET GLOBAL wsrep_desync=0;

Incidentally, as in my opinion the current RSU behavior is unnecessarily intrusive, I have filed this change suggestion:

Photo by Pierre Bamin on Unsplash

by Przemysław Malkowski at March 25, 2019 12:37 PM

March 20, 2019

Peter Zaitsev

MongoDB on ARM Processors

reads updates transactions per hour per $

ARM processors have been around for a while. In mid-2015/2016 there were a couple of attempts by the community to port MongoDB to work with this architecture. At the time, the main storage engine was MMAP and most of the available ARM boards were 32-bits. Overall, the port worked, but the fact is having MongoDB running on a Raspberry Pi was more a hack than a setup. The public cloud providers didn’t yet offer machines running with these processors.

The ARM processors are power-efficient and, for this reason, they are used in smartphones, smart devices and, now, even laptops. It was just a matter of time to have them available in the cloud as well. Now that AWS is offering ARM-based instances you might be thinking: “Hmmm, these instances include the same amount of cores and memory compared to the traditional x86-based offers, but cost a fraction of the price!”.

But do they perform alike?

In this blog, we selected three different AWS instances to compare: one powered by  an ARM processor, the second one backed by a traditional x86_64 Intel processor with the same number of cores and memory as the ARM instance, and finally another Intel-backed instance that costs roughly the same as the ARM instance but carries half as many cores. We acknowledge these processors are not supposed to be “equivalent”, and we do not intend to go deeper in CPU architecture in this blog. Our goal is purely to check how the ARM-backed instance fares in comparison to the Intel-based ones.

These are the instances we will consider in this blog post.


We will use the Yahoo Cloud Serving Benchmark (YCSB, running on a dedicated instance (c5d.4xlarge) to simulate load in three distinct tests:

  1. a load of 1 billion documents in one collection having only the primary key (which we’ll call Inserts).
  2. a workload comprised of exclusively reads (Reads)
  3. a workload comprised of a mix of 75% reads with 5% scans plus 25% updates (Reads/Updates)

We will run each test with a varying number of concurrent threads (32, 64, and 128), repeating each set three times and keeping only the second-best result.

All instances will run the same MongoDB version (4.0.3, installed from a tarball and running with default settings) and operating system, Ubuntu 16.04. We chose this setup because MongoDB offer includes an ARM version for Ubuntu-based machines.

All the instances will be configured with:

  • 100 GB EBS with 5000 PIOPS and 20 GB EBS boot device
  • Data volume formatted with XFS, 4k blocks
  • Default swappiness and disk scheduler
  • Default kernel parameters
  • Enhanced cloud watch configured
  • Free monitoring tier enabled

Preparing the environment

We start with the setup of the benchmark software we will use for the test, YCSB. The first task was to spin up a powerful machine (c5d.4xlarge) to run the software and then prepare the environment:

The YCSB program requires Java, Maven, Python, and pymongo which doesn’t come by default in our Linux version – Ubuntu server x86. Here are the steps we used to configure our environment:

Installing Java

sudo apt-get install java-devel

Installing Maven

sudo tar xzf apache-maven-*-bin.tar.gz -C /usr/local
cd /usr/local
sudo ln -s apache-maven-* maven
sudo vi /etc/profile.d/

Add the following to

export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Installing Python 2.7

sudo apt-get install python2.7

Installing pip to resolve the pymongo dependency

sudo apt-get install python-pip

Installing pymongo (driver)

sudo pip install pymongo

Installing YCSB

curl -O --location
tar xfvz ycsb-0.5.0.tar.gz
cd ycsb-0.5.0

YCSB comes with different workloads, and also allows for the customization of a workload to match our own requirements. If you want to learn more about the workloads have a look at

First, we will edit the workloads/workloada file to perform 1 billion inserts (for our first test) while also preparing it to later perform only reads (for our second test):


We will then change the workloads/workloadb file so as to provide a mixed workload for our third test.  We also set it to perform 1 billion reads, but we break it down into 70% of read queries and 30% of updates with a scan ratio of 5%, while also placing a cap on the maximum number of scanned documents (2000) in an effort to emulate real traffic – workloads are not perfect, right?


With that, we have the environment configured for testing.

Running the tests

With all instances configured and ready, we run the stress test against our MongoDB servers using the following command :

./bin/ycsb [load/run] mongodb -s -P workloads/workload[ab] -threads [32/64/128] \
 -p mongodb.url=mongodb://[0-9] \

The parameters between brackets varied according to the instance and operation being executed:

  • [load/run] load means insert data while run means perform action (update/read)
  • workload[a/b] reference the different workloads we’ve used
  • [32/64/128] indicate the number of concurrent threads being used for the test
  • ycsb0000[0-9] is the database name we’ve used for the tests (for reference only)


Without further ado, the table below summarizes the results for our tests:




Performance cost

Considering throughput alone – and in the context of those tests, particularly the last one – you may get more performance for the same cost. That’s certainly not always the case, which our results above also demonstrate. And, as usual, it depends on “how much performance do you need” – a matter that is even more pertinent in the cloud. With that in mind, we had another look at our data under the “performance cost” lens.

As we saw above, the c5.4xlarge instance performed better than the other two instances for a little over 50% more (in terms of cost). Did it deliver 50% more (performance) as well? Well, sometimes it did even more than that, but not always. We used the following formula to extrapolate the OPS (Operations Per Second) data we’ve got from our tests into OPH (Operations Per Hour), so we could them calculate how much bang (operations) for the buck (US$1) each instance was able to provide:

transactions/hour/US$1 = (OPS * 3600) / instance cost per hour

This is, of course, an artificial metric that aims to correlate performance and cost. For this reason, instead of plotting the raw values, we have normalized the results using the best performer instance as baseline(100%):



The intent behind these was only to demonstrate another way to evaluate how much we’re getting for what we’re paying. Of course, you need to have a clear understanding of your own requirements in order to make a balanced decision.

Parting thoughts

We hope this post awakens your curiosity not only about how MongoDB may perform on ARM-based servers, but also by demonstrating another way you can perform your own tests with the YCSB benchmark. Feel free to reach out to us through the comments section below if you have any suggestions, questions, or other observations to make about the work we presented here.

by Adamo Tonete at March 20, 2019 05:31 PM

Jean-Jerome Schmidt

How to Run and Configure ProxySQL 2.0 for MySQL Galera Cluster on Docker

ProxySQL is an intelligent and high-performance SQL proxy which supports MySQL, MariaDB and ClickHouse. Recently, ProxySQL 2.0 has become GA and it comes with new exciting features such as GTID consistent reads, frontend SSL, Galera and MySQL Group Replication native support.

It is relatively easy to run ProxySQL as Docker container. We have previously written about how to run ProxySQL on Kubernetes as a helper container or as a Kubernetes service, which is based on ProxySQL 1.x. In this blog post, we are going to use the new version ProxySQL 2.x which uses a different approach for Galera Cluster configuration.

ProxySQL 2.x Docker Image

We have released a new ProxySQL 2.0 Docker image container and it's available in Docker Hub. The README provides a number of configuration examples particularly for Galera and MySQL Replication, pre and post v2.x. The configuration lines can be defined in a text file and mapped into the container's path at /etc/proxysql.cnf to be loaded into ProxySQL service.

The image "latest" tag still points to 1.x until ProxySQL 2.0 officially becomes GA (we haven't seen any official release blog/article from ProxySQL team yet). Which means, whenever you install ProxySQL image using latest tag from Severalnines, you will still get version 1.x with it. Take note the new example configurations also enable ProxySQL web stats (introduced in 1.4.4 but still in beta) - a simple dashboard that summarizes the overall configuration and status of ProxySQL itself.

ProxySQL 2.x Support for Galera Cluster

Let's talk about Galera Cluster native support in greater detail. The new mysql_galera_hostgroups table consists of the following fields:

  • writer_hostgroup: ID of the hostgroup that will contain all the members that are writers (read_only=0).
  • backup_writer_hostgroup: If the cluster is running in multi-writer mode (i.e. there are multiple nodes with read_only=0) and max_writers is set to a smaller number than the total number of nodes, the additional nodes are moved to this backup writer hostgroup.
  • reader_hostgroup: ID of the hostgroup that will contain all the members that are readers (i.e. nodes that have read_only=1)
  • offline_hostgroup: When ProxySQL monitoring determines a host to be OFFLINE, the host will be moved to the offline_hostgroup.
  • active: a boolean value (0 or 1) to activate a hostgroup
  • max_writers: Controls the maximum number of allowable nodes in the writer hostgroup, as mentioned previously, additional nodes will be moved to the backup_writer_hostgroup.
  • writer_is_also_reader: When 1, a node in the writer_hostgroup will also be placed in the reader_hostgroup so that it will be used for reads. When set to 2, the nodes from backup_writer_hostgroup will be placed in the reader_hostgroup, instead of the node(s) in the writer_hostgroup.
  • max_transactions_behind: determines the maximum number of writesets a node in the cluster can have queued before the node is SHUNNED to prevent stale reads (this is determined by querying the wsrep_local_recv_queue Galera variable).
  • comment: Text field that can be used for any purposes defined by the user

Here is an example configuration for mysql_galera_hostgroups in table format:

Admin> select * from mysql_galera_hostgroups\G
*************************** 1. row ***************************
       writer_hostgroup: 10
backup_writer_hostgroup: 20
       reader_hostgroup: 30
      offline_hostgroup: 9999
                 active: 1
            max_writers: 1
  writer_is_also_reader: 2
max_transactions_behind: 20

ProxySQL performs Galera health checks by monitoring the following MySQL status/variable:

  • read_only - If ON, then ProxySQL will group the defined host into reader_hostgroup unless writer_is_also_reader is 1.
  • wsrep_desync - If ON, ProxySQL will mark the node as unavailable, moving it to offline_hostgroup.
  • wsrep_reject_queries - If this variable is ON, ProxySQL will mark the node as unavailable, moving it to the offline_hostgroup (useful in certain maintenance situations).
  • wsrep_sst_donor_rejects_queries - If this variable is ON, ProxySQL will mark the node as unavailable while the Galera node is serving as an SST donor, moving it to the offline_hostgroup.
  • wsrep_local_state - If this status returns other than 4 (4 means Synced), ProxySQL will mark the node as unavailable and move it into offline_hostgroup.
  • wsrep_local_recv_queue - If this status is higher than max_transactions_behind, the node will be shunned.
  • wsrep_cluster_status - If this status returns other than Primary, ProxySQL will mark the node as unavailable and move it into offline_hostgroup.

Having said that, by combining these new parameters in mysql_galera_hostgroups together with mysql_query_rules, ProxySQL 2.x has the flexibility to fit into much more Galera use cases. For example, one can have a single-writer, multi-writer and multi-reader hostgroups defined as the destination hostgroup of a query rule, with the ability to limit the number of writers and finer control on the stale reads behaviour.

Contrast this to ProxySQL 1.x, where the user had to explicitly define a scheduler to call an external script to perform the backend health checks and update the database servers state. This requires some customization to the script (user has to update the ProxySQL admin user/password/port) plus it depended on an additional tool (MySQL client) to connect to ProxySQL admin interface.

Here is an example configuration of Galera health check script scheduler in table format for ProxySQL 1.x:

Admin> select * from scheduler\G
*************************** 1. row ***************************
         id: 1
     active: 1
interval_ms: 2000
   filename: /usr/share/proxysql/tools/
       arg1: 10
       arg2: 20
       arg3: 1
       arg4: 1
       arg5: /var/lib/proxysql/proxysql_galera_checker.log

Besides, since ProxySQL scheduler thread executes any script independently, there are many versions of health check scripts available out there. All ProxySQL instances deployed by ClusterControl uses the default script provided by the ProxySQL installer package.

In ProxySQL 2.x, max_writers and writer_is_also_reader variables can determine how ProxySQL dynamically groups the backend MySQL servers and will directly affect the connection distribution and query routing. For example, consider the following MySQL backend servers:

Admin> select hostgroup_id, hostname, status, weight from mysql_servers;
| hostgroup_id | hostname     | status | weight |
| 10           | DB1          | ONLINE | 1      |
| 10           | DB2          | ONLINE | 1      |
| 10           | DB3          | ONLINE | 1      |

Together with the following Galera hostgroups definition:

Admin> select * from mysql_galera_hostgroups\G
*************************** 1. row ***************************
       writer_hostgroup: 10
backup_writer_hostgroup: 20
       reader_hostgroup: 30
      offline_hostgroup: 9999
                 active: 1
            max_writers: 1
  writer_is_also_reader: 2
max_transactions_behind: 20

Considering all hosts are up and running, ProxySQL will most likely group the hosts as below:

Let's look at them one by one:

Configuration Description
  • Groups the hosts into 2 hostgroups (writer and backup_writer).
  • Writer is part of the backup_writer.
  • Since the writer is not a reader, nothing in hostgroup 30 (reader) because none of the hosts are set with read_only=1. It is not a common practice in Galera to enable the read-only flag.
  • Groups the hosts into 3 hostgroups (writer, backup_writer and reader).
  • Variable read_only=0 in Galera has no affect thus writer is also in hostgroup 30 (reader)
  • Writer is not part of backup_writer.
  • Similar with writer_is_also_reader=1 however, writer is part of backup_writer.

With this configuration, one can have various choices for hostgroup destination to cater for specific workloads. "Hotspot" writes can be configured to go to only one server to reduce multi-master conflicts, non-conflicting writes can be distributed equally on the other masters, most reads can be distributed evenly on all MySQL servers or non-writers, critical reads can be forwarded to the most up-to-date servers and analytical reads can be forwarded to a slave replica.

ProxySQL Deployment for Galera Cluster

In this example, suppose we already have a three-node Galera Cluster deployed by ClusterControl as shown in the following diagram:

Our Wordpress applications are running on Docker while the Wordpress database is hosted on our Galera Cluster running on bare-metal servers. We decided to run a ProxySQL container alongside our Wordpress containers to have a better control on Wordpress database query routing and fully utilize our database cluster infrastructure. Since the read-write ratio is around 80%-20%, we want to configure ProxySQL to:

  • Forward all writes to one Galera node (less conflict, focus on write)
  • Balance all reads to the other two Galera nodes (better distribution for the majority of the workload)

Firstly, create a ProxySQL configuration file inside the Docker host so we can map it into our container:

$ mkdir /root/proxysql-docker
$ vim /root/proxysql-docker/proxysql.cnf

Then, copy the following lines (we will explain the configuration lines further down):




mysql_galera_hostgroups =

mysql_servers =
    { address="db1.cluster.local" , port=3306 , hostgroup=10, max_connections=100 },
    { address="db2.cluster.local" , port=3306 , hostgroup=10, max_connections=100 },
    { address="db3.cluster.local" , port=3306 , hostgroup=10, max_connections=100 }

mysql_query_rules =
        match_pattern="^SELECT .* FOR UPDATE"
        match_pattern="^SELECT .*"

mysql_users =
    { username = "wordpress", password = "passw0rd", default_hostgroup = 10, transaction_persistent = 0, active = 1 },
    { username = "sbtest", password = "passw0rd", default_hostgroup = 10, transaction_persistent = 0, active = 1 }

Now, let's pay a visit to some of the most configuration sections. Firstly, we define the Galera hostgroups configuration as below:

mysql_galera_hostgroups =

Hostgroup 10 will be the writer_hostgroup, hostgroup 20 for backup_writer and hostgroup 30 for reader. We set max_writers to 1 so we can have a single-writer hostgroup for hostgroup 10 where all writes should be sent to. Then, we define writer_is_also_reader to 1 which will make all Galera nodes as reader as well, suitable for queries that can be equally distributed to all nodes. Hostgroup 9999 is reserved for offline_hostgroup if ProxySQL detects unoperational Galera nodes.

Then, we configure our MySQL servers with default to hostgroup 10:

mysql_servers =
    { address="db1.cluster.local" , port=3306 , hostgroup=10, max_connections=100 },
    { address="db2.cluster.local" , port=3306 , hostgroup=10, max_connections=100 },
    { address="db3.cluster.local" , port=3306 , hostgroup=10, max_connections=100 }

With the above configurations, ProxySQL will "see" our hostgroups as below:

Then, we define the query routing through query rules. Based on our requirement, all reads should be sent to all Galera nodes except the writer (hostgroup 20) and everything else is forwarded to hostgroup 10 for single writer:

mysql_query_rules =
        match_pattern="^SELECT .* FOR UPDATE"
        match_pattern="^SELECT .*"

Finally, we define the MySQL users that will be passed through ProxySQL:

mysql_users =
    { username = "wordpress", password = "passw0rd", default_hostgroup = 10, transaction_persistent = 0, active = 1 },
    { username = "sbtest", password = "passw0rd", default_hostgroup = 10, transaction_persistent = 0, active = 1 }

We set transaction_persistent to 0 so all connections coming from these users will respect the query rules for reads and writes routing. Otherwise, the connections would end up hitting one hostgroup which defeats the purpose of load balancing. Do not forget to create those users first on all MySQL servers. For ClusterControl user, you may use Manage -> Schemas and Users feature to create those users.

We are now ready to start our container. We are going to map the ProxySQL configuration file as bind mount when starting up the ProxySQL container. Thus, the run command will be:

$ docker run -d \
--name proxysql2 \
--hostname proxysql2 \
--publish 6033:6033 \
--publish 6032:6032 \
--publish 6080:6080 \
--restart=unless-stopped \
-v /root/proxysql/proxysql.cnf:/etc/proxysql.cnf \

Finally, change the Wordpress database pointing to ProxySQL container port 6033, for instance:

$ docker run -d \
--name wordpress \
--publish 80:80 \
--restart=unless-stopped \
-e WORDPRESS_DB_HOST=proxysql2:6033 \
-e WORDPRESS_DB_USER=wordpress \
-e WORDPRESS_DB_HOST=passw0rd \

At this point, our architecture is looking something like this:

If you want ProxySQL container to be persistent, map /var/lib/proxysql/ to a Docker volume or bind mount, for example:

$ docker run -d \
--name proxysql2 \
--hostname proxysql2 \
--publish 6033:6033 \
--publish 6032:6032 \
--publish 6080:6080 \
--restart=unless-stopped \
-v /root/proxysql/proxysql.cnf:/etc/proxysql.cnf \
-v proxysql-volume:/var/lib/proxysql \

Keep in mind that running with persistent storage like the above will make our /root/proxysql/proxysql.cnf obsolete on the second restart. This is due to ProxySQL multi-layer configuration whereby if /var/lib/proxysql/proxysql.db exists, ProxySQL will skip loading options from configuration file and load whatever is in the SQLite database instead (unless you start proxysql service with --initial flag). Having said that, the next ProxySQL configuration management has to be performed via ProxySQL admin console on port 6032, instead of using configuration file.


ProxySQL process log by default logging to syslog and you can view them by using standard docker command:

$ docker ps
$ docker logs proxysql2

To verify the current hostgroup, query the runtime_mysql_servers table:

$ docker exec -it proxysql2 mysql -uadmin -padmin -h127.0.0.1 -P6032 --prompt='Admin> '
Admin> select hostgroup_id,hostname,status from runtime_mysql_servers;
| hostgroup_id | hostname     | status |
| 10           | | ONLINE |
| 30           | | ONLINE |
| 30           | | ONLINE |
| 30           | | ONLINE |
| 20           | | ONLINE |
| 20           | | ONLINE |

If the selected writer goes down, it will be transferred to the offline_hostgroup (HID 9999):

Admin> select hostgroup_id,hostname,status from runtime_mysql_servers;
| hostgroup_id | hostname     | status |
| 10           | | ONLINE |
| 9999         | | ONLINE |
| 30           | | ONLINE |
| 30           | | ONLINE |
| 20           | | ONLINE |

The above topology changes can be illustrated in the following diagram:

We have also enabled the web stats UI with admin-web_enabled=true.To access the web UI, simply go to the Docker host in port 6080, for example: and you will be prompted with username/password pop up. Enter the credentials as defined under admin-stats_credentials and you should see the following page:

By monitoring MySQL connection pool table, we can get connection distribution overview for all hostgroups:

Admin> select hostgroup, srv_host, status, ConnUsed, MaxConnUsed, Queries from stats.stats_mysql_connection_pool order by srv_host;
| hostgroup | srv_host     | status | ConnUsed | MaxConnUsed | Queries |
| 20        | | ONLINE | 5        | 24          | 11458   |
| 30        | | ONLINE | 0        | 0           | 0       |
| 20        | | ONLINE | 2        | 24          | 11485   |
| 30        | | ONLINE | 0        | 0           | 0       |
| 10        | | ONLINE | 32       | 32          | 9746    |
| 30        | | ONLINE | 0        | 0           | 0       |

The output above shows that hostgroup 30 does not process anything because our query rules do not have this hostgroup configured as destination hostgroup.

The statistics related to the Galera nodes can be viewed in the mysql_server_galera_log table:

Admin>  select * from mysql_server_galera_log order by time_start_us desc limit 3\G
*************************** 1. row ***************************
                           port: 3306
                  time_start_us: 1552992553332489
                success_time_us: 2045
              primary_partition: YES
                      read_only: NO
         wsrep_local_recv_queue: 0
              wsrep_local_state: 4
                   wsrep_desync: NO
           wsrep_reject_queries: NO
wsrep_sst_donor_rejects_queries: NO
                          error: NULL
*************************** 2. row ***************************
                           port: 3306
                  time_start_us: 1552992553329653
                success_time_us: 2799
              primary_partition: YES
                      read_only: NO
         wsrep_local_recv_queue: 0
              wsrep_local_state: 4
                   wsrep_desync: NO
           wsrep_reject_queries: NO
wsrep_sst_donor_rejects_queries: NO
                          error: NULL
*************************** 3. row ***************************
                           port: 3306
                  time_start_us: 1552992553329013
                success_time_us: 2715
              primary_partition: YES
                      read_only: NO
         wsrep_local_recv_queue: 0
              wsrep_local_state: 4
                   wsrep_desync: NO
           wsrep_reject_queries: NO
wsrep_sst_donor_rejects_queries: NO
                          error: NULL

The resultset returns the related MySQL variable/status state for every Galera node for a particular timestamp. In this configuration, we configured the Galera health check to run every 2 seconds (monitor_galera_healthcheck_interval=2000). Hence, the maximum failover time would be around 2 seconds if a topology change happens to the cluster.


by ashraf at March 20, 2019 01:03 PM

March 19, 2019

Peter Zaitsev

How To Test and Deploy Kubernetes Operator for MySQL(PXC) in OSX/macOS?

kubernetes on mac osx

kubernetes on mac osxIn this blog post, I’m going to show you how to test Kubernetes locally on OSX/macOS. Testing Kubernetes without having access to a cloud operator in a local lab is not as easy as it sounds. I’d like to share some of my experiences in this adventure. For those who have already experienced in Virtualbox & Vagrant combination, I can tell you that it doesn’t work. Since Kubernetes will require virtualization, setting another virtual environment within another VirtualBox has several issues. After trying to bring up a cluster for a day or two, I gave up my traditional lab and figured out that Kubernetes has an alternate solution called minikube.


If your OSX/macOS doesn’t have brew I strongly recommend installing it. My OSX/macOS version at the time of this post was macOS 10.14.3 (18D109).

$ brew update && brew install kubectl && brew cask install docker minikube virtualbox

Once minikube is installed, we’ll need to start the virtual environment that is required to run our operator.

I’m starting my minikube environment with 4Gb memory since our Percona Xtradb(PXC) Cluster will have 3 MySQL nodes + 1 ProxySQL pod.

$ minikube start --memory 4096
😄  minikube v0.35.0 on darwin (amd64)
🔥  Creating virtualbox VM (CPUs=2, Memory=4096MB, Disk=20000MB) ...
📶  "minikube" IP address is
🐳  Configuring Docker as the container runtime ...
✨  Preparing Kubernetes environment ...
🚜  Pulling images required by Kubernetes v1.13.4 ...
🚀  Launching Kubernetes v1.13.4 using kubeadm ...
⌛  Waiting for pods: apiserver proxy etcd scheduler controller addon-manager dns
🔑  Configuring cluster permissions ...
🤔  Verifying component health .....
💗  kubectl is now configured to use "minikube"
🏄  Done! Thank you for using minikube!

We’re now ready to install Install Percona XtraDB Cluster on Kubernetes.


Clone and download Kubernetes Operator for MySQL.

$ git clone -b release-0.2.0
Cloning into 'percona-xtradb-cluster-operator'...
remote: Enumerating objects: 191, done.
remote: Counting objects: 100% (191/191), done.
remote: Compressing objects: 100% (114/114), done.
remote: Total 10321 (delta 73), reused 138 (delta 67), pack-reused 10130
Receiving objects: 100% (10321/10321), 17.04 MiB | 3.03 MiB/s, done.
Resolving deltas: 100% (3526/3526), done.
Checking out files: 100% (5159/5159), done.
$ cd percona-xtradb-cluster-operator

Here we have to make the following modifications for this operator to work on OSX/macOS.

  1. Reduce memory allocation for each pod.
  2. Reduce CPU usage for each pod.
  3. Change the topology type (because we want to run all PXC instances on one node).

$ sed -i.bak 's/1G/500m/g' deploy/cr.yaml
$ grep "memory" deploy/cr.yaml
        memory: 500m
      #   memory: 500m
        memory: 500m
      #   memory: 500m
$ sed -i.bak 's/600m/200m/g' deploy/cr.yaml
$ grep "cpu" deploy/cr.yaml
        cpu: 200m
      #   cpu: "1"
        cpu: 200m
      #   cpu: 700m
$ grep "topology" deploy/cr.yaml
      topologyKey: ""
    #   topologyKey: ""
$ sed -i.bak 's/kubernetes\.io\/hostname/none/g' deploy/cr.yaml
$ grep "topology" deploy/cr.yaml
      topologyKey: "none"
    #   topologyKey: ""

We’re now ready to deploy our PXC via the operator.

$ kubectl apply -f deploy/crd.yaml created created
$ kubectl create namespace pxc
namespace/pxc created
$ kubectl config set-context $(kubectl config current-context) --namespace=pxc
Context "minikube" modified.
$ kubectl apply -f deploy/rbac.yaml created created
$ kubectl apply -f deploy/operator.yaml
deployment.apps/percona-xtradb-cluster-operator created
$ kubectl apply -f deploy/secrets.yaml
secret/my-cluster-secrets created
$ kubectl apply -f deploy/configmap.yaml
configmap/pxc created
$ kubectl apply -f deploy/cr.yaml created

Here we’re ready to monitor the progress of our deployment.

$ kubectl get pods
NAME                                               READY   STATUS              RESTARTS   AGE
cluster1-pxc-node-0                                0/1     ContainerCreating   0          86s
cluster1-pxc-proxysql-0                            1/1     Running             0          86s
percona-xtradb-cluster-operator-5857dfcb6c-g7bbg   1/1     Running             0          109s

If any of the nodes is having difficulty passing any STATUS to Running state

$ kubectl describe pod cluster1-pxc-node-0
Name:               cluster1-pxc-node-0
Namespace:          pxc
Priority:           0
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  3m47s (x14 over 3m51s)  default-scheduler  pod has unbound immediate PersistentVolumeClaims
  Normal   Scheduled         3m47s                   default-scheduler  Successfully assigned pxc/cluster1-pxc-node-0 to minikube
  Normal   Pulling           3m45s                   kubelet, minikube  pulling image "perconalab/pxc-openshift:0.2.0"
  Normal   Pulled            118s                    kubelet, minikube  Successfully pulled image "perconalab/pxc-openshift:0.2.0"
  Normal   Created           117s                    kubelet, minikube  Created container
  Normal   Started           117s                    kubelet, minikube  Started container
  Warning  Unhealthy         89s                     kubelet, minikube  Readiness probe failed:
At this stage we’re ready to verify our cluster as soon as we see following output (READY 1/1):
$ kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
cluster1-pxc-node-0                                1/1     Running   0          7m38s
cluster1-pxc-node-1                                1/1     Running   0          4m46s
cluster1-pxc-node-2                                1/1     Running   0          2m25s
cluster1-pxc-proxysql-0                            1/1     Running   0          7m38s
percona-xtradb-cluster-operator-5857dfcb6c-g7bbg   1/1     Running   0          8m1s

In order to connect to this cluster, we’ll need to deploy a client shell access.

$ kubectl run -i --rm --tty percona-client --image=percona:5.7 --restart=Never -- bash -il
If you don't see a command prompt, try pressing enter.
bash-4.2$ mysql -h cluster1-pxc-proxysql -uroot -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3617
Server version: 5.5.30 (ProxySQL)
Copyright (c) 2009-2019 Percona LLC and/or its affiliates
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> \s
mysql  Ver 14.14 Distrib 5.7.25-28, for Linux (x86_64) using  6.2
Connection id:		3617
Current database:	information_schema
Current user:		root@cluster1-pxc-proxysql-0.cluster1-pxc-proxysql.pxc.svc.cluste
SSL:			Not in use
Current pager:		stdout
Using outfile:		''
Using delimiter:	;
Server version:		5.5.30 (ProxySQL)
Protocol version:	10
Connection:		cluster1-pxc-proxysql via TCP/IP
Server characterset:	latin1
Db     characterset:	utf8
Client characterset:	latin1
Conn.  characterset:	latin1
TCP port:		3306
Uptime:			14 min 1 sec
Threads: 1  Questions: 3  Slow queries: 0

A few things to remember:

  • Secrets for this setup are under deploy/secrets.yaml, you can decode via

$ echo -n '{secret}' |base64 -D

  • To reconnect shell

$ kubectl run -i --tty percona-client --image=percona:5.7 -- sh

  • To redeploy the pod delete first and repeat above steps without configuration changes

$ kubectl delete -f deploy/cr.yaml

  • To stop and delete  minikube virtual environment

$ minikube stop

$ minikube delete



Photo by frank mckenna on Unsplash

by Alkin Tezuysal at March 19, 2019 05:12 PM

Upcoming Webinar Thurs 3/21: MySQL Performance Schema in 1 hour

MySQL Performance Schema in 1 hour

MySQL Performance Schema in 1 hourPlease join Percona’s Principal Support Engineer, Sveta Smirnova, as she presents MySQL Performance Schema in 1 hour on Thursday, March 21st, 2019, at 10:00 am PDT (UTC-7) / 1:00 pm EDT (UTC-4).

Register Now

MySQL 8.0 Performance Schema is a mature tool, used by humans and monitoring products. It was born in 2010 as “a feature for monitoring server execution at a low level.” The tool has grown over the years with performance fixes and DBA-faced features. In this webinar, I will give an overview of Performance Schema, focusing on its tuning, performance, and usability.

Performance Schema helps to troubleshoot query performance, complicated locking issues and memory leaks. It can also troubleshoot resource usage, problematic behavior caused by inappropriate settings and much more. Additionally, it comes with hundreds of options which allow for greater precision tuning.

Performance Schema is a potent and very complicated tool. What’s more, it does not affect performance in most cases. However, it collects a lot of data and sometimes this data is hard to read.

In this webinar, I will guide you through the main Performance Schema features, design, and configuration. You will learn how to get the best of it. I will cover its companion sys schema and graphical monitoring tools.

In order to learn more, register for MySQL Performance Schema in 1 hour today.

by Sveta Smirnova at March 19, 2019 04:19 PM

March 18, 2019

Peter Zaitsev

PostgreSQL Upgrade Using pg_dumpall

migrating PostgreSQL using pg_dumpall

PostgreSQL logoThere are several approaches to assess when you need to upgrade PostgreSQL. In this blog post, we look at the option for upgrading a postgres database using pg_dumpall. As this tool can also be used to back up PostgreSQL clusters, then it is a valid option for upgrading a cluster too. We consider the advantages and disadvantages of this approach, and show you the steps needed to achieve the upgrade.

This is the first of our Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions series where we’ll be exploring different paths to accomplish postgres upgrade or migration. The series will culminate with a practical webinar to be aired April 17th (you can register here).

We begin this journey by providing you the most straightforward way to carry on with a PostgreSQL upgrade or migration: by rebuilding the entire database from a logical backup.

Defining the scope

Let’s define what we mean by upgrading or migrating PostgreSQL using pg_dumpall.

If you need to perform a PostgreSQL upgrade within the same database server, we’d call that an in-place upgrade or just an upgrade. Whereas a procedure that involves migrating your PostgreSQL server from one server to another server, combined with an upgrade from an older version (let’s say 9.3) to a newer version PostgreSQL (say PG 11.2), can be considered a migration.

There are two ways to achieve this requirement using logical backups :

  1. Using pg_dumpall
  2. Using pg_dumpall + pg_dump + pg_restore

We’ll be discussing the first option (pg_dumpall) here, and will leave the discussion of the second option for our next post.


pg_dumpall can be used to obtain a text-format dump of the whole database cluster, and which includes all databases in the cluster. This is the only method that can be used to backup globals such as users and roles in PostgreSQL.

There are, of course, advantages and disadvantages in employing this approach to upgrading PostgreSQL by rebuilding the database cluster using pg_dumpall.

Advantages of using pg_dumpall for upgrading a PostgreSQL server :

  1. Works well for a tiny database cluster.
  2. Upgrade can be completed using just a few commands.
  3. Removes bloat from all the tables and shrinks the tables to their absolute sizes.

Disadvantages of using pg_dumpall for upgrading a PostgreSQL server :

  1. Not the best option for databases that are huge in size as it might involve more downtime. (Several GB’s or TB’s).
  2. Cannot use parallel mode. Backup/restore can use just one process.
  3. Requires double the space on disk as it involves temporarily creating a copy of the database cluster for an in-place upgrade.

Let’s look at the steps involved in performing an upgrade using pg_dumpall:

  1. Install new PostgreSQL binaries in the target server (which could be the same one as the source database server if it is an in-place upgrade).

    -- For a RedHat family OS
    # yum install postgresql11*
    -- In an Ubuntu/Debian OS
    # apt install postgresql11
  2. Shutdown all the writes to the database server to avoid data loss/mismatch between the old and new version after upgrade.
  3. If you are doing an upgrade within the same server, create a cluster using the new binaries on a new data directory and start it using a port other than the source. For example, if the older version PostgreSQL is running on port 5432, start the new cluster on port 5433. If you are upgrading and migrating the database to a different server, create a new cluster using new binaries on the target server – the cluster may not need to run on a different port other than the default, unless that’s your preference.

    $ /usr/pgsql-11/bin/initdb -D new_data_directory
    $ cd new_data_directory
    $ echo “port = 5433” >>
    $ /usr/pgsql-11/bin/pg_ctl -D new_data_directory start
  4. You might have a few extensions installed in the old version PostgreSQL cluster. Get the list of all the extensions created in the source database server and install them for the new versions. You can exclude those you get with the contrib module by default. To see the list of extensions created and installed in your database server, you can run the following command.

    $ psql -d dbname -c "\dx"

    Please make sure to check all the databases in the cluster as the extensions you see in one database may not match the list of those created in another database.
  5. Prepare a postgresql.conf file for the new cluster. Carefully prepare this by looking at the existing configuration file of the older version postgres server.
  6. Use pg_dumpall to take a cluster backup and restore it to the new cluster.

    -- Command to dump the whole cluster to a file.
    $ /usr/pgsql-11/bin/pg_dumpall > /tmp/dumpall.sql
    -- Command to restore the dump file to the new cluster (assuming it is running on port 5433 of the same server).
    $ /usr/pgsql-11/bin/psql -p 5433 -f /tmp/dumpall.sql

    Note that i have used the new pg_dumpall from the new binaries to take a backup.
    Another, easier, way is to use PIPE to avoid the time involved in creating a dump file. Just add a hostname if you are performing an upgrade and migration.

    $ pg_dumpall -p 5432 | psql -p 5433
    $ pg_dumpall -p 5432 -h source_server | psql -p 5433 -h target_server
  7. Run ANALYZE to update statistics of each database on the new server.
  8. Restart the database server using the same port as the source.

Our next post in this series provides a similar way of upgrading your PostgreSQL server while at the same time providing some flexibility to carry on with changes like the ones described above. Stay tuned!

Image based on photo by Sergio Ortega on Unsplash

by Avinash Vallarapu at March 18, 2019 02:59 PM

March 15, 2019

Peter Zaitsev

Percona Server for MySQL 8.0.15-5 Is Now Available

Percona Server for MySQL 8.0

Percona Server for MySQL 5.6

Percona announces the release of Percona Server for MySQL 8.0.15-5 on March 15, 2019 (downloads are available here and from the Percona Software Repositories).

This release includes fixes to bugs found in previous releases of Percona Server for MySQL 8.0.

Incompatible changes

In previous releases, the audit log used to produce time stamps inconsistent with the ISO 8601 standard. Release 8.0.15-5 of Percona Server for MySQL solves this problem. This change, however, may break programs that rely on the old time stamp format.

Starting from release 8.0.15-5, Percona Server for MySQL uses the upstream implementation of binary log encryption. The variable encrypt_binlog is removed and the related command line option --encrypt_binlog is not supported. It is important that you remove the encrypt_binlog variable from your configuration file before you attempt to upgrade either from another release in the Percona Server for MySQL 8.0 series or from Percona Server for MySQL 5.7. Otherwise, a server boot error will be produced reporting an unknown variable. The implemented binary log encryption is compatible with the old format: the binary log encrypted in a previous version of MySQL 8.0 series or Percona Server for MySQL are supported.

See MySQL documentation for more information: Encrypting Binary Log Files and Relay Log Files and binlog_encryption variable.

This release is based on MySQL 8.0.14 and MySQL 8.0.15. It includes all bug fixes in these releases. Percona Server for MySQL 8.0.14 was skipped.

Percona Server for MySQL 8.0.15-5 is now the current GA release in the 8.0 series. All of Percona’s software is open-source and free.

Percona Server for MySQL 8.0 includes all the features available in MySQL 8.0 Community Edition in addition to enterprise-grade features developed by Percona. For a list of highlighted features from both MySQL 8.0 and Percona Server for MySQL 8.0, please see the GA release announcement.


If you are upgrading from 5.7 to 8.0, please ensure that you read the upgrade guide and the document Changed in Percona Server for MySQL 8.0.

Bugs Fixed

  • The audit log produced time stamps inconsistent with the ISO 8601 standard. Bug fixed PS-226.
  • FLUSH commands written to the binary log could cause errors in case of replication. Bug fixed PS-1827 (upstream #88720).
  • When audit_plugin was enabled, the server could use a lot of memory when handling large queries. Bug fixed PS-5395.
  • The page cleaner could sleep for long time when the system clock was adjusted to an earlier point in time. Bug fixed PS-5221 (upstream #93708).
  • In some cases, the MyRocks storage engine could crash without triggering the crash recovery. Bug fixed PS-5366.
  • In some cases, when it failed to read from a file, InnoDB did not inform the name of the file in the related error message. Bug fixed PS-2455 (upstream #76020).
  • The ACCESS_DENIED field of the information_schema.user_statistics table was not updated correctly. Bugs fixed PS-3956 and PS-4996.
  • MyRocks could crash while running START TRANSACTION WITH CONSISTENT SNAPSHOT if other transactions were in specific states. Bug fixed PS-4705.
  • In some cases, the server using the the MyRocks storage engine could crash when TTL (Time to Live) was defined on a table. Bug fixed PS-4911.
  • MyRocks incorrectly processed transactions in which multiple statements had to be rolled back. Bug fixed PS-5219.
  • A stack buffer overrun could happen if the redo log encryption with key rotation was enabled. Bug fixed PS-5305.
  • The TokuDB storage engine would assert on load when used with jemalloc 5.x. Bug fixed PS-5406.

Other bugs fixed: PS-4106PS-4107PS-4108PS-4121PS-4474PS-4640PS-5055PS-5218PS-5263PS-5328PS-5369.

Find the release notes for Percona Server for MySQL 8.0.15-5 in our online documentation. Report bugs in the Jira bug tracker.

by Borys Belinsky at March 15, 2019 06:31 PM

Percona Server for MongoDB 3.6.11-3.1 Is Now Available

Percona Server for MongoDB

Percona Server for MongoDB

Percona announces the release of Percona Server for MongoDB 3.6.11-3.1 on March 15, 2019. Download the latest version from the Percona website or the Percona software repositories.

Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 3.6 Community Edition. It supports MongoDB 3.6 protocols and drivers.

Percona Server for MongoDB extends Community Edition functionality by including the Percona Memory Engine storage engine, as well as several enterprise-grade features. Also, it includes MongoRocks storage engine, which is now deprecated. Percona Server for MongoDB requires no changes to MongoDB applications or code.

Release 3.6.11-3.1 extends the buildInfo command with the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key not available from MongoDB.


  • PSMDB-216: The database command buildInfo provides the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key is not available from MongoDB.

The Percona Server for MongoDB 3.6.11-3.1 release notes are available in the official documentation.

by Borys Belinsky at March 15, 2019 05:43 PM

Oli Sennhauser

Uptime of a MariaDB Galera Cluster

A while ago somebody on Google Groups asked for the Uptime of a Galera Cluster. The answer is easy... Wait, no! Not so easy... The uptime of a Galera Node is easy (or not?). But Uptime of the whole Galera Cluster?

My answer then was: "Grep the error log." My answer now is still: "Grep the error log." But slightly different:

$ grep 'view(view_id' *
2019-03-07 16:10:26 [Note] WSREP: view(view_id(PRIM,0e0a2851,1) memb {
2019-03-07 16:14:37 [Note] WSREP: view(view_id(PRIM,0e0a2851,2) memb {
2019-03-07 16:16:23 [Note] WSREP: view(view_id(PRIM,0e0a2851,3) memb {
2019-03-07 16:55:56 [Note] WSREP: view(view_id(NON_PRIM,0e0a2851,3) memb {
2019-03-07 16:56:04 [Note] WSREP: view(view_id(PRIM,6d80bb1a,5) memb {
2019-03-07 17:00:28 [Note] WSREP: view(view_id(NON_PRIM,6d80bb1a,5) memb {
2019-03-07 17:01:11 [Note] WSREP: view(view_id(PRIM,24f67954,7) memb {
2019-03-07 17:18:58 [Note] WSREP: view(view_id(NON_PRIM,24f67954,7) memb {
2019-03-07 17:19:31 [Note] WSREP: view(view_id(PRIM,a380c8cb,9) memb {
2019-03-07 17:20:27 [Note] WSREP: view(view_id(PRIM,a380c8cb,11) memb {
2019-03-08  7:58:38 [Note] WSREP: view(view_id(PRIM,753a350f,15) memb {
2019-03-08 11:31:38 [Note] WSREP: view(view_id(NON_PRIM,753a350f,15) memb {
2019-03-08 11:31:43 [Note] WSREP: view(view_id(PRIM,489e3c67,17) memb {
2019-03-08 11:31:58 [Note] WSREP: view(view_id(PRIM,489e3c67,18) memb {
2019-03-22  7:05:53 [Note] WSREP: view(view_id(NON_PRIM,49dc20da,49) memb {
2019-03-22  7:05:53 [Note] WSREP: view(view_id(PRIM,49dc20da,50) memb {
2019-03-26 12:14:05 [Note] WSREP: view(view_id(NON_PRIM,49dc20da,50) memb {
2019-03-27  7:33:25 [Note] WSREP: view(view_id(NON_PRIM,22ae25aa,1) memb {

So this Cluster had an Uptime of about 18 days and 20 hours. Why can I seed this? Simple: In the brackets there is a number at the very right. This number seems to be the same as wsrep_cluster_conf_id which is reset by a full Galera Cluster shutdown.

So far so good. But, wait, what is the definition of Uptime? Hmmm, not so helpful, how should I interpret this for a 3-Node Galera Cluster?

I would say a good definition for Uptime of a Galera Cluster would be: "At least one Galera Node must be available for the application for reading and writing." That means PRIM in the output above. And we still cannot say from the output above if there was at least on Galera Node available (reading and writing) at any time. For this we have to compare ALL 3 MariaDB Error Logs... So it does not help, we need a good Monitoring solution to answer this question...

PS: Who has found the little fake in this blog?

Taxonomy upgrade extras: 

by Shinguz at March 15, 2019 04:58 PM

Linux system calls of MySQL process

We had the problem today that a MySQL Galera Cluster node with the multi-tenancy pattern caused a lot of system time (sy 75%, load average about 30 (you really must read this article by Brendan Gregg, it is worth it!)) so we wanted to find what system calls are being used to see what could cause this issue (to verify if it is a TOC or a TDC problem:

$ sudo strace -c -p $(pidof -s mysqld) -f -e trace=all
Process 5171 attached with 41 threads
Process 16697 attached
Process 5171 detached
Process 5333 detached
Process 16697 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 66.85    1.349700         746      1810           io_getevents
 25.91    0.523055        1298       403       197 futex
  4.45    0.089773        1069        84        22 read
  2.58    0.052000       13000         4         3 restart_syscall
  0.19    0.003802        1901         2           select
  0.01    0.000235           3        69         1 setsockopt
  0.01    0.000210          18        12           getdents
  0.00    0.000078           2        32           write
  0.00    0.000056           1        49           fcntl
  0.00    0.000026           4         6           openat
  0.00    0.000012           2         6           close
  0.00    0.000000           0         2         2 open
  0.00    0.000000           0        22           stat
  0.00    0.000000           0         2           mmap
  0.00    0.000000           0         7           mprotect
  0.00    0.000000           0        16           pread
  0.00    0.000000           0         1           access
  0.00    0.000000           0         1           sched_yield
  0.00    0.000000           0         5           madvise
  0.00    0.000000           0         1           accept
  0.00    0.000000           0         1           getsockname
  0.00    0.000000           0         1           clone
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    2.018947                  2537       225 total

$ man io_getevents

See also: Configuration of MySQL for Shared Hosting.

by Shinguz at March 15, 2019 04:06 PM

Peter Zaitsev

MySQL Ripple: The First Impression of a MySQL Binlog Server

MySQL Ripple

MySQL RippleJust about a month ago, Pavel Ivanov released Ripple under the Apache-2.0 license. Ripple is a MySQL binlog server: software which receives binary logs from MySQL or MariaDB servers and delivers them to another MySQL or MariaDB server. Practically ,this is an intermediary master which does not store any data, except the binary logs themselves, and does not apply events. This solution allows saving of a lot of resources on the server, which acts only as a middle-man between the master and its actual slave(s).

The intermediary server, keeping binary logs only and not doing any other job, is a prevalent use case which allows us to remove IO (binlog read) and network (binlog retrieval via network) load from the actual master and free its resources for updates. The intermediary master, which does not do any work, distributes binary logs to slaves connected to it. This way you can have an increased number of slaves, attached to such a server, without affecting the application, running updates.

Currently, users exploit the Blackhole storage engine to emulate similar behavior. But Blackhole is just a workaround: it still executes all the events in the binary logs, requires valid MySQL installation, and has a lot of issues. Such a pain!

Therefore a new product which can do the same job and is released with an open source license is something worth trying.

A simple test

For this blog, I did a simple test. First, I installed it as described in the README file. Instructions are pretty straightforward, and I successfully built the server on my Ubuntu 18.04.2 LTS laptop. Guidelines suggest to install

, and I replaced
which I had already on my machine. Probably this was not needed, but since the tool claims to support both MySQL and MariaDB binary log formats, I preferred to install the MariaDB client.

There is no manual of usage instructions. However, the tool supports

  command, and it is, again, straightforward.

The server can be started with options:

$./bazel-bin/rippled -ripple_datadir=./data -ripple_master_address= -ripple_master_port=13001 -ripple_master_user=root -ripple_server_ports=15000


  • -ripple-datadir
     : datadir where Ripple stores binary logs
  • -ripple_master_address
     : master host
  • -ripple_master_port
     : master port
  • -ripple_master_user
     : replication user
  • -ripple_server_ports
     : comma-separated ports which Ripple will listen

I did not find an option for securing binary log retrieval. The slave can connect to the Ripple server with any credentials. Have this in mind when deploying Ripple in production.

Now, let’s run a simple test. I have two servers. Both running on localhost, one with port 13001 (master) and another one on port 13002 (slave). The command line which I used to start

 , points to the master. Binary logs are stored in the data directory:

$ ls -l data/
total 14920
-rw-rw-r-- 1 sveta sveta 15251024 Mar 6 01:43 binlog.000000
-rw-rw-r-- 1 sveta sveta 71 Mar 6 00:50 binlog.index

I pointed the slave to the Ripple server with the command

mysql> change master to master_host='',master_port=15000, master_user='ripple';
Query OK, 0 rows affected, 1 warning (0.02 sec)

Then started the slave.

On the master, I created the database

  and ran sysbench
test for a single table. After some time, I stopped the load and checked the content of the table on master and slave:

master> select count(*) from sbtest1;
| count(*) |
| 10000 |
1 row in set (0.08 sec)
master> checksum table sbtest1;
| Table | Checksum |
| sbtest.sbtest1 | 4162333567 |
1 row in set (0.11 sec)
slave> select count(*) from sbtest1;
| count(*) |
| 10000 |
1 row in set (0.40 sec)
slave> checksum table sbtest1;
| Table | Checksum |
| sbtest.sbtest1 | 1797645970 |
1 row in set (0.13 sec)
slave> checksum table sbtest1;
| Table | Checksum |
| sbtest.sbtest1 | 4162333567 |
1 row in set (0.10 sec)

It took some time for the slave to catch up, but everything was applied successfully.

Ripple has nice verbose logging:

$ ./bazel-bin/rippled -ripple_datadir=./data -ripple_master_address= -ripple_master_port=13001 -ripple_master_user=root -ripple_server_ports=15000
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0306 15:57:13.641451 27908] InitPlugins
I0306 15:57:13.642007 27908] Setup
I0306 15:57:13.642937 27908] Starting binlog recovery
I0306 15:57:13.644090 27908] Scanning binlog file: binlog.000000
I0306 15:57:13.872016 27908] Binlog recovery complete
binlog file: binlog.000000, offset: 15251088, gtid: 6ddac507-3f90-11e9-8ee9-00163e000000:0-0-7192
I0306 15:57:13.872050 27908] Recovered binlog
I0306 15:57:13.873811 27908] Listen on host: localhost, port: 15000
I0306 15:57:13.874282 27908] Start
I0306 15:57:13.874511 27910] Master session starting
I0306 15:57:13.882601 27910] connected to host:, port: 13001
I0306 15:57:13.895349 27910] Connected to host:, port: 13001, server_id: 1, server_name:
W0306 15:57:13.898556 27910] master does not support semi sync
I0306 15:57:13.898583 27910] start replicating from '6ddac507-3f90-11e9-8ee9-00163e000000:0-0-7192'
I0306 15:57:13.899031 27910] Master session entering main loop
I0306 15:57:13.899550 27910] Update binlog position to end_pos: binlog.000000:15251152, gtid: 0-0-7192
I0306 15:57:13.899572 27910] Skip writing event [ Previous_gtids len = 67 ]
I0306 15:57:13.899585 27910] Update binlog position to end_pos: binlog.000000:15251152, gtid: 0-0-7192


it may be good to run more tests before using Ripple in production, and to explore its other options, but from a first view it seems to be a very nice and useful product.

Photo by Kishor on Unsplash

by Sveta Smirnova at March 15, 2019 01:16 PM

March 14, 2019

Oli Sennhauser

MariaDB and MySQL Database Consolidation

We see at various customers the request for consolidating their MariaDB and MySQL infrastructure. The advantage of such a measure is clear in the first step: Saving costs! And this requests comes typically from managers. But what we unfortunately see rarely is to question this request from the IT engineering perspective. Because it comes, as anything in life, with some "costs". So, saving costs with consolidation on one side comes with "costs" for operation complexity on the other side.

To give you some arguments for arguing with managers we collected some topics to consider before consolidating:

  • Bigger Database Instances are more demanding in handling than smaller ones:
    • Backup and Restore time takes longer. Copying files around takes longer, etc.
    • Possibly your logical backup with mysqldump does not restore any longer in a reasonable amount of time (Mean Time to Repair/Recover (MTTR) is not met any more). You have to think about some physical backup methods including MariaDB or MySQL Enterprise Backup solutions.
    • Consolidated database instances typically contain many different schemas of various different applications. In case of problems you typically want to restore and possibly recover only one single schema and not all schemas. And this becomes much more complicated (depending on you backup strategy). MariaDB/MySQL tooling is not yet (fully) prepared for this situation (#17365). Possibly your old backup strategy is not adequate any more?
    • When you restore a schema you do not want the application interfering with your restore. How can you properly exclude the one application from your database instance while you are restoring? Locking accounts (possible only with MariaDB 10.4 and MySQL 5.7 and newer). Tricks like --skip-networking, adding Firewall rules, --read-only, database port change (--port=3307), do not work any more (as easy)!
    • In short the costs are: Restore/Recovery Operations becomes more demanding!
  • Do NOT mix schemas of different criticalities into the same database instance! The worst cases we have seen were some development schemas which were on the same high-availability Cluster like highly critical transactional systems. The developers did some nasty things on their development systems (which IMHO is OK for them on a development system). What nobody considered in this case was that the troubles from the development schema brought down the whole production schema which was located on the same machine... Cost: Risk of failure of your important services caused by some non-important services AND planing becomes more expensive and you need to know more about all instances and other instances.
  • This phenomena is also called Noisy Neighbor effect. Noisy Neighbors become a bigger issues with consolidated systems. You have to know much more in detail what you and everybody else is doing on the system! Do you...? Costs are: More know-how is required, better education and training of people, more clever people, better planning, better monitoring, etc.
  • When you consolidate different applications into one system it becomes more critical than the previous ones on their own. So you have to think about High-Availability solutions. Costs are: 1 to 4 new instances (for HA), more complexity, more know-how, more technologies... Do you plan to buy an Enterprise Support subscription?
  • Do NOT mix different maintenances windows (Asia vs. Europe vs. America) or daily online-business and nightly job processing. You get shorter maintenance windows. Costs are: Better planning is needed, costly night and weekend maintenance time, etc...

    China19:00(7 hours ahead of us)
    US east07:00(5 hours behind us)
    US west04:00(8 hours behind us)
  • Resource Fencing becomes more tricky. Within the same instance resource fencing becomes more tricky and is not really doable atm. MySQL 8.0 shows some firsts steps with the Resource Groups but this is pretty complicated and is by far not complete and usable yet. A better way would be to install several instances on the same machine an fence them with some O/S means like Control Groups. This comes at the costs of know-how, complexity and more complicated set-ups.
  • Naming conflicts can happen: Application a) is called `wiki` and application b) is called `wiki` as well and for some reasons you cannot rename them (any more).
  • Monitoring becomes much more demanding and needs to be done more fine grained. You want to know exactly what is going on your system because it can easily have some side effects on many different schemas/applications. Example of today: We were running out of kernel file descriptors (file-max) and we did not recognize it in the beginning.
  • Consolidated things are a much a higher Bulk Risk (this is true also for SAN or Virtualisation Clusters). When you have an outage not only one application is down but the whole company is down. We have seen this already for SAN and Virtualisation Clusters and we expect to see that soon also on highly consolidated Database Clusters. Costs: Damage on the company is bigger for one incident.
  • Different applications have different configuration requirements which possibly conflict with other requirements from other applications (Jira from Atlassian is a good example for this).
    Server variables cannot be adjusted any more according to somebody’s individual wishes...
    • sql_mode: Some old legacy applications still require ONLY_FULL_GROUP_BY) :-(
    • The requirements are conflicting: Performance/fast vs. Safe/durability: innodb_flush_log_at_trx_commit, sync_binlog, crash-safe binary logging, etc.
    • Transaction isolation: transaction_isolation = READ-COMMITTED (old: tx_isolation, Jira again as an example) vs. REPEATABLE-READ (default). Other applications which do not assume, that transaction isolation behaviour changes. And cannot cope with it. Have you ever asked your developers if their application can cope with a different transaction isolation levels? :-) Do they know what you are talking about?
    • Character set (utf8_bin for Jira as example again), which can be changed globally or on a schema level, but it has to be done correctly for all participants.
  • Some applications require MariaDB some application require MySQL. They are not the same databases any more nowadays (8.0 vs. 10.3/10.4). So you cannot consolidate them (easily).
  • You possibly get a mixture of persistent connections (typically Java with connection pooling) and non-persistent connections (typically PHP and other languages). Which causes different database behaviour, which has an impact on how you configure the database instance. Which is more demanding and needs more knowledge of the database AND the application or you solve it with more RAM.
  • You need to know much more about you application to understand what it does and how could it interfere with others...
  • When you consolidate more and more schemas into your consolidated database server you have to adjust your database setting as well from time to time (innodb_buffer_pool_size, table_open_cache, table_definition_cache, O/S File descriptors, etc). And possibly add more RAM, CPU and stronger I/O. When is your network saturated? Have you thought about this already?

This leads us to the result that consolidation let us save some costs on infrastructure but adds additional costs on complexity, skills etc. Theses costs will grow exponentially and thus at some point it is not worth the effort any more. This will end up in not only one big consolidated instance but possibly in a hand full of them.

Where this point is for you you have to find yourself...

Alternatives to consolidating everything into one instance

  • 1 Machine can contain 1 to many Database Instances can contain 1 to many Schemas. Instead of putting all schemas into one machine, think about installing several instances on one machine. This comes at the cost of more complexity. MyEnv will help you to manage this additional complexity.
  • 1 Machine can contain 1 to many Virtual Machines (VMs, kvm, XEN, VMWare, etc.) can contain 1 to many Instance(s) can contain 1 to many Schemas. This comes at the cost of even more complexity and pretty complex technology (Virtualization).

Taxonomy upgrade extras: 

by Shinguz at March 14, 2019 10:05 PM

Jean-Jerome Schmidt

An Introduction to Database High Availability for MySQL & MariaDB

The following is an excerpt from our whitepaper “How to Design Highly Available Open Source Database Environments” which can be downloaded for free.

A Couple of Words on “High Availability”

These days high availability is a must for any serious deployment. Long gone are days when you could schedule a downtime of your database for several hours to perform a maintenance. If your services are not available, you are losing customers and money. Therefore making a database environment highly available has typically one of the highest priorities.

This poses a significant challenge to database administrators. First of all, how do you tell if your environment is highly available or not? How would you measure it? What are the steps you need to take in order to improve availability? How to design your setup to make it highly available from the beginning?

There are many many HA solutions available in the MySQL (and MariaDB) ecosystem, but how do we know which ones we can trust? Some solutions might work under certain specific conditions, but might cause more trouble when applied outside of these conditions. Even a basic functionality like MySQL replication, which can be configured in many ways, can cause significant harm - for instance, circular replication with multiple writeable masters. Although it is easy to set up a ‘multi-master setup’ using replication, it can very easily break and leave us with diverging datasets on different servers. For a database, which is often considered the single source of truth, compromised data integrity can have catastrophic consequences.

In the following chapters, we’ll discuss the requirements for high availability in database
setups, and how to design the system from the ground up.

Measuring High Availability

What is high availability? To be able to decide if a given environment is highly available or not, one has to have some metrics for that. There are numerous ways you can measure high availability, we’ll focus on some of the most basic stuff.

First, though, let’s think what this whole high availability is all about? What is its purpose? It is about making sure your environment serves its purpose. Purpose can be defined in many ways but, typically, it will be about delivering some service. In the database world, typically it’s somewhat related to data. It could be serving data to your internal application. It can be to store data and make it queryable by analytical processes. It can be to store some data for your users, and provide it when requested on demand. Once we are clear about the purpose, we can establish the success factors involved. This will help us define what high availability means in our specific case.


Service Level Agreement (SLA). It is also quite common to define SLA’s for internal services. What is an SLA? It is a definition of the service level you plan to provide to your customers. This is for them to better understand what level of stability you plan for a service they bought or are planning to buy. There are numerous methods you can leverage to prepare a SLA but typical ones are:

  • Availability of the service (percent)
  • Responsiveness of the service - latency (average, max, 95 percentile, 99 percentile)
  • Packet loss over the network (percent)
  • Throughput (average, minimum, 95 percentile, 99 percentile)

It can get more complex than that, though. In a sharded, multi-user environment you can define, let’s say, your SLA as: “Service will be available 99,99% of the time, downtime is declared when more than 2% of the users is affected. No incident can take more than 15 minutes to be resolved”. Such SLA can also be extended to incorporate query response time: “downtime is called if 99 percentile of latency for queries excede 200 milliseconds”.


Availability is typically measured in “nines”, let us look into what exactly a given amount of “nines” guarantees. The table below is taken from Wikipedia:

Availability % Downtime per year Downtime per month Downtime per week Downtime per day
("one nine")
36.5 days 72 hours 16.8 hours 2.4 hours
("one and a half nines")
18.25 days 36 hours 8.4 hours 1.2 hours
97% 10.96 days 21.6 hours 5.04 hours 43.2 min
98% 7.30 days 14.4 hours 3.36 hours 28.8 min
("two nines")
3.65 days 7.20 hours 1.68 hours 14.4 min
("two and a half nines")
1.83 days 3.60 hours 50.4 min 7.2 min
99.8% 17.52 hours 86.23 min 20.16 min 2.88 min
("three nines")
8.76 hours 43.8 min 10.1 min 1.44 min
("three and a half nines")
4.38 hours 21.56 min 5.04 min 43.2 s
("four nines")
52.56 min 4.38 min 1.01 min 8.64 s
("four and a half nines")
26.28 min 2.16 min 30.24 s 4.32 s
("five nines")
5.26 min 25.9 s 6.05 s 864.3 ms
("six nines")
31.5 s 2.59 s 604.8 ms 86.4 ms
("seven nines")
3.15 s 262.97 ms 60.48 ms 8.64 ms
("eight nines")
315.569 ms 26.297 ms 6.048 ms 0.864 ms
("nine nines")
31.5569 ms 2.6297 ms 0.6048 ms 0.0864 ms

As we can see, it escalates quickly. Five nines (99,999% availability) is equivalent to 5.26 minutes of downtime over the course of a year. Availability can also be calculated in different, smaller ranges: per month, per week, per day. Keep in mind those numbers, as they will be useful when we start to discuss the costs associated with maintaining different levels of availability.

Measuring Availability

To tell if there is a downtime or not, one has to have insight into the environment. You need to track the metrics which define the availability of your systems. It is important to keep in mind that you should measure it from a customer’s point of view, taking the broader picture under consideration. It doesn’t matter if your databases are up if, let’s say, due to a network issue, no application cannot reach them. Every single building block of your setup has its impact on availability.

One of the good places where to look for availability data is web server logs. All requests which ended up with errors mean something has happened. It could be HTTP error 500 returned by the application, because the database connection failed. Those could be programmatic errors pointing to some database issues, and which ended up in Apache’s error log. You can also use simple metric as uptime of database servers, although, with more complex SLA’s it might be tricky to determine how the unavailability of one database impacted your user base. No matter what you do, you should use more than one metric - this is needed to capture issues which might have happened on different layers of your environment.

Magic Number: “Three”

Even though high availability is also about redundancy, in case of database clusters, three is a magic number. It is not enough to have two nodes for redundancy - such setup does not provide any built-in high availability. Sure, it might be better than just a single node, but human intervention is required to recover services. Let’s see why it is so.

Let’s assume we have two nodes, A and B. There’s a network link between them. Let us assume that both A and B serves writes and the application randomly picks where to connect (which means that part of the application will connect to node A and the other part will connect to node B). Now, let’s imagine we have a network issue which results in lost network connectivity between A and B.

What now? Neither A nor B can know the state of the other node. There are two actions which can be taken by both nodes:

  1. They can continue accepting traffic
  2. They can cease to operate and refuse to serve any traffic

Let’s think about the first option. As long as the other node is indeed down, this is the preferred action to take - we want our database to continue serving traffic. This is the main idea behind high availability after all. What would happen, though, if both nodes would continue to accept traffic while being disconnected from each other? New data will be added on both sides, and the datasets will get out of sync. When the network issue will be resolved, it will be a daunting task to merge those two datasets. Therefore, it is not acceptable to keep both nodes up and running. The problem is - how can node A tell if node B is alive or not (and vice versa)? The answer is - it cannot. If all connectivity is down, there is no way to distinguish a failed node from a failed network. As a result, the only safe action is for both nodes to cease all operations and refuse to
serve traffic.

Let’s think now how a third node can help us in such a situation.

So we now have three nodes: A, B and C. All are interconnected, all are handling reads and writes.

Again, as in the previous example, node B has been cut off from the rest of the cluster due to network issues. What can happen next? Well, the situation is fairly similar to what we discussed earlier. Two options - node B can either be down (and the rest of the cluster should continue) or it can be up, in which case it shouldn’t be allowed to handle any traffic. Can we now tell what’s the state of the cluster? Actually, yes. We can see that nodes A and C can talk to each other and, as a result, they can agree that node B is not available. They won’t be able to tell why it happened, but what they know is that out of three nodes in the cluster two still have connectivity between each other. Given that those two nodes form a majority of the cluster, it makes possible to continue handling traffic. At the same time node B can also deduct that the problem is on its side. It cannot access neither node A nor node C, making node B separated from the rest of the cluster. As it is isolated and is not part of a majority (1 of 3), the only safe action it can take is to stop serving traffic and refuse to accept any queries, ensuring that data drift won’t happen.

Of course, it doesn’t mean you can have only three nodes in the cluster. If you want better failure tolerance, you may want to add more. Keep in mind, though, it should be an odd number if you want to improve high availability. Also, we were talking about “nodes” in the examples above. Please keep in mind that this is also true for datacenters, availability zones etc. If you have two datacenters, each having the same number of nodes (let’s say three nodes each), and you lose connectivity between those two DC’s, same principles apply here - you cannot tell which half of the cluster should start handling traffic. To be able to tell that, you have to have an observer in a third datacenter. It can be yet another set of nodes, or just a single host, with the task
to observe the state of remaining dataceters and take part in making decisions (an example here would be the Galera arbitrator).

Single Points of Failure

High availability is all about removing single points of failure (SPOF) and not introducing new ones in the process. What are the SPOFs? Any part of your infrastructure which, when failed, brings downtime as defined in SLA, is called a SPOF. Infrastructure design requires a holistic approach, the different components cannot be designed independently of each other. Most likely, you are not responsible for the whole design -
database administrators tend to focus on databases and not, for example, the network layer. Still, you have to keep the other parts in mind and work with the teams which are responsible for them, to make sure that not only the part you are responsible for is designed correctly but also that the remaining bits of the infrastructure were designed using the same principles. On top of that, such knowledge of how the whole
infrastructure is designed, helps you to design the database stack too. Knowing what issues may happen helps to build some mechanisms to prevent them from impacting the availability of the database.

by krzysztof at March 14, 2019 02:42 PM

Peter Zaitsev

Percona’s Open Source Data Management Software Survey


Click Here to Complete our New Survey!

Last year we informally surveyed the open source community and our conference attendees.
The results revealed that:

  • 48% of those in the cloud choose to self-manage their databases, but 52% were comfortable relying on the DBaaS offering of their cloud vendor.
  • 49% of people said “performance issues” when asked, “what keeps you up at night?”
  • The major decision influence for buying services was price, with 42% of respondents keen to make the most of their money.

We found this information so interesting that we wanted to find out more! As a result, we are pleased to announce the launch of our first annual Open Source Data Management Software Survey.

The final results will be 100% anonymous, and will be made freely available on Creative Commons.

How Will This Survey Help The Community?

Unlimited access to accurate market data is important. Millions of open source projects are in play, and most are dependent on databases. Accurate market data helps you track the popularity of different databases, as well as seeing how and where these databases are run. This helps us all build better software and take advantage of shifting trends.

Thousands of vendors are focused on helping SysAdmins, DBAs, and Developers get the most out of their database infrastructure. Insightful market data enables them to create better tools that meet current demands and grow the open source database market.

We want to assist companies who are still deciding what, how, and where to run their systems. This information will help them understand the industry direction and allow them to make an informed decision on the software and services they choose.

How Can You Help Make This Survey A Success?

Firstly, please share your insight into current trends and new developments in open source data management software.

Secondly, please share this survey with other people who work in the industry, and encourage them to contribute.

The more responses we receive, the more useful this will be to the whole open source community. If we missed anything, or you would like to ask other questions in future, let us know!

So tell us; who are the big fish, and which minnows are nibbling at their tails?! Is the cloud giving you altitude sickness, or are you flying high? What is the next big thing and is everyone on board, or is your company lagging behind?

Preliminary results will be presented at our annual Percona Live Conference in Austin, Texas (May 28-30, 2019) by our CEO, Peter Zaitsev and released to the open source community when finalized.

Click Here to Have Your Say!

by Rachel Pescador at March 14, 2019 11:08 AM

March 13, 2019

Oli Sennhauser

FromDual Performance Monitor for MariaDB and MySQL 1.0.2 has been released

FromDual has the pleasure to announce the release of the new version 1.0.2 of its popular Database Performance Monitor for MariaDB, MySQL, Galera Cluster and Percona Server fpmmm.

The new FromDual Performance Monitor for MariaDB and MySQL (fpmmm) can be downloaded from here. How to install and use fpmmm is documented in the fpmmm Installation Guide.

In the inconceivable case that you find a bug in the FromDual Performance Manager for MariaDB and MySQL please report it the FromDual Bugtracker or just send us an email.

Any feedback, statements and testimonials are welcome as well! Please send them to

Monitoring as a Service (MaaS)

You do not want to set-up your Database monitoring yourself? No problem: Choose our MariaDB and MySQL Monitoring as a Service (Maas) program to safe costs!

Upgrade from 1.0.x to 1.0.2

shell> cd /opt
shell> tar xf /download/fpmmm-1.0.2.tar.gz
shell> rm -f fpmmm
shell> ln -s fpmmm-1.0.2 fpmmm

Changes in FromDual Performance Monitor for MariaDB and MySQL 1.0.2

This release contains various bug fixes.

You can verify your current FromDual Performance Monitor for MariaDB and MySQL version with the following command:

shell> fpmmm --version

fpmmm agent

  • Server entropy probe added.
  • Processlist empty state is covered.
  • Processlist statements made more robust.
  • Error caught properly after query.
  • Branch for Ubuntu is different, fixed.
  • PHP Variable variables_order is included into program.
  • Fixed the documentation URL in file INSTALL.
  • Connection was not set to utf8. This is fixed now.
  • fprint error fixed.
  • Library updated from MyEnv project.

fpmmm Templates

  • Backup template added.
  • SQL thread and IO thread error more verbose and running again triggers implemented. Typo in slave template fixed.
  • Forks graph fixed, y axis starts from 0.

fpmmm agent installer

  • Error messages made more flexible.

For subscriptions of commercial use of fpmmm please get in contact with us.

by Shinguz at March 13, 2019 07:58 PM

Peter Zaitsev

Super Saver Discount Ends 17 March for Percona Live 2019


percona-live-2019-austin-tutorials-talksTutorials and initial sessions are set for the Percona Live Open Source Database Conference 2019, to be held May 28-30 at the Hyatt Regency in Austin, Texas! Percona Live 2019 is the premier open source database conference event for users of MySQL®, MariaDB®, MongoDB®, and PostgreSQL. It will feature 13 tracks presented over two days, plus a day of hands-on tutorials. Register now to enjoy our best Super Saver Registration rates which end March 17, 2019 at 11:30 p.m. PST.

Sample Sessions

Here is one item from each of our 13 tracks, samples from our full conference schedule.  Note too that many more great talks will be announced soon!

  1. MySQL®: The MySQL Query Optimizer Explained Through Optimizer Trace by Øystein Grøvlen of Alibaba Cloud.
  2. MariaDB®:  MariaDB Security Features and Best Practices by Robert Bindar of MariaDB Foundation.
  3. MongoDB®: MongoDB: Where Are We Going From Here? presented by David Murphy, Huawei
  4. PostgreSQL: A Deep Dive into PostgreSQL Indexing by Ibrar Ahmed, Percona
  5. Other Open Source Databases: ClickHouse Data Warehouse 101: The First Billion Rows by Alexander Zaitsev and Robert Hodges, Altinity
  6. Observability & Monitoring: Automated Database Monitoring at Uber with M3 and Prometheus by Rob Skillington and Richard Artoul, Uber
  7. Kubernetes: Complex Stateful Applications Made Easier with Kubernetes by Patrick Galbraith of Oracle MySQL
  8. Automation & AI: Databases at Scale, at Square by Emily Slocombe, Square
  9. Java Development for Open Source Databases: Introducing Java Profiling via Flame Graphs by Agustín Gallego, Percona
  10. Migration to Open Source Databases: Migrating between Proprietary and Open Source Database Technologies – Calculating your ROI by John Schultz, The Pythian Group
  11. Polyglot Persistence: A Tale of 8T (Transportable Tablespaces Vs Mysqldump) by Kristofer Grahn, Verisure AB
  12. Database Security & Compliance: MySQL Security and Standardization at PayPal by Stacy Yuan and Yashada Jadhav, Paypal Holdings Inc
  13. Business and Enterprise: MailChimp – Scale A MySQL Perspective by John Scott, MailChimp


Percona Live 2019 will be held at the downtown Hyatt Regency Austin Texas.  Located on the shores of Lady Bird Lake, it’s near water sports like kayaking, canoeing, stand-up paddling, and rowing. There are many food and historical sites nearby, such as the Texas Capitol, the LBJ Library, and Barton Springs Pool.  Book here for Percona’s conference room rate.


Sponsors of Percona Live 2019 can interact with DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solution vendors, and entrepreneurs who typically attend Percona Live. Download our prospectus for more information.

by Bronwyn Campbell at March 13, 2019 04:18 PM

Live MySQL Slave Rebuild with Percona Toolkit

MySQL slave data out of sync

MySQL slave data out of syncRecently, we had an edge case where a MySQL slave went out-of-sync but it couldn’t be rebuilt from scratch. The slave was acting as a master server to some applications and it had data was being written to it. It was a design error, and this is not recommended, but it happened. So how do you synchronize the data in this circumstance? This blog post describes the steps taken to recover from this situation. The tools used to recover the slave were pt-slave-restartpt-table-checksum, pt-table-sync and mysqldiff.


To illustrate this situation, it was built a master x slave configuration with sysbench running on the master server to simulate a general application workload. The environment was set with a Percona Server 5.7.24-26 and sysbench 1.0.16.

Below are the sysbench commands to prepare and simulate the workload:

# Create Data
sysbench --db-driver=mysql --mysql-user=root --mysql-password=msandbox \
  --mysql-socket=/tmp/mysql_sandbox45008.sock --mysql-db=test --range_size=100 \
  --table_size=5000 --tables=100 --threads=1 --events=0 --time=60 \
  --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua prepare
# Simulate Workload
sysbench --db-driver=mysql --mysql-user=root --mysql-password=msandbox \
  --mysql-socket=/tmp/mysql_sandbox45008.sock --mysql-db=test --range_size=100 \
  --table_size=5000 --tables=100 --threads=10 --events=0 --time=6000 \
  --rand-type=uniform /usr/share/sysbench/oltp_read_write.lua --report-interval=1 run

With the environment set, the slave server was stopped, and some operations to desynchronize the slave were performed to reproduce the problem.

Fixing the issue

With the slave desynchronized, a restart on the replication was executed. Immediately, the error below appeared:

Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'

To recover the slave from this error, we had to point the slave to an existing binary log with a valid binary log position. To get a valid binary log position, the command shown below had to be executed on the master:

master [localhost] {msandbox} ((none)) > show master status\G
*************************** 1. row ***************************
File: mysql-bin.000007
Position: 218443612
1 row in set (0.01 sec)

Then, a CHANGE MASTER command was run on the slave:

slave1 [localhost] {msandbox} (test) > change master to master_log_file='mysql-bin.000007', MASTER_LOG_POS=218443612;
Query OK, 0 rows affected (0.00 sec)
slave1 [localhost] {msandbox} (test) > start slave;
Query OK, 0 rows affected (0.00 sec)

Now the slave had a valid binary log file to read, but since it was inconsistent, it hit another error:

Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Delete_rows event on table test.sbtest8; Can't find record in 'sbtest8', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000005, end_log_pos 326822861

Working past the errors

Before fixing the inconsistencies, it was necessary to keep the replication running and to skip the errors. For this, the pt-slave-restart tool will be used. The tool needs to be run on the slave server:

pt-slave-restart --user root --socket=/tmp/mysql_sandbox45008.sock --ask-pass

The tool skips errors and starts the replication threads. Below is an example of the output of the pt-slave-restart:

$ pt-slave-restart --user root --socket=/tmp/mysql_sandbox45009.sock --ask-pass
Enter password:
2019-02-22T14:18:01 S=/tmp/mysql_sandbox45009.sock,p=...,u=root mysql-relay.000007        1996 1146
2019-02-22T14:18:02 S=/tmp/mysql_sandbox45009.sock,p=...,u=root mysql-relay.000007        8698 1146
2019-02-22T14:18:02 S=/tmp/mysql_sandbox45009.sock,p=...,u=root mysql-relay.000007       38861 1146

Finding the inconsistencies

With the tool running on one terminal, the phase to check the inconsistencies began. First things first, an object definition check was performed using mysqldiff utility. The mysqldiff tool is part of MySQL utilities. To execute the tool:

$ mysqldiff --server1=root:msandbox@localhost:48008 --server2=root:msandbox@localhost:48009 test:test --difftype=sql --changes-for=server2

And below are the differences found between the master and the slave:

1-) A table that doesn’t exist

# WARNING: Objects in server1.test but not in server2.test:
# TABLE: joinit

2-) A wrong table structure

# Comparing `test`.`sbtest98` to `test`.`sbtest98` [FAIL]
# Transformation for --changes-for=server2:
ALTER TABLE `test`.`sbtest98`
ADD INDEX k_98 (k);

Performing the recommendations on the slave (creating the table and the table modification) the object’s definition was now equal. The next step was to check data consistency. For this, the pt-table-checksum was used to identify which tables are out-of-sync. This command was run on the master server.

$ pt-table-checksum -uroot -pmsandbox --socket=/tmp/mysql_sandbox48008.sock --replicate=percona.checksums --create-replicate-table --empty-replicate-table --no-check-binlog-format --recursion-method=hosts

And an output example:

01 master]$ pt-table-checksum --recursion-method dsn=D=percona,t=dsns --no-check-binlog-format --nocheck-replication-filter --host --user root --port 48008 --password=msandbox
Checking if all tables can be checksummed ...
Starting checksum ...
  at /usr/bin/pt-table-checksum line 332.
Replica lag is 66 seconds on  Waiting.
Replica lag is 46 seconds on  Waiting.
Replica lag is 33 seconds on  Waiting.
02-26T16:27:59      0      0     5000          0       1       0   0.037 test.sbtest1
02-26T16:27:59      0      0     5000          0       1       0   0.039 test.sbtest10
02-26T16:27:59      0      1     5000          0       1       0   0.033 test.sbtest100
02-26T16:27:59      0      1     5000          0       1       0   0.034 test.sbtest11
02-26T16:27:59      0      1     5000          0       1       0   0.040 test.sbtest12
02-26T16:27:59      0      1     5000          0       1       0   0.034 test.sbtest13

Fixing the data inconsistencies

Analyzing the DIFFS column it is possible to identify which tables were compromised. With this information, the pt-table-sync tool was used to fix these inconsistencies. The tool synchronizes MySQL table data efficiently, performing non-op changes on the master so they can be replicated and applied on the slave. The tools need to be run on the slave server. Below is an example of the tool running:

$ pt-table-sync --execute --sync-to-master h=localhost,u=root,p=msandbox,D=test,t=sbtest100,S=/tmp/mysql_sandbox48009.sock

It is possible to perform a dry-run of the tool before executing the changes to check what changes the tool will apply:

$ pt-table-sync --print --sync-to-master h=localhost,u=root,p=msandbox,D=test,t=sbtest100,S=/tmp/mysql_sandbox48009.sock
REPLACE INTO `test`.`sbtest100`(`id`, `k`, `c`, `pad`) VALUES ('1', '1654', '97484653464-60074971666-42998564849-40530823048-27591234964-93988623123-02188386693-94155746040-59705759910-14095637891', '15000678573-85832916990-95201670192-53956490549-57402857633') /*percona-toolkit src_db:test src_tbl:sbtest100 src_dsn:D=test,P=48008,S=/tmp/mysql_sandbox48009.sock,h=,p=...,t=sbtest100,u=root dst_db:test dst_tbl:sbtest100 dst_dsn:D=test,S=/tmp/mysql_sandbox48009.sock,h=localhost,p=...,t=sbtest100,u=root lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:17806 user:vinicius.grippa*/;
REPLACE INTO `test`.`sbtest100`(`id`, `k`, `c`, `pad`) VALUES ('2', '3007', '31679133794-00154186785-50053859647-19493043469-34585653717-64321870163-33743380797-12939513287-31354198555-82828841987', '30122503210-11153873086-87146161761-60299188705-59630949292') /*percona-toolkit src_db:test src_tbl:sbtest100 src_dsn:D=test,P=48008,S=/tmp/mysql_sandbox48009.sock,h=,p=...,t=sbtest100,u=root dst_db:test dst_tbl:sbtest100 dst_dsn:D=test,S=/tmp/mysql_sandbox48009.sock,h=localhost,p=...,t=sbtest100,u=root lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:17806 user:vinicius.grippa*/;

After executing the pt-table-sync, we recommend that you run the pt-table-checksum again and check if the DIFFS column shows the value of 0.


This blog post was intended to cover all possible issues that could happen on a slave when it goes out-of-sync such as DDL operations, binary log purge and DML operations. This process involves many steps and it could take a long time to finish, especially in large databases. Note that this process might take longer than the backup/restore process. However, in situations like the one mentioned above, it might be the only solution to recover a slave.

Image based on Photo by Randy Fath on Unsplash


by Vinicius Grippa at March 13, 2019 11:58 AM

March 12, 2019

Peter Zaitsev

Upcoming Webinar Thurs 3/14: Web Application Security – Why You Should Review Yours

Please join Percona’s Information Security Architect, David Bubsy, as he presents his talk Web Application Security – Why You Should Review Yours on March 14th, 2019 at 6:00 AM PDT (UTC-7) / 9:00 AM EDT (UTC-4).

Register Now

In this talk, we take a look at the whole stack and I don’t just mean LAMP.

We’ll cover what an attack surface is and some areas you may look to in order to ensure that you can reduce it.

For instance, what’s an attack surface?

Acronym Hell, what do they mean?

Vulnerability Naming, is this media naming stupidity or driving the message home?

Detection, Prevention and avoiding the boy who cried wolf are some further examples.

Additionally, we’ll cover emerging technologies to keep an eye on or even implement yourself to help improve your security posture.

There will also be a live compromise demo (or backup video if something fails) that covers compromising a PCI compliant network structure to reach the database system. Through this compromise you can ultimately exploit multiple failures to gain bash shell access over the MySQL protocol.

by David Busby at March 12, 2019 08:59 PM

PMM’s Custom Queries in Action: Adding a Graph for InnoDB mutex waits

PMM mutex wait graph

One of the great things about Percona Monitoring and Management (PMM) is its flexibility. An example of that is how one can go beyond the exporters to collect data. One approach to achieve that is using textfile collectors, as explained in  Extended Metrics for Percona Monitoring and Management without modifying the Code. Another method, which is the subject matter of this post, is to use custom queries.

While working on a customer’s contention issue I wanted to check the behaviour of InnoDB Mutexes over time. Naturally, I went straight to PMM and didn’t find a graph suitable for my needs. No graph, no problem! Luckily anyone can enhance PMM. So here’s how I made the graph I needed.

The final result will looks like this:

Custom Queries

What is it?

Starting from the version 1.15.0, PMM provides user the ability to take a SQL SELECT statement and turn the resultset into a metric series in PMM. That is custom queries.

How do I enable that feature?

This feature is ON by default. You only need to edit the configuration file using YAML syntax

Where is the configuration file located?

Config file location is /usr/local/percona/pmm-client/queries-mysqld.yml by default. You can change it when adding mysql metrics via pmm-admin:

pmm-admin add mysql:metrics ... -- --queries-file-name=/usr/local/percona/pmm-client/query.yml

How often is data being collected?

The queries are executed at the LOW RESOLUTION level, which by default is every 60 seconds.

InnoDB Mutex monitoring

The method used to gather Mutex status is querying the PERFORMANCE SCHEMA, as explained here: but intentionally removed the SUM_TIMER_WAIT > 0 condition, so the query used looks like this:

FROM performance_schema.events_waits_summary_global_by_event_name
WHERE EVENT_NAME LIKE 'wait/synch/mutex/innodb/%'

For this query to return data, some requirements need to be met:

  • The most important one: Performance Schema needs to be enabled
  • Consumers for “event_waits” enabled
  • Instruments for ‘wait/synch/mutex/innodb’ enabled.

If performance schema is enabled, the other two requirements are met by running these two queries:

update performance_schema.setup_instruments set enabled='YES' where name like 'wait/synch/mutex/innodb%';
update performance_schema.setup_consumers set enabled='YES' where name like 'events_waits%';

YAML Configuration File

This is where the magic happens. Explanation of the YAML syntax is covered in deep on the documentation:

The one used for this issue is:

    query: "SELECT EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT FROM performance_schema.events_waits_summary_global_by_event_name WHERE EVENT_NAME LIKE 'wait/synch/mutex/innodb/%'"
      - EVENT_NAME:
          usage: "LABEL"
          description: "Name of the mutex"
      - COUNT_STAR:
          usage: "COUNTER"
          description: "Number of calls"
          usage: "GAUGE"
          description: "Duration"

The key info is:

  • The metric name is mysql_global_status_innodb_mutex
  • Since EVENT_NAME is used as a label, it will be possible to have values per event

Remember that this should be in the queries-mysql.yml file. Full path /usr/local/percona/pmm-client/queries-mysqld.yml  inside the db node.

Once that is done, you will start to have those metrics available in Prometheus. Now, we have a graph to do!

Creating the graph in Grafana

Before jumping to grafana to add the graph, we need a proper Prometheus Query (A.K.A: PromQL). I came up with these two (one for the count_star, one for the sum_timer_wait):

topk(5, label_replace(rate(mysql_global_status_innodb_mutex_COUNT_STAR{instance="$host"}[$interval]), "mutex", "$2", "EVENT_NAME", "(.*)/(.*)" ) or label_replace(irate(mysql_global_status_innodb_mutex_COUNT_STAR{instance="$host"}[5m]), "mutex", "$2", "EVENT_NAME", "(.*)/(.*)" ))


topk(5, label_replace(rate(mysql_global_status_innodb_mutex_SUM_TIMER_WAIT{instance="$host"}[$interval]), "mutex", "$2", "EVENT_NAME", "(.*)/(.*)" ) or label_replace(irate(mysql_global_status_innodb_mutex_SUM_TIMER_WAIT{instance="$host"}[5m]), "mutex", "$2", "EVENT_NAME", "(.*)/(.*)" ))

These queries are basically: Return the rate values of each mutex event for a specific host. And make some regex to return only the name of the event, and discard whatever is before the last slash character.

Once we are good with our PromQL queries, we can go and add the graph.

Finally, I got the graph that I needed with a very small effort.

The dashboard is also published on the Grafana Labs Community dashboards site.


PMM’s collection of graphs and dashboard is quite complete, but it is also natural that there are specific metrics that might not be there. For those cases, you can count on the flexibility and ease usage of PMM to collect metrics and create custom graphs. So go ahead, embrace PMM, customize it, make it yours!

The JSON for this graph, so it can be imported easily, is:

  "aliasColors": {},
  "bars": false,
  "dashLength": 10,
  "dashes": false,
  "datasource": "Prometheus",
  "fill": 0,
  "gridPos": {
    "h": 18,
    "w": 24,
    "x": 0,
    "y": 72
  "id": null,
  "legend": {
    "alignAsTable": true,
    "avg": true,
    "current": false,
    "max": true,
    "min": true,
    "rightSide": false,
    "show": true,
    "sideWidth": 0,
    "sort": "avg",
    "sortDesc": true,
    "total": false,
    "values": true
  "lines": true,
  "linewidth": 2,
  "links": [],
  "nullPointMode": "null",
  "percentage": false,
  "pointradius": 0.5,
  "points": false,
  "renderer": "flot",
  "seriesOverrides": [
      "alias": "/Timer Wait/i",
      "yaxis": 2
  "spaceLength": 10,
  "stack": false,
  "steppedLine": false,
  "targets": [
      "expr": "topk(5, label_replace(rate(mysql_global_status_innodb_mutex_COUNT_STAR{instance=\"$host\"}[$interval]), \"mutex\", \"$2\", \"EVENT_NAME\", \"(.*)/(.*)\" )) or topk(5,label_replace(irate(mysql_global_status_innodb_mutex_COUNT_STAR{instance=\"$host\"}[5m]), \"mutex\", \"$2\", \"EVENT_NAME\", \"(.*)/(.*)\" ))",
      "format": "time_series",
      "interval": "$interval",
      "intervalFactor": 1,
      "legendFormat": "{{ mutex }} calls",
      "refId": "A",
      "hide": false
      "expr": "topk(5, label_replace(rate(mysql_global_status_innodb_mutex_SUM_TIMER_WAIT{instance=\"$host\"}[$interval]), \"mutex\", \"$2\", \"EVENT_NAME\", \"(.*)/(.*)\" )) or topk(5, label_replace(irate(mysql_global_status_innodb_mutex_SUM_TIMER_WAIT{instance=\"$host\"}[5m]), \"mutex\", \"$2\", \"EVENT_NAME\", \"(.*)/(.*)\" ))",
      "format": "time_series",
      "interval": "$interval",
      "intervalFactor": 1,
      "legendFormat": "{{ mutex }} timer wait",
      "refId": "B",
      "hide": false
  "thresholds": [],
  "timeFrom": null,
  "timeShift": null,
  "title": "InnoDB Mutex",
  "tooltip": {
    "shared": true,
    "sort": 2,
    "value_type": "individual"
  "transparent": false,
  "type": "graph",
  "xaxis": {
    "buckets": null,
    "mode": "time",
    "name": null,
    "show": true,
    "values": []
  "yaxes": [
      "format": "short",
      "label": "",
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
      "decimals": null,
      "format": "ns",
      "label": "",
      "logBase": 1,
      "max": null,
      "min": "0",
      "show": true
  "yaxis": {
    "align": false,
    "alignLevel": null

by Daniel Guzmán Burgos at March 12, 2019 06:31 PM

Jean-Jerome Schmidt

HA for MySQL and MariaDB - Comparing Master-Master Replication to Galera Cluster

Galera replication is relatively new if compared to MySQL replication, which is natively supported since MySQL v3.23. Although MySQL replication is designed for master-slave unidirectional replication, it can be configured as an active master-master setup with bidirectional replication. While it is easy to set up, and some use cases might benefit from this “hack”, there are a number of caveats. On the other hand, Galera cluster is a different type of technology to learn and manage. Is it worth it?

In this blog post, we are going to compare master-master replication to Galera cluster.

Replication Concepts

Before we jump into the comparison, let’s explain the basic concepts behind these two replication mechanisms.

Generally, any modification to the MySQL database generates an event in binary format. This event is transported to the other nodes depending on the replication method chosen - MySQL replication (native) or Galera replication (patched with wsrep API).

MySQL Replication

The following diagrams illustrates the data flow of a successful transaction from one node to another when using MySQL replication:

The binary event is written into the master's binary log. The slave(s) via slave_IO_thread will pull the binary events from master's binary log and replicate them into its relay log. The slave_SQL_thread will then apply the event from the relay log asynchronously. Due to the asynchronous nature of replication, the slave server is not guaranteed to have the data when the master performs the change.

Ideally, MySQL replication will have the slave to be configured as a read-only server by setting read_only=ON or super_read_only=ON. This is a precaution to protect the slave from accidental writes which can lead to data inconsistency or failure during master failover (e.g., errant transactions). However, in a master-master active-active replication setup, read-only has to be disabled on the other master to allow writes to be processed simultaneously. The primary master must be configured to replicate from the secondary master by using the CHANGE MASTER statement to enable circular replication.

Galera Replication

The following diagrams illustrates the data replication flow of a successful transaction from one node to another for Galera Cluster:

The event is encapsulated in a writeset and broadcasted from the originator node to the other nodes in the cluster by using Galera replication. The writeset undergoes certification on every Galera node and if it passes, the applier threads will apply the writeset asynchronously. This means that the slave server will eventually become consistent, after agreement of all participating nodes in global total ordering. It is logically synchronous, but the actual writing and committing to the tablespace happens independently, and thus asynchronously on each node with a guarantee for the change to propagate on all nodes.

Avoiding Primary Key Collision

In order to deploy MySQL replication in master-master setup, one has to adjust the auto increment value to avoid primary key collision for INSERT between two or more replicating masters. This allows the primary key value on masters to interleave each other and prevent the same auto increment number being used twice on either of the node. This behaviour must be configured manually, depending on the number of masters in the replication setup. The value of auto_increment_increment equals to the number of replicating masters and the auto_increment_offset must be unique between them. For example, the following lines should exist inside the corresponding my.cnf:





Likewise, Galera Cluster uses this same trick to avoid primary key collisions by controlling the auto increment value and offset automatically with wsrep_auto_increment_control variable. If set to 1 (the default), will automatically adjust the auto_increment_increment and auto_increment_offset variables according to the size of the cluster, and when the cluster size changes. This avoids replication conflicts due to auto_increment. In a master-slave environment, this variable can be set to OFF.

The consequence of this configuration is the auto increment value will not be in sequential order, as shown in the following table of a three-node Galera Cluster:

Node auto_increment_increment auto_increment_offset Auto increment value
Node 1 3 1 1, 4, 7, 10, 13, 16...
Node 2 3 2 2, 5, 8, 11, 14, 17...
Node 3 3 3 3, 6, 9, 12, 15, 18...

If an application performs insert operations on the following nodes in the following order:

  • Node1, Node3, Node2, Node3, Node3, Node1, Node3 ..

Then the primary key value that will be stored in the table will be:

  • 1, 6, 8, 9, 12, 13, 15 ..

Simply said, when using master-master replication (MySQL replication or Galera), your application must be able to tolerate non-sequential auto-increment values in its dataset.

For ClusterControl users, take note that it supports deployment of MySQL master-master replication with a limit of two masters per replication cluster, only for active-passive setup. Therefore, ClusterControl does not deliberately configure the masters with auto_increment_increment and auto_increment_offset variables.

Data Consistency

Galera Cluster comes with its flow-control mechanism, where each node in the cluster must keep up when replicating, or otherwise all other nodes will have to slow down to allow the slowest node to catch up. This basically minimizes the probability of slave lag, although it might still happen but not as significant as in MySQL replication. By default, Galera allows nodes to be at least 16 transactions behind in applying through variable gcs.fc_limit. If you want to do critical reads (a SELECT that must return most up to date information), you probably want to use session variable, wsrep_sync_wait.

Galera Cluster on the other hand comes with a safeguard to data inconsistency whereby a node will get evicted from the cluster if it fails to apply any writeset for whatever reasons. For example, when a Galera node fails to apply writeset due to internal error by the underlying storage engine (MySQL/MariaDB), the node will pull itself out from the cluster with the following error:

150305 16:13:14 [ERROR] WSREP: Failed to apply trx 1 4 times
150305 16:13:14 [ERROR] WSREP: Node consistency compromized, aborting..

To fix the data consistency, the offending node has to be re-synced before it is allowed to join the cluster. This can be done manually or by wiping out the data directory to trigger snapshot state transfer (full syncing from a donor).

MySQL master-master replication does not enforce data consistency protection and a slave is allowed to diverge e.g, replicate a subset of data or lag behind, which makes the slave inconsistent with the master. It is designed to replicate data in one flow - from master down to the slaves. Data consistency checks have to be performed manually or via external tools like Percona Toolkit pt-table-checksum or mysql-replication-check.

Conflict Resolution

Generally, master-master (or multi-master, or bi-directional) replication allows more than one member in the cluster to process writes. With MySQL replication, in case of replication conflict, the slave's SQL thread simply stops applying the next query until the conflict is resolved, either by manually skipping the replication event, fixing the offending rows or resyncing the slave. Simply said, there is no automatic conflict resolution support for MySQL replication.

Galera Cluster provides a better alternative by retrying the offending transaction during replication. By using wsrep_retry_autocommit variable, one can instruct Galera to automatically retry a failed transaction due to cluster-wide conflicts, before returning an error to the client. If set to 0, no retries will be attempted, while a value of 1 (the default) or more specifies the number of retries attempted. This can be useful to assist applications using autocommit to avoid deadlocks.

Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Node Consensus and Failover

Galera uses Group Communication System (GCS) to check node consensus and availability between cluster members. If a node is unhealthy, it will be automatically evicted from the cluster after gmcast.peer_timeout value, default to 3 seconds. A healthy Galera node in "Synced" state is deemed as a reliable node to serve reads and writes, while others are not. This design greatly simplifies health check procedures from the upper tiers (load balancer or application).

In MySQL replication, a master does not care about its slave(s), while a slave only has consensus with its sole master via the slave_IO_thread process when replicating the binary events from master's binary log. If a master goes down, this will break the replication and an attempt to re-establish the link will be made every slave_net_timeout (default to 60 seconds). From the application or load balancer perspective, the health check procedures for replication slave must at least involve checking the following state:

  • Seconds_Behind_Master
  • Slave_IO_Running
  • Slave_SQL_Running
  • read_only variable
  • super_read_only variable (MySQL 5.7.8 and later)

In terms of failover, generally, master-master replication and Galera nodes are equal. They hold the same data set (albeit you can replicate a subset of data in MySQL replication, but that's uncommon for master-master) and share the same role as masters, capable of handling reads and writes simultaneously. Therefore, there is actually no failover from the database point-of-view due to this equilibrium. Only from the application side that would require failover to skip the unoperational nodes. Keep in mind that because MySQL replication is asynchronous, it is possible that not all of the changes done on the master will have propagated to the other master.

Node Provisioning

The process of bringing a node into sync with the cluster before replication starts, is known as provisioning. In MySQL replication, provisioning a new node is a manual process. One has to take a backup of the master and restore it over to the new node before setting up the replication link. For an existing replication node, if the master's binary logs have been rotated (based on expire_logs_days, default to 0 means no automatic removal), you may have to re-provision the node using this procedure. There are also external tools like Percona Toolkit pt-table-sync and ClusterControl to help you out on this. ClusterControl supports resyncing a slave with just two clicks. You have options to resync by taking a backup from the active master or an existing backup.

In Galera, there are two ways of doing this - incremental state transfer (IST) or state snapshot transfer (SST). IST process is the preferred method where only the missing transactions transfer from a donor's cache. SST process is similar to taking a full backup from the donor, it is usually pretty resource intensive. Galera will automatically determine which syncing process to trigger based on the joiner's state. In most cases, if a node fails to join a cluster, simply wipe out the MySQL datadir of the problematic node and start the MySQL service. Galera provisioning process is much simpler, it comes very handy when scaling out your cluster or re-introducing a problematic node back into the cluster.

Loosely Coupled vs Tightly Coupled

MySQL replication works very well even across slower connections, and with connections that are not continuous. It can also be used across different hardware, environment and operating systems. Most storage engines support it, including MyISAM, Aria, MEMORY and ARCHIVE. This loosely coupled setup allows MySQL master-master replication to work well in a mixed environment with less restriction.

Galera nodes are tightly-coupled, where the replication performance is as fast as the slowest node. Galera uses a flow control mechanism to control replication flow among members and eliminate any slave lag. The replication can be all fast or all slow on every node and is adjusted automatically by Galera. Thus, it's recommended to use uniform hardware specs for all Galera nodes, especially with respect to CPU, RAM, disk subsystem, network interface card and network latency between nodes in the cluster.


In summary, Galera Cluster is superior if compared to MySQL master-master replication due to its synchronous replication support with strong consistency, plus more advanced features like automatic membership control, automatic node provisioning and multi-threaded slaves. Ultimately, this depends on how the application interacts with the database server. Some legacy applications built for a standalone database server may not work well on a clustered setup.

To simplify our points above, the following reasons justify when to use MySQL master-master replication:

  • Things that are not supported by Galera:
    • Replication for non-InnoDB/XtraDB tables like MyISAM, Aria, MEMORY or ARCHIVE.
    • XA transactions.
    • Statement-based replication between masters (e.g, when bandwidth is very expensive).
    • Relying on explicit locking like LOCK TABLES statement.
    • The general query log and the slow query log must be directed to a table, instead of a file.
  • Loosely coupled setup where the hardware specs, software version and connection speed are significantly different on every master.
  • When you already have a MySQL replication chain and you want to add another active/backup master for redundancy to speed up failover and recovery time in case if one of the master is unavailable.
  • If your application can't be modified to work around Galera Cluster limitations and having a MySQL-aware load balancer like ProxySQL or MaxScale is not an option.

Reasons to pick Galera Cluster over MySQL master-master replication:

  • Ability to safely write to multiple masters.
  • Data consistency automatically managed (and guaranteed) across databases.
  • New database nodes easily introduced and synced.
  • Failures or inconsistencies automatically detected.
  • In general, more advanced and robust high availability features.

by ashraf at March 12, 2019 09:03 AM

March 11, 2019

Peter Zaitsev

Switch your PostgreSQL Primary for a Read Replica, Without Downtime

postgres read replica from primary

PostgreSQL logoIn my ongoing research to identify solutions and similarities between MySQL – PostgreSQL, I recently faced a simple issue. I needed to perform a slave shift from one IP to another and I did not want to have to restart the slave that is serving the reads. In MySQL, I can repoint the replication online with the command Change Master TO, so I was looking for similar solution in postgres. In my case, I could also afford some stale reads, so a few seconds delay would have been OK, but I couldn’t take down the server.

After brief research, I noticed that there is not a solution that allow you to do that without restarting the PostgreSQL server instance.
I was a bit disappointed, because I was just trying to move the whole traffic from one subnet to another, so not really changing the Master, but just the pointer.

At this point I raised my question to my colleagues who are experts in PG. Initially they confirmed to me that there is no real dynamic solution/command for that. However, while discussing this, one of them (Jobin Augustine) suggested a not “officially supported” way, that might work.

In brief, given that the WAL Receiver uses its own process, killing it would trigger an internal refresh operation, and that could result in having the replication restart from the new desired configuration.

This was an intriguing suggestion, but I wondered if it might have some negative side effects. In any case, I decided to try it and see what would happen.

This article describe the process I followed to test the approach. To be clear:  this is not an “Official” solution, and is not recommended as best practice.

From now on in this article I will drop the standard MySQL terms and instead use Primary for Master and Replica for Slave.


I carried out two main tests:

  1. No load in writing
  2. Writing happening

for each of these I took these steps:

a) move Replica to same Primary (different ip)
b) move Replica to different Primary/Replica, creating a chain, so from:

                          | Primary|
                +--------+     |    +--------+
                +--------+          +--------+



The other thing was to try to be as non-invasive as possible. Given that, I used KILL SIGQUIT(3) instead of the more brutal SIGKILL.

SIGQUIT “The SIGQUIT signal is sent to a process by its controlling terminal when the user requests that the process quit and perform a core dump.

To note that I did try this with SIGTERM (15) which is the nicest approach, but it didn’t in fact force the process to perform the shift as desired.

In general in all the following tests what I execute is:

ps aux|grep 'wal receiver'
kill -3 <pid>

These are the current IPs for node:

Node1 (Primary):

NIC1 =
NIC2 =
NIC3 =

Node2 (replica1):

NIC1 =
NIC2 =
NIC3 =

Node1 (replica2):

NIC1 =
NIC2 =
NIC3 =

The starting position is:

select pid,usesysid,usename,application_name,client_addr,client_port,backend_start,state,sent_lsn,write_lsn,flush_lsn,sync_state from pg_stat_replication;
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 22495 |    24601 | replica | node2            | |       49518 | 2019-02-06 11:07:46.507511-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

And now let’s roll the ball and see what happen.

Experiment 1 – moving to same Primary no load

I will move Node2 to point to

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

change to:

primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 8343 0.0 0.0 667164 2180 ? Ss Feb06 16:27 postgres: wal receiver process streaming 10/FD6C60E8

Checking the replication status:

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres  8343  0.0  0.0 667164  2180 ?        Ss   Feb06  16:27 postgres: wal receiver process   streaming 10/FD6C60E8
                                                                  Tue 19 Feb 2019 12:10:22 PM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 23748 |    24601 | replica | node2            | |       49522 | 2019-02-19 12:09:31.054915-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)
                                                                  Tue 19 Feb 2019 12:10:23 PM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(1 row)
                                                                  Tue 19 Feb 2019 12:10:26 PM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 23756 |    24601 | replica | node2            | |       37866 | 2019-02-19 12:10:26.904766-05 | catchup   | 10/FD460000 | 10/FD3A0000 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)
                                                                  Tue 19 Feb 2019 12:10:28 PM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 23756 |    24601 | replica | node2            | |       37866 | 2019-02-19 12:10:26.904766-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)

It takes six seconds to kill the process, shift to a new IP, and perform the catch up.

Experiment 2 – moving to Different Primary (as a chain of replicas) No load

I will move Node2 to point to

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 25859 0.0 0.0 667164 3484 ? Ss Feb19 1:53 postgres: wal receiver process

On Node1

Thu 21 Feb 2019 04:23:26 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
 31241 |    24601 | replica | node2            | |       38232 | 2019-02-21 04:17:24.535662-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
(2 rows)
                                                                  Thu 21 Feb 2019 04:23:27 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

On Node3

pid | usesysid | usename | application_name | client_addr | client_port | backend_start | state | sent_lsn | write_lsn | flush_lsn | sync_state
(0 rows)
                                                                  Thu 21 Feb 2019 04:23:30 AM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 1435 |    24601 | replica | node2            | |       58116 | 2019-02-21 04:23:29.846798-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

In this case, shifting to a new primary took four seconds.

Now all this is great, but I was working with NO load, what would happen if we have read/write taking place?

Experiment 3 – moving to same Primary WITH Load

I will move Node2 to point to

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 20765 0.2 0.0 667196 3712 ? Ss 06:23 0:00 postgres: wal receiver process streaming 11/E33F760

Thu 21 Feb 2019 06:23:03 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
 31649 |    24601 | replica | node2            | |       38236 | 2019-02-21 06:21:23.539493-05 | streaming | 11/8FEC000 | 11/8FEC000 | 11/8FEC000 | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/8FEC000 | 11/8FEC000 | 11/8FEC000 | async
                                                                 Thu 21 Feb 2019 06:23:04 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/904DCC0 | 11/904C000 | 11/904C000 | async
                                                                 Thu 21 Feb 2019 06:23:08 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
 31778 |    24601 | replica | node2            | |       49896 | 2019-02-21 06:23:08.978179-05 | catchup   | 11/9020000 |            |            | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/9178000 | 11/9178000 | 11/9178000 | async
                                                                 Thu 21 Feb 2019 06:23:09 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
 31778 |    24601 | replica | node2            | |       49896 | 2019-02-21 06:23:08.978179-05 | streaming | 11/91F7860 | 11/91F7860 | 11/91F7860 | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/91F7860 | 11/91F7860 | 11/91F7860 | async

In this case shifting to a new primary takes six seconds.

Experiment 4 – moving to Different Primary (as a chain of replicas) No load

I move Node2 to point to
In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

[root@pg1h3p82 data]# ps aux|grep 'wal receiver'
postgres 21158 6.3 0.0 667196 3704 ? Ds 06:30 0:09 postgres: wal receiver process streaming 11/4F000000


Thu 21 Feb 2019 06:30:56 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 31778 |    24601 | replica | node2            | |       49896 | 2019-02-21 06:23:08.978179-05 | streaming | 11/177F8000 | 11/177F8000 | 11/177F8000 | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/177F8000 | 11/177F8000 | 11/177F8000 | async
(2 rows)
                                                                  Thu 21 Feb 2019 06:30:57 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/17DAA000 | 11/17DAA000 | 11/17DAA000 | async
(1 row)


Thu 21 Feb 2019 06:31:01 AM EST (every 1s)
 pid | usesysid | usename | application_name | client_addr | client_port | backend_start | state | sent_lsn | write_lsn | flush_lsn | sync_state
(0 rows)
                                                                 Thu 21 Feb 2019 06:31:02 AM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |  state  |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 1568 |    24601 | replica | node2            | |       58122 | 2019-02-21 06:31:01.937957-05 | catchup | 11/17960000 | 11/17800000 | 11/177F8CC0 | async
(1 row)
                                                                  Thu 21 Feb 2019 06:31:03 AM EST (every 1s)
 pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
 1568 |    24601 | replica | node2            | |       58122 | 2019-02-21 06:31:01.937957-05 | streaming | 11/1A1D3D08 | 11/1A1D3D08 | 11/1A1D3D08 | async
(1 row)

In this case shifting to a new primary took seven seconds.

Finally, I did another test. I was wondering, can I move the server Node2 back under the main Primary Node1 while writes are happening?

Well, here’s what happened:

In my recovery.conf
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
change to:
primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host= port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

After I kill the process as I did in the previous examples, Node2 rejoined the Primary Node1, but …

Thu 21 Feb 2019 06:33:58 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
  1901 |    24601 | replica | node2            | |       49900 | 2019-02-21 06:33:57.81308-05  | catchup   | 11/52E40000 | 11/52C00000 | 11/52BDFFE8 | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/5D3F9EC8 | 11/5D3F9EC8 | 11/5D3F9EC8 | async

… Node2 was not really able to catch up quickly, or at least not able to do that until the load was on the primary and high. As soon I reduced the application pressure:

Thu 21 Feb 2019 06:35:29 AM EST (every 1s)
  pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
  1901 |    24601 | replica | node2            | |       49900 | 2019-02-21 06:33:57.81308-05  | streaming | 11/70AE8000 | 11/70000000 | 11/70000000 | async
 22449 |    24601 | replica | node3            | |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/70AE8000 | 11/70AE8000 | 11/70AE8000 | async

Node2 was able to catch up and align itself.


In all tests , the Replica was able to rejoin the Primary or the new primary, with obvious different times.

From the tests I carried out so far, it seems that modifying the replication source, and then killing the “WAL receiver” thread, is a procedure that allows us to shift the replication source without the need for a service restart.

This is even more efficient compared to the MySQL solution, given the time taken for the recovery and the flexibility.

What I am still wondering is IF this might cause some data inconsistency issues or not. I asked some of the PG experts inside the company, and it seems that the process should be relatively safe, but I would appreciate any feedback/comment in case you know this may not be a safe operation.

Good PostgreSQL to everybody!

Photo by from Pexels

by Marco Tusa at March 11, 2019 12:59 PM

Jean-Jerome Schmidt

How to Manage MySQL - for Oracle DBAs

Open source databases are quickly becoming mainstream, so migration from proprietary engines into open source engines is a kind of an industry trend now. It also means that we DBA’s often end up having multiple database backends to manage.

In the past few blog posts, my colleague Paul Namuag and I covered several aspects of migration from Oracle to Percona, MariaDB, and MySQL. The obvious goal for the migration is to get your application up and running more efficiently in the new database environment, however it’s crucial to assure that staff is ready to support it.

This blog covers the basic operations of MySQL with reference to similar tasks that you would perform daily in your Oracle environment. It provides you with a deep dive on different topics to save you time as you can relate to Oracle knowledge that you’ve already built over the years.

We will also talk about external command line tools that are missing in the default MySQL installation but are needed to perform daily operations efficiently. The open source version doesn’t come with the equivalent of Oracle Cloud Control for instance, so do checkout ClusterControl if you are looking for something similar.

In this blog, we are assuming you have a better knowledge of Oracle than MySQL and hence would like to know the correlation between the two. The examples are based on Linux platform however you can find many similarities in managing MySQL on Windows.

How do I connect to MySQL?

Let’s start our journey with a very (seemingly) basic task. Actually, this is a kind of task which can cause some confusion due to different login concepts in Oracle and MySQL.

The equivalent of sqlplus / as sysdba connection is “mysql” terminal command with a flag -uroot. In the MySQL world, the superuser is called root. MySQL database users (including root) are defined by the name and host from where it can connect.

The information about user and hosts from where it can connect is stored in mysql.user table. With the connection attempt, MySQL checks if the client host, username and password match the row in the metadata table.

This is a bit of a different approach than in Oracle where we have a user name and password only, but those who are familiar with Oracle Connection Manager might find some similarities.

You will not find predefined TNS entries like in Oracle. Usually, for an admin connection, we need user, password and -h host flag. The default port is 3306 (like 1521 in Oracle) but this may vary on different setups.

By default, many installations will have root access connection from any machine (root@’%’) blocked, so you have to log in to the server hosting MySQL, typically via ssh.

Type the following:

mysql -u root

When the root password is not set this is enough. If the password is required then you should add the flag -p.

mysql -u root -p

You are now logged in to the mysql client (the equivalent of sqlplus) and will see a prompt, typically 'mysql>'.

Is MySQL up and running?

You can use the mysql service startup script or mysqladmin command to find out if it is running. Then you can use the ps command to see if mysql processes are up and running. Another alternative can be mysqladmin, which is a utility that is used for performing administrative operations.

mysqladmin -u root -p status

On Debian:

/etc/init.d/mysql status

If you are using RedHat or Fedora then you can use the following script:

service mysqld status


/etc/init.d/mysqld status


systemctl status mysql.service

On MariaDB instances, you should look for the MariaDB service name.

systemctl status mariadb

What’s in this database?

Like in Oracle, you can querythe metadata objects to get information about database objects.

It’s common to use some shortcuts here, commands that help you to list objects or get DDL of the objects.

show databases;
use database_name;
show tables;
show table status;
show index from table_name;
show create table table_name;

Similar to Oracle you can describe the table:

desc table_name;

Where is my data stored?

There is no dedicated internal storage like ASM in MySQL. All data files are placed in the regular OS mount points. With a default installation, you can find your data in:


The location is based on the variable datadir.

root@mysql-3:~# cat /etc/mysql/my.cnf | grep datadir

You will see there a directory for each database.

Depending on the version and storage engine (yes there are a few here), the database’s directory may contain files of the format *.frm, which define the structure of each table within the database. For MyISAM tables, the data (*.MYD) and indexes (*.MYI) are stored within this directory also.

InnoDB tables are stored in InnoDB tablespaces. Each of which consists of one or more files, which are similar to Oracle tablespaces. In a default installation, all InnoDB data and indexes for all databases on a MySQL server are held in one tablespace, consisting of one file: /var/lib/mysql/ibdata1. In most setups, you don’t manage tablespaces like in Oracle. The best practice is to keep them with autoextend on and max size unlimited.

root@mysql-3:~# cat /etc/mysql/my.cnf | grep innodb-data-file-path
innodb-data-file-path = ibdata1:100M:autoextend

InnoDB has log files, which are the equivalent of Oracle redo logs, allowing automatic crash recovery. By default there are two log files: /var/lib/mysql/ib_logfile0 and /var/lib/mysql/ib_logfile1. Undo data is held within the tablespace file.

root@galera-3:/var/lib/mysql# ls -rtla | grep logfile
-rw-rw----  1 mysql mysql  268435456 Dec 15 00:59 ib_logfile1
-rw-rw----  1 mysql mysql  268435456 Mar  6 11:45 ib_logfile0

Where is the metadata information?

There are no dba_*, user_*, all_* type of views but MySQL has internal metadata views.

Information_schema is defined in the SQL 2003 standard and is implemented by other major databases, e.g. SQL Server, PostgreSQL.

Since MySQL 5.0, the information_schema database has been available, containing data dictionary information. The information was actually stored in the external FRM files. Finally, after many years .frm files are gone in version 8.0. The metadata is still visible in the information_schema database but uses the InnoDB storage engine.

To see all actual views contained in the data dictionary within the mysql client, switch to information_schema database:

use information_schema;
show tables;

You can find additional information in the MySQL database,which contains information about db, event (MySQL jobs), plugins, replication, database, users etc.

The number of views depends on the version and vendor.

Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Select * from v$session

Oracle’s select * from v$session is represented here with the command SHOW PROCESSLIST which shows the list of threads.

| Id      | User             | Host             | db                 | Command | Time   | State              | Info             | Rows_sent | Rows_examined |
|       1 | system user      |                  | NULL               | Sleep   | 469264 | wsrep aborter idle | NULL             |         0 |             0 |
|       2 | system user      |                  | NULL               | Sleep   | 469264 | NULL               | NULL             |         0 |             0 |
|       3 | system user      |                  | NULL               | Sleep   | 469257 | NULL               | NULL             |         0 |             0 |
|       4 | system user      |                  | NULL               | Sleep   | 469257 | NULL               | NULL             |         0 |             0 |
|       6 | system user      |                  | NULL               | Sleep   | 469257 | NULL               | NULL             |         0 |             0 |
|      16 | maxscale         |  | NULL               | Sleep   |      5 |                    | NULL             |         4 |             4 |
|      59 | proxysql-monitor |  | NULL               | Sleep   |      7 |                    | NULL             |         0 |             0 |
|      81 | proxysql-monitor |  | NULL               | Sleep   |      6 |                    | NULL             |         0 |             0 |
|    1564 | proxysql-monitor |  | NULL               | Sleep   |      3 |                    | NULL             |         0 |             0 |
| 1822418 | cmon             | | information_schema | Sleep   |      0 |                    | NULL             |         0 |             8 |
| 1822631 | cmon             | | information_schema | Sleep   |      4 |                    | NULL             |         1 |             1 |
| 1822646 | cmon             | | information_schema | Sleep   |      0 |                    | NULL             |       464 |           464 |
| 2773260 | backupuser       | localhost        | mysql              | Query   |      0 | init               | SHOW PROCESSLIST |         0 |             0 |

13 rows in set (0.00 sec)

It is based on information stored in the information_schema.processlist view. The view requires to have the PROCESS privilege. It can also help you to check if you are running out of the maximum number of processes.

Where is an alert log?

The error log can be found in my.cnf or via show variables command.

mysql> show variables like 'log_error';
| Variable_name | Value                    |
| log_error     | /var/lib/mysql/error.log |
1 row in set (0.00 sec)

Where is the list of the users and their permissions?

The information about users is stored in the mysql.user table, while the grants are stored in several places including the mysql.user, mysql.tables_priv,

MySQL user access is defined in:

mysql.columns_priv, mysql.tables_priv, mysql.db,mysql.user

The preferable way to list grants is to use pt-grants, the tool from Percona toolkit (a must-have for every MySQL DBA).

pt-show-grants --host localhost --user root --ask-pass

Alternatively, you can use the following query (created by Calvaldo)

    CONCAT("`",gcl.Db,"`") AS 'Database(s) Affected',
    CONCAT("`",gcl.Table_name,"`") AS 'Table(s) Affected',
    gcl.User AS 'User-Account(s) Affected',
    IF(gcl.Host='%','ALL',gcl.Host) AS 'Remote-IP(s) Affected',
    CONCAT("GRANT ",UPPER(gcl.Column_priv)," (",GROUP_CONCAT(gcl.Column_name),") ",
                 "ON `",gcl.Db,"`.`",gcl.Table_name,"` ",
                 "TO '",gcl.User,"'@'",gcl.Host,"';") AS 'GRANT Statement (Reconstructed)'
FROM mysql.columns_priv gcl
GROUP BY CONCAT(gcl.Db,gcl.Table_name,gcl.User,gcl.Host)
/* SELECT * FROM mysql.columns_priv */


/* [Database.Table]-Specific Grants */
    CONCAT("`",gtb.Db,"`") AS 'Database(s) Affected',
    CONCAT("`",gtb.Table_name,"`") AS 'Table(s) Affected',
    gtb.User AS 'User-Account(s) Affected',
    IF(gtb.Host='%','ALL',gtb.Host) AS 'Remote-IP(s) Affected',
        "GRANT ",UPPER(gtb.Table_priv)," ",
        "ON `",gtb.Db,"`.`",gtb.Table_name,"` ",
        "TO '",gtb.User,"'@'",gtb.Host,"';"
    ) AS 'GRANT Statement (Reconstructed)'
FROM mysql.tables_priv gtb
WHERE gtb.Table_priv!=''
/* SELECT * FROM mysql.tables_priv */


/* Database-Specific Grants */
    CONCAT("`",gdb.Db,"`") AS 'Database(s) Affected',
    "ALL" AS 'Table(s) Affected',
    gdb.User AS 'User-Account(s) Affected',
    IF(gdb.Host='%','ALL',gdb.Host) AS 'Remote-IP(s) Affected',
        'GRANT ',
            IF(gdb.Create_tmp_table_priv='Y','CREATE TEMPORARY TABLES',NULL),
            IF(gdb.Lock_tables_priv='Y','LOCK TABLES',NULL),
            IF(gdb.Create_view_priv='Y','CREATE VIEW',NULL),
            IF(gdb.Show_view_priv='Y','SHOW VIEW',NULL),
            IF(gdb.Create_routine_priv='Y','CREATE ROUTINE',NULL),
            IF(gdb.Alter_routine_priv='Y','ALTER ROUTINE',NULL),
        " ON `",gdb.Db,"`.* TO '",gdb.User,"'@'",gdb.Host,"';"
    ) AS 'GRANT Statement (Reconstructed)'
FROM mysql.db gdb
WHERE gdb.Db != ''
/* SELECT * FROM mysql.db */


/* User-Specific Grants */
    "ALL" AS 'Database(s) Affected',
    "ALL" AS 'Table(s) Affected',
    gus.User AS 'User-Account(s) Affected',
    IF(gus.Host='%','ALL',gus.Host) AS 'Remote-IP(s) Affected',
        "GRANT ",
                "ALL PRIVILEGES",
                    IF(gus.Show_db_priv='Y','SHOW DATABASES',NULL),
                    IF(gus.Create_tmp_table_priv='Y','CREATE TEMPORARY TABLES',NULL),
                    IF(gus.Lock_tables_priv='Y','LOCK TABLES',NULL),
                    IF(gus.Repl_slave_priv='Y','REPLICATION SLAVE',NULL),
                    IF(gus.Repl_client_priv='Y','REPLICATION CLIENT',NULL),
                    IF(gus.Create_view_priv='Y','CREATE VIEW',NULL),
                    IF(gus.Show_view_priv='Y','SHOW VIEW',NULL),
                    IF(gus.Create_routine_priv='Y','CREATE ROUTINE',NULL),
                    IF(gus.Alter_routine_priv='Y','ALTER ROUTINE',NULL),
                    IF(gus.Create_user_priv='Y','CREATE USER',NULL),
                    IF(gus.Create_tablespace_priv='Y','CREATE TABLESPACE',NULL)
        " ON *.* TO '",gus.User,"'@'",gus.Host,"' REQUIRE ",
        CASE gus.ssl_type
            WHEN 'ANY' THEN
                "SSL "
            WHEN 'X509' THEN
                "X509 "
                CONCAT_WS("AND ",
                    IF((LENGTH(gus.ssl_cipher)>0),CONCAT("CIPHER '",CONVERT(gus.ssl_cipher USING utf8),"' "),NULL),
                    IF((LENGTH(gus.x509_issuer)>0),CONCAT("ISSUER '",CONVERT(gus.ssl_cipher USING utf8),"' "),NULL),
                    IF((LENGTH(gus.x509_subject)>0),CONCAT("SUBJECT '",CONVERT(gus.ssl_cipher USING utf8),"' "),NULL)
            ELSE "NONE "
        "WITH ",
        IF(gus.Grant_priv='Y',"GRANT OPTION ",""),
        "MAX_QUERIES_PER_HOUR ",gus.max_questions," ",
        "MAX_CONNECTIONS_PER_HOUR ",gus.max_connections," ",
        "MAX_UPDATES_PER_HOUR ",gus.max_updates," ",
        "MAX_USER_CONNECTIONS ",gus.max_user_connections,
    ) AS 'GRANT Statement (Reconstructed)'
FROM mysql.user gus;

How to create a mysql user

The ‘create user’ procedure is similar to Oracle. The simplest example could be:

CREATE user 'username'@'hostname' identified by 'password';
GRANT privilege_name on *.* TO 'username'@'hostname';

The option to grant and create in one line with:

GRANT privilege_name  ON *.* TO 'username'@'hostname' identified by 'password';

has been removed in MySQL 8.0.

How do I start and stop MySQL?

You can stop and start MySQL with the service.

The actual command depends on the Linux distribution and the service name.

Below you can find an example with the service name mysqld.


/etc/init.d/mysqld start 
/etc/init.d/mysqld stop 
/etc/init.d/mysqld restart


service mysqld start 
service mysqld stop 
service mysqld restart
systemctl start mysqld.service
systemctl stop mysqld.service
systemctl restart mysqld.service

Where is the MySQL Server Configuration data?

The configuration is stored in the my.cnf file.

Until version 8.0, any dynamic setting change that should remain after a restart required a manual update of the my.cnf file. Similar to Oracle’s scope=both, you can change values using the persistent option.

mysql> SET PERSIST max_connections = 1000;
mysql> SET @@PERSIST.max_connections = 1000;

For older versions use:

mysql> SET GLOBAL max_connections = 1000;
$ vi /etc/mysql/my.cnf
SET GLOBAL max_connections = 1000;

How do I backup MySQL?

There are two ways to execute a mysql backup.

For smaller databases or smaller selective backups, you can use the mysqldump command.

Database backup with mysqldump (logical backup):

mysqldump -uuser -p --databases db_name --routines --events --single-transaction | gzip > db_name_backup.sql.gz

xtrabackup, mariabackup (hot binary backup)

The preferable method is to use xtrabackup or mariabackup, external tools to run hot binary backups.

Oracle offers hot binary backup in the paid version called MySQL Enterprise Edition.

mariabackup --user=root --password=PASSWORD --backup --target-dir=/u01/backups/

Stream backup to other server

Start a listener on the external server on the preferable port (in this example 1984)

nc -l 1984 | pigz -cd - | pv | xbstream -x -C /u01/backups

Run backup and transfer to external host

innobackupex --user=root --password=PASSWORD --stream=xbstream /var/tmp | pigz  | pv | nc 1984

Copy user permission

It’s often needed to copy user permission and transfer them to the other servers.

The recommended way to do this is to use pt-show-grants.

pt-show-grants > /u01/backups

How do I restore MySQL?

Logical backup restore

MySQLdump creates the SQL file, which can be executed with the source command.

To keep the log file of the execution, use the tee command.

mysql> tee dump.log
mysql> source mysqldump.sql

Binary backup restore (xtrabackup/mariabackup)

To restore of MySQL from the binary backup you need to first restore the files and then apply the log files.

You can compare this process to restore and recover in Oracle.

xtrabackup --copy-back --target-dir=/var/lib/data
innobackupex --apply-log --use-memory=[values in MB or GB] /var/lib/data

Hopefully, these tips give a good overview of how to perform basic administrative tasks.

by Bart Oles at March 11, 2019 12:57 PM

March 09, 2019

Valeriy Kravchuk

Fun with Bugs #81 - On MySQL Bug Reports I am Subscribed to, Part XVII

Two weeks passed since my previous review of public MySQL bug reports I consider interesting enough to subscribe to them. Over this period I picked up a dozen or so new public bug reports that I'd like to briefly review today.

Here is my recent subscriptions list, starting from the oldest bug reports:
  • Bug #94431 - "Can't upgrade from 5.7 to 8.0 if any database have a hyphen in their name". It seems one actually needs a database like that created in MySQL 5.6 with at least one InnoDB table having FULLTEXT index to hit the problem. Great finding by Phil Murray. Note that after several unsuccessful attempts by others the bug was eventually reproduced and verified by Jesper Wisborg Krogh. Let's hope we'll see it fixed in MySQL 8.0.16.
  • Bug #94435 - "mysql command hangs up and cosume CPU almost 100%". It was reported by Masaaki HIROSE, whose previous related/similar Bug #94219 - "libmysqlclient enters and infinite loop and consume CPU usage 100%" ended up as "Not a bug" (wrongly, IMHO, as nobody cared enough to reproduce the steps instead of commenting on their correctness and checking something else). Bug reporter had not only insisted and provided all the details, but also tried to analyze the reasons of the bug and provided links to other potentially related bug reports (Bug #88428 - "mysql_real_query hangs with EINTR errno (using YASSL)" and Bug #92394 - "libmysqlclient enters infinite loop after signal (race condition)"). Great job and nice to see the bug "Verified" eventually.
  • Bug #94441 - "empty ibuf aio reads in innodb status". This regression vs MySQL 5.6 was noted by Nikolai Ikhalainen from Percona. MariaDB 10.3.7 is also affected, unfortunately:
    I/O thread 9 state: native aio handle (write thread)
    Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
     ibuf aio reads:, log i/o's:, sync i/o's:Pending flushes (fsync) log: 0; buffer pool: 0
    1344 OS file reads, 133 OS file writes, 2 OS fsyncs
  • Bug #94448 - "Rewrite LOG_BLOCK_FIRST_REC_GROUP during recovery may be dangerous.". Yet another MySQL 8 regression (not marked with "regression" tag) was found by Kang Wang.
  • Bug #94476 - "mysql semisync replication stuck with master in Waiting to finalize termination". It has "Need feedback" status at the moment. I've subscribed to this report from Shirish Keshava Murthy mostly to find out how a report that may look like a free support request will be processed by Oracle engineers. Pure curiosity, for now.
  • Bug #94504 - "AIO::s_log seems useless". This problem was reported by Yuhui Wang. It's a regression in a sense that part of the code is no longer needed (and seems not to be used) in MySQL 8, but still remains.
  • Bug #94541 - "Assertion on import via Transportable Tablespace". This bug reported by  Daniël van Eeden was verified based on code review and some internal discussion. We do not know if any other version besides 5.7.25 is affected, though. The assertion itself:
    InnoDB: Failing assertion: btr_page_get_prev(next_page, mtr) == btr_pcur_get_block(cursor)->
    does not seem to be unique. We can find it in MDEV-18455 also (in other context).
  • Bug #94543 - "MySQL does not compile with protobuf 3.7.0". I care about build/compiling bugs historically, as I mostly use MySQL binaries that I built myself from GitHub source. So, I've immediately subscribed to this bug report from Laurynas Biveinis.
  • Bug #94548 - "Optimizer error evaluating JSON_Extract". This bug was reported by Dave Pullin. From my quick test it seems MariaDB 10.3.7 is also affected. Error message is different in the failing case, but the point is the same - the function is not evaluated if the column from derived table that is built using the function is not referenced in the SELECT list. This optimization is questionable and may lead to hidden "bombs" in the application code.
  • Bug #94550 - "generated columns referring to current_timestamp fail". I tried to check simple test case in this bug report by Mario Beck on MariaDB 10.3.7, but it does not seem to accept NOT NULL constraint for generated stored columns at all:
    MariaDB [test]> CREATE TABLE `t2` (
    -> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    -> `content` varchar(42) DEFAULT NULL,
    -> `bucket` tinyint(4) GENERATED ALWAYS AS ((floor((to_seconds(`created_at
    `) / 10)) % 3)) STORED NOT NULL);
    ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
    corresponds to your MariaDB server version for the right syntax to use near 'NOT
    NULL)' at line 4
    I do not see this option in formal syntax described here as well. But in case of MariaDB we can actually make sure the generated column is never NULL by adding CHECK constraint like this:
    MariaDB [test]> CREATE TABLE `t2` (    ->   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
        ->   `content` varchar(42) DEFAULT NULL,
        ->   `bucket` tinyint(4) GENERATED ALWAYS AS ((floor((to_seconds(`created_at`) / 10)) % 3)) STORED);
    Query OK, 0 rows affected (0.434 sec)

    MariaDB [test]> INSERT INTO t2 (content) VALUES ("taraaaa");
    Query OK, 1 row affected (0.070 sec)

    MariaDB [test]> alter table t2 add constraint cnn CHECK (`bucket` is NOT NULL);
    Query OK, 1 row affected (1.159 sec)
    Records: 1  Duplicates: 0  Warnings: 0

    MariaDB [test]> INSERT INTO t2 (content) VALUES ("tarabbb");
    Query OK, 1 row affected (0.029 sec)

    MariaDB [test]> INSERT INTO t2 (content) VALUES ("");
    Query OK, 1 row affected (0.043 sec)

    MariaDB [test]> select * from t2;
    | created_at          | content | bucket |
    | 2019-03-09 17:28:03 | taraaaa |      0 |
    | 2019-03-09 17:29:43 | tarabbb |      1 |
    | 2019-03-09 17:29:50 |         |      2 |
    3 rows in set (0.002 sec)

    MariaDB [test]> show create table t2\G*************************** 1. row ***************************
           Table: t2
    Create Table: CREATE TABLE `t2` (
      `created_at` timestamp NOT NULL DEFAULT current_timestamp(),
      `content` varchar(42) DEFAULT NULL,
      `bucket` tinyint(4) GENERATED ALWAYS AS (floor(to_seconds(`created_at`) / 10)
    MOD 3) STORED,
      CONSTRAINT `cnn` CHECK (`bucket` is not null)

    1 row in set (0.011 sec)
    So, maybe after all we can state that MariaDB is NOT affected.
  • Bug #94552 - "innodb.virtual_basic fails when valgrind is enabled". I still wonder if anyone in Oracle runs MTR test suite on Valgrind-enabled (-DWITH_VALGRIND=1 cmake option) at least in the process of official release (and if they check the failures). It seems not to be the case based on this bug report from Manuel Ung.
  • Bug #94553 - "Crash in trx_undo_rec_copy". Bernardo Perez noted that as a side effect of still "Verified" Bug #82734 - "trx_undo_rec_copy needlessly relies on buffer pool page alignment" (that affects both MySQL 5.7 and 8.0) we may get crashes while working with generated columns. I hope to see them both fixed soon, but for now Bug #94553 has status "Need Feedback", probably in a hope to get a repeatable test case. I'll watch it carefully.
  • Bug #94560 - "record comparison in spatial index non-leaf rtree node seems incorrect". I doubt spatial indexes of InnoDB are widely used, and I have no doubts there are many bugs waiting to be discovered in this area. This specific bug was reported by Jie Zhou who had also suggested a fix.
  • Bug #94610 - "Server stalls because ALTER TABLE on partitioned table holds dict mutex". My former colleague Justin Swanhart reported this bug just yesterday, so no wonder it is not verified yet. It refers to a well known verified old Bug #83435 - "ALTER TABLE is very slow when using PARTITIONED table"  (that I've also subscribed to immediately) from Roel Van de Paar, affecting both MySQL 5.6 and 5.7. I hope to see this bug verified and fixed soon, as recently I see this kind of state for main thread:
    Main thread process no. 3185, id 140434206619392, state: enforcing dict cache limit
    too often in INNODB STATUS outputs to my liking...
As you could note, I still try to check (at least in some cases) if MariaDB is also affected by the same problem. I think it's a useful check both for me (as I work mostly with MariaDB as a support engineer) and for the reader (to know if switching to MariaDB may help in any way or if there are any chances for MariaDB engineers to contribute anything useful, like a fix).

"Hove, actually". For years residents of Hove used this humorous reply when they live in Brighton... "Regression, actually" is what I want to say (seriously) about every other MySQL bug report I subscribe to... So, you see Hove and many regression bugs above!
To summarize:
  1. Sometimes Oracle engineers demonstrate proper collective effort to understand and carefully verify public bug reports. Good to know they are not ready to give up fast!
  2. I have to copy-paste this item from my previous post. As the list above proves, Oracle engineers still do not use "regression" tag when setting "Verified" status for obviously regression bugs. I think bug reporters should care then to always set it when they report regression of any kind.
  3. It seems there no regular MTR test runs for Valgrind builds performed by Oracle engineers, or maybe they just ignore failures.

by Valerii Kravchuk ( at March 09, 2019 09:17 PM

March 08, 2019

Jean-Jerome Schmidt

High Availability on a Shoestring Budget - Deploying a Minimal Two Node MySQL Galera Cluster

We regularly get questions about how to set up a Galera cluster with just 2 nodes.

The documentation clearly states you should have at least 3 Galera nodes to avoid network partitioning. But there are some valid reasons for considering a 2 node deployment, e.g., if you want to achieve database high availability but have a limited budget to spend on a third database node. Or perhaps you are running Galera in a development/sandbox environment and prefer a minimal setup.

Galera implements a quorum-based algorithm to select a primary component through which it enforces consistency. The primary component needs to have a majority of votes, so in a 2 node system, there would be no majority resulting in split brain. Fortunately, it is possible to add a garbd (Galera Arbitrator Daemon), which is a lightweight stateless daemon that can act as the odd node. Arbitrator failure does not affect the cluster operations and a new instance can be reattached to the cluster at any time. There can be several arbitrators in the cluster.

ClusterControl has support for deploying garbd on non-database hosts.

Normally a Galera cluster needs at least three hosts to be fully functional, however, at deploy time, two nodes would suffice to create a primary component. Here are the steps:

  1. Deploy a Galera cluster of two nodes,
  2. After the cluster has been deployed by ClusterControl, add garbd on the ClusterControl node.

You should end up with the below setup:

Deploy the Galera Cluster

Go to the ClusterControl Deploy section to deploy the cluster.

After selecting the technology that we want to deploy, we must specify User, Key or Password and port to connect by SSH to our hosts. We also need the name for our new cluster and if we want ClusterControl to install the corresponding software and configurations for us.

After setting up the SSH access information, we must select vendor/version and we must define the database admin password, datadir and port. We can also specify which repository to use.

Even though ClusterControl warns you that a Galera cluster needs an odd number of nodes, only add two nodes to the cluster.

Deploying a Galera cluster will trigger a ClusterControl job which can be monitored at the Jobs page.

Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Install Garbd

Once deployment is complete, install garbd on the ClusterControl host. We have the option to deploy garbd from ClusterControl, but this option won’t work if we want to deploy it in the same ClusterControl server. This is to avoid some issue related to the database versions and package dependencies.

So, we must install it manually, and then import garbd to ClusterControl.

Let’s see the manual installation of Percona Garbd on CentOS 7.

Create the Percona repository file:

$ vi /etc/yum.repos.d/percona.repo
name = Percona-Release YUM repository - $basearch
baseurl =$releasever/RPMS/$basearch
enabled = 1
gpgcheck = 0
name = Percona-Release YUM repository - noarch
baseurl =$releasever/RPMS/noarch
enabled = 1
gpgcheck = 0
name = Percona-Release YUM repository - Source packages
baseurl =$releasever/SRPMS
enabled = 0
gpgcheck = 0

Then, install the Percona XtraDB Cluster garbd package:

$ yum install Percona-XtraDB-Cluster-garbd-57

Now, we need to configure garbd. For this, we need to edit the /etc/sysconfig/garb file:

$ vi /etc/sysconfig/garb
# Copyright (C) 2012 Codership Oy
# This config file is to be sourced by garb service script.
# A comma-separated list of node addresses (address[:port]) in the cluster
# Galera cluster name, should be the same as on the rest of the nodes.
# Optional Galera internal options string (e.g. SSL settings)
# see
# Log file for garbd. Optional, by default logs to syslog
# Deprecated for CentOS7, use journalctl to query the log for garbd

Change the GALERA_NODES and GALERA_GROUP parameter according to the Galera nodes configuration. We also need to remove the line # REMOVE THIS AFTER CONFIGURATION before starting the service.

And now, we can start the garb service:

$ service garb start
Redirecting to /bin/systemctl start garb.service

Now, we can import the new garbd into ClusterControl.

Go to ClusterControl -> Select Cluster -> Add Load Balancer.

Then, select Garbd and Import Garbd section.

Here we only need to specify the hostname or IP Address and the port of the new Garbd.

Importing garbd will trigger a ClusterControl job which can be monitored at the Jobs page. Once completed, you can verify garbd is running with a green tick icon at the top bar:

That’s it!

Our minimal two-node Galera cluster is now ready!

by Sebastian Insausti at March 08, 2019 03:23 PM

March 07, 2019

Peter Zaitsev

Reducing High CPU on MySQL: a Case Study

CPU Usage after query tuning

In this blog post, I want to share a case we worked on a few days ago. I’ll show you how we approached the resolution of a MySQL performance issue and used Percona Monitoring and Management PMM to support troubleshooting. The customer had noticed a linear high CPU usage in one of their MySQL instances and was not able to figure out why as there was no much traffic hitting the app. We needed to reduce the high CPU usage on MySQL. The server is a small instance:

Models | 6xIntel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz

This symptom can be caused by various different reasons. Let’s see how PMM can be used to troubleshoot the issue.


The original issue - CPU usage at almost 100% during application use

It’s important to understand where the CPU time is being consumed: user space, system space, iowait, and so on. Here we can see that CPU usage was hitting almost 100% and the majority of the time was being spent on user space. In other words, the time the CPU was executing user code, such as MySQL. Once we determined that the time was being spent on user space, we could discard other possible issues. For example, we could eliminate the possibility that a high amount of threads were competing for CPU resources, since that would cause an increase in context switches, which in turn would be taken care of by the kernel – system space.

With that we decided to look into MySQL metrics.


Thread activity graph in PMM for MySQL

Queries per second

As expected, there weren’t a lot of threads running—10 on average—and MySQL wasn’t being hammered with questions/transactions. It was running from 500 to 800 QPS (queries per second). Next step was to check the type of workload that was running on the instance:

All the commands are of a SELECT type, in red in this graph

In red we can see that almost all commands are SELECTS. With that in mind, we checked the handlers using 

 to verify if those selects were doing an index scan, a full table scan or what.

Showing that the query was a full table scan

Blue in this graph represents

 , which is the counter MySQL increments every time it reads a row when it’s doing a full table scan. Bingo!!! Around 350 selects were reading 2.5 million rows. But wait—why was this causing CPU issues rather than IO issues? If you refer to the first graph (CPU graph) we cannot see iowait.

That is because the data was stored in the InnoDB Buffer Pool, so instead of having to read those 2.5M rows per second from disk, it was fetching them from memory. The stress had moved from disk to CPU. Now that we identified that the issue had been caused by some queries or query, we went to QAN to verify the queries and check their status:

identifying the long running query in QAN

First query, a

  on table 
 was responsible for 98% of the load and was executing in 20+ seconds.

The initial query load

EXPLAIN confirmed our suspicions. The query was accessing the table using type ALL, which is the last type we want as it means “Full Table Scan”. Taking a look into the fingerprint of the query, we identified that it was a simple query:

Fingerprint of query
Indexes on table did not include a key column

The query was filtering clients based on the status field

SELECT * FROM store.clients WHERE status = ?
 As shown in the indexes, that column was not indexed. Talking with the customer, this turned out to be a query that was introduced as part of a new software release.

From that point, we were confident that we had identified the problem. There could be more, but this particular query was definitely hurting the performance of the server. We decided to add an index and also sent an annotation to PMM, so we could refer back to the graphs to check when the index has been added, check if CPU usage had dropped, and also check Handler_read_rnd_next.

To run the alter we decided to use pt-online-schema-change as it was a busy table, and the tool has safeguards to prevent the situation from becoming even worse. For example, we wanted to pause or even abort the alter in the case of the number of Threads_Running exceeding a certain threshold. The threshold is controlled by

  (25 by default) and
  (50 by default):

pmm-admin annotate "Started ALTER store.clients ADD KEY (status)" && \
pt-online-schema-change --alter "ADD KEY (status)" --execute u=root,D=store,t=clients && \
pmm-admin annotate "Finished ALTER store.clients ADD KEY (status)"
Your annotation was successfully posted.
No slaves found. See --recursion-method if host localhost.localdomain has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait:
analyze_table, 10, 1
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
update_foreign_keys, 10, 1
Altering `store`.`clients`...
Creating new table...
Created new table store._clients_new OK.
Altering new table...
Altered `store`.`_clients_new` OK.
2019-02-22T18:26:25 Creating triggers...
2019-02-22T18:27:14 Created triggers OK.
2019-02-22T18:27:14 Copying approximately 4924071 rows...
Copying `store`.`clients`: 7% 05:46 remain
Copying `store`.`clients`: 14% 05:47 remain
Copying `store`.`clients`: 22% 05:07 remain
Copying `store`.`clients`: 30% 04:29 remain
Copying `store`.`clients`: 38% 03:59 remain
Copying `store`.`clients`: 45% 03:33 remain
Copying `store`.`clients`: 52% 03:06 remain
Copying `store`.`clients`: 59% 02:44 remain
Copying `store`.`clients`: 66% 02:17 remain
Copying `store`.`clients`: 73% 01:50 remain
Copying `store`.`clients`: 79% 01:23 remain
Copying `store`.`clients`: 87% 00:53 remain
Copying `store`.`clients`: 94% 00:24 remain
2019-02-22T18:34:15 Copied rows OK.
2019-02-22T18:34:15 Analyzing new table...
2019-02-22T18:34:15 Swapping tables...
2019-02-22T18:34:27 Swapped original and new tables OK.
2019-02-22T18:34:27 Dropping old table...
2019-02-22T18:34:32 Dropped old table `store`.`_clients_old` OK.
2019-02-22T18:34:32 Dropping triggers...
2019-02-22T18:34:32 Dropped triggers OK.
Successfully altered `store`.`clients`.
Your annotation was successfully posted.


MySQL Handlers after query tuning MySQL query throughput after query tuning
Query analysis by EXPLAIN in PMM after tuning

As we can see, above, CPU usage dropped to less than 25%, which is 1/4 of the previous usage level. Handler_read_rnd_next dropped and we can’t even see it once pt-osc has finished. We had a small increase on Handler_read_next as expected because now MySQL is using the index to resolve the WHERE clause. One interesting outcome is that the instance was able to increase it’s QPS by 2x after the index was added as CPU/Full Table Scan was no longer limiting performance. On average, query time has dropped from 20s to only 661ms.


  1. Applying the correct troubleshooting steps to your problems is crucial:
    a) Understand what resources have been saturated.
    b) Understand what if anything is causing an error.
    c) From there you can divert into the areas that are related to that resource and start to narrow down the issue.
    d) Tackle the problems bit by bit.
  2. Having the right tools for the job key for success. PMM is a great example of a tool that can help you quickly identify, drill in, and fix bottlenecks.
  3. Have realistic load tests. In this case, they had tested the new release on a concurrency level that was not like their production
  4. By identifying the culprit query we were able to:
    a.) Drop average query time from 20s to 661ms
    b.) Increase QPS by 2x
    c.) Reduce the usage of CPU to 1/4 of its level prior to our intervention

Disclosure: For security reasons, sensitive information, such as database, table, column names have been modified and graphs recreated to simulate a similar problem.

by Marcelo Altmann at March 07, 2019 03:17 PM

March 06, 2019

Peter Zaitsev

Settling the Myth of Transparent HugePages for Databases

The concept of Linux HugePages has existed for quite a while: for more than 10 years, introduced to Debian in 2007 with kernel version 2.6.23. Whilst a smaller page size is useful for general use, some memory intensive applications may gain performance by using bigger memory pages. By having bigger memory chunks available to them, they can reduce lookup time as well as improve the performance of read/write operations. To be able to make use of HugePages, applications need to carry the specific code directive, and changing applications across the board is not necessarily a simple task. So enter Transparent HugePages (THP).

By reputation, THPs are said to have a negative impact on performance. For this post, I set out to either prove or debunk the case for the use of THPs for database applications.

The Linux context

On Linux – and for that matter all operating systems that I know of – memory is divided into small chunks called pages. A typical memory page size is set to 4k. You can obtain the value of page size on Linux using getconf.

# getconf PAGE_SIZE

Generally, the latest processors support multiple page sizes. However, Linux defaults to a minimal 4k page size. For a system with 64GB physical memory, this memory will be divided into more than 16 million pages. Linking between these pages and physical memory (which is called page table walking) is undertaken by the CPU’s memory management unit (MMU). To optimize page lookup, CPU maintains a cache of recently used pages called the Table Lookaside Buffer (TLB). The higher the number of pages, the lower the percentage of pages that are maintained in TLB. This translates to a higher cache miss ratio. With every cache miss, a more expensive search must be done via page table walking. In effect, that leads to a degradation in performance.

So what if we could increase the page size? We could then reduce the number of pages accessed, and reduce the cost of page walking. Cache hit ratio might then improve because more relevant data now fits in one page rather than multiple pages.

The Linux kernel will always try to allocate a HugePage (if enabled) and will fall back to the default 4K if a contiguous chunk of the required memory size is not available in the required memory space.

The implication for applications

As mentioned, for an application to make use of HugePages it has to contain an explicit instruction to do so. It’s not always practical to change applications in this way so there’s another option.

Transparent HugePages provides a layer within the Linux kernel – probably since version 2.6.38 – which if enabled can potentially allocate HugePages for applications without them actually “knowing” it; hence the transparency. The expectation is that this will improve application performance.

In this blog, I’ll attempt to find the reasons why THP might help improve database performance. There’s a lot of discussion amongst database experts that classic HugePages give a performance gain, but you’ll see a performance hit with Transparent HugePages. I decided to take up the challenge and perform various benchmarks, with different settings, and with different workloads.

So do Transparent HugePages (THP) improve application performance? More specifically, do they improve performance for database workloads? Most industry standard databases recommend disabling THP and enabling HugePages alone.

So is this a myth or does THP degrade performance for databases? Time to break this myth.

Enabling THP

The current setting can be seen using the command line

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Temporary Change

It can be enabled or disabled using the command line.

# echo never > /sys/kernel/mm/transparent_hugepage/enabled

Permanent Change via grub

Or by setting grub parameter  in 


You can choose one of the three configurations for THP; enable, disable, or “madvise”. Whilst enable and disable options are self-explanatory, madvise allows applications that are optimized for HugePages to use THP.  Applications can use Transparent HugePages by making the madvise system call.

Why was the madvise option added? We will discuss that in a later section.

Transparent HugePages problems

The khugepaged CPU usage

The allocation of a HugePage can be tricky. Whilst traditional HugePages are reserved in virtual memory, THPs are not. In the background, the kernel attempts to allocate a THP, and if it fails, will default to the standard 4k page. This all happens transparently to the user.

The allocation process can potentially involve a number of kernel processes which may include kswapd, defrag, and kcompactd. All of these are responsible for making space in the virtual memory for a future THP. When required, the allocation is made by another kernel process; khugepaged. This process manages Transparent HugePages.


It depends on how khugepaged is configured, but since no memory is reserved beforehand, there is potential for performance degradation. With every attempt to allocate a HugePage, potentially a number of kernel processes are invoked. These carry out certain actions to make enough room in the virtual memory for a THP allocation. Although no notifications are provided to the application, precious resources are spent, and this can lead to spikes in performance with any dips indicating an attempt to allocate THP.

Memory Bloating

HugePages are for not for every application. For example, an application that wants to allocate only one byte of data would be better off using a 4k page rather than a huge one. That way, memory is more efficiently used. To prevent this, one option is to configure THP to “madvise”. By doing this, HugePages are disabled system-wide but are available to applications that make a madvise call to allocate THP in the madvise memory region.


Linux kernel keeps track of memory pages and differentiates between pages are that are actively being used and the ones that are not immediately required. It may load or unload a page from active memory to disk if that page is no longer required or vice versa.

When page size is 4k, these memory operations are understandably fast. However, consider a 1GB page size: there will a significant performance hit when such a page is swapped out. When a THP is swapped out, it is split in standard page sizes. Unlike conventional HugePages which are reserved in RAM and are never swapped, THPs are swappable pages. They could, therefore, potentially be swapped causing a dip in performance. Although in recent years, there have been loads of performance improvements around swapping out the THPs process, it still does impact performance negatively.


I decided to benchmark with and without Transparent HugePages enabled. Initially, I used pgbench – a PostgreSQL benchmarking tool based on TPCB – for a duration of ten minutes. The benchmark used a mixed mode of READ/WRITE. The results with and without the Transparent HugePages show no degradation or improvement in the benchmark. To be sure, I repeated the same benchmark for 60 minutes and got almost the same results.  I performed another benchmark with a TPCC workload using the sysbench benchmarking tool. The results are almost the same.

Benchmark Machine

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: SAMSUNG  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Linux smblade01 4.15.0-42-generic #45~16.04.1-Ubuntu
  • PostgreSQL: version 11

Benchmark TPCB (pgbench) – 10 Minute duration

The following graphs show results for two different database sizes; 48GB and 112GB with 64, 128 and 256 clients each. All other settings were kept unchanged for these benchmarks to ensure that our results are comparable. It is evident that both lines — representing execution with or without THP — are almost overlapping one another. This suggests no performance gains.

Figure 1.1 PostgreSQL' s Benchmark, 10 minutes execution time where database workload(48GB) < shared_buffer (64GB)

Figure 1.1 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload(48GB) < shared_buffer (64GB)


Figure 1.2 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) > shared_buffer (64GB)

Figure 1.2 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) > shared_buffer (64GB)


Figure 1.3 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 1.3 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB) -dTLB-misses


Figure 1.4 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (112GB) > shared_buffer (64GB)

Figure 1.4 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (112GB) > shared_buffer (64GB)-dTLB-misses


Benchmark TPCB (pgbench) – 60 Minute duration

Figure 2.1 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 2.1 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB)


Figure 2.2 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (112GB) &gt; shared_buffer (64GB)

Figure 2.2 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (112GB) > shared_buffer (64GB)


Figure 2.3 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 2.3 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB) -dTLB-misses


Figure 2.4 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (112GB) > shared_buffer (64GB)

Figure 2.4 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (112GB) > shared_buffer (64GB) -dTLB-misses


Benchmark TPCC (sysbecnch) – 10 Minute duration

Figure 3.1 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) &lt; shared_buffer (64GB)

Figure 3.1 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 3.2 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (112GB) &gt; shared_buffer (64GB)

Figure 3.2 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (112GB) > shared_buffer (64GB)


Figure 3.3 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB)

Figure 3.3 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB) -dTLB-misses


Figure 3.4 PostgreSQL' s Benchmark, 10 minutes execution time where database workload 112GB) > shared_buffer (64GB)

Figure 3.4 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload 112GB) > shared_buffer (64GB) -dTLB-misses



I attained these results by running different benchmarking tools and evaluating different OLTP benchmarking standards. The results clearly indicate that for these workloads, THP has a negative impact on the overall database performance. Although the performance degradation is negligible, it is, however, clear that there is no performance gain as one might expect. This is very much in line with all the different databases’ recommendation which suggests disabling the THP.

THP may be beneficial for various applications, but it certainly doesn’t give any performance gains when handling an OLTP workload.

We can safely say that the “myth” is derived from experience and that the rumors are true.


  • The complete benchmark data is available at GitHub[1]
  • The complete “nmon” reports, which include CPU, memory etc usage can be found at GitHub[2]
  • This whole benchmark is based around OLTP. Watch out for the OLAP benchmark. Maybe THP will have more effect on this type of workload.

[1] –

[2] –



by Ibrar Ahmed at March 06, 2019 01:07 PM

Jean-Jerome Schmidt

Dealing with Unreliable Networks When Crafting an HA Solution for MySQL or MariaDB

Long gone are the days when a database was deployed as a single node or instance - a powerful, standalone server which was tasked to handle all the requests to the database. Vertical scaling was the way to go - replace the server with another, even more powerful one. During these times, one didn’t really have to be bothered by network performance. As long as the requests were coming in, all was good.

But nowadays, databases are built as clusters with nodes interconnected over a network. It is not always a fast, local network. With businesses reaching global scale, database infrastructure has also to span across the globe, to stay close to customers and to reduce latency. It comes with additional challenges that we have to face when designing a highly available database environment. In this blog post, we will look into the network issues that you may face and provide some suggestions on how to deal with them.

Two Main Options for MySQL or MariaDB HA

We covered this particular topic quite extensively in one of the whitepapers, but let’s look at the two main ways of building high availability for MySQL and MariaDB.

Galera Cluster

Galera Cluster is shared-nothing, virtually synchronous cluster technology for MySQL. It allows to build multi-writer setups that can span across the globe. Galera thrives in low-latency environments but it can also be configured to work with long WAN connections. Galera has a built-in quorum mechanism which ensures that data will not be compromised in case of the network partitioning of some of the nodes.

MySQL Replication

MySQL Replication can be either asynchronous or semi-synchronous. Both are designed to build large scale replication clusters. Like in any other master-slave or primary-secondary replication setup, there can be only one writer, the master. Other nodes, slaves, are used for failover purposes as they contain the copy of the data set from the maser. Slaves can also be used for reading the data and offloading some of the workload from the master.

Both solutions have their own limits and features, both suffer from different problems. Both can be affected by unstable network connections. Let’s take a look at those limitations and how we can design the environment to minimize the impact of an unstable network infrastructure.

Galera Cluster - Network Problems

First, let’s take a look at Galera Cluster. As we discussed, it works best in a low-latency environment. One of the main latency-related problems in Galera is the way how Galera handles the writes. We will not go into all the details in this blog, but further reading in our Galera Cluster for MySQL tutorial. The bottom line is that, due to the certification process for writes, where all nodes in the cluster have to agree on whether the write can be applied or not, your write performance for single row is strictly limited by the network roundtrip time between the writer node and the most far away node. As long as the latency is acceptable and as long as you do not have too many hot spots in your data, WAN setups may work just fine. The problem starts when the network latency spikes from time to time. Writes will then take 3 or 4 times longer than usual and, as a result, databases may start to be overloaded with long-running writes.

One of great features of Galera Cluster is its ability to detect the cluster state and react upon network partitioning. If a node of the cluster cannot be reached, it will be evicted from the cluster and it will not be able to perform any writes. This is crucial in maintaining the integrity of the data during the time when the cluster is split - only the majority of the cluster will accept writes. Minority will complain. To handle this, Galera introduces a vast array of checks and configurable timeouts to avoid false alerts on very transient network issues. Unfortunately, if the network is unreliable, Galera Cluster will not be able to work correctly - nodes will start to leave the cluster, join it later. It will be especially problematic when we have Galera Cluster spanning across WAN - separated pieces of the cluster may disappear randomly if the interconnecting network will not work properly.

How to Design Galera Cluster for an Unstable Network?

First things first, if you have network problems within the single datacenter, there is not much you can do unless you will be able to solve those issues somehow. Unreliable local network is a no go for Galera Cluster, you have to reconsider using some other solution (even though, to be honest, unreliable network will always be a problematic). On the other hand, if the problems are related to WAN connections only (and this is one of the most typical cases), it may be possible to replace WAN Galera links with regular asynchronous replication (if the Galera WAN tuning did not help).

There are several inherent limitations in this setup - the main issue is that the writes used to happen locally. Now, all the writes will have to head to the “master” datacenter (DC A in our case). This is not as bad as it sounds. Please keep in mind that in an all-Galera environment, writes will be slowed down by the latency between nodes located in different datacenters. Even local writes will be affected. It will be more or less the same slowdown as with asynchronous setup in which you would send the writes across WAN to the “master” datacenter.

Using asynchronous replication comes with all of the problems typical for the asynchronous replication. Replication lag may become a problem - not that Galera would be more performant, it’s just that Galera would slow down the traffic via flow control while replication does not have any mechanism to throttle the traffic on the master.

Another problem is the failover: if the “master” Galera node (the one which acts as the master to the slaves in other datacenters) would fail, some mechanism has to be created to repoint slaves to another, working master node. It might be some sort of a script, it is also possible to try something with VIP where the “slave” Galera cluster slaves off Virtual IP which is always assigned to the alive Galera node in the “master” cluster.

The main advantage of such setup is that we do remove the WAN Galera link which means that our “master” cluster will not be slowed down by the fact that some of the nodes are separated geographically. As we mentioned, we lose the ability to write in all of the data-centers but latency-wise writing across the WAN is the same as writing locally to the Galera cluster which spans across WAN. As a result the overall latency should improve. Asynchronous replication is also less vulnerable to the unstable networks. Worst case scenario, the replication link will break and it will be recreated when the networks converge.

Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

How to Design MySQL Replication for an Unstable Network?

In the previous section, we covered Galera cluster and one solution was to use asynchronous replication. How does it look like in a plain asynchronous replication setup? Let’s look at how an unstable network can cause the biggest disruptions in the replication setup.

First of all, latency - one of the main pain points for Galera Cluster. In case of replication, it is almost a non-issue. Unless you use semi-synchronous replication that is - in such case, increased latency will slow down writes. In asynchronous replication, latency has no impact on the write performance. It may, though, have some impact on the replication lag. It is not anything as significant as it was for Galera but you may expect more lag spikes and overall less stable replication performance if the network between nodes suffers from high latency. This is mostly due to the fact that the master may as well serve several writes before data transfer to the slave can be initiated on high latency network.

The network instability may definitely impact replication links but it is, again, not that critical. MySQL slaves will attempt to reconnect to their masters and replication will commence.

The main issue with MySQL replication is actually something that Galera Cluster solves internally - network partitioning. We are talking about the network partitioning as the condition in which segments of the network are separated from each other. MySQL replication utilizes one single writer node - master. No matter how you design your environment, you have to send your writes to the master. If the master is not available (for whatever reasons), application cannot do its job unless it runs in some sort of read-only mode. Therefore there is a need to pick the new master as soon as possible. This is where the issues show up.

First, how to tell which host is a master and which one is not. One of the usual ways is to use the “read_only” variable to distinguish slaves from the master. If node has read_only enabled (set read_only=1), it is a slave (as slaves should not handle any direct writes). If the node has read_only disabled (set read_only=0), it is a master. To make things safer, a common approach is to set read_only=1 in MySQL configuration - in case of a restart, it is safer if the node shows up as a slave. Such “language” can be understood by proxies like ProxySQL or MaxScale.

Let’s take a look at an example.

We have application hosts which connect to the proxy layer. Proxies perform the read/write split sending SELECTs to slaves and writes to master. If master is down, failover is performed, new master is promoted, proxy layer detects that and start sending writes to another node.

If node1 restarts, it will come up with read_only=1 and it will be detected as a slave. It is not ideal as it is not replicating but it is acceptable. Ideally, the old master should not show up at all until it is rebuilt and slaved off the new master.

Way more problematic situation is if we have to deal with network partitioning. Let’s consider the same setup: application tier, proxy tier and databases.

When the network makes the master not reachable, the application is not usable as no writes make it to their destination. New master is promoted, writes are redirected to it. What will happen then if the network issues cease and the old master becomes reachable? It has not been stopped, therefore it is still using read_only=0:

You’ve now ended up in a split brain, when writes were directed to two nodes. This situation is pretty bad as to merge diverged datasets may take a while and it is quite a complex process.

What can be done to avoid this problem? There is no silver bullet but some actions can be taken to minimize the probability of a split brain to happen.

First of all, you can be smarter in detecting the state of the master. How do the slaves see it? Can they replicate from it? Maybe some of the slaves still can connect to the master, meaning that the master is up and running or, at least, making it possible to stop it should that be necessary. What about the proxy layer? Do all of the proxy nodes see the master as unavailable? If some can still connect, than you can try to utilize those nodes to ssh into the master and stop it before the failover?

The failover management software can also be smarter in detecting the state of the network. Maybe it utilizes RAFT or some other clustering protocol to build a quorum-aware cluster. If a failover management software can detect the split brain, it can also take some actions based on this like, for example, setting all nodes in the partitioned segment to read_only ensuring that the old master will not show up as writable when the networks converge.

You can also include tools like Consul or Etcd to store the state of the cluster. The proxy layer can be configured to use data from Consul, not the state of the read_only variable. It will be then up to the failover management software to make necessary changes in Consul so that all proxies will send the traffic to a correct, new master.

Some of those hints can even be combined together to make the failure detection even more reliable. All in all, it is possible to minimize the chances that the replication cluster will suffer from unreliable networks.

As you can see, no matter if we are talking about Galera or MySQL Replication, unstable networks may become a serious problem. On the other hand, if you design the environment correctly, you can still make it work. We hope this blog post will help you to create environments which will work stable even if the networks are not.

by krzysztof at March 06, 2019 09:26 AM

March 05, 2019

Peter Zaitsev

Upcoming Webinar Thurs 3/7: Enhancing MySQL Security

Enhancing MySQL Security Webinar

Enhancing MySQL Security WebinarJoin Percona Support Engineer, Vinicius Grippa, as he presents his talk Enhancing MySQL Security on Thursday, March 7th, 2019 at 7:00 AM PST (UTC-8) / 10:00 AM EST (UTC-5).

Register Now

Security is always a challenge when it comes to data. What’s more, regulations like GDPR add a whole new layer on top of it, with rules more and more restrictive to access and manipulate data. Join us in this presentation to check security best practices, as well as traditional and new features available for MySQL including features coming with the new MySQL 8.

In this talk, DBA’s and sysadmins will walk through the security features available on the OS and MySQL. For instance, these features include:

– SO security
– Audit Plugin
– MySQL 8 features (undo, redo and binlog encryption)
– New caching_sha2_password
– Roles
– Password Management
– FIPS mode

In order to learn more register for this webinar on Enhancing MySQL Security.

by Vinicius Grippa at March 05, 2019 09:57 PM

How to Upgrade Amazon Aurora MySQL from 5.6 to 5.7

Over time, software evolves and it is important to stay up to date if you want to benefit from new features and performance improvements.  Database engines follow the exact same logic and providers are always careful to provide an easy upgrade path. With MySQL, the mysql_upgrade tool serves that purpose.

A database upgrade process becomes more challenging in a managed environment like AWS RDS where you don’t have shell access to the database host and don’t have access to the SUPER MySQL privilege. This post is a collaboration between Fattmerchant and Percona following an engagement focused on the upgrade of the Fattmerchant database from Amazon Aurora MySQL 5.6 to Amazon Aurora MySQL 5.7. Jacques Fu, the CTO of Fattmerchant, is the co-author of this post.  Our initial plan was to follow a path laid out previously by others but we had difficulties finding any complete and detailed procedure outlining the steps. At least, with this post, there is now one.

Issues with the regular upgrade procedure

How do we normally upgrade a busy production server with minimal downtime?  The simplest solution is to use a slave server with the newer version. Such a procedure has the side benefit of providing a “staging” database server which can be used to test the application with the new version. Basically we need to follow these steps:

  1. Enable replication on the old server
  2. Make a consistent backup
  3. Restore the backup on a second server with the newer database version – it can be a temporary server
  4. Run mysql_upgrade if needed
  5. Configure replication with the old server
  6. Test the application against the new version. If the tests includes conflicting writes, you may have to jump back to step 3
  7. If tests are OK and the new server is in sync, replication wise, with the old server, stop the application (only for a short while)
  8. Repoint the application to the new server
  9. Reset the slave
  10. Start the application

If the new server was temporary, you’ll need to repeat most of the steps the other way around, this time starting from the new server and ending on the old one.

What we thought would be a simple task turned out to be much more complicated. We were preparing to upgrade our database from Amazon Aurora MySQL 5.6 to 5.7 when we discovered that there was no option for an in-place upgrade. Unlike a standard AWS RDS MySQL (RDS MySQL upgrade 5.6 to 5.7) at the time of this article you cannot perform an in-place upgrade or even restore a backup across the major versions of Amazon Aurora MySQL.

We initially chose Amazon Aurora for the benefits of the tuning work that AWS provided out of the box, but we realized with any set of pros there comes a list of cons. In this case, the limitations meant that something that should have been straightforward took us off the documented path.

Our original high-level plan

Since we couldn’t use an RDS snapshot to provision a new Amazon Aurora MySQL 5.7 instance, we had to fallback to the use of a logical backup. The intended steps were:

  1. Backup the Amazon Aurora MySQL 5.6 write node with mysqldump
  2. Spin up an empty Amazon Aurora MySQL 5.7 cluster
  3. Restore the backup
  4. Make the Amazon Aurora MySQL 5.7 write node a slave of the Amazon Aurora MySQL 5.6 write node
  5. Once in sync, transfer the application to the Amazon Aurora MySQL 5.7 cluster

Even those simple steps proved to be challenging.

Backup of the Amazon Aurora MySQL 5.6 cluster

First, the Amazon Aurora MySQL 5.6 write node must generate binary log files. The default cluster parameter group that is generated when creating an Amazon Aurora instance does not enable these settings. Our 5.6 write node was not generating binary log files, so we copied the default cluster parameter group to a new “replication” parameter group and changed the “binlog_format” variable to MIXED.  The parameter is only effective after a reboot, so overnight we rebooted the node. That was a first short downtime.

At that point, we were able to confirm, using “show master status;” that the write node was indeed generating binlog files.  Since our procedure involves a logical backup and restore, we had to make sure the binary log files are kept for a long enough time. With a regular MySQL server the variable “expire_logs_days” controls the binary log files retention time. With RDS, you have to use the mysql.rds_set_configuration. We set the retention time to two weeks:

CALL mysql.rds_set_configuration('binlog retention hours', 336);

You can confirm the new setting is used with:

CALL mysql.rds_show_configuration;

For the following step, we needed a mysqldump backup along with its consistent replication coordinates. The option

   of mysqldump implies “Flush table with read lock;” while the replication coordinates are read from the server.  A “Flush table” requires the SUPER privilege and this privilege is not available in RDS.

Since we wanted to avoid downtime, it is out of question to pause the application for the time it would take to backup 100GB of data. The solution was to take a snapshot and use it to provision a temporary Amazon Aurora MySQL 5.6 cluster of one node. As part of the creation process, the events tab of the AWS console will show the binary log file and position consistent with the snapshot, it looks like this:

Consistent snapshot replication coordinates

Consistent snapshot replication coordinates

From there, the temporary cluster is idle so it is easy to back it up with mysqldump. Since our dataset is large we considered the use of MyDumper but the added complexity was not worthwhile for a one time operation. The dump of a large database can take many hours. Essentially we performed:

mysqldump -h entrypoint-temporary-cluster -u awsrootuser -pxxxx \
 --no-data --single-transaction -R -E -B db1 db2 db3 > schema.sql
mysqldump -h entrypoint-temporary-cluster -nt --single-transaction \
 -u awsrootuser -pxxxx -B db1 db2 db3 | gzip -1 > dump.sql.gz
pt-show-grants -h entrypoint-temporary-cluster -u awsrootuser -pxxxx > grants.sql

The schema consist of three databases: db1, db2 and db3.  We have not included the mysql schema because it will cause issues with the new 5.7 instance. You’ll see why we dumped the schema and the data separately in the next section.

Restore to an empty Amazon Aurora MySQL 5.7 cluster

With our backup done, we are ready to spin up a brand new Amazon Aurora MySQL 5.7 cluster and restore the backup. Make sure the new Amazon Aurora MySQL 5.7 cluster is in a subnet with access to the Amazon Aurora MySQL 5.6 production cluster. In our schema, there a few very large tables with a significant number of secondary keys. To speed up the restore, we removed the secondary indexes of these tables from the schema.sql file and created a restore-indexes.sql file with the list of alter table statements needed to recreate them. Then we restored the data using these steps:

cat grants.sql | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx
cat schema-modified.sql | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx
zcat dump.sql.gz | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx
cat restore-indexes.sql | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx

Configure replication

At this point, we have a new Amazon Aurora MySQL 5.7 cluster provisioned with a dataset at a known replication coordinates from the Amazon Aurora MySQL 5.6 production cluster.  It is now very easy to setup replication. First we need to create a replication user in the Amazon Aurora MySQL 5.6 production cluster:

GRANT REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'repl_user'@'%' identified by 'agoodpassword';

Then, in the new Amazon Aurora MySQL 5.7 cluster, you configure replication and start it by:

CALL mysql.rds_set_external_master ('', 3306,
  'repl_user', 'agoodpassword', 'mysql-bin-changelog.000018', 65932380, 0);
CALL mysql.rds_start_replication;

The endpoint points to the Amazon Aurora MySQL 5.6 production cluster.

Now, if everything went well, the new Amazon Aurora MySQL 5.7 cluster will be actively syncing with its master, the current Amazon Aurora MySQL 5.6 production cluster. This process can take a significant amount of time depending on the write load and the type of instance used for the new cluster. You can monitor the progress with the show slave status\G command, the Seconds_Behind_Master will tell you how far behind in seconds the new cluster is compared to the old one.  It is not a measurement of how long it will take to resync.

You can also monitor throughput using the AWS console. In this screenshot you can see the replication speeding up over time before it peaks when it is completed.

Replication speed

Test with Amazon Aurora MySQL 5.7

At this point, we have an Amazon Aurora MySQL 5.7 cluster in sync with the production Amazon Aurora MySQL 5.6 cluster. Before transferring the production load to the new cluster, you need to test your application with MySQL 5.7. The easiest way is to snapshot the new Amazon Aurora MySQL 5.7 cluster and, using the snapshot, provision a staging Amazon Aurora MySQL 5.7 cluster. Test your application against the staging cluster and, once tested, destroy the staging cluster and any unneeded snapshots.

Switch production to the Amazon Aurora MySQL 5.7 cluster

Now that you have tested your application with the staging cluster and are satisfied how it behaves with Amazon Aurora MySQL 5.7, the very last step is to migrate the production load. Here are the last steps you need to follow:

  1. Make sure the Amazon Aurora MySQL 5.7 cluster is still in sync with the Amazon Aurora MySQL 5.6 cluster
  2. Stop the application
  3. Validate the Show master status; of the 5.6 cluster is no longer moving
  4. Validate from the Show slave status\G in the 5.7 cluster the Master_Log_File and Exec_Master_Log_Pos match the output of the “Show master status;” from the 5.6 cluster
  5. Stop the slave in the 5.7 cluster with CALL mysql.rds_stop_replication;
  6. Reset the slave in the 5.7 cluster with CALL mysql.rds_reset_external_master;
  7. Reconfigure the application to use the 5.7 cluster endpoint
  8. Start the application

The application is down from steps 2 to 8.  Although that might appear to be a long time, these steps can easily be executed within a few minutes.


So, in summary, although RDS Aurora doesn’t support an in place upgrade between Amazon Aurora MySQL 5.6 and 5.7, there is a possible migration path, minimizing downtime.  In our case, we were able to limit the downtime to only a few minutes.

Co-Author: Jacques Fu, Fattmerchant


Jacques is CTO and co-founder at the fintech startup Fattmerchant, author of Time Hacks, and co-founder of the Orlando Devs, the largest developer meetup in Orlando. He has a passion for building products, bringing them to market, and scaling them.

by Yves Trudeau at March 05, 2019 05:31 PM

Shlomi Noach

Un-split brain MySQL via gh-mysql-rewind

We are pleased to release gh-mysql-rewind, a tool that allows us to move MySQL back in time, automatically identify and rewind split brain changes, restoring a split brain server into a healthy replication chain.

I recently had the pleasure of presenting gh-mysql-rewind at FOSDEM. Video and slides are available. Consider following along with the video.


Consider a split brain scenario: a "standard" MySQL replication topology suffered network isolation, and one of the replicas was promoted as new master. Meanwhile, the old master was still receiving writes from co-located apps.

Once the network isolation is over, we have a new master and an old master, and a split-brain situation: some writes only took place on one master; others only took place on the other. What if we wanted to converge the two? What paths do we have to, say, restore the old, demoted master, as a replica of the newly promoted master?

The old master is unlikely to agree to replicate from the new master. Changes have been made. AUTO_INCREMENT values have been taken. UNIQUE constraints will fail.

A few months ago, we at GitHub had exactly this scenario. An entire data center went network isolated. Automation failed over to a 2nd DC. Masters in the isolated DC meanwhile kept receiving writes. At the end of the failover we ended up with a split brain scenario - which we expected. However, an additional, unexpected constraint forced us to fail back to the original DC.

We had to make a choice: we've already operated for a long time in the 2nd DC and took many writes, that we were unwilling to lose. We were OK to lose (after auditing) the few seconds of writes on the isolated DC. But, how do we converge the data?

Backups are the trivial way out, but they incur long recovery time. Shipping backup data over the network for dozens of servers takes time. Restore time, catching up with changes since backup took place, warming up the servers so that they can handle production traffic, all take time.

Could we have reduces the time for recovery?

There are multiple ways to do that: local backups, local delayed replicas, snapshots... We have embarked on several. In this post I wish to outline gh-mysql-rewind, which programmatically identifies the rogue (aka "bad") transactions on the network isolated master, rewinds/reverts them, applies some bookkeeping and restores the demoted master as a healthy replica under the newly promoted master, thereby prepared to be promoted if needed.

General overview

gh-mysql-rewind is a shell script. It utilizes multiple technologies, some of which do not speak to each other, to be able to do the magic. It assumes and utilizes the following:

Some breakdown follows.


MySQL GTIDs keep track of all transactions executed on a given server. GTIDs indicate which server (UUID) originated a write, and ranges of transaction sequences. In a clean state, only one writer will generate GTIDs, and on all the replicas we would see the same GTID set, originated with the writer's UUID.

In a split brain scenario, we would see divergence. It is possible to use GTID_SUBTRACT(old_master-GTIDs, new-master-GTIDs) to identify the exact set of transactions executed on the old, demoted master, right after the failover. This is the essence of the split brain.

For example, assume that just before the network partition, GTID on the master was 00020192-1111-1111-1111-111111111111:1-5000. Assume after the network partition the new master has UUID of 00020193-2222-2222-2222-222222222222. It began to take writes, and after some time its GTID set showed 00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200.

On the demoted master, other writes took place, leading to the GTID set 00020192-1111-1111-1111-111111111111:1-5042.

We will run...


> '00020192-1111-1111-1111-111111111111:5001-5042' identify the exact set of "bad transactions" on the demoted master.

Row Based Replication

With row based replication, and with FULL image format, each DML (INSERT, UPDATE, DELETE) writes to the binary log the complete row data before and after the operation. This means the binary log has enough information for us to revert the operation.


Developed by Alibaba, flashback has been incorporated in MariaDB. MariaDB's mysqlbinlog utility supports a --flashback flag, which interprets the binary log in a special way. Instead of printing out the events in the binary log in order, it prints the inverted operations in reverse order.

To illustrate, let's assume this pseudo-code sequence of events in the binary log:

insert(1, 'a')
insert(2, 'b')
insert(3, 'c')
update(2, 'b')->(2, 'second')
update(3, 'c')->(3, 'third')
insert(4, 'd')
delete(1, 'a')

A --flashback of this binary log would produce:

insert(1, 'a')
delete(4, 'd')
update(3, 'third')->(3, 'c')
update(2, 'second')->(2, 'b')
delete(3, 'c')
delete(2, 'b')
delete(1, 'a')

Alas, MariaDB and flashback do not speak MySQL GTID language. GTIDs are one of the major points where MySQL and MariaDB have diverged beyond compatibility.

The output of MariaDB's mysqlbinlog --flashback has neither any mention of GTIDs, nor does the tool take notice of GTIDs in the binary logs in the first place.


This is where we step in. GTIDs provide the information about what went wrong. flashback has the mechanism to generate the reverse sequence of statements. gh-mysql-rewind:

  • uses GTIDs to detect what went wrong
  • correlates those GTID entries with binary log files: identifies which binary logs actually contain those GTID events
  • invokes MariaDB's mysqlbinlog --flashback to generate the reverse of those binary logs
  • injects (dummy) GTID information into the output
  • computes ETA

This last part is worth elaborating. We have created a time machine. We have the mechanics to make it work. But as any Sci-Fi fan knows, one of the most important parts of time travel is knowing ahead where (when) you are going to land. Are you back in the Renaissance? Or are you suddenly to appear on board the French Revolution? Better dress accordingly.

In our scenario it is not enough to move MySQL back in time to some consistent state. We want to know at what time we landed, so that we can instruct the rewinded server to join the replication chain as a healthy replica. In MySQL terms, we need to make MySQL "forget" everything that ever happened after the split brain: not only in terms of data (which we already did), but in terms of GTID history.

gh-mysql-rewind will do the math to project, ahead of time, at what "time" (i.e. GTID set) our time machine arrived. It will issue a `RESET MASTER; SET GLOBAL gtid_purged='gtid-of-the-landing-time'" to make our re-winded MySQL consistent not only with some past dataset, but also with its own perception of the point in time where that dataset existed.


Some limitations are due to MariaDB's incompatibility with MySQL, some are due to MySQL DDL nature, some due to the fact gh-mysql-rewind is a shell script.

  • Cannot rewind DDL. DDLs are silently ignored, and will impose a problem when trying to re-apply them.
  • JSON, POINT data types are not supported.
  • The logic rewinds the MySQL server farther into the past than strictly required. This simplifies the code considerably, but imposed superfluous time to rewind+reapply, i.e. time to recover.
  • Currently, this only works one server at a time. If a group of 10 servers were network isolated together, the operation would need to run on each of these 10 servers.
  • Runs locally on each server. Requires both MySQL's mysqlbinlog as well as MariaDB's mysqlbinlog.


There's lot of moving parts to this mechanism. A mixture of technologies that don't normally speak to each other, injection of data, prediction of ETA... How reliable is all this?

We run continuous gh-mysql-rewind testing in production to consistently prove that it works as expected. Our testing uses a non-production, dedicated, functional replica. It contaminates the data on the replica. It lets gh-mysql-rewind automatically move it back in time, it joins the replica back into the healthy chain.

That's not enough. We actually create a scenario where we can predict, ahead of testing, what the time-of-arrival will be. We checksum the data on that replica at that time. After contaminating and effectively breaking replication, we expect gh-mysql-rewind to revert the changes back to our predicted point in time. We checksum the data again. We expect 100% match.

See the video or slides for more detail on our testing setup.


At this time the tool in one of several solutions we hope to never need to employ. It is stable and tested. We are looking forward to a promising MySQL development that will provide GTID-revert capabilities using standard commands, such as SELECT undo_transaction('00020192-1111-1111-1111-111111111111:5042').

We have released gh-mysql-rewind as open source, under the MIT license. The public release is a stripped down version of our own script, which has some GitHub-specific integration. We have general ideas in incorporating this functionality into higher level tools.

gh-mysql-rewind is developed by the database-infrastructure team at GitHub.

by shlomi at March 05, 2019 01:51 PM

March 04, 2019

Peter Zaitsev

Percona XtraBackup 8.0.5 Is Now Available

Percona XtraBackup 8.0

Percona XtraBackup 8.0

Percona is glad to announce the release of Percona XtraBackup 8.0.5 on March 4, 2019. Downloads are available from our download site and from apt and yum repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, it drives down backup costs while providing unique features for MySQL backups.

Percona XtraBackup 8.0.5 introduces the support of undo tablespaces created using the new syntax (CREATE UNDO TABLESPACEavailable since MySQL 8.0.14. Percona XtraBackup also supports the binary log encryption introduced in MySQL 8.0.14.

Two new options were added to xbstream. Use the --decompress option with xbstream to decompress individual qpress files. With the --decompress-threads option, specify the number of threads to apply when decompressing. Thanks to Rauli Ikonen for this contribution.

This release of Percona XtraBackup is a General Availability release ready for use in a production environment.

All Percona software is open-source and free.

Please note the following about this release:

  • The deprecated innobackupex has been removed. Use the xtrabackup command to back up your instances: $ xtrabackup --backup --target-dir=/data/backup
  • When migrating from earlier database server versions, backup and restore and using Percona XtraBackup 2.4 and then use mysql_upgrade from MySQL 8.0.x
  • If using yum or apt repositories to install Percona XtraBackup 8.0.5, ensure that you have enabled the new tools repository. You can do this with the percona-release enable tools release command and then install the percona-xtrabackup-80 package.

New Features

  • PXB-1548: Percona XtraBackup enables updating the ib_buffer_pool file with the latest pages present in the buffer pool using the --dump-innodb-buffer-pool option. Thanks to Marcelo Altmann for contribution.
  • PXB-1768: Added support for undo tablespaces created with the new MySQL 8.0.14 syntax.
  • PXB-1781: Added support for binary log encryption introduced in MySQL 8.0.14.
  • PXB-1797: For xbstream, two new options were added. The --decompress option enables xbstream to decompress individual qpress files. The --decompress-threads option controls the number of threads to apply when decompressing. Thanks to Rauli Ikonen for this contribution.

Bugs Fixed

  • Using --lock-ddl-per-table caused the server to scan all records of partitioned tables which could lead to the “out of memory” error. Bugs fixed PXB-1691 and PXB-1698.
  • When Percona XtraBackup was started run with the --slave-info, incorrect coordinates were written to the xtrabackup_slave_info file. Bug fixed PXB-1737
  • Percona XtraBackup could crash at the prepare stage when making an incremental backup if the variable innodb-rollback-segments was changed after starting the MySQL Server. Bug fixed PXB-1785.
  • The full backup could fail when Percona Server was started with the --innodb-encrypt-tables parameter. Bug fixed PXB-1793.

Other bugs fixed: PXB-1632PXB-1715PXB-1770PXB-1771PXB-1773.

by Borys Belinsky at March 04, 2019 07:16 PM

Upcoming Webinar Wed 3/6: High Availability and Disaster Recovery in Amazon RDS

High Availability and Disaster Recovery in Amazon RDS Webinar

MySQL High Availability and Disaster Recovery WebinarJoin Percona CEO Peter Zaitsev as he presents High Availability and Disaster Recovery in Amazon RDS on Wednesday, March 6th, 2019, at 11:00 AM PST (UTC-8) / 2:00 PM EST (UTC-5).

Register Now

In this hour-long webinar, Peter describes the differences between high availability (HA) and disaster recovery (DR). Afterward, Peter will go through scenarios detailing how each is handled manually and in Amazon RDS.

He will review the pros and cons of managing HA and DR in the traditional database environment as well in the cloud. Having full control of these areas is daunting. However, Amazon RDS makes meeting these needs easier and more efficient.

Regardless of which path you choose, monitoring your environment is vital. Peter’s talk will make that message clear. A discussion of metrics you should regularly review to keep your environment working correctly and performing optimally concludes the webinar.

In order to learn more register for Peter’s webinar on High Availability and Disaster Recovery in Amazon RDS.

by Peter Zaitsev at March 04, 2019 04:14 PM

PostgreSQL Webinar Wed April 17 – Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions

upgrade postgresql webinar series

PostgreSQL logoA date for your diary. On Wednesday, April 17 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4) Percona’s PostgreSQL Support Technical Lead, Avinash Vallarapu and Senior Support Engineers, Fernando Laudares, Jobin Augustine and Nickolay Ihalainen, will demonstrate the upgrade of a legacy version of PostgreSQL to a newer version, using built-in as well as open source tools. In the lead up to the live webinar, we’ll be publishing a series of five blog posts that will help you to understand the solutions available to perform a PostgreSQL upgrade.

Register Now


Are you stuck with an application that is using an older version PostgreSQL which is no longer supported? Are you looking for the methods available to upgrade a legacy version PostgreSQL cluster (< PostgreSQL 9.3)? Are you searching for solutions that could upgrade your PostgreSQL with a minimalistic downtime? Are you afraid that your application may not work with latest PostgreSQL versions as it was built on a legacy version, a few years ago? Do you want to confirm if you are doing your PostgreSQL upgrades the right way ? Do you think that you need to buy an enterprise license to minimize the downtime involved in upgrades?

Then we suggest you to subscribe to our webinar, that should answer most of your questions around PostgreSQL upgrades.

This webinar starts with a list of solutions that are built-in to PostgreSQL to help us upgrade a legacy version of PostgreSQL with minimal downtime. The advantages of choosing such methods will also be discussed. You’ll notice a list of prerequisites for each solution, reducing the scope of possible mistakes. It’s important to minimize downtime when upgrading from an older version of PostgreSQL server. Therefore, we will present three open source solutions that will help us either to minimize or to completely avoid downtime.

Our presentation will show the full process of upgrading a set of PostgreSQL servers to the latest available version. Furthermore, we’ll show the pros and cons for each of the methods we employed.

The webinar programme

Topics covered in this webinar will include:

  1. PostgreSQL upgrade using pg_dump/pg_restore (with downtime)
  2. PostgreSQL upgrade using pg_dumpall (with downtime)
  3. Continuous replication from a legacy PostgreSQL version to a newer version using Slony.
  4. Replication between major PostgreSQL versions using Logical Replication.
  5. Fast upgrade of legacy PostgreSQL with minimum downtime.

In the 45 minute session, we’ll walk you through the methods and demonstrate some of the methods you may find useful in your database environment. We’ll see how simple and quick it is to perform the upgrade using our approach.

Register Now

Image adapted from Photo by Magda Ehlers from Pexels

by Avinash Vallarapu at March 04, 2019 02:34 PM

March 01, 2019

Oli Sennhauser

MariaDB and MySQL consulting by plane

Since January 2019 FromDual tries to contribute actively a little bit against global warming too.

The best for the climate would be to NOT travel to the customer at all! For this cases we have our FromDual remote-DBA services for MariaDB and MySQL.

But sometimes customer wants or needs us on-site for our FromDual in-house trainings or our FromDual on-site consulting engagements. In these cases we try to travel by train. Travelling by train is after walking or travelling by bicycle the most climate friendly way to travel:

But some customers are located more than 7 to 8 hours far away by train. For these customers we have to take the plan which is not good for the climate at all. But at least we will compensate for our CO2 emission via


by Shinguz at March 01, 2019 02:27 PM

Jean-Jerome Schmidt

A Review of the New Analytic Window Functions in MySQL 8.0

Data is captured and stored for a variety of reasons. Hours beyond count (and even more budget) invested in collecting, ingesting, structuring, validating, and ultimately storing of data; to say that it is a valuable asset is to drive home a moot point. This day in age it may, in fact, be our most precious commodity.

Some data is used strictly as an archive. Perhaps to record or track events that happened in the past. But the other side of that coin is that historical data has value in basing decisions for the future and future endeavors.

  • What day to have our sale on? (Planning for future sales based on how we did in the past.)
  • Which salesperson performed the best in quarter one? (Looking back, who can we reward for their efforts.)
  • Which restaurant is frequented the most in the middle of July? (The travel season is upon us... Who can we sell our foodstuffs and goods to?)

You get the picture. Using data on hand is integral for any organization.

Many companies build, base, and provide services with data. They depend on it.

Several months back, depending on when you are reading this, I began walking for exercise, in earnest, to lose weight, get a handle on my health, and to seek a daily bit of solitude from this busy world we live in.

I used a mobile pedometer app to track my hikes, even considering which shoes I wore, as I have a tendency to be ultra-picky when it comes to footwear.

While this data is not nearly as important as that mentioned in those scenarios above, for me, a key element in learning anything, is using something I am interested in, can relate to, and understand.

Window Functions have been on my radar to explore for a long while now. So, I thought to try my hand at a couple of them in this post. Having recently been supported in MySQL 8 (Visit this Severalnines blog I wrote about MySQL 8 upgrades and new additions where I mention them briefly) that ecosystem is the one I will use here. Be forewarned, I am not a window analytical function guru.

What is a MySQL Window Function?

The MySQL documentation defines them as such: "A window function performs an aggregate-like operation on a set of query rows. However, whereas an aggregate operation groups query rows into a single result row, a window function produces a result for each query row:"

Data Set and Setup for This Post

I store the captured data from my walks in this table:

mysql> DESC hiking_stats;
| Field           | Type         | Null | Key | Default | Extra |
| day_walked      | date         | YES  |     | NULL    |       |
| burned_calories | decimal(4,1) | YES  |     | NULL    |       |
| distance_walked | decimal(4,2) | YES  |     | NULL    |       |
| time_walking    | time         | YES  |     | NULL    |       |
| pace            | decimal(2,1) | YES  |     | NULL    |       |
| shoes_worn      | text         | YES  |     | NULL    |       |
| trail_hiked     | text         | YES  |     | NULL    |       |
7 rows in set (0.01 sec)

There is close to 90 days worth of data here:

mysql> SELECT COUNT(*) FROM hiking_stats;
| COUNT(*) |
|       84 |
1 row in set (0.00 sec)

I'll admit, I am finicky about my footwear so let's determine which pair of shoes I favored most:

mysql> SELECT DISTINCT shoes_worn, COUNT(*)
    -> FROM hiking_stats
    -> GROUP BY shoes_worn;
| shoes_worn                            | COUNT(*) |
| New Balance Trail Runners-All Terrain |       30 |
| Oboz Sawtooth Low                     |       47 |
| Keen Koven WP(keen-dry)               |        6 |
| New Balance 510v2                     |        1 |
4 rows in set (0.00 sec)

In order to provide a better, manageable on-screen demonstration, I will limit the remaining portion of query results to just those of the favorite shoes I wore 47 times.

I also have a trail_hiked column and since I was in 'ultra exercise mode' during this almost 3 month period, I even counted calories while push mowing the yard:

mysql> SELECT DISTINCT trail_hiked, COUNT(*)
    -> FROM hiking_stats
    -> GROUP BY trail_hiked;
| trail_hiked            | COUNT(*) |
| Yard Mowing            |       14 |
| Sandy Trail-Drive      |       20 |
| West Boundary          |       29 |
| House-Power Line Route |       10 |
| Tree Trail-extended    |       11 |
5 rows in set (0.01 sec)

Yet, to even further limit the data set, I will filter out those rows as well:

mysql> SELECT COUNT(*)
    -> FROM hiking_stats
    -> WHERE shoes_worn = 'Oboz Sawtooth Low'
    -> AND
    -> trail_hiked <> 'Yard Mowing';
| COUNT(*) |
|       40 |
1 row in set (0.01 sec)

For the sake of simplicity and ease of use, I will create a VIEW of columns to work with:

mysql> CREATE VIEW vw_fav_shoe_stats AS
    -> (SELECT day_walked, burned_calories, distance_walked, time_walking, pace, trail_hiked
    -> FROM hiking_stats
    -> WHERE shoes_worn = 'Oboz Sawtooth Low'
    -> AND trail_hiked <> 'Yard Mowing');
Query OK, 0 rows affected (0.19 sec)

Leaving me with this set of data:

mysql> SELECT * FROM vw_fav_shoe_stats;
| day_walked | burned_calories | distance_walked | time_walking | pace | trail_hiked            |
| 2018-06-03 |           389.6 |            4.11 | 01:13:19     |  3.4 | Sandy Trail-Drive      |
| 2018-06-04 |           394.6 |            4.26 | 01:14:15     |  3.4 | Sandy Trail-Drive      |
| 2018-06-06 |           384.6 |            4.10 | 01:13:14     |  3.4 | Sandy Trail-Drive      |
| 2018-06-07 |           382.7 |            4.12 | 01:12:52     |  3.4 | Sandy Trail-Drive      |
| 2018-06-17 |           296.3 |            2.82 | 00:55:45     |  3.0 | West Boundary          |
| 2018-06-18 |           314.7 |            3.08 | 00:59:13     |  3.1 | West Boundary          |
| 2018-06-20 |           338.5 |            3.27 | 01:03:42     |  3.1 | West Boundary          |
| 2018-06-21 |           339.5 |            3.40 | 01:03:54     |  3.2 | West Boundary          |
| 2018-06-24 |           392.4 |            3.76 | 01:13:51     |  3.1 | House-Power Line Route |
| 2018-06-25 |           362.1 |            3.72 | 01:08:09     |  3.3 | West Boundary          |
| 2018-06-26 |           380.5 |            3.94 | 01:11:36     |  3.3 | West Boundary          |
| 2018-07-03 |           323.7 |            3.29 | 01:00:55     |  3.2 | West Boundary          |
| 2018-07-04 |           342.8 |            3.47 | 01:04:31     |  3.2 | West Boundary          |
| 2018-07-06 |           375.7 |            3.80 | 01:10:42     |  3.2 | West Boundary          |
| 2018-07-07 |           347.6 |            3.40 | 01:05:25     |  3.1 | Sandy Trail-Drive      |
| 2018-07-08 |           351.6 |            3.58 | 01:06:09     |  3.2 | West Boundary          |
| 2018-07-09 |           336.0 |            3.28 | 01:03:13     |  3.1 | West Boundary          |
| 2018-07-11 |           375.2 |            3.81 | 01:10:37     |  3.2 | West Boundary          |
| 2018-07-12 |           325.9 |            3.28 | 01:01:20     |  3.2 | West Boundary          |
| 2018-07-15 |           382.9 |            3.91 | 01:12:03     |  3.3 | House-Power Line Route |
| 2018-07-16 |           368.6 |            3.72 | 01:09:22     |  3.2 | West Boundary          |
| 2018-07-17 |           339.4 |            3.46 | 01:03:52     |  3.3 | West Boundary          |
| 2018-07-18 |           368.1 |            3.72 | 01:08:28     |  3.3 | West Boundary          |
| 2018-07-19 |           339.2 |            3.44 | 01:03:06     |  3.3 | West Boundary          |
| 2018-07-22 |           378.3 |            3.76 | 01:10:22     |  3.2 | West Boundary          |
| 2018-07-23 |           322.9 |            3.28 | 01:00:03     |  3.3 | West Boundary          |
| 2018-07-24 |           386.4 |            3.81 | 01:11:53     |  3.2 | West Boundary          |
| 2018-07-25 |           379.9 |            3.83 | 01:10:39     |  3.3 | West Boundary          |
| 2018-07-27 |           378.3 |            3.73 | 01:10:21     |  3.2 | West Boundary          |
| 2018-07-28 |           337.4 |            3.39 | 01:02:45     |  3.2 | Sandy Trail-Drive      |
| 2018-07-29 |           348.7 |            3.50 | 01:04:52     |  3.2 | West Boundary          |
| 2018-07-30 |           361.6 |            3.69 | 01:07:15     |  3.3 | West Boundary          |
| 2018-07-31 |           359.9 |            3.66 | 01:06:57     |  3.3 | West Boundary          |
| 2018-08-01 |           336.1 |            3.37 | 01:01:48     |  3.3 | West Boundary          |
| 2018-08-03 |           259.9 |            2.57 | 00:47:47     |  3.2 | West Boundary          |
| 2018-08-05 |           341.2 |            3.37 | 01:02:44     |  3.2 | West Boundary          |
| 2018-08-06 |           357.7 |            3.64 | 01:05:46     |  3.3 | West Boundary          |
| 2018-08-17 |           184.2 |            1.89 | 00:39:00     |  2.9 | Tree Trail-extended    |
| 2018-08-18 |           242.9 |            2.53 | 00:51:25     |  3.0 | Tree Trail-extended    |
| 2018-08-30 |           204.4 |            1.95 | 00:37:35     |  3.1 | House-Power Line Route |
40 rows in set (0.00 sec)

The first window function I will look at is ROW_NUMBER().

Suppose I want a result set ordered by the burned_calories column for the month of 'July'.

Of course, I can retrieve that data with this query:

mysql> SELECT day_walked, burned_calories, trail_hiked
    -> FROM vw_fav_shoe_stats
    -> WHERE MONTHNAME(day_walked) = 'July'
    -> ORDER BY burned_calories DESC;
| day_walked | burned_calories | trail_hiked            |
| 2018-07-24 |           386.4 | West Boundary          |
| 2018-07-15 |           382.9 | House-Power Line Route |
| 2018-07-25 |           379.9 | West Boundary          |
| 2018-07-22 |           378.3 | West Boundary          |
| 2018-07-27 |           378.3 | West Boundary          |
| 2018-07-06 |           375.7 | West Boundary          |
| 2018-07-11 |           375.2 | West Boundary          |
| 2018-07-16 |           368.6 | West Boundary          |
| 2018-07-18 |           368.1 | West Boundary          |
| 2018-07-30 |           361.6 | West Boundary          |
| 2018-07-31 |           359.9 | West Boundary          |
| 2018-07-08 |           351.6 | West Boundary          |
| 2018-07-29 |           348.7 | West Boundary          |
| 2018-07-07 |           347.6 | Sandy Trail-Drive      |
| 2018-07-04 |           342.8 | West Boundary          |
| 2018-07-17 |           339.4 | West Boundary          |
| 2018-07-19 |           339.2 | West Boundary          |
| 2018-07-28 |           337.4 | Sandy Trail-Drive      |
| 2018-07-09 |           336.0 | West Boundary          |
| 2018-07-12 |           325.9 | West Boundary          |
| 2018-07-03 |           323.7 | West Boundary          |
| 2018-07-23 |           322.9 | West Boundary          |
22 rows in set (0.01 sec)

Yet, for whatever reason (maybe personal satisfaction), I want to award a ranking among the returned rows beginning with 1 indicative of the highest burned_calories count, all the way to (n) rows in the result set.

ROW_NUMBER(), can handle this no problem at all:

mysql> SELECT day_walked, burned_calories,
    -> ROW_NUMBER() OVER(ORDER BY burned_calories DESC)
    -> AS position, trail_hiked
    -> FROM vw_fav_shoe_stats
    -> WHERE MONTHNAME(day_walked) = 'July';
| day_walked | burned_calories | position | trail_hiked            |
| 2018-07-24 |           386.4 |        1 | West Boundary          |
| 2018-07-15 |           382.9 |        2 | House-Power Line Route |
| 2018-07-25 |           379.9 |        3 | West Boundary          |
| 2018-07-22 |           378.3 |        4 | West Boundary          |
| 2018-07-27 |           378.3 |        5 | West Boundary          |
| 2018-07-06 |           375.7 |        6 | West Boundary          |
| 2018-07-11 |           375.2 |        7 | West Boundary          |
| 2018-07-16 |           368.6 |        8 | West Boundary          |
| 2018-07-18 |           368.1 |        9 | West Boundary          |
| 2018-07-30 |           361.6 |       10 | West Boundary          |
| 2018-07-31 |           359.9 |       11 | West Boundary          |
| 2018-07-08 |           351.6 |       12 | West Boundary          |
| 2018-07-29 |           348.7 |       13 | West Boundary          |
| 2018-07-07 |           347.6 |       14 | Sandy Trail-Drive      |
| 2018-07-04 |           342.8 |       15 | West Boundary          |
| 2018-07-17 |           339.4 |       16 | West Boundary          |
| 2018-07-19 |           339.2 |       17 | West Boundary          |
| 2018-07-28 |           337.4 |       18 | Sandy Trail-Drive      |
| 2018-07-09 |           336.0 |       19 | West Boundary          |
| 2018-07-12 |           325.9 |       20 | West Boundary          |
| 2018-07-03 |           323.7 |       21 | West Boundary          |
| 2018-07-23 |           322.9 |       22 | West Boundary          |
22 rows in set (0.00 sec)

You can see the row with burned_calories amount of 386.4 has position 1, while the row with value 322.9 has 22, which is the least (or lowest) amount among the returned rows set.

I'll use ROW_NUMBER() for something a bit more interesting as we progress. Only when I learned about it used in that context, did I truly realize some of its real power.

Up next, let's visit the RANK() window function to provide a different sort of 'ranking' among the rows. We will still target the burned_calories column value. And, while RANK() is similar to ROW_NUMBER() in that they somewhat rank rows, it does introduce a subtle difference in certain circumstances.

I will even further limit the number of rows as a whole by filtering any records not in the month of 'July' but targeting a specific trail:

mysql> SELECT day_walked, burned_calories,
    -> RANK() OVER(ORDER BY burned_calories DESC) AS position,
    -> trail_hiked
    -> FROM vw_fav_shoe_stats
    -> WHERE MONTHNAME(day_walked) = 'July'
    -> AND trail_hiked = 'West Boundary';
| day_walked | burned_calories | position | trail_hiked   |
| 2018-07-24 |           386.4 |        1 | West Boundary |
| 2018-07-25 |           379.9 |        2 | West Boundary |
| 2018-07-22 |           378.3 |        3 | West Boundary |
| 2018-07-27 |           378.3 |        3 | West Boundary |
| 2018-07-06 |           375.7 |        5 | West Boundary |
| 2018-07-11 |           375.2 |        6 | West Boundary |
| 2018-07-16 |           368.6 |        7 | West Boundary |
| 2018-07-18 |           368.1 |        8 | West Boundary |
| 2018-07-30 |           361.6 |        9 | West Boundary |
| 2018-07-31 |           359.9 |       10 | West Boundary |
| 2018-07-08 |           351.6 |       11 | West Boundary |
| 2018-07-29 |           348.7 |       12 | West Boundary |
| 2018-07-04 |           342.8 |       13 | West Boundary |
| 2018-07-17 |           339.4 |       14 | West Boundary |
| 2018-07-19 |           339.2 |       15 | West Boundary |
| 2018-07-09 |           336.0 |       16 | West Boundary |
| 2018-07-12 |           325.9 |       17 | West Boundary |
| 2018-07-03 |           323.7 |       18 | West Boundary |
| 2018-07-23 |           322.9 |       19 | West Boundary |
19 rows in set (0.01 sec)

Notice anything odd here? Different from ROW_NUMBER()?

Check out the position value for those rows of '2018-07-22' and '2018-07-27'. They are in a tie at 3rd.

With good reason since the burned_calorie value of 378.3 is present in both rows.

How would ROW_NUMBER() rank them?

Let's find out:

mysql> SELECT day_walked, burned_calories,
    -> ROW_NUMBER() OVER(ORDER BY burned_calories DESC) AS position,
    -> trail_hiked
    -> FROM vw_fav_shoe_stats
    -> WHERE MONTHNAME(day_walked) = 'July'
    -> AND trail_hiked = 'West Boundary';
| day_walked | burned_calories | position | trail_hiked   |
| 2018-07-24 |           386.4 |        1 | West Boundary |
| 2018-07-25 |           379.9 |        2 | West Boundary |
| 2018-07-22 |           378.3 |        3 | West Boundary |
| 2018-07-27 |           378.3 |        4 | West Boundary |
| 2018-07-06 |           375.7 |        5 | West Boundary |
| 2018-07-11 |           375.2 |        6 | West Boundary |
| 2018-07-16 |           368.6 |        7 | West Boundary |
| 2018-07-18 |           368.1 |        8 | West Boundary |
| 2018-07-30 |           361.6 |        9 | West Boundary |
| 2018-07-31 |           359.9 |       10 | West Boundary |
| 2018-07-08 |           351.6 |       11 | West Boundary |
| 2018-07-29 |           348.7 |       12 | West Boundary |
| 2018-07-04 |           342.8 |       13 | West Boundary |
| 2018-07-17 |           339.4 |       14 | West Boundary |
| 2018-07-19 |           339.2 |       15 | West Boundary |
| 2018-07-09 |           336.0 |       16 | West Boundary |
| 2018-07-12 |           325.9 |       17 | West Boundary |
| 2018-07-03 |           323.7 |       18 | West Boundary |
| 2018-07-23 |           322.9 |       19 | West Boundary |
19 rows in set (0.06 sec)


No ties in the position column numbering this time.

But, who gets precedence?

To my knowledge, for a predictable ordering, you will likely have to determine it by some other additional means within the query (e.g. the time_walking column in this case?).

But we are not done yet with ranking options. Here is DENSE_RANK():

mysql> SELECT day_walked, burned_calories,
    -> DENSE_RANK() OVER(ORDER BY burned_calories DESC) AS position,
    -> trail_hiked
    -> FROM vw_fav_shoe_stats
    -> WHERE MONTHNAME(day_walked) = 'July'
    -> AND trail_hiked = 'West Boundary';
| day_walked | burned_calories | position | trail_hiked   |
| 2018-07-24 |           386.4 |        1 | West Boundary |
| 2018-07-25 |           379.9 |        2 | West Boundary |
| 2018-07-22 |           378.3 |        3 | West Boundary |
| 2018-07-27 |           378.3 |        3 | West Boundary |
| 2018-07-06 |           375.7 |        4 | West Boundary |
| 2018-07-11 |           375.2 |        5 | West Boundary |
| 2018-07-16 |           368.6 |        6 | West Boundary |
| 2018-07-18 |           368.1 |        7 | West Boundary |
| 2018-07-30 |           361.6 |        8 | West Boundary |
| 2018-07-31 |           359.9 |        9 | West Boundary |
| 2018-07-08 |           351.6 |       10 | West Boundary |
| 2018-07-29 |           348.7 |       11 | West Boundary |
| 2018-07-04 |           342.8 |       12 | West Boundary |
| 2018-07-17 |           339.4 |       13 | West Boundary |
| 2018-07-19 |           339.2 |       14 | West Boundary |
| 2018-07-09 |           336.0 |       15 | West Boundary |
| 2018-07-12 |           325.9 |       16 | West Boundary |
| 2018-07-03 |           323.7 |       17 | West Boundary |
| 2018-07-23 |           322.9 |       18 | West Boundary |
19 rows in set (0.00 sec)

The tie remains, however, the numbering is different in where rows are counted, continuing through the remaining results.

Where RANK() began the count with 5 after the ties, DENSE_RANK() picks up at the next number, which is 4 in this instance, since the tie happened at row 3.

I'll be the first to admit, these various row ranking patterns are quite interesting, but, how can you use them for a meaningful result set?

Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

A Bonus Thought

I have to give credit where credit is due. I learned so much about window functions from a wonderful series on YouTube and one video, in particular, inspired me for this next example. Please keep in mind although the examples in that series are demonstrated with a non-open-source database system (Don't toss the digital rotten fruits and veggies at me), there is a ton to learn from the videos overall.

I see a pattern in most of the query results so far I want to explore. I will not filter by any month nor trail.

What I want to know, are the consecutive days that I burned more than 350 calories. Better yet, groups of those days.

Here is the base query I will start with and build off from:

mysql> SELECT day_walked, burned_calories, 
    -> ROW_NUMBER() OVER(ORDER BY day_walked ASC) AS positional_bound, 
    -> trail_hiked 
    -> FROM vw_fav_shoe_stats 
    -> WHERE burned_calories > 350;
| day_walked | burned_calories | positional_bound | trail_hiked            |
| 2018-06-03 |           389.6 |                1 | Sandy Trail-Drive      |
| 2018-06-04 |           394.6 |                2 | Sandy Trail-Drive      |
| 2018-06-06 |           384.6 |                3 | Sandy Trail-Drive      |
| 2018-06-07 |           382.7 |                4 | Sandy Trail-Drive      |
| 2018-06-24 |           392.4 |                5 | House-Power Line Route |
| 2018-06-25 |           362.1 |                6 | West Boundary          |
| 2018-06-26 |           380.5 |                7 | West Boundary          |
| 2018-07-06 |           375.7 |                8 | West Boundary          |
| 2018-07-08 |           351.6 |                9 | West Boundary          |
| 2018-07-11 |           375.2 |               10 | West Boundary          |
| 2018-07-15 |           382.9 |               11 | House-Power Line Route |
| 2018-07-16 |           368.6 |               12 | West Boundary          |
| 2018-07-18 |           368.1 |               13 | West Boundary          |
| 2018-07-22 |           378.3 |               14 | West Boundary          |
| 2018-07-24 |           386.4 |               15 | West Boundary          |
| 2018-07-25 |           379.9 |               16 | West Boundary          |
| 2018-07-27 |           378.3 |               17 | West Boundary          |
| 2018-07-30 |           361.6 |               18 | West Boundary          |
| 2018-07-31 |           359.9 |               19 | West Boundary          |
| 2018-08-06 |           357.7 |               20 | West Boundary          |
20 rows in set (0.00 sec)

We've seen ROW_NUMBER() already, however now it really comes into play.

To make this work (in MySQL at least) I had to use the DATE_SUB() function since essentially, with this technique we are subtracting a number - the value provided by ROW_NUMBER() from the day_walked date column of the same row, which in turn, provides a date itself via the calculation:

mysql> SELECT day_walked AS day_of_walk,
    -> DATE_SUB(day_walked, INTERVAL ROW_NUMBER() OVER(ORDER BY day_walked ASC) DAY) AS positional_bound,
    -> burned_calories,
    -> trail_hiked
    -> FROM vw_fav_shoe_stats
    -> WHERE burned_calories > 350;
| day_of_walk | positional_bound | burned_calories | trail_hiked            |
| 2018-06-03  | 2018-06-02       |           389.6 | Sandy Trail-Drive      |
| 2018-06-04  | 2018-06-02       |           394.6 | Sandy Trail-Drive      |
| 2018-06-06  | 2018-06-03       |           384.6 | Sandy Trail-Drive      |
| 2018-06-07  | 2018-06-03       |           382.7 | Sandy Trail-Drive      |
| 2018-06-24  | 2018-06-19       |           392.4 | House-Power Line Route |
| 2018-06-25  | 2018-06-19       |           362.1 | West Boundary          |
| 2018-06-26  | 2018-06-19       |           380.5 | West Boundary          |
| 2018-07-06  | 2018-06-28       |           375.7 | West Boundary          |
| 2018-07-08  | 2018-06-29       |           351.6 | West Boundary          |
| 2018-07-11  | 2018-07-01       |           375.2 | West Boundary          |
| 2018-07-15  | 2018-07-04       |           382.9 | House-Power Line Route |
| 2018-07-16  | 2018-07-04       |           368.6 | West Boundary          |
| 2018-07-18  | 2018-07-05       |           368.1 | West Boundary          |
| 2018-07-22  | 2018-07-08       |           378.3 | West Boundary          |
| 2018-07-24  | 2018-07-09       |           386.4 | West Boundary          |
| 2018-07-25  | 2018-07-09       |           379.9 | West Boundary          |
| 2018-07-27  | 2018-07-10       |           378.3 | West Boundary          |
| 2018-07-30  | 2018-07-12       |           361.6 | West Boundary          |
| 2018-07-31  | 2018-07-12       |           359.9 | West Boundary          |
| 2018-08-06  | 2018-07-17       |           357.7 | West Boundary          |
20 rows in set (0.00 sec)

However, without DATE_SUB(), you wind up with this (or at least I did):

mysql> SELECT day_walked AS day_of_walk,
    -> day_walked - ROW_NUMBER() OVER(ORDER BY day_walked ASC) AS positional_bound,
    -> burned_calories,
    -> trail_hiked
    -> FROM vw_fav_shoe_stats
    -> WHERE burned_calories > 350;
| day_of_walk | positional_bound | burned_calories | trail_hiked            |
| 2018-06-03  |         20180602 |           389.6 | Sandy Trail-Drive      |
| 2018-06-04  |         20180602 |           394.6 | Sandy Trail-Drive      |
| 2018-06-06  |         20180603 |           384.6 | Sandy Trail-Drive      |
| 2018-06-07  |         20180603 |           382.7 | Sandy Trail-Drive      |
| 2018-06-24  |         20180619 |           392.4 | House-Power Line Route |
| 2018-06-25  |         20180619 |           362.1 | West Boundary          |
| 2018-06-26  |         20180619 |           380.5 | West Boundary          |
| 2018-07-06  |         20180698 |           375.7 | West Boundary          |
| 2018-07-08  |         20180699 |           351.6 | West Boundary          |
| 2018-07-11  |         20180701 |           375.2 | West Boundary          |
| 2018-07-15  |         20180704 |           382.9 | House-Power Line Route |
| 2018-07-16  |         20180704 |           368.6 | West Boundary          |
| 2018-07-18  |         20180705 |           368.1 | West Boundary          |
| 2018-07-22  |         20180708 |           378.3 | West Boundary          |
| 2018-07-24  |         20180709 |           386.4 | West Boundary          |
| 2018-07-25  |         20180709 |           379.9 | West Boundary          |
| 2018-07-27  |         20180710 |           378.3 | West Boundary          |
| 2018-07-30  |         20180712 |           361.6 | West Boundary          |
| 2018-07-31  |         20180712 |           359.9 | West Boundary          |
| 2018-08-06  |         20180786 |           357.7 | West Boundary          |
20 rows in set (0.04 sec)

Hey, that doesn't look so bad really.

What gives?

Eh, the row with a positional_bound value of '20180698'...

Wait a minute, this is supposed to calculate a date value by subtracting the number ROW_NUMBER() provides from the day_of_walk column.


I don't know about you, but I am not aware of a month with 98 days!

But, if there is one, bring on the extra paychecks!

All fun aside, this obviously was incorrect and prompted me to (eventually) use DATE_SUB(), which provides a correct, results set then allowing me to run this query:

mysql> SELECT MIN(t.day_of_walk), 
    -> MAX(t.day_of_walk),
    -> COUNT(*) AS num_of_hikes
    -> FROM (SELECT day_walked AS day_of_walk,
    -> DATE_SUB(day_walked, INTERVAL ROW_NUMBER() OVER(ORDER BY day_walked ASC) DAY) AS positional_bound
    -> FROM vw_fav_shoe_stats
    -> WHERE burned_calories > 350) AS t
    -> GROUP BY t.positional_bound
    -> ORDER BY 1;
| MIN(t.day_of_walk) | MAX(t.day_of_walk) | num_of_hikes |
| 2018-06-03         | 2018-06-04         |            2 |
| 2018-06-06         | 2018-06-07         |            2 |
| 2018-06-24         | 2018-06-26         |            3 |
| 2018-07-06         | 2018-07-06         |            1 |
| 2018-07-08         | 2018-07-08         |            1 |
| 2018-07-11         | 2018-07-11         |            1 |
| 2018-07-15         | 2018-07-16         |            2 |
| 2018-07-18         | 2018-07-18         |            1 |
| 2018-07-22         | 2018-07-22         |            1 |
| 2018-07-24         | 2018-07-25         |            2 |
| 2018-07-27         | 2018-07-27         |            1 |
| 2018-07-30         | 2018-07-31         |            2 |
| 2018-08-06         | 2018-08-06         |            1 |
13 rows in set (0.12 sec)

Basically, I have wrapped the results set provided from that analytical query, in the form of a Derived Table, and queried it for: a start and end date, a count of what I have labeled num_of_hikes, then grouped on the positional_bound column, ultimately providing sets of groups of consecutive days where I burned more than 350 calories.

You can see in the date range of 2018-06-24 to 2018-06-26, resulted in 3 consecutive days meeting the calorie burned criteria of 350 in the WHERE clause.

Not too bad if I don't say so myself, but definitely a record I want to try and best!


Window functions are in a world and league of their own. I have not even scratched the surface of them, having only covered 3 of them in a 'high-level' introductory and perhaps, trivial sense. However, hopefully, through this post, you find that you can query for quite interesting and potentially insightful data with a 'bare minimal' use of them.

Thank you for reading.

by Joshua Otwell at March 01, 2019 10:48 AM

February 28, 2019

Peter Zaitsev

Percona XtraDB Cluster 5.6.43-28.32 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.7

Percona is glad to announce the release of Percona XtraDB Cluster 5.6.43-28.32 on February 28, 2019. Binaries are available from the downloads section or from our software repositories.

This release of Percona XtraDB Cluster includes the support of Ubuntu 18.10 (Cosmic Cuttlefish). Percona XtraDB Cluster 5.6.43-28.32 is now the current release, based on the following:

All Percona software is open-source and free.

Bugs Fixed

  • PXC-2388: In some cases, DROP FUNCTION function_name was not replicated.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

by Borys Belinsky at February 28, 2019 09:24 PM

Percona XtraDB Cluster 5.7.25-31.35 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.7Percona is glad to announce the release of Percona XtraDB Cluster 5.7.25-31.35 on February 28, 2018. Binaries are available from the downloads section or from our software repositories.

This release of Percona XtraDB Cluster includes the support of Ubuntu 18.10 (Cosmic Cuttlefish). Percona XtraDB Cluster 5.7.25-31.35 is now the current release, based on the following:

All Percona software is open-source and free.

Bugs Fixed

  • PXC-2346mysqld could crash when executing mysqldump --single-transaction while the binary log is disabled. This problem was also reported in PXC-1711PXC-2371PXC-2419.
  • PXC-2388: In some cases, DROP FUNCTION function_name was not replicated.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

by Borys Belinsky at February 28, 2019 08:56 PM

Percona Server for MongoDB 4.0.6-3 Is Now Available

Percona Server for MongoDB

Percona Server for MongoDB

Percona announces the release of Percona Server for MongoDB 4.0.6-3 on February 28, 2019. Download the latest version from the Percona website or the Percona software repositories.

Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 4.0 Community Edition. It supports MongoDB 4.0 protocols and drivers.

Percona Server for MongoDB extends the functionality of the MongoDB 4.0 Community Edition by including the Percona Memory Engine storage engine, encrypted WiredTiger storage engineaudit loggingSASL authenticationhot backups, and enhanced query profilingPercona Server for MongoDB requires no changes to MongoDB applications or code.

Release 4.0.6-3 extends the buildInfo command with the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key not available from MongoDB.

This release includes all features of MongoDB 4.0 Community Edition 4.0. Most notable among these are:

Note that the MMAPv1 storage engine is deprecated in MongoDB 4.0 Community Edition 4.0.


  • PSMDB-216: The database command buildInfo provides the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key is not available from MongoDB.

The Percona Server for MongoDB 4.0.6-3 release notes are available in the official documentation.

by Borys Belinsky at February 28, 2019 05:08 PM

MySQL 8.0 Bug 94394, Fixed!

MySQL optimizer bugs

MySQL optimizer bugs

Last week I came across a bug in MySQL 8.0, which meant that the absence of mysql.user leads to auto-apply of –skip-grant-tables (#94394) would leave MySQL running in an undesirable state. My colleague Sveta Smirnova blogged about the issue and it also caught the interest of Valeriy Kravchuk in Fun with Bugs #80 – On MySQL Bug Reports I am Subscribed to, Part XVI. Thanks for the extra visibility!

Credit is now due to Oracle for the quick response, as it was fixed in less than one week (including a weekend):

Fixed in 8.0.16.

Previously, if the grant tables were corrupted, the MySQL server
wrote a message to the error log but continued as if the
–skip-grant-tables option had been specified. This resulted in the
server operating in an unexpected state unless –skip-grant-tables
had in fact been specified. Now, the server stops after writing a
message to the error log unless started with –skip-grant-tables.
(Starting the server with that option enables you to connect to
perform diagnostic operations.)

I think that this particular bug reflects some of the nice things about the MySQL community (and Open Source in general); anyone can find and report a bug, or make a feature request, to one of the software vendors (MySQL, Percona, or MariaDB) and try to improve the software. Sometimes bugs hang around for a while, either because they are hard to fix, viewed as lower in priority (despite the reporter’s opinion), or perhaps the bug does not have enough public visibility. Then a member of the community notices the bug and takes an interest and soon there is more interest. If you are lucky the bug gets fixed quickly! You can of course also provide a fix for the bug yourself, which may speed up the process with a little luck.

If you have not yet reported a bug, or want to find if you are reporting them in the right sort of way then you can take a look at How to create a useful MySQL bug report…and make sure it’s properly processed by Valeriy from FOSDEM 2019.

🐛 🐛 🐛 You can help to find more!

by Ceri Williams at February 28, 2019 03:05 PM

Jean-Jerome Schmidt

MongoDB vs MySQL NoSQL - Why Mongo is Better

There are so many database management systems (DBMS) to choose from ranging from relational to non-relational DBMS. In the past years, the Relational DBMS where more dominant but with recent data structure trends the non-relational DBMS are becoming more popular. The choices for relational DBMS are quite obvious: MySQL, PostgreSQL and MS SQL. On the other hand, MongoDB a non-relational DBM has come to rise basically due to its ability to handle a large set of data. Every selection has got its pros and cons but your choice will mainly be determined by your application needs since both serve in different niches. However, in this article, we are going to discuss the pros of using MongoDB over MySQL.

Pros of Using MongoDB Over MySQL

  1. Speed and performance
  2. High Availability and Cloud Computing
  3. Schema Flexibility
  4. Need to grow bigger
  5. Embedding feature
  6. Security Model
  7. Location-based data
  8. Rich query language support

Speed and Performance

This is one of the major benefits of using MongoDB over MySQL especially when a large set of unstructured data is involved. MongoDB by default encourages high insert rate over transaction safety. This feature is not available in MySQL hence for instance if you are to save a lot of data to your DBM at once, in the case of MySQL you will have to do it one by one. But in the case of MongoDB, with the availability of insertMany() function, you can safely do the multiple inserts. Observing some of the querying behaviours of the two, we can summarize the different operation requests for 1 million documents in the illustration below.

In the case of updating which is a write operation, MongoDB takes 0.002 seconds to update all student emails whereas MySQL takes 0.2491s to execute the same task.

From the illustration, we can conclude that MongoDB takes way lesser time than MySQL for the same operations. MongoDB is mainly structured such that documents are the basis of storage which promotes huge query and data storage. This implies that the performance is dependent on two key values that are the design and scale out. On the other hand, MySQL has data stored in an individual table hence at some point one has to lookup on the entire table before doing a write operation.

High Availability and Cloud Computing

For unstable environments, MongoDB provides a better handling technique than MySQL. This is because it takes very less time for the active secondary nodes to elect a new primary node thus easy administration at the point of failure. Besides, due to comprehensive secondary indexes and native replication, creating a backup for a MongoDB database is quite easy as compared to MySQL since the latter has integrated replication support.

In a nutshell, setting a set of servers that can act as Master-Slaves is easy and fast in MongoDB than MySQL. Besides, recovery from a cluster failure is instant, automatic and safe. For MySQL, there is no clear official solution for providing failover between master and slave in the event of a failure.

Cloud-based storage solutions require data to be smoothly spread across various server to scale up. MongoDB can load a high volume of data as compared to MySQL and with built-in sharding, it is easy to partition and spread out data across multiple servers as a way of utilizing the cost-saving solution as per the cloud-based storage merits.

Schema Flexibility

MongoDB is schemaless such that different documents in the same collection may have the same or different fields from each other. This means there is no restriction on document structure for every insert or update hence changes to the data model won’t have much impact. Of course, there are scenarios that can opt one to use undefined schema for example if you are de-normalizing a database schema or when your database is growing but your schema is unstable. MongoDB therefore allows one to add various types of data as per needs change.

On the other hand, MySQL is table oriented whereby each row must have the same columns as the other rows. Adding a new column would require one to run an ALTER operation which is quite expensive in terms of performance as it will have to lock up the entire database. This is especially the case when the table grows over 10GB, MongoDB does not have this issue.

With a flexible schema it is easy to develop and maintain a cleaner code. Besides, MongoDB provides the option of using a JSON validator in case you want to ensure some data integrity and consistency for your collection hence you can do some validation before insert or update of a document.

The Need to Grow Bigger

Databases scaling is not an easy undertaking especially with MySQL it may result in degraded performance when the 5-10GB per table memory is surpassed. With MongoDB, this is not an issue since one can partition and shard the database with the In-built sharding feature. Once a shard key is specified and sharding is enabled, data is partitioned evenly according to the shard key. If a new shard is added, there is automatic rebalancing. Sharding basically allows horizontal scaling which is difficult to implement in MySQL. Besides, MongoDB has got built-in replication whereby replica sets create multiple copies of the data. Each member of this set has a role either as primary or secondary at any point in the process.

Reads and writes are done on the primary and then replicated to the secondaries. With this merit in place, in case of data inconsistency or instance failure, a new member may be voted in to act as primary.

Embedding Feature

Unlike MySQL where you cannot embed data to a field, MongoDB offers a better embedding technique for related data. As much as you can do a JOIN for tables in MySQL, you may end up having so many tables with some being unnecessary especially if they don’t involve so many fields. In the case of MongoDB you can decide to embed data into a field for related data or reference from another collection if you expect the document grow in future beyond the JSON document size.

For example if we have data for users who we want to capture their addresses and some other information, in the case of MongoDB we can easily have a simple structure like

    name:'George Bush',
    gender: 'Male',
        City: 'New York',
        Street: 'Florida',
        Zip_code: 1342243

But in the case of MySQL we will have to make 2 tables with an id referencing in this case. I.e

Users details table

id name gender age
1 George Bush Male 45

User address table

id City Street Zip_code
1 George Bush Male 134224

In MySQL you will have so many tables which could be so hectic to deal with especially when scaling is involved. As much as one can also do a table join in a single query when fetching this data in MySQL, the latency is quite larger as compared to MongoDB and this is one of the reasons that makes the performance of MongoDB outdo the performance of MySQL.

Become a MongoDB DBA - Bringing MongoDB to Production
Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Security Model

Database administration (DBA) is quite essential in MySQL but not necessary in the case of MongoDB. This means you need to have the DBA to modify a schema in the case of MySQL when an application changes. On the other hand, one can do schema modification without DBA in MongoDB since it is great for class persistence and a class can equally be serialized to JSON and stored. However, this is the best practice if you don’t expect the data to grow big otherwise you will need to follow some best practices to avoid pitfalls.

Location Based Data

In order to improve on throughput operations especially read operations, MongoDB provides built-in special functions that enhance finding relevant data from specific locations which are accurate hence fastening the process. This is not possible in the case of MySQL.

Rich Query Language Support

On a personal interest as a MongoDB enthusiast, I got my attraction with flexibility on querying feature of MongoDB. Regarding the aggregation framework in the later versions and MapReduce feature, one can optimize the result data to suit own specifications. As much as MySQL also offers operations such as grouping, sorting and many more, MongoDB is quite extensive especially with embedded data structures. Further as mentioned early, queries are returned with lesser latency in the aggregation framework than when a JOIN was to be done in the case of MySQL. For instance, MongoDB offers an easy way of modifying a schema using the $set and $unset operations for the embedded schema. But, in the case of MySQL, one has to do the ALTER command for the only table within which the field exists and this is quite expensive in terms of performance.


Regarding the merits discussed above, as much as database selection absolutely depends on application design MongoDB offers a lot of flexibility along different lines. If you are looking for something that will cater for better performance, dealing with complex data hence no need restrictions on schema design, future expectations on database growth and rich query language technique, I would recommend you to go for MongoDB.

by Onyancha Brian Henry at February 28, 2019 10:48 AM

February 27, 2019

Peter Zaitsev

Charset and Collation Settings Impact on MySQL Performance

MySQL 8.0 utf8mb4

Following my post MySQL 8 is not always faster than MySQL 5.7, this time I decided to test very simple read-only CPU intensive workloads, when all data fits memory. In this workload there is NO IO operations, only memory and CPU operations.

My Testing Setup

Environment specification

  • Release | Ubuntu 18.04 LTS (bionic)
  • Kernel | 4.15.0-20-generic
  • Processors | physical = 2, cores = 28, virtual = 56, hyperthreading = yes
  • Models | 56xIntel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz<
  • Memory Total | 376.6G
  • Provider | x2.xlarge.x86 instance

I will test two workloads, sysbench oltp_read_only and oltp_point_select varying amount of threads

sysbench oltp_read_only --mysql-ssl=off --report-interval=1 --time=300 --threads=$i --tables=10 --table-size=10000000 --mysql-user=root run

sysbench oltp_point_select --mysql-ssl=off --report-interval=1 --time=300 --threads=$i --tables=10 --table-size=10000000 --mysql-user=root run

The results for OLTP read-only (latin1 character set):

MySQL 5.7.25 MySQL 8.0.15
threads throughput throughput throughput ratio
1 1241.18 1114.4 1.11
4 4578.18 4106.69 1.11
16 15763.64 14303.54 1.10
24 21384.57 19472.89 1.10
32 25081.17 22897.04 1.10
48 32363.27 29600.26 1.09
64 39629.09 35585.88 1.11
128 38448.23 34718.42 1.11
256 36306.44 32798.12 1.11

The results for point_select (latin1 character set):

point select MySQL 5.7.25 MySQL 8.0.15
threads throughput throughput throughput ratio
1 31672.52 28344.25 1.12
4 110650.7 98296.46 1.13
16 390165.41 347026.49 1.12
24 534454.55 474024.56 1.13
32 620402.74 554524.73 1.12
48 806367.3 718350.87 1.12
64 1120586.03 972366.59 1.15
128 1108638.47 960015.17 1.15
256 1038166.63 891470.11 1.16

We can see that in the OLTP read-only workload, MySQL 8.0.15 is slower by 10%, and for the point_select workload MySQL 8.0.15 is slower by 12-16%.

Although the difference is not necessarily significant, this is enough to reveal that MySQL 8.0.15 does not perform as well as MySQL 5.7.25 in the variety of workloads that I am testing.

However, it appears that the dynamic of the results will change if we use the utf8mb4 character set instead of latin1.

Let’s compare MySQL 5.7.25 latin1 vs utf8mb4, as utf8mb4 is now default CHARSET in MySQL 8.0

But before we do that let’s take look also at COLLATION.

MySQL 5.7.25 uses a default collation utf8mb4_general_ci, However, I read that to use proper sorting and comparison for Eastern European languages, you may want to use the utf8mb4_unicode_ci collation. For MySQL 8.0.5 the default collation is

So let’s compare each version latin1 vs utf8mb4 (with default collation). First 5.7:

Threads utf8mb4_general_ci latin1 latin1 ratio
4 2957.99 4578.18 1.55
24 13792.55 21384.57 1.55
64 24516.99 39629.09 1.62
128 23977.07 38448.23 1.60

So here we can see that utf8mb4 in MySQL 5.7 is really much slower than latin1 (by 55-60%)

And the same for MySQL 8.0.15

MySQL 8.0 defaultcollations

Threads utf8mb4_0900_ai_ci (default) latin1 latin1 ratio
4 3968.88 4106.69 1.03
24 18446.19 19472.89 1.06
64 32776.35 35585.88 1.09
128 31301.75 34718.42 1.11

For MySQL 8.0 the hit from utf8mb4 is much lower (up to 11%)

Now let’s compare all collations for utf8mb4

For MySQL 5.7

MySQL 5.7 utf8mb4

utf8mb4_general_ci (default) utf8mb4_bin utf8mb4_unicode_ci utf8mb4_unicode_520_ci
4 2957.99 3328.8 2157.61 1942.78
24 13792.55 15857.29 9989.96 9095.17
64 24516.99 28125.16 16207.26 14768.64
128 23977.07 27410.94 15970.6 14560.6

If you plan to use utf8mb4_unicode_ci, you will get an even further performance hit (comparing to utf8mb4_general_ci )

And for MySQL 8.0.15

MySQL 8.0 utf8mb4

utf8mb4_general_ci utf8mb4_bin utf8mb4_unicode_ci utf8mb4_0900_ai_ci (default)
4 3461.8 3628.01 3363.7 3968.88
24 16327.45 17136.16 15740.83 18446.19
64 28960.62 30390.29 27242.72 32776.35
128 27967.25 29256.89 26489.83 31301.75

So now let’s compare MySQL 8.0 vs MySQL 5.7 in utf8mb4 with default collations:

mysql 8 and 5.7 default collation

MySQL 8.0 utf8mb4_0900_ai_ci MySQL 5.7 utf8mb4_general_ci MySQL 8.0 ratio
4 3968.88 2957.99 1.34
24 18446.19 13792.55 1.34
64 32776.35 24516.99 1.34
128 31301.75 23977.07 1.31

So there we are. In this case, MySQL 8.0 is actually better than MySQL 5.7 by 34%


There are several observations to make:

  • MySQL 5.7 outperforms MySQL 8.0 in latin1 charset
  • MySQL 8.0 outperforms MySQL 5.7 by a wide margin if we use utf8mb4 charset
  • Be aware that utf8mb4  is now default MySQL 8.0, while MySQL 5.7 has latin1 by default
  • When running comparison between MySQL 8.0 vs MySQL 5.7 be aware what charset you are using, as it may affect the comparison a lot.

by Vadim Tkachenko at February 27, 2019 11:11 PM

February 26, 2019

Peter Zaitsev

Percona XtraBackup Now Supports Dump of InnoDB Buffer Pool

percona-xtra-backup buffer pool restore

percona-xtra-backup buffer pool restoreInnoDB keeps hot data in memory on its buffer named InnoDB Buffer Pool. For a long time, when a MySQL instance needed to bounce, this hot cached data was lost and the instance required a warm-up period to perform as well as it did before the service restart.

That is not the case anymore. Newer versions of MySQL/MariaDB allow users to save the state of this buffer by dumping tablespace ID’s and page ID’s to a file on disk that will be loaded automatically on startup, making the newly started server buffer pool as it was prior the restart.

Details about the MySQL implementation can be found at

With that in mind, Percona XtraBackup versions 2.4.13 can now instruct MySQL to dump the content of buffer pool while taking a backup. This means you can restore the backup on a new server and make MySQL perform just like the other instance in terms of InnoDB Buffer Pool data.

How it works

The buffer pool dump happens at the beginning of backup if --dump-innodb-buffer-pool is set.

The user can choose to change the default innodb_buffer_pool_dump_pct. If --dump-innodb-buffer-pool-pct is set, it stores the current MySQL innodb_buffer_pool_dump_pct value, then it changes it to the desired percentage. After the end of the backup, original values is restored back.

The actual file copy happens at the end of the backup.

Percona XtraDB Cluster

A very good use case is PXC/Galera. When a node initiates SST, we would like the joiner to have a copy of InnoDB Buffer Pool from the donor. We can configure PXC nodes to do that:


Here is an example of a PXC node that just received SST:

Before PXB-1548:

[root@marcelo-altmann-pxb-pxc-3 ~]# systemctl stop mysql && rm -rf /var/lib/mysql/* && systemctl start mysql && mysql -psekret -e "SHOW ENGINE INNODB STATUS\G" | grep 'Database pages'
mysql: [Warning] Using a password on the command line interface can be insecure.
Database pages 311

Joiner started with a cold buffer pool.

After adding dump-innodb-buffer-pool and dump-innodb-buffer-pool-pct=100 to my.cnf :

[root@marcelo-altmann-pxb-pxc-3 ~]# systemctl stop mysql && rm -rf /var/lib/mysql/* && systemctl start mysql && mysql -psekret -e "SHOW ENGINE INNODB STATUS\G" | grep 'Database pages'
mysql: [Warning] Using a password on the command line interface can be insecure.
Database pages 30970

Joiner started with a copy of the buffer pool from the donor, which will reduce the joiner warm-up period.


The new version of Percona XtraBackup can help to minimize the time a newly restored backup will take to perform like source server

Photo by Jametlene Reskp on Unsplash

by Marcelo Altmann at February 26, 2019 10:38 AM

February 25, 2019

MariaDB Foundation

MariaDB 10.4.3 now available

The MariaDB Foundation is pleased to announce the availability of MariaDB 10.4.3, the first release candidate in the MariaDB 10.4 series. See the release notes and changelogs for details. Download MariaDB 10.4.3 Release Notes Changelog What is MariaDB 10.4? MariaDB APT and YUM Repository Configuration Generator Contributors to MariaDB 10.4.3 Aleksey Midenkov (Tempesta) Alexander Barkov […]

The post MariaDB 10.4.3 now available appeared first on

by Ian Gilfillan at February 25, 2019 07:07 PM

Peter Zaitsev

MySQL Challenge: 100k Connections

thread pools MySQL 100k connections

In this post, I want to explore a way to establish 100,000 connections to MySQL. Not just idle connections, but executing queries.

100,000 connections. Is that really needed for MySQL, you may ask? Although it may seem excessive, I have seen a lot of different setups in customer deployments. Some deploy an application connection pool, with 100 application servers and 1,000 connections in each pool. Some applications use a “re-connect and repeat if the query is too slow” technique, which is a terrible practice. It can lead to a snowball effect, and could establish thousands of connections to MySQL in a matter of seconds.

So now I want to set an overachieving goal and see if we can achieve it.


For this I will use the following hardware:

Bare metal server provided by, instance size: c2.medium.x86
Physical Cores @ 2.2 GHz
(1 X AMD EPYC 7401P)
Memory: 64 GB of ECC RAM
Storage : INTEL® SSD DC S4500, 480GB

This is a server grade SATA SSD.

I will use five of these boxes, for the reason explained below. One box for the MySQL server and four boxes for client connections.

For the server I will use Percona  Server for MySQL 8.0.13-4 with the thread pool plugin. The plugin will be required to support the thousands of connections.

Initial server setup

Network settings (Ansible format):

- { name: 'net.core.somaxconn', value: 32768 }
- { name: 'net.core.rmem_max', value: 134217728 }
- { name: 'net.core.wmem_max', value: 134217728 }
- { name: 'net.ipv4.tcp_rmem', value: '4096 87380 134217728' }
- { name: 'net.ipv4.tcp_wmem', value: '4096 87380 134217728' }
- { name: 'net.core.netdev_max_backlog', value: 300000 }
- { name: 'net.ipv4.tcp_moderate_rcvbuf', value: 1 }
- { name: 'net.ipv4.tcp_no_metrics_save', value: 1 }
- { name: 'net.ipv4.tcp_congestion_control', value: 'htcp' }
- { name: 'net.ipv4.tcp_mtu_probing', value: 1 }
- { name: 'net.ipv4.tcp_timestamps', value: 0 }
- { name: 'net.ipv4.tcp_sack', value: 0 }
- { name: 'net.ipv4.tcp_syncookies', value: 1 }
- { name: 'net.ipv4.tcp_max_syn_backlog', value: 4096 }
- { name: 'net.ipv4.tcp_mem', value: '50576   64768 98152' }
- { name: 'net.ipv4.ip_local_port_range', value: '4000 65000' }
- { name: 'net.ipv4.netdev_max_backlog', value: 2500 }
- { name: 'net.ipv4.tcp_tw_reuse', value: 1 }
- { name: 'net.ipv4.tcp_fin_timeout', value: 5 }

These are the typical settings recommended for 10Gb networks and high concurrent workloads.

Limits settings for systemd:


And the relevant setting for MySQL in my.cnf:


For the client I will use sysbench version 0.5 and not 1.0.x, for the reasons explained below.

The workload is

sysbench --test=sysbench/tests/db/select.lua --mysql-host= --mysql-user=sbtest --mysql-password=sbtest --oltp-tables-count=10 --report-interval=1 --num-threads=10000 --max-time=300 --max-requests=0 --oltp-table-size=10000000 --rand-type=uniform --rand-init=on run

Step 1. 10,000 connections

This one is very easy, as there is not much to do to achieve this. We can do this with only one client. But you may face the following error on the client side:

FATAL: error 2004: Can't create TCP/IP socket (24)

This is caused by the open file limit, which is also a limit of TCP/IP sockets. This can be fixed by setting  

ulimit -n 100000
  on the client.

The performance we observe:

[  26s] threads: 10000, tps: 0.00, reads: 33367.48, writes: 0.00, response time: 3681.42ms (95%), errors: 0.00, reconnects:  0.00
[  27s] threads: 10000, tps: 0.00, reads: 33289.74, writes: 0.00, response time: 3690.25ms (95%), errors: 0.00, reconnects:  0.00

Step 2. 25,000 connections

With 25,000 connections, we hit an error on MySQL side:

Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug

If you try to lookup information on this error you might find the following article:

But it does not help in our case, as we have all limits set high enough:

cat /proc/`pidof mysqld`/limits
Limit                     Soft Limit Hard Limit           Units
Max cpu time              unlimited  unlimited            seconds
Max file size             unlimited  unlimited            bytes
Max data size             unlimited  unlimited            bytes
Max stack size            8388608    unlimited            bytes
Max core file size        0          unlimited            bytes
Max resident set          unlimited  unlimited            bytes
Max processes             500000     500000               processes
Max open files            1000000    1000000              files
Max locked memory         16777216   16777216             bytes
Max address space         unlimited  unlimited            bytes
Max file locks            unlimited  unlimited            locks
Max pending signals       255051     255051               signals
Max msgqueue size         819200     819200               bytes
Max nice priority         0          0
Max realtime priority     0          0
Max realtime timeout      unlimited unlimited            us

This is where we start using the thread pool feature:



to the my.cnf and restart Percona Server

The results:

[   7s] threads: 25000, tps: 0.00, reads: 33332.57, writes: 0.00, response time: 974.56ms (95%), errors: 0.00, reconnects:  0.00
[   8s] threads: 25000, tps: 0.00, reads: 33187.01, writes: 0.00, response time: 979.24ms (95%), errors: 0.00, reconnects:  0.00

We have the same throughput, but actually the 95% response time has improved (thanks to the thread pool) from 3690 ms to 979 ms.

Step 3. 50,000 connections

This is where we encountered the biggest challenge. At first, trying to get 50,000 connections in sysbench we hit the following error:

FATAL: error 2003: Can't connect to MySQL server on '' (99)

Error (99) is cryptic and it means: Cannot assign requested address.

It comes from the limit of ports an application can open. By default on my system it is

cat /proc/sys/net/ipv4/ip_local_port_range : 32768   60999

This says there are only 28,231 available ports — 60999 minus 32768 — or the limit of TCP connections you can establish from or to the given IP address.

You can extend this using a wider range, on both the client and the server:

echo 4000 65000 > /proc/sys/net/ipv4/ip_local_port_range

This will give us 61,000 connections, but this is very close to the limit for one IP address (maximal port is 65535). The key takeaway from here is that if we want more connections we need to allocate more IP addresses for MySQL server. In order to achieve 100,000 connections, I will use two IP addresses on the server running MySQL.

After sorting out the port ranges, we hit the following problem with sysbench:

sysbench 0.5:  multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 50000
FATAL: pthread_create() for thread #32352 failed. errno = 12 (Cannot allocate memory)

In this case, it’s a problem with sysbench memory allocation (namely lua subsystem). Sysbench can allocate memory for only 32,351 connections. This is a problem which is even more severe in sysbench 1.0.x.

Sysbench 1.0.x limitation

Sysbench 1.0.x uses a different Lua JIT, which hits memory problems even with 4000 connections, so it is impossible to go over 4000 connection in sysbench 1.0.x

So it seems we hit a limit with sysbench sooner than with Percona Server. In order to use more connections, we need to use multiple sysbench clients, and if 32,351 connections is the limit for sysbench, we have to use at least four sysbench clients to get up to 100,000 connections.

For 50,000 connections I will use 2 servers (each running separate sysbench), each running 25,000 threads from sysbench.

The results for each sysbench looks like:

[  29s] threads: 25000, tps: 0.00, reads: 16794.09, writes: 0.00, response time: 1799.63ms (95%), errors: 0.00, reconnects:  0.00
[  30s] threads: 25000, tps: 0.00, reads: 16491.03, writes: 0.00, response time: 1800.70ms (95%), errors: 0.00, reconnects:  0.00

So we have about the same throughput (16794*2 = 33588 tps in total), however the 95% response time doubled. This is to be expected as we are using twice as many connections compared to the 25,000 connections benchmark.

Step 3. 75,000 connections

To achieve 75,000 connections we will use three servers with sysbench, each running 25,000 threads.

The results for each sysbench:

[ 157s] threads: 25000, tps: 0.00, reads: 11633.87, writes: 0.00, response time: 2651.76ms (95%), errors: 0.00, reconnects:  0.00
[ 158s] threads: 25000, tps: 0.00, reads: 10783.09, writes: 0.00, response time: 2601.44ms (95%), errors: 0.00, reconnects:  0.00

Step 4. 100,000 connections

There is nothing eventful to achieve75k and 100k connections. We just spin up an additional server and start sysbench. For 100,000 connections we need four servers for sysbench, each shows:

[ 101s] threads: 25000, tps: 0.00, reads: 8033.83, writes: 0.00, response time: 3320.21ms (95%), errors: 0.00, reconnects:  0.00
[ 102s] threads: 25000, tps: 0.00, reads: 8065.02, writes: 0.00, response time: 3405.77ms (95%), errors: 0.00, reconnects:  0.00

So we have the same throughput (8065*4=32260 tps in total) with 3405ms 95% response time.

A very important takeaway from this: with 100k connections and using a thread pool, the 95% response time is even better than for 10k connections without a thread pool. The thread pool allows Percona Server to manage resources more efficiently and provides better response times.


100k connections is quite achievable for MySQL, and I am sure we could go even further. There are three components to achieve this:

  • Thread pool in Percona Server
  • Proper tuning of network limits
  • Using multiple IP addresses on the server box (one IP address per approximately 60k connections)

Appendix: full my.cnf

datadir {{ mysqldir }}
# Disabling symbolic-links is recommended to prevent assorted security risks
# general
table_open_cache = 200000
# files
# buffers
innodb_buffer_pool_size= 40G
# tune
innodb_doublewrite= 1
innodb_flush_log_at_trx_commit= 0
innodb_stats_persistent = 1
innodb_adaptive_flushing = 1
innodb_flush_neighbors = 0
innodb_read_io_threads = 16
innodb_write_io_threads = 16
innodb_monitor_enable = '%'
performance_schema = ON

by Vadim Tkachenko at February 25, 2019 03:00 PM

February 23, 2019

Valeriy Kravchuk

Fun with Bugs #80 - On MySQL Bug Reports I am Subscribed to, Part XVI

Today I'd like to continue my review of public MySQL bug reports with a list of some bugs I've subscribed to over last 3 weeks. It's already long enough and includes nice cases to check and share. Note that I usually subscribe to a bug either because it directly affects me or customers I work with, or I consider it technically interesting (so I mostly care about InnoDB, replication, partitioning and optimizer bugs), or it's a "metabug" - a problem in the way public bug report is handled by Oracle engineers. These are my interests related to MySQL bugs.

As usual, I start with the oldest bugs and try to mention bug reporters by name with links to their other reports whenever this may give something useful to a reader. I try to check if MariaDB is also affected in some cases. Check also my summary comments at the end of this blog post.
  • Bug #94148 - "Unnecessary Shared lock on parent table During UPDATE on a child table". In this bug report Uday Varagani reasonably pointed out that formally there is no need to lock parent row when column NOT included in the foreign key gets updated. This happens though when this column is included into the index used to support foreign key constraint. IMHO it's a reasonable feature request and both Trey Raymond and Sveta Smirnova tried their best to  highlight this, but this report now has a "Need Feedback" status with a request to explain new algorithm suggested. It's simple - "Stop it", check that column changed is NOT the one foreign key is defined on, even if it's in the same index...I see no reason NOT to verify this as a reasonable feature request. Is it a new policy that every feature request should come with details on how to implement it? I truly doubt.
  • Bug #94224 - "[5.6] Optimizer reconsiders index based on index definition order, not value". Domas Mituzas found yet another case (see also Bug #36817 - "Non optimal index choice, depending on index creation order" from Jocelyn Fournier, the bug I verified more than 10 years ago) when in MySQL order of index definition matters more for optimizer than anything else.  My quick check shows that MariaDB 10.3.7 is not affected:
    MariaDB [test]> explain select distinct b from t1 where c not in (0) and d > 0;+------+-------------+-------+-------+---------------+--------------------+---------+------+------+-------------+| id   | select_type | table | type  | possible_keys | key            | key_len
    | ref  | rows | Extra                    |
    |    1 | SIMPLE      | t1    | index | NULL          | non_covering_index | 9    | NULL |    1 | Using where |
    1 row in set (0.002 sec)

    MariaDB [test]> alter table t1 add index covering_index (b, c, d);
    Query OK, 0 rows affected (0.149 sec)
    Records: 0  Duplicates: 0  Warnings: 0

    MariaDB [test]> explain select distinct b from t1 where c not in (0) and d > 0;
    | id   | select_type | table | type  | possible_keys | key            | key_len
    | ref  | rows | Extra                    |
    |    1 | SIMPLE      | t1    | index | NULL          | covering_index | 14
    | NULL |    1 | Using where; Using index |
    1 row in set (0.025 sec)
    Fortunately MySQL 8 is no longer affected. Unfortunately we do not see a public comment showing the results of testing on MySQL 5.7 (or any version, for that matter), from engineer who verified the bug. I already pointed out that this "metabug" becomes popular in my previous blog post.
  • Bug #94243 - "WL#9508 introduced non-idiomatic potentially-broken C macros". Laurynas Biveinis from Percona found new code that in ideal world wound not pass any serious code review.
  • Bug #94251 - "Aggregate function result is dependent by window is defined directly or as named". This bug was reported by Владислав Сокол. From what I see:
    MariaDB [test]> WITH RECURSIVE cte AS (
        -> SELECT 1 num
        -> UNION ALL
        -> SELECT num+1 FROM cte WHERE num < 5
        -> )
        -> SELECT num, COUNT(*) OVER (frame) cnt_named, COUNT(*) OVER (ORDER BY num
    DESC) cnt_direct
        -> FROM cte
        -> WINDOW frame AS (ORDER BY num DESC);
    | num  | cnt_named | cnt_direct |
    |    1 |         5 |          5 |
    |    2 |         4 |          4 |
    |    3 |         3 |          3 |
    |    4 |         2 |          2 |
    |    5 |         1 |          1 |
    5 rows in set (0.117 sec)

    MariaDB [test]> WITH RECURSIVE cte AS (
        -> SELECT 1 num
        -> UNION ALL
        -> SELECT num+1 FROM cte WHERE num < 5
        -> )
        -> SELECT num, COUNT(*) OVER (frame) cnt_named, COUNT(*) OVER (ORDER BY num
    DESC) cnt_direct
        -> FROM cte
        -> WINDOW frame AS (ORDER BY num DESC)
        -> ORDER BY num desc;
    | num  | cnt_named | cnt_direct |
    |    5 |         1 |          1 |
    |    4 |         2 |          2 |
    |    3 |         3 |          3 |
    |    2 |         4 |          4 |
    |    1 |         5 |          5 |
    5 rows in set (0.003 sec)
    MariaDB 10.3.7 is NOT affected.
  • Bug #94283 - "MySQL 8.0.15 is slower than MySQL 5.7.25". Percona's CTO Vadim Tkachenko reported that MySQL 8.0.15 is notably slower than 5.7.25 on a simple oltp_read_write sysbench test. He had recently written a separate blog post about this, with more details.There is one detail to clarify based on today's comment from Peter Zaitsev (was the same default character set used), but as my dear friend Sinisa Milivojevic verified the bug without any questions, requests or his own test outputs shared, we can assume that Oracle officially accepted this performance regression (even though "regression" tag was not set).

    Check also later Bug #94387 - "MySQL 8.0.15 is slower than MySQL 5.7.25 in read only workloads", yet another performance regression report from Vadim, where he found that on read only (sysbench oltp_point_select) all in memory workloads MySQL 8.0.15 may also be slower than MySQL 5.7.25.
  • Bug #94302 - "reset master could not break dump thread in some cases". This bug was reported by Ashe Sun. This is definitely a corner case, as it happens only master is still writing to the very first binary log. We can not find out from public comments in the bug report if any other versions besides 5.7.x are affected. This is yet another "metabug" - during my days in Oracle's MySQL bugs verification team we had to check on all versions still supported and present the results explicitly.
  • Bug #94319 - "Format_description_log_event::write can cause segfaults". Nice bug report by Manuel Ung from Facebook.
  • Bug #94330 - "Test for possible compressed failures before upgrade?". Change of zlib version starting from MySQL 5.7.24 means that some operations for InnoDB tables with ROW_FORMAT=COMPRESSED that previously worked may start to fail. In this report Monty Solomon asks for some way to determine if there will be a problem with existing compressed tables before upgrading to 5.7.24. The bug is still "Open".
  • Bug #94338 - "Dirty read-like behavior in READ COMMITTED transaction". Bug reporter, Masaki Oguro, stated that MySQL 8 is not affected (only 5.6 and 5.7) and the bug is verified on these versions, so we should assume it's really the case. But I miss public comment showing the result of testing on recent MySQL 8.0.15.
  • Bug #94340 - "backwards incompatible changes in 8.0: Error number: 3747". Simon Mudd complains about incompatible change in 8.0.13 that does not allow slave to easily switch from SBR to RBR without restart (and was not clearly documented as a change in behavior). Make sure to read all comments.
  • Bug #94370 - "Performance regression of btr_cur_prefetch_siblings". Nice bug report with a patch from Zhai Weixiang.
  • Bug #94383 - "simple ALTER cause unnecessary InnoDB index rebuilds, 5.7.23 or later 5.7 rlses". In this bug report Mikhail Izioumtchenko presented the detailed analysis and suggested diagnostics patches to show what really happens and why. This bug is also a regression of a kind, so while testing results are presented, I still think that it could be processed better according to the good old rules I have in mind.
  • Bug #94394 - "Absence of mysql.user leads to auto-apply of --skip-grant-tables". Great finding by Ceri Williams from Percona. Sveta Smirnova provided a separate MTR test case and clarified the impact of the bug. Surely this is also a regression comparing to MySQL 5.7, as there you can not start MySQL if mysql.user table is missing. I leave it to a reader to decide if there is any security-related impact of this bug...
  • Bug #94396 - "Error message too broad: The used command is not allowed with this MySQL version". This bug was reported by my former colleague in Percona Support, famous Bill Karwin. Informative error messages matter for good user experience.
We rely on MySQL in a same way as that guys on top of dolphins pyramid on this strange monument in some court somewhere at the Lanes. Reliable foundation matters, so regressions should better be avoided.
To summarize:
  1. Looks like it's time for Oracle to spend some efforts to make MySQL 8 great again, by fixing some of the bugs mentioned above, especially performance regressions vs MySQL 5.7 found recently by Vadim Tkachenko from Percona.
  2. Oracle continues to introduce backward-incompatible changes in behavior in minor MySQL 8.0.x releases at GA stage. This is not really good for any production environment.
  3. Asking bug reporters to provide "the basics of such a new algorithm" when they complain that current one is wrong or not optimal is a new word in bugs processing!
  4. When I joined MySQL bugs verification team in 2005 we've set up a culture of bugs processing that included, among other things, presenting in a public comment any successful or unsuccessful attempt to verify the bug, by copy-pasting all commands and statements used along with the outputs, whenever possible and with enough context to show what was really checked. I've studied this approach from Oracle's Tom Kyte over the previous 10 years when I followed him closely. I used to think it's standard for more than a decade already, a kind of my (and not only my) "heritage". It's sad to see this approach is no longer followed by many Oracle engineers who process bugs, in too many cases.
  5. Oracle engineers still do not use "regression" tag when setting "Verified" status for obviously regression bugs. I think bug reporters should care then to always set it when they report regression of any kind.

by Valerii Kravchuk ( at February 23, 2019 06:10 PM

February 22, 2019

Peter Zaitsev

Percona Live 2019 First Sneak Peek!

Percona Live 2019We know you’ve been really looking forward to a glimpse of what to expect at Percona Live Austin, so here is the first sneak peek of the agenda!

Our conference committee has been reviewing hundreds of talks over the last few weeks and is delighted to present some initial talks.

  • New features in MySQL 8.0 Replication by Luís Soares, Oracle OSS
  • Shaping the Future of Privacy & Data Protection by Cristina DeLisle, XWiki SAS
  • Galera Cluster New Features by Seppo Jaakola, Codership
  • MySQL Security and Standardization at PayPal by Stacy Yuan &  Yashada Jadha, PayPal
  • Mailchimp Scale: a MySQL Perspective by John Scott, Mailchimp
  • The State of Databases in 2019 by Dinesh Joshi, Apache Cassandra

PingCAP will be sponsoring the TiDB track and have a day of really exciting content to share! Liu Tang, Chief Engineer at PingCAP, will be presenting: Using Chaos Engineering to Build a Reliable TiDB. Keep your eye out for more coming soon!

We could not put on this conference without the support of our sponsors. By being a sponsor at Percona Live it gives companies the opportunity to showcase their products and services, interact with the community for invaluable face time, meet with users or customers and showcase their recruitment opportunities.

It’s with great pleasure to announce the first round of sponsors for Percona Live!

Diamond Sponsors





Silver Sponsors


If you’d like to find out more about being a sponsor, download the prospectus here
Stay tuned for more updates on the conference agenda! 

by Bronwyn Campbell at February 22, 2019 05:31 PM

Oli Sennhauser

FromDual Backup and Recovery Manager for MariaDB and MySQL 2.1.0 has been released

FromDual has the pleasure to announce the release of the new version 2.1.0 of its popular Backup and Recovery Manager for MariaDB and MySQL (brman).

The new FromDual Backup and Recovery Manager can be downloaded from here. How to install and use the Backup and Recovery Manager is describe in FromDual Backup and Recovery Manager (brman) installation guide.

In the inconceivable case that you find a bug in the FromDual Backup and Recovery Manager please report it to the FromDual Bugtracker or just send us an email.

Any feedback, statements and testimonials are welcome as well! Please send them to

Upgrade from 1.2.x to 2.1.0

brman 2.1.0 requires a new PHP package for ssh connections.

shell> sudo apt-get install php-ssh2

shell> cd ${HOME}/product
shell> tar xf /download/brman-2.1.0.tar.gz
shell> rm -f brman
shell> ln -s brman-2.1.0 brman

Changes in FromDual Backup and Recovery Manager 2.1.0

This release is a new major release series. It contains a lot of new features. We have tried to maintain backward-compatibility with the 1.2 and 2.0 release series. But you should test the new release seriously!

You can verify your current FromDual Backup Manager version with the following command:

shell> fromdual_bman --version
shell> bman --version

FromDual Backup Manager

  • Usage (--help) updated.
  • Some WARN severities downgraded to INFO to keep mail output clean.
  • Error messages made more flexible and fixed PHP library advice.
  • Split some redundant code from bman library into brman library.
  • Security fix: Password from config file is hidden now.
  • Bug on simulation of physical backup fixed (xtrabackup_binlog_info not found).
  • Options --backup-name and --backup-overwrite introduced for restore automation.
  • Minor typo bugs fixed.
  • Option --options remove.
  • Sort order for schema backup changed to ORDER BY ASC.
  • 2 PHP errors fixed for simulation.
  • Maskerade API added.
  • Physical backup sftp archiving with special characters (+foodmarat) in archive directory name fixed.

FromDual Recovery Manager

  • Rman has progress report.
  • Full logical restore is implemented.
  • Schema logical restore is implemented.
  • Physical restore is implemented.
  • Physical restore of compressed backups is implemented.
  • Option --cleanup-first was implemented for physical backup as well.
  • Option: --stop-instance implemented.

FromDual Backup Manager Catalog

  • No changes.

Subscriptions for commercial use of FromDual Backup and Recovery Manager you can get from from us.

by Shinguz at February 22, 2019 04:14 PM

MariaDB Foundation

“Account Locking and Password Expiration Overview” – MariaDB Unconference Presentations

Security is one of the hottest topics in Computer Software today, everybody handles highly valuable data. From private personal data, medical records for clinics to customers credit card information for  online bussinesses, malicious data breaches are always part of the  worst case scenario. Robert Bindar ( is going to present a session at the 2019 MariaDB Unconference, New York about […]

The post “Account Locking and Password Expiration Overview” – MariaDB Unconference Presentations appeared first on

by Anna Widenius at February 22, 2019 02:45 PM

Peter Zaitsev

PostgreSQL fsync Failure Fixed – Minor Versions Released Feb 14, 2019

fsync postgresql upgrade

PostgreSQL logoIn case you didn’t already see this news, PostgreSQL has got its first minor version released for 2019. This includes minor version updates for all supported PostgreSQL versions. We have indicated in our previous blog post that PostgreSQL 9.3 had gone EOL, and it would not support any more updates. This release includes the following PostgreSQL major versions:

What’s new in this release?

One of the common fixes applied to all the supported PostgreSQL versions is on – panic instead of retrying after fsync () failure. This fsync failure has been in discussion for a year or two now, so let’s take a look at the implications.

A fix to the Linux fsync issue for PostgreSQL Buffered IO in all supported versions

PostgreSQL performs two types of IO. Direct IO – though almost never – and the much more commonly performed Buffered IO.

PostgreSQL uses O_DIRECT when it is writing to WALs (Write-Ahead Logs aka Transaction Logs) only when

 is set to :
 or to 
 with no archiving or streaming enabled. The default 
 may be
 that does not use O_DIRECT. This means, almost all the time in your production database server, you’ll see PostgreSQL using O_SYNC / O_DSYNC while writing to WAL’s. Whereas, writing the modified/dirty buffers to datafiles from shared buffers is always through Buffered IO.  Let’s understand this further.

Upon checkpoint, dirty buffers in shared buffers are written to the page cache managed by kernel. Through an fsync(), these modified blocks are applied to disk. If an fsync() call is successful, all dirty pages from the corresponding file are guaranteed to be persisted on the disk. When there is an fsync to flush the pages to disk, PostgreSQL cannot guarantee a copy of a modified/dirty page. The reason is that writes to storage from the page cache are completely managed by the kernel, and not by PostgreSQL.

This could still be fine if the next fsync retries flushing of the dirty page. But, in reality, the data is discarded from the page cache upon an error with fsync. And the next fsync would obviously succeed ignoring the previous errors, because it now includes the next set of dirty buffers that need to be written to disk and not the ones that failed earlier.

To understand it better, consider an example of Linux trying to write dirty pages from page cache to a USB stick that was removed during an fsync. Neither the ext4 file system nor the btrfs nor an xfs tries to retry the failed writes. A silently failing fsync may result in data loss, block corruption, table or index out of sync, foreign key or other data integrity issues… and deleted records may reappear.

Until a while ago, when we used local storage or storage using RAID Controllers with write cache, it might not have been a big problem. This issue goes back to the time when PostgreSQL was designed for buffered IO but not Direct IO. Should this now be considered an issue with PostgreSQL and the way it’s designed? Well, not exactly.

All this started with the error handling during a writeback in Linux. A writeback asynchronously performs dirty page writes from page cache to filesystem. In ext4 like filesystems, upon a writeback error, the page is marked clean and up to date, and the user space is unaware of the problem.

fsync errors are now detected

Starting from kernel 4.13, we can now reliably detect such errors during fsync. So, any open file descriptor to a file includes a pointer to the address_space structure, and a new 32-bit value (errseq_t) has been added that is visible to all the processes accessing that file. With the new minor version for all supported PostgreSQL versions, a PANIC is triggered upon such error. This performs a database crash and initiates recovery from the last CHECKPOINT. There is a patch expected to be released in PostgreSQL 12 that works for newer kernel versions and modifies the way PostgreSQL handles the file descriptors. A long term solution to this issue may be Direct IO, but you might see a different approach to this in PG 12.

A good amount of work on this issue was done by Jeff Layton on reporting writeback errors, and Matthew Wilcox. What this patch means is that a writeback error gets reported during an fsync, which can be seen by another process that opens that file. A new 32-bit value that stores an error code and a sequence number are added to a new

typedef: errseq_t
 . So, these errors are now in the
 . But, if the struct inode is gone due to a memory pressure, this patch has no value.

Can i enable or disable the PANIC on fsync failure in PostgreSQL newer releases ?

Yes. You can set this parameter :

 to false (default), where a PANIC-level error is raised to recover from WAL through a database crash. You must be sure to have a proper high-availability mechanism so that the impact is minimal for your application. You could let your application failover to a slave, which could minimize the impact.

You can always set

 to true, if you are sure about how your OS behaves during write-back failures. By setting this to true, PostgreSQL will just report an error and continue to run.

Some of the other possible issues now fixed and common to these minor releases

  1. A lot of features and fixes related to PARTITIONING have been applied in this minor release. (PostgreSQL 10 and 11 only).
  2. Autovacuum has been made more aggressive about removing leftover temporary tables.
  3. Deadlock when acquiring multiple buffer locks.
  4. Crashes in logical replication.
  5. Incorrect planning of queries in which a lateral reference must be evaluated at a foreign table scan.
  6. Fixed some issues reported with ANALYZE and TRUNCATE operations.
  7. Fix to contrib/hstore to calculate correct hash values for empty hstore values that were created in version 8.4 or before.
  8. A fix to pg_dump’s handling of materialized views with indirect dependencies on primary keys.

We always recommend that you keep your PostgreSQL databases updated to the latest minor versions. Applying a minor release might need a restart after updating the new binaries.

Here is the sequence of steps you should follow to upgrade to the latest minor versions after thorough testing :

  1. Shutdown the PostgreSQL database server
  2. Install the updated binaries
  3. Restart your PostgreSQL database server

Most of the time, you can choose to update the minor versions in a rolling fashion in a master-slave (replication) setup because it avoids downtime for both reads and writes simultaneously. For a rolling style update, you could perform the update on one server after another… but not all at once. However, the best method that we’d almost always recommend is – shutdown, update and restart all instances at once.

If you are currently running your databases on PostgreSQL 9.3.x or earlier, we recommend that you to prepare a plan to upgrade your PostgreSQL databases to the supported versions ASAP. Please subscribe to our blog posts so that you can hear about the various options for upgrading your PostgreSQL databases to a supported major version.

Photo by Andrew Rice on Unsplash

by Avinash Vallarapu at February 22, 2019 01:47 PM

MariaDB Foundation

“How to write your first patch ? ” – MariaDB Unconference Presentations

 Have you ever wondered how to get started with contributions to the world’s most popular open source database? Did you have a problems with building and configuring from source code, writing the contribution patch and testing the server with  use of mysql-test-run (mtr) framework  afterwards? How to make your patch visible to other developers? In […]

The post “How to write your first patch ? ” – MariaDB Unconference Presentations appeared first on

by Anna Widenius at February 22, 2019 12:51 PM

Peter Zaitsev

Measuring Percona Server for MySQL On-Disk Decryption Overhead

benchmark heavy IO percona server for mysql 8 encryption

Percona Server for MySQL 8.0 comes with enterprise grade total data encryption features. However, there is always the question of how much overhead – or performance penalty – comes with the data decryption. As we saw in my networking performance post, SSL under high concurrency might be problematic. Is this the case for data decryption?

To measure any overhead, I will start with a simplified read-only workload, where data gets decrypted during read IO.

MySQL decryption schematic

During query execution, the data in memory is already decrypted so there is no additional processing time. The decryption happens only for blocks that require a read from storage.

For the benchmark I will use the following workload:

sysbench oltp_read_only --mysql-ssl=off --tables=20 --table-size=10000000 --threads=$i --time=300 --report-interval=1 --rand-type=uniform run

The datasize for this workload is about 50GB, so I will use

innodb_buffer_pool_size = 5GB
  to emulate a heavy disk read IO during the benchmark. In the second run, I will use
innodb_buffer_pool_size = 60GB
  so all data is kept in memory and there are NO disk read IO operations.

I will only use table-level encryption at this time (ie: no encryption for binary log, system tablespace, redo-  and undo- logs).

The server I am using has AES hardware CPU acceleration. Read more at

Benchmark N1, heavy read IO

benchmark heavy IO percona server for mysql 8 encryption

Threads encrypted storage no encryption encryption overhead
1 389.11 423.47 1.09
4 1531.48 1673.2 1.09
16 5583.04 6055 1.08
32 8250.61 8479.61 1.03
64 8558.6 8574.43 1.00
96 8571.55 8577.9 1.00
128 8570.5 8580.68 1.00
256 8576.34 8585 1.00
512 8573.15 8573.73 1.00
1024 8570.2 8562.82 1.00
2048 8422.24 8286.65 0.98

Benchmark N2, data in memory, no read IO

benchmark data in memory percona server for mysql 8 encryption

Threads Encryption No encryption
1 578.91 567.65
4 2289.13 2275.12
16 8304.1 8584.06
32 13324.02 13513.39
64 20007.22 19821.48
96 19613.82 19587.56
128 19254.68 19307.82
256 18694.05 18693.93
512 18431.97 18372.13
1024 18571.43 18453.69
2048 18509.73 18332.59


For a high number of threads, there is no measurable difference between encrypted and unencrypted storage. This is because a lot of CPU resources are spent in contention and waits, so the relative time spend in decryption is negligible.

However, we can see some performance penalty for a low number of threads: up to 9% penalty for hardware decryption. When data fully fits into memory, there is no measurable difference between encrypted and unencrypted storage.

So if you have hardware support then you should see little impact when using storage encryption with MySQL. The easiest way to check if you have support for this is to look at CPU flags and search for ‘aes’ string:

> lscpu | grep aes Flags: ... tsc_deadline_timer aes xsave avx f16c ...

by Vadim Tkachenko at February 22, 2019 12:38 PM

Chris Calender

MariaDB MaxScale Masking Basics and Examples

I wanted to take a moment to write up a post on MariaDB MaxScale’s masking basics and include some real-world examples.

We have nice documentation on the subject, and Dipti wrote a nice blog post on it as well. I just wanted to provide my take on it, and hopefully build upon what is already there and offer some additional insights.

To provide a 50-foot overview, the masking filter makes it possible to obfuscate the returned value of a particular column.

3 quite common columns where this would be very beneficial: Social Security Number (“SSN”), Date of Birth (“DOB”), and Credit Card Number (“CCNUM”).

To use masking, it assumes you already have a MaxScale service up and running. For instance, the readwrite splitter.

In this case, you would already have a configuration file similar to this (3 backend servers, 1 master (server1), 2 slaves (server2 & server3), with readwritesplit (Read-Write-Service) and its listener (Read-Write-Listener) set up:

log_debug=1     # debug only







In the examples from the aforementioned manual and blog post, you will see something like this for your “configuration” addition:



MyMasking is the name you will choose for your masking filter.

MyService is a service you already have defined and running. In this example, it is [Read-Write-Service].

Thus, I simply add the following line to [Read-Write-Service]:


If you already have a filter defined for this service, say NamedServerFilter, then you can add a second filter like this (i.e., each filter is separated by a “|”):

filters=NamedServerFilter | MyMasking

And then add your [MyMasking] section/configuration:


In the above, the type is “filter”, and the module is “masking”. Both of those are self-explanatory.

The “warn_type_mismatch” instructs MaxScale to log a warning if a masking rule matches a column that is not of one of the allowed types. Possible values are “never” and “always” (with “never” being the default). However, a limitation of masking is that can only be used for masking columns of the following types: BINARY, VARBINARY, CHAR, VARCHAR, BLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB, TEXT, TINYTEXT, MEDIUMTEXT, LONGTEXT, ENUM and SET. If the type of the column is something else (INTs, DATEs, etc.), then no masking will be performed. So you might want to be “warned” if this happens, hence why I chose “always”.

The “large_payload” specifies how the masking filter should treat payloads larger than 16MB. Possible values are “ignore” and “abort” (with “abort” being the default). If you choose ignore, then if the result set is > 16MB, then no masking will be performed, and the result set will be returned to the client. If abort, then the client conneciton is closed.

And the “rules” defines the path and name to the masking_rules.json file which you must use to define your rules, what you want filtered, which columns, from which tables, schemas, or database-wide, and options on how to handle the display, and so forth. It is very flexible, suffice to say.

Thus my updated config file becomes:

log_debug=1     # debug only








In MaxScale 2.3, there is also a “prevent_function_usage” option, which can be set to “true” or “false”. If true, then all statements that contain functions referring to masked columns will be rejected. Otherwise, not. True is the default, thus I’ll omit this part, so that this config can be used for all 2.x MaxScale setups.

Now we need to create masking_rules.json (in /etc/maxscale.modules.d/), and we should be all set to start masking.

chris@chris-linux-laptop-64:/etc/maxscale.modules.d$ cat masking_rules.json
	"rules": [
			"replace": {
				"column": "SSN"
			"with": {
				"fill": "*"

This is the most basic. In this rule, *any* column named “SSN” in *any* schema will be replaced with all “*”s.

So, once you’ve made your config change, and created masking_rules.json, it’s time to restart MaxScale so that it reads/loads your new masking filter:

sudo service maxscale restart

Now for some testing:

CREATE SCHEMA employees;

USE employees;

CREATE TABLE employees (name char(10), location char(10), SSN char(11), DOB char(10), CCNUM char(16)); 

INSERT INTO employees VALUES ('chris', 'hanger18', '123-45-6789', '07/07/1947', '6011123456789012');

Note that I made DOB a CHAR column so that masking would be applicable as it is not for a DATE column.

Thus with no masking, we see everything:

SELECT * FROM employees.employees;
| name  | location | SSN         | DOB        | CCNUM            |
| chris | hanger18 | 123-45-6789 | 07/07/1947 | 6011123456789012 |

Now, connect to the service listener, in this case [Read-Write-Listener] running on port 4006:

mysql -uroot -pxxx -P4006 --protocol=tcp

SELECT * FROM employees.employees;
| name  | location | SSN         | DOB        | CCNUM            |
| chris | hanger18 | *********** | 07/07/1947 | 6011123456789012 |

So we successfully ***’ed out SSN. Now, to also handle DOB and CCNUM. So edit the masking_rules.json file to:

	"rules": [
			"replace": {
				"column": "SSN"
			"with": {
				"fill": "*"
			"replace": {
				"column": "DOB"
			"with": {
				"fill": "*"
			"replace": {
				"column": "CCNUM"
			"with": {
				"fill": "*"

If for the time being, you can still use MaxAdmin to reload the file without having to restart MaxScale (though do note maxadmin is deprecated in 2.3, and will be removed soon, though I suspect all functionality it provided will be available via maxctrl soon, if not already.):

sudo maxadmin
MaxScale> call command masking reload MyMasking

Assuming the last command completed without errors, then can now simply re-query (via port 4006). However, first exit the connection to port 4006 and then re-connect:

select * from employees.employees;
| name  | location | SSN         | DOB        | CCNUM            |
| chris | hanger18 | *********** | ********** | **************** |

Note: The column names are case-sensitive, so if you have columns like “SSN” and “ssn”, then you will need to add 2 entries to masking_rules.json.

Here is a table that uses “ssn” instead of “SSN” (everything else is the same):

CREATE TABLE employees2 (name char(10), location char(10), ssn char(11), DOB char(10), CCNUM char(16));

INSERT INTO employees2 VALUES ('chris', 'hanger18', '123-45-6789', '07/07/1947', '6011123456789012');

SELECT * FROM employees.employees2;
| name  | location | ssn         | DOB        | CCNUM            |
| chris | hanger18 | 123-45-6789 | ********** | **************** |

As you can see, the “ssn” is not masked, but DOB and CCNUm still are. So let’s add a sction for “ssn” in masking_rules.json:

	"rules": [
			"replace": {
				"column": "SSN"
			"with": {
				"fill": "*"
			"replace": {
				"column": "ssn"
			"with": {
				"fill": "*"
			"replace": {
				"column": "DOB"
			"with": {
				"fill": "*"
			"replace": {
				"column": "CCNUM"
			"with": {
				"fill": "*"

Then reload the file:

sudo maxadmin
MaxScale> call command masking reload MyMasking

And then exit port 4006 and re-connect, and re-issue the query:

SELECT * FROM employees.employees2;
| name  | location | ssn         | DOB        | CCNUM            |
| chris | hanger18 | *********** | ********** | **************** |

There we have it.

And again, you have many more options when it comes to your string replacements, matching, fill, values, obfuscation, pcre2 regex, and so forth. I’ll leave you to the manual page to investigate those options if you wish.

All in all, I hope this is helpful for anyone wanting to get started using MaxScale’s masking filter.

by chris at February 22, 2019 10:48 AM

MariaDB Foundation

MariaDB 10.3.13 and MariaDB Connector/C 3.0.9 now available

The MariaDB Foundation is pleased to announce the availability of MariaDB 10.3.13, the latest stable release in the MariaDB 10.3 series, as well as MariaDB Connector/C 3.0.9, the latest stable release in the MariaDB Connector/ODBC series. See the release notes and changelogs for details. Download MariaDB 10.3.13 Release Notes Changelog What is MariaDB 10.3? MariaDB […]

The post MariaDB 10.3.13 and MariaDB Connector/C 3.0.9 now available appeared first on

by Ian Gilfillan at February 22, 2019 02:21 AM

February 21, 2019

Peter Zaitsev

Percona Server for MongoDB Operator 0.2.1 Early Access Release Is Now Available

Percona Server for MongoDB

Percona Server for MongoDB OperatorPercona announces the availability of the Percona Server for MongoDB Operator 0.2.1 early access release.

The Percona Server for MongoDB Operator simplifies the deployment and management of Percona Server for MongoDB in a Kubernetes or OpenShift environment. It extends the Kubernetes API with a new custom resource for deploying, configuring and managing the application through the whole life cycle.

Note: PerconaLabs is one of the open source GitHub repositories for unofficial scripts and tools created by Percona staff. These handy utilities can help save your time and effort.

Percona software builds located in the Percona repository are not officially released software, and also aren’t covered by Percona support or services agreements.

You can install the Percona Server for MongoDB Operator on Kubernetes or OpenShift. While the operator does not support all the Percona Server for MongoDB features in this early access release, instructions on how to install and configure it are already available along with the operator source code in our Github repository.

The Percona Server for MongoDB Operator on Percona-Lab is an early access release. Percona doesn’t recommend it for production environments.


  • Backups to S3 compatible storages
  • CLOUD-117: An error proof functionality was included into this release. It doesn’t allow unsafe configurations by default, preventing user from configuring a cluster with more than one Arbiter node or a Replica Set with less than three nodes.
    • For those who still need such configurations, this protection can be disabled by setting allowUnsafeConfigurations=true in the deploy/cr.yaml file.

Fixed Bugs

  • CLOUD-105: The Service-per-Pod feature used with the LoadBalancer didn’t work with cluster sizes not equal to 1.
  • CLOUD-137: PVC assigned to the Arbiter Pod had the same size as PVC of the regular Percona Server for MongoDB Pods, despite the fact that Arbiter doesn’t store data.

Percona Server for MongoDB is an enhanced, open source and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB Community Edition. It supports MongoDB protocols and drivers. Percona Server for MongoDB extends MongoDB Community Edition functionality by including the Percona Memory Engine, as well as several enterprise-grade features. It requires no changes to MongoDB applications or code.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system.

by Dmitriy Kostiuk at February 21, 2019 09:48 PM

MySQL 8 is not always faster than MySQL 5.7

mysql 8 slower than mysql 5.7 sysbench

MySQL 8.0.15 performs worse in sysbench oltp_read_write than MySQL 5.7.25

Initially I was testing group replication performance and was puzzled why MySQL 8.0.15 performs consistently worse than MySQL 5.7.25.

It appears that a single server instance is affected by a performance degradation.

My testing setup

mysql 8 slower than mysql 5.7 sysbenchHardware details:
Bare metal server provided by, instance size: c2.medium.x86
24 Physical Cores @ 2.2 GHz
(1 X AMD EPYC 7401P)
Memory: 64 GB of ECC RAM

Storage : INTEL® SSD DC S4500, 480GB

This is a server grade SATA SSD.


sysbench oltp_read_write --report-interval=1 --time=1800 --threads=24 --tables=10 --table-size=10000000 --mysql-user=root --mysql-socket=/tmp/mysql.sock run

In the following summary I used these combinations:

  • innodb_flush_log_at_trx_commit=0 or 1
  • Binlog: off or on
  • sync_binlog=1000 or sync_binlog=1

The summary table, the number are transactions per second (tps – the more the better)

| case                                      | MySQL 5.7.25 | MySQL 8.0.15 | ratio |
| trx_commit=0, binlog=off                  | 11402 tps    | 9840(*)      | 1.16  |
| trx_commit=1, binlog=off                  | 8375         | 7974         | 1.05  |
| trx_commit=0, binlog=on, sync_binlog=1000 | 10862        | 8871         | 1.22  |
| trx_commit=0, binlog=on, sync_binlog=1    | 7238         | 6459         | 1.12  |
| trx_commit=1, binlog=on, sync_binlog=1    | 5970         | 5043         | 1.18  |

Summary: MySQL 8.0.15 is persistently worse than MySQL 5.7.25.

In the worst case with

 , it is worse by 22%, which is huge.

I was looking to use these settings for group replication testing, but these settings, when used with MySQL 8.0.15, provide much worse results than I had with MySQL 5.7.25

(*)  in the case of trx_commit=0, binlog=off, MySQL 5.7.25 performance is very stable, and practically stays at the 11400 tps level. MySQL 8.0.15 varies a lot from 8758 tps to 10299 tps in 1 second resolution measurements


To clarify some comments, I’ve used latin1 CHARSET in this benchmark for both MySQL 5.7 and MySQL 8.0


datadir= /mnt/data/mysql
log_bin = binlog
binlog_format = ROW
# Disabling symbolic-links is recommended to prevent assorted security risks
# Recommended in standard MySQL setup
# general
 table_open_cache = 200000
# files
# buffers
 innodb_buffer_pool_size= 40G
# tune
 innodb_doublewrite= 1
 innodb_flush_log_at_trx_commit= 0
 innodb_stats_persistent = 1
# perf special
 innodb_adaptive_flushing = 1
 innodb_flush_neighbors = 0
 innodb_read_io_threads = 16
 innodb_write_io_threads = 16

Photo by Suzy Hazelwood from Pexels


by Vadim Tkachenko at February 21, 2019 06:10 PM

Parallel queries in PostgreSQL

parallel queries in postgresql

PostgreSQL logoModern CPU models have a huge number of cores. For many years, applications have been sending queries in parallel to databases. Where there are reporting queries that deal with many table rows, the ability for a query to use multiple CPUs helps us with a faster execution. Parallel queries in PostgreSQL allow us to utilize many CPUs to finish report queries faster. The parallel queries feature was implemented in 9.6 and helps. Starting from PostgreSQL 9.6 a report query is able to use many CPUs and finish faster.

The initial implementation of the parallel queries execution took three years. Parallel support requires code changes in many query execution stages. PostgreSQL 9.6 created an infrastructure for further code improvements. Later versions extended parallel execution support for other query types.


  • Do not enable parallel executions if all CPU cores are already saturated. Parallel execution steals CPU time from other queries, and increases response time.
  • Most importantly, parallel processing significantly increases memory usage with high WORK_MEM values, as each hash join or sort operation takes a work_mem amount of memory.
  • Next, low latency OLTP queries can’t be made any faster with parallel execution. In particular, queries that returns a single row can perform badly when parallel execution is enabled.
  • The Pierian spring for developers is a TPC-H benchmark. Check if you have similar queries for the best parallel execution.
  • Parallel execution supports only SELECT queries without lock predicates.
  • Proper indexing might be a better alternative to a parallel sequential table scan.
  • There is no support for cursors or suspended queries.
  • Windowed functions and ordered-set aggregate functions are non-parallel.
  • There is no benefit for an IO-bound workload.
  • There are no parallel sort algorithms. However, queries with sorts still can be parallel in some aspects.
  • Replace CTE (WITH …) with a sub-select to support parallel execution.
  • Foreign data wrappers do not currently support parallel execution (but they could!)
  • There is no support for FULL OUTER JOIN.
  • Clients setting max_rows disable parallel execution.
  • If a query uses a function that is not marked as PARALLEL SAFE, it will be single-threaded.
  • SERIALIZABLE transaction isolation level disables parallel execution.

Test environment

The PostgreSQL development team have tried to improve TPC-H benchmark queries’ response time. You can download the benchmark and adapt it to PostgreSQL by using these instructions. It’s not an official way to use the TPC-H benchmark, so you shouldn’t use it to compare different databases or hardware.

  1. Download (or newer version) from official TPC site.
  2. Rename makefile.suite to Makefile and modify it as requested at . Compile the code with make command
  3. Generate data: ./dbgen -s 10 generates 23GB database which is enough to see the difference in performance for parallel and non-parallel queries.
  4. Convert tbl files to csv with for + sed
  5. Clone pg_tpch repository and copy csv files to pg_tpch/dss/data
  6. Generate queries with qgen command
  7. Load data to the database with ./ command.

Parallel sequential scan

This might be faster not because of parallel reads, but due to scattering of data across many CPU cores. Modern OS provides good caching for PostgreSQL data files. Read-ahead allows getting a block from storage more than just the block requested by PG daemon. As a result, query performance is not limited due to disk IO. It consumes CPU cycles for:

  • reading rows one by one from table data pages
  • comparing row values and WHERE conditions

Let’s try to execute simple select query:

tpch=# explain analyze select l_quantity as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day;
Seq Scan on lineitem (cost=0.00..1964772.00 rows=58856235 width=5) (actual time=0.014..16951.669 rows=58839715 loops=1)
Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 1146337
Planning Time: 0.203 ms
Execution Time: 19035.100 ms

A sequential scan produces too many rows without aggregation. So, the query is executed by a single CPU core.

After adding SUM(), it’s clear to see that two workers will help us to make the query faster:

explain analyze select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day;
Finalize Aggregate (cost=1589702.14..1589702.15 rows=1 width=32) (actual time=8553.365..8553.365 rows=1 loops=1)
-> Gather (cost=1589701.91..1589702.12 rows=2 width=32) (actual time=8553.241..8555.067 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=1588701.91..1588701.92 rows=1 width=32) (actual time=8547.546..8547.546 rows=1 loops=3)
-> Parallel Seq Scan on lineitem (cost=0.00..1527393.33 rows=24523431 width=5) (actual time=0.038..5998.417 rows=19613238 loops=3)
Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 382112
Planning Time: 0.241 ms
Execution Time: 8555.131 ms

The more complex query is 2.2X faster compared to the plain, single-threaded select.

Parallel Aggregation

A “Parallel Seq Scan” node produces rows for partial aggregation. A “Partial Aggregate” node reduces these rows with SUM(). At the end, the SUM counter from each worker collected by “Gather” node.

The final result is calculated by the “Finalize Aggregate” node. If you have your own aggregation functions, do not forget to mark them as “parallel safe”.

Number of workers

We can increase the number of workers without server restart:

alter system set max_parallel_workers_per_gather=4;
select * from pg_reload_conf();
Now, there are 4 workers in explain output:
tpch=# explain analyze select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day;
Finalize Aggregate (cost=1440213.58..1440213.59 rows=1 width=32) (actual time=5152.072..5152.072 rows=1 loops=1)
-> Gather (cost=1440213.15..1440213.56 rows=4 width=32) (actual time=5151.807..5153.900 rows=5 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial Aggregate (cost=1439213.15..1439213.16 rows=1 width=32) (actual time=5147.238..5147.239 rows=1 loops=5)
-> Parallel Seq Scan on lineitem (cost=0.00..1402428.00 rows=14714059 width=5) (actual time=0.037..3601.882 rows=11767943 loops=5)
Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 229267
Planning Time: 0.218 ms
Execution Time: 5153.967 ms

What’s happening here? We have changed the number of workers from 2 to 4, but the query became only 1.6599 times faster. Actually, scaling is amazing. We had two workers plus one leader. After a configuration change, it becomes 4+1.

The biggest improvement from parallel execution that we can achieve is: 5/3 = 1.66(6)X faster.

How does it work?


Query execution always starts in the “leader” process. A leader executes all non-parallel activity and its own contribution to parallel processing. Other processes executing the same queries are called “worker” processes. Parallel execution utilizes the Dynamic Background Workers infrastructure (added in 9.4). As other parts of PostgreSQL uses processes, but not threads, the query creating three worker processes could be 4X faster than the traditional execution.


Workers communicate with the leader using a message queue (based on shared memory). Each process has two queues: one for errors and the second one for tuples.

How many workers to use?

Firstly, the max_parallel_workers_per_gather parameter is the smallest limit on the number of workers. Secondly, the query executor takes workers from the pool limited by max_parallel_workers size. Finally, the top-level limit is max_worker_processes: the total number of background processes.

Failed worker allocation leads to single-process execution.

The query planner could consider decreasing the number of workers based on a table or index size. min_parallel_table_scan_size and min_parallel_index_scan_size control this behavior.

set min_parallel_table_scan_size='8MB'
8MB table => 1 worker
24MB table => 2 workers
72MB table => 3 workers
x => log(x / min_parallel_table_scan_size) / log(3) + 1 worker

Each time the table is 3X bigger than min_parallel_(index|table)_scan_size, postgres adds a worker. The number of workers is not cost-based! A circular dependency makes a complex implementation hard. Instead, the planner uses simple rules.

In practice, these rules are not always acceptable in production and you can override the number of workers for the specific table with ALTER TABLE … SET (parallel_workers = N).

Why parallel execution is not used?

Besides to the long list of parallel execution limitations, PostgreSQL checks costs:

parallel_setup_cost to avoid parallel execution for short queries. It models the time spent for memory setup, process start, and initial communication

parallel_tuple_cost : The communication between leader and workers could take a long time. The time is proportional to the number of tuples sent by workers. The parameter models the communication cost.

Nested loop joins

PostgreSQL 9.6+ could execute a “Nested loop” in parallel due to the simplicity of the operation.

explain (costs off) select c_custkey, count(o_orderkey)
                from    customer left outer join orders on
                                c_custkey = o_custkey and o_comment not like '%special%deposits%'
                group by c_custkey;
                                      QUERY PLAN
 Finalize GroupAggregate
   Group Key: customer.c_custkey
   ->  Gather Merge
         Workers Planned: 4
         ->  Partial GroupAggregate
               Group Key: customer.c_custkey
               ->  Nested Loop Left Join
                     ->  Parallel Index Only Scan using customer_pkey on customer
                     ->  Index Scan using idx_orders_custkey on orders
                           Index Cond: (customer.c_custkey = o_custkey)
                           Filter: ((o_comment)::text !~~ '%special%deposits%'::text)

Gather happens in the last stage, so “Nested Loop Left Join” is a parallel operation. “Parallel Index Only Scan” is available from version 10. It acts in a similar way to a parallel sequential scan. The

c_custkey = o_custkey
condition reads a single order for each customer row. Thus it’s not parallel.

Hash Join

Each worker builds its own hash table until PostgreSQL 11. As a result, 4+ workers weren’t able to improve performance. The new implementation uses a shared hash table. Each worker can utilize WORK_MEM to build the hash table.

                when o_orderpriority = '1-URGENT'
                        or o_orderpriority = '2-HIGH'
                        then 1
                else 0
        end) as high_line_count,
                when o_orderpriority <> '1-URGENT'
                        and o_orderpriority <> '2-HIGH'
                        then 1
                else 0
        end) as low_line_count
        o_orderkey = l_orderkey
        and l_shipmode in ('MAIL', 'AIR')
        and l_commitdate < l_receiptdate
        and l_shipdate < l_commitdate
        and l_receiptdate >= date '1996-01-01'
        and l_receiptdate < date '1996-01-01' + interval '1' year
group by
order by
                                                                                                                                    QUERY PLAN
 Limit  (cost=1964755.66..1964961.44 rows=1 width=27) (actual time=7579.592..7922.997 rows=1 loops=1)
   ->  Finalize GroupAggregate  (cost=1964755.66..1966196.11 rows=7 width=27) (actual time=7579.590..7579.591 rows=1 loops=1)
         Group Key: lineitem.l_shipmode
         ->  Gather Merge  (cost=1964755.66..1966195.83 rows=28 width=27) (actual time=7559.593..7922.319 rows=6 loops=1)
               Workers Planned: 4
               Workers Launched: 4
               ->  Partial GroupAggregate  (cost=1963755.61..1965192.44 rows=7 width=27) (actual time=7548.103..7564.592 rows=2 loops=5)
                     Group Key: lineitem.l_shipmode
                     ->  Sort  (cost=1963755.61..1963935.20 rows=71838 width=27) (actual time=7530.280..7539.688 rows=62519 loops=5)
                           Sort Key: lineitem.l_shipmode
                           Sort Method: external merge  Disk: 2304kB
                           Worker 0:  Sort Method: external merge  Disk: 2064kB
                           Worker 1:  Sort Method: external merge  Disk: 2384kB
                           Worker 2:  Sort Method: external merge  Disk: 2264kB
                           Worker 3:  Sort Method: external merge  Disk: 2336kB
                           ->  Parallel Hash Join  (cost=382571.01..1957960.99 rows=71838 width=27) (actual time=7036.917..7499.692 rows=62519 loops=5)
                                 Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
                                 ->  Parallel Seq Scan on lineitem  (cost=0.00..1552386.40 rows=71838 width=19) (actual time=0.583..4901.063 rows=62519 loops=5)
                                       Filter: ((l_shipmode = ANY ('{MAIL,AIR}'::bpchar[])) AND (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) AND (l_receiptdate >= '1996-01-01'::date) AND (l_receiptdate < '1997-01-01 00:00:00'::timestamp without time zone))
                                       Rows Removed by Filter: 11934691
                                 ->  Parallel Hash  (cost=313722.45..313722.45 rows=3750045 width=20) (actual time=2011.518..2011.518 rows=3000000 loops=5)
                                       Buckets: 65536  Batches: 256  Memory Usage: 3840kB
                                       ->  Parallel Seq Scan on orders  (cost=0.00..313722.45 rows=3750045 width=20) (actual time=0.029..995.948 rows=3000000 loops=5)
 Planning Time: 0.977 ms
 Execution Time: 7923.770 ms

Query 12 from TPC-H is a good illustration for a parallel hash join. Each worker helps to build a shared hash table.

Merge Join

Due to the nature of merge join it’s not possible to make it parallel. Don’t worry if it’s the last stage of the query execution—you can still can see parallel execution for queries with a merge join.

-- Query 2 from TPC-H
explain (costs off) select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment
from    part, supplier, partsupp, nation, region
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 36
        and p_type like '%BRASS'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'AMERICA'
        and ps_supplycost = (
                from    partsupp, supplier, nation, region
                        p_partkey = ps_partkey
                        and s_suppkey = ps_suppkey
                        and s_nationkey = n_nationkey
                        and n_regionkey = r_regionkey
                        and r_name = 'AMERICA'
order by s_acctbal desc, n_name, s_name, p_partkey
LIMIT 100;
                                                QUERY PLAN
   ->  Sort
         Sort Key: supplier.s_acctbal DESC, nation.n_name, supplier.s_name, part.p_partkey
         ->  Merge Join
               Merge Cond: (part.p_partkey = partsupp.ps_partkey)
               Join Filter: (partsupp.ps_supplycost = (SubPlan 1))
               ->  Gather Merge
                     Workers Planned: 4
                     ->  Parallel Index Scan using part_pkey on part
                           Filter: (((p_type)::text ~~ '%BRASS'::text) AND (p_size = 36))
               ->  Materialize
                     ->  Sort
                           Sort Key: partsupp.ps_partkey
                           ->  Nested Loop
                                 ->  Nested Loop
                                       Join Filter: (nation.n_regionkey = region.r_regionkey)
                                       ->  Seq Scan on region
                                             Filter: (r_name = 'AMERICA'::bpchar)
                                       ->  Hash Join
                                             Hash Cond: (supplier.s_nationkey = nation.n_nationkey)
                                             ->  Seq Scan on supplier
                                             ->  Hash
                                                   ->  Seq Scan on nation
                                 ->  Index Scan using idx_partsupp_suppkey on partsupp
                                       Index Cond: (ps_suppkey = supplier.s_suppkey)
               SubPlan 1
                 ->  Aggregate
                       ->  Nested Loop
                             Join Filter: (nation_1.n_regionkey = region_1.r_regionkey)
                             ->  Seq Scan on region region_1
                                   Filter: (r_name = 'AMERICA'::bpchar)
                             ->  Nested Loop
                                   ->  Nested Loop
                                         ->  Index Scan using idx_partsupp_partkey on partsupp partsupp_1
                                               Index Cond: (part.p_partkey = ps_partkey)
                                         ->  Index Scan using supplier_pkey on supplier supplier_1
                                               Index Cond: (s_suppkey = partsupp_1.ps_suppkey)
                                   ->  Index Scan using nation_pkey on nation nation_1
                                         Index Cond: (n_nationkey = supplier_1.s_nationkey)

The “Merge Join” node is above “Gather Merge”. Thus merge is not using parallel execution. But the “Parallel Index Scan” node still helps with the part_pkey segment.

Partition-wise join

PostgreSQL 11 disables the partition-wise join feature by default. Partition-wise join has a high planning cost. Joins for similarly partitioned tables could be done partition-by-partition. This allows postgres to use smaller hash tables. Each per-partition join operation could be executed in parallel.

tpch=# set enable_partitionwise_join=t;
tpch=# explain (costs off) select * from prt1 t1, prt2 t2
where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;
                    QUERY PLAN
   ->  Hash Join
         Hash Cond: (t2.b = t1.a)
         ->  Seq Scan on prt2_p1 t2
               Filter: ((b >= 0) AND (b <= 10000))
         ->  Hash
               ->  Seq Scan on prt1_p1 t1
                     Filter: (b = 0)
   ->  Hash Join
         Hash Cond: (t2_1.b = t1_1.a)
         ->  Seq Scan on prt2_p2 t2_1
               Filter: ((b >= 0) AND (b <= 10000))
         ->  Hash
               ->  Seq Scan on prt1_p2 t1_1
                     Filter: (b = 0)
tpch=# set parallel_setup_cost = 1;
tpch=# set parallel_tuple_cost = 0.01;
tpch=# explain (costs off) select * from prt1 t1, prt2 t2
where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;
                        QUERY PLAN
   Workers Planned: 4
   ->  Parallel Append
         ->  Parallel Hash Join
               Hash Cond: (t2_1.b = t1_1.a)
               ->  Parallel Seq Scan on prt2_p2 t2_1
                     Filter: ((b >= 0) AND (b <= 10000))
               ->  Parallel Hash
                     ->  Parallel Seq Scan on prt1_p2 t1_1
                           Filter: (b = 0)
         ->  Parallel Hash Join
               Hash Cond: (t2.b = t1.a)
               ->  Parallel Seq Scan on prt2_p1 t2
                     Filter: ((b >= 0) AND (b <= 10000))
               ->  Parallel Hash
                     ->  Parallel Seq Scan on prt1_p1 t1
                           Filter: (b = 0)

Above all, a partition-wise join can use parallel execution only if partitions are big enough.

Parallel Append

Parallel Append partitions work instead of using different blocks in different workers. Usually, you can see this with UNION ALL queries. The drawback – less parallelism, because every worker could ultimately work for a single query.

There are just two workers launched even with four workers enabled.

tpch=# explain (costs off) select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day union all select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '2000-12-01' - interval '105' day;
                                           QUERY PLAN
   Workers Planned: 2
   ->  Parallel Append
         ->  Aggregate
               ->  Seq Scan on lineitem
                     Filter: (l_shipdate <= '2000-08-18 00:00:00'::timestamp without time zone)
         ->  Aggregate
               ->  Seq Scan on lineitem lineitem_1
                     Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)

Most important variables

  • WORK_MEM limits the memory usage of each process! Not just for queries: work_mem * processes * joins => could lead to significant memory usage.
  • max_parallel_workers_per_gather  – how many workers an executor will use for the parallel execution of a planner node
  • max_worker_processes – adapt the total number of workers to the number of CPU cores installed on a server
  • max_parallel_workers – same for the number of parallel workers


Starting from 9.6 parallel queries execution could significantly improve performance for complex queries scanning many rows or index records. In PostgreSQL 10, parallel execution was enabled by default. Do not forget to disable parallel execution on servers with a heavy OLTP workload. Sequential scans or index scans still consume a significant amount of resources. If you are not running a report against the whole dataset, you may improve query performance just by adding missing indexes or by using proper partitioning.


Image compiled from photos by Nathan Gonthier and Pavel Nekoranec on Unsplash

by Nickolay Ihalainen at February 21, 2019 02:05 PM

Jean-Jerome Schmidt

Hybrid OLTP/Analytics Database Workloads in Galera Cluster Using Asynchronous Slaves

Using Galera cluster is a great way of building a highly available environment for MySQL or MariaDB. It is a shared-nothing cluster environment which can be scaled even beyond 12-15 nodes. Galera has some limitations, though. It shines in low-latency environments and even though it can be used across WAN, the performance is limited by network latency. Galera performance can also be impacted if one of the nodes starts to behave incorrectly. For example, excessive load on one of the nodes may slow it down, resulting in slower handling of the writes and that will impact all of the other nodes in the cluster. On the other hand, it is quite impossible to run a business without analyzing your data. Such analysis, typically, requires running heavy queries, which is quite different from an OLTP workload. In this blog post, we will discuss an easy way of running analytical queries for data stored in Galera Cluster for MySQL or MariaDB, in a way that it does not impact the performance of the core cluster.

How to run analytical queries on Galera Cluster?

As we stated, running long running queries directly on a Galera cluster is doable, but perhaps not so good idea. Hardware-dependant, this can be acceptable solution (if you use strong hardware and you will not run a multi-threaded analytical workload) but even if CPU utilization will not be a problem, the fact that one of the nodes will have mixed workload (OLTP and OLAP) will alone pose some performance challenges. OLAP queries will evict data required for your OLTP workload from the buffer pool, and this will slow down your OLTP queries. Luckily, there is a simple yet efficient way of separating analytical workload from regular queries - an asynchronous replication slave.

Replication slave is a very simple solution - all you need is just another host which can be provisioned and asynchronous replication has to be configured from Galera Cluster to that node. With asynchronous replication, the slave will not impact the rest of the cluster in any way. No matter if it is heavily loaded, uses different (less powerful) hardware, it will just continue replicating from the core cluster. The worst case scenario is that the replication slave will start lagging behind but then it is up to you to implement multi-threaded replication or, eventually to scale up the replication slave.

Once the replication slave is up and running, you should run the heavier queries on it and offload the Galera cluster. This can be done in multiple ways, depending on your setup and environment. If you use ProxySQL, you can easily direct queries to the analytical slave based on the source host, user, schema or even the query itself. Otherwise it will be up to your application to send analytical queries to the correct host.

Setting up a replication slave is not very complex but it still can be tricky if you are not proficient with MySQL and tools like xtrabackup. The whole process would consist of setting up the repository on a new server and installing the MySQL database. Then you will have to provision that host using data from Galera cluster. You can use xtrabackup for that but other tools like mydumper/myloader or even mysqldump will work as well (as long as you execute them correctly). Once the data is there, you will have to setup the replication between a master Galera node and the replication slave. Finally, you would have to reconfigure your proxy layer to include the new slave and route the traffic towards it or make tweaks in how your application connects to the database in order to redirect some of the load to the replication slave.

What is important to keep in mind, this setup is not resilient. If the “master” Galera node would go down, the replication link will be broken and it will take a manual action to slave the replica off another master node in the Galera cluster.

This is not a big deal, especially if you use replication with GTID (Global Transaction ID) but you have to identify that the replication is broken and then take the manual action.

How to set up the asynchronous slave to Galera Cluster using ClusterControl?

Luckily, if you use ClusterControl, the whole process can be automated and it requires just a handful of clicks. The initial state has already been set up using ClusterControl - a 3 node Galera cluster with 2 ProxySQL nodes and 2 Keepalived nodes for high availability of both database and proxy layer.

Adding the replication slave is just a click away:

Replication, obviously, requires binary logs to be enabled. If you do not have binlogs enabled on your Galera nodes, you can do it also from the ClusterControl. Please keep in mind that enabling binary logs will require a node restart to apply the configuration changes.

Even if one node in the cluster has binary logs enabled (marked as “Master” on the screenshot above), it’s still good to enable binary log on at least one more node. ClusterControl can automatically failover the replication slave after it detects that the master Galera node crashed, but for that, another master node with binary logs enabled is required or it won’t have anything to fail over to.

As we stated, enabling binary logs requires restart. You can either perform it straight away, or just make the configuration changes and perform the restart at some other time.

After binlogs have been enabled on some of the Galera nodes, you can proceed with adding the replication slave. In the dialog you have to pick the master host, pass the hostname or IP address of the slave. If you have recent backups at hand (which you should do), you can use one to provision the slave. Otherwise ClusterControl will provision it using xtrabackup - all the recent master data will be streamed to the slave and then the replication will be configured.

After the job completed, a replication slave has been added to the cluster. As stated earlier, should the die, another host in the Galera cluster will be picked as the master and ClusterControl will automatically slave off another node.

As we use ProxySQL, we need to configure it. We’ll add a new server into ProxySQL.

We created another hostgroup (30) where we put our asynchronous slave. We also increased “Max Replication Lag” to 50 seconds from default 10. It is up to your business requirements how badly analytics slave can be lagging before it becomes a problem.

After that we have to configure a query rule that will match our OLAP traffic and route it to the OLAP hostgroup (30). On the screenshot above we filled several fields - this is not mandatory. Typically you will need to use one, two of them at most. Above screenshot serves as an example so we can easily see that you can match queries using schema (if you have a separate schema with analytical data), hostname/IP (if OLAP queries are executed from some particular host), user (if application uses particular user for analytical queries. You can also match queries directly by either passing a full query or by marking them with SQL comments and let ProxySQL route all queries with a “OLAP_QUERY” string to our analytical hostgroup.

As you can see, thanks to ClusterControl we were able to deploy a replication slave to Galera Cluster in just a couple of clicks. Some may argue that MySQL is not the most suitable database for analytical workload and we tend to agree. You can easily extend this setup using ClickHouse and by setting up a replication from asynchronous slave to ClickHouse columnar datastore for much better performance of analytical queries. We described this setup in one of the earlier blog posts.

by krzysztof at February 21, 2019 11:04 AM

February 20, 2019

Henrik Ingo

20 years later, what's left of the CAP theorem?

The CAP theorem was published in (party like it's...) 1999: Fox Armando, Brewer Eric A: Harvest, Yield, and Scalable Tolerant Systems.

Since its publication it has provided a beacon and rallying cry around which web scale distributed databases could be built and debated. It(s interpretation) has also evolved. Quite quickly the original 1999 formulation was abandoned, and from there it has further eroded as real world database implementations have provided ever more finer grained trade offs for navigating the space that - after all - was correctly mapped out by the CAP theorem.

Pick ANY two? Really?

read more

by hingo at February 20, 2019 09:20 PM

Peter Zaitsev

Percona Monitoring and Management (PMM) 1.17.1 Is Now Available

Percona Monitoring and Management 1.17.0

Percona Monitoring and Management

Percona Monitoring and Management (PMM) is a free and open-source platform for managing and monitoring MySQL®, MongoDB®, and PostgreSQL performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL®, MongoDB®, and PostgreSQL® servers to ensure that your data works as efficiently as possible.

In this release, we are introducing support for detection of our upcoming PMM 2.0 release in order to avoid potential version conflicts in the future, as PMM 1.x will not be compatible with PMM 2.x.

Another improvement in this release is we have updated the Tooltips for Dashboard MySQL Query Response Time by providing a description of what the graphs display, along with links to related documentation resources. An example of Tooltips in action:

PMM 1.17.1 release provides fixes for CVE-2018-16492 and CVE-2018-16487 vulnerabilities, related to Node.js modules. The authentication system used in PMM is not susceptible to the attacks described in these CVE reports. PMM does not use client-side data objects to control user-access.

In release 1.17.1 we have included two improvements and fixed nine bugs.


  • PMM-1339: Improve tooltips for MySQL Query Response Time dashboard
  • PMM-3477: Add Ubuntu 18.10 support

Fixed Bugs

  • PMM-3471: Fix global status metric names in mysqld_exporter for MySQL 8.0 compatibility
  • PMM-3400: Duplicate column in the Query Analytics dashboard Explain section
  • PMM-3353: postgres_exporter does not work with PostgreSQL 11
  • PMM-3188: Duplicate data on Amazon RDS / Aurora MySQL Metrics dashboard
  • PMM-2615: Fix wrong formatting in log which appears if pmm-qan-agent process fails to start
  • PMM-2592: MySQL Replication Dashboard shows error with multi-source replication
  • PMM-2327: Member State Uptime and Max Member Ping time charts on the MongoDB ReplSet dashboard return an error
  • PMM-955: Fix format of User Time and CPU Time Graphs on MySQL User Statistics dashboard
  • PMM-3522: CVE-2018-16492 and CVE-2018-16487

Help us improve our software quality by reporting any Percona Monitoring and Management bugs you encounter using our bug tracking system.

by Dmitriy Kostiuk at February 20, 2019 03:11 PM

ProxySQL Native Support for Percona XtraDB Cluster (PXC)

galera proxy content image

ProxySQL in its versions up to 1.x did not natively support Percona XtraDB Cluster (PXC). Instead, it relied on the flexibility offered by the scheduler. This approach allowed users to implement their own preferred way to manage the ProxySQL behaviour in relation to the Galera events.

From version 2.0 we can use native ProxySQL support for PXC.. The mechanism to activate native support is very similar to the one already in place for group replication.

In brief it is based on the table [runtime_]mysql_galera_hostgroups and the information needed is mostly the same:

  • writer_hostgroup: the hostgroup ID that refers to the WRITER
  • backup_writer_hostgroup: the hostgoup ID referring to the Hostgorup that will contain the candidate servers
  • reader_hostgroup: The reader Hostgroup ID, containing the list of servers that need to be taken in consideration
  • offline_hostgroup: The Hostgroup ID that will eventually contain the writer that will be put OFFLINE
  • active: True[1]/False[0] if this configuration needs to be used or not
  • max_writers: This will contain the MAX number of writers you want to have at the same time. In a sane setup this should be always 1, but if you want to have multiple writers, you can define it up to the number of nodes.
  • writer_is_also_reader: If true [1] the Writer will NOT be removed from the reader HG
  • max_transactions_behind: The number of wsrep_local_recv_queue after which the node will be set OFFLINE. This must be carefully set, observing the node behaviour.
  • comment: I suggest to put some meaningful notes to identify what is what.

Given the above let us see what we need to do in order to have a working galera native solution.
I will have three Servers: (Node1)  (Node2) (node3)

As set of Hostgroup, I will have:

Writer  HG-> 100
Reader  HG-> 101
BackupW HG-> 102
offHG   HG-> 9101

To set it up

Servers first:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('',101,3306,1000);

Then the galera settings:

insert into mysql_galera_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) values (100,102,101,9101,0,1,1,16);

As usual if we want to have R/W split we need to define the rules for it:

insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1040,6033,'windmills','app_test',100,1,3,'^SELECT.*FOR UPDATE',1);
insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1041,6033,'windmills','app_test',101,1,3,'^SELECT.*@@',1);
save mysql query rules to disk;
load mysql query rules to run;

Then another important variable… the server version, please do yourself a good service ad NEVER use the default.

update global_variables set variable_value='5.7.0' where variable_name='mysql-server_version';

Finally activate the whole thing:

save mysql servers to disk;
load mysql servers to runtime;

One thing to note before we go ahead. In the list of servers I had:

  1. Filled only the READER HG
  2. Used the same weight

This because of the election mechanism ProxySQL will use to identify the writer, and the (many) problems that may be attached to it.

For now let us go ahead and see what happens when I load this information to runtime.

Before running the above commands:

| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |


| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
| 1000   | 100       | | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
| 1000   | 101       | | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
| 1000   | 101       |  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
| 1000   | 101       | | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
| 1000   | 102       |  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
| 1000   | 102       | | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
mysql> select * from runtime_mysql_galera_hostgroups \G
*************************** 1. row ***************************
       writer_hostgroup: 100
backup_writer_hostgroup: 102
       reader_hostgroup: 101
      offline_hostgroup: 9101
                active: 0  <----------- note this
            max_writers: 1
  writer_is_also_reader: 1
max_transactions_behind: 16
                comment: NULL
1 row in set (0.01 sec)

As we can see, ProxySQL had taken the nodes from my READER group and distribute them adding node 1 in the writer and node 2 as backup_writer.

But – there is a but – wasn’t my rule set with Active=0? Indeed it was, and I assume this is a bug (#Issue  1902).

The other thing we should note is that ProxySQL had elected as writer node 3 (
As I said before what should we do IF we want to have a specific node as preferred writer?

We need to modify its weight. So say we want to have node 1 ( as writer we will need something like this:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('',101,3306,10000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('',101,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('',101,3306,100);

Doing that will give us :

| weight | hostgroup | srv_host      | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
| 10000  | 100       | | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 2209       |
| 100    | 101       | | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
| 100    | 101       |  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 508        |
| 10000  | 101       | | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 2209       |
| 100    | 102       | | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
| 100    | 102       |  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 508        |

If you noticed, given we had set the WEIGHT in node 1 higher, this node will become also the most utilized for reads.
We probably do not want that, so let us modify the reader weight.

update mysql_servers set weight=10 where hostgroup_id=101 and hostname='';

At this point if we trigger the failover, with set global wsrep_reject_queries=all; on node 1.
ProxySQL will take action and will elect another node as writer:

| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
| 100    | 100       | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
| 100    | 101       | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
| 100    | 101       |  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 588        |
| 100    | 102       |  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 588        |
| 10000  | 9101      | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 468        |

Node 3 ( is the new writer and node 1 is in the special group for OFFLINE.
Let see now what will happen IF we put back node 1.

| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
| 10000  | 100       | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
| 100    | 101       | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
| 100    | 101       |  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |
| 10000  | 101       | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
| 100    | 102       | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
| 100    | 102       |  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |

Ooops the READER has come back with the HIGHEST value and as such it will be the most used node, once more. To fix it, we need to re-run the update as before.

But there is a way to avoid this? In short the answer is NO!
This, in my opinion, is BAD and is worth a feature request, because this can really put a node on the knees.

Now this is not the only problem. There is another point that is probably worth discussion, which is the fact ProxySQL is currently doing FAILOVER/FAILBACK.

Failover, is obviously something we want to have, but failback is another discussion. The point is, once the failover is complete and the cluster has redistributed the incoming requests, doing a failback is an impacting operation that can be a disruptive one too.

If all nodes are treated as equal, there is no real way to prevent it, while if YOU set a node to be the main writer, something can be done, let us see what and how.
Say we have:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('',101,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('',101,3306,100);
| weight | hostgroup | srv_host      | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
| 1000   | 100       | | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
| 100    | 101       | | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
| 100    | 101       |  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 613        |
| 10     | 101       | | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
| 100    | 102       | | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
| 100    | 102       |  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 613        |

Let us put the node down
set global wsrep_reject_queries=all;

And check:

| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
| 100    | 100       | | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
| 100    | 101       | | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
| 100    | 101       |  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 506        |
| 100    | 102       |  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 506        |
| 1000   | 9101      | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 527        |

We can now manipulate the weight in the special OFFLINE group and see what happen:

update mysql_servers set weight=10 where hostgroup_id=9101 and hostname=''

Then I put the node up again:
set global wsrep_reject_queries=none;

| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
| 100    | 100       | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
| 100    | 101       | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
| 100    | 101       |  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 573        |
| 10     | 101       | | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 458	|
| 100    | 102       |  | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 573	|
| 10     | 102       | | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 458        |

That’s it, the node is back but with no service interruption.

At this point we can decide if make this node reader like the others, or wait and plan a proper time of the day when we can put it back as writer, while, in the meanwhile it has a bit of load to warm its bufferpool.

The other point – and important information – is what is ProxySQL is currently checking on Galera? From reading the code Proxy will trap the following:

  • read_only
  • wsrep_local_recv_queue
  • wsrep_desync
  • wsrep_reject_queries
  • wsrep_sst_donor_rejects_queries
  • primary_partition

Plus the standard sanity checks on the node.

Finally to monitor the whole situation we can use this:

mysql> select * from mysql_server_galera_log order by time_start_us desc limit 10;
| hostname      | port | time_start_us    | success_time_us | primary_partition | read_only | wsrep_local_recv_queue | wsrep_local_state | wsrep_desync | wsrep_reject_queries | wsrep_sst_donor_rejects_queries | error |
| | 3306 | 1549982591661779 | 2884            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
|  | 3306 | 1549982591659644 | 2778            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| | 3306 | 1549982591658728 | 2794            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
| | 3306 | 1549982586669233 | 2827            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
|  | 3306 | 1549982586663458 | 5100            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| | 3306 | 1549982586658973 | 4132            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
| | 3306 | 1549982581665317 | 3084            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
|  | 3306 | 1549982581661261 | 3129            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| | 3306 | 1549982581658242 | 2786            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| | 3306 | 1549982576661349 | 2982            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |

As you can see above the log table keeps track of what is changed. In this case, it reports that node 1 has wsrep_reject_queries activated, and it will continue to log this until we set wsrep_reject_queries=none.


ProxySQL galera native integration is a useful feature to manage any Galera implementation, no matter whether it’s Percona PXC, MariaDB cluster or MySQL/Galera.

The generic approach is obviously a good thing, still it may miss some specific extension like we have in PXC with the performance_schema pxc_cluster_view table.

I’ve already objected about the failover/failback, and I am here again to remind you: whenever you do a controlled failover REMEMBER to change the weight to prevent an immediate failback.

This is obviously not possible in the case of a real failover, and, for instance, a simple temporary eviction will cause two downtimes instead only one. Some environments are fine with that others not so.

Personally I think there should be a FLAG in the configuration, such that we can decide if failback should be executed or not.


by Marco Tusa at February 20, 2019 02:11 PM

February 19, 2019

Oli Sennhauser

MySQL Enterprise Backup Support Matrix

MySQL Enterprise Backup (MEB) is a bit limited related to support of older MySQL versions. So you should consider the following release matrix:

MEB/MySQLSupported 5.5  5.6  5.7  8.0 

* MySQL Enterprise Backup 8.0.15 only supports MySQL 8.0.15. For earlier versions of MySQL 8.0, use the MySQL Enterprise Backup version with the same version number as the server.

MySQL Enterprise Backup is available for download from the My Oracle Support (MOS) website. This release will be available on Oracle eDelivery (OSDC) after the next upload cycle. MySQL Enterprise Backup is a commercial extension to the MySQL family of products.

As an Open Source alternative Percona XtraBackup for MySQL databases is available.

Compatibility with MySQL Versions: 3.11, 3.12, 4.0, 4.1, 8.0.

MySQL Enterprise Backup User's Guide: 3.11, 3.12, 4.0, 4.1, 8.0.

by Shinguz at February 19, 2019 06:13 PM

Peter Zaitsev

Percona Server for MongoDB 3.4.19-2.17 Is Now Available

Percona Server for MongoDB

Percona Server for MongoDB

Percona announces the release of Percona Server for MongoDB 3.4.19-2.17 on February 19, 2019. Download the latest version from the Percona website or the Percona Software Repositories.

Percona Server for MongoDB 3.4 is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 3.4 Community Edition. It supports MongoDB 3.4 protocols and drivers.

Percona Server for MongoDB extends MongoDB Community Edition functionality by including the Percona Memory Engine and MongoRocks storage engines, as well as several enterprise-grade features:

Percona Server for MongoDB requires no changes to MongoDB applications or code. This release is based on MongoDB 3.4.19.

In this release, Percona Server for MongoDB supports the ngram full-text search engine. Thanks to Sunguck Lee (@SunguckLee) for this contribution. To enable the ngram full-text search engine, create an index passing ngram to the default_language parameter:

mongo > db.collection.createIndex({name:"text"}, {default_language: "ngram"})

New Features

  • PSMDB-250: The ngram full-text search engine has been added to Percona Server for MongoDB.Thanks to Sunguck Lee (@SunguckLee) for this contribution.

Bugs Fixed

  • PSMDB-272mongos could crash when running the createBackup command.

Other bugs fixed: PSMDB-247

The Percona Server for MongoDB 3.4.19-2.17 release notes are available in the official documentation.

by Borys Belinsky at February 19, 2019 01:43 PM

How Network Bandwidth Affects MySQL Performance

10gb network and 10gb with SSL

Network is a major part of a database infrastructure. However, often performance benchmarks are done on a local machine, where a client and a server are collocated – I am guilty myself. This is done to simplify the setup and to exclude one more variable (the networking part), but with this we also miss looking at how network affects performance.

The network is even more important for clustering products like Percona XtraDB Cluster and MySQL Group Replication. Also, we are working on our Percona XtraDB Cluster Operator for Kubernetes and OpenShift, where network performance is critical for overall performance.

In this post, I will look into networking setups. These are simple and trivial, but are a building block towards understanding networking effects for more complex setups.


I will use two bare-metal servers, connected via a dedicated 10Gb network. I will emulate a 1Gb network by changing the network interface speed with

ethtool -s eth1 speed 1000 duplex full autoneg off

network test topology

I will run a simple benchmark:

sysbench oltp_read_only --mysql-ssl=on --mysql-host= --tables=20 --table-size=10000000 --mysql-user=sbtest --mysql-password=sbtest --threads=$i --time=300 --report-interval=1 --rand-type=pareto

This is run with the number of threads varied from 1 to 2048. All data fits into memory – innodb_buffer_pool_size is big enough – so the workload is CPU-intensive in memory: there is no IO overhead.

Operating System: Ubuntu 16.04

Benchmark N1. Network bandwidth

In the first experiment I will compare 1Gb network vs 10Gb network.

1gb vs 10gb network

threads/throughput 1Gb network 10Gb network
1 326.13 394.4
4 1143.36 1544.73
16 2400.19 5647.73
32 2665.61 10256.11
64 2838.47 15762.59
96 2865.22 17626.77
128 2867.46 18525.91
256 2867.47 18529.4
512 2867.27 17901.67
1024 2865.4 16953.76
2048 2761.78 16393.84


Obviously the 1Gb network performance is a bottleneck here, and we can improve our results significantly if we move to the 10Gb network.

To see that 1Gb network is bottleneck we can check the network traffic chart in PMM:

network traffic in PMM

We can see we achieved 116MiB/sec (or 928Mb/sec)  in throughput, which is very close to the network bandwidth.

But what we can do if the our network infrastructure is limited to 1Gb?

Benchmark N2. Protocol compression

There is a feature in MySQL protocol whereby you can see the compression for the network exchange between client and server:

  for sysbench.

Let’s see how it will affect our results.

1gb network with compression protocol

threads/throughput 1Gb network 1Gb with compression protocol
1 326.13 198.33
4 1143.36 771.59
16 2400.19 2714
32 2665.61 3939.73
64 2838.47 4454.87
96 2865.22 4770.83
128 2867.46 5030.78
256 2867.47 5134.57
512 2867.27 5133.94
1024 2865.4 5129.24
2048 2761.78 5100.46


Here is an interesting result. When we use all available network bandwidth, the protocol compression actually helps to improve the result.10g network with compression protocol

threads/throughput 10Gb 10Gb with compression
1 394.4 216.25
4 1544.73 857.93
16 5647.73 3202.2
32 10256.11 5855.03
64 15762.59 8973.23
96 17626.77 9682.44
128 18525.91 10006.91
256 18529.4 9899.97
512 17901.67 9612.34
1024 16953.76 9270.27
2048 16393.84 9123.84


But this is not the case with the 10Gb network. The CPU resources needed for compression/decompression are a limiting factor, and with compression the throughput actually only reach half of what we have without compression.

Now let’s talk about protocol encryption, and how using SSL affects our results.

Benchmark N3. Network encryption

1gb network and 1gb with SSL

threads/throughput 1Gb network 1Gb SSL
1 326.13 295.19
4 1143.36 1070
16 2400.19 2351.81
32 2665.61 2630.53
64 2838.47 2822.34
96 2865.22 2837.04
128 2867.46 2837.21
256 2867.47 2837.12
512 2867.27 2836.28
1024 2865.4 1830.11
2048 2761.78 1019.23

10gb network and 10gb with SSL

threads/throughput 10Gb 10Gb SSL
1 394.4 359.8
4 1544.73 1417.93
16 5647.73 5235.1
32 10256.11 9131.34
64 15762.59 8248.6
96 17626.77 7801.6
128 18525.91 7107.31
256 18529.4 4726.5
512 17901.67 3067.55
1024 16953.76 1812.83
2048 16393.84 1013.22


For the 1Gb network, SSL encryption shows some penalty – about 10% for the single thread – but otherwise we hit the bandwidth limit again. We also see some scalability hit on a high amount of threads, which is more visible in the 10Gb network case.

With 10Gb, the SSL protocol does not scale after 32 threads. Actually, it appears to be a scalability problem in OpenSSL 1.0, which MySQL currently uses.

In our experiments, we saw that OpenSSL 1.1.1 provides much better scalability, but you need to have a special build of MySQL from source code linked to OpenSSL 1.1.1 to achieve this. I don’t show them here, as we do not have production binaries.


  1. Network performance and utilization will affect the general application throughput.
  2. Check if you are hitting network bandwidth limits
  3. Protocol compression can improve the results if you are limited by network bandwidth, but also can make things worse if you are not
  4. SSL encryption has some penalty (~10%) with a low amount of threads, but it does not scale for high concurrency workloads.

by Vadim Tkachenko at February 19, 2019 11:52 AM

February 18, 2019

Peter Zaitsev

Percona Server for MySQL 5.7.25-28 Is Now Available

Percona Server for MySQL 8.0

Percona Server for MySQL 5.6Percona is glad to announce the release of Percona Server 5.7.25-28 on February 18, 2019. Downloads are available here and from the Percona Software Repositories.

This release is based on MySQL 5.7.25 and includes all the bug fixes in it. Percona Server 5.7.25-28 is now the current GA (Generally Available) release in the 5.7 series.

All software developed by Percona is open-source and free.

In this release, Percona Server introduces the variable binlog_skip_flush_commands. This variable controls whether or not FLUSH commands are written to the binary log. Setting this variable to ON can help avoid problems in replication. For more information, refer to our documentation.


If you’re currently using Percona Server 5.7, Percona recommends upgrading to this version of 5.7 prior to upgrading to Percona Server 8.0.

Bugs fixed

  • FLUSH commands written to the binary log could cause errors in case of replication. Bug fixed #1827: (upstream #88720).
  • Running LOCK TABLES FOR BACKUP followed by STOP SLAVE SQL_THREAD could block replication preventing it from being restarted normally. Bug fixed #4758.
  • The ACCESS_DENIED field of the information_schema.user_statistics table was not updated correctly. Bug fixed #3956.
  • MySQL could report that the maximum number of connections was exceeded with too many connections being in the CLOSE_WAIT state. Bug fixed #4716 (upstream #92108)
  • Wrong query results could be received in semi-join sub queries with materialization-scan that allowed inner tables of different semi-join nests to interleave. Bug fixed #4907 (upstream bug #92809).
  • In some cases, the server using the the MyRocks storage engine could crash when TTL (Time to Live) was defined on a table. Bug fixed #4911
  • Running the SELECT statement with the ORDER BY and LIMIT clauses could result in a less than optimal performance. Bug fixed #4949 (upstream #92850)
  • There was a typo in mysqld_safe.shtrottling was replaced with throttling. Bug fixed #240. Thanks to Michael Coburn for the patch.
  • MyRocks could crash while running START TRANSACTION WITH CONSISTENT SNAPSHOT if other transactions were in specific states. Bug fixed #4705,
  • In some cases, mysqld could crash when inserting data into a database the name of which contained special characters (CVE-2018-20324). Bug fixed #5158.
  • MyRocks incorrectly processed transactions in which multiple statements had to be rolled back. Bug fixed #5219.
  • In some cases, the MyRocks storage engine could crash without triggering the crash recovery. Bug fixed #5366.
  • When bootstrapped with undo or redo log encryption enabled on a very fast storage, the server could fail to start. Bug fixed #4958.

Other bugs fixed: #2455#4791#4855#4996#5268.

This release also contains fixes for the following CVE issues: CVE-2019-2534, CVE-2019-2529, CVE-2019-2482, CVE-2019-2434.

Find the release notes for Percona Server for MySQL 5.7.25-28 in our online documentation. Report bugs in the Jira bug tracker.


by Borys Belinsky at February 18, 2019 04:38 PM

Percona Server for MongoDB 4.0.5-2 Is Now Available

Percona Server for MongoDB

Percona Server for MongoDB

Percona announces the release of Percona Server for MongoDB 4.0.5-2 on February 18, 2019. Download the latest version from the Percona website or the Percona Software Repositories.

Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 4.0 Community Edition. It supports MongoDB 4.0 protocols and drivers.

Percona Server for MongoDB extends Community Edition functionality by including the Percona Memory Engine storage engine, as well as several enterprise-grade features. It also includes MongoRocks storage engine (which is now deprecated). Percona Server for MongoDB requires no changes to MongoDB applications or code.

This release includes all features of MongoDB 4.0 Community Edition. Most notable among these are:

Note that the MMAPv1 storage engine is deprecated in MongoDB 4.0 Community Edition.

In Percona Server for MongoDB 4.0.5-2, data at rest encryption becomes GA. The data at rest encryption feature now covers the temporary files used for external sorting and the rollback files. You can decrypt and examine the contents of the rollback files using the new perconadecrypt command line tool.

In this release, Percona Server for MongoDB supports the ngram full-text search engine. Thanks to Sunguck Lee (@SunguckLee) for this contribution. To enable the ngram full-text search engine, create an index passing ngram to the default_language parameter:

mongo > db.collection.createIndex({name:"text"}, {default_language: "ngram"})

New Features

  • PSMDB-276perconadecrypt tool is now available for decrypting the encrypted rollback files.
  • PSMDB-250: The Ngram full-text search engine has been added to Percona Server for MongoDB.Thanks to Sunguck Lee (@SunguckLee) for this contribution.

Bugs Fixed

  • PSMDB-234: It was possible to use a key file for encryption the owner of which was not the owner of the mongod process.
  • PSMDB-273: When using data at rest encryption, temporary files for external sorting and rollback files were not encrypted
  • PSMDB-257: MongoDB could not be started with a group-readable key file owned by root.
  • PSMDB-272mongos could crash when running the createBackup command.

Other bugs fixed: PSMDB-247

The Percona Server for MongoDB 4.0.5-2 release notes are available in the official documentation.

by Borys Belinsky at February 18, 2019 04:13 PM

Jean-Jerome Schmidt

How to Migrate from Oracle to MySQL / Percona Server

Migrating from Oracle to MySQL/Percona Server is not a trivial task. Although it is getting easier, especially with the arrival of MySQL 8.0 and Percona announced Percona Server for MySQL 8.0 GA. Aside from planning for your migration from Oracle to Percona Server, you must ensure that you understand the purpose and functionality for why it has to be Percona Server.

This blog will focus on Migrating from Oracle to Percona Server as its specific target database of choice. There's a page in the Oracle website about SQL Developer Supplementary Information for MySQL Migrations which can be used as a reference for the planned migration. This blog will not cover the overall process of migration, as it is a long process. But it will hopefully provide enough background information to serve as a guide for your migration process.

Since Percona Server is a fork of MySQL, almost all features that come along in MySQL are present in Percona Server. So any reference of MySQL here is applicable as well to Percona Server. We previously blogged about migrating Oracle Database to PostgreSQL. I’ll reiterate again the reasons why one would consider migrating from Oracle to an open-source RDBMS such as PostgreSQL or Percona Server/MySQL/MariaDB.

  1. Cost: As you may know Oracle licence cost is very expensive and there is additional cost for some features like partitioning and high availability. So overall it's very expensive.
  2. Flexible open source licensing and easy availability from public cloud providers like AWS.
  3. Benefit from open source add-ons to improve performance.

Planning and Development Strategy

Migration from Oracle going to Percona Server 8.0 can be a pain since there's a lot of key factors that needs to be considered and addressed. For example, Oracle can run on a Windows Server machine but Percona Server does not support Windows. Although you can compile it for Windows, Percona itself does not offer any support for Windows. You must also identify your database architecture requirements, since Percona Server is not designed for OLAP (Online Analytical Processing) or data-warehousing applications. Percona Server/MySQL RDBMS are perfect fit for OLTP (Online Transaction Processing).

Identifying the key aspect of your database architecture, for example if your current Oracle architecture implements MAA (Maximum Available Architecture) with Data Guard ++ Oracle RAC (Real Application Cluster), you should determine its equivalence in Percona Server. There's no straight answer for this within MySQL/Percona Server. However, you can choose from a synchronous replication, an asynchronous replication (Percona XtraDB Cluster is still currently on version 5.7.x), or with Group Replication. Then, there's multiple alternatives that you can implement for your own high-availability solution. For example, (to name a few) using Corosync/Pacemaker/DRBD/Linux stack, or using MHA (MySQL High Availability), or using Keepalived/HaProxy/ProxySQL stack, or plainly rely on ClusterControl which supports Keepalived, HaProxy, ProxySQL, Garbd, and Maxscale for your high-availability solutions.

On the other side, the question you have also to consider as part of the plan is "How will Percona will provide support and who will help us when Percona Server itself encounters a bug or how high is the urgency when we need help?". One thing to consider as well is budget, if the purpose of migration from enterprise database to an open-source RDBMS is because of cost-cutting.

There are different options from migration planning to the things you need to do as part of your development strategy. Such options include engaging with experts in the MySQL/Percona Server field and that includes us here at Severalnines. There are lots of MySQL consulting firms that can help you through this since migration from Oracle to MySQL requires a lot of expertise and know-how in the MySQL Server area. This should not be limited to the database but it should cover expertise in scalability, redundancy, backups, high-availability, security, monitoring/observability, recovery and engaging on mission critical systems. Overall, it should have an understanding of your architectural insight without exposing confidentiality of your data.

Assessment or Preliminary Check

Backing up your data including configurations or setup files, kernel tunings, automation scripts shall not be left into oblivion. It's an obvious task, but before you migrate, always secure everything first , especially when moving to a different platform.

You must assess as well that your applications are following the up-to-date software engineering conventions and ensure that they are platform agnostic. These practices can be to your benefit especially when moving to a different database platform, such as Percona Server for MySQL.

Take note that the operating system that Percona Server requires can be a show-stopper if your application and database run on a Windows Server and the application is Windows dependent; then this could be a lot of work! Always remember that Percona Server is on a different platform: perfection might not be guaranteed but can be achieved close enough.

Lastly, make sure that the targeted hardware is designed to work feasibly with Percona's server requirements or that it is bug-free at least (see here). You may consider stress testing first with Percona Server before reliably moving off your Oracle Database.

What You Should Know

It is worth noting that in Percona Server / MySQL, you can create multiple databases whereas Oracle does not come with that same functionality as MySQL.

In MySQL, physically, a schema is synonymous with a database. You can substitute the keyword SCHEMA instead of DATABASE in MySQL SQL syntax, for example using CREATE SCHEMA instead of CREATE DATABASE; whilst Oracle has a distinction of this. A schema represents only a part of a database: the tables and other objects owned by a single user. Normally, there is a one-to-one relationship between the instance and the database.

For example, in a replication setup equivalent in Oracle (e.g. Real Application Clusters or RAC), you have your multiple instances accessing a single database. This lets you start Oracle on multiple servers, but all accessing the same data. However, in MySQL, you can allow access to multiple databases from your multiple instances and can even filter out which databases/schema you can replicate to a MySQL node.

Referencing from one of our previous blog, the same principle applies when speaking of converting your database with available tools found on the internet.

There is no such tool that can 100% convert Oracle database into Percona Server / MySQL; some of it will be manual work.

Checkout the following sections for things that you must be aware of when it comes to migration and verifying the logical SQL result.

Data Type Mapping

MySQL / Percona Server have a number of data-types that is almost the same as Oracle but not as rich as compared to Oracle. But since the arrival of the 5.7.8 version of MySQL, is supports for a native JSON data type.

Below is its data-type equivalent representation (tabular representation is taken from here):

  Oracle MySQL
1 BFILE Pointer to binary file, ⇐ 4G VARCHAR(255)
2 BINARY_FLOAT 32-bit floating-point number FLOAT
3 BINARY_DOUBLE 64-bit floating-point number DOUBLE
4 BLOB Binary large object, ⇐ 4G LONGBLOB
5 CHAR(n), CHARACTER(n) Fixed-length string, 1 ⇐ n ⇐ 255 CHAR(n), CHARACTER(n)
6 CHAR(n), CHARACTER(n) Fixed-length string, 256 ⇐ n ⇐ 2000 VARCHAR(n)
7 CLOB Character large object, ⇐ 4G LONGTEXT
8 DATE Date and time DATETIME
9 DECIMAL(p,s), DEC(p,s) Fixed-point number DECIMAL(p,s), DEC(p,s)
11 FLOAT(p) Floating-point number DOUBLE
12 INTEGER, INT 38 digits integer INT DECIMAL(38)
13 INTERVAL YEAR(p) TO MONTH Date interval VARCHAR(30)
14 INTERVAL DAY(p) TO SECOND(s) Day and time interval VARCHAR(30)
15 LONG Character data, ⇐ 2G LONGTEXT
16 LONG RAW Binary data, ⇐ 2G LONGBLOB
17 NCHAR(n) Fixed-length UTF-8 string, 1 ⇐ n ⇐ 255 NCHAR(n)
18 NCHAR(n) Fixed-length UTF-8 string, 256 ⇐ n ⇐ 2000 NVARCHAR(n)
19 NCHAR VARYING(n) Varying-length UTF-8 string, 1 ⇐ n ⇐ 4000 NCHAR VARYING(n)
20 NCLOB Variable-length Unicode string, ⇐ 4G NVARCHAR(max)
21 NUMBER(p,0), NUMBER(p) 8-bit integer, 1 <= p < 3 TINYINT (0 to 255)
16-bit integer, 3 <= p < 5 SMALLINT
32-bit integer, 5 <= p < 9 INT
64-bit integer, 9 <= p < 19 BIGINT
Fixed-point number, 19 <= p <= 38 DECIMAL(p)
22 NUMBER(p,s) Fixed-point number, s > 0 DECIMAL(p,s)
23 NUMBER, NUMBER(*) Floating-point number DOUBLE
24 NUMERIC(p,s) Fixed-point number NUMERIC(p,s)
25 NVARCHAR2(n) Variable-length UTF-8 string, 1 ⇐ n ⇐ 4000 NVARCHAR(n)
26 RAW(n) Variable-length binary string, 1 ⇐ n ⇐ 255 BINARY(n)
27 RAW(n) Variable-length binary string, 256 ⇐ n ⇐ 2000 VARBINARY(n)
28 REAL Floating-point number DOUBLE
29 ROWID Physical row address CHAR(10)
30 SMALLINT 38 digits integer DECIMAL(38)
31 TIMESTAMP(p) Date and time with fraction DATETIME(p)
32 TIMESTAMP(p) WITH TIME ZONE Date and time with fraction and time zone DATETIME(p)
33 UROWID(n) Logical row addresses, 1 ⇐ n ⇐ 4000 VARCHAR(n)
34 VARCHAR(n) Variable-length string, 1 ⇐ n ⇐ 4000 VARCHAR(n)
35 VARCHAR2(n) Variable-length string, 1 ⇐ n ⇐ 4000 VARCHAR(n)

Data type attributes and options:

Oracle MySQL
BYTE and CHAR column size semantics Size is always in characters


Percona Server uses XtraDB (an enhanced version of InnoDB) as its primary storage engine for handling transactional data; although various storage engines can be an alternative choice for handling transactions such as TokuDB (deprecated) and MyRocks storage engines.

Whilst there are advantages and benefits to using or exploring MyRocks with XtraDB, the latter is more robust and de facto storage engine that Percona Server is using and its enabled by default, so we'll use this storage engine as the basis for migration with regards to transactions.

By default, Percona Server / MySQL has autocommit variable set to ON which means that you have to explicitly handle transactional statements to take advantage of ROLLBACK for ignoring changes or taking advantage of using SAVEPOINT.

It's basically the same concept that Oracle uses in terms of commit, rollbacks and savepoints.

For explicit transactions, this means that you have to use the START TRANSACTION/BEGIN; <SQL STATEMENTS>; COMMIT; syntax.

Otherwise, if you have to disable autocommit, you have to explicitly COMMIT all the time for your statements that requires changes to your data.

Dual Table

MySQL has the dual compatibility with Oracle which is meant for compatibility of databases using a dummy table, namely DUAL.

This suits Oracle's usage of DUAL so any existing statements in your application that use DUAL might require no changes upon migration to Percona Server.

The Oracle FROM clause is mandatory for every SELECT statement, so Oracle database uses DUAL table for SELECT statement where a table name is not required.

In MySQL, the FROM clause is not mandatory so DUAL table is not necessary. However, the DUAL table does not work exactly the same as it does for Oracle, but for simple SELECT's in Percona Server, this is fine.

See the following example below:

In Oracle,

 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 DUMMY                                              VARCHAR2(1)

16-FEB-19 AM +08:00

But in MySQL:

mysql> DESC DUAL;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'DUAL' at line 1
| 2019-02-15 20:20:28 |
1 row in set (0.00 sec)

Note: the DESC DUAL syntax does not work in MySQL and the results as well differ as CURRENT_TIMESTAMP (uses TIMESTAMP data type) in MySQL does not include the timezone.


Oracle's SYSDATE function is almost the same in MySQL.

MySQL returns date and time and is a function that requires () (close and open parenthesis with no arguments required. To demonstrate this below, here's Oracle and MySQL on using SYSDATE.

In Oracle, using plain SYSDATE just returns the date of the day without the time. But to get the time and date, use TO_CHAR to convert the date time into its desired format; whereas in MySQL, you might not need it to get the date and the time as it returns both.

See example below.

In Oracle:

02-16-2019 04:39:00



But in MySQL:

| SYSDATE()           |
| 2019-02-15 20:37:36 |
1 row in set (0.00 sec)

If you want to format the date, MySQL has a DATE_FORMAT() function.

You can check the MySQL Date and Time documentation for more info.

Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl


Oracle's TO_DATE equivalent in MySQL is the STR_TO_DATE() function.

It’s almost identical to the one in Oracle: it returns the DATE data type, while in MySQL it returns the DATETIME data type.


SQL> SELECT TO_DATE ('20190218121212','yyyymmddhh24miss') as "NOW" FROM DUAL; 


mysql> SELECT STR_TO_DATE('2019-02-18 12:12:12','%Y-%m-%d %H:%i:%s') as "NOW" FROM DUAL;
| NOW                 |
| 2019-02-18 12:12:12 |
1 row in set (0.00 sec)


In MySQL, there's no such support nor any equivalence for SYNONYM in Oracle.

A possible alternative can be possible with MySQL is using VIEW.

Although SYNONYM can be used to create an alias of a remote table,



In MySQL, you can take advantage of using FEDERATED storage engine.


CREATE TABLE hr_employees (
    name   VARCHAR(32) NOT NULL DEFAULT '',
    other  INT(20) NOT NULL DEFAULT '0',
    PRIMARY KEY  (id),
    INDEX name (name),
    INDEX other_key (other)

Or you can simplify the process with CREATE SERVER syntax, so that when creating a table acting as your SYNONYM for accessing a remote table, it will be easier. See the documentation for more info on this.

Behaviour of Empty String and NULL

Take note that in Percona Server / MySQL, empty string is not NULL whereas Oracle treats empty string as null values.

In Oracle:



mysql> SELECT CASE WHEN '' IS NULL THEN 'Yes' ELSE 'No' END AS "Null Eval" FROM dual;
| Null Eval |
| No        |
1 row in set (0.00 sec)


In MySQL, there's no exact same approach to what Oracle does for SEQUENCE.

Although there are some posts that are simulating the functionality of this approach, you might be able to try to get the next key using LAST_INSERT_ID() as long as your table's clustered index, PRIMARY KEY, is defined with << is there something missing? >>

Character String Functions

Unlike Oracle, MySQL / Percona Server has a handful of string functions but not as many helpful functions built-in to the database.

It would be too long to discuss it here one-by-one, but you can check the documentation from MySQL and compare this against Oracle's string functions.

DML Statements

Insert/Update/Delete statements from Oracle are congruous in MySQL.

Oracle's INSERT ALL/INSERT FIRST is not supported in MySQL.

Otherwise, you’d need to state your MySQL queries one-by-one.


In Oracle:

  INTO CUSTOMERS (customer_id, customer_name, city) VALUES (1000, 'Jase Alagaban', 'Davao City')
  INTO CUSTOMERS (customer_id, customer_name, city) VALUES (2000, 'Maximus Aleksandre Namuag', 'Davao City')
2 rows created.

2 rows created.

But in MySQL, you have to run the insert one at a time:

mysql> INSERT INTO CUSTOMERS (customer_id, customer_name, city) VALUES (1000, 'Jase Alagaban', 'Davao City');
Query OK, 1 row affected (0.02 sec)
mysql> INSERT INTO CUSTOMERS (customer_id, customer_name, city) VALUES (2000, 'Maximus Aleksandre Namuag', 'Davao City');
Query OK, 1 row affected (0.00 sec)

The INSERT ALL/INSERT FIRST doesn’t compare to how it is used in Oracle, where you can take advantage of conditions by adding a WHEN keyword in your syntax; there's no equivalent option in MySQL / Percona Server in this case.

Hence, your alternative solution on this is to use procedures.

Outer Joins "+" Symbol

In Oracle, using + operator for left and right joins is not supported at present in MySQL as + operator is only used for arithmetic decisions.

Hence, if you have + operator in your existing Oracle SQL statements, you need to replace this with LEFT JOIN or RIGHT JOIN.

You might want to check the official documentation for "Outer Join Simplification" of MySQL.


Oracle uses START WITH..CONNECT BY for hierarchical queries.

Starting with MySQL / Percona 8.0, there is support for generating hierarchical data results which uses models such as adjacency list or nested set models. This is called Common Table Expressions (CTE) in MySQL.

Similar to PostgreSQL, MySQL uses WITH RECURSIVE syntax for hierarchical queries so translate CONNECT BY statement into WITH RECURSIVE statement.

Check down below on how it differs from ORACLE and in MySQL / Percona Server.

In Oracle:

SELECT, cp.title, CONCAT(c2.title, ' > ' || cp.title) as path
FROM category cp INNER JOIN category c2
  ON cp.parent_id =
WHERE cp.parent_id IS NOT NULL

And in MySQL:

WITH RECURSIVE category_path (id, title, path) AS
  SELECT id, title, title as path
    FROM category
    WHERE parent_id IS NULL
  SELECT, c.title, CONCAT(cp.path, ' > ', c.title)
    FROM category_path AS cp JOIN category AS c
      ON = c.parent_id
SELECT * FROM category_path
ORDER BY path;

PL/SQL in MySQL / Percona?

MySQL / Percona RDBMS has a different approach than Oracle's PL/SQL.

MySQL uses stored procedures or stored functions, which is similar to PL/SQL and syntax using BEGIN..END syntax.

Oracle's PL/SQL is compiled before execution when it is loaded into the server, while MySQL is compiled and stored in the cache when it's invoked.

You may want to checkout this documentation as a reference guide on converting your PL/SQL to MySQL.

Migration Tools

I did some research for any tools that could be a de facto standard for migration but I couldn’t find a good answer.

Though, I did find sqlines and it looks simple but promising.

While I didn’t deep-dive into it, the website offers a handful of insights, which could help you on migrating from Oracle to MySQL/Percona Server. There are also paid tools such as this and this.

I've also searched through github but found nothing much more appealing as a resolution to the problem. Hence, if you're aiming to migrate from Oracle and to Amazon, they have AWS Schema Conversion Tool for which migrating from Oracle to MySQL is supported.

Overall, the reason why migration is not an easy thing to do is mainly because Oracle RDBMS is such a beast with lots of features that Percona Server / MySQL or MariaDB RDBMS still do not have.

Anyhow, if you find or know of any tools that you find helpful and beneficial for migrating from Oracle to MySQL / Percona Server, please leave a comment on this blog!


As part of your migration plan, testing is a vital task that plays a very important role and affects your decision with regards to migration.

The tool dbdeployer (a port of MySQL Sandbox) is a very helpful tool that you can take advantage of. This is pretty easy for you to try and test different approaches and saves you time, rather than setting up the whole stack if your purpose is to try and test the RDBMS platform first.

For testing your SQL stored routines (functions or procedures), triggers, events, I suggest you use these tools mytap or the Google's Unit Testing Framework.

Percona as well offers a number of tools that are available for download on their website. Checkout Percona Toolkit here. You can cherry-pick the tools according to your needs especially for testing and production-usage tasks.

Overall, things that you need to keep-in-mind as your guidelines when doing a test for your MySQL Server are:

  • After your installation, you need to consider doing some tuning. Checkout this Percona blog for help.
  • Do some benchmarks and stress-load testing for your configuration setup on your current node. Checkout mysqlslap and sysbench which can help you with this. Also check out our blog "How to Benchmark Performance of MySQL & MariaDB using SysBench".
  • Check your DDL's if they are correctly defined such as data-types, constraints, clustered and secondary indexes, or partitions, if you have any.
  • Check your DML especially if syntax are correct and are saving the data correctly as expected.
  • Check out your stored routines, events, trigger to ensure they run/return the expected results.
  • Verify that your queries running are performant. I suggest you take advantage of open-source tools or try our ClusterControl product. It offers monitoring/observability especially of your MySQL / Percona Server. You can use ClusterControl here to monitor your queries and its query plan to make sure they are performant.

by Paul Namuag at February 18, 2019 03:17 PM

Peter Zaitsev

Deprecation of TLSv1.0 2019-02-28

end of Percona support for TLS1.0

end of Percona support for TLS1.0Ahead of the PCI move to deprecate the use of ‘early TLS’, we’ve previously taken steps to disable TLSv1.0.

Unfortunately at that time we encountered some issues which led us to rollback these changes. This was to allow users of operating systems that did not – yet – support TLSv1.1 or higher to download Percona packages over TLSv1.0.

Since then, we have been tracking our usage statistics for older operating systems that don’t support TLSv1.1 or higher at We now receive very few legitimate requests for these downloads.

Consequently,  we are ending support for TLSv1.0 on all Percona web properties.

While the packages will still be available for download from, we are unlikely to update them as the OS’s are end-of-life (e.g. RHEL5). Also, in future you will need to download these packages from a client that supports TLSv1.1 or greater.

For example EL5 will not receive an updated version of OpenSSL to support versions greater than TLSv1.1. PCI has called for the deprecation of ‘early TLS’ standards. Therefore you should upgrade any EL5 installations to EL6 or greater as soon as possible. As noted in this support policy update by Red Hat, EL5 stopped receiving support under extended user support (EUS) in March 2015.

To continue to receive updates for your OS and for any Percona products that you use, you need to update to more recent versions of CentOS, Scientific Linux, and RedHat Enterprise Linux.

Photo by Kevin Noble on Unsplash

by David Busby at February 18, 2019 12:53 PM

February 16, 2019

Valeriy Kravchuk

Fun with Bugs #79 - On MySQL Bug Reports I am Subscribed to, Part XV

More than 3 weeks passed since my previous review of public MySQL bug reports I am subscribed to, so it's time to present some of the bugs I've considered interesting in January, 2019.

As usual, I'll review them starting from the oldest and try to summarize my feelings about these bugs at the end of this post. Here they are:
  • Bug #93806 - "Document error about ON DUPLICATE KEY UPDATE". Years pass, but fine MySQL manual still does not explain some cases of InnoDB locking properly. Xiaobin Lin found yet another case that it does not explain properly. Or, maybe, the manual is correct and the problem in the implementation? MariaDB 10.3.7 shows the same behavior.
  • Bug #93827 - "dict_index_has_desc() is not efficient". Yet another bug report from Zhai Weixiang. I see 50 still active bug reports from him! Maybe Oracle should send some nice T-shirts to top N most productive bug reporters?
  • Bug #93845 - "Optimizer choose wrong index, sorting index instead of filtering index". yet another bug report of a known class, this time from Daniele Renda. It's good example of optimizer trace usage to make a point. Note also that using ANALYZE ... UPDATE HISTOGRAMS does not help. As a side note, implementation of optimizer trace for MariaDB is finally in progress and should be done for upcoming 10.4. See MDEV-6111 for the details if you care.
  • Bug #93875 - "mysqldump per-table dump is slow since 5.7 on instances with many tables". This performance regression bug (that was "verified" without adding the regression tag) was reported by Nikolai Ikhalainen from Percona. This bug report is a nice example of using Docker to create easily repeatable test cases for bug reports.
  • Bug #93878 - "innodb_status_output fails to restore to old value". This great bug report from Yuhui Wang  not only describes 3 cases when InnoDB status is printed to the error log automatically, but also shows that in one of these cases, when we can not found free block in the buffer pool in 20 loops, this printing is not stopped after the problem is resolved, and provides a patch that resolves the problem. See also his nice Bug #94065 - "MySQL fails to startup when setting persist variable" with detailed analysis of the problem.
  • Bug #93917 - "Wrong binlog entry for BLOB on a blackhole intermediary master". Nice corner case was found by Sveta Smirnova from Percona. With her 52 "Verified" bug reports at the moment she also deserves a T-shirt from Oracle as one of top bug reporters!
  • Bug #93922 - "UNION ALL very slow with SUM(0)". This weird bug was found and reported by Sergio Paternoster. He had to spend notable efforts to see this bug "Verified"...
  • Bug #93948 - "XID inconsistency on master-slave with CTAS". Krunal Bauskar from Percona noted this inconsistency in XID generation on slave vs master. Let's wait and check if it ends up as "Not a bug".
  • Bug #93957 - "slave_compressed_protocol doesn't work with semi-sync replication in MySQL-5.7". This bug report from Pavel Katiushyn also looks like a regression, as similar bug was fixed in older 5.7.x release. But I do not see any public comment with verification attempt neither in recent 5.7, nor in recent 8.0 (where older bug also had to be fixed). So, the bug is "verified", but the real impact and versions affected are not clear.
  • Bug #93963 - "Slow query log doesn't log a slow CREATE INDEX with admin statements enabled". This clear and properly tagged regression vs MySQL 5.7 was reported by Jeremy Smyth.
  • Bug #93986 - "Transactions in serializable mode are not actually serializable". I've subscribed to this bug report mostly for (expected) fun of reading further comments. It's still "Need feedback", but single comment so far is worth reading.
  • Bug #94121 - "Enable hardware CRC32 under Valgrind". Laurynas Biveinis from Percona also provided a patch for this 8 years old problem.
  • Bug #94130 - "XA COMMIT may lead replication broken". Yet another proof that XA transactions implementation is broken in MySQL. This time from Phoenix Zhang and in semi-sync replication case.
This photo reminds me current state of MySQL bugs processing in Oracle - it seems there is no clear and straightforward way to follow. Everything is fuzzy these days...

There are few more bugs reported in January, 2019 that I am watching, but their status is not yet clearly defined, so I decided to skip them in this review.

To summarize:
  1.  Oracle engineers who process bugs still do not add regression tag to many regression bugs. This is a shame, really. If I were their boss I'd make this a policy and one of important KPI values to monitor.
  2. In some cases bugs get verified immediately without any demonstrated attempt to show how the check was performed, while in other cases poor bug reporters have to fight hard to re-make their point and get a real check done. It seems these days good old approaches to bugs verification are not followed strictly by some Oracle engineers.

by Valerii Kravchuk ( at February 16, 2019 08:24 PM

February 15, 2019

Peter Zaitsev

ClickHouse Performance Uint32 vs Uint64 vs Float32 vs Float64

Q1 least compression

While implementing ClickHouse for query executions statistics storage in Percona Monitoring and Management (PMM),  we were faced with a question of choosing the data type for metrics we store. It came down to this question: what is the difference in performance and space usage between Uint32, Uint64, Float32, and Float64 column types?

To test this, I created a test table with an abbreviated and simplified version of the main table in our ClickHouse Schema.

The “number of queries” is stored four times in four different columns to be able to benchmark queries referencing different columns.  We can do this with ClickHouse because it is a column store and it works only with columns referenced by the query. This method would not be appropriate for testing on MySQL, for example.

    digest String,
    db_server String,
    db_schema String,
    db_username String,
    client_host String,
    period_start DateTime,
    nq_UInt32 UInt32,
    nq_UInt64 UInt64,
    nq_Float32 Float32,
    nq_Float64 Float64
ENGINE = MergeTree
PARTITION BY toYYYYMM(period_start)
ORDER BY (digest, db_server, db_username, db_schema, client_host, period_start)
SETTINGS index_granularity = 8192

When testing ClickHouse performance you need to consider compression. Highly compressible data (for example just a bunch of zeroes) will compress very well and may be processed a lot faster than incompressible data. To take this into account we will do a test with three different data sets:

  • Very Compressible when “number of queries” is mostly 1
  • Somewhat Compressible when we use a range from 1 to 1000 and
  • Poorly Compressible when we use range from 1 to 1000000.

Since it’s unlikely that an application will use the full 32 bit range, we haven’t used it for this test.

Another factor which can impact ClickHouse performance is the number of “parts” the table has. After loading the data we ran OPTIMIZE TABLE FINAL to ensure only one part is there on the disk. Note: ClickHouse will gradually delete old files after the optimize command has completed. To avoid these operations interfering with benchmarks, I waited for about 15 minutes to ensure all unused data was removed from the disk.

The amount of memory on the system was enough to cache whole columns in all tests, so this is an in-memory test.

Here is how the table with only one part looks on disk:

root@d01e692c291f:/var/lib/clickhouse/data/pmm/test_lc# ls -la
total 28
drwxr-xr-x 4 clickhouse clickhouse 12288 Feb 10 20:39 .
drwxr-xr-x 8 clickhouse clickhouse 4096 Feb 10 22:38 ..
drwxr-xr-x 2 clickhouse clickhouse 4096 Feb 10 20:30 201902_1_372_4
drwxr-xr-x 2 clickhouse clickhouse 4096 Feb 10 19:38 detached
-rw-r--r-- 1 clickhouse clickhouse 1 Feb 10 19:38 format_version.txt

When you have only one part it makes it very easy to see the space different columns take:

root@d01e692c291f:/var/lib/clickhouse/data/pmm/test_lc/201902_1_372_4# ls -la
total 7950468
drwxr-xr-x 2 clickhouse clickhouse 4096 Feb 10 20:30 .
drwxr-xr-x 4 clickhouse clickhouse 12288 Feb 10 20:39 ..
-rw-r--r-- 1 clickhouse clickhouse 971 Feb 10 20:30 checksums.txt
-rw-r--r-- 1 clickhouse clickhouse 663703499 Feb 10 20:30 client_host.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 client_host.mrk
-rw-r--r-- 1 clickhouse clickhouse 238 Feb 10 20:30 columns.txt
-rw-r--r-- 1 clickhouse clickhouse 9 Feb 10 20:30 count.txt
-rw-r--r-- 1 clickhouse clickhouse 228415690 Feb 10 20:30 db_schema.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 db_schema.mrk
-rw-r--r-- 1 clickhouse clickhouse 6985801 Feb 10 20:30 db_server.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 db_server.mrk
-rw-r--r-- 1 clickhouse clickhouse 19020651 Feb 10 20:30 db_username.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 db_username.mrk
-rw-r--r-- 1 clickhouse clickhouse 28227119 Feb 10 20:30 digest.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 digest.mrk
-rw-r--r-- 1 clickhouse clickhouse 8 Feb 10 20:30 minmax_period_start.idx
-rw-r--r-- 1 clickhouse clickhouse 1552547644 Feb 10 20:30 nq_Float32.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 nq_Float32.mrk
-rw-r--r-- 1 clickhouse clickhouse 1893758221 Feb 10 20:30 nq_Float64.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 nq_Float64.mrk
-rw-r--r-- 1 clickhouse clickhouse 1552524811 Feb 10 20:30 nq_UInt32.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 nq_UInt32.mrk
-rw-r--r-- 1 clickhouse clickhouse 1784991726 Feb 10 20:30 nq_UInt64.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 nq_UInt64.mrk
-rw-r--r-- 1 clickhouse clickhouse 4 Feb 10 20:30 partition.dat
-rw-r--r-- 1 clickhouse clickhouse 400961033 Feb 10 20:30 period_start.bin
-rw-r--r-- 1 clickhouse clickhouse 754848 Feb 10 20:30 period_start.mrk
-rw-r--r-- 1 clickhouse clickhouse 2486243 Feb 10 20:30 primary.idx

We can see there are two files for every column (plus some extras), and so, for example, the Float32 based “number of queries” metric store takes around 1.5GB.

You can also use the SQL queries to get this data from the ClickHouse system tables instead:

FROM system.columns
WHERE (database = 'pmm') AND (table = 'test') AND (name = 'nq_UInt32')
Row 1:
database: pmm
table: test
name: nq_UInt32
type: UInt32
data_compressed_bytes: 7250570
data_uncompressed_bytes: 1545913232
marks_bytes: 754848
is_in_partition_key: 0
is_in_sorting_key: 0
is_in_primary_key: 0
is_in_sampling_key: 0
1 rows in set. Elapsed: 0.002 sec.
WHERE (database = 'pmm') AND (table = 'test')
Row 1:
partition: 201902
name: 201902_1_372_4
active: 1
marks: 47178
rows: 386478308
bytes_on_disk: 1401028031
data_compressed_bytes: 1390993287
data_uncompressed_bytes: 29642900064
marks_bytes: 7548480
modification_time: 2019-02-10 23:26:20
remove_time: 0000-00-00 00:00:00
refcount: 1
min_date: 0000-00-00
max_date: 0000-00-00
min_time: 2019-02-08 14:50:32
max_time: 2019-02-08 15:58:30
partition_id: 201902
min_block_number: 1
max_block_number: 372
level: 4
data_version: 1
primary_key_bytes_in_memory: 4373363
primary_key_bytes_in_memory_allocated: 6291456
database: pmm
table: test
engine: MergeTree
path: /var/lib/clickhouse/data/pmm/test/201902_1_372_4/
1 rows in set. Elapsed: 0.003 sec.

Now let’s look at the queries

We tested with two queries.  One of them – we’ll call it Q1 – is a very trivial query, simply taking the sum across all column values. This query needs only to access one column to return results so it is likely to be the most impacted by a change of data type:

SELECT sum(nq_UInt32)
FROM test

The second query – which we’ll call Q2 – is a typical ranking query which computes the number of queries per period and then shows periods with the highest amount of queries in them:

    sum(nq_UInt32) AS cnt,
FROM test
GROUP BY period_start

This query needs to access two columns and do more complicated processing so we expect it to be less impacted by the change of data type.

Before we get to results I think it is worth drawing attention to the raw performance we’re getting.  I did these tests on DigitalOcean Droplet with just six virtual CPU cores, yet still I see numbers like these:

SELECT sum(nq_UInt32)
FROM test
┌─sum(nq_UInt32) ──┐
│     386638984    │
1 rows in set. Elapsed: 0.205 sec. Processed 386.48 million rows, 1.55 GB (1.88 billion rows/s., 7.52 GB/s.)

Processing more than 300M rows/sec per core and more than 1GB/sec per core is very cool!

Query Performance

Results between different compression levels show similar differences between column types, so let’s focus on those with the least compression:

Q1 least compression

Q2 least compression

As you can see, the width of the data type (32 bit vs 64 bit) matters a lot more than the type (float vs integer). In some cases float may even perform faster than integer. This was the most unexpected result for me.

Another metric ClickHouse reports is the processing speed in GB/sec. We see a different picture here:

Q1 GB per second

64 bit data types have a higher processing speed than their 32 bit counter parts, but queries run slower as there is more raw data to process.


Let’s now take a closer look at compression.  For this test we use default LZ4 compression. ClickHouse has powerful support for Per Column Compression Codecs but testing them is outside of scope for this post.

So let’s look at size on disk for UInt32 Column:

On disk data size for UINT32

What you can see from these results is that when data is very compressible ClickHouse can compress it to almost nothing.  The compression ratio for our very compressible data set is about 200x (or 99.5% size reduction if you prefer this metric).

Somewhat compressible data compression rate is 1.4x.  That’s not bad but considering we are only storing 1-1000 range in this column – which requires 10 bits out of 32 – I would hope for better compression. I guess LZ4 is not compressing such data very well.

Now let’s look at compression for a 64 bit integer column:

On disk data size for UINT64

We can see that while the size almost doubled for very compressible data, increases for our somewhat compressible data and poorly compressible data are not that large.  Somewhat compressible data now compresses 2.5x.

Now let’s take a look at Performance depending on data compressibility:

Q1 time for UINT32

Poorly compressible data which takes a larger space on disk is processed faster than somewhat compressible data? This did not make sense. I repeated the run a few times to make sure that the results were correct. When I looked at the compression ratio, though, it suddenly made sense to me.

Poorly compressible data for the UInt32 data type was not compressible by LZ4 so it seems the original data was stored, significantly speeding up “decompression” process.   With somewhat compressible data, compression worked and so real decompression needed to take place too. This makes things slower.

This is why we can only observe these results with UInt32 and Float32 data types.  UInt64 and Float64 show the more expected results:

Q1 time for UINT64


Here are my conclusions:

  • Even with “slower” data types, ClickHouse is very fast
  • Data type choice matters – but less than I expected
  • Width (32bit vs 64bit) impacts performance more than integer vs float data types
  • Storing a small range of values in a wider column type is likely to yield better compression, though with default compression it is not as good as theoretically possible
  • Compression is interesting. We get the best performance when data can be well compressed. Second best is when we do not have to spend a lot of time decompressing it, as long as it is fits in memory.

by Peter Zaitsev at February 15, 2019 01:23 PM

Jean-Jerome Schmidt

How to Migrate MySQL from Amazon EC2 to your On-Prem Data Center Without Downtime

Since the concept of cloud was born, there has been strong growth in the number of migrations to this environment. However, not everything that shines is gold.

As the demand grows, so does the costs. We can find ourselves in a situation where our monthly cloud expenses are very high and, in this case, it may make sense to migrate back to an on-prem environment.

The costs may not be the only reason. There might be security or compliance requirements, or we may need to have more control of our systems. Knowing what happens at a lower level can help us better optimize things.

AWS not only give us the environment, it also provides us with monitoring and management tools to run our system in the cloud. So, it can be really hard to migrate to an on-prem environment and recreate all these tools to manage our systems in the same way.

In this blog, we will see how we can migrate our systems from AWS to an on-prem datacenter, and how ClusterControl can help us in the process.


First of all, let’s see some basic concepts about Amazon Cloud.


Amazon Web Services (AWS) is an Infrastructure as a Service platform, comprising a large number of independent and semi-independent services. The purpose of Infrastructure as a Service platform is to offer, on a commodity basis, services that previously required the purchase of capital-intensive infrastructure components such as high-end servers, network routers and switches, and for larger enterprises, even their own datacenters.


Amazon Relational Database Service (RDS) makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching and backups.

Amazon RDS is available on several database instance types and provides you with six familiar database engines to choose from, including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server.


Amazon Elastic Compute Cloud (EC2) is a service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.

Amazon EC2’s simple web interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment.


ClusterControl is a comprehensive management system for open source databases that automates deployment and management functions, as well as health and performance monitoring. There are two versions: Community Edition or Enterprise Edition. ClusterControl supports deployment, management, monitoring and scaling for different database technologies on any environment.

Why Migrate?

As we mentioned at the beginning, the most common reasons to migration from AWS to an on-premise environment would be costs, security, compliance, or even working with running local applications. In AWS, we don’t know what is happening under the hood of the infrastructure. We only know that all is working. In cases where you experience poor performance or other anomalies, the only solution is to get in contact with Amazon support.

Example Migration Scenario

In AWS we have two different products related to this blog: EC2 and RDS.

The main difference between them is that in EC2 you have SSH access to the server and have to manage the database yourself. RDS is a hosted database service, and you only have access to the database instance.

In RDS, as you don’t have SSH access, you need to create a dump and import it into the new server, or you can configure replication and promote the slave to the new master. For both options, the process is manual. Also, you can add some load balancer to improve this process. We covered this task in these blogs: Part 1 and Part 2.

So, let focus on the migration from EC2.

In our example, let’s see how to migrate MySQL from AWS EC2 to an on-prem datacenter. We will use a MySQL Replication environment, but these steps should work for other technologies like PostgreSQL.

We will assume that you have your main MySQL database running on EC2 instance. In the on-prem datacenter, we assume you have ClusterControl installed, as well as a fresh database server to migrate to.

In the AWS console, you should have something like this in the EC2 instances section:

AWS EC2 Section
AWS EC2 Section

First, we’ll import our current master running on EC2 to ClusterControl. For this import process, you must open the port 3306 by editing the Security Group associated with the EC2 instance.

AWS Security Group
AWS Security Group

After this, within ClusterControl, go to the Import section.

ClusterControl Import Section 1
ClusterControl Import Section 1

There, you can choose the technology, in our example MySQL Replication, and we must specify User, Key or Password and port to connect by SSH to our server. We also need the name for our new replication ‘cluster’.

ClusterControl Import Section 2
ClusterControl Import Section 2

After setting up the SSH access information, we must define some database information like the database user, version and basedir. Also, we can enable the ClusterControl Node AutoRecovery and Cluster AutoRecovery features for the new cluster.

Then, we need to add our server by using the IP address or hostname and press Import.

ClusterControl Import Section 3
ClusterControl Import Section 3
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

We can monitor the status of the import of our setup from the ClusterControl activity monitor.

Once the task is finished, we can see our master in the main ClusterControl screen.

Make sure that you have enabled the binlog generation in your current master database. If not, you can enable it from the Node Action section in ClusterControl.

Now, we can add our future new master as a new replica from our current master database. For this, go to ClusterControl -> Select Cluster -> Cluster Actions -> Add Replication Slave.

ClusterControl Add Replication Slave
ClusterControl Add Replication Slave

Here, we need to add the hostname or IP address of the new slave server, and if we want ClusterControl to install the software for us.

Make sure that you have connectivity from AWS to the port 3306 and 9999 in the on-prem server.

The way ClusterControl stages the slave with data is to take an hotbackup of the master, stream it to the slave and restore it there. Once restored, the slave is connected to the master so it can catch up on events and get in sync. Note that, for large databases running with some load, you might want to avoid the extra load of this operation on the master. In that case, it is possible to build the slave first from an existing backup, and then connect the slave so it catches up with the master.

After this task, we should have something like this:

You can also verify the topology on the ClusterControl Topology section.

ClusterControl Topology View 1
ClusterControl Topology View 1

Then, we need to promote the slave to master (ClusterControl -> Select Cluster -> Node Actions -> Promote slave) and change the endpoint in your application.

To improve this topology, you can add a load balancer to manage the traffic from the application server to the database. Using a load balancer, during the migration, you don’t need to change the endpoint from your application. The load balancer will change the master in a transparent way for your application.

ClusterControl Topology View 2
ClusterControl Topology View 2

There are many ways to perform this task and, probably you should be able to adapt this strategy or similar, to your environment, depending on your infrastructure, security, etc.

For security reasons, you should consider using a VPN between the AWS and the on-premise environment.

In the case of a multi-master topology like Galera Cluster, you only need to add the nodes that you want on-premise, but be careful with the latency. You can use for example different Galera segments to decrease network usage.


Some considerations to take into account when we want to leave AWS and start to use our own environment could be:

  • Monitoring: Don’t forget to use some monitoring system. You need to know what is happening in your system.
  • Disaster Recovery Strategy: You should consider some disaster recovery strategy. In general, you should have the information in three different places, for example, Master, Slave, and backup, each in different physical places.
  • High Availability: Nowadays, HA is a must in most production environments, so we need to think about the best HA solution depending on our infrastructure.
  • Scaling: We should be able to scale if it’s needed in the future or for some specific event.
  • Rollback: If you want to migrate from AWS to an on-premise environment, keep in mind that something could go wrong (as in any type of migration), so you should have some rollback plan.
  • If you are after some kind of hybrid environment, with instances running on AWS and on-prem, then ClusterControl can be a good fit for monitoring, managing availability, backups and scaling.
ClusterControl Overview
ClusterControl Overview

by Sebastian Insausti at February 15, 2019 07:26 AM

February 14, 2019

Peter Zaitsev

FOSDEM 2019 – Percona Presentations

FOSDEM Paintings

For those not familiar with it, FOSDEM is an amazing, free entry, full on celebration of open source that takes place in Brussels, Belgium every year. This year the event was held over the first weekend of February. Fringe events, such as the Pre-FOSDEM MySQL day hosted by Oracle MySQL, and the community dinner that follows, provide an opportunity to network.

In case you didn’t make it to FOSDEM this year, here are links to Percona’s presentations from the event. Organizers video and share online every talk from every dev room, a phenomenal achievement in itself. All credit to the volunteers who run this show.

Database Dev Room: Hugepages and databases presented by Fernando Laudares Camargos


MySQL, MariaDB and Friends Dev Room: MySQL Replication – Advanced Features presented by Peter Zaitsev


MySQL, MariaDB and Friends Dev Room: MySQL Performance Schema in 20 Minutes presented by Sveta Smirnova


Monitoring and Observability Dev Room: Using eBPF for Linux Performance Analyses by Peter Zaitsev

Percona enjoyed plenty of attention in the booth area where shared information about our open source, free-as-in-beer projects. We were in Brussels after all!

Evgeniy Patlan, Slava Sarzhan and Alexey Palazhchenko enjoying booth duty

Evgeniy Patlan, Slava Sarzhan and Alexey Palazhchenko enjoying booth duty

Percona Booth FOSDEM 2019

Alexey making sure everything is in order at the Percona booth

Sandra Dannenberg art

Passing artist and open source enthusiast Sandra Dannenberg took a liking to our Percona logos and painted her own versions. They’re great aren’t they? FOSDEM is that kind of event… we’re looking forward already to 2020!

by Lorraine Pocklington, Community Manager at February 14, 2019 11:33 AM

Jean-Jerome Schmidt

Monitoring Your Databases with MySQL Enterprise Monitor

How to Monitor MySQL Databases?

Operational visibility is a must in any production environment. It is crucial to be able to identify any issues as soon as possible, otherwise you may end up in serious troubles as an undetected issue can cause serious service disruption or downtime. MySQL Enterprise Monitor is one of the oldest monitoring products for MySQL on the market, and is available as part of an commercial enterprise subscription agreement from Oracle.In this blog post we will take a look at MySQL Enterprise Monitor and the kind of insight it provides into MySQL.


First of all, MySQL Enterprise Monitor is part of MySQL Enterprise Edition, a commercial offering from Oracle. It comes in multiple versions of packages, for different operating systems. The installation on Windows 10 (the system we tested on) is pretty much straightforward. MySQL Enterprise Monitor is configured and some bundled services will be installed (MySQL, Tomcat). The tool can be accessed via the browser.

Initial Configuration

First of all, you have to add hosts you would like to monitor.

You can either add single hosts or a batch of them. The dialog window looks the same except that when adding in bulk, you can pass a comma-separated list of servers.

We won’t go into details, but in short you have to define from which host the MySQL instances should be monitored - typically it will be the host on which you installed MySQL Enterprise Monitor. You can also setup agents on your MySQL instances, in that case they will be able to collect data for the host as well, not only MySQL metrics. Then you need to define how to reach the monitored instance (IP address/hostname, user and password). MySQL Enterprise Monitor will then create additional users for tasks like monitoring, which does not require superuser privileges. If you want, you can also configure SSL communication if that’s what the MySQL instance uses, you can also define some timeouts and if a replication topology should be auto-detected or not.

What is also important to keep in mind is that MySQL Enterprise Monitor relies heavily on Performance Schema - make sure your databases have PS enabled, otherwise you will not benefit from a significant part of the features of MySQL Enterprise Monitor.


Once the monitored MySQL instances are configured, you can start to look at the collected data. The Overview section gives you a short summary of some of the most important metrics in MySQL. Data is aggregated and it makes it easier to find any unexpected patterns and then dig further into what happened.

Events tab gives an overview of different issues or events reported by the MySQL Enterprise Monitor and its advisors. You can click at any of the events and read what it is all about, as well as any recommended steps to take:

In this particular case it seems like some queries are doing full table scans and it is recommended to investigate it further to pinpoint such queries and see if they can be optimized.

Another example, here we see that table cache is not configured in an optimal way. You can see the explanation of the problem, advice and recommended actions to take based on this alert.


In this tab we can see data for multiple MySQL metrics that are helpful to understand the state of the system.

Timeseries Graphs

Screenshots above are just an example, there are many more graphs to look at.

It is possible to apply filtering: you can define which graphs you would like to see, you can also define what time range should be shown. On top of that, you can just mark a part of the graph and either zoom into it or open the Query Analyzer with data from that particular time:

We will go through this functionality later but in short, it allows you to analyze queries, how their performance changed in time and some example queries.

Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Table Statistics

This tab gives us insight into table statistics: how the traffic looked like (rows fetched, inserted, updated, deleted) and how the latency looked like for all the row operations.

User Statistics

In this tab MySQL Enterprise Monitor presents data about users - statements executed, latency, table scans, I/O latency, connections, memory utilization. This data should give quite a good insight into which user is responsible for the load on the database. It might be very useful especially in the multi-user environments, where there is no one main source of the traffic.

Database File I/O

Database File I/O explains how the I/O load is distributed across the files in the database. Total number of I/O operations, latency, how many reads and writes were performed on a given file.

Memory Usage

Memory usage shows memory structures in MySQL, which help to build the better picture of the memory utilization in the database. This data can come handy in case of issues with memory - it is easy to track where the growth is the biggest and, if needed, reduce relevant settings. It can also help significantly in diagnosing potential memory leaks.

InnoDB Buffer Pool

This tab in MySQL Enterprise Monitor gives the user insight into the structure of the buffer pool utilization. Which tables are cached, how many dirty pages are there to flush?


It is extremely important for any MySQL user to understand the load that queries create. Which queries are the most problematic? How they behave in time? Performance can be measured in multiple ways but it is quite common that it is the predictable, stable performance is more important than the top performance. As long as the response time is acceptable, users will like the predictable results better than somewhat faster response (low latency), which can sometimes slow the server down significantly. That’s why it is very valuable to see how a query behaves in time and pinpoint those, which behavior is not consistent.

MySQL Enterprise Monitor definitely delivers such data. On the list of the queries, you can easily see how the latency changed in time. Flat line is good, spikes - not so much. This means such query may have to be investigated further. When you click on it, MySQL Enterprise Monitor will give you more data about it.

As you can see, there are some statistical data about the particular query type, you can also see how the latency changed in time. At the bottom you can see some example statements in time and you can compare their execution time.

When you click on one of them, you will see a full query that was executed at that moment. It can be useful in case of queries where the performance differs depending on what arguments were used in WHERE case (for example, WHERE some_column = ‘some value’ and values in that column are not distributed evenly across the rows).


In a MySQL replication environment, lag is something you have to learn to deal with. What is important is to keep the track of it - how badly are slaves lagging? How often does it happen? With this information it is possible to try and pinpoint the issue and understand better what queries are causing it. Then you can try to implement some improvements like, for example, multi-threaded replication and track if the changes improved the replication performance and reduced the lag to an acceptable level.

How is MySQL Enterprise Monitor Different from ClusterControl

As we stated, MySQL Enterprise Monitor is a part of the paid MySQL Enterprise Edition. For all users of the MySQL Community, MariaDB or Percona Server, MySQL Enterprise Edition is not available. ClusterControl provides access to monitoring of MySQL in its free Community version. In terms of server and query monitoring, there are many similarities.

ClusterControl gives you access to MySQL metrics collected and stored in the Prometheus time-series database. You can easily keep track of numerous metrics made available in ClusterControl.

ClusterControl also comes with a list of advisors, which can be used to keep track of the health and performance of the database. You can also easily create new advisors using the Developer Studio:

If you are interested in query performance, ClusterControl provides a Query Monitor for you - executed queries are collected and their performance is compared making it easy for the user to pinpoint which queries use the most of the CPU on the database.

You can see statistic data on the queries - executions, rows sent and examined, execution time. You can also check the explain plan for a particular query type.

Monitoring Polyglot Persistence

One big difference is the ability to monitor all the main variants of the MySQL ecosystem (Oracle MySQL, MariaDB and Percona Server), different clustering technologies (NDB Cluster, Group Replication, asynchronous replication and Galera Cluster), load balancers/proxies (HAProxy, Keepalived, Maxscale, ProxySQL) as well as other open source databases (PostgreSQL and MongoDB).

Automation and Management

ClusterControl also provides functionality to deploy single instances or clusters on-prem or in the cloud (AWS, GCE and Azure), as well as features like backup management, automatic failover and recovery/repair, rolling upgrades, cluster management for replication or cluster setups, scaling, etc.

That’s all for today folks. If you have worked with MySQL Enterprise Monitor and would like to add something, please do so in the comments section.

by krzysztof at February 14, 2019 10:48 AM

February 13, 2019

Peter Zaitsev

plprofiler – Getting a Handy Tool for Profiling Your PL/pgSQL Code

plprofiler postgres performance tool

PostgreSQL is emerging as the standard destination for database migrations from proprietary databases. As a consequence, there is an increase in demand for database side code migration and associated performance troubleshooting. One might be able to trace the latency to a plsql function, but explaining what happens within a function could be a difficult question. Things get messier when you know the function call is taking time, but within that function there are calls to other functions as part of its body. It is a very challenging question to identify which line inside a function—or block of code—is causing the slowness. In order to answer such questions, we need to know how much time an execution spends on each line or block of code. The plprofiler project provides great tooling and extensions to address such questions.

Demonstration of plprofiler using an example

The plprofiler source contains a sample for testing plprofiler. This sample serves two purposes. It can be used for testing the configuration of plprofiler, and it is great place to see how to do the profiling of a nested function call. Files related to this can be located inside the “examples” directory. Don’t worry—I’ll be running through the installation of plprofiler later in this article.

$ cd examples/

The example expects you to create a database with name “pgbench_plprofiler”

postgres=# CREATE DATABASE pgbench_plprofiler;

The project provides a shell script along with a source tree to test plprofiler functionality. So testing is just a matter of running the shell script.

$ ./
dropping old tables...

Running session level profiling

This profiling uses session level local-data. By default the plprofiler extension collects runtime data in per-backend hashtables (in-memory). This data is only accessible in the current session, and is lost when the session ends or the hash tables are explicitly reset. plprofiler’s run command will execute the plsql code and capture the profile information.

This is illustrated by below example,

$ plprofiler run --command "SELECT tpcb(1, 2, 3, -42)" -d pgbench_plprofiler --output tpcb-test1.html
SELECT tpcb(1, 2, 3, -42)
-- row1:
tpcb: -42
(1 rows)
SELECT 1 (0.073 seconds)

What happens during above plprofiler command run can be summarised in 3 steps:

  1. A function call with four parameters “SELECT tpcb(1, 2, 3, -42)” is presented to the plprofiler tool for execution.
  2. plprofiler establishes a connection to PostgreSQL and executes the function
  3. The tool collects the profile information captured in the local-data hash tables and generates an HTML report “tpcb-test1.html”

Global profiling

As mentioned previously, this method is useful if we want to profile the function executions in other sessions or on the entire database. During global profiling, data is captured into a shared-data hash table which is accessible for all sessions in the database. The plprofiler extension periodically copies the local-data from the individual sessions into shared hash tables, to make the statistics available to other sessions. See the

plprofiler monitor
  command, below, for details. This data still relies on the local database system catalog to resolve Oid values into object definitions.

In this example, the plprofiler tool will be running in monitor mode for a duration of 60 seconds. Every 10 seconds, the tool copies data from local-data to shared-data.

$ plprofiler monitor --interval=10 --duration=60 -d pgbench_plprofiler
monitoring for 60 seconds ...

For testing purposes you can start executing a few functions at the same time.

Once the data is captured into shared-data, we can generate a report. For example:

$ plprofiler report --from-shared --title=MultipgMax --output=MultipgMax.html -d pgbench_plprofiler

The data in shared-data will be retained until it’s explicitly cleared using the

plprofiler reset

$ plprofiler reset

If there is no profile data present in the shared hash tables, execution of the report will result in error message.

$ plprofiler report --from-shared --title=MultipgMax --output=MultipgMax.html
Traceback (most recent call last):
File "/usr/bin/plprofiler", line 11, in <module>
load_entry_point('plprofiler==4.dev0', 'console_scripts', 'plprofiler')()
File "/usr/lib/python2.7/site-packages/plprofiler-4.dev0-py2.7.egg/plprofiler/", line 67, in main
return report_command(sys.argv[2:])
File "/usr/lib/python2.7/site-packages/plprofiler-4.dev0-py2.7.egg/plprofiler/", line 493, in report_command
report_data = plp.get_shared_report_data(opt_name, opt_top, args)
File "/usr/lib/python2.7/site-packages/plprofiler-4.dev0-py2.7.egg/plprofiler/", line 555, in get_shared_report_data
raise Exception("No profiling data found")
Exception: No profiling data found

Report on profile information

The HTML report generated by plprofiler is a self-contained HTML document and it gives detailed information about the PL/pgSQL function execution. There will be a clickable FlameGraph at the top of the report with details about functions in the profile. The plprofiler FlameGraph is based on the actual Wall-Clock time spent in the PL/pgSQL functions. By default, plprofiler provides details on the top ten functions, based on their self_time (total_time – children_time).

This section of the report is followed by tabular representation of function calls. For example:

This gives a lot of detailed information such as execution counts and time spend against each line of code.

Binary Packages

Binary distributions of plprofiler are not common. However the BigSQL project provides plprofiler packages as an easy to use bundle. Such ready-to-use packages are one of the reasons for BigSQL to remain as one of the most developer friendly PostgreSQL distributions. The first screen of Package manager installation of BigSQL provided me with the information I am looking for:

Appears that there was a recent release of BigSQL packages and plprofiler is an updated package within that.

Installation and configuration is made simple:

$ ./pgc install plprofiler-pg11
File is already downloaded.
Unpacking plprofiler-pg11-3.3-1-linux64.tar.bz2
Updating postgresql.conf file:
old: #shared_preload_libraries = '' # (change requires restart)
new: shared_preload_libraries = 'plprofiler'

As we can see, even PostgreSQL parameters are updated to have plprofiler as a

 .  If need to use plprofiler for investigating code, these binary packages from the BigSQL project are my first preference because everything is ready to use. Definitely, this is developer-friendly.

Creation of extension and configuring the plprofiler tool

At the database level, we should create the plprofiler extension to profile the function execution. This step needs to be performed in both cases, whether we want global profiling where share_preload_libraries are set, or at session level where that is not required

postgres=# create extension plprofiler;

plprofiler is not just an extension, but comes with tooling to invoke profiling or to generate reports. These scripts are primarily coded in Python and uses psycopg2 to connect to PostgreSQL. The python code is located inside the “python-plprofiler” directory of the source tree. There are a few perl dependencies too which will be resolved as part of installation

sudo yum install python-setuptools.noarch
sudo yum install python-psycopg2
cd python-plprofiler/
sudo python ./ install

Building from source

If you already have a PostgreSQL instance running using binaries from PGDG repository OR you want to wet your hands by building everything from source, then installation needs a different approach. I have PostgreSQL 11 already running on the system. The first step is to get the corresponding development packages which have all the header files and libraries to support a build from source. Obviously this is the thorough way of getting plprofiler working.

$ sudo yum install postgresql11-devel

We need to have build tools, and since the core of plprofiler is C code, we have to install a C compiler and make utility.

$ sudo yum install gcc make

Preferably, we should build plprofiler using the same OS user that runs PostgreSQL server, which is “postgres” in most environments. Please make sure that all PostgreSQL binaries are available in the path and that you are able to execute the pg_config, which lists out build related information:

$ pg_config
BINDIR = /usr/pgsql-11/bin
INCLUDEDIR = /usr/pgsql-11/include
PKGINCLUDEDIR = /usr/pgsql-11/include
INCLUDEDIR-SERVER = /usr/pgsql-11/include/server
LIBDIR = /usr/pgsql-11/lib
PKGLIBDIR = /usr/pgsql-11/lib
LOCALEDIR = /usr/pgsql-11/share/locale
MANDIR = /usr/pgsql-11/share/man
SHAREDIR = /usr/pgsql-11/share
SYSCONFDIR = /etc/sysconfig/pgsql
PGXS = /usr/pgsql-11/lib/pgxs/src/makefiles/
CONFIGURE = '--enable-rpath' '--prefix=/usr/pgsql-11' '--includedir=/usr/pgsql-11/include' '--mandir=/usr/pgsql-11/share/man' '--datadir=/usr/pgsql-11/share' '--with-icu' 'CLANG=/opt/rh/llvm-toolset-7/root/usr/bin/clang' 'LLVM_CONFIG=/usr/lib64/llvm5.0/bin/llvm-config' '--with-llvm' '--with-perl' '--with-python' '--with-tcl' '--with-tclconfig=/usr/lib64' '--with-openssl' '--with-pam' '--with-gssapi' '--with-includes=/usr/include' '--with-libraries=/usr/lib64' '--enable-nls' '--enable-dtrace' '--with-uuid=e2fs' '--with-libxml' '--with-libxslt' '--with-ldap' '--with-selinux' '--with-systemd' '--with-system-tzdata=/usr/share/zoneinfo' '--sysconfdir=/etc/sysconfig/pgsql' '--docdir=/usr/pgsql-11/doc' '--htmldir=/usr/pgsql-11/doc/html' 'CFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' 'LDFLAGS=-Wl,--as-needed' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'
CC = gcc
VERSION = PostgreSQL 11.1

Now we’re ready to get the source code and build it. You should be able to checkout the git repository for plprofiler.

$ git clone
Cloning into 'plprofiler'...

Building against PostgreSQL 11 binaries from PGDG can be a bit complicated because of th JIT feature. Configuration flag

  will be enabled. So we may have to have LLVM present in the system as detailed in my previous blog about JIT in PostgreSQL11

Once we’re ready, we can move to the plprofiler directory and build it:

$ cd plprofiler
$ make USE_PGXS=1
--- Output ----
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC -I. -I./ -I/usr/pgsql-11/include/server -I/usr/pgsql-11/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include -c -o plprofiler.o plprofiler.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC -shared -o plprofiler.o -L/usr/pgsql-11/lib -Wl,--as-needed -L/usr/lib64/llvm5.0/lib -L/usr/lib64 -Wl,--as-needed -Wl,-rpath,'/usr/pgsql-11/lib',--enable-new-dtags
/opt/rh/llvm-toolset-7/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/usr/pgsql-11/include/server -I/usr/pgsql-11/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include -flto=thin -emit-llvm -c -o plprofiler.bc plprofiler.c

Now we should be able to install this extension:

$ sudo make USE_PGXS=1 install
--- Output ----
/usr/bin/mkdir -p '/usr/pgsql-11/lib'
/usr/bin/mkdir -p '/usr/pgsql-11/share/extension'
/usr/bin/mkdir -p '/usr/pgsql-11/share/extension'
/usr/bin/install -c -m 755 '/usr/pgsql-11/lib/'
/usr/bin/install -c -m 644 .//plprofiler.control '/usr/pgsql-11/share/extension/'
/usr/bin/install -c -m 644 .//plprofiler--1.0--2.0.sql .//plprofiler--2.0--3.0.sql .//plprofiler--3.0.sql '/usr/pgsql-11/share/extension/'
/usr/bin/mkdir -p '/usr/pgsql-11/lib/bitcode/plprofiler'
/usr/bin/mkdir -p '/usr/pgsql-11/lib/bitcode'/plprofiler/
/usr/bin/install -c -m 644 plprofiler.bc '/usr/pgsql-11/lib/bitcode'/plprofiler/./

The above command expects all build tools to be in the proper path even with sudo.

Profiling external sessions

To profile a function executed by another session, or all other sessions, we should load the libraries at global level. In production environments, that will be the case. This can be done by adding the extension library to the

  specification. You won’t need this if you only want to profile functions executed within your session. Session level profiling is generally possible only in Dev/Test environments.

To enable global profiling, verify the current value of

  and add plprofiler to the list.

postgres=# show shared_preload_libraries ;
(1 row)
postgres=# alter system set shared_preload_libraries = 'plprofiler';

This change requires us to restart the PostgreSQL server

$ sudo systemctl restart postgresql-11

After the restart, it’s a good idea to verify the parameter change

postgres=# show shared_preload_libraries ;
(1 row)

From this point onwards, the steps are same as those for the binary package setup discussed above.


plprofiler is a wonderful tool for developers. I keep seeing many users who are in real need of it. Hopefully this blog post will help those who never tried it.

by Jobin Augustine at February 13, 2019 06:20 PM

Jean-Jerome Schmidt

Basic Administration Comparison Between Oracle, MSSQL, MySQL, PostgreSQL

The introduction of DevOps in organizations has changed the development process and also introduced some new challenges. In addition, developers and DevOps teams, along with their own chosen programming languages, also have their favorite database systems.

The product life cycle is getting shorter each year so developers want to be able to develop fast, using technologies they know best.

Having multiple RDBMS database backends means your organization will become more agile on the development side, but it also imposes additional knowledge on the operation teams.

Extending your infrastructure from one to many databases implies you have to also monitor, manage and scale them.

As every storage backend excels at different use cases, this also means you have to reinvent the wheel for every one of them.

Knowing the similarities and key differences will help you to immerse into different flavors of RDBMS.

In this article we will go through the following points:

  • A brief introduction to the platform
    • Oracle, MSSQL, MySQL , PostgreSQL
  • Platform support
  • Installation process
  • Database access
  • Backup process
  • Controlling query execution
  • Security
  • Replication options
  • Community support

A brief introduction to the platform

PostgreSQL is for many recognized as the world's most advanced open source database. It is a fully open source database system released under its own license, the PostgreSQL License, comparable to the MIT or BSD licenses. The PostgreSQL community is active and continuously improving existing and new features. As per the DB-engine popularity rank, PostgreSQL was the DBMS of the year 2017 and 2018. The DB-Engines popularity shows that the trend didn’t change over the years.

An interesting fact is that PostgreSQL didn’t support SQL until 1994. The QUEL language was used to query data from it. SQL support was added later on.

PostgreSQL has many advanced features that other enterprise database management systems offer, such as such as views, stored procedures, indexes, and triggers in addition to the primary key, foreign key and atomicity features.

PostgreSQL can be extended by users by modifying existing features, adding new features and distributed freely as it is open-source. It runs on major platforms such as UNIX, MacOS, Windows, and Linux etc. It supports video, text, audio, images, programming interfaces for different languages. The list of supported languages includes C/C++, Java, Python, Perl etc.

Oracle is one of the largest vendors of RDBMS (relational database management system) in the IT world. It is known as an Oracle database, Oracle DB or Oracle marketed by Oracle.

Oracle Database is being used by many companies in the IT industry for transaction processing, business analytics, business intelligence application purpose, etc..

Oracle has a long and very interesting history:

On 16th June 1977 Software Development Laboratories (SDL) was created in Santa Clara, California by Larry Ellison, Bob Miner, and Ed Oates. In 1977 Oracle took its name from the CIA project codename and the irst commercialized Oracle RDBMS is shown to the world in 1979.

Oracle database is available in different editions such as Enterprise edition Standard edition, Express edition, and Oracle Lite. The biggest competitor for Oracle database is the Microsoft SQL server.

Microsoft SQL Server is a very popular RDBMS with restrictive licensing and modest cost of ownership if the database is of significant size, or is used by a significant number of clients.

It's one of the three market-leading database technologies, along with Oracle Database and IBM's DB2.

It provides a very user-friendly interface and easy to learn, which has resulted in a large installed user base.

Like other RDBMS software, Microsoft SQL Server is built on top of SQL, a standardized programming language that database administrators (DBAs) and other IT professionals use to manage databases and query the data they contain. SQL Server is tied to Transact-SQL (T-SQL), an implementation of SQL from Microsoft that adds a set of proprietary programming extensions to the standard language.


MySQL is an Oracle-backed open source relational database management system based on SQL.

Originally conceived by the Swedish company MySQL AB, MySQL was acquired by Sun Microsystems in 2008 and then by Oracle when it bought Sun in 2010.

Developers can use MySQL under the GNU General Public License (GPL). The Enterprise version comes with support and additional features for security and high availability.

It's the second most popular database in the world according to db-engines ranking and probably the most present database backend on the planet as it runs most of the internet services around the globe. MySQL runs on virtually all platforms, including Linux, UNIX, and Windows.

MySQL is an important component of an open source enterprise stack called LAMP.

LAMP is a web development platform that uses Linux as the operating system, Apache as the web server, MySQL as the relational database management system and PHP as the object-oriented scripting language.

Platform support


The most popular version of Oracle DB, Oracle 12c is a truly enterprise RDBMS system which is supported on a variety of operating systems and platforms. Oracle dominates the database world in part because it runs on dozens of platforms, everything from a Mainframe, Sparc, Mac to Intel. The list includes following OS and architecture combinations: Linux on x86-64 (only Red Hat Enterprise Linux, Oracle Linux, and SUSE distributions are supported) Microsoft Windows on x86-64. Oracle Solaris on SPARC and x86-64. IBM AIX on POWER Systems. Linux on IBM zEnterprise Systems HP-UX on Itanium.


Being a Microsoft product, SQL was designed to be very much compatible with Windows OS. On November 16, 2016, Microsoft announced the beginning of a new story: SQL Server is now supported on Linux and Docker. Hell freezes over!


MYSQL carries out smoother execution on all platforms like Microsoft, UNIX, Linux, Mac etc.


In general, PostgreSQL can be expected to work on various (even exotic) CPU architectures and operating systems.

It includes CPU architectures like x86, x86_64, IA64, PowerPC, PowerPC 64, S/390, S/390x, Sparc, Sparc 64, Alpha, ARM, MIPS, MIPSEL, M68K, and PA-RISC. It is often possible to build on an unsupported CPU type by configuring with --disable-spinlocks, but performance will be poor.

PostgreSQL can be expected to work on the following operating systems: Linux (all recent distributions), Windows (Win2000 SP4 and later), FreeBSD, OpenBSD, NetBSD, Mac OS X, AIX, HP/UX, IRIX, Solaris, Tru64 Unix, and UnixWare.

Installation Process


From all four presented databases systems, Oracle has the most complex system requirements which comes with a complex installation process. On both Windows and Linux based platforms Oracle uses a dedicated Oracle Universal Installer (OUI) tool as the main installation process. The OUI is used to install the Oracle Database software. OUI is a graphical user interface utility that enables you to:

  • View the Oracle software that is installed on your machine
  • Install new Oracle Database software
  • Delete Oracle software that is no longer required.

During the installation process, OUI will start the Oracle Database Configuration Assistant (DBCA) which can install a pre-created default database that contains example schemas or can guide you through the process of creating and configuring a customized database.

Oracle OUI - installation interface
Oracle OUI - installation interface

If you do not create a database during installation, you can invoke DBCA after you have installed the software, to create one or more databases.


Beginning with SQL Server 2016 (13.x), SQL Server is only available as a 64-bit application.

Installation happens via the Installation Wizard, a command prompt, or through sysprep tool.

The Installation Wizard runs the SQL Server Installation Center. To create a new installation of SQL Server, select the Installation option on the left side, and then click New SQL Server stand-alone installation or add features to an existing installation.

The Linux based installation is very similar to the open source database installation method. It supports packaging for Debian and RedHat based systems. The steps consist of repository configuration, package installation and post-installation configuration, quite similar to MySQL. The whole process is greatly described in the following article.

MSSQL Installation Wizard
MSSQL Installation Wizard


Oracle provides a set of binary distributions of MySQL. These include generic binary distributions in the form of compressed tar files (files with a .tar.gz extension) for a number of platforms, and binaries in platform-specific packages. On the Windows platform, the installation process is triggered by the standard installation wizard via GUI.


PostgreSQL is available in a majority of Linux distributions so it’s very likely you can install it through a simple yum or apt-get command. For the HA configuration, you can use the ClusterControl s9s tool or GUI. S9S tools can help you to create a PostgreSQL cluster with just one single line command:

$ s9s cluster \
--create \
--cluster-type=postgresql \
--nodes=";;" \
--provider-version='11' \
--db-admin='postgres' \
--db-admin-passwd='s3cr3tP455' \
--os-user=root \
--os-key-file=/root/.ssh/id_rsa \
--cluster-name='PostgreSQL 11 Streaming Replication' \
Creating PostgreSQL Cluster
\ Job 259 RUNNING    [█▋        ]  15% Installing helper packages

For more information, check this blog.

Access to the database and DB creation


Oracle separates the process of the binary and database creation. Unlike other popular database systems, database creation involves much more steps.

The Database Configuration Assistant (DBCA) is the preferred way to create a database because it can do it in a much more automated approach. DBCA can be launched by the Oracle Universal Installer (OUI), depending on the type of install that you select. You can also launch DBCA as a standalone tool at any time after Oracle Database Installation.

You can run DBCA in interactive mode or non-interactive/silent mode. Interactive mode provides a graphical interface and guided workflow for creating and configuring a database. Non-interactive/silent mode enables you to script the database creation. You can run DBCA in non-interactive/silent mode by specifying command-line arguments, a response file or both.

Oracle DBCA - database creation
Oracle DBCA - database creation

When a database is created you can access it with a dedicated client called sqlplus. SQL*Plus is a terminal client program with which you can access Oracle Database.


SQL Server Management Studio (SSMS) is the main tool for administering the Database Engine and writing Transact-SQL code. SSMS is available as a free download from the Microsoft Download Center. The latest version can be used with older versions of the Database Engine.

Management Studio is a preferred method to create a new database. To create a database in Microsoft SQL Server, connect to the computer where Microsoft SQL Server is installed using an administrator account.
Start Microsoft SQL Server Management Studio and choose to create a database option. The wizard process will walk you through the process. If you prefer command line this can be done with CREATE DATABASE syntax.


In order to access your MySQL database use mysql client. The database creation is as simple as CREATE DATABASE <name>.


PostgreSQL database has the option for multiple ‘schemas’ which operate similarly to databases in MySQL.

Schemas contain the tables, indexes, etc, and can be accessed simultaneously by the same connection to the database that houses them. Access methods for PostgreSQL are defined in a file: pg_hba.conf. It can be located in various places. On Ubuntu 14.04 it is located in /etc/postgresql/9.3/main/pg_hba.conf, on Centos 7 on the other hand it’s located by default in /var/lib/pgsql/data/pg_hba.conf.

Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Backup process


Oracle has the most complex, dedicated built-in backup tool of all four servers described here; it’s called Recovery Manager (RMAN).

RMAN allows you to run sophisticated backup policies and selective restores. The same operations usually require a lot of manual steps in other RDBMS.

We can take backups in two ways:

  • disabling the database and copying physical files (so-called cold backup)
  • using RMAN and make a backup without disabling the database (hot backup)

To make a hot backup, set the base in ARCHIVELOG mode. This will tell Oracle to not keep the copy of redo log files as an archivelogs.


In the MS SQL world, you can use the built-in T-SQL commands to backup and restore databases. There is no need to use tools like mysqlhotcopy and mysqldump.

MS SQL Server offers three different online backup strategies:

  • Simple Recovery Model (ALTER DATABASE dbname SET RECOVERY SIMPLE)
  • Full Recovery Model (ALTER DATABASE dbname SET RECOVERY FULL)
  • Bulk-Logged Recovery Model (ALTER DATABASE dbname SET RECOVERY BULK_LOGGED)

The recommended model is the full recovery if no data loss is acceptable. This mode is similar to the MySQL feature when the binary log is enabled. You can recover the database to any point of time, but you should regularly back up the transaction log as well as the database.

The bulk-logged model can be used for large bulk operations such as importing data or creating indexes on big tables. It’s rather less common method to run a database, especially production. It does not support point-in-time recovery so it is generally used as a temporary solution.

The Simple model is useful when the database is rarely updated or for testing and development purposes. In SIMPLE mode, the transaction log of the database is cut each time after the transaction is completed. In the other modes, the log is truncated via CHECKPOINT statement or after the transaction backup file. In case the database is damaged, only the most recent backup can be recovered and all changes since this backup are lost.


Two most popular backup utilities are available for MySQL and MariaDB, namely mysqldump logical backup and binary backup Percona XtraBackup and MariaBackup (a fork of Percona XtraBackup). MySQL Enterprise version offers also mysqlbackup which is similar to XtraBackup and MariaBackup hot backup tools.


Most DBMS's provide some built-in backup tools. PostgreSQL has pg_dump and pg_dumpall out of the box. However, you may want to use some other tools for your production databases. More information can be found in the top backup tools for PostgreSQL article.

Controlling Query execution and concurrency support


In Oracle, all the database objects are grouped by schemas. Schemas are collection of database objects and all the database objects are shared among all schemas and users. It can be translated to MySQL databases. Even though it is all shared, each user can be limited to certain schemas and tables via roles and permissions. This concept is quite similar to MySQL databases.Hi


MS SQL Server organizes all objects, such as tables, views, and procedures, by database names. Users are assigned to a log in, which is granted access to the specific database and its objects. Also, in SQL Server each database has a private, unshared disk file on the server.


MySQL only has MVCC support in InnoDB. It is a storage engine and by default is available in MySQL. It also provides ACID-complaint features like foreign key support and transaction handling. By default, each query is treated as a separate transaction, which is a different approach than in Oracle DB.


Postgres engine performs concurrency control by using a method called MVCC (Multiversion Concurrency Control). For every user connected to the database, the Postgres database gives a snapshot of the database at a particular instance. When the database must to update an item, it will add the newer version and point the old version as obsolete. It allows the database to save overhead but requires a regulated sweep to delete the old, outdated data.



Security features are great, the system provides multi-layered security including controls to evaluate risks, prevent unauthorized data disclosure, detect and report on database activities and enforce data access controls.


Security features are modest, the RDBMS offers fewer features than Oracle but still much more than Open Source database systems.


MySQL implements security based on Access Control Lists (ACLs) for all connections, queries, and other operations that a user may attempt to perform. There is also some support for SSL-encrypted connections between MySQL clients and servers.


PostgreSQL has ROLES and inherited roles to set and maintain permissions. PostgreSQL has native SSL support for connections to encrypt client/server communications. It also has Row Level Security.
In addition to this, PostgreSQL comes with a built-in enhancement called SE-PostgreSQL which provides additional access controls based on SELinux security policy. More details here.

Community Support


Oracle database, similarly to MySQL, has a large community, mostly organized around and passionate groups in any locations around the world like for example The paid support gives you access to the support group previously known as metalink, not


Compared to other database systems, MSSQL probably has the least organized community groups but still very active. Microsoft does a great job in promoting its products in the universities. This gives young developers, devops and DBAs easy access to the technology (free licenses) and any necessary materials.


MySQL has a large community of contributors who, particularly following the acquisition by Oracle, focus mainly on maintaining existing features with some new features emerging occasionally. The advantage over other open source databases is a very strong external vendor eco-system. Companies like MariaDB and Percona not only offer great support but also contribute by adding enterprise features into their open source versions.


PostgreSQL has a very strong and active community. Its community improves existing features while its innovative committers strive to ensure it remains the most advanced database with new features and security, limiting the distance between Oracle and MSSQL databases. PostgreSQL is known for having more features than other RDBMS on the market.

Replication options


Oracle offers logical and physical replication through a built-in Oracle Data Guard. It is an enterprise feature.
Data Guard is a Ship Redo / Apply Redo technology, "redo" is the information needed to recover transactions.

A production database referred to as a primary database broadcasts redo to one or more replicas referred to as standby databases. When an insert or update is made to a table, this change is captured by the log writer into an archive log, and replicated to the standby system.

Standby databases are in a continuous phase of recovery, verifying and applying redo to maintain synchronization with the primary database. A standby database will also automatically re-synchronize if it becomes temporarily disconnected to the primary database due to power outages, network problems, etc.

For more flexible replication options like multisource, selective replication you should consider an extra paid tool, Oracle Golden Gate.


Microsoft SQL Server provides the following types of replication for use in distributed applications:

  • Transactional replication
  • Merge replication
  • Snapshot replication

It can be greatly extended with Microsoft Integration Services, giving you an option to customize the replication flow out of the box.


PostgreSQL has several options available, each with its own pros and cons, depending on what is needed through replication. The build options are based on Write Ahead Log. Files are shipped to a standby server where they are read and replayed, or Streaming Replication, where a read-only standby server fetches transaction logs over a database connection to replay them. In the case of a more sophisticated replication architecture, you would probably like to check Slony (master to multiple slaves) or Bucardo (multimaster).


MySQL Replication is probably the most popular high availability solution for MySQL,
and widely used by top web services.

It is easy to set up but ongoing maintenance like software upgrades, schema changes, topology changes, failover and recovery have always been tricky.

MySQL replication does not require any third party tools, both master-slave and multimaster can be done out of the box.

The recent versions of MySQL added multi source replication and Global transaction id which make it even more reliable and easier to maintain.


Priority databases like Oracle and MSSQL offer robust management systems and fine support. Among the long list of supported features, users can get the reassuring feeling of access to enterprise support and paid knowledge systems.

On the other side, the cost of the license, not that big of a feature gap and enterprise plugins, will make you eager to shift to the open source decision easier than ever.

Using predefined processes and automation can not only save you time but also protect you from common mistakes.

A management platform that systematically addresses all the different aspects of the database lifecycle will be more robust than patching together a number of point solutions.

by Bart Oles at February 13, 2019 10:48 AM

February 12, 2019

Peter Zaitsev

Debugging MariaDB Galera Cluster SST Problems – A Tale of a Funny Experience

MariaDB galera cluster starting time

MariaDB galera cluster starting timeRecently, I had to work on an emergency for a customer who was having a problem restarting a MariaDB Galera Cluster. After a failure in the cluster they decided to restart the cluster entirely following the right path: bootstrapping the first node, and then adding the rest of the members, one by one. Everything went fine until one of the nodes rejected the request to join the cluster.

Given this problem, the customer decided to ask us to help with the problematic node because none of the tests they did worked, and the same symptom repeated over and over: SST started, copied few gigs of data and then it just hanged (apparently) while the node remained out of the cluster.

Identifying the issue…

Once onboard with the issue, initially I just checked that the cluster was trying a SST: given the whole dataset was about 31GB I decided to go directly to a healthy solution: to clean up the whole datadir and start afresh. No luck at all, the symptom was exactly the same no matter what I tried:

After reviewing the logs I noticed few strange things. In the joiner:

2019-01-29 16:14:41 139996474869504 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (18153472-f958-11e8-ba63-fae8ac6c22f8): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2019-01-29 16:14:41 139996262553344 [Note] WSREP: Member 3.0 (node1) requested state transfer from '*any*'. Selected 0.0 (node3)(SYNCED) as donor.
2019-01-29 16:14:41 139996262553344 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 4902465)
2019-01-29 16:14:41 139996474869504 [Note] WSREP: Requesting state transfer: success, donor: 0
2019-01-29 16:14:41 139996474869504 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 18153472-f958-11e8-ba63-fae8ac6c22f8:4902465
2019-01-29 16:14:42 139996270946048 [Note] WSREP: (9864c6ca, 'tcp://') connection to peer 9864c6ca with addr tcp:// timed out, no messages seen in PT3S
2019-01-29 16:14:42 139996270946048 [Note] WSREP: (9864c6ca, 'tcp://') turning message relay requesting off
2019-01-29 16:16:08 139996254160640 [ERROR] WSREP: Process was aborted.
2019-01-29 16:16:08 139996254160640 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '' --datadir '/var/lib/mysql/' --parent '8725' --binlog '/var/log/mysql/mariadb-bin' --binlog-index '/var/log/mysql/mariadb-bin.index': 2 (No such file or directory)

In the donor (output has been a obfuscated to avoid sharing private info and the times are not matching deliberately):

Jan 29 18:08:22 node3 -innobackupex-backup: 190129 18:08:22 >> log scanned up to (203524317205)
Jan 29 18:08:23 node3 -innobackupex-backup: 190129 18:08:23 >> log scanned up to (203524318337)
Jan 29 18:08:24 node3 -innobackupex-backup: 190129 18:08:24 >> log scanned up to (203524320436)
Jan 29 18:08:25 node3 -innobackupex-backup: 190129 18:08:25 >> log scanned up to (203524322720)
Jan 29 18:08:25 node3 nrpe[25546]: Error: Request packet type/version was invalid!
Jan 29 18:08:25 node3 nrpe[25546]: Client request was invalid, bailing out...
Jan 29 18:08:26 node3 -innobackupex-backup: 190129 18:08:26 >> log scanned up to (203524322720)
Jan 29 18:08:27 node3 -innobackupex-backup: 190129 18:08:27 >> log scanned up to (203524323538)
Jan 29 18:08:28 node3 -innobackupex-backup: 190129 18:08:28 >> log scanned up to (203524324667)
Jan 29 18:08:29 node3 -innobackupex-backup: 190129 18:08:29 >> log scanned up to (203524325358)
Jan 29 18:08:30 node3 -wsrep-sst-donor: 2019/01/29 18:08:30 socat[22843] E write(6, 0x1579220, 8192): Broken pipe
Jan 29 18:08:30 node3 -innobackupex-backup: innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
Jan 29 18:08:30 node3 -innobackupex-backup: xb_stream_write_data() failed.
Jan 29 18:08:30 node3 -innobackupex-backup: innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
Jan 29 18:08:30 node3 -innobackupex-backup: [01] xtrabackup: Error: xtrabackup_copy_datafile() failed.
Jan 29 18:08:30 node3 -innobackupex-backup: [01] xtrabackup: Error: failed to copy datafile.
Jan 29 18:08:30 node3 mysqld[27345]: 2019-01-29 18:08:30 140562136139520 [Warning] Aborted connection 422963 to db: 'unconnected' user: 'sstuser' host: 'localhost' (Got an error reading communication packets)
Jan 29 18:08:30 node3 -wsrep-sst-donor: innobackupex finished with error: 1. Check /var/lib/mysql//innobackup.backup.log
Jan 29 18:08:30 node3 -wsrep-sst-donor: Cleanup after exit with status:22

So SST starts correctly and then failed. I tried forcing different donors, checked firewall rules, etc. Nothing.

Additionally I noticed that the process was starting over and over, while monitoring,  .ssh folder was growing up to certain size (something around 7GB) and then would start over. The logs kept showing the same messages, the init script failed with an error but the process kept running either until I executed service mysql stop or kill -9 to all processes. It was getting stranger every minute.

At this point I was totally lost, scratching my head looking for solutions. More strange still was that trying a manual SST using netcat worked perfectly! So it was definitely a problem with the init script. Systemd journal was not providing any insight…

And then…

MariaDB Cluster dies in the SST process after 90 seconds

Suddenly I noticed that the failure was happening roughly 90 seconds after the start. A short googling later—doing more specific search—I found this page: which explained precisely my problem.

The MariaDB init script has changed its timeout from 900 seconds to 90 while MySQL Community and Percona Server has this value set to 15 mins. Also it seems that this change has caused some major issues with nodes crashing as documented in MDEV-15607 — the bug is reported as fixed but we still can see timeout problems.

I observed that in case of failure, systemd was killing the mysqld process but not stopping the service. This results in an infinite SST loop that only stops when the service is killed or stopped via systemd command.

The fix was super easy, I just needed to create a file to be used by systemd that sets the timeout to a more useful value as follows:

sudo tee /etc/systemd/system/mariadb.service.d/timeoutstartsec.conf <<EOF
sudo systemctl daemon-reload

As you may notice I set the timeout to 15 minutes but I could set it to any time. That was it, the next SST will have plenty of time to finish. Reference to this change is very well documented here

On reflection…

One could argue about this change, and I’m still having some internal discussions about it. In my opinion, a 90 seconds timeout is too short for a Galera cluster. It is very likely that almost any cluster will reach that timeout during SST. Even a regular MySQL server that suffers a crash with a high proportion of dirty pages or many operations to rollback, 90 seconds doesn’t seem to be an feasible time for crash recovery. Why the developers changed it to such a short timeout I have no idea. Luckily, it is very easy to fix now I know the reason.

Photo by Tim Gouw on Unsplash

by Francisco Bordenave at February 12, 2019 01:25 PM

February 11, 2019

MariaDB Foundation

MariaDB 10.2.22 now available

The MariaDB Foundation is pleased to announce the availability of MariaDB 10.2.22, the latest stable release in the MariaDB 10.2 series. See the release notes and changelogs for details. Download MariaDB 10.2.22 Release Notes Changelog What is MariaDB 10.2? MariaDB APT and YUM Repository Configuration Generator Contributors to MariaDB 10.2.22 Alexander Barkov (MariaDB Corporation) Alexander […]

The post MariaDB 10.2.22 now available appeared first on

by Ian Gilfillan at February 11, 2019 04:52 PM

Peter Zaitsev

Compression Options in MySQL (Part 2)

Swiss cheese File system

In one of my previous posts, I started a series on data compression options with MySQL. The first post focused on the more traditional compression options like InnoDB Barracuda page compression and MyISAM packing. With this second part, I’ll discuss a newer compression option, InnoDB transparent page compression with punch holes available since 5.7. First, I’ll describe the transparent page compression method and how it works. Then I’ll present similar results as in the first post.

InnoDB transparent page compression

Before we can discuss transparent page compression, we must understand how InnoDB accesses its data pages. To access an InnoDB page, you need to know the tablespace (the file) and the offset of the page within the tablespace file. The offset is the tough part with data compression. If you just compress pages and concatenate them one after the other, the offsets will no longer be at known intervals. InnoDB Barracuda page compression solves the problem by asking the DBA to guess the compression ratio of the pages with the compressed block size setting. For example, you have to tell InnoDB to use a disk block size of 8KB if you think the compression ratio will be around 2. Transparent page compression uses another approach, sparse files.

Sparse files 101

A sparse file is a file with holes in it. Even though a sparse file may be very large, if there are a lot of holes in it, it may end up using a small amount of storage. On almost every Linux system, the /var/log/lastlog file is sparse:

yves@ThinkPad-P51:/var/log$ ls -lah lastlog
-rw-rw-r-- 1 root utmp 18M jan 5 16:09 lastlog
yves@ThinkPad-P51:/var/log$ du -hs lastlog
56K lastlog

While the ls command reports an apparent size of 18MB, the du command tells us the file actually uses only 56KB. Most of the space in the file is actually unallocated. When you access a sparse file, the filesystem has to map the actual physical offsets in the file with the logical offsets seen by the application. A logical offset is no longer directly the number of bytes since the beginning of the file.

Now that we understand a bit what sparse files are, let’s talk about the punch hole aspect. When you write something to disk, you can use the fallocate call to free up, punch, part of it. The freed/punched portion is thus a hole in the file, and the filesystem can later reuse the hole to store something else. Let’s follow a simplified view of the steps required to write a transparently compressed InnoDB page.

InnoDB using sparse files

Figure 1: InnoDB Transparent page compression

In figure 1, an in memory 16KB InnoDB page with 14KB of data is going to be written to disk. As part of the write process, the data is compressed to 6KB and the page is written to the disk. Once written, InnoDB uses the fallocate call to release the 10KB of unused space. Since only full blocks are release,  only 8KB is really freed. The remaining space unreleased space (2KB) is just zeroed. The freed space will be reused, either for the same file or by another one. For simplicity, let’s assume the space is reused by the same InnoDB file.

Figure 2: File system layout

If there is no immediate reuse, a portion of the InnoDB file will look like the top file layout of figure 2. The pages (numbers) are still sequentially laid out but there are holes in between. As the file system gets full, it will start to reuse the freed space so eventually, the file layout will look like the bottom one. If you notice, in the bottom layout, the pages are no longer in sequential order. There are consequences to that: the notion of disk sequential access is gone. The most stunning example is a simple file copy on a spinning device. While copying a 1GB regular file may take only 30 seconds, the copy of a 1GB sparse file can take much longer, up to 30 minutes in the worst cases. The impact on physical backup tools, like Percona Xtrabackup, are thus important. Normally physical backups are much faster than logical ones (ex: mysqldump), but with sparse files, it may no longer be true.

MySQL impacts

There are also consequences of the use of sparse files on the design of a MySQL database server. The added random operations increase the importance of using SSD/Flash based storage. Also some settings must be considered with a different perspective:

  • innodb_flush_neighbors should be 0 since 1 is a cheat geared toward sequential operations
  • innodb_read_ahead_threshold, normally set to 56, this means when 56 pages of an extent have been scanned, the next extent is read sequentially ahead of time. To be really useful, the next extent should be read before the remaining 8 pages of the current extent are read. Since sequential operations are slower, maybe this value should be lowered a little. The drawback is an increased possibility of a read ahead without use.
  • innodb_random_read_ahead is a wilder setting, it would be a good idea to experiment with this for your workload

There are likely to be other affected settings.

Review of the test procedure

Just to refresh memories, I am using two datasets for the basic benchmarks. The first, Wikipedia, consists in about 1B rows of Wikipedia access logs. It is moderately compressible. The second dataset, o1543, is from the defunct Percona cloud tool project. It has only 77M rows but they are much wider with 134 columns. The o1543 dataset is highly compressible.

On these two datasets, the following steps were executed:

  1. insert the rows: record time, final size and amount of data written
  2. large range select, record the time
  3. 20k updates, record the time to and total bytes written


Final sizes

Figure 3, Innodb transparent page compression final sizes

One of the most critical metrics with compression is the final dataset size, as shown in figure 3. The possibility to use larger InnoDB pages is a big thing with transparent page compression. Larger pages allow for more repetitive patterns to be present within a page, and that improves the compression ratio. Results using page sizes of 16KB, 32KB and 64KB are shown. The uncompressed results are used as references, transparent compression (TC) using Lz4 and Zlib are the actual compressed datasets. First, we see that larger page sizes barely affect the size of the uncompressed dataset (I16, I32 and I64). Since the datasets were inserted in primary key order, the only possible impact is the filling factor of the pages. When InnoDB fills a page in PK order, even when the innodb_fill_factor is set to 100, it always leaves 1KB free per 16KB. With an amount of free space that scales with the page size, the final size doesn’t change much.

The impacts of larger page sizes on the compression ratio are important. The most drastic example is with the o1543 dataset and Zlib compression. While with a 16KB page, the compression ratio was already decent, at 3.65, it grows to an amazing 8.7 (I16/I64TCZlib) with pages of 64KB. Larger page sizes have also a positive impact on the compression ratio of the Wikipedia dataset. The original compression ratio with Zlib and 16KB pages is 2.4 and it grows to 3.4 with 64KB pages. Datasets compressed with Lz4 behave similarly to the Zlib ones but the compression ratio are slightly lower.

Overall, the I64TCZlib results for the Wikipedia dataset is the most compressed form we have so far. For the o1543 dataset, the MyISAMPacked compressed size is still slightly smaller but is read-only.

Insertion time

Figure 4, InnoDB transparent page compression insert times

We normally expect compression to add an overhead but here, the insertion speed improves with larger page sizes (figure 4). The reason is likely to be because we are using spinning disks. Spinning disks have a high latency so doing larger IO operations helps. The time overhead of compression with transparent page compression hovers between 10 and 17%. That’s much less than 60% overhead we observed for the Barracuda table compression in the previous post for the Wikipedia dataset (InnoDBCmp8k/InnoDB). We can conclude the insert rates, when inserts are in PK order, are not much affected by transparent page compression. If you are mostly inserting data, it is a nice win.

Data written by inserts

Figure 5, total amount of data written during the inserts

The amount of data written is not much affected by the transparent compression and the larger page sizes (figure 5) . That’s reasonable as many of the writes are not compressed, only the final write to the tablespace is. Neither the writes to the double write buffer, or to the InnoDB log files, or for the tablespace pre-allocation, are compressed. The differences we see are essentially the same as the ones for the final sizes. Only the uncompressed results do not fit that view but these are rather small deviations.

Range selects

Figure 6, time to complete a long range scan

The range select benchmarks are really a means of testing the decompression overhead. As you notice in figure 6, the time variations are not large. For the Wikipedia dataset, the faster range select is I64TCLz4, and it completed in 788 seconds. That’s almost two minutes slower than the faster results using InnoDB Barracuda compression (block_size=4KB). How can we explain such results? If the freed space is reused, transparent compression causes sequential operations to become random ones. The time should increase.  Without space reuse, the storage layer will merge many small reads into a sequential one, and then discard the holes.  Effectively, the disk will read the same amount of data, compressed or not. The only difference will come from decompression.  Lz4 is extremely fast while Zlib is slower.

Going back to the Wikipedia dataset, it took the exact same time, 830s, for I16, I16TCLz4  and I32TCLz4. That seems to indicate there was no space reuse.  With the xfs xfs_bmap tool on a TC compressed file, I listed the blocks used. Here is the command I used and the first lines of the output (with blocks of 512 bytes):

root@LabPS57kvm_1:/tmp# xfs_bmap /var/lib/mysql/test/query_class_metrics.ibd | more
0: [0..31]: 1013168..1013199
1: [32..39]: 1014544..1014551
2: [40..63]: hole
3: [64..71]: 1016976..1016983
4: [72..95]: hole
5: [96..103]: 1017008..1017015
6: [104..127]: hole
7: [128..135]: 1016880..1016887
8: [136..159]: hole
9: [160..167]: 1016912..1016919
10: [168..191]: hole

We have the list:

  • 0..31: 16 KB tablespace header, apparently not compressed
  • 32..39: 4KB TC compressed page, 8 sectors of compressed data
  • 40..63: 12KB hole (24 sectors)
  • …and so on

So the layout actually looks indeed like the filesystem with no reuse case (top layout) of figure 2. When InnoDB extends the tablespace, it of course proceeds by entire pages. The filesystem will try, as much as possible, to allocate continuous blocks. Initially, the tablespace increases one page at a time but rapidly it grows by extent of 64 pages. The space reuse will start only when there are no more continuous areas large enough to satisfy the allocation requests. Until then, the filesystem still performs mostly  sequential operations. The performance characteristics will thus change once the freed blocks start to be reused. On a smaller server, I continued to insert data well after the filesystem would have been full without the holes. The insertion rate fell by about half but the read performance appeared unchanged.

The times of the range selects for the o1543 dataset are more predictable. In all cases, larger pages increase performance. That kind of makes sense – InnoDB needs less IOPS. With Lz4, InnoDB spends less time to decompress the pages than it would need to read the complete uncompressed pages. The opposite is also true for Zlib. The Lz4 results are the fastest, Zlib the slowest, and in between we have the uncompressed results.

20k updates time

Figure 7, time needed to perform 20k updates

Intuitively, I was expecting the larger pages to slow down the updates. Similarly, I was also expecting Lz4 compressed pages to be slower than uncompressed pages, but faster than the ones compressed with Zlib. The above figure shows the times to perform approximately 20k single row updates for both datasets. We performed the updates to the Wikipedia dataset in small separate transactions, while we used a single large update statement for the o1543 dataset.

While the compression algorithm assumption appears to hold true, the one for the page sizes is plainly wrong. Of course, the storage consists of spinning disks so the latency of random IO dominates. The important factor becomes the number of levels in the b-tree of the table. In the root node of the b-tree and all intermediate nodes, bigger pages mean more pointers to the next level. More pointers causes a bigger fan-out  –ratio of nodes between levels – and fewer levels. Bigger pages also cause fewer leaf level pages which in turn require less upper level node pages.

Let’s dive a bit more into this topic. The Wikipedia dataset table has an int unsigned primary key. Considering InnoDB always leaves 1KB free in a page and, along with the primary key, each entry in a node (non-leaf) has an extra 9 bytes for the pointer to the next level page. Let’s do some math:

  • Total number of pages with 16KB pages = 112.6GB / (15KB) = 7871311 pages
  • Max number of rows in the non-leaf pages for 16KB pages and an int PK = (16 * 1024)/(4 (int PK) + 9 (ptr)) = 1260 rows/pages
  • Minimum number of pages in the first level above the leaf = 7871311 / 1260 = 6247 pages
  • Minimum number of pages at the next level = 6247 / 1260 = 5 pages
  • Root page = 1

Of course, our calculations are an approximation. With a 16KB page size, there are three levels above the leaves for a total of 6253 pages and a size of 98MB. It thus requires 6253 IOPS to warm up the buffer pool with the all nodes. A SATA 7200 rpm disk delivers at best 120 IOPS (one per rotation) so that’s about 51 second. Now, let’s redo the same calculations but with a page size of 32KB:

  • Total number of pages with 32KB pages = 110.7GB / (31KB) = 3744431 pages
  • Max number of rows in the non-leaf pages for 32KB pages and an int PK = (32 * 1024)/(4 (int PK) + 9 (ptr)) = 2520 rows/pages
  • Minimum number of pages in the first level above the leaf = 3744431 / 2520 = 1486 pages
  • Root page = 1

Using 32KB pages, we have one level less and only 1487 node pages for a combined size of 47MB. To warm up, the buffer pool we have to load at least the node pages, an operation requiring only a quarter of the IOPS compared to when 16KB pages were used. That’s where most of the performance gains come from. The reduced number of IOPS more than compensates for the longer time to read a large page.  Again, in this setup, we used spinning disks.

Bytes written per update

Figure 8, average bytes written per update

Now, the last set of results concerns the number of bytes written per update statement (figure 8). There is a big price to pay when you want to use larger InnoDB pages, the write amplification is huge. The number of bytes approximately scales roughly with the page size. The worse case is the I64 result, about 192KB written for a single row update of an integer field (Wikipedia). If your database workload includes a large number of small single row updates, you should avoid expensive flash devices with 64KB InnoDB pages as you’ll burn your devices rapidly.

Operational considerations for larger InnoDB pages and TC

When is it good idea to use transparent compression? When should you use a larger InnoDB page size? One valid use case is a database storing large quantities of operational metrics, like the o1543 dataset.  The compression ratio will be fantastic and the performance penalty limited, at least until the filesystem starts reusing the holes.

If you collect data from a large number of devices and you are likely struggling with TBs of highly compressible data, transparent compression might be an interesting option. The only issue I see, but it is a major one, is how to backup large sparse files. InnoDB transparent page compression with punch holes is an interesting solution but, unless I am missing something, it has a somewhat limited scope. There are other compression options with similar compression ratios and less drawbacks.

In this post we explored a feature available since MySQL 5.7, InnoDB transparent compression with punch holes. Performance-wise, we have an interesting solution which offers excellent compression ratio, especially when larger page sizes are used. The transparent compression with punch holes technique suffers from its foundations, sparse files. Backing up very large sparse files is a slow and IO intensive process. Instead of performing large sequential IO operations, the backup process will require millions of small random IO operations.

So far we have discussed the traditional approaches to compression in MySQL (previous post) and Innodb transparent page compression. The next post of the series on data compression with MySQL will introduce the ZFS filesystem. ZFS externalizes the compression to the filesystem in a way that is pretty similar to InnoDB transparent page compression, but the ZFS b-tree file structure removes the inconvenience of sparse files.

Stay tuned, more results are coming.

by Yves Trudeau at February 11, 2019 04:39 PM

Valeriy Kravchuk

On my Favorite FOSDEM 2019 MySQL, MariaDB and Friends Devroom Talks

This year I had not only spoken about MySQL bugs reporting at FOSDEM, but spent almost the entire day listening at MySQL, MariaDB and Friends Devroom. I missed only one talk, on ProxySQL, (to get some water, drink a bottle of famous Belgian beer and chat with my former colleague in MySQL support team, Geert, whom I had not seen for a decade). So, for the first time out of my 4 FOSDEM visits I've got a first hand impression about the entire set of talks in the devroom that I want to share today, while I still remember my feelings.

Most of the talks have both slides and videos already uploaded on site, so you can check them and make your own conclusions, but my top 5 favorite talks (that have both videos and slides already available to community) were the following:

  • "Un-split brain (aka Move Back in Time) MySQL", by Shlomi Noach. You can find slides at SlideShare.

    This was a replacement talk that was really interesting and had proper style for FOSDEM. It was mostly a nice background story of creation of the gh-mysql-rewind tool, a shell script that uses MariaDB's mysqlbinlog --flashback option and MySQL GTIDs and allows to "rewind" row-based binary log to roll back transactions to some previous point in time. The tool should become available to community soon, maybe as a part of orchestrator. I was impressed how one can successfully use 49 slides for 20 minutes talk. That's far beyond my current presentation skills...
  • "Test complex database systems in a laptop with dbdeployer", by Giuseppe Maxia. You can find slides at SlideShare.

    I've already built and used dbdeployer, as described in my blog post, so I was really interested in the talk. Giuseppe was able not only to show 45 slides over 20 minutes and explain all the reasons behind re-implementing MySQL-Sandbox in Go, but also run a live demo where dozens of sandbox instances were created and used. Very impressive!
  • "MySQL and the CAP theorem: relevance & misconceptions", second great talk and show by Shlomi Noach. You can find slides at SlideShare.

    The "CAP theorem" says is a concept that a distributed database system (like any kind of MySQL replication setup) can only have 2 of the 3 features: (atomic) Consistency, (high) Availability and Partition Tolerance. This can be proved mathematically, but Shlomi had not only defined terms and conditions to present the formal proof, but also explained that they are far from real production objectives of any engineer or DBA (like 99.95% of Availability). He had shown typical MySQL setups (from simple async master-slave replication to Galera, group replication and even Vitess) and proved that formally they all are neither consistent nor available from that formal CAP theorem point of view, while, as we all know, they are practically useful and work (and with some efforts, proxies on top etc can be made both highly available and highly consistent for practical purposes). So, CAP theorem is neither representing real production systems, nor meeting their real requirements. We've also got some kind of explanation of why async master-master or circular replication are still popular... All that in 48 slides, with links, and presented in 20 minutes! Greatest short MySQL-related talk I've ever attended.
  • "TiDB: Distributed, horizontally scalable, MySQL compatible", by Morgan Tocker. You can find slides at SlideShare.

    It was probably the first time when I listened to Morgan, even though we worked together for a long time. I liked his way of explaining the architecture of this yet another database system speaking MySQL protocol and reasons to create it. If you are interested in performance of this system, check this blog post.
  • "MySQL 8.0 Document Store: How to Mix NoSQL & SQL in MySQL 8.0", by Frédéric Descamps. You can find slides (70!) at SlideShare.

    LeFred managed to get me somewhat interested in MySQL Shell and new JSON functions in MySQL, way more than ever before. It's even more surprising that hist talk was the last one and we already spent 8+ hours listening before he started. Simple step by step explanation of how one may get the best of both SQL, ACID and NoSQL (JSON, "MongoDB") worlds, if needed, in a single database management syste, was impressive. Also this talk probably caused the longest discussion and the largest number of questions from those remaining attendees.

    He was also one of two "hosts" and "managers" of the devroom, so I am really thankful him for hist efforts year after year to make MySQL devroom at FOSDEM great!
There were more good talks, but I had to pick up few that already have slides shared and those of a kind that I personally prefer to listen to at FOSDEM. This year I also missed few people whom I like to see and talk to at FOSDEM, namely Mark Callaghan and Jean-François Gagné.

The only photo I made with my Nokia dumb phone this year in Brussels, on my way to FOSDEM on February 2. We've got snow and rain that morning, nice for anyone who had to walk 5 kilometers to the ULB campus.
Overall, based on my experience this year, it still makes a lot of sense to visit FOSDEM for anyone interested in MySQL. You can hardly find so many good, different MySQL-related talks per just one single day on any other conference.

by Valerii Kravchuk ( at February 11, 2019 04:26 PM