Planet MariaDB

February 25, 2017

Valeriy Kravchuk

MySQL Support Engineer's Chronicles, Issue #5

A lot of time passed since my previous post in this series. I was busy with work, participating in FOSDEM, blogging about profilers and sharing various lists of MySQL bugs. But I do not plan to stop writing about my usual weeks of doing support engineer's job. So, time for the next post in this series, based on my random notes taken during the week here and there.

This week started for me with checking recent MySQL bug reports (actually I do it every day). I noted recent report by Roel, Bug #85065. Unfortunately it was quickly marked as "Won't fix", and I tend to agree with Laurynas Biveinis that this was probably a wrong decision. See a discussion here. Out of memory conditions happen in production and it would be great for MySQL to process them properly, not just crash randomly. Roel does a favor to all of us by checking debug builds with additional unusual failure conditions introduced, and this work should not be ignored based on some formal approach.

It's a common knowledge already that I try to avoid not only all kind of clusters, but all kinds of connection pools as well, by all means. Sometimes I still fail, and when I find myself in unlucky situation of dealing with connection pool in Apache Tomcat, I consider this reference useful.

This week I had a useful discussion on the need for xtrabackup (version 2.4.6 was released this week) on Windows (customers ask about it once in a while and some even try to use older binaries of version 1.6.x or so from Percona) and any alternatives. I was pointed out to this blog post by Anders Karlsson. I remember reading something about using Volume Snapshot Service on Windows to backup MySQL back in Oracle in 2012, and really have to just try how it works based on the blog above and this reference. But I still think that, at least without a command line wrapper like mylvmbackup, this approach is hardly easy to follow for average Windows user (like me) and is not on pair with xtrabackup for ease of use etc.

I spent some time building new releases of Percona Server, MariaDB and MySQL 5.6 from Facebook on my Fedora 25 box (this is also one of the first things I do every morning, for software that got updates on GitHub), so just to remind myself, here is my usual cmake command line for 5.7-based builds:

cmake . -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/5.7 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DDOWNLOAD_BOOST=1 -DWITH_BOOST=/home/openxs/boost
I do not have it automated, so sometimes the exact command line gets lost in the history of bash... Never happens with MySQL from Facebook or MariaDB (as we see new commits almost every day), but easily happens with MySQL 5.7 (because they push commit only after the official releases) or Percona Server (as I do not care that much about it for a year already...). This time I noted that one actually needs numactl-devel package to build Percona Server 5.7. Was not the case last time I tried.

I also paid attention to blog posts on two hot topics this week. The first one is comparing ProxySQL (here and there) to recently released MaxScale 2.1. I probably have to build and test both myself, but ProxySQL seems to rock, still, and I have no reasons NOT to trust my former colleague René Cannaò, who had provided a lot of details about his tests.

Another topic that is hot in February, 2017 is group replication vs Galera. Przemek's recent post on this topic caused quite a hot discussion here. I still consider the post as a good, useful and sincere attempt to compare these technologies and highlight problems typical experienced Galera user may expect, suspect or note in the group replication in MySQL 5.7.17. Make sure you read comments to the blog post, as they help to clarify the implementation details, differences and strong features of group replication. IMHO, this post, along with great series by LeFred and comments from community, really helps to set proper expectations and setup proper understanding of both technologies.

The last but not the least, this week Mark Callaghan had confirmed that InnoDB is still faster than MyRocks on all kinds of read-only all-in-memory workloads (he used sysbench 1.0 tests) on "small server"/older hardware. I've noted this on a specific use case of Bug #68079 almost a month ago...

This week I worked on several interesting customer issues (involving Galera, CONNECT engine, security concerns, MySQL 5.1 and the limit of 64 secondary indexes per InnoDB table etc) that are not yet completely resolved, so I expect a lot of fun and useful links noted next week. Stay tuned!

by Valeriy Kravchuk (noreply@blogger.com) at February 25, 2017 04:37 PM

February 24, 2017

Peter Zaitsev

Installing Percona Monitoring and Management (PMM) for the First Time

Percona Monitoring and Management 2

Percona Monitoring and ManagementThis post is another in the series on Percona’s MongoDB 3.4 bundle release. This post is meant to walk a prospective user through the benefits of Percona Monitoring and Management (PMM), how it’s architected and the simple install process. By the end of this post, you should have a good idea of what PMM is, where it can add value in your environment and how you can get PMM going quickly.

Percona Monitoring and Management (PMM) is Percona’s open-source tool for monitoring and alerting on database performance and the components that contribute to it. PMM monitors MySQL (Percona Server and MySQL CE), Amazon RDS/Aurora, MongoDB (Percona Server and MongoDB CE), Percona XtraDB/Galera Cluster, ProxySQL, and Linux.

What is it?

Percona Monitoring and Management is an amalgamation of exciting, best in class, open-source tools and Percona “engineering wizardry,” designed to make it easier to monitor and manage your environment. The real value to our users is the amount of time we’ve spent integrating the tools, plus the pre-built dashboards we’ve constructed that leverage the ten years of performance optimization experience. What you get is a tool that is ready to go out of the box, and installs in minutes. If you’re still not convinced, like ALL Percona software it’s completely FREE!

Sound good? I can hear you nodding your head. Let’s take a quick look at the architecture.

What’s it made of?

PMM, at a high-level, is made up of two basic components: the client and the server. The PMM Client is installed on the database servers themselves and is used to collect metrics. The client contains technology specific exporters (which collect and export data), and an “admin interface” (which makes the management of the PMM platform very simple). The PMM server is a “pre-integrated unit” (Docker, VM or AWS AMI) that contains four components that gather the metrics from the exporters on the PMM client(s). The PMM server contains Consul, Grafana, Prometheus and a Query Analytics Engine that Percona has developed. Here is a graphic from the architecture section of our documentation. In order to keep this post to a manageable length, please refer to that page if you’d like a more “in-depth” explanation.

How do I use it?

PMM is very easy to access once it has been installed (more on the install process below). You will simply open up the web browser of your choice and connect to the PMM Landing Page by typing

http://<ip_address_of _PMM_server>
. That takes you to the PMM landing page, where you can access all of PMM’s tools. If you’d like to get a look into the user experience, we’ve set up a great demo site so you can easily test it out.

Where should I use it?

There’s a good chance that you already have a monitoring/alerting platform for your production workloads. If not, you should set one up immediately and start analyzing trends in your environment. If you’re confident in your production monitoring solution, there is still a use for PMM in an often overlooked area: development and testing.

When speaking with users, we often hear that their development and test environments run their most demanding workloads. This is often due to stress testing and benchmarking. The goal of these workloads is usually to break something. This allows you to set expectations for normal, and thus abnormal, behavior in your production environment. Once you have a good idea of what’s “normal” and the critical factors involved, you can alert around those parameters to identify “abnormal” patterns before they cause user issues in production. The reason that monitoring is critical in your dev/test environment(s) is that you want to easily spot inflection points in your workload, which signal impending disaster. Dashboards are the easiest way for humans to consume and analyze this data.

Are you sold? Let’s get to the easiest part: installation.

How do you install it?

PMM is very easy to install and configure for two main reasons. The first is that the components (mentioned above) take some time to install, so we spent the time to integrate everything and ship it as a unit: one server install and a client install per host. The second is that we’re targeting customers looking to monitor MySQL and MongoDB installations for high-availability and performance. The fact that it’s a targeted solution makes pre-configuring it to monitor for best practices much easier. I believe we’ve all seen a particular solution that tries to do a little of everything, and thus actually does no particular thing well. This is the type of tool that we DO NOT want PMM to be. Now, onto the installation procedure.

There are four basic steps to get PMM monitoring your infrastructure. I do not want to recreate the Deployment Guide in order to maintain the future relevancy of this post. However, I’ll link to the relevant sections of the documentation so you can cut to the chase. Also, underneath each step, I’ll list some key takeaways that will save you time now and in the future.

  1. Install the integrated PMM server in the flavor of your choice (Docker, VM or AWS AMI)
    1. Percona recommends Docker to deploy PMM server as of v1.1
      1. As of right now, using Docker will make the PMM server upgrade experience seamless.
      2. Using the default version of Docker from your package manager may cause unexpected behavior. We recommend using the latest stable version from Docker’s repositories (instructions from Docker).
    2. PMM server AMI and VM are “experimental” in PMM v1.1
    3. When you open the “Metrics Monitor” for the first time, it will ask for credentials (user: admin pwd: admin).
  2. Install the PMM client on every database instance that you want to monitor.
    1. Install with your package manager for easier upgrades when a new version of PMM is released.
  3. Connect the PMM client to the PMM Server.
    1. Think of this step as sending configuration information from the client to the server. This means you are telling the client the address of the PMM server, not the other way around.
  4. Start data collection services on the PMM client.
    1. Collection services are enabled per database technology (MySQL, MongoDB, ProxySQL, etc.) on each database host.
    2. Make sure to set permissions for PMM client to monitor the database: Cannot connect to MySQL: Error 1045: Access denied for user ‘jon’@’localhost’ (using password: NO)
      1. Setting proper credentials uses this syntax sudo pmm-admin add <service_type> –user xxxx –password xxxx
    3. There’s good information about PMM client options in the “Managing PMM Client” section of the documentation for advanced configurations/troubleshooting.

What’s next?

That’s really up to you, and what makes sense for your needs. However, here are a few suggestions to get the most out of PMM.

  1. Set up alerting in Grafana on the PMM server. This is still an experimental function in Grafana, but it works. I’d start with Barrett Chambers’ post on setting up email alerting, and refine it with  Peter Zaitsev’s post.
  2. Set up more hosts to test the full functionality of PMM. We have completely free, high-performance versions of MySQL, MongoDB, Percona XtraDB Cluster (PXC) and ProxySQL (for MySQL proxy/load balancing).
  3. Start load testing the database with benchmarking tools to build your troubleshooting skills. Try to break something to learn what troubling trends look like. When you find them, set up alerts to give you enough time to fix them.

by Jon Tobin at February 24, 2017 09:32 PM

Battle for Synchronous Replication in MySQL: Galera vs. Group Replication

Group Replication

Group ReplicationIn today’s blog post, I will briefly compare two major synchronous replication technologies available today for MySQL.

Synchronous Replication

Thanks to the Galera plugin, founded by the Codership team, we’ve had the choice between asynchronous and synchronous replication in the MySQL ecosystem for quite a few years already. Moreover, we can choose between at least three software providers: Codership, MariaDB and Percona, each with its own Galera implementation.

The situation recently became much more interesting when MySQL Group Replication went into GA (stable) stage in December 2016.

Oracle, the upstream MySQL provider, finally introduced its own synchronous replication implementation. Unlike the others mentioned above, it isn’t based on Galera. Group Replication was built from the ground up as a new solution. MySQL Group Replication shares many very similar concepts to Galera. This post doesn’t cover MySQL Cluster, another synchronous solution way older then Galera — it is a much different solution for different use cases.

In this post, I will point out a couple of interesting differences between Group Replication and Galera, which hopefully will be helpful to those considering switching from one to another (or just planning to test them).

This is certainly not a full list of all the differences, but rather things I found interesting during my explorations.

It is also important to know that Group Replication has evolved a lot before it went GA (its whole cluster layer was replaced). I won’ mention how things looked before the GA stage, and will just concentrate on latest available 5.7.17 version. I will not spend too much time on how Galera implementations looked in the past, and will use Percona XtraDB Cluster 5.7 as a reference.

Multi-Master vs. Master-Slave

Galera has always been multi-master by default, so it does not matter to which node you write. Many users use a single writer due to workload specifics and multi-master limitations, but Galera has no single master mode per se.

Group Replication, on the other hand, promotes just one member as primary (master) by default, and other members are put into read-only mode automatically. This is what happens if we try to change data on non-master node:

mysql> truncate test.t1;
ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement

To change from single primary mode to multi-primary (multi-master), you have to start group replication with the 

group_replication_single_primary_mode
variable disabled.
Another interesting fact is you do not have any influence on which cluster member will be the master in single primary mode: the cluster auto-elects it. You can only check it with a query:

mysql> SELECT * FROM performance_schema.global_status WHERE VARIABLE_NAME like 'group_replication%';
+----------------------------------+--------------------------------------+
| VARIABLE_NAME                    | VARIABLE_VALUE                       |
+----------------------------------+--------------------------------------+
| group_replication_primary_member | 329333cd-d6d9-11e6-bdd2-0242ac130002 |
+----------------------------------+--------------------------------------+
1 row in set (0.00 sec)

Or just:

mysql> show status like 'group%';
+----------------------------------+--------------------------------------+
| Variable_name                    | Value                                |
+----------------------------------+--------------------------------------+
| group_replication_primary_member | 329333cd-d6d9-11e6-bdd2-0242ac130002 |
+----------------------------------+--------------------------------------+
1 row in set (0.01 sec)

To show the hostname instead of UUID, here:

mysql> select member_host as "primary master" from performance_schema.global_status join performance_schema.replication_group_members where variable_name='group_replication_primary_member' and member_id=variable_value;
+----------------+
| primary master |
+----------------+
| f18ff539956d   |
+----------------+
1 row in set (0.00 sec)

Replication: Majority vs. All

Galera delivers write transactions synchronously to ALL nodes in the cluster. (Later, applying happens asynchronously in both technologies.) However, Group Replication needs just a majority of the nodes confirming the transaction. This means a transaction commit on the writer succeeds and returns to the client even if a minority of nodes still have not received it.

In the example of a three-node cluster, if one node crashes or loses the network connection, the two others continue to accept writes (or just the primary node in Single-Primary mode) even before a faulty node is removed from the cluster.

If the separated node is the primary one, it denies writes due to the lack of a quorum (it will report the error

ERROR 3101 (HY000): Plugin instructed the server to rollback the current transaction.
). If one of the nodes receives a quorum, it will be elected to primary after the faulty node is removed from the cluster, and will then accept writes.

With that said, the “majority” rule in Group Replication means that there isn’t a guarantee that you won’t lose any data if the majority nodes are lost. There is a chance these could apply some transactions that aren’t delivered to the minority at the moment they crash.

In Galera, a single node network interruption makes the others wait for it, and pending writes can be committed once either the connection is restored or the faulty node removed from cluster after the timeout. So the chance of losing data in a similar scenario is lower, as transactions always reach all nodes. Data can be lost in Percona XtraDB Cluster only in a really bad luck scenario: a network split happens, the remaining majority of nodes form a quorum, the cluster reconfigures and allows new writes, and then shortly after the majority part is damaged.

Schema Requirements

For both technologies, one of the requirements is that all tables must be InnoDB and have a primary key. This requirement is now enforced by default in both Group Replication and Percona XtraDB Cluster 5.7. Let’s look at the differences.

Percona XtraDB Cluster:

mysql> create table nopk (a char(10));
Query OK, 0 rows affected (0.08 sec)
mysql> insert into nopk values ("aaa");
ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.nopk) without an explicit primary key with pxc_strict_mode = ENFORCING or MASTER
mysql> create table m1 (id int primary key) engine=myisam;
Query OK, 0 rows affected (0.02 sec)
mysql> insert into m1 values(1);
ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.m1) that resides in non-transactional storage engine with pxc_strict_mode = ENFORCING or MASTER
mysql> set global pxc_strict_mode=0;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into nopk values ("aaa");
Query OK, 1 row affected (0.00 sec)
mysql> insert into m1 values(1);
Query OK, 1 row affected (0.00 sec)

Before Percona XtraDB Cluster 5.7 (or in other Galera implementations), there were no such enforced restrictions. Users unaware of these requirements often ended up with problems.

Group Replication:

mysql> create table nopk (a char(10));
Query OK, 0 rows affected (0.04 sec)
mysql> insert into nopk values ("aaa");
ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin.
2017-01-15T22:48:25.241119Z 139 [ERROR] Plugin group_replication reported: 'Table nopk does not have any PRIMARY KEY. This is not compatible with Group Replication'
mysql> create table m1 (id int primary key) engine=myisam;
ERROR 3161 (HY000): Storage engine MyISAM is disabled (Table creation is disallowed).

I am not aware of any way to disable these restrictions in Group Replication.

GTID

Galera has it’s own Global Transaction ID, which has existed since MySQL 5.5, and is independent from MySQL’s GTID feature introduced in MySQL 5.6. If MySQL’s GTID is enabled on a Galera-based cluster, both numerations exist with their own sequences and UUIDs.

Group Replication is based on a native MySQL GTID feature, and relies on it. Interestingly, a separate sequence block range (initially 1M) is pre-assigned for each cluster member.

WAN Support

The MySQL Group Replication documentation isn’t very optimistic on WAN support, claiming that both “Low latency, high bandwidth network connections are a requirement” and “Group Replication is designed to be deployed in a cluster environment where server instances are very close to each other, and is impacted by both network latency as well as network bandwidth.” These statements are found here and here. However there is network traffic optimization: Message Compression.

I don’t see group communication level tunings available yet, as we find in the Galera evs.* series of

wsrep_provider_options
.

Galera founders actually encourage trying it in geo-distributed environments, and some WAN-dedicated settings are available (the most important being WAN segments).

But both technologies need a reliable network for good performance.

State Transfers

Galera has two types of state transfers that allow syncing data to nodes when needed: incremental (IST) and full (SST). Incremental is used when a node has been out of a cluster for some time, and once it rejoins the other nodes has the missing write sets still in Galera cache. Full SST is helpful if incremental is not possible, especially when a new node is added to the cluster. SST automatically provisions the node with fresh data taken as a snapshot from one of the running nodes (donor). The most common SST method is using Percona XtraBackup, which takes a fast and non-blocking binary data snapshot (hot backup).

In Group Replication, state transfers are fully based on binary logs with GTID positions. If there is no donor with all of the binary logs (included the ones for new nodes), a DBA has to first provision the new node with initial data snapshot. Otherwise, the joiner will fail with a very familiar error:

2017-01-16T23:01:40.517372Z 50 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236

The official documentation mentions that provisioning the node before adding it to the cluster may speed up joining (the recovery stage). Another difference is that in the case of state transfer failure, a Galera joiner will abort after the first try, and will shutdown its mysqld instance. The Group Replication joiner will then fall-back to another donor in an attempt to succeed. Here I found something slightly annoying: if no donor can satisfy joiner demands, it will still keep trying the same donors over and over, for a fixed number of attempts:

[root@cd81c1dadb18 /]# grep 'Attempt' /var/log/mysqld.log |tail
2017-01-16T22:57:38.329541Z 12 [Note] Plugin group_replication reported: 'Establishing group recovery connection with a possible donor. Attempt 1/10'
2017-01-16T22:57:38.539984Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 2/10'
2017-01-16T22:57:38.806862Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 3/10'
2017-01-16T22:58:39.024568Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 4/10'
2017-01-16T22:58:39.249039Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 5/10'
2017-01-16T22:59:39.503086Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 6/10'
2017-01-16T22:59:39.736605Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 7/10'
2017-01-16T23:00:39.981073Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 8/10'
2017-01-16T23:00:40.176729Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 9/10'
2017-01-16T23:01:40.404785Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 10/10'

After the last try, even though it fails, mysqld keeps running and allows client connections…

Auto Increment Settings

Galera adjusts the auto_increment_increment and auto_increment_offset values according to the number of members in a cluster. So, for a 3-node cluster,

auto_increment_increment
  will be “3” and
auto_increment_offset
  from “1” to “3” (depending on the node). If a number of nodes change later, these are updated immediately. This feature can be disabled using the 
wsrep_auto_increment_control
 setting. If needed, these settings can be set manually.

Interestingly, in Group Replication the

auto_increment_increment
 seems to be fixed at 7, and only
auto_increment_offset
 is set differently on each node. This is the case even in the default Single-Primary mode! this seems like a waste of available IDs, so make sure that you adjust the
group_replication_auto_increment_increment
 setting to a saner number before you start using Group Replication in production.

Multi-Threaded Slave Side Applying

Galera developed its own multi-threaded slave feature, even in 5.5 versions, for workloads that include tables in the same database. It is controlled with the wsrep_slave_threads variable. Group Replication uses a standard feature introduced in MySQL 5.7, and the number of applier threads is controlled with slave_parallel_workers. Galera will do multi-threaded replication based on potential conflicts of changed/locked rows. In MySQL 5.7, this is based on group commit or schema.

Flow Control

Both technologies use a technique to throttle writes when nodes are slow in applying them. Interestingly, the default size of the allowed applier queue in both is much different:

Moreover, Group Replication provides separate certifier queue size, also eligible for the Flow Control trigger:

 group_replication_flow_control_certifier_threshold
. One thing I found difficult, is checking the actual applier queue size, as the only exposed one via performance_schema.replication_group_member_stats is the
Count_Transactions_in_queue
 (which only shows the certifier queue).

Network Hiccup/Partition Handling

In Galera, when the network connection between nodes is lost, those who still have a quorum will form a new cluster view. Those who lost a quorum keep trying to re-connect to the primary component. Once the connection is restored, separated nodes will sync back using IST and rejoin the cluster automatically.

This doesn’t seem to be the case for Group Replication. Separated nodes that lose the quorum will be expelled from the cluster, and won’t join back automatically once the network connection is restored. In its error log we can see:

2017-01-17T11:12:18.562305Z 0 [ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2017-01-17T11:12:18.631225Z 0 [Note] Plugin group_replication reported: 'getstart group_id ce427319'
2017-01-17T11:12:21.735374Z 0 [Note] Plugin group_replication reported: 'state 4330 action xa_terminate'
2017-01-17T11:12:21.735519Z 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-01-17T11:12:21.735527Z 0 [Note] Plugin group_replication reported: 'state 4257 action xa_exit'
2017-01-17T11:12:21.735553Z 0 [Note] Plugin group_replication reported: 'Exiting xcom thread'
2017-01-17T11:12:21.735558Z 0 [Note] Plugin group_replication reported: 'new state x_start'

Its status changes to:

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 329333cd-d6d9-11e6-bdd2-0242ac130002 | f18ff539956d | 3306 | ERROR |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
1 row in set (0.00 sec)

It seems the only way to bring it back into the cluster is to manually restart Group Replication:

mysql> START GROUP_REPLICATION;
ERROR 3093 (HY000): The START GROUP_REPLICATION command failed since the group is already running.
mysql> STOP GROUP_REPLICATION;
Query OK, 0 rows affected (5.00 sec)
mysql> START GROUP_REPLICATION;
Query OK, 0 rows affected (1.96 sec)
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 24d6ef6f-dc3f-11e6-abfa-0242ac130004 | cd81c1dadb18 | 3306 | ONLINE |
| group_replication_applier | 329333cd-d6d9-11e6-bdd2-0242ac130002 | f18ff539956d | 3306 | ONLINE |
| group_replication_applier | ae148d90-d6da-11e6-897e-0242ac130003 | 0af7a73f4d6b | 3306 | ONLINE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
3 rows in set (0.00 sec

Note that in the above output, after the network failure, Group Replication did not stop. It waits in an error state. Moreover, in Group Replication a partitioned node keeps serving dirty reads as if nothing happened (for non-super users):

cd81c1dadb18 {test} ((none)) > SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 24d6ef6f-dc3f-11e6-abfa-0242ac130004 | cd81c1dadb18 | 3306 | ERROR |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
1 row in set (0.00 sec)
cd81c1dadb18 {test} ((none)) > select * from test1.t1;
+----+-------+
| id | a |
+----+-------+
| 1 | dasda |
| 3 | dasda |
+----+-------+
2 rows in set (0.00 sec)
cd81c1dadb18 {test} ((none)) > show grants;
+-------------------------------------------------------------------------------+
| Grants for test@% |
+-------------------------------------------------------------------------------+
| GRANT SELECT, INSERT, UPDATE, DELETE, REPLICATION CLIENT ON *.* TO 'test'@'%' |
+-------------------------------------------------------------------------------+
1 row in set (0.00 sec)

A privileged user can disable

super_read_only
, but then it won’t be able to write:

cd81c1dadb18 {root} ((none)) > insert into test1.t1 set a="split brain";
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.
cd81c1dadb18 {root} ((none)) > select * from test1.t1;
+----+-------+
| id | a |
+----+-------+
| 1 | dasda |
| 3 | dasda |
+----+-------+
2 rows in set (0.00 sec)

I found an interesting thing here, which I consider to be a bug. In this case, a partitioned node can actually perform DDL, despite the error:

cd81c1dadb18 {root} ((none)) > show tables in test1;
+-----------------+
| Tables_in_test1 |
+-----------------+
| nopk |
| t1 |
+-----------------+
2 rows in set (0.01 sec)
cd81c1dadb18 {root} ((none)) > create table test1.split_brain (id int primary key);
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.
cd81c1dadb18 {root} ((none)) > show tables in test1;
+-----------------+
| Tables_in_test1 |
+-----------------+
| nopk |
| split_brain |
| t1 |
+-----------------+
3 rows in set (0.00 sec)

In a Galera-based cluster, you are automatically protected from that, and a partitioned node refuses to allow both reads and writes. It throws an error: 

ERROR 1047 (08S01): WSREP has not yet prepared node for application use
. You can force dirty reads using the 
wsrep_dirty_reads
 variable.

There many more subtle (and less subtle) differences between these technologies — but this blog post is long enough already. Maybe next time 🙂

Article with Similar Subject

http://lefred.be/content/group-replication-vs-galera/

by Przemysław Malkowski at February 24, 2017 08:46 PM

February 23, 2017

Peter Zaitsev

Percona MongoDB 3.4 Bundle Release: Percona Server for MongoDB 3.4 Features Explored

Percona Server for MongoDB

Percona MongoDB 3.4 Bundle ReleaseThis blog post continues the series on the Percona MongoDB 3.4 bundle release. This release includes Percona Server for MongoDB, Percona Monitoring and Management, and Percona Toolkit. In this post, we’ll look at the features included in Percona Server for MongoDB.

I apologize for the long blog, but there is a good deal of important information to cover. Not just about what new features exist, but also why that are so important. I have tried to break this down into clear areas for you to cover the most amount of data, while also linking to further reading on these topics.

The first and biggest new feature for many people is the addition of collation in MongoDB. Wikipedia says about collation:

Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems, library catalogs, and reference books.

What this is saying is a collation is an ordering of characters for a given character set. Different languages order the alphabet differently or even have different base characters (such as Asian, Middle Eastern and other regions) that are not English-native. Collations are critical for multi-language support and sorting of non-English words for index ordering.

Sharding

General

All members of a cluster are aware of sharding (all members, sharding set name, etc.). Due to this, the

sharding.clusterRole
 must be defined on all shard nodes, a new requirement.

Mongos processes MUST connect to 3.4 mongod instances (shard and config nodes). 3.2 and lower is not possible.

Config Servers

Balancer on Config Server PRIMARY

In MongoDB 3.4, the cluster balancer is moved from the mongos processes (any) to the config server PRIMARY member.

Moving to a config-server-based balancer has the following benefits:

Predictability: the balancer process is always the config server PRIMARY. Before 3.4, any mongos processes could become the balancer, often chosen at random. This made troubleshooting difficult.

Lighter “mongos” process: the mongos/shard router benefits from being as light and thin as possible. This removes some code and potential for breakage from “mongos.”

Efficiency: config servers have dedicated nodes with very low resource utilization and no direct client traffic, for the most part. Moving the balancer to the config server set moves usage away from critical “router” processes.

Reliability: balancing relies on fewer components. Now the balancer can operate on the “config” database metadata locally, without the chance of network interruptions breaking balancing.

Config servers are a more permanent member of a cluster, unlikely to scale up/down or often change, unlike “mongos” processes that may be located on app hosts, etc.

Config Server Replica Set Required

In MongoDB 3.4, the former “mirror” config server strategy (SCCC) is no longer supported. This means all sharded clusters must use a replica-set-based set of config servers.

Using a replica-set based config server set has the following benefits:

Adding and removing config servers is greatly simplified.

Config servers have oplogs (useful for investigations).

Simplicity/Consistency: removing mirrored/SCCC config servers simplifies the high-level and code-level architecture.

Chunk Migration / Balancing Example

(from docs.mongodb.com)

Parallel Migrations

Previous to MongoDB 3.4, the balancer could only perform a single chunk migration at any given time. When a chunk migrates, a “source” shard and a “destination” shard are chosen. The balancer coordinates moving the chunks from the source to the target. In a large cluster with many shards, this is inefficient because a migration only involves two shards and a cluster may contain 10s or 100s of shards.

In MongoDB 3.4, the balancer can now perform many chunk migrations at the same time in parallel — as long as they do not involve the same source and destination shards. This means that in clusters with more than two shards, many chunk migrations can now occur at the same time when they’re mutually exclusive to one another. The effective outcome is (Number of Shards / 2) -1 == number of max parallel migrations: or an increase in the speed of the migration process.

For example, if you have ten shards, then 10/2 = 5 and  5-1 = 4. So you can have four concurrent moveChunks or balancing actions.

Tags and Zone

Sharding Zones supersedes tag-aware sharding. There is mostly no changes to the functionality. This is mostly a naming change and some new helper functions.

New commands/shell-methods added:

addShardToZone / sh.addShardToZone().

removeShardFromZone / sh.removeShardFromZone().

updateZoneKeyRange / sh.updateZoneKeyRange() + sh.removeRangeFromZone().

You might recall  MongoDB has for a long time supported the idea of shard and replication tags. They break into two main areas: hardware-aware tags and access pattern tags. The idea behind hardware-aware tags was that you could have one shard with slow disks, and as data ages, you have a process to move documents to a collection that lives on that shard (or tell specific ranges to live on that shard). Then your other shards could be faster (and multiples of them) to better handle the high-speed processing of current data.

The other is a case based more in replication, where you want to allow BI and other reporting systems access to your data without damaging your primary customer interactions. To do this, you could tag a node in a replica set to be

{reporting: true}
, and all reporting queries would use this tag to prevent affecting the same nodes the user-generated work would live on. Zones is this same idea, simplified into a better-understood term. For now, there is no major difference between these areas, but it could be something to look at more in the 3.6 and 3.8 MongoDB versions.

Replication

New “linearizable” Read Concern: reflects all successful writes issued with a “majority” and acknowledged before the start of the read operation.

Adjustable Catchup for Newly Elected Primary: the time limit for a newly elected primary to catch up with the other replica set members that might have more recent writes.

Write Concern Majority Journal Default replset-config option: determines the behavior of the 

{ w: "majority" }
 write concern if the write concern does not explicitly specify the journal option j.

Initial-sync improvements:

Now the initial sync builds the indexes as the documents are copied.

Improvements to the retry logic make it more resilient to intermittent failures on the network.

Data Types

MongoDB 3.4 adds support for the decimal128 format with the new decimal data type. The decimal128 format supports numbers with up to 34 decimal digits (i.e., significant digits) and an exponent range of −6143 to +6144.

When performing comparisons among different numerical types, MongoDB conducts a comparison of the exact stored numerical values without first converting values to a common type.

Unlike the double data type, which only stores an approximation of the decimal values, the decimal data type stores the exact value. For example, a decimal

NumberDecimal("9.99")
 has a precise value of 9.99, whereas a double 9.99 would have an approximate value of 9.9900000000000002131628….

To test for decimal type, use the $type operator with the literal “decimal” or 19
db.inventory.find( { price: { $type: "decimal" } } )
New Number Wrapper Object Type
db.inventory.insert( {_id: 1, item: "The Scream", price: NumberDecimal("9.99"), quantity: 4 } )

To use the new decimal data type with a MongoDB driver, an upgrade to a driver version that supports the feature is necessary.

Aggregation Changes

Stages

Recursive Search

MongoDB 3.4 introduces a stage to the aggregation pipeline that allows for recursive searches.

Stage
Description
$graphLookup   Performs a recursive search on a collection. To each output document, adds a new array field that contains the traversal results of the recursive search for that document.

Faceted Search

Faceted search allows for the categorization of documents into classifications. For example, given a collection of inventory documents, you might want to classify items by a single category (such as by the price range), or by multiple groups (such as by price range as well as separately by the departments).

3.4 introduces stages to the aggregation pipeline that allow for faceted search.

Stage
Description
$bucket Categorizes or groups incoming documents into buckets that represent a range of values for a specified expression.
$bucketAuto Categorizes or groups incoming documents into a specified number of buckets that constitute a range of values for a specified expression. MongoDB automatically determines the bucket boundaries.
$facet Processes multiple pipelines on the input documents and outputs a document that contains the results of these pipelines. By specifying facet-related stages ($bucket$bucketAuto, and$sortByCount) in these pipelines, $facet allows for multi-faceted search.
$sortByCount   Categorizes or groups incoming documents by a specified expression to compute the count for each group. Output documents are sorted in descending order by the count.

 

Reshaping Documents

MongoDB 3.4 introduces stages to the aggregation pipeline that facilitate replacing documents as well as adding new fields.

Stage
Description
$addFields Adds new fields to documents. The stage outputs documents that contain all existing fields from the input documents as well as the newly added fields.
$replaceRoot   Replaces a document with the specified document. You can specify a document embedded in the input document to promote the embedded document to the top level.

Count

MongoDB 3.4 introduces a new stage to the aggregation pipeline that facilitates counting document.

Stage
Description
$count   Returns a document that contains a count of the number of documents input to the stage.

Operators

Array Operators

Operator
Description
$in Returns a boolean that indicates if a specified value is in an array.
$indexOfArray    Searches an array for an occurrence of a specified value and returns the array index (zero-based) of the first occurrence.
$range Returns an array whose elements are a generated sequence of numbers.
$reverseArray Returns an output array whose elements are those of the input array but in reverse order.
$reduce Takes an array as input and applies an expression to each item in the array to return the final result of the expression.
$zip Returns an output array where each element is itself an array, consisting of elements of the corresponding array index position from the input arrays.

Date Operators

Operator
Description
$isoDayOfWeek   Returns the ISO 8601-weekday number, ranging from 1 (for Monday) to 7 (for Sunday).
$isoWeek Returns the ISO 8601 week number, which can range from 1 to 53. Week numbers start at 1with the week (Monday through Sunday) that contains the year’s first Thursday.
$isoWeekYear Returns the ISO 8601 year number, where the year starts on the Monday of week 1 (ISO 8601) and ends with the Sundays of the last week (ISO 8601).

String Operators

Operator
Description
$indexOfBytes   Searches a string for an occurrence of a substring and returns the UTF-8 byte index (zero-based) of the first occurrence.
$indexOfCP Searches a string for an occurrence of a substring and returns the UTF-8 code point index (zero-based) of the first occurrence.
$split Splits a string by a specified delimiter into string components and returns an array of the string components.
$strLenBytes Returns the number of UTF-8 bytes for a string.
$strLenCP Returns the number of UTF-8 code points for a string.
$substrBytes Returns the substring of a string. The substring starts with the character at the specified UTF-8 byte index (zero-based) in the string for the length specified.
$substrCP Returns the substring of a string. The substring starts with the character at the specified UTF-8 code point index (zero-based) in the string for the length specified.

Others/Misc

Other new operators:

$switch: Evaluates, in sequential order, the case expressions of the specified branches to enter the first branch for which the case expression evaluates to “true”.

$collStats: Returns statistics regarding a collection or view.

$type: Returns a string which specifies the BSON Types of the argument.

$project: Adds support for field exclusion in the output document. Previously, you could only exclude the _id field in the stage.

Views

MongoDB 3.4 adds support for creating read-only views from existing collections or other views. To specify or define a view, MongoDB 3.4 introduces:

    • theViewOn and pipeline options to the existing create command:
      • db.runCommand( { create: <view>, viewOn: <source>, pipeline: <pipeline> } )
    • or if specifying a default collation for the view:
      • db.runCommand( { create: <view>, viewOn: <source>, pipeline: <pipeline>, collation: <collation> } )
    • and a corresponding  mongo shell helper db.createView():
      • db.createView(<view>, <source>, <pipeline>, <collation>)

For more information on creating views, see Views.

by David Murphy at February 23, 2017 09:36 PM

February 22, 2017

MariaDB AB

How MariaDB ColumnStore Handles Big Data Workloads – Data Loading and Manipulation

How MariaDB ColumnStore Handles Big Data Workloads – Data Loading and Manipulation david_thompson_g Wed, 02/22/2017 - 18:34

MariaDB ColumnStore is a massively parallel scale out columnar database. Data loading and modification behaves somewhat differently from how a row based engine works. This article outlines the options available and how these affect performance.

Data Loading and Manipulation Options

MariaDB ColumnStore provides several options for writing data:

  1. DML operations: insert, update, delete
  2. Bulk DML: INSERT INTO … SELECT
  3. MariaDB Server bulk file load: LOAD DATA INFILE
  4. ColumnStore bulk data load: cpimport
  5. ColumnStore bulk delete: ColumnStore partition drop.

 

DML Operations

ColumnStore supports transactionally consistent insert, update, and delete statements using standard syntax. Performance of individual statements will be significantly slower than you’d expect with row based engines such as InnoDB. This is due to the system being optimized for block writes and the fact that a column based change must affect multiple underlying files. Updates in general will be faster since updates are performed inline only to the updated columns.

 

Bulk DML

INSERT INTO .., SELECT statements where the destination table is ColumnStore are optimized by default to internally convert and use the cpimport utility executing in mode 1 which will offer greater performance than utilizing raw DML operations. This can be a useful capability to migrate a non ColumnStore table. For further details please refer to the following knowledge base article: https://mariadb.com/kb/en/mariadb/columnstore-batch-insert-mode/

 

LOAD DATA INFILE

The LOAD DATA INFILE command can also be used and similarly is optimized by default to utilize cpimport mode 1 over DML operations to provide better performance. However greater performance (approximately 2x) and flexibility is provided for by utilizing cpimport directly. This can be useful for compatibility purposes.  For further details please refer to the following knowledge base article: https://mariadb.com/kb/en/mariadb/columnstore-load-data-infile/

 

cpimport

The cpimport utility is the fastest and most flexible data loading utility for ColumnStore. It works directly with the PM WriteEngine processes eliminating many overheads of the prior options. Cpimport is designed to work with either delimited text files or delimited data provided via stdin. The latter option provides for some simple integration capabilities such as streaming a query result from another database directly into cpimport. Multiple tables can be loaded in parallel and a separate utility colxml is provided to help automate this. For further details please refer to the following knowledge base article: https://mariadb.com/kb/en/mariadb/columnstore-bulk-data-loading/

 

The utility can operate in different modes as designated by the -m flag (default 1):

 

Mode 1 - Centralized Trigger, Distributed Loading

The data to be loaded is provided as one input file on a single server. The data is then divided and distributed evenly to each of the PM nodes for loading. The extent map is referenced to aim for even data distribution.  In addition the -P argument can be utilized to send the data to specific PM nodes which allows for the finer grain control of modes 2 and 3 while preserving centralized loading.

Mode 2 - Centralized Trigger, Local Loading

In this mode, the data must be pre divided and pushed to each pm server. The load on each server is triggered from a central location and triggers a local cpimport on each pm server.

Mode 3 - Local Trigger, Local Loading

This mode allows for loading data individually per PM node across some or all of the nodes. The load is triggered from the PM server and runs cpimport locally on that PM node only.

Modes 2 and 3 allow for more direct control of where data is loaded and in what order within the corresponding extents however care needs to be taken to allow for even distribution across nodes. This direct control does allow for explicit partitioning by PM node, for example, with 3 nodes you could have one node with only America's data, one with EMEA, and one with APAC data. Local query can be enabled to allow querying a PM directly limiting to that regions data while still allowing querying all data from the UM level.

 

Partition Drop

ColumnStore provides a mechanism to support bulk delete by extents. An extent can be dropped by partition id or by using a value range corresponding to the minimum and maximum values for the extents to be dropped. This allows for an effective and fast purging mechanism if the data has an increasing date based column then the minimum and maximum values for the columns extents will form a (potentially overlapping) range based partitioning scheme. Data can be dropped by specifying the range values to be removed. This can form a very effective information lifecycle management strategy by removing old data by partition range. For further details please refer to the following knowledge base article: https://mariadb.com/kb/en/mariadb/columnstore-partition-management/

 

Transactional Consistency

MariaDB ColumnStore provides read committed transaction isolation. Changes to the data whether performed through DML or bulk import are always applied such that reads are not blocked and other transactions maintain a consistent view of the prior data until the data is successfully committed.

The cpimport utility interacts with a high water mark value for each column. All queries will only read below the high water mark and cpimport will insert new rows above the high water mark. When the load is completed the high water mark is updated atomically.


SQL DML operations utilize a block based MVCC architecture to provide for transactional consistency. Other transactions will read blocks at a particular version while the uncommitted version is maintained in a version buffer.

MariaDB ColumnStore is a massively parallel scale out columnar database. Data loading and modification behaves somewhat differently from how a row based engine works. This article outlines the options available and how these affect performance.

Login or Register to post comments

by david_thompson_g at February 22, 2017 11:34 PM

Peter Zaitsev

Webinar Thursday, February 23, 2017: Troubleshooting MySQL Access Privileges Issues

Troubleshooting MySQL Access Privileges

Troubleshooting MySQL Access PrivilegesPlease join Sveta Smirnova, Percona’s Principal Technical Services Engineer, as she presents Troubleshooting MySQL Access Privileges Issues on
February 23, 2017 at 11:00 am PST / 2:00 pm EST (UTC-8).

Do you have registered users who can’t connect to the MySQL server? Strangers modifying data to which they shouldn’t have access?

MySQL supports a rich set of user privilege options and allows you to fine tune access to every object in the server. The latest versions support authentication plugins that help to create more access patterns.

However, finding errors in such a big set of options can be problematic. This is especially true for environments with hundreds of users, all with different privileges on multiple objects. In this webinar, I will show you how to decipher error messages and unravel the complicated setups that can lead to access errors. We will also cover network errors that mimic access privileges errors.

In this webinar, we will discuss:

  • Which privileges MySQL supports
  • What GRANT statements are
  • How privileges are stored
  • How to find out why a privilege does not work properly
  • How authentication plugins make difference
  • What the best access control practices are

To register for this webinar please click here.

InnoDB TroubleshootingSveta Smirnova, Principal Technical Services Engineer

Sveta joined Percona in 2015. Her main professional interests are problem-solving, working with tricky issues, bugs, finding patterns that can solve typical issues quicker, and teaching others how to deal with MySQL issues, bugs and gotchas effectively. Before joining Percona, Sveta worked as Support Engineer in the MySQL Bugs Analysis Support Group at MySQL AB-Sun-Oracle. She is the author of book “MySQL Troubleshooting” and JSON UDF functions for MySQL.

by Dave Avery at February 22, 2017 08:50 PM

Percona Monitoring and Management (PMM) Graphs Explained: MongoDB with RocksDB

Percona Monitoring and Management

Percona Monitoring and ManagementThis post is part of the series of Percona’s MongoDB 3.4 bundle release blogs. In mid-2016, Percona Monitoring and Management (PMM) added support for RocksDB with MongoDB, also known as “MongoRocks.” In this blog, we will go over the Percona Monitoring and Management (PMM) 1.1.0 version of the MongoDB RocksDB dashboard, how PMM is useful in the day-to-day monitoring of MongoDB and what we plan to add and extend.

Percona Monitoring and Management (PMM)

Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MySQL and MongoDB, developed by Percona on top of open-source technology. Behind the scenes, the graphing features this article covers use Prometheus (a popular time-series data store), Grafana (a popular visualization tool), mongodb_exporter (our MongoDB database metric exporter) plus other technologies to provide database and operating system metric graphs for your database instances.

The mongodb_exporter tool, which provides our monitoring platform with MongoDB metrics, uses RocksDB status output and optional counters to provide detailed insight into RocksDB performance. Percona’s MongoDB 3.4 release enables RocksDB’s optional counters by default. On 3.2, however, you must set the following in /etc/mongod.conf to enable this:

storage.rocksdb.counters: true
 .

This article shows a live demo of our MongoDB RocksDB graphs: https://pmmdemo.percona.com/graph/dashboard/db/mongodb-rocksdb.

RocksDB/MongoRocks

RocksDB is a storage engine available since version 3.2 in Percona’s fork of MongoDB: Percona Server for MongoDB.

The first thing to know about monitoring RocksDB is compaction. RocksDB stores its data on disk using several tiered levels of immutable files. Changes written to disk are written to the first RocksDB level (Level0). Later the internal compactions merge the changes down to the next RocksDB level when Level0 fills. Each level before the last is essentially deltas to the resting data set that soon merges down to the bottom.

We can see the effect of the tiered levels in our “RocksDB Compaction Level Size” graph, which reflects the size of each level in RocksDB on-disk:

RocksDB

Note that most of the database data is in the final level “L6” (Level 6). Levels L0, L4 and L5 hold relatively smaller amounts of data changes. These get merged down to L6 via compaction.

More about this design is explained in detail by the developers of MongoRocks, here: https://www.percona.com/live/plam16/sessions/everything-you-wanted-know-about-mongorocks.

RocksDB Compaction

Most importantly, RocksDB compactions try to happen in the background. They generally do not “block” the database. However, the additional resource usage of compactions can potentially cause some spikes in latency, making compaction important to watch. When compactions occur, between levels L4 and L5 for example, L4 and L5 are read and merged with the result being written out as a new L5.

The memtable in MongoRocks is a 64mb in-memory table. Changes initially get written to the memtable. Reads check the memtable to see if there are unwritten changes to consider. When the memtable has filled to 100%, RocksDB performs a compaction of the memtable data to Level0, the first on-disk level in RocksDB.

In PMM we have added a single-stat panel for the percentage of the memtable usage. This is very useful in indicating when you can expect a memtable-to-level0 compaction to occur:

Above we can see the memtable is 125% used, which means RocksDB is late to finish (or start) a compaction due to high activity. Shortly after taking this screenshot above, however, our test system began a compaction of the memtable and this can be seen at the drop in active memtable entries below:

RocksDB

Following this compaction further through PMM’s graphs, we can see from the (very useful) “RocksDB Compaction Time” graph that this compaction took 5 seconds.

In the graph above, I have singled-out “L0” to show Level0’s compaction time. However, any level can be selected either per-graph (by clicking on the legend-item) or dashboard-wide (by using the RocksDB Level drop-down at the top of the page).

In terms of throughput, we can see from our “RocksDB Write Activity” graph (Read Activity is also graphed) that this compaction required about 33MBps of disk write activity:

On top of additional resource consumption such as the write activity above, compactions cause caches to get cleared. One example is the OS cache due to new level files being written. These factors can cause some increases to read latencies, demonstrated in this example below by the bump in L4 read latency (top graph) caused by the L4 compaction (bottom graph):

This pattern above is one area to check if you see latency spikes in RocksDB.

RocksDB Stalls

When RocksDB is unable to perform compaction promptly, it uses a feature called “stalls” to try and slow down the amount of data coming into the engine. In my experience, stalls almost always mean something below RocksDB is not up to the task (likely the storage system).

Here is the “RocksDB Stall Time” graph of a host experiencing frequent stalls:

PMM can graph the different types of RocksDB stalls in the “RocksDB Stalls” graph. In our case here, we have 0.3-0.5 stalls per second due to “level0_slowdown” and “level0_slowdown_with_compaction.” This happens when Level0 stalls the engine due to slow compaction performance below its level.

Another metric reflecting the poor compaction performance is the pending compactions in “RocksDB Pending Operations”:

As I mentioned earlier, this almost always means something below RocksDB itself cannot keep up. In the top-right of PMM, we have OS-level metrics in a drop-down, I recommend you look at “Disk Performance” in these scenarios:

On the “Disk Performance” dashboard you can see the “sda” disk has an average write time of 212ms, and a max of 1100ms (1.1 seconds). This is fairly slow.

Further, on the same dashboard I can see the CPU is waiting on disk I/O 98.70% of the time on average. This explains why RocksDB needs to stall to hold back some of the load!

The disks seem too busy to keep up! Looking at the “Mongod – Document Activity” graph, it explains the cause of the high disk usage: 10,000-60,000 inserts per second:

Here we can draw the conclusion that this volume of inserts on this system configuration causes some stalling in RocksDB.

RocksDB Block Cache

The RocksDB Block Cache is the in-heap cache RocksDB uses to cache uncompressed pages. Generally, deployments benefit from dedicating most of their memory to the Linux file system cache vs. the RocksDB Block Cache. We recommend using only 20-30% of the host RAM for block cache.

PMM can take away some of the guesswork with the “RocksDB Block Cache Hit Ratio” graph, showing the efficiency of the block cache:

It is difficult to define a “good” and “bad” number for this metric, as the number varies for every deployment. However, one important thing to look for is significant changes in this graph. In this example, the Block Cache has a page in cache 3000 times for every 1 time it does not.

If you wanted to test increasing your block cache, this graph becomes very useful. If you increase your block cache and do not see an improvement in the hit ratio after a lengthy period of testing, this usually means more block cache memory is not necessary.

RocksDB Read Latency Graphs

PMM graphs Read Latency metrics for RocksDB in several different graphs, one dedicated to Level0:

And three other graphs display Average, 99th Percentile and Maximum latencies for each RocksDB level. Here is an example from the 99th Percentile latency metrics:

Coming Soon

Percona Monitoring and Management needs to add some more metrics that explain the performance of the engine. The rate of deletes/tombstones in the system affects RocksDB’s performance. Currently, this metric is not something our system can easily gather like other engine metrics. Percona Monitoring and Management can’t easily graph the efficiency of the Bloom filter yet, either. These are currently open feature requests to the MongoRocks (and likely RocksDB) team(s) to add in future versions.

Percona’s release of Percona Server for MongoDB 3.4 includes a new, improved version of MongoRocks and RocksDB. More is available in the release notes!

by Tim Vaillancourt at February 22, 2017 08:36 PM

Percona XtraBackup 2.4.6 is Now Available

CVE-2016-6225

Percona XtraBackup 2.4.6Percona announces the GA release of Percona XtraBackup 2.4.6 on February 22, 2017. You can download it from our download site and apt and yum repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, Percona XtraBackup drives down backup costs while providing unique features for MySQL backups.

New features:
  • Percona XtraBackup implemented a new --remove-original option that can be used to remove the encrypted and compressed files once they’ve been decrypted/decompressed.
Bugs Fixed:
  • XtraBackup was using username set for the server in a configuration file even if a different user was defined in the user’s configuration file. Bug fixed #1551706.
  • Incremental backups did not include xtrabackup_binlog_info and xtrabackup_galera_info files. Bug fixed #1643803.
  • In case a warning was written to stout instead of stderr during the streaming backup, it could cause assertion in the xbstream. Bug fixed #1647340.
  • xtrabackup --move-back option did not always restore out-of-datadir tablespaces to their original directories. Bug fixed #1648322.
  • innobackupex and xtrabackup scripts were showing the password in the ps output when it was passed as a command line argument. Bug fixed #907280.
  • Incremental backup would fail with a path like ~/backup/inc_1 because xtrabackup didn’t properly expand tilde. Bug fixed #1642826.
  • Fixed missing dependency check for Perl Digest::MD5 in rpm packages. This will now require perl-MD5 package to be installed from EPEL repositories on CentOS 5 and CentOS 6 (along with libev). Bug fixed #1644018.
  • Percona XtraBackup now supports -H, -h, -u and -p shortcuts for --hostname, --datadir, --user and --password respectively. Bugs fixed #1655438 and #1652044.

Release notes with all the bugfixes for Percona XtraBackup 2.4.6 are available in our online documentation. Please report any bugs to the launchpad bug tracker.

by Hrvoje Matijakovic at February 22, 2017 06:49 PM

Percona XtraBackup 2.3.7 is Now Available

Percona XtraBackup 2.3.5

Percona XtraBackup 2.3.7Percona announces the release of Percona XtraBackup 2.3.7 on February 22, 2017. Downloads are available from our download site or Percona Software Repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, Percona XtraBackup drives down backup costs while providing unique features for MySQL backups.

This release is the current GA (Generally Available) stable release in the 2.3 series.

New Features
  • Percona XtraBackup has implemented a new --remove-original option that can be used to remove the encrypted and compressed files once they’ve been decrypted/decompressed.
Bugs Fixed:
  • XtraBackup was using username set for the server in a configuration file even if a different user was defined in the user’s configuration file. Bug fixed #1551706.
  • Incremental backups did not include xtrabackup_binlog_info and xtrabackup_galera_info files. Bug fixed #1643803.
  • Percona XtraBackup would fail to compile with -DWITH_DEBUG and -DWITH_SSL=system options. Bug fixed #1647551.
  • xtrabackup --move-back option did not always restore out-of-datadir tablespaces to their original directories. Bug fixed #1648322.
  • innobackupex and xtrabackup scripts were showing the password in the ps output when it was passed as a command line argument. Bug fixed #907280.
  • Incremental backup would fail with a path like ~/backup/inc_1 because xtrabackup didn’t properly expand tilde. Bug fixed #1642826.
  • Fixed missing dependency check for Perl Digest::MD5 in rpm packages. This will now require perl-MD5 package to be installed from EPEL repositories on CentOS 5 and CentOS 6 (along with libev). Bug fixed #1644018.
  • Percona XtraBackup now supports -H, -h, -u and -p shortcuts for --hostname, --datadir, --user and --password respectively. Bugs fixed #1655438 and #1652044.

Other bugs fixed: #1655278.

Release notes with all the bugfixes for Percona XtraBackup 2.3.7 are available in our online documentation. Bugs can be reported on the launchpad bug tracker.

by Hrvoje Matijakovic at February 22, 2017 06:48 PM

February 21, 2017

Peter Zaitsev

Percona Monitoring and Management (PMM) Upgrade Guide

Percona Monitoring and Management

Percona Monitoring and ManagementThis post is part of a series of Percona’s MongoDB 3.4 bundle release blogs. The purpose of this blog post is to demonstrate current best-practices for an in-place Percona Monitoring and Management (PMM) upgrade. Following this method allows you to retain data previously collected by PMM in your MySQL or MongoDB environment, while upgrading to the latest version.

Step 1: Housekeeping

Before beginning this process, I recommend that you use a package manager that installs directly from Percona’s official software repository. The install instructions vary by distro, but for Ubuntu users the commands are:

wget https://repo.percona.com/apt/percona-release_0.1-4.$(lsb_release -sc)_all.deb

sudo dpkg -i percona-release_0.1-4.$(lsb_release -sc)_all.deb

Step 2: PMM Server Upgrade

Now that we have ensured we’re using Percona’s official software repository, we can continue with the upgrade. To check which version of PMM server is running, execute the following command on your PMM server host:

docker ps

This command shows a list of all running Docker containers. The version of PMM server you are running is found in the image description.

docker_ps

Once you’ve verified you are on an older version, it’s time to upgrade!

The first step is to stop and remove your docker pmm-server container with the following command:

docker stop pmm-server && docker rm pmm-server

Please note that this command may take several seconds to complete.
docker_stop

The next step is to create and run the image with the new version tag. In this case, we are installing version 1.1.0. Please make sure to verify the correct image name in the install instructions.

Run the command below to create and run the new image.

docker run -d
 -p 80:80
 --volumes-from pmm-data
 --name pmm-server
 --restart always
 percona/pmm-server:1.1.0

docker_run

We can confirm our new image is running with the following command:

docker ps

docker_ps

As you can see, the latest version of PMM server is installed. The final step in the process is to update the PMM client on each host to be monitored.

Step 3: PMM Client Upgrade

The GA version of Percona Monitoring and Management supports in-place upgrades. Instructions can be found in our documentationOn the client side, update the local apt cache, and upgrade to the new version of pmm-client by running the following commands:

apt-get update

apt-get_update

apt-get install pmm-client

Congrats! We’ve successfully upgraded to the latest PMM version. As you can tell from the graph below, there is a slight gap in our polling data due to the downtime necessary to upgrade the version. However, we have verified that the data that existed prior to the upgrade is still available and new data is being gathered.

grafana_graph

Conclusion

I hope this blog post has given you the confidence to do an in-place Percona Monitoring and Management upgrade. As always, please submit your feedback on our forums with regards to any PMM-related suggestions or questions. Our goal is to make PMM the best-available open-source MySQL and MongoDB monitoring tool.

by Barrett Chambers at February 21, 2017 10:53 PM

Webinar Wednesday February 22, 2017: Percona Server for MongoDB 3.4 Product Bundle Release

Percona Server for MongoDB

Product Bundle ReleaseJoin Percona’s MongoDB Practice Manager David Murphy on Wednesday, February 22, 2017 at 10:00 am PST / 1:00 pm EST (UTC-8) as he reviews and discusses the Percona Server for MongoDB, Percona Monitoring and Management (PMM) and Percona Toolkit product bundle release.

The webinar covers how this new bundled release ensures a robust, secure database that can be adapted to changing business requirements. It demonstrates how MongoDB, PMM and Percona Toolkit are used together so that organizations benefit from the cost savings and agility provided by free and proven open source software.

Percona Server for MongoDB 3.4 delivers all the latest MongoDB 3.4 Community Edition features, additional Enterprise features and a greater choice of storage engines.

Along with improved insight into the database environment, the solution provides enhanced control options for optimizing a wider range of database workloads with greater reliability and security.

Some of the features that will be discussed are:

  • Percona Server for MongoDB 3.4
    • All the features of MongoDB Community Edition 3.4, which provides an open source, fully compatible, drop-in replacement:
      • Integrated, pluggable authentication with LDAP to provide a centralized enterprise authentication service
      • Open-source auditing for visibility into user and process actions in the database, with the ability to redact sensitive information (such as user names and IP addresses) from log files
      • Hot backups for the WiredTiger engine protect against data loss in the case of a crash or disaster, without impacting performance
      • Two storage engine options not supported by MongoDB Community Edition 3.4:
        • MongoRocks, the RocksDB-powered storage engine, designed for demanding, high-volume data workloads such as in IoT applications, on-premises or in the cloud.
        • Percona Memory Engine is ideal for in-memory computing and other applications demanding very low latency workloads.
  • Percona Monitoring and Management 1.1
    • Support for MongoDB and Percona Server for MongoDB
    • Graphical dashboard information for WiredTiger, MongoRocks and Percona Memory Engine
  • Percona Toolkit 3.0
    • Two new tools for MongoDB:
      • pt-mongodb-summary (the equivalent of pt-mysql-summary) provides a quick, at-a-glance overview of a MongoDB and Percona Server for MongoDB instance.
      • pt-mongodb-query-digest (the equivalent of pt-query-digest for MySQL) offers a query review for troubleshooting.

You can register for the webinar here.

MongoDB BackupsDavid Murphy, MongoDB Practice Manager

David joined Percona in October 2015 as Practice Manager for MongoDB. Prior to that, David joined the ObjectRocket by Rackspace team as the Lead DBA in Sept 2013. With the growth involved with a any recently acquired startup, David’s role covered a wide range from evangelism, research, run book development, knowledge base design, consulting, technical account management, mentoring and much more.

Prior to the world of MongoDB, David was a MySQL and NoSQL architect at Electronic Arts. There, he worked with some of the largest titles in the world like FIFA, SimCity, and Battle Field providing tuning, design, and technology choice responsibilities. David maintains an active interest in database speaking and exploring new technologies.

by Dave Avery at February 21, 2017 09:06 PM

Installing Percona Monitoring and Management (PMM) for the First Time

Percona Monitoring and Management

Percona Monitoring and Management

This post is part of a series of Percona’s MongoDB 3.4 bundle release blogs. In this blog, we’ll look at the process for installing Percona Monitoring and Management (PMM) for the first time.

Installing Percona Monitoring and Management

Percona Monitoring and Management (PMM) is Percona’s open source tool for monitoring databases. You can use it with either MongoDB and MySQL databases.

PMM requires the installation of a server and client component on each database server to be monitored. You can install the server component on a local or remote server, and monitor any MySQL or MongoDB instance (including Amazon RDS environments).

What is it?

PMM provides a graphical view of the status of monitored databases. You can use it to perform query analytics and metrics review. The graphical component relies on Grafana, and uses Prometheus for information processing. It includes a Query Analytics module that allows you to analyze queries over a period of time and it uses Orchestrator for replication. Since the integration of these items is the most difficult part, the server is distributed as a preconfigured Docker image.

PMM works with any variety of MongoDB or MySQL.

How do you install it?

As mentioned, there is a server component to PMM. You can install it on any server in your database environment, but be aware that if the server on which it is installed goes down or runs out of space, monitoring also fails. The server should have at least 5G of available disk space for each monitored client.

You must install the client component on each monitored database server. It is available as a package for a variety of Linux distributions.

To install the server

  • Create the Docker container for PMM. This container is the storage location for all PMM data and should not be altered or removed.
  • Create and run the PMM server container
  • Verify the installation

Next, install the client on each server to be monitored. It is installed based on the Linux distribution of the server.

Last, you connect the client(s) to the PMM server and monitoring begins.

The Query Analytics tool monitors and reviews queries run in the environment. It displays the current top 10 most time intensive queries. You can click on a query to view a detailed analysis of the query.

The Metrics Monitor gives you a historical view of queries. PMM separates time-based graphs by theme for additional clarity.

PMM includes Orchestrator for replication management and visualization. You must enable it separately, then you can view the metrics on the Discover page in Orchestrator.

Ongoing administration and maintenance

PMM includes an administration tool that adds, removes or monitors services. The pmm-admin tool requires a user with root or sudo access.

You can enable HTTP password protection to add authentication when accessing the PMM Server web interface or use SSL encryption to secure traffic between PMM Client and PMM Server.

You can edit the config files located in <your Docker container>/etc/grafana to set up alerting. Log files are stored in <your Docker container>/var/log.

by Rick Golba at February 21, 2017 07:22 PM

February 20, 2017

Peter Zaitsev

MongoDB 3.4 Bundle Release: Percona Server for MongoDB 3.4, Percona Monitoring and Management 1.1, Percona Toolkit 3.0 with MongoDB

Percona Server for MongoDB

This blog post is the first in a series on Percona’s MongoDB 3.4 bundle release. This release includes Percona Server for MongoDB, Percona Monitoring and Management, and Percona Toolkit. In this post, we’ll look at the features included in the release.

We have a lot of great MongoDB content coming your way in the next few weeks. However, I wanted first to give you a quick list of the major things to be on the look out for.

This new bundled release ensures a robust, secure database that you can adapt to changing business requirements. It helps demonstrate how organizations can use MongoDB (and Percona Server for MongoDB), PMM and Percona Toolkit together to benefit from the cost savings and agility provided by free and proven open source software.

Percona Server for MongoDB 3.4 delivers all the latest MongoDB 3.4 Community Edition features, additional Enterprise features and a greater choice of storage engines.

Some of these new features include:

  • Shard member types. All nodes now need to know what they do – this helps with reporting and architecture planning more than the underlying code, but it’s an important first step.
  • Sharding balancer moved to config server primary
  • Configuration servers must now be a replica set
  • Faster balancing (shard count/2) – concurrent balancing actions can now happen at the same time!
  • Sharding and replication tags renamed to “zones” – again, an important first step
  • Default write behavior moved to majority – this could majorly impact many workloads, but moving to a default safe write mode is important
  • New decimal data type
  • Graph aggregation functions – we will talk about these more in a later blog, but for now note that graph and faceted searches are added.
  • Collations added to most access patterns for collections and databases
  • . . .and much more

Percona Server for MongoDBPercona Server for MongoDB includes all the features of MongoDB Community Edition 3.4, providing an open source, fully-compatible, drop-in replacement with many improvements, such as:

  • Integrated, pluggable authentication with LDAP that provides a centralized enterprise authentication service
  • Open-source auditing for visibility into user and process actions in the database, with the ability to redact sensitive information (such as user names and IP addresses) from log files
  • Hot backups for the WiredTiger engine to protect against data loss in the case of a crash or disaster, without impacting performance
  • Two storage engine options not supported by MongoDB Community Edition 3.4 (doubling the total engine count choices):
    • MongoRocks, the RocksDB-powered storage engine, designed for demanding, high-volume data workloads such as in IoT applications, on-premises or in the cloud.
    • Percona Memory Engine is ideal for in-memory computing and other applications demanding very low latency workloads.

Percona Monitoring and Management 1.1

  • Support for MongoDB and Percona Server for MongoDB
  • Graphical dashboard information for WiredTiger, MongoRocks and Percona Memory Engine
  • Cluster and replica set wide views
  • Many more graphable metrics available for both for the OS and the database layer than currently provided by other tools in the ecosystem

Percona Toolkit 3.0

  • Percona Server for MongoDBTwo new tools for MongoDB are now in Percona’s Toolkit:
    • pt-mongodb-summary (the equivalent of pt-mysql-summary) provides a quick, at-a-glance overview of a MongoDB and Percona Server for MongoDB instance
      • This is useful for any DBA who wants a general idea of what’s happening in the system, what the state of their cluster/replica set is, and more.
    • pt-mongodb-query-digest (the equivalent of pt-query-digest for MySQL) offers a query review for troubleshooting
      • Query digest is one of the most used Toolkit features ever. In MongoDB, this is no different. Typically you might only look at your best and worst query times and document scans. However, this will show 90th percentiles, and top 10 queries take seconds versus minutes.

For all of these topics, you will see more blogs in the next few weeks that cover them in detail. Some people have asked what Percona’s MongoDB commitment looks like. Hopefully, this series of blogs help show how improving open source databases is central to the Percona vision. We are here to make the world better for developers, DBAs and other MongoDB users.

by David Murphy at February 20, 2017 09:51 PM

Percona Toolkit 3.0.1 is now available

Percona Server for MongoDB

Percona ToolkitPercona announces the availability of Percona Toolkit 3.0.1 on February 20, 2017. This is the first general availability (GA) release in the 3.0 series with a focus on padding MongoDB tools:

Downloads are available from the Percona Software Repositories.

NOTE: If you are upgrading using Percona’s yum repositories, make sure that the you enable the basearch repo, because Percona Toolkit 3.0 is not available in the noarch repo.

Percona Toolkit is a collection of advanced command-line tools that perform a variety of MySQL and MongoDB server and system tasks too difficult or complex for DBAs to perform manually. Percona Toolkit, like all Percona software, is free and open source.

This release includes changes from the previous 3.0.0 RC and the following additional changes:

  • Added requirement to run pt-mongodb-summary as a user with the clusterAdmin or root built-in roles.

You can find release details in the release notes. Bugs can be reported on Toolkit’s launchpad bug tracker.

by Alexey Zhebel at February 20, 2017 09:50 PM

Percona Monitoring and Management 1.1.1 is now available

Percona Monitoring and Management

Percona Monitoring and ManagementPercona announces the release of Percona Monitoring and Management 1.1.1 on February 20, 2017. This is the first general availability (GA) release in the PMM 1.1 series with a focus on providing alternative deployment options for PMM Server:

NOTE: The AMIs and VirtualBox images above are still experimental. For production, it is recommended to run Docker images.

The instructions for installing Percona Monitoring and Management 1.1.1 are available in the documentation. Detailed release notes are available here.

There are no changes compared to previous 1.1.0 Beta release, except small fixes for MongoDB metrics dashboards.

A live demo of PMM is available at pmmdemo.percona.com.

We welcome your feedback and questions on our PMM forum.

About Percona Monitoring and Management
Percona Monitoring and Management is an open-source platform for managing and monitoring MySQL and MongoDB performance. Percona developed it in collaboration with experts in the field of managed database services, support and consulting.

PMM is a free and open-source solution that you can run in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

by Alexey Zhebel at February 20, 2017 09:49 PM

Percona Server for MongoDB 3.4.2-1.2 is now available

Percona Server for MongoDB

Percona Server for MongoDBPercona announces the release of Percona Server for MongoDB 3.4.2-1.2 on February 20, 2017. It is the first general availability (GA) release in the 3.4 series. Download the latest version from the Percona web site or the Percona Software Repositories.

Percona Server for MongoDB is an enhanced, open source, fully compatible, highly-scalable, zero-maintenance downtime database supporting the MongoDB v3.4 protocol and drivers. It extends MongoDB with Percona Memory Engine and MongoRocks storage engine, as well as several enterprise-grade features:

Percona Server for MongoDB requires no changes to MongoDB applications or code.

This release candidate is based on MongoDB 3.4.2, includes changes from PSMDB 3.4.0 Beta and 3.4.1 RC, and the following additional changes:

  • Fixed the audit log message format to comply with upstream MongoDB:
    • Changed params document to param
    • Added roles document
    • Fixed date and time format
    • Changed host field to ip in the local and remote documents

Percona Server for MongoDB 3.4.2-1.2 release notes are available in the official documentation.

by Alexey Zhebel at February 20, 2017 09:49 PM

February 17, 2017

MariaDB Foundation

MariaDB 10.2.4 RC now available

The MariaDB project is pleased to announce the immediate availability of MariaDB 10.2.4 Release Candidate (RC). See the release notes and changelog for details. Download MariaDB 10.2.4 Release Notes Changelog What is MariaDB 10.2? MariaDB APT and YUM Repository Configuration Generator Thanks, and enjoy MariaDB!

The post MariaDB 10.2.4 RC now available appeared first on MariaDB.org.

by Daniel Bartholomew at February 17, 2017 09:29 PM

February 16, 2017

Peter Zaitsev

MySQL Bug 72804 Workaround: “BINLOG statement can no longer be used to apply query events”

MySQL Bug 72804

MySQL Bug 72804In this blog post, we’ll look at a workaround for MySQL bug 72804.

Recently I worked on a ticket where a customer performed a point-in-time recovery PITR using a large set of binary logs. Normally we handle this by applying the last backup, then re-applying all binary logs created since the last backup. In the middle of the procedure, their new server crashed. We identified the binary log position and tried to restart the PITR from there. However, using the option

--start-position
, the restore failed with the error “The BINLOG statement of type Table_map was not preceded by a format description BINLOG statement.” This is a known bug and is reported as MySQL Bug #72804: “BINLOG statement can no longer be used to apply Query events.”

I created a small test to demonstrate a workaround that we implemented (and worked).

First, I ran a large import process that created several binary logs. I used a small value in 

max_binlog_size
 and tested using the database “employees” (a standard database used for testing).Then I dropped the database.

mysql> set sql_log_bin=0;
Query OK, 0 rows affected (0.33 sec)
mysql> drop database employees;
Query OK, 8 rows affected (1.25 sec)

To demonstrate the recovery process, I joined all the binary log files into one SQL file and started an import.

sveta@Thinkie:~/build/ps-5.7/mysql-test$ ../bin/mysqlbinlog var/mysqld.1/data/master.000001 var/mysqld.1/data/master.000002 var/mysqld.1/data/master.000003 var/mysqld.1/data/master.000004 var/mysqld.1/data/master.000005 > binlogs.sql
sveta@Thinkie:~/build/ps-5.7/mysql-test$ GENERATE_ERROR.sh binlogs.sql
sveta@Thinkie:~/build/ps-5.7/mysql-test$ mysql < binlogs.sql
ERROR 1064 (42000) at line 9020: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'inserting error

I intentionally generated a syntax error in the resulting file with the help of the GENERATE_ERROR.sh script (which just inserts a bogus SQL statement in a random row). The error message clearly showed where the import stopped: line 9020. I then created a file that cropped out the part that had already been imported (lines 1- 9020), and tried to import this new file.

sveta@Thinkie:~/build/ps-5.7/mysql-test$ tail -n +9021 binlogs.sql >binlogs_rest.sql
sveta@Thinkie:~/build/ps-5.7/mysql-test$ mysql < binlogs_rest.sql
ERROR 1609 (HY000) at line 134: The BINLOG statement of type `Table_map` was not preceded by a format description BINLOG statement.

Again, the import failed with exactly the same error as the customer. The reason for this error is that the BINLOG statement – which applies changes from the binary log – expects that the format description event gets run in the same session as the binary log import, but before it. The format description existed initially at the start of the import that failed at line 9020. The later import (from line 9021 on) doesn’t contain this format statement.

Fortunately, this format is the same for the same version! We can simply take it from the beginning the SQL log file (or the original binary file) and put into the file created after the crash without lines 1-9020.

With MySQL versions 5.6 and 5.7, this event is located in the first 11 rows:

sveta@Thinkie:~/build/ps-5.7/mysql-test$ head -n 11 binlogs.sql | cat -n
     1	/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
     2	/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
     3	DELIMITER /*!*/;
     4	# at 4
     5	#170128 17:58:11 server id 1  end_log_pos 123 CRC32 0xccda074a 	Start: binlog v 4, server v 5.7.16-9-debug-log created 170128 17:58:11 at startup
     6	ROLLBACK/*!*/;
     7	BINLOG '
     8	g7GMWA8BAAAAdwAAAHsAAAAAAAQANS43LjE2LTktZGVidWctbG9nAAAAAAAAAAAAAAAAAAAAAAAA
     9	AAAAAAAAAAAAAAAAAACDsYxYEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA
    10	AUoH2sw=
    11	'/*!*/;

The first six rows are meta information, and rows 6-11 are the format event itself. The only thing we need to export into our resulting file is these 11 lines:

sveta@Thinkie:~/build/ps-5.7/mysql-test$ head -n 11 binlogs.sql > binlogs_rest_with_format.sql
sveta@Thinkie:~/build/ps-5.7/mysql-test$ cat binlogs_rest.sql >> binlogs_rest_with_format.sql
sveta@Thinkie:~/build/ps-5.7/mysql-test$ mysql < binlogs_rest_with_format.sql
sveta@Thinkie:~/build/ps-5.7/mysql-test$

After this, the import succeeded!

by Sveta Smirnova at February 16, 2017 11:39 PM

Percona Blog Poll Results: What Programming Languages Are You Using for Backend Development?

Programming Languages

Programming LanguagesIn this blog we’ll look at the results from Percona’s blog poll on what programming languages you’re using for backend development.

Late last year we started a poll on what backend programming languages are being used by the open source community. The three components of the backend – server, application, and database – are what makes a website or application work. Below are the results of Percona’s poll on backend programming languages in use by the community:

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

One of the best-known and earliest web service stacks is the LAMP stack, which spelled out refers to Linux, Apache, MySQL and PHP/Perl/Python. We can see that this early model is still popular when it comes to the backend.

PHP still remains a very common choice for a backend programming language, with Python moving up the list as well. Perl seems to be fading in popularity, despite being used a lot in the MySQL world.

Java is also showing signs of strength, demonstrating the strides MySQL is making in enterprise applications. We can also see JavaScript is increasingly getting used not only as a front-end programming language, but also as back-end language with the Node.JS framework.

Finally, Go is a language to look out for. Go is an open source programming language created by Google. It first appeared in 2009, and is already more popular than Perl or Ruby according to this poll.

Thanks to the community for participating in our poll. You can take our latest poll on what database engine are you using to store time series data here. 

by Peter Zaitsev at February 16, 2017 09:53 PM

MariaDB at Percona Live Open Source Database Conference 2017

MariaDB at Percona Live

MariaDB at Percona LiveIn this blog, we’ll look at how we plan to represent MariaDB at Percona Live.

The MariaDB Corporation is organizing a conference called M17 on the East Coast in April. Some Perconians (Peter Zaitsev, Vadim Tkachenko, Sveta Smirnova, Alex Rubin, Colin Charles) decided to submit some interesting talks for that conference. Percona also offered to sponsor the conference.

As of this post, the talks haven’t been accepted, and we were politely told that we couldn’t sponsor.

Some of the proposed talks were:

  • MariaDB Backup with Percona XtraBackup (Vadim Tkachenko)
  • Managing MariaDB Server operations with Percona Toolkit (Colin Charles)
  • MariaDB Server Monitoring with Percona Monitoring and Management (Peter Zaitsev)
  • Securing your MariaDB Server/MySQL data (Colin Charles, Ronald Bradford)
  • Data Analytics with MySQL, Apache Spark and Apache Drill (Alexander Rubin)
  • Performance Schema for MySQL and MariaDB Troubleshooting (Sveta Smirnova)

At Percona, we think MariaDB Server is an important part of the MySQL ecosystem. This is why the Percona Live Open Source Database Conference 2017 in Santa Clara has a MariaDB mini-track, consisting of talks from various Percona and MariaDB experts:

If any of these topics look enticing, come to the conference. We have MariaDB at Percona Live.

To make your decision easier, we’ve created a special promo code that gets you $75 off a full conference pass! Just use MariaDB@PL17 at checkout.

In the meantime, we will continue to write and discuss MariaDB, and any other open source database technologies. The power of the open source community is the free exchange of ideas, healthy competition and open dialog within the community.

Here are some more past presentations that are also relevant:

by Dave Avery at February 16, 2017 06:08 PM

MariaDB AB

Making MaxScale Dynamic: New in MariaDB MaxScale 2.1

Making MaxScale Dynamic: New in MariaDB MaxScale 2.1 markusmakela Thu, 02/16/2017 - 12:43


The beta version of the upcoming MaxScale 2.1 has been released. This new release takes MaxScale to the next level by introducing functionality that makes it possible to configure parts of MaxScale’s configuration at runtime.

How Has MaxScale Changed?

One of the big new features in 2.1 is the ability to add, remove and alter the server definitions at runtime. By decoupling the servers from the services that use them by making the server list dynamic, MaxScale can be defined as a more abstract service instead of a concrete service comprised of a pre-defined set of servers. This opens up new use cases for MaxScale, such as template configurations, that make the use and maintenance of MaxScale a lot easier.


The logical extension to being able to add new servers is the ability to add new monitors. After all, the monitors are just an abstraction of a cluster of servers. What this gives us is the capability to radically change the cluster layout by bringing in a whole new cluster of servers. With this, it is possible to do a runtime migration from one cluster to another by simply adding a new monitor, then adding the servers that it monitors and finally swapping the old cluster to the new one.


While services are a somewhat of a more static concept, the network ports that they listen on aren’t. For this reason, we added the ability to add new listeners for services. What this means is that you can extend the set of ports that a service listens on at runtime. One example of how this could be used is to connect a new appliance which uses a different port to MaxScale. This removes some of the burden on the developer of the appliance and offloads it to MaxScale.


How to Use It?


A practical use case for the new features in 2.1 can be demonstrated by one of the tests that are in our testing framework. It starts MaxScale with a minimal configuration similar to the following.

[maxscale]
threads=4


[rwsplit-service]
type=service
router=readwritesplit
user=maxskysql
passwd=skysql


[CLI]
type=service
router=cli


[CLI Listener]
type=listener
service=CLI
protocol=maxscaled
socket=default


Next it defines a monitor.

maxadmin create monitor cluster-monitor mysqlmon
maxadmin alter monitor cluster-monitor user=maxuser password=maxpwd monitor_interval=1000
maxadmin restart monitor cluster-monitor


Since the monitors require some configuration before they can be used, they are created in a stopped state. This allows the user to configure the credentials that the monitor will use when it connects to the servers in the cluster. Once those are set, the monitor is started and then the test proceeds to create a listener for the rwsplit-service on port 4006.

maxadmin create listener rwsplit-service rwsplit-listener 0.0.0.0 4006

The listeners require no extra configuration and are started immediately. The next step is the definition of the two servers that the service will use.

maxadmin create server db-serv-west 172.0.10.2 3306
maxadmin create server db-serv-east 172.0.10.4 3306

Once the servers are defined, they need to be linked to the monitor so that it will start monitoring the status of the servers and also to the service that will use them.

maxadmin add server db-serv-west cluster-monitor rwsplit-service
maxadmin add server db-serv-east cluster-monitor rwsplit-service

After the servers have been added to the service and the monitor, the system is ready for use. At this point, our test performs various health checks and then proceeds to scale the service down.

Dynamic Scaling

When the demand on the database cluster lowers, servers can be shut down as they are no longer needed. When demand grows, new servers can be started to handle the increased load. Combined with on-demand, cloud-based database infrastructure, MaxScale can handle changing traffic loads.

Scaling up on demand is simply done by creating new servers and adding them to the monitors and services that use them. Once the demand for the service lowers, the servers can be scaled down by removing them from the service.

All of the changes done at runtime are persisted in a way that will survive a restart of MaxScale. This makes it a reliable way to adapt MaxScale to its environment.

Summary

The new features in 2.1 make MaxScale a dynamic part of the database infrastructure. The support of live changes in clusters makes MaxScale the ideal companion for your application whether you’re using an on-premise, cloud-based or a hybrid database setup.
The changes to the configuration in MaxScale 2.1 are backwards compatible with earlier versions of MaxScale which makes taking it into use is easy. Going forward, our goal is to make MaxScale easier to use for the DBAs by providing new ways to manage MaxScale while at the same time taking the modular ideology of MaxScale to the limits by introducing new and interesting modules.

There's More


The 2.1 release of MaxScale is packed with new features designed to supercharge your cluster. One example is the newly added concept of module commands that allow the modules in MaxScale to expand their capabilities beyond that of the module type. Remember to read the MariaDB blog as this new functionality will be explored in a follow-up blog post that focuses on the new module command capabilities and takes a practical look at the dbfwfilter firewall filter module and how it implements these.

The beta version of the upcoming MaxScale 2.1 has been released. This new release takes MaxScale to the next level by introducing functionality that makes it possible to configure parts of MaxScale’s configuration at runtime.

Login or Register to post comments

by markusmakela at February 16, 2017 05:43 PM

MariaDB MaxScale 2.1: Power to the Modules

MariaDB MaxScale 2.1: Power to the Modules markusmakela Thu, 02/16/2017 - 12:41

The 2.1 version of MaxScale introduced a new concept of module commands. This blog post takes a look at what they are, why they were implemented and where you see them when you use MaxScale.

The Idea Behind Module Commands

The core concept of the module commands was driven by the need to do things with modules that couldn’t be done with all modules of that type. One example of this need is the dbfwfilter firewall filter module. The module needs to be able to provide a mechanism which is capable of testing the validity of the rules as well as updating them out on the fly. We could have introduced a reload function into the filter API but only a few filters would ever implement it. But what if a filter needs to be able to do even more specific tasks which do not apply to other modules of the same type?

As the type and variety of modules in MaxScale starts to grow, so do the requirements that those module place on the MaxScale core. At this point, we have realized that we need a way to allow modules to expose module specific functionality to end users regardless of the module API they use. And so the idea of a module command was born.

How Module Commands Were Implemented

The module command implementation is relatively simple. It allows modules to register commands that are exposed to the user via the administrative interface. All the module needs to do is to declare the parameters the command expects, document the parameters and give a function which is called when the module command is executed. Everything else is handled by the MaxScale core.

Calling Module Commands

The user can then call these commands via the administrative interface. The use of module commands is intended to be very similar to the use of the built in maxadmin commands and it should feel familiar to the users of MaxScale.

Two new commands were added to MaxAdmin itself: list commands and call command. The former lists all registered module commands and the latter calls a module command. The execution of call command always takes at least two arguments, the module and command names.

Example Use of Module Commands

The best way to show how module commands work is to show them in action. One of the MaxScale modules that implements module commands is the dbfwfilter database firewall.

Micro-Tutorial: Dbfwfilter Module Commands

A practical example of using module commands is the reloading of the firewall rules in the dbfwfilter module when changes occur in the database.
The module registers two commands. Here’s the output of maxadmin list commands when the dbfwfilter module has been loaded.

Command: dbfwfilter rules
Parameters: FILTER


    FILTER - Filter to inspect


Command: dbfwfilter rules/reload
Parameters: FILTER [STRING]


    FILTER - Filter to reload
    [STRING] - Path to rule file

The rules commands shows diagnostic information about the active rules. The rules/reload command reloads the rules from a file and optionally changes the location of the rule file. The following filter configuration will use the rules file to detect which types of queries should be allowed. The action=allow parameter tells the filter to let all queries pass that match a rule.

[Database-Firewall]
type=filter
module=dbfwfilter
action=allow
rules=/home/markusjm/rules

Using the following rule file for the dbfwfilter will only allow statements that target the name or address columns.

rule allowed_columns allow columns name address
users %@% match any rules allowed_columns

But what if we need to alter the client application to allow statements to the email column? In older versions this would require a restart of MaxScale so that the rules could be reloaded. With the 2.1 version of MaxScale, you can alter the rules and reload them at runtime. To do this, we add the email column to the rule file.

rule allowed_columns allow columns name address email
users %@% match any rules allowed_columns

Then we reload the rules by calling the rules/reload command of the dbfwfilter. The rules/reload command can be called by executing the call command MaxAdmin command and giving it the filter name as a parameter.

maxadmin call command dbfwfilter rules/reload Database-Firewall

Once the command is complete, the dbfwfilter has reloaded and taken the new rules into use. With this new functionality in the dbfwfilter, online modifications to both the client applications and backend databases can be made without interruptions to the service.

Summary

Module commands make the usage of MaxScale's modules easier as it allows modules to perform custom actions. The example shows a practical use-case that makes the use of MaxScale more convenient by leveraging the module command system.

For more information on the implementation of the module commands, read the MaxScale documentation on module commands. You can find a list of modules that implement module commands in the MariaDB MaxScale 2.1.0 Beta Release Notes.

Please try out the new module commands by downloading the MariaDB MaxScale 2.1.0 beta and give us your feedback on our mailing list maxscale@googlegroups.com or on #maxscale@FreeNode.

The 2.1 version of MaxScale introduced a new concept of module commands. This blog post takes a look at what they are, why they were implemented and where you see them when you use MaxScale.

Login or Register to post comments

by markusmakela at February 16, 2017 05:41 PM

Improving the Performance of MariaDB MaxScale

Improving the Performance of MariaDB MaxScale Johan Thu, 02/16/2017 - 12:34

Performance has been a central goal of MariaDB MaxScale’s architecture from the start. Core elements in the drive for performance is the use of asynchronous I/O and Linux’ epoll. With a fixed set of worker threads, whose amount is selected to correspond to the number of available cores, each thread should either be working or be idle, but never be waiting for I/O. This should provide good performance that scales well with the number of available CPUs. However, benchmarking revealed that was not entirely the case. The performance of MaxScale did not continuously increase as more worker threads were added, but after a certain point the performance would actually start to decrease.

When we started working on MariaDB MaxScale 2.1 we decided to investigate what the actual cause for that behaviour was and to make modifications in order to improve the situation. As will be shown below, the modifications that were made have indeed produced good results.

Benchmark

The results below were produced with sysbench OLTP in read-only mode. The variant used performs 1000 point selects per transaction and the focus is on measuring the throughput that the proxy can deliver.

In the benchmark two computers were used, both of which have 16 cores / 32 hyperthreads, 128GB RAM and an SSD drive. The computers are connected with a GBE LAN. On one of the computers there were 4 MariaDB-10.1.13 backends and on the other MariaDB MaxScale and sysbench.

MariaDB MaxScale 2.0

The following figure shows the result for MariaDB MaxScale 2.0. As the basic architecture of MaxScale has been the same from 1.0, the result is applicable to all versions up to 2.0.

2.0-1000-selects.png

In the figure above, on the vertical axis there is the number of queries performed by second and on the horizontal axis there is the number of threads used by sysbench. That number corresponds to the number of clients MariaDB MaxScale will see. In each cluster, the blue bar shows the result when the backends are connected to directly, and the red, yellow and green bars when the connection goes via MaxScale running with 4, 8 and 16 threads respectively.

As long as the number of clients is small, the performance of MariaDB MaxScale is roughly the same irrespective of the number of worker threads. However, already at 32 clients it can be seen that 16 worker threads perform worse than 8 and at 64 clients that is evident. The overall performance of MaxScale is clearly below that of a direct connection.

MariaDB MaxScale 2.1

For MariaDB MaxScale 2.1 we did some rather significant changes regarding how the worker threads are used. Up until version 2.0 all threads were used in a completely symmetric way with respect to all connections, both from client to MaxScale and from MaxScale to the backends, which implied a fair amount of locking inside MaxScale. In version 2.1 a session and all its related connections are pinned to a particular worker thread. That means that there is a need for significantly less locking and that the possibility for races has been reduced.

The following figure shows the result of the same benchmark when run using MariaDB MaxScale 2.1.

2.1.0-1000-selects.png

Compared with the result of MaxScale 2.0, the performance follows and slightly exceeds the baseline of a direct connection. At 64 clients, 16 worker threads still provide the best performance but at 128 clients 8 worker threads overtake. However, as these figures were obtained with a version very close to the 2.1.0 Beta we are fairly confident that by the time 2.1 GA is released we will be able to address that as well. Based on this benchmark, the performance of 2.1 is roughly 45% better than that of 2.0.

MariaDB MaxScale and ProxySQL

ProxySQL is a MySQL proxy that in some respects is similar to MariaDB MaxScale. Out of curiosity we ran the same benchmark with ProxySQL 1.3.0, using exactly the same environment as when MaxScale was benchmarked. In the following we will show in different figures how MaxScale 2.0, MaxScale 2.1 and ProxySQL 1.3.0 compares, when 4, 8 and 16 threads are used.

2.0-2.1-ProxySQL-1000-selects-4thr.png

With 4 threads, ProxySQL 1.3.0 performs slightly better than MaxScale 2.0 and MaxScale 2.1 performs significantly better than both. The performance of both MaxScale 2.1 and ProxySQL 1.3.0 start to drop at 256 client connections.

2.0-2.1-ProxySQL-1000-selects-8thr.png

With 8 threads, up until 128 client connections the order remains the same, with MaxScale 2.1 at the top. When the number of client connections grows larger than that, MaxScale 2.1 and ProxySQL 1.3.0 are roughly on par.

2.0-2.1-ProxySQL-1000-selects-16thr.png

With 16 threads, up until 64 client connections the order is still the same. When the number of client connections grows larger than that, MaxScale 2.1 and ProxySQL 1.3.0 are roughly on par. The performance of both drop slightly compared with the 8 thread situation.

Too far reaching conclusions should not be drawn from these results alone; it was just one benchmark, MaxScale 2.1 will still evolve before it is released as GA and the version of ProxySQL was not the very latest one.

Summary

In MariaDB MaxScale 2.1 we have made significant changes to the internal architecture and benchmarking shows that the changes were beneficial for the overall performance. MaxScale 2.1 consistently performs better than MaxScale 2.0 and with 8 threads, which in this benchmark is the sweetspot for MaxScale 2.0, the performance has improved by 45%.

If you want to give MaxScale 2.1.0 Beta a go, it can be downloaded at https://downloads.mariadb.com/MaxScale/.

Performance has been a central goal of MariaDB MaxScale’s architecture from the start. Core elements in the drive for performance is the use of asynchronous I/O and Linux’ epoll. With a fixed set of worker threads, whose amount is selected to correspond to the number of available cores, each thread should either be working or be idle, but never be waiting for I/O. This should provide good performance that scales well with the number of available CPUs. However, benchmarking revealed that was not entirely the case. The performance of MaxScale did not continuously increase as more worker threads were added, but after a certain point the performance would actually start to decrease.

When we started working on MariaDB MaxScale 2.1 we decided to investigate what the actual cause for that behaviour was and to make modifications in order to improve the situation.

Login or Register to post comments

by Johan at February 16, 2017 05:34 PM

Introducing MariaDB MaxScale 2.1.0 Beta

Introducing MariaDB MaxScale 2.1.0 Beta Dipti Joshi Thu, 02/16/2017 - 12:32

We are happy to announce that  MariaDB MaxScale 2.1.0 beta is released today. MariaDB MaxScale is the next generation database proxy for MariaDB.  Beta is an important time in our release and we encourage you to download this release today!

MariaDB MaxScale 2.1 introduces the following key new capabilities:

Dynamic Configuration

  • Server, Monitor and Listeners: MaxScale 2.1 supports dynamic configuration of servers, monitors and listeners. Servers, monitors and listeners can be added, modified or removed during runtime. A set of new commands were added to maxadmin.

  • Database firewall filter:  Rules can now be modified during runtime using the new module commands introduced in this release.

  • Persistent configuration changes: The runtime configuration changes are immediately applied to the running MaxScale as well as persisted using the new hierarchical configuration architecture.

Security

  • Secure Binlog Server: The binlog cache files on MaxScale can now be encrypted. MaxScale binlog server also uses SSL in communication with Master and Slave.

  • Secured single sign-on: MariaDB MaxScale now supports LDAP/GSSAPI authentication support

  • Selective Data Masking: Meet your HIPPA and PCI compliance needs by obfuscating sensitive data using the new masking filter.

  • Result set limiting: Prevent access to large sets of data with a single query by using maxrows filter.

  • Prepared Statement filtering by database firewall: The database firewall filter now applies the filtering rules to prepared statements as well.

  • Function filtering by database firewall: Now database firewall filter adds a rule to whitelist or blacklist a query based on presence of a function.

Scalability

  • Aurora Cluster Support: MariaDB MaxScale can now be used as a Proxy for Amazon Aurora Cluster. Newly added monitor detects read replicas and write node in Aurora Cluster, and supports launchable scripts on monitored events like other monitors.

  • Multi-master for MySQL monitor: Now MariaDB MaxScale can detect complex multi-master replication topologies for MariaDB and MySQL environment.

  • Failover mode for MySQL Monitor: For two node master-slave cluster, MariaDB MaxScale now allows slave to act as a master in case the original master fails

  • Read-Write Splitting with Master Pinning: MariaDB MaxScale 2.1 introduces a new “Consistent Critical Read Filter”. This filter detects a statement that would modify the database and route all subsequent statement to the master server where data is guaranteed to be in a up-to-date state

Query Performance

  • Query Cache Filter: MariaDB MaxScale 2.1 now allows caching of query results in MaxScale for a configurable timeout. If a query is in cache, MaxScale will return results from cache before going to server to fetch query results

  • Streaming Insert Plugin: A new plugin in MariaDB MaxScale 2.1 converts all INSERT statements done inside an explicit transaction into LOAD DATA LOCAL INFILE
     

Links: 

Please post your question in our Knowledge Base or email me at dipti.joshi@mariadb.com

We are happy to announce that  MariaDB MaxScale 2.1.0 beta is released today. MariaDB MaxScale is the next generation database proxy for MariaDB.  Beta is an important time in our release and we encourage you to download this release today!

Login or Register to post comments

by Dipti Joshi at February 16, 2017 05:32 PM

Valeriy Kravchuk

Fun with Bugs #48 - Group Replication Bugs and Missing Features as of MySQL 5.7.17

It seems recent post on Group Replication by Vadim caused an interesting discussion on Facebook. I am NOT going to continue it here, but decided to present some facts, specifically, list of public bug reports and feature requests for Group Replication (mostly "Verified", that is, accepted by Oracle engineers as valid) as of MySQL 5.7.17 (where the feature is promoted as GA), with just few comments to some of the bugs.

The goal is to double check this post when next Oracle MySQL 5.7.x release appears, to find out how much Oracle carews to fix the problems already identified by MySQL Community.

So, here are the bugs and feature requests in MySQL Server: Group Replication category (I've included only those reported for 5.7.17 into this post), as of today:
  • Bug #84315 - "add mysql group replication bootstrap option to init script". Fair enough, we have a way to boottrap a cluster for a node in all implementations of Galera cluster. This feature request is still "Open".
  • Bug #84329 - "Some Class B private address is not permitted automatically". Manual seems to say it should be.
  • Bug #84337 - "Configure mgr with out root user encounter an error". Seems to be a duplicate of Bug #82687, that is, known for quite a some time before GA.
  • Bug #84367 - "START GROUP_REPLICATION should be possible without error". You get error if it was already started. Not a big deal, but some optional clause like IF NOT STARTED may help (in a similar way to DROP TABLE ... IF EXISTS) to code scripts. Still "Open".
  • Bug #84710 - "group replication does not respect localhost IP". may be a problem for some MySQL Sandbox setups in a hope to test group replication.
  • Bug #84727 - "GR: Partitioned Node Should Get Updated Status and not accept writes". Writes on partitioned node are accepted and hang (forever?). As a side note, I think Kenny Gryp deserves a special recognition as an early adopter of Group Replication feature who cared to report many problems noted in the process.
  • Bug #84728 - "Not Possible To Avoid MySQL From Starting When GR Cannot Start". We need this fixed to avoid split brain situations by default.
  • Bug #84729 - "Group Replication: Block Reads On Partitioned Nodes". In Galera reads are blocked by default when node is not considered a member of cluster.
  • Bug #84730 - "Feature: Troubleshoot Group Replication Transaction Rollbacks". You can get a lot of information about conflicts in case of Galera. Sorry that I have to compare, but when discussing the ways of dealing with common problems in cluster environments related to MySQL one can not ignore existing solutions (NDB clusters and Galera clusters), so I just picked up a (somewhat similar) technology that I know and used (a bit). I think readers will do the same, try to base their conclusions on known examples.
  • Bug #84731 - "Group Replication: mysql client connections hang during group replication start". There is no reason to hang more than needed. We should just make sure reads and writes are NOT accepted until the node is a member of cluster and in sync.
  • Bug #84733 - "Cannot Start GR with super_read_only=1". But this setting may be needed to make sure that there is only one master node in cluster, no matter what happens ot them...
  • Bug #84773 - "Flow control slows down throughput while a node join the cluster". Let me quote: "This is done this way by design, that is, request group to slow down while the member is fetching and applying data.
  • Bug #84774 - "Performance drop every 60 seconds". Now this sounds like a performance problem to work on, maybe by adding some tuning options.
  • Bug #84784 - "Group Replication nodes do not rejoin cluster after network connectivity issues". It would be really nice for nodes to try to re-join the cluster in case of short term connectivity issues. Galera nodes do not give up that fast. The bug is still not verified.
  • Bug #84785  - "Prevent Large Transactions in Group Replication". Galera somehow allows to limit transaction size. Not that there were no related bugs, but still options exist.
  • Bug #84792 - "Idle GR Cluster: Member CPU hog". Not yet verified, but it seems in some cases node can use a notable share of CPU time for no clear/good reason.
  • Bug #84794 - "Cannot kill query inside GR". Weel, you can do STOP GROUP_REPLICATION, but then it can be dangerous, because...
  • Bug #84795 - "STOP GROUP_REPLICATION sets super_read_only=off" - the node with stopped replication may allow to change the data...
  • Bug #84798 - "Group Replication can use some verbosity in the error log". Galera cluster nodes are too verbose, one gets kilometers of log records about everything, anything and nothing. Still, better to skip some usual outputs in the logs than get no evidence at all on what was going on...
  • Bug #84900 - "Getting inconsistent result on different nodes". Now, this is really unfortunate (for a "cluster") and somewhat similar problem was reported by Vadim before, see Bug #82481 and was supposed to be fixed. Anyway, the inconsistency is repatable and looks somewhat scary.
  • Bug #84901 - "Can't execute transaction on second node". Best practice is to write on one and only one node (see Bug #83218). An attempt to write on the other node may fail...
Now, make your own conclusions about the maturity of Group Replication in MySQL 5.7.17. I manage to avoid it so far, as I try to keep myself as far from any kinds of "clusters" as possible... Had not worked well with Galera, unfortunately - I have to read its verbose logs on a daily basis.

by Valeriy Kravchuk (noreply@blogger.com) at February 16, 2017 01:34 PM

Jean-Jerome Schmidt

Video Interview with ProxySQL Creator René Cannaò

In anticipation of this month’s webinar MySQL & MariaDB Load Balancing with ProxySQL & ClusterControl that will happen on February 28th Severalnines sat down with the creator of ProxySQL founder and creator René Cannaò to discuss his revolutionary product, how it’s used, and what he plans to cover in the webinar. Watch the video or read the transcript below of the interview.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Transcript of Interview

Hi I’m Forrest and I’m from the Severalnines marketing team and I’m here interviewing René Cannaò from ProxySQL. René thanks for joining me. Let’s start by introducing yourself, where did you come from?

Thank you, Forrest. Without going into too many details, I came from a system administrator background, and as a system administrator I got fascinated by databases, so I then become a DBA. In my past experience I worked as a Support Engineer for MySQL/Sun/Oracle, where I got of experience about MySQL... then remote DBA for PalominoDB and after that working as a MySQL SRE for Dropbox and I founded of ProxySQL.

So What is ProxySQL?

ProxySQL is a lightweight yet complex protocol aware proxy that sits between the MySQL clients and servers. I like to describe it as a gate, in fact the Stargate is the logo, so basically it separates clients from databases, therefore an entry point to access all the databases server.

Why did you create ProxySQL?

That’s a very interesting question, and very important. As a DBA, it was always extremely difficult to control the traffic sent to the database. This was the main reason to create ProxySQL: basically it’s a layer that separates the database from the application (de facto splitting them into two different layers), and because it’s sitting in the middle it’s able to control and manage all the traffic between the two, and also transparently managing failures.

So there are several database load balancers in the market, what differentiates ProxySQL from others?

First, most of the load balancers do not understand MySQL protocol: ProxySQL understands it, and this allows the implementation of features otherwise impossible to implement. Among the few proxies that are able to understand the MySQL protocol, ProxySQL is the only one designed from DBAs for DBAs, therefore it is designed to solve real issues and challenges as a DBA. For example, ProxySQL is the only proxy supporting connections multiplexing and query caching.

I noticed on your website that you say that ProxySQL isn’t battle-tested, its WAR-tested. What have you done to put ProxySQL through its paces?

The point is that from the very beginning ProxySQL was architected and designed to behave correctly in extremely demanding and very complex setups with millions of clients connections and thousands of database servers. Other proxies won’t be able to handle this. So, a lot of effort was invested in making sure ProxySQL is resilient in such complex setups. And, of course, no matter how resient it is set up it should not sacrifice performance.

On the 28th of February you will be co-hosting a webinar with Severalnines; with Krzysztof one or our Support Engineers. What are some of the topics you are going to cover at that event?

ProxySQL is built upon new technology data not present in other load balancers, its features and concepts are not always intuitive. Some concepts are extremely original in ProxySQL. For this reason the topics I plan to cover at the event are hostgroups, query rules, connection multiplexing, failures handling, and configuration management. Again, those are all the features and concepts that are only present in ProxySQL.

Excellent, well thank you for joining me, I’m really looking forward to this webinar on the 28th.

Thank you, Forrest.

by Severalnines at February 16, 2017 12:04 PM

Peter Zaitsev

Group Replication: Shipped Too Early

Group Replication

Group ReplicationThis blog post is my overview of Group Replication technology.

With Oracle clearly entering the “open source high availability solutions” arena with the release of their brand new Group Replication solution, I believe it is time to review the quality of the first GA (production ready) release.

TL;DR: Having examined the technology, it is my conclusion that Oracle seems to have released the GA version of Group Replication too early. While the product is definitely “working prototype” quality, the release seems rushed and unfinished. I found a significant number of issues, and I would personally not recommend it for production use.

It is obvious that Oracle is trying hard to ship technology to compete with Percona XtraDB Cluster, which is probably why they rushed to claim Group Replication GA quality.

If you’re all set to follow along and test Group Replication yourself, simplify the initial setup by using this Docker image. We can review some of the issues you might face together.

For the record, I tested the version based on MySQL 5.7.17 release.

No automatic provisioning

First off, the first thing you’ll find is there is NO way to automatically setup of a new node.

If you need to setup new node or recover an existing node from a fatal failure, you’ll need to manually provision the slave.

Of course, you can clone a slave using Percona XtraBackup or LVM by employing some self-developed scripts. But given the high availability nature of the product, one would expect Group Replication to automatically re-provision any failed node.

Bug: stale reads on nodes

Please see this bug:

One line summary: while any secondary nodes are “catching up” to whatever happened on a first node (it takes time to apply changes on secondary nodes), reads on a secondary node could return stale data (as shown in the bug report).

This behavior brings us back to the traditional asynchronous replication slave behavior (i.e., Group Replication’s predecessor).

It also contradicts the Group Replication documentation, which states: “There is a built-in group membership service that keeps the view of the group consistent and available for all servers at any given point in time.” (See https://dev.mysql.com/doc/refman/5.7/en/group-replication.html.)

I might also mention here that Percona XtraDB Cluster prevents stale reads (see https://www.percona.com/doc/percona-xtradb-cluster/5.7/wsrep-system-index.html#wsrep_sync_wait).

Bug: nodes become unusable after a big transaction, refusing to execute further transactions

There are two related bugs:

One line summary: after running a big transaction, any secondary nodes become unusable and refuse to perform any further transactions.

Obscure error messages

It is not uncommon to see cryptic error messages while testing Group Replication. For example:

mysql> commit;
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.

This is fairly useless and provides little help until I check the mysqld error log. The log provides a little bit more information:

2017-02-09T02:05:36.996776Z 18 [ERROR] Plugin group_replication reported: '[GCS] Gcs_packet's payload is too big. Only the packets smaller than 2113929216 bytes can be compressed.'

Discussion:

The items highlighted above might not seem too bad at first, and you could assume that your workload won’t be affected. However, stale reads and node dysfunctions basically prevent me from running a more comprehensive evaluation.

My recommendation:

If you care about your data, then I recommend not using Group Replication in production. Currently, it looks like it might cause plenty of headaches, and it is easy to get inconsistent results.

For the moment, Group Replication appears an advanced – but broken – traditional MySQL asynchronous replication.

I understand Oracle’s dilemma. Usually people are hesitant to test a product that is not GA. So in order to get feedback from users, Oracle needs to push the product to GA. Oracle must absolutely solve the issues above during future QA cycles.

by Vadim Tkachenko at February 16, 2017 12:02 AM

February 15, 2017

MariaDB AB

Kerberos for SQLyog by MariaDB Connector/C

Kerberos for SQLyog by MariaDB Connector/C julienfritsch Wed, 02/15/2017 - 15:06

MariaDB is an open source enterprise database with one of the most active and fastest-growing communities in the world. MariaDB Enterprise delivers the security, high availability and scalability required for mission-critical applications, and the management and monitoring tools today’s enterprises rely on.

SQLyog is included in MariaDB Enterprise and it helps DBAs, developers and database architects save time writing queries visualized with syntax checking, designing visually complex queries, and many other powerful features for visualization, synchronization and management. This 12.4 release introduces ‘read-only’ connections as well as support for the MariaDB auth_gssapi (Kerberos) plugin.

Kerberos is an authentication protocol that works on the basis of 'tickets' to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. Typically Kerberos is used within corporate/internal environments for two purposes: security and authentication – replacing the computer local passwords with tickets on a distributed network system, thereby eliminating the risk for transmitted passwords over the network to be intercepted.

The MariaDB auth_gssapi (Kerberos) plugin was available with MariaDB Connector/C 3.0 (currently beta), but based on the high demand for this, we backported it to the MariaDB Connector/C 2.3 which is used in SQLyog.

More information on Connector/C here.

Download SQLyog 12.4 from our Customer Portal here.

Want to learn more, please contact us here.

This blog post was inspired by this page.

SQLyog is included in MariaDB Enterprise and it helps DBAs, developers and database architects save time writing queries visualized with syntax checking, designing visually complex queries, and many other powerful features for visualization, synchronization and management. This 12.4 release introduces ‘read-only’ connections as well as support for the MariaDB auth_gssapi (Kerberos) plugin.

Login or Register to post comments

by julienfritsch at February 15, 2017 08:06 PM

Peter Zaitsev

Docker Images for Percona Server for MySQL Group Replication

Group Replication

Group ReplicationIn this blog post, we’ll point to a new Docker image for Percona Server for MySQL Group Replication.

Our most recent release of Percona Server for MySQL (Percona Server for MySQL 5.7.17) comes with Group Replication plugins. Unfortunately, since this technology is very new, it requires some fairly complicated steps to setup and get running. To help with that process, I’ve prepare Docker images that simplify its setup procedures.

You can find the image here: https://hub.docker.com/r/perconalab/pgr-57/.

To start the first node (bootstrap the group):

docker run -d -p 3306 --net=clusternet -e MYSQL_ROOT_PASSWORD=passw0rd -e CLUSTER_NAME=cluster1 perconalab/pgr-57

To add nodes into the group after:

docker run -d -p 3306 --net=clusternet -e MYSQL_ROOT_PASSWORD=passw0rd -e CLUSTER_NAME=cluster1 -e CLUSTER_JOIN=CONTAINER_ID_FROM_THE_FIRST_STEP perconalab/pgr-57

You can also get a full script that starts “N” number of nodes, here: https://github.com/Percona-Lab/percona-docker/blob/master/pgr-57/start_node.sh

 

by Vadim Tkachenko at February 15, 2017 04:27 PM

Jean-Jerome Schmidt

MySQL on Docker: Composing the Stack

Docker 1.13 introduces a long-awaited feature called compose-file support, which allow us to define our containers with a nice simple config file instead of a single long command. If you have a look at our previous “MySQL on Docker” blog posts, we used multiple long command lines to run containers and services. By using compose-file, containers are easily specified for deployment. This reduces the risk for human error as you do not have to remember long commands with multiple parameters.

In this blog post, we’ll show you how to use compose-file by using simple examples around MySQL deployments. We assume you have Docker Engine 1.13 installed on 3 physical hosts and Swarm mode is configured on all hosts.

Introduction to Compose-file

In the Compose-file, you specify everything in YAML format as opposed to trying to remember all the arguments we have to pass to Docker commands. You can define services, networks and volumes here. The definition will be picked up by Docker and it is very much like passing command-line parameters to “docker run|network|volume” command.

As an introduction, we are going to deploy a simple standalone MySQL container. Before you start writing a Compose file, you first need to know the run command. Taken from our first MySQL on Docker blog series, let’s compose the following “docker run” command:

$ docker run --detach \
--name=test-mysql \
--publish 6603:3306 \
--env="MYSQL_ROOT_PASSWORD=mypassword" \
-v /storage/docker/mysql-datadir:/var/lib/mysql \
mysql

The docker-compose command will look for a default file called “docker-compose.yml” in the current directory. So, let’s first create the required directories beforehand:

$ mkdir -p ~/compose-files/mysql/single
$ mkdir -p /storage/docker/mysql-datadir
$ cd ~/compose-files/mysql/single

In YAML, here is what should be written:

version: '2'

services:
  mysql:
    image: mysql
    container_name: test-mysql
    ports:
      - 6603:3306
    environment:
      MYSQL_ROOT_PASSWORD: "mypassword"
    volumes:
      - /storage/docker/mysql-datadir:/var/lib/mysql

Save the above content into “~/compose-files/mysql/single/docker-compose.yml”. Ensure you are in the current directory ~/compose-files/mysql/single, then fire it up by running the following command:

$ docker-compose up -d
WARNING: The Docker Engine you're using is running in swarm mode.

Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.

To deploy your application across the swarm, use `docker stack deploy`.

Creating test-mysql

Verify if the container is running in detached mode:

[root@docker1 single]# docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
379d5c15ef44        mysql               "docker-entrypoint..."   8 minutes ago       Up 8 minutes        0.0.0.0:6603->3306/tcp   test-mysql

Congratulations! We have now got a MySQL container running with just a single command.

Deploying a Stack

Compose-file simplifies things, it provides us with a clearer view on how the infrastructure should look like. Let’s create a container stack that consists of a website running on Drupal, using a MySQL instance under a dedicated network and link them together.

Similar to above, let’s take a look at the command line version in the correct order to build this stack:

$ docker volume create mysql_data
$ docker network create drupal_mysql_net --driver=bridge
$ docker run -d --name=mysql-drupal --restart=always -v mysql_data:/var/lib/mysql --net=drupal_mysql_net -e MYSQL_ROOT_PASSWORD="mypassword" -e MYSQL_DATABASE="drupal" mysql
$ docker run -d --name=drupal -p 8080:80 --restart=always -v /var/www/html/modules -v /var/www/html/profiles -v /var/www/html/themes -v /var/www/html/sites --link mysql:mysql --net=drupal_mysql_net drupal

To start composing, let’s first create a directory for our new stack:

$ mkdir -p ~/compose-files/drupal-mysql
$ cd ~/compose-files/drupal-mysql

Then, create write content of docker-compose.yml as per below:

version: '2'

services:
  mysql:
    image: mysql
    container_name: mysql-drupal
    environment:
      MYSQL_ROOT_PASSWORD: "mypassword"
      MYSQL_DATABASE: "drupal"
    volumes:
      - mysql_data:/var/lib/mysql
    restart: always
    networks:
      - drupal_mysql_net

  drupal:
    depends_on:
      - mysql
    image: drupal
    container_name: drupal
    ports:
      - 8080:80
    volumes:
      - /var/www/html/modules
      - /var/www/html/profiles
      - /var/www/html/themes
      - /var/www/html/sites
    links:
      - mysql:mysql
    restart: always
    networks:
      - drupal_mysql_net

volumes:
  mysql_data:

networks:
  drupal_mysql_net:
    driver: bridge

Fire them up:

$ docker-compose up -d
..
Creating network "drupalmysql_drupal_mysql_net" with driver "bridge"
Creating volume "drupalmysql_mysql_data" with default driver
Pulling drupal (drupal:latest)...
..
Creating mysql-drupal
Creating drupal

Docker will perform the deployment as follows:

  1. Create network
  2. Create volume
  3. Pull images
  4. Create mysql-drupal (since container “drupal” is dependent on it)
  5. Create the drupal container

At this point, our architecture can be illustrated as follows:

We can then specify ‘mysql’ as the MySQL host in the installation wizard page since both containers are linked together. That’s it. To tear them down, simply run the following command under the same directory:

$ docker-compose down

The corresponding containers will be terminated and removed accordingly. Take note that the docker-compose command is bound to the individual physical host running Docker. In order to run on multiple physical hosts across Swarm, it needs to be treated differently by utilizing “docker stack” command. We’ll explain this in the next section.

Composing a Stack Across Swarm

Firstly, make sure the Docker engine is running on v1.13 and Swarm mode is enabled and in ready state:

$ docker node ls
ID                           HOSTNAME       STATUS  AVAILABILITY  MANAGER STATUS
8n8t3r4fvm8u01yhli9522xi9 *  docker1.local  Ready   Active        Reachable
o1dfbbnmhn1qayjry32bpl2by    docker2.local  Ready   Active        Reachable
tng5r9ax0ve855pih1110amv8    docker3.local  Ready   Active        Leader

In order to use the stack feature for Docker Swarm mode, we have to use the Docker Compose version 3 format. We are going to deploy a setup similar to the above, apart from a 3-node Galera setup as the MySQL backend. We already explained in details in this blog post.

Firstly, create a directory for our new stack:

$ mkdir -p ~/compose-files/drupal-galera
$ cd ~/compose-files/drupal-galera

Then add the following lines into “docker-compose.yml”:

version: '3'

services:

  galera:
    deploy:
      replicas: 3
      restart_policy:
        condition: on-failure
        delay: 30s
        max_attempts: 3
        window: 60s
      update_config:
        parallelism: 1
        delay: 10s
        max_failure_ratio: 0.3
    image: severalnines/pxc56
    environment:
      MYSQL_ROOT_PASSWORD: "mypassword"
      CLUSTER_NAME: "my_galera"
      XTRABACKUP_PASSWORD: "mypassword"
      DISCOVERY_SERVICE: '192.168.55.111:2379,192.168.55.112:2379,192.168.55.207:2379'
      MYSQL_DATABASE: 'drupal'
    networks:
      - galera_net

  drupal:
    depends_on:
      - galera
    deploy:
      replicas: 1
    image: drupal
    ports:
      - 8080:80
    volumes:
      - drupal_modules:/var/www/html/modules
      - drupal_profile:/var/www/html/profiles
      - drupal_theme:/var/www/html/themes
      - drupal_sites:/var/www/html/sites
    networks:
      - galera_net

volumes:
  drupal_modules:
  drupal_profile:
  drupal_theme:
  drupal_sites:

networks:
  galera_net:
    driver: overlay

Note that the Galera image that we used (severalnines/pxc56) requires a running etcd cluster installed on each of the Docker physical host. Please refer to this blog post on the prerequisite steps.

One of the important parts in our compose-file is the max_attempts parameter under restart_policy section. We have to specify a hard limit on the number of restarts in case of failure. This will make the deployment process safer because, by default, the Swarm scheduler will never give up in attempting to restart containers. If this happens, the process loop will fill up the physical host’s disk space with unusable containers when the scheduler cannot bring the containers up to the desired state. This is a common approach when handling stateful services like MySQL. It’s better to bring them down altogether rather than make them run in an inconsistent state.

To start them all, just execute the following command in the same directory where docker-compose.yml resides:

$ docker stack deploy --compose-file=docker-compose.yml my_drupal

Verify the stack is created with 2 services (drupal and galera):

$ docker stack ls
NAME       SERVICES
my_drupal  2

We can also list the current tasks in the created stack. The result is a combined version of “docker service ps my_drupal_galera” and “docker service ps my_drupal_drupal” commands:

$ docker stack ps my_drupal
ID            NAME                IMAGE                      NODE           DESIRED STATE  CURRENT STATE           ERROR  PORTS
609jj9ji6rxt  my_drupal_galera.1  severalnines/pxc56:latest  docker3.local  Running        Running 7 minutes ago
z8mcqzf29lbq  my_drupal_drupal.1  drupal:latest              docker1.local  Running        Running 24 minutes ago
skblp9mfbbzi  my_drupal_galera.2  severalnines/pxc56:latest  docker1.local  Running        Running 10 minutes ago
cidn9kb0d62u  my_drupal_galera.3  severalnines/pxc56:latest  docker2.local  Running        Running 7 minutes ago

Once we get the CURRENT STATE as RUNNING, we can start the Drupal installation by connecting to any of the Docker host IP address or hostname on port 8080, as in this case we used docker3 (albeit the drupal container is deployed on docker1), http://192.168.55.113:8080/. Proceed with the installation and specify ‘galera’ as the MySQL host and ‘drupal’ as the database name (as defined in the compose-file under MYSQL_DATABASE environment variable):

That’s it. The stack deployment was simplified by using Compose-file. At this point, our architecture is looking something like this:

Lastly, to remove the stack, just run the following command:

$ docker stack rm my_drupal
Removing service my_drupal_galera
Removing service my_drupal_drupal
Removing network my_drupal_galera_net

Conclusion

Using compose-file can save you time and reduce the risk for human error, as compared to when working with long command lines. This is a perfect tool for you to master before working with multi-container Docker applications, dealing with multiple deployment environments (e.g dev, test, staging, pre-prod, prod) and handling much more complex services, just like MySQL Galera Cluster. Happy containerizing!

by ashraf at February 15, 2017 08:43 AM

February 14, 2017

MariaDB AB

Releasing BSL 1.1

Releasing BSL 1.1 kajarno Tue, 02/14/2017 - 17:29

MariaDB announced BSL 1.0 together with MariaDB MaxScale 2.0 in August 2016. After having the license “in the wild” for a few months we’ve been reaching out to Open Source advocates to get feedback on the BSL. We’ve gotten some great responses, including from Bruce Perens and Heather Meeker of O’Melveny & Myers LLP*.

Our next MariaDB MaxScale release** will use BSL 1.1. We feel that the updates make it easier for other vendors to adopt the BSL, providing a better monetization opportunity for software vendors than many alternate approaches, especially Proprietary or Open Core software. To read more about BSL 1.1,  we have updated our MaxScale-specific and vendor-oriented BSL FAQs.

The changes in the new release aim to better enable the BSL as a viable license for other vendors by setting consistent expectations.

First, we set a cap of four years for the duration of the time window prior to code becoming FOSS. That said, we encourage a shorter window (and for MariaDB MaxScale, the window is between two and three years). Second, we now require the Change License to be GPL compatible with either GPLv2, GPLv3, or any other license that can be combined properly with GPL software. Lastly, with the Use Limitation, we failed at identifying a reasonable least common denominator, as software comes in so many different shapes. Even “non-commercial use” is hard to define. As an example, we want to enable App developers to distribute their free versions under the BSL, where a payment trigger would be getting rid of ads. And what exactly is commercial use? Internal use, where you save a lot of cost, in producing value for your own customers? If the users earn money – directly or indirectly – the developer should be fed.

We encourage you to consider the BSL and explore MaxScale and our vendor-oriented BSL FAQs for more information.

 

 

 

*Heather is author of Open (Source) For Business: A Practical Guide to Open Source Licensing, and has been a pro-bono counsel to the Mozilla, GNOME, and Python foundations, so who better to ask than her.

** To be specific:

  • MariaDB MaxScale 1.4.5 (released 8 Feb 2017) is GPL and the following maintenance release MariaDB MaxScale 1.4.6 (whenever that may be) will remain GPL.

  • MariaDB MaxScale 2.0.4 (also released 8 Feb 2017) is BSL 1.0 but the following maintenance release MariaDB MaxScale 2.0.5 (whenever that may be) will move to BSL 1.1.

  • MariaDB MaxScale 2.1.0 will be released under BSL 1.1 from the start. 

MariaDB announced BSL 1.0 together with MariaDB MaxScale 2.0 in August 2016. After having the license “in the wild” for a few months we’ve been reaching out to Open Source advocates to get feedback on the BSL. We’ve gotten some great responses, including from Bruce Perens and Heather Meeker of O’Melveny & Myers LLP. Our next MariaDB MaxScale release will use BSL 1.1.

Login or Register to post comments

by kajarno at February 14, 2017 10:29 PM

Peter Zaitsev

Percona Live Open Source Database Conference 2017 Crash Courses: MySQL and MongoDB!

MariaDB at Percona Live

Percona Live Open Source Database Conference 2017The Percona Live Open Source Database Conference 2017 will once again host crash courses on MySQL and MongoDB. Read below to get an outstanding discount on either the MySQL or MongoDB crash course (or both).

The database community constantly tells us how hard it is to find someone with MySQL and MongoDB DBA skills who can help with the day-to-day management of their databases. This is especially difficult when companies don’t have a full-time requirement for a DBA. Developers, system administrators and IT staff spend too much time trying to solve basic database problems that keep them from doing their other job duties. Eventually, the little problems or performance inefficiencies that start to pile up lead to big problems.

In answer to this growing need, Percona Live is once again hosting crash courses for developers, systems administrators and other technical resources. A crash course is a one-day training session on either MySQL 101 or MongoDB 101.

Don’t let the name fool you: these courses are led by Percona database experts who will show you the fundamentals of MySQL or MongoDB tools and techniques.

And it’s not just for DBAs: developers are encouraged to attend to hone their database skills.

Below is a list of the topics covered in each course this year:

MySQL 101 Topics MongoDB 101 Topics

Attendees will return ready to quickly and correctly take care of the day-to-day and week-to-week management of your MySQL or MongoDB environment.

The schedule and cost for the 101 courses (without a full-conference pass) are:

  • MySQL 101: Tuesday, April 25 ($400)
  • MongoDB 101: Wednesday, April 26 ($400)
  • Both MySQL and MongoDB 101 sessions ($700)

(Tickets to the 101 sessions do not grant access to the main Percona Live breakout sessions. Full Percona Live conferences passes grant admission to the 101 sessions. 101 Crash Course attendees will have full access to Percona Live keynote speakers the exhibit hall and receptions.)

As a special promo, the first 101 people to purchase the single 101 talks receive a $299.00 discount off the ticket price! Each session only costs $101! Get both sessions for a mere $202 and save $498.00! Register now using the following codes for your discount:

  • 101: $299 off of either the MySQL or MongoDB tickets
  • 202: $498 off of the combined MySQL/MongoDB ticket

Click here to register.

Register for Percona Live 2017 now! Advanced Registration lasts until March 5, 2017. Percona Live is a very popular conference: this year’s Percona Live Europe sold out, and we’re looking to do the same for Percona Live 2017. Don’t miss your chance to get your ticket at its most affordable price. Click here to register.

Percona Live 2017 sponsorship opportunities are available now. Click here to find out how to sponsor.

by Dave Avery at February 14, 2017 08:23 PM

Percona Server for MongoDB 3.4 Product Bundle Release is a Valentine to the Open Source Community

Percona Server for MongoDB 3.4

Percona Server for MongoDB 3.4Percona today announced a Percona Server for MongoDB 3.4 solution bundle of updated products. This release enables any organization to create a robust, secure database environment that can be adapted to changing business requirements.

Percona Server for MongoDB 3.4, Percona Monitoring and Management 1.1, and Percona Toolkit 3.0 offer more features and benefits, with enhancements for both MongoDB® and MySQL® database environments. When these Percona products are used together, organizations gain all the cost and agility benefits provided by free, proven open source software that delivers all the latest MongoDB Community Edition 3.4 features, additional Enterprise features, and a greater choice of storage engines. Along with improved insight into the database environment, the solution provides enhanced control options for optimizing a wider range of database workloads with greater reliability and security.

The solution will be generally available the week of Feb. 20.

New Features and Benefits Summary

Percona Server for MongoDB 3.4

  • Percona Server for MongoDBAll the features of MongoDB Community Edition 3.4, which provides an open source, fully compatible, drop-in replacement
  • Integrated, pluggable authentication with LDAP to provide a centralized enterprise authentication service
  • Open-source auditing for visibility into user and process actions in the database, with the ability to redact sensitive information (such as usernames and IP addresses) from log files
  • Hot backups for the WiredTiger engine protect against data loss in the case of a crash or disaster, without impacting performance
  • Two storage engine options not supported by MongoDB Community Edition 3.4:

    • MongoRocks, the RocksDB-powered storage engine, designed for demanding, high-volume data workloads such as in IoT applications, on-premises or in the cloud
    • Percona Memory Engine is ideal for in-memory computing and other applications demanding very low latency workloads

Percona Server for MongoDBPercona Monitoring and Management 1.1

  • Support for MongoDB and Percona Server for MongoDB
  • Graphical dashboard information for WiredTiger, MongoRocks and Percona Memory Engine

Percona Toolkit 3.0

  • Two new tools for MongoDB:
    • pt-mongodb-summary (the equivalent of pt-mysql-summary) provides a quick, at-a-glance overview of a MongoDB and Percona Server for MongoDB instance.
    • Percona Server for MongoDBpt-mongodb-query-digest (the equivalent of pt-query-digest for MySQL) offers a query review for troubleshooting.

For more information, see Percona’s press release.

by Dave Avery at February 14, 2017 03:46 PM

Jean-Jerome Schmidt

MySQL & MariaDB load balancing with ProxySQL & ClusterControl: introduction webinar

Proxies are building blocks of high availability setups for MySQL and MariaDB. They can detect failed nodes and route queries to hosts which are still available. If your master failed and you had to promote one of your slaves, proxies will detect such topology changes and route your traffic accordingly. More advanced proxies can do much more: route traffic based on precise query rules, cache queries or mirror them. They can even be used to implement different types of sharding.

Introducing ProxySQL!

Join us for this live joint webinar with ProxySQL’s creator, René Cannaò, who will tell us more about this new proxy and its features. We will also show you how you can deploy ProxySQL using ClusterControl. And we will give you an early walk-through of some of the  exciting ClusterControl features for ProxySQL that we have planned.

Date, Time & Registration

Europe/MEA/APAC

Tuesday, February 28th at 09:00 GMT (UK) / 10:00 CET (Germany, France, Sweden)

Register Now

North America/LatAm

Tuesday, February 28th at 9:00 Pacific Time (US) / 12:00 Eastern Time (US)

Register Now

Agenda

  1. Introduction
  2. ProxySQL concepts (René Cannaò)
    • Hostgroups
    • Query rules
    • Connection multiplexing
    • Configuration management
  3. Demo of ProxySQL setup in ClusterControl (Krzysztof Książek)
  4. Upcoming ClusterControl features for ProxySQL

Speakers

René Cannaò, Creator & Founder, ProxySQL. René has 10 years of working experience as a System, Network and Database Administrator mainly on Linux/Unix platform. In the last 4-5 years his experience was focused mainly on MySQL, working as Senior MySQL Support Engineer at Sun/Oracle and then as Senior Operational DBA at Blackbird, (formerly PalominoDB). In this period he built an analytic and problem solving mindset and he is always eager to take on new challenges, especially if they are related to high performance. And then he created ProxySQL …

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard.

We look forward to “seeing” you there and to insightful discussions!

If you have any questions or would like a personalised live demo, please do contact us.

by Severalnines at February 14, 2017 10:15 AM

February 13, 2017

Peter Zaitsev

ClickHouse: New Open Source Columnar Database

Clickhouse

For this blog post, I’ve decided to try ClickHouse: an open source column-oriented database management system developed by Yandex (it currently powers Yandex.Metrica, the world’s second-largest web analytics platform).

In my previous set of posts, I tested Apache Spark for big data analysis and used Wikipedia page statistics as a data source. I’ve used the same data as in the Apache Spark blog post: Wikipedia Page Counts. This allows me to compare ClickHouse’s performance to Spark’s.

ClickHouse

I’ve spent some time testing ClickHouse for relatively large volumes of data (1.2Tb uncompressed). Here is a list of ClickHouse advantages and disadvantages that I saw:

ClickHouseClickHouse advantages

  • Parallel processing for single query (utilizing multiple cores)
  • Distributed processing on multiple servers
  • Very fast scans (see benchmarks below) that can be used for real-time queries
  • Column storage is great for working with “wide” / “denormalized” tables (many columns)
  • Good compression
  • SQL support (with limitations)
  • Good set of functions, including support for approximated calculations
  • Different storage engines (disk storage format)
  • Great for structural log/event data as well as time series data (engine MergeTree requires date field)
  • Index support (primary key only, not all storage engines)
  • Nice command line interface with user-friendly progress bar and formatting

Here is a full list of ClickHouse features

ClickHouse disadvantages

  • No real delete/update support, and no transactions (same as Spark and most of the big data systems)
  • No secondary keys (same as Spark and most of the big data systems)
  • Own protocol (no MySQL protocol support)
  • Limited SQL support, and the joins implementation is different. If you are migrating from MySQL or Spark, you will probably have to re-write all queries with joins.
  • No window functions

Full list of ClickHouse limitations

Group by: in-memory vs. on-disk

Running out of memory is one of the potential problems you may encounter when working with large datasets in ClickHouse:

SELECT
    min(toMonth(date)),
    max(toMonth(date)),
    path,
    count(*),
    sum(hits),
    sum(hits) / count(*) AS hit_ratio
FROM wikistat
WHERE (project = 'en')
GROUP BY path
ORDER BY hit_ratio DESC
LIMIT 10
↖ Progress: 1.83 billion rows, 85.31 GB (68.80 million rows/s., 3.21 GB/s.) ██████████▋                                                                                                                                                    6%Received exception from server:
Code: 241. DB::Exception: Received from localhost:9000, 127.0.0.1.
DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attempt to allocate chunk of 1048576 bytes), maximum: 9.31 GiB:
(while reading column hits):

By default, ClickHouse limits the amount of memory for group by (it uses a hash table for group by). This is easily fixed – if you have free memory, increase this parameter:

SET max_memory_usage = 128000000000; #128G

If you don’t have that much memory available, ClickHouse can “spill” data to disk by setting this:

set max_bytes_before_external_group_by=20000000000; #20G
set max_memory_usage=40000000000; #40G

According to the documentation, if you need to use

max_bytes_before_external_group_by
 it is recommended to set
max_memory_usage
 to be ~2x of the size of
max_bytes_before_external_group_by
.

(The reason for this is that the aggregation is performed in two phases: (1) reading and building an intermediate data, and (2) merging the intermediate data. The spill to disk can only happen during the first phase. If there won’t be spill, ClickHouse might need the same amount of RAM for stage 1 and 2.)

Benchmarks: ClickHouse vs. Spark

Both ClickHouse and Spark can be distributed. However, for the purpose of this test I’ve run a single node for both ClickHouse and Spark. The results are quite impressive.

Benchmark summary

 Size / compression  Spark v. 2.0.2  ClickHouse
 Data storage format  Parquet, compressed: snappy   Internal storage, compressed 
 Size (uncompressed: 1.2TB)   395G  212G

Clickhouse

 

 Test  Spark v. 2.0.2  ClickHouse   Diff
 Query 1: count (warm)  7.37 sec (no disk IO)  6.61 sec   ~same
 Query 2: simple group (warm)   792.55 sec (no disk IO)   37.45 sec  21x better
 Query 3: complex group by   2522.9 sec  398.55 sec  6.3x better

 

ClickHouse vs. MySQL

I wanted to see how ClickHouse compared to MySQL. Obviously, we can’t compare some workloads. For example:

  • Storing terabytes of data and querying (“crunching” would be a better word here) data without an index. It would take weeks (or even months) to load data and build the indexes. That is a much more suitable workload for ClickHouse or Spark.
  • Real-time updates / OLTP. ClickHouse does not support real-time updates / deletes.

Usually big data systems provide us with real-time queries. Systems based on map/reduce (i.e., Hive on top of HDFS) are just too slow for real-time queries, as it takes a long time to initialize the map/reduce job and send the code to all nodes.

Potentially, you can use ClickHouse for real-time queries. It does not support secondary indexes, however. This means it will probably scan lots of rows, but it can do it very quickly.

To do this test, I’m using the data from the Percona Monitoring and Management system. The table I’m using has 150 columns, so it is good for column storage. The size in MySQL is ~250G:

mysql> show table status like 'query_class_metrics'G
*************************** 1. row ***************************
           Name: query_class_metrics
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 364184844
 Avg_row_length: 599
    Data_length: 218191888384
Max_data_length: 0
   Index_length: 18590056448
      Data_free: 6291456
 Auto_increment: 416994305

Scanning the whole table is significantly faster in ClickHouse. Retrieving just ten rows by key is faster in MySQL (especially from memory).

But what if we only need to scan limited amount of rows and do a group by? In this case, ClickHouse may be faster. Here is the example (real query used to create sparklines):

MySQL

SELECT
	(1480888800 - UNIX_TIMESTAMP(start_ts)) / 11520 as point,
	FROM_UNIXTIME(1480888800 - (SELECT point) * 11520) AS ts,
	COALESCE(SUM(query_count), 0) / 11520 AS query_count_per_sec,
	COALESCE(SUM(Query_time_sum), 0) / 11520 AS query_time_sum_per_sec,
	COALESCE(SUM(Lock_time_sum), 0) / 11520 AS lock_time_sum_per_sec,
	COALESCE(SUM(Rows_sent_sum), 0) / 11520 AS rows_sent_sum_per_sec,
	COALESCE(SUM(Rows_examined_sum), 0) / 11520 AS rows_examined_sum_per_sec
FROM  query_class_metrics
WHERE  query_class_id = 7 AND instance_id = 1259 AND (start_ts >= '2014-11-27 00:00:00'
AND start_ts < '2014-12-05 00:00:00')
GROUP BY point;
...
61 rows in set (0.10 sec)
# Query_time: 0.101203  Lock_time: 0.000407  Rows_sent: 61  Rows_examined: 11639  Rows_affected: 0
explain SELECT ...
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: query_class_metrics
   partitions: NULL
         type: range
possible_keys: agent_class_ts,agent_ts
          key: agent_class_ts
      key_len: 12
          ref: NULL
         rows: 21686
     filtered: 100.00
        Extra: Using index condition; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: NULL
   partitions: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
     filtered: NULL
        Extra: No tables used
2 rows in set, 2 warnings (0.00 sec)

It is relatively fast.

ClickHouse (some functions are different, so we will have to rewrite the query):

SELECT
    intDiv(1480888800 - toRelativeSecondNum(start_ts), 11520) AS point,
    toDateTime(1480888800 - (point * 11520)) AS ts,
    SUM(query_count) / 11520 AS query_count_per_sec,
    SUM(Query_time_sum) / 11520 AS query_time_sum_per_sec,
    SUM(Lock_time_sum) / 11520 AS lock_time_sum_per_sec,
    SUM(Rows_sent_sum) / 11520 AS rows_sent_sum_per_sec,
    SUM(Rows_examined_sum) / 11520 AS rows_examined_sum_per_sec,
    SUM(Rows_affected_sum) / 11520 AS rows_affected_sum_per_sec
FROM query_class_metrics
WHERE (query_class_id = 7) AND (instance_id = 1259) AND ((start_ts >= '2014-11-27 00:00:00')
AND (start_ts < '2014-12-05 00:00:00'))
GROUP BY point;
61 rows in set. Elapsed: 0.017 sec. Processed 270.34 thousand rows, 14.06 MB (15.73 million rows/s., 817.98 MB/s.)

As we can see, even though ClickHouse scans more rows (270K vs. 11K – over 20x more) it is faster to execute the ClickHouse query (0.10 seconds in MySQL compared to 0.01 second in ClickHouse). The column store format helps a lot here, as MySQL has to read all 150 columns (stored inside InnoDB pages) and ClickHouse only needs to read seven columns.

Wikipedia trending article of the month

Inspired by the article about finding trending topics using Google Books n-grams data, I decided to implement the same algorithm on top of the Wikipedia page visit statistics data. My goal here is to find the “article trending this month,” which has significantly more visits this month compared to the previous month. As I was implementing the algorithm, I came across another ClickHouse limitation: join syntax is limited. In ClickHouse, you can only do join with the “using” keyword. This means that the fields you’re joining need to have the same name. If the field name is different, we have to use a subquery.

Below is an example.

First, create a temporary table to aggregate the visits per month per page:

CREATE TABLE wikistat_by_month ENGINE = Memory AS
SELECT
    path,
    mon,
    sum(hits) / total_hits AS ratio
FROM
(
    SELECT
        path,
        hits,
        toMonth(date) AS mon
    FROM wikistat
    WHERE (project = 'en') AND (lower(path) NOT LIKE '%special%') AND (lower(path) NOT LIKE '%page%') AND (lower(path) NOT LIKE '%test%') AND (lower(path) NOT LIKE '%wiki%') AND (lower(path) NOT LIKE '%index.html%')
) AS a
ANY INNER JOIN
(
    SELECT
        toMonth(date) AS mon,
        sum(hits) AS total_hits
    FROM wikistat
    WHERE (project = 'en') AND (lower(path) NOT LIKE '%special%') AND (lower(path) NOT LIKE '%page%') AND (lower(path) NOT LIKE '%test%') AND (lower(path) NOT LIKE '%wiki%') AND (lower(path) NOT LIKE '%index.html%')
    GROUP BY toMonth(date)
) AS b USING (mon)
GROUP BY
    path,
    mon,
    total_hits
ORDER BY ratio DESC
Ok.
0 rows in set. Elapsed: 543.607 sec. Processed 53.77 billion rows, 2.57 TB (98.91 million rows/s., 4.73 GB/s.)

Second, calculate the actual list:

SELECT
    path,
    mon + 1,
    a_ratio AS ratio,
    a_ratio / b_ratio AS increase
FROM
(
    SELECT
        path,
        mon,
        ratio AS a_ratio
    FROM wikistat_by_month
    WHERE ratio > 0.0001
) AS a
ALL INNER JOIN
(
    SELECT
        path,
        CAST((mon - 1) AS UInt8) AS mon,
        ratio AS b_ratio
    FROM wikistat_by_month
    WHERE ratio > 0.0001
) AS b USING (path, mon)
WHERE (mon > 0) AND (increase > 2)
ORDER BY
    mon ASC,
    increase DESC
LIMIT 100
┌─path───────────────────────────────────────────────┬─plus(mon, 1)─┬──────────────────ratio─┬───────────increase─┐
│ Heath_Ledger                                       │            2 │  0.0008467223172121601 │  6.853825241458039 │
│ Cloverfield                                        │            2 │  0.0009372609760313347 │  3.758937474560766 │
│ The_Dark_Knight_(film)                             │            2 │  0.0003508532447770276 │ 2.8858100355450484 │
│ Scientology                                        │            2 │  0.0003300109101992719 │   2.52497180013816 │
│ Barack_Obama                                       │            3 │  0.0005786473399980557 │  2.323409928527576 │
│ Canine_reproduction                                │            3 │  0.0004836300843539438 │ 2.0058985801174662 │
│ Iron_Man                                           │            6 │    0.00036261003907049 │ 3.5301196568303888 │
│ Iron_Man_(film)                                    │            6 │ 0.00035634745198422497 │ 3.3815325090507193 │
│ Grand_Theft_Auto_IV                                │            6 │  0.0004036713142943461 │ 3.2112732008504885 │
│ Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull │            6 │  0.0002856570195547951 │  2.683443198030021 │
│ Tha_Carter_III                                     │            7 │ 0.00033954377342889735 │  2.820114216429247 │
│ EBay                                               │            7 │  0.0006575000133427979 │ 2.5483158977946787 │
│ Bebo                                               │            7 │  0.0003958340022793501 │ 2.3260912792668162 │
│ Facebook                                           │            7 │   0.001683658379576915 │   2.16460972864883 │
│ Yahoo!_Mail                                        │            7 │  0.0002190640575012259 │ 2.1075879062784737 │
│ MySpace                                            │            7 │   0.001395608643577507 │  2.103263660621813 │
│ Gmail                                              │            7 │  0.0005449834079575953 │ 2.0675919337716757 │
│ Hotmail                                            │            7 │  0.0009126863121737026 │  2.052471735190232 │
│ Google                                             │            7 │   0.000601645849087389 │ 2.0155448612416644 │
│ Barack_Obama                                       │            7 │ 0.00027336526076130943 │ 2.0031305241832302 │
│ Facebook                                           │            8 │  0.0007778115183044431 │  2.543477658022576 │
│ MySpace                                            │            8 │   0.000663544314346641 │  2.534512981232934 │
│ Two-Face                                           │            8 │ 0.00026975137404447024 │ 2.4171743959768803 │
│ YouTube                                            │            8 │   0.001482456447101451 │ 2.3884527929836152 │
│ Hotmail                                            │            8 │ 0.00044467667764940547 │ 2.2265750216262954 │
│ The_Dark_Knight_(film)                             │            8 │  0.0010482536106662156 │  2.190078096294301 │
│ Google                                             │            8 │  0.0002985028319919154 │ 2.0028812075734637 │
│ Joe_Biden                                          │            9 │ 0.00045067411455437264 │  2.692262662620829 │
│ The_Dark_Knight_(film)                             │            9 │ 0.00047863754833213585 │  2.420864550676665 │
│ Sarah_Palin                                        │           10 │  0.0012459220318907518 │  2.607063205782761 │
│ Barack_Obama                                       │           12 │  0.0034487235202817087 │ 15.615409029600414 │
│ George_W._Bush                                     │           12 │ 0.00042708730873936023 │ 3.6303098900144937 │
│ Fallout_3                                          │           12 │  0.0003568429236849597 │ 2.6193094036745155 │
└────────────────────────────────────────────────────┴──────────────┴────────────────────────┴────────────────────┘
34 rows in set. Elapsed: 1.062 sec. Processed 1.22 billion rows, 49.03 GB (1.14 billion rows/s., 46.16 GB/s.)

Their response time is really good, considering the amount of data it needed to scan (the first query scanned 2.57 TB of data).

Conclusion

The ClickHouse column-oriented database looks promising for data analytics, as well as for storing and processing structural event data and time series data. ClickHouse can be ~10x faster than Spark for some workloads.

Appendix: Benchmark details

Hardware

  • CPU: 24xIntel(R) Xeon(R) CPU L5639 @ 2.13GHz (physical = 2, cores = 12, virtual = 24, hyperthreading = yes)
  • Disk: 2 consumer grade SSD in software RAID 0 (mdraid)

Query 1

select count(*) from wikistat

ClickHouse:

:)  select count(*) from wikistat;
SELECT count(*)
FROM wikistat
┌─────count()─┐
│ 26935251789 │
└─────────────┘
1 rows in set. Elapsed: 6.610 sec. Processed 26.88 billion rows, 53.77 GB (4.07 billion rows/s., 8.13 GB/s.)

Spark:

spark-sql> select count(*) from wikistat;
26935251789
Time taken: 7.369 seconds, Fetched 1 row(s)

Query 2

select count(*), month(dt) as mon
from wikistat where year(dt)=2008
and month(dt) between 1 and 10
group by month(dt)
order by month(dt);

ClickHouse:

:) select count(*), toMonth(date) as mon from wikistat
where toYear(date)=2008 and toMonth(date) between 1 and 10 group by mon;
SELECT
    count(*),
    toMonth(date) AS mon
FROM wikistat
WHERE (toYear(date) = 2008) AND ((toMonth(date) >= 1) AND (toMonth(date) <= 10))
GROUP BY mon
┌────count()─┬─mon─┐
│ 2100162604 │   1 │
│ 1969757069 │   2 │
│ 2081371530 │   3 │
│ 2156878512 │   4 │
│ 2476890621 │   5 │
│ 2526662896 │   6 │
│ 2489723244 │   7 │
│ 2480356358 │   8 │
│ 2522746544 │   9 │
│ 2614372352 │  10 │
└────────────┴─────┘
10 rows in set. Elapsed: 37.450 sec. Processed 23.37 billion rows, 46.74 GB (623.97 million rows/s., 1.25 GB/s.)

Spark:

spark-sql> select count(*), month(dt) as mon from wikistat where year(dt)=2008 and month(dt) between 1 and 10 group by month(dt) order by month(dt);
2100162604      1
1969757069      2
2081371530      3
2156878512      4
2476890621      5
2526662896      6
2489723244      7
2480356358      8
2522746544      9
2614372352      10
Time taken: 792.552 seconds, Fetched 10 row(s)

Query 3

SELECT
path,
count(*),
sum(hits) AS sum_hits,
round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = 'en'
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100;

ClickHouse:

:) SELECT
:-]     path,
:-]     count(*),
:-]     sum(hits) AS sum_hits,
:-]     round(sum(hits) / count(*), 2) AS hit_ratio
:-] FROM wikistat
:-] WHERE (project = 'en')
:-] GROUP BY path
:-] ORDER BY sum_hits DESC
:-] LIMIT 100;
SELECT
    path,
    count(*),
    sum(hits) AS sum_hits,
    round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = 'en'
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100
┌─path────────────────────────────────────────────────┬─count()─┬───sum_hits─┬─hit_ratio─┐
│ Special:Search                                      │   44795 │ 4544605711 │ 101453.41 │
│ Main_Page                                           │   31930 │ 2115896977 │  66266.74 │
│ Special:Random                                      │   30159 │  533830534 │  17700.54 │
│ Wiki                                                │   10237 │   40488416 │   3955.11 │
│ Special:Watchlist                                   │   38206 │   37200069 │    973.67 │
│ YouTube                                             │    9960 │   34349804 │   3448.78 │
│ Special:Randompage                                  │    8085 │   28959624 │    3581.9 │
│ Special:AutoLogin                                   │   34413 │   24436845 │    710.11 │
│ Facebook                                            │    7153 │   18263353 │   2553.24 │
│ Wikipedia                                           │   23732 │   17848385 │    752.08 │
│ Barack_Obama                                        │   13832 │   16965775 │   1226.56 │
│ index.html                                          │    6658 │   16921583 │   2541.54 │
…
100 rows in set. Elapsed: 398.550 sec. Processed 26.88 billion rows, 1.24 TB (67.45 million rows/s., 3.10 GB/s.)

Spark:

spark-sql> SELECT
         >     path,
         >     count(*),
         >     sum(hits) AS sum_hits,
         >     round(sum(hits) / count(*), 2) AS hit_ratio
         > FROM wikistat
         > WHERE (project = 'en')
         > GROUP BY path
         > ORDER BY sum_hits DESC
         > LIMIT 100;
...
Time taken: 2522.903 seconds, Fetched 100 row(s)

 

by Alexander Rubin at February 13, 2017 11:24 PM

Oli Sennhauser

MySQL and MariaDB authentication against pam_unix

The PAM authentication plug-in is an extension included in MySQL Enterprise Edition (since 5.5) and in MariaDB (since 5.2).

MySQL authentication against pam_unix


Check if plug-in is available:

# ll lib/plugin/auth*so
-rwxr-xr-x 1 mysql mysql 42937 Sep 18  2015 lib/plugin/authentication_pam.so
-rwxr-xr-x 1 mysql mysql 25643 Sep 18  2015 lib/plugin/auth.so
-rwxr-xr-x 1 mysql mysql 12388 Sep 18  2015 lib/plugin/auth_socket.so
-rwxr-xr-x 1 mysql mysql 25112 Sep 18  2015 lib/plugin/auth_test_plugin.so

Install PAM plug-in:

mysql> INSTALL PLUGIN authentication_pam SONAME 'authentication_pam.so';

Check plug-in information:

mysql> SELECT * FROM information_schema.plugins WHERE plugin_name = 'authentication_pam'\G
*************************** 1. row ***************************
           PLUGIN_NAME: authentication_pam
        PLUGIN_VERSION: 1.1
         PLUGIN_STATUS: ACTIVE
           PLUGIN_TYPE: AUTHENTICATION
   PLUGIN_TYPE_VERSION: 1.1
        PLUGIN_LIBRARY: authentication_pam.so
PLUGIN_LIBRARY_VERSION: 1.7
         PLUGIN_AUTHOR: Georgi Kodinov
    PLUGIN_DESCRIPTION: PAM authentication plugin
        PLUGIN_LICENSE: PROPRIETARY
           LOAD_OPTION: ON

It seems like this set-up is persisted and survives a database restart because of the mysql schema table:

mysql> SELECT * FROM mysql.plugin;
+--------------------+-----------------------+
| name               | dl                    |
+--------------------+-----------------------+
| authentication_pam | authentication_pam.so |
+--------------------+-----------------------+

Configuring PAM on Ubuntu/Debian:

#%PAM-1.0
#
# /etc/pam.d/mysql
#
@include common-auth
@include common-account
@include common-session-noninteractive

Create the database user matching to the O/S user:

mysql> CREATE USER 'oli'@'localhost'
IDENTIFIED WITH authentication_pam AS 'mysql'
;
mysql> GRANT ALL PRIVILEGES ON test.* TO 'oli'@'localhost';

Verifying user in the database:

mysql> SELECT user, host, authentication_string FROM mysql.user WHERE user = 'oli';
+-----------+-----------+-------------------------------------------+
| user      | host      | authentication_string                     |
+-----------+-----------+-------------------------------------------+
| oli       | localhost | mysql                                     |
+-----------+-----------+-------------------------------------------+

mysql> SHOW CREATE USER 'oli'@'localhost';
+-----------------------------------------------------------------------------------------------------------------------------------+
| CREATE USER for oli@localhost                                                                                                     |
+-----------------------------------------------------------------------------------------------------------------------------------+
| CREATE USER 'oli'@'localhost' IDENTIFIED WITH 'authentication_pam' AS 'mysql' REQUIRE NONE PASSWORD EXPIRE DEFAULT ACCOUNT UNLOCK |
+-----------------------------------------------------------------------------------------------------------------------------------+

Connection tests:

# mysql --user=oli --host=localhost
ERROR 2059 (HY000): Authentication plugin 'mysql_clear_password' cannot be loaded: plugin not enabled

# mysql --user=oli --host=localhost --enable-cleartext-plugin --password=wrong
ERROR 1045 (28000): Access denied for user 'oli'@'localhost' (using password: YES)

# tail /var/log/auth.log
Feb 13 15:15:14 chef unix_chkpwd[31600]: check pass; user unknown
Feb 13 15:15:14 chef unix_chkpwd[31600]: password check failed for user (oli)

# mysql --user=oli --host=localhost --enable-cleartext-plugin --password=rigth
ERROR 1045 (28000): Access denied for user 'oli'@'localhost' (using password: YES)

# tail /var/log/auth.log
Feb 13 15:15:40 chef unix_chkpwd[31968]: check pass; user unknown
Feb 13 15:15:40 chef unix_chkpwd[31968]: password check failed for user (oli)

Some research led to the following result: The non privileged mysql user is not allowed to access the file /etc/shadow thus it should be added to the group shadow to make it work:

# ll /sbin/unix_chkpwd
-rwxr-sr-x 1 root shadow 35536 Mar 16  2016 /sbin/unix_chkpwd

# usermod -a -G shadow mysql


Connection tests:

# mysql --user=oli --host=localhost --enable-cleartext-plugin --password=rigth

mysql> SELECT USER(), CURRENT_USER(), @@proxy_user;
+---------------+----------------+--------------+
| USER()        | CURRENT_USER() | @@proxy_user |
+---------------+----------------+--------------+
| oli@localhost | oli@localhost  | NULL         |
+---------------+----------------+--------------+

MariaDB authentication against pam_unix


Check if plug-in is available:

# ll lib/plugin/auth*so
-rwxr-xr-x 1 mysql mysql 12462 Nov  4 14:37 lib/plugin/auth_0x0100.so
-rwxr-xr-x 1 mysql mysql 33039 Nov  4 14:37 lib/plugin/auth_gssapi_client.so
-rwxr-xr-x 1 mysql mysql 80814 Nov  4 14:37 lib/plugin/auth_gssapi.so
-rwxr-xr-x 1 mysql mysql 19015 Nov  4 14:37 lib/plugin/auth_pam.so
-rwxr-xr-x 1 mysql mysql 13028 Nov  4 14:37 lib/plugin/auth_socket.so
-rwxr-xr-x 1 mysql mysql 23521 Nov  4 14:37 lib/plugin/auth_test_plugin.so

Install PAM plug-in:

mysql> INSTALL SONAME 'auth_pam';

Check plug-in information:

mysql> SELECT * FROM information_schema.plugins WHERE plugin_name = 'pam'\G
*************************** 1. row ***************************
           PLUGIN_NAME: pam
        PLUGIN_VERSION: 1.0
         PLUGIN_STATUS: ACTIVE
           PLUGIN_TYPE: AUTHENTICATION
   PLUGIN_TYPE_VERSION: 2.0
        PLUGIN_LIBRARY: auth_pam.so
PLUGIN_LIBRARY_VERSION: 1.11
         PLUGIN_AUTHOR: Sergei Golubchik
    PLUGIN_DESCRIPTION: PAM based authentication
        PLUGIN_LICENSE: GPL
           LOAD_OPTION: ON
       PLUGIN_MATURITY: Stable
   PLUGIN_AUTH_VERSION: 1.0

Configuring PAM on Ubuntu/Debian:

#%PAM-1.0
#
# /etc/pam.d/mysql
#
@include common-auth
@include common-account
@include common-session-noninteractive

Create the database user matching to the O/S user:

mysql> CREATE USER 'oli'@'localhost'
IDENTIFIED VIA pam USING 'mariadb'
;
mysql> GRANT ALL PRIVILEGES ON test.* TO 'oli'@'localhost';

Verifying user in the database:

mysql> SELECT user, host, authentication_string FROM mysql.user WHERE user = 'oli';
+------+-----------+-----------------------+
| user | host      | authentication_string |
+------+-----------+-----------------------+
| oli  | localhost | mariadb               |
+------+-----------+-----------------------+

Connection tests:

# mysql --user=oli --host=localhost --password=wrong
ERROR 2059 (HY000): Authentication plugin 'dialog' cannot be loaded: /usr/local/mysql/lib/plugin/dialog.so: cannot open shared object file: No such file or directory

# tail /var/log/auth.log
Feb 13 17:11:16 chef mysqld: pam_unix(mariadb:auth): unexpected response from failed conversation function
Feb 13 17:11:16 chef mysqld: pam_unix(mariadb:auth): conversation failed
Feb 13 17:11:16 chef mysqld: pam_unix(mariadb:auth): auth could not identify password for [oli]
Feb 13 17:11:16 chef mysqld: pam_winbind(mariadb:auth): getting password (0x00000388)
Feb 13 17:11:16 chef mysqld: pam_winbind(mariadb:auth): Could not retrieve user's password

# mysql --user=oli --host=localhost --password=wrong --plugin-dir=$PWD/lib/plugin
ERROR 1045 (28000): Access denied for user 'oli'@'localhost' (using password: NO)
Feb 13 17:11:30 chef mysqld: pam_unix(mariadb:auth): authentication failure; logname= uid=1001 euid=1001 tty= ruser= rhost=  user=oli
Feb 13 17:11:30 chef mysqld: pam_winbind(mariadb:auth): getting password (0x00000388)
Feb 13 17:11:30 chef mysqld: pam_winbind(mariadb:auth): pam_get_item returned a password
Feb 13 17:11:30 chef mysqld: pam_winbind(mariadb:auth): request wbcLogonUser failed: WBC_ERR_AUTH_ERROR, PAM error: PAM_USER_UNKNOWN (10), NTSTATUS: NT_STATUS_NO_SUCH_USER, Error message was: No such user

Add mysql user to the shadow group:

# ll /sbin/unix_chkpwd
-rwxr-sr-x 1 root shadow 35536 Mar 16  2016 /sbin/unix_chkpwd

# usermod -a -G shadow mysql

Connection tests:

# mysql --user=oli --host=localhost --password=right --plugin-dir=$PWD/lib/plugin

mysql> SELECT USER(), CURRENT_USER(), @@proxy_user;
+---------------+----------------+--------------+
| USER()        | CURRENT_USER() | @@proxy_user |
+---------------+----------------+--------------+
| oli@localhost | oli@localhost  | NULL         |
+---------------+----------------+--------------+

Taxonomy upgrade extras: 

by Shinguz at February 13, 2017 05:02 PM

Jean-Jerome Schmidt

MySQL in the Cloud - Pros and Cons of Amazon RDS

Moving your data into a public cloud service is a big decision. All the major cloud vendors offer cloud database services, with Amazon RDS for MySQL being probably the most popular.

In this blog, we’ll have a close look at what it is, how it works, and compare its pros and cons.

RDS (Relational Database Service) is an Amazon Web Services offering. In short, it is a Database as a Service, where Amazon deploys and operates your database. It takes care of tasks like backup and patching the database software, as well as high availability. A few databases are supported by RDS, we are here mainly interested in MySQL though - Amazon supports MySQL and MariaDB. There is also Aurora, which is Amazon’s clone of MySQL, improved, especially in area of replication and high availability.

Deploying MySQL via RDS

Let’s take a look at the deployment of MySQL via RDS. We picked MySQL and then we are presented with couple of deployment patterns to pick from.

Main choice is - do we want to have high availability or not? Aurora is also promoted.

Next dialog box gives us some options to customize. You can pick one of many MySQL versions - several 5.5, 5.6 and 5.7 versions are available. Database instance - you can choose from typical instance sizes available in a given region.

Next option is a pretty important choice - do you want to use multi-AZ deployment or not? This is all about high availability. If you don’t want to use multi-AZ deployment, a single instance will be installed. In case of failure, a new one will be spun up and its data volume will be remounted to it. This process takes some time, during which your database will not be available. Of course, you can minimize this impact by using slaves and promoting one of them, but it’s not an automated process. If you want to have automated high availability, you should use multi-AZ deployment. What will happen is that two database instances will be created. One is visible to you. A second instance, in a separate availability zone, is not visible to the user. It will act as a shadow copy, ready to take over the traffic once the active node fails. It is still not a perfect solution as traffic has to be switched from the failed instance to the shadow one. In our tests, it took ~45s to perform a failover but, obviously, it may depend on instance size, I/O performance etc. But it’s much better than non-automated failover where only slaves are involved.

Finally, we have storage settings - type, size, PIOPS (where applicable) and database settings - identifier, user and password.

In the next step, a few more options are waiting for user input.

We can choose where the instance should be created: VPC, subnet, should it be publicly available or not (as in - should a public IP be assigned to the RDS instance), availability zone and VPC Security Group. Then, we have database options: first schema to be created, port, parameter and option groups, whether metadata tags should be included in snapshots or not, encryption settings.

Next, backup options - how long do you want to keep your backups? When would you like them taken? Similar setup is related to maintenances - sometimes Amazon administrators have to perform maintenance on your RDS instance - it will happen within a predefined window which you can set here. Please note, there is no option not to pick at least 30 minutes for the maintenance window, that’s why having multi-AZ instance on production is really important. Maintenance may result in node restart or lack of availability for some time. Without multi-AZ, you need to accept that downtime. With multi-AZ deployment, failover happens.

Finally, we have settings related to additional monitoring - do we want to have it enabled or not?

Managing RDS

In this chapter we will take a closer look at how to manage MySQL RDS. We will not go through every option available out there, but we’d like to highlight some of the features Amazon made available.

Snapshots

MySQL RDS uses EBS volumes as storage, so it can use EBS snapshots for different purposes. Backups, slaves - all based on snapshots. You can create snapshots manually or they can be taken automatically, when such need arises. It is important to keep in mind that EBS snapshots, in general (not only on RDS instances), adds some overhead to I/O operations. If you want to take a snapshot, expect your I/O performance to drop. Unless you use multi-AZ deployment, that is. In such case, the “shadow” instance will be used as a source of snapshots and no impact will be visible on the production instance.

Backups

Backups are based on snapshots. As mentioned above, you can define your backup schedule and retention when you create a new instance. Of course, you can edit those settings afterwards, through the “modify instance” option.

At any time you can restore a snapshot - you need to go to the snapshot section, pick the snapshot you want to restore, and you will be presented with a dialog similar to the one you’ve seen when you created a new instance. This is not a surprise as you can only restore a snapshot into a new instance - there is no way to restore it on one of the existing RDS instances. It may come as a surprise, but even in cloud environment, it may make sense to reuse hardware (and instances you already have). In a shared environment, performance of a single virtual instance may differ - you may prefer to stick to the performance profile that you are already familiar with. Unfortunately, it’s not possible in RDS.

Another option in RDS is point-in-time recovery - very important feature, a requirement for anyone who need to take good care of her data. Here things are more complex and less bright. For starters, it’s important to keep in mind that MySQL RDS hides binary logs from the user. You can change a couple of settings and list created binlogs, but you don’t have direct access to them - to make any operation, including using them for recovery, you can only use the UI or CLI. This limits your options to what Amazon allows you to do, and it allows you to restore your backup up to the latest “restorable time” which happens to be calculated in 5 minutes interval. So, if your data has been removed at 9:33a, you can restore it only up to the state at 9:30a. Point-in-time recovery works the same way as restoring snapshots - a new instance is created.

Scale-out, replication

MySQL RDS allows scale-out through adding new slaves. When a slave is created, a snapshot of the master is taken and it is used to create a new host. This part works pretty well. Unfortunately, you cannot create any more complex replication topology like one involving intermediate masters. You are not able to create a master - master setup, which leaves any HA in the hands of Amazon (and multi-AZ deployments). From what we can tell, there is no way to enable GTID (not that you could benefit from it as you don’t have any control over the replication, no CHANGE MASTER in RDS), only regular, old-fashioned binlog positions.

Lack of GTID makes it not feasible to use multithreaded replication - while it is possible to set a number of workers using RDS parameter groups, without GTID this is unusable. Main issue is that there is no way to locate a single binary log position in case of a crash - some workers could have been behind, some could be more advanced. If you use the latest applied event, you’ll lose data that is not yet applied by those “lagging” workers. If you will use the oldest event, you’ll most likely end up with “duplicate key” errors caused by events applied by those workers which are more advanced. Of course, there is a way to solve this problem but it is not trivial and it is time-consuming - definitely not something you could easily automate.

Users created on MySQL RDS don’t have SUPER privilege so operations, which are simple in stand-alone MySQL, are not trivial in RDS. Amazon decided to use stored procedures to empower the user to do some of those operations. From what we can tell, a number of potential issues are covered although it hasn’t always been the case - we remember when you couldn’t rotate to the next binary log on the master. A master crash + binlog corruption could render all slaves broken - now there is a procedure for that: rds_next_master_log.

A slave can be manually promoted to a master. This would allow you to create some sort of HA on top of multi-AZ mechanism (or bypassing it) but it has been made pointless by the fact that you cannot reslave any of existing slaves to the new master. Remember, you don’t have any control over the replication. This makes the whole exercise futile - unless your master can accommodate all of your traffic. After promoting a new master, you are not able to failover to it because it does not have any slaves to handle your load. Spinning up new slaves will take time as EBS snapshots have to be created first and this may take hours. Then, you need to warm up the infrastructure before you can put load on it.

Lack of SUPER privilege

As we stated earlier, RDS does not grant users SUPER privilege and this becomes annoying for someone who is used to having it on MySQL. Take it for granted that, in the first weeks, you will learn how often it is required to do things that you do rather frequently - such as killing queries or operating the performance schema. In RDS, you will have to stick to predefined list of stored procedures and use them instead of doing things directly. You can list all of them using the following query:

SELECT specific_name FROM information_schema.routines;

As with replication, a number of tasks are covered but if you ended up in a situation which is not yet covered, then you’re out of luck.

Interoperability and Hybrid Cloud Setups

This is another area where RDS is lacking flexibility. Let’s say you want to build a mixed cloud/on-premises setup - you have a RDS infrastructure and you’d like to create a couple of slaves on premises. The main problem you’ll be facing is that there is no way to move data out of RDS except to take a logical dump. You can take snapshots of RDS data but you don’t have access to them and you cannot move them away from AWS. You also don’t have physical access to the instance to use xtrabackup, rsync or even cp. The only option for you is to use mysqldump, mydumper or similar tools. This adds complexity (character set and collation settings have a potential to cause problems) and is time-consuming (it takes long time to dump and load data using logical backup tools).

It is possible to setup replication between RDS and an external instance (in both ways, so migrating data into RDS is also possible), but it can be a very time-consuming process.

On the other hand, if you want to stay within an RDS environment and span your infrastructure across the atlantic or from east to west coast US, RDS allows you to do that - you can easily pick a region when you create a new slave.

Unfortunately, if you’d like to move your master from one region to the other, this is virtually not possible without downtime - unless your single node can handle all of your traffic.

Security

While MySQL RDS is a managed service, not every aspect related to security is taken care of by Amazon’s engineers. Amazon calls it “Shared Responsibility Model”. In short, Amazon takes care of the security of the network and storage layer (so that data is transferred in a secure way), operating system (patches, security fixes). On the other hand, user has to take care of the rest of the security model. Make sure traffic to and from RDS instance is limited within VPC, ensure that database level authentication is done right (no password-less MySQL user accounts), verify that API security is ensured (AMI’s are set correctly and with minimal required privileges). User should also take care of firewall settings (security groups) to minimize exposure of RDS and the VPC it’s in to external networks. It’s also the user’s responsibility to implement data at rest encryption - either on the application level or on the database level, by creating an encrypted RDS instance in the first place.

Database level encryption can be enabled only on the instance creation, you cannot encrypt an existing, already running database.

RDS limitations

If you plan to use RDS or if you are already using it, you need to be aware of limitations that come with MySQL RDS.

Lack of SUPER privilege can be, as we mentioned, very annoying. While stored procedures take care of a number of operations, it is a learning curve as you need to learn to do things in a different way. Lack of SUPER privilege can also create problems in using external monitoring and trending tools - there are still some tools which may require this priviledge for some part of its functionality.

Lack of direct access to MySQL data directory and logs makes it harder to perform actions which involves them. It happens every now and then that a DBA needs to parse binary logs or tail error, slow query or general log. While it is possible to access those logs on RDS, it is more cumbersome than doing whatever you need by logging into shell on the MySQL host. Downloading them locally also takes some time and adds additional latency to whatever you do.

Lack of control over replication topology, high availability only in multi-AZ deployments. Given that you don’t have a control over the replication, you cannot implement any kind of high availability mechanism into your database layer. It doesn’t matter that you have several slaves, you cannot use some of them as master candidates because even if you promote a slave to a master, there is no way to reslave the remaining slaves off this new master. This forces users to use multi-AZ deployments and increase costs (the “shadow” instance doesn’t come free, user has to pay for it).

Reduced availability through planned downtime. When deploying an RDS instance, you are forced to pick a weekly time window of 30 minutes during which maintenance operations may be executed on your RDS instance. On the one hand, this is understandable as RDS is a Database as a Service so hardware and software upgrades of your RDS instances are managed by AWS engineers. On the other hand, this reduce your availability because you cannot prevent your master database from going down for the duration of the maintenance period. Again, in this case using multi-AZ setup increases availability as changes happen first on the shadow instance and then failover is executed. Failover itself, though, is not transparent so, one way or the other, you lose the uptime. This forces you to design your app with unexpected MySQL master failures in mind. Not that it’s a bad design pattern - databases can crash at any time and your application should be built in a way it can withstand even the most dire scenario. It’s just that with RDS, you have limited options for high availability.

Reduced options for high availability implementation. Given the lack of flexibility in the replication topology management, the only feasible high availability method is multi-AZ deployment. This method is good but there are tools for MySQL replication which would minimize the downtime even further. For example, MHA or ClusterControl when used in connection with ProxySQL can deliver (under some conditions like lack of long running transactions) transparent failover process for the application. While on RDS, you won’t be able to use this method.

Reduced insight into performance of your database. While you can get metrics from MySQL itself, sometimes it’s just not enough to get a full 10k feet view of the situation. At some point, the majority of users will have to deal with really weird issues caused by faulty hardware or faulty infrastructure - lost network packets, abruptly terminated connections or unexpectedly high CPU utilization. When you have an access to your MySQL host, you can leverage lots of tools that help you to diagnose state of a Linux server. When using RDS, you are limited to what metrics are available in Cloudwatch, Amazon’s monitoring and trending tool. Any more detailed diagnosis require contacting support and asking them to check and fix the problem. This may be quick but it also can be a very long process with a lot of back and forth email communication.

Vendor lock-in caused by complex and time-consuming process of getting data out of the MySQL RDS. RDS doesn’t grant access to MySQL data directory so there is no way to utilize industry standard tools like xtrabackup to move data in a binary way. On the other hand, the RDS under the hood is a MySQL maintained by Amazon, it is hard to tell if it is 100% compatible with upstream or not. RDS is only available on AWS, so you would not be able to do a hybrid setup.

Summary

MySQL RDS has both strengths and weaknesses. This is a very good tool for those who’d like to focus on the application without having to worry about operating the database. You deploy a database and start issuing queries. No need for building backup scripts or setting up monitoring solution because it’s already done by AWS engineers - all you need to do is to use it.

There is also a dark side of the MySQL RDS. Lack of options to build more complex setups and scaling outside of just adding more slaves. Lack of support for better high availability than what’s proposed under multi-AZ deployments. Cumbersome access to MySQL logs. Lack of direct access to MySQL data directory and lack of support for physical backups, which makes it hard to move the data out of the RDS instance.

To sum it up, RDS may work fine for you if you value ease of use over detailed control of the database. You need to keep in mind that, at some point in the future, you may outgrow MySQL RDS. We are not necessarily talking here about performance only. It’s more about your organization’s needs for more complex replication topology or a need to have better insight into database operations to deal quickly with different issues that arise from time to time. In that case, if your dataset already has grown in size, you may find it tricky to move out of the RDS. Before making any decision to move your data into RDS, information managers must consider their organization's requirements and constraints in specific areas.

In next couple of blog posts we will show you how to take your data out of the RDS into a separate location. We will discuss both migration to EC2 and to on-premises infrastructure.

by krzysztof at February 13, 2017 02:55 PM

February 10, 2017

Chris Calender

An Introduction to MariaDB’s Data at Rest Encryption (DARE) – Part 2

Okay, so you’ve read the first post on enabling MariaDB’s data at rest encryption, and now you are ready to create an encrypted table.

And just to get it out of the way for those interested, you can always check your encrypted (and non-encrypted) table stats via:

SELECT * FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION;

ENCRYPTION_SCHEME=1 means the table is encrypted and ENCRYPTION_SCHEME=0 means they are not.

But let’s get into some specific examples.

I find the following 4 tables interesting, as the first 3 essentially all create the same table, and the 4th shows how to create a non-encrypted table once you have encryption enabled.

CREATE TABLE t10 (id int) ENGINE=INNODB;
CREATE TABLE t11 (id int) ENGINE=INNODB ENCRYPTED=YES;
CREATE TABLE t12 (id int) ENGINE=INNODB ENCRYPTED=YES ENCRYPTION_KEY_ID=18;
CREATE TABLE t13 (id int) ENGINE=INNODB ENCRYPTED=NO;
MariaDB> SELECT * FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION WHERE NAME LIKE 'dare/t1%';
+-------+----------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+
| SPACE | NAME     | ENCRYPTION_SCHEME | KEYSERVER_REQUESTS | MIN_KEY_VERSION | CURRENT_KEY_VERSION | KEY_ROTATION_PAGE_NUMBER | KEY_ROTATION_MAX_PAGE_NUMBER | CURRENT_KEY_ID |
+-------+----------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+
|    48 | dare/t10 |                 1 |                  1 |               1 |                   1 |                     NULL |                         NULL |              1 |
|    49 | dare/t11 |                 1 |                  1 |               1 |                   1 |                     NULL |                         NULL |              1 |
|    50 | dare/t13 |                 0 |                  0 |               0 |                   1 |                     NULL |                         NULL |              1 |
|    51 | dare/t12 |                 1 |                  1 |               1 |                   1 |                     NULL |                         NULL |             18 |
+-------+----------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+

So when configuered as above, then we see t10, t11, and t12 were all encrypted no matter whether we specified ENCRYPTED=YES in the CREATE TABLE. We also see t10 and t11 used KEY_ID #1 since we did not specify ENCRYPTION_KEY_ID in the CREATE TABLE, whereas t12 did use KEY_ID #18 since we specified ENCRYPTION_KEY_ID=18 in the CREATE TABLE.

We see that t13 was not encrypted, since we did specify ENCRYPTED=NO in the CREATE TABLE. This is possible when you use innodb_file_per_table, the default.

If we attempt to create a table specifying ENCRYPTION_KEY_ID to a value that is not in the keys.txt file, then you’ll see an error like the following:

MariaDB> CREATE TABLE t15 (id int) ENGINE=INNODB ENCRYPTED=YES ENCRYPTION_KEY_ID=17;
ERROR 1005 (HY000): Can't create table `dare`.`t15` (errno: 140 "Wrong create options")

You will see the same if you try to ALTER an existing table from one ENCRYPTION_KEY_ID to another that does not exist.

Examples with above settings:

Let’s encrypt a non-encrypted table (t13) to be encrypted:

MariaDB> SELECT NAME, ENCRYPTION_SCHEME FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION WHERE NAME LIKE 'dare/t13';
+----------+-------------------+
| NAME     | ENCRYPTION_SCHEME |
+----------+-------------------+
| dare/t13 |                 0 |
+----------+-------------------+

MariaDB> ALTER TABLE t13 ENCRYPTED=YES;
Query OK, 0 rows affected (0.16 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB> SELECT NAME, ENCRYPTION_SCHEME FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION WHERE NAME LIKE 'dare/t13';
+----------+-------------------+
| NAME     | ENCRYPTION_SCHEME |
+----------+-------------------+
| dare/t13 |                 1 |
+----------+-------------------+

Now let’s perform the reverse, un-encrpyt an encrypted table:

MariaDB> SELECT NAME, ENCRYPTION_SCHEME FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION WHERE NAME LIKE 'dare/t13';
+----------+-------------------+
| NAME     | ENCRYPTION_SCHEME |
+----------+-------------------+
| dare/t13 |                 1 |
+----------+-------------------+

MariaDB> ALTER TABLE t13 ENCRYPTED=NO;
Query OK, 0 rows affected (0.19 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB> SELECT NAME, ENCRYPTION_SCHEME FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION WHERE NAME LIKE 'dare/t13';
+----------+-------------------+
| NAME     | ENCRYPTION_SCHEME |
+----------+-------------------+
| dare/t13 |                 0 |
+----------+-------------------+

Let’s convert an ENCRYPTION_KEY_ID:

MariaDB> SELECT NAME, CURRENT_KEY_ID FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION WHERE NAME LIKE 'dare/t11';
+----------+----------------+
| NAME     | CURRENT_KEY_ID |
+----------+----------------+
| dare/t11 |              1 |
+----------+----------------+

MariaDB [dare]> ALTER TABLE t11 ENCRYPTION_KEY_ID=18;
Query OK, 0 rows affected (0.13 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB> SELECT NAME, CURRENT_KEY_ID FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION WHERE NAME LIKE 'dare/t11';
+----------+----------------+
| NAME     | CURRENT_KEY_ID |
+----------+----------------+
| dare/t11 |             18 |
+----------+----------------+

Now change innodb-encrypt-tables=FORCE:

With this setting, if you attempt to create a table with ENCRYPTED=NO, then the command will fail:

MariaDB [dare]> CREATE TABLE t23 (id int) ENGINE=INNODB ENCRYPTED=NO;
ERROR 1005 (HY000): Can't create table `dare`.`t23` (errno: 140 "Wrong create options")

Similarly, if you attempt to change an encrypted table to a non-encrypted table with FORCE in effect, you’ll see an error like the below:

MariaDB> CREATE TABLE t20 (id int) ENGINE=INNODB;
Query OK, 0 rows affected (0.04 sec)

MariaDB> ALTER TABLE t20 ENCRYPTED=NO;
ERROR 1005 (HY000): Can't create table `dare`.`#sql-29b4_2` (errno: 140 "Wrong create options")

Now revert the FORCE back to 1 for innodb-encrypt-tables, and drop innodb-encryption-threads from 4 to 0 because of high CPU usage issue – if you need to avoid it (and also revert innodb-encrypt-tables back to 1 instead of FORCE)

MariaDB> CREATE TABLE t30 (id int) ENGINE=INNODB;
Query OK, 0 rows affected (0.14 sec)

MariaDB> SELECT * FROM INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION WHERE NAME LIKE 'dare/t3%';
+-------+----------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+
| SPACE | NAME     | ENCRYPTION_SCHEME | KEYSERVER_REQUESTS | MIN_KEY_VERSION | CURRENT_KEY_VERSION | KEY_ROTATION_PAGE_NUMBER | KEY_ROTATION_MAX_PAGE_NUMBER | CURRENT_KEY_ID |
+-------+----------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+
|    63 | dare/t30 |                 1 |                  1 |               1 |                   1 |                     NULL |                         NULL |              1 |
+-------+----------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+

I did this test since it could be necessary that to set innodb-encryption-threads=0 if the high CPU utilzation issue requires you to do so, and you do not need the background threads to perform key rotation (discussed more on the manual page listed above).

Also, I did this because I’ve heard that if you set innodb-encryption-threads=0, then newly created tables where you do not explicitly set ENCRYPTED=YES will not be encrypted. I do not see that behavior, as you can see above that the table is encrypted. So I do want to dispell this notion whilst I’m at it.

Misc:

I did notice that one you set an ENCRYPTION_KEY_ID for a table, it remains with that CREATE TABLE output forever, even if you un-encrypt the table. Even an null ALTER TABLE does not remove it. It is harmless, but soemthing to be aware of:

MariaDB> ALTER TABLE t3 ENCRYPTED=NO;
Query OK, 0 rows affected, 1 warning (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 1

MariaDB> SHOW WARNINGS;
+---------+------+------------------------------------------------------------------+
| Level   | Code | Message                                                          |
+---------+------+------------------------------------------------------------------+
| Warning |  140 | InnoDB: Ignored ENCRYPTION_KEY_ID 18 when encryption is disabled |
+---------+------+------------------------------------------------------------------+

MariaDB> ALTER TABLE t3 ENGINE=INNODB;
Query OK, 0 rows affected, 1 warning (0.30 sec)
Records: 0  Duplicates: 0  Warnings: 1

MariaDB> SHOW WARNINGS;
+---------+------+------------------------------------------------------------------+
| Level   | Code | Message                                                          |
+---------+------+------------------------------------------------------------------+
| Warning |  140 | InnoDB: Ignored ENCRYPTION_KEY_ID 18 when encryption is disabled |
+---------+------+------------------------------------------------------------------+

MariaDB> SHOW CREATE TABLE t3\G
*************************** 1. row ***************************
       Table: t3
Create Table: CREATE TABLE `t3` (
  `id` int(11) DEFAULT NULL,
  `value` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 `encryption_key_id`=18 `encrypted`=NO

Lastly, I should mention that if you want to export a certain table to another instance, if it is encrypted, it will not work as-is. You will need to un-encrypt the table first, then perform the discard/import tablespace on the destination server, and then re-encrypt the table on the source server.

I hope this has helped covered the basics and a number of use cases to help you get your data at rest encrypted.

by chris at February 10, 2017 06:57 PM

An Introduction to MariaDB’s Data at Rest Encryption (DARE) – Part 1

Encryption is becoming more and more prevalent and increasingly necessary in today’s world, so I wanted to provide a good overall “getting started” article on using MariaDB’s data at rest encryption (DARE) for anyone out there interested in setting this up in their environment.

MariaDB’s data encryption at rest manual page covers a lot of the specifics, but I wanted to create a quick start guide and also note a few items that might not be immediately obvious.

And due to the number of my examples, I’m splitting this into two posts. The first will focus solely on setting up encryption so you can use it. The second will focus on using it with a number of examples and common use cases.

Also, I feel that I should mention from the outset that, currently, this data at rest encryption only applies to InnoDB/XtraDB tables and Aria tables (that are created with ROW_FORMAT=PAGE, the default), and other limitations are listed on the above page.

1. First off, you will need to create a keys.txt file. The example on the above page works fine for this, but you may want to eventually create your own:

1;770A8A65DA156D24EE2A093277530142
18;F5502320F8429037B8DAEF761B189D12F5502320F8429037B8DAEF761B189D12

Save that as keys.txt (or whatever you want) and I’ll place it in the datadir for the sake of this example.

The first number is the encryption key identifier (a 32-bit number), and the latter the actual hex-encoded encryption key (which can be 128, 192, or 256-bit). The two values are to be separated by a semi-colon. You only need the first key (i.e., #1), but you can add as many keys as you like, and use various keys for various tables if you desire. The number 1 key is also the default.

2. Add the following lines to your configuration file (these are the basics from the manual page above):

plugin-load-add=file_key_management.dll
file-key-management
file-key-management-filename = "C:/Program Files/MySQL/mariadb-10.1.21/data/keys.txt"
innodb-encrypt-tables
innodb-encrypt-log
innodb-encryption-threads=4

Of course if on Linux, change the “.dll” to “.so”. Or you can always load it dynamically, but when you also need to set related variables, I prefer this way.

  • The first three lines load the key_management plugin, enable file-key-management, and point to the file-key file you created.
  • The fourth is what enables encryption. You can set to 0 (off), 1 (on), or FORCE (to always force every single table to be encrypted).
  • The fifth tells the server to enable encryption for the log files. From what I understand, this is for the InnoDB log files. There is a separate option to encrypt binary logs (- -encrypt-binlog). The manual recommends to enable this if you are encrypting tables since the data in the logs would not be encrypted otherwise, and thus viewable.
  • The sixth specifies how many background encryption threads to startup and use. These threads perform background key rotation and scrubbing. Note that this will add some overhead, and these background threads can cause higher CPU utilization since they are running and checking tables all of the time. More on this topic can be found here: https://jira.mariadb.org/browse/MDEV-11581

3. Restart your instance, and check that all looks normal.

Verify that the plugin loaded properly:

MariaDB> SHOW PLUGINS;
+-------------------------------+----------+--------------------+-------------------------+---------+
| Name                          | Status   | Type               | Library                 | License |
+-------------------------------+----------+--------------------+-------------------------+---------+
...
| file_key_management           | ACTIVE   | ENCRYPTION         | file_key_management.dll | GPL     |
+-------------------------------+----------+--------------------+-------------------------+---------+

And if using the above, you will see the 4 encryption threads start up in the error log:

2017-02-08 16:41:44 55840 [Note] InnoDB: Creating #1 thread id 33908 total threads 4.
2017-02-08 16:41:44 55840 [Note] InnoDB: Creating #2 thread id 33924 total threads 4.
2017-02-08 16:41:44 55840 [Note] InnoDB: Creating #3 thread id 38532 total threads 4.
2017-02-08 16:41:44 55840 [Note] InnoDB: Creating #4 thread id 29060 total threads 4.

4. Now you are ready to create an encrypted table. More on this in part #2 of this blog post.

Click here to read An Introduction to MariaDB’s Data at Rest Encryption (DARE) – Part 2.

Hope this helps.

by chris at February 10, 2017 06:56 PM

MariaDB AB

How MariaDB ColumnStore Handles Big Data Workloads – Query Engine

How MariaDB ColumnStore Handles Big Data Workloads – Query Engine david_thompson_g Fri, 02/10/2017 - 13:34

MariaDB ColumnStore is a massively parallel scale out columnar database. Query execution behaves quite differently from how a row-based engine works. This article outlines how queries are executed, optimized, and how performance can be influenced.

Query Flow

Applications interact using SQL with MariaDB ColumnStore over the standard MariaDB connectors to the MariaDB server. For ColumnStore tables the query is routed through the ColumnStore storage engine.

The ColumnStore storage engine converts the parsed query into a ColumnStore specific format which is then passed to the ExeMgr process for actual execution. In return the ExeMgr retrieves the query results back.

The ExeMgr process converts the query request into a batch of primitive data jobs to perform the actual query. This is where ColumnStore performs query optimization to perform the query as efficiently as possible. The MariaDB server optimizer is mostly bypassed since its goals are driven around optimizing for row-based storage.

The primitive job steps are sent down to the PrimProc process running on each PM server. Filtering, joins, aggregates, and group bys are pushed down to enable scale out performance across multiple nodes.

Each primitive data job is generally a small discrete task that runs in a fraction of second. For filtering or projection jobs the system will read data in parallel and use other cores to process blocks in parallel to produce the result.

Results are returned from the PrimProc to the ExeMgr process. Once results are returned from each PM, the final stage of aggregation is performed to produce the results. Window functions are applied at the ExeMgr level due to the need for sorted results which may ultimately come from different PM nodes.

In the MariaDB server, any order by and select function results are applied before returning results to the client application.

Query Optimization

The ColumnStore optimizer makes use of table statistics including table size and extent map information to optimize query execution.

If a query involves joins table statistics are used to predict which table will have the largest results and make this the driving table. Both table size and the extent min max values are used in this calculation. The other tables will be queried by the ExeMgr and the results passed down to the PM nodes for hash joining with the driving table.

Where clause filters are examined in conjunction with the extent map minimum and maximum values to determine which extents for the column even need to be scanned, drastically reducing the number of extents that must be read.  This tends to work particularly well for sorted data such as dates of when an event happened.

Table projection involves first executing any column filters and then projecting the minimum set of other columns necessary to satisfy the query. Column projection is an efficient operation due to the use of fixed length datatypes.

Query Scale Out

A ColumnStore deployment with multiple PM nodes will provide scale out query performance. ColumnStore will automatically distribute data evenly across the nodes. This ensures that each PM node is responsible for a smaller portion of the total data set. Having multiple PM nodes allows for applying more CPU, network and disk I/O to provide for:

  • Reduction in query time, e.g. going from 2 nodes to 4 nodes will result in results coming back in half the time.
  • Maintaining query response time as your data set grows.

If a time dimension or column is utilized and the data loaded in order or near order, extent map elimination allows for dramatic reduction in disk I/O when data is filtered by that column.

Example Query

To better illustrate query execution within MariaDB ColumnStore, consider the following query which produces a report of total order sum by market segment for nation 24 (US) in calendar Q4 of 2016:

select c.c_mktsegment cust_mkt_segment,

sum(o.o_totalprice) total_order_amount

from orders o join customer c on o.o_custkey = c.c_custkey

where c.c_nationkey = 24

and o.o_orderDATE >= '2016-10-01'

and o.o_orderDATE < '2017-01-01'

group by total_order_amount

order by total_order_amount;

+------------------+--------------------+

| cust_mkt_segment | total_order_amount |

+------------------+--------------------+

| AUTOMOBILE       |       352911555.24 |

| BUILDING         |       338742302.27 |

| FURNITURE        |       342992395.48 |

| HOUSEHOLD        |       339851076.28 |

| MACHINERY        |       353259581.80 |

+------------------+--------------------+

5 rows in set, 1 warning (0.19 sec)

This query will be executed via the following steps:

  1. The customer table is identified using table statistics as being the smaller / dimension table and filtered first. Scan the customer table across pm nodes on the c_nationkey, c_mktsegment and c_custkey columns. Filter the c_nationkey column to rows matching the value 24 (US companies) and project c_custkey (the customer identifier) and c_mktsegment (the customer market segment).
  2. Scan the larger / fact orders table across pm nodes on the o_custkey, o_orderDATE, and o_totalprice columns. Filter the o_orderDATE column to rows with a range of Q4 1997. If there are more than 8M rows in the orders table the system will utilize extent min and max values to eliminate reading those extents that are outside of the range.
  3. A hash of c_custkey is built from the customer table results and then a distributed hash join is executed against the order table results in each PM server to produce the set of joined columns between customer and order.
  4. A distributed grouping and aggregation is then executed in each PM server to produce the group by results. The results are accumulated in the UM server which also combines any overlapping results coming from different PM servers.
  5. Finally the results are sorted by the total_order_amountcolumn in the UM server to produce the final result.

Utility tools are provided within MariaDB ColumnStore to help understand the query optimization process. For further details see the MariaDB ColumnStore knowledge base section on performance : https://mariadb.com/kb/en/mariadb/columnstore-performance-tuning/.

MariaDB ColumnStore is a massively parallel scale out columnar database. Query execution behaves quite differently from how a row-based engine works. This article outlines how queries are executed, optimized, and how performance can be influenced.

Login or Register to post comments

by david_thompson_g at February 10, 2017 06:34 PM

Chris Calender

Treating NULLs as not less than zero in ORDER BY Revisited

In my post yesterday, I shared a little known trick for sorting NULLs last when using ORDER BY ASC.

To summarize briefly, NULLs are treated as less than 0 when used in ORDER BY, However, sometimes you do not want that behavior, and you need the NULLs listed last, even though you want your numbers in ascending order.

So a query like the following returns the NULLs first (expected behavior):

SELECT * FROM t1 ORDER BY col1 ASC, col2 ASC;
+--------+------+
| col1   | col2 |
+--------+------+
| apple  | NULL |
| apple  |    5 |
| apple  |   10 |
| banana | NULL |
| banana |    5 |
| banana |   10 |
+--------+------+

The trick I mentioned in my post is to rewrite the query like:

SELECT * FROM t1 ORDER BY col1 ASC, -col2 DESC;

The difference is that we added a minus sign (-) in front of the column we want sorted and we change ASC to DESC.

Now this query returns what we’re looking for:

+--------+------+
| col1   | col2 |
+--------+------+
| apple  |    5 |
| apple  |   10 |
| apple  | NULL |
| banana |    5 |
| banana |   10 |
| banana | NULL |
+--------+------+

I could not really find this behavior documented at the time, and thus did some more digging to find out if this is intended behavior and if it should continue to work in the future (i.e., can we rely on it).

The answer is yes to both, it is intended, and it will continue to work.

Now, why does this work this way?

  • It is known that sorting in ascending (ASC) order NULLs are listed first, and if descending (DESC) they are listed last.
  • It is known that minus sign (-) preceding the column followed by DESC is the same as ASC (essentially the inverse). This is because if a > b, then -a < -b.
  • And the last bit of the puzzle is that NULL == -NULL.

So since -NULL == NULL, and we are now using DESC, the NULLs will be last. And then the remainder of the INT values will be in ASC order, since we effectively converted them to negative values and the changed order to DESC, effectively putting those INTs in ASC order.

Hope this helps clarify for anyone out there interested.

by chris at February 10, 2017 05:09 PM

Peter Zaitsev

Percona Blog Poll: What Database Engine Are You Using to Store Time Series Data?

TIme Series Data

TIme Series DataTake Percona’s blog poll on what database engine you are using to store time series data.

Time series data is some of the most actionable data available when it comes to analyzing trends and making predictions. Simply put, time series data is data that is indexed not just by value, but by time as well – allowing you to view value changes over time as they occur. Obvious uses include the stock market, web traffic, user behavior, etc.

With the increasing number of smart devices in the Internet of Things (IoT), being able to track data over time is more and more important. With time series data, you can measure and make predictions on things like energy consumption, pH values, water consumption, data from environment-aware machines like smart cars, etc. The sensors used in IoT devices and systems generate huge amounts of time-series data.

How is all of this data collected, segmented and stored? We’d like to hear from you: what database engine are you using to store time series data? Please take a few seconds and answer the following poll. Which are you using? Help the community learn what database engines help solve critical database issues. Please select from one to three database engines as they apply to your environment. Feel free to add comments below if your engine isn’t listed.

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

by Peter Zaitsev at February 10, 2017 03:19 PM

Chris Calender

Treating NULLs as not less than zero in ORDER BY Revisited

In my post the other day, I shared a little known trick for sorting NULLs last when using ORDER BY ASC.

To summarize briefly, NULLs are treated as less than 0 when used in ORDER BY, However, sometimes you do not want that behavior, and you need the NULLs listed last, even though you want your numbers in ascending order.

So a query like the following returns the NULLs first (expected behavior):

SELECT * FROM t1 ORDER BY col1 ASC, col2 ASC;
+--------+------+
| col1   | col2 |
+--------+------+
| apple  | NULL |
| apple  |    5 |
| apple  |   10 |
| banana | NULL |
| banana |    5 |
| banana |   10 |
+--------+------+

The trick I mentioned in my post is to rewrite the query like:

SELECT * FROM t1 ORDER BY col1 ASC, -col2 DESC;

The difference is that we added a minus sign (-) in front of the column we want sorted and we change ASC to DESC.

Now this query returns what we’re looking for:

+--------+------+
| col1   | col2 |
+--------+------+
| apple  |    5 |
| apple  |   10 |
| apple  | NULL |
| banana |    5 |
| banana |   10 |
| banana | NULL |
+--------+------+

I could not really find this behavior documented at the time, and thus did some more digging to find out if this is intended behavior and if it should continue to work in the future (i.e., can we rely on it).

The answer is yes to both, it is intended, and it will continue to work.

Now, why does this work this way?

  • It is known that sorting in ascending (ASC) order NULLs are listed first, and if descending (DESC) they are listed last.
  • It is known that minus sign (-) preceding the column followed by DESC is the same as ASC (essentially the inverse). This is because if a > b, then -a < -b.
  • And the last bit of the puzzle is that NULL == -NULL.

So since -NULL == NULL, and we are now using DESC, the NULLs will be last. And then the remainder of the INT values will be in ASC order, since we effectively converted them to negative values and the changed order to DESC, effectively putting those INTs in ASC order.

Hope this helps clarify for anyone out there interested.

by chris at February 10, 2017 03:16 PM

February 09, 2017

Peter Zaitsev

Using NVMe Command Line Tools to Check NVMe Flash Health

NVME

NVMEIn this blog post, I’ll look at the types of NVMe flash health information you can get from using the NVMe command line tools.

Checking SATA-based drive health is easy. Whether it’s an SSD or older spinning drive, you can use the

smartctl
 command to get a wealth of information about the device’s performance and health. As an example:

root@blinky:/var/lib/mysql# smartctl -A /dev/sda
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-62-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
 5 Reallocated_Sector_Ct   0x0032   100   100   010    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       41
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   059   000    Old_age   Always       -       35 (Min/Max 21/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Unknown_SSD_Attribute   0x0030   100   100   001    Old_age   Offline      -       0
206 Unknown_SSD_Attribute   0x000e   100   100   000    Old_age   Always       -       0
246 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       145599393
247 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       4550280
248 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       582524
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000    Pre-fail  Always       -       1260
210 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0

While

smartctl
 might not know all vendor-specific smart values, typically you can Google the drive model along with “smart attributes” and find documents like this to get more details.

If you move to newer generation NVMe-based flash storage,

smartctl
 won’t work anymore – at least it doesn’t work for the packages available for Ubuntu 16.04 (what I’m running). It looks like support for NVMe in Smartmontools is coming, and it would be great to get a single tool that supports both  SATA and NVMe flash storage.

In the meantime, you can use the

nvme
 tool available from the nvme-cli package. It provides some basic information for NVMe devices.

To get information about the NVMe devices installed:

root@alex:~# nvme list
Node             SN                   Model                                    Version  Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- -------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S3EVNCAHB01861F      Samsung SSD 960 PRO 1TB                  1.2      1         689.63  GB /   1.02  TB    512   B +  0 B   1B6QCXP7

To get SMART information:

root@alex:~# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 34 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 3,465,389
data_units_written                  : 9,014,689
host_read_commands                  : 89,719,366
host_write_commands                 : 134,671,295
controller_busy_time                : 310
power_cycles                        : 11
power_on_hours                      : 21
unsafe_shutdowns                    : 8
media_errors                        : 0
num_err_log_entries                 : 1
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 34 C
Temperature Sensor 2                : 47 C
Temperature Sensor 3                : 0 C
Temperature Sensor 4                : 0 C
Temperature Sensor 5                : 0 C
Temperature Sensor 6                : 0 C

To get additional SMART information (not all devices support it):

root@ts140i:/home/pz/workloads/1m# nvme smart-log-add /dev/nvme0
Additional Smart Log for NVME device:nvme0 namespace-id:ffffffff
key                               normalized raw
program_fail_count              : 100%       0
erase_fail_count                : 100%       0
wear_leveling                   :  62%       min: 1114, max: 1161, avg: 1134
end_to_end_error_detection_count: 100%       0
crc_error_count                 : 100%       0
timed_workload_media_wear       : 100%       37.941%
timed_workload_host_reads       : 100%       51%
timed_workload_timer            : 100%       446008 min
thermal_throttle_status         : 100%       0%, cnt: 0
retry_buffer_overflow_count     : 100%       0
pll_lock_loss_count             : 100%       0
nand_bytes_written              : 100%       sectors: 16185227
host_bytes_written              : 100%       sectors: 6405605

Some of this information is self-explanatory, and some of it isn’t. After looking at the NVMe specification document, here is my read on some of the data:

Available Spare. Contains a normalized percentage (0 to 100%) of the remaining spare capacity that is available.

Available Spare Threshold. When the Available Spare capacity falls below the threshold indicated in this field, an asynchronous event completion can occur. The value is indicated as a normalized percentage (0 to 100%).

(Note: I’m not quite sure what the practical meaning of “asynchronous event completion” is, but it looks like something to avoid!)

Percentage Used. Contains a vendor specific estimate of the percentage of the NVM subsystem life used, based on actual usage and the manufacturer’s prediction of NVM life.

(Note: the number can be more than 100% if you’re using storage for longer than its planned life.)

Data Units Read/Data Units Written. This is the number of 512-byte data units that are read/written, but it is measured in an unusual way. The first value corresponds to 1000 of the 512-byte units. So you can multiply this value by 512000 to get value in bytes. It does not include meta-data accesses.

Host Read/Write Commands. The number of commands of the appropriate type issued. Using this value, as well as one below, you can compute the average IO size for “physical” reads and writes.

Controller Busy Time. Time in minutes that the controller was busy servicing commands. This can be used to gauge long-term storage load trends.

Unsafe Shutdowns. The number of times a power loss happened without a shutdown notification being sent. Depending on the NVMe device you’re using, an unsafe shutdown might corrupt user data.

Warning Temperature Time/Critical Temperature Time. The time in minutes a device operated above a warning or critical temperature. It should be zeroes.

Wear_Leveling. This shows how much of the rated cell life was used, as well as the min/max/avg write count for different cells. In this case, it looks like the cells are rated for 1800 writes and about 1100 on average were used

Timed Workload Media Wear. The media wear by the current “workload.” This device allows you to measure some statistics from the time you reset them (called the “workload”) in addition to showing the device lifetime values.

Timed Workload Host Reads. The percentage of IO operations that were reads (since the workload timer was reset).

Thermal Throttle Status. This shows if the device is throttled due to overheating, and when there were throttling events in the past.

Nand Bytes Written. The bytes written to NAND cells. For this device, the measured unit seems to be in 32MB values. It might be different for other devices.

Host Bytes Written. The bytes written to the NVMe storage from the system. This unit also is in 32MB values. The scale of these values is not very important, as they are the most helpful for finding the write amplification of your workload. This ratio is measured in writes to NAND and writes to HOST. For this example, the Write Amplification Factor (WAF) is 16185227 / 6405605 = 2.53  

As you can see, the NVMe command line tools provide a lot of good information for understanding the health and performance of NVMe devices. You don’t need to use vendor specific tools (like isdct).

by Peter Zaitsev at February 09, 2017 07:50 PM

Chris Calender

Treating NULLs as not less than zero in ORDER BY

I was working on a seemingly basic query the other day where the user needed to have an INT column listed in ascending order (i.e., 1, 2, 3, …).

However, the tricky part came in because the column allowed NULLs and the user needed the NULLs to be listed last, not first, which is the default behavior in both MariaDB and MySQL.

We first devised a somewhat convoluted solution where we used ISNULL() first in the ORDER BY, and then the column, but that wasn’t ideal since it added an additional check for each row in the ORDER BY, which we wanted to avoid in a query returning ~5M rows.

To illustrate, a normal query just sorting in ASC order returned:

MariaDB> SELECT * FROM t1 ORDER BY col1 ASC, col3 ASC;
+--------+--------+------+
| col1   | col2   | col3 |
+--------+--------+------+
| apple  | yellow | NULL |
| apple  | red    |    5 |
| apple  | green  |   10 |
| banana | brown  | NULL |
| banana | green  |    5 |
| banana | yellow |   10 |
+--------+--------+------+

But of course we want the NULLs last in each respective group, so we first added an ISNULL() check, which does return what we need:

MariaDB> SELECT * FROM t1 ORDER BY col1 ASC, ISNULL(col3) ASC, col3 ASC;
+--------+--------+------+
| col1   | col2   | col3 |
+--------+--------+------+
| apple  | red    |    5 |
| apple  | green  |   10 |
| apple  | yellow | NULL |
| banana | green  |    5 |
| banana | yellow |   10 |
| banana | brown  | NULL |
+--------+--------+------+

However, ideally, we wanted to eliminate the ISNULL() call.

The solution was to use a little know trick (at least to me, and I also don’t see it documented) where you add a minus sign (“-“) in front of the column you want to sort and also change the ASC to DESC. The minus sign is essentially the inverse, hence the change from ASC to DESC, but it has the added benefit (in this case) of now listing NULLs last.

MariaDB> SELECT * FROM t1 ORDER BY col1 ASC, -col3 DESC;
+--------+--------+------+
| col1   | col2   | col3 |
+--------+--------+------+
| apple  | red    |    5 |
| apple  | green  |   10 |
| apple  | yellow | NULL |
| banana | green  |    5 |
| banana | yellow |   10 |
| banana | brown  | NULL |
+--------+--------+------+

Hope this helps.

by chris at February 09, 2017 06:20 PM

Jean-Jerome Schmidt

How to deploy and manage MySQL multi-master replication setups with ClusterControl 1.4

MySQL replication setups can take different shapes. The main topology is probably a simple master-slave setup. But it is also possible to construct more elaborate setups with multiple masters and chained setups. ClusterControl 1.4 takes advantage of this flexibility and gives you possibility to deploy multimaster setups. In this blog post, we will look at a couple of different setups and how they would be used in real-life situations.

New Deployment Wizard

First of all, let’s take a look at the new deployment wizard in ClusterControl 1.4. It starts with SSH configuration: user, path to ssh key and whether you use sudo or not.

Next, we pick a vendor and version, data directory, port, configuration template, password for root user and, finally, from which repository ClusterControl should install the software.

Then, the third and final step to define the topology.

Let’s go through some of these topologies in more detail.

Master - slave topology

This is the most basic setup you can create with MySQL replication - one master and one or more slaves.

Such configuration gives you scale-out for reads as you can utilize your slaves to handle read-only queries and transactions. It also adds some degree of high availability in your setup - one of slaves can be promoted to master in case the current master becomes unavailable. We introduced an automatic failover mechanism in ClusterControl 1.4.

The master - slave topology is widely used to reduce load on the master by moving reads to slaves. Slaves can also be used to handle specific types of heavy traffic - for instance, backups or analytics/reporting servers. This topology can also be used to distribute data across different datacenters.

When it comes to multiple datacenters, this might be useful if users are spread across different regions. By moving data closer to the users, you will reduce network latency.

Master - master, active - standby

This is another common deployment pattern - two MySQL instances replicating to each other. One of them is taking writes, the second one is in standby mode. This setup can be used for scale-out, where we use the standby node for reads. But this is not where its strength lies. The most common use case of this setup is to deliver high availability. When the active master dies, the standby master takes over its role and starts accepting writes. When deploying this setup, you have to keep in mind that two nodes may not be enough to avoid split brain. Ideally you’d use a third node, for example a ClusterControl host, to detect the state of the cluster. A proxy, collocated with ClusterControl, should be used to direct traffic. Colocation ensures that both ClusterControl (which performs the failover) and proxy (which routes traffic) see the topology in the same way.

You may ask - what is the difference between this setup and master-slave? One way or the other, a failover has to be performed when the active master is down. There is one important difference - replication goes both ways. This can be used to self-heal the old master after failover. As soon as you determine that the old master is safe to take a “standby” role, you can just start it and, when using GTID, all missing transactions should be replicated to it without need for any action from user.

This feature is commonly used to simplify site switchover. Let’s say that you have two site locations - active and standby/disaster recovery (DR). The DR site is designed to take over the workload when something is not right with the main location. Imagine that some issue hit your main datacenter, something not necessarily related to the database - for instance, a problem with block storage on your web tier. As long as your backup site is not affected, you can easily (or not - it depends on how your app works) switch your traffic to the backup site. From the database perspective, this is fairly simple process. Especially if you use proxies like ProxySQL which can perform a failover that is transparent to the application. After such failover, your writes hit the old standby master which now acts as active. Everything is replicated back to the primary datacenter. So when the problem is solved, you can switch the traffic back without much issues. The data in both datacenters are up-to-date.

It is worth noting that ClusterControl also supports active - active type of master - master setups. It does not deploy such topology as we strongly discourage users from writing simultaneously on multiple masters. It does not help you to scale writes, and is potentially very tricky to maintain. Still, as long as you know what you are doing, ClusterControl will detect that both masters have read_only=off and will treat the topology accordingly.

Master - master with slaves

This is an extended version of the previous topology, it combines scale-out of master - slave(s) setup with easy of failover of master - master setup. Such complex setups are commonly used across datacenters, either forming a backup environment or being actively used for scale-out, keeping data close to the rest of the application.

Topology changes

Replication topologies are not static, they evolve with time. A slave can be promoted to master, different slaves can be slaving off different masters or intermediate masters. New slaves can be added. As you can see, deploying a replication topology is one thing. Maintaining it is something different. In ClusterControl 1.4, we added the ability to modify your topology.

On the above screenshot, you can see how ClusterControl sees a master - master topology with a few slaves. On the left panel, you can see list of nodes and their roles. We can see two multi-master nodes out of which one is writable (our active master). We can also see list of slaves (read copies). On the main panel, you can see a summary for the highlighted host: its IP, IP of its master and IPs of its slaves.

As we mentioned in our previous blog post, ClusterControl handles failover for you - it checks errant transactions, it lets slaves to catch up if needed. We still need a way to move our slaves around - you can find those options in the node’s drop-down list of actions:

What we are looking for are: “Promote Slave”, which does what it says - the chosen slave will become a master (as long as there is nothing which would prevent it from happening) and the remaining hosts will slave off it. More commonly used will be “Change Replication Master”, which gives you a way to slave the chosen node off another MySQL master. Once you pick this job and “Execute” it, you’ll be presented with following dialog box:

Here you need to pick a new master host for your node. Once that’s done, click “Proceed”. In our case, we picked the IP of one of the slaves which will end up as an intermediate master. Below you can see the status of our replication setup after reslaving finished. Please note that node 172.30.4.119 is marked as “Intermediate”. It’s worth noting that ClusterControl performs sanity checks when reslaving happens - it checks for errant transactions and ensures that the master switch won’t impact replication. You can read more about those safety measures in our blog post which covers failover and switchover process.

As you can see, deploying and managing replication setups is easy with ClusterControl 1.4. We encourage you to give it a try and see how efficiently you can handle your setups. If you have any feedback on it, let us know as we’d love to hear from you.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

by krzysztof at February 09, 2017 11:45 AM

February 08, 2017

MariaDB AB

MariaDB is Coming to a City Near You!

MariaDB is Coming to a City Near You! Public Relations Wed, 02/08/2017 - 14:32

Join us for one of our full-day roadshows coming to a city near you!  MariaDB is hosting educational roadshows across North America to showcase exciting new products and solutions, and share best practices for enabling critical enterprise features such as high availability and security.  MariaDB experts will lead content-rich sessions on open source, product features, architecture and security. If you’re a DBA, architect, or developer, you won’t want to miss this event! Check out the locations and dates here.

 

MariaDB is offering complimentary, daylong roadshows across North America to help developers, DBAs, and architects connect with MariaDB experts and learn best practices for high availability, the managing of scale-out environments, database security threats, big data analytics, and how to use MariaDB with containers like Docker.  Reserve your spot today!

 

Attendees will:

  • See the latest MariaDB product features and a live demo, including MariaDB Server 10.2, MariaDB MaxScale and ColumnStore

  • Learn how to quickly get started with MariaDB

  • Interact and network with MariaDB experts and users

 

Dates and locations:

 

The MariaDB North American Road Show will also be held in New York City, Boston, Washington DC, Seattle, Chicago, Phoenix, Atlanta, Austin, and Toronto. Dates will be announced soon.

 

Sample Agenda:

 8:30 a.m.  Registration + Networking

 9:00 a.m.  Welcome and Introduction

 9:15 a.m.  Keynote When Open Source Meets the Enterprise

 9:45 a.m.  Best Practices for Achieving High Availability with MariaDB

10:45 a.m.  Break

11:00 a.m.  How to Manage Scale-out Environments with MariaDB MaxScale

12:00 p.m.  Lunch

 1:00 p.m.  Get Started with MariaDB on Docker

 2:00 p.m.  Big Data Analytics with MariaDB ColumnStore

 3:00 p.m.  Break

 3:15 p.m.  Database Security Threats MariaDB Security Best Practices

 4:15 p.m.  Wrap up

 

Speakers:

MariaDB speakers will include Roger Bodamer, Chief Product Officer; Gerry Narvaja, Solutions Engineers Global Manager; Jon Day, Solutions Architect; James McLaurin, Solutions Architect; and Alvin Richards, Field CTO.

 

To learn more about upcoming MariaDB events or to register for a roadshow, visit mariadb.com/resources/events.


Questions? Email Amy Krishnamohan at amy.krishnamohan@mariadb.com.

Join us for one of our full-day roadshows coming to a city near you!  MariaDB is hosting educational roadshows across North America to showcase exciting new products and solutions, and share best practices for enabling critical enterprise features such as high availability and security.  MariaDB experts will lead content-rich sessions on open source, product features, architecture and security. If you’re a DBA, architect, or developer, you won’t want to miss this event! Check out the locations and dates here

Login or Register to post comments

by Public Relations at February 08, 2017 07:32 PM

Peter Zaitsev

Percona Server 5.6.35-80.0 is Now Available

Percona Server for MySQL 5.5.54-38.6

percona server 5.6.35-80.0Percona announces the release of Percona Server 5.6.35-80.0 on February 8, 2017. Download the latest version from the Percona web site or the Percona Software Repositories.

Based on MySQL 5.6.35, and including all the bug fixes in it, Percona Server 5.6.35-80.0 is the current GA release in the Percona Server 5.6 series. Percona Server is open-source and free – this is the latest release of our enhanced, drop-in replacement for MySQL. Complete details of this release are available in the 5.6.35-80.0 milestone on Launchpad.

New Features:
  • Kill Idle Transactions feature has been re-implemented by setting a connection socket read timeout value instead of periodically scanning the internal InnoDB transaction list. This makes the feature applicable to any transactional storage engine, such as TokuDB, and, in future, MyRocks. This re-implementation is also addressing some existing bugs, including server crashes: #1166744, #1179136, #907719, and #1369373.
Bugs Fixed:
  • Logical row counts for TokuDB tables could get inaccurate over time. Bug fixed #1651844 (#732).
  • Repeated execution of SET STATEMENT ... FOR SELECT FROM view could lead to a server crash. Bug fixed #1392375.
  • CREATE TEMPORARY TABLE would create a transaction in binary log on a read-only server. Bug fixed #1539504 (upstream #83003).
  • If temporary tables from CREATE TABLE ... AS SELECT contained compressed attributes it could lead to a server crash. Bug fixed #1633957.
  • Using the per-query variable statement with subquery temporary tables could cause a memory leak. Bug fixed #1635927.
  • Fixed new compilation warnings with GCC 6. Bugs fixed #1641612 and #1644183.
  • A server could crash if a bitmap write I/O error happens in the background log tracking thread while a FLUSH CHANGED_PAGE_BITMAPS is executing concurrently. Bug fixed #1651656.
  • TokuDB was using the wrong function to calculate free space in data files. Bug fixed #1656022 (#1033).
  • CONCURRENT_CONNECTIONS column in the USER_STATISTICS table was showing incorrect values. Bug fixed #728082.
  • InnoDB index dives did not detect some of the concurrent tree changes, which could return bogus estimates. Bug fixed #1625151 (upstream #84366).
  • INFORMATION_SCHEMA.INNODB_CHANGED_PAGES queries would needlessly read potentially incomplete bitmap data past the needed LSN range. Bug fixed #1625466.
  • Percona Server cmake compiler would always attempt to build RocksDB even if -DWITHOUT_ROCKSDB=1 argument was specified. Bug fixed #1638455.
  • Adding COMPRESSED attributes to InnoDB special tables fields (like mysql.innodb_index_stats and mysql.innodb_table_stats) could lead to server crashes. Bug fixed #1640810.
  • Lack of free pages in the buffer pool is not diagnosed with innodb_empty_free_list_algorithm set to backoff (which is the default). Bug fixed #1657026.
  • mysqld_safe now limits the use of rm and chown to avoid privilege escalation. chown can now be used only for /var/log directory. Bug fixed #1660265. Thanks to Dawid Golunski (https://legalhackers.com).
  • Renaming a TokuDB table to a non-existent database with tokudb_dir_per_db enabled would lead to a server crash. Bug fixed #1030.
  • Read Free Replication optimization could not be used for TokuDB partition tables. Bug fixed #1012.

Other bugs fixed: #1486747 (upstream #76872), #1633988, #1638198 (upstream #82823), #1638897, #1646384, #1647530, #1647741, #1651121, #1156772, #1644569, #1644583, #1648389, #1648737, #1650247, #1650256, #1650324, #1650450, #1655587, and #1647723.

Release notes for Percona Server 5.6.35-80.0 are available in the online documentation. Please report any bugs on the launchpad bug tracker.

by Hrvoje Matijakovic at February 08, 2017 06:09 PM

Valeriy Kravchuk

More on MyRocks Performance for Bug #68079 Case

My previous post on MyRocks was intended to provide some background details for a couple of slides for my FOSDEM talk on profiling MySQL. The results and performance demonstrated by MyRocks vs InnoDB from MySQL 5.7.17 were not really important to show how perf helps to understand where the time was spent while executing of one specific query vs the other (with the same results, but different plan), but they still caused a lot of comments from people who care, so I decided to double check and clarify few more details.

First of all, it was noted that one may get the same execution plan for both queries (even in INNER JOIN case we may end up with filesort and access only via PRIMARY keys), like this:
# bin/mysql test -e 'explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category\G'
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: task
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 96
ref: NULL
rows: 8092
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: incident
type: eq_ref
possible_keys: PRIMARY,incident_category
key: PRIMARY
key_len: 96
ref: test.task.sys_id
rows: 1
Extra: NULL
Indeed, I've ended up with this plan easily, sometimes, after executing ANALYZE for the tables involved. But this was not always the case (here we see for InnoDB) and next ANALYZE or OPTIMIZE may change the plan:

mysql> analyze table task;
+-----------+---------+----------+----------+
| Table     | Op      | Msg_type | Msg_text |
+-----------+---------+----------+----------+
| test.task | analyze | status   | OK       |
+-----------+---------+----------+----------+
1 row in set (0.06 sec)

mysql> analyze table incident;
+---------------+---------+----------+----------+
| Table         | Op      | Msg_type | Msg_text |
+---------------+---------+----------+----------+
| test.incident | analyze | status   | OK       |
+---------------+---------+----------+----------+
1 row in set (0.15 sec)

mysql> show table status like 'task'\G
*************************** 1. row ***************************
           Name: task
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 8452
 Avg_row_length: 60
    Data_length: 507904
Max_data_length: 0
   Index_length: 0
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2017-01-31 12:00:38
    Update_time: NULL
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

mysql> show table status like 'incident'\G
*************************** 1. row ***************************
           Name: incident
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 7786
 Avg_row_length: 204
    Data_length: 1589248
Max_data_length: 0
   Index_length: 507904
      Data_free: 4194304
 Auto_increment: NULL
    Create_time: 2017-01-31 12:00:35
    Update_time: NULL
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

mysql> explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category;
+----+-------------+----------+------------+--------+---------------------------+-------------------+---------+----------------------+------+----------+-------------+
| id | select_type | table    | partitions | type   | possible_keys             | key               | key_len | ref                  | rows | filtered | Extra       |
+----+-------------+----------+------------+--------+---------------------------+-------------------+---------+----------------------+------+----------+-------------+
|  1 | SIMPLE      | incident | NULL       | index  | PRIMARY,incident_category | incident_category | 123     | NULL                 | 7786 |   100.00 | Using index |
|  1 | SIMPLE      | task     | NULL       | eq_ref | PRIMARY                   | PRIMARY           | 96      | test.incident.sys_id |    1 |   100.00 | Using index |
+----+-------------+----------+------------+--------+---------------------------+-------------------+---------+----------------------+------+----------+-------------+
2 rows in set, 1 warning (0.02 sec)
Surely I can always force the plan needed, this way or that:

mysql> explain select count(*), category from task inner join incident force index(primary) on task.sys_id=incident.sys_id group by incident.category;
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
| id | select_type | table    | partitions | type   | possible_keys             | key     | key_len | ref              | rows | filtered | Extra                                        |
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
|  1 | SIMPLE      | task     | NULL       | index  | PRIMARY                   | PRIMARY | 96      | NULL             | 8452 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | incident | NULL       | eq_ref | PRIMARY,incident_category | PRIMARY | 96      | test.task.sys_id |    1 |   100.00 | NULL                                         |
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
mysql> explain select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category;
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
| id | select_type | table    | partitions | type   | possible_keys             | key     | key_len | ref              | rows | filtered | Extra                                        |
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
|  1 | SIMPLE      | task     | NULL       | index  | PRIMARY                   | PRIMARY | 96      | NULL             | 8452 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | incident | NULL       | eq_ref | PRIMARY,incident_category | PRIMARY | 96      | test.task.sys_id |    1 |   100.00 | NULL                                         |
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
I was interested more in comparing scalability and performance of these different ways to execute the same query, just because usually it's considered good to have covering secondary index used instead of creating temporary table and filesort for this kind of queries... The reality shows this is not always the case.

I've got suggestion to maybe tune something for MyRocks case or use jemalloc library. I am open to any tuning suggestions from real experts based on the fact that these are default settings in my build:

[openxs@fc23 fb56]$ bin/mysql -uroot -e"show variables like 'rocks%'"
+-------------------------------------------------+----------------------+
| Variable_name                                   | Value                |
+-------------------------------------------------+----------------------+
| rocksdb_access_hint_on_compaction_start         | 1                    |
| rocksdb_advise_random_on_open                   | ON                   |
| rocksdb_allow_concurrent_memtable_write         | OFF                  |
| rocksdb_allow_mmap_reads                        | OFF                  |
| rocksdb_allow_mmap_writes                       | OFF                  |
| rocksdb_background_sync                         | OFF                  |
| rocksdb_base_background_compactions             | 1                    |
| rocksdb_block_cache_size                        | 536870912            || rocksdb_block_restart_interval                  | 16                   |
| rocksdb_block_size                              | 4096                 || rocksdb_block_size_deviation                    | 10                   |
| rocksdb_bulk_load                               | OFF                  |
| rocksdb_bulk_load_size                          | 1000                 |
| rocksdb_bytes_per_sync                          | 0                    |
| rocksdb_cache_index_and_filter_blocks           | ON                   |
| rocksdb_checksums_pct                           | 100                  |
| rocksdb_collect_sst_properties                  | ON                   |
| rocksdb_commit_in_the_middle                    | OFF                  |
| rocksdb_compact_cf                              |                      |
| rocksdb_compaction_readahead_size               | 0                    |
| rocksdb_compaction_sequential_deletes           | 0                    |
| rocksdb_compaction_sequential_deletes_count_sd  | OFF                  |
| rocksdb_compaction_sequential_deletes_file_size | 0                    |
| rocksdb_compaction_sequential_deletes_window    | 0                    |
| rocksdb_create_checkpoint                       |                      |
| rocksdb_create_if_missing                       | ON                   |
| rocksdb_create_missing_column_families          | OFF                  |
| rocksdb_datadir                                 | ./.rocksdb           |
| rocksdb_db_write_buffer_size                    | 0                    |
| rocksdb_deadlock_detect                         | OFF                  |
| rocksdb_debug_optimizer_no_zero_cardinality     | ON                   |
| rocksdb_default_cf_options                      |                      |
| rocksdb_delete_obsolete_files_period_micros     | 21600000000          |
| rocksdb_disabledatasync                         | OFF                  |
| rocksdb_enable_2pc                              | ON                   |
| rocksdb_enable_bulk_load_api                    | ON                   |
| rocksdb_enable_thread_tracking                  | OFF                  |
| rocksdb_enable_write_thread_adaptive_yield      | OFF                  |
| rocksdb_error_if_exists                         | OFF                  |
| rocksdb_flush_memtable_on_analyze               | ON                   |
| rocksdb_force_flush_memtable_now                | OFF                  |
| rocksdb_force_index_records_in_range            | 0                    |
| rocksdb_hash_index_allow_collision              | ON                   |
| rocksdb_index_type                              | kBinarySearch        |
| rocksdb_info_log_level                          | error_level          |
| rocksdb_is_fd_close_on_exec                     | ON                   |
| rocksdb_keep_log_file_num                       | 1000                 |
| rocksdb_lock_scanned_rows                       | OFF                  |
| rocksdb_lock_wait_timeout                       | 1                    |
| rocksdb_log_file_time_to_roll                   | 0                    |
| rocksdb_manifest_preallocation_size             | 4194304              |
| rocksdb_max_background_compactions              | 1                    |
| rocksdb_max_background_flushes                  | 1                    |
| rocksdb_max_log_file_size                       | 0                    |
| rocksdb_max_manifest_file_size                  | 18446744073709551615 |
| rocksdb_max_open_files                          | -1                   |
| rocksdb_max_row_locks                           | 1073741824           |
| rocksdb_max_subcompactions                      | 1                    |
| rocksdb_max_total_wal_size                      | 0                    |
| rocksdb_merge_buf_size                          | 67108864             |
| rocksdb_merge_combine_read_size                 | 1073741824           |
| rocksdb_new_table_reader_for_compaction_inputs  | OFF                  |
| rocksdb_no_block_cache                          | OFF                  |
| rocksdb_override_cf_options                     |                      |
| rocksdb_paranoid_checks                         | ON                   |
| rocksdb_pause_background_work                   | OFF                  |
| rocksdb_perf_context_level                      | 0                    |
| rocksdb_persistent_cache_path                   |                      |
| rocksdb_persistent_cache_size                   | 0                    |
| rocksdb_pin_l0_filter_and_index_blocks_in_cache | ON                   |
| rocksdb_print_snapshot_conflict_queries         | OFF                  |
| rocksdb_rate_limiter_bytes_per_sec              | 0                    |
| rocksdb_read_free_rpl_tables                    |                      |
| rocksdb_records_in_range                        | 0                    |
| rocksdb_seconds_between_stat_computes           | 3600                 |
| rocksdb_signal_drop_index_thread                | OFF                  |
| rocksdb_skip_bloom_filter_on_read               | OFF                  |
| rocksdb_skip_fill_cache                         | OFF                  |
| rocksdb_skip_unique_check_tables                | .*                   |
| rocksdb_stats_dump_period_sec                   | 600                  |
| rocksdb_store_row_debug_checksums               | OFF                  |
| rocksdb_strict_collation_check                  | ON                   |
| rocksdb_strict_collation_exceptions             |                      |
| rocksdb_table_cache_numshardbits                | 6                    |
| rocksdb_table_stats_sampling_pct                | 10                   |
| rocksdb_tmpdir                                  |                      |
| rocksdb_trace_sst_api                           | OFF                  |
| rocksdb_unsafe_for_binlog                       | OFF                  |
| rocksdb_use_adaptive_mutex                      | OFF                  |
| rocksdb_use_direct_reads                        | OFF                  |
| rocksdb_use_direct_writes                       | OFF                  |
| rocksdb_use_fsync                               | OFF                  |
| rocksdb_validate_tables                         | 1                    |
| rocksdb_verify_row_debug_checksums              | OFF                  |
| rocksdb_wal_bytes_per_sync                      | 0                    |
| rocksdb_wal_dir                                 |                      |
| rocksdb_wal_recovery_mode                       | 2                    |
| rocksdb_wal_size_limit_mb                       | 0                    |
| rocksdb_wal_ttl_seconds                         | 0                    |
| rocksdb_whole_key_filtering                     | ON                   |
| rocksdb_write_disable_wal                       | OFF                  |
| rocksdb_write_ignore_missing_column_families    | OFF                  |
| rocksdb_write_sync                              | OFF                  |
+-------------------------------------------------+----------------------+
and hardware details of my QuadCore box with HDD were shared previously. I tend to think no tuning is needed for such a small tables and corner read-only use case (if maybe READ COMMITTED isolation level, that I plan to study separately, as initial attempts had not shown much difference).

Just for completeness, this is the minimal configuration file I've used and the commit I've built from (after creating a new clone yesterday, this time the default branch was fb-mysql-5.6.35):
[openxs@fc23 fb56]$ cat ~/fb56.cnf
[mysqld]
rocksdb
default-storage-engine=rocksdb
skip-innodb
default-tmp-storage-engine=MyISAM

log-bin
binlog-format=ROW

#transaction-isolation=READ-COMMITTED
#rocksdb_max_open_files=-1
#rocksdb_block_cache_size=128M

[openxs@fc23 mysql-5.6]$ git log -1commit 95dd650dcfb07b91971b20cbac82d61f244a05a9
Author: Gunnar Kudrjavets <gunnarku@fb.com>
Date:   Fri Feb 3 10:07:25 2017 -0800
...
You can guess that I tried to add some more options along the lines of the fine manual, but they had not changes things much to better, so I've commented them out.

So, with this fresh build based on MySQL 5.6.35 I've decided to study the influence of jemalloc for both queries with 4, 8 and 16 threads (most interesting settings for the QuadCore). I've used the --malloc-lib option for mysqld_safe to force use of jemalloc, and checked pmap ouput for mysqld process for jemalloc entries to confirm that it was really preloaded:



[openxs@fc23 fb56]$ ps aux | grep mysqld
openxs   20444  0.0  0.0 119164  3264 pts/0    S    11:49   0:00 /bin/sh bin/mysqld_safe --defaults-file=/home/openxs/fb56.cnf --malloc-lib=/usr/lib64/libjemalloc.so
openxs   20630  0.8  0.2 170704 24284 pts/0    Sl   11:49   0:00 /home/openxs/dbs/fb56/bin/mysqld --defaults-file=/home/openxs/fb56.cnf --basedir=/home/openxs/dbs/fb56 --datadir=/home/openxs/dbs/fb56/data --plugin-dir=/home/openxs/dbs/fb56/lib/plugin --log-error=/home/openxs/dbs/fb56/data/fc23.err --pid-file=/home/openxs/dbs/fb56/data/fc23.pid
openxs   20750  0.0  0.0 118520   904 pts/0    S+   11:49   0:00 grep --color=auto mysqld

[openxs@fc23 fb56]$ sudo pmap 20630 | grep jemalloc
00007fb8637db000    268K r-x-- libjemalloc.so.2
00007fb86381e000   2044K ----- libjemalloc.so.2
00007fb863a1d000     12K r---- libjemalloc.so.2
00007fb863a20000      4K rw--- libjemalloc.so.2
So, here are the raw results of mysqlslap runs with jemalloc preloaded after I made sure queries use different plans (STRAIGHT_JOIN uses PRIMARY keys and filesort, while INNER JOIN uses covering secondary index and avoids filesort):

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.695 seconds
        Minimum number of seconds to run all queries: 19.627 seconds
        Maximum number of seconds to run all queries: 19.769 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.332 seconds
        Minimum number of seconds to run all queries: 19.306 seconds
        Maximum number of seconds to run all queries: 19.349 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=16 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.276 seconds
        Minimum number of seconds to run all queries: 19.225 seconds
        Maximum number of seconds to run all queries: 19.309 seconds
        Number of clients running queries: 16
        Average number of queries per client: 62

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.553 seconds
        Minimum number of seconds to run all queries: 18.336 seconds
        Maximum number of seconds to run all queries: 18.748 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.227 seconds
        Minimum number of seconds to run all queries: 18.198 seconds
        Maximum number of seconds to run all queries: 18.269 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=16 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 17.941 seconds
        Minimum number of seconds to run all queries: 17.886 seconds
        Maximum number of seconds to run all queries: 18.000 seconds
        Number of clients running queries: 16
        Average number of queries per client: 62


Now, the same but with default memory allocator picked up (no jemalloc):

[openxs@fc23 fb56]$ ps aux | grep mysqld
openxs   27696  0.0  0.0 119164  3372 pts/0    S    13:50   0:00 /bin/sh bin/mysqld_safe --defaults-file=/home/openxs/fb56.cnf
openxs   27834  0.3  0.2 607012 23096 pts/0    Sl   13:50   0:00 ./bin/mysqld --defaults-file=/home/openxs/fb56.cnf --basedir=. --datadir=./data --plugin-dir=./lib/plugin --log-error=./data/fc23.err --pid-file=./data/fc23.pid
openxs   28029  0.0  0.0 118520   880 pts/0    S+   13:50   0:00 grep --color=auto mysqld
[openxs@fc23 fb56]$ pmap 27834 | grep jemalloc
[openxs@fc23 fb56]$

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 20.207 seconds
        Minimum number of seconds to run all queries: 19.833 seconds
        Maximum number of seconds to run all queries: 20.543 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.746 seconds
        Minimum number of seconds to run all queries: 19.500 seconds
        Maximum number of seconds to run all queries: 20.402 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=16 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.150 seconds
        Minimum number of seconds to run all queries: 19.084 seconds
        Maximum number of seconds to run all queries: 19.292 seconds
        Number of clients running queries: 16
        Average number of queries per client: 62

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.692 seconds
        Minimum number of seconds to run all queries: 18.516 seconds
        Maximum number of seconds to run all queries: 18.889 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.290 seconds
        Minimum number of seconds to run all queries: 18.250 seconds
        Maximum number of seconds to run all queries: 18.342 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=16 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.155 seconds
        Minimum number of seconds to run all queries: 18.066 seconds
        Maximum number of seconds to run all queries: 18.397 seconds
        Number of clients running queries: 16
        Average number of queries per client: 62


Finally, this is how it looks like on a chart:


I can make the following conclusions:
  1. Using jemalloc helps to get a bit better throughput for both queries (I tend to consider the INNER case with 16 threads a fluctuation for now), but not much.
  2. New build based on fb-mysql-5.6.35 branch as of yesterday's morning demonstrates worse performance for STRAIGHT_JOIN case (access via PRIMARY keys + filesort) comparing to the previous build from January 30, 2017 based on webscalesql-5.6.27.75 branch. I am yet to find out why is it so, as INNER JOIN case performs a bit better.
  3. I am open to any MyRocks runing suggestions for my old QuadCore box, as for now MySQL 5.7.17 performs notably better on this same hardware and data in the tables, with --no-defaults. No wonder, a lot of efforts were spent by Oracle on Bug #68079, while all I've got so far for MyRocks is public "Can't repeat" of a kind, plus simple tuning advices with limited impact for my case.
I am sure I am doing something wrong with MyRocks, just tell me what exactly...

by Valeriy Kravchuk (noreply@blogger.com) at February 08, 2017 06:01 PM

Peter Zaitsev

MySQL super_read_only Bugs

super_read_only

super_read_onlyThis blog we describe an issue with MySQL 5.7’s super_read_only feature when used alongside with GTID in chained slave instances.

Background

In MySQL 5.7.5 and onward introduced the gtid_executed table in the MySQL database to store every GTID. This allows slave instances to use the GTID feature regardless whether the binlog option is set or not. Here is an example of the rows in the gtid_executed table:

mysql> SELECT * FROM mysql.gtid_executed;
+--------------------------------------+----------------+--------------+
| source_uuid                          | interval_start | interval_end |
+--------------------------------------+----------------+--------------+
| 00005730-1111-1111-1111-111111111111 |              1 |            1 |
| 00005730-1111-1111-1111-111111111111 |              2 |            2 |
| 00005730-1111-1111-1111-111111111111 |              3 |            3 |
| 00005730-1111-1111-1111-111111111111 |              4 |            4 |
| 00005730-1111-1111-1111-111111111111 |              5 |            5 |
| 00005730-1111-1111-1111-111111111111 |              6 |            6 |
| 00005730-1111-1111-1111-111111111111 |              7 |            7 |
| 00005730-1111-1111-1111-111111111111 |              8 |            8 |
| 00005730-1111-1111-1111-111111111111 |              9 |            9 |
| 00005730-1111-1111-1111-111111111111 |             10 |           10 |
...

To save space, this table needs to be compressed periodically by replacing GTIDs rows with a single row that represents that interval of identifiers. For example, the above GTIDs can be represented with the following row:

mysql> SELECT * FROM mysql.gtid_executed;
+--------------------------------------+----------------+--------------+
| source_uuid                          | interval_start | interval_end |
+--------------------------------------+----------------+--------------+
| 00005730-1111-1111-1111-111111111111 |              1 |           10 |
...

On the other hand, we have the super_read_only feature, if this option is set to ON, MySQL won’t allow any updates – even from users that have SUPER privileges. It was first implemented on WebscaleSQL and later ported to Percona Server 5.6. MySQL mainstream code implemented a similar feature in version 5.7.8.

The Issue [1]

MySQL’s super_read_only feature won’t allow the compression of the mysql.gtid_executed table. If a high number of transactions run on the master instance, it causes the gtid_executed table to grow to a considerable size. Let’s see an example.

I’m going to use the MySQL Sandbox to quickly setup a Master/Slave configuration, and sysbench to simulate a high number of transactions on master instance.

First, set up replication using GTID:

make_replication_sandbox --sandbox_base_port=5730 /opt/mysql/5.7.17 --how_many_nodes=1 --gtid

Next, set up the variables for a chained slave instance:

echo "super_read_only=ON" >> node1/my.sandbox.cnf
echo "log_slave_updates=ON" >> node1/my.sandbox.cnf
node1/restart

Now, generate a high number of transactions:

sysbench --test=oltp.lua --mysql-socket=/tmp/mysql_sandbox5730.sock --report-interval=1 --oltp-tables-count=100000 --oltp-table-size=100 --max-time=1800 --oltp-read-only=off --max-requests=0 --num-threads=8 --rand-type=uniform --db-driver=mysql --mysql-user=msandbox --mysql-password=msandbox --mysql-db=test prepare

After running sysbench for awhile, we check that the number of rows in the gtid_executed table is increasing faster:

slave1 [localhost] {msandbox} ((none)) &gt; select count(*) from mysql.gtid_executed ;
+----------+
| count(*) |
+----------+
|   300038 |
+----------+
1 row in set (0.00 sec)

By reviewing SHOW ENGINE INNODB STATUS, we can find a compression thread running and trying to compress the gtid_executed table.

---TRANSACTION 4192571, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
9 lock struct(s), heap size 1136, 1533 row lock(s), undo log entries 1525
MySQL thread id 4, OS thread handle 139671027824384, query id 0 Compressing gtid_executed table

This thread runs and takes ages to complete (or may never complete). It has been reported as #84332.

The Issue [2]

What happens if you have to stop MySQL while the thread compressing the gtid_executed table is running? In this special case, if you run the flush-logs command before or at the same time as mysqladmin shutdown, MySQL will actually stop accepting connections (all new connections hang waiting for the server) and will start to wait for the thread compressing the gtid_executed table to complete its work. Below is an example.

First, execute the flush logs command and obtain ERROR 1290:

$ mysql -h 127.0.0.1 -P 5731 -u msandbox -pmsandbox -e "flush logs ;"
ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement

We’ve tried to shutdown the instance, but it hangs:

$ mysqladmin -h 127.0.0.1 -P 5731 -u msandbox -pmsandbox shutdown
^CWarning;  Aborted waiting on pid file: 'mysql_sandbox5731.pid' after 175 seconds

This bug has been reported and verified as #84597.

The Workaround

If you already have an established connection to your database with SUPER privileges, you can disable the super_read_only feature dynamically. Once that is done, the pending thread compressing the gtid_executed table completes its work and the shutdown finishes successfully. Below is an example.

We check rows in the gtid_executed table:

$ mysql -h 127.0.0.1 -P 5731 -u msandbox -pmsandbox -e "select count(*) from mysql.gtid_executed ;"
+----------+
| count(*) |
+----------+
|   300038 |
+----------+

We disable the super_read_only feature on an already established connection:

$ mysql> set global super_read_only=OFF ;

We check the rows in the gtid_executed table again, verifying that the compress thread ran successfully.

$ mysql -h 127.0.0.1 -P 5731 -u msandbox -pmsandbox -e "select count(*) from mysql.gtid_executed ;"
+----------+
| count(*) |
+----------+
|        1 |
+----------+

Now we can shutdown the instance without issues:

$ mysqladmin -h 127.0.0.1 -P 5731 -u msandbox -pmsandbox shutdown

You can disable the super_read_only feature before you shutdown the instance to compress the gtid_executed table. If you ran into bug above, and don’t have any established connections to your database, the only way to shutdown the server is by issuing a kill -9 on the mysqld process.

Summary

As shown in this blog post, some of the mechanics of MySQL 5.7’s super_read_only command are not working as expected. This can prevent some administrative operations, like shutdown, from happening.

If you are using the super_read_only feature on MySQL 5.7.17 or older, including Percona Server 5.7.16 or older (which ports the mainstream implementation – unlike Percona Server 5.6, which ported Webscale’s super_read_only implementation) don’t use FLUSH LOGS.

by Juan Arruti at February 08, 2017 03:23 PM

Jean-Jerome Schmidt

Demonstration Videos: Top Four Feature Sets of ClusterControl for MySQL, MongoDB & PostgreSQL

The videos below demonstrate the top features and functions included in ClusterControl.  

Deploy

Deploy the best open source database for the job at hand using repeatable deployments with best practice configurations for MySQL, MySQL Cluster, Galera Cluster, Percona, PostgreSQL or MongoDB databases. Reduce time spent on manual provisioning and more time for experimentation and innovation.

Management

Easily handle and automate your day to day tasks uniformly and transparently across a mixed database infrastructure. Automate backups, health checks, database repair/recovery, security and upgrades using battle tested best practices.

Monitoring

Unified and comprehensive real-time monitoring of your entire database and server infrastructure. Gain access to 100+ key database and host metrics that matter to your operational performance. Visualize performance in custom dashboards to establish operational baselines and support capacity planning.

Scaling

Handle unplanned workload changes by dynamically scaling out with more nodes. Optimize resource usage by scaling back nodes.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

by Severalnines at February 08, 2017 10:04 AM

February 07, 2017

Peter Zaitsev

Percona Monitoring and Management 1.1.0 Beta is Now Available

Percona Server for MongoDB

Percona Monitoring and ManagementPercona announces the release of Percona Monitoring and Management 1.1.0 Beta on February 7, 2017. This is the first beta in the PMM 1.1 series with a focus on providing alternative deployment options for PMM Server:

The instructions for installing Percona Monitoring and Management 1.1.0 Beta are available in the documentation. Detailed release notes are available here.

New in PMM Server:

  • Grafana 4.1.1
  • Prometheus 1.5.0
  • Consul 0.7.3
  • Updated the MongoDB ReplSet dashboard to show the storage engine used by the instance
  • PMM-551: Fixed QAN changing query format when a time-based filter was applied to the digest

New in PMM Client:

  • PMM-530: Fixed pmm-admin to support special characters in passwords
  • Added displaying of original error message in pmm-admin config output

Known Issues:

  • Several of the MongoDB RocksDB metrics do not display correctly. This issue will be resolved in the production release.

A live demo of PMM is available at pmmdemo.percona.com.

We welcome your feedback and questions on our PMM forum.

About Percona Monitoring and Management
Percona Monitoring and Management is an open-source platform for managing and monitoring MySQL and MongoDB performance. Percona developed it in collaboration with experts in the field of managed database services, support and consulting.

PMM is a free and open-source solution that you can run in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

by Alexey Zhebel at February 07, 2017 10:31 PM

Overview of Different MySQL Replication Solutions

MySQL Replication

In this blog post, I will review some of the MySQL replication concepts that are part of the MySQL environment (and Percona Server for MySQL specifically). I will also try to clarify some of the misconceptions people have about replication.

Since I’ve been working on the Solution Engineering team, I’ve noticed that – although information is plentiful – replication is often misunderstood or incompletely understood.

So What is Replication?

Replication guarantees information gets copied and purposely populated into another environment, instead of only stored in one location (based on the transactions of the source environment).

The idea is to use secondary servers on your infrastructure for either reads or other administrative solutions. The below diagram shows an example of a MySQL replication environment.

MySQL Replication

 

Fine, But What Choices Do I Have in MySQL?

You actually have several different choices:

Standard asynchronous replication

Asynchronous replication means that the transaction is completed on the local environment completely, and is not influenced by the replication slaves themselves.

After completion of its changes, the master populates the binary log with the data modification or the actual statement (the difference between row-based replication or statement-based replication – more on this later). This dump thread reads the binary log and sends it to the slave IO thread. The slave places it in its own preprocessing queue (called a relay log) using its IO thread.

The slave executes each change on the slave’s database using the SQL thread.

MySQL Replication

 

Semi-synchronous replication

Semi-synchronous replication means that the slave and the master communicate with each other to guarantee the correct transfer of the transaction. The master only populates the binlog and continues its session if one of the slaves provides confirmation that the transaction was properly placed in one of the slave’s relay log.

Semi-synchronous replication guarantees that a transaction is correctly copied, but it does not guarantee that the commit on the slave actually takes place.

Important to note is that semi-sync replication makes sure that the master waits to continue processing transactions in a specific session until at least one of the slaves has ACKed the reception of the transaction (or reaches a timeout). This differs from asynchronous replication, as semi-sync allows for additional data integrity.

Keep in mind that semi-synchronous replication impacts performance because it needs to wait for the round trip of the actual ACK from the slave.

Group Replication

This is a new concept introduced in the MySQL Community Edition 5.7, and was GA’ed in MySQL 5.7.17. It’s a rather new plugin build for virtual synchronous replication.

Whenever a transaction is executed on a node, the plugin tries to get consensus with the other nodes before returning it completed back to the client. Although the solution is a completely different concept compared to standard MySQL replication, it is based on the generation and handling of log events using the binlog.

Below is an example architecture for Group Replication.

MySQL Replication

If Group Replication interests you, read the following blog posts:

There will be a tutorial at the Percona Live Open Source Database Conference in Santa Clara in April, 2017.

Percona XtraDB Cluster / Galera Cluster

Another solution that allows you to replicate information to other nodes is Percona XtraDB Cluster. This solution focuses on delivering consistency, and also uses a certification process to guarantee that transactions avoid conflicts and are performed correctly.

In this case, we are talking about a clustered solution. Each environment is subject to the same data, and there is communication in-between nodes to guarantee consistency.

Percona XtraDB Cluster has multiple components:

  • Percona Server for MySQL
  • Percona XtraBackup for performing snapshots of the running cluster (if recovering or adding a node).
  • wsrep patches / Galera Library

This solution is virtually synchronous, which is comparable to Group Replication. However, it also has the capability to use multi-master replication. Solutions like Percona XtraDB Cluster are a component to improve the availability of your database infrastructure.

MySQL Replication

A tutorial on Percona XtraDB Cluster will be given at the Percona Live Open Source Database Conference in Santa Clara in April 2017.

Row-Based Replication Vs. Statement-Based Replication

With statement-based replication, the SQL query itself is written to the binary log. For example, the exact same INSERT/UPDATE/DELETE statements are executed by the slave.

There are many advantages and disadvantages to this system:

  • Auditing the database is much easier as the actual statements are logged in the binary log
  • Less data is transfered over the wire
  • Non-deterministic queries can create actual havoc in the slave environment
  • There might be a performance disadvantage, with some queries using statement-based replication (INSERT based on SELECT)
  • Statement-based replication is slower due to SQL optimizing and execution

Row-based replication is the default choice since MySQL 5.7.7, and it has many advantages. The row changes are logged in the binary log, and it does not require context information. This removes the impact of non-deterministic queries.

Some additional advantages are:

  • Performance improvements with high concurrency queries containing few row changes
  • Significant data-consistency improvement

And, of course, some disadvantages:

  • Network traffic can be significantly larger if you have queries that modify a large number of rows
  • It’s more difficult to audit the changes on the database
  • Row-based replication can be slower than statement-based replication in some cases

Some Misconceptions About Replication

Replication is a cluster.

Standard asynchronous replication is not a synchronous cluster. Keep in mind that standard and semi-synchronous replication do not guarantee that the environments are serving the same dataset. This is different when using Percona XtraDB Cluster, where every server actually needs to process each change. If not, the impacted node is removed from the cluster. Asynchronous replication does not have this fail safe. It still accepts reads while in an inconsistent state.

Replication sounds perfect, I can use this as a manual failover solution.

Theoretically, the environments should be comparable. However, there are many parameters influencing the efficiency and consistency of the data transfer. As long as you use asynchronous replication, there is no guarantee that the transaction correctly took place. You can circumvent this by enhancing the durability of the configuration, but this comes at a performance cost. You can verify the consistency of your master and slaves using the 

pt-table-checksum
 tool.

I have replication, so I actually don’t need backups.

Replication is a great solution for having an accessible copy of the dataset (e.g., reporting issues, read queries, generating backups). This is not a backup solution, however. Having an offsite backup provides you with the certainty that you can rebuild your environment in the case of any major disasters, user error or other reasons (remember the Bobby Tables comic). Some people use delayed slaves. However, even delayed slaves are not a replacement for proper disaster recovery procedures.

I have replication, so the environment will now load balance the transactions.

Although you’ve potentially improved the availability of your environment by having a secondary instance running with the same dataset, you still might need to point the read queries towards the slaves and the write queries to the master. You can use proxy tools, or define this functionality in your own application.

Replication will slow down my master significantly.

Replication has only minor performance impacts on your master. Peter Zaitsev has an interesting post on this here, which discusses the potential impact of slaves on the master. Keep in mind that writing to the binary log can potentially impact performance, especially if you have a lot of small transactions that are then dumped and received by multiple slaves.

There are, of course, many other parameters that might impact the performance of the actual master and slave setup.

by Dimitri Vanoverbeke at February 07, 2017 03:04 PM

Jean-Jerome Schmidt

What’s New in ClusterControl 1.4 - Backup Management

ClusterControl 1.4 introduces some major improvements in the area of backup management, with a revamped interface and simplified options to create backups. In this blog post, we’ll have a look at the new backup features available in this release.

Upgrading to 1.4

If you upgrade ClusterControl from version 1.3.x to version 1.4, the CMON process will internally migrate all backup related data/schedules to the new interface. The migration will happen during the first startup after you have upgraded (you are required to restart the CMON process after a package upgrade). To upgrade, please refer to the documentation.

Redesigned User Interface

In the user interface, we have now consolidated related functionality onto a single interface. This includes Backup Settings, which were previously found under ClusterControl -> Settings -> Backups. It is now accessible under the same backup management tab:

The interface is now responsive to any action taken and requires no manual refresh. When a backup is created, you will see it in the backup list with a spinning arrows icon:

It is also possible now to schedule a backup every minute (the lowest interval) or year (the highest interval):

The backup options when scheduling or creating a backup now appear on the right side:

This allows you to quickly configure the backup, rather than having to scroll down the page.

Backup Report

Here is how it used to look pre v1.4:

After upgrading to ClusterControl v1.4, the report will look like this:

All incremental backups are automatically grouped together under the last full backup and expandable with a drop down. This makes the backups more organized per backup set. Each created backup will have “Restore” and “Log” buttons. The “Time” column also now contains timezone information, useful if you are dealing with geographically distributed infrastructure.

Restore to an Incremental Backup Point

You are now able to restore up to a certain incremental backup. Previously, ClusterControl supported restoration per backup set. All incremental backups under a single backup set would be restored and there was no way, for instance, to skip some of the incremental backups.

Consider the below example:

Full backup happens every day around 5:15 AM while incremental backup was scheduled every 15 minutes after the hour. If something happened around 5:50 AM and you would like to restore up to the backup taken just before that, you can skip the 6 AM backup by just clicking on the “Restore” link of the 5:45 AM incremental backup. You should then see the following Restore wizard and a couple of post-restoration options:

ClusterControl will then prepare the backup up until the selected point and the rest will be skipped. It also highlights “Warning” and “Notes” so you are aware of what will happen with the cluster during the restoration process. Note that mysqldump restoration can be performed online, while Xtrabackup requires the cluster/database instance to be stopped.

Operational Report

You might have multiple database systems running, and perhaps in different datacenters. Would it not be nice to get a consolidated report of the systems, when they were last backed up, and if there were any failed backups? This is available in 1.4. Note that you have other types of ops reports available in ClusterControl.

The report contains two sections and gives you a short summary of when the last backup was created, if it completed successfully or failed. You can also check the list of backups executed on the cluster with their state, type and size. This is as close you can get to check that backups work correctly without running a full recovery test. However, we definitely recommend that such tests are regularly performed.

The operational report can be scheduled and emailed to a set of recipients under Settings -> Operational Reports section, as shown in the following screenshot:

Access via ClusterControl RPC interface

The new backup features are now exposed under ClusterControl RPC interface, which means you can interact via API call with a correct RPC key. For example, to list the created backup on cluster ID 2, the following call should be enough:

$ curl -XPOST -d '{"operation": "listbackups", "token": "RB81tydD0exsWsaM"}' http://localhost:9500/2/backup
{
  "cc_timestamp": 1477063671,
  "data": [ 
  {
      "backup": [ 
      {
          "db": "mysql",
          "files": [ 
          {
              "class_name": "CmonBackupFile",
              "created": "2016-10-21T15:26:40.000Z",
              "hash": "md5:c7f4b2b80ea439ae5aaa28a0f3c213cb",
              "path": "mysqldump_2016-10-21_172640_mysqldb.sql.gz",
              "size": 161305,
              "type": "data,schema"
          } ],
          "start_time": "2016-10-21T15:26:41.000Z"
      } ],
      "backup_host": "192.168.33.125",
      "cid": 101,
      "class_name": "CmonBackupRecord",
      "config": 
      {
          "backupDir": "/tmp",
          "backupHost": "192.168.33.125",
          "backupMethod": "mysqldump",
          "backupToIndividualFiles": false,
          "backup_failover": false,
          "backup_failover_host": "",
          "ccStorage": false,
          "checkHost": false,
          "compression": true,
          "includeDatabases": "",
          "netcat_port": 9999,
          "origBackupDir": "/tmp",
          "port": 3306,
          "set_gtid_purged_off": true,
          "throttle_rate_iops": 0,
          "throttle_rate_netbw": 0,
          "usePigz": false,
          "wsrep_desync": false,
          "xtrabackupParallellism": 1,
          "xtrabackup_locks": false
      },
      "created": "2016-10-21T15:26:40.000Z",
      "created_by": "",
      "description": "",
      "finished": "2016-10-21T15:26:41.000Z",
      "id": 5,
      "job_id": 2952,
      "log_file": "",
      "lsn": 140128879096992,
      "method": "mysqldump",
      "parent_id": 0,
      "root_dir": "/tmp/BACKUP-5",
      "status": "Completed",
      "storage_host": "192.168.33.125"
  }, 
  {
      "backup": [ 
      {
          "db": "",
          "files": [ 
          {
              "class_name": "CmonBackupFile",
              "created": "2016-10-21T15:21:50.000Z",
              "hash": "md5:538196a9d645c34b63cec51d3e18cb47",
              "path": "backup-full-2016-10-21_172148.xbstream.gz",
              "size": 296000,
              "type": "full"
          } ],
          "start_time": "2016-10-21T15:21:50.000Z"
      } ],
      "backup_host": "192.168.33.125",
      "cid": 101,
      "class_name": "CmonBackupRecord",
      "config": 
      {
          "backupDir": "/tmp",
          "backupHost": "192.168.33.125",
          "backupMethod": "xtrabackupfull",
          "backupToIndividualFiles": false,
          "backup_failover": false,
          "backup_failover_host": "",
          "ccStorage": false,
          "checkHost": false,
          "compression": true,
          "includeDatabases": "",
          "netcat_port": 9999,
          "origBackupDir": "/tmp",
          "port": 3306,
          "set_gtid_purged_off": true,
          "throttle_rate_iops": 0,
          "throttle_rate_netbw": 0,
          "usePigz": false,
          "wsrep_desync": false,
          "xtrabackupParallellism": 1,
          "xtrabackup_locks": true
      },
      "created": "2016-10-21T15:21:47.000Z",
      "created_by": "",
      "description": "",
      "finished": "2016-10-21T15:21:50.000Z",
      "id": 4,
      "job_id": 2951,
      "log_file": "",
      "lsn": 1627039,
      "method": "xtrabackupfull",
      "parent_id": 0,
      "root_dir": "/tmp/BACKUP-4",
      "status": "Completed",
      "storage_host": "192.168.33.125"
  } ],
  "requestStatus": "ok",
  "total": 2
}

Other supported operations are:

  • deletebackup
  • listschedules
  • schedule
  • deleteschedule
  • updateschedule

By having those operations extensible via ClusterControl RPC, one could automate the backup management and list the backup schedule via scripting or application call. However, to create a backup, ClusterControl handles it differently via job call (operation: createJob) since some backups may take hours or days to complete. To create a backup on cluster ID 9, one would do:

$ curl -XPOST -d '{"token": "c8gY3Eq5iFE3DC4i", "username":"admin@domain.com","operation":"createJob","job":{"command":"backup", "job_data": {"backup_method":"xtrabackupfull", "hostname": "192.168.33.121", "port":3306, "backupdir": "/tmp/backups/" }}}' http://localhost:9500/9/job

Where:

  • The URL format is: http://[ClusterControl_host]/clusterid/job
  • Backup method: Xtrabackup (full)
  • RPC token: c8gY3Eq5iFE3DC4i (retrievable from cmon_X.cnf)
  • Backup host: 192.168.33.121, port 3306
  • Backup destination: /tmp/backups on the backup host

For example, it’s a good idea to create a backup when testing DDL queries like TRUNCATE or DROP because those are not transactions, meaning they are impossible to rollback. We are going to cover this in details in an upcoming blog post.

With a BASH script together with correct API call, it is now possible to have an automated script like the following:

$ test_disasterous_query.sh --host 192.168.33.121 --query 'TRUNCATE mydb.processes' --backup-first 1

There are many other reasons to upgrade to the latest ClusterControl version, the backup functionality is just one of many exciting new features introduced in ClusterControl v1.4. Do upgrade (or install ClusterControl if you haven’t used it yet), give it a try and let us know your thoughts. New installations come with a 30-days trial.

by ashraf at February 07, 2017 09:47 AM

February 06, 2017

Peter Zaitsev

Percona Server for MongoDB 3.4.1-1.1 Release Candidate is Now Available

Percona Server for MongoDB

Percona Server for MongoDBPercona announces the release of Percona Server for MongoDB 3.4.1-1.1rc on February 6, 2017. It is the first release candidate in the 3.4 series. Download the latest version from the Percona web site or the Percona Software Repositories.

NOTE: Release candidate packages are available from the testing repository.

Percona Server for MongoDB is an enhanced, open source, fully compatible, highly scalable, zero-maintenance downtime database supporting the MongoDB v3.4 protocol and drivers. It extends MongoDB with Percona Memory Engine and MongoRocks storage engine, as well as several enterprise-grade features:

Percona Server for MongoDB requires no changes to MongoDB applications or code.


This release candidate is based on MongoDB 3.4.1, and includes the following additional changes:

Percona Server for MongoDB 3.4.1-1.1rc release notes are available in the official documentation.

by Alexey Zhebel at February 06, 2017 07:41 PM

Percona Toolkit 3.0.0 Release Candidate is Now Available

Percona Server for MongoDB

Percona ToolkitPercona announces the availability of Percona Toolkit 3.0.0rc-2 with new MongoDB tools on February 6, 2017. This is a release candidate.

Percona Toolkit is a collection of advanced command-line tools to perform a variety of MySQL and MongoDB server and system tasks that are too difficult or complex for DBAs to perform manually. Percona Toolkit, like all Percona software, is free and open source.

This is the first release candidate in the 3.0 series. It includes new features and bug fixes. Downloads are available from the Percona Software Testing Repositories.

New features:

Bug fixes:

  • 1402776: Updated MySQLProtocolParser to fix error when parsing tcpdump capture with pt-query-digest
  • 1632522: Fixed failure of pt-online-schema-change when altering a table with a self-referencing foreign key (Thanks, Amiel Marqeta)
  • 1654668: Fixed failure of pt-summary on Red Hat and derivatives (Thanks, Marcelo Altmann)

You can find release details in the release notes. Bugs can be reported on the Percona Toolkit launchpad bug tracker.

by Alexey Zhebel at February 06, 2017 07:37 PM

February 03, 2017

Peter Zaitsev

Percona Server for MySQL 5.7.17-11 is now available

Percona Server for MySQL 5.7.17-11

Percona Server for MySQL 5.7.17-11Percona announces the GA release of Percona Server for MySQL 5.7.17-11 on February 3, 2017. Download the latest version from the Percona web site or the Percona Software Repositories.

Based on MySQL 5.7.17, including all the bug fixes in it, Percona Server for MySQL 5.7.17-11 is the current GA release in the Percona Server for MySQL 5.7 series. Percona’s provides completely open-source and free software. Find release details in the 5.7.17-11 milestone at Launchpad.

New Features:
  • Percona Server for MySQL has implemented support for per-column VARCHAR/BLOB compression for the XtraDB storage engine. This also features compression dictionary support, to improve compression ratio for relatively short individual rows, such as JSON data.
  • Kill Idle Transactions feature has been re-implemented by setting a connection socket read timeout value instead of periodically scanning the internal InnoDB transaction list. This makes the feature applicable to any transactional storage engine, such as TokuDB, and, in future, MyRocks. This re-implementation is also addressing some existing bugs, including server crashes: #1166744, #1179136, #907719, and #1369373.
Bugs Fixed:
  • Logical row counts for TokuDB tables could get inaccurate over time. Bug fixed #1651844 (#732).
  • Repeated execution of SET STATEMENT ... FOR SELECT FROM view could lead to a server crash. Bug fixed #1392375.
  • CREATE TEMPORARY TABLE would create a transaction in binary log on a read-only server. Bug fixed #1539504 (upstream #83003).
  • Using per-query variable statement with subquery temporary tables could cause a memory leak. Bug fixed #1635927.
  • Fixed new compilation warnings with GCC 6. Bugs fixed #1641612 and #1644183.
  • A server could crash if a bitmap write I/O error happens in the background log tracking thread while a FLUSH CHANGED_PAGE_BITMAPS is executing concurrently. Bug fixed #1651656.
  • TokuDB was using wrong function to calculate free space in data files. Bug fixed #1656022 (#1033).
  • CONCURRENT_CONNECTIONS column in the USER_STATISTICS table was showing incorrect values. Bug fixed #728082.
  • Audit Log Plugin when set to JSON format was not escaping characters properly. Bug fixed #1548745.
  • InnoDB index dives did not detect some of the concurrent tree changes, which could return bogus estimates. Bug fixed #1625151 (upstream #84366).
  • INFORMATION_SCHEMA.INNODB_CHANGED_PAGES queries would needlessly read potentially incomplete bitmap data past the needed LSN range. Bug fixed #1625466.
  • Percona Server cmake compiler would always attempt to build RocksDB even if -DWITHOUT_ROCKSDB=1 argument was specified. Bug fixed #1638455.
  • Lack of free pages in the buffer pool is not diagnosed with innodb_empty_free_list_algorithm set to backoff (which is the default). Bug fixed #1657026.
  • mysqld_safe now limits the use of rm and chown to avoid privilege escalation. chown can now be used only for /var/log directory. Bug fixed #1660265. Thanks to Dawid Golunski (https://legalhackers.com).
  • Renaming a TokuDB table to a non-existent database with tokudb_dir_per_db enabled would lead to a server crash. Bug fixed #1030.
  • Read Free Replication optimization could not be used for TokuDB partition tables. Bug fixed #1012.

Other bugs fixed: #1486747, #1617715, #1633988, #1638198 (upstream #82823), #1642230, #1646384, #1640810, #1647530, #1651121, #1658843, #1156772, #1644583, #1648389, #1648737, #1650256, and #1647723.

The release notes for Percona Server for MySQL 5.7.17-11 are available in the online documentation. Please report any bugs on the launchpad bug tracker.

by Hrvoje Matijakovic at February 03, 2017 08:32 PM

Percona Live Featured Tutorial with Derek Downey, David Turner and René Cannaò — ProxySQL Tutorial

Percona Live Featured Tutorial

Welcome to another post in the series of Percona Live featured tutorial speakers blogs! In these blogs, we’ll highlight some of the tutorial speakers that will be at this year’s Percona Live conference. We’ll also discuss how these tutorials can help you improve your database environment. Make sure to read to the end to get a special Percona Live 2017 registration bonus!

In this Percona Live featured tutorial, we’ll meet Derek Downey (OSDB Practice Advocate, Pythian), David Turner (Storage SRE, Uber) and René Cannaò (MySQL SRE, Dropbox / ProxySQL). Their session is ProxySQL Tutorial. There is a stigma attached to database proxies when it comes to MySQL. This tutorial hopes to blow away that stigma by showing you what can be done with a proxy designed from the ground up to perform. I had a chance to speak with Derek, David and René and learn a bit more about ProxySQL:

Percona: How did you get into database technology? What do you love about it?

Percona Live Featured Tutorial
Derek Downey

Derek: I took a relational database course in college based on Oracle. Set theory and the relational model made a lot of sense to me. After a few years as a web developer at a small company, I transitioned to a hybrid SysAdmin/DBA role and got my first taste of the potential of “the cloud” (and some of the drawbacks).

I really came to understand that data is the lifeblood of any organization, and making sure it is always available through any disaster – from human error to hurricanes – is a unique and exciting challenge.

You should never notice the DBA if they’re doing their job right. There’s not much praise for a DBA on a job well done. But it’s a vital position to keep a company running. And that suits me just fine.

David: I started working for the Advanced Projects Group at the University of Missouri, now known as MOREnet. They were largely responsible for connecting all of the libraries and schools in the state to the Internet. I was initially helping them with their Internet presence as a webmaster. Later they needed help with their databases. I got very excited about working with Oracle at the time, and decided to join that team.

My relationship with MySQL started primarily because the cost of sharding Oracle was so high. Additionally, MySQL’s replication allowed us to use slaves. Oracle’s Dataguard/standby options wouldn’t allow reads from the slaves at that time. Lastly, MySQL was sort of “wild west” fun, since it lacked so many other features that Oracle had long ago. You had to get creative. It has been humbling to see how much innovation has come from the community and how far MySQL has come. And this is only the beginning!

Percona Live Featured Tutorial
René Cannaò

René: My career followed the classic path of a system administrator that ends up becoming a DBA. I used to work for a few companies as webmaster,  and finally as SysAdmin for a web hosting company. I always saw a similar pattern: “the bottleneck is in the database.” Nobody ever knew why the database was the bottleneck. I volunteered to improve the performance of this “unknown system.” Learning was a fun experience, and the result was extremely rewarding. I love understanding how databases operate and their internals. This is the only way to be able to get the maximum performance: “scientia potentia est”!

Percona: Your tutorial is called “ProxySQL Tutorial.” What exactly is ProxySQL, and what does it do?

Derek: I’ll leave it to René, the creator of ProxySQL, to give more detail on exactly what it is. But for a DBA trying to ensure their data is always available, it is a new and exciting tool in our toolbox.

René: ProxySQL is the MySQL data gateway. It’s the Stargate that can inspect, control, transform, manage, and route all traffic between clients and database servers. It builds reliable and fault-tolerant networks. It is a software bridge that empowers MySQL DBAs, built by DBAs for DBAs, allowing them to control all MySQL traffic where previously such traffic could not be controlled either on the client side (normally the developers’ realm) or server side (where there are not enough tools).

Percona Live Featured Tutorial
David Turner

David: Architecturally, ProxySQL is a separate process between the client and the database. Because traffic passes through it, ProxySQL can become many things (three of which got my attention). It can be a multiplexer, a filter, and a replicator.

Multiplexers reduce many signals down to a few. Web servers often open many static connections to MySQL. Since MySQL can only support a limited number of connections before performance suffers, ProxySQL’s ability to transparently manage tens of thousands of connections while only opening a few to the database is a great feature.

Administrators can update ProxySQL to filter and even rewrite queries based on patterns they decide on. As someone that has worked in operations and seen how long it can take to disable misbehaving applications, this is a very compelling feature. With ProxySQL in place, I can completely block a query from the database in no time.

ProxySQL’s replication or mirroring capability means that all of the queries sent to one database can now be sent to N databases. As someone that has to roll out new versions of MySQL, test index changes, and benchmark hardware this is also a compelling feature.

Percona: What are the advantages of using ProxySQL in a database environment?

René: ProxySQL is the bridge between the clients and the servers. It creates two layers, and controls all the communication between the two. Sitting in the middle, ProxySQL provides a lot of advantages normally impossible to achieve in a standard database environment, such as throttling or blocking queries, rewriting queries, implementing sharding, read/write splitting, caching, duplicating traffic, handling backend failures, failovers, integration with HA solutions, generating real-time statistics, etc. All this, without any application change, and completely transparent to the application.

Derek: For me, ProxySQL decouples the application from the data layer. This provides more control over the backend database environment to the DBA in regards to queries, maintenance and failovers, without impact to the application.

David: In addition to the roles noted above, ProxySQL can be part of a failover solution, routing queries to a new master when the current master fails. Other advantages are splitting queries over multiple databases to distribute the load, provide additional metrics, etc.

Percona: What do you want attendees to take away from your tutorial session? Why should they attend?

Derek: This tutorial highlights what ProxySQL is trying to achieve, and discusses how to add ProxySQL to common architectures environments. Attendees will get hands-on experience with the technology, and learn how to install and configure ProxySQL to achieve query rewriting, seamless failover, and query mirroring.

David: Not only hands-on experience with ProxySQL, but an understanding of how much they can leverage with it. The more I use ProxySQL, the more advantages I see to it. For example, I did not realize that by clustering ProxySQL processes I can distribute the query matching and rewrites over many hosts, as well as use them as a caching layer.

René: ProxySQL is built upon very innovative technologies. Certain architectural concepts like hostgroups, chaining of query rules, granular routing and sharding, query retries and the very powerful Admin interface are concepts not always intuitive for DBAs with experience using other proxies. This tutorial helps understand these concepts, and attendees get hand-on experience in the configuration of ProxySQL in various scenarios.

Percona: What are you most looking forward to at Percona Live?

David: First, the people. Next, everything everyone is working on. We’re really lucky to work in such an innovative and collaborative industry. As databases evolve, we are on the ground floor of their evolution. What an exciting place to be.

Derek: I am mostly looking forward to reconnecting with my peers in the MySQL community. Both ones I’ve formerly worked with or previously met at Percona Live, as well as meeting new open source database professionals and hearing how they are providing solutions for their companies.

René: I am looking forward to attending sessions regarding new features in MySQL 8 and other new technologies. But moreover, I am excited to interact with MySQL users and get more input on how to improve ProxySQL so that it can become an indispensable tool in any MySQL environment.

Register for Percona Live Data Performance Conference 2017, and see Derek, David and René present their ProxySQL Tutorial. Use the code FeaturedTutorial and receive $30 off the current registration price!

Percona Live Data Performance Conference 2017 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 24-27, 2017 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

by Dave Avery at February 03, 2017 05:22 PM

February 02, 2017

Peter Zaitsev

PMM Alerting with Grafana: Working with Templated Dashboards

PMM Alerting

In this blog post, we will look into more intricate details of PMM alerting. More specifically, we’ll look at how to set up alerting based on templated dashboards.

Percona Monitoring and Management (PMM) 1.0.7 includes Grafana 4.0, which comes with the Alerting feature. Barrett Chambers shared how to enable alerting in general. This blog post looks at the specifics of setting up alerting based on the templated dashboards. Grafana 4.0 does not support basic alerting out-of-the-box.

PMM Alerting 1

This means if I try to set up an alert on the number of MySQL threads running, I get the error “Template variables are not supported in alert queries.”

What is the solution?

Until Grafana provides a better option, you need to do alerting based on graphs (which don’t use templating). This is how to do it.

Click on “Create New” in the Dashboards list to create a basic dashboard for your alerts:

PMM Alerting 2

Click on “Add Panel” and select “Graph”:

PMM Alerting 3

Click on the panel title of the related panel on the menu sign, and then click on “Panel JSON”.

PMM Alerting 4

This shows you the JSON of the panel, which will look like something like this:

PMM Alerting 5

Now you need to go back to the other browser window, and the dashboard with the graph you want to alert on. Show the JSON panel for it. In our case, we go to “MySQL Overview” and show the JSON for “MySQL Active Threads” panel.

Copy the JSON from the “MySQL Active Threads” panel and paste it into the new panel in the dashboard created for alerting.

Once we have done the copy/paste, click on the green Update button, and we’ll see the broken panel:

PMM Alerting 6

It’s broken because we’re using templating variables in dashboard expressions. None of them are set up in this dashboard. Expressions won’t work. We must replace the template variables in the formulas with actual hosts, instances, mount points, etc., for we want to alert on:

PMM Alerting 7

We need to change

$host
 to the name of the host we want to alert on, and the
$interval
 should align with the data capture interval (here we’ll set it to 5 seconds):

PMM Alerting 8

If correctly set up, you should see the graph showing the data.

Finally, we can go to edit the graph. Click on the “Alert” and “Create Alert”.

PMM Alerting 9

Specify

Evaluate Every
 to create an alert. This sets up the evaluation interval for the alert rule. Obviously, the more often the alert evaluates the condition, the more quickly you get alerted if something goes wrong (as well as alert conditions).

In our case, we want to get an alert if the number of running threads are sustained at a high rate. To do this, look at the minimum number of threads for last minute to be above 30:

PMM Alerting 10

Note that our query has two parameters: “A” is the number of threads connected, and “B” is the number of threads running. We’re choosing to Alert on “B”. 

The beautiful thing Grafana does is show the alert threshold clearly on the graph, and allows you to edit the alert just by moving this alert line with a mouse:

PMM Alerting 11

You may want to click on the floppy drive at the top to save dashboard (giving it whatever identifying name you want).

PMM Alerting 12

At this point, you should see the alert working. A little heart sign appears by the graph title, colored green (indicating it is not active) or red (indicating it is active). Additionally, you will see the red and green vertical lines in the alert history. These show when this alert gets triggered and when the system went back to normal.

PMM Alerting 13

You probably want to set up notifications as well as see alerts on the graphs. 

To set up notifications, go to the Grafana Configuration menu and configure Alerting. There are Grafana Support Email, Slack, Pagerduty and general Webhook notification options (with more on the way, I’m sure).

The same way you added the “Graph” panel to set up an alert, you can add the “Alert List” panel to see all the alerts you have set up (and their status):

PMM Alerting 14

Summary

As you can see, it is possible to set up alerts in PMM using the new Grafana 4.0 alerting feature. It is not very convenient or easy to do. This is first alerting support release for Grafana and PMM. As such, I’m sure it will become much easier and more convenient over time.

by Peter Zaitsev at February 02, 2017 11:54 PM

Vote Percona in LinuxQuestions.org Members Choice Awards 2016

LinuxQuestions.org Members Choice Awards 2016Percona is calling on you! Vote for Percona for Database of the Year in LinuxQuestions.org Members Choice Awards 2016. Help Percona get recognized as one of the best database options for data performance. Percona provides free, fully compatible, enhanced, open source drop-in replacement database software with superior performance, scalability and instrumentation.

LinuxQuestions.org, or LQ for short, is a community-driven, self-help website for Linux users. Each year, LinuxQuestions.org holds an annual competition to recognize the year’s best-in-breed technologies. The online Linux community determines the winners of each category!

You can vote now for your favorite database of 2016 (Percona, of course!). This is your chance to be heard!

Voting ends on February 7, 2017. You must be a registered member of LinuxQuestions.org with at least one post on their forums to vote.

by Dave Avery at February 02, 2017 08:40 PM

Jean-Jerome Schmidt

ClusterControl Tips & Tricks: How to Manage Configuration Templates for your databases

ClusterControl makes it easy to deploy a database setup - just fill in some values (database vendor, database data directory, password and hostnames) in the deployment wizard and you’re good to go. The rest of the configuration options will be automatically determined (and calculated) based on the host specifications (CPU cores, memory, IP address etc) and applied to the template file that comes with ClusterControl. In this blog post, we are going to look into how ClusterControl uses default template files and how users can customize them to their needs.

Base Template Files

All services configured by ClusterControl use a base configuration template available under /usr/share/cmon/templates on the ClusterControl node. The following are template files provided by ClusterControl v1.4.0:

Filename Description
config.ini.mc MySQL Cluster configuration file.
haproxy.cfg HAProxy configuration template for Galera Cluster.
haproxy_rw_split.cfg HAProxy configuration template for read-write splitting.
garbd.cnf Galera arbitrator daemon (garbd) configuration file.
keepalived-1.2.7.conf Legacy keepalived configuration file (pre 1.2.7). This is deprecated.
keepalived.conf Keepalived configuration file.
keepalived.init Keepalived init script.
MaxScale_template.cnf MaxScale configuration template.
mongodb-2.6.conf.org MongoDB 2.x configuration template.
mongodb.conf.org MongoDB 3.x configuration template.
mongodb.conf.percona MongoDB 3.x configuration template for Percona Server for MongoDB.
mongos.conf.org Mongo router (mongos) configuration template.
my.cnf.galera MySQL configuration template for Galera Cluster.
my57.cnf.galera MySQL configuration template for Galera Cluster on MySQL 5.7.
my.cnf.grouprepl MySQL configuration template for MySQL Group Replication.
my.cnf.gtid_replication MySQL configuration template for MySQL Replication with GTID.
my.cnf.mysqlcluster MySQL configuration template for MySQL Cluster.
my.cnf.pxc55 MySQL configuration template for Percona XtraDB Cluster v5.5.
my.cnf.repl57 MySQL configuration template for MySQL Replication v5.7.
my.cnf.replication MySQL configuration template for MySQL/MariaDB without MySQL’s GTID.
mysqlchk.galera MySQL health check script template for Galera Cluster.
mysqlchk.mysql MySQL health check script template for MySQL Replication.
mysqlchk_xinetd Xinetd configuration template for MySQL health check.
mysqld.service.override Systemd unit file template for MySQL service.
proxysql_template.cnf ProxySQL configuration template.

The above list depends upon the feature set provided by the installed ClusterControl release. In an older version, you might not find some of them. You can modify these template files directly, although we do not recommend it as explained in the next sections.

Configuration Manager

Depending on the cluster type, ClusterControl will then import the necessary base template file into CMON database and accessible via Manage -> Configurations -> Templates once deployment succeeds. For example, consider the following configuration template for a MariaDB Galera Cluster:

ClusterControl will load the base template content of Galera configuration template from /usr/share/cmon/templates/my.cnf.galera into CMON database (inside cluster_configuration_templates table) after deployment succeeds. You can then customize your own configuration file directly in the ClusterControl UI. Whenever you hit the Save button, the new version of configuration template will be stored inside CMON database, without overwriting the base template file.

Once the cluster is deployed and running, the template in the UI takes precedence. The base template file is only used during the initial cluster deployment via ClusterControl -> Deploy -> Deploy Database Cluster. During the deployment stage, ClusterControl will use a temporary directory located at /var/tmp/ to prepare the content, for example:

/var/tmp/cmon-003862-6a7775ca76c62486.tmp

Dynamic Variables

There are a number configuration variables which configurable dynamically by ClusterControl. These variables are represented with capital letters enclosed by at sign ‘@’, for example @DATADIR@. For full details on supported variables, please refer to this page. Dynamic variables are automatically configured based on the input specified during cluster deployment, or ClusterControl performs automatic detection based on hostname, IP address, available RAM, number of CPU cores and so on. This simplifies the deployment where you only need to specify minimal options during cluster deployment stage

If the dynamic variable is replaced with a value (or undefined), ClusterControl will skip it and use the configured value instead. This is handy for advanced users, where usually have their own set of configuration options that tailored for specific database workloads.

Pre-deployment Configuration Template Example

Instead of relying on ClusterControl’s dynamic variable on the number of max_connections for our database nodes, we can change the following line inside /usr/share/cmon/templates/my57.cnf.galera, from:

max_connections=@MAX_CONNECTIONS@

To:

max_connections=50

Save the text file and on the Deploy Database Cluster dialog, ensure ClusterControl uses the correct base template file:

Click on Deploy button to start the database cluster deployment.

Post-deployment Configuration Template Example

After the database cluster deployment completes, you might have done some fine tuning on the running servers before deciding to scale it up. When scaling up, ClusterControl will use the configuration template inside CMON database (the one populated under ClusterControl -> Configurations -> Templates) to deploy the new nodes. Hence do remember to apply the modification you made on the database server to the template file.

Before adding a new node, it’s a good practice to review the configuration template to ensure that the new node gets what we expected. Then, go to ClusterControl -> Add Node and ensure the correct MySQL template file is selected:

Then, click on “Add Node” button to start the deployment.

That’s it. Even though ClusterControl does various automation jobs when it comes to deployment, it still provides freedom for users to customize the deployment accordingly. Happy clustering!

by ashraf at February 02, 2017 12:45 PM

February 01, 2017

Peter Zaitsev

WAN Synchronous Clusters: Dealing with Latency Using Concurrency

WAN Latency

In this blog, we’ll discuss how to use concurrency to help with WAN latency when using synchronous clusters.

WAN Latency Problem

Our customers often ask us for help or advice with WAN clustering problems. Historically, the usual solution for MySQL WAN deployments is having the primary site in one data center, and stand-by backup site in another data center (replicating from the primary asynchronously). These days, however, there is a huge desire to employ available synchronous replication solutions for MySQL. These solutions include things like Galera (i.e., Percona XtraDB Cluster) or the recently released MySQL Group Replication. This trend is attributable to the fact that these solutions are less problematic and provide more automatic fail over and fail back procedures. But it’s also because businesses want to write in both data centers simultaneously.

Unfortunately, WAN link reliability and latency makes the synchronous replication solution a big challenge. In many cases, these challenges force geographically separate data centers to still replicate asynchronously.

From a requirements point of view, the Galera founders official documentation has WAN related recommendations and some dedicated options (like segments) — as described in Jay’s blog post. But WAN deployments are absolutely possible, and even an advertised option, in Galera. The MySQL Group Replication team, however, seem to discourage such use cases, as we can read:

Group Replication is designed to be deployed in a cluster environment where server instances are very close to each other, and is impacted by both network latency as well as network bandwidth.

Remedy

While perhaps obvious to some, I would like to point out a simple dependency that might be a viable solution in some deployments that face significant network latency. That solution is concurrency! When you face the problem of limited write throughput due to a transaction commit latency, you can employ more writer threads. By using separate connections to MySQL, overall you can to commit more transactions at the same time.

Let me demonstrate with example results based on a very simple test case. I tested both Percona XtraDB Cluster (with Galera replication) and MySQL Group Replication. I configured a minimal cluster of three nodes in each case, running as Docker containers on the same host (simulating a WAN network). For this setup, latency is around 0.05ms on average. Then, I introduced an artificial network latency of 50ms and 100ms into one of the node’s network interfaces. I later repeated the same tests using VirtualBox VM instances, running on a completely different server. The results were very similar. The command to simulate additional network latency is:

# tc qdisc add dev eth0 root netem delay 50ms

To delay the ping to other nodes in the cluster:

# ping -c 2 172.17.0.2
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=50.0 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=50.1 ms

The test is very simple: execute 500 small insert transactions, each inserting just single row (but that is less relevant now).

For testing, I used a simple mysqlslap command:

mysqlslap --password=*** --host=$IP --user=root --delimiter=";" --number-of-queries=500 --create-schema=test --concurrency=$i --query="insert into t1 set a='fooBa'"

and simple single table:

CREATE TABLE `t1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`a` char(5) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB

Interestingly, without increased latency, the same test takes much longer against the Group Replication cluster, even though by default Group Replication works with enabled

group_replication_single_primary_mode
, and disabled
group_replication_enforce_update_everywhere_checks
. Theoretically, it should be a lighter operation, from a “data consistency checks” point of view. Also with WAN-type latencies, Percona XtraDB Cluster seems to be slightly faster in this particular test. Here are the test results for the three different network latencies:

XtraDB Cluster latency/seconds
Threads 100ms 50ms 0.05ms
1 51.671 25.583 0.268
4 13.936 8.359 0.187
8 7.84 4.18 0.146
16 4.641 2.353 0.13
32 2.33 1.16 0.122
64 1.808 0.925 0.098
GR latency/seconds
Threads 100ms 50ms 0.05ms
1 55.513 29.339 5.059
4 14.889 7.916 2.184
8 7.673 4.195 1.294
16 4.52 2.507 0.767
32 3.417 1.479 0.473
64 2.099 0.809 0.267

WAN latency

I used the same InnoDB settings for both clusters, each node under a separate Docker container or Virtual Box VM. Similar test result could differ a lot in real production systems, where more CPU cores provide better multi-concurrency conditions.

It also wasn’t my idea to benchmark Galera versus Group Replication, but rather to show that the same concurrency to write throughput dependency applies to both technologies. I might be missing some tuning on the Group Replication side, so I don’t claim any verified winner here.

Just to provide some more details, I was using Percona XtraDB Cluster 5.7.16 and MySQL with Group Replication 5.7.17.

One important note: when you expect higher concurrency to provide better throughput, you must make sure the concurrency is not limited by server settings. For example, you must look at

innodb_thread_concurrency
  (I used 0, so unlimited), 
slave_parallel_workers
 for GR and
wsrep_slave_threads
 for Galera (among others related to IO operations, etc.).

Apart from “concurrency tuning,” which could involve application changes if not architectural re-design, there are of course more possible optimizations for WAN environments. For example:

https://www.percona.com/blog/2016/03/14/percona-xtradb-cluster-in-a-high-latency-network-environment/ (to deal with latency)

and:

https://www.percona.com/blog/2013/12/19/automatic-replication-relaying-galera-3/

for saving/minimizing network utilization using 

binlog_row_image=minimal
 and other variables.

But these are out of the scope of this post. I hope this simple post helps you deal with the speed of light better!  😉

by Przemysław Malkowski at February 01, 2017 09:58 PM

Percona Server for MySQL 5.5.54-38.6 is now available

Percona Server for MySQL 5.5.54-38.6

Percona Server for MySQL 5.5.54-38.6Percona announces the release of Percona Server for MySQL 5.5.54-38.6 on February 1, 2017. Based on MySQL 5.5.54, including all the bug fixes in it, Percona Server for MySQL 5.5.54-38.6 is now the current stable release in the 5.5 series.

Percona Server for MySQL is open-source and free. You can find release details in the 5.5.54-38.6 milestone on Launchpad. Downloads are available here and from the Percona Software Repositories.

Bugs Fixed:
  • Fixed new compilation warnings with GCC 6. Bugs fixed #1641612 and #1644183.
  • CONCURRENT_CONNECTIONS column in the USER_STATISTICS table was showing incorrect values. Bug fixed #728082.
  • Audit Log Plugin when set to JSON format was not escaping characters properly. Bug fixed #1548745.
  • mysqld_safe now limits the use of rm and chown to avoid privilege escalation. chown can now be used only for /var/log directory. Bug fixed #1660265.

Other bugs fixed: #1638897, #1644174, #1644547, and #1644558.

[UPDATE 2016-02-02]: New packages have been pushed to repositories with incremented package version to address the bug #1661123.

Find the release notes for Percona Server for MySQL 5.5.54-38.6 in our online documentation. Report bugs on the launchpad bug tracker.

by Hrvoje Matijakovic at February 01, 2017 08:46 PM

Shlomi Noach

orchestrator Puppet module now available

We have just open sourced and published an orchestrator puppet module. This module is authored by Tom Krouper of GitHub's database infrastructure team, and is what we use internally at GitHub for deploying orchestrator.

The module manages the orchestrator service, the config file (inherit to override values), etc (pun intended). Check it out!

 

 

by shlomi at February 01, 2017 11:19 AM

Jean-Jerome Schmidt

Video: An Overview of the Features & Functions of ClusterControl

The video below demonstrates the top features and functions included in ClusterControl.  

ClusterControl is an all-inclusive database management system that lets you easily deploy, monitor, manage and scale highly available open source databases on-premise or in the cloud.

Included in this presentation are…

  • Deploying MySQL, MongoDB & PostgreSQL nodes and clusters
  • Overview of the monitoring dashboard
  • Individual node or cluster monitoring
  • Query monitor system
  • Creating and restoring immediate and scheduled backups
  • Configuration management
  • Developer Studio introduction
  • Reviewing log files
  • Scaling database clusters

by Severalnines at February 01, 2017 08:32 AM

January 31, 2017

Valeriy Kravchuk

Profiling MyRocks with perf: Good Old Bug #68079 Use Case

Almost a year ago I've got really interested in MyRocks and built MySQL from Facebook that provides it from source. Since that time I build it from fresh sources few times per week (as I've described in that post) and once in a while try to work with it and study some details or use cases. Today I'd like to discuss one of them that I've recently studied with perf profiler.

This is not only because I am going to talk about applying profilers to all kinds and forks of MySQL at FOSDEM 2017 MySQL & Friends Devroom this week. It seems profilers is the only way to study or troubleshoot MyRocks performance problems (if any) at the moment (when provided status variables and SHOW ENGINE ROCKSDB STATUS etc are not enough), as Facebook's MySQL does NOT have Performance Schema compiled in by default:

mysql> show engines;
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
| Engine     | Support | Comment                                                        | Transactions | XA   | Savepoints |
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
| ROCKSDB    | DEFAULT | RocksDB storage engine                                         | YES          | YES  | YES        || MRG_MYISAM | YES     | Collection of identical MyISAM tables                          | NO           | NO   | NO         |
| MyISAM     | YES     | MyISAM storage engine                                          | NO           | NO   | NO         |
| BLACKHOLE  | YES     | /dev/null storage engine (anything you write to it disappears) | NO           | NO   | NO         |
| CSV        | YES     | CSV storage engine                                             | NO           | NO   | NO         |
| MEMORY     | YES     | Hash based, stored in memory, useful for temporary tables      | NO           | NO   | NO         |
| ARCHIVE    | YES     | Archive storage engine                                         | NO           | NO   | NO         |
| FEDERATED  | NO      | Federated MySQL storage engine                                 | NULL         | NULL | NULL       |
| InnoDB     | NO      | Supports transactions, row-level locking, and foreign keys     | NULL         | NULL | NULL       |
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
9 rows in set (0.00 sec)

This may change when MyRocks builds from Percona or MariaDB will appear as ready for common use, but for now perf is a way to go, and I like it!

I do not remember how exactly I've come up with the idea to compare MyRocks scalability to InnoDB scalability on my old QuadXeon box (now running Fedora 25):
[openxs@fc23 mysql-server]$ pt-summary
# Percona Toolkit System Summary Report ######################
        Date | 2017-01-30 11:32:53 UTC (local TZ: EET +0200)
    Hostname | fc23
      Uptime |  4:02,  3 users,  load average: 3.47, 2.12, 1.27
    Platform | Linux
     Release | Fedora release 25 (Twenty Five)
      Kernel | 4.9.5-200.fc25.x86_64

Architecture | CPU = 64-bit, OS = 64-bit
   Threading | NPTL 2.24
     SELinux | Enforcing
 Virtualized | No virtualization detected
# Processor ##################################################
  Processors | physical = 1, cores = 4, virtual = 4, hyperthreading = no
      Speeds | 4x2499.000
      Models | 4xIntel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz
      Caches | 4x2048 KB

# Memory #####################################################
       Total | 7.8G
        Free | 2.9G
        Used | physical = 1.0G, swap allocated = 0.0, swap used = 0.0, virtual = 1.0G
      Shared | 176.0M
     Buffers | 3.9G
      Caches | 6.2G
       Dirty | 208 kB
     UsedRSS | 2.5G
  Swappiness | 60
 DirtyPolicy | 20, 10
 DirtyStatus | 0, 0
...
for the famous use case presented in Bug #68079. Scalability problems identified there cause a lot of work by Oracle engineers, and a lot of changes in MySQL 5.6.x and 5.7.x (so natable that MariaDB decided to use only those from 5.7 in the upcoming MariaDB 10.2, see MDEV-10476). I've already mentioned this bug probably in every my public talk about MySQL since 2013, so one day I decided to check how the fix helped for recent InnoDB and then compare to MyRocks performance with all default settings on the same hardware for different number of concurrent threads.

So, one day in April 2016 I took a dump of tables involved from the bug, replaced InnoDB with ROCKSDB there, added explicit CHARSET=utf8 COLLATE=utf8_bin clauses, and loaded the dump into Facebook's MySQL 5.6 built from source. I've ended up with the following:
mysql> show create table task\G
*************************** 1. row ***************************
       Table: task
Create Table: CREATE TABLE `task` (
  `sys_id` char(32) COLLATE utf8_bin NOT NULL DEFAULT '',
  `u_root_cause` char(32) COLLATE utf8_bin DEFAULT NULL,
  `u_business_impact_description` mediumtext COLLATE utf8_bin,
  `u_business_impact_category` mediumtext COLLATE utf8_bin,
  PRIMARY KEY (`sys_id`)
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)

mysql> show create table incident\G
*************************** 1. row ***************************
       Table: incident
Create Table: CREATE TABLE `incident` (
  `sys_id` char(32) COLLATE utf8_bin NOT NULL DEFAULT '',
  `category` varchar(40) COLLATE utf8_bin DEFAULT NULL,
  PRIMARY KEY (`sys_id`),
  KEY `incident_category` (`category`)
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)

mysql> show table status like 'task'\G
*************************** 1. row ***************************
           Name: task
         Engine: ROCKSDB
        Version: 10
     Row_format: Dynamic
           Rows: 8292
 Avg_row_length: 69
    Data_length: 573991
...
1 row in set (0.00 sec)

mysql> show table status like 'incident'\G
*************************** 1. row ***************************
           Name: incident
         Engine: ROCKSDB
        Version: 10
     Row_format: Dynamic
           Rows: 8192
 Avg_row_length: 87
    Data_length: 719091
Max_data_length: 0
   Index_length: 956624
...
1 row in set (0.00 sec)
The tables are simple and small. Then I tried to study how well this query scales when number of threads it is running from is increasing:
mysql> explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category;
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
| id | select_type | table    | type   | possible_keys             | key               | key_len | ref                  | rows | Extra       |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
|  1 | SIMPLE      | incident | index  | PRIMARY,incident_category | incident_category | 123     | NULL                 | 8192 | Using index |
|  1 | SIMPLE      | task     | eq_ref | PRIMARY                   | PRIMARY           | 96      | test.incident.sys_id |    1 | Using index |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
2 rows in set (0.00 sec)
For this I've used mysqlslap and commands like the following:
[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=1 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
with different --concurrency values: 1, 2, 4, 8, 10, 16, 32 (on a box with 4 cores). Before the fix InnoDB had scalability problems even with concurrency N, where N is the number of cores. After the fix it scaled way better in this test.

You can find the summary of results in this Google spreadsheet, but here is a chart from it:


The X axis has log scale and shows number of concurrent threads running. The Y axis is the average number of problematic queries executed per second. "inner" and "straight" refer to INNER JOIN vs STRAIGHT_JOIN queries presented above.

"MyRocks" on this chart means ROCKSDB in MySQL from Facebook built from source corresponding to the following commit:
[openxs@fc23 mysql-5.6]$ git log -1
commit 6b4ba98698182868800f31e0d303f1e16026efdf
...
Date:   Sat Jan 28 13:51:47 2017 -0800
while "InnoDB 5.7" means InnoDB from Oracle MySQL 5.7.17, the following commit specifically:
[openxs@fc23 mysql-server]$ git log -1
commit 23032807537d8dd8ee4ec1c4d40f0633cd4e12f9
...
Date:   Mon Nov 28 16:48:20 2016 +0530
For this post I wonder why the difference in favor of "straight" query is so big starting from 4 concurrent threads in case of MyRocks? I've profiled both queries running via mysqlslap as follows:
[openxs@fc23 ~]$ perf --version
perf version 4.9.6.200.fc25.x86_64.g51a0
[openxs@fc23 ~]$ sudo perf record -a
[sudo] password for openxs:
^C[ perf record: Woken up 335 times to write data ]
[ perf record: Captured and wrote 84.913 MB perf.data (1825548 samples) ]
Profiler was started, then mysqlslap command executed and Ctrl-C was pressed when it produced the results.

The result for STRAIGHT_JOIN queries was the following (up to 1% of "overhead"):
[openxs@fc23 ~]$ sudo perf report --stdio | more...
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:p'
# Event count (approx.): 62748127726390
#
# Overhead  Command          Shared Object                     Symbol...
     5.32%  my-oneconnectio  libc-2.24.so                      [.] __memcmp_sse4_1
     4.68%  my-oneconnectio  libc-2.24.so                      [.] __memcpy_ssse3
     4.04%  my-oneconnectio  mysqld                            [.] rocksdb::BlockIter::Seek
     3.33%  my-oneconnectio  mysqld                            [.] my_strnxfrm_unicode
     3.11%  my-oneconnectio  mysqld                            [.] rocksdb::BlockBasedTable::Get
     2.96%  my-oneconnectio  mysqld                            [.] myrocks::Rdb_pk_comparator::Compare
     2.54%  my-oneconnectio  mysqld                            [.] rocksdb::BlockIter::BinarySeek
     2.14%  my-oneconnectio  mysqld                            [.] rocksdb::InternalKeyComparator::Compare
     2.00%  my-oneconnectio  mysqld                            [.] rocksdb::StatisticsImpl::recordTick
     1.99%  my-oneconnectio  mysqld                            [.] rocksdb::MergingIterator::Next
     1.95%  my-oneconnectio  mysqld                            [.] rocksdb::HistogramStat::Add
     1.51%  my-oneconnectio  mysqld                            [.] join_read_key
     1.51%  my-oneconnectio  mysqld                            [.] myrocks::rdb_unpack_utf8_str
     1.41%  my-oneconnectio  mysqld                            [.] myrocks::Rdb_key_def::unpack_record
     1.36%  my-oneconnectio  mysqld                            [.] sub_select
     1.32%  my-oneconnectio  mysqld                            [.] my_uni_utf8
     1.29%  my-oneconnectio  mysqld                            [.] rocksdb::Version::Get
     1.22%  my-oneconnectio  libc-2.24.so                      [.] _int_malloc
     1.22%  my-oneconnectio  mysqld                            [.] rocksdb::TableCache::Get
     1.22%  my-oneconnectio  mysqld                            [.] rocksdb::BlockIter::Next
     1.21%  my-oneconnectio  mysqld                            [.] rocksdb::ThreadLocalPtr::Get
     1.18%  my-oneconnectio  mysqld                            [.] rocksdb::DBIter::FindNextUserEntryInternal
     1.18%  my-oneconnectio  mysqld                            [.] rocksdb::(anonymous namespace)::FilePicker::GetNextFile
     1.12%  my-oneconnectio  libc-2.24.so                      [.] malloc
     1.07%  my-oneconnectio  mysqld                            [.] evaluate_join_record
     1.07%  my-oneconnectio  mysqld                            [.] myrocks::Rdb_key_def::pack_index_tuple
     1.01%  my-oneconnectio  mysqld                            [.] key_restore
The result for INNER JOIN queries was the following (up to 1% of "overhead"):
[openxs@fc23 ~]$ sudo perf report --stdio | more
...
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:p'
# Event count (approx.): 162704428520300
#
# Overhead  Command          Shared Object                  Symbol...
#
     5.90%  my-oneconnectio  libc-2.24.so                   [.] __memcpy_ssse3
     4.38%  my-oneconnectio  libpthread-2.24.so             [.] pthread_mutex_lock     3.69%  my-oneconnectio  libc-2.24.so                   [.] __memcmp_sse4_1
     3.58%  my-oneconnectio  mysqld                         [.] rocksdb::BlockIter::Seek
     2.53%  my-oneconnectio  libpthread-2.24.so             [.] pthread_mutex_unlock     2.47%  my-oneconnectio  mysqld                         [.] rocksdb::StatisticsImpl::recordTick
     2.35%  my-oneconnectio  mysqld                         [.] rocksdb::BlockBasedTable::Get
     2.28%  my-oneconnectio  mysqld                         [.] my_strnxfrm_unicode
     2.08%  my-oneconnectio  mysqld                         [.] rocksdb::BlockIter::BinarySeek
     1.84%  my-oneconnectio  mysqld                         [.] rocksdb::HistogramStat::Add
     1.84%  my-oneconnectio  mysqld                         [.] rocksdb::Version::Get
     1.71%  my-oneconnectio  mysqld                         [.] myrocks::rdb_unpack_binary_or_utf8_varchar_space_pad     1.69%  my-oneconnectio  mysqld                         [.] rocksdb::LRUCacheShard::Lookup
     1.67%  my-oneconnectio  mysqld                         [.] my_uni_utf8
     1.61%  my-oneconnectio  mysqld                         [.] rocksdb::InternalKeyComparator::Compare
     1.52%  my-oneconnectio  mysqld                         [.] myrocks::Rdb_pk_comparator::Compare
     1.47%  my-oneconnectio  mysqld                         [.] rocksdb::(anonymous namespace)::FilePicker::GetNextFile
     1.23%  my-oneconnectio  mysqld                         [.] rocksdb::BlockIter::Next
     1.15%  my-oneconnectio  mysqld                         [.] rocksdb::MergingIterator::Next
     1.10%  my-oneconnectio  mysqld                         [.] join_read_key
     1.09%  my-oneconnectio  mysqld                         [.] rocksdb::Block::NewIterator
     1.04%  my-oneconnectio  mysqld                         [.] rocksdb::TableCache::Get
I am not an expert in MyRocks at all, so I will not even try to interpret these results (maybe in some later post, one day. I've just highlighted what I consider a real difference in the second (INNER JOIN) case. Note pthread_mutex_lock/pthread_mutex_unlock calls, for example that are NOT present in "more than 1%" output for STRAIGHT_JOIN case.

I'd be happy to get any comments on the above. I also plan to share more/raw data from perf some day later, when I decide where to put them for public access (we speak about text outputs megabytes in size).

I've captured profiles with callgraphs as well:
[openxs@fc23 ~]$ sudo perf record -ag
^C[ perf record: Woken up 565 times to write data ]
[ perf record: Captured and wrote 142.410 MB perf.data (644176 samples) ]
Again, full trees are huge and require real expert to be analyzed properly.

New perf report option (-g flat, available since kernels 4.4+) gave nice backtraces like this:
           40.20%
                handle_one_connection
                do_handle_one_connection
                dispatch_command
                mysql_parse
                mysql_execute_command
                execute_sqlcom_select
                handle_select
                mysql_select
                JOIN::exec
                sub_select
                evaluate_join_record
                sub_select
                join_read_key
                myrocks::ha_rocksdb::index_read_map_impl
                myrocks::ha_rocksdb::get_row_by_rowid
                myrocks::Rdb_transaction_impl::get
                rocksdb::TransactionBaseImpl::Get
                rocksdb::WriteBatchWithIndex::GetFromBatchAndDB
                rocksdb::DBImpl::Get
                rocksdb::DBImpl::GetImpl
                rocksdb::Version::Get
that show how data are read in MyRocks. You can get similar backtraces with pt-pmp as well, but you'll see only how many times specific backtrace was noted, not what was the related overhead. Also, performance impact from pt-pmp would be very notable in this case.

Time to stop it seems, the post is already long. One day I'll continue, as there are many phenomena highlighted by this very simple, but famous, use case. Stay tuned!

by Valeriy Kravchuk (noreply@blogger.com) at January 31, 2017 07:03 PM

Peter Zaitsev

Docker Security Vulnerability CVE-2016-9962

CVE-2016-9962

CVE-2016-9962Docker 1.12.6 was released to address CVE-2016-9962. CVE-2016-9962 is a serious vulnerability with RunC.

Quoting the coreos page (linked above):

“RunC allowed additional container processes via runc exec to be ptraced by the pid 1 of the container. This allows the main processes of the container, if running as root, to gain access to file-descriptors of these new processes during the initialization and can lead to container escapes or modification of runC state before the process is fully placed inside the container.”

In short, IF processes run as root inside a container they could potentially break out of the container and gain access over the host.

My recommendation at this time is to apply the same basic security tenants for containers as you would (I hope) for VM and baremetal installs. In other words, ensure you are adhering to a Path of Least Privilege as a best practice and not running as root for conevience’s sake.

Prior to this, we made changes to PMM prior to version 1.0.4 to reduce the number of processes within the container that ran as root. As such, only the processes required to do so run as root. All other processes run as a lower privilege user.

Check here for documentation on PMM, and use the JIRA project to raise bugs (JIRA requires registration).

To comment on running a database within docker, I’ve reviewed the following images

  • percona-server image: I have verified it does not run as root, and runs as a mysql user (for 5.7.16 at least)
  • percona-server-mongodb: I have worked with our teams internally and can confirm that the latest image no longer runs as root (you will to run the latest image, however, to see this change via docker pull)

Please comment below with any questions.

by David Busby at January 31, 2017 05:39 PM

MariaDB Foundation

2017 Developers Conference, New York

The 2017 MariaDB Developers Conference is crossing the ocean this year, and will be taking place in New York, from 9 to 10 April. The meetup will last for two days and you can join for the whole time, or as little time as you wish. The schedule of this unconference will be drafted in […]

The post 2017 Developers Conference, New York appeared first on MariaDB.org.

by Ian Gilfillan at January 31, 2017 11:46 AM

Jean-Jerome Schmidt

How to automate & manage MySQL (Replication) & MongoDB with ClusterControl - live webinar

Join us next Tuesday, February 7th 2017, as Johan Andersson, CTO at Severalnines, unveils the new ClusterControl 1.4 in a live demo webinar.

ClusterControl reduces complexity of managing your database infrastructure while adding support for new technologies; enabling you to truly automate multiple environments for next-level applications. This latest release further builds out the functionality of ClusterControl to allow you to manage and secure your 24/7, mission critical infrastructures.

In this live webinar, Johan will demonstrate how ClusterControl increases your efficiency by giving you a single interface to deploy and operate your databases, instead of searching for and cobbling together a combination of open source tools, utilities and scripts that need constant updates and maintenance. Watch as ClusterControl demystifies the complexity associated with database high availability, load balancing, recovery and your other everyday struggles.

To put it simply: learn how to be a database hero with ClusterControl!

Date, Time & Registation

Europe/MEA/APAC

Tuesday, February 7th at 09:00 GMT (UK) / 10:00 CET (Germany, France, Sweden)

Register Now

North America/LatAm

Tuesday, February 7th at 9:00 Pacific Time (US) / 12:00 Eastern Time (US)

Register Now

Agenda

  • ClusterControl (1.4) Overview
  • ‘Always on Databases’ with enhanced MySQL Replication functions
  • ‘Safer NoSQL’ with MongoDB and larger sharded cluster deployments
  • ‘Enabling the DBA’ with ProxySQL, HAProxy and MaxScale
  • Backing up your open source databases
  • Live Demo
  • Q&A

Speaker

Johan Andersson, CTO, Severalnines

Johan's technical background and interest are in high performance computing as demonstrated by the work he did on main-memory clustered databases at Ericsson as well as his research on parallel Java Virtual Machines at Trinity College Dublin in Ireland. Prior to co-founding Severalnines, Johan was Principal Consultant and lead of the MySQL Clustering & High Availability consulting group at MySQL / Sun Microsystems / Oracle, where he designed and implemented large-scale MySQL systems for key customers. Johan is a regular speaker at MySQL User Conferences as well as other high profile community gatherings with popular talks and tutorials around architecting and tuning MySQL Clusters.

We look forward to “seeing” you there and to insightful discussions!

If you have any questions or would like a personalised live demo, please do contact us.

by Severalnines at January 31, 2017 08:40 AM

January 30, 2017

Peter Zaitsev

MySQL Sharding Models for SaaS Applications

MySQL Sharding Models

MySQL Sharding ModelsIn this blog post, I’ll discuss MySQL sharding models, and how they apply to SaaS application environments.

MySQL is one of the most popular database technologies used to build many modern SaaS applications, ranging from simple productivity tools to business-critical applications for the financial and healthcare industries.

Pretty much any large scale SaaS application powered by MySQL uses sharding to scale. In this blog post, we will discuss sharding choices as they apply to these kinds of applications.

In MySQL, unlike in some more modern technologies such as MongoDB, there is no standard sharding implementation that the vast majority of applications use. In fact, if anything “no standard” is the standard. The common practice is to roll your own sharding framework, as famous MySQL deployments such as Facebook and Twitter have done. MySQL Cluster – the MySQL software that has built-in Automatic Sharding functionality – is rarely deployed (for a variety of reasons). MySQL Fabric, which has been the official sharding framework, has no traction either.

When sharding today, you have a choice of rolling your own system from scratch, using comprehensive sharding platform such as Vitess or using a proxy solution to assist you with sharding. For proxy solutions, MySQL Router is the official solution. But in reality, third party solutions such as open source ProxySQL, commercial ScaleArc and semi-commercial (BSL)  MariaDB MaxScale are widely used. Keep in mind, however, that traffic routing is only one of the problems that exist in large scale sharding implementations.

Beneath all these “front end” choices for sharding on the application database connection framework or database proxy, there are some lower level decisions that you’ve got to make. Namely, around how your data is going to be led out and organized on the MySQL nodes.

When it comes to SaaS applications, at least one answer is simple. It typically makes sense to shard your data by “customer” or “organization” using some sort of mapping tables. In the vast majority of cases, single node (or replicated cluster) should be powerful enough to handle all the data and load coming from each customer.

What Should I Ask Myself Now?

The next set questions you should ask yourself are around your SaaS applications:

  • How much revenue per customer are you generating?
  • Do your customers (or regulations) require data segregation?
  • Are all the customers about the same, or are there outliers?
  • Are all your customers running the same database schema?

I address the answers in the sections below.

How Much Revenue?

How much revenue per customer you’re generating is an important number. It defines how much infrastructure costs per customer you can afford. In the case of “freemium” models, and customers generating less than $1 a month an average, you might need to ensure low overhead per customer (even if you have to compromise on customer isolation).

How much revenue per customer you’re generating is an important number. It defines how much infrastructure costs per customer you can afford. In the case of “freemium” models, and customers generating less than $1 a month an average, you might need to ensure low overhead per customer (even if you have to compromise on customer isolation).

Typically with low revenue customers, you have to co-locate the data inside the same MySQL instance (potentially even same tables). In the case of high revenue customers, isolation in separate MySQL instances (or even containers or virtualized OS instances) might be possible.

Data Segregation?

Isolation is another important area of consideration. Some enterprise customers might require that their data is physically separate from others. There could also be government regulations in play that require customer data to be stored in a specific physical location. If this is the case, you’re looking at completely dedicated customer environments. Or at the very least, separate database instances (which come with additional costs).

Customer Types?

Customer size and requirements are also important. A system designed to handle all customers of approximately the same scale (for example, personal accounting) is going to be different than if you are in the business of blog hosting. Some blogs might be 10,000 times more popular than the average.

Same Database Schema?

Finally, there is a there is the big question of whether all your customers are running the same database schema and same software version. If you want to support different software versions (if your customers require a negotiated maintenance window for software upgrades, for example) or different database schemas (if the schema is dependent on the custom functionality and modules customers might use, for example), keeping such customers in different MySQL schemas make sense.

Sharding Models

This gets us to the following sharding isolation models, ranging from lowest to highest:

  • Customers Share Schemas. This is the best choice when you have very large numbers of low-revenue customers. In this case, you would map multiple customers to the same set of tables, and include something like a customer_id field in them to filter customer data. This approach minimizes customer overhead and reduces customer isolation. It’s harder to backup/restore data for individual customers, and it is easier to introduce coding mistakes that can access other customers data. This method does not mean there is only one schema, but that there is a one-to-many relationship between schemas and customers.  For example, you might have 100 schema’s per MySQL instance, each handling 1000 to 10000 customers (depending on the application). Note that with a well-designed sharding implementation, you should be able to map customers individually to schemas. This allows you to have key customer data stored in dedicated schemas, or even on dedicated nodes.
  • Schema per Customer. This is probably the most common sharding approach in MySQL powered SaaS applications. Especially ones that have substantial revenue ($10+ per month / per customer). In this model, each customer’s data is stored in its own schema (database). This makes it very easy to backup/restore individual customers. It allows customers to have different schemas (i.e., add custom tables). It also allows them to run different versions of the application if desired. This approach allows the application server to use different MySQL users connecting on behalf of different customers, which adds an extra level of protection from accidental (or intentional) access of data that belongs to different customers. The schema per customer approach also makes it easier to move the shards around, and limits maintenance impact. The downside of this approach is higher overhead. It also results in a large number of tables per instance, and potentially larger numbers of files (which can be hard to manage).
  • Database Instance per Customer. You achieve even better isolation by having a MySQL instance per customer. This approach, however, increases overhead even further. The recent rise of light virtualization technologies and containers has reduced its usage.
  • OS Instance/Container per Customer. This approach allows you to improve isolation even further. It can be used for any customer, but can also be applied to selected customers in a model that uses Schema per Customer model for a majority of them.  Dedicated OS Instance, with improved isolation and better performance SLAs, might be a feature of some premium customer tiers. This method not only allows better isolation, but it also let’s you handle outliers better. You might chose to run a majority of your customers on the hardware (or cloud instance) that has best price/performance numbers, and also place some of the larger customers on the highest performance nodes.
  • Environment per customer. Finally, if you take this all the way you can build completely separate environments for customers. This includes databases, application servers and other required components. This is especially useful if you need to deploy the application close to the customer – which includes the appliance model, or deployment in the customer’s data center or cloud provider. This also allows you to accommodate customers if their data must be stored in a specific location. This is often due to government regulations. It is worth noting that many SaaS applications, even if they do not quite have one environment per customer, have multiple independent environments. These are often hosted in different locations or availability zones. Such setups allow you to reduce the impact of large-scale failures to only a portion of your customers. This avoids overloading your customer service group and allowing the operational organization to focus on repairing smaller environments.

The farther you go down this route – from the shared schema to an environment per customer – the more important is to have a high level of automation. With a shared schema, you often can get by with little automation (and some environments manually set up) and all the schema’s pre-created. If customer sign up requires setting up dedicated database instance or the whole environment, manual implementation doesn’t scale. For this type of setup, you need state-of-the-art automation and orchestration.

Conclusion

I hope this helps you to understand your options for MySQL sharding models. Each of the different sharding models for SaaS applications powered by MySQL have benefits and drawbacks. As you can see, many of these approaches require you to work with a large number of tables in the MySQL – this will be the topic of one of my next posts!

by Peter Zaitsev at January 30, 2017 11:16 PM

MariaDB ColumnStore

MariaDB ColumnStore

MariaDB ColumnStoreLast month, MariaDB officially released MariaDB ColumnStore, their column store engine for MySQL. This post discusses what it is (and isn’t), why it matters and how you can approach a test of it.

What is ColumnStore?

ColumnStore is a storage engine that turns traditional MySQL storage concepts on their head. Instead of storing the data by row, a column store stores the data by column (obviously). This provides advantages for certain types of data, and certain types of queries run against that data. See my previous post for more details on column-based storage systems.

ColumnStore is a fork of InfiniDB and carries forward many of the concepts behind that product. InfiniDB ceased operations in 2014. With the front end managed through MariaDB, you get access to all of the expected security and audit options of MariaDB. MariaDB designed ColumnStore as a massively parallel database, working best in an environment with multiple servers. This is somewhat different than a traditional row store database.

ColumnStore stores columnar data in a concept called an “extent.” An extent contains a range of values for a single column. Each extent contains no more than 8 million values. It stores additional values in a new extent. The extents for a single column get distributed across the database nodes, known as “Performance Modules” in ColumnStore. It stores each unique extent on more than one node, thus providing data redundancy and removing the need for replication. If a node is down, and it contains an extent needed for a query, that same extent is found on another node in the environment. This data redundancy also provides a high availability environment.

The query engine determines which extents process query requests. Since the data in an extent is often preordered (time series data, for example), many queries can ignore individual extents since they cannot contain any data needed for the query. If we are only looking for data from February 2017, for example, extents containing data outside of that range get ignored. However, if a query requires data from many or all extents on a single column, the query takes much longer to complete.

Unlike some traditional column store vendors, that take an all or nothing approach to storage, MariaDB decided to go with a mixed concept. In a MariaDB MySQL database, you can mix traditional InnoDB storage with the new ColumnStore storage, just like you used to mix InnoDB and MyISAM. This presents some nice options, not the least of which is that it provides a way to “dip your toe” into the world of column stores. On the other hand, it could lead to people using the ColumnStore engine in non-optimal ways. Also, the differences in what is considered optimal architecture between these two storage options make it hard to see how this plays out in the real world.

Data Definition

As discussed in the earlier post, column storage works great for specific types of data and queries. It is important that your data definitions are as tight as possible, and that your queries are appropriate for column-based data.

Many people set their field definition as VARCHAR(256) when setting up a new database. They might not know what type of data gets stored in the new field. This broad definition allows you to store whatever you throw at the database. The negative effect for row store is that it can cause over-allocation of storage – but it only has a minimal effect on queries.

In a column store, the field definition can drive decisions about the compression methods for storing the data, along with sorting implications. Columnar data can use storage more efficiently than a row store, since the data for a single column is well-defined. This leads to selecting the best compression algorithm for the data. If that data is poorly defined, the selected compression algorithm might not be the best for the data.

Sorting is also a problem in a column store when the data types are not well-defined. We’ve all seen integer or date data that is sorted alphabetically. While it can be annoying, we can still adjust to that sorting method to find what we need. Since a column store is often used to perform analytical queries over a range of data, this poorly-sorted data can present a bigger problem. If you specify a column to be VARCHAR and only include date information, that data is sorted alphabetically. The same column defined as DATE causes the data to be sorted by date. This chart shows the difference (date format is mm/dd/yy)

Alphabetic Sort Date Sort
01/01/17 01/01/17
01/11/17 01/02/17
01/02/17 01/04/17
01/21/17 01/11/17
01/4/17 01/21/17
11/01/17 02/01/17
11/02/17 11/01/17
02/01/17 11/02/17

 

Imagine running a query over a range of dates (requesting all activity in the months of January and February 2017, for example). In the alphabetic sort, this requires working through the whole file, since the data for November shows up between the data for January and February. In the date sort, the query only reads the until the end of February. We know there can be no more matching data after that. The alphabetic sort leads to more I/O, more query time and less happiness on the part of the user.

Why Should You Care About ColumnStore?

The first reason is that it allows you to try out column storage without doing a massive shift in technology and with minimal effort. By setting up some tables in a MariaDB database to use the ColumnStore engine, you can benefit from the storage efficiencies and faster query capabilities, provided that the data you’ve selected for this purpose is sound. This means that the data definitions should be tight (always a good plan anyway), and the queries should be more analytical than transactional. For a purely transactional workflow, a row store is the logical choice. For a purely analytical workflow, a column store is the logical choice. ColumnStore allows you to easily mix the two storage options so that you can have the best match possible. It is still important to know what type of workflow you’re dealing with, and match the storage engine to that need.

Another solid reason is that it is a great fit if you are already doing analysis over massive amounts of data. Any column store shines when asked to look at relatively few columns of data (ideally the column or two that are being aggregated and other columns to locate and group the data). If you are already running these types of queries in MySQL, ColumnStore would likely be a good fit.

But There Be Dragons!

As with any new technology, ColumnStore might not be a good fit for everyone. Given that you can mix and match your storage engines, with ColumnStore for some tables and InnoDB for others, it can be tempting to just go ahead with a ColumnStore test doing things the same way you always did in the past. While this still yields results, those results might not be a true test of the technology. It’s like trying to drive your minivan the same way you used to drive your sports car. “Hey, my Alfa Romeo never flipped over taking these turns at high speed!”

To effectively use ColumnStore, it’s important to match it to a proper analytical workload. This means that you will likely do more bulk loading into these tables, since there is additional overhead in writing the data out into the column files. The overall workflow should be more read-intensive. The queries should only look for data from a small set of fields, but can span massive amounts of data within a single column. In my earlier post, there’s also a discussion about normalization of data and how denormalizing data is more common in columnar databases.

You should address these issues in your testing for a valid conclusion.

The minimum specifications for ColumnStore also point to a need for a more advanced infrastructure than is often seen for transactional data. This is to support batch loading, read intensive workloads and possibly different ETL processes for each type of data. In fact, MariaDB states in the installation documentation for ColumnStore that it must be completed as a new installation of MariaDB. You must remove any existing installations of MariaDB or MySQL before installing the ColumnStore-enabled RPM on a system.

Is It Right for Me?

ColumnStore might fit well into your organization. But like haggis, it’s not for everyone. If you need analytical queries, it is a great option. If your workload is more read-intensive, it could still work for you. As we move to a more Internet of Things (IoT) world, we’re likely to see a need for more of this type of query work. In order to accurately present each user with the best possible Internet experience, we might want to analyze their activities over spans of time and come up with the best match for future needs.

Seriously consider if making the move to ColumnStore is right for you. It is newer software (version 1.0.6, the first GA version, was released on December 14, 2016, and 1.0.7 was released on January 23, 2017), so it might go through changes as it matures. Though a new product, it is based on InfiniDB concepts (which are somewhat dated). MariaDB has lots of plans for additional integrations and support for ancillary products that are absent in the current release.

MariaDB took a huge step forward with ColumnStore. But do yourself a favor and consider whether it is right for you before testing it out. Also, make sure that you are not trying to force your current workflow into the column store box. Kids know that we cannot put a square peg in a round hole, but we adults often try to do just that. Finding the right peg saves you lots of time, hassle and annoyance.

by Rick Golba at January 30, 2017 05:29 PM

Jean-Jerome Schmidt

Automatic failover of MySQL Replication - New in ClusterControl 1.4

MySQL replication setups are inevitably related to failovers. Unlike multi-master clusters like Galera, there is one single writer in a whole setup - the master. If the master fails, one of the slaves will have to take its role through the process of failover. Such process is tricky and potentially, it may cause data loss. It may happen, for example, if a slave is not up to date while it is promoted. The master may also die before it is able to transfer all binlog events to at least one of its slaves.

Different people have different takes on how to perform failover. It depends on personal preferences but also on requirements of the business. There are two main options - automated failover or manual failover.

Automated failover comes in very handy if you want your environment to run 24x7, and to recover quickly from any failures. Unfortunately, this may come at a cost - in more complex failure scenarios, automated failover may not work correctly or, even worse, it may result in your data being messed up and partially missing (although one might argue that a human can also make disastrous mistakes leading to similar consequences). Those who prefer to keep close control over their database may choose to skip automated failover and use a manual process instead. Such a process takes more time, but it allows an experienced DBA to assess the state of a system and take corrective actions based on what happened.

ClusterControl already supports automated failover for master-master clusters like Galera and NDB Cluster. Now with 1.4, it also does this for MySQL replication. In this blog post, we’ll take you through the failover process, discussing how ClusterControl does it, and what can be configured by the user.

Configuring Automatic Failover

Failover in ClusterControl can be configured to be automatic or not. If you prefer to take care of failover manually, you can disable automated cluster recovery. By default, cluster recovery is enabled and automated failover is used. Once you make changes in the UI, make sure you also make them in the cmon configuration and set enable_cluster_autorecovery to ‘0’. Otherwise your settings will be overwritten when the cmon process is restarted.

Failover is initiated by ClusterControl when it detects that there is no host with read_only flag disabled. It can happen because master (which has read_only set to 0) is not available or it can be triggered by a user or some external software that changed this flag on the master. If you do manual changes to the database nodes or have software that may fiddle with the read_only settings, then you should disable automatic failover.

Also, note that failover is attempted only once. Should a failover attempt fail, then no more attempts will be made until the controller is restarted.

At the beginning of the failover procedure, ClusterControl builds a list of slaves which can be promoted to master. Most of the time, it will contain all slaves in the topology but the user has some additional control over it. There are two variables you can set in the cmon configuration:

replicaton_failover_whitelist

and

replicaton_failover_blacklist

First of them, when used, contains a list of IP’s or hostnames of slaves which should be used as potential master candidates. If this variable is set, only those hosts will be considered.

Second variable may contain list of hosts which will never be considered a master candidate. You can use it to list slaves that are used for backups or analytical queries. If the hardware varies between slaves, you may want to put here the slaves which use slower hardware.

Replication_failover_whitelist takes precedence, meaning the replication_failover_blacklist is ignored if replication_failover_whitelist is set.

Once the list of slaves which may be promoted to master is ready, ClusterControl starts to compare their state, looking for the most up to date slave. Here, the handling of MariaDB and MySQL-based setups differs. For MariaDB setups, ClusterControl picks a slave which has the lowest replication lag of all slaves available. For MySQL setups, ClusterControl picks such a slave as well but then it checks for additional, missing transactions which could have been executed on some of the remaining slaves. If such a transaction is found, ClusterControl slaves the master candidate off that host in order to retrieve all missing transactions.

In case you’d like to skip this process and just use the most advanced slave, you can set the following setting in the cmon configuration:

replication_skip_apply_missing_txs=1

Such process may result in a serious problem though - if an errant transaction is found, replication may be broken. What is an errant transaction? In short, it is a transaction that has been executed on a slave but it’s not coming from the master. It could have been, for example, executed locally. The problem is caused by the fact that, while using GTID, if a host, which has such errant transaction, becomes a master, all slaves will ask for this missing transaction in order to be in sync with their new master. If the errant transaction happened way in the past, it may not longer be available in binary logs. In that case, replication will break because slaves won’t be able to retrieve the missing data. If you would like to learn more about errant transactions, we have a blog post covering this topic.

Of course, we don’t want to see replication breaking, therefore ClusterControl, by default, checks for any errant transactions before it promotes a master candidate to become a master. If such problem is detected, the master switch is aborted and ClusterControl lets the user fix the problem manually. The blog post we mentioned above explains how you can manually fix issues with errant transactions.

If you want to be 100% certain that ClusterControl will promote a new master even if some issues are detected, you can do that using the replication_stop_on_error=0 setting in cmon configuration. Of course, as we discussed, it may lead to problems with replication - slaves may start asking for a binary log event which is not available anymore. To handle such cases we added experimental support for slave rebuilding. If you set replication_auto_rebuild_slave=1 in the cmon configuration and if your slave is marked as down with the following error in MySQL:

Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'

ClusterControl will attempt to rebuild the slave using data from the master. Such a setting may not always be appropriate as the rebuilding process will induce an increased load on the master. It may also be that your dataset is very large and a regular rebuild is not an option - that’s why this behavior is disabled by default.

Once we ensure that no errant transactions exist and we are good to go, there is still one more issue we need to handle somehow - it may happen that all slaves are lagging behind the master.

As you probably know, replication in MySQL works in a rather simple way. The master stores writes in binary logs. The slave’s I/O thread connects to the master and pulls any binary log events it is missing. It then stores them in the form of relay logs. The SQL thread parses them and applies events. Slave lag is a condition in which SQL thread (or threads) cannot cope with the number of events, and is unable to apply them as soon as they are pulled from the master by the I/O thread. Such situation may happen no matter what type of replication you are using. Even if you use semi-sync replication, it can only guarantee that all events from the master are stored on one of slaves in the relay log. It doesn’t say anything about applying those events to a slave.

The problem here is that, if a slave is promoted to master, relay logs will be wiped out. If a slave is lagging and hasn’t applied all transactions, it will lose data - events that are not yet applied from relay logs will be lost forever.

There is no one-size-fits-all way of solving this situation. ClusterControl gives users control over how it should be done, maintaining safe defaults. It is done in cmon configuration using the following setting:

Replication_failover_wait_to_apply_timeout

By default it takes a value of ‘-1’, which means that failover won’t happen if a master candidate is lagging. ClusterControl will wait indefinitely for it to apply all missing transactions from its relay logs. This is safe, but, if for some reason, the most up-to-date slave is lagging badly, failover may takes hours to complete. On the other side of the spectrum is setting it to ‘0’ - it means that failover happens immediately, no matter if the master candidate is lagging or not. You can also go the middle way and set it to some value. This will set a time in seconds, during which ClusterControl will wait for a master candidate to apply missing transactions from its relay logs. Failover happens after the defined time or when the master candidate will catch up on replication - whichever happens first. This may be a good choice if your application has specific requirements regarding downtime and you have to elect a new master within a short time window.

When using MySQL replication along with proxies like ProxySQL, ClusterControl can help you build an environment in which the failover process is barely noticeable by the application. Below we’ll show how the failover process may look like in a typical replication setup - one master with two slaves. We will use ProxySQL to detect topology changes and route traffic to the correct hosts.

First, we’ll start our “application” - sysbench:

root@ip-172-30-4-48:~# while true ; do sysbench --test=/root/sysbench/sysbench/tests/db/oltp.lua --num-threads=2 --max-requests=0 --max-time=0 --mysql-host=172.30.4.48 --mysql-user=sbtest --mysql-password=sbtest --mysql-port=6033 --oltp-tables-count=32 --report-interval=1 --oltp-skip-trx=on --oltp-table-size=100000 run ; done

It will connect to ProxySQL (port 6033) and use it to distribute traffic between master and slaves. We simulate default behavior of autocommit=1 in MySQL by disabling transactions for Sysbench.

Once we induce some lag, we kill our master:

root@ip-172-30-4-112:~# killall -9 mysqld mysqld_safe

ClusterControl starts failover.

ID:79574 [13:18:34]: Failover to a new Master.

At first, it verifies state of replication on all nodes in the cluster. Among other things, ClusterControl looks for the most up to date slave in the topology and picks it as master candidate.

ID:79575 [13:18:34]: Checking 172.30.4.99:3306
ID:79576 [13:18:34]: ioerrno=2003 io running 0 on 172.30.4.99:3306
ID:79577 [13:18:34]: Checking 172.30.4.4:3306
ID:79578 [13:18:34]: ioerrno=2003 io running 0 on 172.30.4.4:3306
ID:79579 [13:18:34]: Checking 172.30.4.112:3306
ID:79580 [13:18:34]: 172.30.4.112:3306: is not connected. Checking if this is the failed master.
ID:79581 [13:18:34]: 172.30.4.99:3306: Checking if slave can be used as a candidate.
ID:79582 [13:18:34]: Adding 172.30.4.99:3306 to slave list
ID:79583 [13:18:34]: 172.30.4.4:3306: Checking if slave can be used as a candidate.
ID:79584 [13:18:34]: Adding 172.30.4.4:3306 to slave list
ID:79585 [13:18:34]: 172.30.4.4:3306: Slave lag is 4 seconds.
ID:79586 [13:18:34]: 172.30.4.99:3306: Slave lag is 20 seconds >= 4 seconds, not a possible candidate.
ID:79587 [13:18:34]: 172.30.4.4:3306 elected as the new Master.

As a next step, required grants are added.

ID:79588 [13:18:34]: 172.30.4.4:3306: Creating user 'rpl_user'@'172.30.4.99.
ID:79589 [13:18:34]: 172.30.4.4:3306: Granting REPLICATION SLAVE 'rpl_user'@'172.30.4.99'.
ID:79590 [13:18:34]: 172.30.4.99:3306: Creating user 'rpl_user'@'172.30.4.4.
ID:79591 [13:18:34]: 172.30.4.99:3306: Granting REPLICATION SLAVE 'rpl_user'@'172.30.4.4'.
ID:79592 [13:18:34]: 172.30.4.99:3306: Setting read_only=ON
ID:79593 [13:18:34]: 172.30.4.4:3306: Setting read_only=ON

Then, it’s time to ensure no errant transactions are found, which could prevent the whole failover process from happening.

ID:79594 [13:18:34]: Checking for errant transactions.
ID:79595 [13:18:34]: 172.30.4.99:3306: Skipping, same as slave 172.30.4.99:3306
ID:79596 [13:18:34]: 172.30.4.99:3306: Comparing to 172.30.4.4:3306, master_uuid = 'e4864640-baff-11e6-8eae-1212bbde1380'
ID:79597 [13:18:34]: 172.30.4.4:3306: Checking for errant transactions.
ID:79598 [13:18:34]: 172.30.4.112:3306: Skipping, same as master 172.30.4.112:3306
ID:79599 [13:18:35]: 172.30.4.4:3306: Comparing to 172.30.4.99:3306, master_uuid = 'e4864640-baff-11e6-8eae-1212bbde1380'
ID:79600 [13:18:35]: 172.30.4.99:3306: Checking for errant transactions.
ID:79601 [13:18:35]: 172.30.4.4:3306: Skipping, same as slave 172.30.4.4:3306
ID:79602 [13:18:35]: 172.30.4.112:3306: Skipping, same as master 172.30.4.112:3306
ID:79603 [13:18:35]: No errant transactions found.

During the last preparation step, missing transactions are being applied on the master candidate - we want it to fully catch up on the replication before we proceed with failover. In our case, to ensure that failover will happen even if slave is badly lagging, we enforced 600 second limit - slave will try to replay any missing transactions from its relay logs but if it will take more than 600 seconds, we will force a failover.

ID:79604 [13:18:35]: 172.30.4.4:3306: preparing candidate.
ID:79605 [13:18:35]: 172.30.4.4:3306: Checking if there the candidate has relay log to apply.
ID:79606 [13:18:35]: 172.30.4.4:3306: waiting up to 600 seconds before timing out.
ID:79608 [13:18:37]: 172.30.4.4:3306: Applied 391 transactions
ID:79609 [13:18:37]: 172.30.4.4:3306: Executing 'SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS('e4864640-baff-11e6-8eae-1212bbde1380:16340-23420', 5)' (waited 5 out of maximally 600 seconds).
ID:79610 [13:18:37]: 172.30.4.4:3306: Applied 0 transactions
ID:79611 [13:18:37]: 172.30.4.99:3306: No missing transactions found.
ID:79612 [13:18:37]: 172.30.4.4:3306: Up to date with temporary master 172.30.4.99:3306
ID:79613 [13:18:37]: 172.30.4.4:3306: Completed preparations of candidate.

Finally, failover happens. From the application’s standpoint, the impact was minimal - the process took less than 5 seconds, during which the application had to wait for queries to execute. Of course, it depends on multiple factors - the main one is replication lag as the failover process, by default, requires the slave to be up-to-date. Catching up can take quite some time if the slave is lagging behind heavily.

At the end, we have a new replication topology. A new master has been elected and the second slave has been reslaved. The old master, on the other hand, is stopped. This is intentional as we want the user to be able to investigate the state of the old master before performing any further changes (e.g., slaving it off a new master or rebuilding it).

We hope this mechanism will be useful in maintaining high availability of replication setups. If you have any feedback on it, let us know as we’d love to hear from you.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

by krzysztof at January 30, 2017 10:51 AM

Shlomi Noach

Some observations on MySQL to sqlite migration & compatibility

I'm experimenting with sqlite as backend database for orchestrator. While orchestrator manages MySQL replication topologies, it also uses MySQL as backend. For some deployments, and I'm looking into such one, having MySQL as backend is a considerable overhead.

This sent me to the route of looking into a self contained orchestrator binary + backend DB. I would have orchestrator spawn up its own backend database instead of connecting to an external one.

Why even relational?

Can't orchestrator just use a key-value backend?

Maybe it could. But frankly I enjoy the power of relational databases, and the versatility they offer has proven itself multiple times with orchestrator, being able to answer interesting, new, complex questions about one's topology by crafting SQL queries.

Moreover, orchestrator is already heavily invested in the relational model. At this time, replacing all SQL queries with key-value reads seems to me as a significant investment in time and risk. So I was looking for a relational, SQL accessible embeddable database for orchestrator.

Why sqlite?

I am in particular looking at two options: sqlite (via the go-sqlite3 binding) and TiDB. sqlite does not need much introduction, and I'll just say it's embeddable within the golang-built binary.

TiDB is a pure-Go, MySQL dialect compatible server which provides relational model on top of key-value store. The store could be local or distributed, and the TiDB project cleverly separates involved layers.

Of the two, sqlite is mature, alas uses a different SQL dialect and has different behavior. TiDB's compatibility with MySQL is an impressive feat, but still ongoing.

Both require adaptations of SQL code. Here are some observations on adaptations required when moving an existing app from MySQL backend to sqlite backend.

Differences

Just to answer an obvious question: can't everything be abstracted away by an ORM that speaks both dialects?

I don't think so. I always exploit SQL beyond the standard insert/delete/update/select, exploits that ORMs just don't support.

Here's an incomplete list of differences I found. Some purely syntactical, some semantical, some behavioral, and some operational.

  • Data types: no CHARACTER SET clause
  • Data types: you can't associate UNSIGNED to any int type
  • Data types: no enum. However there's an alternative in the form of:
    race text check (race in ('human', 'dog', 'alien'))
  • auto_increment is called integer
  • Data types: timestamps are not a thing. There's no timezone info.
  • TIMESTAMP has no ON UPDATE clause.
  • No after clause for adding columns
  • Indexes are not part of table creation. Only PRIMARY KEY is. The rest of indexes are created via CREATE INDEX statement
  • Indexes have unique names across the schema. This is unfortunate, since it forces me to use longer names for indexes so as to differentiate them. For example, in MySQL I can have an index named name_idx in two different tables; in sqlite I append table name for "namespacing"
  • Temporal values and functions: poor support for time arithmetic.
    • Getting the diff between two datetimes is non-trivial (what's the diff in months for a leap year?)
    • INTERVAL keyword not respected. Appending/subtracting dates can be done via:
      datetime('now', '-3 hour')
      Can you see the problem in the above? What is the number 3 is a positional argument? In MySQL I would use NOW() - INTERVAL ? HOUR. To make positional arguments work in sqlite, the above gets to be datetime('now', printf('-%d hour', ?)). How would you even translate NOW() - INTERVAL (? * 2) SECOND?
    • UNIX_TIMESTAMP not respected. Instead using strftime('%s', 'now'), I dislike the use of string functions to generate times.
  • insert ignore turns to insert or ignore
  • replace into remains replace into. Yay!
  • insert on duplicate key update has no equivalent. It's worthwhile noting a replace into does not replace (pun intended) an insert ... on duplicate key as the latter can choose to only update a subset of column in the event of a constraint violation. It's really very powerful.
  • IS TRUE and IS FALSE are not respected.
  • ALTER TABLE:
    • When adding a not null column one must specify the default value (e.g. 0 for an int)
    • You cannot add a timestamp column that defaults to CURRENT_TIMESTAMP. You can have such column in your CREATE TABLE definition, but you cannot add such a column. The reason being that CURRENT_TIMESTAMP is not a constant value. When adding a column to a table, sqlite does not actually apply the change to all existing rows, but only to newly created ones. It therefore does not know what value to provide to those older rows.
    • You cannot DROP COLUMN. I'll say it again. You cannot DROP COLUMN
    • You cannot modify a PRIMARY KEY
    • You cannot rename a column or change its datatype.
    • In fact, the only supported ALTER TABLE statements are ADD COLUMN and RENAME (renaming the table itself)

Regarding the ALTER TABLE limitations, the advice for dropping/changing/renaming columns or changing the primary key is to "create a new table with the new format, copy the rows, drop the old table, rename". This is certainly feasible, but requires a substantial overhead from the user. And it's non atomic. It requires a change in the state of mind but also a change in state of operations, the latter being non-trivial when moving from one DBMS to another, or when wishing to support both.

I'm still looking into this, and trying to work my way around differences with cheap regular expressions for as much as possible. I'm mainly interested right now in finding all semantic/logic differences that would require application changes. So far the TIMESTAMP behavior is such, and so is the INSERT ... ON DUPLICATE KEY statement.

by shlomi at January 30, 2017 09:14 AM

January 29, 2017

Valeriy Kravchuk

perf Basics for MySQL Profiling

Oprofile was widely used for MySQL profiling on Linux in the past. But since 2010 and 2.6.31 Linux kernels another profiler, perf, gets increasing popularity. It uses performance counters (CPU hardware registers that count hardware events such as instructions executed) subsystem in Linux. perf is capable of lightweight profiling. It is included in the Linux kernel, under tools/perf (so features available depends on kernel version), and is frequently updated and enhanced.

So, probably perf is the future of profiling on Linux and it makes sense to discuss its basic usage for profiling MySQL servers. For detailed discussions of features provided, numerous examples (not related to MySQL) and links I suggest to read great posts by Brendan Gregg, starting from this one.

I am going to use Ubuntu 14.04 for this basic Howto, with the following perf version provided by the linux-tools-common package:

openxs@ao756:~/dbs/maria10.1$ perf --version
perf version 3.13.11-ckt39
openxs@ao756:~/dbs/maria10.1$ dpkg -S `which perf`
linux-tools-common: /usr/bin/perf
On RPM-based systems package name is just perf. I plan to discuss and show details of (other versions of) perf usage on CentOS 6.8 and Fedora 25 later.

I'll work with a bit outdated MariaDB 10.1.x (built from source) I have at hand up and running. In case of standard vendor's packages, make sure that you have debug symbols installed (perf will inform you if some are missing anyway), to get useful results:
openxs@ao756:~$ dpkg -l | grep -i percona | grep dbg
ii  percona-server-5.7-dbg                                5.7.16-10-1.trusty                                  amd64        Debugging package for Percona Server
Also, ideally binaries that you are profiling should be built with  -fno-omit-frame-pointer option.

As in the previous post about basic oprofile usage, I'll try to start with a primitive test case based on  Bug #39630, for simplicity and to establish basic trust in the tool's outputs, but this time I also want to try profiling some concurrent load more close to a real life, along the lines of Bug #83912.

Basic two(!) steps to use perf for studying some problematic period of MySQL server are the following:
  1. Assuming the we want to get a profile of mysqld process during some problematic period, let's say in some script, we start recording samples for, let's say, 30 seconds, as a root user or via sudo:
    openxs@ao756:~/dbs/maria10.1$ sudo perf record -p `pidof mysqld` sleep 30
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.052 MB perf.data (~2257 samples) ]
  2. Samples are collected in the perf.data file in the current directory by default:
    openxs@ao756:~/dbs/maria10.1$ ls -l perf.data-rw------- 1 root root 56768 січ 29 13:53 perf.data
    When collection completes we can get profiling information using perf report command. If you want to save the result into some file or just review it, add --stdio option. otherwise you end up in a text-based "interactive" user interface of perf, that, while can be useful, may be misleading and weird for beginners. For the 'select benchmark(500000000, 2*2) case I've got:
    openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
    # ========
    # captured on: Sun Jan 29 14:01:24 2017
    # hostname : ao756
    # os release : 3.13.0-100-generic
    # perf version : 3.13.11-ckt39
    # arch : x86_64
    # nrcpus online : 2
    # nrcpus avail : 2

    # cpudesc : Intel(R) Pentium(R) CPU 987 @ 1.50GHz
    # cpuid : GenuineIntel,6,42,7
    # total memory : 3861452 kB
    # cmdline : /usr/lib/linux-tools-3.13.0-100/perf record -p 4270 sleep 30
    # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, e
    # HEADER_CPU_TOPOLOGY info available, use -I to display
    # HEADER_NUMA_TOPOLOGY info available, use -I to display
    # pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore
    # ========
    #
    # Samples: 248  of event 'cycles'
    # Event count (approx.): 24811349
    #
    # Overhead  Command       Shared Object
    # ........  .......  ..................  .......................................
    #
        11.60%   mysqld  [kernel.kallsyms]   [k] update_blocked_averages
         8.12%   mysqld  [kernel.kallsyms]   [k] update_cfs_rq_blocked_load
         6.98%   mysqld  [kernel.kallsyms]   [k] find_busiest_group
         5.76%   mysqld  [kernel.kallsyms]   [k] sys_io_getevents
         4.40%   mysqld  libc-2.19.so        [.] memset
         3.21%   mysqld  [kernel.kallsyms]   [k] update_curr
         3.03%   mysqld  [kernel.kallsyms]   [k] __calc_delta
         2.96%   mysqld  [kernel.kallsyms]   [k] put_prev_task_fair
         2.46%   mysqld  [kernel.kallsyms]   [k] timerqueue_add
         2.12%   mysqld  [kernel.kallsyms]   [k] cpumask_next_and
         2.11%   mysqld  libpthread-2.19.so  [.] __pthread_mutex_cond_lock     2.10%   mysqld  mysqld              [.] os_aio_linux_handle(unsigned long,
    ...
         1.09%   mysqld  mysqld              [.] buf_flush_lru_manager_thread
    ...
Now, the result above looks similar to and as suspicious as the result I've got with operf (that is also based on perf_events internally) on the same system and MariaDB version in the previous post. Everyone wonders where is the time spent on multiplication etc in the above? For operf (see my comment) I had found out that system-wide profiling gives expected results. It seems perf record -p pid also collects samples for single CPU (the one it runs on, or the one running that process actively, or just some one), and this is a problem for multi-threaded programs like mysqld if they are executed on multiple-core hardware (almost all hardware are like that these days, maybe some virtual machines aside). The solution is simple, add -a option for perf record:
       -a, --all-cpus
           System-wide collection from all CPUs.
and do not refer to mysqld, you can filter out related results later (for example, with  sudo perf report --stdio | grep mysqld | head -10, or interactively when --stdio option is not used). Like this:
openxs@ao756:~/dbs/maria10.1$ sudo perf record -a sleep 30
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 4.411 MB perf.data (~192717 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
# ========
# captured on: Sun Jan 29 14:27:31 2017
# hostname : ao756
# os release : 3.13.0-100-generic
# perf version : 3.13.11-ckt39
# arch : x86_64
# nrcpus online : 2
...
# Samples: 83K of event 'cycles'
# Event count (approx.): 30621323147
#
# Overhead          Command                  Shared Object
# ........  ...............  .............................  ............................................................................................................
#
    31.22%           mysqld  mysqld                         [.] Item_func_mul::int_op()    16.57%           mysqld  mysqld                         [.] Item_func_hybrid_field_type::val_int()
    16.51%           mysqld  mysqld                         [.] Item_hybrid_func::result_type() const
    16.11%           mysqld  mysqld                         [.] Item_func_benchmark::val_int()     9.92%           mysqld  mysqld                         [.] Item_int::val_int()
     3.27%           mysqld  mysqld                         [.] Type_handler_int_result::cmp_type() const
     3.25%           mysqld  mysqld                         [.] Type_handler_int_result::result_type() const
     0.27%      kworker/0:2  [kernel.kallsyms]              [k] aes_encrypt
     0.18%           compiz  i965_dri.so                    [.] 0x00000000000def10
     0.14%          swapper  [kernel.kallsyms]              [k] intel_idle
...
So, system-wide profile provides the information we can trust, even on multi-core systems for multi-threaded programs. If you know some other way, less intrusive way to get profiles that do not miss the details from threads running on other cores, please, tell me. I was NOT able to find it quickly.

By the way, nice interactive perf top command shows expected results as well when this primitive test runs (it is still a system-wide profile):

Samples: 17K of event 'cycles', Event count (approx.): 5061557268
 28,29%  mysqld                         [.] Item_func_mul::int_op()
 15,05%  mysqld                         [.] Item_func_benchmark::val_int()
 15,01%  mysqld                         [.] Item_hybrid_func::result_type() cons
 15,01%  mysqld                         [.] Item_func_hybrid_field_type::val_int
  9,12%  mysqld                         [.] Item_int::val_int()
  3,34%  perf                           [.] 0x00000000000564db
  3,18%  mysqld                         [.] Type_handler_int_result::cmp_type()
  2,93%  mysqld                         [.] Type_handler_int_result::result_type
  0,96%  libc-2.19.so                   [.] __strstr_sse2
  0,61%  libc-2.19.so                   [.] __GI___strcmp_ssse3
  0,24%  [kernel]                       [k] kallsyms_expand_symbol.constprop.1
  0,21%  [kernel]                       [k] module_get_kallsym
  0,20%  libc-2.19.so                   [.] _int_malloc
  0,20%  libharfbuzz.so.0.927.0         [.] 0x0000000000028051
  0,19%  i965_dri.so                    [.] 0x000000000010a9a0
  0,18%  [kernel]                       [k] format_decode
  0,16%  [kernel]                       [k] vsnprintf
  0,15%  libc-2.19.so                   [.] memchr
  0,15%  libc-2.19.so                   [.] __strchr_sse2
  0,14%  [kernel]                       [k] number.isra.1
  0,13%  [kernel]                       [k] memcpy
  0,09%  libc-2.19.so                   [.] strlen
Press '?' for help on key bindings
Press 'q' to get out of that screen.

perf record may be also used the same way as operf, run in foreground and be interrupted with Ctrl-C (examples are from my last night's session on CentOS):
[root@centos ~]# perf record -ag
^C[ perf record: Woken up 63 times to write data ]
[ perf record: Captured and wrote 15.702 MB perf.data (226643 samples) ]
or it can be started in the background and then interrupted with -SIGINT signal sent to the process:
[root@centos ~]# perf record -ag &[1] 2353
[root@centos ~]# mysql -uroot -e'select benchmark(500000000,2*2);' test
+--------------------------+
| benchmark(500000000,2*2) |
+--------------------------+
|                        0 |
+--------------------------+
[root@centos ~]# kill -sigint  2353[ perf record: Woken up 90 times to write data ]
[root@centos ~]# [ perf record: Captured and wrote 22.366 MB perf.data (322362 samples) ]

[1]+  Interrupt               perf record -ag
[root@centos ~]# ps aux | grep perf
root      2358  0.0  0.0 103312   864 pts/0    S+   22:16   0:00 grep perf
You've probably noted additional -g option in the examples above (explanation pasted from perf help record/man perf-record output):
       -g
           Enables call-graph (stack chain/backtrace) recording.
Another option often used is -F (like -F99 or -F1000):
       -F, --freq=
           Profile at this frequency.
For select benchmark(500000000, 2*2) case -g does not show anything really unexpected or enlightening:

openxs@ao756:~/dbs/maria10.1$ sudo perf record -ag sleep 30
[sudo] password for openxs:
[ perf record: Woken up 64 times to write data ]
[ perf record: Captured and wrote 16.496 MB perf.data (~720730 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
...

# Overhead          Command                  Shared Object
# ........  ...............  .............................  ............................................................................................................
#
    31.02%           mysqld  mysqld                         [.] Item_func_mul::int_op()
                     |
                     --- Item_func_mul::int_op()
                        |
                        |--94.56%-- Item_func_hybrid_field_type::val_int()
                        |          Item_func_benchmark::val_int()
                        |          Item::send(Protocol*, String*)
                        |          Protocol::send_result_set_row(List<Item>*)
                        |          select_send::send_data(List<Item>&)
                        |          JOIN::exec_inner()
                        |          JOIN::exec()
                        |          mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsi
                        |          handle_select(THD*, LEX*, select_result*, unsigned long)
                        |          execute_sqlcom_select(THD*, TABLE_LIST*)
                        |          mysql_execute_command(THD*)
                        |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                        |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                        |          do_command(THD*)
                        |          do_handle_one_connection(THD*)
                        |          handle_one_connection
                        |          start_thread
                        |
                         --5.44%-- Item_func_benchmark::val_int()
                                   Item::send(Protocol*, String*)
                                   Protocol::send_result_set_row(List<Item>*)
                                   select_send::send_data(List<Item>&)
                                   JOIN::exec_inner()
                                   JOIN::exec()
                                   mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsi
                                   handle_select(THD*, LEX*, select_result*, unsigned long)
                                   execute_sqlcom_select(THD*, TABLE_LIST*)
                                   mysql_execute_command(THD*)
                                   mysql_parse(THD*, char*, unsigned int, Parser_state*)
                                   dispatch_command(enum_server_command, THD*, char*, unsigned int)
                                   do_command(THD*)
                                   do_handle_one_connection(THD*)
                                   handle_one_connection
                                   start_thread
...


But we clearly see backtraces/callgraphs.

Note that this tree starts with the on-CPU functions and works back through the ancestry. This approach is called a "callee based call graph". This can be flipped by using -G for an "inverted call graph", or by using the caller option to -g/--call-graph, instead of the callee default. The hottest (most frequent) stack trace in this perf call graph occurred in 29.33% of samples, which is the product of the overhead percentage and top stack leaf (31.02%*94.56%, which are relative rates). perf report can also be run with "-g graph" to show absolute overhead rates

I'd like to conclude this blog post with profiling system running test load presented in Bug #83912 (that, in turn, is just one of the observations based on a very complex real life performance problem I am still working on). There we run N+2 long running UPDATEs on N+2 different tables on a system with N cores. While UPDATEs run, sometimes SELECT asking for a single record by PRIMARY KEY for a different table may sometimes run slow, for fractions of second to many seconds.

So, with table to select from, t0, and other 4 (N+2, N=2 on my netbook used for testing) tables to update, t1 ... t4, created:
openxs@ao756:~/dbs/maria10.1$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 21
Server version: 10.1.18-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> select count(*) from t0;
+----------+
| count(*) |
+----------+
|  2097152 |
+----------+
1 row in set (9.33 sec)

MariaDB [test]> select * from t0 where id = 13;
+----+------+----------------------+
| id | c1   | c2                   |
+----+------+----------------------+
| 13 |  269 | 0.008545639932663777 |
+----+------+----------------------+
1 row in set (0.00 sec)

MariaDB [test]> show table status like 't1'\G
*************************** 1. row ***************************
           Name: t1
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 2079632
 Avg_row_length: 139
    Data_length: 289177600
Max_data_length: 0
   Index_length: 0
      Data_free: 6291456
 Auto_increment: 2424797
    Create_time: 2017-01-29 12:31:57
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: stats_persistent=1
        Comment:
1 row in set (0.11 sec)

MariaDB [test]> explain select * from t0 where id = 13;
+------+-------------+-------+-------+---------------+---------+---------+-------+------+-------+
| id   | select_type | table | type  | possible_keys | key     | key_len | ref   | rows | Extra |
+------+-------------+-------+-------+---------------+---------+---------+-------+------+-------+
|    1 | SIMPLE      | t0    | const | PRIMARY       | PRIMARY | 4       | const |    1 |       |
+------+-------------+-------+-------+---------------+---------+---------+-------+------+-------+
1 row in set (0.00 sec)

MariaDB [test]> explain update t1 set c2 = rand() where id between 2000 and 300000;
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------------+
| id   | select_type | table | type  | possible_keys | key     | key_len | ref  | rows   | Extra       |
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------------+
|    1 | SIMPLE      | t1    | range | PRIMARY       | PRIMARY | 4       | NULL | 339246 | Using where |
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------------+
1 row in set (0.00 sec)
I've started 4 (N+2) threads each updating large part of one of the tables:
openxs@ao756:~/dbs/maria10.1$ for i in `seq 1 4`; do bin/mysqlslap -uroot --no-drop --create-schema=test --number-of-queries=1000 iternations=10 --concurrency=1 --query="update t$i set c2 = rand() where id between 2000 and 300000;" & done
[1] 15513
[2] 15514
[3] 15515
[4] 15516

In the SHOW PROCESSLIST we see:
...
| 26 | root | localhost | test | Query   |   10 | updating | update t2 set c2 = rand() where id between 2000 and 300000 |    0.000 |
| 27 | root | localhost | test | Query   |   32 | updating | update t1 set c2 = rand() where id between 2000 and 300000 |    0.000 |
| 28 | root | localhost | test | Query   |    5 | updating | update t4 set c2 = rand() where id between 2000 and 300000 |    0.000 |
| 29 | root | localhost | test | Query   |   11 | updating | update t3 set c2 = rand() where id between 2000 and 300000 |    0.000 |
...

Let's see with perf what these threads are doing, so they spend 29 seconds in updating status. First without -g:
openxs@ao756:~/dbs/maria10.1$ sudo perf record -a sleep 30[sudo] password for openxs:
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 4.260 MB perf.data (~186128 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
...
# Overhead          Command                  Shared Object
# ........  ...............  .............................  ............................................................................................................
#
    12.50%      kworker/0:1  [kernel.kallsyms]              [k] aes_encrypt    10.54%      kworker/1:1  [kernel.kallsyms]              [k] aes_encrypt
     9.77%      kworker/1:2  [kernel.kallsyms]              [k] aes_encrypt
     4.81%      kworker/0:0  [kernel.kallsyms]              [k] aes_encrypt
     4.52%           mysqld  mysqld                         [.] buf_calc_page_new_checksum(unsigned char const*)     3.64%           mysqld  mysqld                         [.] log_block_calc_checksum_innodb(unsigned char const*)
     2.22%           mysqld  mysqld                         [.] dtoa.constprop.6
     2.15%           mysqld  mysqld                         [.] trx_undo_report_row_operation(unsigned long, unsigned long, que_thr_t*, dict_index_t*, dtuple_t const*,
     2.10%           mysqld  mysqld                         [.] quorem
     1.78%           mysqld  mysqld                         [.] mtr_commit(mtr_t*)
     1.41%           mysqld  mysqld                         [.] ha_innobase::update_row(unsigned char const*, unsigned char*)
     1.15%           mysqld  mysqld                         [.] buf_page_optimistic_get(unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)

     1.15%           mysqld  mysqld                         [.] buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long
     1.11%           mysqld  mysqld                         [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
     1.10%           mysqld  mysqld                         [.] multadd
...
I am not surprised to see aes_encrypt in kernel on top of this system-wide profile, as encrupted HDD is used on this netbook (one of Percona security requirements for any hardware ever used for work, even personal) and upodates are surely I/O bound with mysqld --no-defaults and 128M of buffer pool.

We may want to study some of mysqld-related entries further with callgraphs generated (-g):

openxs@ao756:~/dbs/maria10.1$ sudo perf record -ag sleep 20
[ perf record: Woken up 31 times to write data ]
[ perf record: Captured and wrote 8.313 MB perf.data (~363186 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio

...
     4.29%           mysqld  mysqld                         [.] buf_calc_page_new_checksum(unsigned char const*)
                     |
                     --- buf_calc_page_new_checksum(unsigned char const*)
                        |
                        |--67.15%-- buf_flush_init_for_writing(unsigned char*, void*, unsigned long)
                        |          buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool)
                        |          buf_flush_try_neighbors(unsigned long, unsigned long, buf_flush_t, unsigned long, unsigned long)
                        |          |
                        |          |--52.15%-- buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long)
                        |          |          buf_flush_list(unsigned long, unsigned long, unsigned long*)
                        |          |          buf_flush_page_cleaner_thread
                        |          |          start_thread
                        |          |
                        |           --47.85%-- buf_flush_LRU_list_batch(buf_pool_t*, unsigned long, bool, flush_counters_t*)
                        |                     buf_flush_batch(buf_pool_t*, buf_flush_t, unsigned long, unsigned long, bool, flush_counters_t*)
                        |                     buf_flush_LRU_tail()
                        |                     buf_flush_lru_manager_thread
                        |                     start_thread
                        |
                         --32.85%-- buf_page_is_corrupted(bool, unsigned char const*, unsigned long)
                                   buf_page_io_complete(buf_page_t*)
                                   |
                                   |--82.17%-- fil_aio_wait(unsigned long)
                                   |          io_handler_thread
                                   |          start_thread
                                   |
                                    --17.83%-- buf_read_page(unsigned long, unsigned long, unsigned long, trx_t*, buf_page_t**)
                                              buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, uns
                                              |
                                              |--92.31%-- btr_pcur_move_to_next_page(btr_pcur_t*, mtr_t*)
                                              |          btr_pcur_move_to_next(btr_pcur_t*, mtr_t*)
                                              |          row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
                                              |          ha_innobase::index_next(unsigned char*)
                                              |          handler::ha_index_next(unsigned char*)
                                              |          handler::read_range_next()
                                              |          handler::multi_range_read_next(void**)
                                              |          Mrr_simple_index_reader::get_next(void**)
                                              |          DsMrr_impl::dsmrr_next(void**)
                                              |          QUICK_RANGE_SELECT::get_next()
                                              |          rr_quick(READ_RECORD*)
                                              |          mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, e
                                              |          mysql_execute_command(THD*)
                                              |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                                              |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                                              |          do_command(THD*)
                                              |          do_handle_one_connection(THD*)
                                              |          handle_one_connection
                                              |          start_thread
                                              |
                                              |--5.37%-- btr_estimate_n_rows_in_range(dict_index_t*, dtuple_t const*, unsigned long, dtuple_t const*, unsigned long, trx
                                              |          ha_innobase::records_in_range(unsigned int, st_key_range*, st_key_range*)
                                              |          handler::multi_range_read_info_const(unsigned int, st_range_seq_if*, void*, unsigned int, unsigned int*, unsign
                                              |          DsMrr_impl::dsmrr_info_const(unsigned int, st_range_seq_if*, void*, unsigned int, unsigned int*, unsigned int*,
                                              |          check_quick_select(PARAM*, unsigned int, bool, SEL_ARG*, bool, unsigned int*, unsigned int*, Cost_estimate*)
                                              |          get_key_scans_params(PARAM*, SEL_TREE*, bool, bool, double)
                                              |          SQL_SELECT::test_quick_select(THD*, Bitmap<64u>, unsigned long long, unsigned long long, bool, bool, bool)
                                              |          mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, e
                                              |          mysql_execute_command(THD*)
                                              |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                                              |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                                              |          do_command(THD*)
                                              |          do_handle_one_connection(THD*)
                                              |          handle_one_connection
                                              |          start_thread
...

     1.28%           mysqld  mysqld                         [.] ha_innobase::update_row(unsigned char const*, unsigned char*)
                     |
                     --- ha_innobase::update_row(unsigned char const*, unsigned char*)
                        |
                        |--97.71%-- handler::ha_update_row(unsigned char const*, unsigned char*)
                        |          mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool,
                        |          mysql_execute_command(THD*)
                        |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                        |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                        |          do_command(THD*)
                        |          do_handle_one_connection(THD*)
                        |          handle_one_connection
                        |          start_thread
                        |
                         --2.29%-- mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool,
                                   mysql_execute_command(THD*)
                                   mysql_parse(THD*, char*, unsigned int, Parser_state*)
                                   dispatch_command(enum_server_command, THD*, char*, unsigned int)
                                   do_command(THD*)
                                   do_handle_one_connection(THD*)
                                   handle_one_connection
                                   start_thread
...


Now to the interesting part (slow UPDATEs when I/O bound are expected). Usually (but not always, in original real life case!) simple SELECT runs fast, but if we limit innodb_thread_concurrency to, say, 4, it may start to run slow:
MariaDB [test]> select * from t0 where id = 13;
+----+------+----------------------+
| id | c1   | c2                   |
+----+------+----------------------+
| 13 |  269 | 0.008545639932663777 |
+----+------+----------------------+
1 row in set (0.00 sec)

MariaDB [test]> select @@innodb_thread_concurrency;
+-----------------------------+
| @@innodb_thread_concurrency |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (0.05 sec)

MariaDB [test]> set global innodb_thread_concurrency=4;
Query OK, 0 rows affected (0.25 sec)

MariaDB [test]> select * from t0 where id = 13;
+----+------+----------------------+
| id | c1   | c2                   |
+----+------+----------------------+
| 13 |  269 | 0.008545639932663777 |
+----+------+----------------------+
1 row in set (2.92 sec)

MariaDB [test]> select * from t0 where id = 13;
+----+------+----------------------+
| id | c1   | c2                   |
+----+------+----------------------+
| 13 |  269 | 0.008545639932663777 |
+----+------+----------------------+
1 row in set (12.16 sec)

Profile with callgraph may help to find out why this happens:
openxs@ao756:~/dbs/maria10.1$ sudo perf record -ag sleep 20
[ perf record: Woken up 32 times to write data ]
[ perf record: Captured and wrote 8.503 MB perf.data (~371488 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
...
                        |--1.40%-- sys_select
                        |          system_call_fastpath
                        |          __select
                        |          |
                        |          |--71.70%-- srv_conc_enter_innodb(trx_t*)
                        |          |          ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
                        |          |          handler::index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function)
                        |          |          handler::ha_index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function)
                        |          |          join_read_const(st_join_table*)
                        |          |          join_read_const_table(THD*, st_join_table*, st_position*) [clone .isra.301]
                        |          |          make_join_statistics(JOIN*, List<TABLE_LIST>&, st_dynamic_array*)
                        |          |          JOIN::optimize_inner()
                        |          |          JOIN::optimize()
                        |          |          mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_o
                        |          |          handle_select(THD*, LEX*, select_result*, unsigned long)
                        |          |          execute_sqlcom_select(THD*, TABLE_LIST*)
                        |          |          mysql_execute_command(THD*)
                        |          |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                        |          |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                        |          |          do_command(THD*)
                        |          |          do_handle_one_connection(THD*)
                        |          |          handle_one_connection
                        |          |          start_thread
...
It spends 71.7% of time waiting to join InnoDB queue (in this limited InnoDB concurrency case) while diving into the index while optimizing the query... This time is spent on "statistics" stage, for single row SELECT by PRIMARY KEY. Now, go conclude this without a backtrace of calls or some profiler.

by Valeriy Kravchuk (noreply@blogger.com) at January 29, 2017 02:47 PM

January 27, 2017

Peter Zaitsev

Percona Software News and Roadmap Update with CEO Peter Zaitsev: Q1 2017

Percona Software News and Roadmap Update

This blog post is a summary of the webinar Percona Software News and Roadmap Update – Q1 2017 given by Peter Zaitsev on January 12, 2017.

Over the last few months, I’ve had the opportunity to meet and talk with many of Percona’s customers. I love these meetings, and I always get a bunch of questions about what we’re doing, what our plans are and what releases are coming.

I’m pleased to say there is a great deal going on at Percona, and I thought giving a quick talk about our current software and services, along with our plans, would provide a simple reference for many of these questions.

A full recording of this webinar, along with the presentation slide deck, can be found here.

Percona Solutions and Services

Let me start by laying out Percona’s company purpose:

To champion unbiased open source database solutions.

What does this mean? It means that we write software to offer you better solutions, and we use the best of what software and technology exist in the open source community.

Percona stands by a set of principles that we feel define us as a company, and are a promise to our customers:

  • 100% free and open source software
  • Focused on finding solution to maximize your success
  • Open source database strategy consulting and implementation
  • No vendor lock-in required

We offer trusted and unbiased expert solutions, support and resource in a broad software ecosystem, including:

We also have specialization options for PaaS, IaaS, and SaaS solutions like Amazon Web Services, OpenStack, Google Cloud Platform, OpenShift, Ceph, Docker and Kubernetes.

Percona’s immediate business focus includes building long-term partnership relationships through support and managed services.

The next few sections detail our current service offerings, with some outlook on our plans.

98% Customer Satisfaction Rating

Over the last six months, Percona has consistently maintained a 98% Customer Satisfaction Rating!

Customer Success Team

Our expanded Customer Success Team is here to ensure you’re getting most out of your Percona Services Subscription.

Managed Services Best Practices

  • Unification based on best practices
  • Organization changes to offer more personal service
  • Increased automation

Ongoing Services

Percona Software News and Roadmap Update

Consulting and Training. Our consulting and training services are available to assist you with whatever project or staff needs you have.

  • Onsite and remote
  • 4 hours to full time (weeks or months)
  • Project and staff augmentation

Advanced HA Included with Enterprise and Premier Support. Starting this past Spring, we included advance high availability (HA) support as part of our Enterprise and Premier support tiers. This advanced support includes coverage for:

  • Percona XtraDB Cluster
  • MariaDB Galera Cluster
  • Galera Cluster for MySQL
  • Upcoming MySQL group replication
  • Upcoming MySQL Innodb Cluster

Enterprise Wide Support Agreements. Our new Enterprise Wide Support option allows you to buy per-environment support coverage that covers all of the servers in your environment, rather than on a per-server basis. This method of support can save you money, because it:

  • Covers both “MySQL” and “MongoDB”
  • Means you don’t have to count servers
  • Provides highly customized coverage

 

Simplified Support Pricing. Get easy to understand support pricing quickly.

To discuss how Percona Support can help your business, please call us at +1-888-316-9775 (USA),
+44 203 608 6727 (Europe), or have us contact you.

New Percona Online Store – Easy to Buy, Pay Monthly

Percona Software

Below are the latest and upcoming features in Percona’s software. All of Percona’s software adheres to the following principles:

  • 100% free and open source
  • No restricted “Enterprise” version
  • No “open core”
  • No BS-licensing (BSL)

Percona Server for MySQL 5.7

Overview

  • 100% Compatible with MySQL 5.7 Community Edition
  • 100% Free and Open Source
  • Includes Alternatives to Many MySQL Enterprise Features
  • Includes TokuDB Storage Engine
  • Focus on Performance and Operational Visibility

Latest Improvements

Features about to be Released 

  • Integration of TokuDB and Performance Schema
  • MyRocks integration in Percona Server
  • MySQL Group Replication
  • Starting to look towards MySQL 8

Percona XtraBackup 2.4

Overview

  • #1 open source binary hot backup solution for MySQL
  • Alternative to MySQL Enterprise backup
  • Parallel backups, incremental backups, streaming, encryption
  • Supports MySQL, MariaDB, Percona Server

New Features

  • Support SHA256 passwords and secure connection to server
  • Improved Security (CVE-2016-6225)
  • Wrong passphrase detection

Percona Toolkit

Overview

  • “Swiss Army Knife” of tools
  • Helps DBAs be more efficient
  • Helps DBAs make fewer mistakes
  • Supports MySQL, MariaDB, Percona Server, Amazon RDS MySQL

New Features

  • Improved fingerprinting in pt-query-digest
  • Pause file for pt-online-schema-change
  • Provide information about transparent huge pages

Coming Soon

  • Working towards Percona Toolkit 3.0 release
  • Comprehensive support for MongoDB
  • New tools are now implemented in Go

Percona Server for MongoDB 3.2

Overview

  • 100% compatible with MongoDB 3.2 Community Edition
  • 100% open source
  • Alternatives for many MongoDB Enterprise features
  • MongoRocks (RocksDB) storage engine
  • Percona Memory Engine

New

  • Percona Server for MongoDB 3.2 – GA
  • Support for MongoRocks storage engine
  • PerconaFT storage engine depreciated
  • Implemented Percona Memory Engine

Coming Soon

  • Percona Server for MongoDB 3.4
  • Fully compatible with MongoDB 3.4 Community Edition
  • Updated RocksDB storage engine
  • Universal hot backup for WiredTiger and MongoRocks
  • Profiling rate limiting (query sampling)

Percona Memory Engine for MongoDB

Benchmarks

Percona Memory Engine for MongoDB® is a 100 percent open source in-memory storage engine for Percona Server for MongoDB.

Based on the in-memory storage engine used in MongoDB Enterprise Edition, WiredTiger, Percona Memory Engine for MongoDB delivers extremely high performance and reduced costs for a variety of use cases, including application cache, sophisticated data manipulation, session management and more.

Below are some benchmarks that we ran to demonstrate Percona Memory Engine’s performance.

Percona Software News and Roadmap Update

WiredTiger vs MongoRocks – write intensive

Percona XtraDB Cluster 5.7

Overview

  • Based on Percona Server 5.7
  • Easiest way to bring HA in your MySQL environment
  • Designed to work well in the cloud
  • Multi-master replication with no conflicts
  • Automatic node provisioning for auto-scaling and self-healing

Goals

  • Brought PXC development in-house to server our customers better
  • Provide complete clustering solution, not set of LEGO pieces
  • Improve usability and ease of use
  • Focus on quality

Highlights

  • Integrated cluster-aware load balancer with ProxySQL
  • Instrumentation with Performance Schema
  • Support for data at rest encryption (InnoDB tablespace encryption)
  • Your data is safe by default with “strict mode” – prevents using features that do not work correctly
  • Integration with Percona Monitoring and Management (PMM)

New in Percona XtraDB Cluster 5.7

  • One option to secure all network communication: pxc-encrypt-cluster-traffic
  • Zero downtime maintenance with ProxySQL and Maintenance Mode

Percona Monitoring and Management

Overview

  • Comprehensive database-focused monitoring
  • 100% open source, roll-your-own solution
  • Easy to install and use
  • Supports MySQL and MongoDB
  • Version 1.0 focuses on trending and query analyses
  • Management features to come

Examples of PMM Screens

What queries are causing the load?

Percona Software News and Roadmap Update

Why are they causing this load?

Percona Software News and Roadmap Update

How to fix them:

Percona Software News and Roadmap Update

System information:

Percona Software News and Roadmap Update

What happens on OS and hardware level:

Percona Software News and Roadmap Update

As well as the database level:

Percona Software News and Roadmap Update

New in Percona Monitoring and Management

  • Continuing to improve and expand dashboards with every release
  • Includes Grafana 4.0 (with basic Alerting)
  • SSL support for server-agent communications
  • Supports authentication for server-agent communication
  • Added support for Amazon RDS
  • Reduced space consumption by using advanced compression

Coming Soon 

  • PMM server available as AMI and Virtual Appliance image
  • Better MongoDB dashboards
  • Continuing work on dashboards Improvement
  • Query analytics application refinements
  • Short term continuing focus on monitoring functionality

Check out the Demo

Percona Live Open Source Database Conference 2017 is right around the corner!

high availibilityThe Percona Live Open Source Database Conference is the premier event for the diverse and active open source database community, as well as businesses that develop and use open source database software. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, PostgreSQL and other open source databases. Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience – all to help you learn how to tackle your open source database challenges in a whole new way

This conference has something for everyone!

Register now and get the early bird rate, but hurry prices go up Jan 31st.

Sponsorship opportunities are available as well. Become a Percona Live Sponsor, download the prospectus here.

 

by Peter Zaitsev at January 27, 2017 10:06 PM

When MySQL Lies: Wrong seconds_behind_master with slave_parallel_workers > 0

seconds_behind_master

seconds_behind_masterIn today’s blog, I will show an issue with seconds_behind_master that one of our clients faced when running slave_parallel_works > 0. We found out that the reported seconds_behind_master from SHOW SLAVE STATUS was lying. To be more specific, I’m talking about bugs #84415 and #1654091.

The Issue

MySQL will not report the correct slave lag if you have slave_parallel_workers> 0. Let’s show it in practice.

I’ll use MySQL Sandbox to speed up one master and two slaves on MySQL version 5.7.17, and sysbench to populate the database:

# Create sandboxes
make_replication_sandbox /path/to/mysql/5.7.17
# Create table with 1.5M rows on it
sysbench --test=/usr/share/sysbench/tests/db/oltp.lua --mysql-host=localhost --mysql-user=msandbox --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox20192.sock --mysql-db=test --oltp-table-size=1500000 prepare
# Add slave_parallel_workers=5 and slave_pending_jobs_size_max=1G" on node1
echo "slave_parallel_workers=5" >> node1/my.sandbox.cnf
echo "slave_pending_jobs_size_max=1G" >> node1/my.sandbox.cnf
node1/restart

Monitor Replication lag via SHOW SLAVE STATUS:

for i in {1..1000};
do
    (
        node1/use -e "SHOW SLAVE STATUSG" | grep "Seconds_Behind_Master" | awk '{print "Node1: " $2}' &
        sleep 0.1 ;
        node2/use -e "SHOW SLAVE STATUSG" | grep "Seconds_Behind_Master" | awk '{print "Node2: " $2}' &
    );
    sleep 1;
done

On a separate terminal, DELETE some rows in the test.sbtest1 table on the master, and monitor the above output once the master completes the delete command:

DELETE FROM test.sbtest1 WHERE id > 100;

Here is a sample output:

master [localhost] {msandbox} (test) > DELETE FROM test.sbtest1 WHERE id > 100;
Query OK, 1499900 rows affected (46.42 sec)
. . .
Node1: 0
Node2: 0
Node1: 0
Node2: 48
Node1: 0
Node2: 48
Node1: 0
Node2: 49
Node1: 0
Node2: 50
. . .
Node1: 0
Node2: 90
Node1: 0
Node2: 91
Node1: 0
Node2: 0
Node1: 0
Node2: 0

As you can see, node1 (which is running with slave_parallel_workers = 5) doesn’t report any lag.

The Workaround

We can workaround this issue by querying performance_schema.threads:

SELECT PROCESSLIST_TIME FROM performance_schema.threads WHERE NAME = 'thread/sql/slave_worker' AND (PROCESSLIST_STATE IS NULL  or PROCESSLIST_STATE != 'Waiting for an event from Coordinator') ORDER BY PROCESSLIST_TIME DESC LIMIT 1;

Let’s modify our monitoring script, and use the above query to monitor the lag on node1:

for i in {1..1000};
do
    (
        node1/use -BNe "SELECT PROCESSLIST_TIME FROM performance_schema.threads WHERE NAME = 'thread/sql/slave_worker' AND (PROCESSLIST_STATE IS NULL  or PROCESSLIST_STATE != 'Waiting for an event from Coordinator') ORDER BY PROCESSLIST_TIME DESC LIMIT 1 INTO @delay; SELECT IFNULL(@delay, 0) AS 'lag';" | awk '{print "Node1: " $1}' &
        sleep 0.1 ;
        node2/use -e "SHOW SLAVE STATUSG" | grep "Seconds_Behind_Master" | awk '{print "Node2: " $2}' &
    );
    sleep 1;
done

master [localhost] {msandbox} (test) > DELETE FROM test.sbtest1 WHERE id > 100;
Query OK, 1499900 rows affected (45.21 sec)
Node1: 0
Node2: 0
Node1: 0
Node2: 0
Node1: 45
Node2: 45
Node1: 46
Node2: 46
. . .
Node1: 77
Node2: 77
Node1: 78
Node2: 79
Node1: 0
Node2: 80
Node1: 0
Node2: 81
Node1: 0
Node2: 0
Node1: 0
Node2: 0

Please note that in our query to performance_schema.threads, we are filtering PROCESSLIST_STATE “NULL” and “!= Waiting for an event from Coordinator”. The correct state is “Executing Event”, but it seems like it doesn’t correctly report that state (#84655).

Summary

MySQL parallel replication is a nice feature, but we still need to make sure we are aware of any potential issues it might bring. Most monitoring systems use the output of SHOW SLAVE STATUS to verify whether or not the slave is lagging behind the master. As shown above, it has its caveats.

As always, we should test, test and test again before implementing any change like this in production!

by Marcelo Altmann at January 27, 2017 07:03 PM

Oli Sennhauser

Is your MySQL software Cluster ready?

When we do Galera Cluster consulting we always discuss with the customer if his software is Galera Cluster ready. This basically means: Can the software cope with the Galera Cluster specifics?

If it is a software product developed outside of the company we recommend to ask the software vendor if the software supports Galera Cluster or not.

We typically see 3 different answers:

  • We do not know. Then they are at least honest.
  • Yes we do support Galera Cluster. Then they hopefully know what they are talking about but you cannot be sure and should test carefully.
  • No we do not. Then they most probably know what they are talking about.

If the software is developed in-house it becomes a bit more tricky because the responsibility for this statement has to be taken by you or some of your colleagues.

Thus it is good to know what are the characteristics and the limitations of a Cluster like Galera Cluster for MySQL.

Most of the Galera restrictions an limitation you can find here.

DDL statements cause TOI operations

DDL and DCL statements (like CREATE, ALTER, TRUNCATE, OPTIMIZE, DROP, GRANT, REVOKE, etc.) are executed by default in Total Order Isolation (TOI) by the Online Schema Upgrade (OSU) method. To achieve this schema upgrade consistently Galera does a global Cluster lock.

It is obvious that those DDL operations should be short and not very frequent to not always block your Galera Cluster. So changing your table structure must be planned and done carefully to not impact your daily business operation.

But there are also some not so obvious DDL statements causing TOI operations (and Cluster locks).

  • TRUNCATE TABLE ... This operation is NOT a DML statement (like DELETE) but a DDL statement and thus does a TOI operation with a Cluster lock.
  • CREATE TABLE IF NOT EXISTS ... This operation is clearly a DDL statement but one might think that it does NOT a TOI operation if the table already exists. This is wrong. This statement causes always a TOI operation if the table is there or not does not matter. If you run this statement very frequent this potentially causes troubles to your Galera Cluster.
  • CREATE TABLE younameit_tmp ... The intention is clear: The developer wants to create a temporary table. But this is NOT a temporary table but just a normal table called _tmp. So it causes as TOI operation as well. What you should do in this case is to create a real temporary table like this: CREATE TEMPORARY TABLE yournameit_tmp ... This DDL statement is only executed locally and will not cause a TOI operation.

How to check?

You can check the impact of this problem with the following sequence of statements:

mysql> SHOW GLOBAL STATUS LIKE 'Com_create_table%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Com_create_table | 4     |
+------------------+-------+

mysql> CREATE TABLE t1_tmp (id INT);

mysql> SHOW GLOBAL STATUS LIKE 'Com_create_table%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Com_create_table | 5     | --> Also changes on the Slave nodes!
+------------------+-------+

mysql> CREATE TEMPORARY TABLE t2_tmp (id INT);

mysql> SHOW GLOBAL STATUS LIKE 'Com_create_table%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Com_create_table | 6     | --> Does NOT change on the Slave nodes!
+------------------+-------+

mysql> CREATE TABLE IF NOT EXISTS t1_tmp (id INT);
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Com_create_table | 7     | --> Also changes on the Slave nodes!
+------------------+-------+

Find out in advance

If you want to find out before migrating to Galera Cluster if you are hit by this problem or not you can either run:

mysql> SHOW GLOBAL STATUS
WHERE variable_name LIKE 'Com_create%'
   OR variable_name LIKE 'Com_alter%'
   OR variable_name LIKE 'Com_drop%'
   OR variable_name LIKE 'Com_truncate%'
   OR variable_name LIKE 'Com_grant%'
   OR variable_name LIKE 'Com_revoke%'
   OR variable_name LIKE 'Com_optimize%'
   OR variable_name LIKE 'Com_rename%'
   OR variable_name LIKE 'Uptime'
;
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Com_create_db        | 2     |
| Com_create_table     | 6     |
| Com_optimize         | 1     |
| Uptime               | 6060  |
+----------------------+-------+

Or if you want to know exactly who was running the query from the PERFORMANCE_SCHEMA:

SELECT user, host, SUBSTR(event_name, 15) AS event_name, count_star
  FROM performance_schema.events_statements_summary_by_account_by_event_name
 WHERE count_star > 0 AND
     ( event_name LIKE 'statement/sql/create%'
    OR event_name LIKE 'statement/sql/alter%'
    OR event_name LIKE 'statement/sql/drop%'
    OR event_name LIKE 'statement/sql/rename%'
    OR event_name LIKE 'statement/sql/grant%'
    OR event_name LIKE 'statement/sql/revoke%'
    OR event_name LIKE 'statement/sql/optimize%'
    OR event_name LIKE 'statement/sql/truncate%'
    OR event_name LIKE 'statement/sql/repair%'
    OR event_name LIKE 'statement/sql/check%'
     )
;
+------+-----------+--------------+------------+
| user | host      | event_name   | count_star |
+------+-----------+--------------+------------+
| root | localhost | create_table |          4 |
| root | localhost | create_db    |          2 |
| root | localhost | optimize     |          1 |
+------+-----------+--------------+------------+

If you need help to make your application Galera Cluster ready we will be glad to assist you.

by Shinguz at January 27, 2017 05:19 PM

January 26, 2017

Peter Zaitsev

Percona Live Featured Tutorial with Frédéric Descamps — MySQL InnoDB Cluster & Group Replication in a Nutshell: Hands-On Tutorial

Percona Live Featured Tutorial

Percona Live Featured TutorialWelcome to another post in the series of Percona Live featured tutorial speakers blogs! In these blogs, we’ll highlight some of the tutorial speakers that will be at this year’s Percona Live conference. We’ll also discuss how these tutorials can help you improve your database environment. Make sure to read to the end to get a special Percona Live 2017 registration bonus!

In this Percona Live featured tutorial, we’ll meet Frédéric Descamps, MySQL Community Manager at Oracle. Frédéric is probably better known in the community as “LeFred” (Twitter: @lefred)! His tutorial is MySQL InnoDB Cluster and Group Replication in a Nutshell: Hands-On Tutorial (along with  Part 2). Frédéric is delivering this talk with Percona’s MySQL Practice Manager Kenny Gryp. Attendees will get their hands on virtual machines and migrate standard Master/Slave architecture to the new MySQL InnoDB Cluster (native Group Replication) with minimal downtime. I had a chance to speak with Frédéric and learn a bit more about InnoDB Cluster and Group Replication:

Percona: How did you get into database technology? What do you love about it?

Frédéric: I started with SQL on VMS and DBASE during my IT courses. I wrote my thesis on SQLServer replication. To be honest, the implementation at the time wasn’t particularly good. At the same time, I enjoyed hacking a Perl module that allowed you to run SQL queries against plain text files. Then I worked as a Progress developer for MFG/Pro (a big customized ERP). When I decided to work exclusively for an open source company, I was managing more and more customers’ databases in MySQL (3.23 and 4.x), so my colleagues and I took MySQL training. It was still MySQL AB at that time. I passed my MySQL 4.1 Core and Professional Certification, and I ended up delivering MySQL Training for MySQL AB too. Some might also remember that I worked for a company called Percona for a while, before moving on to Oracle!

Percona: Your and Kenny’s tutorial is called “MySQL InnoDB Cluster & Group Replication in a nutshell: hands-on tutorial.” Why would somebody want to migrate to this technology?

Frédéric: The main reason to migrate to MySQL InnoDB is to achieve high availability (HA) easily for your MySQL database, while still maintaining performance.

Thanks to  MySQLShell’s adminAPI, you can create a cluster in less than five minutes. All the management can be done remotely, with just some shell commands. Additionally, for Group Replication our engineers have leveraged existing standard MySQL infrastructures (GTIDs, Multi-Source Replication, Multi-threaded Slave Applier, Binary Logs, etc.), which are already well known by the majority of MySQL DBAs.

Percona: How can moving to a high availability environment help businesses?

Frédéric: Time is money! Many businesses can’t afford downtime anymore. Having a reliable HA solution is crucial for businesses on the Internet. Expectations have changed: users want this to be natively supported inside the database technology versus externally. For example, when I started to work in IT, it was very common to have several maintenance windows during the night. But currently, and almost universally, the customer base is spread worldwide. Any time is during business hours somewhere in the world!

Percona: What do you want attendees to take away from your tutorial session? Why should they attend?

Frédéric: I think that current HA solutions are too complex to setup and manage. They use external tools. In our tutorial, Kenny and I will demonstrate how MySQL InnoDB Cluster, being an internal implementation, is extremely easy to use. We will also cover some scenarios where things go wrong, and how to deal with them. Performance and ease-of-use were two key considerations in the design of InnoDB Cluster.

Percona: What are you most looking forward to at Percona Live?

Frédéric: Like every time I’ve attended, my main goal is to bring to the audience all the new and amazing stuff we are implementing in MySQL. MySQL has evolved quickly these last few years, and we don’t really plan to slow down. I also really enjoy the feedback from users and other MySQL professionals. This helps focus us on what really matters for our users. And finally, it’s a great opportunity to re-connect with ex-colleagues and friends.

You can find out more about Frédéric Descamps and his work with InnoDB Cluster at his Twitter handle @lefredRegister for Percona Live Data Performance Conference 2017, and see Frédéric and Kenny present their MySQL InnoDB Cluster and Group Replication in a Nutshell: Hands-On Tutorial talk. Use the code FeaturedTutorial and receive $30 off the current registration price!

Percona Live Data Performance Conference 2017 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 24-27, 2017 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

by Dave Avery at January 26, 2017 11:58 PM

Performance Schema Benchmarks: OLTP RW

Performance Schema Benchmarks

In this blog post, we’ll look at Performance Schema benchmarks for OLTP Read/Write workloads.

I am in love with Performance Schema and talk a lot about it. Performance Schema is a revolutionary MySQL troubleshooting instrument, but earlier versions had performance issues. Many of these issues are fixed now, and the default options work quickly and reliably. However, there is no free lunch! It is expected that the more instruments you turn ON, the more overhead you’ll have.

The advice I give my customers is that when in doubt, only turn ON the instruments that are required to troubleshoot your issue. Many of them ask: what exactly are the overhead costs for one instrumentation or another? I believe the best answer is “test on your system!” No generic benchmark can exactly repeat a workload on your site. But while I was working on the “OpenSource Databases on Big Machines” project, I decided to test the performance of Performance Schema as well.

I only tested a Read/Write workload. I used the same fast machine (144 CPU cores), the same MySQL options and the same SysBench commands that I described in this post. The option 

innodb_flush_method
 was changed to O_DIRECT, because it’s more reasonable for real-life workloads. I also upgraded the MySQL Server version to Oracle’s MySQL 5.7.17. The reason for the upgrade was to test if the issue described in this post is repeatable with latest Oracle MySQL server version. But since I tested Performance Schema, the effect on Percona Server for MySQL should be same.

I tested nine different scenarios:

  1. “All disabled”: Performance Schema is ON, but all instruments and consumers are disabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
  2. “All enabled”: Performance Schema is ON, and all instruments and consumers are enabled.
    update setup_instruments set enabled='yes';
    update setup_consumers set enabled='yes';
  3. “Default”: Performance Schema is ON, and only default instruments and consumers are enabled.
  4. “MDL only”: only Metadata Lock instrumentation is enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name='wait/lock/metadata/sql/mdl';
  5. “Memory only”: only Memory instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'memory%';
  6. “Stages and Statements”: only Stages and Statements instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'statement%';
    update setup_consumers set enabled='yes' where name like 'events_statements%';
    update setup_instruments set enabled='yes' where name like 'stage%';
    update setup_consumers set enabled='yes' where name like 'events_stages%';
  7. “Stages only”: only Stages instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'stage%';
    update setup_consumers set enabled='yes' where name like 'events_stages%';
  8. “Statements only”: only Statements instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'statement%';
    update setup_consumers set enabled='yes' where name like 'events_statements%';
  9. “Waits only”: only Waits instrumentation enabled.
    update setup_consumers set enabled='no';
    update setup_instruments set enabled='no';
    update setup_consumers set enabled='yes' where name= 'global_instrumentation';
    update setup_consumers set enabled='yes' where name= 'thread_instrumentation';
    update setup_instruments set enabled='yes' where name like 'wait%' ;
    update setup_consumers set enabled='yes' where name like 'events_waits%';

Here are the overall results.

As you can see, some instrumentation only slightly affects performance, while others affect it a lot. I created separate graphs to make the picture clearer.

As expected, enabling all instrumentation makes performance lower:

Does this mean to use Performance Schema, you need to start the server with it ON and then disable all instruments? No! The default options have very little effect on performance:

The same is true for Metadata Locks, Memory and Statements instrumentation:

Regarding statements, I should note that I used prepared statements (which are instrumented in version 5.7). But it makes sense to repeat the tests without prepared statements.

The Stages instrumentation starts affecting performance:

However, the slowdown is reasonable and it happens only after we reach 32 concurrent threads. It still provides great insights on what is happening during query execution.

The real performance killer is Waits instrumentation:

It affects performance close to the same way as all instruments ON.

Conclusion

Using Performance Schema with the default options, Memory, Metadata Locks and Statements instrumentation doesn’t have a great impact on read-write workload performance. You might notice slowdowns with Stages instrumentation after reaching 32 actively running parallel connections. The real performance killer is Waits instrumentation. And even with it on, you will start to notice a performance drop only after 10,000 transactions per second.

Save

Save

Save

Save

Save

Save

Save

by Sveta Smirnova at January 26, 2017 07:19 PM

Jean-Jerome Schmidt

Using ClusterControl to Deploy and Configure ProxySQL on top of MySQL Replication

A proxy layer is an important building block for any highly available database environment. HAProxy has been available for a long time, as a generic TCP load balancer. We later added support for MaxScale, an SQL-aware load balancer. We’re today happy to announce support for ProxySQL, which also belongs to the family of SQL-aware load balancers. We blogged about ProxySQL sometime back, in case you would like to read more about it.

As from ClusterControl 1.4, we now support deployment and configuration of ProxySQL. It is done from a simple wizard:

You need to pick a host on which ProxySQL will be installed, you can choose to install it on one of your existing hosts in your cluster or you can type in a new host.

In the next step, you need to pick username and password for administration and monitoring user for ProxySQL.

You also need to define application users. ProxySQL works in the middle, between application and backend MySQL servers, so the database users need to be able to connect from the ProxySQL IP address. The proxy supports connection multiplexing, so not all connections opened by applications onto ProxySQL will result in connections opened to the backend databases. This can seriously increase performance of MySQL - it’s well known that MySQL scales up to some number of concurrent connections (exact value depends mostly on the MySQL version). Now, using ProxySQL, you can push thousands of connections from the application. ProxySQL opens a much lower number of connections to MySQL. It takes advantage of the fact that, for most of the time, connections wait for the application and do not process any query: they can be reused. This feature requires that the application authenticates against ProxySQL, so you need to add all users that you use in your application to ProxySQL.

You can either add existing database users, or you can create a new one from the UI - fill in username and password, pick what grants you want to assign to that user, and you are all set - ClusterControl will create this user for you in both MySQL and ProxySQL.

Finally, you need to pick which hosts you want to include in the ProxySQL configuration, and set some configuration variables like maximum allowed replication lag, maximum number of connections or weight.

At the end of the form you have to answer an important question: do you use implicit transactions or not? This question is mandatory to answer and based on your answer ClusterControl will configure ProxySQL. If you don’t use implicit transactions (by that we mean that you don’t use SET autocommit=1 to initiate a transaction) thenyour read-only traffic will be sent to slaves while your writes will be sent to a writable master. If you use implicit transactions, the only safe method of configuring ProxySQL is to send all traffic to the master. There will be no scale-out, only high availability. Of course, you can always redirect some particular queries to the slaves but you will have to do it on your own - ClusterControl cannot do it for you. This will change in the future when ProxySQL will add support for handling such scenario, but for now we have to stick to what we have.

ProxySQL will also work together with the new automatic failover mechanism (for master-slave replication setups) added in ClusterControl 1.4.0 - once failover happens, ProxySQL will detect the new writable master and route writes to it. It all happens automatically, without any need for the user to take action.

Once deployed, you can use ClusterControl to monitor your ProxySQL installation. You can check the status of hosts in all defined hostgroups. You can also check some metrics related to hostgroups - used connections, free connections, errors, number of queries executed, amount of data sent and received, latency. Below you can find graphs related to some of ProxySQL metrics - active transactions, data sent and received, memory utilization, number of connections and some others. This gives you nice insight in how ProxySQL operates and helps to catch any potential issues with your proxy layer.

So, give it a try and tell us what you think. We plan on adding more management features in the next release, and would love to hear what you’d like to see.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

by krzysztof at January 26, 2017 11:14 AM