Planet MariaDB

January 19, 2017

Peter Zaitsev

Open Source Databases on Big Machines: Does READ COMMITTED Scale on 144 Cores?


In the second post in my series on open source databases on big machines, we’ll look at whether READ COMMITTED scales with multiple cores.

The default transaction level for InnoDB is

. A more permissive level is 
, and is known to work well. While the 
 level maintains the transaction history up to the start of the transaction,
 maintains the transaction history up to the start of the current statement. Peter Zaitsev described the differences between how these modes are handled in this blog post. Both can theoretically cause performance slowdowns, but
 is usually seen as fast-working – at least on a typical MySQL machine (owned by Percona):


The default transaction isolation mode for PostgreSQL is also

. Originally I wanted to use this mode for MySQL tests as well. But when I tested on a machine with 144 cores, I found that after 36 threads
 continued to scale while
 slowed down. It then got stuck at about 3x slower results for standard OLTP RW tests.


Dimitri Kravtchuk wrote about performance issues with READ COMMITTED in 2015, but he tested with 40 cores that time. My tests show that there is a huge difference after 40 cores.

I tested this originally with Percona Server 5.7.15 and recently re-tested with Oracle’s MySQL versions 5.6.35 and 5.7.17. I confirmed that the bug exists in these versions as well, and reported it. I the Oracle MySQL Team fixes it. The good news is that while 5.6 stopped scaling after 16 threads, 5.7 improves this to 36 threads.

Results for 5.6.35:


Results for 5.7.17:

Machine details:

PostgreSQL Professional and Freematiq machine (tests for MYSQL 5.6.35, Percona 5.7.15 and MySQL 5.7.17 servers):

Processors: physical = 4, cores = 72, virtual = 144, hyperthreading = yes
Memory: 3.0T
Disk speed: about 3K IOPS
OS: CentOS 7.1.1503
File system: XFS

Percona machine (test for Percona 5.7.15 server):

Processors: physical = 2, cores = 12, virtual = 24, hyperthreading = yes
Memory: 251.9G
Disk speed: about 33K IOPS
OS: Ubuntu 14.04.5 LTS
File system: EXT4

Test: SysBench OLTP RW test, converted to use prepared statements, as described in this post.

MySQL Options: described in this post.











by Sveta Smirnova at January 19, 2017 09:12 PM

January 18, 2017

MariaDB AB

Security Vulnerability CVE-2016-6664 / CVE-2016-5617

Security Vulnerability CVE-2016-6664 / CVE-2016-5617 rasmusjohansson Wed, 01/18/2017 - 13:23

During the fall there were a couple of vulnerabilities found that could be used for privilege escalations in conjunction with race conditions. These were:

  • CVE-2016-6662 MySQL Remote Root Code Execution / Privilege Escalation 0day

  • CVE-2016-6663 Privilege Escalation / Race Condition (also referred to as CVE-2016-5616)

  • CVE-2016-6664 Root Privilege Escalation (also referred to as CVE-2016-5617)

I’ve published two blog posts about these vulnerabilities before:

CVE-2016-6662 and CVE-2016-6663 have been fixed during the fall and versions of MariaDB has been released containing the fixes. As stated in the latter blog post the root privilege escalation vulnerability CVE-2016-6664 was not exploitable by itself. It will need to obtain shell access first through some other vulnerability. But a final fix was still needed to completely shut the door for this last related vulnerability.

The CVE-2016-6664 vulnerability makes use of a weak point in the way the mysqld_safe script handled the creation of the error log file, through which root privileges could be obtained.

Oracle made an attempt to fix this already in November, but the fix was unfortunately half-baked and made the vulnerability slightly less exploitable, but didn’t completely get rid of it. This and other issues in the mysqld_safe script were pointed out by Red Hat’s Security Team. Oracle has since then opened CVE-2017-3312 for the missing pieces of CVE-2016-6664 and fixed them.

In MariaDB Server, we’ve now implemented our own fix for the vulnerability, which we believe completely removes the possibility to make use of it.

CVE-2016-6664 is fixed as of the following versions of MariaDB Server:

Please upgrade to these versions (or newer) to be protected against CVE-2016-6664. The latest versions can be download here.

- - -

In addition to CVE-2016-6664, fixes for the following CVEs affecting MySQL, mentioned in Oracle’s Critical Patch Update Advisory - January 2017 are included in the versions 5.5.54, 10.0.29 and 10.1.21 of MariaDB:

Rasmus Johansson provides an update on CVE-2016-6662 and CVE-2016-6663 vulnerabilities which were both fixed in the fall.  

Login or Register to post comments

by rasmusjohansson at January 18, 2017 06:23 PM

Peter Zaitsev

Elasticsearch Ransomware: Open Source Database Security Part 2

Elasticsearch Ransomware

Elasticsearch RansomwareIn this blog post, we’ll look at a new Elasticsearch ransomware outbreak and what you can do to prevent it happening to you.

Mere weeks after reports of MongoDB servers getting hacked and infected with ransomware, Elasticsearch clusters are experiencing the same difficulties. David Murphy’s blog discussed the situation and the solution for MongoDB servers. In this blog post, we look at how you can prevent ransomware attacks on your Elasticsearch clusters.

First off, what is Elasticsearch? Elasticsearch is an open source distributed index based on Apache Lucene. It provides a full-text search with an HTTP API, using schemaless JSON documents. By its nature, it is also distributed and redundant. Companies use Elasticsearch with logging via the ELK stack and data-gathering software, to assist with data analytics and visualizations. It is also used to back search functionality in a number of popular apps and web services.

In this new scenario, the ransomware completed wiped away the cluster data, and replaced it with the following warning index:


As with the MongoDB situation, this isn’t a flaw in the Elasticsearch software. This vulnerability stems from not correctly using the security settings provided by Elasticsearch. As the PCWorld article sums up:

According to experts, there is no reason to expose Elasticsearch clusters to the internet. In response to these recent attacks, search technologies and distributed systems architect Itamar Syn-Hershko has published a blog post with recommendations for securing Elasticsearch deployments.

The blog post they reference has excellent advice and examples of how to protect your Elasticsearch installations from exploitation. To summarize its advice (from the post itself):

Whatever you do, never expose your cluster nodes to the web directly.

So how do you prevent hackers from getting into your Elasticsearch cluster? Using the advice from Syn-Hershko’s blog, here are some bullet points for shoring up your Elasticsearch security:

  • HTTP-enabled nodes should only listen to private IPs. You can configure what IPs Elasticsearch listens to: localhost, private IPs, public IPs or several of these options.
     control the IP types (manual). Never set Elasticsearch to listen to a public IP or a publicly accessible DNS name.
  • Use proxies to communicate with clients. You should pass any application queries to Elasticsearch through some sort of software that can filter requests, perform audit-logging and password-protect the data. Your client-side javascript shouldn’t talk to Elastic directly, and should only communicate with your server-side software. That software can translate all client-side requests to Elasticsearch DSL, execute the query, and then send the response in a format the clients expect.
  • Don’t use default ports. Once again for clarity: DON’T USE DEFAULT PORTS. You can easily change Elasticsearch’s default ports by modifying the .YML file. The relevant parameters are
  • Disable HTTP if you don’t need it. Only Elasticsearch client nodes should enable HTTP, and your private network applications should be the only ones with access to them. You can completely disable the HTTP module by setting
  • Secure publicly available client nodes. You should protect your Elasticsearch client and any UI it communicates with (such as Kibana and Kopf) behind a VPN. If you choose to allow some nodes access to the public network, use HTTPS and don’t transmit data and credentials as plain-text. You can use plugins like Elastic’s Shield or SearchGuard to secure your cluster.
  • Disable scripting (pre-5.x). Malicious scripts can hack clusters via the Search API. Earlier versions of Elasticscript allowed unsecured scripts to access the software. If you are using an older version (pre-5.x), upgrade to a newer version or disable dynamic scripting completely.

Go to Syn-Hershko’s blog for more details.

This should get you started on correctly protecting yourself against Elasticsearch ransomware (and other security threats). If you want to have someone review your security, please contact us.

by Dave Avery at January 18, 2017 06:15 PM

MariaDB Foundation

MariaDB 10.1.21 and other releases now available

The MariaDB project is pleased to announce the immediate availability of MariaDB 10.1.21, MariaDB 10.0.29, MariaDB Galera Cluster 10.0.29, MariaDB Connector/J 1.5.7, MariaDB Connector/C 2.3.2, and MariaDB Connector/C 3.0.1 Beta. Apart from the Connector/C 3.0.1 Beta these are all stable (GA) releases. See the release notes and changelogs for details. Download MariaDB 10.1.21 Release Notes […]

The post MariaDB 10.1.21 and other releases now available appeared first on

by Daniel Bartholomew at January 18, 2017 03:14 PM

Jean-Jerome Schmidt

Announcing ClusterControl 1.4 - the MySQL Replication & MongoDB Edition

Today we are pleased to announce the 1.4 release of ClusterControl - the all-inclusive database management system that lets you easily deploy, monitor, manage and scale highly available open source databases in any environment; on-premise or in the cloud.

This release contains key new features for MongoDB and MySQL Replication in particular, along with performance improvements and bug fixes.

Release Highlights


MySQL Replication

  • Enhanced multi-master deployment
  • Flexible topology management & error handling
  • Automated failover

MySQL Replication & Load Balancers

  • Deploy ProxySQL on MySQL Replication setups and monitor performance
  • HAProxy Read-Write split configuration support for MySQL Replication setups

Experimental support for Oracle MySQL Group Replication

  • Deploy Group Replication Clusters

And support for Percona XtraDB Cluster 5.7

Download ClusterControl

For MongoDB

MongoDB & sharded clusters

  • Convert a ReplicaSet to a sharded cluster
  • Add or remove shards
  • Add Mongos/Routers

More MongoDB features

  • Step down or freeze a node
  • New Severalnines database advisors for MongoDB

Download ClusterControl

View release details and resources

Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

New MySQL Replication Features

ClusterControl 1.4 brings a number of new features to better support replication users. You are now able to deploy a multi-master replication setup in active - standby mode. One master will actively take writes, while the other one is ready to take over writes should the active master fail. From the UI, you can also easily add slaves under each master and reconfigure the topology by promoting new masters and failing over slaves.

Topology reconfigurations and master failovers are not usually possible in case of replication problems, for instance errant transactions. ClusterControl will check for issues before any failover or switchover happens. The admin can define whitelists and blacklists of which slaves to promote to master (and vice versa). This makes it easier for admins to manage their replication setups and make topology changes when needed. 

Deploy ProxySQL on MySQL Replication clusters and monitor performance

Load balancers are an essential component in database high availability. With this new release, we have extended ClusterControl with the addition of ProxySQL, created for DBAs by René Cannaò, himself a DBA trying to solve issues when working with complex replication topologies. Users can now deploy ProxySQL on MySQL Replication clusters with ClusterControl and monitor its performance.

By default, ClusterControl deploys ProxySQL in read/write split mode - your read-only traffic will be sent to slaves while your writes will be sent to a writable master. ProxySQL will also work together with the new automatic failover mechanism. Once failover happens, ProxySQL will detect the new writable master and route writes to it. It all happens automatically, without any need for the user to take action.

MongoDB & sharded clusters

MongoDB is the rising star of the Open Source databases, and extending our support for this database has brought sharded clusters in addition to replica sets. This meant we had to retrieve more metrics to our monitoring, adding advisors and provide consistent backups for sharding. With this latest release, you can now convert a ReplicaSet cluster to a sharded cluster, add or remove shards from a sharded cluster as well as add Mongos/routers to a sharded cluster.

New Severalnines database advisors for MongoDB

Advisors are mini programs that provide advice on specific database issues and we’ve added three new advisors for MongoDB in this ClusterControl release. The first one calculates the replication window, the second watches over the replication window, and the third checks for un-sharded databases/collections. In addition to this we also added a generic disk advisor. The advisor verifies if any optimizations can be done, like noatime and noop I/O scheduling, on the data disk that is being used for storage.

There are a number of other features and improvements that we have not mentioned here. You can find all details in the ChangeLog.

We encourage you to test this latest release and provide us with your feedback. If you’d like a demo, feel free to request one.

Thank you for your ongoing support, and happy clustering!

PS.: For additional tips & tricks, follow our blog:

by Severalnines at January 18, 2017 02:44 PM

January 17, 2017

Peter Zaitsev

Webinar Wednesday January 18, 2017: Lessons from Database Failures

Database FailuresJoin Percona’s Chief Evangelist Colin Charles on Wednesday, January 18, 2017, at 7:00 am (PST) / 10:00 am (EST) (UTC-8) as he presents “Lessons from Database Failures.”

MySQL failures at scale can teach a great deal. MySQL failures can lead to a discussion about such topics as high availability (HA), geographical redundancy and automatic failover. In this webinar, Colin will present case study material (how automatic failover caused Github to go offline, why Facebook uses assisted failover rather than fully automated failover, and other scenarios) to look at how the MySQL world is making things better. One way, for example, is using semi-synchronous replication to run fully scalable services.

The webinar will begin with an obvious example of how a business died due to incorrect MySQL backup procedures. The agenda includes backups (and verification), replication (and failover) and security (and encryption).

The webinar will cover a mix of big “fail whale” problems from the field, and how you should avoid them by properly architecting solutions.

Register for the webinar here.

Database FailuresColin Charles is the Chief Evangelist at Percona. Previously, he was part of the founding team of MariaDB Server in 2009. Before that, he worked at MySQL since 2005. Colin has been a MySQL user since 2000. Before joining MySQL, he worked actively on the Fedora and projects. He’s well known within the APAC open source communities and has spoken at many conferences.

by Dave Avery at January 17, 2017 06:09 PM

January 16, 2017

Peter Zaitsev

Percona Live Featured Tutorial with Morgan Tocker — MySQL 8.0 Optimizer Guide

Percona Live Featured Tutorial

Percona Live Featured TutorialWelcome to another post in the series of Percona Live featured tutorial speakers blogs! In these blogs, we’ll highlight some of the tutorial speakers that will be at this year’s Percona Live conference. We’ll also discuss how these tutorials can help you improve your database environment. Make sure to read to the end to get a special Percona Live 2017 registration bonus!

In this Percona Live featured tutorial, we’ll meet Morgan Tocker, MySQL Product Manager at Oracle. His tutorial is a MySQL 8.0 Optimizer Guide. Many users who follow MySQL development are aware that recent versions introduced a number of improvements to query execution (via the addition of new execution strategies and an improved cost model). But what we don’t talk enough about is that the diagnostic features are also significantly better. I had a chance to speak with Morgan and learn a bit more about the MySQLOptimizer:

Percona: How did you get into database technology? What do you love about it?

Morgan: I started my career as a web developer, mainly focusing on the front end area. As the team I worked on grew and required different skills, I tried my hand at the back end programming. This led me to databases.

I think what I enjoyed about databases at the time was that front end design was a little bit too subjective for my tastes. With databases, you could prove what was “correct” by writing a simple micro-benchmark.  I joined the MySQL team in January 2006, and rejoined it again in 2013 after a five-year hiatus.

I don’t quite subscribe to this same view on micro benchmarks today, since it is very easy to accidentally (or intentionally) write a naïve benchmark. But I am still enjoying myself.

Percona: Your tutorial is called “MySQL 8.0 Optimizer Guide.” What exactly is the MySQL optimizer, and what new things have been added in MySQL 8.0?

Morgan: Because SQL is declarative (i.e., you state “what you want” rather than “do this then that”), there is a process that has to happen internally to prepare a query for execution. I like to describe it as similar to what happens when you enter an address in a GPS navigator. Some software then spits out the best steps on how to get there. In a database server, the Optimizer is that software code area.

There are a number of new optimizer features in MySQL 8.0, both in terms of new syntax supported and performance improvements to existing queries. These will be covered in some talks at the main conference (and also my colleague Øystein’s tutorial). The goal of my tutorial is to focus more on diagnostics than the enhancements themselves.

Percona: How can you use diagnostics to improve queries?

Morgan: To put it in numbers: it is not uncommon to see a user obsess over a configuration change that may yield a 2x improvement, and not spot the 100x improvement available by adding an index!

I like to say that users do not get the performance that they are entitled to if and when they lack the visibility and diagnostics available to them:

-In MySQL 5.6, an optimizer trace diagnostic was added so that you can now see not only why the optimizer arrived at a particular execution plan, but why other options were avoided.

-In MySQL 5.7, the EXPLAIN FORMAT=JSON command now includes the cost information (the internal formula used for calculations). My experience has been that sharing this detail actually makes the optimizer a lot easier to teach.

Good diagnostics by themselves do not make the improvements, but they bring required changes to the surface. On most systems, there are opportunities for improvements (indexes, hints, slight changes to queries, etc.).

Percona: What do you want attendees to take away from your tutorial session? Why should they attend?

Morgan: Visibility into running systems has been a huge priority for the MySQL Engineering team over the last few releases. I think in many cases, force-of-habit leaves users using an older generation of diagnostics (EXPLAIN versus EXPLAIN FORMAT=JSON). My goal is to show them the light using the state-of-the-art stack. This is why I called my talk an 8.0 guide, even though much of it is still relevant for 5.7 and 5.6.

I also have a companion website to my tutorial at

Percona: What are you most looking forward to at Percona Live?

Morgan: I’m excited to talk to users about MySQL 8.0, and not just in an optimizer sense. The MySQL engineering team has made a large investment in improving the reliability of MySQL with the introduction of a native data dictionary. I expect it will be the subject of many discussions, and a great opportunity for feedback.

There is also the social aspect for me, too. It will be 11 years since I first attended the predecessor to Percona Live. I have a lot of fond memories, and enjoy catching up with new friends and old over a beer!

You can find out more about Morgan Tocker and his work with the Optimizer at his tutorial website. Want to find out more about Morgan and MySQL query optimization? Register for Percona Live Data Performance Conference 2017, and see his MySQL 8.0 Optimizer Guide tutorial. Use the code FeaturedTalk and receive $30 off the current registration price!

Percona Live Data Performance Conference 2017 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 24-27, 2017 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

by Dave Avery at January 16, 2017 11:06 PM

Ad-hoc Data Visualization and Machine Learning with mysqlshell


In this blog post, I am going to show how we can use mysqlshell to run ad-hoc data visualizations and use machine learning to predict new outcomes from the data.

Some time ago Oracle released MySQL Shell, a command line client to connect to MySQL using the X protocol. It allows us to use Python or JavaScript scripting capabilities. This unties us from the limitations of SQL, and the possibilities are infinite. It means that MySQL can not only read data from the tables, but also learn from it and predict new values from features never seen before.

Some disclaimers:

  • This is not a post about to how to install mysqlshell or enable the X plugin. It should be already installed. Follow the first link if instructions are needed.
  • The idea is to show some of the things that can be done from the shell. Don’t expect the best visualizations or a perfectly tuned Supervised Learning algorithm.

It is possible to start mysqlshell with JavaScript or Python interpreter. Since we are going to use Pandas, NumPy and Scikit, Python will be our choice. There is an incompatibility between mysqlshell and Python > 2.7.10 that gives an error when loading some external libraries, so make sure you use 2.7.10.

We’ll work the “employees” database that can be downloaded here. In order to make everything easier and avoid several lines of data parsing, I have created a new table that summarizes the data we are going to work with, generated using the following structure and query:

mysql> show create table data\G
*************************** 1. row ***************************
Create Table: CREATE TABLE `data` (
  `emp_no` int(11) NOT NULL,
  `age` int(11) DEFAULT NULL,
  `hired` int(11) DEFAULT NULL,
  `gender` int(11) DEFAULT NULL,
  `salary` int(11) DEFAULT NULL,
  `department` int(11) DEFAULT NULL,
  PRIMARY KEY (`emp_no`)

mysql> INSERT INTO data SELECT employees.emp_no, YEAR(now()) - YEAR(birth_date) as age, YEAR(now()) - YEAR(hire_date) as hired, IF(gender='M',0,1) as gender, max(salary) as salary, RIGHT(dept_no,1) as department from employees, salaries, dept_emp
WHERE employees.emp_no = salaries.emp_no and employees.emp_no = dept_emp.emp_no and dept_emp.to_date="9999-01-01"
GROUP BY emp_no, dept_emp.dept_no;

mysql> select * from data limit 5;
| emp_no | age  | hired | gender | salary | department |
|  10001 |   64 |    31 |      0 |  88958 |          5 |
|  10002 |   53 |    32 |      1 |  72527 |          7 |
|  10003 |   58 |    31 |      0 |  43699 |          4 |
|  10004 |   63 |    31 |      0 |  74057 |          4 |
|  10005 |   62 |    28 |      0 |  94692 |          3 |

The data is:

  • Age: the age of the employee
  • Hired: the number of years working in the company
  • Gender: 0 Male, 1 Female
  • Salary: the salary 🙂

It only includes people currently working at the company.

Now that the data is ready, let’s start with mysqlshell. Everything that follows was done directly from the shell itself.

Starting the Shell and Loading the Libraries

mysqlsh -uroot -p -h127.0.0.1 --py

Once the login is validated, we will see the following prompt:


That means we are using the shell in Python mode. We can start loading our libraries:

mysql-py> import pandas as pd
mysql-py> import numpy as np
mysql-py> import seaborn
mysql-py> import matplotlib.pyplot as plt
mysql-py> from sklearn import tree

Now, we read each column from the table and store it in its own variable:

mysql-py> use employees
mysql-py> def column_to_list(column_name):
    temp_var =[column_name]).execute().fetch_all()
    return [val for sublist in temp_var for val in sublist]
mysql-py> gender = column_to_list("gender")
mysql-py> salary = column_to_list("salary")
mysql-py> age = column_to_list("age")
mysql-py> hired = column_to_list("hired")
mysql-py> department = column_to_list("department")

And create a Pandas dataframe used to generate the visualizations:

df = pd.DataFrame({'Gender': gender,
                   'Salary': salary,
                   'Age': age,
                   'Hired': hired,
                   'Department': department

Data Analysis

Now, let’s investigate the data. Some basic statistics to get age, hired and salary overview:

mysql-py> print df[["Salary","Age","Hired",]].describe(percentiles=(.75,.90,.99))
              Salary            Age          Hired
count  240124.000000  240124.000000  240124.000000
mean    72041.332178      58.918226      27.413782
std     17305.819632       3.750406       3.525041
min     40000.000000      52.000000      17.000000
50%     69827.000000      59.000000      28.000000
75%     82570.000000      62.000000      30.000000
90%     96125.000000      64.000000      32.000000
99%    119229.390000      65.000000      32.000000
max    158220.000000      65.000000      32.000000

Those statistics already give us good information. The employees range from 52 to 65, having an average of 59. They have been working at the company for 27 years on average with a salary of 72041.

But let’s forget about numbers. The human brain works much better and faster interpreting graphs than reading a table full of numbers. Let’s create some graphs and see if we can find any relationship.

Data Visualization

Relation between Gender and Salary:

mysql-py> df.groupby(['Gender']).mean()['Salary'].plot(kind='bar')


Relation between Age and Salary:

mysql-py> df.groupby(['Age']).mean()['Salary'].plot(kind='bar')


Relation between Department and Salary:

mysql-py> df.groupby(['Department']).mean()['Salary'].plot(kind='bar')


Relation between Hired and Salary:

mysql-py> df.groupby(['Hired']).mean()['Salary'].plot(kind='bar')


Now everything is more clear. There is no real relationship between gender and salary (yay!) or between age and salary. Seems that the average salary is related to the years that an employee has been working at the company, It also shows some differences depending on the department he/she belongs to.

Making Predictions: Machine Learning

Up to this point we have been using matplotlib, Pandas and NumPy to investigate and create graphs from the data stored in MySQL. Everything is from the shell itself. Amazing, eh? 🙂 Now let’s take a step forward. We are going to use machine learning so our MySQL client is not only able to read the data already stored, but also predict a salary.

Decision Tree Regression from SciKit Learn is the supervised learning algorithm we’ll use. Remember, everything is still from the shell!

Let’s separate the data into features and labels. From Wikipedia:

“Feature is an individual measurable property of a phenomenon being observed.”

Taking into account the graphs we saw before, “hired” and “department” are good features that could be used to predict the correct label (salary). In other words, we will train our Decision Tree by giving it “hired” and “department” data, along with their labels “salary”. The idea is that after the learning phase, we can ask it to predict a salary based on new “hired” and “department” data we provide. Let’s do it:

Separate the data in features and labels:

mysql-py> features = np.column_stack((hired, department))
mysql-py> labels = np.array(salary)

Train our decision tree:

mysql-py> clf = tree.DecisionTreeRegressor()
mysql-py> clf =, labels)

Now, MySQL, tell me:

What do you think the salary of a person that has been working 25 years at the company, currently in department number 2, should be?

mysql-py> clf.predict([[25, 2]])
array([ 75204.21140143])

It predicts that the employee should have a salary of 75204. A person working there for 25 years, but in department number 7, should have a greater salary (based on the averages we saw before). What does our Decision Tree say?

mysql-py> clf.predict([[25, 7]])
array([ 85293.80606296])


Now MySQL can both read data we already know, and it can also predict it! 🙂 mysqlshell is a very powerful tool that can be used to help us in our data analysis tasks. We can calculate statistics, visualize graphs, use machine learning, etc. There are many things you might want to do with your data without leaving the MySQL Shell.

by Miguel Angel Nieto at January 16, 2017 10:00 PM

January 13, 2017

Peter Zaitsev

MongoDB 3.4 Views

MongoDB 3.4 Views

MongoDB 3.4 ViewsThis blog post covers MongoDB 3.4 views, one of the more recent MongoDB features.

Views are often used in relational databases to achieve both data security and a high level of abstraction, making it easier to retrieve data. Unlike regular tables, views neither have a physical schema nor use disk space. They execute a pre-specified query. There are exceptions (such as materialized views and pre-executed views), but as a default the engine actually executes a query and then sends the result set as a single table when a view is used.

In MySQL, a simple view can be defined as:

create database percona;
use percona;
create view user_hosts as select user, host from mysql.user;
select * from user_hosts
| user             | host      |
| myuser           | %         |

The query above shows only the users and host field, rather than all the table fields. Anyone who queries this view sees a table that only has the user and host fields.

This feature was not available in previous MongoDB versions. All we could do was either deny reads in a collection (which would make it useless to the user) or allow reads to the entire collection (which  was pretty unsafe).

The views feature request was open for a while, and as we can see there was a considerable number of votes to make this feature available:

MongoDB 3.4 views are non-materialized views, and behind the scenes the engine runs an aggregation. Creating a view requires that we specify a collection or a previous existing view. When a view is the source collection from another view, it allows us to execute a chained aggregation.

To create a view, we should use the

 command, specifying the view name, the view source collection and the aggregation pipeline. This aggregation pipeline, as well as the other parameters, is saved in the
collection. This is the only space that the view will use in the system. A new document is saved in the
 collection for each view created.

Although views seem very easy to create, a few pitfalls when using them.

Since views always run an aggregation, an index is desired to cover the aggregation $match pipeline, or slow responses might be expected during the full collection scans.

Cascading aggregations (creating views of views) can be slow, as the view does not have any data and therefore cannot be indexed. MongoDB neither checks the collection fields nor the collection existence before creating the view. If there is no collection, the view returns an empty cursor.

Views appear as a collection when we are listing them. The

show collections
 command shows us views as one collection, but such collections are read-only. To drop a view, we simply execute
. The collection is removed from the
, but the data remains untouched because it only removes the code that generates the view result.

How to create views:

In this step-by-step, we will create a view and restrict the user UserReadOnly to read privileges only:

1. Populate collection:

$ mongo --authenticatinDatabase admin -u foo -p
use financial
switched to db financial
> db.employee.insert({FirstName : 'John', LastName:  'Test', position : 'CFO', wage : 180000.00 })
WriteResult({ "nInserted" : 1 })
> db.employee.insert({FirstName : 'John', LastName:  'Another Test', position : 'CTO', wage : 210000.00 })
WriteResult({ "nInserted" : 1 })
> db.employee.insert({FirstName : 'Johnny', LastName:  'Test', position : 'COO', wage : 180000.00 })
WriteResult({ "nInserted" : 1 })

2. Create view that only shows full names:

use financial
db.createView('employee_names','employee', [{ $project : { _id : 0, "fullname" : {$concat : ["$FirstName", " ", "$LastName"]}}}])
{ "ok" : 1 }
>show collections
{ "fullname" : "John Test" }
{ "fullname" : "John Another Test" }
{ "fullname" : "Johnny Test" }

3. Create a user-defined role that only gives access to the views:

Create a file “createviewOnlyRole.js” with the following javascript, or copy and paste the following code: 

use financial
db_name = db.toString()
priv = []
db.system.views.find({},{"_id" : 1, "viewOn" : 1}).forEach(function (view) {
    database_collection = view['_id'].split('.')
    database = database_collection[0]
    coll = database_collection.join([separator = '.'])
    priv.push({"resource" : { "db" : database, "collection" : coll}, "actions" : ["find"]})
var viewrole = db.getRole(db_name + '_readAnyView')
if (viewrole == null) {
     db.runCommand({ createRole: db_name + "_readAnyView",
    "privileges": priv,
    roles : []
} else {
	db.runCommand({ updateRole: db_name + "_readAnyView",
    "privileges": priv,
    roles : []
print('access granted to:')

Then authenticate and use the desired database to create this role. In our case:

use financial

4. Create a new user assigned to the readAnyView role. This new user is only able to query against views, and they must know the view name because no other privileges are granted:

use financial
db_name = db.toString()
     user: "userReadViews",
     pwd: "123",
     roles: [ db_name + "_readAnyView"]

Notes: If you receive an error when trying to execute the .js file, please create a new role that grants find in the system.views collection:

use admin
db.runCommand({ createRole: "readViewCollection",
  privileges: [
    { resource: { db: "", collection: "system.views" }, actions: [ "find"] }],
    roles : []

For more information about user-defined roles, please check please check the user-defined docs.

This should help explain MongoDB 3.4 views. Please feel free to contact me @AdamoTonete or @percona for any questions and suggestions.

by Adamo Tonete at January 13, 2017 11:07 PM

The Impact of Swapping on MySQL Performance

In this blog, I’ll look at the impact of swapping on MySQL performance. 

It’s common sense that when you’re running MySQL (or really any other DBMS) you don’t want to see any I/O in your swap space. Scaling the cache size (using

 in MySQL’s case) is standard practice to make sure there is enough free memory so swapping isn’t needed.   

But what if you make some mistake or miscalculation, and swapping happens? How much does it really impact performance? This is exactly what I set out to investigate.

My test system has the following:

  • 32GB of physical memory
  • OS (and swap space) on a (pretty old) Intel 520 SSD device
  • Database stored on Intel 750 NVMe storage

To simulate a worst case scenario, I’m using Uniform Sysbench Workload:

sysbench --test=/usr/share/doc/sysbench/tests/db/select.lua   --report-interval=1 --oltp-table-size=700000000 --max-time=0 --oltp-read-only=off --max-requests=0 --num-threads=64 --rand-type=uniform --db-driver=mysql --mysql-password=password --mysql-db=test_innodb  run

To better visualize the performance of the metrics that matter for this test, I have created the following custom graph in our Percona Monitoring and Management (PMM) tool. It shows performance disk IO and swapping activity on the same graph.

Here are the baseline results for

. The results are a reasonable ballpark number for a system with 32GB of memory.

Impact of Swapping on MySQL PMM 1

As you can see in the baseline scenario, there is almost no swapping, with around 600MB/sec read from the disk. This gives us about 44K QPS. The 95% query response time (reported by sysbench) is about 3.5ms.

Next, I changed the configuration to

, which is the total amount of memory available. As memory is required for other purposes, it caused swapping activity:

Impact of Swapping on MySQL PMM 2

We can see that performance stabilizes after a bit at around 20K QPS, with some 380MB/sec disk IO and 125MB/sec swap IO. The 95% query response time has grown to around 9ms.

Now let’s look at an even worse case. This time, we’ll set our configuration to

 (on a 32GB system).

Impact of Swapping on MySQL PMM 3

Now we have around 6K QPS. Disk IO has dropped to 250MB/sec, and swap IO is up to 190MB/sec. The 95% query response time is around 35ms. As the graph shows, the performance becomes more variable, confirming the common assumption that intense swapping affects system stability.

Finally, let’s remember MySQL 5.7 has the Online Buffer Pool Resize feature, which was created to solve exactly this problem (among other reasons). It changes the buffer pool size if you accidentally set it too large. As we have tested

, and demonstrated it worked well, let’s scale it back to that value:

mysql> set global innodb_buffer_pool_size=24*1024*1024*1024;
Query OK, 0 rows affected (0.00 sec)

Impact of Swapping on MySQL PMM 4

Now the graph shows both good and bad news. The good news is that the feature works as intended, and after the resize completes we get close to the same results before our swapping experiment. The bad news is everything pretty much grinds to halt for 15 minutes or so while resizing occurs. There is almost no IO activity or intensive swapping while the buffer pool resize is in progress.   

I also performed other sysbench runs for selects using Pareto random type rather than Uniform type, creating more realistic testing (skewed) data access patterns. I further performed update key benchmarks using both Uniform and Pareto access distribution.

You can see the results below:

Impact of Swapping on MySQL Pareto 1

Impact of Swapping on MySQL Pareto 2

As you can see, the results for selects are as expected. Accesses with Pareto distributions are better and are affected less – especially by minor swapping.  

If you look at the update key results, though, you find that minor swapping causes performance to improve for Pareto distribution. The results at 48GB of memory are pretty much the same.

Before you say that that is impossible, let me provide an explanation: I limited

 on this system to avoid unbound InnoDB history length growth. These workloads tend to be bound by InnoDB purge performance. It looks like swapping has impacted the user threads more than it did the purge threads, causing such an unusual performance profile. This is something that might not be repeatable between systems.


When I started, I expected severe performance drop even with very minor swapping. I surprised myself by getting swap activity to more than 100MB/sec, with performance “only” halved.  

While you should continue to plan your capacity so that there is no constant swapping on the database system, these results show that a few MB/sec of swapping activity it is not going to have a catastrophic impact.

This assumes your swap space is on an SSD, of course! SSDs handle random IO (which is what paging activity usually is) much better than HDDs.

by Peter Zaitsev at January 13, 2017 05:27 PM

Jean-Jerome Schmidt

Online schema change with gh-ost - throttling and changing configuration at runtime

(this post was edited on 13/01/2017 after comments from Shlomi N.)

In previous posts, we gave an overview of gh-ost and showed you how to test your schema changes before executing them. One important feature of all schema change tools is their ability to throttle themselves. Online schema change requires copying data from old table to a new one and, no matter what you do in addition to that, it is an expensive process which may impact database performance.

Throttling in gh-ost

Throttling is crucial to ensure that normal operations continue to perform in a smooth way. As we discussed in a previous blog post, gh-ost allows to stop all of its activity, which makes things so much less intrusive. Let’s see how it works and to what extent it is configurable.

 - Disclaimer - this section is related to gh-ost in versions older than 1.0.34 -

The main problem is that, gh-ost uses multiple methods of lag calculation, which make things not really clear. The documentation is also not clear enough to clarify how things work internally. Let’s take a look at how gh-ost operates right now.As we mentioned, there are multiple methods used to calculate lag. First of all, gh-ost generates an internal heartbeat in its _ghc table.

mysql> SELECT * FROM sbtest1._sbtest1_ghc LIMIT 1\G
*************************** 1. row ***************************
         id: 1
last_update: 2016-12-27 13:36:37
       hint: heartbeat
      value: 2016-12-27T13:36:37.139851335Z
1 row in set (0.00 sec)

It is used to calculate lag on the slave/replica, on which gh-ost operates and reads binary logs from. Then, replicas are mentioned in --throttle-control-replicas. Those, by default, have their lag tracked using SHOW SLAVE STATUS and Seconds_Behind_Master. This data has the granularity of one second.

The problem is that sometimes, one second of lag is too much for the application to handle, therefore one of the very important features of gh-ost is to be able to detect sub-second lag. On the replica, where gh-ost operates, gh-ost’s heartbeat supports sub-second granularity using heartbeat-interval-millis variable. The remaining replicas, though, are not supported this way - there is an option to take advantage of an external heartbeat solution like, for example, pt-heartbeat, and calculate slave lag using --replication-lag-query.

Unfortunately, when we put it all together, it didn’t work as expected - sub-second lag was not calculated correctly by gh-ost. We decided to contact Shlomi Noah, who’s leading the gh-ost project, to get some more insight in how gh-ost operates regarding to sub-second lag detection. What you will read below is a result of this conversation, showing how it is done starting from version 1.0.34, which incorporates changes in lag calculation and does it in the “right” way.

Gh-ost, at this moment, inserts heartbeat data in its _*_ghc table. This makes any external heartbeat generator redundant and, as a result, it makes --replication-lag-query deprecated and soon to be removed. Gh-ost’s internal heartbeat is be used across the whole replication topology.

If you want to check for lag with sub-second granularity, you need to configure correctly --heartbeat-interval-millis and --max-lag-millis ensuring that heartbeat-interval-millis is set to lower value than max-lag-millis - that’s all. You can, for example, tell gh-ost to insert a heartbeat every 100 milliseconds (heartbeat-interval-millis) and then test if lag is less than, let’s say 500 milliseconds (max-lag-millis). Of course, lag will be checked on all replicas defined in --throttle-control-replicas. You can see updated documentation related to the lag checking process here:

Again, please keep in mind that this is how gh-ost operates when you use it in version v1.0.34 or later.

We need to mention, for a sake of completeness, one more setting - nice-ratio. It is used to define how aggressive gh-ost should be in copying the data. It basically tells ghost how much should it pause after each row copy operation. If you set it to 0 - no pause will be added. If you set it to 0.5, the whole process of copying rows will take 150% of original time. If you set it to 1, it will take twice as long (200%). It works but it is also pretty hard to adjust the ratio so the original workload is not affected. As long as you can use sub-second lag throttling, this is the way to go.

Runtime configuration changes in gh-ost

Another very useful feature of gh-ost is its ability to handle runtime configuration changes. When it starts, it listens on the unix socket, which you can choose through --serve-socket-file. By default it is created in /tmp dir and name is determined by gh-ost. It seems like it depends on the schema and table which gh-ost works upon. An example would be: /tmp/gh-ost.sbtest1.sbtest1.sock

Gh-ost can also work using TCP port but for that you need to pass --serve-tcp-port.

Knowing this, we can manipulate some of the settings. The best way to learn what we can change would be to ask gh-ost about it. When we send the ‘help’ string to the socket, we’ll get a list of available commands:

root@ip-172-30-4-235:~# echo help | nc -U /tmp/gh-ost.sbtest1.sbtest1.sock
available commands:
status                               # Print a detailed status message
sup                                  # Print a short status message
chunk-size=<newsize>                 # Set a new chunk-size
nice-ratio=<ratio>                   # Set a new nice-ratio, immediate sleep after each row-copy operation, float (examples: 0 is agrressive, 0.7 adds 70% runtime, 1.0 doubles runtime, 2.0 triples runtime, ...)
critical-load=<load>                 # Set a new set of max-load thresholds
max-lag-millis=<max-lag>             # Set a new replication lag threshold
replication-lag-query=<query>        # Set a new query that determines replication lag (no quotes)
max-load=<load>                      # Set a new set of max-load thresholds
throttle-query=<query>               # Set a new throttle-query (no quotes)
throttle-control-replicas=<replicas> # Set a new comma delimited list of throttle control replicas
throttle                             # Force throttling
no-throttle                          # End forced throttling (other throttling may still apply)
unpostpone                           # Bail out a cut-over postpone; proceed to cut-over
panic                                # panic and quit without cleanup
help                                 # This message

As you can see, there is a bunch of settings to change at runtime - we can change chunk size, we can change critical load settings (when defined thresholds will cross, causing gh-ost to start to throttle). You can also set settings related to throttling: nice-ratio, max-lag-millis, replication-lag-query, throttle-control-replicas. You can as well force throttling by sending the ‘throttle’ string to gh-ost or immediately stop the migration by sending ‘panic’.

Another setting which is worth mentioning is unpostpone. Gh-ost allows you to postpone the cutover process. As you know, gh-ost creates a temporary table using the new schema, and then fills it with data from the old table. Once all data has been copied, it performs a cut-over and replaces the old table with a new one. It may happen that you want to be there to monitor things, when gh-ost performs this step - in case something goes wrong. In that case, you can use --postpone-cut-over-flag-file to define a file which, if exists, will postpone the cut-over process. Then you can create that file and be sure that gh-ost won’t swap tables unless you let it by removing the file. Still, if you’d like to go ahead and force cut-over without a need to find and remove the postpone file, you can send ‘unpostpone’ string to gh-ost and it will immediately perform a cut-over.

We coming to the end of this post. Throttling is a critical part of any online schema change process (or any database-heavy process, for that matter) and it is important to understand how to do it right. Yet, even with throttling, some additional load is unavoidable That’s why, in our next blog post, we will try to assess the impact of running gh-ost on the system.

by krzysztof at January 13, 2017 10:08 AM

January 12, 2017

Peter Zaitsev

CVE-2016-6225: Percona Xtrabackup Encryption IV Not Being Set Properly


CVE-2016-6225If you are using Percona XtraBackup with

 to create encrypted backups, and are using versions older than 2.3.6 or 2.4.5, we advise that you upgrade Percona XtraBackup.

Note: this does not affect encryption of encrypted InnoDB tables.


Percona XtraBackup versions older than 2.3.6 or 2.4.5 suffered an issue of not properly setting the Initialization Vector (IV) for encryption. This could allow someone to carry out a Chosen-Plaintext Attack, which could recover decrypted content from the encrypted backup files without the need for a password.


Percona XtraBackup carries backward compatibility to allow for the decryption of older backup files. However, encrypted backup files produced by the versions that have the fix will not be compatible with older versions of Percona XtraBackup.


Access to the encrypted files must already be present for exploitation to occur. So long as you adequately protect the encrypted files, we don’t expect this issue to adversely affect users.


Percona would like to thank and give credit to Ken Takara for discovering this issue and working it through to PoC exploitation.

More Information

Release Notes

by David Busby at January 12, 2017 09:34 PM

The Percona Online Store: Get Database Help Now with Support and Health Audit Services

Percona Online Store

Percona Online StoreWe are proud to announce the new Percona online store!

Keeping your database environment tuned, optimized and high-performance is key to achieving business goals. If your database goes down, so does your business. Percona experts have a long history of helping enterprises ensure their databases are running smoothly. With Percona, you can meet today’s workloads, and prepare for future workloads before they impact performance.

Now we’ve made it even easier to get Percona database services: visit Percona’s new online store! The webstore is perfect for ordering a health audit and immediate, smaller-scale database support. Simply select your service type, answer a few questions about your environment, and then submit. A Percona expert will be in touch.

The webstore makes it fast and easy to purchase Percona Services, with recurring monthly credit card payments. Shop now for Percona’s highly responsive, effective and affordable support and service options, including MySQL Standard Support, MongoDB Standard Support and a MySQL Health Audit.

Percona has some of the best reviews and one of the highest renewal rates in the industry. We can help you increase your uptime, be more productive, reduce your support budget and implement fixes for performance issues faster.

Check out the new Percona online store here!

by Dave Avery at January 12, 2017 09:10 PM

Oli Sennhauser

MySQL replication with filtering is dangerous

From time to time we see in customer engagements that MySQL Master/Slave replication is set-up doing schema or table level replication filtering. This can be done either on Master or on Slave. If filtering is done on the Master (by the binlog_{do|ignore}_db settings), the binary log becomes incomplete and cannot be used for a proper Point-in-Time-Recovery. Therefore FromDual recommends AGAINST this approach.

The replication filtering rules vary depending on the binary log format (ROW and STATEMENT) See also: How Servers Evaluate Replication Filtering Rules.

For reasons of data consistency between Master and Slave FromDual recommends to use only the binary log format ROW. This is also stated in the MySQL documentation: All changes can be replicated. This is the safest form of replication. Especially dangerous is binary log filtering with binary log format MIXED. This binary log format FromDual strongly discourages users to use.

The binary log format ROW affects only DML statements (UPDATE, INSERT, DELETE, etc.) but NOT DDL statements (CREATE, ALTER, DROP, etc.) and NOT DCL statements (CREATE, GRANT, REVOKE, DROP, etc.). So how are those statements replicated? They are replicated in STATEMENT binary log format even though binlog_format is set to ROW. This has the consequences that the binary log filtering rules of STATEMENT based replication and not the ones of ROW based replication apply when running one of those DDL or DCL statements.

This can easily cause problems. If you are lucky, they will cause the replication to break sooner or later, which you can detect and fix - but they may also cause inconsistencies between Master and Slave which may remain undetected for a long time.

Let us show what happens in 2 similar scenarios:

Scenario A: Filtering on mysql schema

On Slave we set the binary log filter as follows:

replicate_ignore_db = mysql

and verify it:

          Replicate_Ignore_DB: mysql

The intention of this filter setting is to not replicate user creations or modifications from Master to the Slave.

We verify on the Master, that binlog_format is set to the wanted value:

mysql> SHOW GLOBAL VARIABLES LIKE 'binlog_format';
| Variable_name | Value |
| binlog_format | ROW   |

Now we do the following on the Master:

mysql> use mysql
mysql> CREATE USER 'inmysql'@'%';
mysql> use test
mysql> CREATE USER 'intest'@'%';

and verify the result on the Master:

mysql> SELECT user, host FROM mysql.user;
| user        | host      |
| inmysql     | %         |
| intest      | %         |
| mysql.sys   | localhost |
| root        | localhost |

and on the Slave:

mysql> SELECT user, host FROM mysql.user;
| user        | host      |
| intest      | %         |
| mysql.sys   | localhost |
| root        | localhost |

We see, that the user intest was replicated and the user inmysql was not. And we have clearly an unwanted data inconsistency between Master and Slave.

If we want to drop the inmysql user some time later on the Master:

mysql> use myapp;
mysql> DROP USER 'inmysql'@'%';

we get the following error message on the Slave and are wondering, why this user or the query appears on the Slave:

               Last_SQL_Errno: 1396
               Last_SQL_Error: Error 'Operation DROP USER failed for 'inmysql'@'%'' on query. Default database: 'test'. Query: 'DROP USER 'inmysql'@'%''

A similar problem happens when we connect to NO database on the Master as follows and change the users password:

shell> mysql -uroot
| database() |
| NULL       |
mysql> ALTER USER 'innone'@'%' IDENTIFIED BY 'secret';

This works perfectly on the Master. But what happens on the Slave:

               Last_SQL_Errno: 1396
               Last_SQL_Error: Error 'Operation ALTER USER failed for 'innone'@'%'' on query. Default database: ''. Query: 'ALTER USER 'innone'@'%' IDENTIFIED WITH 'mysql_native_password' AS '*14E65567ABDB5135D0CFD9A70B3032C179A49EE7''

The Slave wants to tell us in a complicated way, that the user innone does not exist on the Slave...

Scenario B: Filtering on tmp or similar schema

An other scenario we have seen recently is that the customer is filtering out tables with temporary data located in the tmp schema. Similar scenarios are cache, session or log tables. He did it as follows on the Master:

mysql> use tmp;
mysql> TRUNCATE TABLE tmp.test;

As he has learned in FromDual trainings he emptied the table with the TRUNCATE TABLE command instead of a DELETE FROM tmp.test command which is much less efficient than the TRUNCATE TABLE command. What he did not consider is, that the TRUNCATE TABLE command is a DDL command and not a DML command and thus the STATEMENT based replication filtering rules apply. His filtering rules on the Slave were as follows:

          Replicate_Ignore_DB: tmp

When we do the check on the Master we get an empty set as expected:

mysql> SELECT * FROM tmp.test;
Empty set (0.00 sec)

When we add new data on the Master:

mysql> INSERT INTO tmp.test VALUES (NULL, 'new data', CURRENT_TIMESTAMP());
mysql> SELECT * FROM tmp.test;
| id | data      | ts                  |
|  1 | new data  | 2017-01-11 18:00:11 |

we get a different result set on the Slave:

mysql> SELECT * FROM tmp.test;
| id | data      | ts                  |
|  1 | old data  | 2017-01-11 17:58:55 |

and in addition the replication stops working with the following error:

                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows event on table tmp.test; Duplicate entry '1' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log laptop4_qa57master_binlog.000042, end_log_pos 1572

See also our earlier bug report of a similar topic: Option "replicate_do_db" does not cause "create table" to replicate ('row' log)


Binary log filtering is extremely dangerous when you care about data consistency and thus FromDual recommends to avoid binary log filtering by all means. If you really have to do binary log filtering you should exactly know what you are doing, carefully test your set-up, check your application and your maintenance jobs and also review your future code changes regularly. Otherwise you risk data inconsistencies in your MySQL Master/Slave replication.

by Shinguz at January 12, 2017 03:47 PM

January 11, 2017

Jean-Jerome Schmidt

How to use the ClusterControl Query Monitor for MySQL, MariaDB and Percona Server

The MySQL database workload is determined by the number of queries that it processes. There are several situations in which MySQL slowness can originate. The first possibility is if there is any queries that are not using proper indexing. When a query cannot make use of an index, the MySQL server has to use more resources and time to process that query. By monitoring queries, you have the ability to pinpoint SQL code that is the root cause of a slowdown.

By default, MySQL provides several built-in tools to monitor queries, namely:

  • Slow Query Log - Captures query that exceeds a defined threshold, or query that does not use indexes.
  • General Query Log - Captures all queries happened in a MySQL server.
  • SHOW FULL PROCESSLIST statement (or through mysqladmin command) - Monitors live queries currently being processed by MySQL server.
  • PERFORMANCE_SCHEMA - Monitors MySQL Server execution at a low level.

There are also open-source tools out there that can achieve similar result like mtop and Percona’s pt-query-digest.

How ClusterControl monitors queries

ClusterControl does not only monitor your hosts and database instances, it also monitors your database queries. It gets the information in two different ways:

  • Queries are retrieved from PERFORMANCE_SCHEMA
  • If PERFORMANCE_SCHEMA is disabled or unavailable, ClusterControl will parse the content of the Slow Query Log

ClusterControl starts reading from the PERFORMANCE_SCHEMA tables immediately when the query monitor is enabled, and the following tables are used by ClusterControl to sample the queries:

  • performance_schema.events_statements_summary_by_digest
  • performance_schema.events_statements_current
  • performance_schema.threads

In older versions of MySQL (5.5), having PERFORMANCE_SCHEMA (P_S) enabled might not be an option since it can cause significant performance degradation. With MySQL 5.6 the overhead is reduced and even more so in 5.7. P_S offers great introspection of the server at an overhead of a few percents (1-3%). If the overhead is a concern then ClusterControl can parse the Slow Query log remotely to sample queries. Note that no agents are required on your database servers. It uses the following flow:

  1. Start slow log (during MySQL runtime).
  2. Run it for a short period of time (a second or couple of seconds).
  3. Stop log.
  4. Parse log.
  5. Truncate log (ClusterControl creates new log file).
  6. Go to 1.

As you can see, ClusterControl does the above trick when pulling and parsing the Slow Query log to overcome the problems with offsets. The drawback of this method is that the continuous sampling might miss some queries during steps 3 to 5. Hence, if continuous query sampling is vital for you and part of your monitoring policy, the best way is to use P_S. If enabled, ClusterControl will automatically use it.

The collected queries are hashed, calculated and digested (normalize, average, count, sort) and then stored in ClusterControl.

Enabling Query Monitoring

As mentioned earlier, ClusterControl monitors MySQL query via two ways:

  • Fetch the queries from PERFORMANCE_SCHEMA
  • Parse the content of MySQL Slow Query

Performance Schema (Recommended)

First of all, if you would like to use Performance Schema, turn it on all MySQL servers (MySQL/MariaDB v5.5.3 and later). Enabling this requires a MySQL restart. Add the following line to your MySQL configuration file:

performance_schema = ON

Then, restart the MySQL server. For ClusterControl users, you can use the configuration management feature at Manage -> Configurations -> Change Parameter and perform a rolling restart at Manage -> Upgrades -> Rolling Restart.

Once enabled, ensure at least events_statements_current is enabled:

mysql> SELECT * FROM performance_schema.setup_consumers WHERE NAME LIKE 'events_statements%';
| NAME                           | ENABLED |
| events_statements_current      | YES     |
| events_statements_history      | NO      |
| events_statements_history_long | NO      |

Otherwise, run the following statement to enable it:

UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME = 'events_statements_current';

MySQL Slow Query

If Performance Schema is disabled, ClusterControl will then default to the Slow Query log. Hence, you don’t have to do anything since it can be turned on and off dynamically during runtime via SET statement.

The Query Monitoring function must be toggled to on under ClusterControl -> Query Monitor -> Top Queries. ClusterControl will monitor queries on all database nodes under this cluster:

Click on the “Settings” and configure “Long Query Time” and toggle “Log queries not using indexes” to On. If you have defined two parameters (long_query_time and log_queries_not_using_indexes) inside my.cnf and you would like to use those values instead, toggle “MySQL Local Query Override” to On. Otherwise, ClusterControl will obey the former.

Once enabled, you just need to wait a couple of minutes before you can see data under Top Queries and Query Histogram.

How ClusterControl visualizes the queries

Under the Query Monitor tab, you should see the following three items:

  • Top Queries

  • Running Queries

  • Query Histogram

We’ll have a quick look at these here, but remember that you can always find more details in the ClusterControl documentation.

Top Queries

Top Queries is an aggregated list of all your top queries running on all the nodes of your cluster. The list can be ordered by “Occurrence” or “Execution Time”, to show the most common or slowest queries respectively. You don’t have to login to each of the servers to see the top queries. The UI provides an option to filter based on MySQL server.

If you are using the Slow Query log, only queries that exceed the “Long Query Time” will be listed here. If the data is not populated correctly and you believe that there should be something in there, it could be:

  • ClusterControl did not collect enough queries to summarize and populate data. Try to lower the “Long Query Time”.
  • You have configured Slow Query Log configuration options in the my.cnf of MySQL server, and “Override Local Query” is turned off. If you really want to use the value you defined inside my.cnf, probably you have to lower the long_query_time value so ClusterControl can calculate a more accurate result.
  • You have another ClusterControl node pulling the Slow Query log as well (in case you have a standby ClusterControl server). Only allow one ClusterControl server to do this job.

The “Long Query Time” value can be specified to a resolution of microseconds, for example 0.000001 (1 x 10-6). The following shows a screenshot of what’s under Top Queries:

Clicking on each query will show the query plan executed, similar to EXPLAIN command output:

Running Queries

Running Queries provides an aggregated view of current running queries across all nodes in the cluster, similar to SHOW FULL PROCESSLIST command in MySQL. You can stop a running query by selecting to kill the connection that started the query. The process list can be filtered out by host.

Use this feature to monitor live queries currently running on MySQL servers. By clicking on each row that contains “Info”, you can see the extended information containing the full query statement and the query plan:

Query Histogram

The Query Histogram is actually showing you queries that are outliers. An outlier is a query taking longer time than the normal query of that type. Use this feature to filter out the outliers for a certain time period. This feature is dependent on the Top Queries feature above. If Query Monitoring is enabled and Top Queries are captured and populated, the Query Histogram will summarize these and provide a filter based on timestamp.

That’s all folks! Monitoring queries is as important as monitoring your hosts or MySQL instances, to make sure your database is performing well.

by ashraf at January 11, 2017 07:12 PM

Peter Zaitsev

How to Replace MySQL with Percona Server on a CPanel, WHM VPS or Dedicated Server

Replace MySQL with Percona Server

Replace MySQL with Percona ServerIn this blog post, we’ll look at how to replace MySQL with Percona Server for MySQL on a CPanel, WHM VPS or dedicated server.

In general, CPanel and WHM have been leaning towards support of MariaDB over other flavors. This is partly due to the upstream repos replacing the MySQL package with MariaDB (for example, on CentOS).

MySQL 5.6 is still supported though, which means they are keeping support for core MySQL products. But if you want to get some extra performance enhancements or enterprise features for free, without getting too many bells and whistles, you might want to install Percona Server.

I’ve done this work on a new dedicated server with the latest WHM and CPanel on CentOS 7, with MySQL 5.6 installed. Besides the backup, this is a fairly quick process.

It’s pretty simple. From the Percona Server for MySQL 5.7 installation doc, we can get the YUM repo. (Run commands as root if you can, otherwise as sudo.)

yum install

Now that we have the repo, let’s install Percona XtraBackup in case we need to roll this back at any point:

yum install percona-xtrabackup

This server had a drive mounted at /backup, so I created the backup with the following commands:

xtrabackup --target-dir=/backup/xtrabackup --backup
xtrabackup --target-dir=/backup/xtrabackup --prepare

Now that we have a good backup, let’s remove MySQL:

service mysql stop
yum remove MySQL* mysql*

Depending on your dependency chain, this could remove Percona XtraBackup, but that can be fixed. Let’s accept this uninstall.

Let’s install Percona Server for MySQL 5.7 and Percona Toolkit:

yum install Percona-Server-server-57 percona-toolkit percona-xtrabackup

Now that it’s installed ensure the mysql service is running. If it isn’t, start it. Now let’s upgrade:


NOTE. This works if you can log in as root without a password; if you can’t, you will need to specify the


Once you run the upgrade, restart the mysql service:

service mysql restart

And there you go, you are now running on Percona Server for MySQL 5.7. If your managed providers tell you it’s not supported, don’t worry! It works as long as CPanel supports MySQL 5.6.

If you have any issues, just restore the backup.

NOTE: One thing to keep in mind is that 5.7 breaks CPanel’s ability to create users in MySQL. I believe this is due to the changes to the mysql.user table. If this is an issue for you, you can always use Percona Server for MySQL 5.6.

by Manjot Singh at January 11, 2017 06:41 PM

Reinstall MySQL and Preserve All MySQL Grants and Users

MySQL Grants and Users

MySQL Grants and UsersIn this blog post, we’ll look at how to preserve all MySQL grants and users after reinstalling MySQL.

Every so often, I need to reinstall a MySQL version from scratch and preserve all the user accounts and their permissions (or move the same users and privileges to another server).

As of MySQL 5.7, MySQL does not make this easy! MySQL SHOW GRANTS only shows permissions for one user, and the method suggested on StackExchange – dumping tables containing grants information directly – is not robust (as Rick James mentions in the comments). It also doesn’t work between different MySQL versions.

This problem is easily solved, however, with the pt-show-grants tool from Percona Toolkit (which serves pretty much as a mysqldump for user privileges).  

All you need to do is:

  1. On the source, or to backup MySQL privileges, run:

pt-show-grants > grants.sql

  1. On the target, or to restore MySQL privileges, run:

mysql  < grants.sql

  1. If you would like to clean up the old privileges from MySQL before loading new ones, use:

pt-show-grants --drop  --ignore root@localhost | grep "^DROP USER " | mysql

This removes all the users (except the root user, which you will need to connect back and load new privileges).

With Percona Toolkit, preserving your grants and user privileges is easy!

by Peter Zaitsev at January 11, 2017 04:35 PM

January 10, 2017

Peter Zaitsev

Webinar Thursday, January 12: Percona Software News and Roadmap Update for Q1 2017

Percona Software News and RoadmapPlease join Percona CEO Peter Zaitsev for a webinar on Thursday, January 12, 2017 at 11 am PST/ 2 pm EST (UTC-8) for a discussion on the Percona Software News and Roadmap Update for Q1 2017.

In this webinar, Peter will discuss what’s new in Percona open source software. This will include Percona Server for MySQL and MongoDB, Percona XtraBackup, Percona Toolkit, Percona XtraDB Cluster and Percona Monitoring and Management.

During this webinar Peter will talk about newly released features in Percona software, show a few quick demos and share with you highlights from the Percona open source software roadmap.

Peter will also talk about new developments in Percona commercial services and finish with a Q&A.

Register for the Percona Software News and Roadmap Update webinar here.

Percona Software News and RoadmapPeter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With over 150 professionals in 20 plus countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of internet giants, large enterprises and many exciting startups.

Peter was an early employee at MySQL AB, eventually leading the company’s High Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Data Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization is one of Percona’s most popular downloads.


by Dave Avery at January 10, 2017 07:44 PM

How to Move a MySQL Partition from One Table to Another

Move a MySQL Partition

Move a MySQL PartitionIn this blog post we’ll look at how to move a MySQL partition from one table to another, for MySQL versions before 5.7.

Up to version 5.7, MySQL had a limitation that made it impossible to directly exchange partitions between partitioned tables. Now and then, we get questions about how to import an .ibd for use as a partition in a table, as well as how to exchange partitions with another partitioned table. Below is step-by-step instructions on how to move a partition from one table to another.

In this example, one of our customers had two tables with the following structures:

CREATE TABLE live_tbl (
some_id bigint(20) NOT NULL DEFAULT '0',
summary_date date NOT NULL,
PRIMARY KEY (some_id,summary_date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
/*!50500 PARTITION BY RANGE COLUMNS(summary_date)
(PARTITION p201203 VALUES LESS THAN ('2012-04-01') ENGINE = InnoDB,
PARTITION p201204 VALUES LESS THAN ('2012-05-01') ENGINE = InnoDB,
PARTITION p201205 VALUES LESS THAN ('2012-06-01') ENGINE = InnoDB,
PARTITION p201206 VALUES LESS THAN ('2012-07-01') ENGINE = InnoDB,

CREATE TABLE archive_tbl (
some_id bigint(20) NOT NULL DEFAULT '0',
summary_date date NOT NULL,
PRIMARY KEY (some_id,summary_date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
/*!50500 PARTITION BY RANGE COLUMNS(summary_date)
(PARTITION p201109 VALUES LESS THAN ('2011-10-01') ENGINE = InnoDB,
PARTITION p201110 VALUES LESS THAN ('2011-11-01') ENGINE = InnoDB,
PARTITION p201111 VALUES LESS THAN ('2011-12-01') ENGINE = InnoDB,
PARTITION p201112 VALUES LESS THAN ('2012-01-01') ENGINE = InnoDB,
PARTITION p201201 VALUES LESS THAN ('2012-02-01') ENGINE = InnoDB,
PARTITION p201202 VALUES LESS THAN ('2012-03-01') ENGINE = InnoDB,

And their (likely obvious) goal is to move (not copy) the oldest partition from live_tbl to archive_tbl. To achieve this, we came up with the following procedure:

For the following, we assume:

  • The datadir is “/var/lib/mysql/”
  • MySQL Server is run by “mysql” Linux user
  • “p201203” is the partition name you want to move
  • “live_tbl is the source table from where you want to move the partition
  • “archive_tbl” is the destination table to where you want to move the partition
  • “dest_tbl_tmp” is the temporary table we will create, using the same CREATE TABLE criteria as in the live_tbl
  • “thedb” is the database name

1. Copy the .ibd data file from that particular partition

First, make sure you flush any pending changes to disk and that the table is locked, so that binary table copies can be made while the server is running. Keep in mind that the table will be locked while you copy the .ibd file. All reads/writes during that time will be blocked.

Important: Don’t close this session or the lock will be released.

mysql> USE thedb
mysql> FLUSH TABLE live_tbl FOR EXPORT;

Open another session, and copy the .ibd file to a temporary folder.

shell> cp /var/lib/mysql/thedb/live_tbl#P#p201203.ibd /tmp/dest_tbl_tmp.ibd

After you copy the .ibd file to the temporary folder, go back to the MySQL session and unlock the table so that all reads and writes to that particular table are allowed again.


2. Prepare a temporary table to import the tablespace

Create a temporary table exactly like the one into which you want to import the partition. Remove the partitioning on it and discard the tablespace so that it is ready for the .ibd import.

mysql> CREATE TABLE dest_tbl_tmp LIKE archive_tbl;

3.  Import the tablespace to the temporary table

Place the .ibd file in the appropriate folder, set the correct permissions and ownership and then import the tablespace to the temporary table.

shell> cp /tmp/dest_tbl_tmp.ibd /var/lib/mysql/thedb/
shell> chmod 660 /var/lib/mysql/thedb/dest_tbl_tmp.ibd
shell> chown mysql.mysql /var/lib/mysql/thedb/dest_tbl_tmp.ibd

4. Swap the tablespace with the destination table’s partition tablespace

Partition according to your own schema. (This is just an example using date values. In our case, we have to REORGANIZE PARTITION to accommodate a new LESS THAN range before the MAXVALUE.)

PARTITION p201203 VALUES LESS THAN ('2012-04-01'),
mysql> ALTER TABLE archive_tbl EXCHANGE PARTITION p201203 WITH TABLE dest_tbl_tmp;

5. Check that the partitions are correctly exchanged before dropping the one from the source table

SELECT * FROM archive_tbl;
SELECT * FROM dest_tbl_tmp;
SELECT * FROM live_tbl;

For more information on why these steps are needed, please check the following documentation link for ALTER TABLE … EXCHANGE PARTITION:

In MySQL version 5.7, it is possible to exchange partitions without the unpartitioned table step, as described in the following link:

There are bugs related to the steps in this guide that might be useful to take into consideration:

by Pablo Padua at January 10, 2017 06:00 PM

MongoDB PIT Backups: Part 2

MongoDB PIT Backups

This blog post is the second in a series covering MongoDB PIT backups. You can find the first part here.

Sharding Makes Everything Fun(ner)

The first blog post in this series looked at MongoDB backups in a simple single-replica set environment. In this post, we’ll look at the scale-out use case. When sharding, we have exactly the same problem as we do on a single replica set. However, now the problem is multiplied by the number of replica sets in the cluster. Additionally, we have a bonus problem: each replica set has unique data. That means to get a truly consistent snapshot of the cluster, we need to orchestrate our backups to capture a single consistent point in time. Just so we’re on the same page, that means that every replica set needs to stop their backups at, or near, the same time that the slowest replica set stops. Are you sufficiently confused now? Let me get to a basic concept that I forgot to cover in the first post, and then I’ll give you a simple description of the problem.

Are you Write Concerned?

So far, I’ve neglected to talk about the very important role of “write concern” when taking consistent backups. In MongoDB, the database is not durable by default. By “durable,” I mean “on disk” when the database acknowledges receipt of an operation from your application. There are most likely several reasons for this. Most likely the biggest one originally was probably throughput given a lack of concurrency.

However, the side effect is possible data loss due to loss of operations applied only in memory. Changing the write concern to “journaled” (

j : true
) will change this behavior so that MongoDB journals changes before acknowledging them (you also need to be running with journal enabled).

TIP: For true durability in a replica set, you should use a write concern of “majority” for operations and the writeConcernMajorityJournalDefault : true on all replica set members (new to v3.4). This has the added benefit of greatly decreasing the chance of rollback after an election.

Wow, you’re inconsistent

At the risk of being repetitive, the crux of this issue is that we need to run a backup on every shard (replica set). This is necessary because every shard has a different piece of the data set. Each piece of that data set is necessary to get an entire picture of the data set for the cluster (and thus, your application). Since we’re using mongodump, we’ll only have a consistent snapshot at the point in time when the backup completes. This means we must end each shard’s backup at a consistent point in time. We cannot expect that the backup will complete in exactly the same amount of time on every shard, which is what we’ll need for a consistent point in time across the cluster. This means that Shard1 might have a backup that is consistent to 12:05 PM, and another shard that is consistent to 12:06 PM. In a high traffic environment (the kind that’s likely to need horizontal scale), this could mean thousands of lost documents. Here’s a diagram:

MongoDB PIT Backups
MongoDB PIT Backups


Here’s the math to illustrate the problem:

  • Shard1’s backup will contain 30,000 documents ((100 docs * 60 secs) * 5 mins)
  • Shard2’s backup will contain 36,000 documents ((100 docs * 60 secs) * 6 mins)

In this example, to get a consistent point in time you’d need to remove all insert, update and delete operations that happened on Shard 2 from the time that Shard 1’s backup completed (6,000 documents). This means examining the timestamp of every operation in the oplog and reversing it’s operation. That’s a very intensive process, and will be unique for every mongodump that’s executed. Furthermore, this is a pretty tricky thing to do. The repeatable and much more efficient method is to have backups that finish in a consistent state, ready to restore when needed.

Luckily, Percona has you covered!

You’re getting more consistent

Having data is important, but knowing what data you have is even more important. Here’s how you can be sure you know what you have in your MongoDB backups:

David Murphy has released his MongoDB Consistent Backup Tool in the Percona Labs github account, and has written a very informative blog post about it. My goal with these blog posts is to make it even easier to understand the problem and how to solve it. We’ve already had an exhaustive discussion about the problem on both small and large scales. How about the solution?

It’s actually pretty simple. The solution, at a basic level, is to use a simple algorithm to decide when a cluster-wide consistent point-in-time can be reached. In the MongoDB Consistent Backup tool, this is done by the backup host kicking off backups on a “known good member” of each shard (that’s a pretty cool feature by itself) and then tracking the progress of each dump. At the same time the backup is kicked off, the backup host kicks off a separate thread that tails the oplog on each “known good member” until the mongodump on the slowest shard completes. By using this method, we have a very simple way of deciding when we can get a cluster-wide consistent snapshot. In other words, when the slowest member completes their piece of the workload. Here’s the same workload from Figure 4, but with the MongoDB Consistent Backup Tool methodology:

MongoDB PIT Backups
MongoDB PIT Backups


TIP: The amount of time that it takes to perform these backups is often decided by two factors:

  1. How evenly distributed the data is across the shards (balanced)
  2. How much data each shard contains (whether or not it’s balanced).

The takeaway here is that you may need to shard so that each shard has a manageable volume of data. This allows you to hit your backup/restore windows more easily.

…The Proverbial “Monkey Wrench”

There’s always a “gotcha” just when you think you’ve got your mind around any difficult concept. Of course, this is no different.

There is one very critical concept in sharding that we didn’t cover: tracking what data lies on which shard. This is important for routing the workload to the right place, and balancing the data across the shards. In MongoDB, this is completed by the config servers. If you cannot reach (or recover) your config servers, your entire cluster is lost! For obvious reasons, you need to back them up as well. With the Percona Labs MongoDB Consistent Backup Tool, there are actually two modes used to backup config servers: v3.2 and greater, and legacy. The reason is that in v3.2, config servers went from mirrors to a standard replica set. In v3.2 mode, we just treat the config servers like another replica set. They have their own mongodump and oplog tail thread. They get a backup that is consistent to the same point in time as all other shards in the cluster. If you’re on a version of MongoDB prior to v3.2, and you’re interested in an explanation of legacy mode, please refer back to David’s blog post.

The Wrap Up

We’ve examined the problems with getting consistent backups in a running MongoDB environment in this and the previous blog posts. Whether you have a single replica set or a sharded cluster, you should have a much better understanding of what the problems are and how Percona has you covered. If you’re still confused, or you’d just like to ask some additional questions, drop a comment in the section below. Or shoot me a tweet @jontobs, and I’ll make sure to get back to you.

by Jon Tobin at January 10, 2017 12:14 AM

January 09, 2017

Peter Zaitsev

MySQL 8.0.1: The Next Development Milestone

MySQL 8.0.1

MySQL 8.0.1This post discusses the next MySQL development milestone: MySQL 8.0.1.

From the outset, MySQL 8.0 has received plenty of attention. Both this blog (see the MySQL 8.0 search) and other sites around the Internet have covered it. Early reviews seem positive (including my own MySQL 8.0 early bugs review). There is plenty of excitement about the new features.

As for early feedback on MySQL 8.0, Peter Zaitsev (Percona CEO) listed a set of recommendations for benchmarking MySQL 8.0. I hope these get reviewed and implemented.

MySQL achieved the current development milestone (available for download on on September 12, 2016. Its release immediately came with a detailed review by Geir Hoydalsvik from MySQL. If you haven’t had the opportunity to do so yet, you can also review the MySQL 8.0 release notes.

It now looks like we’re nearing 8.0.1, the next development milestone. I don’t have insider information, but it’s quite clear when navigating that:

Regarding timing, it’s interesting to note that the “What Is New in MySQL 8.0” page was updated on the 6th of January.

It looks like the release might come soon. So, restrain your excitement for a few days (or weeks?) more. Maybe you’ll be able to checkout the all new MySQL 8.0.1!

PS: If MySQL quality interests you, have a look at this recent – and very interesting – change made to the MTR (MySQL Test Run, the MySQL test suite) program. I believe it improves quality for everyone who runs MySQL (including its forks). The tests (which are run worldwide, often for each code change made) will now test the product with its own defaults.

by Roel Van de Paar at January 09, 2017 04:49 PM

January 06, 2017

Peter Zaitsev

Archiving MySQL and MongoDB Data

Archiving MySQL and MongoDB Data

Archiving MySQL and MongoDB DataThis post discusses archiving MySQL and MongoDB data, and determining what, when and how to archive data.

Many people store infrequently used data. This data is taking up storage space and might make your database slower than it could be. Archiving data can be a huge benefit, both regarding the performance impact and storage savings.

Why archive?

One of the reasons for archiving data is freeing up space on your database volumes. You can store archived data on slower, less expensive storage devices, and current data on the faster database drives. Archiving old data makes backups and restores run faster since they need to process less data. Last, but by no means least, archiving data has the benefit of making your queries perform more efficiently since they do not need to process through old data.

What do you archive?

That is the big question. Archiving too much is just as detrimental as not archiving enough (or at all). As you’ll see, finding this balance requires foresight and planning. Fortunately, you can tweak your archiving scheme to make it work better as time goes by,

Some people feel that keeping all the data in their database, even if they don’t access that data frequently, is the best way to go. If you are lucky enough to have vast quantities of storage, and a database that is performing well, keeping all of the data in your database might be a good idea. Even with lots of storage, archiving off some data that you don’t use regularly might have advantages. We all know someone whose desk is piled with stacks of paper. When they need something, they tell us that they know where everything is. Even if they can find the requested item, they need to work through the piles of stuff to locate it. They also have to decide where to put new items so that they can be easily found. In your database, this equates to slower queries and potentially slower writes. Clearing out some of the less frequently accessed data will have a beneficial effect overall.

At the other end of the spectrum are the people who want to archive in a manner that is too aggressive. This means that any requests for data must access the archive location This might be slower and more burdensome, causing the queries to run slowly. In addition, new data written into the database will have to go through an archive process fairly quickly, which might slow down the database. This is the person who puts each and every item they own into storage. It makes for a clean home, but it’s tough to find many of the items that you own. In our database, this means that most queries are run against archived data, and archive processes are frequently running. This too can slow down performance overall.

The best archiving solution is one that meets both the needs of efficient use of storage and efficiency of queries and inserts. You want to be able to write new data quickly, access frequently used data promptly, and still be able to get the information that might not often be used. There is no simple answer here: each company will have different needs and requirements. For some companies, regulations might govern how long data must be stored. With these sorts of requirements in place, you should look to place data that isn’t accessed often on a storage medium that is lower in cost (and often slower in performance). It is still there, but it is not crowding out the more commonly used data. Other companies might query or manipulate data shortly after it is loaded into the database, and they might be able to archive more often.

When do you archive?

This is another big consideration. Do you archive data daily, weekly, monthly, annually or on some other schedule? The basic answer is that it doesn’t matter what the schedule is. It matters that there is some sort of schedule, and that archiving is happening as expected. Keeping to a schedule allows everyone to know that the data is being archived as expected, and will avoid any “gee, we completely forgot about doing that” issues from arising.

Frequent archiving (daily or weekly) is good when you have high data volumes and normally need to access only the latest data in your queries. Think of stock data. Queries to pull trade volumes and pricing over a short time period are more frequent than queries that would analyze a stock’s performance over time. Therefore, archiving old data can be helpful since it keeps the frequently accessed table’s data easily accessible, but still accommodates the need to get at data for longer time spans. With high data volume, you might need to archive often so that one archive process can complete before another is started.

Less frequent archiving might be used when you have longer term projects or if you find your current database is performing reasonably well. In these cases, archiving monthly, quarterly, or annually might make sense. This is like cleaning out your garage or attic. You might do it, but you probably don’t do it every week. The amount of stuff being stored, along with the space to store it in, might determine how often you do this type of cleanup.

How do you go about archiving MySQL and MongoDB data?

There are lots of possibilities here as well. If well planned, it can be an easy implementation. But like many things, figuring this out is usually done once things have gotten a little out of control.

You can archive data using a standard backup, moving it to another table in the database, exporting to a flat file, or moving it to another database altogether. The end goal is a cleaner production environment that still allows access to the archived data if it is needed. The method for performing the archive determines the method used to bring that data back to a state in which it can be queried. One of the considerations must be how much time you are willing and able to invest in making that data available again.

  1. You can use your standard backup method to create and manage your archive, but this is a solution that is cumbersome and prone to error. You can perform a backup and then delete the unwanted data from your table(s). Now, the deleted data is only stored in your backup and must be restored in order to be queried. You should restore to another database for this purpose so that you keep your production environment clean. With this option, you also have to consider the methods for recovering space used by deleted files. This opens to the possibility of someone restoring to the original database, which can cause a host of problems. With MongoDB, there is an optional –archive option that moves the data to an archive location that you specify. MongoDB version 3.2 added this option.
  2. Another possibility is to move the data to another MySQL table or MongoDB collection in the existing database (i.e., moving from the transactions table to transactions_archived). This is a fast and efficient way to backup the data, and it allows for easy querying since the data still resides in the database. Of course, this assumes that you have enough storage space to accommodate the active and the archive tables/collections.
  3. You can also export the data to be archived to a flat file and then delete it from the original table or collection. This is workable if the data needs to be kept available but is unlikely to be regularly needed. (It will need to be imported in order to query it.) This method also comes with all the caveats about needing to delete and recover the space of the archived records, issues with importing into the original database (and ruining all the good archiving work you’ve done, and the possibility of deleting the flat file.
  4. Alternatively, you can move the data to another database. This too can be an effective method for archiving data, and can also allow that data to be made available to others for query analysis. Once again, all warnings about recovering the space apply, but the good thing here is that the data does not need to be restored to be queried. It is simply queried through the other database.


Another option for archiving MySQL data is a tool like pt-archiver. pt-archiver is a component of the Percona Toolkit that nibbles old data from a source table and moves it to a target table. The target can be in the current or an archive database. It is designed for use with an up-and-running database. It has minimal impact on the overall performance of the database. It is part of the Percona Toolkit, so it is available as an open source download. It has the benefit of slowly working to move the data and is always running. This allows it to archive data regularly and cleanly. One warning is that it does delete the data from the source table, so you should test it before running it in production. pt-archiver works with MySQL data only. It is also important to note that removing large quantities of data might cause InnoDB fragmentation. Running OPTIMIZE TABLE to recover the space resolves this. As of version 5.7.4, this is no longer a locking action.

So now what?

Unless you are in the enviable position where archiving MySQL and MongoDB data isn’t an issue, the first step is to come up with an archiving scheme. This will likely involve many different people since there can be an impact across the entire organization. Determine what can and should be archived, and then determine how best to archive the data. Document the process and test it before putting it into production. In the end, your database and your users will thank you.

by Rick Golba at January 06, 2017 09:07 PM

Millions of Queries per Second: PostgreSQL and MySQL’s Peaceful Battle at Today’s Demanding Workloads

PostgreSQL and MySQL

This blog compares how PostgreSQL and MySQL handle millions of queries per second.

Anastasia: Can open source databases cope with millions of queries per second? Many open source advocates would answer “yes.” However, assertions aren’t enough for well-grounded proof. That’s why in this blog post, we share the benchmark testing results from Alexander Korotkov (CEO of Development, Postgres Professional) and Sveta Smirnova (Principal Technical Services Engineer, Percona). The comparative research of PostgreSQL 9.6 and MySQL 5.7 performance will be especially valuable for environments with multiple databases.

The idea behind this research is to provide an honest comparison for the two popular RDBMSs. Sveta and Alexander wanted to test the most recent versions of both MySQL and PostgreSQL with the same tool, under the same challenging workloads and using the same configuration parameters (where possible). However, because both PostgreSQL and MySQL ecosystems evolved independently, with standard testing tools (pgbench and SysBench) used for each database, it wasn’t an easy journey.

The task fell to database experts with years of hands-on experience. Sveta has worked as a Senior Principal Technical Support Engineer in the Bugs Verification Group of the MySQL Support Group at Oracle for more than eight years, and since 2015  has worked as a Principal Technical Services Engineer at Percona. Alexander Korotkov is a PostgreSQL major contributor, and the developer of a number PostgreSQL features – including the CREATE ACCESS METHOD command, generic WAL interface, lockfree Pin/UnpinBuffer, index-based search for regular expressions and much more. So we have a pretty decent cast for this particular play!

SvetaDimitri Kravtchuk regularly publishes detailed benchmarks for MySQL, so my main task wasn’t confirming that MySQL can do millions of queries per second. As our graphs will show, we’ve passed that mark already. As a Support Engineer, I often work with customers who have heterogeneous database environments in their shops, and want to know about the impact of migrating jobs from one database to another. So instead, I found the chance to work with the Postgres Professional company and identify both the strong and weak points of the two databases an excellent opportunity.

We wanted to test both databases on the same hardware, using the same tools and tests. We expected to test base functionality, and then work on more detailed comparisons. That way we could compare different real-world use case scenarios and popular options.

Spoiler: We are far from the final results. This is the start of a blog series.

OpenSource Databases on Big Machines, Series 1: “That Was Close…”

PostgreSQL Professional together with Freematiq provided two modern, powerful machines for tests.

Hardware configuration:

Processors: physical = 4, cores = 72, virtual = 144, hyperthreading = yes
Memory: 3.0T
Disk speed: about 3K IOPS
OS: CentOS 7.1.1503
File system: XFS

I also used a smaller Percona machine.

Hardware configuration:

Processors: physical = 2, cores = 12, virtual = 24, hyperthreading = yes
Memory: 251.9G
Disk speed: about 33K IOPS
OS: Ubuntu 14.04.5 LTS
File system: EXT4

Note that machines with smaller numbers of CPU cores and faster disks are more common for MySQL installations than machines with larger numbers of cores.

The first thing we needed to agree on is which tool to use. A fair comparison only makes sense if the workloads are as close as possible.

The standard PostgreSQL tool for performance tests is pgbench, while for MySQL it’s SysBench. SysBench supports multiple database drivers and scriptable tests in the Lua programming language, so we decided to use this tool for both databases.

The initial plan was to convert pgbench tests into SysBench Lua syntax, and then run standard tests on both databases. After initial results, we modified our tests to better examine specific MySQL and PostgreSQL features.

I converted pgbench tests into SysBench syntax, and put the tests into an open-database-bench GitHub repository.

And then we both faced difficulties.

As I wrote already, I also ran the tests on a Percona machine. For this converted test, the results were almost identical:

Percona machine:

OLTP test statistics:
       transactions:                        1000000 (28727.81 per sec.)
       read/write requests:                 5000000 (143639.05 per sec.)
       other operations:                    2000000 (57455.62 per sec.)

Freematiq machine:

OLTP test statistics:
       transactions:                        1000000 (29784.74 per sec.)
       read/write requests:                 5000000 (148923.71 per sec.)
       other operations:                    2000000 (59569.49 per sec.)

I started investigating. The only place where the Percona machine was better than Freematiq’s was disk speed. So I started running the pgbench read-only test, which was identical to SysBench’s point select test with full dataset in memory. But this time SysBench used 50% of the available CPU resources:

4585 smirnova  20   0  0,157t 0,041t   9596 S  7226  1,4  12:27.16 mysqld
8745 smirnova  20   0 1266212 629148   1824 S  7126  0,0   9:22.78 sysbench

Alexander, in turn, had issues with SysBench, which could not create a high load on PostgreSQL when prepared statements were used:

93087 korotkov  20   0 9289440 3,718g   2964 S 242,6  0,1   0:32.82 sysbench
93161 korotkov  20   0 32,904g  81612  80208 S   4,0  0,0   0:00.47 postgres
93116 korotkov  20   0 32,904g  80828  79424 S   3,6  0,0   0:00.46 postgres
93118 korotkov  20   0 32,904g  80424  79020 S   3,6  0,0   0:00.47 postgres
93121 korotkov  20   0 32,904g  80720  79312 S   3,6  0,0   0:00.47 postgres
93128 korotkov  20   0 32,904g  77936  76536 S   3,6  0,0   0:00.46 postgres
93130 korotkov  20   0 32,904g  81604  80204 S   3,6  0,0   0:00.47 postgres
93146 korotkov  20   0 32,904g  81112  79704 S   3,6  0,0   0:00.46 postgres

We contacted SysBench author Alexey Kopytov, and he fixed MySQL issue. The solution is:

  • Use SysBench with the options
    --percentile=0 --max-requests=0
      (reasonable CPU usage)
  • Use the concurrency_kit branch (better concurrency and Lua processing)
  • Rewrite Lua scripts to support prepared statements (pull request:
  • Start both SysBench and mysqld with the jemalloc or tmalloc library pre-loaded

A fix for PostgreSQL is on the way. For now, Alexander converted a standard SysBench test into pgbench format and we stuck with it. Not much new for MySQL, but at least we had a baseline for comparison.

The next difficulty I faced was the default operating system parameters. To make the long story short, I changed them to the recommended ones (described below):

cpupower frequency-set --governor performance
kernel.sched_migration_cost_ns= 5000000
IO scheduler [deadline]

The same parameters were better for PostgreSQL performance as well. Alexander set his machine similarly.

After solving these issues we learned and implemented the following:

  • We cannot use a single tool (for now)
  • Alexander wrote a test for pgbench, imitating the standard SysBench tests
  • We are still not able to write custom tests because we use different tools

But we could use these tests as a baseline. After work done by Alexander, we stuck with the standard SysBench tests. I converted them to use prepared statements, and Alexander converted them into pgbench format.

I should mention that I was not able to get the same results as Dimitri for the Read Only and Point Select tests. They are close, but slightly slower. We need to investigate if this is the result of different hardware, or my lack of performance testing abilities. The results from the Read-Write tests are similar.

Another difference was between the PostgreSQL and MySQL tests. MySQL users normally have many connections. Setting the value of the variable

, and limiting the total number of parallel connections to thousands is not rare nowadays. While not recommended, people use this option even without the thread pool plugin. In real life, most of these connections are sleeping. But there is always a chance they all will get used in cases of increased website activity.

For MySQL I tested up to 1024 connections. I used powers of two and multiplies of the number of cores: 1, 2, 4, 8, 16, 32, 36, 64, 72, 128, 144, 256, 512 and 1024 threads.

For Alexander, it was more important to test in smaller steps. He started from one thread and increased by 10 threads, until 250 parallel threads were reached. So you will see a more detailed graph for PostgreSQL, but no results after 250 threads.

Here are our comparison results.


PostgreSQL and MySQL

  • pgsql-9.6 is standard PostgreSQL
  • pgsql-9.6 + pgxact-align is PostgreSQL with this patch (more details can be found in this blog post)
  • MySQL-5.7 Dimitri is Oracle’s MySQL Server
  • MySQL-5.7 Sveta is Percona Server 5.7.15


PostgreSQL and MySQL


PostgreSQL and MySQL

Sync commit in PostgreSQL is a feature, similar to

 in InnoDB, and async commit is similar to

You see that the results are very similar: both databases are developing very fast and work with modern hardware well.

MySQL results which show 1024 threads for reference.


PostgreSQL and MySQL

OLTP RW with innodb_flush_log_at_trx_commit set to 1 and 2

PostgreSQL and MySQL

After receiving these results, we did a few feature-specific tests that will be covered in separate blog posts.

More Information

MySQL Options for OLTP RO and Point SELECT tests:

# general
table_open_cache = 8000
# files
# Monitoring
innodb_monitor_enable = '%'
performance_schema=OFF #cpu-bound, matters for performance
#Percona Server specific
# buffers
innodb_buffer_pool_instances=128 #to avoid wait on InnoDB Buffer Pool mutex
# InnoDB-specific
innodb_checksums=1 #Default is CRC32 in 5.7, very fast
innodb_doublewrite= 1 #
innodb_stats_persistent = 1
innodb_support_xa=0 #(We are read-only, but this option is deprecated)
innodb_spin_wait_delay=6 #(Processor and OS-dependent)
# perf special
innodb_adaptive_flushing = 1
innodb_flush_neighbors = 0
innodb_read_io_threads = 4
innodb_write_io_threads = 4
innodb_adaptive_hash_index=0 (depends on workload, always check)

MySQL Options for OLTP RW:

#Open files
table_open_cache = 8000
table_open_cache_instances = 16
query_cache_type = 0
#Percona Server specific
#InnoDB General
innodb_spin_wait_delay=12 Good value for RO is 6, for RW and RC is 192
innodb_buffer_pool_instances=128 #to avoid wait on InnoDB Buffer Pool mutex
innodb_flush_neighbors = 0
innodb_change_buffering=none #can be inserts, workload-specific
optimizer_switch="index_condition_pushdown=off" #workload-specific

MySQL SysBench parameters:

 [ --test=/data/sveta/sysbench/sysbench/tests/db/oltp_prepared.lua | --test=/data/sveta/sysbench/sysbench/tests/db/oltp_simple_prepared.lua ]
 --db-driver=mysql --oltp-tables-count=8 --oltp-table-size=10000000
--mysql-table-engine=innodb --mysql-user=msandbox --mysql-password=msandbox
--num-threads=$i --max-requests=0 --max-time=300
--percentile=0 [--oltp-read-only=on --oltp-skip-trx=on]

PostgreSQL pgbench parameters:

$ git clone
$ cd pg_oltp_bench
$ make USE_PGXS=1
$ sudo make USE_PGXS=1 install
$ psql DB -f oltp_init.sql
$ psql DB -c "CREATE EXTENSION pg_oltp_bench;"
$ pgbench -c 100 -j 100 -M prepared -f oltp_ro.sql -T 300 -P 1 DB
$ pgbench -c 100 -j 100 -M prepared -f oltp_rw.sql -T 300 -P 1 DB

Features in MySQL 5.7 that significantly improved performance:

  • InnoDB: transaction list optimization
  • InnoDB: Reduce lock_sys_t::mutex contention
  • InnoDB: fix index->lock contention
  • InnoDB: faster and parallel flushing
    • Multiple page cleaner threads: WL #6642
    • Reduced number of pages which needs to be flushed: WL #7047
    • Improved adaptive flushing: WL #7868
  • MDL (Meta-Data Lock) scalability
    • Remove THR_LOCK::mutex for InnoDB: Wl #6671
    • Partitioned LOCK_grant
    • Number of partitions is constant
    • Thread ID used to assign partition
    • Lock-free MDL lock acquisition for DML

Anastasia: The initial findings of this research were announced at Percona Live Amsterdam 2016. More findings were added to the second version of the same talk given at Moscow HighLoad++ 2016. Hopefully the third iteration of this talk will be available at Percona Live Open Source Database Conference 2017 in Santa Clara. Stay tuned: the Percona Live Committee is working on the program!



















by Anastasia Raspopina at January 06, 2017 04:33 PM

January 05, 2017

Peter Zaitsev

MongoDB Ransomware: Not Likely, But How Do You Know?

MongoDB Ransomware

MongoDB RansomwareIn this blog post, we’ll look at some of the concerns recently seen around MongoDB ransomware and security issues.

Security blogs and magazines have recently been aflutter with the news that a hacker is stealing data from MongoDB instantiations and demanding bitcoins to get the data back. This sounds pretty bad at first glance, but let’s examine the facts.

The hacker needs a few things to pull this off:

  1. MongoDB is running on default ports
  2. MongoDB is not using authentication
  3. MongoDB is accessible on the Internet with no security groups or firewalls

If this sounds familiar, you might remember a similar flurry occurred last year when people counted the number of open MongoDB installs on the web. That required these same conditions to all be true. This also means the solution is the same: you simply need to make sure you follow the normal security practices of locking down ports and using authentication. Not so scary after all, right?

What does this hack look like?

Finding out if this happened is simple: your data is removed and gone! In its place, you will find a “WARNING” database, which holds a “WARNING” collection. This collection has a document that looks like:

     "_id" : ObjectId("5859a0370b8e49f123fcc7da"),
     "mail" : "",

To fix this, hopefully, you have backups. If you don’t, you might want to look at on how to get consistent backups. If not, you will need to send the hackers the 0.2 bitcoins (~200 USD) to get your data back.

So, backup!

But this brings us to the real question: can you be hijacked? It’s pretty easy to check:

  1. Do you have authentication on? Try running this command:

rs1:PRIMARY> if (db.adminCommand('getCmdLineOpts') === undefined || db.adminCommand('getCmdLineOpts') === undefined || db.adminCommand('getCmdLineOpts') == "disabled"){ print("Auth not enabled!")}else{print("Your safe!")}
Auth not enabled!

  1. Are you running on a non-default port? Simply run this command (if you’re using 27017 or 29017, you’re using a default port):

rs1:PRIMARY> db.adminCommand('getCmdLineOpts')

The last part is a bit harder if the other two are both false. You will need to spin up a server outside of your environment and test the connection. I suggest an Amazon EC2 Micro instance (it’s very inexpensive – free if you use a new account). It’s simple to install a MongoDB client on. Check your setup:

  1. Login to Amazon and launch an EC2 node.
  2. Open a shell to this node (this can be done via their website).
  3. Get MongoDB’s binaries:

wget -q --show-progress
gzip -d mongodb-linux-x86_64-amazon-3.4.1.tgz
tar xf mongodb-linux-x86_64-amazon-3.4.1.tar -C 3.4 --strip-components=1

  1. Try and connect to your MongoDB Server

./3.4/bin/mongo --host <your_host_name> --port <your_mongod_port>

If this connects, and you can run “db.serverStatus()”, you are at risk and should enable authentication ASAP!

We will have a blog out shortly on the particulars of creating a user. To enable authentication, you simply need to add “–auth” to your startup, or the following to your YAML config file:


This should get you started on correctly protecting yourself against MongoDB ransomware (and other security threats). If you want to have someone review your security, or even help you use LDAP to tie into your main authentication systems, please contact us.

by David Murphy at January 05, 2017 08:07 PM

Jean-Jerome Schmidt

Tips and Tricks - How to shard MySQL with ProxySQL in ClusterControl

Having too large a (write) workload on a master is dangerous. If the master collapses and a failover happens to one of its slave nodes, the slave node could collapse under the write pressure as well. To mitigate this problem you can shard horizontally across more nodes.

Sharding increases the complexity of data storage though, and very often, it requires an overhaul of the application. In some cases, it may be impossible to make changes to an application. Luckily there is a simpler solution: functional sharding. With functional sharding you move a schema or table to another master, and thus alleviating the master from the workload of these schemas or tables.

In this Tips & Tricks post, we will explain how you can functionally shard your existing master, and offload some workload to another master using functional sharding. We will use ClusterControl, MySQL replication and ProxySQL to make this happen, and the total time taken should not be longer than 15 minutes in total. Mission impossible? :-)

The example database

In our example we have a serious issue with the workload on our simple order database, accessed by the so_user. The majority of the writes are happening on two tables: orders and order_status_log. Every change to an order will write to both the order table and the status log table.

CREATE TABLE `orders` (
  `customer_id` int(11) NOT NULL,
  `status` varchar(14) DEFAULT 'created',
  `total_vat` decimal(15,2) DEFAULT '0.00',
  `total` decimal(15,2) DEFAULT '0.00',
  PRIMARY KEY (`id`)
CREATE TABLE `order_status_log` (
  `orderId` int(11) NOT NULL,
  `status` varchar(14) DEFAULT 'created',
  `logline` text,
  PRIMARY KEY (`orderId`, `status`, `changeTime` )
CREATE TABLE `customers` (
  `firstname` varchar(15) NOT NULL,
  `surname` varchar(80) NOT NULL,
  `address` varchar(255) NOT NULL,
  `postalcode` varchar(6) NOT NULL,
  `city` varchar(50) NOT NULL,
  `state` varchar(50) NOT NULL,
  `country` varchar(50) NOT NULL,
  PRIMARY KEY (`id`)

What we will do is to move the order_status_log table to another master.

As you might have noticed, there is no foreign key defined on the order_status_log table. This simply would not work across functional shards. Joining the order_status_log table with any other table would simply no longer work as it will be physically on a different server than the other tables. And if you write transactional data to multiple tables, the rollback will only work for one of these masters. If you wish to retain these things, you should consider to use homogenous sharding instead where you keep related data grouped together in the same shard.

Installing the Replication setups

First, we will install a replication setup in ClusterControl. The topology in our example is really basic: we deploy one master and one replica:

But you could import your own existing replication topology into ClusterControl as well.

After the setup has been deployed, deploy the second setup:

While waiting for the second setup to be deployed, we will add ProxySQL to the first replication setup:

Adding the second setup to ProxySQL

After ProxySQL has been deployed we can connect with it via command line, and see it’s current configured servers and settings:

MySQL [(none)]> select hostgroup_id, hostname, port, status, comment from mysql_servers;
| hostgroup_id | hostname    | port | status | comment               |
| 20           | | 3306 | ONLINE | read server           |
| 20           | | 3306 | ONLINE | read server           |
| 10           | | 3306 | ONLINE | read and write server |
MySQL [(none)]> select rule_id, active, username, schemaname, match_pattern, destination_hostgroup from mysql_query_rules;
| rule_id | active | username | schemaname | match_pattern                                           | destination_hostgroup |
| 100     | 1      | NULL     | NULL       | ^SELECT .* FOR UPDATE                                   | 10                    |
| 200     | 1      | NULL     | NULL       | ^SELECT .*                                              | 20                    |
| 300     | 1      | NULL     | NULL       | .*                                                      | 10                    |

As you can see, ProxySQL has been configured with the ClusterControl default read/write splitter for our first cluster. Any basic select query will be routed to hostgroup 20 (read pool) while all other queries will be routed to hostgroup 10 (master). What is missing here is the information about the second cluster, so we will add the hosts of the second cluster first:

MySQL [(none)]> INSERT INTO mysql_servers VALUES (30, '', 3306, 'ONLINE', 1, 0, 100, 10, 0, 0, 'Second repl setup read server'), (30, '', 3306, 'ONLINE', 1, 0, 100, 10, 0, 0, 'Second repl setup read server');
Query OK, 2 rows affected (0.00 sec) 
MySQL [(none)]> INSERT INTO mysql_servers VALUES (40, '', 3306, 'ONLINE', 1, 0, 100, 10, 0, 0, 'Second repl setup read and write server');
Query OK, 1 row affected (0.00 sec)

After this we need to load the servers to ProxySQL runtime tables and store the configuration to disk:

Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)

As ProxySQL is doing the authentication for the clients as well, we need to add the os_user user to ProxySQL to allow the application to connect through ProxySQL:

MySQL [(none)]> INSERT INTO mysql_users (username, password, active, default_hostgroup, default_schema) VALUES ('so_user', 'so_pass', 1, 10, 'simple_orders');
Query OK, 1 row affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)

Now we have added the second cluster and user to ProxySQL. Keep in mind that normally in ClusterControl the two clusters are considered two separate entities. ProxySQL will remain part of the first cluster. Even though it is now configured for the second cluster, it will only be displayed under the first cluster,.

Mirroring the data

Keep in mind that mirroring queries in ProxySQL is still a beta feature, and it doesn’t guarantee the mirrored queries will actually be executed. We have found it working fine within the boundaries of this use case. Also there are (better) alternatives to our example here, where you would make use of a restored backup on the new cluster and replicate from the master until you make the switch. We will describe this scenario in a follow up Tips & Tricks blog post.

Now that we have added the second cluster, we need to create the simple_orders database, the order_status_log table and the appropriate users on the master of the second cluster:

mysql> create database simple_orders;
Query OK, 1 row affected (0.01 sec)
mysql> use simple_orders;
Database changed
mysql> CREATE TABLE `order_status_log` (
  `orderId` int(11) NOT NULL,
  `status` varchar(14) DEFAULT 'created',
  `logline` text,
  PRIMARY KEY (`orderId`, `status`, `changeTime` )
Query OK, 0 rows affected (0.00 sec)
mysql> create user 'so_user'@'' identified by 'so_pass';
Query OK, 0 rows affected (0.00 sec)
mysql> grant select, update, delete, insert on simple_orders.* to 'so_user'@'';
Query OK, 0 rows affected (0.00 sec)

This enables us to start mirroring the queries executed against the first cluster onto the second cluster. This requires an additional query rule to be defined in ProxySQL:

MySQL [(none)]> INSERT INTO mysql_query_rules (rule_id, active, username, schemaname, match_pattern, destination_hostgroup, mirror_hostgroup, apply) VALUES (50, 1, 'so_user', 'simple_orders', '(^INSERT INTO|^REPLACE INTO|^UPDATE|INTO TABLE) order_status_log', 20, 40, 1);
Query OK, 1 row affected (0.00 sec)
Query OK, 1 row affected (0.00 sec)

In this rule ProxySQL will match everything that is writing to the orders_status_log table, and send it in addition to the hostgroup 40. (write server of the second cluster)

Now that we have started mirroring the queries, the backfill of the data from the first cluster can take place. You can use the timestamp from the first entry in the new orders_status_log table to determine the time we started to mirror.

Once the data has been backfilled we can reconfigure ProxySQL to perform all actions on the orders_status_log table on the second cluster. This will be a two step approach: add a new rule to move the read queries to the second cluster’s read servers and except the SELECT … FOR UPDATE queries. Then another one to modify our mirroring query to stop mirroring and only write to the second cluster.

MySQL [(none)]> INSERT INTO mysql_query_rules (rule_id, active, username, schemaname, match_pattern, destination_hostgroup, apply) VALUES (70, 1, 'so_user', 'simple_orders', '^SELECT .* FROM order_status_log', 30, 1), (60, 1, 'so_user', 'simple_orders', '^FROM order_status_log .* FOR UPDATE', 40, 1);
Query OK, 2 rows affected (0.00 sec)
MySQL [(none)]> UPDATE mysql_query_rules SET destination_hostgroup=40, mirror_hostgroup=NULL WHERE rule_id=50;
Query OK, 1 row affected (0.00 sec)

And don’t forget to activate and persist the new query rules:

Query OK, 1 row affected (0.00 sec)
Query OK, 0 rows affected (0.05 sec)

After this final step we should see the workload drop on the first cluster, and increase on the second cluster. Mission possible and accomplished. Happy clustering!

by Art at January 05, 2017 09:43 AM

January 04, 2017

Peter Zaitsev

MongoDB 3.4: Sharding Improvements

Sharding Improvements

In this blog post, we will discuss some of the Sharding improvements in the recent MongoDB 3.4 GA release.


Let’s go over what MongoDB Sharding “is” at a simplified, high level.

The concept of “sharding” exists to allow MongoDB to scale to very large data sets that may exceed the available resources of a single node or replica set. When a MongoDB collection is sharding-enabled, it’s data is broken into ranges called “chunks.” These are intended to be evenly distributed across many nodes or replica sets (called “shards”). MongoDB computes the ranges of a given chunk based on a mandatory document-key called a “shard key.” The shard key is used in all read and write queries to route a database request to the right shard.

The MongoDB ecosystem introduced additional architectural components so that this could happen:

  1. A shard. A single MongoDB node or replica set used for storing the cluster data. There are usually many shards in a cluster and more shards can be added/removed to scale.
  2. “mongos” router. A sharding-aware router for handling client traffic. There can be one or more mongos instances in a cluster.
  3. The “config servers”. Used for storing the cluster metadata. Config servers are essentially regular MongoDB servers dedicated to storing the cluster metadata within the “config” database. Database traffic does not access these servers, only the mongos.

Under sharding, all client database traffic is directed to one or more of the mongos router process(es), which use the cluster metadata, such as the chunk ranges, the members of the cluster, etc., to service requests while keeping the details of sharding transparent to the client driver. Without the cluster metadata, the routing process(es) do not know where the data is, making the config servers a critical component of sharding. Due to this, at least three config servers are required for full fault tolerance.

Sharding: Chunk Balancer

To ensure chunks are always balanced among the cluster shards, a process named the “chunk balancer” (or simply “balancer”) runs periodically, moving chunks from shard to shard to ensure data is evenly distributed. When a chunk is balanced, the balancer doesn’t actually move any data, it merely coordinates the transfer between the source and destination shard and updates the cluster metadata when chunks have moved.

Before MongoDB 3.4, the chunk balancer would run on whichever mongos process could acquire a cluster-wide balancer lock first. From my perspective this was a poor architectural decision for these reasons:

  1. Predictability. Due to the first-to-lock nature, the mongos process running the balancer is essentially chosen at random. This can complicate troubleshooting as you try to chase down which mongos process is the active balancer to see what it is doing, it’s logs, etc. As a further example: it is common in some deployments for the mongos process to run locally on application servers and in large organizations it is common for a DBA to not have access to application hosts – something I’ve ran into many times myself.
  2. Efficiency. mongos was supposed to be a stateless router, not a critical administrative process! As all client traffic passes in-line through the mongos process, it is important for it to be as simple, reliable and efficient as possible.
  3. Reliability. in order to operate, the mongos process must read and write cluster metadata hosted within the config servers. As mongos is almost always running on a physically separate host from the config servers, any disruption (network, hardware, etc) in between the balancer and config server nodes will break balancing!

Luckily, MongoDB 3.4 has come to check this problem (and many others) off of my holiday wish list!

MongoDB 3.4: Chunk Balancer Moved to Config Servers

In MongoDB 3.4, the chunk balancer was moved to the Primary config server, bringing these solutions to my concerns about the chunk balancer:

  1. Predictability. The balancer is always running in a single, predictable place: the primary config server.
  2. Efficiency. Removing the balancer from “mongos” allows it to worry about routing only. Also, as config servers are generally dedicated nodes that are never directly hit by client database traffic, in my opinion this is a more efficient place for the balancer to run.
  3. Reliability. Perhaps the biggest win I see with this change is the balancer can no longer lose connectivity with the cluster metadata that is stored on separate hosts. The balancer now runs inside the same node as the metadata!
  4. Centralized. As a freebie, now all the background/administrative components of Sharding are in one place!

Note: although we expect the overhead of the balancer to be negligible, keep in mind that a minor overhead is added to the config server Primary-node due to this change.

See more about this change here:

MongoDB 3.4: Required Config Server Replica Set

In MongoDB releases before 3.2, the set of cluster config servers received updates using a mode called Sync Cluster Connection Config (SCCC) to ensure all nodes received the same change. This essentially meant that any updates to cluster metadata would be sent N x times from the mongos to the config servers in a fan-out pattern. This is another legacy design choice that always confused me, considering MongoDB already has facilities for reliably replicating data: MongoDB Replication. Plus without transactions in MongoDB, there are some areas where SCCC can fail.

Luckily MongoDB 3.2 introduced Replica-Set based config servers as an optional feature. This moved us away from the SCCC fan-out mode to traditional replication and write concerns for consistency. This brought many benefits: rebuilding a config server node became simpler, backups became more straightforward and flexible and the move towards a consistent method of achieving consistent updates simplified the architecture.

MongoDB 3.4 requires Replica-Set based config servers, and removed the SCCC mode entirely. This might require some changes for some, but I think the benefits outweigh the cost. For more details on how to upgrade from SCCC to Replica-Set based config servers, see this article.

Note: the balancer in MongoDB 3.4 always runs on the config server that is the ‘PRIMARY’ of the replica set.

MongoDB 3.4: Parallel Balancing

As intended, MongoDB Sharded Clusters can get very big, with 10s, 100s or even 1000s of shards. Historically MongoDB’s balancer worked in serial, meaning it could only coordinate 1 x chunk balancing round at any given time within the cluster. On very large clusters, this limitation poses a huge throughput limitation on balancing: all chunk moves have to wait in a serial queue.

In MongoDB 3.4, the chunk balancer can now perform several chunk moves in parallel given they’re between a unique source and destination shard. Given shards: A, B, C and D, this means that a migration from A -> B can now happen at the same time as a migration from C -> D as they’re mutually exclusive source and destination shards. Of course, you need four or more shards to really see the benefit of this change.

Of course, on large clusters this change could introduce a significant change in network bandwidth usage. This is due to the ability for several balancing operations to occur at once. Be sure to test your network capacity with this change.

See more about this change here:


Of course, there were many other improvements to sharding and other areas in 3.4. We hope to cover more in the future. These are just some of my personal highlights.

For more information about what has changed in the new GA release, see: MongoDB 3.4 Release Notes. Also, please check out our beta release of Percona Server for MongoDB 3.4. This includes all the improvements in MongoDB 3.4 plus additional storage engines and features.


by Tim Vaillancourt at January 04, 2017 09:37 PM

January 03, 2017

Peter Zaitsev

Enabling and Disabling Jemalloc on Percona Server

Jemalloc on Percona Server

Jemalloc on Percona ServerThis post discusses enabling and disabling jemalloc on Percona Server for MySQL.

The benefits of jemalloc versus glibc for use with MySQL have been widely discussed. With jemalloc (along with Transparent Huge Pages disabled) you have less memory fragmentation, and thus more efficient resource management of the available server memory.

For standard installations of Percona Server 5.6+ (releases starting with 5.6.19-67.0), the only thing needed to use jemalloc as the memory library for mysqld is for it to be installed on the server.

Enabling Jemalloc on Percona Server

First thing first: install Jemalloc.

The library is available on the Percona repository, which is available for both apt and yum package management:

Once you have the repo, just run the install command (according to your OS) to install it:

yum install jemalloc / apt-get install libjemalloc1

Now that you have the jemalloc package installed, all it takes to start using it is…..

  • Restart the server.

That’s it! No modifications needed on the my.cnf file or anywhere else. Plain and simple!

Disabling Jemalloc on Percona Server

If for any reason you need to disable jemalloc and go back to the default library, you have two options: remove the jemalloc package (not too practical), or add the following line to the [mysqld_safe] section of the my.cnf file:

malloc-lib =

In other words, an empty path. That will do the trick. Note that commenting or removing the “malloc-lib” parameter on the cnf file won’t work.

How to Know if Jemalloc is Being Used?

There are couple of ways you can verify this, but the less invasive way is by using the pt-mysql-summary (version 2.2.20 and higher) tool from the Percona Toolkit:

root@reports:~# pt-mysql-summary | grep -A5 -i "memory management"
# Memory management library ##################################
jemalloc enabled in MySQL config for process with ID 5122
Using jemalloc from /usr/lib/x86_64-linux-gnu/
# The End ####################################################

by Daniel Guzmán Burgos at January 03, 2017 08:29 PM

Jean-Jerome Schmidt

How to perform online schema changes on MySQL using gh-ost

In the previous blog post, we discussed how gh-ost works internally and how it compares to Percona’s pt-online-schema-change. Today we’d like to focus on operations - how can we test a schema change with gh-ost to verify it can be executed without any issues? And how do we go ahead and perform the actual schema change?

Testing migration

Ensuring that a migration will go smoothly is one of the most important steps in the whole schema change process. If you value your data, then you definitely want to avoid any risk of data corruption or partial data transformation. Let’s see how gh-ost allows you to test your migration.

gh-ost gives you numerous ways to test. First of all, you can execute a no-op migration by skipping the --execute flag. Let’s look at an example - we want to add a column to a table.

root@ip-172-30-4-235:~# ./gh-ost --host= --user=sbtest --password=sbtest --database=sbtest1 --table=sbtest1 --alter="ADD COLUMN x INT NOT NULL DEFAULT '0'" --chunk-size=2000 --max-load=Threads_connected=20

We here pass access details like user, password, database and table to alter. We also define what change needs to be added. Finally, we define chunk size for the background copy process and what we understand as a max load. Here we can pass different status counters in MySQL (not all makes sense) - we used threads_connected but we could use, for example, ‘threads_running’. Once this threshold is crossed, gh-ost starts to throttle writes.

# Migrating `sbtest1`.`sbtest1`; Ghost table is `sbtest1`.`_sbtest1_gho`
# Migrating ip-172-30-4-4:3306; inspecting ip-172-30-4-235:3306; executing on ip-172-30-4-235
# Migration started at Tue Dec 20 14:00:45 +0000 2016
# chunk-size: 2000; max-lag-millis: 1500ms; max-load: Threads_connected=20; critical-load: ; nice-ratio: 0.000000

Next, we see information about migration - what tables do we alter, which table is used as a ghost (temporary) table. Gh-ost creates two tables, one with _gho suffix is a temporary table with the new schema and it’s the target of the data copying process. The second table, with _ghc suffix, stores migration logs and status. We can also see a couple of other defaults - maximum acceptable lag is 1500 milliseconds (1.5 seconds) - gh-ost may work with an external script to create up to millisecond granularity for lag control. If you don’t set --replication-lag-query flag, seconds_behind_master from SHOW SLAVE STATUS will be used, which has granularity of one second.

# throttle-additional-flag-file: /tmp/gh-ost.throttle
# Serving on unix socket: /tmp/gh-ost.sbtest1.sbtest1.sock

Here we have information about throttle flag file - creating it will automatically trigger throttling on gh-ost. We also have an unix socket file, which can be used to control gh-ost’s configuration at runtime.

Copy: 0/0 100.0%; Applied: 0; Backlog: 0/100; Time: 1s(total), 0s(copy); streamer: binlog.000042:102283; State: migrating; ETA: due
CREATE TABLE `_sbtest1_gho` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `k` int(10) unsigned NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  `x` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)

Finally, we have information about progress - nothing interesting here as we ran a no-op change. We also have information about the schema of the target table.

Now that we tested no-op change, it’s time for some more real-life tests. Again, gh-ost gives you an option to verify that everything goes as planned. What we can do is to use one of our database replicas to run the change on, and verify it went fine. Gh-ost will stop the replication for us as soon as the change completes, to ensure that we can compare data from old and new table. It’s not so easy to compare tables with different schemas so we may want to start with a change which doesn’t do anything. For example:


Let’s run this migration to verify that gh-ost actually does its job correctly:

root@ip-172-30-4-235:~# ./gh-ost --host= --user=sbtest --password=sbtest --database=sbtest1 --table=sbtest1 --alter="ENGINE=InnoDB" --chunk-size=2000 --max-load=Threads_connected=20 --test-on-replica --execute

Once it’s done, you will see your slave in the following state.

mysql> \P grep Running
PAGER set to 'grep Running'
             Slave_IO_Running: No
            Slave_SQL_Running: No
1 row in set (0.00 sec)

Replication has been stopped so no new changes are being added.

mysql> SHOW TABLES FROM sbtest1;
| Tables_in_sbtest1 |
| _sbtest1_gho      |
| sbtest1           |
2 rows in set (0.00 sec)

Gh-ost table has been left for you to look into. Now, as we run a noop alter, we can compare both tables to verify that the whole process worked flawlessly. There are a couple of methods to do that. You can, for example, dump the table contents via SELECT … INTO OUTFILE and then compare md5 of both dump files. You can also use CHECKSUM TABLE command in MySQL:

mysql> CHECKSUM TABLE sbtest1.sbtest1, sbtest1._sbtest1_gho EXTENDED;
| Table                | Checksum  |
| sbtest1.sbtest1      | 851491558 |
| sbtest1._sbtest1_gho | 851491558 |
2 rows in set (9.27 sec)

As long as checksums are identical (no matter how you calculated them), you should be safe to assume that both tables are identical and the migration process went fine.

Performing an actual migration

Once we verified that gh-ost can execute our schema change correctly, it’s time to actually execute it. Keep in mind that you may need to manually drop old tables that were created by gh-ost during the process of testing the migration. You can also use --initially-drop-ghost-table and --initially-drop-old-table flags to ask gh-ost to do it for you. The final command to execute is exactly the same as we used to test our change, we just added --execute to it.

./gh-ost --host= --user=sbtest --password=sbtest --database=sbtest1 --table=sbtest1 --alter="ADD COLUMN x INT NOT NULL DEFAULT '0'" --chunk-size=2000 --max-load=Threads_connected=20 --execute

Once started, we’ll see a summary of the job. The main change is that the “migrating” host points to our master, and we use one of slaves, to look for binary logs.

# Migrating `sbtest1`.`sbtest1`; Ghost table is `sbtest1`.`_sbtest1_gho`
# Migrating ip-172-30-4-4:3306; inspecting ip-172-30-4-235:3306; executing on ip-172-30-4-235
# Migration started at Fri Dec 23 19:18:00 +0000 2016
# chunk-size: 2000; max-lag-millis: 1500ms; max-load: Threads_connected=20; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# Serving on unix socket: /tmp/gh-ost.sbtest1.sbtest1.sock

We can also see progress messages printed by gh-ost:

Copy: 0/9982267 0.0%; Applied: 0; Backlog: 7/100; Time: 4s(total), 0s(copy); streamer: binlog.000074:808522953; State: migrating; ETA: N/A
Copy: 0/9982267 0.0%; Applied: 538; Backlog: 100/100; Time: 5s(total), 1s(copy); streamer: binlog.000074:808789786; State: migrating; ETA: N/A
Copy: 0/9982267 0.0%; Applied: 1079; Backlog: 100/100; Time: 6s(total), 2s(copy); streamer: binlog.000074:809092031; State: migrating; ETA: N/A
Copy: 0/9982267 0.0%; Applied: 1580; Backlog: 100/100; Time: 7s(total), 3s(copy); streamer: binlog.000074:809382067; State: migrating; ETA: N/A
Copy: 0/9982267 0.0%; Applied: 2171; Backlog: 84/100; Time: 8s(total), 4s(copy); streamer: binlog.000074:809718243; State: migrating; ETA: N/A
Copy: 4000/9982267 0.0%; Applied: 2590; Backlog: 33/100; Time: 9s(total), 5s(copy); streamer: binlog.000074:810697550; State: migrating; ETA: N/A
Copy: 12000/9982267 0.1%; Applied: 3006; Backlog: 5/100; Time: 10s(total), 6s(copy); streamer: binlog.000074:812459945; State: migrating; ETA: N/A
Copy: 28000/9982267 0.3%; Applied: 3348; Backlog: 12/100; Time: 11s(total), 7s(copy); streamer: binlog.000074:815749963; State: migrating; ETA: N/A
Copy: 46000/9982267 0.5%; Applied: 3736; Backlog: 0/100; Time: 12s(total), 8s(copy); streamer: binlog.000074:819054426; State: migrating; ETA: N/A
Copy: 60000/9982267 0.6%; Applied: 4032; Backlog: 4/100; Time: 13s(total), 9s(copy); streamer: binlog.000074:822321562; State: migrating; ETA: N/A
Copy: 78000/9982267 0.8%; Applied: 4340; Backlog: 12/100; Time: 14s(total), 10s(copy); streamer: binlog.000074:825982397; State: migrating; ETA: N/A
Copy: 94000/9982267 0.9%; Applied: 4715; Backlog: 0/100; Time: 15s(total), 11s(copy); streamer: binlog.000074:829283130; State: migrating; ETA: N/A
Copy: 114000/9982267 1.1%; Applied: 5060; Backlog: 24/100; Time: 16s(total), 12s(copy); streamer: binlog.000074:833357982; State: migrating; ETA: 17m19s
Copy: 130000/9982267 1.3%; Applied: 5423; Backlog: 16/100; Time: 17s(total), 13s(copy); streamer: binlog.000074:836654200; State: migrating; ETA: 16m25s

From those we can see how many rows were copied, how many events have been applied from binary logs, if there is a backlog of binlog events to apply, how long the whole process and copying of data took, binlog coordinates where gh-ost is looking for new events, state of the job (migrating, throttled, etc) and estimated time to complete the process.

Important to remember is that the number of rows to copy is just an estimate based on the EXPLAIN output for:

SELECT * FROM yourschema.yourtable;

You can see it below in ‘rows’ column and on gh-ost status output:

mysql> EXPLAIN SELECT * FROM sbtest1.sbtest1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: sbtest1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9182788
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)
Copy: 0/9182788 0.0%; Applied: 0; Backlog: 0/100; Time: 1m15s(total), 0s(copy); streamer: binlog.000111:374831609; State: migrating; ETA: N/A
Copy: 0/9182788 0.0%; Applied: 0; Backlog: 100/100; Time: 1m20s(total), 5s(copy); streamer: binlog.000111:374945268; State: throttled, lag=33.166494s; ETA: N/A
Copy: 0/9182788 0.0%; Applied: 0; Backlog: 100/100; Time: 1m25s(total), 10s(copy); streamer: binlog.000111:374945268; State: throttled, lag=2.766375s; ETA: N/A
Copy: 0/9182788 0.0%; Applied: 1907; Backlog: 100/100; Time: 1m30s(total), 15s(copy); streamer: binlog.000111:375777140; State: migrating; ETA: N/A
Copy: 0/9182788 0.0%; Applied: 4543; Backlog: 100/100; Time: 1m35s(total), 20s(copy); streamer: binlog.000111:376924495; State: migrating; ETA: N/A

If you are interested in having precise numbers, you can use --exact-rowcount flag in gh-ost. If you use it, gh-ost will execute SELECT COUNT(*) FROM yourtable;, making sure that the number of rows has been calculated precisely.

After some time, gh-ost should complete the change, leaving the old table with _del suffix (_yourtable_del). In case something went wrong, you still can recover old data and then, using binary logs, replay any events which are missing. Obviously, it’s not the cleanest or fastest way to recover but it has been made possible - we’d surely take it over data loss.

What we described above is the default way in which gh-ost performs migration - read binary log from a slave, analyze table on a slave and execute changes on the master. This way we minimize any extra load which is put on the master. If you’d like to execute all your changes on the master, it is possible, as long as your master uses RBR format.

To execute our change on the master, we need to execute gh-ost in a way like below. We use our master’s IP in --host flag. We also use --allow-on-master flag to tell gh-ost that we are going to run the whole process on the master only.

./gh-ost --host= --user=sbtest --password=sbtest --database=sbtest1 --table=sbtest1 --alter="ADD COLUMN x INT NOT NULL DEFAULT '0'" --chunk-size=2000 --max-load=Threads_connected=20 --allow-on-master --execute

As you can clearly see, gh-ost gives you numerous ways in which you can ensure the schema change will be performed smoothly and in a safe manner. We cannot stress enough how important it is for a DBA to have a way to test every operation. Flexibility is also very welcome - default behavior of reducing load on the master makes perfect sense, but it is good that gh-ost still allows you to execute everything on the master only.

In the next blog post, we are going to discuss some safety measures that come with gh-ost. Namely, we will talk about its throttling mechanism and ways to perform runtime configuration changes.

by krzysztof at January 03, 2017 03:02 PM

January 01, 2017

Daniël van Eeden

The mysql client, and some improvements

The mysql client is a tool which I use every day as a DBA. I think it's a great tool. When I used a client of several other SQL and NoSQL databases I was quickly reminded of all the features of the mysql client. Note that psql (PostgreSQL client) is also very nice.

Some other interesting things about the mysql client: It is build from the same mysql-server repository as MySQL Server. The source is in client/ In addition to the server version it also reports 14.14 as its version. The previous version (14.13) was around the time of MySQL 5.1, so this version is mostly meaningless.
If you start it it identifies itself as "MySQL monitor", not to be confused with MySQL Enterprise Monitor.
The version of the client is not tightly coupled with the server, in most situations a 5.6 client works fine with a 5.7 server and vice versa. Note that there might be some minor annoyances if you use an older client with a newer server. For example: the 5.6 client doesn't know about the new hint syntax, and considers the hint to be just a comment. And comments are stripped by default, which results in the situation that the hint is not sent to the server.

But there are some situations where the MySQL client has some limitations.

The first one is that the 'pager' option doesn't work on Windows. The pager command is very useful (e.g. less, grep, etc). And cmd.exe isn't the best terminal emulator ever.. using a third party terminal emulator or PowerShell fixes that somewhat. And with PowerShell there are some other issues you might run into: MySQL uses UTF-8, and PowerShell uses UTF-16. While both can do charset conversions, this often makes things more difficult (for example: Bug #74817).

And if you're working with spatial data, images or stored procedures then the mysql client is often not very helpful. The graphical client, MySQL Workbench, is often much better suited in these cases. It has syntax highlighting, a spatial viewer and an image viewer. It allows you to edit a SQL script and then execute it and edit it again and run it again. I you try to do this with the history of the mysql client then the formatting gets lost. For working with SQL procedures, triggers, events, etc the solution is to edit it with your favourite editor and then source it. But for images and spatial data you often really have to use Workbench or something like QGIS.

Besides the CLI vs GUI difference there are some more differences in how most people use both tools. For Workbench it is installed on a the client workstation and then uses a remote connection to the server. Workbench supports both the native SSL/TLS protocol and can tunnel through SSH.
The mysql client supports SSL/TLS, but doesn't support SSH tunnelling. Which is ok, because you can just run it on the server.
This also has implications on configuration: The mysql client only needs to know how to connect to the local server. Workbench needs configuration for every server. This makes the mysql client more useful if you are managing a large set of machines.

One of the more annoying situations with the mysql client is that you quickly want to select a row from a table or run that select query which was reported as being slow. So you ssh to the server and run the query... and then you suddenly get a lot of 'weird' characters on you screen. This happens if you have binary columns (BLOB, varbinary, geometry) to store IP addresses, locations, binary UUID's, photos, etc.
I made a patch to fix that. With the patch binary data is printed with hex literals (e.g. 0x08080404 for the binary version of So this doesn't break your terminal anymore and also allows you to copy the value to the subsequent query.

mysql> select * from t1;
| id | ip |
| 1 | 0x00000000000000000000000000000001 |
| 2 | 0x7F000001 |
| 3 | 0x08080808 |
| 4 | 0x08080404 |
4 rows in set (0.00 sec)

mysql> show create table t1\G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`ip` varbinary(16) DEFAULT NULL,
1 row in set (0.00 sec)

This might raise the question: why not display them as an IP address instead? I did make a patch to do that. The patch triggers this display if the column is varbinary with a length which matches an IPv4 or IPv6 address. But we might store IP addresses in columns with other names and we might store values which are not an IP, but have the same length. This would require a lot of configuration and configuration options. And this would need more work for geometry types, binary UUID's etc. So for now I decided not to take that route.
It would be nice if the server would allow you to define an 'ip6' datatype which is just an alias for varbinary(16), but would be sent to the client. This could also be done with something like "SELECT c1::ip6" in the query. Or the server really has to define UUID, and IP types. Or user defined types. Or both.

mysql> select id,hex(ip),ip from t1\G
*************************** 1. row ***************************
id: 1
hex(ip): 00000000000000000000000000000001
ip: INET6_ATON('::1')
*************************** 2. row ***************************
id: 2
hex(ip): 7F000001
ip: INET6_ATON('')
2 rows in set (0.00 sec)

Also somewhat belonging in this list: I made a patch in 2015 which replaces the drawing characters (+ for corners, - for horizontal lines, | for vertical lines) with unicode drawing characters.

mysql> DESC mysql.func;
│ Field │ Type │ Null │ Key │ Default │ Extra │
│ name │ char(64) │ NO │ PRI │ │ │
│ ret │ tinyint(1) │ NO │ │ 0 │ │
│ dl │ char(128) │ NO │ │ │ │
│ type │ enum('function','aggregate') │ NO │ │ NULL │ │
4 rows in set (0.00 sec)

I also made a patch to report the runtime with more detail (e.g 0.004 instead of 0.00).

mysql> select sleep(0.123);
| sleep(0.123) |
| 0 |
1 row in set (0.123 sec)

I also once made a patch to set the terminal title.

And what about the future? I don't know, the mysql client might be replaced with MySQL Shell (mysqlsh), but for that to happen mysqlsh needs many improvements. MySQL Workbench could replace some of it if it gets the capability to easily connect to many similar servers without much configuration. But should it? iTerm2 (macOS) now allows you to display images in the terminal, so if more terminal emulators would get this feature then it might make sense to get a image and geometry viewer in the client..

Please leave a comment with your experience with the mysql client and which features you would like to see.

by Daniël van Eeden ( at January 01, 2017 09:18 PM

December 31, 2016

Valeriy Kravchuk

Fun with Bugs #46 - On Some Bugs I've Reported During the Year of 2016

It's time to summarize the year of 2016. As a kind of a weird summary, in this post I'd like to share a list of MySQL bug reports I've created in 2016 that are still remaining "Verified" today:

  • Bug #79831 - "Unexpected error message on crash-safe slave with max_relay_log_size set". According to Umesh this is not repeatable with 5.7. The fact that I've reported the bug on January 4 probably means I was working at that time. I should not repeat this mistake again next year.
  • Bug #80067 - "Index on BIT column is NOT used when column name only is used in WHERE clause". People say the same problem happens with INT and, what may be even less expected, BOOLEAN columns.
  • Bug #80424 - "EXPLAIN output depends on binlog_format setting". Who could expect that?
  • Bug #80619 - "Allow slave to filter replication events based on GTID". In this feature request I've suggested to implement filtering by GTID pattern, so that we can skip all events originating from specific master on some slave in a complex replication chain.
  • Bug #82127 - "Deadlock with 3 concurrent DELETEs by UNIQUE key". It's clear that manual is not even close to explaining how the locks are really set "by design" in this weird case. See comments in MDEV-10962 for some explanations. Nobody from Oracle event tried to really explain how things are designed to work.
  • Bug #82212 - "mysqlbinlog can produce events larger than max_allowed_packet for mysql". This happens for encoded row-based events. There should be some way to take this overhead into account while creating binary log, IMHO.
  • Bug #83024 - "Internals manual does not explain COM_SLEEP in details". One day you'll see Sleep for some 17 seconds logged into the slow query log, and may start to wonder why...
  • Bug #83248 - "Partition pruning is not working with LEFT JOIN". You may find some interesting related ideas in MDEV-10946.
  • Bug #83640 - "Locks set by DELETE statement on already deleted record". This case shows that design of locking in InnoDB does produce really weird outcomes sometimes. This is not about "missing manual", this is about extra lock set that is absolutely NOT needed (a gap X lock on a record in the secondary unique index is set when the same transaction transaction already has the next key lock on it). As a side note, I keep finding, explaining and reporting weird or undocumented details in InnoDB locking for years, still my talk about InnoDB locks was not accepted by Oracle once again for OOW in 2016. What do I know about the subject and who even cares about those locks... 
  • Bug #83708 - "uint expression is used for the value that is passed as my_off_t for DDL log". I was really shocked by this finding. I assumed that all uint vs unsigned long long improper casts are already found. It seems I was mistaking.
  • Bug #83912 - "Time spent sleeping before entering InnoDB is not measured/reported separately". The use case that led me to reporting this bug is way more interesting than the fact that some wait is not instrumented in performance_schema. You may see more related bug reports from me next year.
  • Bug #83950 - "LOAD DATA INFILE fails with an escape character followed by a multi-byte one". This single bug (and related bugs and stories) were original topic for issue #46 of my "Fun With Bugs" series. I was not able to write everything I want properly over last 3 weeks, but trust me: it's a great story, of "Let's Make America Great Again" style. With the goal for LOAD DATA to behave exactly as INSERT when wrong utf8 data are inserted, Oracle changed the way LOAD DATA works back and forth, with the last change (back) happened in 5.7.17:
     "Incompatible Change: A change made in MySQL 5.7.8 for handling of multibyte character sets by LOAD DATA was reverted due to the replication incompatibility (Bug #24487120, Bug #82641)"
    I just can not keep up with all the related fun people have in replication environments thanks to these ongoing changes... It's incredible.
  • Bug #84004 - "Manual misses details on MDL locks set and released for online ALTER TABLE". Nothing new: locks in MySQL are not properly/completely documented, metadata locks included. yes, they are documented better now, after 11+ years of my continuous efforts (of a kind), but we are "not there yet". I am still waiting for a job offer to join MySQL Documentation Team, by the way :)
  • Bug #84173 - "mysqld_safe --no-defaults & silently does NOT work any more". Recent MySQL 5.7.17 release had not only given us new Group Replication plugin and introduced incompatible changes. In a hope to fix security issues it comes with pure regression - for the first time in last 11 years mysqld_safe --no-defaults stopped working for me! By the way, mysqld_safe is still NOT safe in a sense that 5.7.17 tried to enforce, and one day (really soon) you will find out why.
  • Bug #84185 - "Not all "Statements writing to a table with an auto-increment..." are unsafe". If you do something like DELETE FROM `table` WHERE some_col IN (SELECT some_id FROM `other_table`) where `table` has auto_increment column, why should anyone care about it? We do not generate the value, we delete rows...
    This bug report was actually created by Hartmut Holzgraefe and test case comes from Elena Stepanova (see MDEV-10170). I want to take this opportunity to thank them and other colleagues from MariaDB for their hard work and cooperation during the year of 2016. Thanks to Umesh (who processed most of my bug reports),  Sinisa Milivojevic and Miguel Solorzano for their verifications of my bug reports this year.

In conclusion I should say that, no matter how pointless you may consider this activity, I still suggest you to report each and every problem that you have with MySQL and can not understand after reading the manual, as a public MySQL bug. Now, re-read my 4 years old post on this topic and have a Happy and Fruitful New Year 2017!

by Valeriy Kravchuk ( at December 31, 2016 05:16 PM

December 29, 2016

Peter Zaitsev

Query Language Type Overview

Query Language Type

Query Language TypeThis blog provides a query language type overview.

The idea for this blog originated from some customers asking me questions. When working in a particular field, you often a dedicated vocabulary that makes sense to your peers. It often includes phrases and abbreviations because it’s efficient. It’s no different in the database world. Much of this language might make sense to DBA’s, but it might sound like “voodoo” to people not used to it. The overview below covers the basic types of query languages inside SQL. I hope it clarifies what they mean, how they’re used and how you should interpret them.

DDL (Data Definition Language)

A database schema is a visualization of information. It contains the data structure separated by tables structures, views and anything that contains structure for your data. It defines how you want to store and visualize the information.

It’s like a skeleton, defining how data is organized. Any action that creates/updates/changes this skeleton is DDL.

Do you remember spreadsheets? A table definition describes something like:

Account number Account name Account owner Creation date Amount
Sorted ascending Unique, indexed Date, indexed Number, linked with transactions

Whenever you want to create a table like this, you must use a DDL query. For example:

Account_number Bigint(16) ,
Account_name varchar(255),
Account_name varchar(255),
Creation_date date,
Amount Bigint(16),
PRIMARY KEY (Account_number),
FOREIGN KEY (Amount) REFERENCES transactions(Balancevalue)

CREATE, ALTER, DROP, etc.: all of these types of structure modification queries are DDL queries!

Defining the structure of the tables is important as this defines how you would potentially access the information stored in the database while also defining how you might visualize it.

Why should you care that much?

DDL queries define the structure on which you develop your application. Your structure will also define how the database server searches for information in a table, and how it is linked to other tables (using foreign keys, for example).

You must design your MySQL schema before adding information to it (unlike NoSQL solutions such as MongoDB). MySQL might be more rigid in this manner, but it often makes sense to design the pattern for how you want to store your information and query it properly.

Due to the rigidity of an RDBMS system, changing the data structure (or table schema) requires the system to rebuild the actual table in most cases. This is potentially problematic for performance or table availability (locking). Often this is a “hot” procedure (since MySQL 5.6), requiring no downtime for active operations. Additionally, tools like pt-osc or other open source solutions can be used for migrating the data structure to a new format without requiring downtime.

An example:

ALTER TABLE accounts ADD COLUMN wienietwegisisgezien varchar(20)

DML (Data Manipulation Language)

Data manipulation is what it sounds like: working with information inside a structure. Inserting information and deleting information (adding rows, deleting rows) are examples of data manipulation.

An example:

INSERT into resto_visitor values(5,'Julian',’highway 5’,12);
UPDATE resto_visitor set name='Evelyn',age=17 where id=103;

Sure, but why should I use it?

Having a database environment makes no sense unless you insert and fetch information out of it. Remember that databases are plentiful in the world: whenever you click on a link on your favorite blog website, it probably means you are fetching information out of a database (and that data was at one time inserted or modified).

Interacting with a database requires that you write DML queries.

DCL (Data Control Language)

Data control language is anything that is used for administrating access to the database content. For example, GRANT queries:

GRANT ALL PRIVILEGES ON database.table to ‘jeffbridges’@’ourserver’;

Well that’s all fine, but why another subset “language” in SQL?

As a user of database environments, at some point you’ll get access permission from someone performing a DCL query. Data control language is used to define authorization rules for accessing the data structures (tables, views, variables, etc.) inside MySQL.

TCL (Transaction Control Language) Queries

Transaction control language queries are used to control transactional processing in a database. What do we mean by transactional processes? Transactional processes are typically bundled DML queries. For example:


This gives you the ability to perform or rollback a complete action. Only storage engines offering transaction support (like InnoDB) can work with TCL.

Yet another term, but why?

Ever wanted to combine information and perform it as one transaction? In some circumstances, for example, it makes sense to make sure you perform an insert first and then perform an update. If you don’t use transactions, the insert might fail and the associated update might be an invalid entry. Transactions make sure that either the complete transaction (a group of DML queries) takes place, or it’s completely rolled back (this is also referred to as atomicity).


Hopefully this blog post helps you understand some of the “insider” database speech. Post comments below.

by Dimitri Vanoverbeke at December 29, 2016 10:46 PM

Percona Live Featured Tutorial with Øystein Grøvlen — How to Analyze and Tune MySQL Queries for Better Performance

Percona Live Featured Tutorial

Percona Live Featured TutorialWelcome to another post in the series of Percona Live featured tutorial speakers blogs! In these blogs, we’ll highlight some of the tutorial speakers that will be at this year’s Percona Live conference. We’ll also discuss how these tutorials can help you improve your database environment. Make sure to read to the end to get a special Percona Live 2017 registration bonus!

In this Percona Live featured tutorial, we’ll meet Øystein Grøvlen, Senior Principal Software Engineer at Oracle. His tutorial is on How to Analyze and Tune MySQL Queries for Better Performance. SQL query performance plays a big role in application performance. If some queries execute slowly, these queries or the database schema may need tuning. I had a chance to speak with Øystein and learn a bit more about the MySQL query tuning:

Percona: How did you get into database technology? What do you love about it?

Øystein: I got into database technology during my Ph.D. studies. I got in touch with a research group in Trondheim, Norway, that did research on highly available distributed database systems. I ended up writing a thesis on query processing in such database systems.

What I love most about my job on the MySQL Optimizer Team is that it involves a lot of problem-solving. Why is a query so slow? What can we do to improve it? I have always been very interested in sports results and statistics. Working with query execution times gives me much of the same feeling. Searching for information is another interest of mine, and that is really what query execution is about.

Percona: What impacts database performance the most?

Øystein: From my point of view – mainly concerned with the performance of read-only queries – the most important performance metric is how much data needs to be accessed in order to answer a query. For update-intensive workloads, it is often about concurrency issues. For SELECT statements, the main thing is to not access more data than necessary.

Users should make sure to design their database schema so that the database system can efficiently access the needed data. This includes creating the right indexes. As MySQL developers, we need to develop the right algorithms to support efficient retrieval. We also need to provide a query optimizer that can pick the best query execution plan.

Of course, there are other performance aspects that are important. Especially if your data cannot fit in your database buffer pool. In that case, the order in which you access the data becomes more important. The best query plan when your data is disk-bound is not necessarily the same as when all data is in memory.

Percona: Your tutorial is called “How to Analyze and Tune MySQL Queries for Better Performance.” What are the most recent MySQL updates that help with tuning queries?

Øystein: I think the biggest improvements came in MySQL 5.6, with increased observability through performance schema and new variants of
EXPLAIN (Structured EXPLAIN (JSON format) and visual EXPLAIN in MySQL Workbench). We also added Optimizer Trace, which gives insight into how the optimizer arrived at a certain query plan. All this made it easier to identify queries that need tuning, understand how a query is executed and what might be done to improve it.

In MySQL 5.7, we added a new syntax for optimizer hints, and provided a lot of new hints that can be used to influence the optimizer to change a non-optimal query plan. We also provided a query rewrite plugin that makes it possible to tune queries even when it is not possible to change the application.

MySQL 5.7 also came with improvements to EXPLAIN. It is now possible to get the query plan for a running query, and Structured EXPLAIN shows both estimated total query cost and the cost per table. A more experimental feature allows you to provide your own cost constants to the optimizer.  This way, you can configure the optimizer to better suit your particular system.

For MySQL 8.0 we are continuing to improve tunability by adding more optimizer hints.  At the same time, we are working hard on features that will reduce the need for tuning. Histograms and awareness of whether data is in memory or on disk make the optimizer able to pick better query plans.

Percona: What do you want attendees to take away from your tutorial session? Why should they attend?

Øystein: While the query optimizer in most cases will come up with a good query plan, there are some cases where it won’t generate the most optimal query plan. This tutorial will show how you can identify which queries need tuning, how you can further investigate the issues and what types of tuning options you have for different types of queries. By attending this tutorial, you will learn how to improve the performance of applications through query tuning.

Percona: What are you most looking forward to at Percona Live?

Øystein: I am looking forward to interacting with MySQL users, discussing the query performance issues they might have, and learning how I can help with their issues.

You can find out more about Øystein Grøvlen and his work with databases at his blog, or follow him on Twitter: @ogrovlen. Want to find out more about Øystein and MySQL query optimization? Register for Percona Live Data Performance Conference 2017, and see his tutorial How to Analyze and Tune MySQL Queries for Better Performance. Use the code FeaturedTalk and receive $30 off the current registration price!

Percona Live Data Performance Conference 2017 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 24-27, 2017 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

by Dave Avery at December 29, 2016 04:52 PM

December 28, 2016

Peter Zaitsev

Quickly Troubleshoot Metadata Locks in MySQL 5.7

Metadata Locks

Metadata LocksIn a previous article, Ovais demonstrated how a DDL can render a table blocked from new queries. In another article, Valerii introduced performance_schema.metadata_locks, which is available in MySQL 5.7 and exposes metadata lock details. Given this information, here’s a quick way to troubleshoot metadata locks by creating a stored procedure that can:

  • Find out which thread(s) have the metadata lock
  • Determine which thread has been waiting for it the longest
  • Find other threads waiting for the metadata lock

Setting up instrumentation

First, you need to enable instrumentation for metadata locks:

UPDATE performance_schema.setup_instruments SET ENABLED = 'YES' WHERE NAME = 'wait/lock/metadata/sql/mdl';

Second, you need to add this stored procedure:

USE test;
DROP PROCEDURE IF EXISTS procShowMetadataLockSummary;
delimiter //
CREATE PROCEDURE procShowMetadataLockSummary()
	DECLARE table_schema VARCHAR(64);
    DECLARE table_name VARCHAR(64);
    DECLARE id bigint;
    DECLARE time bigint;
    DECLARE info longtext;
	DECLARE curMdl CURSOR FOR SELECT * FROM tmp_blocked_metadata;
	DROP TEMPORARY TABLE IF EXISTS tmp_blocked_metadata;
       table_schema varchar(64),
       table_name varchar(64),
       id bigint,
	   time bigint,
       info longtext,
       PRIMARY KEY(table_schema, table_name)
    REPLACE tmp_blocked_metadata(table_schema,table_name,id,time,info) SELECT mdl.OBJECT_SCHEMA, mdl.OBJECT_NAME, t.PROCESSLIST_ID, t.PROCESSLIST_TIME, t.PROCESSLIST_INFO FROM performance_schema.metadata_locks mdl JOIN performance_schema.threads t ON mdl.OWNER_THREAD_ID = t.THREAD_ID WHERE mdl.LOCK_STATUS='PENDING' and mdl.LOCK_TYPE='EXCLUSIVE' ORDER BY mdl.OBJECT_SCHEMA,mdl.OBJECT_NAME,t.PROCESSLIST_TIME ASC;
    OPEN curMdl;
    SET curMdlCount = (SELECT FOUND_ROWS());
    WHILE (curMdlCtr < curMdlCount)
      FETCH curMdl INTO table_schema, table_name, id, time, info;
      SELECT CONCAT_WS(' ','PID',t.PROCESSLIST_ID,'has metadata lock on', CONCAT(mdl.OBJECT_SCHEMA,'.',mdl.OBJECT_NAME), 'with current state', CONCAT_WS('','[',t.PROCESSLIST_STATE,']'), 'for', t.PROCESSLIST_TIME, 'seconds and is currently running', CONCAT_WS('',"[",t.PROCESSLIST_INFO,"]")) AS 'Process(es) that have the metadata lock' FROM performance_schema.metadata_locks mdl JOIN performance_schema.threads t ON t.THREAD_ID = mdl.OWNER_THREAD_ID WHERE mdl.LOCK_STATUS='GRANTED' AND mdl.OBJECT_SCHEMA = table_schema and mdl.OBJECT_NAME = table_name AND mdl.OWNER_THREAD_ID NOT IN(SELECT mdl2.OWNER_THREAD_ID FROM performance_schema.metadata_locks mdl2 WHERE mdl2.LOCK_STATUS='PENDING' AND mdl.OBJECT_SCHEMA = mdl2.OBJECT_SCHEMA and mdl.OBJECT_NAME = mdl2.OBJECT_NAME);
      SELECT CONCAT_WS(' ','PID', id, 'has been waiting for metadata lock on',CONCAT(table_schema,'.', table_name),'for', time, 'seconds to execute', CONCAT_WS('','[',info,']')) AS 'Oldest process waiting for metadata lock';
      SET curMdlCtr = curMdlCtr + 1;
	  SELECT CONCAT_WS(' ','PID', t.PROCESSLIST_ID, 'has been waiting for metadata lock on',CONCAT(table_schema,'.', table_name),'for', t.PROCESSLIST_TIME, 'seconds to execute', CONCAT_WS('','[',t.PROCESSLIST_INFO,']')) AS 'Other queries waiting for metadata lock' FROM performance_schema.metadata_locks mdl JOIN performance_schema.threads t ON t.THREAD_ID = mdl.OWNER_THREAD_ID WHERE mdl.LOCK_STATUS='PENDING' AND mdl.OBJECT_SCHEMA = table_schema and mdl.OBJECT_NAME = table_name AND mdl.OWNER_THREAD_ID AND t.PROCESSLIST_ID <> id ;
    CLOSE curMdl;
delimiter ;


Now, let’s call the procedure to see if there are threads waiting for metadata locks:

mysql> CALL test.procShowMetadataLockSummary();
| Process(es) that have the metadata lock                                                                        |
| PID 10 has metadata lock on sbtest.sbtest with current state [] since 274 seconds and is currently running []  |
| PID 403 has metadata lock on sbtest.sbtest with current state [] since 291 seconds and is currently running [] |
2 rows in set (0.00 sec)
| Oldest process waiting for metadata lock                                                                               |
| PID 1264 has been waiting for metadata lock on sbtest.sbtest for 264 seconds to execute [truncate table sbtest.sbtest] |
1 row in set (0.00 sec)
| Other queries waiting for metadata lock                                                                                   |
| PID 1269 has been waiting for metadata lock on sbtest.sbtest for 264 seconds to execute [SELECT c from sbtest where id=?] |
| PID 1270 has been waiting for metadata lock on sbtest.sbtest for 264 seconds to execute [SELECT c from sbtest where id=?] |
| PID 1271 has been waiting for metadata lock on sbtest.sbtest for 264 seconds to execute [SELECT c from sbtest where id=?] |
| PID 1272 has been waiting for metadata lock on sbtest.sbtest for 264 seconds to execute [SELECT c from sbtest where id=?] |
| PID 1273 has been waiting for metadata lock on sbtest.sbtest for 264 seconds to execute [SELECT c from sbtest where id=?] |
5 rows in set (0.00 sec)

So, as you can see above, you have several choices. You could (a) do nothing and wait for threads 10 and 403 to complete and then thread 1264 can get the lock.

If you can’t wait, you can (b) kill the threads that have the metadata lock so that the TRUNCATE TABLE in thread 1264 can get the lock. Although, before you decide to kill threads 10 and 403, you should check

 to see if the undo log entries for those threads are high. If they are, rolling back these transactions might take a long time.

Lastly, you can instead (c) kill the DDL thread 1264 to free up other queries. You should then reschedule the DDL to run during offpeak hours.

Happy metadata lock hunting!

by Jaime Sicam at December 28, 2016 07:52 PM

Using Percona XtraBackup on a MySQL Instance with a Large Number of Tables

Percona XtraBackup

Percona XtraBackupIn this blog post, we’ll find out how to use Percona XtraBackup on a MySQL instance with a large number of tables.

As of Percona Xtrabackup 2.4.5, you are required to have enough open files to open every single InnoDB tablespace in the instance you’re trying to back up. So if you’re running innodb_file_per_table=1, and have a large number of tables, you’re very likely to see Percona XtraBackup fail with the following error message:

InnoDB: Operating system error number 24 in a file operation.
InnoDB: Error number 24 means 'Too many open files'
InnoDB: Some operating system error numbers are described at
InnoDB: File ./sbtest/sbtest132841.ibd: 'open' returned OS error 124. Cannot continue operation
InnoDB: Cannot continue operation.

If you run into this issue, here is what you need to do:

  1. Find out how many files you need:

root@ts140i:~# find /var/lib/mysql/ -name "*.ibd" | wc -l

I would add at least another 1000 to this number for system tablespace and other miscellaneous open file needs. You might want to go even higher to accommodate for a growing number of tables.

  1. Check the maximum number of files you can keep open in the system. If this number is too small Percona Xtrabackup might monopolize the open files in the system, causing other processes to fail when they try to open files. This can cause MySQL Server to crash, and other processes to fail.

root@ts140i:/mnt/data/backup# cat /proc/sys/fs/file-max

If you need to, here is how to  increase the number:

sysctl -w fs.file-max=5000000
echo "fs.file-max=5000000" >> /etc/sysctl.conf

  1. Increase the limit on the number of files the Percona XtraBackup process can open:

The best way to do this is using

 option. For example, you can specify the following in your my.cnf:


Alternatively, you can pass it as a command-line option, or run ulimit -n 2000000 before running the backup command.

You need to be sure your user account has permissions to set open files limit this high. If you are doing backups under the “root” user, it shouldn’t be a problem. Otherwise, you might need to adjust the limits in  /etc/security/limits.conf:

mysql hard nofile 2000000
mysql soft nofile 2000000

Specifying a “soft” limit in this file eliminates the need to run ulimit before Percona XtraBackup, or specifying it in the configuration.

  1. There is one more possible limit to overcome. Even running as a root user, you might get the following error message:

root@ts140i:/mnt/data/backup# ulimit -n 2000000
-su: ulimit: open files: cannot modify limit: Operation not permitted

If this happens, you might need to increase the kernel limit on the number of processes any can have:

pz@ts140i:~$ cat /proc/sys/fs/nr_open

The limit I have on this system is slightly above 1 million. You can increase it using the following:

sysctl -w fs.nr_open=2000000
echo "fs.nr_open=2000000" >> /etc/sysctl.conf

With these configuration adjustments, you should be able to use Percona XtraBackup to backup MySQL instances containing millions of tables without problems.

What if you can’t allow Percona XtraBackup to open that many files? Then there is the option –close-files that won’t normally require increasing the limit to the number of open files. Using this option, however, might cause the backup corruption if you’re doing DDL operations during the backup.

From where does this strange limitation requiring you to keep all tablespaces open come? It comes from this issue. In some cases, DDL operations such as RENAME TABLE might cause the wrong file to be copied, and unable to be caught up by replying to InnoDB redo logs. Keeping the file open clearly shows which file corresponds to a given tablespace at the start of a backup process, and gets handled correctly.

This problem is not unique to Percona XtraBackup. If anything, Percona Xtrabackup goes the extra mile to ensure database backups are safe.  For comparison, MySQL Enterprise Backup 4.0  simply states:  

Do not run the DDL operations ALTER TABLE, TRUNCATE TABLE, OPTIMIZE TABLE, REPAIR TABLE, RESTORE TABLE or CREATE INDEX while a backup operation is going on. The resulting backup might become corrupted.”

by Peter Zaitsev at December 28, 2016 04:51 PM

December 27, 2016

Peter Zaitsev

Webinar Thursday December 29: JSON in MySQL 5.7


JSON in MySQL 5.7Please join Percona’s Consultant David Ducos on Thursday, December 29, 2016 at 10 am PST/ 1:00 pm EST (UTC-8) as he presents JSON in MySQL 5.7.

Since it was implemented in MySQL 5.7, we can use JSON as a data type. In this webinar, we will review some of the useful functions that have been added to work with JSON.

We will examine and analyze how JSON works internally, and take into account some of the costs related to employing this new technology. 

At the end of the webinar, you will know the answers to the following questions: 

  • What is JSON?
  • Why don’t we keep using VARCHAR?
  • How does it work? 
  • What are the costs?
  • What limitations should we take into account?
  • What are the benefits of using MySQL JSON support?

Register for the webinar here.

JSON in MySQL 5.7David Ducos, Percona Consultant

David studied Computer Science at the National University of La Plata, and has worked as a Database Consultant since 2008. He worked for three years in a worldwide platform of free classifieds, until starting work for Percona in November 2014 as part of the Consulting team.

by Dave Avery at December 27, 2016 10:44 PM

Don’t Let a Leap Second Leap on Your Database!

Leap Second

leap_secThis blog discusses how to prepare your database for the new leap second coming in the new year.

At the end of this year, on December 31, 2016, a new leap second gets added. Many of us remember the huge problems this caused back in 2012. Some of our customers asked how they should prepare for this year’s event to avoid any unexpected problems.

It’s a little late, but I thought discussing the issue might still be useful.

The first thing is to make sure your systems avoid the issue with abnormally high CPU usage. This was an problem in 2012 due to a Linux kernel bug. After the leap second was added, CPU utilization sky-rocketed on many systems, taking down many popular sites. This issue was addressed back in 2012, and similar global problems did not occur in 2015 thanks to those fixes. So it is important to make sure you have an up-to-date Linux kernel version.

It’s worth knowing that in the case of any unpredicted system misbehavior from the leap second problem, the quick remedy for the CPU overheating was restarting services or rebooting servers (in the worst case).

(Please do not reboot the server without being absolutely sure that your serious problems started exactly when the leap second was added.)

The following are examples of bug records:

The second thing is to add proper support for the upcoming event. Leap second additions are announced some time before they are implemented, as it isn’t known exactly when the next one will occur for sure.

Therefore, you should upgrade your OS tzdata package to prepare your system for the upcoming leap second. This document shows how to check if your OS is already “leap second aware”:

zdump -v right/America/Los_Angeles | grep Sat.Dec.31.*2016

A non-updated system returns an empty output. On an updated OS, you should receive something like this:

right/America/Los_Angeles  Sat Dec 31 23:59:60 2016 UTC = Sat Dec 31 15:59:60 2016 PST isdst=0 gmtoff=-28800
right/America/Los_Angeles  Sun Jan  1 00:00:00 2017 UTC = Sat Dec 31 16:00:00 2016 PST isdst=0 gmtoff=-28800

If your systems use the NTP service though, the above is not necessary (as stated in Still, you should make sure that the NTP services you use are also up-to-date.

With regards to leap second support in MySQL there is nothing to do, regardless of the version. MySQL doesn’t allow an extra second numeration within the 60 seconds part of timestamp datatype, so you should expect rows with 59 instead of 60 seconds when the additional second is added, as described here:

Similarly, MongoDB expects no serious problems either.

Let’s “smear” the second

Many big Internet properties, however, introduced a technique to adapt to the leap second change more gracefully and smoothly, called Leap Smear or Slew. Instead of introducing the additional leap second immediately, the clock slows down a bit, allowing it to gradually get in sync with the new time. This way there is no issue with extra abnormal second notation, etc.

This solution is used by Google, Amazon, Microsoft, and others. You can find a comprehensive document about Google’s use here:

You can easily introduce this technique with the ntpd -x or Chronyd slew options, which are nicely explained in this document:


Make sure you have your kernel up-to-date, NTP service properly configured and consider using the Slew/Smear technique to make the change easier. After the kernel patches in 2012, no major problems happened in 2015. We expect none this year either (especially if you take time to properly prepare).

by Przemysław Malkowski at December 27, 2016 09:00 PM

Jean-Jerome Schmidt

How to Perform Efficient Backup for MySQL and MariaDB

All backup methods have their pros and cons. They also affect database workloads differently. Your backup strategy will depend upon the business requirements, the environment you operate in and resources at your disposal. Backups are usually planned according to your restoration requirement. Data loss can be full or partial, and you do not always need to recover the whole dataset. In some cases, you might just want to do a partial recovery by restoring missing tables or rows. In this case, you will need a combination of Percona Xtrabackup, mysqldump and binary logs to cover the different cases.

Performing a backup on MySQL or MariaDB is not that hard, but to be efficient, we do need to understand the effects of each and every procedure. It also depends on a number of factors like storage engine, recovery objective, dataset and delta size, storage capability and capacity, security as well as high availability design and architecture.

One of the most important things in performing a backup is to make sure you get a consistent backup. Backing up non-transactional tables like MyISAM and MEMORY require tables to be locked to guarantee consistency, this can be done using the global lock (FLUSH TABLE WITH READ LOCKS). Consequently, global lock will temporarily make the server to be read-only. For InnoDB, locking is unnecessary and other DML operations are allowed to execute while the backup is running.

In term of backup size, if you have limited storage space backed by an outdated disk subsystem, compression is your friend. Performing compression is a CPU intensive process and can directly impact the performance of your MySQL server. However, if it can be scheduled during periods of low traffic, compression can save you a lot of space. It is a tradeoff between processing power and storage space, and reduces the risk of server crash caused by a full disk.

If your database workload is write-intensive, you might find the difference in size (delta) between the two latest full backups to be fairly big, for example 1GB for a 10GB dataset per day. Performing regular full backups on databases with this kind of workload will likely introduce performance degradation, and it might be more efficient to perform incremental backups. Ultimately, this kind of workload will bring the database to a state where the backup size is rapidly growing and physical backup might be the only way to go.

When creating an encrypted backup, one thing to have in mind is that it usually takes more time to recover. The backup has to be decrypted prior to any recovery activities. With a large dataset, this could introduce some delays to the RTO. On the other hand, if you are using private key for encryption, make sure to store the key in a safe place. If the private key is missing, the backup will be useless and unrecoverable. If the key is stolen, all created backups that use the same key would be compromised as they are no longer secured.

It is common nowadays to have a high availability setup using either MySQL Replication or MySQL/MariaDB Galera Cluster. It is not necessary to backup all members in the replication chain or cluster. Since all nodes are expected to hold the same data (unless the dataset is sharded across different nodes), it is recommended to perform backup on only one node (or one per shard).

Taking a MySQL backup on a dedicated backup server will simplify your backup plans. A dedicated backup server is usually an isolated slave connected to the production servers via asynchronous replication. A good backup server consists of plenty of  disk space for backup storage, with the ability to do storage snapshots. Since it uses loosely-coupled asynchronous replication, it will unlikely cause additional overhead to the production database. However, this server might become a single point of failure, with the risk of inconsistent backup if the backup server regularly lags behind.

As we have seen, there are quite a few things to consider in order to make efficient backups of MySQL and MariaDB. Each of the mentioned points are discussed in depth, together with example use-cases and best practices in our latest whitepaper - The DevOps Guide to Database Backups for MySQL and MariaDB.

by ashraf at December 27, 2016 09:07 AM

December 24, 2016

MariaDB Foundation

MariaDB 10.2.3 and 5.5.54 now available

The MariaDB project is pleased to announce the immediate availability of MariaDB 10.2.3 beta and MariaDB 5.5.54 stable (GA). See the release notes and changelogs for details. Download MariaDB 10.2.3 Release Notes Changelog What is MariaDB 10.2? MariaDB APT and YUM Repository Configuration Generator Download MariaDB 5.5.54 Release Notes Changelog What is MariaDB 5.5? MariaDB […]

The post MariaDB 10.2.3 and 5.5.54 now available appeared first on

by Daniel Bartholomew at December 24, 2016 03:05 PM

December 23, 2016

Peter Zaitsev

Percona Server for MongoDB 3.4 Beta is now available

Percona Server for MongoDB

Percona is pleased to announce the release of Percona Server for MongoDB 3.4.0-1.0beta on December 23, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

NOTE: Beta packages are available from testing repository.

Percona Server for MongoDB is an enhanced, open source, fully compatible, highly scalable, zero-maintenance downtime database supporting the MongoDB v3.4 protocol and drivers. It extends MongoDB with Percona Memory Engine and MongoRocks storage engine, as well as adding features like external authentication, audit logging, and profiling rate limiting. Percona Server for MongoDB requires no changes to MongoDB applications or code.

This beta release is based on MongoDB 3.4.0 and includes the following additional changes:

  • Red Hat Enterprise Linux 5 and derivatives (including CentOS 5) are no longer supported.
  • MongoRocks is now based on RocksDB 4.11.
  • PerconaFT and TokuBackup were removed.
    As alternatives, we recommend using MongoRocks for write-heavy workloads and Hot Backup for physical data backups on a running server.

Percona Server for MongoDB 3.4.0-1.0beta release notes are available in the official documentation.


by Alexey Zhebel at December 23, 2016 02:43 PM

December 22, 2016

Jean-Jerome Schmidt

Planets9s - Online schema change for MySQL & MariaDB, MySQL storage engine & backups … and more

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

Online schema change for MySQL & MariaDB: GitHub’s gh-ost & pt-online-schema-change

Online schema changes are unavoidable, as any DBA will know. While there are tools such as Percona’s pt-online-schema-change to assist, it does not come without drawbacks. However, there is a new kid on the block: GitHub released an online schema change tool called gh-ost. This post by Krzysztof Ksiazek, Senior Support Engineer at Severalnines, looks at how gh-ost compares to pt-online-schema-change, and how it can be used to address some limitations.

Read the blog

The choice of MySQL storage engine and its impact on backup procedures

As you will know, MySQL offers multiple storage engines to store its data, with InnoDB and MyISAM being the most popular ones. And this has an impact on how you design and run your backup procedures. Since data is stored inside the storage engine, we need to understand how the storage engines work to determine the best backup tool. This post by Ashraf Sharif, System Support Engineer at Severalnines, provides the necessary insight into these topics and recommendations on how best to proceed.

Read the blog

Want an easy way to deploy & monitor Galera Cluster in the cloud?

If you haven’t see it yet, we’ve recently launched a new tool that allows you to easily deploy and monitor Galera Clusters onto the AWS and Digital Ocean clouds. NinesControl allows quick, easy, point-and-click deployment and monitoring of a standalone or a clustered SQL and NoSQL database. Each provisioned database is automatic, repeatable and completes in minutes. It also provides real-time monitoring, self-healing and automatic recovery features. Find out more and get started via the link below.

Check out NinesControl

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

by Severalnines at December 22, 2016 11:41 AM

December 21, 2016

MariaDB AB

Of Temporal Datatypes, Electricity and Cows

Of Temporal Datatypes, Electricity and Cows anderskarlsson4 Wed, 12/21/2016 - 14:10


In an earlier blog post I discussed some aspects of temporal datatypes and how they apply to databases, in particular MariaDB. In this follow-up blog post I will get into some not-so-pleasant aspects of temporal datatypes, for example why this was caused by a shortage of electricity and then get into a subject where cows and developers share an opinion (which is not to say that there might not be more such subjects).

How an attempt to save electricity gets time zones all wrong

Let’s again go back a few years, back to the early years of World War 1 when the government in Germany determined that they needed a lot of energy to fight the war and they wanted to conserve power. They first figured that they would do this by telling people to switch off their iPhones when they weren’t using them, but then they realized that wouldn’t work as the iPhone hadn’t been invented yet. Instead they forced people (people here not including the government officials themselves) to get out of bed earlier (yes, these folks were bad, real bad) and they did this by changing the time in the spring/summer so that people could work in the daylight longer and not have to use artificial lightning. People were outraged by this, obviously, no one likes to get up early, so in the fall, some 6 months after implementing this, they changed the time back again, as they didn’t want a revolution. Some 6’ish months later they forgot all about that and changed the time again to force people to get up earlier, and then back again in the fall. And so it continued.

Then the world followed that scheme, or rather, some did and some didn’t. They decided to give constant flipping of the time a name also, and they decided on Daylight Saving Time (DST) to make it sound like something sane, pleasant and modern, and “Get Out Of Bed One Hour Earlier For Half The Year” was considered too long a name for this atrocity.

Not everybody liked DST though. Cows for example did not. And I know what you are asking now, as you are modern technical IT guys reading this, you ask “what is a cow”. When I was a kid, my mum told me that cows were mild mannered animals that you got milk from, and that was a lie of course, as everyone knows that you get milk from the grocery store, not from some animal, mild mannered or not. But cows sound funny at least. In addition to cows, software developers also hate DST, for a number of reasons, but for a different reason than cows (apparently cows like to be “milked”, whatever that is, the same time every day which is why they dislike DST so much. And how they know that DST means that the time is different I do not know).

The MariaDB TIMESTAMP datatype in practice

In the previous post on this subject we determined that the MariaDB TIMESTAMP datatype supports time zones, but what does this mean? Well, this is what it means, in short:
•    MariaDB has a default TIME ZONE setting that is, unless you change it, set to the time zone of the operating system that the MariaDB Server runs on.
•    All TIMESTAMP values are converted to the UTC (Universal Time Coordinated) time zone before being stored. More on this later.
•    Each session connecting to MariaDB has a time zone (unless specified it uses that defined in the server) and conversion to and from UTC is automatic.

So what is this UTC time zone then? Well, it is actually not a TIME ZONE per se, rather it is the standard time that all other time settings reference, but as such it also assumes a time zone for technical reasons, although there is no physical place on earth with the UTC time zone. 

One thing with UTC is that it doesn’t have something like DST, and that it doesn’t makes a lot of sense. But this also means that we are going to convert to and from UTC anytime we run with a non-UTC time zone which most people do (I can argue that maybe you should, but that depends on your application). The issue is that as we are not using UTC on our clients we have to convert to and from UTC all the time and the non-UTC time zones we typically use do have DST and then conversion will be difficult. Let me show you what this means.
For the PST time zone, on Nov 6 2016 we changed from DST and back to normal time, meaning that we would our clocks back. When the time was 02:00 the clocks were reset to 1:00. Lets start with inserting some data using UTC, or “Cow-time”:

MariaDB> SET time_zone = 'UTC';
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
MariaDB> INSERT INTO timetable(ts1) VALUES('2016-11-06 07:30:00');
Query OK, 1 row affected (0.00 sec)
MariaDB> INSERT INTO timetable(ts1) VALUES('2016-11-06 08:00:00');
Query OK, 1 row affected (0.00 sec)
MariaDB> INSERT INTO timetable(ts1) VALUES('2016-11-06 08:30:00');
Query OK, 1 row affected (0.00 sec)
MariaDB> INSERT INTO timetable(ts1) VALUES('2016-11-06 09:00:00');
Query OK, 1 row affected (0.00 sec)
MariaDB> SELECT * FROM  timetable ORDER BY ts1;
| id | ts1                 |
|  1 | 2016-11-06 07:30:00 |
|  2 | 2016-11-06 08:00:00 |
|  3 | 2016-11-06 08:30:00 |
|  4 | 2016-11-06 09:00:00 |
4 rows in set (0.00 sec)

That seems fair, right? Now, the time we are inserting this data in UTC is actually when PST stops DST, 09:00 is UST is 02:00 PST. Let’s look at what the result of that last SELECT looks like in the PST timezone. Note that this is exactly the same data, table and SELECT statement, the latter including an explicit ORDER BY:

MariaDB> SET time_zone = 'America/Los_Angeles';
Query OK, 0 rows affected (0.00 sec)
MariaDB> SELECT * FROM  timetable ORDER BY ts1;
| id | ts1                 |
|  1 | 2016-11-06 00:30:00 |
|  2 | 2016-11-06 01:00:00 |
|  3 | 2016-11-06 01:30:00 |
|  4 | 2016-11-06 01:00:00 |
4 rows in set (0.00 sec)

What! That is not ordered by the ts1 column? Well, it is, but fact is that 01:00 happens twice that night! Let’s try something else in the PST time zone:

MariaDB> INSERT INTO timetable(ts1) VALUES('2017-03-12 02:30:00');
Query OK, 1 row affected, 1 warning (0.00 sec)
| Level   | Code | Message                                          |
| Warning | 1299 | Invalid TIMESTAMP value in column 'ts1' at row 1 |
1 row in set (0.00 sec)

The time we want to set isn’t valid in PST that day, as when the clock turns 02:00, the time is reset to 03:00. The data is “truncated” to a valid PST timestamp:

MariaDB> SELECT * FROM timetable;
| id | ts1                 |
|  7 | 2016-11-06 00:30:00 |
|  8 | 2016-11-06 01:00:00 |
|  9 | 2016-11-06 01:30:00 |
| 10 | 2016-11-06 01:00:00 |
| 22 | 2017-03-12 03:00:00 |
5 rows in set (0.00 sec)


So, how do we solve this? Is there a best practice? Well, if we want to follow the DST changes and also support timestamped data in different time-zones, we at least has to learn to live with it. One way is to stick to UTC across the range and let the application handle this? Or have each MariaDB connection set the timezone to something appropriate? Or have all client run the with time zone of the server? Really, this is difficult, but it’s not as much an IT problem, more a problem with Cows, Electricity and Trains.

Happy SQL’ing

In this second blog post on temporal datatypes Anders Karlsson explores the effects of Daylight Savings Time and how TIMESTAMP datatypes are affected by DST. 

Login or Register to post comments

by anderskarlsson4 at December 21, 2016 07:10 PM

Peter Zaitsev

Percona Blog Poll: What Programming Languages are You Using for Backend Development?

Programming Languages

Programming LanguagesTake Percona’s blog poll on what programming languages you’re using for backend development.

While customers and users focus and interact with applications and websites, these are really just the tip of the iceberg for the whole end-to-end system that allows applications to run. The backend is what makes a website or application work. The backend has three parts to it: server, application, and database. A backend operation can be a web application communicating with the server to make a change in a database stored on a server. Technologies like PHP, Ruby, Python, and others are the ones backend programmers use to make this communication work smoothly, allowing the customer to purchase his or her ticket with ease.

Backend programmers might not get a lot of credit, but they are the ones that design, maintain and repair the machinery that powers a system.

Please take a few seconds and answer the following poll on backend programming languages. Which are you using? Help the community learn what languages help solve critical database issues. Please select from one to six languages as they apply to your environment.

If you’re using other languages, or have specific issues, feel free to comment below. We’ll post a follow-up blog with the results!

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

by Dave Avery at December 21, 2016 06:53 PM

Percona Poll Results: What Database Technologies Are You Using?

Database TechnologiesThis blog shows the results from Percona’s poll on what database technologies our readers use in their environment.

We design different databases for different scenarios. Using one database technology for every situation doesn’t make sense, and can lead to non-optimal solutions for common issues. Big data and IoT applications, high availability, secure backups, security, cloud vs. on-premises deployment: each have a set of requirements that might need a special technology. Relational, document-based, key-value, graphical, column family – there are many options for many problems. More and more, database environments combine more than one solution to address the various needs of an enterprise or application (known as polyglot persistence).

The following are the results of our poll on database technologies:

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

We’ve concluded our database technology poll that looks at the technologies our readers are running in 2016. Thank you to the more than 1500 people who responded! Let’s look at what the poll results tell us, and how they compare to the similar poll we did in 2013.

Since the wording of the two poll questions is slightly different, the results won’t be directly comparable.  

First, let’s set the record straight: this poll does not try to be an unbiased, open source database technology poll. We understand our audience likely has many more MySQL and MongoDB users than other technologies. So we should look at the poll results as “how MySQL and MongoDB users look at open source database technology.”

It’s interesting to examine which technologies we chose to include in our 2016 poll, compared to the 2013 poll. The most drastic change can be seen in the full-text search technologies. This time, we decided not to include Lucene and Sphinx this time. ElasticSearch, which wasn’t included back in 2013, is now the leading full-text search technology. This corresponds to what we see among our customers.

The change between Redis versus Memcached is also interesting. Back in 2013, Memcached was the clear supporting technology winner. In 2016, Redis is well ahead.

We didn’t ask about PostgreSQL back in 2013 (few people probably ran PostgreSQL alongside MySQL then). Today our poll demonstrates its very strong showing.

We are also excited to see MongoDB’s strong ranking in the poll, which we interpret both as a result of the huge popularity of this technology and as recognition of our success as MongoDB support and services provider. We’ve been in the MongoDB solutions business for less than two years, and already seem to have a significant audience among MongoDB users.

In looking at other technologies mentioned, it is interesting to see that Couchbase and Riak were mentioned by fewer people than in 2013, while Cassandra came in about the same. I don’t necessarily see it as diminishing popularity for these technologies, but as potentially separate communities forming that don’t extensively cross-pollinate.

Kafka also deserves special recognition: with the initial release in January 2011, it gets a mention back in our 2013 poll. Our current poll shows it at 7%. This is a much larger number than might be expected, as Kafka is typically used in complicated, large-scale applications.

Thank you for participating!

by Peter Zaitsev at December 21, 2016 06:46 PM

Installing Percona Monitoring and Management on Google Container Engine (Kubernetes)


This blog discusses installing Percona Monitoring and Management on Google Container Engine.

I am working with a client that is on Google Cloud Services (GCS) and wants to use Percona Monitoring and Management (PMM). They liked the idea of using Google Container Engine (GKE) to manage the docker container that pmm-server uses.

The regular install instructions are here:

Since Google Container Engine runs on Kubernetes, we had to do some interesting changes to the server install instructions.

First, you will want to get the gcloud shell. This is done by clicking the gcloud shell button at the top right of your screen when logged into your GCS project.

Installing Percona Monitoring and Management

Once you are in the shell, you just need to run some commands to get up and running.

Let’s set our availability zone and region:

manjot_singh@googleproject:~$ gcloud config set compute/zone asia-east1-c
Updated property [compute/zone].

Then let’s set up our auth:

manjot_singh@googleproject:~$ gcloud auth application-default login
These credentials will be used by any library that requests
Application Default Credentials.

Now we are ready to go.

Normally, we create a persistent container called pmm-data to hold the data the server collects and survive container deletions and upgrades. For GCS, we will create persistent disks, and use the minimum (Google) recommended size for each.

manjot_singh@googleproject:~$ gcloud compute disks create --size=200GB --zone=asia-east1-c pmm-prom-data-pv
Created [].
NAME              ZONE          SIZE_GB  TYPE         STATUS
pmm-prom-data-pv  asia-east1-c  200      pd-standard  READY
manjot_singh@googleproject:~$ gcloud compute disks create --size=200GB --zone=asia-east1-c pmm-consul-data-pv
Created [].
NAME                ZONE          SIZE_GB  TYPE         STATUS
pmm-consul-data-pv  asia-east1-c  200      pd-standard  READY
manjot_singh@googleproject:~$ gcloud compute disks create --size=200GB --zone=asia-east1-c pmm-mysql-data-pv
Created [].
NAME               ZONE          SIZE_GB  TYPE         STATUS
pmm-mysql-data-pv  asia-east1-c  200      pd-standard  READY
manjot_singh@googleproject:~$ gcloud compute disks create --size=200GB --zone=asia-east1-c pmm-grafana-data-pv
Created [].
NAME                 ZONE          SIZE_GB  TYPE         STATUS
pmm-grafana-data-pv  asia-east1-c  200      pd-standard  READY

Ignoring messages about disk formatting, we are ready to create our Kubernetes cluster:

manjot_singh@googleproject:~$ gcloud container clusters create pmm-server --num-nodes 1 --machine-type n1-standard-2
Creating cluster pmm-server...done.
Created [].
kubeconfig entry generated for pmm-server.
pmm-server  asia-east1-c  1.4.6           999.911.999.91  n1-standard-2  1.4.6         1          RUNNING

You should now see something like:

manjot_singh@googleproject:~$ gcloud compute instances list
NAME                                       ZONE          MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP      STATUS
gke-pmm-server-default-pool-73b3f656-20t0  asia-east1-c  n1-standard-2       911.119.999.11  RUNNING

Now that our container manager is up, we need to create 2 configs for the “pod” we are creating to run our container. One will be used only to initialize the server and move the container drives to the persistent disks and the second one will be the actual running server.

manjot_singh@googleproject:~$ vi pmm-server-init.json
  "apiVersion": "v1",
  "kind": "Pod",
  "metadata": {
      "name": "pmm-server",
      "labels": {
          "name": "pmm-server"
  "spec": {
    "containers": [{
        "name": "pmm-server",
        "image": "percona/pmm-server:1.0.6",
        "env": [{
        "ports": [{
            "containerPort": 80
        "volumeMounts": [{
          "mountPath": "/opt/prometheus/d",
          "name": "pmm-prom-data"
          "mountPath": "/opt/c",
          "name": "pmm-consul-data"
          "mountPath": "/var/lib/m",
          "name": "pmm-mysql-data"
          "mountPath": "/var/lib/g",
          "name": "pmm-grafana-data"
    "restartPolicy": "Always",
    "volumes": [{
      "gcePersistentDisk": {
          "pdName": "pmm-prom-data-pv",
          "fsType": "ext4"
      "gcePersistentDisk": {
          "pdName": "pmm-consul-data-pv",
          "fsType": "ext4"
      "gcePersistentDisk": {
          "pdName": "pmm-mysql-data-pv",
          "fsType": "ext4"
      "gcePersistentDisk": {
          "pdName": "pmm-grafana-data-pv",
          "fsType": "ext4"

manjot_singh@googleproject:~$ vi pmm-server.json
  "apiVersion": "v1",
  "kind": "Pod",
  "metadata": {
      "name": "pmm-server",
      "labels": {
          "name": "pmm-server"
  "spec": {
    "containers": [{
        "name": "pmm-server",
        "image": "percona/pmm-server:1.0.6",
        "env": [{
        "ports": [{
            "containerPort": 80
        "volumeMounts": [{
          "mountPath": "/opt/prometheus/data",
          "name": "pmm-prom-data"
          "mountPath": "/opt/consul-data",
          "name": "pmm-consul-data"
          "mountPath": "/var/lib/mysql",
          "name": "pmm-mysql-data"
          "mountPath": "/var/lib/grafana",
          "name": "pmm-grafana-data"
    "restartPolicy": "Always",
    "volumes": [{
      "gcePersistentDisk": {
          "pdName": "pmm-prom-data-pv",
          "fsType": "ext4"
      "gcePersistentDisk": {
          "pdName": "pmm-consul-data-pv",
          "fsType": "ext4"
      "gcePersistentDisk": {
          "pdName": "pmm-mysql-data-pv",
          "fsType": "ext4"
      "gcePersistentDisk": {
          "pdName": "pmm-grafana-data-pv",
          "fsType": "ext4"

Then create it:

manjot_singh@googleproject:~$ kubectl create -f pmm-server-init.json
pod "pmm-server" created

Now we need to move data to persistent disks:

manjot_singh@googleproject:~$ kubectl exec -it pmm-server bash
root@pmm-server:/opt# supervisorctl stop grafana
grafana: stopped
root@pmm-server:/opt# supervisorctl stop prometheus
prometheus: stopped
root@pmm-server:/opt# supervisorctl stop consul
consul: stopped
root@pmm-server:/opt# supervisorctl stop mysql
mysql: stopped
root@pmm-server:/opt# mv consul-data/* c/
root@pmm-server:/opt# chown pmm.pmm c
root@pmm-server:/opt# cd prometheus/
root@pmm-server:/opt/prometheus# mv data/* d/
root@pmm-server:/opt/prometheus# chown pmm.pmm d
root@pmm-server:/var/lib# cd /var/lib
root@pmm-server:/var/lib# mv mysql/* m/
root@pmm-server:/var/lib# chown mysql.mysql m
root@pmm-server:/var/lib# mv grafana/* g/
root@pmm-server:/var/lib# chown grafana.grafana g
root@pmm-server:/var/lib# exit
manjot_singh@googleproject:~$ kubectl delete pods pmm-server
pod "pmm-server" deleted

Now recreate the pmm-server container with the actual configuration:

manjot_singh@googleproject:~$ kubectl create -f pmm-server.json
pod "pmm-server" created

It’s up!

Now let’s get access to it by exposing it to the internet:

manjot_singh@googleproject:~$ kubectl expose deployment pmm-server --type=LoadBalancer
service "pmm-server" exposed

You can get more information on this by running:

manjot_singh@googleproject:~$ kubectl describe services pmm-server
Name:                   pmm-server
Namespace:              default
Labels:                 run=pmm-server
Selector:               run=pmm-server
Type:                   LoadBalancer
Port:                   <unset> 80/TCP
NodePort:               <unset> 31757/TCP
Session Affinity:       None
  FirstSeen     LastSeen        Count   From                    SubobjectPath   Type            Reason                  Message
  ---------     --------        -----   ----                    -------------   --------        ------                  -------
  22s           22s             1       {service-controller }                   Normal          CreatingLoadBalancer    Creating load balancer

To find the public IP of your PMM server, look under “EXTERNAL-IP”

manjot_singh@googleproject:~$ kubectl get services
kubernetes     <none>           443/TCP   7m
pmm-server    999.911.991.91   80/TCP    1m

That’s it, just visit the external IP in your browser and you should see the PMM landing page!

One of the things we didn’t resolve was being able to access the pmm-server container within the vpc. The client had to go through the open internet and hit PMM via the public IP. I hope to work on this some more and resolve this in the future.

I have also talked to our team about making mounts for persistent disks easier so that we can use less mounts and make the configuration and setup easier.



by Manjot Singh at December 21, 2016 06:19 PM

MariaDB AB

On Databases, Temporal Datatypes and Trains

On Databases, Temporal Datatypes and Trains anderskarlsson1 Wed, 12/21/2016 - 13:18


The data type aspect of databases is a key feature as is it when it comes to programming languages. I would guess that all programming languages, with the possible exception of assembly, provides a set of predefined “built in” datatypes. Some programming languages are limited in this respect, like Forth, where others have a larger set of types. A data type determines what data can be stored, what operations are allowed and semantics.

One family of data types that is present in more or less all relational databases  (I don’t say all here as I know someone will tell me about an arcane relational database systems developed in Burundi where this is not true) is the temporal types, i.e. datatypes that hold a time value. This in difference to most programming languages where the native datatypes are numeric and strings, and all other types are extensions using some kind of structure style.

So, databases have temporal datatypes, programming languages do not (OK, that is a generalization). In this blog post I will look at some aspects of temporal datatypes, and in a later blog post I will dig even deeper into this outrageously interesting subject.


Temporal Datatypes in Databases

Before we get into the aspect of trains, let’s spend some time with the temporal datatypes themselves. As already stated, the type of an item among other things determine the semantics of the type. Let’s start with a look at the temporal data types in MariaDB, they are DATETIME, TIMESTAMP, DATE, TIME and YEAR. If we for a second assume that you don’t know anything about how MariaDB looks at these, you might ask yourself what the difference is between TIMESTAMP and DATETIME, so let’s start with there.

Both DATETIME and TIMESTAMP store a date and a time, but there the similarities end. And by the way, I’m not saying that we should change the behavior of these datatypes, just that they are sometimes a bit odd.

The DATETIME datatype is more recent and is more in line with other relational databases, but on the other hand it takes up more space on disk. Both of these types also have microseconds support (since MariaDB 5.3). To enable this you add a precision, such as DATETIME(6). The reason a TIMESTAMP is more compact is that it can only hold a limited range of dates, from Jan 1 1970 up to 2038. Well, that should be enough for most purposes right? Yes, just as representing a year with just 2 digits and allowing for 640 K of RAM was “enough” a few years back! Those kinds of assumptions really made the world a better place for us all.

As for the DATE, TIME and YEAR datatypes, I will skip them for and now focus on DATETIME and TIMESTAMP.


One thing you do NOT want to use temporal types for

I had a customer, many years ago, possible way back during the Reagan administration or so, that had an issue. They were using TIMESTAMP, using millisecond precision, in data in their OLTP systems as a PRIMARY KEY. They had determined that there would not be more than 1 transaction per microsecond, and that this would work. It didn’t. For the simple reasons that:

  • On average, there was a lot less than 1 transaction per microsecond, but during high load times, it could well be more than this.
  • Computers tend to get faster over time and the load of popular services also increase which means this scheme was bound to break faster the better it was.
  • This was a stupid assumption.

Their solution was to have the transaction retry when they had a PRIMARY KEY violation, which was neither effective, nor performant or practical. Don’t do something even remotely similar to this!


Other relational databases support for temporal data types

Other relational databases also support temporal datatypes, and you might have sensed that I feel that the MariaDB temporal datatypes are a bit awkward. This is not so though, as all relation database temporal data types have quirks, to say the least, so it is appropriate to have a look at this too.

Let’s begin with Oracle where there is support for DATE and TIMESTAMP. Oracle also supports INTERVAL types. As for Oracle DATE that is the oldest Oracle temporal datatypes, and it is rather odd in a few ways. One such oddity is that although the type is called DATE and when querying it, by default you get a proper date back, it actually stores the TIME also, up to seconds. Which means that two field that looks like they have the same value using the default format, the comparison might still fail. Odd, to say the least.

As for SQL Server, things are messier still. Here a TIMESTAMP is actually a table attribute that works much like the way the first TIMESTAMP column in a MariaDB table works in that it keeps track of the last insert / update to the row. Then SQL Server has both DATETIME and a DATETIME2 datatypes, where the former has a limited date range. SQL Server also has a DATETIMEOFFSET which is pretty odd. I will not get into NULL handling with temporal data types in SQL Server and I will avoid giving you a headache. Also in SQL Server are DATE and TIME datatypes as well as a SMALLDATETIME, where the latter is a more compact DATETIME with no fractional seconds and again a limited range.

I have not gotten into the issue of how relational databases treat temporal datatypes with incorrect data and NULL values. Note that “incorrect” data when it comes to temporal data is a fuzzy subject. Handling leap years is no big deal, but maybe you haven’t heard about leap seconds? They are there to compensate for the earth slowing down it’s rotation and leap seconds are added now and then to compensate for this and we have one such coming up by the end of this year, the last minute of 2016 will have 61 seconds, so that 2016-12-31 23:59:60 is actually a valid time, something not recognized by either MariaDB, Oracle, SQL Server or for that matter Linux (at least where I tested it) whci all report this as an invalid time specification. If someone asks you “how many seconds are there in a day” your answer should be, in true engineering fashion, “it depends” and if you write code assuming there are always 86400 (24 * 60 * 60) seconds in a day, you might be making a mistake, depending on the situation.

Another situation is with Financial Services where domain-specific calendars are used in some cases, like a 360-day calendar where each year is considered to have 12 months of 30 days each. This is used for example with interest rate calculations, which is why every monthly mortgage payment of your house is the same amount, despite the fact that some months are shorter and other longer in your calendar (but not in the calendar used by your bank).


SQL Standard Temporal Types

In the SQL Standard, let’s assume SQL-99, there are three temporal data types: TIMESTAMP, DATE and TIME. The Oracle TIMESTAMP that was added in Oracle 9 are reasonably well in line with Standard SQL-99.

The SQL Standard TIME and TIMESTAMP datatypes has a number of attributes, namely a precision, in terms of fractional seconds, and if TIME ZONES are used or not. Which brings us to the issue of Time Zones, are an issue they are, but it would probably be even worse without them.


Time Zones and Trains

I guess you are wondering what trains have to do with all this and with relational databases in particular? Well trains are nice and fun and the same goes for relational databases, right? Or maybe not. No, there is an aspect of temporal data that is related to trains. Let’s go back a few years in time to when trains were all new and hot, which is around year 1996. No wait, that was Netscape. We are talking about is the mid 1800s when trains caused people to travel a lot more and to travel much longer distances. There were no time zones though, so noon was different for every station on the railroad, which was when the sun was in zenith at that particular station. This meant that if you traveled for 1 hour you might end up 53 minutes away from your origin.

In 1884, it was determined that we should all have one meridian (the one in Greenwich) and that we should have 24 time zones spread around the globe. And this would be a world standard and as such enforced, and you better follow this, or else…

As we all know standards are universally accepted more or less always and they are also backward compatible. And this is why the SQL standard works so that all SQL databases can talk to each other, all SCSI connectors (remember SCSI? I do, but then I’m an old fart) fit any other SCSI connector and for a more modern example of a truly successful standard, all HDMI cables and connectors works with all other HDMI cables, connectors, screens, players and what have you not.

This explains why the good intentions of having 24 time zones 1 hour apart around the globe isn’t really followed. In particular India and Australia screw things up with 15, 30 and 45 minute time zones.


MariaDB Temporal Datatypes and Time Zones In Practice

And what has this got to do with relational databases you ask? Well there is one difference between TIMESTAMP and DATETIME in MariaDB that we haven’t mentioned so far, and that is timezone support. The MariaDB TIMESTAMP data type supports time zones, which the DATETIME datatypes does not. What does this mean then? Well, it means that a TIMESTAMP data values is stored together with the timezone of the value. To explain what this means, let’s look at an example, first we need a table to insert some data into for our tests:

MariaDB> CREATE TABLE temporaltest (timestamp_t timestamp,

  datetime_t datetime);


Before moving on from this, let’s look at how MariaDB interprets this table definition:

MariaDB> SHOW CREATE TABLE temporaltest\G

*************************** 1. row ***************************

       Table: temporaltest

Create Table: CREATE TABLE `temporaltest` (


  `datetime_t` datetime DEFAULT NULL



As you can see, MariaDB adds a few things to the simple TIMESTAMP column, like a NOT NULL clause that we didn’t specify and a default value that we didn’t ask for. This is for backward compatibility to align more recent MariaDB TIMESTAMP column semantics with how a TIMESTAMP used to work way back during the Harding administration. Before we move on, let’s set the timezone of your server. The default for MariaDB is to use the timezone as defined by the host operating system, but in many cases that is not a good idea in production use. To be able to set the timezone with MariaDB, we first have to import timezone information into MariaDB, and that is done by running this from the command line:

$ mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql -p


This assumes that the zoneinfo file is at the default location, if you have installed MariaDB from a tarball it will be somewhere in that path. With that in place let’s set the timezone:

MariaDB> SHOW VARIABLES LIKE 'time_zone';


| Variable_name | Value  |


| time_zone     | SYSTEM |


1 row in set (0.00 sec)

MariaDB> SET time_zone = 'PST8PDT';

Query OK, 0 rows affected (0.00 sec)


So now we are running with the PST timezone instead of the one defined by the operating system. So far so good. Let’s then insert some data into the table we created above:

MariaDB> INSERT INTO temporaltest VALUES('2015-03-08 02:05:00', '2015-03-08 02:05:00');

MariaDB> SELECT * FROM temporaltest;


| timestamp_t         | datetime_t          |


| 2015-03-08 02:05:00 | 2015-03-08 02:05:00 |


1 row in set (0.00 sec)


OK, that looks fine, right? Now, let’s assume that we move this server to the east coast and set the timezone as appropriate for that and select the above data again:

MariaDB> SET time_zone = 'EST';

Query OK, 0 rows affected (0.00 sec)

MariaDB> SELECT * FROM temporaltest;


| timestamp_t         | datetime_t          |


| 2015-03-07 21:05:00 | 2015-03-08 02:05:00 |


1 row in set (0.00 sec)


As you can see, with a different timezone we get data back adjusted to the timezone different for the TIMESTAMP column, but not for the DATETIME column. That does make a difference, right? If you run with MariaDB clients in different time zones, all those clients may well insert data using different time zones! If the server and the client are in different time zones, the data is converted to the timezone of the server and when retrieving data it is converted back to that of the client. Seems fair, right.

Maybe this thing with time zones wasn’t such a bad and difficult thing after all? Yes, it can be handled, but that was before Germany was about to run out of electricity and in an attempt to fix that caused years of suffering to cows and programmers across the globe. That story will be told in the next part of this series of blogs though, so don’t touch that dial, I’ll be right back.

Happy SQLing


In this first blog in a series, Anders Karlsson will look at aspects of temporal datatypes. Surprisingly, he also talks a bit about trains and how they relate.

Art van Scheppingen

Art van Scheppingen

Wed, 12/21/2016 - 16:13

Timezone mess

Thanks for the explanation of these data types. There is one thing you may have missed here: the timezone the server is configured in is only initialized on startup of mysql/mariadb, so changing that will not affect it immediately. We found this out when we had a problem that got dumped on our plate at my previous employer. Some sysop switched the timezones on all servers from Amsterdam/Europe to UTC, and as there was no restart necessary nothing happened. That meant that the database was in a different timezone than the server. So when we did a failover for maintenance (on a freshly rebooted server), this changed the behavior of the database to read already stored datetime values differently (as they are stored with UTC offset) than the timestamp values (as they are stored "as is"). This made the data in the database unreliable. We would have been able to alter the data with the failover switch as the cut-off point, however the schema and application itself were a timezone mess as well: timestamps and datetime used alongside each other in the same tables or joined tables, and then used for calculations. For input/output the application relied on the php date and gmdate functions (e.g. whenever a developer had clue he/she used gmdate, otherwise just date) and the application did not have any timezone selection for the input. Since the timezone change was applied to all servers, also php had landed in a different timezone. This meant that making date/time calculations were unreliable and fixing the data next to impossible, as we would not determine what the date/time entered was supposed to be in UTC.
What did we do in the end? I tried to explain the developer we should change all timestamp columns to datetime, and start using gmdate in the whole application. The developer agreed, but the only change he did was the runtime timezone of php from UTC to Amsterdam/Europe, because he believed that would solve all his problems.

Holger Thiel

Holger Thiel

Thu, 12/22/2016 - 11:12

PostgreSQL is ideal

The timestamps in MariaDB/MySQL are not very handy.

Oracle is not better (you must fight with conversions between datatypes and different Oracle standards). PostgreSQL is not mentioned in this article.

PostgreSQL has a logical and handy way of implementation in relation to the others. The timestamps are readable on first sight and the timezones are far better solved than in MariaDB or Oracle.

I don't think MariaDB implementation is all that wrong

I haven't looked at Postgres, but maybe I should. That said, I don't think either MariaDB or Oracle gets it that wrong, rather my view is that dealing with timezones is reasonable once you understand it (disregarding the 30 and 45 minute times zones). But I'll have a look at Postgres and see it it adds something that I miss with the MariaDB implementation, something that we might create a MariaDB feature request from. I have a follow up blog post coming up soon that gets into the issue of Daylight Savings Time, where MariaDB also get's it right I think, but it needs to be understood and it's not MariaDB that I have an issue with but DST.

Login or Register to post comments

by anderskarlsson1 at December 21, 2016 06:18 PM

Jean-Jerome Schmidt

Online schema change for MySQL & MariaDB - comparing GitHub’s gh-ost vs pt-online-schema-change

Database schema change is one of the most common activities that a MySQL DBA has to tackle. No matter if you use MySQL Replication or Galera Cluster, direct DDL’s are troublesome and, sometimes, not feasible to execute. Add the requirement to perform the change while all databases are online, and it can get pretty daunting.

Thankfully, online schema tools are there to help DBAs deal with this problem. Arguably, the most popular of them is Percona’s pt-online-schema-change, which is part of Percona Toolkit.

It has been used by MySQL DBAs for years and is proven as a flexible and reliable tool. Unfortunately, not without drawbacks.

To understand these, we need to understand how it works internally.

How does pt-online-schema-change work?

Pt-online-schema-change works in a very simple way. It creates a temporary table with the desired new schema - for instance, if we added an index, or removed a column from a table. Then, it creates triggers on the old table - those triggers are there to mirror changes that happen on the original table to the new table. Changes are mirrored during the schema change process. If a row is added to the original table, it is also added to the new one. Likewise if a row is modified or deleted on the old table, it is also applied on the new table. Then, a background process of copying data (using LOW_PRIORITY INSERT) between old and new table begins. Once data has been copied, RENAME TABLE is executed to rename “yourtable” into “yourtable_old” and “yourtable_new” into “yourtable”. This is an atomic operation and in case something goes wrong, it is possible to recover the old table.

The process described above has some limitations. For starters, it is not possible to reduce the overhead of the tool to 0. Pt-online-schema-change gives you an option to define the maximum allowed replication lag and, if that threshold is crossed, it stops to copy data between the old and new table. It is also possible to pause the background process entirely. The problem is that we are talking only about the background process of running INSERTs. It is not possible to reduce the overhead caused by the fact that every operation in “yourtable” is duplicated in “yourtable_new” through triggers. If you remove the triggers, the old and new table would go out of sync without any means to sync them again. Therefore, when you run pt-online-schema-change on your system, it always adds some overhead, even if it is paused or throttled. How big overhead depends on how many writes hit the table which is undergoing a schema change.

Another issue is caused again by triggers - precisely by the fact that, to create triggers, one has to acquire a lock on MySQL’s metadata. This can become a serious problem if you have highly concurrent traffic or if you use longer transactions. Under such load, it may be virtually impossible (and we’ve seen such databases) to use pt-online-schema-change due to the fact that it is not able to acquire metadata lock to create the required triggers. Additionally, the process of acquiring metadata can also lock further transactions, basically grinding all database operations to halt.

Yet another problem are foreign keys - unfortunately, there is no simple way of handling them. Pt-online-schema-change gives you two methods to approach this issue. Neither of those are really good. The main issue here is that a foreign key of a given name can only refer to a single table and it sticks to it - even if you rename the table referred to, the foreign key will follow this change. This leads to the problem: after RENAME TABLE, the foreign key will point to ‘yourtable_old’, not ‘yourtable’.

One workaround is to not use:

RENAME TABLE ‘yourtable’ TO ‘yourtable_old’, ‘yourtable_new’ TO ‘yourtable’;

Instead, use a two step approach:

DROP TABLE ‘yourtable’; RENAME TABLE ‘yourtable_new’ TO ‘yourtable’;

This poses a serious problem. If for some reason, RENAME TABLE won’t work, there’s no going back as the original table has been already dropped.

Another approach would be to create a second foreign key, under a different name, which refers to ‘yourtable_new’. After RENAME TABLE, it will point to ‘yourtable’, which is exactly what we want. Thing is, you need to execute a direct ALTER to create such foreign key - which kind of invalidates the point of using online schema change - to avoid direct alters. If the altered table is large, such operation is not feasible to execute on Galera Cluster (cluster-wide stall caused by TOI) and MySQL replication cluster (slave lag induced by serialized ALTER).

As you can see, while being a useful tool, pt-online-schema-change has serious limitations which you need to be aware of before you use it. If you use MySQL at scale, limitations may become a serious motivation to do something about it.

Introducing GitHub’s gh-ost

Motivation alone is not enough - you also need resources to create a new solution. GitHub recently released gh-ost, their take on online schema change. Let’s take a look at how it compares to Percona’s pt-online-schema-change and how it can be used to avoid some of its limitations.

To understand better what is the difference between both tools, let’s take a look at how gh-ost works.

Gh-ost creates a temporary table with the altered schema, just like pt-online-schema-change does - it uses “_yourtable_gho” pattern. It executes INSERT queries which use the following pattern to copy data from old to new table:

insert /* gh-ost `sbtest1`.`sbtest1` */ ignore into `sbtest1`.`_sbtest1_gho` (`id`, `k`, `c`, `pad`)
      (select `id`, `k`, `c`, `pad` from `sbtest1`.`sbtest1` force index (`PRIMARY`)
        where (((`id` > ?)) and ((`id` < ?) or ((`id` = ?)))) lock in share mode

As you can see, it is a variation of INSERT INTO new_table  SELECT * FROM old_table. It uses primary key to split data in chunks and then work on them.

In pt-online-schema-change, the current traffic was handled using triggers. Gh-ost uses a triggerless approach - it uses binary logs to track and apply changes which happened since gh-ost started to copy data. It connects to one of the hosts, by default it is one of the slaves, simulates that it is a slave itself and asks for binary logs.

This behavior has a couple of repercussions. First of all, network traffic is increased compared to pt-online-schema-change - not only gh-ost has to copy data but it also has to copy binary logs.

It also requires binary logs in row-based format for full data consistency - if you use statement or mixed replication, gh-ost won’t work in your setup. As a workaround, you can create a new slave, enable log_slave_updates and set it to store events in row format. Reading data from a slave is, actually, the default way in which gh-ost operates - it makes perfect sense as pulling binary logs adds some overhead and if you can avoid additional overhead on the master, you most likely want to do it. Of course, if your master uses row-based replication format, you can force gh-ost to connect to it and get binary logs.

What is good about this design is that you don’t have to create triggers, which, as we discussed, could become a serious problem or even a blocker. What is also great is that you can always stop parsing binary logs - it’s like you’d just run STOP SLAVE. You have the binlog coordinates so you can easily start in the same position later on. This makes it possible to stop practically all operations executed by gh-ost. Not only the background process of copying data from old to new table, but also any load related to keeping the new table in sync with the old one. This is a great feature in a production environment - pt-online-schema-change requires constant monitoring as you can only estimate the additional load on the system. Even if you paused it, it will still add some overhead and, under heavy load, this overhead may result in an unstable database. On the other hand, with gh-ost, you can just pause the whole process and the workload pattern will go back to what you are used to see - no additional load whatsoever related to the schema change. This is really great - it means you can start the migration at 9am, when you start your day, stop it at 5pm when you are leaving your office. You can be sure that you won’t get paged late at night because the paused schema change process is not actually 100% paused, and is causing problems to your production systems.

Unfortunately, gh-ost is not without drawbacks. For starters, foreign keys. Pt-online-schema-change does not provide any good way of altering tables which contain foreign keys. It is still way better than gh-ost as gh-ost does not support foreign keys at all. At the moment of writing, that is - it may change in the future. Triggers - gh-ost, at the moment of writing, does not support triggers at all. The same is true for pt-online-schema-change - it was a limitation of pre-5.7 MySQL where you couldn’t have more than one trigger of a given type defined in a table (and pt-online-schema-change had to create them for its own purposes). Even if the limitation is removed in MySQL 5.7, pt-online-schema-change still does not support tables with triggers.

One of the main limitations of gh-ost is, definitely, the fact that it does not support Galera Cluster. It is because of how gh-ost performs a table switch - it uses LOCK TABLE which do not work well with Galera - as of now there is no known fix or workaround for this issue and this leaves pt-online-schema-change as the only option for Galera Cluster.

These are probably the most important  limitations of gh-ost, but there are more of them. Minimal row image is not supported (which makes your binlogs grow larger), JSON and generated columns in 5.7 are not supported. Migration key must not contain NULL values, there are limitations when it comes to mixed cases in table names. You can find more details on all requirements and limitations of gh-ost in its documentation.

In our next blog post we will take a look at how gh-ost operates, how you can test your changes and how to perform it. We will also discuss throttling of gh-ost.

by krzysztof at December 21, 2016 11:39 AM

December 20, 2016

MariaDB AB

Why Marko Mäkelä, Lead Developer InnoDB, Recently Joined MariaDB Corporation

Why Marko Mäkelä, Lead Developer InnoDB, Recently Joined MariaDB Corporation Marko Mäkelä Tue, 12/20/2016 - 08:02

I recently joined MariaDB Corporation. You might not recognize my name, but you may have used some InnoDB features that I have worked on since I joined Innobase Oy as the first full-time employee in 2003.

My first task was to reduce the overhead of the InnoDB table storage. I introduced ROW_FORMAT=COMPACT (and named the old format ROW_FORMAT=REDUNDANT) in MySQL 5.0.3. ROW_FORMAT=COMPRESSED and ROW_FORMAT=DYNAMIC were released as part of the InnoDB Plugin for MySQL 5.1.

In the InnoDB Plugin, I also completed the ‘fast index creation’ feature. That along with the ROW_FORMAT changes and some BLOB bug fixes were among the major improvements that the InnoDB Plugin offered over the built-in InnoDB in MySQL 5.1. The InnoDB Plugin became the built-in InnoDB in MySQL 5.5.

In MySQL 5.5, I transformed the InnoDB insert buffer into a change buffer (delete-mark and purge buffering) and introduced the first regression tests based on fault injection.

In MySQL 5.6, I designed and implemented the InnoDB part of ALTER TABLE…ALGORITHM=INPLACE and LOCK=NONE, also known as the ‘online ALTER TABLE’. I also removed the famous limitation that the InnoDB redo log file size could not be changed.

In MySQL 5.7 one of my most visible contributions probably is the InnoDB redo log format tagging, to prevent old servers from failing with obscure errors when starting up with new data files.

Why did I join MariaDB? The short answer is that Monty called me and asked. The long answer is that having an academic background, I value open collaboration and the open exchange of ideas. At MariaDB it feels like being at home again, working with many of the colleagues from the 2003‒2005 era when both Innobase Oy and MySQL AB were small companies where each employee was able or forced to work on a wide range of  tasks.

The acquisitions of the companies introduced policies and processes that restrict communication, gradually transforming the ‘bazaar’ into a ‘cathedral’. While ‘open source’ seems to have largely won over ‘closed source’ when it comes to information technology and communication infrastructure, I think that The Cathedral and the Bazaar are still with us. The new ‘cathedral’ is ‘closed development’ where source code is only released in snapshots synchronized with product releases. Significant parts of the development history are often hidden by squashing merges of long-lived development branches to a gigantic commit.

MariaDB is the ‘bazaar’, encouraging feedback from end users in all stages of development.

While a for-profit business cannot provide unpaid support to users, it sometimes makes sense to work with users to obtain test cases. For instance, MDEV-11233 (a server crash in CREATE FULLTEXT INDEX) had been reported by several users, but crucial details were missing. Finally, the same assertion failure message was posted on the #maria channel in the FreeNode IRC network. After some discussion, Waqar Khan was busy executing a binary search, running a series of SQL statements to reduce the problematic table from one million rows to a single record that triggered the problem. An hour or two later we had a minimal 3-statement test case for reproduce the problem.

Another example is MDEV-6076 Persistent AUTO_INCREMENT for InnoDB tables, which I hope to be included in the upcoming MariaDB 10.2 release. Zhang Yuan at Alibaba looked at my code changes and pointed out a mistake before our internal code review had been completed.

I am looking forward to interesting times with MariaDB.

Marko Mäkelä, the first full-time employee at Innobase Oy in 2003 explains why he recently joined MariaDB Corporation.

Login or Register to post comments

by Marko Mäkelä at December 20, 2016 01:02 PM

Jean-Jerome Schmidt

The choice of MySQL storage engine and its impact on backup procedures

MySQL offers multiple storage engines to store its data, with InnoDB and MyISAM being the most popular ones.  Each storage engine implements a more specific set of features required for a type of workload, and as a result, works differently from other engines. Since data is stored inside the storage engine, we need to understand how the storage engines work to determine the best backup tool. In general, MySQL backup tools perform a special operation in order to retrieve a consistent data - either lock the tables or establish a transaction isolation level that guarantees data read is unchanged.


MyISAM was the default storage engine for MySQL versions prior to 5.5.5. It is based on the older ISAM code but has many useful extensions. The major deficiency of MyISAM is the absence of transactions support. Aria is another storage engine with MyISAM heritage and is a MyISAM replacement in all MariaDB distributions. The main difference is that Aria is crash safe, whereas MyISAM is not. Being crash safe means that an Aria table can recover from unexpected failures in a much better way than a MyISAM table can. In most circumstances, backup operations for MyISAM and Aria are almost identical.

MyISAM uses table-level locking. It stores indexes in one file and data in another. MyISAM tables are generally more compact in size on disk when compared to InnoDB tables. With the nature of table-level locking and no transaction support, the recommended way to backup MyISAM tables is to acquire the global read lock by using FLUSH TABLE WITH READ LOCK (FTWRL) to make MySQL read-only temporarily or use LOCK TABLES statement explicitly. Without that, MyISAM backups will be inconsistent.


InnoDB is the default storage engine for MySQL and MariaDB. It provides the standard ACID-compliant transaction features, along with foreign key support and row-level locking.

Percona’s XtraDB is an enhanced version of the InnoDB storage engine for MySQL and MariaDB. It features some improvements that make it perform better in certain situations. It is backwards compatible with InnoDB, so it can be used as a drop-in replacement.

There are a number of key components in InnoDB that directly influences the behaviour of backup and restore operation:

  • Transactions
  • Crash recovery
  • Multiversion concurrency control (MVCC)


InnoDB does transactions. A transaction will never be completed unless each individual operation within the group is successful (COMMIT). If any operation within the transaction fails, the entire transaction will fail and any changes will be undone (ROLLBACK).

The following example shows a transaction in MySQL (assuming autocommit is off):

UPDATE account.saving SET balance = (balance - 10) WHERE id = 2;
UPDATE account.current SET balance = (balance + 10) WHERE id = 2;

A transaction starts with a BEGIN and ends with a COMMIT or ROLLBACK. In the above example, if the MySQL server crashes after the first UPDATE statement completed (line 2), that update would be rolled back and the balance value won’t change for this transaction. The ability to rollback is vital when performing crash recovery, as explained in the next section.

Crash Recovery

InnoDB maintains a transaction log, also called redo log. The redo log is physically represented as a set of files, typically named ib_logfile0 and ib_logfile1. The log contains a record of every change to InnoDB data. When InnoDB starts, it inspects the data files and the transaction log, and performs two steps:

  1. Applies committed transaction log entries to the data files.
  2. Performs an undo operation (rollback) on any transactions that modified data but did not commit.

The rollback is performed by a background thread, executed in parallel with transactions from new connections. Until the rollback operation is completed, new connections may encounter locking conflicts with recovered transactions. In most situations, even if the MySQL server was killed unexpectedly in the middle of heavy activity, the recovery process happens automatically. No action is needed from the DBA.

Percona Xtrabackup utilizes InnoDB crash recovery functionality to prepare the internally inconsistent backup (the binary copy of MySQL data directory) into a consistent and usable database again.


InnoDB is a multiversion concurrency control (MVCC) storage engine which means many versions of a single row can exist at the same time. Due to this nature, unlike MyISAM, InnoDB does not require a global read lock to get a consistent read. It utilizes its ACID-compliant transaction component called isolation. Isolation is the “i” in the acronym ACID - the isolation level determines the capabilities of a transaction to read/write data that is accessed by other transactions.

In order to get a consistent snapshot of InnoDB tables, one could simply start a transaction with REPEATABLE READ isolation level. In REPEATABLE READ, a read view is created at the start of the transaction, and this read view is held open for the duration of the transaction. For example, if you execute a SELECT statement at 6 AM, and come back in an open transaction at 6 PM, when you run the same SELECT, then you will see the exact same result set that you saw at 6 AM. This is part of MVCC capability and it is accomplished using row versioning and UNDO information.

Logical backup like mysqldump uses this approach to generate a consistent backup for InnoDB without explicit table lock that can cause the MySQL server to be read-only.


The MEMORY storage engine (formerly known as HEAP) creates special-purpose tables with contents that are stored in memory. Because the data is vulnerable to crashes, hardware issues, or power outages, only use these tables as temporary work areas or read-only caches for data pulled from other tables.

Due to the transient nature of data from MEMORY tables (data is not persisted to disk), only logical backup is capable of backing up these tables. Backup in physical format is not possible.

That’s it for today, but you can read more about backups in our whitepaper - The DevOps Guide to Database Backups for MySQL and MariaDB.

by ashraf at December 20, 2016 10:15 AM

December 19, 2016

Peter Zaitsev

Securing MongoDB Instances

securing MongoDB

In this blog post we’ll look at how to go about securing MongoDB instances.

Authentication is one of the most important features of a database, and MongoDB supports it in different ways. Although it allows you to work without any authentication, the best practice is to enable authentication and give users only the permissions they need.

Instances without authentication allow all users to perform all the operations they feel like, and that is not safe at all.

Securing MongoDBNative Authentication

The MongoDB community version features more than one authentication method: SCRAM-SHA-1, MONGODB-CR, and x.509 Certificate Authentication. The current default method is SCRAM-SHA-1. The version prior to 3.0 used to have MONGODB-CR as its default method.

Percona Server for MongoDB also offers LDAP authentication free of charge where this feature is only available in the MongoDB enterprise version.

SCRAM-SHA-1 and MONGODB-CR check whether the user/password exists against a specific database and use challenge response authentication to verify user’s authenticity.

The x.509 authentication is based on certificates. It does not run challenge response algorithms. This method instead validates a certificate to prove client’s authenticity. It depends on a certificate authority and each client must have a valid certificate.

LDAP Authentication

The LDAP authentication uses an external LDAP server to authenticate the user by using authsasld in Linux. LDAP is commonly used to manage users in a network. There are advantages and disadvantages when using LDAP. One advantage is that it centralizes users. However, it depends on network connectivity to check user credentials and sasld tries to help with caching but it does have limitations. Please see further details here.

There are two different internal authentication methods for replica-set and shadings, where instances need to prove that they are expected members of the deployment. The first method is using a shared keyfile for all instances, and the second one is using a different x.509 certificate for each instance. It is important to know x.509 forces proper SSL coverage or replication while a key will not, but we will cover this topic in a different blog post.

Authorization and Roles

Once authenticated, users must be allowed to perform commands against the instance/replica-set/sharding. There are a few built-in roles that are able to cover almost all the user cases, and creating a user defined role is possible.
The current built-in roles are:

read readWrite dbAdmin dbOwner
userAdmin clusterAdmin clusterManager clusterMonitor
hostManager backup/restore readAnyDatabase readWriteAnyDatabase
userAdminAnyDatabase dbAdminAnyDatabase root and many more…

There is also the __system role, which is solely used for internal purposes.

Customer user and role by example

This shows how to both enable MongoDB authentication and create a user-defined role, where the user will only be able to read a specific collection. We are using tarballs for testing only. To perform a production installation please follow our docs.

  1. Download Percona Server MongoDB:
    >tar -xvzf percona-server-mongodb-3.2.10-3.0-trusty-x86_64.tar.gz
    >mv percona-server-mongodb-3.2.10-3.0/ perc-mongodb
    >cd perc-mongodb/
    >mkdir bin/data
  2. Start the service with authentication:
    cd bin
    ./mongod --dbpath data --smallfiles --logpath data/mongod.log --fork --auth
  3. Create root/admin user:

    We are able to create the first user without authentication. The next users must be created by an authenticated user.

    > use admin
    > db.createUser({user : 'administrator', pwd : '123', roles : ['root'] })
    > Successfully added user: { "user" : "administrator", "roles" : [ "root" ] }
  4. Login with the just created credentials:
    >mongo --authenticationDatabase admin -u administrator -p
  5. Create database and collection:
    > use percona
    > db.simple_collection.insert({ random_number : Math.random()})
    > db.secure_collection.insert({ name : 'John', annual_wage : NumberLong(130000.00), target_bonus : NumberLong(15000.00)})
  6. create a user that can read all the collections in the percona database:
    db.createUser( {
        user: "app_read",
        pwd: "123456",
        roles: [ { role: "read", db: "percona" }]})
    // testing
    ./mongo --authenticationDatabase admin -u app_read -p
    MongoDB shell version: 3.2.10-3.0
    Enter password:
    connecting to: test
    > use percona
    switched to db percona
    > show collections
    > db.employee_dependents.find()
        { "_id" : ObjectId("583c5afe38c4be98b24e86e6"), "emp_id" : DBRef(‘employees’,’583c5bea38c4be98b24e86e8’}
    > db.employees.find()
        { "_id" : ObjectId("583c5bea38c4be98b24e86e8"), "name" : "John", "annual_wage" : NumberLong(130000), "target_bonus" : NumberLong(15000) }
  7. Now we see that this user can read not only the
    but also the
    . We don’t want users to read the
    , so we are going to create a user-defined role.
    > db.createRole( {
        role : 'readOnly_nonPrivilegedCollections',
        roles : [],
        privileges: [
             resource: {
               db: "percona",
               collection: "foo"
             actions: [ "find"] }
  8. Assign created role to the user:
    db.createUser( {
         user: "app_non_secure",
         pwd: "123456",
         roles: [ { role: "readOnly_nonPrivilegedCollections", db: 'admin' }]
  9. Test access:
    ./mongo --authenticationDatabase admin -u app_non_secure -p
        { "_id" : ObjectId("583c5afe38c4be98b24e86e6"), "random_number" : 0.2878080930921183 }
    > db.secure_collection.find()
    Error: error: {
        "ok" : 0,
        "errmsg" : "not authorized on percona to execute command { find: "secure_collection", filter: {} }",
        "code" : 13

Please feel free to ping us on Twitter @percona with any questions and suggestions for securing MongoDB instances.

by Adamo Tonete at December 19, 2016 05:53 PM

December 16, 2016

Peter Zaitsev

MongoDB PIT Backups In Depth

MongoDB PIT Backups

In this blog is an in-depth discussion of MongoDB PIT backups.

Note: INTIMIDATION FREE ZONE!! This post is meant to give the reader most of the knowledge that is needed to understand the material. This includes basic MongoDB knowledge and other core concepts. I have tried to include links for further research where I can. Please use them where you need background and ask questions. We’re here to help!


In this two-part series, we’re going to fill in some of the gaps you might have to help you get awesome Point-in-Time (PIT) backups in MongoDB. These procedures will work for both MongoDB and Percona Server for MongoDB. This is meant to be a prequel to David Murphy’s MongoDB PIT Backup blog post. In that blog, David shows you how to take and to restore a dump up until a problem operation happened. If you haven’t read that post yet, hold off until you read this one. This foundation will help you better understand the how and why of the necessary steps. Let’s move onto some core concepts in MongoDB – but first, let me tell you what to expect in each part.

Blog 1 (this one): Core concepts – replica set backups, problem statement and solution

Blog 2: Getting Shardy – why backup consistency is tough and how to solve it

Core Concepts

Replica Set (process name: mongod) – MongoDB uses replica sets to distribute data for DR purposes. Replica sets are made up of primaries, secondaries and/or arbiters. These are much like master/slave pairs in MySQL, but with one big caveat. There’s an election protocol included that also handles failover! That’s right, high availability (HA) too! So, in general, there is a “rule of three” when thinking about the minimum number of servers to put in your replica sets. This is necessary to avoid a split-brain scenario.

Oplog (collection name: The oplog is the log that records changes to the data on the MongoDB primary (secondaries also have an oplog). Much like MySQL, the non-primary members (secondaries) pull operations from the oplog and apply them to their own collections. Secondaries can pull from any member of the replica set that is ahead of them. The operations in the oplog are idempotent, meaning they always result in the same change to the database no matter how many times they’re performed.

Sharding (process name: mongos) – MongoDB also has built in horizontal scalability. This is implemented in a “shared nothing” architecture. A sharded cluster is made up of several replica sets. Each replica set contains a unique range of data. The data is distributed amongst replica sets based on a sharding key. There is also a sharding router (mongos) that runs as a routing umbrella over the cluster. In a sharded setup the application solely interfaces with the sharding router (never the replica sets themselves). This is the main function for scaling reads or writes in MongoDB. Scaling both takes very thoughtful design, but may not be possible.

Mongodump (utility name: mongodump) – MongoDB has built in database dump utility that can interface with mongod or mongos. Mongodump can also use the oplog of the server that it is run on to create a consistent point in time backup by using a “roll forward” strategy.

Mongorestore (utility name: mongorestore) – MongoDB has a built in database restore utility. Mongorestore is a rather simple utility that will replay binary dumps created by mongodump. When used with –oplogReplay when restoring a dump made with mongodump’s –oplog switch, it can make for a very functional backup facility.

Tip: make sure that user permissions are properly defined when using –oplogReplay – besides restore, anyAction and anyResource need to be granted.

OK, So What?

We’re going to first need to understand how backups work in a simple environment (a single replica set). Things are going to get much more interesting when we look at sharded clusters in the next post.

Backing Up

In a single replica set, things are pretty simple. There is no sharding router to deal with. You can get an entire data set by interacting with one server. The only problem that you need to deal with is the changes that are being made to the database while your mongodump is in process. If the concept of missed operations is a little fuzzy to you, just consider this simple use case:

We’re going to run a mongodump, and we have a collection with four documents:MongoDB PIT Backups

We start mongodump on this collection. We’re also running our application at the same time, because we can’t take down production. Mongodump scans from first to last in the collection (like a range scan based on ID). In this case mongodump has backed up of all documents from id:1 through id:4MongoDB PIT Backups

At this same moment in time, our application inserts id:3 into the collection.MongoDB PIT Backups

Is the document with id:3 going to be included in the mongodump? The answer is: most likely not. The problem is that you would expect it to be in the completed dump. However, if you need to restore this backup, you’re going to lose id:3. Now, this is perfectly normal in Disaster Recovery scenarios. Knowing that this is happening is the key part. Your backups will have the consistency of swiss cheese if you don’t have a way to track changes being made while the backup is running. Unknown data loss is one of the worst situations one can be in. What we need is a process to capture changes while the backup is running.

Here’s where the inclusion of the –oplog flag is very important. The –oplog flag will allow mongodump to capture changes that are being made to the database while the backup is running. Since the oplog is idempotent, there is chance that we’ll change the data during a restore. This gives the mongodump a consistent snapshot of when the dump completes, like most “clone” type operations.


When running mongorestore, you can use the –oplogReplay option. Using oplog recovers to the point in time when the dump completed. Back to the use case, we may not capture id:3 on the first pass in this case, but as long as we’ve captured the oplog up until the backup completes, we’ll have id:3 available. When replaying the oplog during mongorestore, we will basically re-run the insert operation, completing the dataset. The oplog BSON timestamps all entries, so we know for sure until what point in time we’ve captured.

TIP: If you need to convert the timestamp to something human-readable, here’s something helpful

The Wrap Up

Now we have a firm understanding of the problem. Once we understand the problem, we can easily design a solution to ensure our backups have the integrity that our environment demands. In the next post, we’re going to step up the complexity by examining backups in conjunction with the most complex feature MongoDB has: sharding. Until then, post your feedback and questions in the comments section below.

by Jon Tobin at December 16, 2016 06:54 PM

Percona Live 2017 Sneak Peek Schedule Up Now! See the Available Sessions!

Percona Live 2017

Percona Live 2017We are excited to announce that the sneak peek schedule for the Percona Live 2017 Open Source Database Conference is up! The Percona Live Open Source Database Conference 2017 is April 24th – 27th, at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

The Percona Live Open Source Database Conference 2017 is the premier event for the rich and diverse MySQL, MongoDB and open source database ecosystems. This conference provides an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience.

Below are some of our top picks for MySQL, MongoDB and open source database sessions:


MySQL 101 Tracks

MongoDB 101 Tracks

Breakout Talks

Register for the Percona Live Open Source Database Conference here.

Early Bird Discounts

Just a reminder to everyone out there: our Early Bird discount rate for the Percona Live Open Source Database Conference 2017 is only available ‘til January 8, 2017, 11:30 pm PST! This rate gets you all the excellent and amazing opportunities that Percona Live offers, at a very reasonable price!

Sponsor Percona Live

Become a conference sponsor! We have sponsorship opportunities available for this annual MySQL, MongoDB and open source database event. Sponsors become a part of a dynamic and growing ecosystem and interact with hundreds of DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solutions vendors, and entrepreneurs who attend the event.

by Dave Avery at December 16, 2016 03:05 PM

Jean-Jerome Schmidt

Planets9s - Download our new DevOps Guide to Database Backups for MariaDB & MySQL

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

Download our new DevOps Guide to Database Backups for MariaDB & MySQL

Check out our free whitepaper on database backups, which discusses in detail the two most popular backup utilities available for MySQL and MariaDB, namely mysqldump and Percona XtraBackup. If you’re looking for insight into how to perform database backups efficiently or the impact of Storage Engine on MySQL or MariaDB backup procedures, need some tips & tricks on MySQL / MariaDB backup management … our new DevOps Guide has you covered.

Download the whitepaper

Tips and Tricks: Receive email notifications from ClusterControl

Did you know that apart from receiving notifications when things go wrong, you can also receive digest emails for less critical notifications from ClusterControl? As SysAdmins and DBAs, we need to be notified whenever something critical happens to our database. But would it not be nicer if we were informed upfront, and still had time to perform pre-emptive maintenance and retain high availability?  With this new blog post, find out how to enable and set up your email notifications in ClusterControl according to your needs.

Read the blog

Getting social with Severalnines

As we begin to wrap up 2016 and look towards and plan all the exciting things for next year, we wanted to take a moment to encourage you to follow and engage with us on our social channels. We produce plenty of content and have a lot more planned for 2017. To ensure that you don’t miss out on any of it, we’d love it if you would follow us so we can better keep you up to date and interact more directly with you.

Get social

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

by Severalnines at December 16, 2016 12:49 PM

December 15, 2016

Peter Zaitsev

Percona XtraDB Cluster 5.7.16-27.19 is now available

Percona XtraDB Cluster 5.7.16-27.19

Percona XtraDB Cluster 5.7.16-27.19

Percona announces the release of Percona XtraDB Cluster 5.7.16-27.19 on December 15, 2016. Binaries are available from the downloads section or our software repositories.

Percona XtraDB Cluster 5.7.16-27.19 is now the current release, based on the following:

All Percona software is open-source and free.


  • The following encryption modes are now deprecated:
    • encrypt=1
    • encrypt=2
    • encrypt=3

The default is encrypt=0 with encryption disabled. The recommended mode now is the new encrypt=4, which uses SSL files generated by MySQL.

For more information, see Encrypting PXC Traffic.

New Features

  • Added encrypt=4 mode for SST encryption that uses SSL files generated by MySQL. Modes 1, 2, and 3 are now deprecated.
  • ProxySQL assisted maintenance mode that enables you to take a node down without adjusting ProxySQL manually. The mode is controlled using the pxc_maint_mode variable, which can be set to one of the following values:
    • DISABLED: This is the default state that tells ProxySQL to route traffic to the node as usual.
    • SHUTDOWN: This state is set automatically when you initiate node shutdown.
    • MAINTENANCE: You can change to this state if you need to perform maintenance on a node without shutting it down.

For more information, see Assisted Maintenance Mode.

  • Simplified SSL configuration for Galera/SST traffic with pxc-encrypt-cluster-traffic option, which auto-configures SSL encryption.

For more information, see SSL Automatic Configuration.

  • Added the wsrep_flow_control_interval status variable that displays the lower and upper limits of the flow control system used for the Galera receive queue.

Fixed Bugs

  • Optimized IST donor selection logic to avoid SST. Child processes are now cleaned-up and node state is resumed if SST fails.
  • Added init.ok to the list of files that do not get removed during SST.
  • Fixed error with ASIO library not acknowledging an EPOLLIN event when building Galera.
  • Fixed stalling of DML workload on slave node caused by FLUSH TABLE executed on the master.
    For more information, see 1629296.
  • Fixed super_read_only to not apply to Galera replication applier.
    For more information, see 1634295.
  • Redirected netcat output to stdout to avoid it in the log.
    For more information, see 1625968.
  • Enabled replication of ALTER USER statements.
    For more information, see 1376269.
  • Changed the wsrep_max_ws_rows variable to ignore non-replicated write-sets generated by DML action on temporary tables (explict or implicit).
    For more information, see 1638138.
  • Fixed SST to fail with an error if SSL is not supported by socat, instead of switching to unencrypted mode.
  • Fixed SST with SSL to auto-generate a 2048-bit dhparams file for versions of socat before 1.7.3. These older versions use 512-bit dhparams file by default that gets rejected by newer clients with dh key too small error.
  • PXC-731: Changed the wsrep_cluster_name variable to read-only, because changing it dynamically leads to high overhead.
    For more information, see 1620439.
  • PXC-732: Improved error message when any of the SSL files required for SST are missing.
  • PXC-735: Fixed SST to fail with an error when netcat is used (transferfmt=nc) with SSL encryption (encrypt set to 2, 3 or 4), instead of silently switching to unencrypted mode.
  • Fixed faulty switch case that caused cluster to stall when the repl.commit_order variable was set to 2 (LOCAL_OOOC mode that should allow out-of-order committing for local transactions).

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

by Alexey Zhebel at December 15, 2016 07:21 PM

Percona Live Featured Tutorial with Giuseppe Maxia — MySQL Document Store: SQL and NoSQL United

Percona Live Featured Tutorial Giuseppe Maxia

Percona Live Featured Tutorial Giuseppe MaxiaWelcome to a new series of blogs: Percona Live featured tutorial speakers! In these blogs, we’ll highlight some of the tutorial speakers that will be at this year’s Percona Live conference. We’ll also discuss how these tutorials can help you improve your database environment. Make sure to read to the end to get a special Percona Live 2017 registration bonus!

In this Percona Live featured tutorial, we’ll meet Giuseppe Maxia, Quality Assurance Architect at VMware. His tutorial is on MySQL Document Store: SQL and NoSQL United. MySQL 5.7 introduced document store, which allows asynchronous operations and native document store handling. Since the changes are huge, both developers and DBAs are uncertain about what is possible and how to do it.  I had a chance to speak with Giuseppe and learn a bit more about the MySQL document store feature:

Percona: How did you get into database technology? What do you love about it?

Giuseppe: I am fascinated by the ability to organize and dynamically retrieve data. I got my first experiences with databases several decades ago. Relational databases were not as dominant then as they are now. In an early job, I actually wrote a relational database interface in C, with the purpose of getting more flexibility than what I could get from commercial products. This happened several years before the creation of MySQL. This experience with the internals of a DBMS gave me some insight on how raw data, when appropriately processed, becomes useful information. All the jobs that I had since then related to database usage or development. With each job I learned something, and most of what I accumulated over the years is still relevant today when I use top-notch databases.

What I love today about databases is the same thing that made me start working with them: I see them as powerful tools to help people make order out of the chaos of their data.

Percona: Your tutorial is “MySQL Document Store: SQL and NoSQL United.” What exactly is MySQL document store, and why is it such an exciting new feature?

Giuseppe: The “Document Store” is a feature introduced as a plugin in MySQL 5.7. It is different from most anything that MySQL has done before, for two reasons:

  1. It is a feature added to a server that is already GA – not directly as a change in the server code, but as an addition that users need to enable. Document store is the first of several additions that will come using the same paradigm. It allows the MySQL team to add functionalities without waiting for the natural cycle of development, which usually takes a few years.
  2. It allows users to treat some of the data stored in MySQL as schema-less documents, i.e. data that does not have to be restricted by the stiff paradigm of rows and columns that are the foundation of relational databases. In a nutshell, by using this plugin we can write collections of heterogeneous documents instead of tables and relations. Moreover, we can handle the data using non-SQL languages, such as JavaScript and Python, with a syntax that is more natural to developers that are not very familiar with relational theory.

Why is this such an exciting feature? I think it’s an attempt by Oracle to lure no-SQL users into the MySQL arena. By offering the ability to combine structured and unstructured data into the same entity with a proven record of safety and stability, Oracle may have created the perfect match between relational educated DBAs and developers who usually think in terms of hierarchical or nested data structures.

Percona: How can the document store make DBAs’ easier? How more complicated?

Giuseppe: This depends on the organizational needs that the DBA has to address. There is a simplification, if the organization needs to deal with both structured and unstructured data. Instead of installing and maintaining two databases (e.g., MySQL and MongoDB) they can use just one.

What can go wrong? The plugin isn’t GA software (“Using MySQL as a document store is currently a preproduction feature”) and therefore DBAs should be ready to apply patches and take extra steps to keep the data safe, should a defect arise.

Percona: What benefits does document store hold for a business’ database environment?

Giuseppe: As mentioned before, it could be a simplification of overall operations. It exposes data as collections containing unstructured documents. This matches closely the kind of information that we deal with in many modern environments. Consider, for instance, current operations with cloud computing appliances: we mostly encode the data sent and received in such an environment as JSON or XML (which in turn can be easily converted into JSON.) Storing the documents retrieved from such operations directly as they are produced is a great advantage. A further benefit is the ability to index the data without converting it into structured tables, and retrieving information quickly and dynamically.

Percona: What do you want attendees to take away from your tutorial session? Why should they attend?

Giuseppe: The document store comes with a gargantuan amount of documentation. Kudos to the MySQL team for providing such detail on a new feature. However, the sheer size of the data might intimidate casual users who want to take advantage of the new feature. They might also fail to grasp the starting points. This tutorial’s main purpose is explaining the document store in simple terms, how to get started, and the common pitfalls.

Everyone who wants to deal with unstructured documents without maintaining two DBMS should attend. Developers will probably have more interest than DBAs, but there is food for everyone’s taste with the live demos.

On the practical side, the tutorial will show how data can get created in MySQL and consumed in MongoDB, and the other way around.

Percona: What are you most looking forward to at Percona Live?

Giuseppe: The MySQL world has been boiling over with new or enhanced features lately. I look forward to seeing the latest news about MySQL and related technologies. Percona Live is the place where MySQL professionals meet and exchange ideas. In addition to exposing myself to new things, though, I also enjoy seeing my friends in the MySQL world, and meeting new ones.

Want to find out more about Giuseppe and MySQL document store? Register for Percona Live Data Performance Conference 2017, and see his talk MySQL Document Store: SQL and NoSQL United. Use the code FeaturedTalk and receive $30 off the current registration price!

Percona Live Data Performance Conference 2017 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 24-27, 2017 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

by Dave Avery at December 15, 2016 06:38 PM

MariaDB Foundation

MariaDB 10.1.20 now available

The MariaDB project is pleased to announce the immediate availability of MariaDB 10.1.20. This is a Stable (GA) release. See the release notes and changelog for details. Download MariaDB 10.1.20 Release Notes Changelog What is MariaDB 10.1? MariaDB APT and YUM Repository Configuration Generator Thanks, and enjoy MariaDB!

The post MariaDB 10.1.20 now available appeared first on

by Daniel Bartholomew at December 15, 2016 06:08 PM

Jean-Jerome Schmidt

New whitepaper - the DevOps Guide to database backups for MySQL and MariaDB

This week we’re happy to announce that our new DevOps Guide to Database Backups for MySQL & MariaDB is now available for download (free)!

This guide discusses in detail the two most popular backup utilities available for MySQL and MariaDB, namely mysqldump and Percona XtraBackup.

Topics such as how database features like binary logging and replication can be leveraged in backup strategies are covered. And it provides best practices that can be applied to high availability topologies in order to make database backups reliable, secure and consistent.

Ensuring that backups are performed, so that a database can be restored if disaster strikes, is a key operational aspect of database management. The DBA or System Administrator is usually the responsible party to ensure that the data is protected, consistent and reliable. Ever more crucially, backups are an important part of any disaster recovery strategy for businesses.

So if you’re looking for insight into how to perform database backups efficiently or the impact of Storage Engine on MySQL or MariaDB backup procedures, need some tips & tricks on MySQL / MariaDB backup management … our new DevOps Guide has you covered.

by Severalnines at December 15, 2016 04:41 PM

Peter Zaitsev

Row Store and Column Store Databases

Row Store and Column Store

Row Store and Column StoreIn this blog post, we’ll discuss the differences between row store and column store databases.

Clients often ask us if they should or could be using columnar databases. For some applications, a columnar database is a great choice; for others, you should stick with the tried and true row-based option.

At a basic level, row stores are great for transaction processing. Column stores are great for highly analytical query models. Row stores have the ability to write data very quickly, whereas a column store is awesome at aggregating large volumes of data for a subset of columns.

One of the benefits of a columnar database is its crazy fast query speeds. In some cases, queries that took minutes or hours are completed in seconds. This makes columnar databases a good choice in a query-heavy environment. But you must make sure that the queries you run are really suited to a columnar database.

Data Storage

Let’s think about a basic database, like a stockbroker’s transaction records. In a row store, each client would have a record with their basic information – name, address, phone number, etc. – in a single table. It’s likely that each record would have a unique identifier. In our case, it would probably be an


There is another table that stored stock transactions. Again, each transaction is uniquely identified by something like a

. Each transaction is associated to one
, but each
 is associated with multiple transactions. This provides us with a one-to-many relationship, and is a classic example of a transactional database.

We store all these tables on a disk and, when we run a query, the system might access lots of data before it determines what information is relevant to the specific query. If we want to know the

, and
 for a given time period, the system needs to access all of the information for the two tables, including fields that may not be relevant to the query. It then performs a join to relate the two tables’ data, and then it can return the information. This can be inefficient at scale, and this is just one example of a query that would probably run faster on a columnar database.

With a columnar database, each field from each table is stored in its own file or set of files. In our example database, all

 data is stored in one file, all
 data is stored in another file, and so on. This provides some efficiencies when running queries against wide tables, since it is unlikely that a query needs to return all of the fields in a single table. In the query example above, we’d only need to access the files that contained data from the requested fields. You can ignore all other fields that exist in the table. This ability to minimize i/o is one of the key reasons columnar databases can perform much faster.

Normalization Versus Denormalization

Additionally, many columnar databases prefer a denormalized data structure. In the example above, we have two separate tables: one for account information and one for transaction information. In many columnar databases, a single table could represent this information. With this denormalized design, when a query like the one presented is run, no joins would need to be processed in the columnar database, so the query will likely run much faster.

The reason for normalizing data is that it allows data to be written to the database in a highly efficient manner. In our row store example, we need to record just the relevant transaction details whenever an existing customer makes a transaction. The account information does not need to be written along with the transaction data. Instead, we reference the

 to gain access to all of the fields in the accounts table.

The place where a columnar database really shines is when we want to run a query that would, for example, determine the average price for a specific stock over a range of time. In the case of the columnar database, we only need a few fields – 

, and
– in order to complete the query. With a row store, we would gather additional data that was not needed for the query but was still part of the table structure.

Normalization of data also makes updates to some information much more efficient in a row store. If you change an account holder’s address, you simply update the one record in the accounts table. The updated information is available to all transactions completed by that account owner. In the columnar database, since we might store the account information with the transactions of that user, many records might need updating in order update the available address information.


So, which one is right for you? As with so many things, it depends. You can still perform data analysis with a row-based database, but the queries may run slower than they would on a column store. You can record transactions in a column-based model, but the writes may takes longer to complete. In an ideal world, you would have both options available to you, and this is what many companies are doing.

In most cases, the initial write is to a row-based system. We know them, we love them, we’ve worked with them forever. They’re kind of like that odd relative who has some real quirks. We’ve learned the best ways to deal with them.

Then, we write the data (or the relevant parts of the data) to a column based database to allow for fast analytic queries.

Both databases incurred write transactions, and both also likely incur read transactions. Due to the fact that a column-based database has each column’s data in a separate file, it is less than ideal for a “SELECT * FROM…” query, since the request must access numerous files to process the request. Similarly, any query that selects a single or small subset of files will probably perform better in a row store. The column store is awesome for performing aggregation over large volumes of data. Or when you have queries that only need a few fields from a wide table.

It can be tough to decide between the two if you only have one database. But it is more the norm that companies support multiple database platforms for multiple uses. Also, your needs might change over time. The sports car you had when you were single is less than optimal for your current family of five. But, if you could, wouldn’t you want both the sports car and the minivan? This is why we often see both database models in use within a single company.

by Rick Golba at December 15, 2016 12:35 AM

December 14, 2016

MariaDB AB

IHME Believes Open Source MariaDB ColumnStore Is The Future of Data Warehousing

IHME Believes Open Source MariaDB ColumnStore Is The Future of Data Warehousing guest Wed, 12/14/2016 - 18:32

Note: This is a guest post by Andrew Ernst, Assistant Director, Infrastructure at the Institute for Health Metrics and Evaluation. 

In the early 1990s, the World Bank Commissioned an in-depth study to measure disability and death from a multitude of causes worldwide. Over the past few decades, this study has grown into an international consortium of more than 1,800 researchers from more than 120 countries, and its estimates are being updated annually.

Today, the Global Burden of Disease report, managed by the Institute for Health Metrics and Evaluation (IHME), serves as the most comprehensive effort to systematically measure the world’s health problems. In fact, the tools can be used at the global, national, and local levels to understand health trends over time, just like gross domestic product data are used to monitor a country’s economic activity.

The data is growing each year. In 2015, the Global Burden of Disease results were three times larger than any other year. As the Global Burden of Disease report continues to grow in size, focusing on more granular geographies, the data requirements also continue to scale exponentially.

Solving Volume and Scale Challenges

Over several decades, the size and scope of the Global Burden of Disease results have grown - today reaching multi-billion row tables. As it has grown, we’ve tried several storage engines which have failed miserably.

The Global Burden results are developed through many internal processes and pipelines that rely on a MySQL-compliant infrastructure. The choice to adopt MySQL was made at a time when the scope of IHME’s work was much smaller, and today’s scale would have been incomprehensible. The ambitions of our researchers and abundance of available input data have pushed the boundaries of research and traditional database engines.

IHME has leveraged the advanced Percona XtraDB Barracuda (InnoDB-compliant) storage engine for every database environment, and supports more than 90 database instances within their infrastructure.  The environments supporting those critical scientific computing pipelines run on high-end hardware with and are optimized for low-latency and to support servicing extreme multi-concurrency from their High Performance Compute cluster.  

Knowing that the existing solutions would not scale with the Institute’s growth, efforts have been underway to evaluate platforms that offer:

  • a MySQL-compatible interface

  • Are cost-effective

  • don’t require a vast amount of research application code to be re-written

  • Ideally offer a Open Source development effort with community-driven input and contributions.

memSQL was one of the top contenders and performed extremely well at scale, but had a number of non-standard constructs for database design and implementation, and lacked definable security mechanisms for authentication and authorization with their non-commercial product.

MySQL 5.7, while offering higher benchmark speeds for ingest and query optimization didn’t offer enough of a paradigm shift to make a huge impact in our workload.  We knew that any  platform selected would need to leverage multi-host sharding with a multi-threaded software. Our database team is small, and while building a sharding infrastructure is reasonably straightforward, realistically cannot ask the development staff to make the applications shard-aware.

Then IHME evaluated MariaDB ColumnStore - which combined the power of big data analytics with the the ease of use of SQL. Leveraging MariaDB’s open source model, MariaDB offers high performance search queries on massive billion row data tables.

With MariaDB ColumnStore, we were able to improve the performance of our multi-billion row tables. IHME found several benefits to using ColumnStore including:

  • Higher performance: Compared to row-based storage, MariaDB ColumnStore column storage reduces disk I/O, making it much faster for read-intensive analytic workloads on large datasets.  

  • Better security: ColumnStore accesses all the same security capabilities delivered in MariaDB Server including encryption for data in motion, role-based access and audit features.

  • Benefits of leveraging SQL: ColumnStore brings transactional and analytic workloads into a single enterprise grade system. It simplifies enterprise administration and execution with a standard SQL front end for OLTP and analytics.

We also found MariaDB engineers to be incredibly responsive to IHME - we trust we can work with them for a very long time.

Moving Forward

Looking into the future, IHME has to design around a future of growing data that allows for regular updates as new data and epidemiological studies are made available.

MariaDB's ColumnStore storage engine solved both a volume and scale problem within our environment that allows us to seamlessly handle both current and planned increases in workload.

When IHME released our  results in 2010, there were approximately 2 billion data points, and with the 2015 effort, that number has grown to just shy of 100 billion. The 2016 results are already suggesting we will far exceed 10 billion results per table in the next six months (each result set is about 9-10 tables of roughly the same size). Looking further into the future, IHME will be focusing on smaller geographical areas across the globe, and will need to support analytical workloads that include geospatial calculations.

High performance, flexible data analytics using MariaDB ColumnStore doesn’t just make my day to day job easier. It will have a profound impact on how the global community can assess disease around the world.

The pioneering effort of the IHME continues to be hailed as a major landmark in public health and an important foundation for policy formulation and priority setting.

MariaDB ColumnStore user Andrew Ernst from the Institute for Health Metrics and Evaluation discusses how other storage engines have "failed miserably" and why he chose ColumnStore instead.  

Login or Register to post comments

by guest at December 14, 2016 11:32 PM

Peter Zaitsev

Percona XtraDB Cluster 5.6.34-26.19 is now available

Percona XtraDB Cluster 5.7.16-27.19

Percona XtraDB Cluster 5.6.34-26.19

Percona announces the release of Percona XtraDB Cluster 5.6.34-26.19 on December 14, 2016. Binaries are available from the downloads section or our software repositories.

Percona XtraDB Cluster 5.6.34-26.19 is now the current release, based on the following:

All Percona software is open-source and free. Details of this release can be found in the 5.6.34-26.19 milestone on Launchpad.


  • The following encryption modes are now deprecated:
    • encrypt=1
    • encrypt=2
    • encrypt=3

The default is encrypt=0 with encryption disabled. The recommended mode now is the new encrypt=4, which uses SSL files generated by MySQL.

For more information, see Encrypting PXC Traffic.

New Features

  • Added encrypt=4 mode for SST encryption that uses SSL files generated by MySQL. Modes 1, 2, and 3 are now deprecated.

Fixed Bugs

  • Optimized IST donor selection logic to avoid SST. Child processes are now cleaned-up and node state is resumed if SST fails.
  • Added init.ok to the list of files that do not get removed during SST.
  • Fixed error with ASIO library not acknowledging an EPOLLIN event when building Galera.
  • Fixed stalling of DML workload on slave node caused by FLUSH TABLE executed on the master.
    For more information, see 1629296.
  • Fixed super_read_only to not apply to Galera replication applier.
    For more information, see 1634295.
  • Redirected netcat output to stdout to avoid it in the log.
    For more information, see 1625968.
  • Enabled replication of ALTER USER statements.
    For more information, see 1376269.
  • Changed the wsrep_max_ws_rows variable to ignore non-replicated write-sets generated by DML action on temporary tables (explicit or implicit).
    For more information, see 1638138.
  • Fixed SST to fail with an error if SSL is not supported by socat, instead of switching to unencrypted mode.
  • Fixed SST with SSL to auto-generate a 2048-bit dhparams file for versions of socat before 1.7.3. These older versions use 512-bit dhparams file by default that gets rejected by newer clients with dh key too small error.
  • PXC-731: Changed the wsrep_cluster_name variable to read-only, because changing it dynamically leads to high overhead.
    For more information, see 1620439.
  • PXC-732: Improved error message when any of the SSL files required for SST are missing.
  • PXC-735: Fixed SST to fail with an error when netcat is used (transferfmt=nc) with SSL encryption (encrypt set to 2, 3 or 4), instead of silently switching to unencrypted mode.
  • Fixed faulty switch case that caused the cluster to stall when the repl.commit_order variable was set to 2 (LOCAL_OOOC mode that should allow out-of-order committing for local transactions).

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

by Alexey Zhebel at December 14, 2016 06:53 PM

MariaDB AB

General Availability of MariaDB ColumnStore 1.0

General Availability of MariaDB ColumnStore 1.0 roger_bodamer_g Wed, 12/14/2016 - 03:35

I am happy to announce the general availability of MariaDB ColumnStore 1.0!

MariaDB ColumnStore is a powerful open source columnar storage engine that unites transactional and analytic processing with a single ANSI SQL front end to deliver a solution that simplifies high-performance, big data analytics.

By leveraging MariaDB’s extensible architecture, we radically simplified the entry to analytics. While users continue to access the single SQL interface and familiar MariaDB API, data is analyzed with our highly optimized columnar storage engine and parallel query processing.

ColumnStore 1.0 key features include:

  • Better Price Performance – ColumnStore brings the power of SQL and freedom of open source to big data analytics with 90% less cost per TB per year compared to proprietary data warehouses

  • Easy Enterprise Analytics – Single SQL interface for OLTP and analytics including complex aggregation, joins and windowing functions at the data storage level

  • Faster, More Efficient Queries – Parallel query processing and data ingestion for real-time big data analytics on distributed environments

This milestone was achieved by great teamwork from MariaDB’s engineering team and community support. Get started with ColumnStore today and enjoy the power of open source innovation for big data analytics.


Roger Bodamer

Chief Product Officer

I am happy to announce the general availability of MariaDB ColumnStore 1.0!

MariaDB ColumnStore is a powerful open source columnar storage engine that unites transactional and analytic processing with a single ANSI SQL front end to deliver a solution that simplifies high-performance, big data analytics.

Login or Register to post comments

by roger_bodamer_g at December 14, 2016 08:35 AM

A Look Inside MariaDB ColumnStore 1.0.6 GA

A Look Inside MariaDB ColumnStore 1.0.6 GA Dipti Joshi Wed, 12/14/2016 - 03:31

Today, MariaDB ColumnStore has reached a major milestone – MariaDB ColumnStore 1.0 is now GA with the release of MariaDB ColumnStore 1.0.6 GA. The journey of MariaDB ColumnStore began in January 2016 when our team started building ColumnStore. The support from our early alpha and beta adopters and community users has helped us take MariaDB ColumnStore from the first alpha release to the GA today.


MariaDB ColumnStore is a massively parallel, high-performance, distributed columnar storage engine built on MariaDB Server. It is the first columnar storage engine for big data analytics in the MariaDB ecosystem. It can be deployed in the cloud (optimized for Amazon Web Services) or on a local cluster of Linux servers using either local or networked storage.

A Look Inside

In MariaDB ColumnStore’s architecture, three components – a MariaDB SQL front end called User Module (UM), a distributed query engine called Performance Module (PM) and distributed data storage – work together to deliver high-performance, big data analytics.


  • User Module (UM):
    The UM is made up of the front end MariaDB Server instance and a number of processes specific to MariaDB ColumnStore that handle concurrency scaling. The storage engine plugin for MariaDB ColumnStore hands over the query to one of these processes which then further break down SQL requests, distributing the various parts to one or more Performance Modules to process the query. Finally, the UM assembles all the query results from the various participating Performance Modules to form the complete query result set that is returned to the user.

  • Performance Module (PM):
    The PM is responsible for storing, retrieving and managing data, processing block requests for query operations, and passing it back to the User Module(s) to finalize the query requests. The PM selects data from disk and caches it in a shared-nothing data cache that is part of the server on which the PM resides. MPP is accomplished by allowing the user to configure as many Performance Modules as they would like; each additional PM adds more cache to the overall database as well as more processing power.

  • Distributed Data Storage:
    MariaDB ColumnStore is extremely flexible with respect to the storage system. When running on premise, it can use either local storage or shared storage (e.g., SAN) to store data. In the Amazon EC2 environment, it can use ephemeral or Elastic Block Store (EBS) volumes.

MariaDB ColumnStore 1.0 Features

  • Scale

    • Massively parallel architecture designed for big data scaling

      • Linear scalability as new nodes are added

    • Easy horizontal scaling

      • Add new data nodes as your data grows

      • Continue read queries when adding new nodes

    • Compression

      • Data compression designed to accelerate decompression rate, reducing disk I/O

  • Performance

    • High-performance, real-time and ad-hoc analytics

      • Columnar optimized, massively parallel, distributed query processing on commodity servers

    • High-speed data load and extract

      • Load data while continuing analytics queries

      • Fully parallel high-speed data load and extract

  • Enterprise-Grade Analytics

    • Analytics

      • In-database distributed analytics with complex join, aggregation, window functions

      • Extensible UDF for custom analytics

    • Cross-engine access

      • Use a single SQL interface for analytics and OLTP

      • Cross join tables between MariaDB and ColumnStore for full insight

    • Security

      • MariaDB security features – SSL, role-based access and auditability

      • Out-of-the-box BI tool connectivity using ODBC/JDBC or standard MariaDB connectors

  • Management and Availability  

    • Easy to install, manage, maintain and use

      • Automatic horizontal partitioning

      • No index, views or manual partition tuning needed for performance

      • Online schema changes while read queries continue

    • Deploy anywhere

      • On premise or on AWS

      • On premise using commodity servers

    • High Availability

      • Automatic UM failover

      • Multi-PM distributed data attachment across all PMs in SAN and EBS environment for automatic PM failover


The release notes for MariaDB ColumnStore 1.0.6, along with a list of bugs fixed, can be found here. Documentation is available in our Knowledge Base. Binaries for MariaDB 1.0.6 are available for download here. For developers wanting to do a quick install, Docker and Vagrant options are available. You can also find MariaDB-ColumnStore-1.0.6 AMI in the AWS marketplace.


Reaching the GA could not have been possible without the valuable feedback we have received from the community and our beta customers. Thanks to everyone who contributed. Special acknowledgment also goes to the outstanding work by MariaDB ColumnStore Engineering team whose hard work and dedication has made this GA possible.

The journey does not stop here. As the new year unfolds we will start looking at the content and begin planning for MariaDB ColumnStore 1.1. Based on what we have already learned from our beta users, we will be adding streaming and more manageability features in 1.1. If you have any ideas or suggestions that you would like to see in the next release, please create a request in JIRA. For questions or comments, you can reach me at or tweet me @dipti_smg


Today, MariaDB ColumnStore has reached a major milestone – MariaDB ColumnStore 1.0 is now GA with the release of MariaDB ColumnStore 1.0.6 GA. The journey of MariaDB ColumnStore began in January 2016 when our team started building ColumnStore. The support from our early alpha and beta adopters and community users has helped us take MariaDB ColumnStore from the first alpha release to the GA today.

Login or Register to post comments

by Dipti Joshi at December 14, 2016 08:31 AM

December 13, 2016

Peter Zaitsev

Webinar Wednesday 12/14: MongoDB System Tuning Best Practices

MongoDB System Tuning

MongoDB System TuningPlease join Percona Senior Technical Operations Architect Tim Vaillancourt on Wednesday December 14, at 10:00 am PST/ 1:00pm EST (UTC-8) as he presents MongoDB System Tuning Best Practices.

People give much love to optimizing  document design, provisioning, and even selecting an engine in MongoDB. They give little attention to tuning Linux to handle databases efficiently. In this session we will talk about what schedulers you should use, what network settings, what memory and cache settings, what file systems, should you use NUMA and Huge Pages, and more.

This will be a data-packed webinar for the advanced user, but still accessible by the budding systems admin type that wants to learn more about system internals.

Register for this webinar here.

MongoDB System TuningTim joined Percona in 2016 as Sr. Technical Operations Architect for MongoDB, with a goal to make the operations of MongoDB as smooth as possible. With experience operating infrastructures in industries such as government, online marketing/publishing, SaaS and gaming – combined with experience tuning systems from the hard disk all the way up to the end-user – Tim has spent time in nearly every area of the modern IT stack with many lessons learned.

Tim lives in Amsterdam, NL and enjoys traveling, coding and music. Prior to Percona Tim was the Lead MySQL DBA of Electronic Arts’ DICE studios, helping some of the largest games in the world (“Battlefield” series, “Mirrors Edge” series, “Star Wars: Battlefront”) launch and operate smoothly while also leading the automation of MongoDB deployments for EA systems. Before the role of DBA at EA’s DICE studio, Tim served as a subject matter expert in NoSQL databases, queues and search on the Online Operations team at EA SPORTS. Prior to moving to the gaming industry, Tim served as a Database/Systems Admin operating a large MySQL-based SaaS infrastructure at AbeBooks/Amazon Inc.

by Dave Avery at December 13, 2016 05:57 PM

MongoDB 3.4: Facet Aggregation Features and SERVER-27395 Mongod Crash

Mongod Crash

This blog discusses MongoDB 3.4 GA facet aggregation features and the SERVER-27395 mongod crash bug.

As you may have heard, in late November MongoDB 3.4 GA was released. One feature that stuck out for me, a Lucene enthusiast, was the addition of powerful grouping and faceted search features in MongoDB 3.4.

Faceted Search

For those unfamiliar with the term faceted search, this is a way of grouping data using one or many different grouping criteria over a large result. It’s a tough idea to define Mongod Crashspecifically, but the aim of a faceted search is generally to show the most relevant information possible to the user and allow them to further filter what is usually a very large result of a given search criteria.

The most common day-to-day example of a faceted search is performing a search for a product on an e-commerce website such as eBay, Amazon, etc. As e-commerce sites commonly have the challenge of supplying a massive range of items to users that often provide limited search criteria, it is rare to see an online store today that does not have many “filters” in the right-side of their website to further narrow down a given product search.

Here is an example of me searching the term “mongodb” on a popular auction site:

Mongod CrashWhile this may seem like a specific search to some, at large volume this search term might not immediately show something relevant to some users. What if the user only wants a “used” copy of a MongoDB book from a specific year? What if the user was looking for a MongoDB sticker and not a book at all? This is why you’ll often see filters alongside search results (which we can call “facets”) showing item groupings such as different store departments, different item conditions (such as used/new), publication years, price ranges, review ratings, etc.

In some traditional databases, to get this kind of result we might need to issue many different expensive “GROUP BY” queries that could be painful for a database to process. Each of these queries would independently scan data, even if all queries are summarizing the same “result set.” This is very inefficient. A faceted search offers powerful groupings using a single operation on result data.

When I made my search for “mongodb”, under a faceted search model the page of items (in this case MongoDB books) and all the different groupings of departments, condition, rrice, etc., are performed as a single grouping operation in one “pass” of the data. The result from a faceted search contain items matching the search criteria AND the grouping results of the matched items as a single response.

Traditionally faceted searches were mostly limited to Lucene-based search engines such as Apache Solr, Elasticsearch and various closed-source solutions. With the release of MongoDB 3.4, this has changed!

The new Aggregation Pipeline features named $bucket and $bucketAuto provide functionality for processing groupings of result data in a single aggregation stage, and $facet allows the processing of many aggregation pipelines on the same result for even more complex cases.

New Facetting Features

MongoDB 3.4 introduces these new Aggregation Pipeline operators, allowing some advanced grouping and faceted-search-like features:

  1. $facet – Processes multiple aggregation pipelines within a single stage on the same set of input documents. Each sub-pipeline has its own field in the output document where its results are stored as an array of documents.
  2. $bucket – Categorizes incoming documents into groups, called buckets, based on a specified expression and bucket boundaries.
  3. $bucketAuto – Similar to $bucket, however bucket boundaries are automatically determined in an attempt to evenly distribute the documents into the specified number of buckets.

As a very basic example, let’s consider this collection of store items:

> db.items.find()
{ "_id" : ObjectId("58502ade9a49537a011226fb"), "name" : "scotch", "price_usd" : 90, "department" : "food and drinks" }
{ "_id" : ObjectId("58502ade9a49537a011226fc"), "name" : "wallet", "price_usd" : 95, "department" : "clothing" }
{ "_id" : ObjectId("58502ade9a49537a011226fd"), "name" : "watch", "price_usd" : 900, "department" : "clothing" }
{ "_id" : ObjectId("58502ade9a49537a011226fe"), "name" : "flashlight", "price_usd" : 9, "department" : "hardware" }

From this example data, I’d like to gather a count of items in buckets by price (field ‘price_usd’):

  1. $0.99 to $9.99
  2. $9.99 to $99.99
  3. $99.99 to $999.99

For each price-bucket, I would also like a list of unique “department” names for the matches. Here is how I would do this with $bucket (and the result):

> db.items.aggregate([
...   { $bucket: {
...     groupBy: "$price_usd",
...     boundaries: [ 0.99, 9.99, 99.99, 999.99 ],
...     output: {
...       count: { $sum: 1 },
...       departments: { $addToSet: "$department" }
...     }
...   } }
... ])
{ "_id" : 0.99, "count" : 1, "departments" : [ "hardware" ] }
{ "_id" : 9.99, "count" : 2, "departments" : [ "clothing", "food and drinks" ] }
{ "_id" : 99.99, "count" : 1, "departments" : [ "clothing" ] }

If you wanted to do something more complex, you have the flexibility of either making the $bucket stage more complex or you can even chain multiple stages together with $facet!

Mongod Crash: SERVER-27395

As I mentioned in my explanation of faceted search, it is a very complex/advanced feature that – due to the implementation challenges – is bound to have some bugs and inefficiencies.

During the evaluation of these new features, I noticed a very serious issue: I was able to crash the entire MongoDB 3.4.0 database instance using the $bucketAuto feature in combination with an $addToSet accumulator in the output definition. This is very serious!

This the example output from my issue reproduction script, responsible for sending the $bucketAuto query to the mongo instance and then checking if it crashed:

$ bash -x ./
+ js='db.tweets.aggregate([
  { $bucketAuto: {
    groupBy: "$user.location",
    buckets: 1,
    output: {
      count: { $sum: 1 },
      location: { $addToSet: "$user.location" }
  } }
+ echo '### Running crashing $bucketAuto .aggregate() query'
### Running crashing $bucketAuto .aggregate() query
+ /opt/mongodb-linux-x86_64-3.4.0/bin/mongo --port=27017 '--eval=db.tweets.aggregate([
  { $bucketAuto: {
    groupBy: "$user.location",
    buckets: 1,
    output: {
      count: { $sum: 1 },
      location: { $addToSet: "$user.location" }
  } }
])' test
MongoDB shell version v3.4.0
connecting to: mongodb://
MongoDB server version: 3.4.0
2016-12-13T12:59:10.066+0100 E QUERY    [main] Error: error doing query: failed: network error while attempting to run command 'aggregate' on host ''  :
@(shell eval):1:1
+ sleep 1
++ tail -1 mongod.log
+ '[' '-----  END BACKTRACE  -----' = '-----  END BACKTRACE  -----' ']'
+ echo '###  Crashed mongod 3.4.0!'
###  Crashed mongod 3.4.0!

As you can see above, a full server crash occurred in my test when using $bucketAuto with $addToSet accumulators. The “network error” is caused by the MongoDB shell losing connection to the now-crashed server.

The mongod log file reports the following lines before the crash (and backtrace):

2016-12-13T12:59:10.048+0100 F -        [conn2] Invalid operation at address: 0x7f1d43ba990a
2016-12-13T12:59:10.061+0100 F -        [conn2] Got signal: 8 (Floating point exception).
 0x7f1d443e0f91 0x7f1d443e0089 0x7f1d443e06f6 0x7f1d42153100 0x7f1d43ba990a 0x7f1d43ba91df 0x7f1d43bc8d2e 0x7f1d43bcae3a 0x7f1d43bce255 0x7f1d43ca4492 0x7f1d43a3b0a5 0x7f1d43a3b29c 0x7f1d43a3b893 0x7f1d43d3c31a 0x7f1d43d3cc3b 0x7f1d4398447b 0x7f1d439859a9 0x7f1d438feb2b 0x7f1d438ffd70 0x7f1d43f12afd 0x7f1d43b1c54d 0x7f1d4371082d 0x7f1d4371116d 0x7f1d4435ec22 0x7f1d4214bdc5 0x7f1d41e78ced

This has been reported as the ticket SERVER-27395, and exists in MongoDB 3.4.0. Please see the ticket for more details, updates and a full issue reproduction: If this issue is important to you, please vote for this issue at the ticket URL.

This highlights the importance of testing new features with your exact application usage pattern, especially during a major version release such as MongoDB 3.4.0. With all the new exciting ways one can aggregate data in MongoDB 3.4.0, and the infinite ways to stitch those features together in a pipeline, there are bound to be some cases where the code needs improvement.

Nonetheless, I am very excited to see the addition of these powerful new features and I look forward to them maturing.



by Tim Vaillancourt at December 13, 2016 05:41 PM

December 12, 2016

MariaDB AB

How MariaDB ColumnStore Handles Big Data Workloads – Storage Architecture

How MariaDB ColumnStore Handles Big Data Workloads – Storage Architecture david_thompson_g Mon, 12/12/2016 - 18:32

Storage Overview


In this blog post, I will outline MariaDB ColumnStore's architecture, which has the capacity to handle large datasets and scale out across multiple nodes as your data grows.



  • Columns are the unit of storage, which is a key differentiator from a row-based storage engine such as InnoDB. A columnar system stores data per column rather than per row.

  • Partitions are used to store the data for a Column. Within MariaDB ColumnStore, a Partition is a logical concept for a grouping of Segments (default 4 per Partition).

  • A Segment is a storage file belonging to a Partition and containing a number of Extents (default 2). The system creates Segment files as needed.

  • An Extent is a collection of 8 million values for a given Column stored within a Segment. An Extent is made up of many Blocks.

  • A Block stores 8K worth of data and is the unit of disk I/O.


How MariaDB Utilizes Storage

Within each Extent and Block, MariaDB ColumnStore stores column values sequentially using a fixed length datatype between 1 and 8 bytes long. For string types longer than this, a separate Dictionary extent is created to store unique string values. The column extent stores pointers to the string within the Dictionary extent.

Because the system utilizes fixed length datatypes (within the primary extent), it is possible to map directly between columns belonging to the same row. For example, if we have row 234 in an extent for column ‘Name’, the query engine can easily read the value for row 234 in column ‘Amount’. This allows for efficient recreation of the required columns to form a query result row.

By default, column and dictionary values are compressed within Extent storage. This trades off CPU for reduced I/O, which benefits query response time. MariaDB ColumnStore utilizes the Snappy library ( which provides high decompression speeds with reasonable compression. Many columns have repeating or low cardinality values which will compress extremely well, in some cases up to 10x.

Segment files are physically managed within a DBRoot directory. A DBRoot encapsulates a physical storage unit and is assigned to one physical PM server at a point in time. A DBRoot contains Segment files containing Extents. In the installation directory, each “data” directory corresponds to the given DBRoot identified by N. The system automatically distributes data to the available DBRoots across servers.

MariaDB ColumnStore allows use of internal (local) or external storage. If external storage (e.g., SAN, GlusterFS or EBS in AWS) is utilized then the system provides automated failover in a multi-node deployment should a server fail. This is possible because the failed server’s storage can be remounted by another server. With internal storage, automated failover is not possible since a given server’s data is not replicated or available to another server.

Extent Maps & Horizontal Partitioning

The system maintains a persistent distributed data structure called an Extent Map which provides necessary metadata on Extents. This includes tracking the minimum and maximum column values within that Extent. This allows MariaDB ColumnStore to provide a simple but effective horizontal partitioning scheme. At query time, the optimizer can eliminate reading Extents that fall outside of the WHERE clause predicate for that column, for example:


If a query is executed with a WHERE clause filter of “COL1 BETWEEN 220 and 250”, then the system can eliminate COL1 Extents 1, 2 and 4 from being scanned, saving ¾ of the I/O and many comparison operations. This can extend to multiple columns, for example, a query WHERE clause filter of “COL1 BETWEEN 220 AND 250 AND COL2 < 10000” can be fulfilled entirely from the Extent Map since no rows can ever match this query based on the minimum and maximum values across both COL1 and COL2.

Use cases where this works well are time series or semi-ordered date or time derived columns. For example, consider order tracking. Each order has an order date and a ship date. In most systems order creation will correspond to the order date and this value will increase with each inserted record. In general, the ship date will be based on item availability; therefore, it will not strictly increase with each record. However, ship dates are normally semi-ordered, i.e., a ship date will usually be within days to weeks of an order date. Ship dates will also form a natural partition with some possible overlap between Extents.

In addition, the system allows for bulk deletion by extent using Extent Map minimum and maximum values. In the order management case, the administrator can drop an entire extent using order date values. This allows for a simple information lifecycle management strategy by disabling or purging old lower value data to make room for current higher value data.

In summary, the storage model for MariaDB ColumnStore provides the following benefits:

  • Columnar storage is optimized for analytical queries that access a subset of columns and a majority of rows.

  • Online schema changes can be made to ColumnStore tables with no table locking.

  • Disk I/O is optimized by utilizing contiguous block storage where possible, allowing for efficient I/O streaming. Data compression further reduces the amount of I/O required.

  • Storage is automatically distributed across nodes to provide optimal distributed query execution and scale-out capabilities.

  • Bulk data loading can be run in parallel with concurrent query access.

  • An automated vertical partitioning scheme allows for query optimization to avoid reading rows that can’t possibly be found in a given extent / range of columns.

  • Data can be bulk deleted online using the vertical partitioning scheme which simplifies data lifecycle management.

In subsequent blogs I will provide details on how MariaDB ColumnStore builds upon this storage architecture to provide high-speed data ingestion, scale-out query performance, and rich ANSI SQL support.


In this blog post, I will outline MariaDB ColumnStore's architecture, which has the capacity to handle large datasets and scale out across multiple nodes as your data grows.

Login or Register to post comments

by david_thompson_g at December 12, 2016 11:32 PM

Peter Zaitsev

Database Solutions Engineer FAQs

Database Solutions Engineer

Database Solutions EngineerIn this blog series, I will discuss common questions I receive as a database Solutions Engineer at Percona. In this role, I speak with a wide array of MySQL and MongoDB users responsible for both extremely large and complex environments to smaller single-server environments. Typically we are contacted when the customer is about to embark on an architecture migration or redesign, or they have performance issues in their production environment. The purpose of this blog is to put together a list of common questions I field while speaking with active MySQL and MongoDB users.

We are considering a migration to AWS. What solution is right for us: EC2, RDS, or Aurora?

We get this question a lot. Moving to AWS is a hot trend. Fellow Solution Engineer Rick Golba wrote a blog post dedicated to the specifics of each of these Amazon offerings, as well as the freedom you give up moving down the tiers. This is the primary concern when considering these cloud-based solutions. With Aurora, you give up a large amount of control of your database environment. With an EC2 deployment, you can keep most of it. However, there are other considerations to make.


The largest benefit to choosing one of these Amazon offerings is reducing the cost associated with managing a physical database environment. This does not eliminate the necessary task of right-sizing your environment. Doing so can make a huge difference in the yearly costs associated with acquiring a large Amazon instance. This can also open up options when it comes to choosing between EC2, RDS, and Aurora as there are certain limitations and restrictions with regards to tablesize and total datasize. Here is a quick reference:

  • Amazon RDS – 6 TB*
  • Aurora – 64 TB**
  • EC2 – Depends***

* Max table size from Amazon’s documentation.

** Max size of Aurora cluster volume.

** There are too many options to list one.

There are several strategies when it comes to right-sizing your environment. The first and easiest way is to archive old, unused data. Percona Toolkit offers a tool that can assist with this process called pt-archiver. This tool allows you to archive unused MySQL rows into other tables or a file. The documentation for pt-archiver is here. Another strategy used by large organizations is to employ different databases for different tasks. The advantage of this strategy is that you can use the right database for a specific use-case. The disadvantage is the overhead of having experts to manage each of this varying database types and instances. This requires a significant amount of engineering effort that is not suitable for smaller deployments.

Some people might ask, “Why right-size my environment?” Most of the time, all of that data is not needed in a production database. There is likely data that is never touched taking a significant amount of space. When you lower your datasize, more Amazon options become possible. In addition to this, the operational tasks associated with managing your database environment become easier. If you’ve managed to turn a bloated table into a more manageable one, you might see increased performance as well. This reduces costs when it comes to a cloud migration.


Amazon is compatible with most MySQL deployments, but there are some exceptions. Amazon Aurora is currently compatible with MySQL 5.6. If you are interested in MySQL 5.7 features such as storing data with the JSON datatype, then Aurora might not be the right option. For a full list of MySQL 5.7 features, see the MySQL documentation. Amazon RDS and EC2 are both compatible with MySQL 5.7. One limitation of RDS is that it is not compatible with MongoDB. Amazon does offer its own cloud-hosted NoSQL solution called DynamoDB, but migration is not as seamless as it is with Amazon’s MySQL offerings. The best option for migrating to the cloud with MongoDB is an EC2 instance.


Percona has assisted with Amazon optimizations and migrations for many customers through our consulting services. Our architects have in-depth knowledge of high-performing MySQL deployments in the cloud and can assist with both your design and implementation/migration. One example of this success is Wattpad. Through performance optimizations recommended by Percona, Wattpad was able to reduce the size of their Amazon instance and save money over the course of the year.

Can we replace our enterprise monitoring solution with Percona Monitoring and Management (PMM)?

As with most answers in the database world, the short answer is “it depends.” Percona Monitoring and Management (PMM) offers a robust array of monitoring features for your database environment and is perfectly capable of replacing certain features of enterprise-grade MySQL and MongoDB monitoring platforms. Here is a short list of what PMM brings to the table:

  • Free and Open Source. Our CEO Peter Zaitsev is dedicated to keeping this true. PMM uses existing open-source elements and integrates some of Percona’s own plugins to form a complete, robust monitoring solution.
  • MongoDB Integration. If you have both MySQL and MongoDB deployments, you can see metrics and query analysis for both in one place.
  • Remotely Monitor MySQL in the CloudPMM is compatible with RDS.
  • Visual Query Analysis. Quickly identify problem queries.
  • Fix and Find Expensive Queries. Analyze expensive queries without needing scripts or command line tools.
  • InnoDB Monitoring. Get in-depth stats on InnoDB metrics.
  • Disk Monitoring. Be aware of system level metrics in addition to MySQL and MongoDB metrics.
  • Cluster Monitor. The recent addition of Orchestrator to PMM added this functionality.
  • Replication Dashboard. Orchestrator can also show the status of replication in an intuitive GUI.

If the list above satisfies your monitoring needs, then you should definitely be using PMM. Our development team is actively working to enhance this product and appreciates input from the community using this solution. The PMM forums are a great place to ask questions or offer feedback/suggestions.

Is moving to a synchronous replication a solution for us?

At first glance, a synchronous replication solution seems to solve all of the limitations that come with a standard MySQL deployment. It brings with it loads of great features like high availability multi-master nodes, each capable of handling writes and read scaling. However, there are several things to consider when answering this question.

Will a simpler solution meet your needs?

One of Percona’s Technical Account Managers, Michael Patrick, wrote a fantastic blog concerning choosing an HA solution. Typically the reason for moving to a clustered solution is for high-availability. If you’ve been bit by downtime due to a failed master and a slow transition to a slave, moving to a cluster could be a knee-jerk reaction. However, solutions like MHA or MySQL Orchestrator might ease these pains sufficiently while adding little complexity to the environment.

Is your application and database design compatible with a clustered solution?

You must make some application-based considerations when moving to a clustered solution. One consideration is storage engine limitations with clustered solutions. Percona XtraDB Cluster and MariaDB Cluster both require InnoDB. MySQL Cluster requires the NDB storage engine. By committing to a clustered solution, other storage engine options become unavailable.

Another application consideration is how clustered solutions handle synchronous write set replication. If your application has write hot-spots, deadlocks will occur given simultaneous write transactions. There are solutions to dealing with these, such as re-engineering database structure to remove the hotspot or allowing the application layer to retry these transactions. If neither of these are an option, a clustered solution might not fit your environment.

Is your database spread across multiple geographic regions?

You can deploy cluster solutions across WAN environments. However, these solutions contain latency issues. If your application is capable of enduring longer flight times due to a cluster being spread across multiple geographic regions, this will not be a problem. However, if this delay is not tolerable, a WAN cluster might not be the right solution. There are multiple strategies for alleviating this pain-point when it comes to deploying a cluster across WAN environments, a Webinar given by Percona XtraDB Cluster’s Lead Software Engineer, Krunal Bauskar, covers this topic. One example is asynchronous replication between geographic regions with clusters in each. The benefit of this is that the cluster in each geographic region will have eliminated the WAN latency delay. The downside of this is the addition of many more nodes (likely three for each data center). This solution also complicates the environment.

Closing Thoughts

I plan to continue this blog series with more frequently asked questions that I receive when talking to MySQL and MongoDB users. If you would like to speak with an account representative (or me!) to see how Percona can help you meet your database performance needs, feel free to reach out.

by Barrett Chambers at December 12, 2016 11:32 PM

Oli Sennhauser

MySQL and MariaDB variables inflation

MySQL is well known and widely spread because of its philosophy of Keep it Simple (KISS).

We recently had the discussion that with newer releases also MySQL and MariaDB relational databases becomes more and more complicated.

One indication for this trend is the number of MySQL server system variables and status variables.

In the following tables and graphs we compare the different releases since MySQL version 4.0:

mysql> SHOW GLOBAL STATUS LIKE 'innodb%';

VersionSystemIB Sys.StatusIB Stat.
MySQL 4.0.3014322*133**0
MySQL 4.1.2518926*164**0
MySQL 5.0.962393625242
MySQL 5.1.732773629142
MySQL 5.5.513176031247
MySQL 5.6.3143812034151
MySQL 5.7.1549113135351
MySQL 8.0.048812436351

* Use SHOW STATUS instead.


VersionSystemIB Sys.StatusIB Stat.
MariaDB 5.1.443547230144
MariaDB 5.2.103978632446
MariaDB 5.5.4141910341399
MariaDB 10.0.2153714745595
MariaDB 10.1.18***589178517127
MariaDB 10.2.2****58616448196

*** XtraDB 5.6
****InnoDB 5.7.14???


Taxonomy upgrade extras: 

by Shinguz at December 12, 2016 08:43 PM

Peter Zaitsev

Percona Monitoring and Management 1.0.7 release

Percona Monitoring and Management

Percona Monitoring and ManagementPercona announces the release of Percona Monitoring and Management 1.0.7.

The Percona Monitoring and Management Server (PMM) is distributed through Docker Hub, PMM Client – through tarball or system packages. The instructions for installing or upgrading PMM are available in the documentation.

PMM Server changelog

  • Added new widgets and graphs to “PXC/Galera Graphs” dashboard.
  • Fixed hostgroup filtering for ProxySQL dashboard.
  • Various fixes to MongoDB dashboards.
  • Enabled HTTPS/TLS and basic authentication support on Prometheus targets.
  • Fixed potential error with too many connections on Query Analytics API.
  • Grafana 4.0.2 with an alerting engine.
  • Prometheus 1.4.1.
  • Consul 0.7.1 with snapshot/restore feature.
  • Orchestrator 2.0.1.

PMM Client changelog

  • Automatically generate self-signed SSL certificate to protect metric services with HTTPS/TLS by default (requires re-adding services, see “check-network” output).
  • Enable HTTP basic auth for metric services when defined on PMM server and configured on a client to achieve client-side protection (requires re-adding services, see “check-network” output).
  • Added –bind-address flag to support running PMM server and client on the different networks. By default, this address is the same as client one. When running PMM on different networks, –client-address should be set to remote (public) address and –bind-address to local (private) address. This also assumes you configure NAT and port forwarding between those addresses.
  • Added “show-passwords” command to display the current HTTP auth credentials and password of the last created user on MySQL (useful for PMM installation on replication setup).
  • Do not pass MongoDB connection string in command-line arguments and hide the password from the process list (requires re-adding mongodb:metrics service).
  • Do not listen a network port by mysql:queries service (percona-qan-agent process) as there is no need for it.
  • Fixed slow log rotation for mysql:queries service for MySQL 5.1.
  • Expose PXC/Galera gcache size as a metric.
  • Use terminal color instead of emoji on “check-network” output and also “list” one.
  • Amended output of systemv service status if run adhoc (requires re-adding services).

To see a live demo, please visit

We welcome your feedback and questions on our PMM forum.

About Percona Monitoring and Management
Percona Monitoring and Management is an open-source platform for managing and monitoring MySQL and MongoDB performance. It is developed by Percona in collaboration with experts in the field of managed database services, support and consulting.

PMM is a free and open-source solution that you can run in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

by Roman Vynar at December 12, 2016 04:54 PM

December 09, 2016

Peter Zaitsev

Percona Monitoring Plugins 1.1.7 release

Percona Monitoring Plugins 1.1.7

Percona Monitoring Plugins 1.1.7Percona announces the release of Percona Monitoring Plugins 1.1.7.


  • New Nagios script for MongoDB.
  • Added MySQL socket and flag options to Cacti PHP script.
  • Added disk volume check on “Mounted on” in addition to “Filesystem” to Cacti PHP script to allow monitoring of tmpfs mounts.
  • Allow delayed slave to have SQL thread stopped on pmp-check-mysql-replication-delay check.
  • Fix for –unconfigured flag of pmp-check-mysql-replication-delay.
  • Fix for max_duration check of pmp-check-mysql-innodb when system and MySQL timezones mismatch.
  • Fix rare nrpe broken pipe error on pmp-check-unix-memory check.
  • Updated package spec files.

A new tarball is available from downloads area or in packages from our software repositories. The plugins are fully supported for customers with a Percona Support contract and free installation services are provided as part of some contracts. You can find links to the documentation, forums and more at the project homepage.

About Percona Monitoring Plugins
Percona Monitoring Plugins are monitoring and graphing components designed to integrate seamlessly with widely deployed solutions such as Nagios, Cacti and Zabbix.

by Roman Vynar at December 09, 2016 07:03 PM

Percona Toolkit 2.2.20 is now available

Percona Toolkit 2.2.20

Percona Toolkit 2.2.20Percona announces the availability of Percona Toolkit 2.2.20. Released December 9, 2016, Percona Toolkit is a collection of advanced command-line tools that perform a variety of MySQL server and system tasks that DBAs find too difficult or complex for to perform manually. Percona Toolkit, like all Percona software, is free and open source.

This release is the current GA (Generally Available) stable release in the 2.2 series. Downloads are available here and from the Percona Software Repositories.

New Features:
  • 1636068: New --pause-file option has been implemented for pt-online-schema-change. When used pt-online-schema-change will pause while the specified file exists.
  • 1638293 and 1642364: pt-online-schema-change now supports adding and removing the DATA DIRECTORY to a new table with the --data-dir and --remove-data-dir options.
  • 1642994: Following schemas/tables have been added to the default ignore list: mysql.gtid_execution, sys.sys_config, mysql.proc, mysql.inventory, mysql.plugin, percona.* (including checksums, DSNs table), test.*, and percona_schema.*
  • 1643940: pt-summary now provides information about Transparent huge pages.
  • 1604834: New --preserve-embedded-numbers option was implemented for pt-query-digest which can be used to preserve numbers in database/table names when fingerprinting queries.
Bugs Fixed:
  • 1613915: pt-online-schema-change could miss the data due to the way ENUM values are sorted.
  • 1625005: pt-online-schema-change didn’t apply underscores to foreign keys individually.
  • 1566556: pt-show-grants didn’t work correctly with MariaDB 10 (Daniël van Eeden).
  • 1634900: pt-upgrade would fail when log contained SELECT...INTO queries.
  • 1639052: pt-table-checksum now automatically excludes checking schemas named percona and percona_schema which aren’t consistent across the replication hierarchy.
  • 1635734: pt-slave-restart --config did not recognize = as a separator.
  • 1362942: pt-slave-restart would fail on MariaDB 10.0.13.

Find release details in the release notes and the 2.2.20 milestone at Launchpad. Report bugs on the Percona Toolkit launchpad bug tracker

by Hrvoje Matijakovic at December 09, 2016 06:32 PM

Jean-Jerome Schmidt

Planets9s - On stable MySQL Replication setups, running MySQL on Docker and cloud lockin

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

Watch the replay: how to build a stable MySQL Replication environment

Thanks to everyone who participated in this week’s webinar on building production-ready MySQL Replication environments. Krzysztof Książek, Senior Support Engineer at Severalnines, shared his top 9 tips on that topic with sanity checks before migrating into MySQL replication setup, operating system configuration, replication, backup, provisioning, performance, schema changes, reporting and disaster recovery. If you'd like to learn how to build a stable environment with MySQL replication, then watch this webinar replay.

Watch the replay

MySQL on Docker: Deploy a Homogeneous Galera Cluster with etcd

As you might know, we’ve been on a journey in the past months exploring how to make Galera Cluster run smoothly on Docker containers and that journey continues. Deploying Galera Cluster on Docker is tricky when using orchestration tools. Due to the nature of the scheduler in container orchestration tools and the assumption of homogenous images, the scheduler will just fire the respective containers according to the run command and leave the bootstrapping process to the container’s entrypoint logic when starting up. And you do not want to do that for Galera … This blog post discusses why and provides insight into how to deploy a homogeneous Galera Cluster with etcd.

Read the blog

About cloud lock-in and open source databases

If you haven’t read it yet, do check out the editorial our CEO Vinay Joosery recently published on the importance of avoiding cloud database lock-in. The cloud is no longer a question of if, but of when. Many IT leaders, however, find that one consistent barrier to their adoption of the cloud is vendor lock-in. What do you do when you are forced to stay with a provider that no longer meets your needs? And is cloud lock-in a problem? Find out what Vinay’s thoughts are on these topics.

Read the editorial

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

by Severalnines at December 09, 2016 10:42 AM

December 08, 2016

Peter Zaitsev

Tuning Linux for MongoDB: Automated Tuning on Redhat and CentOS

Tuning Linux for MongoDB

In a previous blog post: “Tuning Linux for MongoDB,” I covered several tunings for an efficient MongoDB deployment on Linux in Production. This post expands on that one.

While I feTuning Linux for MongoDBlt the tuning Linux for MongoDB was a very useful blog post that results in a great baseline tuning, something bugged me about how much effort and touch-points were required to achieve an efficient Linux installation for MongoDB. More importantly, I noticed some cases where the tunings (example: changes to disk I/O scheduler in /etc/udev.d) were ignored on some recent RedHat and CentOS versions. With these issues in mind, I started to investigate better solutions for achieving the tuned baseline.


In RedHat (and thus CentOS) 7.0, a daemon called “tuned” was introduced as a unified system for applying tunings to Linux. tuned operates with simple, file-based tuning “profiles” and provides an admin command-line interface named “tuned-adm” for applying, listing and even recommending tuned profiles.

Some operational benefits of tuned:

  • File-based configuration – Profile tunings are contained in a simple, consolidated files
  • Swappable profiles – Profiles are easily changed back/forth
  • Standards compliance – Using tuned profiles ensures tunings are not overridden or ignored

Note: If you use configuration management systems like Puppet, Chef, Salt, Ansible, etc., I suggest you configure those systems to deploy tunings via tuned profiles instead of applying tunings directly, as tuned will likely start to fight this automation, overriding the changes.

The default available tuned profiles (as of  RedHat 7.2.1511) are:

  • balanced
  • desktop
  • latency-performance
  • network-latency
  • network-throughput
  • powersave
  • throughput-performance
  • virtual-guest
  • virtual-host

The profiles that are generally interesting for database usage are:

  • latency-performance

    “A server profile for typical latency performance tuning. This profile disables dynamic tuning mechanisms and transparent hugepages. It uses the performance governer for p-states through cpuspeed, and sets the I/O scheduler to deadline.”

  • throughput-performance

    “A server profile for typical throughput performance tuning. It disables tuned and ktune power saving mechanisms, enables sysctl settings that improve the throughput performance of your disk and network I/O, and switches to the deadline scheduler. CPU governor is set to performance.”

  • network-latency – Includes “latency-performance,” disables transparent_hugepages, disables NUMA balancing and enables some latency-based network tunings.
  • network-throughput – Includes “throughput-performance” and increases network stack buffer sizes.

I find “network-latency” is the closest match to our recommended tunings, but some additional changes are still required.

The good news is tuned was designed to be flexible, so I decided to make a MongoDB-specific profile: enter “tuned-percona-mongodb”.



“tuned-percona-mongodb” is a performance-focused tuned profile for MongoDB on Linux, and is currently considered experimental (no gurantees/warranties). It’s hosted in our Percona-Lab Github repo.

tuned-percona-mongodb applies the following tunings (from the previous tuning article) on a Redhat/CentOS 7+ host:

  • Disabling of transparent huge pages
  • Kernel network tunings (sysctls)
  • Virtual memory dirty ratio changes (sysctls)
  • Virtual memory “swappiness” (sysctls)
  • Block-device readahead settings (on all disks except /dev/sda by default)
  • Block-device I/O scheduler (on all disks except /dev/sda by default)

The following tunings that our previous tuning article didn’t cover are also applied:

After a successful deployment of this profile, only these recommendations are outstanding:

  1. Filesystem type and mount options:
    Tuned does not handle filesystem mount options, this needs to be done manually in /etc/fstab. To quickly summarize: we recommend the XFS or EXT4 filesystem type for MongoDB data when using MMAPv1 or RocksDB storage engines, and XFS ONLY when using WiredTiger. For all filesystems, using the mount options “rw,noatime” will reduce some activity.
  2. NUMA disabling or interleaving:
    Tuned does not handle NUMA settings and these still need to be handled via the MongoDB init script or the BIOS on/off switch.
  3. Linux ulimits:
    Tuned does not set Linux ulimit settings. However, Percona Server for MongoDB RPM packages do this for you at startup! See “LimitNOFILE” and “LimitNPROC” in “/usr/lib/systemd/system/mongod.service” for more information.
  4. NTP server:
    Tuned does not handle installation of RPM packages or enabling of services. You will need to install the “ntp” package and enable/start the “ntpd” service manually:

    sudo yum install ntp
    sudo systemctl enable ntpd
    sudo systemctl start ntpd

tuned-percona-mongodb: Installation

The installation of this profile is as simple as checking-out the repository with a “git” command and then running “sudo make enable”, full output here:

$ git clone
$ cd tuned-percona-mongodb
$ sudo make enable
if [ -d /etc/tuned ]; then
	cp -dpR percona-mongodb /etc/tuned/percona-mongodb;
	echo "### 'tuned-percona-mongodb' is installed. Enable with 'make enable'.";
	echo "### ERROR: cannot find tuned config dir at /etc/tuned!";
	exit 1;
### 'tuned-percona-mongodb' is installed. Enable with 'make enable'.
tuned-adm profile percona-mongodb
tuned-adm active
Current active profile: percona-mongodb

In the example above you can see “percona-mongodb” is now the active tuned profile on the system (mentioned on the last output line).

The tuned profile files are installed to “/etc/tuned/percona-mongodb”, as seen here:

$ ls -alh /etc/tuned/percona-mongodb/*.*
-rwxrwxr-x. 1 root root   677 Nov 22 20:00
-rw-rw-r--. 1 root root  1.4K Nov 22 20:00 tuned.conf

Let’s check that the “deadline” i/o scheduler is now the current scheduler on any disk that isn’t /dev/sda (“sdb” used below):

$ cat /sys/block/sdb/queue/scheduler
noop [deadline] cfq

Transparent huge pages should be disabled (it is!):

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
$ cat /sys/kernel/mm/transparent_hugepage/defrag
always madvise [never]

Block-device readahead should be 32 (16kb) on /dev/sdb (looks good!):

$ blockdev --getra /dev/sdb

That was easy!

tuned-percona-mongodb: Uninstallation

To uninstall the profile, run “sudo make uninstall” in the github checkout directory:

if [ -d /etc/tuned/percona-mongodb ]; then
	echo "### Disabling tuned profile 'tuned-percona-mongodb'";
	echo "### Changing tuned profile to 'latency-performance', adjust if necessary after!";
	tuned-adm profile latency-performance;
	tuned-adm active;
	echo "tuned-percona-mongodb profile not installed!";
### Disabling tuned profile 'tuned-percona-mongodb'
### Changing tuned profile to 'latency-performance', adjust if necessary after!
Current active profile: latency-performance
if [ -d /etc/tuned/percona-mongodb ]; then rm -rf /etc/tuned/percona-mongodb; fi

Note: the uninstallation will enable the “latency-performance” tuned profile, change this after the uninstall if needed

To confirm the uninstallation, let’s check if the block-device readahead is set back to default (256/128kb):

$ sudo blockdev --getra /dev/sdb

Uninstall complete.


So far tuned shows a lot of promise for tuning Linux for MongoDB, providing a single, consistent interface for tuning the Linux operating system. In the future, I would like to see the documentation for tuned improve. However, its simplicity makes the need for documentation rarely necessary.

As mentioned, after applying “tuned-percona-mongodb” you still need to configure an NTP server, NUMA (in some cases) and the filesystem type+tunings manually. The majority of the time, effort and room for mistakes is greatly reduced using this method.

If you have any issues with this profile for tuning Linux for MongoDB, or have any questions, please create a Github issue at this URL:


by Tim Vaillancourt at December 08, 2016 10:34 PM

December 07, 2016

Peter Zaitsev

First MongoDB replica-set Configuration for MySQL DBAs

Replica-Set Configuration

In this blog post, we will work on the first replica-set configuration for MySQL DBAs. We will map as many names as possible and compare how the databases work.

Replica-sets are the most common MongoDB deployment nowadays. One of the most frequent questions is: How do you deploy a replica-set? In this blog, the setup we’ll use compares the MongoDB replica-set to a standard MySQL master-slave replication not using GTID.

Replica-Set configuration

The replica-set usually consists of 3+ instances in different hosts that communicate with each other through both dedicated connections and heartbeat packages. The latter checks the other instances’ health in order to keep the high availability of the replica-sets. The names are slightly different: while “primary” corresponds to “master” in MySQL, “secondary” corresponds to “slave.” MongoDB only supports a single master — different from MySQL, which can have more than one depending on how you set it.


Unlike MySQL, MongoDB does not use files to replicate each other (such as binary log or relay log files). All the statements that should be replicated are in the collection. This collection is a capped collection, which means it handles a limited number of documents. Therefore, when it becomes full new content replaces old documents. The amount of data that the can keep is called the “oplog window,” and it is measured in seconds. If a secondary node is delayed for longer than the oplog can handle, a new initial sync is needed. The same happens in MySQL when a slave tries to read binary logs that have been deleted. 

When the replica-set is initialized, all the inserts, updates and deletes are saved in a database called “local” in a collection called The replica-set initialization can be compared to enabling bin logs in the MySQL configuration.

Now let’s point out the most important differences between such databases: the way they handle replication, and how they keep high availability.

For a standard MySQL replication we need a to enable the binlog in the config file, perform a backup, be aware of the binlog position, restore this backup in a server with a different server id, and finally start the slave thread in the slave. On the other hand, in MongoDB you only need a primary that has been previously configured with the

 parameter, and then add the new secondaries with the same
 parameter. No backup needed, no restore needed, no oplog position needed.

Unlike MySQL, MongoDB is capable of electing a new primary when the primary fails. This process is called election, and each instance will vote for a new primary based on how up-to-date they are without human intervention. This is why at least three instances are necessary for a reliable production replica-set. The election is based on votes, and for a secondary to become primary it needs the majority of votes – at least two out of three votes/boxes are required. We can also have an arbiter dedicated to voting only – it does not handle any data, but only decides which secondary should receive a vote. Most drivers are capable of changing the master once we need to pass the replica-set name in the connection string, and with this information drivers map primary and secondary on the fly using the result of


Note: There are a few tools capable of emulating this behavior in MySQL. One example is:

Maintaining Replica-sets

After deploying a replica-set, we should monitor it. There are a couple of commands that identify not only the available hosts, but also the replication status. They edit such replication as well.

The command

 will show all the details of the replication, such as the replica-set name, all the hosts that belong to this replica-set, and their status. This command is similar to “show slave hosts” in MySQL.

In addition, the command

 shows how delayed the secondaries are. It can be compared to “show slave status” in MySQL.

Replica-sets can be managed online by the command rs.config(). Passing the replica-set name as a parameter in the mongod process, or in the config file, is the only necessary action to start a replica-set. All the other configs can be managed using


Step-by-Step How to Start Your First Replica-Set:

Please follow the following instructions to start testing replica-set with three nodes, using all the commands we’ve talked about.

For a production installation, please follow instructions on how to use our repositories here.

Download Percona Server for MongoDB:

$ cd ~
tar -xvzf percona-server-mongodb-3.2.10-3.0-trusty-x86_64.tar.gz
mv percona-server-mongodb-3.2.10-3.0 mongodb

Create folders:

cd mongodb/bin
mkdir data1 data2 data3

Generate the configs file:

(This is a simple config file, and almost all parameters are the default, so please edit the database directory first.)

for i in {1..3}; do echo
echo 'storage:
 dbPath: "'$(pwd)'/data'$i'"
 destination: file
 path: "'$(pwd)'/data'$i'/mongodb.log"
 logAppend: true
 fork: true
 port: '$(( 27017 + $i -1 ))'
 replSetName: "rs01"' > config$i.cfg; done

Starting MongoDB’s:

  •  Before initializing any MongoDB instance, confirm if the config files exist:

percona@mongo32:~/mongodb/bin$ ls -lah *.cfg

  • Then start mongod process and repeat for the others:

percona@mongo32:~/mongodb/bin$ ./mongod -f config1.cfg
2016-11-10T16:56:12.854-0200 I STORAGE  [main] Counters: 0
2016-11-10T16:56:12.855-0200 I STORAGE  [main] Use SingleDelete in index: 0
about to fork child process, waiting until server is ready for connections.
forked process: 1263
child process started successfully, parent exiting
percona@mongo32:~/mongodb/bin$ ./mongod -f config2.cfg
2016-11-10T16:56:21.992-0200 I STORAGE  [main] Counters: 0
2016-11-10T16:56:21.993-0200 I STORAGE  [main] Use SingleDelete in index: 0
about to fork child process, waiting until server is ready for connections.
forked process: 1287
child process started successfully, parent exiting
percona@mongo32:~/mongodb/bin$ ./mongod -f config3.cfg
2016-11-10T16:56:24.250-0200 I STORAGE  [main] Counters: 0
2016-11-10T16:56:24.250-0200 I STORAGE  [main] Use SingleDelete in index: 0
about to fork child process, waiting until server is ready for connections.
forked process: 1310
child process started successfully, parent exiting

Initializing a replica-set:

  • Connect to the first MongoDB:

$ ./mongo
> rs.initiate()
 "info2" : "no configuration specified. Using a default configuration for the set",
 "me" : "mongo32:27017",
 "ok" : 1

  • Add a new member

rs01:PRIMARY> rs.add('mongo32:27018') // replace to your hostname, localhost is not allowed.
{ "ok" : 1 }
rs01:PRIMARY> rs.add('mongo32:27019')
{ "ok" : 1 }
rs01:PRIMARY> rs.status()
   "set" : "rs01",
   "date" : ISODate("2016-11-10T19:40:08.190Z"),
   "myState" : 1,
   "term" : NumberLong(1),
   "heartbeatIntervalMillis" : NumberLong(2000),
   "members" : [
      "_id" : 0,
      "name" : "mongo32:27017",
      "health" : 1,
      "state" : 1,
      "stateStr" : "PRIMARY",
      "uptime" : 2636,
      "optime" : {
          "ts" : Timestamp(1478806805, 1),
          "t" : NumberLong(1)
      "optimeDate" : ISODate("2016-11-10T19:40:05Z"),
      "electionTime" : Timestamp(1478804218, 2),
      "electionDate" : ISODate("2016-11-10T18:56:58Z"),
      "configVersion" : 3,
      "self" : true
      "_id" : 1,
      "name" : "mongo32:27018",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 44,
      "optime" : {
         "ts" : Timestamp(1478806805, 1),
         "t" : NumberLong(1)
      "optimeDate" : ISODate("2016-11-10T19:40:05Z"),
      "lastHeartbeat" : ISODate("2016-11-10T19:40:07.129Z"),
      "lastHeartbeatRecv" : ISODate("2016-11-10T19:40:05.132Z"),
      "pingMs" : NumberLong(0),
      "syncingTo" : "mongo32:27017",
      "configVersion" : 3
   "_id" : 2,
   "name" : "mongo32:27019",
   "health" : 1,
   "state" : 2,
   "stateStr" : "SECONDARY",
   "uptime" : 3,
   "optime" : {
      "ts" : Timestamp(1478806805, 1),
      "t" : NumberLong(1)
   "optimeDate" : ISODate("2016-11-10T19:40:05Z"),
   "lastHeartbeat" : ISODate("2016-11-10T19:40:07.130Z"),
   "lastHeartbeatRecv" : ISODate("2016-11-10T19:40:06.239Z"),
   "pingMs" : NumberLong(0),
   "configVersion" : 3
"ok" : 1

  • Check replication lag:

$ mongo
rs01:PRIMARY> rs.printSlaveReplicationInfo()
source: mongo32:27018
syncedTo: Thu Nov 10 2016 17:40:05 GMT-0200 (BRST)
0 secs (0 hrs) behind the primary
source: mongo32:27019
syncedTo: Thu Nov 10 2016 17:40:05 GMT-0200 (BRST)
0 secs (0 hrs) behind the primary

  • Start an election:

rs01:PRIMARY> rs.stepDown()
2016-11-10T17:41:27.271-0200 E QUERY [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '':
2016-11-10T17:41:27.274-0200 I NETWORK [thread1] trying reconnect to ( failed
2016-11-10T17:41:27.275-0200 I NETWORK [thread1] reconnect ( ok
   "set" : "rs01",
   "date" : ISODate("2016-11-10T19:41:39.280Z"),
   "myState" : 2,
   "term" : NumberLong(2),
   "heartbeatIntervalMillis" : NumberLong(2000),
   "members" : [
      "_id" : 0,
      "name" : "mongo32:27017",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 2727,
      "optime" : {
           "ts" : Timestamp(1478806805, 1),
           "t" : NumberLong(1)
      "optimeDate" : ISODate("2016-11-10T19:40:05Z"),
      "configVersion" : 3,
      "self" : true
      "_id" : 1,
      "name" : "mongo32:27018",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 135,
      "optime" : {
         "ts" : Timestamp(1478806805, 1),
         "t" : NumberLong(1)
      "optimeDate" : ISODate("2016-11-10T19:40:05Z"),
      "lastHeartbeat" : ISODate("2016-11-10T19:41:37.155Z"),
      "lastHeartbeatRecv" : ISODate("2016-11-10T19:41:37.155Z"),
      "pingMs" : NumberLong(0),
      "configVersion" : 3
      "_id" : 2,
      "name" : "mongo32:27019",
      "health" : 1,
      "state" : 1,
      "stateStr" : "PRIMARY",
      "uptime" : 94,
      "optime" : {
          "ts" : Timestamp(1478806897, 1),
          "t" : NumberLong(2)
     "optimeDate" : ISODate("2016-11-10T19:41:37Z"),
     "lastHeartbeat" : ISODate("2016-11-10T19:41:39.151Z"),
     "lastHeartbeatRecv" : ISODate("2016-11-10T19:41:38.354Z"),
     "pingMs" : NumberLong(0),
     "electionTime" : Timestamp(1478806896, 1),
     "electionDate" : ISODate("2016-11-10T19:41:36Z"),
     "configVersion" : 3
"ok" : 1
rs01:SECONDARY> exit

Shut down instances:

$ killall mongod

Hopefully, this was helpful. Please post any questions in the comments section.

by Adamo Tonete at December 07, 2016 11:23 PM

Percona Server for MongoDB 3.2.11-3.1 is now available

Percona Server for MongoDB

Percona Server for MongoDBPercona announces the release of Percona Server for MongoDB 3.2.11-3.1 on December 7, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

Percona Server for MongoDB 3.2.11-3.1 is an enhanced, open-source, fully compatible, highly scalable, zero-maintenance downtime database supporting the MongoDB v3.2 protocol and drivers. It extends MongoDB with MongoRocks, Percona Memory Engine, and PerconaFT storage engine, as well as enterprise-grade features like external authentication and audit logging at no extra cost. Percona Server for MongoDB requires no changes to MongoDB applications or code.

NOTE: We deprecated the PerconaFT storage engine. It will not be available in future releases.

This release is based on MongoDB 3.2.11 and includes the following additional fixes:

  • PSMDB-93: Fixed hang during shutdown of mongod when started with the --storageEngine=PerconaFT and --nojournal options
  • PSMDB-92: Added Hot Backup to Ubuntu/Debian packages
  • PSMDB-83: Updated default configuration file to include recommended settings templates for various storage engines
  • Added support for Ubuntu 16.10 (Yakkety Yak)
  • Added binary tarballs for Ubuntu 16.04 LTS (Xenial Xerus)

The release notes are available in the official documentation.


by Alexey Zhebel at December 07, 2016 05:07 PM

MariaDB AB

Facebook MyRocks at MariaDB

Facebook MyRocks at MariaDB spetrunia Wed, 12/07/2016 - 08:58

Recently my colleague Rasmus Johansson announced that MariaDB is adding support for the Facebook MyRocks storage engine. Today I’m going to share a bit more on what that means for MariaDB users. Members of the Facebook Database Engineering team helped us answer some questions we think our community will have about MyRocks.

Benefits of MariaDB Server’s Extensible Architecture
Before discussing specifics of MyRocks, new readers may benefit from a description of MariaDB Server architecture, which is extensible at every layer including the storage layer. This means users and the community can add functionality to meet unique needs. Community contributions are one of MariaDB’s greatest advantages over other databases, and a big reason for us becoming the fastest growing open source database in the marketplace.

Openness in the storage layer is especially important because being able to use the right storage engine for the right use case ensures better performance optimization. Both MySQL and MariaDB support InnoDB - a well known, general purpose storage engine. But InnoDB is not suited to every use case, so the MariaDB engineering team is extending support for additional storage engines, including Facebook’s MyRocks for workloads requiring greater compression and IO efficiency, and MariaDB ColumnStore (currently in beta), which will provide faster time-to-insight with Massively Parallel Execution (MPP).

Facebook MyRocks for MariaDB
When searching for a storage engine that could give greater performance for web scale type applications, MyRocks was an obvious choice because of its superior handling of data compression and IO efficiency. Besides that, its LSM architecture allows for very efficient data ingestion, like read-free replication slaves, or fast bulk data loading.

As we add support for new storage engines, many of our current users may ask, “What happens to MariaDB’s support for InnoDB? Do I have to migrate?” Of course not! We have no plans to abandon InnoDB. InnoDB is a proven storage engine and we expect it to continue to be used by MariaDB users. But we do expect that deployments that need highest possible efficiency will opt for MyRocks because of its performance gains and IO efficiency. Over time, as MyRocks matures we expect it will become appropriate for even more use cases.

The first MariaDB version of MyRocks will be available in a release candidate of MariaDB Server 10.2 coming this winter. Our goal is for MyRocks to work with all MariaDB features, but some of them, like optimistic parallel replication, may not work in the first release. MariaDB is an open source project that follows the "release often, release early" approach, so our goal is to first make a release that meets core requirements, and then add support for special cases in subsequent releases.

Now let’s move onto my discussion with Facebook’s Database Engineering team!

Can you tell us a bit about the history of RocksDB at Facebook?

In 2012, we started to build an embedded storage engine optimized for flash-based SSD, by forking LevelDB. The fork became RocksDB, which was open-sourced on November 2013 [1] . After RocksDB proved to be an effective persistent key-value store for SSD, we enhanced RocksDB for other platforms. We improved its performance on DRAM in 2014 and on hard drives in 2015, two platforms with production use cases now.

Over the past few years, we've introduced numerous features and improvements. To name a few, we built compaction filter and merge operator in 2013, backup and column families in 2014, transactions and bulk loading in 2015, and persistent cache in 2016. See the list of features that are not in LevelDB: .

Early RocksDB adopters at Facebook such as the distributed key-value store ZippyDB [2], Laser [2] and Dragon [3] went into production in early 2013. Since then, many more new or existing services at Facebook started to use RocksDB every year. Now RocksDB is used in a number of services across multiple hardware platforms at Facebook.

[1] and

Why did FB go down the RocksDB path for MySQL?

MySQL is a popular storage solution at Facebook because we have a great team dedicated to running MySQL at scale that provides a high quality of service. The MySQL tiers store many petabytes of data that have been compressed with InnoDB table compression. We are always looking for ways to improve compression and the LSM algorithm used by RocksDB has several advantages over the B-Tree used by InnoDB. This led us to MyRocks: RocksDB is a key-value storage engine. MyRocks implements that MySQL storage engine API to make RocksDB work with MySQL and provide SQL functionality. Our initial goal was to get 2x more compression from MyRocks than from compressed InnoDB without affecting read performance. We exceeded our goal. In addition to getting 2x better compression, we also got much lower write rates to storage, faster database loads, and better performance.

Lower write rates enable the use of lower endurance flash, and faster loads simplify the migration from MySQL on InnoDB to MySQL on RocksDB. While we don't expect better performance for all workloads, the way in which we operate the database tier for the initial MyRocks deployment favors RocksDB more than InnoDB. Finally, there are features unique to an LSM that we expect to support in the future, including the merge operator and compaction filters. MyRocks can be helpful to the MySQL community because of efficiency and innovation.

We considered multiple write-optimized database engines. We chose RocksDB because it has excellent performance and efficiency and because we work directly with the team. The MyRocks effort has benefited greatly from being able to collaborate on a daily basis with the RocksDB team. We appreciate that the RocksDB team treats us like a very important customer. They move fast to make RocksDB better for MyRocks.

How was MyRocks developed?

MyRocks is developed by engineers from several locations across the globe. The team had the privilege to work with Sergey Petrunia right from the beginning, and he is based in Russia. At Facebook's Menlo Park campus, Siying Dong leads RocksDB development and Yoshinori Matsunobu leads the collaboration with MySQL infrastructure and data performance teams. From the Seattle office, Herman Lee worked on the initial validation of MyRocks that gave the team the confidence to proceed with MyRocks for our user databases as well as led the MyRocks feature development. In Oregon, Mark Callaghan has been benchmarking all aspects of MyRocks and RocksDB, which has helped developers prioritize performance improvements and feature work. Since the rollout began, the entire database engineering team has been helping to make MyRocks successful by developing high-confidence testing, improving MySQL rollout speed, and addressing other issues. At the same time, the MySQL infrastructure and data performance teams worked to adapt our automation around MyRocks.

What gave Facebook the confidence to move to MyRocks in production?

Much of our early testing with the new storage engine was running the Linkbench benchmark used to simulate Facebook's social graph workload. While these results were promising, we could not rely completely on them to make a decision. In order for MyRocks to be compelling for our infrastructure, MyRocks needed to reduce space and write rates by 50% compared with InnoDB on production workloads.

Once we supported enough features in MyRocks, we created a MyRocks test replica from a large production InnoDB server. We built a tool to duplicate the read and write traffic from the production InnoDB server to the MyRocks test replica. Compared with compressed InnoDB, we confirmed that MyRocks used half the space and reduced the storage write rate by more than half while providing similar response times for read and write operations.

We ran tests where we consolidated two InnoDB production servers onto a single MyRocks server and showed that our hardware can handle the double workload. This was the final result we needed to show that MyRocks is capable of reducing our server requirements by half and gave us the confidence that we should switch from InnoDB to MyRocks.

What approach did Facebook take for deploying MyRocks in production?

Moving to a new storage engine for MySQL comes with some risk and requires extensive testing and careful planning. Starting the RocksDB deployment with our user databases that store the social graph data may seem counterintuitive. However, the team chose to go this route because of two mutually reinforcing reasons:

  1. Based on benchmark and production experiments, the efficiency gains were significant enough and proportional to the scale of the deployment
  2. The workload on our user database tier is relatively simple, well known, and something our engineering team could easily reason about as most of it comes from our TAO Cache.

The benefits we expect as well as further details on the MyRocks project can be found in Yoshinori's post.

Both MyRocks and MariaDB are open source projects that are made stronger with community involvement. How will it help MyRocks when MariaDB releases a supported version? How would you like to see the community get more involved?

We expect MyRocks to get better faster when it is used beyond Facebook. But for that to happen it needs to be in a distribution like MariaDB Server that has great documentation, expert support, a community, and many power users. The community brings more skills, more use cases, and more energy to the MyRocks effort. We look forward to getting bug reports when MyRocks doesn't perform as expected, feature requests that we might have missed, and pull requests for bug fixes and new features.

I am most excited about attending conference talks about MyRocks presented by people who don't work at Facebook. While I think it is a great storage engine, the real test is whether other people find it useful — and hopefully useful enough that they want to talk and write about it.


Recently Rasmus Johansson announced that MariaDB is adding support for the Facebook MyRocks storage engine. In this blog members of the Facebook Database Engineering team helped us answer some questions we think our community will have about MyRocks.

Login or Register to post comments

by spetrunia at December 07, 2016 01:58 PM

Jean-Jerome Schmidt

MySQL on Docker: Deploy a Homogeneous Galera Cluster with etcd

In the previous blog post, we have looked into the multi-host networking capabilities with Docker with native network and Calico. In this blog post, our journey to make Galera Cluster run smoothly on Docker containers continues. Deploying Galera Cluster on Docker is tricky when using orchestration tools. Due to the nature of the scheduler in container orchestration tools and the assumption of homogenous images, the scheduler will just fire the respective containers according to the run command and leave the bootstrapping process to the container’s entrypoint logic when starting up. And you do not want to do that for Galera - starting all nodes at once means each node will form a “1-node cluster” and you’ll end up with a disjointed system.

“Homogeneousing” Galera Cluster

That might be a new word, but it holds true for stateful services like MySQL Replication and Galera Cluster. As one might know, the bootstrapping process for Galera Cluster usually requires manual intervention, where you usually have to decide which node is the most advanced node to start bootstrapping from. There is nothing wrong with this step, you need to be aware of the state of each database node before deciding on the sequence of how to start them up. Galera Cluster is a distributed system, and its redundancy model works like that.

However, container orchestration tools like Docker Engine Swarm Mode and Kubernetes are not aware of the redundancy model of Galera. The orchestration tool presumes containers are independent from each other. If they are dependent, then you have to have an external service that monitors the state. The best way to achieve this is to use a key/value store as a reference point for other containers when starting up.

This is where service discovery like etcd comes into the picture. The basic idea is, each node should report its state periodically to the service. This simplifies the decision process when starting up. For Galera Cluster, the node that has wsrep_local_state_comment equal to Synced shall be used as a reference node when constructing the Galera communication address (gcomm) during joining. Otherwise, the most updated node has to get bootstrapped first.

Etcd has a very nice feature called TTL, where you can expire a key after a certain amount of time. This is useful to determine the state of a node, where the key/value entry only exists if an alive node reports to it. As a result, the node won’t have to connect to each other to determine state (which is very troublesome in a dynamic environment) when forming a cluster. For example, consider the following keys:

        "createdIndex": 10074,
        "expiration": "2016-11-29T10:55:35.218496083Z",
        "key": "/galera/my_wsrep_cluster/",
        "modifiedIndex": 10074,
        "ttl": 10,
        "value": "2881"
        "createdIndex": 10072,
        "expiration": "2016-11-29T10:55:34.650574629Z",
        "key": "/galera/my_wsrep_cluster/",
        "modifiedIndex": 10072,
        "ttl": 10,
        "value": "Synced"

After 10 seconds (ttl value), those keys will be removed from the entry. Basically, all nodes should report to etcd periodically with an expiring key. Container should report every N seconds when it's alive (wsrep_cluster_state_comment=Synced and wsrep_last_committed=#value) via a background process. If a container is down, it will no longer send the update to etcd, thus the keys are removed after expiration. This simply indicates that the node was registered but is no longer synced with the cluster. It will be skipped when constructing the Galera communication address at a later point.

The overall flow of joining procedure is illustrated in the following flow chart:

We have built a Docker image that follows the above. It is specifically built for running Galera Cluster using Docker’s orchestration tool. It is available at Docker Hub and our Github repository. It requires an etcd cluster as the discovery service (supports multiple etcd hosts) and based on Percona XtraDB Cluster 5.6. The image includes Percona Xtrabackup, jq (JSON processor) and also a shell script tailored for Galera health check called

You are welcome to fork or contribute to the project. Any bugs can be reported via Github or via our support page.

Deploying etcd Cluster

etcd is a distributed key value store that provides a simple and efficient way to store data across a cluster of machines. It’s open-source and available on GitHub. It provides shared configuration and service discovery. A simple use-case is to store database connection details or feature flags in etcd as key value pairs. It gracefully handles leader elections during network partitions and will tolerate machine failures, including the leader.

Since etcd is the brain of the setup, we are going to deploy it as a cluster daemon, on three nodes, instead of using containers. In this example, we are going to install etcd on each of the Docker hosts and form a three-node etcd cluster for better availability.

We used CentOS 7 as the operating system, with Docker v1.12.3, build 6b644ec. The deployment steps in this blog post are basically similar to the one used in our previous blog post.

  1. Install etcd packages:

    $ yum install etcd
  2. Modify the configuration file accordingly depending on the Docker hosts:

    $ vim /etc/etcd/etcd.conf

    For docker1 with IP address


    For docker2 with IP address


    For docker3 with IP address

  3. Start the service on docker1, followed by docker2 and docker3:

    $ systemctl enable etcd
    $ systemctl start etcd
  4. Verify our cluster status using etcdctl:

    [docker3 ]$ etcdctl cluster-health
    member 2f8ec0a21c11c189 is healthy: got healthy result from
    member 589a7883a7ee56ec is healthy: got healthy result from
    member fcacfa3f23575abe is healthy: got healthy result from
    cluster is healthy

That’s it. Our etcd is now running as a cluster on three nodes. The below illustrates our architecture:

Deploying Galera Cluster

Minimum of 3 containers is recommended for high availability setup. Thus, we are going to create 3 replicas to start with, it can be scaled up and down afterwards. Running standalone is also possible with standard "docker run" command as shown further down.

Before we start, it’s a good idea to remove any sort of keys related to our cluster name in etcd:

$ etcdctl rm /galera/my_wsrep_cluster --recursive

Ephemeral Storage

This is a recommended way if you plan on scaling the cluster out on more nodes (or scale back by removing nodes). To create a three-node Galera Cluster with ephemeral storage (MySQL datadir will be lost if the container is removed), you can use the following command:

$ docker service create \
--name mysql-galera \
--replicas 3 \
-p 3306:3306 \
--network galera-net \
--env MYSQL_ROOT_PASSWORD=mypassword \
--env XTRABACKUP_PASSWORD=mypassword \
--env CLUSTER_NAME=my_wsrep_cluster \

Persistent Storage

To create a three-node Galera Cluster with persistent storage (MySQL datadir persists if the container is removed), add the mount option with type=volume:

$ docker service create \
--name mysql-galera \
--replicas 3 \
-p 3306:3306 \
--network galera-net \
--mount type=volume,source=galera-vol,destination=/var/lib/mysql \
--env MYSQL_ROOT_PASSWORD=mypassword \
--env XTRABACKUP_PASSWORD=mypassword \
--env CLUSTER_NAME=my_wsrep_cluster \

Custom my.cnf

If you would like to include a customized MySQL configuration file, create a directory on the physical host beforehand:

$ mkdir /mnt/docker/mysql-config # repeat on all Docker hosts

Then, use the mount option with “type=bind” to map the path into the container. In the following example, the custom my.cnf is located at /mnt/docker/mysql-config/my-custom.cnf on each Docker host:

$ docker service create \
--name mysql-galera \
--replicas 3 \
-p 3306:3306 \
--network galera-net \
--mount type=volume,source=galera-vol,destination=/var/lib/mysql \
--mount type=bind,src=/mnt/docker/mysql-config,dst=/etc/my.cnf.d \
--env MYSQL_ROOT_PASSWORD=mypassword \
--env XTRABACKUP_PASSWORD=mypassword \
--env CLUSTER_NAME=my_wsrep_cluster \

Wait for a couple of minutes and verify the service is running (CURRENT STATE = Running):

$ docker service ls mysql-galera
ID                         NAME            IMAGE               NODE           DESIRED STATE  CURRENT STATE           ERROR
2vw40cavru9w4crr4d2fg83j4  mysql-galera.1  severalnines/pxc56  docker1.local  Running        Running 5 minutes ago
1cw6jeyb966326xu68lsjqoe1  mysql-galera.2  severalnines/pxc56  docker3.local  Running        Running 12 seconds ago
753x1edjlspqxmte96f7pzxs1  mysql-galera.3  severalnines/pxc56  docker2.local  Running        Running 5 seconds ago

External applications/clients can connect to any Docker host IP address or hostname on port 3306, requests will be load balanced between the Galera containers. The connection gets NATed to a Virtual IP address for each service "task" (container, in this case) using the Linux kernel's built-in load balancing functionality, IPVS. If the application containers reside in the same overlay network (galera-net), then use the assigned virtual IP address instead. You can retrieve it using the inspect option:

$ docker service inspect mysql-galera -f "{{ .Endpoint.VirtualIPs }}"
[{89n5idmdcswqqha7wcswbn6pw} {1ufbr56pyhhbkbgtgsfy9xkww}]

Our architecture is now looking like this:

As a side note, you can also run Galera in standalone mode. This is probably useful for testing purposes like backup and restore, testing the impact of queries and so on. To run it just like a standalone MySQL container, use the standard docker run command:

$ docker run -d \
-p 3306 \
--name=galera-single \
-e MYSQL_ROOT_PASSWORD=mypassword \
-e CLUSTER_NAME=my_wsrep_cluster \
Single Console for Your Entire Database Infrastructure
Deploy, manage, monitor, scale your databases on the technology stack of your choice!

Scaling the Cluster

There are two ways you can do scaling:

  1. Use “docker service scale” command.
  2. Create a new service with same CLUSTER_NAME using “docker service create” command.

Docker’s “scale” Command

The scale command enables you to scale one or more services either up or down to the desired number of replicas. The command will return immediately, but the actual scaling of the service may take some time. Galera needs to be run an odd number of nodes to avoid network partitioning.

So a good number to scale to would be 5 and so on:

$ docker service scale mysql-galera=5

Wait for a couple of minutes to let the new containers reach the desired state. Then, verify the running service:

$ docker service ls
ID            NAME          REPLICAS  IMAGE               COMMAND
bwvwjg248i9u  mysql-galera  5/5       severalnines/pxc56

One drawback of using this method is that you have to use ephemeral storage because Docker will likely schedule the new containers on a Docker host that already has a Galera container running. If this happens, the volume will overlap the existing Galera containers’ volume. If you would like to use persistent storage and scale in Docker Swarm mode, you should create another new service with a couple of different options, as described in the next section.

At this point, our architecture looks like this:

Another Service with Same Cluster Name

Another way to scale is to create another service with the same CLUSTER_NAME and network. However, you can’t really use the exact same command as the first one due to the following reasons:

  • The service name should be unique.
  • The port mapping must be other than 3306, since this port has been assigned to the mysql-galera service.
  • The volume name should be different to distinguish them from the existing Galera containers.

A benefit of doing this is you will got another virtual IP address assigned to the “scaled” service. This allows you to have an additional option for your application or client to connect to the “scaled” IP address for various tasks, e.g. perform a full backup in desync mode, database consistency check or server auditing.

The following example shows the command to add two more nodes to the cluster in a new service called mysql-galera-scale:

$ docker service create \
--name mysql-galera-scale \
--replicas 2 \
-p 3307:3306 \
--network galera-net \
--mount type=volume,source=galera-scale-vol,destination=/var/lib/mysql \
--env MYSQL_ROOT_PASSWORD=mypassword \
--env XTRABACKUP_PASSWORD=mypassword \
--env CLUSTER_NAME=my_wsrep_cluster \

If we look into the service list, here is what we see:

$ docker service ls
ID            NAME                REPLICAS  IMAGE               COMMAND
0ii5bedv15dh  mysql-galera-scale  2/2       severalnines/pxc56
71pyjdhfg9js  mysql-galera        3/3       severalnines/pxc56

And when you look into the cluster size on one of the container, you should get 5:

[root@docker1 ~]# docker exec -it $(docker ps | grep mysql-galera | awk {'print $1'}) mysql -uroot -pmypassword -e 'show status like "wsrep_cluster_size"'
Warning: Using a password on the command line interface can be insecure.
| Variable_name      | Value |
| wsrep_cluster_size | 5     |

At this point, our architecture looks like this:

To get a clearer view of the process, we can simply look at the MySQL error log file (located under Docker’s data volume) on one of the running containers, for example:

$ tail -f /var/lib/docker/volumes/galera-vol/_data/error.log

Scale Down

Scaling down is simple. Just reduce the number of replicas or remove the service that holds the minority number of containers to ensure that Galera is still in quorum. For example, if you have fired two groups of nodes with 3 + 2 containers and reach total of 5, the majority need to survive thus you can only remove the second group with 2 containers. If you have three groups with 3 + 2 + 2 containers, you can lose a maximum of 3 containers. This is due to the fact that the Docker Swarm scheduler simply terminates and removes the containers corresponding to the service. This makes Galera think that there are nodes failing, as they are not shut down in a graceful way.

If you scaled up using “docker service scale” command, you should scale down using the same method by reducing the number of replicas. To scale it down, simply do:

$ docker service scale mysql-galera=3

Otherwise, if you chose to create another service to scale up, then simply remove the respective service to scale down:

$ docker service rm mysql-galera-scale

Known Limitations

There will be no automatic recovery if a split-brain happens (where all nodes are in Non-Primary state). This is because the MySQL service is still running, yet it will refuse to serve any data and will return error to the client. Docker has no capability to detect this since what it cares about is the foreground MySQL process which is not terminated, killed or stopped. Automating this process is risky, especially if the service discovery is co-located with the Docker host (etcd would also lose contact with other members). Although if the service discovery is healthy somewhere else, it is probably unreachable from the Galera containers perspective, preventing each other to see the container’s status correctly during the glitch.

In this case, you will need to intervene manually.

Choose the most advanced node to bootstrap and then run the following command to promote the node as Primary (other nodes shall then rejoin automatically if the network recovers):

$ docker exec -it [container ID] mysql -uroot -pyoursecret -e 'set global wsrep_provider_option="pc.bootstrap=1"'

Also, there is no automatic cleanup for the discovery service registry. You can remove all entries using either the following command (assuming the CLUSTER_NAME is my_wsrep_cluster):

$ curl -XDELETE # or
$ etcdctl rm /galera/my_wsrep_cluster --recursive


This combination of technologies opens a door for a more reliable database setup in the Docker ecosystem. Working with service discovery to store state makes it possible to have stateful containers to achieve a homogeneous setup.

In the next blog post, we are going to look into how to manage Galera Cluster on Docker.

by ashraf at December 07, 2016 08:48 AM

Peter Zaitsev

Webinar Thursday, December 8: Virtual Columns in MySQL and MariaDB

Virtual Columns

Virtual ColumnsPlease join Federico Razzoli, Consultant at Percona, on Thursday, December 8, 2016, at 8 AM PT / 11 AM ET (UTC – 8) as he presents Virtual Columns in MySQL and MariaDB.

MariaDB 5.2 and MySQL 5.7 introduced virtual columns, with different implementations.Their features and limitations are similar, but not identical. The main difference is that only MySQL allows you to build an index on a non-persistent column.

In this talk, we’ll present some use cases for virtual columns. These cases include query simplification and UNIQUE constraints based on an SQL expression. In particular, we will see how to use them to index JSON data in MySQL, or dynamic columns in MariaDB.

Performance and limitations will also be discussed.

Sign up for the webinar here.

Virtual ColumnsFederico Razzoli is a relational databases lover and open source supporter. He is a MariaDB Community Ambassador and wrote “Mastering MariaDB” in 2014. Currently, he works for Percona as a consultant.

by Dave Avery at December 07, 2016 12:14 AM

December 06, 2016

MariaDB AB

Using the MariaDB Audit Plugin with MySQL

Using the MariaDB Audit Plugin with MySQL geoff_montee_g Tue, 12/06/2016 - 15:10

The MariaDB audit plugin is an audit plugin that is bundled with MariaDB server. However, even though it is bundled with MariaDB, the plugin is actually compatible with MySQL as well. For a step-by-step guide on how to use the plugin with MySQL, check out my blog post here

Login or Register to post comments

by geoff_montee_g at December 06, 2016 08:10 PM

Importing InnoDB Partitions in MariaDB 10.0/10.1

Importing InnoDB Partitions in MariaDB 10.0/10.1 geoff_montee_g Tue, 12/06/2016 - 15:02

Transportable tablespaces for InnoDB tables is a very useful feature added in MySQL 5.6 and MariaDB 10.0. With this new feature, an InnoDB table’s tablespace file can be copied from one server to another, as long as the table uses a file-per-table tablespace.

Unfortunately, the initial transportable tablespace feature in MySQL 5.6 and MariaDB 10.0 does not support partitioned tables. Support for partitioned tables was added in MySQL 5.7. This feature will also likely be added to MariaDB 10.2 since it will contain MySQL 5.7’s InnoDB implementation. However, having this feature in new versions doesn’t help you much if you wanted to use this feature in the older versions of MySQL or MariaDB.

The good news is that there is a workaround that allows you to use transportable tablespaces in MySQL 5.6 and MariaDB 10.0/10.1 to copy partitioned tables from one server to another. For a step-by-step guide on how to use the workaround, check out my blog post here.

Login or Register to post comments

by geoff_montee_g at December 06, 2016 08:02 PM

December 05, 2016

Peter Zaitsev

Percona Live 2017 Open Source Database Conference Tutorial Schedule is Live!

Percona Live 2017

Percona Live 2017We are excited to announce that the tutorial schedule for the Percona Live 2017 Open Source Database Conference is up!

The Percona Live 2017 Open Source Database Conference 2017 is April 24th – 27th, at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

Click through to the tutorial link right now, look them over, and pick which sessions you want to attend. Discounted passes available below!

Tutorial List:

Early Bird Discounts

Just a reminder to everyone out there: our Early Bird discount rate for the Percona Live Open Source Database Conference 2017 is only available ‘til January 8, 2017, 11:30 pm PST! This rate gets you all the excellent and amazing opportunities that Percona Live offers, at a very reasonable price!

Sponsor Percona Live

Become a conference sponsor! We have sponsorship opportunities available for this annual MySQL, MongoDB and open source database event. Sponsors become a part of a dynamic and growing ecosystem and interact with hundreds of DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solutions vendors, and entrepreneurs who attend the event.

by Kortney Runyan at December 05, 2016 08:43 PM

MongoDB Troubleshooting: My Top 5

MongoDB Troubleshooting

MongoDB TroubleshootingIn this blog post, I’ll discuss my top five go-to tips for MongoDB troubleshooting.

Every DBA has a war chest of their go-to solutions for any support issues they run into for a specific technology. MongoDB is no different. Even if you have picked it because it’s a good fit and it runs well for you, things will change. When things change – sometimes there is a new version of your application, or a new version of the database itself – you need to have a solid starting place.

To help new DBA’s, I like to point out my top five things that cover the bulk of requests a DBA might need to work on.

Table of Contents

Common greps to use

This issue is all about what are some ways to pair down the error log and make it a bit more manageable. The error log is a slew of information and sometimes, without grep, it’s challenging to correlate some events.

Is an index being built?

As a DBA you will often get a call saying the database has “stopped.” The developer might say, “I didn’t change anything.” Looking at the error log is a great first port of call. With this particular grep, you just want to see if all index builds were done, if a new index was built and is still building, or an index was removed. This will help catch all of the cases in question.

>grep -i index mongod.log
2016-11-11T17:08:53.731+0000 I INDEX [conn458] build index on: samples.col1 properties: { v: 1, key: { friends: 1.0 }, name: "friends_1", ns: "samples.col1" }
2016-11-11T17:08:53.733+0000 I INDEX [conn458] building index using bulk method
2016-11-11T17:08:56.045+0000 I - [conn458] Index Build: 24700/1000000 2%
2016-11-11T17:08:59.004+0000 I - [conn458] Index Build: 61000/1000000 6%
2016-11-11T17:09:02.001+0000 I - [conn458] Index Build: 103200/1000000 10%
2016-11-11T17:09:05.013+0000 I - [conn458] Index Build: 130800/1000000 13%
2016-11-11T17:09:08.013+0000 I - [conn458] Index Build: 160300/1000000 16%
2016-11-11T17:09:11.039+0000 I - [conn458] Index Build: 183100/1000000 18%
2016-11-11T17:09:14.009+0000 I - [conn458] Index Build: 209400/1000000 20%
2016-11-11T17:09:17.007+0000 I - [conn458] Index Build: 239400/1000000 23%
2016-11-11T17:09:20.010+0000 I - [conn458] Index Build: 264100/1000000 26%
2016-11-11T17:09:23.001+0000 I - [conn458] Index Build: 286800/1000000 28%
2016-11-11T17:09:30.783+0000 I - [conn458] Index Build: 298900/1000000 29%
2016-11-11T17:09:33.015+0000 I - [conn458] Index Build: 323900/1000000 32%
2016-11-11T17:09:36.000+0000 I - [conn458] Index Build: 336600/1000000 33%
2016-11-11T17:09:39.000+0000 I - [conn458] Index Build: 397000/1000000 39%
2016-11-11T17:09:42.000+0000 I - [conn458] Index Build: 431900/1000000 43%
2016-11-11T17:09:45.002+0000 I - [conn458] Index Build: 489100/1000000 48%
2016-11-11T17:09:48.003+0000 I - [conn458] Index Build: 551200/1000000 55%
2016-11-11T17:09:51.004+0000 I - [conn458] Index Build: 567700/1000000 56%
2016-11-11T17:09:54.004+0000 I - [conn458] Index Build: 589600/1000000 58%
2016-11-11T17:10:00.929+0000 I - [conn458] Index Build: 597800/1000000 59%
2016-11-11T17:10:03.008+0000 I - [conn458] Index Build: 633100/1000000 63%
2016-11-11T17:10:06.001+0000 I - [conn458] Index Build: 647200/1000000 64%
2016-11-11T17:10:09.008+0000 I - [conn458] Index Build: 660000/1000000 66%
2016-11-11T17:10:12.001+0000 I - [conn458] Index Build: 672300/1000000 67%
2016-11-11T17:10:15.009+0000 I - [conn458] Index Build: 686000/1000000 68%
2016-11-11T17:10:18.001+0000 I - [conn458] Index Build: 706100/1000000 70%
2016-11-11T17:10:21.006+0000 I - [conn458] Index Build: 731400/1000000 73%
2016-11-11T17:10:24.006+0000 I - [conn458] Index Build: 750900/1000000 75%
2016-11-11T17:10:27.000+0000 I - [conn458] Index Build: 773900/1000000 77%
2016-11-11T17:10:30.000+0000 I - [conn458] Index Build: 821800/1000000 82%
2016-11-11T17:10:33.026+0000 I - [conn458] Index Build: 843800/1000000 84%
2016-11-11T17:10:36.008+0000 I - [conn458] Index Build: 874000/1000000 87%
2016-11-11T17:10:43.854+0000 I - [conn458] Index Build: 896600/1000000 89%
2016-11-11T17:10:46.009+0000 I - [conn458] Index Build: 921800/1000000 92%
2016-11-11T17:10:49.000+0000 I - [conn458] Index Build: 941600/1000000 94%
2016-11-11T17:10:52.011+0000 I - [conn458] Index Build: 955700/1000000 95%
2016-11-11T17:10:55.007+0000 I - [conn458] Index Build: 965500/1000000 96%
2016-11-11T17:10:58.046+0000 I - [conn458] Index Build: 985200/1000000 98%
2016-11-11T17:11:01.002+0000 I - [conn458] Index Build: 995000/1000000 99%
2016-11-11T17:11:13.000+0000 I - [conn458] Index: (2/3) BTree Bottom Up Progress: 8216900/8996322 91%
2016-11-11T17:11:14.021+0000 I INDEX [conn458] done building bottom layer, going to commit
2016-11-11T17:11:14.023+0000 I INDEX [conn458] build index done. scanned 1000000 total records. 140 secs
2016-11-11T17:11:14.035+0000 I COMMAND [conn458] command samples.$cmd command: createIndexes { createIndexes: "col1", indexes: [ { ns: "samples.col1", key: { friends: 1.0 }, name: "friends_1" } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:173 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, MMAPV1Journal: { acquireCount: { w: 9996326 }, acquireWaitCount: { w: 1054 }, timeAcquiringMicros: { w: 811319 } }, Database: { acquireCount: { w: 1, W: 1 } }, Collection: { acquireCount: { W: 1 } }, Metadata: { acquireCount: { W: 12 } }, oplog: { acquireCount: { w: 1 } } } 140306ms

What’s happening right now?

Like with the above index example, this helps you remove many of the messages you might not care about, or you want to block off. MongoDB does have some useful sub-component tags in the logs, such as “ReplicationExecutor” and “connXXX” that can be helpful, but I find it helpful to remove the noisy lines as opposed to the log facility types. In this example, I opted to also not have “| grep -v connection” – typically I will look at the log with connections first to see if they are acting funny, and filter those out to see the core data of what is happening. If you only want to see the long queries and command, replace “ms” with “connection” to make them easier to find.

>grep -v conn mongod.log | grep -v auth | grep -vi health | grep -v ms
2016-11-11T14:41:06.376+0000 I REPL [ReplicationExecutor] This node is localhost:28001 in the config
2016-11-11T14:41:06.377+0000 I REPL [ReplicationExecutor] transition to STARTUP2
2016-11-11T14:41:06.379+0000 I REPL [ReplicationExecutor] Member localhost:28003 is now in state STARTUP
2016-11-11T14:41:06.383+0000 I REPL [ReplicationExecutor] Member localhost:28002 is now in state STARTUP
2016-11-11T14:41:06.385+0000 I STORAGE [FileAllocator] allocating new datafile /Users/dmurphy/Github/dbmurphy/MongoDB32Labs/labs/rs2-1/local.1, filling with zeroes...
2016-11-11T14:41:06.586+0000 I STORAGE [FileAllocator] done allocating datafile /Users/dmurphy/Github/dbmurphy/MongoDB32Labs/labs/rs2-1/local.1, size: 256MB, took 0.196 secs
2016-11-11T14:41:06.610+0000 I REPL [ReplicationExecutor] transition to RECOVERING
2016-11-11T14:41:06.614+0000 I REPL [ReplicationExecutor] transition to SECONDARY
2016-11-11T14:41:08.384+0000 I REPL [ReplicationExecutor] Member localhost:28003 is now in state STARTUP2
2016-11-11T14:41:08.386+0000 I REPL [ReplicationExecutor] Standing for election
2016-11-11T14:41:08.388+0000 I REPL [ReplicationExecutor] Member localhost:28002 is now in state STARTUP2
2016-11-11T14:41:08.390+0000 I REPL [ReplicationExecutor] not electing self, localhost:28002 would veto with 'I don't think localhost:28001 is electable because the member is not currently a secondary (mask 0x8)'
2016-11-11T14:41:08.391+0000 I REPL [ReplicationExecutor] not electing self, we are not freshest
2016-11-11T14:41:10.387+0000 I REPL [ReplicationExecutor] Standing for election
2016-11-11T14:41:10.389+0000 I REPL [ReplicationExecutor] replSet info electSelf
2016-11-11T14:41:10.393+0000 I REPL [ReplicationExecutor] received vote: 1 votes from localhost:28003
2016-11-11T14:41:10.395+0000 I REPL [ReplicationExecutor] replSet election succeeded, assuming primary role
2016-11-11T14:41:10.396+0000 I REPL [ReplicationExecutor] transition to PRIMARY
2016-11-11T14:41:10.631+0000 I REPL [rsSync] transition to primary complete; database writes are now permitted
2016-11-11T14:41:12.390+0000 I REPL [ReplicationExecutor] Member localhost:28003 is now in state SECONDARY
2016-11-11T14:41:12.393+0000 I REPL [ReplicationExecutor] Member localhost:28002 is now in state SECONDARY
2016-11-11T14:41:12.393+0000 I REPL [ReplicationExecutor] Member localhost:28002 is now in state SECONDARY
2016-11-11T14:41:36.433+0000 I NETWORK [conn3] end connection (1 connection now open)
2016-11-11T14:41:36.433+0000 I NETWORK [initandlisten] connection accepted from #8 (3 connections now open)
2016-11-11T14:41:36.490+0000 I NETWORK [conn2] end connection (1 connection now open)
2016-11-11T14:41:36.490+0000 I NETWORK [initandlisten] connection accepted from #9 (3 connections now open)
2016-11-11T14:41:54.480+0000 I NETWORK [initandlisten] connection accepted from #10 (3 connections now open)
2016-11-11T14:41:54.486+0000 I NETWORK [initandlisten] connection accepted from #11 (4 connections now open)
2016-11-11T14:42:06.493+0000 I NETWORK [conn8] end connection (3 connections now open)
2016-11-11T14:42:06.494+0000 I NETWORK [initandlisten] connection accepted from #12 (5 connections now open)
2016-11-11T14:42:06.550+0000 I NETWORK [conn9] end connection (3 connections now open)
2016-11-11T14:42:06.550+0000 I NETWORK [initandlisten] connection accepted from #13 (5 connections now open)
2016-11-11T14:42:36.550+0000 I NETWORK [conn12] end connection (3 connections now open)
2016-11-11T14:42:36.550+0000 I NETWORK [initandlisten] connection accepted from #14 (5 connections now open)
2016-11-11T14:42:36.601+0000 I NETWORK [conn13] end connection (3 connections now open)
2016-11-11T14:42:36.601+0000 I NETWORK [initandlisten] connection accepted from #15 (5 connections now open)
2016-11-11T14:43:06.607+0000 I NETWORK [conn14] end connection (3 connections now open)
2016-11-11T14:43:06.608+0000 I NETWORK [initandlisten] connection accepted from #16 (5 connections now open)
2016-11-11T14:43:06.663+0000 I NETWORK [conn15] end connection (3 connections now open)
2016-11-11T14:43:06.663+0000 I NETWORK [initandlisten] connection accepted from #17 (5 connections now open)
2016-11-11T14:43:36.655+0000 I NETWORK [conn16] end connection (3 connections now open)
2016-11-11T14:43:36.656+0000 I NETWORK [initandlisten] connection accepted from #18 (5 connections now open)
2016-11-11T14:43:36.718+0000 I NETWORK [conn17] end connection (3 connections now open)
2016-11-11T14:43:36.719+0000 I NETWORK [initandlisten] connection accepted from #19 (5 connections now open)
2016-11-11T14:44:06.705+0000 I NETWORK [conn18] end connection (3 connections now open)
2016-11-11T14:44:06.705+0000 I NETWORK [initandlisten] connection accepted from #20 (5 connections now open)
2016-11-11T14:44:06.786+0000 I NETWORK [conn19] end connection (3 connections now open)
2016-11-11T14:44:06.786+0000 I NETWORK [initandlisten] connection accepted from #21 (5 connections now open)
2016-11-11T14:44:36.757+0000 I NETWORK [conn20] end connection (3 connections now open)
2016-11-11T14:44:36.757+0000 I NETWORK [initandlisten] connection accepted from #22 (5 connections now open)
2016-11-11T14:44:36.850+0000 I NETWORK [conn21] end connection (3 connections now open)

Did any elections happen? Why did they happen?

While this isn’t the most common command to run, it is very helpful if you aren’t using Percona Monitoring and Management (PMM) to track the historical frequency of elections. In this example, we want up to 20 lines before and after the word “SECONDARY”, which typically guards when a step-down or election takes place. Then you can see around that time if a command was issued, did a network error occur, was there a heartbeat failure or other such scenario.

grep -i SECONDARY -A20 -B20
2016-11-11T14:44:38.622+0000 I COMMAND [conn22] Attempting to step down in response to replSetStepDown command
2016-11-11T14:44:38.625+0000 I REPL [ReplicationExecutor] transition to SECONDARY
2016-11-11T14:44:38.627+0000 I NETWORK [conn10] end connection (4 connections now open)
2016-11-11T14:44:38.627+0000 I NETWORK [conn11] end connection (4 connections now open)
2016-11-11T14:44:38.630+0000 I NETWORK [thread1] trying reconnect to localhost:27001 ( failed
2016-11-11T14:44:38.628+0000 I NETWORK [conn22] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server []
2016-11-11T14:44:38.630+0000 I NETWORK [initandlisten] connection accepted from #25 (5 connections now open)
2016-11-11T14:44:38.633+0000 I NETWORK [thread1] reconnect localhost:27001 ( ok
2016-11-11T14:44:40.567+0000 I REPL [ReplicationExecutor] replSetElect voting yea for localhost:27002 (1)
2016-11-11T14:44:42.223+0000 I REPL [ReplicationExecutor] Member localhost:27002 is now in state PRIMARY
2016-11-11T14:44:44.314+0000 I NETWORK [initandlisten] connection accepted from #26 (4 connections now open)

Is replication lagged, do I have enough oplog?

Always write a single test document just to ensure replication has a recent write:


Checking lag information:

rs1:PRIMARY> db.printSlaveReplicationInfo()
source: localhost:27002
syncedTo: Fri Nov 11 2016 17:11:14 GMT+0000 (GMT)
0 secs (0 hrs) behind the primary
source: localhost:27003
syncedTo: Fri Nov 11 2016 17:11:14 GMT+0000 (GMT)
0 secs (0 hrs) behind the primary

Oplog Size and Range:

rs1:PRIMARY> db.printReplicationInfo()
configured oplog size: 192MB
log length start to end: 2154secs (0.6hrs)
oplog first event time: Fri Nov 11 2016 16:35:20 GMT+0000 (GMT)
oplog last event time: Fri Nov 11 2016 17:11:14 GMT+0000 (GMT)
now: Fri Nov 11 2016 17:16:46 GMT+0000 (GMT)

Taming the profiler

MongoDB is filled with tons of data in the profiler. I have highlighted some key points to know:

	"queryPlanner" : {
		"mongosPlannerVersion" : 1,
		"winningPlan" : {
			"stage" : "SINGLE_SHARD",
			"shards" : [
					"shardName" : "rs3",
					"connectionString" : "rs3/localhost:29001,localhost:29002,localhost:29003",
					"serverInfo" : {
						"host" : "Davids-MacBook-Pro-2.local",
						"port" : 29001,
						"version" : "3.0.11",
						"gitVersion" : "48f8b49dc30cc2485c6c1f3db31b723258fcbf39"
					"plannerVersion" : 1,
					"namespace" : "",
					"indexFilterSet" : false,
					"parsedQuery" : {
						"name" : {
							"$eq" : "Bob"
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"filter" : {
							"name" : {
								"$eq" : "Bob"
						"direction" : "forward"
					"rejectedPlans" : [ ]
	"executionStats" : {
		"nReturned" : 0,
		"executionTimeMillis" : 0,
		"totalKeysExamined" : 0,
		"totalDocsExamined" : 1,
		"executionStages" : {
			"stage" : "SINGLE_SHARD",
			"nReturned" : 0,
			"executionTimeMillis" : 0,
			"totalKeysExamined" : 0,
			"totalDocsExamined" : 1,
			"totalChildMillis" : NumberLong(0),
			"shards" : [
					"shardName" : "rs3",
					"executionSuccess" : true,
					"executionStages" : {
						"stage" : "COLLSCAN",
						"filter" : {
							"name" : {
								"$eq" : "Bob"
						"nReturned" : 0,
						"executionTimeMillisEstimate" : 0,
						"works" : 3,
						"advanced" : 0,
						"needTime" : 2,
						"needFetch" : 0,
						"saveState" : 0,
						"restoreState" : 0,
						"isEOF" : 1,
						"invalidates" : 0,
						"direction" : "forward",
						"docsExamined" : 1
		"allPlansExecution" : [
				"shardName" : "rs3",
				"allPlans" : [ ]
	"ok" : 1

Metric Description
Filter Formulated query that was run. Right above it you can find the parsed query. These should be the same. It’s useful to know what the engine was sent in the end.
nReturned Number of documents to return via the cursor to the client running the query/command.
executionTimeMillis This used just to be called “ms”, but it means how long did this operation take. Typically you would measure this like a slow query in any database.
total(Keys|Docs)Examined Unlike returned, this is what might be considered since not all indexes have perfect coverage, and sometimes you scan many documents to find no results.
stage While poorly named, this will tell you if a collection scan (table scan) or index is used to answer a given operation. In the case of an index, it will say the name.


CurrentOp and killOp explained

When using

 to see what is running, I frequently include
 so that I can see everything and not just limited items. This makes the
 function look and act much more like
 in MySQL. One significant difference that commonly catches a new DBA off guard is the killing of operations between MySQL and MongoDB. While Mongo does have a handy
 function, it is important to know that unlike MySQL – which immediately kills the thread running the process – MongoDB is a bit different. When you run
, MongoDB appends “killed: true” into the document structure. When the next yield occurs (if it occurs), it will tell the operation to quit. This is also how a shutdown works: if it seems like it’s not shutting down, it might be waiting for an operation to yield and notice the shutdown request.

I’m not arguing that this is bad or good, just different from MySQL and something of which you should be aware. One thing to note, however, is that MongoDB has great built in HA. Sometimes it is better to cause an election and let the drivers gracefully handle things, rather than running the

 command (unless it’s a write, then you should always try and use  


I hope you have found some of this insightful. Look for future posts from the MongoDB team around other MongoDB areas we like to look at (or in different parts of the system) to help ourselves and clients get to the root of an issue.

by David Murphy at December 05, 2016 07:38 PM