Planet MariaDB

July 13, 2018

Peter Zaitsev

On MySQL and Intel Optane performance

MySQL 8 versus Percona Server with heavy IO application on Intel Optane

Recently, Dimitri published the results of measuring MySQL 8.0 on Intel Optane storage device. In this blog post, I wanted to look at this in more detail and explore the performance of MySQL 8, MySQL 5.7 and Percona Server for MySQL using a similar set up. The Intel Optane is a very capable device, so I was puzzled that Dimitri chose MySQL options that are either not safe or not recommended for production workloads.

Since we have an Intel Optane in our labs, I wanted to run a similar benchmark, but using settings that we would recommend our customers to use, namely:

  • use innodb_checksum
  • use innodb_doublewrite
  • use binary logs with sync_binlog=1
  • enable (by default) Performance Schema

I still used

charset=latin1
  (even though the default is utf8mb4 in MySQL 8) and I set a total size of InnoDB log files to 30GB (as in Dimitri’s benchmark). This setting allocates big InnoDB log files to ensure there is no pressure from adaptive flushing. Though I have concerns about how it works in MySQL 8, this is a topic for another research.

So let’s see how MySQL 8.0 performed with these settings, and compare it with MySQL 5.7 and Percona Server for MySQL 5.7.

I used an Intel Optane SSD 905P 960GB device on the server with 2 socket Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz CPUs.

To highlight the performance difference I wanted to show, I used a single case: sysbench 8 tables 50M rows each (which is about ~120GB of data) and buffer pool 32GB. I ran sysbench oltp_read_write in 128 threads.

First, let’s review the results for MySQL 8 vs MySQL 5.7

After achieving a steady state – we can see that MySQL 8 does not have ANY performance improvements over MySQL 5.7.

Let’s compare this with Percona Server for MySQL 5.7

MySQL 8 versus Percona Server with heavy IO application on Intel Optane

Percona Server for MySQL 5.7 shows about 60% performance improvement over both MySQL 5.7 and MySQL 8.

How did we achieve this? All our improvements are described here: https://www.percona.com/doc/percona-server/LATEST/performance/xtradb_performance_improvements_for_io-bound_highly-concurrent_workloads.html. In short:

  1. Parallel doublewrite.  In both MySQL 5.7 and MySQL 8 writes are serialized by writing to doublewrite.
  2. Multi-threaded LRU flusher. We reported and proposed a solution here https://bugs.mysql.com/bug.php?id=70500. However, Oracle have not incorporated the solution upstream.
  3. Single page eviction. This is another problematic area in MySQL’s flushing algorithm. The bug https://bugs.mysql.com/bug.php?id=81376 was reported over 2 years ago, but unfortunately it’s still overlooked.

Summarizing performance findings:

  • For Percona Server for MySQL during this workload, I observed 1.4 GB/sec  reads and 815 MB/sec  writes
  • For MySQL 5.7 and MySQL 8 the numbers are 824 MB/sec reads and  530 MB/sec writes.

My opinion is that Oracle focused on addressing the wrong performance problems in MySQL 8 and did not address the real issues. In this benchmark, using real production settings, MySQL 8 does not show any significant performance benefits over MySQL 5.7 for workloads characterized by heavy IO writes.

With this, I should admit that Intel Optane is a very performant storage. By comparison, on Intel 3600 SSD under the same workload, for Percona Server I am able to achieve only 2000 tps, which is 2.5x times slower than with Intel Optane.

Drawing some conclusions

So there are a few outcomes I can highlight:

  • Intel Optane is a very capable drive, it is easily the fastest of those we’ve tested so far
  • MySQL 8 is not able to utilize all the power of Intel Optane, unless you use unsafe settings (which to me is the equivalent of driving 200 MPH on a highway without working brakes)
  • Oracle has focused on addressing the wrong IO bottlenecks and has overlooked the real ones
  • To get all the benefits of Intel Optane performance, use a proper server—Percona Server for MySQL—which is able to utilize more IOPS from the device.

The post On MySQL and Intel Optane performance appeared first on Percona Database Performance Blog.

by Vadim Tkachenko at July 13, 2018 03:25 PM

Jean-Jerome Schmidt

How to perform Schema Changes in MySQL & MariaDB in a Safe Way

Before you attempt to perform any schema changes on your production databases, you should make sure that you have a rock solid rollback plan; and that your change procedure has been successfully tested and validated in a separate environment. At the same time, it’s your responsibility to make sure that the change causes none or the least possible impact acceptable to the business. It’s definitely not an easy task.

In this article, we will take a look at how to perform database changes on MySQL and MariaDB in a controlled way. We will talk about some good habits in your day-to-day DBA work. We’ll focus on pre-requirements and tasks during the actual operations and problems that you may face when you deal with database schema changes. We will also talk about open source tools that may help you in the process.

Test and rollback scenarios

Backup

There are many ways to lose your data. Schema upgrade failure is one of them. Unlike application code, you can’t drop a bundle of files and declare that a new version has been successfully deployed. You also can’t just put back an older set of files to rollback your changes. Of course, you can run another SQL script to change the database again, but there are cases when the only accurate way to roll back changes is by restoring the entire database from backup.

However, what if you can’t afford to rollback your database to the latest backup, or your maintenance window is not big enough (considering system performance), so you can’t perform a full database backup before the change?

One may have a sophisticated, redundant environment, but as long as data is modified in both primary and standby locations, there is not much to do about it. Many scripts can just be run once, or the changes are impossible to undo. Most of the SQL change code falls into two groups:

  • Run once – you can’t add the same column to the table twice.
  • Impossible to undo – once you’ve dropped that column, it’s gone. You could undoubtedly restore your database, but that’s not precisely an undo.

You can tackle this problem in at least two possible ways. One would be to enable the binary log and take a backup, which is compatible with PITR. Such backup has to be full, complete and consistent. For xtrabackup, as long as it contains a full dataset, it will be PITR-compatible. For mysqldump, there is an option to make it PITR-compatible too. For smaller changes, a variation of mysqldump backup would be to take only a subset of data to change. This can be done with --where option. The backup should be part of the planned maintenance.

mysqldump -u -p --lock-all-tables --where="WHERE employee_id=100" mydb employees> backup_table_tmp_change_07132018.sql

Another possibility is to use CREATE TABLE AS SELECT.

You can store data or simple structure changes in the form of a fixed temporary table. With this approach you will get a source if you need to rollback your changes. It may be quite handy if you don’t change much data. The rollback can be done by taking data out from it. If any failures occur while copying the data to the table, it is automatically dropped and not created, so make sure that your statement creates a copy you need.

Obviously, there are some limitations too.

Because the ordering of the rows in the underlying SELECT statements cannot always be determined, CREATE TABLE ... IGNORE SELECT and CREATE TABLE ... REPLACE SELECT are flagged as unsafe for statement-based replication. Such statements produce a warning in the error log when using statement-based mode and are written to the binary log using the row-based format when using MIXED mode.

A very simple example of such method could be:

CREATE TABLE tmp_employees_change_07132018 AS SELECT * FROM employees where employee_id=100;
UPDATE employees SET salary=120000 WHERE employee_id=100;
COMMMIT;

Another interesting option may be MariaDB flashback database. When a wrong update or delete happens, and you would like to revert to a state of the database (or just a table) at a certain point in time, you may use the flashback feature.

Point-in-time rollback enables DBAs to recover data faster by rolling back transactions to a previous point in time rather than performing a restore from a backup. Based on ROW-based DML events, flashback can transform the binary log and reverse purposes. That means it can help undo given row changes fast. For instance, it can change DELETE events to INSERTs and vice versa, and it will swap WHERE and SET parts of the UPDATE events. This simple idea can dramatically speed up recovery from certain types of mistakes or disasters. For those who are familiar with the Oracle database, it’s a well known feature. The limitation of MariaDB flashback is the lack of DDL support.

Create a delayed replication slave

Since version 5.6, MySQL supports delayed replication. A slave server can lag behind the master by at least a specified amount of time. The default delay is 0 seconds. Use the MASTER_DELAY option for CHANGE MASTER TO to set the delay to N seconds:

CHANGE MASTER TO MASTER_DELAY = N;

It would be a good option if you didn’t have time to prepare a proper recovery scenario. You need to have enough delay to notice the problematic change. The advantage of this approach is that you don’t need to restore your database to take out data needed to fix your change. Standby DB is up and running, ready to pick up data which minimizes the time needed.

Create an asynchronous slave which is not part of the cluster

When it comes to Galera cluster, testing changes is not easy. All nodes run the same data, and heavy load can harm flow control. So you not only need to check if changes applied successfully, but also what the impact to the cluster state was. To make your test procedure as close as possible to the production workload, you may want to add an asynchronous slave to your cluster and run your test there. The test will not impact synchronization between cluster nodes, because technically it’s not part of the cluster, but you will have an option to check it with real data. Such slave can be easily added from ClusterControl.

ClusterControl add asynchronous slave
ClusterControl add asynchronous slave

As shown in the above screenshot, ClusterControl can automate the process of adding an asynchronous slave in a few ways. You can add the node to the cluster, delay the slave. To reduce the impact on the master, you can use an existing backup instead of the master as the data source when building the slave.

Clone database and measure time

A good test should be as close as possible to the production change. The best way to do this is to clone your existing environment.

ClusterControl Clone Cluster for test
ClusterControl Clone Cluster for test

Perform changes via replication

To have better control over your changes, you can apply them on a slave server ahead of time and then do the switchover. For statement-based replication, this works fine, but for row-based replication, this can work up to a certain degree. Row-based replication enables extra columns to exist at the end of the table, so as long as it can write the first columns, it will be fine. First apply these setting to all slaves, then failover to one of the slaves and then implement the change to the master and attach that as a slave. If your modification involves inserting or removing a column in the middle of the table, it will work with row-based replication.

Operation

During the maintenance window, we do not want to have application traffic on the database. Sometimes it is hard to shut down all applications spread over the whole company. Alternatively, we want to allow only some specific hosts to access MySQL from remote (for example the monitoring system or the backup server). For this purpose, we can use the Linux packet filtering. To see what packet filtering rules are available, we can run the following command:

iptables -L INPUT -v

To close the MySQL port on all interfaces we use:

iptables -A INPUT -p tcp --dport mysql -j DROP

and to open the MySQL port again after the maintenance window:

iptables -D INPUT -p tcp --dport mysql -j DROP

For those without root access, you can change max_connection to 1 or 'skip networking'.

Logging

To get the logging process started, use the tee command at the MySQL client prompt, like this:

mysql> tee /tmp/my.out;

That command tells MySQL to log both the input and output of your current MySQL login session to a file named /tmp/my.out .Then execute your script file with source command.

To get a better idea of your execution times, you can combine it with the profiler feature. Start the profiler with

SET profiling = 1;

Then execute your Query with

SHOW PROFILES;

you see a list of queries the profiler has statistics for. So finally, you choose which query to examine with

SHOW PROFILE FOR QUERY 1;

Schema migration tools

Many times, a straight ALTER on the master is not possible - most of the cases it causes lag on the slave, and this may not be acceptable to the applications. What can be done, though, is to execute the change in a rolling mode. You can start with slaves and, once the change is applied to the slave, migrate one of the slaves as a new master, demote the old master to a slave and execute the change on it.

A tool that may help with such a task is Percona’s pt-online-schema-change. Pt-online-schema-change is straightforward - it creates a temporary table with the desired new schema (for instance, if we added an index, or removed a column from a table). Then, it creates triggers on the old table. Those triggers are there to mirror changes that happen on the original table to the new table. Changes are mirrored during the schema change process. If a row is added to the original table, it is also added to the new one. It emulates the way that MySQL alters tables internally, but it works on a copy of the table you wish to alter. It means that the original table is not locked, and clients may continue to read and change data in it.

Likewise, if a row is modified or deleted on the old table, it is also applied in the new table. Then, a background process of copying data (using LOW_PRIORITY INSERT) between old and new table begins. Once data has been copied, RENAME TABLE is executed.

Another intresting tool is gh-ost. Gh-ost creates a temporary table with the altered schema, just like pt-online-schema-change does. It executes INSERT queries, which use the following pattern to copy data from old to new table. Nevertheless it does not use triggers. Unfortunately triggers may be the source of many limitations. gh-ost uses the binary log stream to capture table changes and asynchronously applies them onto the ghost table. Once we verified that gh-ost can execute our schema change correctly, it’s time to actually execute it. Keep in mind that you may need to manually drop old tables that were created by gh-ost during the process of testing the migration. You can also use --initially-drop-ghost-table and --initially-drop-old-table flags to ask gh-ost to do it for you. The final command to execute is exactly the same as we used to test our change, we just added --execute to it.

pt-online-schema-change and gh-ost are very popular among Galera users. Nevertheless Galera has some additional options.The two methods Total Order Isolation (TOI) and Rolling Schema Upgrade (RSU) have both their pros and cons.

TOI - This is the default DDL replication method. The node that originates the writeset detects DDL at parsing time and sends out a replication event for the SQL statement before even starting the DDL processing. Schema upgrades run on all cluster nodes in the same total order sequence, preventing other transactions from committing for the duration of the operation. This method is good when you want your online schema upgrades to replicate through the cluster and don’t mind locking the entire table (similar to how default schema changes happened in MySQL).

SET GLOBAL wsrep_OSU_method='TOI';

RSU - perfom the schema upgrades locally. In this method, your writes are affecting only the node on which they are run. The changes do not replicate to the rest of the cluster.This method is good for non-conflicting operations and it will not slow down the cluster.

SET GLOBAL wsrep_OSU_method='RSU';

While the node processes the schema upgrade, it desynchronizes with the cluster. When it finishes processing the schema upgrade, it applies delayed replication events and synchronizes itself with the cluster. This could be a good option to run heavy index creations.

Conclusion

We presented here several different methods that may help you with planning your schema changes. Of course it all depends on your application and business requirements. You can design your change plan, perform necessary tests, but there is still a small chance that something will go wrong. According to Murphy’s law - “things will go wrong in any given situation, if you give them a chance”. So make sure you try out different ways of performing these changes, and pick the one that you are the most comfortable with.

by Bart Oles at July 13, 2018 09:29 AM

July 12, 2018

MariaDB AB

Ask the Experts – Live Q&A on MariaDB TX 3.0 and MariaDB Server 10.3

Ask the Experts – Live Q&A on MariaDB TX 3.0 and MariaDB Server 10.3 MariaDB Team Thu, 07/12/2018 - 15:59

Until the recent release of MariaDB TX 3.0, several enterprise database features were available only from proprietary vendors such as Oracle, Microsoft and IBM. It’s no wonder that our first webinar on MariaDB TX 3.0 drew a large crowd – and spurred a lively discussion with lots of great follow-up questions.

That’s why we’re hosting an Ask the Experts webinar on Tuesday, July 17. We’ll start with a brief overview of those new enterprise-grade features, including temporal tables and queries, the MyRocks and Spider storage engines, and more. Then we’ll dig into the nitty-gritty—a deep-dive Q&A session with MariaDB senior product and engineering leads.

Take the opportunity to get answers to your implementation questions and understand how MariaDB TX 3.0 (which includes MariaDB Server 10.3 at its core) can help your organization achieve cost savings, simplify migration, support multiple workloads with different characteristics–whether transactional, analytical, write-intensive or extreme scale, and more. Come prepared with everything you’ve ever wanted to know about MariaDB TX and the enterprise open source approach.

The Q&A takes place Tuesday, July 17 – 10 a.m. EDT / 2 p.m. GMT.

Save Your Seat

And if you missed our What’s New in MariaDB TX 3.0 webinar (or just want a review as you head into the Q&A session), watch it on demand anytime.

Login or Register to post comments

by MariaDB Team at July 12, 2018 07:59 PM

Serge Frezefond

Porting this Oracle MySQL feature to MariaDB would be great ;-)

Oracle has done a great technical work with MySQL. Specifically a nice job has been done around security. There is one useful feature that exists in Oracle MySQL and that currently does not exist in MariaDB. Oracle MySQL offers the possibility from within the server to generate asymetric key pairs. It is then possible use [...]

by Serge at July 12, 2018 03:05 PM

Peter Zaitsev

Why MySQL Stored Procedures, Functions and Triggers Are Bad For Performance

Execution map for func1()

MySQL stored procedures, functions and triggers are tempting constructs for application developers. However, as I discovered, there can be an impact on database performance when using MySQL stored routines. Not being entirely sure of what I was seeing during a customer visit, I set out to create some simple tests to measure the impact of triggers on database performance. The outcome might surprise you.

Why stored routines are not optimal performance wise: short version

Recently, I worked with a customer to profile the performance of triggers and stored routines. What I’ve learned about stored routines: “dead” code (the code in a branch which will never run) can still significantly slow down the response time of a function/procedure/trigger. We will need to be careful to clean up what we do not need.

Profiling MySQL stored functions

Let’s compare these four simple stored functions (in MySQL 5.7):

Function 1:

CREATE DEFINER=`root`@`localhost` FUNCTION `func1`() RETURNS int(11)
BEGIN
	declare r int default 0;
RETURN r;
END

This function simply declares a variable and returns it. It is a dummy function

Function 2:

CREATE DEFINER=`root`@`localhost` FUNCTION `func2`() RETURNS int(11)
BEGIN
    declare r int default 0;
    IF 1=2
    THEN
		select levenshtein_limit_n('test finc', 'test func', 1000) into r;
    END IF;
RETURN r;
END

This function calls another function, levenshtein_limit_n (calculates levenshtein distance). But wait: this code will never run – the condition IF 1=2 will never be true. So that is the same as function 1.

Function 3:

CREATE DEFINER=`root`@`localhost` FUNCTION `func3`() RETURNS int(11)
BEGIN
    declare r int default 0;
    IF 1=2 THEN
		select levenshtein_limit_n('test finc', 'test func', 1) into r;
    END IF;
    IF 2=3 THEN
		select levenshtein_limit_n('test finc', 'test func', 10) into r;
    END IF;
    IF 3=4 THEN
		select levenshtein_limit_n('test finc', 'test func', 100) into r;
    END IF;
    IF 4=5 THEN
		select levenshtein_limit_n('test finc', 'test func', 1000) into r;
    END IF;
RETURN r;
END

Here there are four conditions and none of these conditions will be true: there are 4 calls of “dead” code. The result of the function call for function 3 will be the same as function 2 and function 1.

Function 4:

CREATE DEFINER=`root`@`localhost` FUNCTION `func3_nope`() RETURNS int(11)
BEGIN
    declare r int default 0;
    IF 1=2 THEN
		select does_not_exit('test finc', 'test func', 1) into r;
    END IF;
    IF 2=3 THEN
		select does_not_exit('test finc', 'test func', 10) into r;
    END IF;
    IF 3=4 THEN
		select does_not_exit('test finc', 'test func', 100) into r;
    END IF;
    IF 4=5 THEN
		select does_not_exit('test finc', 'test func', 1000) into r;
    END IF;
RETURN r;
END

This is the same as function 3 but the function we are running does not exist. Well, it does not matter as the

select does_not_exit
  will never run.

So all the functions will always return 0. We expect that the performance of these functions will be the same or very similar. Surprisingly it is not the case! To measure the performance I used the “benchmark” function to run the same function 1M times. Here are the results:

+-----------------------------+
| benchmark(1000000, func1()) |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (1.75 sec)
+-----------------------------+
| benchmark(1000000, func2()) |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (2.45 sec)
+-----------------------------+
| benchmark(1000000, func3()) |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (3.85 sec)
+----------------------------------+
| benchmark(1000000, func3_nope()) |
+----------------------------------+
|                                0 |
+----------------------------------+
1 row in set (3.85 sec)

As we can see func3 (with four dead code calls which will never be executed, otherwise identical to func1) runs almost 3x slower compared to func1(); func3_nope() is identical in terms of response time to func3().

Visualizing all system calls from functions

To figure out what is happening inside the function calls I used performance_schema / sys schema to create a trace with ps_trace_thread() procedure

  1. Get the thread_id for the MySQL connection:
    mysql> select THREAD_ID from performance_schema.threads where processlist_id = connection_id();
    +-----------+
    | THREAD_ID |
    +-----------+
    |        49 |
    +-----------+
    1 row in set (0.00 sec)
  2. Run ps_trace_thread in another connection passing the thread_id=49:
    mysql> CALL sys.ps_trace_thread(49, concat('/var/lib/mysql-files/stack-func1-run1.dot'), 10, 0, TRUE, TRUE, TRUE);
    +--------------------+
    | summary            |
    +--------------------+
    | Disabled 0 threads |
    +--------------------+
    1 row in set (0.00 sec)
    +---------------------------------------------+
    | Info                                        |
    +---------------------------------------------+
    | Data collection starting for THREAD_ID = 49 |
    +---------------------------------------------+
    1 row in set (0.00 sec)
  3. At that point I switched to the original connection (thread_id=49) and run:
    mysql> select func1();
    +---------+
    | func1() |
    +---------+
    |       0 |
    +---------+
    1 row in set (0.00 sec)
  4. The sys.ps_trace_thread collected the data (for 10 seconds, during which I ran the
    select func1()
     ), then it finished its collection and created the dot file:
    +-----------------------------------------------------------------------+
    | Info                                                                  |
    +-----------------------------------------------------------------------+
    | Stack trace written to /var/lib/mysql-files/stack-func3nope-new12.dot |
    +-----------------------------------------------------------------------+
    1 row in set (9.21 sec)
    +-------------------------------------------------------------------------------+
    | Convert to PDF                                                                |
    +-------------------------------------------------------------------------------+
    | dot -Tpdf -o /tmp/stack_49.pdf /var/lib/mysql-files/stack-func3nope-new12.dot |
    +-------------------------------------------------------------------------------+
    1 row in set (9.21 sec)
    +-------------------------------------------------------------------------------+
    | Convert to PNG                                                                |
    +-------------------------------------------------------------------------------+
    | dot -Tpng -o /tmp/stack_49.png /var/lib/mysql-files/stack-func3nope-new12.dot |
    +-------------------------------------------------------------------------------+
    1 row in set (9.21 sec)
    Query OK, 0 rows affected (9.45 sec)

I repeated these steps for all the functions above and then created charts of the commands.

Here are the results:

Func1()

Execution map for func1()

Func2()

Execution map for func2()

Func3()

Execution map for func3()

 

As we can see there is a sp/jump_if_not call for every “if” check followed by an opening tables statement (which is quite interesting). So parsing the “IF” condition made a difference.

For MySQL 8.0 we can also see MySQL source code documentation for stored routines which documents how it is implemented. It reads:

Flow Analysis Optimizations
After code is generated, the low level sp_instr instructions are optimized. The optimization focuses on two areas:

Dead code removal,
Jump shortcut resolution.
These two optimizations are performed together, as they both are a problem involving flow analysis in the graph that represents the generated code.

The code that implements these optimizations is sp_head::optimize().

However, this does not explain why it executes “opening tables”. I have filed a bug.

When slow functions actually make a difference

Well, if we do not plan to run one million of those stored functions we will never even notice the difference. However, where it will make a difference is … inside a trigger. Let’s say that we have a trigger on a table: every time we update that table it executes a trigger to update another field. Here is an example: let’s say we have a table called “form” and we simply need to update its creation date:

mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.31 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

That is good and fast. Now we create a trigger which will call our dummy func1():

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
	declare r int default 0;
	select func1() into r;
END

Now repeat the update. Remember: it does not change the result of the update as we do not really do anything inside the trigger.

mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.90 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

Just adding a dummy trigger will add 2x overhead: the next trigger, which does not even run a function, introduces a slowdown:

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
	declare r int default 0;
END
mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.52 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

Now, lets use func3 (which has “dead” code and is equivalent to func1):

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
	declare r int default 0;
	select func3() into r;
END
mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (1.06 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

However, running the code from the func3 inside the trigger (instead of calling a function) will speed up the update:

CREATE DEFINER=`root`@`localhost` TRIGGER `test`.`form_AFTER_UPDATE`
AFTER UPDATE ON `form`
FOR EACH ROW
BEGIN
    declare r int default 0;
    IF 1=2 THEN
		select levenshtein_limit_n('test finc', 'test func', 1) into r;
    END IF;
    IF 2=3 THEN
		select levenshtein_limit_n('test finc', 'test func', 10) into r;
    END IF;
    IF 3=4 THEN
		select levenshtein_limit_n('test finc', 'test func', 100) into r;
    END IF;
    IF 4=5 THEN
		select levenshtein_limit_n('test finc', 'test func', 1000) into r;
    END IF;
END
mysql> update form set form_created_date = NOW() where form_id > 5000;
Query OK, 65536 rows affected (0.66 sec)
Rows matched: 65536  Changed: 65536  Warnings: 0

Memory allocation

Potentially, even if the code will never run, MySQL will still need to parse the stored routine—or trigger—code for every execution, which can potentially lead to a memory leak, as described in this bug.

Conclusion

Stored routines and trigger events are parsed when they are executed. Even “dead” code that will never run can significantly affect the performance of bulk operations (e.g. when running this inside the trigger). That also means that disabling a trigger by setting a “flag” (e.g.

if @trigger_disable = 0 then ...
 ) can still affect performance of bulk operations.

The post Why MySQL Stored Procedures, Functions and Triggers Are Bad For Performance appeared first on Percona Database Performance Blog.

by Alexander Rubin at July 12, 2018 01:19 PM

July 11, 2018

Jean-Jerome Schmidt

Galera Cluster Recovery 101 - A Deep Dive into Network Partitioning

One of the cool features in Galera is automatic node provisioning and membership control. If a node fails or loses communication, it will be automatically evicted from the cluster and remain unoperational. As long as the majority of nodes are still communicating (Galera calls this PC - primary component), there is a very high chance the failed node would be able to automatically rejoin, resync and resume the replication once the connectivity is back.

Generally, all Galera nodes are equal. They hold the same data set and same role as masters, capable of handling read and write simultaneously, thanks to Galera group communication and certification-based replication plugin. Therefore, there is actually no failover from the database point-of-view due to this equilibrium. Only from the application side that would require failover, to skip the unoperational nodes while the cluster is partitioned.

In this blog post, we are going to look into understanding how Galera Cluster performs node and cluster recovery in case network partition happens. Just as a side note, we have covered a similar topic in this blog post some time back. Codership has explained Galera's recovery concept in great details in the documentation page, Node Failure and Recovery.

Node Failure and Eviction

In order to understand the recovery, we have to understand how Galera detects the node failure and eviction process first. Let's put this into a controlled test scenario so we can understand the eviction process better. Suppose we have a three-node Galera Cluster as illustrated below:

The following command can be used to retrieve our Galera provider options:

mysql> SHOW VARIABLES LIKE 'wsrep_provider_options'\G

It's a long list, but we just need to focus on some of the parameters to explain the process:

evs.inactive_check_period = PT0.5S; 
evs.inactive_timeout = PT15S; 
evs.keepalive_period = PT1S; 
evs.suspect_timeout = PT5S; 
evs.view_forget_timeout = P1D;
gmcast.peer_timeout = PT3S;

First of all, Galera follows ISO 8601 formatting to represent duration. P1D means the duration is one day, while PT15S means the duration is 15 seconds (note the time designator, T, that precedes the time value). For example if one wanted to increase evs.view_forget_timeout to 1 day and a half, one would set P1DT12H, or PT36H.

Considering all hosts haven't been configured with any firewall rules, we use the following script called block_galera.sh on galera2 to simulate a network failure to/from this node:

#!/bin/bash
# block_galera.sh
# galera2, 192.168.55.172

iptables -I INPUT -m tcp -p tcp --dport 4567 -j REJECT
iptables -I INPUT -m tcp -p tcp --dport 3306 -j REJECT
iptables -I OUTPUT -m tcp -p tcp --dport 4567 -j REJECT
iptables -I OUTPUT -m tcp -p tcp --dport 3306 -j REJECT
# print timestamp
date

By executing the script, we get the following output:

$ ./block_galera.sh
Wed Jul  4 16:46:02 UTC 2018

The reported timestamp can be considered as the start of the cluster partitioning, where we lose galera2, while galera1 and galera3 are still online and accessible. At this point, our Galera Cluster architecture is looking something like this:

From Partitioned Node Perspective

On galera2, you will see some printouts inside the MySQL error log. Let's break them out into several parts. The downtime was started around 16:46:02 UTC time and after gmcast.peer_timeout=PT3S, the following appears:

2018-07-04 16:46:05 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') connection to peer 8b2041d6 with addr tcp://192.168.55.173:4567 timed out, no messages seen in PT3S
2018-07-04 16:46:05 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.55.173:4567
2018-07-04 16:46:06 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') connection to peer 737422d6 with addr tcp://192.168.55.171:4567 timed out, no messages seen in PT3S
2018-07-04 16:46:06 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 0

As it passed evs.suspect_timeout = PT5S, both nodes galera1 and galera3 are suspected as dead by galera2:

2018-07-04 16:46:07 140454904243968 [Note] WSREP: evs::proto(62116b35, OPERATIONAL, view_id(REG,62116b35,54)) suspecting node: 8b2041d6
2018-07-04 16:46:07 140454904243968 [Note] WSREP: evs::proto(62116b35, OPERATIONAL, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive
2018-07-04 16:46:07 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 0
2018-07-04 16:46:08 140454904243968 [Note] WSREP: evs::proto(62116b35, GATHER, view_id(REG,62116b35,54)) suspecting node: 737422d6
2018-07-04 16:46:08 140454904243968 [Note] WSREP: evs::proto(62116b35, GATHER, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive

Then, Galera will revise the current cluster view and the position of this node:

2018-07-04 16:46:09 140454904243968 [Note] WSREP: view(view_id(NON_PRIM,62116b35,54) memb {
        62116b35,0
} joined {
} left {
} partitioned {
        737422d6,0
        8b2041d6,0
})
2018-07-04 16:46:09 140454904243968 [Note] WSREP: view(view_id(NON_PRIM,62116b35,55) memb {
        62116b35,0
} joined {
} left {
} partitioned {
        737422d6,0
        8b2041d6,0
})

With the new cluster view, Galera will perform quorum calculation to decide whether this node is part of the primary component. If the new component sees "primary = no", Galera will demote the local node state from SYNCED to OPEN:

2018-07-04 16:46:09 140454288942848 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Flow-control interval: [16, 16]
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Trying to continue unpaused monitor
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Received NON-PRIMARY.
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 2753699)

With the latest change on the cluster view and node state, Galera returns the post-eviction cluster view and global state as below:

2018-07-04 16:46:09 140454222194432 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2753699, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2018-07-04 16:46:09 140454222194432 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

You can see the following global status of galera2 have changed during this period:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY');
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+
| VARIABLE_NAME             | VARIABLE_VALUE                                                                                                                    |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+
| WSREP_CLUSTER_SIZE        | 1                                                                                                                                 |
| WSREP_CLUSTER_STATUS      | non-Primary                                                                                                                       |
| WSREP_EVS_DELAYED         | 737422d6-7db3-11e8-a2a2-bbe98913baf0:tcp://192.168.55.171:4567:1,8b2041d6-7f62-11e8-87d5-12a76678131f:tcp://192.168.55.173:4567:2 |
| WSREP_LOCAL_STATE_COMMENT | Initialized                                                                                                                       |
| WSREP_READY               | OFF                                                                                                                               |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+

At this point, MySQL/MariaDB server on galera2 is still accessible (database is listening on 3306 and Galera on 4567) and you can query the mysql system tables and list out the databases and tables. However when you jump into the non-system tables and make a simple query like this:

mysql> SELECT * FROM sbtest1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

You will immediately get an error indicating WSREP is loaded but not ready to use by this node, as reported by wsrep_ready status. This is due to the node losing its connection to the Primary Component and it enters the non-operational state (the local node status was changed from SYNCED to OPEN). Data reads from nodes in a non-operational state are considered stale, unless you set wsrep_dirty_reads=ON to permit reads, although Galera still rejects any command that modifies or updates the database.

Finally, Galera will keep on listening and reconnecting to other members in the background infinitely:

2018-07-04 16:47:12 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 30
2018-07-04 16:47:13 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 30
2018-07-04 16:48:20 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 60
2018-07-04 16:48:22 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 60

The eviction process flow by Galera group communication for the partitioned node during network issue can be summarized as below:

  1. Disconnects from the cluster after gmcast.peer_timeout.
  2. Suspects other nodes after evs.suspect_timeout.
  3. Retrieves the new cluster view.
  4. Performs quorum calculation to determine the node's state.
  5. Demotes the node from SYNCED to OPEN.
  6. Attempts to reconnect to the primary component (other Galera nodes) in the background.

From Primary Component Perspective

On galera1 and galera3 respectively, after gmcast.peer_timeout=PT3S, the following appears in the MySQL error log:

2018-07-04 16:46:05 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.55.172:4567
2018-07-04 16:46:06 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') reconnecting to 62116b35 (tcp://192.168.55.172:4567), attempt 0

After it passed evs.suspect_timeout = PT5S, galera2 is suspected as dead by galera3 (and galera1):

2018-07-04 16:46:10 139955510687488 [Note] WSREP: evs::proto(8b2041d6, OPERATIONAL, view_id(REG,62116b35,54)) suspecting node: 62116b35
2018-07-04 16:46:10 139955510687488 [Note] WSREP: evs::proto(8b2041d6, OPERATIONAL, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive

Galera checks out if the other nodes respond to the group communication on galera3, it finds galera1 is in primary and stable state:

2018-07-04 16:46:11 139955510687488 [Note] WSREP: declaring 737422d6 at tcp://192.168.55.171:4567 stable
2018-07-04 16:46:11 139955510687488 [Note] WSREP: Node 737422d6 state prim

Galera revises the cluster view of this node (galera3):

2018-07-04 16:46:11 139955510687488 [Note] WSREP: view(view_id(PRIM,737422d6,55) memb {
        737422d6,0
        8b2041d6,0
} joined {
} left {
} partitioned {
        62116b35,0
})
2018-07-04 16:46:11 139955510687488 [Note] WSREP: save pc into disk

Galera then removes the partitioned node from the Primary Component:

2018-07-04 16:46:11 139955510687488 [Note] WSREP: forgetting 62116b35 (tcp://192.168.55.172:4567)

The new Primary Component is now consisted of two nodes, galera1 and galera3:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2

The Primary Component will exchange the state between each other to agree on the new cluster view and global state:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2018-07-04 16:46:11 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') turning message relay requesting off
2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: sent state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993
2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: got state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993 from 0 (192.168.55.171)
2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: got state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993 from 1 (192.168.55.173)

Galera calculates and verifies the quorum of the state exchange between online members:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 27,
        members    = 2/2 (joined/total),
        act_id     = 2753703,
        last_appl. = 2753606,
        protocols  = 0/8/3 (gcs/repl/appl),
        group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc
2018-07-04 16:46:11 139955502294784 [Note] WSREP: Flow-control interval: [23, 23]
2018-07-04 16:46:11 139955502294784 [Note] WSREP: Trying to continue unpaused monitor

Galera updates the new cluster view and global state after galera2 eviction:

2018-07-04 16:46:11 139955214169856 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2753703, view# 28: Primary, number of nodes: 2, my index: 1, protocol version 3
2018-07-04 16:46:11 139955214169856 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-07-04 16:46:11 139955214169856 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-07-04 16:46:11 139955214169856 [Note] WSREP: Assign initial position for certification: 2753703, protocol version: 3
2018-07-04 16:46:11 139956691814144 [Note] WSREP: Service thread queue flushed.
Clean up the partitioned node (galera2) from the active list:
2018-07-04 16:46:14 139955510687488 [Note] WSREP: cleaning up 62116b35 (tcp://192.168.55.172:4567)

At this point, both galera1 and galera3 will be reporting similar global status:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY');
+---------------------------+------------------------------------------------------------------+
| VARIABLE_NAME             | VARIABLE_VALUE                                                   |
+---------------------------+------------------------------------------------------------------+
| WSREP_CLUSTER_SIZE        | 2                                                                |
| WSREP_CLUSTER_STATUS      | Primary                                                          |
| WSREP_EVS_DELAYED         | 1491abd9-7f6d-11e8-8930-e269b03673d8:tcp://192.168.55.172:4567:1 |
| WSREP_LOCAL_STATE_COMMENT | Synced                                                           |
| WSREP_READY               | ON                                                               |
+---------------------------+------------------------------------------------------------------+

They list out the problematic member in the wsrep_evs_delayed status. Since the local state is "Synced", these nodes are operational and you can redirect the client connections from galera2 to any of them. If this step is inconvenient, consider using a load balancer sitting in front of the database to simplify the connection endpoint from the clients.

Node Recovery and Joining

A partitioned Galera node will keep on attempting to establish connection with the Primary Component infinitely. Let's flush the iptables rules on galera2 to let it connect with the remaining nodes:

# on galera2
$ iptables -F

Once the node is capable of connecting to one of the nodes, Galera will start re-establishing the group communication automatically:

2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 8b2041d6 tcp://192.168.55.173:4567
2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 737422d6 tcp://192.168.55.171:4567
2018-07-09 10:46:34 140075962705664 [Note] WSREP: declaring 737422d6 at tcp://192.168.55.171:4567 stable
2018-07-09 10:46:34 140075962705664 [Note] WSREP: declaring 8b2041d6 at tcp://192.168.55.173:4567 stable

Node galera2 will then connect to one of the Primary Component (in this case is galera1, node ID 737422d6) to get the current cluster view and nodes state:

2018-07-09 10:46:34 140075962705664 [Note] WSREP: Node 737422d6 state prim
2018-07-09 10:46:34 140075962705664 [Note] WSREP: view(view_id(PRIM,1491abd9,142) memb {
        1491abd9,0
        737422d6,0
        8b2041d6,0
} joined {
} left {
} partitioned {
})
2018-07-09 10:46:34 140075962705664 [Note] WSREP: save pc into disk

Galera will then perform state exchange with the rest of the members that can form the Primary Component:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: sent state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 0 (192.168.55.172)
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 1 (192.168.55.171)
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 2 (192.168.55.173)

The state exchange allows galera2 to calculate the quorum and produce the following result:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 71,
        members    = 2/3 (joined/total),
        act_id     = 2836958,
        last_appl. = 0,
        protocols  = 0/8/3 (gcs/repl/appl),
        group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc

Galera will then promote the local node state from OPEN to PRIMARY, to start and establish the node connection to the Primary Component:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Flow-control interval: [28, 28]
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Trying to continue unpaused monitor
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 2836958)

As reported by the above line, Galera calculates the gap on how far the node is behind from the cluster. This node requires state transfer to catch up to writeset number 2836958 from 2761994:

2018-07-09 10:46:34 140075929970432 [Note] WSREP: State transfer required:
        Group state: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958
        Local state: 55238f52-41ee-11e8-852f-3316bdb654bc:2761994
2018-07-09 10:46:34 140075929970432 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958, view# 72: Primary, number of nodes:
3, my index: 0, protocol version 3
2018-07-09 10:46:34 140075929970432 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Assign initial position for certification: 2836958, protocol version: 3

Galera prepares the IST listener on port 4568 on this node and asks any Synced node in the cluster to become a donor. In this case, Galera automatically picks galera3 (192.168.55.173), or it could also pick a donor from the list under wsrep_sst_donor (if defined) for the syncing operation:

2018-07-09 10:46:34 140075996276480 [Note] WSREP: Service thread queue flushed.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: IST receiver addr using tcp://192.168.55.172:4568
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.55.172:4568
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Member 0.0 (192.168.55.172) requested state transfer from '*any*'. Selected 2.0 (192.168.55.173)(SYNCED) as donor.

It will then change the local node state from PRIMARY to JOINER. At this stage, galera2 is granted with state transfer request and starts to cache write-sets:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2836958)
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Requesting state transfer: success, donor: 2
2018-07-09 10:46:34 140075929970432 [Note] WSREP: GCache history reset: 55238f52-41ee-11e8-852f-3316bdb654bc:2761994 -> 55238f52-41ee-11e8-852f-3316bdb654bc:2836958
2018-07-09 10:46:34 140075929970432 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset

Node galera2 starts receiving the missing writesets from the selected donor's gcache (galera3):

2018-07-09 10:46:34 140075954312960 [Note] WSREP: 2.0 (192.168.55.173): State transfer to 0.0 (192.168.55.172) complete.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Receiving IST: 74964 writesets, seqnos 2761994-2836958
2018-07-09 10:46:34 140075593627392 [Note] WSREP: Receiving IST...  0.0% (    0/74964 events) complete.
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Member 2.0 (192.168.55.173) synced with group.
2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 737422d6 tcp://192.168.55.171:4567
2018-07-09 10:46:41 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') turning message relay requesting off
2018-07-09 10:46:44 140075593627392 [Note] WSREP: Receiving IST... 36.0% (27008/74964 events) complete.
2018-07-09 10:46:54 140075593627392 [Note] WSREP: Receiving IST... 71.6% (53696/74964 events) complete.
2018-07-09 10:47:02 140075593627392 [Note] WSREP: Receiving IST...100.0% (74964/74964 events) complete.
2018-07-09 10:47:02 140075929970432 [Note] WSREP: IST received: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958
2018-07-09 10:47:02 140075954312960 [Note] WSREP: 0.0 (192.168.55.172): State transfer from 2.0 (192.168.55.173) complete.

Once all the missing writesets are received and applied, Galera will promote galera2 as JOINED until seqno 2837012:

2018-07-09 10:47:02 140075954312960 [Note] WSREP: Shifting JOINER -> JOINED (TO: 2837012)
2018-07-09 10:47:02 140075954312960 [Note] WSREP: Member 0.0 (192.168.55.172) synced with group.

The node applies any cached writesets in its slave queue and finishes catching up with the cluster. Its slave queue is now empty. Galera will promote galera2 to SYNCED, indicating the node is now operational and ready to serve clients:

2018-07-09 10:47:02 140075954312960 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2837012)
2018-07-09 10:47:02 140076605892352 [Note] WSREP: Synchronized with group, ready for connections

At this point, all nodes are back operational. You can verify by using the following statements on galera2:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY');
+---------------------------+----------------+
| VARIABLE_NAME             | VARIABLE_VALUE |
+---------------------------+----------------+
| WSREP_CLUSTER_SIZE        | 3              |
| WSREP_CLUSTER_STATUS      | Primary        |
| WSREP_EVS_DELAYED         |                |
| WSREP_LOCAL_STATE_COMMENT | Synced         |
| WSREP_READY               | ON             |
+---------------------------+----------------+

The wsrep_cluster_size reported as 3 and the cluster status is Primary, indicating galera2 is part of the Primary Component. The wsrep_evs_delayed has also been cleared and the local state is now Synced.

The recovery process flow for the partitioned node during network issue can be summarized as below:

  1. Re-establishes group communication to other nodes.
  2. Retrieves the cluster view from one of the Primary Component.
  3. Performs state exchange with the Primary Component and calculates the quorum.
  4. Changes the local node state from OPEN to PRIMARY.
  5. Calculates the gap between local node and the cluster.
  6. Changes the local node state from PRIMARY to JOINER.
  7. Prepares IST listener/receiver on port 4568.
  8. Requests state transfer via IST and picks a donor.
  9. Starts receiving and applying the missing writeset from chosen donor's gcache.
  10. Changes the local node state from JOINER to JOINED.
  11. Catches up with the cluster by applying the cached writesets in the slave queue.
  12. Changes the local node state from JOINED to SYNCED.
ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Cluster Failure

A Galera Cluster is considered failed if no primary component (PC) is available. Consider a similar three-node Galera Cluster as depicted in the diagram below:

A cluster is considered operational if all nodes or majority of the nodes are online. Online means they are able to see each other through Galera's replication traffic or group communication. If no traffic is coming in and out from the node, the cluster will send a heartbeat beacon for the node to response in a timely manner. Otherwise, it will be put into the delay or suspected list according to how the node responses.

If a node goes down, let's say node C, the cluster will remain operational because node A and B are still in quorum with 2 votes out of 3 to form a primary component. You should get the following cluster state on A and B:

mysql> SHOW STATUS LIKE 'wsrep_cluster_status';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_cluster_status | Primary |
+----------------------+---------+

If let's say a primary switch went kaput, as illustrated in the following diagram:

At this point, every single node loses communication to each other, and the cluster state will be reported as non-Primary on all nodes (as what happened to galera2 in the previous case). Every node would calculate the quorum and find out that it is the minority (1 vote out of 3) thus losing the quorum, which means no Primary Component is formed and consequently all nodes refuse to serve any data. This is deemed as cluster failure.

Once the network issue is resolved, Galera will automatically re-establish the communication between members, exchange node's states and determine the possibility of reforming the primary component by comparing node state, UUIDs and seqnos. If the probability is there, Galera will merge the primary components as shown in the following lines:

2018-06-27  0:16:57 140203784476416 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: sent state msg: 5885911b-795c-11e8-8683-931c85442c7e
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 0 (192.168.55.171)
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 1 (192.168.55.172)
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 2 (192.168.55.173)
2018-06-27  0:16:57 140203784476416 [Warning] WSREP: Quorum: No node with complete state:

        Version      : 4
        Flags        : 0x3
        Protocols    : 0 / 8 / 3
        State        : NON-PRIMARY
        Desync count : 0
        Prim state   : SYNCED
        Prim UUID    : 5224a024-791b-11e8-a0ac-8bc6118b0f96
        Prim  seqno  : 5
        First seqno  : 112714
        Last  seqno  : 112725
        Prim JOINED  : 3
        State UUID   : 5885911b-795c-11e8-8683-931c85442c7e
        Group UUID   : 55238f52-41ee-11e8-852f-3316bdb654bc
        Name         : '192.168.55.171'
        Incoming addr: '192.168.55.171:3306'

        Version      : 4
        Flags        : 0x2
        Protocols    : 0 / 8 / 3
        State        : NON-PRIMARY
        Desync count : 0
        Prim state   : SYNCED
        Prim UUID    : 5224a024-791b-11e8-a0ac-8bc6118b0f96
        Prim  seqno  : 5
        First seqno  : 112714
        Last  seqno  : 112725
        Prim JOINED  : 3
        State UUID   : 5885911b-795c-11e8-8683-931c85442c7e
        Group UUID   : 55238f52-41ee-11e8-852f-3316bdb654bc
        Name         : '192.168.55.172'
        Incoming addr: '192.168.55.172:3306'

        Version      : 4
        Flags        : 0x2
        Protocols    : 0 / 8 / 3
        State        : NON-PRIMARY
        Desync count : 0
        Prim state   : SYNCED
        Prim UUID    : 5224a024-791b-11e8-a0ac-8bc6118b0f96
        Prim  seqno  : 5
        First seqno  : 112714
        Last  seqno  : 112725
        Prim JOINED  : 3
        State UUID   : 5885911b-795c-11e8-8683-931c85442c7e
        Group UUID   : 55238f52-41ee-11e8-852f-3316bdb654bc
        Name         : '192.168.55.173'
        Incoming addr: '192.168.55.173:3306'

2018-06-27  0:16:57 140203784476416 [Note] WSREP: Full re-merge of primary 5224a024-791b-11e8-a0ac-8bc6118b0f96 found: 3 of 3.
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 5,
        members    = 3/3 (joined/total),
        act_id     = 112725,
        last_appl. = 112722,
        protocols  = 0/8/3 (gcs/repl/appl),
        group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Flow-control interval: [28, 28]
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Trying to continue unpaused monitor
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Restored state OPEN -> SYNCED (112725)
2018-06-27  0:16:57 140202564110080 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:112725, view# 6: Primary, number of nodes: 3, my index: 2, protocol version 3

A good indicator to know if the re-bootstrapping process is OK is by looking at the following line in the error log:

[Note] WSREP: Synchronized with group, ready for connections

ClusterControl Auto Recovery

ClusterControl comes with node and cluster automatic recovery features, because it oversees and understands the state of all nodes in the cluster. Automatic recovery is by default enabled if the cluster is deployed using ClusterControl. To enable or disable the cluster, simply clicking on the power icon in the summary bar as shown below:

Green icon means automatic recovery is turned on, while red is the opposite. You can monitor the recovery progress from the Activity -> Jobs dialog, like in this case, galera2 was totally inaccessible due to firewall blocking, thus forcing ClusterControl to report the following:

The recovery process will only be commencing after a graceful timeout (30 seconds) to give Galera node a chance to recover itself beforehand. If ClusterControl fails to recover a node or cluster, it will first pull all MySQL error logs from all accessible nodes and will raise the necessary alarms to notify the user via email or by pushing critical events to the third-party integration modules like PagerDuty, VictorOps or Slack. Manual intervention is then required. For Galera Cluster, ClusterControl will keep on trying to recover the failure until you mark the node as under maintenance, or disable the automatic recovery feature.

ClusterControl's automatic recovery is one of most favorite features as voted by our users. It helps you to take the necessary actions quickly, with a complete report on what has been attempted and recommendation steps to troubleshoot further on the issue. For users with support subscriptions, you can look for extra hands by escalating this issue to our technical support team for assistance.

Conclusion

Galera automatic node recovery and membership control are neat features to simplify the cluster management, improve the database reliability and reduce the risk of human error, as commonly haunting other open-source database replication technology like MySQL Replication, Group Replication and PostgreSQL Streaming/Logical Replication.

by ashraf at July 11, 2018 02:28 PM

Peter Zaitsev

Percona Server for MongoDB 3.6.5-1.3 Is Now Available

MongoRocks

Percona Server for MongoDBPercona announces the release of Percona Server for MongoDB 3.6.5-1.3 on July 11, 2018. Download the latest version from the Percona web site or the Percona Software Repositories.

Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 3.6 Community Edition. It supports MongoDB 3.6 protocols and drivers.

Percona Server for MongoDB extends MongoDB Community Edition functionality by including the Percona Memory Engine, as well as several enterprise-grade features. Percona Server for MongoDB requires no changes to MongoDB applications or code.

This release is based on MongoDB 3.6.5 and does not include any additional changes.

The Percona Server for MongoDB 3.6.5-1.3 release notes are available in the official documentation.

The post Percona Server for MongoDB 3.6.5-1.3 Is Now Available appeared first on Percona Database Performance Blog.

by Dmitriy Kostiuk at July 11, 2018 02:25 PM

AMD EPYC Performance Testing… or Don’t get on the wrong side of SystemD

Ubuntu 16 AMD EPYC

Ever since AMD released their EPYC CPU for servers I wanted to test it, but I did not have the opportunity until recently, when Packet.net started offering bare metal servers for a reasonable price. So I started a couple of instances to test Percona Server for MySQL under this CPU. In this benchmark, I discovered some interesting discrepancies in performance between  AMD and Intel CPUs when running under systemd .

The set up

To test CPU performance, I used a read-only in-memory sysbench OLTP benchmark, as it burns CPU cycles and no IO is performed by Percona Server.

For this benchmark I used Packet.net c2.medium.x86 instances powered by AMD EPYC 7401P processors. The OS is exposed to 48 CPU threads.

For the OS I tried

  • Ubuntu 16.04 with default kernel 4.4 and upgraded to 4.15
  • Ubuntu 18.04 with kernel 4.15
  • Percona Server started from SystemD and without SystemD (for reasons which will become apparent later)

To have some points for comparison, I also ran a similar workload on my 2 socket Intel CPU server, with CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. I recognize this is not most recent Intel CPU, but this was the best I had at the time, and it also gave 48 CPU Threads.

Ubuntu 16

First, let’s review the results for Ubuntu 16

Or in tabular format:

Threads Ubuntu 16, kernel 4.4; systemd Ubuntu 16, kernel 4.4;

NO systemd

Ubuntu 16, kernel 4.15
1 943.44 948.7 899.82
2 1858.58 1792.36 1774.45
4 3533.2 3424.05 3555.94
8 6762.35 6731.57 7010.51
12 10012.18 9950.3 10062.82
16 13063.39 13043.55 12893.17
20 15347.68 15347.56 14756.27
24 16886.24 16864.81 16176.76
30 18150.2 18160.07 17860.5
36 18923.06 18811.64 19528.27
42 19374.86 19463.08 21537.79
48 20110.81 19983.05 23388.18
56 20548.51 20362.31 23768.49
64 20860.51 20729.52 23797.14
72 21123.71 21001.06 23645.95
80 21370 21191.24 23546.03
90 21622.54 21441.73 23486.29
100 21806.67 21670.38 23392.72
128 22161.42 22031.53 23315.33
192 22388.51 22207.26 22906.42
256 22091.17 21943.37 22305.06
512 19524.41 19381.69 19181.71

 

There are few conclusions we can see from this data

  1. AMD EPYC CPU scales quite well to the number of CPU Threads
  2. The recent kernel helps to boost the throughput.

Ubuntu 18.04

Now, let’s review the results for Ubuntu 18.04

Threads Ubuntu 18, systemd
Ubuntu 18, NO systemd
1 833.14 843.68
2 1684.21 1693.93
4 3346.42 3359.82
8 6592.88 6597.48
12 9477.92 9487.93
16 12117.12 12149.17
20 13934.27 13933
24 15265.1 15152.74
30 16846.02 16061.16
36 18488.88 16726.14
42 20493.57 17360.56
48 22217.47 17906.4
56 22564.4 17931.83
64 22590.29 17902.95
72 22472.75 17857.73
80 22421.99 17766.76
90 22300.09 17773.57
100 22275.24 17646.7
128 22131.86 17411.55
192 21750.8 17134.63
256 21177.25 16826.53
512 18296.61 17418.72

 

This is where the result surprised me: on Ubuntu 18.04 with SystemD running Percona Server for MySQL as a service the throughput was up to 24% better than if Percona Server for MySQL is started from a bash shell. I do not know exactly what causes this dramatic difference—systemd uses different slices for services and user commands, and somehow it affects the performance.

Baseline benchmark

To establish a baseline, I ran the same benchmark on my Intel box, running Ubuntu 16, and I tried two kernels: 4.13 and 4.15

Threads Ubuntu 16, kernel 4.13, systemd Ubuntu 16, kernel 4.15, systemd
Ubuntu 16, kernel 4.15, NO systemd
1 820.07 798.42 864.21
2 1563.31 1609.96 1681.91
4 2929.63 3186.01 3338.47
8 6075.73 6279.49 6624.49
12 8743.38 9256.18 9622.6
16 10580.14 11351.31 11984.64
20 12790.96 12599.78 14147.1
24 14213.68 14659.49 15716.61
30 15983.78 16096.03 17530.06
36 17574.46 18098.36 20085.9
42 18671.14 19808.92 21875.84
48 19431.05 22036.06 23986.08
56 19737.92 22115.34 24275.72
64 19946.57 21457.32 24054.09
72 20129.7 21729.78 24167.03
80 20214.93 21594.51 24092.86
90 20194.78 21195.61 23945.93
100 20753.44 21597.26 23802.16
128 20235.24 20684.34 23476.82
192 20280.52 20431.18 23108.36
256 20410.55 20952.64 22775.63
512 20953.73 22079.18 23489.3

 

Here we see the opposite result with SystemD: Percona Server running from a bash shell shows the better throughput compared with the SystemD service. So for some reason, systemd works differently for AMD and Intel CPUs. Please let me know if you have any ideas on how to deal with the impact that systemd has on performance.

Conclusions

So there are some conclusions from these results:

  1. AMD EPYC shows a decent performance scalability; the new kernel helps to improve it
  2. systemd shows different effects on throughput for AMD and Intel CPUs
  3. With AMD the throughput declines for a high concurrent workload with 512 threads, while Intel does not show a decline.

The post AMD EPYC Performance Testing… or Don’t get on the wrong side of SystemD appeared first on Percona Database Performance Blog.

by Vadim Tkachenko at July 11, 2018 11:52 AM

July 10, 2018

Peter Zaitsev

When Database Warm Up is Not Really UP

first few minutes MySQL Warm Up graph from PMM

The common wisdom with database performance management is that a “cold” database server has poor performance. Then, as it “warms up”, performance improves until finally you reach a completely warmed up state with peak database performance. In other words, that to get peak performance from MySQL you need to wait for database warm up.

This thinking comes from the point of view of database cache warmup. Indeed from the cache standpoint, you start with an empty cache and over time the cache is filled with data. Moreover the longer the database runs, the more statistics about data access patterns it has, and the better it can manage database cache contents.

Over recent years with the rise of SSDs, cache warmup has become less of an issue. High Performance NVMe Storage can do more than 1GB/sec read, meaning you can warm up a 100GB database cache in less than 2 minutes. Also, SSD IO latency tends to be quite good so you’re not paying as high a penalty for a higher miss rate during the warm up stage.

It is not all so rosy with database performance over time. Databases tend to delay work when possible, but there is only so much delaying you can do. When the database can’t delay work any longer performance tends to be negatively impacted. Here are some examples of delaying work:

  • Checkpointing: depending on the database technology and configuration, checkpointing may be delayed for 30 minutes or more after database start
  • Change Buffer (Innodb) can delay index maintenance work
  • Pushing Messages from Buffers to Leaves (TokuDB) can be delayed until space in the buffers is exhausted
  • Compaction for RocksDB and other LSM-Tree based system can take quite a while to reach steady state

In all these cases database performance can be a lot better almost immediately after start compared to when it is completely “warmed up”.

An experiment with database warm up

Let’s illustrate this with a little experiment running Sysbench with MySQL and Innodb storage engine for 15 minutes:

sysbench --db-driver=mysql --threads=200 --rand-type=uniform --report-interval=10 --percentile=99 --time=900 --mysql-user=root --mysql-password= /usr/share/sysbench/oltp_update_index.lua --table_size=100000000 run

Let’s look in detail at what happens during the run using graphs from Percona Monitoring and Management

PMM graph of first three minutes db warm up

As you can see the number of updates/sec we’re doing actually gets worse (and more uneven) after the first 3 minutes, while a jump to peak performance is almost immediate

InnoDB Checkpoint Age graph from PMM

The log space usage explains some of this—in the first few minutes, we did not need to do as aggressive flushing as we had to do later.

first few minutes MySQL Warm Up graph from PMM

On the InnoDB I/O graph we can see a couple of interesting things. First, you can see how quickly warm up happens—in 2 minutes the IO is already at half of its peak. You can also see the explanation for the little performance dip after its initial high performance (around 19:13)—this is where we got close to using all log space, so active flushing was required while, at the same time, a lot of IO was still needed for cache warmup.

Reaching Steady State is another term commonly used to describe the stage after warm up completes. Note though that such steady state is not guaranteed to be steady at all. In fact, the most typical steady state is unsteady. For example, you can see in this blog post both InnoDB and MyRocks have quite a variance.

Summary

While the term database warm up may imply performance after warm up will be better, it is often not the case. “Reaching Steady State” is a better term as long as you understand that “steady” does not mean uniform performance.

 

The post When Database Warm Up is Not Really UP appeared first on Percona Database Performance Blog.

by Peter Zaitsev at July 10, 2018 01:56 PM

Percona Live Europe 2018 Call for Papers is Now Open

Percona Live Europe Open Source Database Conference PLE 2018

Percona Live Europe Open Source Database Conference PLE 2018Announcing the opening of the Percona Live Europe Open Source Database Conference 2018 in Frankfurt, Germany call for papers. It will be open from now until August 10, 2018.

Our theme this year is
Connect. Accelerate. Innovate.

As a speaker at Percona Live Europe, you’ll have the opportunity to CONNECT with your peers—open source database experts and enthusiasts who share your commitment to improving knowledge and exchanging ideas. ACCELERATE your projects and career by presenting at the premier open source database event, a great way to build your personal and company brands. And influence the evolution of the open source software movement by demonstrating how you INNOVATE!

Community initiatives remain core to the open source ethos, and we are proud of the contribution we make with Percona Live Europe in showcasing thought leading practices in the open source database world.

With a nod to innovation, for the first time, this year we are introducing a business track to benefit those business leaders who are exploring the use of open source and are interested in learning more about its costs and benefits.

Speaking Opportunities

The Percona Live Europe Open Source Database Conference 2018 Call for Papers is open until August 10, 2018. We invite you to submit your speaking proposal for breakout, tutorial or lightning talk sessions. Classes and talks are invited for Foundation (either entry level or of general interest to all), Core (intermediate) and Masterclass (advanced) levels.

If selected, you will receive a complimentary full conference pass.

  • Breakout Session. Broadly cover a technology area using specific examples. Sessions should be either 25 minutes or 50 minutes in length (including Q&A).
  • Tutorial Session. Present a technical session that aims for a level between a training class and a conference breakout session. We encourage attendees to bring and use laptops for working on detailed and hands-on presentations. Tutorials will be three or six hours in length (including Q&A).
  • Lightning Talk. Give a five-minute presentation focusing on one key point that interests the open source community: technical, lighthearted or entertaining talks on new ideas, a successful project, a cautionary story, a quick tip or demonstration.

Topics and Tracks

We want proposals that cover the many aspects of application development using all open source databases, as well as new and interesting ways to monitor and manage database environments. Did you just embrace open source databases this year? What are the technical and business values of moving to or using open source databases? How did you convince your company to make the move? Was there tangible ROI?

Best practices and current trends, including design, application development, performance optimization, HA and clustering, cloud, containers and new technologies, as well as new and interesting ways to monitor and manage database environments—what’s holding your focus? Share your case studies, experiences and technical knowledge with an engaged audience of open source peers.

In the submission entry you will be asked to indicate which of these tracks your proposal best fits: tutorial, business needs; case studies/use cases; operations; or developer.

A few ideas

The conference committee is looking for proposals that cover the many aspects of using, deploying and managing open source databases, including:

  • Open source – Describe the technical and business values of moving to or using open source databases. How did you convince your company to make the move? Was there tangible ROI?
  • Security – All of us have experienced security challenges. Whether they are initiated by legislature (GDPR), bugs (Meltdown/Spectre), experience (external attacks) or due diligence (planning for the worst), when do you have ‘enough’ security? Are you finding that security requirements are preventing your ability to be agile?
  • Serverless, Cloud or On-Premise – The technology landscape is no longer a simple one, and mixing infrastructures has almost become the norm. Are you designing data architectures for the new landscape, and eager to share your experience? Have microservices become an important part of your plans?
  • MySQL – Do you have an opinion on what is new and exciting in MySQL? With the release of MySQL 8.0, are you using the latest features? How and why? Are they helping you solve any business issues, or making deployment of applications and websites easier, faster or more efficient? Did the new release get you to change to MySQL? What do you see as the biggest impact of the MySQL 8.0 release? Do you use MySQL in conjunction with other databases in your environment?
  • MongoDB – How has the 3.6 release improved your experience in application development or time-to-market? How are the new features making your database environment better? What is it about MongoDB 4.0 that excites you? What are your experiences with Atlas? Have you moved to it, and has it lived up to its promises? Do you use MongoDB in conjunction with other databases in your environment?
  • PostgreSQL – Why do you use PostgreSQL as opposed to other SQL options? Have you done a comparison or benchmark of PostgreSQL vs. other types of databases related to your tasks? Why and what were the results? How does PostgreSQL help you with application performance or deployment? How do you use PostgreSQL in conjunction with other databases in your environment?
  • SQL, NewSQL, NoSQL – It’s become a perennial question without an easy answer. How do databases compare, how do you choose the right technology for the job, how do you trade off between features and their benefits in comparing databases? If you have ever tried a hybrid database approach in a single application, how did that work out? How nicely does MongoDB play with MySQL in the real world? Do you have anything to say about using SQL with NoSQL databases?
  • High Availability – What choices are you making to ensure high availability? How do you find the balance between redundancy and cost? Are you using hot backups, and if so, what happened when you needed to rollback on them?
  • Scalability – When did you recognize you needed to address data scale? Did your data growth take you by surprise or were you always in control? Did it take a degradation in performance to get your management to sit up and take notice? How do you plan for scale if you can’t predict demand?
  • What the Future Holds – What do you see as the “next big thing”? What new and exciting features are going to be released? What’s in your next release? What new technologies will affect the database landscape? AI? Machine learning? Blockchain databases? Let us know about innovations you see on the way.

How to respond to the call for papers

For information on how to submit your proposal visit our call for papers page. The conference web pages will be updated throughout the next few weeks and bios, synopsis and slides will be published on those pages after the event.

Sponsorship

If you would like to obtain a sponsor pack for Percona Live Europe Open Source Database Conference 2018, you will find more information including a prospectus on our sponsorship page. You are welcome to contact me, Bronwyn Campbell, directly.

The post Percona Live Europe 2018 Call for Papers is Now Open appeared first on Percona Database Performance Blog.

by Bronwyn Campbell at July 10, 2018 07:00 AM

Open Query Pty Ltd

MariaDB Galera cluster and GTID

In MariaDB 10.2.12, these two don’t yet work together. GTID = Global Transaction ID.  In the master-slave asynchronous replication realm, this means that you can reconnect a slave to another server (change its master) and it’ll happily continue replicating from the correct point.  No more fussing with filenames and offsets (which of course will both differ on different machines).

So in concept the GTIID is “globally” unique – that means it’s consistent across an entire infra: a binlogged write transaction will have the same GTID no matter on which machine you look at it.

  • OK: if you are transitioning from async replication to Galera cluster, and have a cluster as slave of the old infra, then GTID will work fine.
  • PROBLEM: if you want to run an async slave in a Galera cluster, GTID will currently not work. At least not reliably.

The overview issue is MDEV-10715, the specific problem is documented in MDEV-14153 with some comments from me from late last week. MDEV-14153 documents cases where the GTID is not in fact consistent – and the way in which it isn’t is most disturbing.

The issue appears as “drift”. A GTID is made up of R-S-# where R is replication domain (0 unless set by an app), S for server-id where the write was originally done, and # which is just a number. The only required rule for the # is that that each next event has to have a higher number than the previous.  In principle there could be #s missing, that’s ok.

In certain scenarios, the # part of the GTID falls behind on the “other nodes” in the Galera cluster. There was the node where the statement was first issued, and then there are the other nodes which pick up the change through the Galera (wsrep) cluster mechanism. Those other nodes.  So at that point, different nodes in the cluster have different GTIDs for the same query. Not so great.

To me, this looked like a giant red flag though: if a GTID is assigned on a commit, and then replicated through the cluster as part of that commit, it can’t change. Not drift, or any other change. So the only possible conclusion must be that it is in fact not passed through the cluster, but “reinvented” by a receiving cluster node, which simply assumes that the current event from a particular server-id is previous-event id + 1.  That assumption is false, because as I mentioned above it’s ok for gaps to exist.  As long as the number keeps going up, it’s fine.

Here is one of the simplest examples of breakage (extract from a binlog, with obfuscated table names):

# at 12533795
#180704 5:00:22 server id 1717 end_log_pos 12533837 CRC32 0x878fe96e GTID 0-1717-1672559880 ddl
/*!100001 SET @@session.gtid_seq_no=1672559880*//*!*/;
# at 12533837
#180704 5:00:22 server id 1717 end_log_pos 12534024 CRC32 0xc6f21314 Query thread_id=4468 exec_time=0 error_code=0
SET TIMESTAMP=1530644422/*!*/;
SET @@session.time_zone='SYSTEM'/*!*/;
DROP TEMPORARY TABLE IF EXISTS `qqq`.`tmp_foobar` /* generated by server */
/*!*/;

Fact: temporary tables are not replicated (imagine restarting a slave, it wouldn’t have whatever temporary tables were supposed to exist). So, while this event is stored in the binary log (which it is to ensure that if you replay the binlog on a machine, it correctly drops the temporary table after creating and using it), it won’t go through a cluster.  Remember that Galera cluster is essentially a ROW-based replication scheme; if there are changes in non-temporary tables, of course they get replicated just fine.  So if an app creates a temporary table, does some calculations, and then inserts the result of that into a regular table, the data of that last bit will get replicated. As it should. In a nutshell, as far as data consistency goes, we’re all fine.

But the fact that we have an event that doesn’t really get replicated creates the “fun” in the “let’s assume the next event is just the previous + 1” logic. This is where the drift comes in. Sigh.

In any case, this issue needs to be fixed by let’s say “being re-implemented”: the MariaDB GTID needs to be propagated through the Galera cluster, so it’s the same on every server, as it should be. Doing anything else is always going to go wrong somewhere, so trying to catch more cases like the above example is not really the correct way to go.

If you are affected by this or related problems, please do vote on the relevant MDEV issues. That is important!  If you need help tracking down problems, feel free to ask.  If you have more information on the matter, please comment too!  I’m sure this and related bugs will be fixed, there are very capable developers at MariaDB Corp and Codership Oy (the Galera company). But the more information we can provide, the better. It often helps with tracking down problems and creating reproducible test cases.

by Arjen Lentz at July 10, 2018 06:12 AM

July 09, 2018

Peter Zaitsev

InnoDB Cluster in a nutshell – Part 1

innodb cluster in a nutshell part 1

Since MySQL 5.7 we have a new player in the field, MySQL InnoDB Cluster. This is an Oracle High Availability solution that can be easily installed over MySQL to get High Availability with multi-master capabilities and automatic failover.

This solution consists in 3 components: InnoDB Group Replication, MySQL Router and MySQL Shell, you can see how these components interact in this graphic:

Graphic describing InnoDB Group Replication, MySQL Router and MySQL Shell

In this three blog post series, we will cover each of this components to get a sense of what this tool provides and how it can help with architecture decisions.

Group Replication

This is the actual High Availability solution, and a while ago I wrote a short review of it when it still was in its labs stage. It has improved a lot since then.

This solution is based on a plugin that has to be installed (not installed by default) and works on the top of built-in replication. So it relies on binary logs and relay logs to apply writes to members of the cluster.

The main concept about this new type of replication is that all members of a cluster (i.e. each node) are considered equals. This means there is no master-slave (where slaves follow master) but members that apply transactions based on a consensus algorithm. This algorithm forces all members of a cluster to commit or reject a given transaction following decisions made by each individual member.

In practical terms, this means each member of the cluster has to decide if a transaction can be committed (i.e. no conflicts) or should be rolled back but all other members follow this decision. In other words, the transaction is either committed or rolled back according to the majority of members in a consistent state.

To achieve this, there is a service that exposes a view of cluster status indicating what members form the cluster and the current status of each of them. Additionally Group Replication requires GTID and Row Based Replication (or

binlog_format=ROW
 ) to distribute each writeset with row changes between members. This is done via binary logs and relay logs but before each transaction is pushed to binary/relay logs it has to be acknowledged by a majority of members of the clusters, in other words through consensus. This process is synchronous, unlike legacy replication. After a transaction is replicated we have a certification process to commit the transaction, and thus making it durable.

Here appears a new concept, the certification process, which is the process that confirms if a writeset can be applied/committed (i.e. a row change can be done without conflicts) after replication of the transaction is complete.

Basically this process consists of inspecting writesets to check if there are conflicts (i.e. same row updated by concurrent transactions). Based on an order set in the writeset, the conflict is resolved by ‘first-commiter wins’ while the second is rolled back in the originator. Finally, the transaction is pushed to binary/relay logs and committed.

Solution features

Some other features provided by this solution are:

  • Single-primary or multi-primary modes meaning that the cluster can operate with a single writer and multiple readers (recommended and default setup); or with multiple writers where all nodes are capable to accept write transactions. The latter is at the cost of a performance penalty due to conflict resolution.
  • Automatic failure detection, where an internal mechanism is able to detect a failed node (i.e. a crash, network problems, etc) and decide to exclude it from the cluster automatically. Also if a member can’t communicate with the cluster and gets isolated, it can’t accept transactions. This ensures that cluster data is not impacted by this situation.
  • Fault tolerance. This is the strategy that the cluster uses to support failing members. As described above, this is based on a majority. A cluster needs at least three members to support one node failure because the other two members will keep the majority. The bigger the number of nodes, the bigger the number of failing nodes the cluster supports. The maximum number of members (nodes) in a cluster is currently limited to 7. If it has seven members, then the majority is kept by four or more active members. In other words, a cluster of seven would support up to three failing nodes.

We will not cover installation and configuration aspects now. This will probably come with a new series of blogs where we can cover not only deployment but also use cases and so on.

In the next post we will talk about the next cluster component: MySQL Router, so stay tuned.

The post InnoDB Cluster in a nutshell – Part 1 appeared first on Percona Database Performance Blog.

by Francisco Bordenave at July 09, 2018 01:47 PM

July 06, 2018

Valeriy Kravchuk

Problems of Oracle's MySQL as an Open Source Product

In my previous summary blog post I listed 5 problems I see with the way Oracle handles MySQL server development. The first of them was that "Oracle does not develop MySQL server in a true open source way" and this is actually what I started my draft of that entire blog post with. Now it's time to get into details, as so far there was mostly fun around this and statements that MariaDB also could do better in the related Twitter discussion I had.

So, let me explain what forces me to think that Oracle is treating MySQL somewhat wrong for the open source product.

Nice pathway on this photo, but it's not straight and it's not clear where it goes. Same with MySQL development...
We get MySQL source code updated at GitHub only when (or, as it often happened in the past, some time after) the official release of new version happens. You can see, for example, that MySQL 8.0 source code at GitHub was actually last time updated on April 3, 2018, while MySQL 8.0.11 GA was released officially on April 19, 2018 (and that's when new code became really available in public repository). We do not see any code changes later than April 3, while it's clear that there are bug fixes already implemented for MySQL 8.0.12 (see Bug #90523 - "[MySQL 8.0 GA Release Build] InnoDB Assertion: (capacity & (capacity - 1)) == 0", for example. There is an easy way to crash official MySQL 8.0.11 binaries upon startup, fixed back before April 30, with some description of the fix even, but no source code of the fix is published) and 8.0.13 even (see Bug #90999 - "Bad usage of ppoll in libmysql"). With Oracle's approach to sharing the source code, we can not see the fixes that are already made long time ago, apply them, test them or comment on them. This is fundamentally wrong, IMHO, for any open source software.

In other projects we usually can see the code as soon as it is pushed to the branch (check MariaDB if you care, last change few hours ago at the moment). Main branches may have more strict rules for updating, but in general we see fixes as they happen, not only when new official release happens.
Side note: if you see that Bug #90523 became private after I mentioned it here, that's another wrong thing they often do. More on the in the next post, on community bug reports handling by Oracle...
Interesting enough, when the fix comes from community we can usually see the patch. This happened to the Bug #90999 mentioned above - we have a fix provided by Facebook and one can see the patch in Bug #91067 - "Contribution by Facebook: Do not use sigmask in ppoll for client libraries". When somebody makes pull request, patch source is visible. But one can never be sure if it's the final patch and had it passed all the usual QA tests and reviews, or what happens to pull requests closed because developer had not signed the agreement...

If the fix is developed by Oracle you'll see the code changed only with/after the official release. Moreover, it would be on you to identify the exact commit(s) that introduced the fix. For a long time Laurynas Biveinis from Percona cared to add comments about the exact commit that fixed the bug to public bug reports (see Bug #77689 - "mysql_execute_command SQLCOM_UNLOCK_TABLES redundant trans_check_state check?" as one of examples). Community members have to work hard to "reverse engineer" Oracle's fixes and link them back to details of real problems (community bug reports) they were intended to resolve!

Compare this to a typical changelog of MariaDB that leads you directly to commits and code changes.

What's even worse, Oracle started a practice to publish only part of their changes made for the release. Some tests, those for "security" bugs, are NOT published even if we assume they exist or even can be 100% sure they exist.

My recent enough favorite example is the "The CREATE TABLE of death" bug reported by Jean-François Gagné. If you follow his blog post and links in it you can find out all the details, including the test case that is public in MariaDB. With this public information you can go and crash any affected older MySQL versions. Bug reporter did everything to inform affected vendors properly, and responsible vendors disclosed the test (after they fixed the problem)!

Now, try to find similar test in public GitHub tree of Oracle MySQL. I tried to find it literally, try to find references to somewhat related public bug numbers etc, but failed. If you know better and can identify the related public test at GitHub, please, add a comment and correct me!

To summarize, this is what I am mostly concerned about:
  1. Public source code is updated only with the releases. There are no feature-specific code branches, development branches, just nothing public until the official release.
  2. Oracle does not provide any details about commits and their relations to bugs fixed in the release notes or anywhere else outside GitHub. One has to go study the source code to make his own conclusions.
  3. Oracle does not share some of test cases in their commits. So, some test cases remain non-public and we can only guess (based on code analysis) what was the real intention of the fix. This applies to security bugs and who knows to what else.
I would not go into other potential problems (I've heard about some others from developers, for example, related to code refactoring Oracle does) or more details. The above is enough for me to state that Oracle do wrong things with the way they publish source code and threat MySQL as open source product.

All the problems mentioned above were introduced by Oracle, these never happened in MySQL AB or Sun. MariaDB and Percona servers may have their own problems, but the above do NOT apply to them, so I state that other vendors develop MySQL forks and related projects differently, and still are in business and doing well!

by Valeriy Kravchuk (noreply@blogger.com) at July 06, 2018 07:30 PM

Peter Zaitsev

Percona Toolkit 3.0.11 Is Now Available

percona toolkit

percona toolkitPercona announces the release of Percona Toolkit 3.0.11 on July 6, 2018.

Percona Toolkit is a collection of advanced open source command-line tools, developed and used by the Percona technical staff, that are engineered to perform a variety of MySQL®, MongoDB® and system tasks that are too difficult or complex to perform manually. With over 1,000,000 downloads, Percona Toolkit supports Percona Server for MySQL, MySQL®, MariaDB®, Percona Server for MongoDB and MongoDB.

Percona Toolkit, like all Percona software, is free and open source. You can download packages from the website or install from official repositories.

This release includes the following changes:

New Features:

  • PT-1571: Improved hostname recognition in pt-secure-collect
  • PT-1569: Disabled --alter-foreign-keys-method=drop_swap in pt-online-schema-change
  • PT-242: (pt-stalk) Include SHOW SLAVE STATUS on MySQL 5.7 (Thanks Marcelo Altmann)

Fixed bugs:

  • PT-1570: pt-archiver fails to detect columns with the word *GENERATED* as part of the comment
  • PT-1563: pt-show-grantsfails for MySQL 5.6 producing an error which reports that an unknown column account_locked has been detected.
  • PT-1551: pt-table-checksum fails on MySQL 8.0.11
  • PT-241: (pt-stalk) Slave queries don’t run on MySQL 5.7  because the FQDN was missing (Thanks Marcelo Altmann)

Breaking changes:

Starting with this version, the queries checksum in pt-query-digest will use the full MD5 field as a CHAR(32) field instead of storing just the least significant bytes of the checksum as a BIGINT field. The reason for this change is that storing only the least significant bytes as a BIGINT was producing inconsistent results in MySQL 8 compared to MySQL 5.6+.

pt-online-schema-change in MySQL 8:

Due to a bug in MySQL 8.0+, it is not possible to use the drop_swapmethod to rebuild constraints because renaming a table will result in losing the foreign keys. You must specify a different method explicitly.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system.

The post Percona Toolkit 3.0.11 Is Now Available appeared first on Percona Database Performance Blog.

by Michael Coburn at July 06, 2018 06:22 PM

Another Day, Another Data Leak

another day another data leak Exactis

another day another data leak ExactisIn the last few days, there has been information released about yet another alleged data leak, placing in jeopardy “…[the] personal information on hundreds of millions of American adults, as well as millions of businesses.” In this case, the “victim” was Exactis, for whom data collection and data security are core business functions.

Some takeaways from Exactis

Please excuse the pun! In security, we have few chances to chuckle. In fact, as a Security Architect, I sigh deeply when I read about this kind of issue. Firstly, it’s preventable. Secondly, I worry that if an organization like Exactis is not getting it right, what chance the rest of the world?

As the Wired article notes the tool https://shodan.io/ can be revealing and well worth a look. For example, you can see there are still MANY elasticSearch systems exposed to the public internet here. Why not use shodan to check what everyone else in the world can see openly on your systems ?

Securing databases

Databases in themselves do not need to be at risk, as long as you take the necessary precautions. We discussed this in this blog post that I co-authored last year.

In this latest alleged gaffe, as far as I can discern, had the setup made use of iptables or a similar feature then the breach could not have occurred.

With immaculate timing, my colleague Marco Tusa wrote a post last month on how to set up iptables for Percona XtraDB Cluster, and if you are not sure if or how that applies to your setup, it is definitely worth a read. In fact, you can access all of our security blog posts if you would like some more pointers.

Of course, security does not stop with iptables. Application developers should already be familiar with the need to avoid SQL injection, and there is a decent SQL injection prevention cheat sheet here, offered by The Open Web Application Security Project (OWASP). Even if you don’t fully understand the technical details, a cheat sheet like this might help you to ask the right questions for your application.

MySQL resources

For a more in-depth look at MySQL security, I have two talks up on YouTube. The first of these is a twenty-minute presentation on hardening MySQL and the second on web application security and why you really should review yours. You could also check out our recorded webinar Security and Encryption in the MySQL world presented by Dimitri Vanoverbeke.

MongoDB resources

Of course, security challenges are not unique to SQL databases. If you are a MongoDB user, this webinar MongoDB Security: Making things secure by default might be of interest to you. Or perhaps this one on using LDAP Authentication with MongoDB? Adamo Tonete presents both of these webinars.

For a more widely applicable view, you could try Colin Charles’ recent webinar too.

There are always consequences

As Exactis are no doubt discovering, managing the fallout from such a breach is a challenge. If you are not sure where you stand on security, or what you can do to improve your situation, then audit services such as those we offer could prove to be a valuable investment.

Finally, some of you will be lucky enough to have someone dedicated to IT security in your organizations. Next time you see them, instead of avoiding their steely stare, why not invite them for a coffee* and a chat? It could be enlightening!

*Beer or scotch is also almost always accepted too…

The post Another Day, Another Data Leak appeared first on Percona Database Performance Blog.

by David Busby at July 06, 2018 03:58 PM

July 05, 2018

MariaDB AB

Use Cases for MariaDB Data Versioning

Use Cases for MariaDB Data Versioning rasmusjohansson Thu, 07/05/2018 - 19:54

Working in software development, versioning of code is something that we’ve often taken for granted. Task definitions and bug descriptions are preferably also managed by a system that versions every change. On top of this, we use a lot of documents for designing, documenting and managing our development cycles. For example, some of the tools we use are Jira, Google Docs and our own Knowledge Base to accomplish these things, which all provide versioning support. In MariaDB Server 10.3, we’ve introduced an elegant and easy way for data versioning, called System-Versioned Tables. Let’s look at what it can be used for.

A Look Back at Data for GDPR and PCI DSS Compliance

The General Data Protection Regulation (GDPR) is now enforced by the European Union (EU). All companies collecting user data in the EU have to comply with the GDPR rules. In addition to the daily questions and statements coming over email asking you to agree to new terms because of GDPR, the companies also have to store personal and private data in a way that fulfills the criteria of GDPR.

Card payments also have their own rules. There are standards like the Payment Card Industry Data Security Standard (PCI DSS), which are followed by banks and other online businesses. 1) What happened when, 2) by whom and 3) what did the data look like before and after? In MariaDB Server, the MariaDB Audit plugin is there for dealing with 1) and 2). It can also be used for 3) by looking in the audit logs on changes made, but it doesn’t give you the full data for how it looked before and after. With the newly released System-Versioned Tables this is possible. Let’s say that payment card information is stored in a database table. By turning on versioning for that table, all changes will create a new version of the row(s) affected by the change. The rows will also be time stamped, which means that you can query the row to see what it looked like before.

Handling Personal Data in Registries

When you think about versioning, one thing that comes to mind are registries of all sorts, which is the domain of GDPR when it comes to handling personal data. There are many types of personal data and one important to all of us is healthcare data, for example the patient registries of hospitals. In these registries versioning is of great importance to keep track of patients’ health history and related information such as medication. Other personal data registers are civil registers, tax registers, school and student registers and employee registers. The list is endless.

Rapidly Changing Data Sets

All of the above mentioned examples with data versioning applied in one way or another can be seen as slowly changing data. What I mean is that, although the systems can be huge and the total amount of transactions happening enormous, each piece of data doesn’t change that often. For example, my information in the civil register doesn’t change every second. But what if we have rapidly changing data such as the share rates at a stock exchange or tracking vehicle data for a shipping company. In these cases, we can make use of MariaDB’s data versioning.

Creating applications or software for the above purposes and having a database that provides data versioning out-of-the-box will lead to easier design, less customization and a more secure solutions.

Step-by-step Example

I’ll end with a GDPR example. I have a newsletter with subscribers and want to make sure that I always know when and what has happened to the subscriber data. I create a table in the database for the purpose and turn on versioning for the table.

CREATE TABLE Subscriber (
SubscriberId int(11) NOT NULL AUTO_INCREMENT,
FirstName varchar(50) NOT NULL,
LastName varchar(50) NOT NULL,
Newsletter bit NOT NULL,
PRIMARY KEY (SubscriberId)
) ENGINE=InnoDB WITH SYSTEM VERSIONING;

I insert myself as subscriber.

INSERT INTO Subscriber (FirstName, LastName, Newsletter) VALUES ('Rasmus', 'Johansson', 1);

I then try to add a column to the table.

ALTER TABLE Subscriber ADD COLUMN Gender char(1) NULL;
ERROR 4119 (HY000): Not allowed for system-versioned `Company`.`Subscriber`. Change @@system_versioning_alter_history to proceed with ALTER.

It results in the above error, because changing a versioned table is not permitted by default. I turn on the possibility to change the table and then the ALTER succeeds.

SET @@system_versioning_alter_history = 1;
ALTER TABLE Subscriber ADD COLUMN Gender char(1) NULL;
Query OK, 1 row affected (0.17 sec)

I also want a constraint on the new column.

ALTER TABLE Subscriber ADD CONSTRAINT con_gender CHECK (Gender in ('f','m'));

Then I do a couple of updates in the table.

UPDATE Subscriber SET Newsletter = 0 WHERE SubscriberId = 1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Inserted: 1 Warnings: 0

UPDATE Subscriber SET Gender = 'm' WHERE SubscriberId = 1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Inserted: 1 Warnings: 0

Finally, I delete the row in the table:

DELETE FROM Subscriber WHERE SubscriberId = 1;

If we ask for the rows in the table, including the old versions we get the following.

SELECT *, ROW_START, ROW_END FROM Subscriber FOR SYSTEM_TIME ALL;
+--------------+-----------+-----------+------------+--------+----------------------------+----------------------------+
| SubscriberId | FirstName | LastName | Newsletter | Gender | ROW_START | ROW_END |
+--------------+-----------+-----------+------------+--------+----------------------------+----------------------------+
| 1 | Rasmus | Johansson | # | NULL | 2018-06-08 10:57:36.982721 | 2018-06-08 11:14:07.654996 |
| 1 | Rasmus | Johansson | | NULL | 2018-06-08 11:14:07.654996 | 2018-06-08 11:15:05.971761 |
| 1 | Rasmus | Johansson | | NULL | 2018-06-08 11:15:05.971761 | 2018-06-08 11:15:28.459109 |
| 1 | Rasmus | Johansson | | m | 2018-06-08 11:15:28.459109 | 2038-01-19 03:14:07.999999 |
+--------------+-----------+-----------+------------+--------+----------------------------+----------------------------+

Even though I deleted the row, I get four old versions of the row. All this was handled by the database. The only thing I had to do was to turn on versioning for the table. What will you do with MariaDB’s System Versioned-Tables? We’d love to hear from you!

Login or Register to post comments

by rasmusjohansson at July 05, 2018 11:54 PM

Peter Zaitsev

Configuring PMM Monitoring for MongoDB Cluster

In this blog, we will see how to configure Percona Monitoring and Management (PMM) monitoring for a MongoDB cluster. It’s very simple, like adding a replica set or standalone instances to PMM Monitoring. 

For this example, I have used docker to create PMM Server and MongoDB sharded cluster containers. If you want the steps I used to create the MongoDB® cluster environment using docker, I have shared them at the end of this blog. You can refer to this if you would like to create the same set up.

Configuring PMM Clients

For PMM installations, you can check these links for PMM installation and pmm-client setup. The following are the members of the MongoDB cluster:

mongos:
 mongos1  
config db: (replSet name - mongors1conf)  
 mongocfg1  
 mongocfg2  
 mongocfg3
Shard1: (replSet name - mongors1)  
 mongors1n1  
 mongors1n2
 mongors1n3
Shard1: (replSet name - mongors2)  
 mongors2n1
 mongors2n2  
 mongors2n3

In this setup, I installed the pmm-client on mongos1 server. Then I added an agent to monitor MongoDB metrics with cluster option as shown in the next code block. I named the cluster “mongoClusterPMM” (Note: you have to be root user or need sudo access to execute the

pmm-admin
 command):

root@90262f1360a0:/# pmm-admin add  mongodb --cluster mongoClusterPMM
[linux:metrics]   OK, now monitoring this system.
[mongodb:metrics] OK, now monitoring MongoDB metrics using URI localhost:27017
[mongodb:queries] OK, now monitoring MongoDB queries using URI localhost:27017
[mongodb:queries] It is required for correct operation that profiling of monitored MongoDB databases be enabled.
[mongodb:queries] Note that profiling is not enabled by default because it may reduce the performance of your MongoDB server.
[mongodb:queries] For more information read PMM documentation (https://www.percona.com/doc/percona-monitoring-and-management/conf-mongodb.html).
root@90262f1360a0:/# pmm-admin list
pmm-admin 1.11.0
PMM Server      | 172.17.0.2 (password-protected)
Client Name     | 90262f1360a0
Client Address  | 172.17.0.4 
Service Manager | unix-systemv
---------------- ------------- ----------- -------- ---------------- ------------------------
SERVICE TYPE     NAME          LOCAL PORT  RUNNING  DATA SOURCE      OPTIONS                 
---------------- ------------- ----------- -------- ---------------- ------------------------
mongodb:queries  90262f1360a0  -           YES      localhost:27017  query_examples=true     
linux:metrics    90262f1360a0  42000       YES      -                                        
mongodb:metrics  90262f1360a0  42003       YES      localhost:27017  cluster=mongoClusterPMM 
root@90262f1360a0:/#

As you can see, I used the 

pmm-admin add mongodb [options]
 command which enables monitoring for system, MongoDB metrics and queries. You need to enable profiler to monitor MongoDB queries. Use the next command to enable it at database level:

use db_name
db.setProfilingLevel(1)

Check this blog to know more about QAN setup and details. If you want to enable only MongoDB metrics, rather than queries to be monitored, then you can use the command 

pmm-admin add  mongodb:metrics [options]
 . After this, go to the PMM homepage (in my case localhost:8080) in your browser and select MongoDB Cluster Summary from the drop down list under the MongoDB option. The below screenshot shows the MongoDB Cluster—“mongoClusterPMM” statistics— collected by the agent that we added in mongos1 server. 

Percona Monitoring and Management for a MongoDB cluster

Did we miss something here? And do you see any metrics in the dashboard above except “Balancer Enabled” and “Chunks Balanced”?

No. This is because, PMM doesn’t have enough data to show in the dashboard. The shards are not added to the cluster yet and as you can see it displays 0 under shards. Let’s add two shard replica sets mongors1 and mongors2 in the mongos1 instance, and enable sharding to the database to complete the cluster setup as follows:

mongos> sh.addShard("mongors1/mongors1n1:27017,mongors1n2:27017,mongors1n3:27017")
{ "shardAdded" : "mongors1", "ok" : 1 }
mongos> sh.addShard("mongors2/mongors2n1:27017,mongors2n2:27017,mongors2n3:27017")
{ "shardAdded" : "mongors2", "ok" : 1 }

Now, I’ll add some data, collection and shard keys, and enable sharding so that we can see some statistics in the dashboard:

use vinodh
db.setProfilingLevel(1)
db.testColl.insertMany([{id1:1,name:"test insert”},{id1:2,name:"test insert"},{id1:3,name:"insert"},{id1:4,name:"insert"}])
db.testColl.ensureIndex({id1:1})
sh.enableSharding("vinodh")
sh.shardCollection("vinodh.testColl", {id1:1})

At last! Now you can see statistics in the graph for the MongoDB cluster:

PMM Statistics for a MongoDB Cluster 

Full Cluster Monitoring

We are not done yet. We have just added an agent to monitor mongos1 instance, and so only the cluster related statistics and QAN are collected through this node. For monitoring all nodes in the MongoDB cluster, we need to configure all nodes in PMM under “mongoClusterPMM” cluster. This will tell PMM that the configured nodes are part of the same cluster. We could also monitor the replica set related metrics for the members in config DBs and shards. Let’s add the monitoring agents in mongos1 server to monitor all MongoDB instances remotely. We’ll use these commands:

pmm-admin add mongodb:metrics --uri "mongodb://mongocfg1:27017/admin?replicaSet=mongors1conf" mongocfg1replSet --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongocfg2:27017/admin?replicaSet=mongors1conf" mongocfg2replSet --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongocfg3:27017/admin?replicaSet=mongors1conf" mongocfg3replSet --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors1n1:27017/admin?replicaSet=mongors1" mongors1replSetn1 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors1n2:27017/admin?replicaSet=mongors1" mongors1replSetn2 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors1n3:27017/admin?replicaSet=mongors1" mongors1replSetn3 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors2n1:27017/admin?replicaSet=mongors2" mongors2replSetn1 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors2n2:27017/admin?replicaSet=mongors2" mongors2replSetn2 --cluster  mongoClusterPMM
pmm-admin add mongodb:metrics --uri "mongodb://mongors2n3:27017/admin?replicaSet=mongors2" mongors2replSetn3 --cluster  mongoClusterPMM

Once you have added them, you can check the agent’s (mongodb-exporter) status as follows:

root@90262f1360a0:/# pmm-admin list
pmm-admin 1.11.0
PMM Server      | 172.17.0.2 (password-protected)
Client Name     | 90262f1360a0
Client Address  | 172.17.0.4 
Service Manager | unix-systemv
---------------- ------------------ ----------- -------- ----------------------- ------------------------
SERVICE TYPE     NAME               LOCAL PORT  RUNNING  DATA SOURCE             OPTIONS                 
---------------- ------------------ ----------- -------- ----------------------- ------------------------
mongodb:queries  90262f1360a0       -           YES      localhost:27017         query_examples=true     
linux:metrics    90262f1360a0       42000       YES      -                                               
mongodb:metrics  90262f1360a0       42003       YES      localhost:27017         cluster=mongoClusterPMM 
mongodb:metrics  mongocfg1replSet   42004       YES      mongocfg1:27017/admin   cluster=mongoClusterPMM 
mongodb:metrics  mongocfg2replSet   42005       YES      mongocfg2:27017/admin   cluster=mongoClusterPMM 
mongodb:metrics  mongocfg3replSet   42006       YES      mongocfg3:27017/admin   cluster=mongoClusterPMM 
mongodb:metrics  mongors1replSetn1  42007       YES      mongors1n1:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors1replSetn3  42008       YES      mongors1n3:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors2replSetn1  42009       YES      mongors2n1:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors2replSetn2  42010       YES      mongors2n2:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors2replSetn3  42011       YES      mongors2n3:27017/admin  cluster=mongoClusterPMM 
mongodb:metrics  mongors1replSetn2  42012       YES      mongors1n2:27017/admin  cluster=mongoClusterPMM

So now you can monitor every member of the MongoDB cluster including their replica set and shard statistics. The next screenshot shows one of the members from the replica set under MongoDB replica set dashboard. You can select this from dashboard in this way: Cluster: mongoClusterPMM → Replica Set: mongors1 → Instance: mongors1replSetn2]:

 PMM MongoDB Replica Set Dashboard

MongoDB cluster docker setup

As I said in the beginning of this blog, I’ve provided the steps for the MongoDB cluster setup. Since this is for testing, I have used very simple configuration to setup the cluster environment using docker-compose. Before creating the MongoDB cluster, create the network in docker for the cluster nodes and PMM to connect each other like this:

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker network create mongo-cluster-nw
7e3203aed630fed9b5f5f2b30e346301a58a068dc5f5bc0bfe38e2eef1d48787

I used the docker-compose.yaml file to create the docker environment here:

version: '2'
services:
  mongors1n1:
    container_name: mongors1n1
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors1 --dbpath /data/db --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27047:27017
    expose:
      - "27017"
    environment:
      TERM: xterm
  mongors1n2:
    container_name: mongors1n2
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors1 --dbpath /data/db --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27048:27017
    expose:
      - "27017"
    environment:
      TERM: xterm
  mongors1n3:
    container_name: mongors1n3
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors1 --dbpath /data/db --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27049:27017
    expose:
      - "27017"
    environment:
      TERM: xterm
  mongors2n1:
    container_name: mongors2n1
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors2 --dbpath /data/db --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27057:27017
    expose:
      - "27017"
    environment:
      TERM: xterm
  mongors2n2:
    container_name: mongors2n2
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors2 --dbpath /data/db --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27058:27017
    expose:
      - "27017"
    environment:
      TERM: xterm
  mongors2n3:
    container_name: mongors2n3
    image: mongo:3.4
    command: mongod --shardsvr --replSet mongors2 --dbpath /data/db --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27059:27017
    expose:
      - "27017"
    environment:
      TERM: xterm
  mongocfg1:
    container_name: mongocfg1
    image: mongo:3.4
    command: mongod --configsvr --replSet mongors1conf --dbpath /data/db --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27025:27017
    environment:
      TERM: xterm
    expose:
      - "27017"
  mongocfg2:
    container_name: mongocfg2
    image: mongo:3.4
    command: mongod --configsvr --replSet mongors1conf --dbpath /data/db --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27024:27017
    environment:
      TERM: xterm
    expose:
      - "27017"
  mongocfg3:
    container_name: mongocfg3
    image: mongo:3.4
    command: mongod --configsvr --replSet mongors1conf --dbpath /data/db --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27023:27017
    environment:
      TERM: xterm
    expose:
      - "27017"
  mongos1:
    container_name: mongos1
    image: mongo:3.4
    depends_on:
    - mongocfg1
      - mongocfg2
    command: mongos --configdb mongors1conf/mongocfg1:27017,mongocfg2:27017,mongocfg3:27017 --port 27017
    networks:
      - mongo-cluster
    ports:
      - 27019:27017
    expose:
      - "27017"
networks:
  mongo-cluster:
    external:
      name: mongo-cluster-nw

Hint:
In the docker compose file, above, if you are using docker-compose version >=3.x, then you can use the option for “networks” in the yaml file like this:

networks:
  mongo-cluster:
    name: mongo-cluster-nw

Now start the containers and configure the cluster quickly as follows:

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker-compose up -d
Creating mongocfg1 ... 
Creating mongos1 ... 
Creating mongors2n3 ... 
Creating mongocfg2 ... 
Creating mongors1n1 ... 
Creating mongocfg3 ... 
Creating mongors2n2 ... 
Creating mongors1n3 ... 
Creating mongors2n1 ... 
Creating mongors1n2 ...

ConfigDB replica set setup

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker exec -it mongocfg1 mongo --quiet
> rs.initiate({  _id: "mongors1conf",configsvr: true,
...   members: [{ _id:0, host: "mongocfg1" },{_id:1, host: "mongocfg2" },{ _id:2, host: "mongocfg3"}]
...   })
{ "ok" : 1 }

Shard1 replica set setup

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker exec -it mongors1n1 mongo --quiet
> rs.initiate( { _id: "mongors1", members:[ {_id: 0, host: "mongors1n1"}, {_id:1, host: "mongors1n2"}, {_id:2, host: "mongors1n3"}] })
{ "ok" : 1 }

Shard2 replica set setup:

Vinodhs-MBP:docker-mcluster vinodhkrish$ docker exec -it mongors2n1 mongo --quiet
> rs.initiate({ _id: "mongors2",  members: [ { _id: 0, host: "mongors2n1"}, { _id:1, host: "mongors2n2"}, { _id:2, host: "mongors2n3"}] })
{ "ok" : 1 }

I hope that this blog helps you to setup a MongoDB cluster and also to configure PMM to monitor it! Have a great day!

The post Configuring PMM Monitoring for MongoDB Cluster appeared first on Percona Database Performance Blog.

by Vinodh Krishnaswamy at July 05, 2018 12:18 PM

July 04, 2018

Jean-Jerome Schmidt

New Webinar: Disaster Recovery Planning for MySQL & MariaDB with ClusterControl

Everyone should have a disaster recovery plan for MySQL & MariaDB!

Join Vinay Joosery, CEO at Severalnines, on July 24th for our new webinar on Disaster Recovery Planning for MySQL & MariaDB with ClusterControl; especially if you find yourself wondering about disaster recovery planning for MySQL and MariaDB, if you’re unsure about RTO and RPO or whether you should you have a secondary datacenter, or are concerned about disaster recovery in the cloud…

Organizations need an appropriate disaster recovery plan in order to mitigate the impact of downtime. But how much should a business invest? Designing a highly available system comes at a cost, and not all businesses and certainly not all applications need five 9’s availability.

Vinay will explain key disaster recovery concepts and walk us through the relevant options from the MySQL & MariaDB ecosystem in order to meet different tiers of disaster recovery requirements; and demonstrate how ClusterControl can help fully automate an appropriate disaster recovery plan.

Sign up below to join the discussion!

Date, Time & Registration

Europe/MEA/APAC

Tuesday, July 24th at 09:00 BST / 10:00 CEST (Germany, France, Sweden)

Register Now

North America/LatAm

Tuesday, July 24th at 09:00 Pacific Time (US) / 12:00 Eastern Time (US)

Register Now

Agenda

  • Business Considerations for DR
    • Is 100% uptime possible?
    • Analyzing risk
    • Assessing business impact
  • Defining DR
    • Outage Timeline
    • RTO
    • RPO
    • RTO + RPO = 0 ?
  • DR Tiers
    • No offsite data
    • Database backup with no Hot Site
    • Database backup with Hot Site
    • Asynchronous replication to Hot Site
    • Synchronous replication to Hot Site
  • Implementing DR with ClusterControl
    • Demo
  • Q&A

Speaker

Vinay Joosery, CEO & Co-Founder, Severalnines

Vinay Joosery, CEO, Severalnines, is a passionate advocate and builder of concepts and business around distributed database systems. Prior to co-founding Severalnines, Vinay held the post of Vice-President EMEA at Pentaho Corporation - the Open Source BI leader. He has also held senior management roles at MySQL / Sun Microsystems / Oracle, where he headed the Global MySQL Telecoms Unit, and built the business around MySQL's High Availability and Clustering product lines. Prior to that, Vinay served as Director of Sales & Marketing at Ericsson Alzato, an Ericsson-owned venture focused on large scale real-time databases.

This webinar builds upon a related white paper written by Vinay on disaster recovery, which you can download here: https://severalnines.com/resources/whitepapers/disaster-recovery-planning-mysql-mariadb

We look forward to “seeing” you there!

by jj at July 04, 2018 02:12 PM

Peter Zaitsev

How to Set Up Replication Between AWS Aurora and an External MySQL Instance

Amazon RDS Aurora replication to external server

Amazon RDS Aurora replication to external serverAmazon RDS Aurora (MySQL) provides its own low latency replication. Nevertheless, there are cases where it can be beneficial to set up replication from Aurora to an external MySQL server, as Amazon RDS Aurora is based on MySQL and supports native MySQL replication. Here are some examples of when replicating from Amazon RDS Aurora to an external MySQL server can make good sense:

  • Replicating to another cloud or datacenter (for added redundancy)
  • Need to use an independent reporting slave
  • Need to have an additional physical backup
  • Need to use another MySQL flavor or fork
  • Need to failover to another cloud and back

In this blog post I will share simple step by step instructions on how to do it.

Steps to setup MySQL replication from AWS RDS Aurora to MySQL server

  1. Enable binary logs in the option group in Aurora (Binlog format = mixed). This will require a restart.
  2. Create a snapshot and restore it (create a new instance from a snapshot). This is only needed to make a consistent copy with mysqldump. As Aurora does not allow “super” privileges, running
    mysqldump --master-data
      is not possible. The snapshot is the only way to get a consistent backup with the specific binary log position.
  3. Get the binary log information from the snapshot. In the console, look for the “Alarms and Recent Events” for the restored snapshot instance. We should see something like:
    Binlog position from crash recovery is mysql-bin-changelog.000708 31278857
  4. Install MySQL 5.6 (i.e. Percona Server 5.6) on a separate EC2 instance (for Aurora 5.6 – note that you should use MySQL 5.7 for Aurora 5.7). After MySQL is up and running, import the timezones:
    # mysql_tzinfo_to_sql /usr/share/zoneinfo/|mysql

    Sample config:
    [mysqld]
    log-bin=log-bin
    log-slave-updates
    binlog-format=MIXED
    server-id=1000
    relay-log=relay-bin
    innodb_log_file_size=1G
    innodb_buffer_pool_size=2G
    innodb_flush_method=O_DIRECT
    innodb_flush_log_at_trx_commit=0 # as this is replication slave
  5. From now on we will make all backups from the restored snapshot. First get all users and import those to the new instance:
    pt-show-grants -h myhost...amazonaws.com -u percona > grants.sql

    # check that grants are valid and upload to MySQL
    mysql -f < grants.sql

    Make a backup of all schemas except for the “mysql” system tables as Aurora using different format of those (make sure we connect to the snapshot):
    host="my-snapshot...amazonaws.com"
    mysqldump --single-transaction -h $host -u percona
    --triggers --routines
    --databases `mysql -u percona -h $host -NBe
    "select group_concat(schema_name separator ' ') from information_schema.schemata where schema_name not in ('mysql', 'information_schema', 'performance_schema')"` > all.sql
  6. Restore to the local database:
    mysql -h localhost < all.sql
  7. Restore users again (some users may fail to create where there are missing databases):
    mysql -f < grants.sql
  8. Download the RDS/Aurora SSL certificate:
    # cd /etc/ssl
    # wget 'https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem'
    # chown mysql.mysql rds-combined-ca-bundle.pem
  9. Configure MySQL replication. Take the values for the binary log name and position from #3 above. Please note: now we connect to the actual instance, not a snapshot:
    # mysql -h localhost
    ...
    mysql> CHANGE MASTER TO
    MASTER_HOST='dev01-aws-1...',
    MASTER_USER='awsreplication',
    MASTER_PASSWORD='<pass>',
    MASTER_LOG_FILE = 'mysql-bin-changelog.000708',
    MASTER_LOG_POS = 31278857,
    MASTER_SSL_CA = '/etc/ssl/rds-combined-ca-bundle.pem',
    MASTER_SSL_CAPATH = '',
    MASTER_SSL_VERIFY_SERVER_CERT=1;
    mysql> start slave;
  10. Verify that the slave is working. Optionally add the SQL_Delay option to the CHANGE MASTER TO (or anytime) and specify the slave delay in seconds.

I hope those steps will be helpful for setting up an external MySQL replica.

The post How to Set Up Replication Between AWS Aurora and an External MySQL Instance appeared first on Percona Database Performance Blog.

by Alexander Rubin at July 04, 2018 01:59 PM

Open Query Pty Ltd

A reminder for Symantec certificate users

icon security lowUsing Chrome or Chromium?  Browse to PayPal, right-click while on the page, select Inspect, and click on the Console tab. Bit of an exercise, but it’ll let you see the following notice:

The SSL certificate used to load resources from https://www.paypal.com will be distrusted in M70.
Once distrusted, users will be prevented from loading these resources.
See https://g.co/chrome/symantecpkicerts for more information.

Google hasn’t really been hiding this, they have been publicly talking about it since last year.

“Symantec’s PKI business, which operates a series of Certificate Authorities under various brand names, including Thawte, VeriSign, Equifax, GeoTrust, and RapidSSL, had issued numerous certificates that did not comply with the industry-developed CA/Browser Forum Baseline Requirements.”

The M70 release of Chrome will be in beta from September, and GA in October 2018.  Yes, that’s quite soon!

All of this would be fine, if sites and companies would get a move on and deploy new certificates that don’t originate from this problematic source.  One would expect PayPal to have done this months ago, but apparently not.  It’s important to be aware of this, because come October it’ll be much more than a little inconvenience particularly for e-commerce sites: online payments via PayPal will start failing – unless and until PayPal and others update their certificates.

by Arjen Lentz at July 04, 2018 06:07 AM

July 03, 2018

MariaDB Foundation

Tampere MariaDB Developers Unconference: Reportback & Presentations

The second MariaDB Developers Unconference of 2018 was held in Tampere, Finland on 26-29 June. The event was kindly hosted by Seravo and the Crazy Town venue offered us great facilities for collaborative work and discussions. Thanks to all 27 attendees who arrived! Present were many core developers and new contributors interested in MariaDB development […]

The post Tampere MariaDB Developers Unconference: Reportback & Presentations appeared first on MariaDB.org.

by Katri Tuunanen at July 03, 2018 04:40 PM

Peter Zaitsev

Linux OS Tuning for MySQL Database Performance

Linux OS tuning for MySQL database performance

Linux OS tuning for MySQL database performanceIn this post we will review the most important Linux settings to adjust for performance tuning and optimization of a MySQL database server. We’ll note how some of the Linux parameter settings used OS tuning may vary according to different system types: physical, virtual or cloud. Other posts have addressed MySQL parameters, like Alexander’s blog MySQL 5.7 Performance Tuning Immediately After Installation. That post remains highly relevant for the latest versions of MySQL, 5.7 and 8.0. Here we will focus more on the Linux operating system parameters that can affect database performance.

Server and Operating System

Here are some Linux parameters that you should check and consider modifying if you need to improve database performance.

Kernel – vm.swappiness

The value represents the tendency of the kernel  to swap out memory pages. On a database server with ample amounts of RAM, we should keep this value as low as possible. The extra I/O can slow down or even render the service unresponsive. A value of 0 disables swapping completely while 1 causes the kernel to perform the minimum amount of swapping. In most cases the latter setting should be OK:

# Set the swappiness value as root
echo 1 > /proc/sys/vm/swappiness
# Alternatively, using sysctl
sysctl -w vm.swappiness=1
# Verify the change
cat /proc/sys/vm/swappiness
1
# Alternatively, using sysctl
sysctl vm.swappiness
vm.swappiness = 1

The change should be also persisted in /etc/sysctl.conf:

vm.swappiness = 1

Filesystems – XFS/ext4/ZFS
XFS

XFS is a high-performance, journaling file system designed for high scalability. It provides near native I/O performance even when the file system spans multiple storage devices.  XFS has features that make it suitable for very large file systems, supporting files up to 8EiB in size. Fast recovery, fast transactions, delayed allocation for reduced fragmentation and near raw I/O performance with DIRECT I/O.

The default options for mkfs.xfs are good for optimal speed, so the simple command:

# Use default mkfs options
mkfs.xfs /dev/target_volume

will provide best performance while ensuring data safety. Regarding mount options, the defaults should fit most cases. On some filesystems you can see a performance increase by adding the noatime mount option to the /etc/fstab.  For XFS filesystems the default atime behaviour is relatime, which has almost no overhead compared to noatime and still maintains sane atime values.  If you create an XFS file system on a LUN that has a battery backed, non-volatile cache, you can further increase the performance of the filesystem by disabling the write barrier with the mount option nobarrier. This helps you to avoid flushing data more often than necessary. If a BBU (backup battery unit) is not present, however, or you are unsure about it, leave barriers on, otherwise you may jeopardize data consistency. With this options on, an /etc/fstab file should look like the one below:

/dev/sda2              /datastore              xfs     defaults,nobarrier
/dev/sdb2              /binlog                 xfs     defaults,nobarrier

ext4

ext4 has been developed as the successor to ext3 with added performance improvements. It is a solid option that will fit most workloads. We should note here that it supports files up to 16TB in size, a smaller limit than xfs. This is something you should consider if extreme table space size/growth is a requirement. Regarding mount options, the same considerations apply. We recommend the defaults for a robust filesystem without risks to data consistency. However, if an enterprise storage controller with a BBU cache is present, the following mount options will provide the best performance:

/dev/sda2              /datastore              ext4     noatime,data=writeback,barrier=0,nobh,errors=remount-ro
/dev/sdb2              /binlog                 ext4     noatime,data=writeback,barrier=0,nobh,errors=remount-ro

Note: The data=writeback option results in only metadata being journaled, not actual file data. This has the risk of corrupting recently modified files in the event of a sudden power loss, a risk which is minimised with a presence of a BBU enabled controller. nobh only works with the data=writeback option enabled.

ZFS

ZFS is a filesystem and LVM combined enterprise storage solution with extended protection vs data corruption. There are certainly cases where the rich feature set of ZFS makes it an essential option to consider, most notably when advance volume management is a requirement. ZFS tuning for MySQL can be a complex topic and falls outside the scope of this blog. For further reference, there is a dedicated blog post on the subject by Yves Trudeau:

Disk Subsystem – I/O scheduler 

Most modern Linux distributions come with noop or deadline I/O schedulers by default, both providing better performance than the cfq and anticipatory ones. However it is always a good practice to check the scheduler for each device and if the value shown is different than noop or deadline the policy can change without rebooting the server:

# View the I/O scheduler setting. The value in square brackets shows the running scheduler
cat /sys/block/sdb/queue/scheduler
noop deadline [cfq]
# Change the setting
sudo echo noop > /sys/block/sdb/queue/scheduler

To make the change persistent, you must modify the GRUB configuration file:

# Change the line:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
# to:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash elevator=noop"

AWS Note: There are cases where the I/O scheduler has a value of none, most notably in AWS VM instance types where EBS volumes are exposed as NVMe block devices. This is because the setting has no use in modern PCIe/NVMe devices. The reason is that they have a very large internal queue and they bypass the IO scheduler altogether. The setting in this case is none and it is the optimal in such disks.

Disk Subsystem – Volume optimization

Ideally different disk volumes should be used for the OS installation, binlog, data and the redo log, if this is possible. The separation of OS and data partitions, not just logically but physically, will improve database performance. The RAID level can also have an impact: RAID-5 should be avoided as the checksum needed to ensure integrity is costly. The best performance without making compromises to redundancy is achieved by the use of an advanced controller with a battery-backed cache unit and preferably RAID-10 volumes spanned across multiple disks.

AWS Note: For further information about EBS volumes and AWS storage optimisation, Amazon has documentation at the following links:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-optimized-instances.html

Database settings

System Architecture – NUMA settings

Non-uniform memory access (NUMA) is a memory design where an SMP’s system processor can access its own local memory faster than non-local memory (the one assigned local to other CPUs). This may result in suboptimal database performance and potentially swapping. When the buffer pool memory allocation is larger than size of the RAM available local to the node, and the default memory allocation policy is selected, swapping occurs. A NUMA enabled server will report different node distances between CPU nodes. A uniformed one will report a single distance:

# NUMA system
numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 65525 MB
node 0 free: 296 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 65536 MB
node 1 free: 9538 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 65536 MB
node 2 free: 12701 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 65535 MB
node 3 free: 7166 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10
# Uniformed system
numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 64509 MB
node 0 free: 4870 MB
node distances:
node   0
  0:  10

In the case of a NUMA system, where numactl shows different distances across nodes, the MySQL variable innodb_numa_interleave should be enabled to ensure memory interleaving. Percona Server provides improved NUMA support by introducing the flush_caches variable. When enabled, it will help with allocation fairness across nodes. To determine whether or not allocation is equal across nodes, you can examine numa_maps for the mysqld process with this script:

# The perl script numa_maps.pl will report memory allocation per CPU node:
# 3595 is the pid of the mysqld process
perl numa_maps.pl < /proc/3595/numa_maps
N0        :     16010293 ( 61.07 GB)
N1        :     10465257 ( 39.92 GB)
N2        :     13036896 ( 49.73 GB)
N3        :     14508505 ( 55.35 GB)
active    :          438 (  0.00 GB)
anon      :     54018275 (206.06 GB)
dirty     :     54018275 (206.06 GB)
kernelpagesize_kB:         4680 (  0.02 GB)
mapmax    :          787 (  0.00 GB)
mapped    :         2731 (  0.01 GB)

Conclusion

In this blog post we examined a few important OS related settings and explained how they can be tuned for better database performance.

While you are here …

You might also find value in this recorded webinar Troubleshooting Best Practices: Monitoring the Production Database Without Killing Performance

 

The post Linux OS Tuning for MySQL Database Performance appeared first on Percona Database Performance Blog.

by Spyros Voultepsis at July 03, 2018 12:57 PM

MariaDB Foundation

MariaDB 10.3.8 now available

The MariaDB Foundation is pleased to announce the availability of MariaDB 10.3.8, the latest stable release in the MariaDB 10.3 series. See the release notes and changelogs for details. Download MariaDB 10.3.8 Release Notes Changelog What is MariaDB 10.3? MariaDB APT and YUM Repository Configuration Generator Contributors to MariaDB 10.3.8 Adrian Bunk (Debian) Aleksey Midenkov […]

The post MariaDB 10.3.8 now available appeared first on MariaDB.org.

by Ian Gilfillan at July 03, 2018 07:18 AM

MariaDB AB

MariaDB Server 10.3.8 now available

MariaDB Server 10.3.8 now available dbart Mon, 07/02/2018 - 21:05

The MariaDB project is pleased to announce the immediate availability of MariaDB Server 10.3.8. See the release notes and changelog for details and visit mariadb.com/downloads to download.

Download MariaDB Server 10.3.8

Release Notes Changelog What is MariaDB 10.3?

Get MariaDB Server 10.3 as part of the MariaDB TX 3.0 download – now available.

The MariaDB project is pleased to announce the immediate availability of MariaDB Server 10.3.8. See the release notes and changelog for details.

Login or Register to post comments

by dbart at July 03, 2018 01:05 AM

July 02, 2018

Jean-Jerome Schmidt

New Whitepaper: Disaster Recovery Planning for MySQL & MariaDB

We’re happy to announce that our new whitepaper Disaster Recovery Planning for MySQL & MariaDB is now available to download for free!

Database outages are almost inevitable and understanding the timeline of an outage can help us better prepare, diagnose and recover from one. To mitigate the impact of downtime, organizations need an appropriate disaster recovery (DR) plan. However, it makes no business sense to abstract the cost of a DR solution from the design of it, so organizations have to implement the right level of protection at the lowest possible cost.

This white paper provides essential insights into how to build such a plan, discussing the database mechanisms involved as well as how these mechanisms can be fully automated with ClusterControl, a management platform for open source database systems.

Topics included in this whitepaper are…

  • Business Considerations for Disaster Recovery
    • Is 100% Uptime Possible?
    • Analysing Risk
    • Assessing Business Impact
  • Defining Disaster Recovery
    • Recovery Time Objectives
    • Recovery Point Objectives
  • Disaster Recovery Tiers
    • Offsite Data
    • Backups and Hot Sites

Download the whitepaper today!

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

About the Author

Vinay Joosery, CEO & Co-Founder, Severalnines

Vinay Joosery, CEO, Severalnines, is a passionate advocate and builder of concepts and business around distributed database systems. Prior to co-founding Severalnines, Vinay held the post of Vice-President EMEA at Pentaho Corporation - the Open Source BI leader. He has also held senior management roles at MySQL / Sun Microsystems / Oracle, where he headed the Global MySQL Telecoms Unit, and built the business around MySQL's High Availability and Clustering product lines. Prior to that, Vinay served as Director of Sales & Marketing at Ericsson Alzato, an Ericsson-owned venture focused on large scale real-time databases.

About ClusterControl

ClusterControl is the all-inclusive open source database management system for users with mixed environments that removes the need for multiple management tools. ClusterControl provides advanced deployment, management, monitoring, and scaling functionality to get your MySQL, MongoDB, and PostgreSQL databases up-and-running using proven methodologies that you can depend on to work. At the core of ClusterControl is it’s automation functionality that lets you automate many of the database tasks you have to perform regularly like deploying new databases, adding and scaling new nodes, running backups and upgrades, and more.

To learn more about ClusterControl click here.

by fwlymburner at July 02, 2018 01:57 PM

Peter Zaitsev

Fixing ER_MASTER_HAS_PURGED_REQUIRED_GTIDS when pointing a slave to a different master

gtid auto position

gtid auto positionGTID replication has made it convenient to setup and maintain MySQL replication. You need not worry about binary log file and position thanks to GTID and auto-positioning. However, things can go wrong when pointing a slave to a different master. Consider a situation where the new master has executed transactions that haven’t been executed on the old master. If the corresponding binary logs have been purged already, how do you point the slave to the new master?

The scenario

Based on technical requirements and architectural change, there is a need to point the slave to a different master by

  1. Pointing it to another node in a PXC cluster
  2. Pointing it to another master in master/master replication
  3. Pointing it to another slave of a master
  4. Pointing it to the slave of a slave of the master … and so on and so forth.

Theoretically, pointing to a new master with GTID replication is easy. All you have to do is run:

STOP SLAVE;
CHANGE MASTER TO MASTER_HOST='new_master_ip';
START SLAVE;
SHOW SLAVE STATUS\G

Alas, in some cases, replication breaks due to missing binary logs:

*************************** 1. row ***************************
Slave_IO_State:
Master_Host: pxc_57_5
Master_User: repl
Master_Port: 3306
**redacted**
Slave_IO_Running: No
Slave_SQL_Running: Yes
** redacted **
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
** redacted **
Master_Server_Id: 1
Master_UUID: 4998aaaa-6ed5-11e8-948c-0242ac120007
Master_Info_File: /var/lib/mysql/master.info
** redacted **
Last_IO_Error_Timestamp: 180613 08:08:20
Last_SQL_Error_Timestamp:
** redacted **
Retrieved_Gtid_Set:
Executed_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3
Auto_Position: 1

The strange issue here is that if you point the slave back to the old master, replication works just fine. The error says that there are missing binary logs in the new master that the slave needs. If there’s no problem with replication performance and the slave can easily catch up, then it looks like there are transactions executed in the new master that have not been executed in the old master but are recorded in the missing binary logs. The binary logs are most likely lost due to manually purging with PURGE BINARY LOGS or automatic purging if expire_logs_days is set.

At this point, it would be prudent to check and sync old master and new master with tools such as pt-table-checksum and pt-table-sync. However, if a consistency check has been performed and no differences have been found, or there’s confidence that the new master is a good copy—such as another node in the PXC cluster—you can follow the steps below to resolve the problem.

Solution

To solve the problem, the slave needs to execute the missing transactions. But since these transactions have been purged, the steps below provide the workaround.

Step 1 Find the GTID sequences that are purged from the new master that is needed by the slave

To identify which GTID sequences are missing, run SHOW GLOBAL VARIABLES LIKE 'gtid_purged'; and SHOW MASTER STATUS; on the new master and SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; on the slave:

New Master:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_purged';
+---------------+-------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-------------------------------------------------------------------------------------+
| gtid_purged | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-2,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+---------------+-------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW MASTER STATUS;
+------------------+----------+--------------+------------------+-------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------------------------------------------------------------------------+
| mysql-bin.000004 | 741 | | | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+------------------+----------+--------------+------------------+-------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Slave:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+------------------------------------------+
| Variable_name | Value |
+---------------+------------------------------------------+
| gtid_executed | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3 |
+---------------+------------------------------------------+
1 row in set (0.00 sec)

Take note that 1904cf31-912b-ee17-4906-7dae335b4bfc and 1904cf31-912b-ee17-4906-7dae335b4bfc are UUIDs and refer to the MySQL instance where the transaction originated from.

Based on the output:

  • The slave has executed 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3
  • The new master has executed 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6 and 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11
  • The new master has purged 1904cf31-912b-ee17-4906-7dae335b4bfc:1-2 and 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11

This means that the slave has no issue with 1904cf31-912b-ee17-4906-7dae335b4bfc it requires sequences 4-6 and sequences 3-6 are still available in the master. However, the slave cannot fetch sequences 1-11 from 4998aaaa-6ed5-11e8-948c-0242ac120007 because these has been purged from the master.

To summarize, the missing GTID sequences are 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11.

Step 2: Identify where the purged GTID sequences came from

From the SHOW SLAVE STATUS output in the introduction section, it says that the Master_UUID is 4998aaaa-6ed5-11e8-948c-0242ac120007, which means the new master is the source of the missing transactions. You can also verify the new Master’s UUID by running SHOW GLOBAL VARIABLES LIKE 'server_uuid';

mysql> SHOW GLOBAL VARIABLES LIKE 'server_uuid';
+---------------+--------------------------------------+
| Variable_name | Value |
+---------------+--------------------------------------+
| server_uuid | 4998aaaa-6ed5-11e8-948c-0242ac120007 |
+---------------+--------------------------------------+
1 row in set (0.00 sec)

If the new master’s UUID does not match the missing GTID, it is most likely that this missing sequence came from its old master, another master higher up the chain or from another PXC node. If that other master still exists, you can run the same query on those masters to check.

The missing sequences are small such as 1-11. Typically, commands executed locally are due to performing maintenance on this server directly. For example, creating users, fixing privileges or updating passwords. However, you have no guarantee that this is the reason, since the binary logs have already been purged. If you still want to point the slave to the new master, proceed to step 3 or step 4.

Step 3. Injecting the missing transactions on the slave with empty transactions

The workaround is to pretend that those missing GTID sequences have been executed on the slave by injecting 11 empty transactions as instructed here by running:

SET GTID_NEXT='UUID:SEQUENCE_NO';
BEGIN;COMMIT;
SET GTID_NEXT='AUTOMATIC';

It looks tedious, but a simple script can automate this:

cat empty_transaction_generator.sh
#!/bin/bash
uuid=$1
first_sequence_no=$2
last_sequence_no=$3
while [ "$first_sequence_no" -le "$last_sequence_no" ]
do
echo "SET GTID_NEXT='$uuid:$first_sequence_no';"
echo "BEGIN;COMMIT;"
first_sequence_no=`expr $first_sequence_no + 1`
done
echo "SET GTID_NEXT='AUTOMATIC';"
bash empty_transaction_generator.sh 4998aaaa-6ed5-11e8-948c-0242ac120007 1 11
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:1';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:2';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:3';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:4';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:5';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:6';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:7';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:8';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:9';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:10';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:11';
BEGIN;COMMIT;
SET GTID_NEXT='AUTOMATIC';

Before executing the generated output on the slave, stop replication first:

mysql> STOP SLAVE;
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:1';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:2';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:3';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:4';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:5';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:6';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:7';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:8';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:9';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:10';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:11';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='AUTOMATIC';
Query OK, 0 rows affected (0.00 sec)

There’s also an even easier solution of injecting empty transactions by using mysqlslavetrx from MySQL utilities. By stopping the slave first and running
mysqlslavetrx --gtid-set=4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 --slaves=root:password@:3306 you will achieve the same result as above.

By running SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; on the slave you can see that sequences 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 have been executed already:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+-------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-------------------------------------------------------------------------------------+
| gtid_executed | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+---------------+-------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

Resume replication and check if replication is healthy by running START SLAVE; and SHOW SLAVE STATUS\G

mysql> START SLAVE;
Query OK, 0 rows affected (0.01 sec)
mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: pxc_57_5
Master_User: repl
Master_Port: 3306
** redacted **
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
** redacted **
Seconds_Behind_Master: 0
** redacted **
Master_Server_Id: 1
Master_UUID: 4998aaaa-6ed5-11e8-948c-0242ac120007
** redacted **
Retrieved_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:4-6
Executed_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11
Auto_Position: 1
** redacted **
1 row in set (0.00 sec)

At this point, we have already solved the problem. However, there’s another way to restore the slave much faster but at the cost of erasing all the existing binary logs on the slave as mentioned in this article. If you want to do this, proceed to step 4.

Step 4. Add the missing sequences to GTID_EXECUTED by modifying GTID_PURGED.

CRITICAL NOTE:
If you followed the steps in Step 3, you do not need to perform Step 4!

To add the missing transactions, you’ll need to stop the slave, reset the master, place the original value of gtid_executed and the missing sequences in gtid_purged variable. A word of caution on using this method: this will purge the existing binary logs of the slave.

mysql> STOP SLAVE;
Query OK, 0 rows affected (0.02 sec)
mysql> RESET MASTER;
Query OK, 0 rows affected (0.02 sec)
mysql> SET GLOBAL gtid_purged="1904cf31-912b-ee17-4906-7dae335b4bfc:1-3,4998aaaa-6ed5-11e8-948c-0242ac120007:1-11";
Query OK, 0 rows affected (0.02 sec)

Similar to Step 3, running SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; on the slave shows that sequence 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 has been executed already:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+-------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-------------------------------------------------------------------------------------+
| gtid_executed | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+---------------+-------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

Run START SLAVE; and SHOW SLAVE STATUS\G to resume replication and check if replication is healthy:

mysql> START SLAVE;
Query OK, 0 rows affected (0.01 sec)
mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: pxc_57_5
Master_User: repl
Master_Port: 3306
** redacted **
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
** redacted **
Seconds_Behind_Master: 0
** redacted **
Master_Server_Id: 1
Master_UUID: 4998aaaa-6ed5-11e8-948c-0242ac120007
** redacted **
Retrieved_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:4-6
Executed_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11
Auto_Position: 1
** redacted **
1 row in set (0.00 sec)

Step 5. Done

Summary

In this article, I demonstrated how to point the slave to a new master even if it’s missing some binary logs that need to be executed. Although, it is possible to do so with the workarounds shared above, it is prudent to check the consistency of the old and new master first before switching the slave to the new master.

The post Fixing ER_MASTER_HAS_PURGED_REQUIRED_GTIDS when pointing a slave to a different master appeared first on Percona Database Performance Blog.

by Jaime Sicam at July 02, 2018 11:50 AM

July 01, 2018

Valeriy Kravchuk

What's Right and What's Wrong With Oracle's Way of MySQL Server Development

Recently it's quite common to state that "Oracle's Acquisition Was Actually the Best Thing to Happen to MySQL". I am not going to argue with that - Oracle proved over years that they are committed to continue active development of this great open source RDBMS, and they have invested a lot into making it better and implementing features that were missed or became important recently. Unlike Sun Microsystems, they seem to clearly know what to do with this software to make it more popular and make money on it.

Among the right things Oracle does for MySQL server development I'd like to highlight the following:
  1. MySQL server development continues, with new features added, most popular OSes supported, regular releases happened and source code still published at GitHub under GPL license.
  2. Oracle continues to maintain public MySQL bugs database and fix bugs reported there.
  3. Oracle accepts external contributions to MySQL server under clear conditions. They acknowledge contributions in public. The release notes, in particular, mention authors of each community-provided patch.
  4. Oracle works hard on improving performance and scalability of MySQL server.
  5. Oracle tries to provide good background for their new MySQL designs (check new InnoDB redo logging, for example).
  6. Oracle cooperates with MySQL Community. They organize their own community events and participate in numerous related conferences, including (but not limited to) the biggest Percona Live ones. Oracle engineers speak and write about their work in progress. Oracle seems to actively support some open source tools that work with MySQL server, like ProxySQL.
  7. Oracle still keeps and maintains pluggable storage engine architecture and plugin APIs, even though their own development is recently mostly related to InnoDB storage engine.
  8. Oracle still maintains and improves public MySQL Manual.
So, Oracle is doing good with MySQL, and dozens of their customers, community members and MySQL experts keep stating this all the time. But, as a former and current "MySQL Entomologist" (somebody who worked on processing MySQL bug reports from community and reported MySQL bugs for 13 years), I clearly see problems with the way Oracle handles MySQL server development. I write and speak about these problems in public since the end of 2012 or so, and would like to summarize them in this post.
MySQL's future is bright, but there are some clouds


Here is the list of problems I see:
  1. Oracle does not develop MySQL server in a true open source way.
  2. Oracle does not care enough to maintain public bugs database properly.
  3. Some older MySQL features remain half-backed, not well tested, not properly integrated with each other and new features, and not documented properly, for years.
    In general, Oracle's focus seem to be more on new developments and cool features for MySQL (with some of them got ignored and going nowhere with time).
  4. Oracle's internal QA efforts still seem to be somewhat limited.
    We get regression bugs, ASAN failures, debug assertions, crashes, test failures etc in the official releases, and Oracle MySQL still relies a lot on QA by MySQL Community (while not highlighting this fact that much in public).
  5. MySQL Manual still have many details missing and is not fixed fast enough.
    Moreover, it is not open source, so there is no other way for community to fix or improve it other than add comments or report documentation bugs, and wait.
In the upcoming weeks I am going to explain each of these items in a separate post, with some links to my older blog posts, MySQL server bug reports and other sources that should illustrate my points. In the meantime I am open for comments from those who disagree with the theses presented above.

by Valeriy Kravchuk (noreply@blogger.com) at July 01, 2018 03:29 PM

June 30, 2018

Peter Zaitsev

This Week in Data with Colin Charles 44: MongoDB 4.0 and Facebook MyRocks

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

There have been two big pieces of news this week: the release of MongoDB 4.0 and the fact that Facebook has migrated the Messenger backend over to MyRocks.

MongoDB 4.0 is stable, with support for multi-document ACID transactions. I quite like the engineering chalk and talks videos on the transactions page. There are also improvements to help manage your MongoDB workloads in a Kubernetes cluster. MongoDB Atlas supports global clusters (geographically distributed databases, low latency writes, and data placement controls for regulatory compliance), HIPAA compliance, and more. ZDNet calls it the “operational database that is developer friendly”. The TechCrunch take was more focused on MongoDB Atlas, MongoDB launches Global Clusters to put geographic data control within reach of anyone.

In addition to that, I found this little snippet on CNBC featuring Michael Gordon, MongoDB CFO, very interesting: last quarter MongoDB Inc reported 53% year-over-year growth in their subscription revenue business. The fastest-growing piece of the business? Cloud-hosted database as a service offering. They partner with Amazon, Google and Microsoft. They are looking to grow in the Chinese market.

Did you attend MongoDB World 2018? I personally can’t wait to see the presentations. Do not forget to read the MongoDB 4.0 release notes in the manual. Take heed of this important note: “In most cases, multi-document transaction incurs a greater performance cost over single document writes, and the availability of multi-document transaction should not be a replacement for effective schema design.”

As for Facebook Messenger migrating to MyRocks, this blog post is highly detailed: Migrating Messenger storage to optimize performance. This is a migration from the previous HBase backend to MyRocks. End users should notice a more responsive product and better search. For Facebook, storage consumption went down by 90%! The migration methodology to ensure Messenger usage was not disrupted for end users is also worth paying attention to. A more personal note from Yoshinori Matsunobu, as MyRocks is something he’s been spearheading. Don’t forget that you can try out MyRocks in Percona Server for MySQL as well as in MariaDB Server 10.2 and 10.3. To use Zstandard (or zstd for short), Percona Server for MySQL supports this (MariaDB does not, but has varying other compression algorithms).

Have you seen the Percona Open Source Database Community Blog? Jean-François Gagné recently wrote about how he posted on the Community Blog (so a very nice behind the scenes kind of post), and I hope you also read A Nice Feature in MariaDB 10.3: No InnoDB Buffer Pool in Core Dumps. Don’t forget to add this new blog to your RSS feed readers.

Lastly, as a quick note, there will unlikely be a column next week. I’m taking a short vacation, so see you in the following week!

Releases

Link List

Industry Updates

  • Louis Fahrberger (formerly of Clustrix, MariaDB Corporation, InfoBright and MySQL) is now an Account Executive in Sales for MemSQL.
  • The Wall Street Journal reports on Oracle Cloud and how the business continues to grow. “Revenues from its cloud services businesses jumped 25% year over year to $1.7 billion for its fiscal fourth quarter that ended May 31”.
  • The Financial Times reports on Red Hat sinks as currency swings cloud full-year sales outlook. The CFO, Eric Shander said, “we continue to expect strong demand for our hybrid cloud enabling technologies”.

Upcoming appearances

  • OSCON – Portland, Oregon, USA – July 16-19 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

The post This Week in Data with Colin Charles 44: MongoDB 4.0 and Facebook MyRocks appeared first on Percona Database Performance Blog.

by Colin Charles at June 30, 2018 05:04 AM

June 29, 2018

MariaDB AB

MariaDB Roadshow | Minneapolis – July 12

MariaDB Roadshow | Minneapolis – July 12 MariaDB Team Fri, 06/29/2018 - 15:19

MariaDB is on its way to the Twin Cities! 

We’ll be diving into what's new in MariaDB TX 3.0 (with MariaDB Server 10.3 included), best practices for critical enterprise features such as HA and security, containerization and the brand-new Oracle-compatibility features.

Come hang out at Pinstripes Edina on July 12, 9:30 a.m. - 4:30 p.m. Network, learn and enjoy lunch on us. Reserve your seat here.

Do you have a use case you’d like to talk through in detail? Drop a line to nathan.singhal@mariadb.com and we’ll set up a one-on-one meeting at the event. 

Don’t miss out on this complimentary event! 

Speakers

Austin Rutherford
Vice President, Customer Success 

Jon Day
Senior Solutions Engineer

Shane Johnson
Senior Director, Product Marketing 

Venue

Pinstripes Edina
3849 Gallagher Drive
Edina, MN 55435

Login or Register to post comments

by MariaDB Team at June 29, 2018 07:19 PM

Peter Zaitsev

Percona XtraDB Cluster 5.7.22-29.26 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.6Percona announces the release of Percona XtraDB Cluster 5.7.22-29.26 (PXC) on June 29, 2018. Binaries are available from the downloads section or our software repositories.

Percona XtraDB Cluster 5.7.22-29.26 is now the current release, based on the following:

Deprecated

The following variables are deprecated starting from this release:

  • wsrep-force-binlog-format
  • wsrep_sst_method = mysqldump

As long as the use of binlog_format=ROW is enforced in 5.7, wsrep_forced_binlog_format variable is much less significant. The same is related to mysqldump, as xtrabackup is now the recommended SST method.

New features

  • PXC-907: New variable wsrep_RSU_commit_timeout allows to configure RSU wait for active commit connection timeout (in microseconds).
  • Percona XtraDB Cluster now supports the keyring_vault plugin, which allows to store the master key in a vault server.
  • Percona XtraDB Cluster  5.7.22 depends on Percona XtraBackup  2.4.12 in order to fully support vault plugin functionality.

Fixed Bugs

  • PXC-2127: Percona XtraDB Cluster shutdown process hung if thread_handling option was set to pool-of-threads due to a regression in  5.7.21.
  • PXC-2128: Duplicated auto-increment values were set for the concurrent sessions on cluster reconfiguration due to the erroneous readjustment.
  • PXC-2059: Error message about the necessity of the SUPER privilege appearing in case of the CREATE TRIGGER statements fail due to enabled WSREP was made more clear.
  • PXC-2061: Wrong values could be read, depending on timing, when read causality was enforced with wsrep_sync_wait=1, because of waiting on the commit monitor to be flushed instead of waiting on the apply monitor.
  • PXC-2073CREATE TABLE AS SELECT statement was not replicated in case if result set was empty.
  • PXC-2087: Cluster was entering the deadlock state if table had an unique key and INSERT ... ON DUPLICATE KEY UPDATE statement was executed.
  • PXC-2091: Check for the maximum number of rows, that can be replicated as a part of a single transaction because of the Galera limit, was enforced even when replication was disabled with wsrep_on=OFF.
  • PXC-2103: Interruption of the local running transaction in a COMMIT state by a replicated background transaction while waiting for the binlog backup protection caused the commit fail and, eventually, an assert in Galera.
  • PXC-2130: Percona XtraDB Cluster failed to build with Python 3.
  • PXC-2142: Replacing Percona Server with Percona XtraDB Cluster on CentOS 7 with the yum swap command produced a broken symlink in place of the /etc/my.cnf configuration file.
  • PXC-2154: rsync SST is now aborted with error message if used onnode with keyring_vault plugin configured, because it doesn’t support  keyring_vault. Also Percona doesn’t recommend using rsync-based SST for data-at-rest encryption with keyring.
  •  PXB-1544: xtrabackup --copy-back didn’t read which encryption plugin to use from plugin-load setting of the my.cnf configuration file.
  •  PXB-1540: Meeting a zero sized keyring file, Percona XtraBackup was removing and immediately recreating it, and this could affect external software noticing the file had undergo some manipulations.

Other bugs fixed:

PXC-2072 “flush table <table> for export should be blocked with mode=ENFORCING”.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

The post Percona XtraDB Cluster 5.7.22-29.26 Is Now Available appeared first on Percona Database Performance Blog.

by Dmitriy Kostiuk at June 29, 2018 04:16 PM

MySQL 8.0 Hot Rows with NOWAIT and SKIP LOCKED

MySQL 8.0 hot rows

In MySQL 8.0 there are two new features designed to support lock handling: NOWAIT and SKIP LOCKED. In this post, we’ll look at how MySQL 8.0 handles hot rows. Up until now, how have you handled locks that are part of an active transaction or are hot rows? It’s likely that you have the application attempt to access the data, and if there is a lock on the requested rows, you incur a timeout and have to retry the transaction. These two new features help you to implement sophisticated lock handling scenarios allowing you to handle timeouts better and improve the application’s performance.

To demonstrate I’ll use this product table.

mysql> select @@version;
+-----------+
| @@version |
+-----------+
| 8.0.11    |
+-----------+
1 row in set (0.00 sec)

CREATE TABLE `product` (
`p_id` int(11) NOT NULL AUTO_INCREMENT,
`p_name` varchar(255) DEFAULT NULL,
`p_cost` decimal(19,4) NOT NULL,
`p_availability` enum('YES','NO') DEFAULT 'NO',
PRIMARY KEY (`p_id`),
KEY `p_cost` (`p_cost`),
KEY `p_name` (`p_name`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Let’s run through an example. The transaction below will lock the rows 2 and 3 if not already locked. The rows will get released when our transaction does a COMMIT or a ROLLBACK. Autocommit is enabled by default for any transaction and can be disabled either by using the START TRANSACTION clause or by setting the Autocommit to 0.

Session 1:

mysql> START TRANSACTION;SELECT * FROM mydb.product WHERE p_cost >=20 and p_cost <=30 FOR UPDATE;
Query OK, 0 rows affected (0.00 sec)
+------+--------+---------+----------------+
| p_id | p_name | p_cost  | p_availability |
+------+--------+---------+----------------+
|    2 | Item2  | 20.0000 | YES            |
|    3 | Item3  | 30.0000 | YES            |
+------+--------+---------+----------------+
2 rows in set (0.00 sec)

InnoDB performs row-level locking in such a way that when it searches or scans a table index, it sets shared or exclusive locks on the index records it encounters. Thus, the row-level locks are actually index-record locks.

We can get the details of a transaction such as the transaction id, row lock count etc using the command innodb engine status or by querying the performance_schema.data_locks table. The result from the innodb engine status command can however be confusing as we can see below. Our query only locked rows 3 and 4 but the output of the query reports 5 rows as locked (Count of Locked PRIMARY+ locked selected column secondary index + supremum pseudo-record). We can see that the row right next to the rows that we selected is also reported as locked. This is an expected and documented behavior. Since the table is small with only 5 rows, a full scan of the table is much faster than an index search. This causes all rows or most rows of the table to end up as locked as a result of our query.

Innodb Engine Status :-

---TRANSACTION 205338, ACTIVE 22 sec
3 lock struct(s), heap size 1136, 5 row lock(s)
MySQL thread id 8, OS thread handle 140220824467200, query id 28 localhost root

performance_schema.data_locks (another new feature in 8.0.1):

mysql> SELECT ENGINE_TRANSACTION_ID,
 CONCAT(OBJECT_SCHEMA, '.',
 OBJECT_NAME)TBL,
 INDEX_NAME,count(*) LOCK_DATA
FROM performance_schema.data_locks
where LOCK_DATA!='supremum pseudo-record'
GROUP BY ENGINE_TRANSACTION_ID,INDEX_NAME,OBJECT_NAME,OBJECT_SCHEMA;
+-----------------------+--------------+------------+-----------+
| ENGINE_TRANSACTION_ID | TBL          | INDEX_NAME | LOCK_DATA |
+-----------------------+--------------+------------+-----------+
|                205338 | mydb.product | p_cost     |         3 |
|                205338 | mydb.product | PRIMARY    |         2 |
+-----------------------+--------------+------------+-----------+
2 rows in set (0.04 sec)

mysql> SELECT ENGINE_TRANSACTION_ID as ENG_TRX_ID,
 object_name,
 index_name,
 lock_type,
 lock_mode,
 lock_data
FROM performance_schema.data_locks WHERE object_name = 'product';
+------------+-------------+------------+-----------+-----------+-------------------------+
| ENG_TRX_ID | object_name | index_name | lock_type | lock_mode | lock_data               |
+------------+-------------+------------+-----------+-----------+-------------------------+
|     205338 | product     | NULL       | TABLE     | IX        | NULL                    |
|     205338 | product     | p_cost     | RECORD    | X         | 0x800000000000140000, 2 |
|     205338 | product     | p_cost     | RECORD    | X         | 0x8000000000001E0000, 3 |
|     205338 | product     | p_cost     | RECORD    | X         | 0x800000000000320000, 5 |
|     205338 | product     | PRIMARY    | RECORD    | X         | 2                       |
|     205338 | product     | PRIMARY    | RECORD    | X         | 3                       |
+------------+-------------+------------+-----------+-----------+-------------------------+
6 rows in set (0.00 sec)

Session 1:

mysql> COMMIT;
Query OK, 0 rows affected (0.00 sec)


SELECT FOR UPDATE with innodb_lock_wait_timeout:

The innodb_lock_wait_timeout feature is one mechanism that is used to handle lock conflicts. The variable has default value set to 50 sec and causes any transaction that is waiting for a lock for more than 50 seconds to terminate and post a timeout message to the user. The parameter is configurable based on the requirements of the application.

Let’s look at how this feature works using an example with a select for update query.

mysql> select @@innodb_lock_wait_timeout;
+----------------------------+
| @@innodb_lock_wait_timeout |
+----------------------------+
|                         50 |
+----------------------------+
1 row in set (0.00 sec)

Session 1:

mysql> START TRANSACTION;SELECT * FROM mydb.product WHERE p_cost >=20 and p_cost <=30 FOR UPDATE;
Query OK, 0 rows affected (0.00 sec)
+------+--------+---------+----------------+
| p_id | p_name | p_cost  | p_availability |
+------+--------+---------+----------------+
|    2 | Item2  | 20.0000 | YES            |
|    3 | Item3  | 30.0000 | YES            |
+------+--------+---------+----------------+
2 rows in set (0.00 sec)

Session 2:

mysql> select now();SELECT * FROM mydb.product WHERE p_id=3 FOR UPDATE;select now();
+---------------------+
| now()               |
+---------------------+
| 2018-06-19 05:29:48 |
+---------------------+
1 row in set (0.00 sec)
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
+---------------------+
| now()               |
+---------------------+
| 2018-06-19 05:30:39 |
+---------------------+
1 row in set (0.00 sec)
mysql>

Autocommit is enabled (by default) and as expected the transaction waited for lock wait timeout and exited.

Session 1:

mysql> COMMIT;
Query OK, 0 rows affected (0.00 sec)


NOWAIT:

The NOWAIT clause causes a query to terminate immediately in the case that candidate rows are already locked. Considering the previous example, if the application’s requirement is to not wait for the locks to be released or for a timeout, using the NOWAIT clause is the perfect solution. (Setting the innodb_lock_wait_timeout=1 in session also has the similar effect). 

Session 1:

mysql> START TRANSACTION;SELECT * FROM mydb.product WHERE p_cost >=20 and p_cost <=30 FOR UPDATE;
Query OK, 0 rows affected (0.00 sec)
+------+--------+---------+----------------+
| p_id | p_name | p_cost  | p_availability |
+------+--------+---------+----------------+
|    2 | Item2  | 20.0000 | YES            |
|    3 | Item3  | 30.0000 | YES            |
+------+--------+---------+----------------+
2 rows in set (0.00 sec)

Session 2:

mysql>  SELECT * FROM mydb.product WHERE p_id = 3 FOR UPDATE NOWAIT;
ERROR 3572 (HY000): Statement aborted because lock(s) could not be acquired immediately and NOWAIT is set.
mysql>

Session 1:

mysql> COMMIT;
Query OK, 0 rows affected (0.00 sec)


SKIP LOCKED:

The SKIP LOCKED clause asks MySQL to non-deterministically skip over the locked rows and process the remaining rows based on the where clause. Let’s look at how this works using some examples:

Session 1:

mysql> START TRANSACTION;SELECT * FROM mydb.product WHERE p_cost >=20 and p_cost <=30 FOR UPDATE;
Query OK, 0 rows affected (0.00 sec)
+------+--------+---------+----------------+
| p_id | p_name | p_cost  | p_availability |
+------+--------+---------+----------------+
|    2 | Item2  | 20.0000 | YES            |
|    3 | Item3  | 30.0000 | YES            |
+------+--------+---------+----------------+
2 rows in set (0.00 sec)

Session 2:

mysql> SELECT * FROM mydb.product WHERE p_cost = 30 FOR UPDATE SKIP LOCKED;
Empty set (0.00 sec)
mysql>

mysql> SELECT * from mydb.product where p_id IN (1,2,3,4,5) FOR UPDATE SKIP LOCKED;
+------+--------+---------+----------------+
| p_id | p_name | p_cost  | p_availability |
+------+--------+---------+----------------+
|    1 | Item1  | 10.0000 | YES            |
|    5 | Item5  | 50.0000 | YES            |
+------+--------+---------+----------------+
2 rows in set (0.00 sec)

Session 1:

mysql> COMMIT;
Query OK, 0 rows affected (0.00 sec)

The first transaction is selecting rows 2 and 3 for update(ie locked). The second transaction skips these rows and returns the remaining rows when the SKIP LOCKED clause is used.

Important Notes: As the SELECT … FOR UPDATE clause affects concurrency, it should only be used when absolutely necessary. Make sure to index the column part of the where clause as the SELECT … FOR UPDATE is likely to lock the whole table if proper indexes are not setup for the table. When an index is used, only the candidate rows are locked.

The post MySQL 8.0 Hot Rows with NOWAIT and SKIP LOCKED appeared first on Percona Database Performance Blog.

by Arunjith Aravindan at June 29, 2018 01:44 PM

Open Query Pty Ltd

EFF STARTTLS Everywhere project: safer hops for email

Safe and secure online infrastructure is a broad topic, covering databases, privacy, web applications, and much more, and over the years we’ve specifically addressed many of these issues with information and recommendations.

The Electronic Frontier Foundation (EFF) announced the launch of STARTTLS Everywhere, their initiative to improve the security of the email ecosystem. Thanks to previous EFF efforts like Let’s Encrypt (that we’ve written about earlier on the Open Query blog), and the Certbot tool, as well as help from the major web browsers, there have been significant wins in encrypting the web. Now EFF wants to do for email what they’ve done for web browsing: make it simple and easy for everyone to help ensure their communications aren’t vulnerable to mass surveillance.

STARTTLS is an addition to SMTP, which allows one email server to say to the other, “I want to deliver this email to you over an encrypted communications channel.” The recipient email server can then say “Sure! Let’s negotiate an encrypted communications channel.” The two servers then set up the channel and the email is delivered securely, so that anybody listening in on their traffic only sees encrypted data. In other words, network observers gobbling up worldwide information from Internet backbone access points (like the NSA or other governments) won’t be able to see the contents of messages while they’re in transit, and will need to use more targeted, low-volume methods.

STARTTLS Everywhere provides software that a sysadmin can run on an email server to automatically get a valid certificate from Let’s Encrypt. This software can also configure their email server software so that it uses STARTTLS, and presents the valid certificate to other email servers. Finally, STARTTLS Everywhere includes a “preload list” of email servers that have promised to support STARTTLS, which can help detect downgrade attacks.

The net result: more secure email, and less mass surveillance.

This article is based on the announcement in EFFector, EFF’s newsletter.

by Arjen Lentz at June 29, 2018 03:38 AM

June 28, 2018

Peter Zaitsev

Faster Point In Time Recovery (PITR) in PostgreSQL Using a Delayed Standby

PostgreSQL Point in Time Recovery

PostgreSQL Point in Time RecoveryThe need to recover a database back to a certain point in time can be a nerve-racking task for DBAs and for businesses. Can this be simplified? Could it be made to work faster? Can we recover to a given point in time with zero loss of transactions/records? Fortunately, the answer to these questions is yes. PostgreSQL Point in Time Recovery (PITR) is an important facility. It offers DBAs the ability to restore a PostgreSQL database simply, quickly and without the loss of transactions or data.

In this post, we’ll help you to understand how this can be achieved, and reduce the potential for pain in the event of panic situations where you need to perform a PITR.

Before proceeding further, let us understand what could force us to perform a PITR.

  1. Someone has accidentally dropped or truncated a table.
  2. A failed deployment has made changes to the database that are difficult to reverse.
  3. You accidentally deleted or modified a lot of data, and as a consequence you cannot run your applications.

In such scenarios, you would immediately look for the latest full backup and the relevant transaction logs (aka WALs in PostgreSQL) to recover up to a known point in the past, before the error occurred. But what if your backup is corrupt and not valid?

Well, it is very important to perform a backup and recovery validation to ensure that the backups are always recoverable—we will address this in a future post. But, if the backup that you are looking at is corrupt, that can be a nightmare. One such unlucky incident for GitLab, where there was a backup restoration failure, caused a major outage followed by a data loss after recovery.

https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/

Even the best of plans can be hard to realize in practice.

It may be that our backups are intact and recoverable. Can we afford to wait until we copy/download the backup and recover it to another disk or server? What if the database size is several hundreds of GBs or several TBs like GitLab’s?

The solution to the problem is: add another standby that is always delayed by a few hours or a day.

This is one of the great features available in PostgreSQL. If you have migrated from Oracle RDBMS to PostgreSQL, you can think of it as an equivalent to FLASHBACK DATABASE in Oracle. Flashback database helps you to rewind data back in time. However, the technique does not work if you have dropped a data file. In fact, this is the case for both Oracle RDBMS and PostgreSQL PITR. 🙁

https://docs.oracle.com/cd/E11882_01/backup.112/e10642/flashdb.htm#BRADV71000

Adding a Delayed Standby in PostgreSQL

It is important that we use features like streaming replication to achieve high availability in PostgreSQL. Most of the environments have 1 master with 1 or more slaves (standby), either in the same data centre or geographically distributed. To save the time needed for PITR, you can add another slave that can always be delayed by a certain amount of time—this could be hours or days.

For example, if I know that my deployment is determined to be successful when no issues are observed in the first 12 hours, then I might delay one of the standbys by 12 hours.

To delay a standby, once you have setup streaming replication between your PostgreSQL master and slave, the following parameter needs to be added to the recovery.conf file of the slave, followed by a restart.

recovery_min_apply_delay = '12h' # or '1min' or 1d'

Now, let’s consider an example where you have inserted 10000 records at 10:27:34 AM and you have accidentally deleted 5000 records at 10:28:43 AM. Let’s say that you have a standby that is delayed by 1 hour. The steps to perform PITR using the delayed standby through until 10:27:34 AM look like this:

Steps to perform PostgreSQL Point in Time Recovery using a delayed standby


Step 1

Stop the slave (delayed standby) immediately, as soon as you have noticed that an accidental change has happened. If you know that the change has been already applied on the slave, then you cannot perform the point in time recovery using this method.

$ pg_ctl -D $PGDATA stop -mf

Step 2

Rename the recovery.conf file in your standby to another name.

$ mv $PGDATA/recovery.conf $PGDATA/recovery.conf.old

Step 3

Create a new recovery.conf file with the required parameters for PITR.

# recovery.conf file always exists in the Data Directory of Slave
recovery_target_time = '2018-06-07 10:27:34 EDT'
restore_command = 'sh /var/lib/pgsql/scripts/restore_command_script.sh %p %f'
recovery_target_action = 'pause'
recovery_target_inclusive = 'false'

recovery_target_time

Specifies the timestamp up to which you wish to recover your database.

restore_command

Shell command that can be used by PostgreSQL to fetch the required Transaction Logs (WALs) for recovery.
PostgreSQL sends the arguments %p (path to WAL file) and % f (WAL file name) to this shell command. These arguments can be used in the script you use to copy your WALs.

Here is an example script for your reference. This example relies on rsync. The script connects to the backup server to fetch the WALs requested by PostgreSQL. (We’ll cover the procedure to archive these WALs in another blog post soon: this could be a good time to subscribe to the Percona blog mailing list!)

$ cat /var/lib/pgsql/scripts/restore_command_script.sh
#!/bin/bash
# Enable passwordless ssh to Backup Server
# $1 is %p substituted by postgres as the path to WAL File
# $2 is the %f substituted by postgres as the WAL File Name
LOG=/var/lib/pgsql/scripts/restore_command.log
Backup_Server=192.168.0.12
ArchiveDir='/archives'
#
wal=$2
wal_with_path=$1
rsync --no-motd -ave ssh ${Backup_Server}:${ArchiveDir}/${wal} ${wal_with_path} >>$LOG 2>&1
if [ "$?" -ne "0" ]
then
echo "Restore Failed for WAL : $wal" >> $LOG
exit 1
fi
#

recovery_target_action

This is the action that needs to be performed after recovering the instance up to the recovery_target_time. Setting this to pause would let you modify the recovery_target_time after recovery, if you need to. You can then replay the transactions at a slow pace until your desired recovery target is reached. For example, you can recover until 2018-06-07 10:26:34 EDT and then modify recovery_target_time to 2018-06-07 10:27:34 EDT when using pause.

When you know that all the data you are looking for has been recovered, you can issue the following command to stop the recovery process, change the timeline and open the database for writes.

select pg_wal_replay_resume();

Other possible settings for this parameter are promote and shutdown. These do not allow you to replay a few more future transactions after the recovery, as you can with pause.

recovery_target_inclusive

Whether to stop recovery just after the specified recovery_target_time(true) or before(false).

Step 4

Start PostgreSQL using pg_ctl. Now, it should read the parameters in recovery.conf and perform the recovery until the time you set in the recovery_target_time.

$ pg_ctl -D $PGDATA start

Step 5

Here is how the log appears. It says that has performed point-in-time-recovery and has reached a consistent state as requested.

2018-06-07 10:43:22.303 EDT [1423] LOG: starting point-in-time recovery to 2018-06-07 10:27:34-04
2018-06-07 10:43:22.607 EDT [1423] LOG: redo starts at 0/40005B8
2018-06-07 10:43:22.607 EDT [1423] LOG: consistent recovery state reached at 0/40156B0
2018-06-07 10:43:22.608 EDT [1421] LOG: database system is ready to accept read only connections
2018-06-07 10:43:22.626 EDT [1423] LOG: recovery stopping before commit of transaction 570, time 2018-06-07 10:28:59.645685-04
2018-06-07 10:43:22.626 EDT [1423] LOG: recovery has paused
2018-06-07 10:43:22.626 EDT [1423] HINT: Execute pg_wal_replay_resume() to continue.

Step 6

You can now stop recovery and open the database for writes after PITR.

Before executing the next command, you may want to verify that you have got all the desired data by connecting to the database and executing some SQL’s. You can still perform reads before you stop recovery. If you notice that you need another few minutes (or hours) of transactions, then modify the parameter recovery_target_time and go back to step 4. Otherwise, you can stop the recovery by running the following command.

$ psql
select pg_wal_replay_resume();

Summing up

Using PostgreSQL Point in time Recovery is the most simple of procedures that does not involve any effort in identifying the latest backups, transaction logs and space or server to restore in a database emergency. These things happen! Also, it could save a lot of time because the replay of WALs is much faster than rebuilding an entire instance using backups, especially when you have a huge database.

Important post script: I tested and recorded these steps using PostgreSQL 10.4. It is possible with PostgreSQL 9.x versions, however, the parameters could change slightly and you should refer to the PostgreSQL documentation for the correct syntax.

The post Faster Point In Time Recovery (PITR) in PostgreSQL Using a Delayed Standby appeared first on Percona Database Performance Blog.

by Avinash Vallarapu at June 28, 2018 04:00 PM

What To Do When MySQL Runs Out of Memory: Troubleshooting Guide

MySQL memory troubleshooting

MySQL memory troubleshootingTroubleshooting crashes is never a fun task, especially if MySQL does not report the cause of the crash. For example, when MySQL runs out of memory. Peter Zaitsev wrote a blog post in 2012: Troubleshooting MySQL Memory Usage with a lots of useful tips. With the new versions of MySQL (5.7+) and performance_schema we have the ability to troubleshoot MySQL memory allocation much more easily.

In this blog post I will show you how to use it.

First of all, there are 3 major cases when MySQL will crash due to running out of memory:

  1. MySQL tries to allocate more memory than available because we specifically told it to do so. For example: you did not set innodb_buffer_pool_size correctly. This is very easy to fix
  2. There is some other process(es) on the server that allocates RAM. It can be the application (java, python, php), web server or even the backup (i.e. mysqldump). When the source of the problem is identified, it is straightforward to fix.
  3. Memory leaks in MySQL. This is a worst case scenario, and we need to troubleshoot.

Where to start troubleshooting MySQL memory leaks

Here is what we can start with (assuming it is a Linux server):

Part 1: Linux OS and config check
  1. Identify the crash by checking mysql error log and Linux log file (i.e. /var/log/messages or /var/log/syslog). You may see an entry saying that OOM Killer killed MySQL. Whenever MySQL has been killed by OOM “dmesg” also shows details about the circumstances surrounding it.
  2. Check the available RAM:
    • free -g
    • cat /proc/meminfo
  3. Check what applications are using RAM: “top” or “htop” (see the resident vs virtual memory)
  4. Check mysql configuration: check /etc/my.cnf or in general /etc/my* (including /etc/mysql/* and other files). MySQL may be running with the different my.cnf (run
    ps  ax| grep mysql
     )
  5. Run
    vmstat 5 5
     to see if the system is reading/writing via virtual memory and if it is swapping
  6. For non-production environments we can use other tools (like Valgrind, gdb, etc) to examine MySQL usage
Part 2:  Checks inside MySQL

Now we can check things inside MySQL to look for potential MySQL memory leaks.

MySQL allocates memory in tons of places. Especially:

  • Table cache
  • Performance_schema (run:
    show engine performance_schema status
      and look at the last line). That may be the cause for the systems with small amount of RAM, i.e. 1G or less
  • InnoDB (run
    show engine innodb status
      and check the buffer pool section, memory allocated for buffer_pool and related caches)
  • Temporary tables in RAM (find all in-memory tables by running:
    select * from information_schema.tables where engine='MEMORY'
     )
  • Prepared statements, when it is not deallocated (check the number of prepared commands via deallocate command by running show global status like ‘
    Com_prepare_sql';show global status like 'Com_dealloc_sql'
      )

The good news is: starting with MySQL 5.7 we have memory allocation in performance_schema. Here is how we can use it

  1. First, we need to enable collecting memory metrics. Run:
    UPDATE setup_instruments SET ENABLED = 'YES'
    WHERE NAME LIKE 'memory/%';
  2. Run the report from sys schema:
    select event_name, current_alloc, high_alloc
    from sys.memory_global_by_current_bytes
    where current_count > 0;
  3. Usually this will give you the place in code when memory is allocated. It is usually self-explanatory. In some cases we can search for bugs or we might need to check the MySQL source code.

For example, for the bug where memory was over-allocated in triggers (https://bugs.mysql.com/bug.php?id=86821) the select shows:

mysql> select event_name, current_alloc, high_alloc from memory_global_by_current_bytes where current_count > 0;
+--------------------------------------------------------------------------------+---------------+-------------+
| event_name                                                                     | current_alloc | high_alloc  |
+--------------------------------------------------------------------------------+---------------+-------------+
| memory/innodb/buf_buf_pool                                                     | 7.29 GiB      | 7.29 GiB    |
| memory/sql/sp_head::main_mem_root                                              | 3.21 GiB      | 3.62 GiB    |
...

The largest chunk of RAM is usually the buffer pool but ~3G in stored procedures seems to be too high.

According to the MySQL source code documentation, sp_head represents one instance of a stored program which might be of any type (stored procedure, function, trigger, event). In the above case we have a potential memory leak.

In addition we can get a total report for each higher level event if we want to see from the birds eye what is eating memory:

mysql> select  substring_index(
    ->     substring_index(event_name, '/', 2),
    ->     '/',
    ->     -1
    ->   )  as event_type,
    ->   round(sum(CURRENT_NUMBER_OF_BYTES_USED)/1024/1024, 2) as MB_CURRENTLY_USED
    -> from performance_schema.memory_summary_global_by_event_name
    -> group by event_type
    -> having MB_CURRENTLY_USED>0;
+--------------------+-------------------+
| event_type         | MB_CURRENTLY_USED |
+--------------------+-------------------+
| innodb             |              0.61 |
| memory             |              0.21 |
| performance_schema |            106.26 |
| sql                |              0.79 |
+--------------------+-------------------+
4 rows in set (0.00 sec)

I hope those simple steps can help troubleshoot MySQL crashes due to running out of memory.

Links to more resources that might be of interest

The post What To Do When MySQL Runs Out of Memory: Troubleshooting Guide appeared first on Percona Database Performance Blog.

by Alexander Rubin at June 28, 2018 12:00 PM

Jean-Jerome Schmidt

How to Improve Performance of Galera Cluster for MySQL or MariaDB

Galera Cluster comes with many notable features that are not available in standard MySQL replication (or Group Replication); automatic node provisioning, true multi-master with conflict resolutions and automatic failover. There are also a number of limitations that could potentially impact cluster performance. Luckily, if you are not aware of these, there are workarounds. And if you do it right, you can minimize the impact of these limitations and improve overall performance.

We have previously covered many tips and tricks related to Galera Cluster, including running Galera on AWS Cloud. This blog post distinctly dives into the performance aspects, with examples on how to get the most out of Galera.

Replication Payload

A bit of introduction - Galera replicates writesets during the commit stage, transferring writesets from the originator node to the receiver nodes synchronously through the wsrep replication plugin. This plugin will also certify writesets on the receiver nodes. If the certification process passes, it returns OK to the client on the originator node and will be applied on the receiver nodes at a later time asynchronously. Else, the transaction will be rolled back on the originator node (returning error to the client) and the writesets that have been transferred to the receiver nodes will be discarded.

A writeset consists of write operations inside a transaction that changes the database state. In Galera Cluster, autocommit is default to 1 (enabled). Literally, any SQL statement executed in Galera Cluster will be enclosed as a transaction, unless you explicitly start with BEGIN, START TRANSACTION or SET autocommit=0. The following diagram illustrates the encapsulation of a single DML statement into a writeset:

For DML (INSERT, UPDATE, DELETE..), the writeset payload consists of the binary log events for a particular transaction while for DDLs (ALTER, GRANT, CREATE..), the writeset payload is the DDL statement itself. For DMLs, the writeset will have to be certified against conflicts on the receiver node while for DDLs (depending on wsrep_osu_method, default to TOI), the cluster cluster runs the DDL statement on all nodes in the same total order sequence, blocking other transactions from committing while the DDL is in progress (see also RSU). In simple words, Galera Cluster handles DDL and DML replication differently.

Round Trip Time

Generally, the following factors determine how fast Galera can replicate a writeset from an originator node to all receiver nodes:

  • Round trip time (RTT) to the farthest node in the cluster from the originator node.
  • The size of a writeset to be transferred and certified for conflict on the receiver node.

For example, if we have a three-node Galera Cluster and one of the nodes is located 10 milliseconds away (0.01 second), it's very unlikely you might be able to write more than 100 times per second to the same row without conflicting. There is a popular quote from Mark Callaghan which describes this behaviour pretty well:

"[In a Galera cluster] a given row can’t be modified more than once per RTT"

To measure RTT value, simply perform ping on the originator node to the farthest node in the cluster:

$ ping 192.168.55.173 # the farthest node

Wait for a couple of seconds (or minutes) and terminate the command. The last line of the ping statistic section is what we are looking for:

--- 192.168.55.172 ping statistics ---
65 packets transmitted, 65 received, 0% packet loss, time 64019ms
rtt min/avg/max/mdev = 0.111/0.431/1.340/0.240 ms

The max value is 1.340 ms (0.00134s) and we should take this value when estimating the minimum transactions per second (tps) for this cluster. The average value is 0.431ms (0.000431s) and we can use to estimate the average tps while min value is 0.111ms (0.000111s) which we can use to estimate the maximum tps. The mdev means how the RTT samples were distributed from the average. Lower value means more stable RTT.

Hence, transactions per second can be estimated by dividing RTT (in second) into 1 second:

Resulting,

  • Minimum tps: 1 / 0.00134 (max RTT) = 746.26 ~ 746 tps
  • Average tps: 1 / 0.000431 (avg RTT) = 2320.19 ~ 2320 tps
  • Maximum tps: 1 / 0.000111 (min RTT) = 9009.01 ~ 9009 tps

Note that this is just an estimation to anticipate replication performance. There is not much we can do to improve this on the database side, once we have everything deployed and running. Except, if you move or migrate the database servers closer to each other to improve the RTT between nodes, or upgrade the network peripherals or infrastructure. This would require maintenance window and proper planning.

Chunk Up Big Transactions

Another factor is the transaction size. After the writeset is transferred, there will be a certification process. Certification is a process to determine whether or not the node can apply the writeset. Galera generates MD5 checksum pseudo keys from every full row. The cost of certification depends on the size of the writeset, which translates into a number of unique key lookups into the certification index (a hash table). If you update 500,000 rows in a single transaction, for example:

# a 500,000 rows table
mysql> UPDATE mydb.settings SET success = 1;

The above will generate a single writeset with 500,000 binary log events in it. This huge writeset does not exceed wsrep_max_ws_size (default to 2GB) so it will be transferred over by Galera replication plugin to all nodes in the cluster, certifying these 500,000 rows on the receiver nodes for any conflicting transactions that are still in the slave queue. Finally, the certification status is returned to the group replication plugin. The bigger the transaction size, the higher risk it will be conflicting with other transactions that come from another master. Conflicting transactions waste server resources, plus cause a huge rollback to the originator node. Note that a rollback operation in MySQL is way slower and less optimized than commit operation.

The above SQL statement can be re-written into a more Galera-friendly statement with the help of simple loop, like the example below:

(bash)$ for i in {1..500}; do \
mysql -uuser -ppassword -e "UPDATE mydb.settings SET success = 1 WHERE success != 1 LIMIT 1000"; \
sleep 2; \
done

The above shell command would update 1000 rows per transaction for 500 times and wait for 2 seconds between executions. You could also use a stored procedure or other means to achieve a similar result. If rewriting the SQL query is not an option, simply instruct the application to execute the big transaction during a maintenance window to reduce the risk of conflicts.

For huge deletes, consider using pt-archiver from the Percona Toolkit - a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much.

Parallel Slave Threads

In Galera, the applier is a multithreaded process. Applier is a thread running within Galera to apply the incoming write-sets from another node. Which means, it is possible for all receivers to execute multiple DML operations that come right from the originator (master) node simultaneously. Galera parallel replication is only applied to transactions when it is safe to do so. It improves the probability of the node to sync up with the originator node. However, the replication speed is still limited to RTT and writeset size.

To get the best out of this, we need to know two things:

  • The number of cores the server has.
  • The value of wsrep_cert_deps_distance status.

The status wsrep_cert_deps_distance tells us the potential degree of parallelization. It is the value of the average distance between highest and lowest seqno values that can be possibly applied in parallel. You can use the wsrep_cert_deps_distance status variable to determine the maximum number of slave threads possible. Take note that this is an average value across time. Hence, in order get a good value, you have to hit the cluster with writes operations through test workload or benchmark until you see a stable value coming out.

To get the number of cores, you can simply use the following command:

$ grep -c processor /proc/cpuinfo
4

Ideally, 2, 3 or 4 threads of slave applier per CPU core is a good start. Thus, the minimum value for the slave threads should be 4 x number of CPU cores, and must not exceed the wsrep_cert_deps_distance value:

MariaDB [(none)]> SHOW STATUS LIKE 'wsrep_cert_deps_distance';
+--------------------------+----------+
| Variable_name            | Value    |
+--------------------------+----------+
| wsrep_cert_deps_distance | 48.16667 |
+--------------------------+----------+

You can control the number of slave applier threads using wsrep_slave_thread variable. Even though this is a dynamic variable, only increasing the number would have an immediate effect. If you reduce the value dynamically, it would take some time, until the applier thread exits after it finishes applying. A recommended value is anywhere between 16 to 48:

mysql> SET GLOBAL wsrep_slave_threads = 48;

Take note that in order for parallel slave threads to work, the following must be set (which is usually pre-configured for Galera Cluster):

innodb_autoinc_lock_mode=2

Galera Cache (gcache)

Galera uses a preallocated file with a specific size called gcache, where a Galera node keeps a copy of writesets in circular buffer style. By default, its size is 128MB, which is rather small. Incremental State Transfer (IST) is a method to prepare a joiner by sending only the missing writesets available in the donor’s gcache. IST is faster than state snapshot transfer (SST), it is non-blocking and has no significant performance impact on the donor. It should be the preferred option whenever possible.

IST can only be achieved if all changes missed by the joiner are still in the gcache file of the donor. The recommended setting for this is to be as big as the whole MySQL dataset. If disk space is limited or costly, determining the right size of the gcache size is crucial, as it can influence the data synchronization performance between Galera nodes.

The below statement will give us an idea of the amount of data replicated by Galera. Run the following statement on one of the Galera nodes during peak hours (tested on MariaDB >10.0 and PXC >5.6, galera >3.x):

mysql> SET @start := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); do sleep(60); SET @end := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); SET @gcache := (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(@@GLOBAL.wsrep_provider_options,'gcache.size = ',-1), 'M', 1)); SELECT ROUND((@end - @start),2) AS `MB/min`, ROUND((@end - @start),2) * 60 as `MB/hour`, @gcache as `gcache Size(MB)`, ROUND(@gcache/round((@end - @start),2),2) as `Time to full(minutes)`;
+--------+---------+-----------------+-----------------------+
| MB/min | MB/hour | gcache Size(MB) | Time to full(minutes) |
+--------+---------+-----------------+-----------------------+
|   7.95 |  477.00 |  128            |                 16.10 |
+--------+---------+-----------------+-----------------------+

We can estimate that the Galera node can have approximately 16 minutes of downtime, without requiring SST to join (unless Galera cannot determine the joiner state). If this is too short time and you have enough disk space on your nodes, you can change the wsrep_provider_options="gcache.size=<value>" to a more appropriate value. In this example workload, setting gcache.size=1G allows us to have 2 hours of node downtime with high probability of IST when the node rejoins.

It's also recommended to use gcache.recover=yes in wsrep_provider_options (Galera >3.19), where Galera will attempt to recover the gcache file to a usable state on startup rather than delete it, thus preserving the ability to have IST and avoiding SST as much as possible. Codership and Percona have covered this in details in their blogs. IST is always the best method to sync up after a node rejoins the cluster. It is 50% faster than xtrabackup or mariabackup and 5x faster than mysqldump.

Asynchronous Slave

Galera nodes are tightly-coupled, where the replication performance is as fast as the slowest node. Galera use a flow control mechanism, to control replication flow among members and eliminate any slave lag. The replication can be all fast or all slow on every node and is adjusted automatically by Galera. If you want to know about flow control, read this blog post by Jay Janssen from Percona.

In most cases, heavy operations like long running analytics (read-intensive) and backups (read-intensive, locking) are often inevitable, which could potentially degrade the cluster performance. The best way to execute this type of queries is by sending them to a loosely-coupled replica server, for instance, an asynchronous slave.

An asynchronous slave replicates from a Galera node using the standard MySQL asynchronous replication protocol. There is no limit on the number of slaves that can be connected to one Galera node, and chaining it out with an intermediate master is also possible. MySQL operations that execute on this server won't impact the cluster performance, apart from the initial syncing phase where a full backup must be taken on the Galera node to stage the slave before establishing the replication link (although ClusterControl allows you to build the async slave from an existing backup first, before connecting it to the cluster).

GTID (Global Transaction Identifier) provides a better transactions mapping across nodes, and is supported in MySQL 5.6 and MariaDB 10.0. With GTID, the failover operation on a slave to another master (another Galera node) is simplified, without the need to figure out the exact log file and position. Galera also comes with its own GTID implementation but these two are independent to each other.

Scaling out an asynchronous slave is one-click away if you are using ClusterControl -> Add Replication Slave feature:

Take note that binary logs must be enabled on the master (the chosen Galera node) before we can proceed with this setup. We have also covered the manual way in this previous post.

The following screenshot from ClusterControl shows the cluster topology, it illustrates our Galera Cluster architecture with an asynchronous slave:

ClusterControl automatically discovers the topology and generates the super cool diagram like above. You can also perform administration tasks directly from this page by clicking on the top-right gear icon of each box.

SQL-aware Reverse Proxy

ProxySQL and MariaDB MaxScale are intelligent reverse-proxies which understand MySQL protocol and is capable of acting as a gateway, router, load balancer and firewall in front of your Galera nodes. With the help of Virtual IP Address provider like LVS or Keepalived, and combining this with Galera multi-master replication technology, we can have a highly available database service, eliminating all possible single-point-of-failures (SPOF) from the application point-of-view. This will surely improve the availability and reliability the architecture as whole.

Another advantage with this approach is you will have the ability to monitor, rewrite or re-route the incoming SQL queries based on a set of rules before they hit the actual database server, minimizing the changes on the application or client side and routing queries to a more suitable node for optimal performance. Risky queries for Galera like LOCK TABLES and FLUSH TABLES WITH READ LOCK can be prevented way ahead before they would cause havoc to the system, while impacting queries like "hotspot" queries (a row that different queries want to access at the same time) can be rewritten or being redirected to a single Galera node to reduce the risk of transaction conflicts. For heavy read-only queries like OLAP or backup, you can route them over to an asynchronous slave if you have any.

Reverse proxy also monitors the database state, queries and variables to understand the topology changes and produce an accurate routing decision to the backend servers. Indirectly, it centralizes the nodes monitoring and cluster overview without the need to check on each and every single Galera node regularly. The following screenshot shows the ProxySQL monitoring dashboard in ClusterControl:

There are also many other benefits that a load balancer can bring to improve Galera Cluster significantly, as covered in details in this blog post, Become a ClusterControl DBA: Making your DB components HA via Load Balancers.

Final Thoughts

With good understanding on how Galera Cluster internally works, we can work around some of the limitations and improve the database service. Happy clustering!

by ashraf at June 28, 2018 09:26 AM

June 27, 2018

Peter Zaitsev

Percona Monitoring and Management 1.12.0 Is Now Available

Percona Monitoring and Management

Percona Monitoring and ManagementPMM (Percona Monitoring and Management) is a free and open-source platform for managing and monitoring MySQL and MongoDB performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

In release 1.12, we invested our efforts in the following areas:

  • Visual Explain in Query Analytics – Gain insight into MySQL’s query optimizer for your queries
  • New Dashboard – InnoDB Compression Metrics – Evaluate effectiveness of InnoDB Compression
  • New Dashboard – MySQL Command/Handler Compare – Contrast MySQL instances side by side
  • Updated Grafana to 5.1 – Fixed scrolling issues

We addressed 10 new features and improvements, and fixed 13 bugs.

Visual Explain in Query Analytics

We’re working on substantial changes to Query Analytics and the first part to roll out is something that users of Percona Toolkit may recognize – we’ve introduced a new element called Visual Explain based on pt-visual-explain.  This functionality transforms MySQL EXPLAIN output into a left-deep tree representation of a query plan, in order to mimic how the plan is represented inside MySQL.  This is of primary benefit when investigating tables that are joined in some logical way so that you can understand in what order the loops are executed by the MySQL query optimizer. In this example we are demonstrating the output of a single table lookup vs two table join:

Single Table Lookup Two Tables via INNER JOIN
SELECT DISTINCT c
FROM sbtest13
WHERE id
BETWEEN 49808
AND 49907
ORDER BY c
SELECT sbtest3.c
FROM sbtest1
INNER JOIN sbtest3
ON sbtest1.id = sbtest3.id
WHERE sbtest3.c ='long-string';

InnoDB Compression Metrics Dashboard

A great feature of MySQL’s InnoDB storage engine includes compression of data that is transparently handled by the database, saving you space on disk, while reducing the amount of I/O to disk as fewer disk blocks are required to store the same amount of data, thus allowing you to reduce your storage costs.  We’ve deployed a new dashboard that helps you understand the most important characteristics of InnoDB’s Compression.  Here’s a sample of visualizing Compression and Decompression attempts, alongside the overall Compression Success Ratio graph:

 

MySQL Command/Handler Compare Dashboard

We have introduced a new dashboard that lets you do side-by-side comparison of Command (Com_*) and Handler statistics.  A common use case would be to compare servers that share a similar workload, for example across MySQL instances in a pool of replicated slaves.  In this example I am comparing two servers under identical sysbench load, but exhibiting slightly different performance characteristics:

The number of servers you can select for comparison is unbounded, but depending on the screen resolution you might want to limit to 3 at a time for a 1080 screen size.

New Features & Improvements

  • PMM-2519: Display Visual Explain in Query Analytics
  • PMM-2019: Add new Dashboard InnoDB Compression metrics
  • PMM-2154: Add new Dashboard Compare Commands and Handler statistics
  • PMM-2530: Add timeout flags to mongodb_exporter (thank you unguiculus for your contribution!)
  • PMM-2569: Update the MySQL Golang driver for MySQL 8 compatibility
  • PMM-2561: Update to Grafana 5.1.3
  • PMM-2465: Improve pmm-admin debug output
  • PMM-2520: Explain Missing Charts from MySQL Dashboards
  • PMM-2119: Improve Query Analytics messaging when Host = All is passed
  • PMM-1956: Implement connection checking in mongodb_exporter

Bug Fixes

  • PMM-1704: Unable to connect to AtlasDB MongoDB
  • PMM-1950: pmm-admin (mongodb:metrics) doesn’t work well with SSL secured mongodb server
  • PMM-2134: rds_exporter exports memory in Kb with node_exporter labels which are in bytes
  • PMM-2157: Cannot connect to MongoDB using URI style
  • PMM-2175: Grafana singlestat doesn’t use consistent colour when unit is of type Time
  • PMM-2474: Data resolution on Dashboards became 15sec interval instead of 1sec
  • PMM-2581: Improve Travis CI tests by addressing pmm-admin check-network Time Drift
  • PMM-2582: Unable to scroll on “_PMM Add Instance” page when many RDS instances exist in an AWS account
  • PMM-2596: Set fixed height for panel content in PMM Add Instances
  • PMM-2600: InnoDB Checkpoint Age does not show data for MySQL
  • PMM-2620: Fix balancerIsEnabled & balancerChunksBalanced values
  • PMM-2634: pmm-admin cannot create user for MySQL 8
  • PMM-2635: Improve error message while adding metrics beyond “exit status 1”

Known Issues

  • PMM-2639: mysql:metrics does not work on Ubuntu 18.04 – We will address this in a subsequent release

How to get PMM Server

PMM is available for installation using three methods:

The post Percona Monitoring and Management 1.12.0 Is Now Available appeared first on Percona Database Performance Blog.

by Michael Coburn at June 27, 2018 07:36 PM

Oli Sennhauser

FromDual Backup and Recovery Manager for MariaDB and MySQL 2.0.0 has been released

FromDual has the pleasure to announce the release of the new version 2.0.0 of its popular Backup and Recovery Manager for MariaDB and MySQL (brman).

The new FromDual Backup and Recovery Manager can be downloaded from here. How to install and use the Backup and Recovery Manager is describe in FromDual Backup and Recovery Manager (brman) installation guide.

In the inconceivable case that you find a bug in the FromDual Backup and Recovery Manager please report it to the FromDual Bugtracker or just send us an email.

Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

Upgrade from 1.2.x to 2.0.0

brman 2.0.0 requires a new PHP package for ssh connections.

shell> sudo apt-get install php-ssh2

shell> cd ${HOME}/product
shell> tar xf /download/brman-2.0.0.tar.gz
shell> rm -f brman
shell> ln -s brman-2.0.0 brman

Changes in FromDual Backup and Recovery Manager 2.0.0

This release is a new major release series. It contains a lot of new features. We have tried to maintain backward-compatibility with the 1.2 release series. But you should test the new release seriously!

You can verify your current FromDual Backup Manager version with the following command:

shell> fromdual_bman --version
shell> bman --version

FromDual Backup Manager

  • brman was made ready for MariaDB 10.3
  • brman was made ready for MySQL 8.0
  • DEB and RPM packages are prepared.
  • fromdual_brman was renamed to brman (because of DEB).
  • mariabackup support was added.
  • Timestamp for physical backup and compression added to log file.
  • Option --no-purge added to not purge binary logs during binlog backup.
  • A failed archive command in physical full backup does not abort backup loop any more.
  • Return code detection for different physical backup tools improved.
  • Bug in fetching binlog file and position from xtrabackup_binlog_info fixed.
  • Version made MyEnv compliant.
  • Errors and warnings are written to STDERR.
  • General Tablespace check implemented. Bug in mysqldump. Only affects MySQL 5.7 and 8.0. MariaDB up to 10.3 has not implemented this feature yet.
  • Warning messages improved.
  • Option --quick added for logical (mysqldump) backup to speed up backup.
  • On schema backups FLUSH BINARY LOGS is executed only once when --per-schema backup is used.
  • The database user root should not be used for backups any more. User brman is suggested.
  • Option --pass-through is implemented to pass options like --ignore-table through to the backend backup tool (mysqldump, mariabackup, xtrabackup, mysqlbackup).
  • bman can report to fpmmm/Zabbix now.
  • Check for binary logging made less intrusive.
  • All return codes (rc) are matching to new schema now. That means errors do not necessarily have same error codes with new brman version.
  • If RELOAD privilege is missing --master-data and/or --flush-logs options are omitted. This makes bman backups possible for some shared hosting and cloud environments.
  • Schema backup does not require SHOW DATABASES privilege any more. This makes it possible to use bman for shared hosting and cloud environments.
  • Info messages made nicer with empty lines.
  • Option --archivedir is replaced by --archivedestination.
  • Remote copy of backup via rsync, scp and sftp is possible.
  • Connect string was shown wrong in the log file.
  • Connect string of target and catalog made URI conform.
  • bman supports now mariabackup, xtrabackup and mysqlbackup properly (recent releases).

FromDual Backup Manager Catalog

  • Catalog write is done if physical backup hits an error in archiving.
  • Renamed catalog to brman_catalog.

For subscriptions of commercial use of brman please get in contact with us.

by Shinguz at June 27, 2018 04:25 PM

Peter Zaitsev

Scaling PostgreSQL with PgBouncer: You May Need a Connection Pooler Sooner Than You Expect

PostgreSQL connection pools

PostgreSQL connection poolsAs PostgreSQL based applications scale, the need to implement connection pooling can become apparent sooner than you might expect. Since, PostgreSQL to date has no built-in connection pool handler, in this post I’ll explore some of the options for implementing connection pooling. In doing so, we’ll take a look at some of the implications for application performance.

PostgreSQL implements connection handling by “forking” it’s main OS process into a child process for each new connection. An interesting practical consequence of this is that we get a full view of resource utilization per connection in PostgreSQL at the OS level (the output below is from top):

PID USER      PR NI VIRT RES  SHR S %CPU %MEM TIME+  COMMAND             
24379 postgres  20 0 346m 148m 122m R 61.7  7.4 0:46.36 postgres: sysbench sysbench ::1(40120)
24381 postgres  20 0 346m 143m 119m R 62.7  7.1 0:46.14 postgres: sysbench sysbench ::1(40124)
24380 postgres  20 0 338m 137m 121m R 57.7  6.8 0:46.04 postgres: sysbench sysbench ::1(40122)
24382 postgres  20 0 338m 129m 115m R 57.4  6.5 0:46.09 postgres: sysbench sysbench ::1(40126)

This extra visualization comes at some additional cost though: it is more expensive—in terms of time and memory mostly—to fork an OS process than it would be, for example, to spawn a new thread for an existing process. This might be irrelevant if the rate at which connections are opened and closed is low but becomes increasingly important to consider over time as this reality changes. That is possibly one of the reasons why the need for a connection pooling mechanism often manifests itself early in the scaling life of a PostgreSQL-based application.

Scaling connections

When an application server sends a connection request to a PostgreSQL database it will be received by the Postmaster process. Postmaster, observing the limit set by max_connections, will then fork itself, creating a new backend process to handle this new connection. This backend process will live until the connection is closed by the client or terminated by PostgreSQL itself.

If the application was conceived with the database in mind it will make an effective use of connections, reusing existing ones whenever possible while avoiding idle connections from laying around for too long. Unfortunately, that is not always the case. Plus there are legitimate situations where the application popularity/usage increases. These naturally increases the rate at which new connection requests are created. There is a practical limit for the number of connections a server can manage at a given time. Beyond these limits, we start seeing contention in different areas. This in turn affects the server’s capacity to process requests, which can lead to bigger problems.

Putting a cap on the number of connections that may exist at any given time helps somewhat. However, considering the time it takes to properly establish a new connection, if the rate at which new connection requests arrive continues to rise we may soon reach a scaling problem.

Connection pooling for PostgreSQL applications

What if we could instead recycle existing connections to serve new client requests to gain on time (by avoiding the creation and remotion of yet another backend process each time), ultimately increasing transaction throughput, while also making a better use of the resources available? The strategy that revolves around the use of a cache (or pool) of connections that are kept open on the database server and re-used by different client requests is known as connection pooling.

Given there is no built-in handler for PostgreSQL, there are typically two ways of implementing such a mechanism:

  1. On the application side. Some frameworks like Ruby on Rails include their own, built-in connection pool mechanism. There are also libraries that extend database driver functionality to include connection pooling support, such as c3P0.
  2. As an external service, sitting between the application and the database server. The application will then connect to this external service instead, which will relay each request from the application to the database server through one of the connections it maintains in its pool.

An application connection pooler might provide better integration with the application, for example, when it comes to the use of prepared statements and the re-utilization of the cached result set. However, sometimes they may fall short on understanding PostgreSQL protocol and this might result in things like failing to properly clear pre-cached memory. An external service, on the other hand, won’t provide such a tight integration with a particular application but usually allows for greater customization and better maintenance of the pool. This often provides the ability to increase and decrease the number of connections it maintains cached dynamically and according to the demand observed, while also respecting pre-set threshold marks. Probably the most popular connection pooler used with PostgreSQL is PgBouncer.

A simple test

In order to illustrate the impact a connection pooler might have on the performance of a PostgreSQL server, I took advantage of the recent tests we did with sysbench-tpcc on PostgreSQL and repeated them partially by making use of PgBouncer as a connection pooler.

When we first ran the tests our goal was to optimize PostgreSQL for sysbench-tpcc workload running with 56 concurrent clients (threads), with the server having the same amount of CPUs available. The goal this time was to vary the number of concurrent clients (56, 150, 300 and 600) to see how the server would cope with the scaling of connections.

Instead of running each round for 10 hours, however, I ran them for 30 minutes only. This might mask, or at least change, the effects of checkpointing and caching observed in our initial tests.

PgBouncer

I compiled the latest version of PgBouncer (1.8.1) following the instruction on GitHub and installed it in our test box, alongside PostgreSQL.

PgBouncer can be configured with three different types of pooling:

  • Session pooling: once the client gets one of the connections in the pool assigned it will keep it until it disconnects (or a timeout is reached).
  • Transaction pooling: once the client gets a connection from the pool, it keeps it to run a single transaction only.  After that, the connection is returned to the pool. If the client wants to run other transactions it has to wait until it gets another connection assigned to it.
  • Statement pooling: in this mode, PgBouncer will return a connection to its pool as soon as the first query is processed, which means multi-statement transactions would break in this mode.

I went with transaction pooling for my tests as the workload of sysbench-tpcc is composed of several short and single-statement transactions. Here’s the configuration file I used in full, which I named pgbouncer.ini:

[databases]
sbtest = host=127.0.0.1 port=5432 dbname=sbtest
[pgbouncer]
listen_port = 6543
listen_addr = 127.0.0.1
auth_type = md5
auth_file = users.txt
logfile = pgbouncer.log
pidfile = pgbouncer.pid
admin_users = postgres
pool_mode = transaction
default_pool_size=56
max_client_conn=600

Apart from pool_mode, the other variables that matter the most are (definitions below came from PgBouncer’s manual page):

  • default_pool_size: how many server connections to allow per user/database pair.
  • max_client_conn: maximum number of client connections allowed

The users.txt file specified by auth_file contains only a single line with the user and password used to connect to PostgreSQL; more elaborate authentication methods are also supported.

Running the test

I started PgBouncer as a daemon with the following command:

$ pgbouncer -d pgbouncer.ini

Apart from running the benchmark for only 30 minutes and varying the number of concurrent threads each time, I employed the exact same options for sysbench-tpcc we used in our previous tests. The example below is from the first run with threads=56:

$ ./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=1800 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --pgsql-password=****** --db-driver=pgsql run > /var/lib/postgresql/Nando/56t.txt

For the tests using the connection pooler I adapted the connection options so as to connect with PgBouncer instead of PostgreSQL directly. Note it remains a local connection:

./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=1800 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --pgsql-password=****** --pgsql-port=6543 --db-driver=pgsql run > /var/lib/postgresql/Nando/P056t.txt

After each sysbench-tpcc execution I cleared the OS cache with the following command:

$ sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'

It is not like this will have made much of a difference. As discussed in our previous post shared_buffers was set with 75% of RAM, enough to fit all of the “hot data” in memory.

Results

Without further ado here are the results I obtained:

comparing the effects of running sysbench-tpcc with a connection pooler
TPS from sysbench-tpcc: comparing direct connection to PostgreSQL and the use of PgBouncer as a connection pooler

When running sysbench-tpcc with only 56 concurrent clients the use of direct connections to PostgreSQL provided a throughput (TPS stands for transactions per second) 2.5 times higher than that obtained when using PgBouncer. The use of a connection pooler in this case was extremely detrimental to performance. At such small scale there were no gains obtained from a pool of connections, only overhead.

When running the benchmark with 150 concurrent clients, however, we start seeing the benefits of employing a connection pooler. In fact, such benefits most probably materialize much earlier. With hindsight going from 56 concurrent transactions to 150 was too big of an initial jump.

It was somewhat surprising to me to realize at first that such throughput could be sustained with PgBouncer even when doubling and then quadrupling the number of concurrent clients. What happens, in this case, is that instead of flooding PostgreSQL with that many requests at once, they all stop at PgBouncer’s door. PgBouncer only allows the next request to proceed to PostgreSQL once one of the connections in its pool is freed. Remember, I configured it in transaction pooling mode. The process is transparent to PostgreSQL. It has no idea how many requests are waiting at PgBouncer’s door to be processed (and thus it is spared the trouble and doesn’t freak out!). It’s effectively like outsourcing connection management, beyond the ones used by the pool itself, to a contractor. PgBouncer does this hard work so PostgreSQL doesn’t need to.

This strategy seems to work great for sysbench-tpcc. With other workloads, the balance point might lie elsewhere.

Experimenting with bigger and smaller connection pools

For the tests above I set default_pool_size on PgBouncer equal to the number of CPU cores available on this server (56). To explore the tuning of this parameter, I repeated these tests using bigger connection pools (150, 300, 600) as well as a smaller one (14). The following chart summarizes the results obtained:

sysbench-tpcc with pgbouncer: comparing different pool sizes
How the use of PgBouncer impacts throughput on sysbench-tpcc: comparing the use of different pool sizes

Using a much smaller connection pool (14), sized to ¼ of the number of available CPUs, still yielded a result almost as good. That says a lot about how much leveraging connection handling alone to PgBouncer already helps. Maybe some of PgBouncer’s “gatekeeper” qualities could be incorporated by PostgreSQL itself?

Doubling the number of connections in the pool didn’t make any practical difference. But as soon as we extrapolate that number to 600 (which is the number of maximum concurrent threads I’ve tested) throughput becomes comparable to when not using a connection pooler once the number of concurrent threads is greater than the number of available CPUs. That’s true even when running as many concurrent threads as there are connections available in the pool (600). There’s a practical limit to it, which clearly is on the PostgreSQL side. That’s expected, otherwise, the need for a connection pool wouldn’t be as important.

As a starting point, setting the connection pool size equal to the number of CPUs available in the server looks like a good idea. There may be a hard limit for the pool around 150 connections or so, after which further benefits aren’t realized. However, that’s only speculation. Further tests using both different hardware and workloads would be needed to investigate this properly.

Here’s the table that summarizes the results obtained, for a different view:

Do I need a connection pooler ?

The need to couple PostgreSQL with a connection pooler depends on a number of factors including:

  • The number of connections typically established with the server. Take into account that the number usually varies significantly during the day. Think about the average number per hour, during different hours in the day, and particularly when compared with peak time.
  • The effective throughput (TPS) produced by these connections
  • The number of CPUs available in view of the effective throughput
  • The nature of your workload:
    • Whether your application opens a new connection for each new request or tends to leave connections open for longer
    • Whether it is composed of various single-statement transactions (AUTOCOMMIT=ON style) or big and long transactions.

Unfortunately, I don’t have a formula for this. But now that you understand how a connection pooler such as PgBouncer works and where it tends to benefit PostgreSQL’s performance (even if only having sysbench-tpcc’s workload as the sole example here) it’s a matter of investigating the points above and experimenting. Experimenting is the key! Though possibly using reads only on a stand-by replica at first, to stay on the safe side…

In a future post, we will cover how the architecture of a database solution changes with the use of connection poolers: where to place them, how to cope with single point of failure issues and how they can be used alongside load balancers.

The post Scaling PostgreSQL with PgBouncer: You May Need a Connection Pooler Sooner Than You Expect appeared first on Percona Database Performance Blog.

by Fernando Laudares at June 27, 2018 02:42 PM

Jean-Jerome Schmidt

Watch the Webinar Replay: MySQL & MariaDB Performance Tuning for Dummies

Thanks to everyone who participated in this week’s webinar on Performance Tuning for MySQL & MariaDB!

If you’ve missed the live session or would like to watch it again, it is now available online to view; especially if any of the following questions sound familiar to you:

You’re running MySQL or MariaDB as backend database, how do you tune it to make best use of the hardware? How do you optimize the Operating System? How do you best configure MySQL or MariaDB for a specific database workload?

MySQL & MariaDB Performance Tuning Webinar

A database server needs CPU, memory, disk and network in order to function. Understanding these resources is important for anybody managing a production database. Any resource that is weak or overloaded can become a limiting factor and cause the database server to perform poorly.

In this webinar, we discuss some of the settings that are most often tweaked and which can bring you significant improvement in the performance of your MySQL or MariaDB database. We also cover some of the variables which are frequently modified even though they should not.

Performance tuning is not easy, especially if you’re not an experienced DBA, but you can go a surprisingly long way with a few basic guidelines.

Agenda

  • What to tune and why?
  • Tuning process
  • Operating system tuning
    • Memory
    • I/O performance
  • MySQL configuration tuning
    • Memory
    • I/O performance
  • Useful tools
  • Do’s and do not’s of MySQL tuning
  • Changes in MySQL 8.0

Speaker

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard.

This webinar builds upon blog posts by Krzysztof from the ‘Become a MySQL DBA’ series.

by jj at June 27, 2018 01:06 PM

Peter Zaitsev

Webinar 6/28: Securing Database Servers From External Attacks

securing database servers

securing database serversPlease join Percona’s Chief Evangelist Colin Charles on Thursday, June 28th, 2018, as he presents Securing Database Servers From External attacks at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

 

A critical piece of your infrastructure is the database tier, yet people don’t pay enough attention to it judging by how many are bitten via poorly chosen defaults, or just a lack understanding of running a secure database tier. In this talk, I’ll focus on MySQL/MariaDB, PostgreSQL, and MongoDB, and cover external authentication, auditing, encryption, SSL, firewalls, replication, and more gems from over a decade of consulting in this space from Percona’s 4,000+ customers.

Register Now

 

Colin Charles

Chief Evangelist

Colin Charles is the Chief Evangelist at Percona. He was previously on the founding team of MariaDB Server in 2009, and had worked at MySQL since 2005, and been a MySQL user since 2000. Before joining MySQL, he worked actively on the Fedora and OpenOffice.org projects. He’s well known within open source communities in APAC, and has spoken at many conferences. Experienced technologist, well known in the open source world for work that spans nearly two decades within the community. Pays attention to emerging technologies from an integration standpoint. Prolific speaker at many industry-wide conferences delivering talks and tutorials with ease. Interests: application development, systems administration, database development, migration, Web-based technologies. Considered expert in Linux and Mac OS X usage/administration/roll-out’s. Specialties: MariaDB, MySQL, Linux, Open Source, Community, speaking & writing to technical audiences as well as business stakeholders.

The post Webinar 6/28: Securing Database Servers From External Attacks appeared first on Percona Database Performance Blog.

by Lorraine Pocklington at June 27, 2018 12:38 PM

June 26, 2018

MariaDB Foundation

MariaDB 10.2.16 now available

The MariaDB Foundation is pleased to announce the availability of MariaDB 10.2.16, the latest stable release in the MariaDB 10.2 series. See the release notes and changelogs for details. Download MariaDB 10.2.16 Release Notes Changelog What is MariaDB 10.2? MariaDB APT and YUM Repository Configuration Generator Contributors to MariaDB 10.2.16 Alexander Barkov (MariaDB Corporation) Alexey […]

The post MariaDB 10.2.16 now available appeared first on MariaDB.org.

by Ian Gilfillan at June 26, 2018 03:44 PM

MariaDB AB

MariaDB Server 10.2.16 now available

MariaDB Server 10.2.16 now available dbart Tue, 06/26/2018 - 11:28

The MariaDB project is pleased to announce the immediate availability of MariaDB Server 10.2.16. See the release notes and changelog for details and visit mariadb.com/downloads to download.

Download MariaDB Server 10.2.16

Release Notes Changelog What is MariaDB 10.2?

The MariaDB project is pleased to announce the immediate availability of MariaDB Server 10.2.16. See the release notes and changelog for details.

Login or Register to post comments

by dbart at June 26, 2018 03:28 PM

Peter Zaitsev

MongoDB Explain – Using PMM-QAN for MongoDB Query Analytics

MongoDB explain plan with PMM-QAN

In this blog post, we will walk through PMM-Query Analytics for MongoDB. We will see how to analyze MongoDB query performance; review the initial parameters that we need to check; and find out how to compare MongoDB query performance with and without indexes with the help of EXPLAIN plan.

The Percona Monitoring and Management QAN (PMM-QAN) dashboard helps DBAs and Developers to analyze database queries and identify performance issues more easily. Sometimes it is difficult to find issues by just enabling the profiler and tracking through mongo shell for all the slow queries.

Test Case Environment

We configured the test environment with PMM Server before we ran the test:

PMM Version:<span class="s1">1.11.0</span>
MongoDB Version: 3.4.10
MongoDB Configurations:
3 Member Replicaset (1 Primary, 2 Secondaries)
ns: "test.pro"
Test Query: Find Test case: (with and without Index)
Indexed Field: "product"
Query: db.pro.find({product:"aa"})
Query count: 1000
Total count: 20000

Please Note: We have to enable profiler in the MongoDB environment to reflect the metrics in the QAN page. The QAN dashboard lists multiple queries, based on the threshold we have set in profiler. For this demonstration, we have set profiling level to 2 to get the query reflected in the dashboard.

QAN Analysis

Let’s analyze comparisons of the plans that were collected before and after the index was added to the collection.

Query used:

db.pro.find({product:"aa"})

The QAN dashboard, lists the FIND query for the “pro” collection:

Query Time

After selecting this specific query, we will see its details just below the list.

Review parameter: “Query Time”

Without Index

Here Query Time=18ms, we will check for the query count, docs returned and docs scanned in detail in the explain plan.

With Index

Now the Query Time=15ms, we have improved the query by creating an index, and MongoDB is no longer scanning the whole collection.

EXPLAIN PLAN

Analysis of the query from the QAN EXPLAIN Plan: You can use the toggle button “Expand All” to check the complete execution stats and plans, as this in case of the “test” database

COMPARISON OF PERFORMANCE

Here we make a comparison of query performance with and without the index, and take a look at the basics that need to be checked before and after index creation:

Without Index:

Stage =”COLLSCAN”, this means that MongoDB scans every document in order to fulfil the find query, so you need to create an index to improve query performance

With Index:

Stage=”IXSCAN”, indicates that the optimizer is not scanning the whole document, but scanned only the indexed bound document

Without Index:

It scanned whole documents i.e. docsExamined:20000 and return nReturned:1000.

With Index:

MongoDB uses the index and scans only the relevant documents .i.e. docsExamined:1000 and return nReturned:1000.  We can see the performance improvement from adding an index.

This is how the query behaves after index creation. We can identify exactly which index is being used with QAN. We can discover whether that index is useful or not, and other performance related events, all with an easy UI.

As this blog post is specific to Query Analysis for MongoDB, QAN graphs and their attributes can be accessed from here, an excellent blog post written by my colleague Vinodh.

The post MongoDB Explain – Using PMM-QAN for MongoDB Query Analytics appeared first on Percona Database Performance Blog.

by Aayushi Mangal at June 26, 2018 12:12 PM

Jean-Jerome Schmidt

Schema Management Tips for MySQL & MariaDB

Database schema is not something that is written in stone. It is designed for a given application, but then the requirements may and usually do change. New modules and functionalities are added to the application, more data is collected, code and data model refactoring is performed. Thereby the need to modify the database schema to adapt to these changes; adding or modifying columns, creating new tables or partitioning large ones. Queries change too as developers add new ways for users to interact with the data - new queries could use new, more efficient indexes so we rush to create them in order to provide the application with the best database performance.

So, how do we best approach a schema change? What tools are useful? How to minimize the impact on a production database? What are the most common issues with schema design? What tools can help you to stay on top of your schema? In this blog post we will give you a short overview of how to do schema changes in MySQL and MariaDB. Please note that we will not discuss schema changes in the context of Galera Cluster. We already discussed Total Order Isolation, Rolling Schema Upgrades and tips to minimize impact from RSU in previous blog posts. We will also discuss tips and tricks related to schema design and how ClusterControl can help you to stay on top of all schema changes.

Types of Schema Changes

First things first. Before we dig into the topic, we have to understand how MySQL and MariaDB perform schema changes. You see, one schema change is not equal to another schema change.

You may have heard about online alters, instant alters or in-place alters. All of this is a result of work which is ongoing to minimize the impact of the schema changes on the production database. Historically, almost all schema changes were blocking. If you executed a schema change, all of the queries will start to pile up, waiting for the ALTER to complete. Obviously, this posed serious issues for production deployments. Sure, people immediately start to look for workarounds, and we will discuss them later in this blog, as even today those are still relevant. But also, work started to improve capability of MySQL to run DDL’s (Data Definition Language) without much impact to other queries.

Instant Changes

Sometimes it is not needed to touch any data in the tablespace, because all that has to be changed is the metadata. An example here will be dropping an index or renaming a column. Such operations are quick and efficient. Typically, their impact is limited. It is not without any impact, though. Sometimes it takes couple of seconds to perform the change in the metadata and such change requires a metadata lock to be acquired. This lock is on a per-table basis, and it may block other operations which are to be executed on this table. You’ll see this as “Waiting for table metadata lock” entries in the processlist.

An example of such change may be instant ADD COLUMN, introduced in MariaDB 10.3 and MySQL 8.0. It gives the possibility to execute this quite popular schema change without any delay. Both MariaDB and Oracle decided to include code from Tencent Game which allows to instantly add a new column to the table. This is under some specific conditions; column has to be added as the last one, full text indexes cannot exist in the table, row format cannot be compressed - you can find more information on how instant add column works in MariaDB documentation. For MySQL, the only official reference can be found on mysqlserverteam.com blog, although a bug exists to update the official documentation.

In Place Changes

Some of the changes require modification of the data in the tablespace. Such modifications can be performed on the data itself, and there’s no need to create a temporary table with a new data structure. Such changes, typically (although not always) allow other queries touching the table to be executed while the schema change is running. An example of such operation is to add a new secondary index to the table. This operation will take some time to perform but will allow DML’s to be executed.

Table Rebuild

If it is not possible to make a change in place, InnoDB will create a temporary table with the new, desired structure. It will then copy existing data to the new table. This operation is the most expensive one and it is likely (although it doesn’t always happen) to lock the DML’s. As a result, such schema change is very tricky to execute on a large table on a standalone server, without help of external tools - typically you cannot afford to have your database locked for long minutes or even hours. An example of such operation would be to change the column data type, for example from INT to VARCHAR.

Schema Changes and Replication

Ok, so we know that InnoDB allow online schema changes and if we consult MySQL documentation, we will see that the majority of the schema changes (at least among the most common ones) can be performed online. What is the reason behind dedicating hours of development to create online schema change tools like gh-ost? We can accept that pt-online-schema-change is a remnant of the old, bad times but gh-ost is a new software.

The answer is complex. There are two main issues.

For starters, once you start a schema change, you do not have control over it. You can abort it but you cannot pause it. You cannot throttle it. As you can imagine, rebuilding the table is an expensive operation and even if InnoDB allows DML’s to be executed, additional I/O workload from the DDL affects all other queries and there’s no way to limit this impact to a level that is acceptable to the application.

Second, even more serious issue, is replication. If you execute a non-blocking operation, which requires a table rebuild, it will indeed not lock DML’s but this is true only on the master. Let’s assume such DDL took 30 minutes to complete - ALTER speed depends on the hardware but it is fairly common to see such execution times on tables of 20GB size range. It is then replicated to all slaves and, from the moment DDL starts on those slaves, replication will wait for it to complete. It does not matter if you use MySQL or MariaDB, or if you have multi-threaded replication. Slaves will lag - they will wait those 30 minutes for the DDL to complete before the commence applying the remaining binlog events. As you can imagine, 30 minutes of lag (sometimes even 30 seconds will be not acceptable - it all depends on the application) is something which makes impossible to use those slaves for scale-out. Of course, there are workarounds - you can perform schema changes from the bottom to the top of the replication chain but this seriously limits your options. Especially if you use row-based replication, you can only execute compatible schema changes this way. Couple of examples of limitations of row-based replication; you cannot drop any column which is not the last one, you cannot add a column into a position other than the last one. You cannot also change column type (for example, INT -> VARCHAR).

As you can see, replication adds complexity into how you can perform schema changes. Operations which are non-blocking on the standalone host become blocking while executed on slaves. Let’s take a look at couple of methods you can use to minimize the impact of schema changes.

Online Schema Change Tools

As we mentioned earlier, there are tools, which are intended to perform schema changes. The most popular ones are pt-online-schema-change created by Percona and gh-ost, created by GitHub. In a series of blog posts we compared them and discussed how gh-ost can be used to perform schema changes and how you can throttle and reconfigure an undergoing migration. Here we will not go into details, but we would still like to mention some of the most important aspects of using those tools. For starters, a schema change executed through pt-osc or gh-ost will happen on all database nodes at once. There is no delay whatsoever in terms of when the change will be applied. This makes it possible to use those tools even for schema changes that are incompatible with row-based replication. The exact mechanisms about how those tools track changes on the table is different (triggers in pt-osc vs. binlog parsing in gh-ost) but the main idea is the same - a new table is created with the desired schema and existing data is copied from the old table. In the meantime, DML’s are tracked (one way or the other) and applied to the new table. Once all the data is migrated, tables are renamed and the new table replaces the old one. This is atomic operation so it is not visible to the application. Both tools have an option to throttle the load and pause the operations. Gh-ost can stop all of the activity, pt-osc only can stop the process of copying data between old and new table - triggers will stay active and they will continue duplicating data, which adds some overhead. Due to the rename table, both tools have some limitations regarding foreign keys - not supported by gh-ost, partially supported by pt-osc either through regular ALTER, which may cause replication lag (not feasible if the child table is large) or by dropping the old table before renaming the new one - it’s dangerous as there’s no way to rollback if, for some reason, data wasn’t copied to the new table correctly. Triggers are also tricky to support.

They are not supported in gh-ost, pt-osc in MySQL 5.7 and newer has limited support for tables with existing triggers. Other important limitations for online schema change tools is that unique or primary key has to exist in the table. It is used to identify rows to copy between old and new tables. Those tools are also much slower than direct ALTER - a change which takes hours while running ALTER may take days when performed using pt-osc or gh-ost.

On the other hand, as we mentioned, as long as the requirements are satisfied and limitations won’t come into play, you can run all schema changes utilizing one of the tools. All will happen at the same time on all hosts thus you don’t have to worry about compatibility. You have also some level of control over how the process is executed (less in pt-osc, much more in gh-ost).

You can reduce the impact of the schema change, you can pause them and let them run only under supervision, you can test the change before actually performing it. You can have them track replication lag and pause should an impact be detected. This makes those tools a really great addition to the DBA’s arsenal while working with MySQL replication.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Rolling Schema Changes

Typically, a DBA will use one of the online schema change tools. But as we discussed earlier, under some circumstances, they cannot be used and a direct alter is the only viable option. If we are talking about standalone MySQL, you have no choice - if the change is non-blocking, that’s good. If it is not, well, there’s nothing you can do about it. But then, not that many people run MySQL as single instances, right? How about replication? As we discussed earlier, direct alter on the master is not feasible - most of the cases it will cause lag on the slave and this may not be acceptable. What can be done, though, is to execute the change in a rolling fashion. You can start with slaves and, once the change is applied on all of them, promote one of the slaves as a new master, demote the old master to a slave and execute the change on it. Sure, the change has to be compatible but, to tell the truth, the most common cases where you cannot use online schema changes is because of a lack of primary or unique key. For all other cases, there is some sort of workaround, especially in pt-online-schema-change as gh-ost has more hard limitations. It is a workaround you would call “so so” or “far from ideal”, but it will do the job if you have no other option to pick from. What is also important, most of the limitations can be avoided if you monitor your schema and catch the issues before the table grows. Even if someone creates a table without a primary key, it is not a problem to run a direct alter which takes half a second or less, as the table is almost empty.

If it will grow, this will become a serious problem but it is up to the DBA to catch this kind of issues before they actually start to create problems. We will cover some tips and tricks on how to make sure you will catch such issues on time. We will also share generic tips on how to design your schemas.

Tips and Tricks

Schema Design

As we showed in this post, online schema change tools are quite important when working with a replication setup therefore it is quite important to make sure your schema is designed in such a way that it will not limit your options for performing schema changes. There are three important aspects. First, primary or unique key has to exist - you need to make sure there are no tables without a primary key in your database. You should monitor this on a regular basis, otherwise it may become a serious problem in the future. Second, you should seriously consider if using foreign keys is a good idea. Sure, they have their uses but they also add overhead to your database and they can make it problematic to use online schema change tools. Relations can be enforced by the application. Even if it means more work, it still may be a better idea than to start using foreign keys and be severely limited to which types of schema changes can be performed. Third, triggers. Same story as with foreign keys. They are a nice feature to have, but they can become a burden. You need to seriously consider if the gains from using them outweight the limitations they pose.

Tracking Schema Changes

Schema change management is not only about running schema changes. You also have to stay on top of your schema structure, especially if you are not the only one doing the changes.

ClusterControl provides users with tools to track some of the most common schema design issues. It can help you to track tables which do not have primary keys:

As we discussed earlier, catching such tables early is very important as primary keys have to be added using direct alter.

ClusterControl can also help you track duplicate indexes. Typically, you don’t want to have multiple indexes which are redundant. In the example above, you can see that there is an index on (k, c) and there’s also an index on (k). Any query which can use index created on column ‘k’ can also use a composite index created on columns (k, c). There are cases where it is beneficial to keep redundant indexes but you have to approach it on case by case basis. Starting from MySQL 8.0, it is possible to quickly test if an index is really needed or not. You can make a redundant index ‘invisible’ by running:

ALTER TABLE sbtest.sbtest1 ALTER INDEX k_1 INVISIBLE;

This will make MySQL ignore that index and, through monitoring, you can check if there was any negative impact on the performance of the database. If everything works as planned for some time (couple of days or even weeks), you can plan on removing the redundant index. In case you detected something is not right, you can always re-enable this index by running:

ALTER TABLE sbtest.sbtest1 ALTER INDEX k_1 VISIBLE;

Those operations are instant and the index is there all the time, and is still maintained - it’s only that it will not be taken into consideration by the optimizer. Thanks to this option, removing indexes in MySQL 8.0 will be much safer operation. In the previous versions, re-adding a wrongly removed index could take hours if not days on large tables.

ClusterControl can also let you know about MyISAM tables.

While MyISAM still may have its uses, you have to keep in mind that it is not a transactional storage engine. As such, it can easily introduce data inconsistency between nodes in a replication setup.

Another very useful feature of ClusterControl is one of the operational reports - a Schema Change Report.

In an ideal world, a DBA reviews, approves and implements all of the schema changes. Unfortunately, this is not always the case. Such review process just does not go well with agile development. In addition to that, Developer-to-DBA ratio typically is quite high which can also become a problem as DBA’s would struggle not to become a bottleneck. That’s why it is not uncommon to see schema changes performed outside of the DBA’s knowledge. Yet, the DBA is usually the one responsible for the database’s performance and stability. Thanks to the Schema Change Report, they can now keep track of the schema changes.

At first some configuration is needed. In a configuration file for a given cluster (/etc/cmon.d/cmon_X.cnf), you have to define on which host ClusterControl should track the changes and which schemas should be checked.

schema_change_detection_address=10.0.0.126
schema_change_detection_databases=sbtest

Once that’s done, you can schedule a report to be executed on a regular basis. An example output may be like below:

As you can see, two tables have changed since the previous run of the report. In the first one, a new composite index has been created on columns (k, c). In the second table, a column was added.

In the subsequent run we got information about new table, which was created without any index or primary key. Using this kind of info, we can easily act when it is needed and solve the issues before they actually start to become blockers.

by krzysztof at June 26, 2018 10:55 AM

Peter Zaitsev

Webinar 6/27: MySQL Troubleshooting Best Practices: Monitoring the Production Database Without Killing Performance

performance troubleshooting MySQL monitoring tools

performance troubleshooting MySQL monitoring toolsPlease join Percona’s Principal Support Escalation Specialist Sveta Smirnova as she presents Troubleshooting Best Practices: Monitoring the Production Database Without Killing Performance on Wednesday, June 27th at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

 

During the MySQL Troubleshooting webinar series, I covered many monitoring and logging tools such as:

  • General, slow, audit, binary, error log files
  • Performance Schema
  • Information Schema
  • System variables
  • Linux utilities
  • InnoDB monitors
  • PMM

However, I did not spend much time on the impact these instruments have on overall MySQL performance. And they do have an impact.

And this is the conflict many people face. MySQL Server users try exploring these monitoring instruments, see that they slow down their installations, and turn them off. This is unfortunate. If the instrument that can help you resolve a problem is OFF, you won’t have good and necessary information to help understand when, how and why the issue occurred. In the best case, you’ll re-enable instrumentation and wait for the next disaster occurrence. In the worst case, you try various fix options without any real knowledge if they solve the problem or not.

This is why it is important to understand the impact monitoring tools have on your database, and therefore how to minimize it.

Understanding and controlling the impact of MySQL monitoring tools

In this webinar, I cover why certain monitoring tools affect performance, and how to minimize the impact without turning the instrument off. You will learn how to monitor safely and effectively.

Register Now

 

Sveta Smirnova

Principal Support Escalation Specialist

Sveta joined Percona in 2015. Her main professional interests are problem-solving, working with tricky issues, bugs, finding patterns that can quickly solve typical issues and teaching others how to deal with MySQL issues, bugs and gotchas effectively. Before joining Percona, Sveta worked as Support Engineer in MySQL Bugs Analysis Support Group in MySQL AB-Sun-Oracle. She is the author of book “MySQL Troubleshooting” and JSON UDF functions for MySQL.

The post Webinar 6/27: MySQL Troubleshooting Best Practices: Monitoring the Production Database Without Killing Performance appeared first on Percona Database Performance Blog.

by Sveta Smirnova at June 26, 2018 08:48 AM

Open Query Pty Ltd

MariaDB 10.3 use case: Hidden PRIMARY KEY column for closed/legacy applications

Database symbolI really like MariaDB 10.3, it has a lot of interesting new features.  Invisible (hidden) columns, for instance.  Here is a practical use case:

You’re dealing with a legacy application, possibly you don’t even have the source code (or modifying is costly), and this app doesn’t have a PRIMARY KEY on each table. Consequences:

  • Even though InnoDB would have an internal hidden ID in this case (you can’t access that column at all), it affects InnoDB’s locking strategy – I don’t think this aspect is explicitly documented, but we’ve seen table-level locks in this scenario where we’d otherwise see more granular locking;
  • Galera cluster really wants PKs;
  • Asynchronous row-based replication can work without PKs, but it is more efficient with;

So, in a nutshell, your choices for performance and scaling are restricted.

On the other hand, you can’t just add a new ID column, because the application may:

  • use SELECT * and not be able to handle seeing the extra column, and/or
  • use INSERT without an explicit list of columns and then the server would chuck an error for the missing column.

The old way

Typically, what we used to do for cases like this is hunt for UNIQUE indexes that can be converted to PK. That’d be a clean operation with no effect on the application. We use a query like the following to help us find out if this is feasible:


SELECT
CONCAT(s.TABLE_SCHEMA,'.',s.TABLE_NAME) AS tblname,
INDEX_NAME AS idxname,
GROUP_CONCAT(s.COLUMN_NAME) AS cols,
GROUP_CONCAT(IF(s.NULLABLE='YES','X','-')) AS nullable
FROM INFORMATION_SCHEMA.STATISTICS s
WHERE s.table_schema NOT IN ('information_schema','performance_schema','mysql')
AND s.NON_UNIQUE=0
GROUP BY tblname,idxname
HAVING nullable LIKE '%X%'
ORDER BY tblname;

The output is like this:


+----------+---------+------+----------+
| tblname  | idxname | cols | nullable |
+----------+---------+------+----------+
| test.foo | a       | a,b  | -,X      |
+----------+---------+------+----------+

We see that table test.foo has a UNIQUE index over columns (a,b), but column b is NULLable.  A PK may not contain any NULLable columns, so it would not be a viable candidate.

All is not yet lost though, we can further check whether the column data actually contains NULLs. If it doesn’t, we could change the column to NOT NULL and thus solve that problem.  But strictly speaking, that’s more risky as we may not know for certain that the application will never use a NULL in that column. So that’s not ideal.

IF all tables without a PK have existing (or possible) UNIQUE indexes without any NULLable columns, we can resolve this issue. But as you can appreciate, that won’t always be the case.

With MariaDB 10.3 and INVISIBLE columns

Now for Plan B (or actually, our new plan A as it’s much nicer).  Let’s take a closer look at the foo table from the above example:


CREATE TABLE foo (
a int(11) NOT NULL,
b int(11) DEFAULT NULL,
UNIQUE KEY uidx (a,b)
) ENGINE=InnoDB

Our new solution:

ALTER TABLE foo ADD COLUMN id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY INVISIBLE FIRST

So we’re adding a AUTO_INCREMENT new column named ‘id’, technically first in the table, but we also flag it as invisible.

Normally, you won’t notice, see how we can use INSERT without an explicit (a,b) column list:


MariaDB [test]> INSERT INTO  foo VALUES (2,24);
Query OK, 1 row affected (0.007 sec)

MariaDB [test]> SELECT * FROM foo;
+---+------+
| a | b    |
+---+------+
| 1 |   20 |
| 2 |   24 |
+---+------+

So it won’t even show up with SELECT *.  Fabulous.  We can see the column only if we explicitly ask for it:


MariaDB [test]> SELECT id,a,b FROM foo;
+----+---+------+
| id | a |    b |
+----+---+------+
|  1 | 1 |   20 |
|  2 | 2 |   24 |
+----+---+------+

We solved our table structural issue by introducing a clean AUTO_INCREMENT id column covered by a PRIMARY KEY, while the application remains perfectly happy with what it can SELECT and INSERT. Total win!

by Arjen Lentz at June 26, 2018 02:05 AM

June 25, 2018

Peter Zaitsev

Running Percona XtraDB Cluster in Kubernetes/OpenShift

Diagram of Percona XtraDB Cluster / MySQL running in Kubernetes Open Shift

Kubernetes, and its most popular distribution OpenShift, receives a lot of interest as a container orchestration platform. However, databases remain a foreign entity, primarily because of their stateful nature, since container orchestration systems prefer stateless applications. That said, there has been good progress in support for StatefulSet applications and persistent storage, to the extent that it might be already comfortable to have a production database instance running in Kubernetes. With this in mind, we’ve been looking at running Percona XtraDB Cluster in Kubernetes/OpenShift.

While there are already many examples on the Internet of how to start a single MySQL instance in Kubernetes, for serious usage we need to provide:

  • High Availability: how can we guarantee availability when an instance (or Pod in Kubernetes terminology) crashes or becomes unresponsive?
  • Persistent storage: we do not want to lose our data in case of instance failure
  • Backup and recovery
  • Traffic routing: in the case of multiple instances, how do we direct an application to the correct one
  • Monitoring

Percona XtraDB Cluster in Kubernetes/OpenShift

Schematically it looks like this:


Percona XtraDB Cluster in Kubernetes/OpenShift a possible configuration for a resilient solution

The picture highlights the components we are going to use

Running this in Kubernetes assumes a high degree of automation and minimal manual intervention.

We provide our proof of concept in this project: https://github.com/Percona-Lab/percona-openshift. Please treat it like a source for ideas and as an alpha-quality project, in no way it is production ready.

Details

In our implementation we rely on Helm, the package manager for Kubernetes.  Unfortunately OpenShift does not officially support Helm out of the box, but there is a guide from RedHat on how to make it work.

In the clustering setup, it is quite typical to use a service discovery software like Zookeeper, etcd or Consul. It may become necessary for our Percona XtraDB Cluster deployment, but for now, to simplify deployment, we are going to use the DNS service discovery mechanism provided by Kubernetes. It should be enough for our needs.

We also expect the Kubernetes deployment to provide Dynamic Storage Provisioning. The major cloud providers (like Google Cloud, Microsoft Azure or Amazon Cloud) should have it. Also, it might not be easy to have Dynamic Storage Provisioning for on-premise deployments. You may need to setup GlusterFS or Ceph to provide Dynamic Storage Provisioning.

The challenge with a distributed file system is how many copies of data you will end up having. Percona XtraDB Cluster by itself has three copies, and GlusterFS will also require at least two copies of the data, so in the end we will have six copies of the data. This can’t be good for write intensive applications, but it’s also not good from the capacity standpoint.

One possible approach is to have local data copies for Percona XtraDB Cluster deployments. This will provide better performance and less impact on the network, but in the case of a big dataset (100GB+ ) the node failure will require SST with a big impact on the cluster and network. So the individual solution should be tailored for your workload and your requirements.

Now, as we have a basic setup working, it would be good to understand the performance impact of running Percona XtraDB Cluster in Kubernetes.  Is the network and storage overhead acceptable or it is too big? We plan to look into this in the future.

Once again, our project is located at https://github.com/Percona-Lab/percona-openshift, we are looking for your feedback and for your experience of running databases in Kubernetes/OpenShift.

Before you leave …

Percona XtraDB Cluster

If this article has interested you and you would like to know more about Percona XtraDB Cluster, you might enjoy our recent series of webinar tutorials that introduce this software and how to use it.

The post Running Percona XtraDB Cluster in Kubernetes/OpenShift appeared first on Percona Database Performance Blog.

by Vadim Tkachenko at June 25, 2018 04:27 PM

MongoDB Transactions: Your Very First Transaction with MongoDB 4.0

MongoDB 4.0 transactions

MongoDB 4.0MongoDB 4.0 transactions is just around the corner and with rc0 we can get a good idea of what we can expect in the GA version. MongoDB 4.0 will allow transactions to run in a replica set and, in a future release, the MongoDB transaction will work for sharded clusters. This is a really big change!

Multi-statement transactions are a big deal for a lot of companies. The transactions feature has been in development since MongoDB version 3.0; in version 3.6 logical sessions were added. In an earlier blog post we highlighted a few details from what was delivered in 3.6 that indicated that 4.0 would have transactions.

There are a few limitations for transactions and some operations are not allowed yet. A detailed list can be found in the MongoDB documentation of the Session.startTransaction() method.

One restriction that we must be aware of is that the collection MUST exist in order to use transactions.

A simple transaction will be declared in a very similar way to that we use for other databases. The caveat is that we need to start a session before starting a transaction. This means that multi-statement transactions are not the default behavior to write to the database.

How to use transactions in MongoDB® 4.0

Download MongoDB 4.0 RC (or you can install it from the repositories).

wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-ubuntu1604-4.0.0-rc1.tgz

Uncompress the files:

tar -xvzf mongodb-linux-x86_64-ubuntu1604-4.0.0-rc1.tgz

Rename the folder to mongo4.0 and create the data folder inside of the bin folder:

mv mongodb-linux-x86_64-ubuntu1604-4.0.0-rc1 mongo4.0
cd mongo4.0
cd bin
mkdir data

Start the database process:
Important: in order to have multi-statement transactions replica-set must be enabled

./mongod --dbpath data --logpath data/log.log --fork --replSet foo

Initialize the replica-set:

> rs.initiate()
foo:Primary> use percona
foo:Primary> db.createCollection('test')

Start a session and then a transaction:

session = db.getMongo().startSession()
session.startTransaction()
session.getDatabase("percona").test.insert({today : new Date()})
session.getDatabase("percona").test.insert({some_value : "abc"})

Then you can decide whether to commit the transaction or abort it:

session.commitTransaction()
session.abortTransaction()

If the startTransaction throws the IllegalOperation error, make sure the database is running with replica set.

Transaction isolation level in in MongoDB 4.0: Snapshot Isolation

MongoDB 4.0 implements snapshot isolation for the transactions. The pending uncommitted changes are only visible inside the session context (the session which has started the transaction) and are not visible outside. Here is an example:

Connection 1:

foo:PRIMARY> use percona
switched to db percona
foo:PRIMARY>  db.createCollection('test')
{
        "ok" : 1,
        "operationTime" : Timestamp(1528903182, 1),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1528903182, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}
foo:PRIMARY> session = db.getMongo().startSession()
session { "id" : UUID("bdd82af7-ab9d-4cd3-9238-f08ee928f31e") }
foo:PRIMARY> session.startTransaction()
foo:PRIMARY> session.getDatabase("percona").test.insert({today : new Date()})
WriteResult({ "nInserted" : 1 })
foo:PRIMARY> session.getDatabase("percona").test.insert({some_value : "abc"})
WriteResult({ "nInserted" : 1 })

Connection 2: starting second transaction in its own session:

foo:PRIMARY> use percona
switched to db percona
foo:PRIMARY> db.test.find()
foo:PRIMARY> db.test.find()
foo:PRIMARY> session = db.getMongo().startSession()
session { "id" : UUID("eb628bfd-425e-450c-a51b-733435474eaa") }
foo:PRIMARY> session.startTransaction()
foo:PRIMARY> session.getDatabase("percona").test.find()
foo:PRIMARY>

Connection 1: commit

foo:PRIMARY> session.commitTransaction()

Connection 2: after connection1 commits:

foo:PRIMARY> db.test.find()
{ "_id" : ObjectId("5b21361252bbe6e5b9a70a4e"), "today" : ISODate("2018-06-13T15:19:46.645Z") }
{ "_id" : ObjectId("5b21361252bbe6e5b9a70a4f"), "some_value" : "abc" }

Outside of the session it sees the new values, however inside the opened session it will not see the new values.

foo:PRIMARY> session.getDatabase("percona").test.find()
foo:PRIMARY>

Now if we commit the transaction inside connection 2 it will commit as well, and we will have 2 rows now (as there are no conflicts).

Sometimes, however, we may see the transient transaction error when committing or even doing find() inside a session:

foo:PRIMARY> session.commitTransaction()
2018-06-14T21:56:29.111+0000 E QUERY    [js] Error: command failed: {
        "errorLabels" : [
                "TransientTransactionError"
        ],
        "operationTime" : Timestamp(1529013385, 1),
        "ok" : 0,
        "errmsg" : "Transaction 0 has been aborted.",
        "code" : 251,
        "codeName" : "NoSuchTransaction",
        "$clusterTime" : {
                "clusterTime" : Timestamp(1529013385, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:520:17
assert.commandWorked@src/mongo/shell/assert.js:604:16
commitTransaction@src/mongo/shell/session.js:878:17
@(shell):1:1

From the MongoDB doc we can read that we could retry the transaction back when we have this error.

If an operation encounters an error, the returned error may have an errorLabels array field. If the error is a transient error, the errorLabels array field contains “TransientTransactionError” as an element and the transaction as a whole can be retried.

MongoDB transactions: conflict

What about transaction conflicts in MongoDB? Let’s say we are updating the same row. Here is the demo:

First we create a record, trx, in the collection:

use percona
db.test.insert({trx : 0})

Then we create session1 and update trx to change from 0 to 1:

foo:PRIMARY> session = db.getMongo().startSession()
session { "id" : UUID("0b7b8ce0-919a-401a-af01-69fe90876301") }
foo:PRIMARY> session.startTransaction()
foo:PRIMARY> session.getDatabase("percona").test.update({trx : 0}, {trx: 1})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

Then (before committing) create another session which will try to change from 0 to 2:

foo:PRIMARY> session = db.getMongo().startSession()
session { "id" : UUID("b312c662-247c-47c5-b0c9-23d77f4e9f6d") }
foo:PRIMARY> session.startTransaction()
foo:PRIMARY> session.getDatabase("percona").test.update({trx : 0}, {trx: 2})
WriteCommandError({
        "errorLabels" : [
                "TransientTransactionError"
        ],
        "operationTime" : Timestamp(1529675754, 1),
        "ok" : 0,
        "errmsg" : "WriteConflict",
        "code" : 112,
        "codeName" : "WriteConflict",
        "$clusterTime" : {
                "clusterTime" : Timestamp(1529675754, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
})

As we can see, MongoDB catches the conflict and return the error on the insert (even before the commit).

We hope this post, with its simple example how transactions will work, has been useful. Feedback is welcome: you can comment here, catch Adamo on twitter @AdamoTonete or talk to the team at @percona.

The post MongoDB Transactions: Your Very First Transaction with MongoDB 4.0 appeared first on Percona Database Performance Blog.

by Adamo Tonete at June 25, 2018 01:59 PM

Webinar Tues 6/26: MariaDB Server 10.3

MariaDB 10.3 Webinar

MariaDB 10.3 WebinarPlease join Percona’s Chief Evangelist, Colin Charles on Tuesday, June 26th, 2018, as he presents MariaDB Server 10.3 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

 

MariaDB Server 10.3 is out. It has some interesting features around system versioned tables, Oracle compatibility, column compression, an integrated SPIDER engine, as well as MyRocks. Learn about what’s new, how you can use it, and how it is different from MySQL.

Register Now

Colin Charles

Chief Evangelist

Colin Charles is the Chief Evangelist at Percona. He was previously on the founding team of MariaDB Server in 2009, and had worked at MySQL since 2005, and been a MySQL user since 2000. Before joining MySQL, he worked actively on the Fedora and OpenOffice.org projects. He’s well known within open source communities in APAC, and has spoken at many conferences. Experienced technologist, well known in the open source world for work that spans nearly two decades within the community. Pays attention to emerging technologies from an integration standpoint. Prolific speaker at many industry-wide conferences delivering talks and tutorials with ease. Interests: application development, systems administration, database development, migration, Web-based technologies. Considered expert in Linux and Mac OS X usage/administration/roll-out’s. Specialties: MariaDB, MySQL, Linux, Open Source, Community, speaking & writing to technical audiences as well as business stakeholders.

The post Webinar Tues 6/26: MariaDB Server 10.3 appeared first on Percona Database Performance Blog.

by Colin Charles at June 25, 2018 11:15 AM

Percona XtraBackup 2.4.12 Is Now Available

Percona_XtraBackup

Percona XtraBackupPercona announces the GA release of Percona XtraBackup 2.4.12 on June 22, 2018. You can download it from our download site and apt and yum repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, it drives down backup costs while providing unique features for MySQL backups.

New features and improvements:

  • Percona XtraBackup now prints used arguments to standard output. Bug fixed PXB-1494.

Bugs fixed

  • xtrabackup --copy-back didn’t read which encryption plugin to use from plugin-load setting of the my.cnf configuration file. Bug fixed PXB-1544.
  • xbstream was exiting with zero return code when it failed to create one or more target files instead of returning error code 1. Bug fixed PXB-1542.
  • Meeting a zero sized keyring file, Percona XtraBackup was removing and immediately recreating it, which could affect external software noticing this file had undergo manipulations. Bug fixed PXB-1540.
  • xtrabackup_checkpoints files were encrypted during a backup, which caused additional difficulties to take incremental backups. Bug fixed PXB-202.

Other bugs fixed: PXB-1526 “Test kill_long_selects.sh failing with MySQL 5.7.21”.

Release notes with all the improvements for version 2.4.12 are available in our online documentation. Please report any bugs to the issue tracker.

The post Percona XtraBackup 2.4.12 Is Now Available appeared first on Percona Database Performance Blog.

by Dmitriy Kostiuk at June 25, 2018 09:15 AM

Open Query Pty Ltd

Using FastCGI to separate web frontend from application space

server rackFastCGI has many advantages, and it’s our preferred interface when it’s available for the required script language (such as PHP).  However, we generally see environments where the php-fpm processes (for instance) are run on the same system as the web server, even though that’s not necessary.  In FastCGI space, the web server (say nginx) passes a request through a socket or TCP/IP address:port, and the application delivers back an response (web page, web service JSON result, etc).  Obviously sockets only work locally, but the port can be on another machine.

So the question is, how do you arrange your virtual server rack?  While splitting things out like that tends to add a few ms of latency, the advantages tend to justify this approach:

  • Not running PHP on your web-server, and only having connectivity web-server -> application, is good for security:
    • should the web server get compromised, it can still only make application requests, not reconfigure the application server in any way;
    • should the application get compromised, it will not be able to gain control over the web-server as there is no connectivity path in that direction.
  • nginx is exceedingly good at “smart proxy”-style tasks, and that’s essentially already what it’s doing with FastCGI anyway;
    • you could have multiple application servers, rather than just one.  Optionally scaled dynamically as traffic needs change;
    • this effectively turns nginx into a kind of load-balancer for the application server back-ends; the way you specify this is through
      1. a single name that resolves to multiple address which are then used in a round-robin fashion (not necessarily  optimal as some requests may take longer than others), or
      2. a server group as provided by the ngx_http_upstream_module module.
    • nginx can detect when a back-end is unresponsive, and connect to an alternative;
    • thinking back to the fact that some requests take longer than others, that’s usually URL-specific.  That is, it’s relevant for queries in certain areas (such as reporting, or drill-down type searches) of a web interface.  Now, you can tell nginx to use a different FastCGI destination for those, and thus separate (insulate, really) the handling of that application traffic from the rest of the application.
  • This all fits very neatly with containers, should your infrastructure use those or if you’re considering moving towards containers and micro-services.

The possibilities are quite extensive, but naturally the details and available options will depend on your needs and what you already have in place.  We regularly help our clients with such questions. Solutions Architecture, both for evolving existing environments as well as for greenfield projects.

by Arjen Lentz at June 25, 2018 05:11 AM

June 24, 2018

Valeriy Kravchuk

On InnoDB Data Compression in MySQL

Another story that I've prepared back in April for my meeting with one of customers in London was a "compression story". We spent a lot of time on it in several support issues in the past, with only limited success.

In case of InnoDB tables, there are actually two ways to compress data (besides relying on filesystem compression or compressing individual columns at server or application side). Historically the first one was introduced by the Barracuda InnoDB file format and ROW_FORMAT=COMPRESSED it supported. Notable number of related bugs were reported with time, and it may be not that easy to identify them all (you can find current list of bugs tagged with "compression" here). I've picked up the following bugs for my "story":
  • Bug #88220 - "compressing and uncompressing InnoDB tables seems to be inconsistent". Over years Simon Mudd, Monty Solomon (see related Bug #70534 - "Removing table compression leaves compressed keys") and other community members reported several bugs related to inconsistencies and surprises with key_block_size option. It is used for both MyISAM and InnoDB storage engines (for compressed tables) and it seems nobody is going to fix the remaining problems until they are gone with MyISAM engine.
  • Bug #69588 - "MyISAM to InnoDB compressed slower than MyISAM to InnoDB, Then InnoDB to Compressed". Just a detail to take into account, noted 5 years ago by Joffrey MICHAIE, verified almost 4 years ago and then getting zero public attention from Oracle engineers.
  • Bug #62431 - "What is needed to make innodb compression work for 32KB pages?". Nothing can be done according to the manual:
    "In particular, ROW_FORMAT=COMPRESSED in the Barracuda file format assumes that the page size is at most 16KB and uses 14-bit pointers."
  • Bug #78827 - "Speedup replication of compressed tables". Come on, Daniël van Eeden, nobody cares that
    "Replication and InnoDB compressed tables are not efficiently working together."
    The bug is still "Open".
  • Bug #75110 - "Massive, to-be-compressed not committed InnoDB table is total database downtime". This problem was reported by Jouni Järvinen back in 2014. Surely this is not a bug, but it seems nobody even tried to speed up compression in any way on multiple cores.
  • Bug #84439 - "Table of row size of ~800 bytes does not compress with KEY_BLOCK_SIZE=1". It was reported by Jean-François Gagné, who asked for a reasonable error message at least. Nothing happens after verification.
  • Bug #77089 - "Misleading innochecksum error for compressed tables with key_block_size=16". This problem was reported by Laurynas Biveinis more than three years ago, immediately verified and then got zero attention.
The boats above do not use the space for mooring efficiently. They need better compression.
Transparent Page Compression for InnoDB tables was added later and looked promising. If you are lucky to use filesystem with sparse file and hole punching support and proper OS or kernel version, then you could expect notable saving of disk space with very few additional keystrokes (like COMPRESSION="zlib") when defining the table. Different compression libraries were supported. Moreover (see here), only uncompressed pages are stored in memory in this case, and this improved the efficiency of buffer pool usage. Sounded promising originally, but there are still bugs to consider:
  • Bug #78277 - "InnoDB deadlock, thread stuck on kernel calls from transparent page compression". This bug alone (reported by Mark Callaghan back in 2015) may be a reason to NOT use the feature in production, as soon as you hit it (chances are high). there are many interesting comments that there are environments where the feature works as fast as expected, but I think this summary is good enough for most users:
    "[19 Oct 2015 15:56] Mark Callaghan
    ...
    Slow on XFS, slow on ext4, btrfs core team tells me it will be slow there. But we can celebrate that it isn't slow on NVMFS - closed source, not GA, can't even find out where to buy it, not aware of anyone running it."
    The bug is still "Open".
  • Bug #81145 - "Sparse file and punch hole compression not working on Windows". Not that I care about Windows that much, but still. The bug is "Verified" for 2 years.
  • Bug #87723 - "mysqlbackup cannot work with mysql5.7 using innodb page-level compression" Now this is awesome! Oracle's own MySQL Enterprise Backup does NOT support the feature. Clearly they cared about making it useful...
    As a side note, same problem affects Percona's xtrabackup (see PXB-1394). MariaDB resolved the problem (and several related ones like MDEV-13023) with mariabackup tool.
  • Bug #87603 - "compression/tablespace ignored in create/alter table when not using InnoDB". COMPRESSION='.../' option is supported for MyISAM tables as well, and this again leads to problems when switching to another storage engine, as Tomislav Plavcic noted.
  • Bug #78672 - "assert fails in fil_io during linkbench with transparent innodb compression". This crash (assertion failure) was noted by Mark Callaghan back in 2015. May not crash anymore since 5.7.10 according to the last comment, but nobody cares to close the bug or comment anything useful. The bug is still "Verified".
That's almost all I prepared for my "compression story". It had to be sad one.

What about the moral of the story? For me it's the following:
  1. Classical InnoDB compression (page_format=compressed) has limited efficiency and does not get any attention from developers recently. If you hit some problem with this feature you have to live with it.
  2. Transparent page compression for InnoDB seems to be originally more like a proof of concept in MySQL that may not work well in production on commodity hardware, and software and was not integrated with backup tools. MariaDB improved it, added support for backing up page compressed tables efficiently with the same familiar xtrabackup-based approach, but there are still open problems to resolve (see MDEV-15527 and MDEV-15528 that I also picked up for my "story").
  3. It seems (based on public sources review at least) that both compression options do not get much attention from Oracle developers recently. If you check new features of MySQL 8.0 GA here,  you may notice that zlib version is updated, compressed temporary InnoDB tables are no longer supported and... that's all about compression for InnoDB!
This story could probably be shortened to just one link to the summary post by Mark Callaghan from Facebook (who studied the efficiency of data compression by various engines a lot, among other performance metrics), or by simple statement that if you want data to be compressed efficiently at server side do NOT use current InnoDB implementations and better use RocksDB engine (with MariaDB or Percona Server if you need other modern features also). But I can not write any story about MySQL without referring to some bugs, and this is how I've ended up with the above.

What if you just switched to MySQL 8.0 GA and need some new features from it badly? Then just wait for a miracle to happen (and hope Percona will make it one day :)

by Valeriy Kravchuk (noreply@blogger.com) at June 24, 2018 12:51 PM

June 23, 2018

Valeriy Kravchuk

On Partitioning in MySQL

Back in April I was preparing for vacations that my wife and I planned to spend in UK. Among other things planned I wanted to visit a customer's office in London and discuss few MySQL and MariaDB related topics, let's call them "stories". I tried to prepare myself for the discussion and collected a list of known active bugs (what else could I do as MySQL entomologist) for each of them. Surely live discussion was not suitable to share lists of bugs (and for some "stories" they were long), so I promised to share them later, in my blog. Time to do what I promised had finally come!

One of the stories we briefly discussed was "partitioning story". Right now I can immediately identify at least 47 active MySQL bugs in the related category.  While preparing I checked the same list and picked up 15 or so bug reports that had to illustrate my points. Let me share them here in no specific order, and add few more.
In April the latest still active bug in partitioning reported by MySQL community was  Bug #88916 - "Assertion `table->s->db_create_options == part_table->s->db_create_options'", from my colleague Elena Stepanova. Note a very simple test case that leads to assertion in debug builds, immediately verified.

Recently two more bugs were reported. Reporter of Bug #91190 - "DROP PARTITION and REORGANIZE PARTITION are slow" suspects a performance regression in MySQL 8.0.11. I've subscribed to this bug and is following the progress carefully. Same with Bug #91203 - "For partitions table, deal with NULL with is mismatch with reference guide". I think what happens with NULL value and range partitioning perfectly matches the manual, but the fact that INFORMATION_SCHEMA.PARTITIONS table may return wrong information after dropping partition with NULL value is somewhat unexpected.

Now back to the original lists for the "story" I prepared in April:
  • Bug #60023 - "No Loose Index Scan for GROUP BY / DISTINCT on InnoDB partitioned table". It was reported by Rene' Cannao' and since 2013 I strongly suspect that it's fixed in MySQL 5.6+ or, as noted in another comment, may depend on statistics properly collected for the table. Still the status remains "Verified".
  • Bug #78164 - "alter table command affect partitioned table data directory". Your custom DATA DIRECTORY settings may get lost when ALTER is applied to the whole table. Quick test shows that at least in MariaDB 10.3.7 this is no longer the case. The bug is still "Verified".
  • Bug #85126 - "Delete by range in presence of partitioning and no PK always picks wrong index". It was reported by Riccardo Pizzi 16 months ago, immediately verified (without explicit list of versions affected, by the way). One more case when ordering of indexes in CREATE TABLE may matter...
  • Bug #81712 - "lower_case_table_names=2 ignored on ADD PARTITION on Windows". Who cares about Windows these days?
  • Bug #84356 - "General tablespace table encryption". It seems partitioning allows to overcome documented limitation. If this is intended, then the manual is wrong, otherwise I suspect the lack of careful testing of partitioning integration with other features.
  • Bug #88673 - "Regression CREATE TBL from 5.7.17 to 20 (part #1: innodb_file_per_table = ON)." I've probably mentioned this bug reported by Jean-François Gagné in more than one blog post already. Take care and do not use long partition names.
  • Bug #85413 - "Failing to rename a column involved in partition". As simple as it sounds, and it still happens.
  • Bug #83435 - "ALTER TABLE is very slow when using PARTITIONED table". It was reported by Roel Van de Paar back in 2016 and still remains "Verified".
  • Bug #73084 - "Exchanging partitions defined with DATA DIRECTORY and INDEX DIRECTORY options". The bug still remains "Open" (see Bug #77772 also).
  • Bug #73648 - "innodb table replication is very slow with some of the partitioned table". It seems to be fixed last year as internal Bug #25687813 (see release notes for 5.6.38), but nobody cares to find this older duplicate and change its status or re-verify it.
  • Bug #83750 - "Import via TTS of a partitioned table only uses 1 cpu core". This feature requested by Daniël van Eeden makes a lot of sense. I truly hope to see parallel operations implemented for partitioned tables in GA MySQL versions (as I saw some parallel processing for partitions done for some upcoming "6.1" or so version back in 2008 in Riga during the MySQL's last company meeting I've attended).
  • Bug #64498 - "Running out of file handles when ALTERing partitioned MyISAM table". Too many file handles are needed. This is a documented limitation that DBAs should still take into account.
I also prepared a separate small list of partition pruning bugs:
  • Bug #83248 - "Partition pruning is not working with LEFT JOIN". I've reported it back in 2016 and it is still not fixed. There are reasons to think it is not so easy.
  • Bug #75085 - "Partition pruning on key partitioning with ENUM". It was reported by  Daniël van Eeden back in 2014!
  • Bug #77318 - "Selects waiting on MDL when altering partitioned table". One of the worst expectations DBA may have is that partitioned tables help to workaround "global" MDL locks because of partition pruning! This is not the case.
Does this story have any moral? I think so, and for me it's the following:
  1. Partitioning bugs do not get proper attention from Oracle engineers. We see bugs with wrong status and even a bug with a clear test case and a duplicate that is "Open" for 4 years. Some typical use cases are affected badly, and still no fixes (even though since 5.7 we have native partitioning in InnoDB and changing implementation gave good chance to review and either fix or re-check these bugs).
  2. MySQL DBAs should expect all kinds of surprises when running usual DDL statements (ALTER TABLE to add column even) with partitioned tables. In the best case DDL is just unexpectedly slow for them.
  3. Partition pruning may not work they way one expects.
  4. We miss parallel processing for partitioned tables. They should allow to speed up queries and DDL, not to slow them down instead...
  5. One can suspect that there is no careful internal testing performed on integration of partitioning with other features, or even basic partition maintenance operations.

by Valeriy Kravchuk (noreply@blogger.com) at June 23, 2018 05:21 PM

June 22, 2018

Peter Zaitsev

This Week in Data with Colin Charles 43: Polyglots, Security and DataOps.Barcelona

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

This is a short working week for me due to a family emergency. It caused me to skip speaking at DataOps.Barcelona and miss hanging out with the awesome of speakers and attendees. This is the first time I’ve missed a scheduled talk, and I received many messages about my absence. I am sure we will all meet again soon.

One of the talks I was planning to give at DataOps.Barcelona will be available as a Percona webinar next week: Securing Your Database Servers from External Attacks on Thursday, June 28, 2018, at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4). I am also giving a MariaDB 10.3 overview on Tuesday, June 26, 2018, at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4). I will “virtually” see you there.

If you haven’t already read Werner Vogel’s post A one size fits all database doesn’t fit anyone, I highly recommend it. It is true there is no “one size fits all” solution when it comes to databases. This is why Percona has made “the polyglot world” a theme. It’s why Amazon offers different database flavors: relational (Aurora for MySQL/PostgreSQL, RDS for MySQL/PostgreSQL/MariaDB Server), key-value (DynamoDB), document (DynamoDB), graph (Neptune), in-memory (ElastiCache for Redis & Memcached), search (Elasticsearch service). The article has a plethora of use cases, from AirBnB using Aurora, to Snapchat Stories and Tinder using DynamoDB, to Thomson Reuters using Neptune, down to McDonald’s using ElastiCache and Expedia using Elasticsearch. This kind of detail, and customer use case, is great.

There are plenty more stories and anecdotes in the post, and it validates why Percona is focused not just on MySQL, but also MariaDB, MongoDB, PostgreSQL and polyglot solutions. From a MySQL lens, it’s also worth noting that not one storage engine fits every use case. Facebook famously migrated a lot of their workload from InnoDB to MyRocks, and it is exciting to see Mark Callaghan stating that there are already three big workloads on MyRocks in production, with another two coming soon.

Releases

  • MariaDB 10.1.34 – including fixes for InnoDB defragmentation and full text search (MDEV-15824). This was from the WebScaleSQL tree, ported by KakaoTalk to MariaDB Server.
  • Percona XtraDB Cluster 5.6.40-26.25 – now with Percona Server for MySQL 5.6.40, including a new variable to configure rolling schema upgrade (RSU) wait for active commit connection timeouts.
  • Are you using the MariaDB Connector/C, Connector/J or Connector/ODBC? A slew of updates abound.

Link List

Industry Updates

Upcoming appearances

  • OSCON – Portland, Oregon, USA – July 16-19 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

The post This Week in Data with Colin Charles 43: Polyglots, Security and DataOps.Barcelona appeared first on Percona Database Performance Blog.

by Colin Charles at June 22, 2018 09:26 PM

Finding the Right Direction: MongoDB Compass – Community Version

MongoDB Compass

MongoDB CompassIn this blog post, we will talk a bit about the product MongoDB Compass. This new tool has 3 main versions, these being: Community, Enterprise and Enterprise Read Only. MongoDB Compass Community is free, but a bit limited. It allows you to connect to your MongoDB Database to run queries, check queries execution plans, manage indexes, and create, drop/create collections and databases. The paid-for version offers some additional features such as Schema Analysis, Real Time Server Stats, and Document Validation.

We will focus on the Community version here, and look at how we can workaround its limitations using free open source software.

Of course, MongoDB 3.6 was released in November 2017 and it comes with a lot of new features. We’ve already covered those in some of our blog posts and webinars, which you might find interesting:

Using MongoDB Compass Community

The installation is very straightforward and it is available to all operating systems. We used MacOS version as an example but the product looks the same in all supported OS, including Linux with GUI.

Here are the main screens of MongoDB Compass, they are pretty self explanatory.

Database List

Collection List

Collection content (Documents)

Query explain

Indexes

In the community version, we don’t have Real Time Server Status, Document Validation, or Schema Analysis available. I’ve left these features offered by MongoDB Compass Enterprise out of this article.

However, following the philosophy offered in an earlier blog post — why pay if open source has you covered — I’d like to demonstrate some free tools that offer the same functionality.

I should highlight that Percona doesn’t have a partnership with those companies. These examples represent my suggestions of how open source software can deliver the same functionality as enterprise versions. There are other options out there, and if you know any that you think should be here please let us know!

Schema Validation

For schema validation, it is very likely that Compass is running behind the scenes something similar to Variety, an open project from James Cropcho and currently maintained by a few people https://github.com/variety/variety#core-maintainers.

With this tool, users can generate reports about collections, schemas and their field types.

Schema validator is a wrapper to create collection validation, this blog post from MongoDB explains in details how to create validations

Real-time Server Status

Real-time Server Status shows details about the server itself. It shows the current number of operations, memory used and network throughput. Those metrics can be gathered with open source or “homemade” scripts.

Most of the metrics are based on db.serverStatus()

We also have Percona Monitoring and Management, or PMM, that provides enterprise-grade monitoring features free of charge and not only monitors MongoDB but also MySQL and PostgreSQL see more at https://www.percona.com/doc/percona-monitoring-and-management/index.html

However, if you didn’t like Compass there are a lot of GUI tools available to run queries with IntelliSense, and this search will reveal the most common ones.

In summary, Compass is a great tool. However, with the limitations imposed in the Community version, it is just another user-friendly client. It is up to the user choose Compass over the other options available and if Community Compass is your option I hope you found this discussion useful.

The post Finding the Right Direction: MongoDB Compass – Community Version appeared first on Percona Database Performance Blog.

by Adamo Tonete at June 22, 2018 12:48 PM

June 21, 2018

Peter Zaitsev

Lock Down: Enforcing SELinux with Percona XtraDB Cluster

SELinux for PXC security

SELinux for PXC security

Why do I spend time blogging about security frameworks? Because, although there are some resources available on the Web, none apply to Percona XtraDB Cluster (PXC) directly. Actually, I rarely encounter a MySQL setup where SELinux is enforced and never when Percona XtraDB Cluster (PXC) or another Galera replication implementation is used. As we’ll see, there are good reasons for that. I originally thought this post would be a simple “how to” but it ended up with a push request to modify the SST script and a few other surprises.

Some context

These days, with all the major security breaches of the last few years, the importance of security in IT cannot be highlighted enough. For that reason, security in MySQL has been progressively tightened from version to version and the default parameters are much more restrictive than they used to be. That’s all good but it is only at the MySQL level if there is still a breach allowing access to MySQL, someone could in theory do everything the mysql user is allowed to do. To prevent such a situation, the operations that mysqld can do should be limited to only what it really needs to do. SELinux’ purpose is exactly that. You’ll find SELinux on RedHat/Centos and their derived distributions. Debian, Ubuntu and OpenSuse uses another framework, AppArmor, which is functionally similar to SELinux. I’ll talk about AppArmor in a future post, let’s focus for now on SELinux.

The default behavior of many DBAs and Sysadmins appears to be: “if it doesn’t work, disable SELinux”. Sure enough, it often solves the issue but it also removes an important security layer. I believe disabling SELinux is the wrong cure so let’s walk through the steps of configuring a PXC cluster with SELinux enforced.

Starting point

As a starting point, I’ll assume you have a running PXC cluster operating with SELinux in permissive mode. That likely means the file “/etc/sysconfig/selinux” looks like this:

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=permissive
# SELINUXTYPE= can take one of three two values:
#     targeted - Targeted processes are protected,
#     minimum - Modification of targeted policy. Only selected processes are protected.
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

For the purpose of writing this article, I created a 3 nodes PXC cluster with the hosts: BlogSELinux1, BlogSELinux2 and BlogSELinux3. On BlogSELinux1, I set SELinux in permissive mode, I truncated the audit.log. SELinux violations are logged in the audit.log file.

[root@BlogSELinux1 ~]# getenforce
Permissive
[root@BlogSELinux1 ~]# echo '' > /var/log/audit/audit.log

Let’s begin by covering the regular PXC operation items like start, stop, SST Donor, SST Joiner, IST Donor and IST Joiner. As we execute the steps in the list, the audit.log file will record SELinux related elements.

Stop and start

Those are easy:

[root@BlogSELinux1 ~]# systemctl stop mysql
[root@BlogSELinux1 ~]# systemctl start mysql

SST Donor

On BlogSELinux3:

[root@BlogSELinux3 ~]# systemctl stop mysql

then on BlogSELinux2:

[root@BlogSELinux2 ~]# systemctl stop mysql
[root@BlogSELinux2 ~]# rm -f /var/lib/mysql/grastate.dat
[root@BlogSELinux2 ~]# systemctl start mysql

SST Joiner

We have BlogSELinux1 and BlogSELinux2 up and running, we just do:

[root@BlogSELinux1 ~]# systemctl stop mysql
[root@BlogSELinux1 ~]# rm -f /var/lib/mysql/grastate.dat
[root@BlogSELinux1 ~]# systemctl start mysql

IST Donor

We have BlogSELinux1 and BlogSELinux2 up and running, we just do:

[root@BlogSELinux2 ~]# systemctl stop mysql

Then on the first node:

[root@BlogSELinux1 ~]# mysql -e 'create database test;';
[root@BlogSELinux1 ~]# mysql -e 'create table test.testtable (id int not null, primary key (id)) engine=innodb;'
[root@BlogSELinux1 ~]# mysql -e 'insert into test.testtable (id) values (1);'

Those statements put some data in the gcache, now we just restart the second node:

[root@BlogSELinux2 ~]# systemctl start mysql

IST Joiner

We have BlogSELinux1 and BlogSELinux2 up and running, we just do:

[root@BlogSELinux1 ~]# systemctl stop mysql

Then on the second node:

[root@BlogSELinux2 ~]# mysql -e 'insert into test.testtable (id) values (2);'

to insert some data in the gcache and we restart the first node:

[root@BlogSELinux1 ~]# systemctl start mysql

First run

Now that we performed the basic operations of a cluster while recording the security violations in permissive mode, we can look at the audit.log file and start building the SELinux policy. Let’s begin by installing the tools needed to manipulate the SELinux audit log and policy files with:

[root@BlogSELinux1 ~]# yum install policycoreutils-python.x86_64

Then, we’ll use the audit2allow tool to analyze the audit.log file:

[root@BlogSELinux1 ~]# grep -i denied /var/log/audit/audit.log | grep mysqld_t | audit2allow -M PXC
******************** IMPORTANT ***********************
To make this policy package active, execute:
semodule -i PXC.pp

We end up with 2 files, PXC.te and PXC.pp. The pp file is a compiled version of the human readable te file. If we examine the content of the PXC.te file, at the beginning, we have the require section listing all the involved SELinux types and classes:

module PXC 1.0;
require {
        type unconfined_t;
        type init_t;
        type auditd_t;
        type mysqld_t;
        type syslogd_t;
        type NetworkManager_t;
        type unconfined_service_t;
        type system_dbusd_t;
        type tuned_t;
        type tmp_t;
        type dhcpc_t;
        type sysctl_net_t;
        type kerberos_port_t;
        type kernel_t;
        type unreserved_port_t;
        type firewalld_t;
        type systemd_logind_t;
        type chronyd_t;
        type policykit_t;
        type udev_t;
        type mysqld_safe_t;
        type postfix_pickup_t;
        type sshd_t;
        type crond_t;
        type getty_t;
        type lvm_t;
        type postfix_qmgr_t;
        type postfix_master_t;
        class process { getattr setpgid };
        class unix_stream_socket connectto;
        class system module_request;
        class netlink_tcpdiag_socket { bind create getattr nlmsg_read setopt };
        class tcp_socket { name_bind name_connect };
        class file { getattr open read write };
        class dir search;
}

Then, using these types and classes, the policy file adds a series of generic allow rules matching the denied found in the audit.log file. Here’s what I got:

#============= mysqld_t ==============
allow mysqld_t NetworkManager_t:process getattr;
allow mysqld_t auditd_t:process getattr;
allow mysqld_t chronyd_t:process getattr;
allow mysqld_t crond_t:process getattr;
allow mysqld_t dhcpc_t:process getattr;
allow mysqld_t firewalld_t:process getattr;
allow mysqld_t getty_t:process getattr;
allow mysqld_t init_t:process getattr;
#!!!! This avc can be allowed using the boolean 'nis_enabled'
allow mysqld_t kerberos_port_t:tcp_socket name_bind;
allow mysqld_t kernel_t:process getattr;
#!!!! This avc can be allowed using the boolean 'domain_kernel_load_modules'
allow mysqld_t kernel_t:system module_request;
allow mysqld_t lvm_t:process getattr;
allow mysqld_t mysqld_safe_t:process getattr;
allow mysqld_t policykit_t:process getattr;
allow mysqld_t postfix_master_t:process getattr;
allow mysqld_t postfix_pickup_t:process getattr;
allow mysqld_t postfix_qmgr_t:process getattr;
allow mysqld_t sysctl_net_t:file { getattr open read };
allow mysqld_t syslogd_t:process getattr;
allow mysqld_t system_dbusd_t:process getattr;
allow mysqld_t systemd_logind_t:process getattr;
allow mysqld_t tuned_t:process getattr;
allow mysqld_t udev_t:process getattr;
allow mysqld_t unconfined_service_t:process getattr;
allow mysqld_t unconfined_t:process getattr;
allow mysqld_t tuned_t:process getattr;
allow mysqld_t udev_t:process getattr;
allow mysqld_t sshd_t:process getattr;
allow mysqld_t self:netlink_tcpdiag_socket { bind create getattr nlmsg_read setopt };
allow mysqld_t self:process { getattr setpgid };
#!!!! The file '/var/lib/mysql/mysql.sock' is mislabeled on your system.
#!!!! Fix with $ restorecon -R -v /var/lib/mysql/mysql.sock
#!!!! This avc can be allowed using the boolean 'daemons_enable_cluster_mode'
allow mysqld_t self:unix_stream_socket connectto;
allow mysqld_t sshd_t:process getattr;
allow mysqld_t sysctl_net_t:dir search;
allow mysqld_t sysctl_net_t:file { getattr open read };
allow mysqld_t syslogd_t:process getattr;
allow mysqld_t system_dbusd_t:process getattr;
allow mysqld_t systemd_logind_t:process getattr;
#!!!! WARNING 'mysqld_t' is not allowed to write or create to tmp_t.  Change the label to mysqld_tmp_t.
allow mysqld_t tmp_t:file write;
allow mysqld_t tuned_t:process getattr;
allow mysqld_t udev_t:process getattr;
allow mysqld_t unconfined_service_t:process getattr;
allow mysqld_t unconfined_t:process getattr;
#!!!! This avc can be allowed using one of the these booleans:
#     nis_enabled, mysql_connect_any
allow mysqld_t unreserved_port_t:tcp_socket { name_bind name_connect };

I can understand some of these rules. For example, one of the TCP ports used by Kerberos is 4444 and it is also used by PXC for the SST transfer. Similarly, MySQL needs to write to /tmp. But what about all the other rules?

Troubleshooting

We could load the PXC.pp module we got in the previous section and consider our job done. It will likely allow the PXC node to start and operate normally but what exactly is happening? Why did MySQL or one of its subprocesses asked for the process attributes getattr of all the running processes like sshd, syslogd and cron. Looking directly in the audit.log file, I found many entries like these:

type=AVC msg=audit(1527792830.989:136): avc:  denied  { getattr } for  pid=3683 comm="ss"
  scontext=system_u:system_r:mysqld_t:s0 tcontext=system_u:system_r:init_t:s0 tclass=process
type=AVC msg=audit(1527792830.990:137): avc:  denied  { getattr } for  pid=3683 comm="ss"
  scontext=system_u:system_r:mysqld_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=process
type=AVC msg=audit(1527792830.991:138): avc:  denied  { getattr } for  pid=3683 comm="ss"
  scontext=system_u:system_r:mysqld_t:s0 tcontext=system_u:system_r:syslogd_t:s0 tclass=process

So, ss, a network utility tool, scans all the processes. That rang a bell… I knew where to look for, the sst script. Here’s the source of the problem in the wsrep_sst_xtrabackup-v2 file:

wait_for_listen()
{
    local HOST=$1
    local PORT=$2
    local MODULE=$3
    for i in {1..300}
    do
        ss -p state listening "( sport = :$PORT )" | grep -qE 'socat|nc' && break
        sleep 0.2
    done
    echo "ready ${HOST}:${PORT}/${MODULE}//$sst_ver"
}

This bash function is used when the node is a joiner and it checks using ss if the TCP port used by socat or nc is opened. The check is needed in order to avoid replying too early with the “ready” message. The code is functionally correct but wrong, security wise. Instead of looking if there is a socat or nc command running in the list of processes owned by the mysql user, it checks if any of the processes has opened the SST port and only then does it checks if the name of the command is socat or nc. Since we don’t know which processes will be running on the server, we can’t write a good security profile. For example, in the future, one could add the ntpd daemon, causing PXC to fail to start yet again. To avoid that, the function needs to be modified like this:

wait_for_listen()
{
    local HOST=$1
    local PORT=$2
    local MODULE=$3
    for i in {1..300}
    do
        sleep 0.2
        # List only our (mysql user) processes to avoid triggering SELinux
        for cmd in $(ps -u $(id -u) -o pid,comm | sed 's/^\s*//g' | tr ' ' '|' | grep -E 'socat|nc')
        do
            pid=$(echo $cmd | cut -d'|' -f1)
            # List the sockets of the pid
            sockets=$(ls -l /proc/$pid/fd | grep socket | cut -d'[' -f2 | cut -d ']' -f1 | tr '\n' '|')
            if [[ -n $sockets ]]; then
                # Is one of these sockets listening on the SST port?
                # If so, we need to break from 2 loops
                grep -E "${sockets:0:-1}" /proc/$pid/net/tcp | \
                  grep "00000000:$(printf '%X' $PORT)" > /dev/null \
                  && break 2
            fi
        done
    done
    echo "ready ${HOST}:${PORT}/${MODULE}//$sst_ver"
}

The modified function removes many of the denied messages in the audit log file and simplifies a lot the content of PXC.te. I tested the above modification and made a pull request to PXC. Among the remaining items, we have:

allow mysqld_t self:process { getattr setpgid };

setpgid is called often used after a fork to set the process group, usually through the setsid call. MySQL uses fork when it starts with the daemonize option but our installation of Percona XtraDB cluster uses mysqld_safe and does not directly run as a daemon. Another fork call is part of the wsrep source files and is used to launch processes like the SST script and is done when mysqld is already running with reduced privileges. This later invocation is certainly our culprit.

TCP ports

What about TPC ports? PXC uses quite a few. Of course there is the 3306/tcp port used to access MySQL. Galera also uses the ports 4567/tcp for replication, 4568/tcp for IST and 4444/tcp for SST. Let’s have a look which ports SELinux allows PXC to use:

[root@BlogSELinux1 audit]# semanage port -l | grep mysql
mysqld_port_t                  tcp      1186, 3306, 63132-63164

No surprise, port 3306/tcp is authorized but if you are new to MySQL, you may wonder what uses the 1186/tcp. It is the port used by NDB cluster for inter-node communication (NDB API). Now, if we try to add the missing ports:

[root@BlogSELinux1 audit]# semanage port -a -t mysqld_port_t -p tcp 4567
ValueError: Port tcp/4567 already defined
[root@BlogSELinux1 audit]# semanage port -a -t mysqld_port_t -p tcp 4568
[root@BlogSELinux1 audit]# semanage port -a -t mysqld_port_t -p tcp 4444
ValueError: Port tcp/4444 already defined

4568/tcp was successfully added but, 4444/tcp and 4567/tcp failed because they are already assigned to another security context. For example, 4444/tcp belongs to the kerberos security context:

[root@BlogSELinux1 audit]# semanage port -l | grep kerberos_port
kerberos_port_t                tcp      88, 750, 4444
kerberos_port_t                udp      88, 750, 4444

A TCP port is not allowed by SELinux to belong to more than one security context. We have no other choice than to move the two missing ports to the mysqld_t security context:

[root@BlogSELinux1 audit]# semanage port -m -t mysqld_port_t -p tcp 4444
[root@BlogSELinux1 audit]# semanage port -m -t mysqld_port_t -p tcp 4567
[root@BlogSELinux1 audit]# semanage port -l | grep mysqld
mysqld_port_t                  tcp      4567, 4444, 4568, 1186, 3306, 63132-63164

If you happen to be planning to deploy a Kerberos server on the same servers you may have to run PXC using a different port for Galera replication. In that case, and in the case where you want to run MySQL on a port other than 3306/tcp, you’ll need to add the port to the mysqld_port_t context like we just did above. Do not worry too much for the port 4567/tcp, it is reserved for tram which, from what I found, is a remote access protocol for routers.

Non-default paths

It is very frequent to run MySQL with non-standard paths/directories. With SELinux, you don’t list the authorized path in the security context, you add the security context labels to the paths. Adding a context label is a two steps process, basically change and apply. For example, if you are using /data as the MySQL datadir, you need to do:

semanage fcontext -a -t mysqld_db_t "/data(/.*)?"
restorecon -R -v /data

On a RedHat/Centos 7 server, the MySQL file contexts and their associated paths are:

[root@BlogSELinux1 ~]# bzcat /etc/selinux/targeted/active/modules/100/mysql/cil | grep filecon
(filecon "HOME_DIR/\.my\.cnf" file (system_u object_r mysqld_home_t ((s0) (s0))))
(filecon "/root/\.my\.cnf" file (system_u object_r mysqld_home_t ((s0) (s0))))
(filecon "/usr/lib/systemd/system/mysqld.*" file (system_u object_r mysqld_unit_file_t ((s0) (s0))))
(filecon "/usr/lib/systemd/system/mariadb.*" file (system_u object_r mysqld_unit_file_t ((s0) (s0))))
(filecon "/etc/my\.cnf" file (system_u object_r mysqld_etc_t ((s0) (s0))))
(filecon "/etc/mysql(/.*)?" any (system_u object_r mysqld_etc_t ((s0) (s0))))
(filecon "/etc/my\.cnf\.d(/.*)?" any (system_u object_r mysqld_etc_t ((s0) (s0))))
(filecon "/etc/rc\.d/init\.d/mysqld" file (system_u object_r mysqld_initrc_exec_t ((s0) (s0))))
(filecon "/etc/rc\.d/init\.d/mysqlmanager" file (system_u object_r mysqlmanagerd_initrc_exec_t ((s0) (s0))))
(filecon "/usr/bin/mysqld_safe" file (system_u object_r mysqld_safe_exec_t ((s0) (s0))))
(filecon "/usr/bin/mysql_upgrade" file (system_u object_r mysqld_exec_t ((s0) (s0))))
(filecon "/usr/libexec/mysqld" file (system_u object_r mysqld_exec_t ((s0) (s0))))
(filecon "/usr/libexec/mysqld_safe-scl-helper" file (system_u object_r mysqld_safe_exec_t ((s0) (s0))))
(filecon "/usr/sbin/mysqld(-max)?" file (system_u object_r mysqld_exec_t ((s0) (s0))))
(filecon "/usr/sbin/mysqlmanager" file (system_u object_r mysqlmanagerd_exec_t ((s0) (s0))))
(filecon "/usr/sbin/ndbd" file (system_u object_r mysqld_exec_t ((s0) (s0))))
(filecon "/var/lib/mysql(-files|-keyring)?(/.*)?" any (system_u object_r mysqld_db_t ((s0) (s0))))
(filecon "/var/lib/mysql/mysql\.sock" socket (system_u object_r mysqld_var_run_t ((s0) (s0))))
(filecon "/var/log/mariadb(/.*)?" any (system_u object_r mysqld_log_t ((s0) (s0))))
(filecon "/var/log/mysql.*" file (system_u object_r mysqld_log_t ((s0) (s0))))
(filecon "/var/run/mariadb(/.*)?" any (system_u object_r mysqld_var_run_t ((s0) (s0))))
(filecon "/var/run/mysqld(/.*)?" any (system_u object_r mysqld_var_run_t ((s0) (s0))))
(filecon "/var/run/mysqld/mysqlmanager.*" file (system_u object_r mysqlmanagerd_var_run_t ((s0) (s0))))

If you want to avoid security issues with SELinux, you should stay within those paths. A good example of an offending path is the PXC configuration file and directory which are now located in their own directory. These are not labeled correctly for SELinux:

[root@BlogSELinux1 ~]# ls -Z /etc/per*
-rw-r--r--. root root system_u:object_r:etc_t:s0       /etc/percona-xtradb-cluster.cnf
/etc/percona-xtradb-cluster.conf.d:
-rw-r--r--. root root system_u:object_r:etc_t:s0       mysqld.cnf
-rw-r--r--. root root system_u:object_r:etc_t:s0       mysqld_safe.cnf
-rw-r--r--. root root system_u:object_r:etc_t:s0       wsrep.cnf

I must admit that even if the security context labels on those files were not set, I got no audit messages and everything worked normally. Nevetheless, adding the labels is straightforward:

[root@BlogSELinux1 ~]# semanage fcontext -a -t mysqld_etc_t "/etc/percona-xtradb-cluster\.cnf"
[root@BlogSELinux1 ~]# semanage fcontext -a -t mysqld_etc_t "/etc/percona-xtradb-cluster\.conf\.d(/.*)?"
[root@BlogSELinux1 ~]# restorecon -v /etc/percona-xtradb-cluster.cnf
restorecon reset /etc/percona-xtradb-cluster.cnf context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0
[root@BlogSELinux1 ~]# restorecon -R -v /etc/percona-xtradb-cluster.conf.d/
restorecon reset /etc/percona-xtradb-cluster.conf.d context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0
restorecon reset /etc/percona-xtradb-cluster.conf.d/wsrep.cnf context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0
restorecon reset /etc/percona-xtradb-cluster.conf.d/mysqld.cnf context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0
restorecon reset /etc/percona-xtradb-cluster.conf.d/mysqld_safe.cnf context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0

Variables check list

Here is a list of all the variables you should check for paths used by MySQL

  • datadir, default is /var/lib/mysql, where MySQL stores its data
  • basedir, default is /usr, where binaries and librairies can be found
  • character_sets_dir, default is basedir/share/mysql/charsets, charsets used by MySQL
  • general_log_file, default is the datadir, where the general log is written
  • init_file, no default, sql file read and executed when the server starts
  • innodb_undo_directory, default is datadir, where InnoDB stores the undo files
  • innodb_tmpdir, default is tmpdir, where InnoDB creates temporary files
  • innodb_temp_data_file_path, default is in the datadir, where InnoDB creates the temporary tablespace
  • innodb_parallel_doublewrite_path, default is in the datadir, where InnoDB created the parallel doublewrite buffer
  • innodb_log_group_home_dir, default is the datadir, where InnoDB writes its transational log files
  • innodb_data_home_dir, default is the datadir, used a default value for the InnoDB files
  • innodb_data_file_path, default is in the datadir, path of the system tablespace
  • innodb_buffer_pool_filename, default is in the datadir, where InnoDB writes the buffer pool dump information
  • lc_messages_dir, basedir/share/mysql
  • log_bin_basename, default is the datadir, where the binlogs are stored
  • log_bin_index, default is the datadir, where the binlog index file is stored
  • log_error, no default value, where the MySQL error log is stored
  • pid-file, no default value, where the MySQL pid file is stored
  • plugin_dir, default is basedir/lib/mysql/plugin, where the MySQL plugins are stored
  • relay_log_basename, default is the datadir, where the relay logs are stored
  • relay_log_info_file, default is the datadir, may include a path
  • slave_load_tmpdir, default is tmpdir, where the slave stores files coming from LOAD DATA INTO statements.
  • slow_query_log, default is in the datadir, where the slow queries are logged
  • socket, no defaults, where the Unix socket file is created
  • ssl_*, SSL/TLS related files
  • tmpdir, default is /tmp, where temporary files are stored
  • wsrep_data_home_dir, default is the datadir, where galera stores its files
  • wsrep_provider->base_dir, default is wsrep_data_home_dir
  • wsrep_provider->gcache_dir, default is wsrep_data_home_dir, where the gcache file is stored
  • wsrep_provider->socket.ssl_*, no defaults, where the SSL/TLS related files for the Galera protocol are stored

That’s quite a long list and I may have missed some. If for any of these variables you use a non-standard path, you’ll need to adjust the context labels as we just did above.

All together

I would understand if you feel a bit lost, I am not a SELinux guru and it took me some time to understand decently how it works. Let’s recap how we can enable SELinux for PXC from what we learned in the previous sections.

1. Install the SELinux utilities

yum install policycoreutils-python.x86_64

2. Allow the TCP ports used by PXC

semanage port -a -t mysqld_port_t -p tcp 4568
semanage port -m -t mysqld_port_t -p tcp 4444
semanage port -m -t mysqld_port_t -p tcp 4567

3. Modify the SST script

Replace the wait_for_listen function in the /usr/bin/wsrep_sst_xtrabackup-v2 file by the version above. Hopefully, the next PXC release will include a SELinux friendly wait_for_listen function.

4. Set the security context labels for the configuration files

These steps seems optional but for completeness:

semanage fcontext -a -t mysqld_etc_t "/etc/percona-xtradb-cluster\.cnf"
semanage fcontext -a -t mysqld_etc_t "/etc/percona-xtradb-cluster\.conf\.d(/.*)?"
restorecon -v /etc/percona-xtradb-cluster.cnf
restorecon -R -v /etc/percona-xtradb-cluster.conf.d/

5. Create the policy file PXC.te

Create the file PXC.te with this content:

module PXC 1.0;
require {
        type unconfined_t;
        type mysqld_t;
        type unconfined_service_t;
        type tmp_t;
        type sysctl_net_t;
        type kernel_t;
        type mysqld_safe_t;
        class process { getattr setpgid };
        class unix_stream_socket connectto;
        class system module_request;
        class file { getattr open read write };
        class dir search;
}
#============= mysqld_t ==============
allow mysqld_t kernel_t:system module_request;
allow mysqld_t self:process { getattr setpgid };
allow mysqld_t self:unix_stream_socket connectto;
allow mysqld_t sysctl_net_t:dir search;
allow mysqld_t sysctl_net_t:file { getattr open read };
allow mysqld_t tmp_t:file write;

6. Compile and load the policy module

checkmodule -M -m -o PXC.mod PXC.te
semodule_package -o PXC.pp -m PXC.mod
semodule -i PXC.pp

7. Run for a while in Permissive mode

Set SELinux into permissive mode in /etc/sysconfig/selinux and reboot. Validate everything works fine in Permissive mode, check the audit.log for any denied messages. If there are denied messages, address them.

8. Enforce SELINUX

Last step, enforce SELinux:

setenforce 1
perl -pi -e 's/SELINUX=permissive/SELINUX=enforcing/g' /etc/sysconfig/selinux

Conclusion

As we can see, enabling SELinux with PXC is not straightforward but, once the process is understood, it is not that hard either. In an IT world where security is more than ever a major concern, enabling SELinux with PXC is a nice step forward. In an upcoming post, we’ll look at the other security framework, Apparmor.

The post Lock Down: Enforcing SELinux with Percona XtraDB Cluster appeared first on Percona Database Performance Blog.

by Yves Trudeau at June 21, 2018 03:57 PM

Jean-Jerome Schmidt

MySQL on Docker: Running a MariaDB Galera Cluster without Orchestration Tools - DB Container Management - Part 2

As we saw in the first part of this blog, a strongly consistent database cluster like Galera does not play well with container orchestration tools like Kubernetes or Swarm. We showed you how to deploy Galera and configure process management for Docker, so you retain full control of the behaviour.  This blog post is the continuation of that, we are going to look into operation and maintenance of the cluster.

To recap some of the main points from the part 1 of this blog, we deployed a three-node Galera cluster, with ProxySQL and Keepalived on three different Docker hosts, where all MariaDB instances run as Docker containers. The following diagram illustrates the final deployment:

Graceful Shutdown

To perform a graceful MySQL shutdown, the best way is to send SIGTERM (signal 15) to the container:

$ docker kill -s 15 {db_container_name}

If you would like to shutdown the cluster, repeat the above command on all database containers, one node at a time. The above is similar to performing "systemctl stop mysql" in systemd service for MariaDB. Using "docker stop" command is pretty risky for database service because it waits for 10 seconds timeout and Docker will force SIGKILL if this duration is exceeded (unless you use a proper --timeout value).

The last node that shuts down gracefully will have the seqno not equal to -1 and safe_to_bootstrap flag is set to 1 in the /{datadir volume}/grastate.dat of the Docker host, for example on host2:

$ cat /containers/mariadb2/datadir/grastate.dat
# GALERA saved state
version: 2.1
uuid:    e70b7437-645f-11e8-9f44-5b204e58220b
seqno:   7099
safe_to_bootstrap: 1

Detecting the Most Advanced Node

If the cluster didn't shut down gracefully, or the node that you were trying to bootstrap wasn't the last node to leave the cluster, you probably wouldn't be able to bootstrap one of the Galera node and might encounter the following error:

2016-11-07 01:49:19 5572 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node.
It was not the last one to leave the cluster and may not contain all the updates.
To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .

Galera honours the node that has safe_to_bootstrap flag set to 1 as the first reference node. This is the safest way to avoid data loss and ensure the correct node always gets bootstrapped.

If you got the error, we have to find out the most advanced node first before picking up the node as the first to be bootstrapped. Create a transient container (with --rm flag), map it to the same datadir and configuration directory of the actual database container with two MySQL command flags, --wsrep_recover and --wsrep_cluster_address. For example, if we want to know mariadb1 last committed number, we need to run:

$ docker run --rm --name mariadb-recover \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/conf.d \
        mariadb:10.2.15 \
        --wsrep_recover \
        --wsrep_cluster_address=gcomm://
2018-06-12  4:46:35 139993094592384 [Note] mysqld (mysqld 10.2.15-MariaDB-10.2.15+maria~jessie) starting as process 1 ...
2018-06-12  4:46:35 139993094592384 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
...
2018-06-12  4:46:35 139993094592384 [Note] Plugin 'FEEDBACK' is disabled.
2018-06-12  4:46:35 139993094592384 [Note] Server socket created on IP: '::'.
2018-06-12  4:46:35 139993094592384 [Note] WSREP: Recovered position: e70b7437-645f-11e8-9f44-5b204e58220b:7099

The last line is what we are looking for. MariaDB prints out the cluster UUID and the sequence number of the most recently committed transaction. The node which holds the highest number is deemed as the most advanced node. Since we specified --rm, the container will be removed automatically once it exits. Repeat the above step on every Docker host by replacing the --volume path to the respective database container volumes.

Once you have compared the value reported by all database containers and decided which container is the most up-to-date node, change the safe_to_bootstrap flag to 1 inside /{datadir volume}/grastate.dat manually. Let's say all nodes are reporting the same exact sequence number, we can just pick mariadb3 to be bootstrapped by changing the safe_to_bootstrap value to 1:

$ vim /containers/mariadb3/datadir/grasate.dat
...
safe_to_bootstrap: 1

Save the file and start bootstrapping the cluster from that node, as described in the next chapter.

Bootstrapping the Cluster

Bootstrapping the cluster is similar to the first docker run command we used when starting up the cluster for the first time. If mariadb1 is the chosen bootstrap node, we can simply re-run the created bootstrap container:

$ docker start mariadb0 # on host1

Otherwise, if the bootstrap container does not exist on the chosen node, let's say on host2, run the bootstrap container command and map the existing mariadb2's volumes. We are using mariadb0 as the container name on host2 to indicate it is a bootstrap container:

$ docker run -d \
        --name mariadb0 \
        --hostname mariadb0.weave.local \
        --net weave \
        --publish "3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb2/datadir:/var/lib/mysql \
        --volume /containers/mariadb2/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm:// \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb0.weave.local

You may notice that this command is slightly shorter as compared to the previous bootstrap command described in this guide. Since we already have the proxysql user created in our first bootstrap command, we may skip these two environment variables:

  • --env MYSQL_USER=proxysql
  • --env MYSQL_PASSWORD=proxysqlpassword

Then, start the remaining MariaDB containers, remove the bootstrap container and start the existing MariaDB container on the bootstrapped host. Basically the order of commands would be:

$ docker start mariadb1 # on host1
$ docker start mariadb3 # on host3
$ docker stop mariadb0 # on host2
$ docker start mariadb2 # on host2

At this point, the cluster is started and is running at full capacity.

Resource Control

Memory is a very important resource in MySQL. This is where the buffers and caches are stored, and it's critical for MySQL to reduce the impact of hitting the disk too often. On the other hand, swapping is bad for MySQL performance. By default, there will be no resource constraints on the running containers. Containers use as much of a given resource as the host’s kernel will allow. Another important thing is file descriptor limit. You can increase the limit of open file descriptor, or "nofile" to something higher to cater for the number of files MySQL server can open simultaneously. Setting this to a high value won't hurt.

To cap memory allocation and increase the file descriptor limit to our database container, one would append --memory, --memory-swap and --ulimit parameters into the "docker run" command:

$ docker kill -s 15 mariadb1
$ docker rm -f mariadb1
$ docker run -d \
        --name mariadb1 \
        --hostname mariadb1.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --memory 16g \
        --memory-swap 16g \
        --ulimit nofile:16000:16000 \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb1.weave.local

Take note that if --memory-swap is set to the same value as --memory, and --memory is set to a positive integer, the container will not have access to swap. If --memory-swap is not set, container swap will default to --memory multiply by 2. If --memory and --memory-swap are set to the same value, this will prevent containers from using any swap. This is because --memory-swap is the amount of combined memory and swap that can be used, while --memory is only the amount of physical memory that can be used.

Some of the container resources like memory and CPU can be controlled dynamically through "docker update" command, as shown in the following example to upgrade the memory of container mariadb1 to 32G on-the-fly:

$ docker update \
    --memory 32g \
    --memory-swap 32g \
    mariadb1

Do not forget to tune the my.cnf accordingly to suit the new specs. Configuration management is explained in the next section.

Configuration Management

Most of the MySQL/MariaDB configuration parameters can be changed during runtime, which means you don't need to restart to apply the changes. Check out the MariaDB documentation page for details. The parameter listed with "Dynamic: Yes" means the variable is loaded immediately upon changing without the necessity to restart MariaDB server. Otherwise, set the parameters inside the custom configuration file in the Docker host. For example, on mariadb3, make the changes to the following file:

$ vim /containers/mariadb3/conf.d/my.cnf

And then restart the database container to apply the change:

$ docker restart mariadb3

Verify the container starts up the process by looking at the docker logs. Perform this operation on one node at a time if you would like to make cluster-wide changes.

Backup

Taking a logical backup is pretty straightforward because the MariaDB image also comes with mysqldump binary. You simply use the "docker exec" command to run the mysqldump and send the output to a file relative to the host path. The following command performs mysqldump backup on mariadb2 and saves it to /backups/mariadb2 inside host2:

$ docker exec -it mariadb2 mysqldump -uroot -p --single-transaction > /backups/mariadb2/dump.sql

Binary backup like Percona Xtrabackup or MariaDB Backup requires the process to access the MariaDB data directory directly. You have to either install this tool inside the container, or through the machine host or use a dedicated image for this purpose like "perconalab/percona-xtrabackup" image to create the backup and stored it inside /tmp/backup on the Docker host:

$ docker run --rm -it \
    -v /containers/mariadb2/datadir:/var/lib/mysql \
    -v /tmp/backup:/xtrabackup_backupfiles \
    perconalab/percona-xtrabackup \
    --backup --host=mariadb2 --user=root --password=mypassword

You can also stop the container with innodb_fast_shutdown set to 0 and copy over the datadir volume to another location in the physical host:

$ docker exec -it mariadb2 mysql -uroot -p -e 'SET GLOBAL innodb_fast_shutdown = 0'
$ docker kill -s 15 mariadb2
$ cp -Rf /containers/mariadb2/datadir /backups/mariadb2/datadir_copied
$ docker start mariadb2

Restore

Restoring is pretty straightforward for mysqldump. You can simply redirect the stdin into the container from the physical host:

$ docker exec -it mariadb2 mysql -uroot -p < /backups/mariadb2/dump.sql

You can also use the standard mysql client command line remotely with proper hostname and port value instead of using this "docker exec" command:

$ mysql -uroot -p -h127.0.0.1 -P3306 < /backups/mariadb2/dump.sql

For Percona Xtrabackup and MariaDB Backup, we have to prepare the backup beforehand. This will roll forward the backup to the time when the backup was finished. Let's say our Xtrabackup files are located under /tmp/backup of the Docker host, to prepare it, simply:

$ docker run --rm -it \
    -v mysql-datadir:/var/lib/mysql \
    -v /tmp/backup:/xtrabackup_backupfiles \
    perconalab/percona-xtrabackup \
    --prepare --target-dir /xtrabackup_backupfiles

The prepared backup under /tmp/backup of the Docker host then can be used as the MariaDB datadir for a new container or cluster. Let's say we just want to verify restoration on a standalone MariaDB container, we would run:

$ docker run -d \
    --name mariadb-restored \
    --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
    -v /tmp/backup:/var/lib/mysql \
    mariadb:10.2.15

If you performed a backup using stop and copy approach, you can simply duplicate the datadir and use the duplicated directory as a volume maps to MariaDB datadir to run on another container. Let's say the backup was copied over under /backups/mariadb2/datadir_copied, we can run a new container by running:

$ mkdir -p /containers/mariadb-restored/datadir
$ cp -Rf /backups/mariadb2/datadir_copied /containers/mariadb-restored/datadir
$ docker run -d \
    --name mariadb-restored \
    --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
    -v /containers/mariadb-restored/datadir:/var/lib/mysql \
    mariadb:10.2.15

The MYSQL_ROOT_PASSWORD must match the actual root password for that particular backup.

Severalnines
 
MySQL on Docker: How to Containerize Your Database
Discover all you need to understand when considering to run a MySQL service on top of Docker container virtualization

Database Version Upgrade

There are two types of upgrade - in-place upgrade or logical upgrade.

In-place upgrade involves shutting down the MariaDB server, replacing the old binaries with the new binaries and then starting the server on the old data directory. Once started, you have to run mysql_upgrade script to check and upgrade all system tables and also to check the user tables.

The logical upgrade involves exporting SQL from the current version using a logical backup utility such as mysqldump, running the new container with the upgraded version binaries, and then applying the SQL to the new MySQL/MariaDB version. It is similar to backup and restore approach described in the previous section.

Nevertheless, it's a good approach to always backup your database before performing any destructive operations. The following steps are required when upgrading from the current image, MariaDB 10.1.33 to another major version, MariaDB 10.2.15 on mariadb3 resides on host3:

  1. Backup the database. It doesn't matter physical or logical backup but the latter using mysqldump is recommended.

  2. Download the latest image that we would like to upgrade to:

    $ docker pull mariadb:10.2.15
  3. Set innodb_fast_shutdown to 0 for our database container:

    $ docker exec -it mariadb3 mysql -uroot -p -e 'SET GLOBAL innodb_fast_shutdown = 0'
  4. Graceful shut down the database container:

    $ docker kill --signal=TERM mariadb3
  5. Create a new container with the new image for our database container. Keep the rest of the parameters intact except using the new container name (otherwise it would conflict):

    $ docker run -d \
            --name mariadb3-new \
            --hostname mariadb3.weave.local \
            --net weave \
            --publish "3306:3306" \
            --publish "4444" \
            --publish "4567" \
            --publish "4568" \
            $(weave dns-args) \
            --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
            --volume /containers/mariadb3/datadir:/var/lib/mysql \
            --volume /containers/mariadb3/conf.d:/etc/mysql/mariadb.conf.d \
            mariadb:10.2.15 \
            --wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
            --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
            --wsrep_node_address=mariadb3.weave.local
  6. Run mysql_upgrade script:

    $ docker exec -it mariadb3-new mysql_upgrade -uroot -p
  7. If no errors occurred, remove the old container, mariadb3 (the new one is mariadb3-new):

    $ docker rm -f mariadb3
  8. Otherwise, if the upgrade process fails in between, we can fall back to the previous container:

    $ docker stop mariadb3-new
    $ docker start mariadb3

Major version upgrade can be performed similarly to the minor version upgrade, except you have to keep in mind that MySQL/MariaDB only supports major upgrade from the previous version. If you are on MariaDB 10.0 and would like to upgrade to 10.2, you have to upgrade to MariaDB 10.1 first, followed by another upgrade step to MariaDB 10.2.

Take note on the configuration changes being introduced and deprecated between major versions.

Failover

In Galera, all nodes are masters and hold the same role. With ProxySQL in the picture, connections that pass through this gateway will be failed over automatically as long as there is a primary component running for Galera Cluster (that is, a majority of nodes are up). The application won't notice any difference if one database node goes down because ProxySQL will simply redirect the connections to the other available nodes.

If the application connects directly to the MariaDB bypassing ProxySQL, failover has to be performed on the application-side by pointing to the next available node, provided the database node meets the following conditions:

  • Status wsrep_local_state_comment is Synced (The state "Desynced/Donor" is also possible, only if wsrep_sst_method is xtrabackup, xtrabackup-v2 or mariabackup).
  • Status wsrep_cluster_status is Primary.

In Galera, an available node doesn't mean it's healthy until the above status are verified.

Scaling Out

To scale out, we can create a new container in the same network and use the same custom configuration file for the existing container on that particular host. For example, let's say we want to add the fourth MariaDB container on host3, we can use the same configuration file mounted for mariadb3, as illustrated in the following diagram:

Run the following command on host3 to scale out:

$ docker run -d \
        --name mariadb4 \
        --hostname mariadb4.weave.local \
        --net weave \
        --publish "3306:3307" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb4/datadir:/var/lib/mysql \
        --volume /containers/mariadb3/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm://mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local,mariadb4.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb4.weave.local

Once the container is created, it will join the cluster and perform SST. It can be accessed on port 3307 externally or outside of the Weave network, or port 3306 within the host or within the Weave network. It's not necessary to include mariadb0.weave.local into the cluster address anymore. Once the cluster is scaled out, we need to add the new MariaDB container into the ProxySQL load balancing set via admin console:

$ docker exec -it proxysql1 mysql -uadmin -padmin -P6032
mysql> INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (10,'mariadb4.weave.local',3306);
mysql> INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (20,'mariadb4.weave.local',3306);
mysql> LOAD MYSQL SERVERS TO RUNTIME;
mysql> SAVE MYSQL SERVERS TO DISK;

Repeat the above commands on the second ProxySQL instance.

Finally for the the last step, (you may skip this part if you already ran "SAVE .. TO DISK" statement in ProxySQL), add the following line into proxysql.cnf to make it persistent across container restart on host1 and host2:

$ vim /containers/proxysql1/proxysql.cnf # host1
$ vim /containers/proxysql2/proxysql.cnf # host2

And append mariadb4 related lines under mysql_server directive:

mysql_servers =
(
        { address="mariadb1.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb4.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb1.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb4.weave.local" , port=3306 , hostgroup=20, max_connections=100 }
)

Save the file and we should be good on the next container restart.

Scaling Down

To scale down, simply shuts down the container gracefully. The best command would be:

$ docker kill -s 15 mariadb4
$ docker rm -f mariadb4

Remember, if the database node left the cluster ungracefully, it was not part of scaling down and would affect the quorum calculation.

To remove the container from ProxySQL, run the following commands on both ProxySQL containers. For example, on proxysql1:

$ docker exec -it proxysql1 mysql -uadmin -padmin -P6032
mysql> DELETE FROM mysql_servers WHERE hostname="mariadb4.weave.local";
mysql> LOAD MYSQL SERVERS TO RUNTIME;
mysql> SAVE MYSQL SERVERS TO DISK;

You can then either remove the corresponding entry inside proxysql.cnf or just leave it like that. It will be detected as OFFLINE from ProxySQL point-of-view anyway.

Summary

With Docker, things get a bit different from the conventional way on handling MySQL or MariaDB servers. Handling stateful services like Galera Cluster is not as easy as stateless applications, and requires proper testing and planning.

In our next blog on this topic, we will evaluate the pros and cons of running Galera Cluster on Docker without any orchestration tools.

by ashraf at June 21, 2018 07:11 AM

June 20, 2018

Peter Zaitsev

Percona Server for MongoDB 3.4.15-2.13 Is Now Available

MongoRocks

Percona Server for MongoDB 3.2Percona announces the release of Percona Server for MongoDB 3.4.15-2.13 on June 20, 2018. Download the latest version from the Percona web site or the Percona Software Repositories.

Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 3.4 Community Edition. It supports MongoDB 3.4 protocols and drivers.

Percona Server for MongoDB extends MongoDB Community Edition functionality by including the Percona Memory Engine and MongoRocks storage engine, as well as several enterprise-grade features. It requires no changes to MongoDB applications or code.

This release is based on MongoDB 3.4.15 and does not include any additional changes.

The Percona Server for MongoDB 3.4.15-2.13 release notes are available in the official documentation.

The post Percona Server for MongoDB 3.4.15-2.13 Is Now Available appeared first on Percona Database Performance Blog.

by Dmitriy Kostiuk at June 20, 2018 04:44 PM

Percona XtraDB Cluster 5.6.40-26.25 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.6Percona announces the release of Percona XtraDB Cluster 5.6.40-26.25 (PXC) on June 20, 2018. Binaries are available from the downloads section or our software repositories.

Percona XtraDB Cluster 5.6.40-26.25 is now the current release, based on the following:

All Percona software is open-source and free.

New feature

  • PXC-907: New variable wsrep_RSU_commit_timeout allows to configure RSU wait for active commit connection timeout (in microseconds).

Fixed Bugs

  • PXC-2128: Duplicated auto-increment values were set for the concurrent sessions on cluster reconfiguration due to the erroneous readjustment.
  • PXC-2059: Error message about the necessity of the SUPER privilege appearing in case of the CREATE TRIGGER statements fail due to enabled WSREP was made more clear.
  • PXC-2091: Check for the maximum number of rows, that can be replicated as a part of a single transaction because of the Galera limit, was enforced even when replication was disabled with wsrep_on=OFF.
  • PXC-2103: Interruption of the local running transaction in a COMMIT state by a replicated background transaction while waiting for the binlog backup protection caused the commit fail and, eventually, an assert in Galera.
  • PXC-2130Percona XtraDB Cluster failed to build with Python 3.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

The post Percona XtraDB Cluster 5.6.40-26.25 Is Now Available appeared first on Percona Database Performance Blog.

by Dmitriy Kostiuk at June 20, 2018 02:07 PM

Is Serverless Just a New Word for Cloud Based?

serverless architecture

serverless architectureServerless is a new buzzword in the database industry. Even though it gets tossed around often, there is some confusion about what it really means and how it really works. Serverless architectures rely on third-party Backend as a Service (BaaS) services. They can also include custom code that is run in managed, ephemeral containers on a Functions as a Service (FaaS) platform. In comparison to traditional Platform as a Service (PaaS) server architecture, where you pay a predetermined sum for your instances, serverless applications benefit from reduced costs of operations and lower complexity. They are also considered to be more agile, allowing for reduced engineering efforts.

In reality, there are still servers in a serverless architecture: they are just being used, managed, and maintained outside of the application. But isn’t that a lot like what cloud providers, such as Amazon RDS, Google Cloud, and Microsoft Azure, are already offering? Well, yes, but with several caveats.

When you use any of the aforementioned platforms, you still need to provision the types of instances that you plan to use and define how those platforms will act. For example, will it run MySQL, MongoDB, PostgreSQL, or some other tool? With serverless, these decisions are no longer needed. Instead, you simply consume resources from a shared resource pool, using whatever application suits your needs at that time. In addition, in a serverless world, you are only charged for the time that you use the server instead of being charged whether you use it a lot or a little (or not at all).

Remember When You Joined That Gym?

How many of us have purchased a gym membership at some point in our life? Oftentimes, you walk in with the best of intentions and happily enroll in a monthly plan. “For only $29.95 per month, you can use all of the resources of the gym as much as you want.” But, many of us have purchased such a membership and found that our visits to the gym dwindle over time, leaving us paying the same monthly fee for less usage.

Traditional Database as a Service (DBaaS) offerings are similar to your gym membership: you sign up, select your service options, and start using them right away. There are certainly cases of companies using those services consistently, just like there are gym members who show up faithfully month after month. But there are also companies who spin up database instances for a specific purpose, use the database instance for some amount of time, and then slowly find that they are accessing that instance less and less. However, the fees for the instance, much like the fees for your gym membership, keep getting charged.

What if we had a “pay as you go” gym plan? Well, some of those certainly exist. Serverless architecture is somewhat like this plan: you only pay for the resources when you use them, and you only pay for your specific usage. This would be like charging $5 for access to the weight room and $3 for access to the swimming pool, each time you use one or the other. The one big difference with serverless architecture for databases is that you still need to have your data stored somewhere in the environment and made available to you as needed. This would be like renting a gym locker to store your workout gear so that didn’t have to bring it back and forth each time you visited.

Obviously, you will pay for that storage, whether it is your data or your workout gear, but the storage fees are going to be less than your standard membership. The big advantage is that you have what you need when you need it, and you can access the necessary resources to use whatever you are storing.

With a serverless architecture, you store your data securely on low cost storage devices and access as needed. The resources required to process that data are available on an on demand basis. So, your charges are likely to be lower since you are paying a low fee for data storage and a usage fee on resources. This can work great for companies that do not need 24x7x365 access to their data since they are only paying for the services when they are using them. It’s also ideal for developers, who may find that they spend far more time working on their application code than testing it against the database. Instead of paying for the database resources while the data is just sitting there doing nothing, you now pay to store the data and incur the database associated fees at use time.

Benefits and Risks of Going Serverless

One of the biggest possible benefits of going with a serverless architecture is that you save money and hassle. Money can be saved since you only pay for the resources when you use them. Hassle is reduced since you don’t need to worry about the hardware on which your application runs. These can be big wins for a company, but you need to be aware of some pitfalls.

First, serverless can save you money, but there is no guarantee that it will save you money.

Consider 2 different people who have the exact same cell phone – maybe it’s your dad and your teenage daughter. These 2 users probably have very different patterns of usage: your dad uses the phone sporadically (if at all!) and your teenage daughter seems to have her phone physically attached to her. These 2 people would benefit from different service plans with their provider. For your dad, a basic plan that allows some usage (similar to the base cost of storage in our serverless database) with charges for usage above that cap would probably suffice. However, such a plan for your teenage daughter would probably spiral out of control and incur very high usage fees. For her, an unlimited plan makes sense. What is a great fit for one user is a poor fit for another, and the same is true when comparing serverless and DBaaS options.

The good news is that serverless architectures and DBaaS options, like Amazon RDS, Microsoft Azure, and Google Cloud, reduce a lot of the hassle of owning and managing servers. You no longer need to be concerned about Mean Time Between Failures, power and cooling issues, or many of the other headaches that come with maintaining your hardware. However, this can also have a negative consequence.

The challenge of enforced updates

About the only thing that is consistent about software in today’s world is that it is constantly changing. New versions are released with new features that may or may not be important to you. When a serverless provider decides to implement a new version or patch of their backend, there may be some downstream issues for you to manage. It is always important to test any new updates, but now some of the decisions about how and when to upgrade may be out of your control. Proper notification from the provider gives you a window of time for testing, but they are probably going to flip the switch regardless of whether or not you have completed all of your test cycles. This is true of both serverless and DBaaS options.

A risk of vendor lock-in

A common mantra in the software world is that we want to avoid vendor lock-in. Of course, from the provider’s side, they want to avoid customer churn, so we often find ourselves on opposite sides of the same issue. Moving to a new platform or provider becomes more complex as you cede more aspects of server management to the host. This means that serverless can cause deep lock-in since your application is designed to work with the environment as your provider has configured it. If you choose to move to a different provider, you need to extract your application and your data from the current provider and probably need to rework it to fit the requirements of the new provider.

The challenge of client-side optimization

Another consideration is that optimizations of server-side configurations must necessarily be more generic compared to those you might make to self-hosted servers. Optimization can no longer be done at the server level for your specific application and use; instead, you now rely on a smarter client to perform your necessary optimizations. This requires a skill set that may not exist with some developers: the ability to tune applications client-side.

Conclusion

Serverless is not going away. In fact, it is likely to grow as people come to a better understanding and comfort level with it. You need to be able to make an informed decision regarding whether serverless is right for you. Careful consideration of the pros and cons is imperative for making a solid determination. Understanding your usage patterns, user expectations, development capabilities, and a lot more will help to guide that decision.

In a future post, I’ll review the architectural differences between on-premises, PaaS, DBaaS and serverless database environments.

 

The post Is Serverless Just a New Word for Cloud Based? appeared first on Percona Database Performance Blog.

by Rick Golba at June 20, 2018 11:35 AM

Webinar Thu 6/21: How to Analyze and Tune MySQL Queries for Better Performance

database query tuning

database query tuningPlease join Percona’s MySQL Database Administrator, Brad Mickel as he presents How to Analyze and Tune MySQL Queries for Better Performance on Thursday, June 21st, 2018, at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

 

Query performance is essential in making any application successful. In order to finely tune your queries you first need to understand how MySQL executes them, and what tools are available to help identify problems.

In this session you will learn:

  1. The common tools for researching problem queries
  2. What an Index is, and why you should use one
  3. Index limitations
  4. When to rewrite the query instead of just adding a new index
Register Now

 

Brad Mickel

MySQL DBA

Bradley began working with MySQL in 2013 as part of his duties in healthcare billing. After 3 years in healthcare billing he joined Percona as part of the bootcamp process. After the bootcamp he has served as a remote database administrator on the Atlas team for Percona Managed Services.

The post Webinar Thu 6/21: How to Analyze and Tune MySQL Queries for Better Performance appeared first on Percona Database Performance Blog.

by Bradley Mickel at June 20, 2018 09:09 AM

June 19, 2018

Jean-Jerome Schmidt

Comparing RDS vs EC2 for Managing MySQL or MariaDB on AWS

RDS is a Database as a Service (DBaaS) that automatically configures and maintains your databases in the AWS cloud. The user has limited power over specific configurations in comparison to running MySQL directly on Elastic Compute Cloud (EC2). But RDS is a convenient service, as long as you can live with the instances and configurations that it offers.

Amazon RDS currently supports various MySQL and MariaDB versions as well as the, MySQL-compatible Amazon Aurora DB engine. It does support replication, but as you may expect from a predefined web console, there are some limitations.

Amazon RDS Services
Amazon RDS Services

There are some tradeoffs when using RDS. These may not only affect the way you manage and provision your database instances, but also key things like performance, security, and high availability.

In this blog, we will take a look at the differences between using RDS and running MySQL on EC2, with focus on replication. As we will see, to decide between hosting MySQL on an EC2 instance or using Amazon RDS is not an easy task.

RDS Platform Tradeoffs

The biggest size of database that AWS can host depends on your source environment, the allocation of data in your source database, and how busy your system is.

Amazon RDS Environment options
Amazon RDS Environment options
Amazon RDS instance class
Amazon RDS instance class

AWS is split into regions. Every AWS account has limits, per region, on the number of AWS resources that can be created. Once a limit for a resource has been reached, additional calls to create that resource will fail.

AWS Regions
AWS Regions

For Amazon RDS MySQL DB instances, the maximum provisioned storage limit constrains the size of a table to a maximum size of 6 TB when using InnoDB file-per-table tablespaces.

InnoDB file-per-table feature is something that you should consider even if you are not looking to migrate a big database into the cloud. You may notice that some existing DB instances have a lower limit. For example, MySQL DB instances created prior to April 2014 have a file and table size limit of 2 TB. This 2-TB file size limit also applies to DB instances or Read Replicas created from DB snapshots taken before April 2014.

One of the key differences which affects the way you set up and maintain database replication is the lack of SUPER user. To address this limitation, Amazon introduced stored procedures that take care of various DBA tasks. Below are the key procedures to manage MySQL RDS replication.

Skip replication error:

CALL mysql.rds_skip_repl_error;

Stop replication:

CALL mysql.rds_stop_replication;

Start replication

CALL mysql.rds_start_replication;

Configures an RDS instance as a Read Replica of a MySQL instance running outside of AWS.

CALL mysql.rds_set_external_master;

Reconfigures a MySQL instance to no longer be a Read Replica of a MySQL instance running outside of AWS.

CALL mysql.rds_reset_external_master;

Imports a certificate. This is needed to enable SSL communication and encrypted replication.

CALL mysql.rds_import_binlog_ssl_material;

Removes a certificate.

CALL mysql.rds_remove_binlog_ssl_material;

Changes the replication master log position to the start of the next binary log on the master.

CALL mysql.rds_next_master_log;

While stored procedures take care of a number of tasks, it is a bit of a learning curve. Lack of SUPER privilege can also create problems in using external replication monitoring.

Amazon RDS does not currently support the following:

  • Global Transaction IDs
  • Transportable Table Space
  • Authentication Plugin
  • Password Strength Plugin
  • Replication Filters
  • Semi-synchronous Replication

Last but not least - access to the shell. Amazon RDS does not allow direct host access to a DB instance via Telnet, Secure Shell (SSH), or Windows Remote Desktop Connection (RDP). You can still use the client on an application host to connect to the DB via standard tools like mysql client.

There are other limitations, as described in the RDS documentation.

High availability with MySQL on EC2

There are options to operate MySQL directly on EC2, and thereby retain control of one’s high availability options. When going down this route, it is important to understand how to leverage the different AWS features that are at your disposal. Make sure you check out our ‘DIY Cloud Database’ white paper.

To automate deployment and management/maintenance tasks (while retaining control), it is possible to use ClusterControl. Just like with RDS, you have the convenience of deploying a database setup in a few minutes via a GUI. Adding nodes, scheduling backups, performing failovers, and so on, can also be conveniently done via the GUI.

Deployment

ClusterControl can automate deployment of different high availability database setups - from master-slave replication to multi-master clusters. All the main MySQL flavours are supported - Oracle MySQL, MariaDB and Percona Server. Some initial setup of VPC/security group is required, and these are well described in the DIY Cloud Database whitepaper. Note that similar concepts apply, whether it is AWS or Google Cloud or Azure

ClusterControl Deploy in EC2
ClusterControl Deploy in EC2

Galera Cluster is a good alternative to consider when deploying a highly available MySQL service. It has established itself as a credible replacement for traditional MySQL master-slave architectures, although it is not a drop-in replacement. Most applications can still be adapted to run on it. It is possible to define different segments for databases that span across multiple AWS regions.

ClusterControl expand cluster in EC2
ClusterControl expand cluster in EC2

It is possible to setup ‘hybrid replication’ by combining synchronous replication within a Galera Cluster and asynchronous replication between the cluster and one or more slaves. Options like delaying the slave gives an additional level of protection to the data.

ClusterControl Add replication in EC2
ClusterControl Add replication in EC2

Proxy layer

To achieve high availability, deploying a highly available setup is not enough. The applications have to somehow know which nodes are working and which ones are not. Changes in topology, e.g. moving a master to another host, also need to be propagated somehow so as to avoid errors in the application layer. ClusterControl supports deployments of proxies like HAProxy, MaxScale, and ProxySQL. For HAProxy and ProxySQL, there are additional options to deploy redundant instances with Keepalived and VirtualIP.

ClusterControl manager load balancers on EC2 nodes
ClusterControl manager load balancers on EC2 nodes

Cross-region replica

Amazon RDS provides read replica services. Cross-region replicas give you the ability to scale reads, as AWS has its services in a number of datacenters around the world. All read replicas are accessible and can be used for reading in a maximum number of five regions. These nodes are independent and can be used in your upgrade path, or can be promoted to standalone databases.

In addition to that, Amazon offers Multi-AZ deployments based on DRBD, synchronous disk replication. How is it different from Read Replicas? The main difference is that only the database engine on the primary instance is active, which leads to other architectural variations.

As opposed to read replicas, database engine version upgrades happen on the primary. Another difference is that AWS RDS will failover automatically with DRBD, while read replicas (using asynchronous replication) will require manual operations from you.

Multi-AZ failover on RDS uses a DNS change to point to the standby instance, according to Amazon this should happen within 60-120 seconds during the failover. Because the standby uses the same storage data as the primary, there will probably be transaction/log recovery. Bigger databases may spend a significant amount of time on InnoDB recovery, so please consider that in your DR plan and RTO calculation.

Of course, this goes with additional cost. Let’s take a look at some basic example. The cost of db.t2.medium host with 2vCPU, 4GB ram is 185.98 USD per month, the price will double when you enable Multizone (MZ) replica to 370.98 UDB. The price will vary by region but it will double in MZ.

Cost comparision
Cost comparision
Cost comparision
Cost comparision
Cost comparision

In order to achieve the same with EC2, you can deploy your virtual machines in different regions. Each AWS Region is completely independent. The setting of AWS Region can be changed in the console, by setting the EC2_REGION environment variable, or it can be overridden by using the --region parameter with the AWS Command Line Interface. When your set of servers are ready, you can use ClusterControl to deploy and monitor your replication. You can also manually set up replication through the console using standard commands.

Cross technology replication

It is possible to setup replication between an Amazon RDS MySQL or MariaDB DB instance and a MySQL or MariaDB instance that is external to Amazon RDS. This is done using standard replication method in mysql, through binary logs. To enable binary logs, you need to modify the my.cnf configuration. Without access to the shell, this task became impossible in RDS. It's done in a not so obvious way. You have two options. One is to enable backups - set automated backups on your Amazon RDS DB instance with retention to higher than 0. Or enable replication to a prebuilt slave server. Both tasks will enable binary logs which you can later on use for your replication.

Enable binary logs via RDS backup
Enable binary logs via RDS backup

Maintain the binlogs in your master instance until you have verified that they have been applied on the replica. This maintenance ensures that you can restore your master instance in the event of a failure.

Another roadblock can be permissions. The permissions required to start replication on an Amazon RDS DB instance are restricted and not available to your Amazon RDS master user. Because of this, you must use the Amazon RDS mysql.rds_set_external_master and mysql.rds_start_replication commands to set up replication between your live database and your Amazon RDS database.

Monitor failover events for the Amazon RDS instance that is your replica. If a failover occurs, then the DB instance that is your replica might be recreated on a new host with a different network address. For information on how to monitor failover events, see Using Amazon RDS Event Notification.

In the below example, we will see how to enable replication from RDS to an external DB located on an EC2 instance.
You should have binary logs enabled, we use an RDS slave here.

Specify the number of hours to retain binary logs.

mysql -h RDS_MASTER -u<username> -u<password>
call mysql.rds_set_configuration('binlog retention hours', 7);

On RDS MASTER, create replication user with the following commands:

CREATE USER 'repl'@'ec2DBslave' IDENTIFIED BY 's3cr3tp4SSw0rd';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'ec2DBslave';

On RDS SLAVE, run the commands:

mysql -u<username> -u<password> -h RDS_SLAVE
call mysql.rds_stop_replication;
SHOW SLAVE STATUS;  Exec_Master_Log_Pos, Relay_Master_Log_File.

On RDS SLAVE, run mysqldump with the following format:

mysqldump -u<username> -u<password> -h RDS_SLAVE --routines --triggers --single-transaction --databases DB1 DB2 DB3 > mysqldump.sql

Import the DB dump to external database:

mysql -u<username> -u<password> -h ec2DBslave
tee import_database.log;
source mysqldump.sql;
CHANGE MASTER TO 
 MASTER_HOST='RDS_MASTER', 
 MASTER_USER='repl',
 MASTER_PASSWORD='s3cr3tp4SSw0rd',
 MASTER_LOG_FILE='<Relay_Master_Log_File>',
 MASTER_LOG_POS=<Exec_Master_Log_Pos>;

Create a replication filter to ignore tables created by AWS only on RDS

CHANGE REPLICATION FILTER REPLICATE_WILD_IGNORE_TABLE = ('mysql.rds\_%');

Start replication

START SLAVE;

Verify replication status

SHOW SLAVE STATUS;

That’s it for now. Managing MySQL on AWS is a big topic. Do let us know your thoughts in the comments section below.

by Bart Oles at June 19, 2018 08:49 PM

Peter Zaitsev

Chunk Change: InnoDB Buffer Pool Resizing

innodb buffer pool chunk size

Since MySQL 5.7.5, we have been able to resize dynamically the InnoDB Buffer Pool. This new feature also introduced a new variable — innodb_buffer_pool_chunk_size — which defines the chunk size by which the buffer pool is enlarged or reduced. This variable is not dynamic and if it is incorrectly configured, could lead to undesired situations.

Let’s see first how innodb_buffer_pool_size , innodb_buffer_pool_instances  and innodb_buffer_pool_chunk_size interact:

The buffer pool can hold several instances and each instance is divided into chunks. There is some information that we need to take into account: the number of instances can go from 1 to 64 and the total amount of chunks should not exceed 1000.

So, for a server with 3GB RAM, a buffer pool of 2GB with 8 instances and chunks at default value (128MB) we are going to get 2 chunks per instance:

This means that there will be 16 chunks.

I’m not going to explain the benefits of having multiple instances, I will focus on resizing operations. Why would you want to resize the buffer pool? Well, there are several reasons, such as:

  • on a virtual server you can add more memory dynamically
  • for a physical server, you might want to reduce database memory usage to make way for other processes
  • on systems where the database size is smaller than available RAM
  • if you expect a huge growth and want to increase the buffer pool on demand

Reducing the buffer pool

Let’s start reducing the buffer pool:

| innodb_buffer_pool_size | 2147483648 |
| innodb_buffer_pool_instances | 8     |
| innodb_buffer_pool_chunk_size | 134217728 |
mysql> set global innodb_buffer_pool_size=1073741824;
Query OK, 0 rows affected (0.00 sec)
mysql> show global variables like 'innodb_buffer_pool_size';
+-------------------------+------------+
| Variable_name           | Value      |
+-------------------------+------------+
| innodb_buffer_pool_size | 1073741824 |
+-------------------------+------------+
1 row in set (0.00 sec)

If we try to decrease it to 1.5GB, the buffer pool will not change and a warning will be showed:

mysql> set global innodb_buffer_pool_size=1610612736;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+---------------------------------------------------------------------------------+
| Level   | Code | Message                                                                         |
+---------+------+---------------------------------------------------------------------------------+
| Warning | 1210 | InnoDB: Cannot resize buffer pool to lesser than chunk size of 134217728 bytes. |
+---------+------+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> show global variables like 'innodb_buffer_pool_size';
+-------------------------+------------+
| Variable_name           | Value      |
+-------------------------+------------+
| innodb_buffer_pool_size | 2147483648 |
+-------------------------+------------+
1 row in set (0.01 sec)

Increasing the buffer pool

When we try to increase the value from 1GB to 1.5GB, the buffer pool is resized but the requested innodb_buffer_pool_size is considered to be incorrect and is truncated:

mysql> set global innodb_buffer_pool_size=1610612736;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+-----------------------------------------------------------------+
| Level   | Code | Message                                                         |
+---------+------+-----------------------------------------------------------------+
| Warning | 1292 | Truncated incorrect innodb_buffer_pool_size value: '1610612736' |
+---------+------+-----------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> show global variables like 'innodb_buffer_pool_size';
+-------------------------+------------+
| Variable_name           | Value      |
+-------------------------+------------+
| innodb_buffer_pool_size | 2147483648 |
+-------------------------+------------+
1 row in set (0.01 sec)

And the final size is 2GB. Yes! you intended to set the value to 1.5GB and you succeeded in setting it to 2GB. Even if you set 1 byte higher, like setting: 1073741825, you will end up with a buffer pool of 2GB.

mysql> set global innodb_buffer_pool_size=1073741825;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> show global variables like 'innodb_buffer_pool_%size' ;
+-------------------------------+------------+
| Variable_name                 | Value      |
+-------------------------------+------------+
| innodb_buffer_pool_chunk_size | 134217728  |
| innodb_buffer_pool_size       | 2147483648 |
+-------------------------------+------------+
2 rows in set (0.01 sec)

Interesting scenarios

Increasing size in the config file

Let’s suppose one day you get up willing to change or tune some variables in your server, and you decide that as you have free memory you will increase the buffer pool. In this example, we are going to use a server with 

innodb_buffer_pool_instances = 16
  and 2GB of buffer pool size which will be increased to 2.5GB

So, we set in the configuration file:

innodb_buffer_pool_size = 2684354560

But then after restart, we found:

mysql> show global variables like 'innodb_buffer_pool_%size' ;
+-------------------------------+------------+
| Variable_name                 | Value      |
+-------------------------------+------------+
| innodb_buffer_pool_chunk_size | 134217728  |
| innodb_buffer_pool_size       | 4294967296 |
+-------------------------------+------------+
2 rows in set (0.00 sec)

And the error log says:

2018-05-02T21:52:43.568054Z 0 [Note] InnoDB: Initializing buffer pool, total size = 4G, instances = 16, chunk size = 128M

So, after we have set innodb_buffer_pool_size in the config file to 2.5GB, the database gives us a 4GB buffer pool, because of the number of instances and the chunk size. What the message doesn’t tell us is the number of chunks, and this would be useful to understand why such a huge difference.

Let’s take a look at how that’s calculated.

Increasing instances and chunk size

Changing the number of instances or the chunk size will require a restart and will take into consideration the buffer pool size as an upper limit to set the chunk size. For instance, with this configuration:

innodb_buffer_pool_size = 2147483648
innodb_buffer_pool_instances = 32
innodb_buffer_pool_chunk_size = 134217728

We get this chunk size:

mysql> show global variables like 'innodb_buffer_pool_%size' ;
+-------------------------------+------------+
| Variable_name                 | Value      |
+-------------------------------+------------+
| innodb_buffer_pool_chunk_size | 67108864   |
| innodb_buffer_pool_size       | 2147483648 |
+-------------------------------+------------+
2 rows in set (0.00 sec)

However, we need to understand how this is really working. To get the innodb_buffer_pool_chunk_size it will make this calculation: innodb_buffer_pool_size / innodb_buffer_pool_instances with the result rounded to a multiple of 1MB.

In our example, the calculation will be 2147483648 / 32 = 67108864 which 67108864%1048576=0, no rounding needed. The number of chunks will be one chunk per instance.

When does it consider that it needs to use more chunks per instance? When the difference between the required size and the innodb_buffer_pool_size configured in the file is greater or equal to 1MB.

That is why, for instance, if you try to set the innodb_buffer_pool_size equal to 1GB + 1MB – 1B you will get 1GB of buffer pool:

innodb_buffer_pool_size = 1074790399
innodb_buffer_pool_instances = 16
innodb_buffer_pool_chunk_size = 67141632
2018-05-07T09:26:43.328313Z 0 [Note] InnoDB: Initializing buffer pool, total size = 1G, instances = 16, chunk size = 64M

But if you set the innodb_buffer_pool_size equals to 1GB + 1MB you will get 2GB of buffer pool:

innodb_buffer_pool_size = 1074790400
innodb_buffer_pool_instances = 16
innodb_buffer_pool_chunk_size = 67141632
2018-05-07T09:25:48.204032Z 0 [Note] InnoDB: Initializing buffer pool, total size = 2G, instances = 16, chunk size = 64M

This is because it considers that two chunks will fit. We can say that this is how the InnoDB Buffer pool size is calculated:

determine_best_chunk_size{
  if innodb_buffer_pool_size / innodb_buffer_pool_instances < innodb_buffer_pool_chunk_size
  then
    innodb_buffer_pool_chunk_size = roundDownMB(innodb_buffer_pool_size / innodb_buffer_pool_instances)
  fi
}
determine_amount_of_chunks{
  innodb_buffer_amount_chunks_per_instance = roundDown(innodb_buffer_pool_size / innodb_buffer_pool_instances / innodb_buffer_pool_chunk_size)
  if innodb_buffer_amount_chunks_per_instance * innodb_buffer_pool_instances * innodb_buffer_pool_chunk_size - innodb_buffer_pool_size > 1024*1024
  then
    innodb_buffer_amount_chunks_per_instance++
  fi
}
determine_best_chunk_size
determine_amount_of_chunks
innodb_buffer_pool_size = innodb_buffer_pool_instances * innodb_buffer_pool_chunk_size * innodb_buffer_amount_chunks_per_instance

What is the best setting?

In order to analyze the best setting you will need to know that there is a upper limit of 1000 chunks. In our example with 16 instances, we can have no more than 62 chunks per instance.

Another thing to consider is what each chunk represents in percentage terms. Continuing with the example, each chunk per instance represent 1.61%, which means that we can increase or decrease the complete buffer pool size in multiples of this percentage.

From a management point of view, I think that you might want to consider at least a range of 2% to 5% to increase or decrease the buffer. I performed some tests to see the impact of having small chunks and I found no issues but this is something that needs to be thoroughly tested.

The post Chunk Change: InnoDB Buffer Pool Resizing appeared first on Percona Database Performance Blog.

by David Ducos at June 19, 2018 06:02 PM

Webinar Weds 20/6: Percona XtraDB Cluster 5.7 Tutorial Part 2

webinar Percona XtraDB Cluster

Including setting up Percona XtraDB Cluster with ProxySQL and PMM

webinar Percona XtraDB ClusterPlease join Percona’s Architect, Tibi Köröcz as he presents Percona XtraDB Cluster 5.7 Tutorial Part 2 on Wednesday, June 20th, 2018, at 7:00 am PDT (UTC-7) / 10:00 am EDT (UTC-4).

 

Never used Percona XtraDB Cluster before? This is the webinar for you! In this 45-minute webinar, we will introduce you to a fully functional Percona XtraDB Cluster.

This webinar will show you how to install Percona XtraDB Cluster with ProxySQL, and monitor it with Percona Monitoring and Management (PMM).

We will also cover topics like bootstrap, IST, SST, certification, common-failure situations and online schema changes.

After this webinar, you will have enough knowledge to set up a working Percona XtraDB Cluster with ProxySQL, in order to meet your high availability requirements.

You can see part one of this series here: Percona XtraDB Cluster 5.7 Tutorial Part 1

Register Now!

Tibor Köröcz

Architect

ProxySQL for Connection Pooling

Tibi joined Percona in 2015 as a Consultant. Before joining Percona, among many other things, he worked at the world’s largest car hire booking service as a Senior Database Engineer. He enjoys trying and working with the latest technologies and applications which can help or work with MySQL together. In his spare time he likes to spend time with his friends, travel around the world and play ultimate frisbee.

 

The post Webinar Weds 20/6: Percona XtraDB Cluster 5.7 Tutorial Part 2 appeared first on Percona Database Performance Blog.

by Tibor Korocz at June 19, 2018 12:18 PM

Percona Database Performance Blog Commenting Issues

Percona Database Performance Blog Comments

We are experiencing an intermittent commenting problem on the Percona Database Performance Blog.

At Percona, part of our purpose is to engage with the open source database community. A big part of this engagement is the Percona Database Performance Blog, and the participation of the community in reading and commenting. We appreciate your interest and contributions.

Currently, we are experiencing an intermittent problem with comments on the blog. Some users are unable to post comments and can receive an error message similar to the following:

Percona Blog Error MsgWe are working on correcting the issue, and apologize if it affects you. If you have a comment that you want to make on a specific blog and this issue affects you, you can also leave comments on Twitter, Facebook or LinkedIn – all blog posts are socialized on those platforms.

Thanks for your patience.

The post Percona Database Performance Blog Commenting Issues appeared first on Percona Database Performance Blog.

by Dave Avery at June 19, 2018 02:28 AM

June 18, 2018

MariaDB Foundation

MariaDB 10.1.34 and latest MariaDB Connectors now available

The MariaDB Foundation is pleased to announce the availability of MariaDB 10.1.34, the latest stable release in the MariaDB 10.1 series, as well as MariaDB Connector/C 3.0.5, MariaDB Connector/C 2.3.6, MariaDB Connector/J 2.2.5, MariaDB Connector/J 1.7.4, MariaDB Connector/ODBC 3.0.5 and MariaDB Connector/ODBC 2.0.17, the latest stable MariaDB Connector releases. See the release notes and changelogs […]

The post MariaDB 10.1.34 and latest MariaDB Connectors now available appeared first on MariaDB.org.

by Ian Gilfillan at June 18, 2018 02:15 PM

MariaDB AB

MariaDB Server 10.1.34 and updated Connectors now available

MariaDB Server 10.1.34 and updated Connectors now available dbart Mon, 06/18/2018 - 09:51

The MariaDB project is pleased to announce the immediate availability of MariaDB Server 10.1.34 and updated MariaDB Connectors. See the release notes and changelogs for details and visit mariadb.com/downloads to download.

Download MariaDB Server 10.1.34

Release Notes Changelog What is MariaDB 10.1?


Download MariaDB Connectors

The MariaDB project is pleased to announce the immediate availability of MariaDB Server 10.1.34 and updated MariaDB Connectors. See the release notes and changelogs for details.

Login or Register to post comments

by dbart at June 18, 2018 01:51 PM

Peter Zaitsev

Webinar Tues 19/6: MySQL: Scaling and High Availability – Production Experience from the Last Decade(s)

scale high availability

scale high availability
Please join Percona’s CEO, Peter Zaitsev as he presents MySQL: Scaling and High Availability – Production Experience Over the Last Decade(s) on Tuesday, June 19th, 2018 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

 

Percona is known as the MySQL performance experts. With over 4,000 customers, we’ve studied, mastered and executed many different ways of scaling applications. Percona can help ensure your application is highly available. Come learn from our playbook, and leave this talk knowing your MySQL database will run faster and more optimized than before.

Register Now

About Peter Zaitsev, CEO

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With over 140 professionals in 30 plus countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of internet giants, large enterprises and many exciting startups. Percona was named to the Inc. 5000 in 2013, 2014, 2015 and 2016.

Peter was an early employee at MySQL AB, eventually leading the company’s High Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Database Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization Volume 1 is one of percona.com’s most popular downloads. Peter lives in North Carolina with his wife and two children. In his spare time, Peter enjoys travel and spending time outdoors.

The post Webinar Tues 19/6: MySQL: Scaling and High Availability – Production Experience from the Last Decade(s) appeared first on Percona Database Performance Blog.

by Peter Zaitsev at June 18, 2018 11:02 AM

June 15, 2018

Peter Zaitsev

This Week in Data with Colin Charles 42: Security Focus on Redis and Docker a Timely Reminder to Stay Alert

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Much of last week, there was a lot of talk around this article: New research shows 75% of ‘open’ Redis servers infected. It turns out, it helps that one should always read beyond the headlines because they tend to be more sensationalist than you would expect. From the author of Redis, I highly recommend reading Clarifications on the Incapsula Redis security report, because it turns out that in this case, it is beyond the headline. The content is also suspect. Antirez had to write this to help the press (we totally need to help keep reportage accurate).

Not to depart from the Redis world just yet, but Antirez also had some collaboration with the Apple Information Security Team with regards to the Redis Lua subsystem. The details are pretty interesting as documented in Redis Lua scripting: several security vulnerabilities fixed because you’ll note that the Alibaba team also found some other issues. Antirez also ensured that the Redis cloud providers (notably: Redis Labs, Amazon, Alibaba, Microsoft, Google, Heroku, Open Redis and Redis Green) got notified first (and in the comments, compose.io was missing, but now added to the list). I do not know if Linux distributions were also informed, but they will probably be rolling out updates soon.

In the “be careful where you get your software” department: some criminals have figured out they could host some crypto-currency mining software that you would get pre-installed if you used their Docker containers. They’ve apparently made over $90,000. It is good to note that the Backdoored images downloaded 5 million times finally removed from Docker Hub. This, however, was up on the Docker Hub for ten months and they managed to get over 5 million downloads across 17 images. Know what images you are pulling. Maybe this is again more reason for software providers to run their own registries?

James Turnbull is out with a new book: Monitoring with Prometheus. It just got released, I’ve grabbed it, but a review will come shortly. He’s managed all this while pulling off what seems to be yet another great O’Reilly Velocity San Jose Conference.

Releases

A quiet week on this front.

Link List

  • INPLACE upgrade from MySQL 5.7 to MySQL 8.0
  • PostgreSQL relevant: What’s is the difference between streaming replication vs hot standby vs warm standby ?
  • A new paper on Amazon Aurora is out: Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes. It was presented at SIGMOD 2018, and an abstract: “One of the more novel differences between Aurora and other relational databases is how it pushes redo processing to a multi-tenant scale-out storage service, purpose-built for Aurora. Doing so reduces networking traffic, avoids checkpoints and crash recovery, enables failovers to replicas without loss of data, and enables fault-tolerant storage that heals without database involvement. Traditional implementations that leverage distributed storage would use distributed consensus algorithms for commits, reads, replication, and membership changes and amplify cost of underlying storage.” Aurora, as you know, avoids distributed consensus under most circumstances. Short 8-page read.
  • Dormando is blogging again, and this was of particular interest — Caching beyond RAM: the case for NVMe. This is done in the context of memcached, which I am certain many use.
  • It is particularly heartening to note that not only does MongoDB use Linkbench for some of their performance testing, they’re also contributing to making it better via a pull request.

Industry Updates

Trying something new here… To cover fundraising, and people on the move in the database industry.

  • Kenny Gorman — who has been on the program committee for several Percona Live conferences, and spoken at the event multiple times before — is the founder and CEO of Eventador, a stream-processing as a service company built on Apache Kafka and Apache Flink, has just raised $3.8 million in funding to fuel their growth. They are also naturally spending this on hiring. The full press release.
  • Jimmy Guerrero (formerly of MySQL and InfluxDB) is now VP Marketing & Community at YugaByte DB. YugaByte was covered in column 13 as having raised $8 million in November 2017.

Upcoming appearances

  • DataOps Barcelona – Barcelona, Spain – June 21-22, 2018 – code dataopsbcn50 gets you a discount
  • OSCON – Portland, Oregon, USA – July 16-19, 2018
  • Percona webinar on Maria Server 10.3 – June 26, 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

The post This Week in Data with Colin Charles 42: Security Focus on Redis and Docker a Timely Reminder to Stay Alert appeared first on Percona Database Performance Blog.

by Colin Charles at June 15, 2018 04:18 PM

Tuning PostgreSQL for sysbench-tpcc

PostgreSQL benchmark

tuning PostgreSQL performance for sysbench-tpccPercona has a long tradition of performance investigation and benchmarking. Peter Zaitsev, CEO and Vadim Tkachenko, CTO, led their crew into a series of experiments with MySQL in this space. The discussion that always follows on the results achieved is well known and praised even by the PostgreSQL community. So when Avi joined the team and settled at Percona just enough to get acquainted with my colleagues, sure enough one of the first questions they asked him was: “did you know sysbench-tpcc also works with PostgreSQL now ?!“. 

sysbench

sysbench is “a scriptable multi-threaded benchmark tool based on LuaJIT (…) most frequently used for database benchmarks“, created and maintained by Alexey Kopytov. It’s been around for a long time now and has been a main reference for MySQL benchmarking since its inception. One of the favorites of Netflix’ Brendan Gregg, we now know. You may remember Sveta Smirnova and Alexander Korotkov’s report on their experiments in Millions of Queries per Second: PostgreSQL and MySQL’s Peaceful Battle at Today’s Demanding Workloads here. In fact, that post may serve as a nice prelude for the tests we want to show you today. It provides a good starting point as a MySQL vs PostgreSQL performance comparison.

The idea behind Sveta and Alexander’s experiments was “to provide an honest comparison for the two popular RDBMSs“, MySQL and PostgreSQL, using “the same tool, under the same challenging workloads and using the same configuration parameters (where possible)“. Since neither pgbench nor sysbench would work effectively with MySQL and PostgreSQL for both writes and reads they attempted to port pgbench‘s workload as a sysbench benchmark. 

sysbench-tpcc

More recently, Vadim came up with an implementation of the famous TPC-C workload benchmark for sysbench, sysbench-tpcc. He has since published a series of tests using Percona Server and MySQL, and worked to make it compatible with PostgreSQL too. For real now, hence the request that awaited us.

Our goal this time was less ambitious than Sveta and Alexander’s. We wanted to show you how we setup PostgreSQL to perform optimally for sysbench-tpcc, highlighting the settings we tuned the most to accomplish this. We ran our tests on the same box used by Vadim in his recent experiments with Percona Server for MySQL and MySQL.

A valid benchmark – benchmark rules

Before we present our results we shall note there are several ways to speed up database performance. You may for example disable full_page_writes, which would make a server crash unsafe, and use a minimalistic wal_level mode, which would block replication capability. These would speed things up but at the expense of reliability, making the server inappropriate for production usage.

For our benchmarks, we made sure we had all the necessary parameters in place to satisfy the following:

  1. ACID Compliance
  2. Point-in-time-recovery
  3. WALs usable by Replica/Slave for Replication
  4. Crash Recovery
  5. Frequent Checkpointing to reduce time for Crash Recovery
  6. Autovacuum

When we initially prepared sysbench-tpcc with PostgreSQL 10.3 the database size was 118 GB. By the time we completed the test, i.e. after 36000 seconds, the DB size had grown up to 335 GB. We have a total of “only” 256 GB of memory available in this server, however, based on the observations from pg_stat_database, pg_statio_user_tables and pg_statio_user_indexes 99.7% of the blocks were always in-memory:

postgres=# select ((blks_hit)*100.00)/(blks_hit+blks_read) AS “perc_mem_hit” from pg_stat_database where datname like ‘sbtest’;
   perc_mem_hit
---------------------
99.7267224322546
(1 row)

Hence, we consider it to be an in-memory workload with the whole active data set in RAM. In this post we explain how we tuned our PostgreSQL Instance for an in-memory workload, as was the case here.

Preparing the database before running sysbench

In order to run a sysbench-tpcc, we must first prepare the database to load some data. In our case, as mentioned above, this initial step resulted in a 118 GB database:

postgres=# select datname, pg_size_pretty(pg_database_size(datname)) as "DB_Size" from pg_stat_database where datname = 'sbtest';
 datname | DB_Size
---------+---------
 sbtest  | 118 GB
(1 row)

This may change depending on the arguments used. Here is the actual command we used to prepare the PostgreSQL Database for sysbench-tpcc:

$ ./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=120 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --db-driver=pgsql prepare

While we were loading the data, we wanted to see if we could speed-up the process. Here’s the customized PostgreSQL settings we used, some of them directly targeted to accelerate the data load:

shared_buffers = 192GB
maintenance_work_mem = '20GB'
wal_level = 'minimal'
autovacuum = 'OFF'
wal_compression = 'ON'
max_wal_size = '20GB'
checkpoint_timeout = '1h'
checkpoint_completion_target = '0.9'
random_page_cost = 1
max_wal_senders = 0
full_page_writes = ON
synchronous_commit = ON

We’ll discuss most of these parameters in the sections that follow, but we would like to highlight two of them here. We increased maintenance_work_mem to speed  up index creation and max_wal_size to delay checkpointing further, but not too much — this is a write-intensive phase after all. Using these parameters it took us 33 minutes to complete the prepare stage compared with 55 minutes when using the default parameters. 

If you are not concerned about crash recovery or ACID, you could turn off full_page_writes, fsync and synchrnous_commit. That would speed up the data load much more. 

Running a manual VACUUM ANALYZE after sysbench-tpcc’s initial prepare stage

Once we had prepared the database, as it is a newly created DB Instance, we ran a manual VACUUM ANALYZE on the database (in parallel jobs) using the command below. We employed all the 56 vCPUs available in the server since there was nothing else running in the machine:

$ /usr/lib/postgresql/10/bin/vacuumdb -j 56 -d sbtest -z

Having run a vacuum for the entire database we restarted PostgreSQL and cleared the OS cache before executing the benchmark in “run” mode. We repeated this process after each round.

First attempt with sysbench-tpcc

When we ran sysbench-tpcc for the first time, we observed a resulting TPS of 1978.48 for PostgreSQL with the server not properly tuned, running with default settings. We used the following command to run sysbench-tpcc for PostgreSQL for 10 hours (or 36000 seconds) for all rounds:

./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=36000 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --pgsql-password=oracle --db-driver=pgsql run

PostgreSQL performance tuning of parameters for sysbench-tpcc (crash safe)

After getting an initial idea of how PostgreSQL performed with the default settings and the actual demands of the sysbench-tpcc workload, we began making progressive adjustments in the settings, observing how they impacted the server’s performance. After several rounds we came up with the following list of parameters (all of these satisfy ACID properties):

shared_buffers = '192GB'
work_mem = '4MB'
random_page_cost = '1'
maintenance_work_mem = '2GB'
wal_level = 'replica'
max_wal_senders = '3'
synchronous_commit = 'on'
seq_page_cost = '1'
max_wal_size = '100GB'
checkpoint_timeout = '1h'
synchronous_commit = 'on'
checkpoint_completion_target = '0.9'
autovacuum_vacuum_scale_factor = '0.4'
effective_cache_size = '200GB'
min_wal_size = '1GB'
bgwriter_lru_maxpages = '1000'
bgwriter_lru_multiplier = '10.0'
logging_collector = 'ON'
wal_compression = 'ON'
log_checkpoints = 'ON'
archive_mode = 'ON'
full_page_writes = 'ON'
fsync = 'ON'

Let’s discuss our reasoning behind the tuning of the most important settings:

shared_buffers

Defines the amount of memory PostgreSQL uses for shared memory buffers. It’s arguably its most important setting, often compared (for better or worse) to MySQL’s innodb_buffer_pool_size. The biggest difference, if we dare to compare shared_buffers to the Buffer Pool, is that InnoDB bypasses the OS cache to directly access (read and write) data in the underlying storage subsystem whereas PostgreSQL do not.

Does this mean PostgreSQL does “double caching” by first loading data from disk into the OS cache to then make a copy of these pages into the shared_buffers area? Yes.

Does this “double caching” makes PostgreSQL inferior to InnoDB and MySQL in terms of memory management? No. We’ll discuss why that is the case in a follow up blog post. For now it suffice to say the actual performance depends on the workload (mix of reads and writes), the size of the “hot data” (the portion of the dataset that is most accessed and modified) and how often checkpointing takes place.

How we chose the setting for shared_buffers to optimize PostgreSQL performance

Due to these factors, the documented suggested formula of setting shared_buffers to 25% of RAM or the magic number of “8GB” is hardly ideal. What seems to be good reasoning, though, is this:

  • If you can fit the whole of your “hot data” in memory, then dedicating most of your memory to shared_buffers pays off nicely, making PostgreSQL behave as close to an in-memory database as possible.
  • If the size of your “hot data” surpasses the amount of memory you have available in the server, then you’re probably better off working with a much smaller shared_buffers area and relying more on the OS cache.

For this benchmark, considering the options we used, we found that dedicating 75% of all the available memory to shared_buffers is ideal. It is enough to fit the entire “hot data” and still leave sufficient memory for the OS to operate, handle connections and everything else.

work_mem

This setting defines the amount of memory that can be used by each query (not session) for internal sort operations (such as ORDER BY and DISTINCT), and hash tables (such as when doing hash-based aggregation). Beyond this, PostgreSQL moves the data into temporary disk files. The challenge is usually finding a good balance here. We want to avoid the use of temporary disk files, which slow down query completion and in turn may cause contention. But we don’t want to over-commit memory, which could even lead to OOM; working with high values for work_mem may be destructive when it is not really needed.

We analyzed the workload produced by sysbench-tpcc and found with some surprise that work_mem doesn’t play a role here, considering the queries that were executed. So we kept the default value of 4MB. Please note that this is seldom the case in production workloads, so it is important to always keep an eye on that parameter.

random_page_cost

This setting stipulates the cost that a non-sequentially-fetched disk page would have, and directly affects the query planner’s decisions. Going with a conservative value is particularly important when using high latency storage, such as spinning disks. This wasn’t our case, hence we could afford to equalize random_page_cost to seq_page_cost. So, we set this parameter to 1 as well, down from the default value of 4.

wal_level, max_wal_senders and archive_mode

To set up streaming replication wal_level needs to be set to at least “replica” and archive_mode must be enabled. This means the amount of WAL data produced increases significantly compared to when using default settings for these parameters, which in turn impacts IO. However, we considered these with a production environment in mind.

wal_compression

For this workload, we observed total WALs produced of size 3359 GB with wal_compression disabled and 1962 GB with wal_compression. We enabled wal_compression to reduce IO — the amount (and, most importantly, the rate) of WAL files being written to disk — at the expense of some additional CPU cycles. This proved to be very effective in our case as we had a surplus of CPU available.

checkpoint_timeout, checkpoint_completion_target and max_wal_size

We set the checkpoint_timeout to 1 hour and checkpoint_completion_target to 0.9. This means a checkpoint is forced every 1 hour and it has 90% of the time before the next checkpoint to spread the writes. However, a checkpoint is also forced when max_wal_size of WAL’s have been generated. With these parameters for a sysbench-tpcc workload, we saw that there were 3 to 4 checkpoints every 1 hour. This is especially because of the amount of WALs being generated.

In production environments we would always recommend you perform a manual CHECKPOINT before shutting down PostgreSQL in order to allow for a faster restart (recovery) time. In this context, issuing a manual CHECKPOINT took us between 1 and 2 minutes, after which we were able to restart PostgreSQL in just about 4 seconds. Please note that in our testing environment, taking time to restart PostgreSQL was not a concern, so working with this checkpoint rate benefited us. However, if you cannot afford a couple of minutes for crash recovery it is always suggested to force checkpointing to take place more often, even at the cost of some degraded performance.

full_page_writes, fsync and synchronous_commit

We set all of these parameters to ON to satisfy ACID properties.

autovacuum

We enabled autovacuum and other vacuum settings to ensure vacuum is being performed in the backend. We will discuss the importance of maintaining autovacuum enabled in a production environment, as well as the danger of doing otherwise, in a separate post. 

Amount of WAL’s (Transaction Logs) generated after 10 hours of sysbench-tpcc

Before we start to discuss the numbers it is important to highlight that we enabled wal_compression before starting sysbench. As we mentioned above, the amount of WALs generated with wal_compression set to OFF was more than twice the amount of WALs generated when having compression enabled. We observed that enabling wal_compression resulted in an increase in TPS of 21%. No wonder, the production of WALs has an important impact on IO: so much so that it is very common to find PostgreSQL servers with a dedicated storage for WALs only. Thus, it is important to highlight the fact wal_compression may benefit write-intensive workloads by sparing IO at the expense of additional CPU usage.

To find out the total amount of WALs generated after 10 Hours, we took note at the WAL offset from before we started the test and after the test completed:

WAL Offset before starting the sysbench-tpcc ⇒ 2C/860000D0
WAL Offset after 10 hours of sysbench-tpcc   ⇒ 217/14A49C50

and subtracted one from the other using pg_wal_lsn_diff, as follows:

postgres=# SELECT pg_size_pretty(pg_wal_lsn_diff('217/14A49C50','2C/860000D0'));
pg_size_pretty
----------------
1962 GB
(1 row)

1962 GB of WALs is a fairly big amount of transaction logs produced over 10 hours, considering we had enabled wal_compression .

We contemplated making use of a separate disk to store WALs to find out by how much more a dedicated storage for transaction logs would benefit overall performance. However, we wanted to keep using the same hardware Vadim had used for his previous tests, so decided against this.

Crash unsafe parameters

Setting full_page_writes, fsync and synchronous_commit to OFF may speed up the performance but it is always crash unsafe unless we have enough backup in place to consider these needs. For example, if you are using a COW FileSystem with Journaling, you may be fine with full_page_writes set to OFF. This may not be true 100% of the time though.

However, we still want to share the results with the crash unsafe parameters mentioned in the paragraph above as a reference.

Results after 10 Hours of sysbench-tpcc for PostgreSQL with default, crash safe and crash unsafe parameters

Here are the final numbers we obtained after running sysbench-tpcc for 10 hours considering each of the scenarios above:

Parameters

TPS

Default / Untuned

1978.48

Tuned (crash safe)

5736.66

Tuned (crash unsafe)

7881.72

Did we expect to get these numbers? Yes and no.

Certainly we expected a properly tuned server would outperform one running with default settings considerably but we can’t say we expected it to be almost three times better (2.899). With PostgreSQL making use of the OS cache it is not always the case that tuning shared_buffers in particular will make such a dramatic difference. By comparison, tuning MySQL’s InnoDB Buffer Pool almost always makes a difference. For PostgreSQL high performance it depends on the workload. In this case for sysbench-tpcc benchmarks, tuning shared_buffers definitely makes a difference.

On the other hand experiencing an additional order of magnitude faster (4x), when using crash unsafe settings, was not much of a surprise.

Here’s an alternative view of the results of our PostgreSQL insert performance tuning benchmarks:

Sysbench-TPCC with PostgreSQL

What did you think about this experiment? Please let us know in the comments section below and let’s get the conversation going.

Hardware spec

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: SAMSUNG  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Ubuntu 16.04.4, kernel 4.13.0-36-generic
  • PostgreSQL: version 10.3
  • sysbench-tpcc: https://github.com/Percona-Lab/sysbench-tpcc

The post Tuning PostgreSQL for sysbench-tpcc appeared first on Percona Database Performance Blog.

by Avinash Vallarapu at June 15, 2018 12:30 PM

June 14, 2018

Peter Zaitsev

Percona Monitoring and Management: Look After Your pmm-data Container

looking after pmm-datamcontainers

looking after pmm-datamcontainersIf you have already deployed PMM server using Docker you might be aware that we begin by creating a special container for persistent PMM data. In this post, I aim to explain the importance of pmm-data container when you deploy PMM server with Docker. By the end of this post, you will have a fair idea of why this Docker container is needed.

Percona Monitoring and Management (PMM) is a free and open-source solution for database troubleshooting and performance optimization that you can run in your own environment. It provides time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

What is the purpose of pmm-data?

Well, as simple as its name suggests, when PMM Server runs via Docker its data is stored in the pmm-data container. It’s a dedicated data only container which you create with bind mounts using -v i.e data volumes for holding persistent PMM data. We use pmm-data to compartmentalize the persistent data so you can more easily backup up and move data consistently across instances or containers. It acts as a single access point from which other running containers (in this case pmm-server) can access data volumes.

pmm-data container does not run, but data from the container is used by pmm-server to build graphs. PMM Server is the core of PMM that aggregates collected data and presents it in the form of tables, dashboards, and graphs in a web interface.

Why do we use docker create ?

The

docker create
  command instructs the Docker daemon to create a writable container layer over the docker image. When you execute
docker create
  using the steps shown, it will create a Docker container named pmm-data and initialize data volumes using the -v flag in conjunction with the create command. (e.g. /opt/prometheus/data).

Option -v is used multiple times in current versions of PMM to mount multiple data volumes. This allows you to create the data volume containers, and then use them from another container i.e pmm-server. We do not want to run the pmm-data container, but only to create it. nb: the number of data volumes bind mounted may change with versions of PMM

$ docker create \
   -v /opt/prometheus/data
   -v /opt/consul-data \
   -v /var/lib/mysql \
   -v /var/lib/grafana \
   --name pmm-data \
   percona/pmm-server:latest /bin/true

Make sure that the data volumes you initialize with the -v option match those given in the example. PMM Server expects you to have bind mounted those directories exactly as demonstrated in the deployment steps. For using different mount points for PMM deployment, please refer to this blog post. Data volumes are very useful as once designated and created you can share them and be include them as part of other containers. If you use -v or –volume to bind-mount a file or directory that does not yet exist on the Docker host, -v creates the endpoint for you. It is always created as a directory. Data in the pmm-data volume are actually hosted on the host’s filesystem.

Why does pmm-data not run ?

As we used

docker create
  container and not
docker run
  for pmm-data, this container does not run. It simply exists to make sure you retain all PMM data when you upgrade to a newer PMM Server image. Data volumes bind mounted on pmm-data container are shared to the running pmm-server container as the
--volumes-from
  option is used for pmm-server launch. Here we persisted data using Docker without binding it to the pmm-server by storing files in the host machine. As long as pmm-data exists, the data exists.

You can stop, destroy, or replace a container. When a non-running container is using a volume, the volume is still available to Docker and is not removed automatically. You can easily replace the pmm-server of the running container by a newer version without any impact or loss of data. For that reason, because of the need to store persistent data, we do it in a data volume. In our case, pmm-data container does not write to the same volumes as it could cause possible corruption.

Why can’t I remove pmm-data container ? What happens if I delete it ?

Removing pmm-data container results in the loss of collected metrics data.

If you remove containers that mount volumes, including the initial pmm-server container, or any subsequent containers mounted, such as pmm-server-2, you do not delete the volumes. This allows you to upgrade — or effectively migrate — data volumes between containers. Your data container might be based on an old version of container, with known security problems. It is not a big problem since it doesn’t actually run anything, but it doesn’t feel right.

As noted earlier, pmm-data stores metrics data as per the retention. You should not remove or recreate pmm-data container unless you need to wipe out all PMM data and start again. To delete the volume from disk, you must explicitly call docker rm -v against the container with a reference to the volume.

Some do’s and don’ts

  • Allocate enough disk space on the host for pmm-data to retain data.
    By default, Prometheus stores time-series data for 30 days, and QAN stores query data for 8 days.
  • Manage data retention appropriately as per your disk space available.
    You can take backup of pmm-data by extracting data from container to avoid data-loss in any situation by using steps mentioned here.

In case of any issues with metrics, here’s a good blog post regarding troubleshooting.

The post Percona Monitoring and Management: Look After Your pmm-data Container appeared first on Percona Database Performance Blog.

by Siddhant Sawant at June 14, 2018 12:58 PM

What is the Top Cause of Application Downtime Today?

Application outages lurking monster

Application outages lurking monsterI frequently talk to our customer base about what keeps them up at night. While there is a large variance of answers, they tend to fall into one of two categories. The first is the conditioned fear of some monster lurking behind the scenes that could pounce at any time. The second, of course, is the actual monster of downtime on a critical system. Ask most tech folks and they will tell you outages seem to only happen late at night or early in the morning. And that they do keep them up.

Entire companies and product lines have been built around providing those in the IT world with some ability to sleep at night. Modern enterprises have spent millions to mitigate the risk and prevent their businesses from having a really bad day because of an outage. Cloud providers are attuned to the downtime dilemma and spend lots of time, money, and effort to build in redundancy and make “High Availability” (HA) as easy as possible. The frequency of “hardware” or server issues continues to dwindle.

Where does the downtime issue start?

In my discussions, most companies I have talked to say their number one cause of outages and customer interruptions is ultimately related to the deployment of new or upgraded code. Often I hear the operations team has little or no involvement with an application until it’s put into production. It is a bit ironic that this is also the area where companies tend to drastically under-invest. They opt instead to invest in ways to “Scale Out or Up”. Or perhaps how to survive asteroids hitting two out three of their data centers.

Failing over broken or slow code from one server to another does not fix it. Adding more servers to distribute the load can mitigate a problem, but can also escalate the cost dramatically. In most cases, the solutions they apply don’t address the primary cause of the problems.

While there are some fantastic tools out there that can help with getting better visibility into code level issues — such as New Relic, AppDynamics and others — the real problem is that these often end up being used to diagnose issues after they have appeared in production. Most companies carry out some amount of testing before releasing code, but typically it is a fraction of what they should be doing. Working for a company that specializes in open source databases, we get a lot of calls on issues that have prevented companies’ end users from using critical applications. Many of these problems are fixable before they cost a loss of revenue and reputation.

I think it’s time technology companies start to rethink our QA, Testing, and Pre-Deployment requirements. How much time, effort, and money can we save if we catch these “monsters” before they make it into production?

Not to mention how much better our operations team will sleep . . .

The post What is the Top Cause of Application Downtime Today? appeared first on Percona Database Performance Blog.

by Matt Yonkovit at June 14, 2018 09:43 AM

June 13, 2018

Peter Zaitsev

Zone Based Sharding in MongoDB

MongoDB shard zones

MongoDB shard zonesIn this blog post, we will discuss about how to use zone based sharding to deploy a sharded MongoDB cluster in a customized manner so that the queries and data will be redirected per geographical groupings. This feature of MongoDB is a part of its Data Center Awareness, that allows queries to be routed to particular MongoDB deployments considering physical locations or configurations of mongod instances.

Before moving on, let’s have an overview of this feature. You might already have some questions about zone based sharding. Was it recently introduced? If zone-based sharding is something we should use, then what about tag-aware sharding?

MongoDB supported tag-aware sharding from even the initial versions of MongoDB. This means tagging a range of shard keys values, associating that range with a shard, and redirecting operations to that specific tagged shard. This tag-aware sharding, since version 3.4, is referred to as ZONES. So, the only change is its name, and this is the reason sh.addShardTag(shard, tag) method is being used.

How it works

  1. With the help of a shard key, MongoDB allows you to create zones of sharded data – also known as shard zones.
  2. Each zone can be associated with one or more shards.
  3. Similarly, a shard can associate with any number of non-conflicting zones.
  4. MongoDB migrates chunks to the zone range in the selected shards.
  5. MongoDB routes read and write to a particular zone range that resides in particular shards.

Useful for what kind of deployments/applications?

  1. In cases where data needs to be routed to a particular shard due to some hardware configuration restrictions.
  2. Zones can be useful if there is the need to isolate specific data to a particular shard. For example, in the case of GDPR compliance that requires businesses to protect data and privacy for an individual within the EU.
  3. If an application is being used geographically and you want a query to route to the nearest shards for both reads and writes.

Let’s consider a Scenario

Consider the scenario of a school where students are experts in Biology, but most students are experts in Maths. So we have more data for the maths students compare to Biology students. In this example, deployment requires that Maths students data should route to the shard with the better configuration for a large amount of data. Both read and write will be served by specific shards.  All the Biology students will be served by another shard. To implement this, we will add a tag to deploy the zones to the shards.

For this scenario we have an environment with:

DB: “school”

Collection: “students”

Fields: “sId”, “subject”, “marks” and so on..

Indexed Fields: “subject” and “sId”

We enable sharding:

sh.enableSharding("school")

And create a shardkey: “subject” and “sId” 

sh.shardCollection("school.students", {subject: 1, sId: 1});

We have two shards in our test environment

shards:

{  "_id" : "shard0000",  "host" : "127.0.0.1:27001",  "state" : 1 }
{  "_id" : "shard0001",  "host" : "127.0.0.1:27002",  "state" : 1 }

Zone Deployment

1) Disable balancer

To prevent migration of the chunks across the cluster, disable the balancer for the “students” collection:

mongos> sh.disableBalancing("school.students")
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

Before proceeding further make sure the balancer is not running. It is not a mandatory process, but it is always a good practice to make sure no migration of chunks takes place while configuring zones

mongos> sh.isBalancerRunning()
false

2) Add shard to the zone

A zone can be associated with a particular shard in the form of a tag, using the sh.addShardTag(), so a tag will be added to each shard. Here we are considering two zones so the tags “MATHS” and “BIOLOGY” need to be added.

mongos> sh.addShardTag( "shard0000" , "MATHS");
{ "ok" : 1 }
mongos> sh.addShardTag( "shard0001" , "BIOLOGY");
{ "ok" : 1 }

We can see zones are assigned in the form of tags as required against each shard.

mongos> sh.status()
 shards:
        {  "_id" : "shard0000",  "host" : "127.0.0.1:27001",  "state" : 1,  "tags" : [ "MATHS" ] }
        {  "_id" : "shard0001",  "host" : "127.0.0.1:27002",  "state" : 1,  "tags" : [ "BIOLOGY" ] }

3) Define ranges for each zone

Each zone covers one or more ranges of shard key values. Note: each range a zone covers is always inclusive of its lower boundary and exclusive of its upper boundary.

mongos> sh.addTagRange(
	"school.students",
	{ "subject" : "maths", "sId" : MinKey},
	{ "subject" : "maths", "sId" : MaxKey},
	"MATHS"
)
{ "ok" : 1 }
mongos> sh.addTagRange(
	"school.students",
	{ "subject" : "biology", "sId" : MinKey},
	{ "subject" : "biology", "sId" : MaxKey},
"BIOLOGY"
)
{ "ok" : 1 }

4) Enable balancer

Now enable the balancer so the chunks will migrate across the shards as per the requirement and all the read and write queries will be routed to the particular shards.

mongos> sh.enableBalancing("school.students")
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
mongos> sh.isBalancerRunning()
true

Let’s check how documents get routed as per the tags:

We have inserted 6 documents, 4 documents with “subject”:”maths” and 2 documents with “subject”:”biology”

mongos> db.students.find({"subject":"maths"}).count()
4
mongos> db.students.find({"subject":"biology"}).count()
2

Checking the shard distribution for the students collection:

mongos> db.students.getShardDistribution()
Shard shard0000 at 127.0.0.1:27003
data : 236B docs : 4 chunks : 4
estimated data per chunk : 59B
estimated docs per chunk : 1
Shard shard0001 at 127.0.0.1:27004
data : 122B docs : 2 chunks : 1
estimated data per chunk : 122B
estimated docs per chunk : 2

So in this test case, all the queries for the students collection have routed as per the tag used, with four documents inserted into shard0000 and two documents inserted to shard0001.

Any queries related to MATHS will route to shard0000 and queries related to BIOLOGY will route to shard0001, hence the load will be distributed as per the configuration of the shard, keeping the database performance optimized.

Sharding MongoDB using zones is a great feature provided by MongoDB. With the help of zones, data can be isolated to the specific shards. Or if we have any kind of hardware or configuration restrictions to the shards, it is a possible solution for routing the operations.

The post Zone Based Sharding in MongoDB appeared first on Percona Database Performance Blog.

by Aayushi Mangal at June 13, 2018 01:58 PM

Jean-Jerome Schmidt

ChatOps - Managing MySQL, MongoDB & PostgreSQL from Slack

What is ChatOps?

Nowadays, we make use of multiple communication channels to manage or receive information from our systems, such as email, chat and applications among others. If we could centralize this in one or just a few different possible applications, and even better, if we could integrate it with tools that we currently use in our organization, we would be able to automate processes, improve our work dynamics and communication, having a clearer picture of the current state of our system. In many companies, Slack or other collaboration tools is becoming the centre and the heart of the development and ops teams.

What is ChatBot?

A chatbot is a program that simulates a conversation, receiving entries made by the user and returns answers based on its programming.

Some products have been developed with this technology, that allow us to perform administrative tasks, or keeps the team up to date on the current status of the systems.

This allows, among other things, to integrate the communication tools we use daily, with our systems.

CCBot - ClusterControl

CCBot is a chatbot that uses the ClusterControl APIs to manage and monitor your database clusters. You will be able to deploy new clusters or replication setups, keep your team up to date on the status of the databases as well as the status of any administrative jobs (e.g., backups or rolling upgrades). You can also restart failed nodes, add new ones, promote a slave to master, add load balancers, and so on. CCBot supports most of the major chat services like Slack, Flowdock and Hipchat.

CCBot is integrated with the s9s command line, so you have several commands to use with this tool.

ClusterControl Notifications via Slack

Note that you can use Slack to handle alarms and notifications from ClusterControl. Why? A chat room is a good place to discuss incidents. Seeing an actual alarm in a Slack channel makes it easy to discuss it with the team, because all team members actually know what is being discussed and can chime in.

The main difference between CCBot and the integration of notifications via Slack is that, with CCBot, the user initiates the communication via a specific command, generating a response from the system. For notifications, ClusterControl generates an event, for example, a message about a node failure. This event is then sent to the tool that we have integrated for our notifications, for example, Slack.

You can review this post on how to configure ClusterControl in order to send notifications to Slack.

After this, we can see ClusterControl notifications in our Slack:

ClusterControl Slack Integration
ClusterControl Slack Integration

CCBot Installation

To install CCBot, once we have installed ClusterControl, we must execute the following script:

$ /var/www/html/clustercontrol/app/tools/install-ccbot.sh

We select which adapter we want to use, in this blog, we will select Slack.

-- Supported Hubot Adapters --
1. slack
2. hipchat
3. flowdock
Select the hubot adapter to install [1-3]: 1

It will then ask us for some information, such as an email, a description, the name we will give to our bot, the port, the API token and the channel to which we want to add it.

? Owner (User <user@example.com>)
? Description (A simple helpful robot for your Company)
Enter your bot's name (ccbot):
Enter hubot's http events listening port (8081):
Enter your slack API token:
Enter your slack message room (general):

To obtain the API token, we must go to our Slack -> Apps (On the left side of our Slack window), we look for Hubot and select Install.

CCBot Hubot
CCBot Hubot

We enter the Username, which must match our bot name.

In the next window, we can see the API token to use.

CCBot API Token
CCBot API Token
Enter your slack API token: xoxb-111111111111-XXXXXXXXXXXXXXXXXXXXXXXX
CCBot installation completed!

Finally, to be able to use all the s9s command line functions with CCBot, we must create a user from ClusterControl:

$ s9s user --create --cmon-user=cmon --group=admins  --controller="https://localhost:9501" --generate-key cmon

For further information about how to manage users, please check the official documentation.

We can now use our CCBot from Slack.

Here we have some examples of commands:

$ s9s --help
CCBot Help
CCBot Help

With this command we can see the help for the s9s CLI.

$ s9s cluster --list --long
CCBot Cluster List
CCBot Cluster List

With this command we can see a list of our clusters.

$ s9s cluster --cluster-id=17 --stat
CCBot Cluster Stat
CCBot Cluster Stat

With this command we can see the stats of one cluster, in this case cluster id 17.

$ s9s node --list --long
CCBot Node List
CCBot Node List

With this command we can see a list of our nodes.

$ s9s job --list
CCBot Job List
CCBot Job List

With this command we can see a list of our jobs.

$ s9s backup --create --backup-method=mysqldump --cluster-id=16 --nodes=192.168.100.34:3306 --backup-directory=/backup
CCBot Backup
CCBot Backup

With this command we can create a backup with mysqldump, in the node 192.168.100.34. The backup will be saved in the /backup directory.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Now let's see some more complex examples:

$ s9s cluster --create --cluster-type=mysqlreplication --nodes="mysql1;mysql2" --vendor="percona" --provider-version="5.7" --template="my.cnf.repl57" --db-admin="root" --db-admin-passwd="root123" --os-user="root" --cluster-name="MySQL1"
CCBot Create Replication
CCBot Create Replication

With this command we can create a MySQL Master-Slave Replication with Percona for MySQL 5.7 version.

CCBot Check Replication Created
CCBot Check Replication Created

And we can check this new cluster.

In ClusterControl Topology View, we can check our current topology with one master and one slave node.

Topology View Replication 1
Topology View Replication 1
$ s9s cluster --add-node --nodes=mysql3 --cluster-id=24
CCBot Add Node
CCBot Add Node

With this command we can add a new slave in our current cluster.

Topology View Replication 2
Topology View Replication 2

And we can check our new topology in ClusterControl Topology View.

$ s9s cluster --add-node --cluster-id=24 --nodes="proxysql://proxysql"
CCBot Add ProxySQL
CCBot Add ProxySQL

With this command we can add a new ProxySQL node named "proxysql" in our current cluster.

Topology View Replication 3
Topology View Replication 3

And we can check our new topology in ClusterControl Topology View.

You can check the list of available commands in the documentation.
If we try to use CCBot from a Slack channel, we must add "@ccbot_name" at the beginning of our command:

@ccbot s9s backup --create --backup-method=xtrabackupfull --cluster-id=1 --nodes=10.0.0.5:3306 --backup-directory=/storage/backups

CCBot makes it easier for teams to manage their clusters in a collaborative way. It is fully integrated with the tools they use on a daily basis.

Note

If we have the following error when wanting to run the CCBot installer in our ClusterControl:

-bash: yo: command not found

We must update the version of nodejs package.

Conclusion

As we said previously, there are several ChatBot alternatives for different purposes, we can even create our own ChatBot, but as this technology facilitates our tasks and has several advantages that we mentioned at the beginning of this blog, not everything that shines is gold.

There is a very important detail to keep in mind - security. We must be very careful when using them, and take all the necessary precautions to know what we allow to do, in what way, at what moment, to whom and from where.

by Sebastian Insausti at June 13, 2018 01:04 PM

Peter Zaitsev

Webinar Thurs 6/14: MongoDB Backup and Recovery Field Guide

mongodb backup and recovery field guide

mongodb backup and recovery field guidePlease join Percona’s Sr. Technical Operations Architect, Tim Vaillancourt as he presents MongoDB Backup and Recovery Field Guide on Thursday, June 14, 2018, at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

This talk will cover backup and recovery solutions for MongoDB replica sets and clusters, focusing on online and low-impact solutions for production systems.

Register for the webinar

Tim Vaillancourt

Senior Technical Operations Architect

With experience operating infrastructures in industries such as government, online marketing/publishing, SaaS and gaming combined with experience tuning systems from the hard disk all the way up to the end-user, Tim has spent time in nearly every area of the modern IT stack with many lessons learned.

Tim is based in Amsterdam, NL and enjoys traveling, coding and music. Prior to Percona Tim was the Lead MySQL DBA of Electronic Arts’ DICE studios, helping some of the largest games in the world (“Battlefield” series, “Mirrors Edge” series, “Star Wars: Battlefront”) launch and operate smoothly while also leading the automation of MongoDB deployments for EA systems. Before the role of DBA at EA’s DICE studio, Tim served as a subject matter expert in NoSQL databases, queues and search on the Online Operations team at EA SPORTS.

Prior to moving to the gaming industry, Tim served as a Database/Systems Admin operating a large MySQL-based SaaS infrastructure at AbeBooks/Amazon Inc.

The post Webinar Thurs 6/14: MongoDB Backup and Recovery Field Guide appeared first on Percona Database Performance Blog.

by Tim Vaillancourt at June 13, 2018 08:04 AM

June 12, 2018

Oli Sennhauser

Select Hello World FromDual with MariaDB PL/SQL

MariaDB 10.3 was released GA a few weeks ago. One of the features which interests me most is the MariaDB Oracle PL/SQL compatibility mode.

So its time to try it out now...

Enabling Oracle PL/SQL in MariaDB

Oracle PL/SQL syntax is quite different from old MySQL/MariaDB SQL/PSM syntax. So the old MariaDB parser would through some errors without modification. The activation of the modification of the MariaDB PL/SQL parser is achieved by changing the sql_mode as follows:

mariadb> SET SESSION sql_mode=ORACLE;

or you can make this setting persistent in your my.cnf MariaDB configuration file:

[mysqld]

sql_mode = ORACLE

To verify if the sql_mode is already set you can use the following statement:

mariadb> pager grep --color -i oracle
PAGER set to 'grep --color -i oracle'
mariadb> SELECT @@sql_mode;
| PIPES_AS_CONCAT,ANSI_QUOTES,IGNORE_SPACE,ORACLE,NO_KEY_OPTIONS,NO_TABLE_OPTIONS,NO_FIELD_OPTIONS,NO_AUTO_CREATE_USER,SIMULTANEOUS_ASSIGNMENT |
mariadb> nopager

Nomen est omen

First of all I tried the function of the basic and fundamental table in Oracle, the DUAL table:

mariadb> SELECT * FROM dual;
ERROR 1096 (HY000): No tables used

Sad. :-( But this query on the dual table seems to work:

mariadb> SELECT 'Hello World!' FROM dual;
+--------------+
| Hello World! |
+--------------+
| Hello World! |
+--------------+

The second result looks much better. The first query should work as well but does not. We opened a bug at MariaDB without much hope that this bug will be fixed soon...

To get more info why MariaDB behaves like this I tried to investigate a bit more:

mariadb> SELECT table_schema, table_name
  FROM information_schema.tables
 WHERE table_name = 'dual';
Empty set (0.001 sec)

Hmmm. It seems to be implemented not as a real table... But normal usage of this table seems to work:

mariadb> SELECT CURRENT_TIMESTAMP() FROM dual;
+---------------------+
| current_timestamp() |
+---------------------+
| 2018-06-07 15:32:11 |
+---------------------+

If you rely heavily in your code on the dual table you can create it yourself. It is defined as follows:

"The DUAL table has one column, DUMMY, defined to be VARCHAR2(1), and contains one row with a value X."

If you want to create the dual table yourself here is the statement:

mariadb> CREATE TABLE `DUAL` (DUMMY VARCHAR2(1));
mariadb> INSERT INTO `DUAL` (DUMMY) VALUES ('X');

Anonymous PL/SQL block in MariaDB

To try some PL/SQL features out or to run a sequence of PL/SQL commands you can use anonymous blocks. Unfortunately MySQL SQL/PSM style delimiter seems still to be necessary.

It is recommended to use the DELIMITER /, then most of the Oracle examples will work straight out of the box...

DELIMITER /

BEGIN
  SELECT 'Hello world from MariaDB anonymous PL/SQL block!';
END;
/

DELIMITER ;

+--------------------------------------------------+
| Hello world from MariaDB anonymous PL/SQL block! |
+--------------------------------------------------+
| Hello world from MariaDB anonymous PL/SQL block! |
+--------------------------------------------------+

A simple PL/SQL style MariaDB Procedure

DELIMITER /

CREATE OR REPLACE PROCEDURE hello AS
BEGIN
  DECLARE
    vString VARCHAR2(255) := NULL;
  BEGIN
    SELECT 'Hello world from MariaDB PL/SQL Procedure!' INTO vString FROM dual;
    SELECT vString;
  END;
END hello;
/

BEGIN
  hello();
END;
/

DELIMITER ;

A simple PL/SQL style MariaDB Function

DELIMITER /

CREATE OR REPLACE FUNCTION hello RETURN VARCHAR2 DETERMINISTIC AS
BEGIN
  DECLARE
    vString VARCHAR2(255) := NULL;
  BEGIN
    SELECT 'Hello world from MariaDB PL/SQL Function!' INTO vString FROM dual;
    RETURN vString;
  END;
END hello;
/

DECLARE
  vString VARCHAR(255) := NULL;
BEGIN
  vString := hello();
  SELECT vString;
END;
/

DELIMITER ;

An PL/SQL package in MariaDB

Up to here there is nothing really new, just slightly different. But now let us try a PL/SQL package in MariaDB:

DELIMITER /

CREATE OR REPLACE PACKAGE hello AS
  -- must be delared as public!
  PROCEDURE helloWorldProcedure(pString VARCHAR2);
  FUNCTION helloWorldFunction(pString VARCHAR2) RETURN VARCHAR2;
END hello;
/

CREATE OR REPLACE PACKAGE BODY hello AS

  vString VARCHAR2(255) := NULL;

  -- was declared public in PACKAGE
  PROCEDURE helloWorldProcedure(pString VARCHAR2) AS
  BEGIN
    SELECT 'Hello world from MariaDB Package Procedure in ' || pString || '!' INTO vString FROM dual;
    SELECT vString;
  END;

  -- was declared public in PACKAGE
  FUNCTION helloWorldFunction(pString VARCHAR2) RETURN VARCHAR2 AS
  BEGIN
    SELECT 'Hello world from MariaDB Package Function in ' || pString || '!' INTO vString FROM dual;
    return vString;
  END;
BEGIN
  SELECT 'Package initialiser, called only once per connection!';
END hello;
/

DECLARE
  vString VARCHAR2(255) := NULL;
  -- CONSTANT seems to be not supported yet by MariaDB
  -- cString CONSTANT VARCHAR2(255) := 'anonymous block';
  cString VARCHAR2(255) := 'anonymous block';
BEGIN
  CALL hello.helloWorldProcedure(cString);
  SELECT hello.helloWorldFunction(cString) INTO vString;
  SELECT vString;
END;
/

DELIMITER ;

DBMS_OUTPUT package for MariaDB

An Oracle database contains over 200 PL/SQL packages. One of the most common one is the DBMS_OUTPUT package. In this package we can find the Procedure PUT_LINE.

This package/function has not been implemented yet by MariaDB so far. So we have to do it ourself:

DELIMITER /

CREATE OR REPLACE PACKAGE DBMS_OUTPUT AS
  PROCEDURE PUT_LINE(pString IN VARCHAR2);
END DBMS_OUTPUT;
/

CREATE OR REPLACE PACKAGE BODY DBMS_OUTPUT AS

  PROCEDURE PUT_LINE(pString IN VARCHAR2) AS
  BEGIN
    SELECT pString;
  END;
END DBMS_OUTPUT;
/

BEGIN
  DBMS_OUTPUT.PUT_LINE('Hello world from MariaDB DBMS_OUTPUT.PUT_LINE!');
END;
/

DELIMITER ;

The other Functions and Procedures have to be implemented later over time...

Now we can try to do all examples from Oracle sources!

by Shinguz at June 12, 2018 09:36 PM

Peter Zaitsev

Webinar Weds 6/13: Performance Analysis and Troubleshooting Methodologies for Databases

troubleshooting methodologies

troubleshooting methodologiesPlease join Percona’s CEO, Peter Zaitsev as he presents Performance Analysis and Troubleshooting Methodologies for Databases on Wednesday, June 13th, 2018 at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

 

Have you heard about the USE Method (Utilization – Saturation – Errors)? RED (Rate – Errors – Duration), or Golden Signals (Latency – Traffic – Errors – Saturations)?

In this presentation, we will talk briefly about these different-but-similar “focuses”. We’ll discuss how we can apply them to data infrastructure performance analysis, troubleshooting, and monitoring.

We will use MySQL as an example, but most of this talk applies to other database technologies too.

Register for the webinar

About Peter Zaitsev, CEO

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With over 140 professionals in 30 plus countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of internet giants, large enterprises and many exciting startups. Percona was named to the Inc. 5000 in 2013, 2014, 2015 and 2016.

Peter was an early employee at MySQL AB, eventually leading the company’s High Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Database Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization Volume 1 is one of percona.com’s most popular downloads. Peter lives in North Carolina with his wife and two children. In his spare time, Peter enjoys travel and spending time outdoors.

The post Webinar Weds 6/13: Performance Analysis and Troubleshooting Methodologies for Databases appeared first on Percona Database Performance Blog.

by Peter Zaitsev at June 12, 2018 11:48 AM

PXC loves firewalls (and System Admins loves iptables)

PXC and setting firewalls using iptables

PXC and setting firewalls using iptablesLet them stay together.

In the last YEARS, I have seen quite often that users, when installing a product such as PXC, instead of spending five minutes to understand what to do just run

iptables -F
  and save.

In short, they remove any rules for their firewall.

With this post, I want to show you how easy it can be to do the right thing instead of putting your server at risk. I’ll show you how a slightly more complex setup like PXC (compared to MySQL), can be easily achieved without risky shortcuts.

iptables is the utility used to manage the chains of rules used by the Linux kernel firewall, which is your basic security tool.
Linux comes with a wonderful firewall built into the kernel. As an administrator, you can configure this firewall with interfaces like ipchains  — which we are not going to cover — and iptables, which we shall talk about.

iptables is stateful, which means that the firewall can make decisions based on received packets. This means that I can, for instance, DROP a packet if it’s coming from bad-guy.com.

I can also create a set of rules that either will allow or reject the package, or that will redirect it to another rule. This potentially can create a very complex scenario.

However, for today and for this use case let’s keep it simple…  Looking at my own server:

iptables -v -L
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
 250K   29M ACCEPT     all  --  any    any     anywhere             anywhere             state RELATED,ESTABLISHED
    6   404 ACCEPT     icmp --  any    any     anywhere             anywhere
    0     0 ACCEPT     all  --  lo     any     anywhere             anywhere
    9   428 ACCEPT     tcp  --  any    any     anywhere             anywhere             state NEW tcp dpt:ssh
    0     0 ACCEPT     tcp  --  any    any     anywhere             anywhere             state NEW tcp dpt:mysql
    0     0 ACCEPT     tcp  --  any    any     anywhere             anywhere
  210 13986 REJECT     all  --  any    any     anywhere             anywhere             reject-with icmp-host-prohibited
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 REJECT     all  --  any    any     anywhere             anywhere             reject-with icmp-host-prohibited
Chain OUTPUT (policy ACCEPT 241K packets, 29M bytes)
 pkts bytes target     prot opt in     out     source               destination

That’s not too bad, my server is currently accepting only SSH and packets on port 3306. Please note that I used the -v option to see more information like IN/OUT and  that allows me to identify that actually row #3 is related to my loopback device, and as such it’s good to have it open.

The point is that if I try to run the PXC cluster with these settings it will fail, because the nodes will not be able to see each other.

A quite simple example when try to start the second node of the cluster:

2018-05-21T17:56:14.383686Z 0 [Note] WSREP: (3cb4b3a6, 'tcp://10.0.0.21:4567') connection to peer 584762e6 with addr tcp://10.0.0.23:4567 timed out, no messages seen in PT3S

Starting a new node will fail, given that the connectivity will not be established correctly. In the Percona documentation there is a notes section in which we mention that these ports must be open to have the cluster working correctly.:

  • 3306 For MySQL client connections and State Snapshot Transfer that use the mysqldump method.
  • 4567 For Galera Cluster replication traffic, multicast replication uses both UDP transport and TCP on this port.
  • 4568 For Incremental State Transfer.
  • 4444 For all other State Snapshot Transfer.

Of course, if you don’t know how to do it that could be a problem, but it is quite simple. Just use the following commands to add the needed rules:

iptables -I INPUT 2 --protocol tcp --match tcp --dport 3306 --source 10.0.0.1/24 --jump ACCEPT
iptables -I INPUT 3 --protocol tcp --match tcp --dport 4567 --source 10.0.0.1/24 --jump ACCEPT
iptables -I INPUT 4 --protocol tcp --match tcp --dport 4568 --source 10.0.0.1/24 --jump ACCEPT
iptables -I INPUT 5 --protocol tcp --match tcp --dport 4444 --source 10.0.0.1/24 --jump ACCEPT
iptables -I INPUT 6 --protocol udp --match udp --dport 4567 --source 10.0.0.1/24 --jump ACCEPT

Once you have done this check the layout again and you should have something like this:

[root@galera1h1n5 gal571]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT tcp -- 10.0.0.0/24 anywhere tcp dpt:mysql
ACCEPT tcp -- 10.0.0.0/24 anywhere tcp dpt:tram
ACCEPT tcp -- 10.0.0.0/24 anywhere tcp dpt:bmc-reporting
ACCEPT tcp -- 10.0.0.0/24 anywhere tcp dpt:krb524
ACCEPT udp -- 10.0.0.0/24 anywhere udp dpt:tram
ACCEPT icmp -- anywhere anywhere
ACCEPT tcp -- anywhere anywhere tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere tcp dpt:mysql
REJECT all -- anywhere anywhere reject-with icmp-port-unreachable
Chain FORWARD (policy ACCEPT)
target prot opt source destination
REJECT all -- anywhere anywhere reject-with icmp-port-unreachable
Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Try to start the secondary node, and — tadaaa — the node will connect, will provision itself, and finally will start correctly.

All good? Well not really, you still need to perform a final step. We need to make our server accessible also for PMM monitoring agents.

You have PMM right? If you don’t take a look here and you will want it. 😀

Anyhow PMM will not work correctly with the rules I have, and the result will be an empty set of graphs when accessing the server statistics. Luckily, PMM has a very easy way to help you identify the issue:

[root@galera1h1n5 gal571]# pmm-admin check-network
PMM Network Status
Server Address | 192.168.1.52
Client Address | 192.168.1.205
* System Time
NTP Server (0.pool.ntp.org) | 2018-05-24 08:05:37 -0400 EDT
PMM Server | 2018-05-24 12:05:34 +0000 GMT
PMM Client | 2018-05-24 08:05:37 -0400 EDT
PMM Server Time Drift | OK
PMM Client Time Drift | OK
PMM Client to PMM Server Time Drift | OK
* Connection: Client --> Server
-------------------- -------
SERVER SERVICE STATUS
-------------------- -------
Consul API OK
Prometheus API OK
Query Analytics API OK
Connection duration | 1.051724ms
Request duration | 311.924µs
Full round trip | 1.363648ms
* Connection: Client <-- Server
-------------- ------------ -------------------- ------- ---------- ---------
SERVICE TYPE NAME REMOTE ENDPOINT STATUS HTTPS/TLS PASSWORD
-------------- ------------ -------------------- ------- ---------- ---------
linux:metrics galera1h1n5 192.168.1.205:42000 DOWN NO NO
mysql:metrics gal571 192.168.1.205:42002 DOWN NO NO
When an endpoint is down it may indicate that the corresponding service is stopped (run 'pmm-admin list' to verify).
If it's running, check out the logs /var/log/pmm-*.log
When all endpoints are down but 'pmm-admin list' shows they are up and no errors in the logs,
check the firewall settings whether this system allows incoming connections from server to address:port in question.
Also you can check the endpoint status by the URL: http://192.168.1.52/prometheus/targets

What you want more? You have all the information to debug and build your new rules. I just need to open the ports 42000 42002 on my firewall:

iptables -I INPUT 7 --protocol tcp --match tcp --dport 42000 --source 192.168.1.1/24 --jump ACCEPT
iptables -I INPUT 8 --protocol tcp --match tcp --dport 42002 --source 192.168.1.1/24 --jump ACCEPT

Please note that we are handling the connectivity for PMM using a different range of IPs/subnet. This because it is best practice to have PXC nodes communicate to a dedicated network/subnet (physical and logical).

Run the test again:

* Connection: Client <-- Server
-------------- ------------ -------------------- ------- ---------- ---------
SERVICE TYPE NAME REMOTE ENDPOINT STATUS HTTPS/TLS PASSWORD
-------------- ------------ -------------------- ------- ---------- ---------
linux:metrics galera1h1n5 192.168.1.205:42000 OK YES YES
mysql:metrics gal571 192.168.1.205:42002 OK YES YES

Done …  I just repeat this on all my nodes and I will have set my firewall to handle the PXC related security.

Now that all my settings are working well I can save my firewall’s rules:

iptables-save > /etc/sysconfig/iptables

For Ubuntu you may need some additional steps as for (https://help.ubuntu.com/community/IptablesHowTo#Using_iptables-save.2Frestore_to_test_rules)

There are some nice tools to help you even more, if you are very lazy, like UFW and the graphical one, GUFW. Developed to ease iptables firewall configuration, ufw provides a user friendly way to create an IPv4 or IPv6 host-based firewall. By default UFW is disabled in Ubuntu. Given that ultimately they use iptables, and their use is widely covered in other resources such as the official Ubuntu documentation, I won’t cover these here.

Conclusion

Please don’t make the mistake of flushing/ignoring your firewall, when to make this right is just a matter of 5 commands. It’s easy enough to be done by everyone and it’s good enough to stop the basic security attacks.

Happy MySQL (and PXC) to everyone.

The post PXC loves firewalls (and System Admins loves iptables) appeared first on Percona Database Performance Blog.

by Marco Tusa at June 12, 2018 10:37 AM

Jean-Jerome Schmidt

How to Benchmark Performance of MySQL & MariaDB using SysBench

What is SysBench? If you work with MySQL on a regular basis, then you most probably have heard of it. SysBench has been in the MySQL ecosystem for a long time. It was originally written by Peter Zaitsev, back in 2004. Its purpose was to provide a tool to run synthetic benchmarks of MySQL and the hardware it runs on. It was designed to run CPU, memory and I/O tests. It had also an option to execute OLTP workload on a MySQL database. OLTP stands for online transaction processing, typical workload for online applications like e-commerce, order entry or financial transaction systems.

In this blog post, we will focus on the SQL benchmark feature but keep in mind that hardware benchmarks can also be very useful in identifying issues on database servers. For example, I/O benchmark was intended to simulate InnoDB I/O workload while CPU tests involve simulation of highly concurrent, multi-treaded environment along with tests for mutex contentions - something which also resembles a database type of workload.

SysBench history and architecture

As mentioned, SysBench was originally created in 2004 by Peter Zaitsev. Soon after, Alexey Kopytov took over its development. It reached version 0.4.12 and the development halted. After a long break Alexey started to work on SysBench again in 2016. Soon version 0.5 has been released with OLTP benchmark rewritten to use LUA-based scripts. Then, in 2017, SysBench 1.0 was released. This was like day and night compared to the old, 0.4.12 version. First and the foremost, instead of hardcoded scripts, now we have the ability to customize benchmarks using LUA. For instance, Percona created TPCC-like benchmark which can be executed using SysBench. Let’s take a quick look at the current SysBench architecture.

SysBench is a C binary which uses LUA scripts to execute benchmarks. Those scripts have to:

  1. Handle input from command line parameters
  2. Define all of the modes which the benchmark is supposed to use (prepare, run, cleanup)
  3. Prepare all of the data
  4. Define how the benchmark will be executed (what queries will look like etc)

Scripts can utilize multiple connections to the database, they can also process results should you want to create complex benchmarks where queries depend on the result set of previous queries. With SysBench 1.0 it is possible to create latency histograms. It is also possible for the LUA scripts to catch and handle errors through error hooks. There’s support for parallelization in the LUA scripts, multiple queries can be executed in parallel, making, for example, provisioning much faster. Last but not least, multiple output formats are now supported. Before SysBench generated only human-readable output. Now it is possible to generate it as CSV or JSON, making it much easier to do post-processing and generate graphs using, for example, gnuplot or feed the data into Prometheus, Graphite or similar datastore.

Why SysBench?

The main reason why SysBench became popular is the fact that it is simple to use. Someone without prior knowledge can start to use it within minutes. It also provides, by default, benchmarks which cover most of the cases - OLTP workloads, read-only or read-write, primary key lookups and primary key updates. All which caused most of the issues for MySQL, up to MySQL 8.0. This was also a reason why SysBench was so popular in different benchmarks and comparisons published on the Internet. Those posts helped to promote this tool and made it into the go-to synthetic benchmark for MySQL.

Another good thing about SysBench is that, since version 0.5 and incorporation of LUA, anyone can prepare any kind of benchmark. We already mentioned TPCC-like benchmark but anyone can craft something which will resemble her production workload. We are not saying it is simple, it will be most likely a time-consuming process, but having this ability is beneficial if you need to prepare a custom benchmark.

Being a synthetic benchmark, SysBench is not a tool which you can use to tune configurations of your MySQL servers (unless you prepared LUA scripts with custom workload or your workload happen to be very similar to the benchmark workloads that SysBench comes with). What it is great for is to compare performance of different hardware. You can easily compare performance of, let’s say, different type of nodes offered by your cloud provider and maximum QPS (queries per second) they offer. Knowing that metric and knowing what you pay for given node, you can then calculate even more important metric - QP$ (queries per dollar). This will allow you to identify what node type to use when building a cost-efficient environment. Of course, SysBench can be used also for initial tuning and assessing feasibility of a given design. Let’s say we build a Galera cluster spanning across the globe - North America, EU, Asia. How many inserts per second can such a setup handle? What would be the commit latency? Does it even make sense to do a proof of concept or maybe network latency is high enough that even a simple workload does not work as you would expect it to.

What about stress-testing? Not everyone has moved to the cloud, there are still companies preferring to build their own infrastructure. Every new server acquired should go through a warm-up period during which you will stress it to pinpoint potential hardware defects. In this case SysBench can also help. Either by executing OLTP workload which overloads the server, or you can also use dedicated benchmarks for CPU, disk and memory.

As you can see, there are many cases in which even a simple, synthetic benchmark can be very useful. In the next paragraph we will look at what we can do with SysBench.

What SysBench can do for you?

What tests you can run?

As mentioned at the beginning, we will focus on OLTP benchmarks and just as a reminder we’ll repeat that SysBench can also be used to perform I/O, CPU and memory tests. Let’s take a look at the benchmarks that SysBench 1.0 comes with (we removed some helper LUA files and non-database LUA scripts from this list).

-rwxr-xr-x 1 root root 1.5K May 30 07:46 bulk_insert.lua
-rwxr-xr-x 1 root root 1.3K May 30 07:46 oltp_delete.lua
-rwxr-xr-x 1 root root 2.4K May 30 07:46 oltp_insert.lua
-rwxr-xr-x 1 root root 1.3K May 30 07:46 oltp_point_select.lua
-rwxr-xr-x 1 root root 1.7K May 30 07:46 oltp_read_only.lua
-rwxr-xr-x 1 root root 1.8K May 30 07:46 oltp_read_write.lua
-rwxr-xr-x 1 root root 1.1K May 30 07:46 oltp_update_index.lua
-rwxr-xr-x 1 root root 1.2K May 30 07:46 oltp_update_non_index.lua
-rwxr-xr-x 1 root root 1.5K May 30 07:46 oltp_write_only.lua
-rwxr-xr-x 1 root root 1.9K May 30 07:46 select_random_points.lua
-rwxr-xr-x 1 root root 2.1K May 30 07:46 select_random_ranges.lua

Let’s go through them one by one.

First, bulk_insert.lua. This test can be used to benchmark the ability of MySQL to perform multi-row inserts. This can be quite useful when checking, for example, performance of replication or Galera cluster. In the first case, it can help you answer a question: “how fast can I insert before replication lag will kick in?”. In the later case, it will tell you how fast data can be inserted into a Galera cluster given the current network latency.

All oltp_* scripts share a common table structure. First two of them (oltp_delete.lua and oltp_insert.lua) execute single DELETE and INSERT statements. Again, this could be a test for either replication or Galera cluster - push it to the limits and see what amount of inserting or purging it can handle. We also have other benchmarks focused on particular functionality - oltp_point_select, oltp_update_index and oltp_update_non_index. These will execute a subset of queries - primary key-based selects, index-based updates and non-index-based updates. If you want to test some of these functionalities, the tests are there. We also have more complex benchmarks which are based on OLTP workloads: oltp_read_only, oltp_read_write and oltp_write_only. You can run either a read-only workload, which will consist of different types of SELECT queries, you can run only writes (a mix of DELETE, INSERT and UPDATE) or you can run a mix of those two. Finally, using select_random_points and select_random_ranges you can run some random SELECT either using random points in IN() list or random ranges using BETWEEN.

How you can configure a benchmark?

What is also important, benchmarks are configurable - you can run different workload patterns using the same benchmark. Let’s take a look at the two most common benchmarks to execute. We’ll have a deep dive into OLTP read_only and OLTP read_write benchmarks. First of all, SysBench has some general configuration options. We will discuss here only the most important ones, you can check all of them by running:

sysbench --help

Let’s take a look at them.

  --threads=N                     number of threads to use [1]

You can define what kind of concurrency you’d like SysBench to generate. MySQL, as every software, has some scalability limitations and its performance will peak at some level of concurrency. This setting helps to simulate different concurrencies for a given workload and check if it already has passed the sweet spot.

  --events=N                      limit for total number of events [0]
  --time=N                        limit for total execution time in seconds [10]

Those two settings govern how long SysBench should keep running. It can either execute some number of queries or it can keep running for a predefined time.

  --warmup-time=N                 execute events for this many seconds with statistics disabled before the actual benchmark run with statistics enabled [0]

This is self-explanatory. SysBench generates statistical results from the tests and those results may be affected if MySQL is in a cold state. Warmup helps to identify “regular” throughput by executing benchmark for a predefined time, allowing to warm up the cache, buffer pools etc.

  --rate=N                        average transactions rate. 0 for unlimited rate [0]

By default SysBench will attempt to execute queries as fast as possible. To simulate slower traffic this option may be used. You can define here how many transactions should be executed per second.

  --report-interval=N             periodically report intermediate statistics with a specified interval in seconds. 0 disables intermediate reports [0]

By default SysBench generates a report after it completed its run and no progress is reported while the benchmark is running. Using this option you can make SysBench more verbose while the benchmark still runs.

  --rand-type=STRING   random numbers distribution {uniform, gaussian, special, pareto, zipfian} to use by default [special]

SysBench gives you ability to generate different types of data distribution. All of them may have their own purposes. Default option, ‘special’, defines several (it is configurable) hot-spots in the data, something which is quite common in web applications. You can also use other distributions if your data behaves in a different way. By making a different choice here you can also change the way your database is stressed. For example, uniform distribution, where all of the rows have the same likeliness of being accessed, is much more memory-intensive operation. It will use more buffer pool to store all of the data and it will be much more disk-intensive if your data set won’t fit in memory. On the other hand, special distribution with couple of hot-spots will put less stress on the disk as hot rows are more likely to be kept in the buffer pool and access to rows stored on disk is much less likely. For some of the data distribution types, SysBench gives you more tweaks. You can find this info in ‘sysbench --help’ output.

  --db-ps-mode=STRING prepared statements usage mode {auto, disable} [auto]

Using this setting you can decide if SysBench should use prepared statements (as long as they are available in the given datastore - for MySQL it means PS will be enabled by default) or not. This may make a difference while working with proxies like ProxySQL or MaxScale - they should treat prepared statements in a special way and all of them should be routed to one host making it impossible to test scalability of the proxy.

In addition to the general configuration options, each of the tests may have its own configuration. You can check what is possible by running:

root@vagrant:~# sysbench ./sysbench/src/lua/oltp_read_write.lua  help
sysbench 1.1.0-2e6b7d5 (using bundled LuaJIT 2.1.0-beta3)

oltp_read_only.lua options:
  --distinct_ranges=N           Number of SELECT DISTINCT queries per transaction [1]
  --sum_ranges=N                Number of SELECT SUM() queries per transaction [1]
  --skip_trx[=on|off]           Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]
  --secondary[=on|off]          Use a secondary index in place of the PRIMARY KEY [off]
  --create_secondary[=on|off]   Create a secondary index in addition to the PRIMARY KEY [on]
  --index_updates=N             Number of UPDATE index queries per transaction [1]
  --range_size=N                Range size for range SELECT queries [100]
  --auto_inc[=on|off]           Use AUTO_INCREMENT column as Primary Key (for MySQL), or its alternatives in other DBMS. When disabled, use client-generated IDs [on]
  --delete_inserts=N            Number of DELETE/INSERT combinations per transaction [1]
  --tables=N                    Number of tables [1]
  --mysql_storage_engine=STRING Storage engine, if MySQL is used [innodb]
  --non_index_updates=N         Number of UPDATE non-index queries per transaction [1]
  --table_size=N                Number of rows per table [10000]
  --pgsql_variant=STRING        Use this PostgreSQL variant when running with the PostgreSQL driver. The only currently supported variant is 'redshift'. When enabled, create_secondary is automatically disabled, and delete_inserts is set to 0
  --simple_ranges=N             Number of simple range SELECT queries per transaction [1]
  --order_ranges=N              Number of SELECT ORDER BY queries per transaction [1]
  --range_selects[=on|off]      Enable/disable all range SELECT queries [on]
  --point_selects=N             Number of point SELECT queries per transaction [10]

Again, we will discuss the most important options from here. First of all, you have a control of how exactly a transaction will look like. Generally speaking, it consists of different types of queries - INSERT, DELETE, different type of SELECT (point lookup, range, aggregation) and UPDATE (indexed, non-indexed). Using variables like:

  --distinct_ranges=N           Number of SELECT DISTINCT queries per transaction [1]
  --sum_ranges=N                Number of SELECT SUM() queries per transaction [1]
  --index_updates=N             Number of UPDATE index queries per transaction [1]
  --delete_inserts=N            Number of DELETE/INSERT combinations per transaction [1]
  --non_index_updates=N         Number of UPDATE non-index queries per transaction [1]
  --simple_ranges=N             Number of simple range SELECT queries per transaction [1]
  --order_ranges=N              Number of SELECT ORDER BY queries per transaction [1]
  --point_selects=N             Number of point SELECT queries per transaction [10]
  --range_selects[=on|off]      Enable/disable all range SELECT queries [on]

You can define what a transaction should look like. As you can see by looking at the default values, majority of queries are SELECTs - mainly point selects but also different types of range SELECTs (you can disable all of them by setting range_selects to off). You can tweak the workload towards more write-heavy workload by increasing the number of updates or INSERT/DELETE queries. It is also possible to tweak settings related to secondary indexes, auto increment but also data set size (number of tables and how many rows each of them should hold). This lets you customize your workload quite nicely.

  --skip_trx[=on|off]           Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]

This is another setting, quite important when working with proxies. By default, SysBench will attempt to execute queries in explicit transaction. This way the dataset will stay consistent and not affected: SysBench will, for example, execute INSERT and DELETE on the same row, making sure the data set will not grow (impacting your ability to reproduce results). However, proxies will treat explicit transactions differently - all queries executed within a transaction should be executed on the same host, thus removing the ability to scale the workload. Please keep in mind that disabling transactions will result in data set diverging from the initial point. It may also trigger some issues like duplicate key errors or such. To be able to disable transactions you may also want to look into:

  --mysql-ignore-errors=[LIST,...] list of errors to ignore, or "all" [1213,1020,1205]

This setting allows you to specify error codes from MySQL which SysBench should ignore (and not kill the connection). For example, to ignore errors like: error 1062 (Duplicate entry '6' for key 'PRIMARY') you should pass this error code: --mysql-ignore-errors=1062

What is also important, each benchmark should present a way to provision a data set for tests, run them and then clean it up after the tests complete. This is done using ‘prepare’, ‘run’ and ‘cleanup’ commands. We will show how this is done in the next section.

Examples

In this section we’ll go through some examples of what SysBench can be used for. As mentioned earlier, we’ll focus on the two most popular benchmarks - OLTP read only and OLTP read/write. Sometimes it may make sense to use other benchmarks, but at least we’ll be able to show you how those two can be customized.

Primary Key lookups

First of all, we have to decide which benchmark we will run, read-only or read-write. Technically speaking it does not make a difference as we can remove writes from R/W benchmark. Let’s focus on the read-only one.

As a first step, we have to prepare a data set. We need to decide how big it should be. For this particular benchmark, using default settings (so, secondary indexes are created), 1 million rows will result in ~240 MB of data. Ten tables, 1000 000 rows each equals to 2.4GB:

root@vagrant:~# du -sh /var/lib/mysql/sbtest/
2.4G    /var/lib/mysql/sbtest/
root@vagrant:~# ls -alh /var/lib/mysql/sbtest/
total 2.4G
drwxr-x--- 2 mysql mysql 4.0K Jun  1 12:12 .
drwxr-xr-x 6 mysql mysql 4.0K Jun  1 12:10 ..
-rw-r----- 1 mysql mysql   65 Jun  1 12:08 db.opt
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:12 sbtest10.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:12 sbtest10.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:10 sbtest1.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:10 sbtest1.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:10 sbtest2.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:10 sbtest2.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:10 sbtest3.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:10 sbtest3.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:10 sbtest4.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:10 sbtest4.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:11 sbtest5.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:11 sbtest5.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:11 sbtest6.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:11 sbtest6.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:11 sbtest7.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:11 sbtest7.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:11 sbtest8.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:11 sbtest8.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:12 sbtest9.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:12 sbtest9.ibd

This should give you idea how many tables you want and how big they should be. Let’s say we want to test in-memory workload so we want to create tables which will fit into InnoDB buffer pool. On the other hand, we want also to make sure there are enough tables not to become a bottleneck (or, that the amount of tables matches what you would expect in your production setup). Let’s prepare our dataset. Please keep in mind that, by default, SysBench looks for ‘sbtest’ schema which has to exist before you prepare the data set. You may have to create it manually.

root@vagrant:~# sysbench /root/sysbench/src/lua/oltp_read_only.lua --threads=4 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=10 --table-size=1000000 prepare
sysbench 1.1.0-2e6b7d5 (using bundled LuaJIT 2.1.0-beta3)

Initializing worker threads...

Creating table 'sbtest2'...
Creating table 'sbtest3'...
Creating table 'sbtest4'...
Creating table 'sbtest1'...
Inserting 1000000 records into 'sbtest2'
Inserting 1000000 records into 'sbtest4'
Inserting 1000000 records into 'sbtest3'
Inserting 1000000 records into 'sbtest1'
Creating a secondary index on 'sbtest2'...
Creating a secondary index on 'sbtest3'...
Creating a secondary index on 'sbtest1'...
Creating a secondary index on 'sbtest4'...
Creating table 'sbtest6'...
Inserting 1000000 records into 'sbtest6'
Creating table 'sbtest7'...
Inserting 1000000 records into 'sbtest7'
Creating table 'sbtest5'...
Inserting 1000000 records into 'sbtest5'
Creating table 'sbtest8'...
Inserting 1000000 records into 'sbtest8'
Creating a secondary index on 'sbtest6'...
Creating a secondary index on 'sbtest7'...
Creating a secondary index on 'sbtest5'...
Creating a secondary index on 'sbtest8'...
Creating table 'sbtest10'...
Inserting 1000000 records into 'sbtest10'
Creating table 'sbtest9'...
Inserting 1000000 records into 'sbtest9'
Creating a secondary index on 'sbtest10'...
Creating a secondary index on 'sbtest9'...

Once we have our data, let’s prepare a command to run the test. We want to test Primary Key lookups therefore we will disable all other types of SELECT. We will also disable prepared statements as we want to test regular queries. We will test low concurrency, let’s say 16 threads. Our command may look like below:

sysbench /root/sysbench/src/lua/oltp_read_only.lua --threads=16 --events=0 --time=300 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=10 --table-size=1000000 --range_selects=off --db-ps-mode=disable --report-interval=1 run

What did we do here? We set the number of threads to 16. We decided that we want our benchmark to run for 300 seconds, without a limit of executed queries. We defined connectivity to the database, number of tables and their size. We also disabled all range SELECTs, we also disabled prepared statements. Finally, we set report interval to one second. This is how a sample output may look like:

[ 297s ] thds: 16 tps: 97.21 qps: 1127.43 (r/w/o: 935.01/0.00/192.41) lat (ms,95%): 253.35 err/s: 0.00 reconn/s: 0.00
[ 298s ] thds: 16 tps: 195.32 qps: 2378.77 (r/w/o: 1985.13/0.00/393.64) lat (ms,95%): 189.93 err/s: 0.00 reconn/s: 0.00
[ 299s ] thds: 16 tps: 178.02 qps: 2115.22 (r/w/o: 1762.18/0.00/353.04) lat (ms,95%): 155.80 err/s: 0.00 reconn/s: 0.00
[ 300s ] thds: 16 tps: 217.82 qps: 2640.92 (r/w/o: 2202.27/0.00/438.65) lat (ms,95%): 125.52 err/s: 0.00 reconn/s: 0.00

Every second we see a snapshot of workload stats. This is quite useful to track and plot - final report will give you averages only. Intermediate results will make it possible to track the performance on a second by second basis. The final report may look like below:

SQL statistics:
    queries performed:
        read:                            614660
        write:                           0
        other:                           122932
        total:                           737592
    transactions:                        61466  (204.84 per sec.)
    queries:                             737592 (2458.08 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      204.8403
    time elapsed:                        300.0679s
    total number of events:              61466

Latency (ms):
         min:                                   24.91
         avg:                                   78.10
         max:                                  331.91
         95th percentile:                      137.35
         sum:                              4800234.60

Threads fairness:
    events (avg/stddev):           3841.6250/20.87
    execution time (avg/stddev):   300.0147/0.02

You will find here information about executed queries and other (BEGIN/COMMIT) statements. You’ll learn how many transactions were executed, how many errors happened, what was the throughput and total elapsed time. You can also check latency metrics and the query distribution across threads.

If we were interested in latency distribution, we could also pass ‘--histogram’ argument to SysBench. This results in an additional output like below:

Latency histogram (values are in milliseconds)
       value  ------------- distribution ------------- count
      29.194 |******                                   1
      30.815 |******                                   1
      31.945 |***********                              2
      33.718 |******                                   1
      34.954 |***********                              2
      35.589 |******                                   1
      37.565 |***********************                  4
      38.247 |******                                   1
      38.942 |******                                   1
      39.650 |***********                              2
      40.370 |***********                              2
      41.104 |*****************                        3
      41.851 |*****************************            5
      42.611 |*****************                        3
      43.385 |*****************                        3
      44.173 |***********                              2
      44.976 |**************************************** 7
      45.793 |***********************                  4
      46.625 |***********                              2
      47.472 |*****************************            5
      48.335 |**************************************** 7
      49.213 |***********                              2
      50.107 |**********************************       6
      51.018 |***********************                  4
      51.945 |**************************************** 7
      52.889 |*****************                        3
      53.850 |*****************                        3
      54.828 |***********************                  4
      55.824 |***********                              2
      57.871 |***********                              2
      58.923 |***********                              2
      59.993 |******                                   1
      61.083 |******                                   1
      63.323 |***********                              2
      66.838 |******                                   1
      71.830 |******                                   1

Once we are good with our results, we can clean up the data:

sysbench /root/sysbench/src/lua/oltp_read_only.lua --threads=16 --events=0 --time=300 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=10 --table-size=1000000 --range_selects=off --db-ps-mode=disable --report-interval=1 cleanup

Write-heavy traffic

Let’s imagine here that we want to execute a write-heavy (but not write-only) workload and, for example, test I/O subsystem’s performance. First of all, we have to decide how big the dataset should be. We’ll assume ~48GB of data (20 tables, 10 000 000 rows each). We need to prepare it. This time we will use the read-write benchmark.

root@vagrant:~# sysbench /root/sysbench/src/lua/oltp_read_write.lua --threads=4 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=20 --table-size=10000000 prepare

Once this is done, we can tweak the defaults to force more writes into the query mix:

root@vagrant:~# sysbench /root/sysbench/src/lua/oltp_read_write.lua --threads=16 --events=0 --time=300 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=20 --delete_inserts=10 --index_updates=10 --non_index_updates=10 --table-size=10000000 --db-ps-mode=disable --report-interval=1 run

As you can see from the intermediate results, transactions are now on a write-heavy side:

[ 5s ] thds: 16 tps: 16.99 qps: 946.31 (r/w/o: 231.83/680.50/33.98) lat (ms,95%): 1258.08 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 16 tps: 17.01 qps: 955.81 (r/w/o: 223.19/698.59/34.03) lat (ms,95%): 1032.01 err/s: 0.00 reconn/s: 0.00
[ 7s ] thds: 16 tps: 12.00 qps: 698.91 (r/w/o: 191.97/482.93/24.00) lat (ms,95%): 1235.62 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 16 tps: 14.01 qps: 683.43 (r/w/o: 195.12/460.29/28.02) lat (ms,95%): 1533.66 err/s: 0.00 reconn/s: 0.00

Understanding the results

As we showed above, SysBench is a great tool which can help to pinpoint some of the performance issues of MySQL or MariaDB. It can also be used for initial tuning of your database configuration. Of course, you have to keep in mind that, to get the best out of your benchmarks, you have to understand why results look like they do. This would require insights into the MySQL internal metrics using monitoring tools, for instance, ClusterControl. This is quite important to remember - if you don’t understand why the performance was like it was, you may draw incorrect conclusions out of the benchmarks. There is always a bottleneck, and SysBench can help raise the performance issues, which you then have to identify.

by krzysztof at June 12, 2018 08:08 AM

June 11, 2018

Peter Zaitsev

ProxySQL Experimental Feature: Native ProxySQL Clustering

ProxySQL Cluster

ProxySQL 1.4.2 introduced native clustering, allowing several ProxySQL instances to communicate with and share configuration updates with each other. In this blog post, I’ll review this new feature and how we can start working with 3 nodes.

Before I continue, let’s review two common methods to installing ProxySQL.

ProxySQL as a centralized server

This is the most common installation, where ProxySQL is between application servers and the database. It is simple, but without any high availability. If ProxySQL goes down you lose all connectivity to the database.

ProxySQL Install most common set up

ProxySQL on app instances

Another common setup is to install ProxySQL onto each application server. This is good because the loss of one ProxySQL/App server will not bring down the entire application.

ProxySQL Install master-slave

For more information about the previous installation, please visit this link Where Do I Put ProxySQL?

Sometimes our application and databases grow fast. Maybe you need add a loadbalancer, for example, and in that moment you start thinking … “What could I do to configure and maintain all these ProxySQL nodes without mistakes?

To do that, there are many tools like Ansible, Puppet, and Chef, but you will need write/create/maintain scripts to do those tasks. This is really difficult to administer for one person.

Now, there is a native solution, built into ProxySQL, to create and administer a cluster in an easy way.

At the moment this feature is EXPERIMENTAL and subject to change. Think very carefully before installing it in production, in fact I strongly recommend you wait. However, if you would like to start testing this feature, you need to install ProxySQL 1.4.2, or better.

This clustering feature is really useful if you have installed one ProxySQL per application instance, because all the changes in one of the ProxySQL nodes will be propagated to all the other ProxySQL nodes. You can also configure a “master-slave” style setup with ProxySQL clustering.

There are only 4 tables where you can make changes and propagate the configuration:

  • mysql_query_rules
  • mysql_servers
  • mysql_users
  • proxysql_servers

How does it work?

It’s easy. When you make a change like INSERT/DELETE/UPDATE on any of these tables, after running the command

LOAD … TO RUNTIME
 , ProxySQL creates a new checksum of the table’s data and increments the version number in the table runtime_checksums_values. Below we can see an example.

admin ((none))>SELECT name, version, FROM_UNIXTIME(epoch), checksum FROM runtime_checksums_values ORDER BY name;
+-------------------+---------+----------------------+--------------------+
| name              | version | FROM_UNIXTIME(epoch) | checksum           |
+-------------------+---------+----------------------+--------------------+
| admin_variables   | 0       | 1970-01-01 00:00:00  |                    |
| mysql_query_rules | 1       | 2018-04-26 15:58:23  | 0x0000000000000000 |
| mysql_servers     | 1       | 2018-04-26 15:58:23  | 0x0000000000000000 |
| mysql_users       | 4       | 2018-04-26 18:36:12  | 0x2F35CAB62143AE41 |
| mysql_variables   | 0       | 1970-01-01 00:00:00  |                    |
| proxysql_servers  | 1       | 2018-04-26 15:58:23  | 0x0000000000000000 |
+-------------------+---------+----------------------+--------------------+

Internally, all nodes are monitoring and communicating with all the other ProxySQL nodes. When another node detects a change in the checksum and version (both at the same time), each node will get a copy of the table that was modified, make the same changes locally, and apply the new config to RUNTIME to refresh the new config, make it visible to the applications connected and automatically save it to DISK for persistence.

ProxySQL Cluster

The following setup creates a “synchronous cluster” so any changes to these 4 tables on any ProxySQL server will be replicated to all other ProxySQL nodes. Be careful!

How can I start testing this new feature?

1) To start we need to get at least 2 nodes. Download and install ProxySQL 1.4.2 or higher and start a clean version.

2) On all nodes, we need to update the following global variables. These changes will set the username and password used by each node’s internal communication to cluster1/clusterpass. These must be the same on all nodes in this cluster.

update global_variables set variable_value='admin:admin;cluster1:clusterpass' where variable_name='admin-admin_credentials';
update global_variables set variable_value='cluster1' where variable_name='admin-cluster_username';
update global_variables set variable_value='clusterpass' where variable_name='admin-cluster_password';
update global_variables set variable_value=200 where variable_name='admin-cluster_check_interval_ms';
update global_variables set variable_value=100 where variable_name='admin-cluster_check_status_frequency';
update global_variables set variable_value='true' where variable_name='admin-cluster_mysql_query_rules_save_to_disk';
update global_variables set variable_value='true' where variable_name='admin-cluster_mysql_servers_save_to_disk';
update global_variables set variable_value='true' where variable_name='admin-cluster_mysql_users_save_to_disk';
update global_variables set variable_value='true' where variable_name='admin-cluster_proxysql_servers_save_to_disk';
update global_variables set variable_value=3 where variable_name='admin-cluster_mysql_query_rules_diffs_before_sync';
update global_variables set variable_value=3 where variable_name='admin-cluster_mysql_servers_diffs_before_sync';
update global_variables set variable_value=3 where variable_name='admin-cluster_mysql_users_diffs_before_sync';
update global_variables set variable_value=3 where variable_name='admin-cluster_proxysql_servers_diffs_before_sync';
load admin variables to RUNTIME;
save admin variables to disk;

3) Add all IPs from the other ProxySQL nodes into each other node:

INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES ('10.138.180.183',6032,100,'PRIMARY');
INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES ('10.138.244.108',6032,99,'SECONDARY');
INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES ('10.138.244.244',6032,98,'SECONDARY');
LOAD PROXYSQL SERVERS TO RUNTIME;
SAVE PROXYSQL SERVERS TO DISK;

At this moment, we have all nodes synced.

In the next example from the log file, we can see when node1 detected node2.

[root@proxysql1 ~]# $ tail /var/lib/proxysql/proxysql.log
...
2018-05-10 11:19:51 [INFO] Cluster: Fetching ProxySQL Servers from peer 10.138.244.108:6032 started
2018-05-10 11:19:51 [INFO] Cluster: Fetching ProxySQL Servers from peer 10.138.244.108:6032 completed
2018-05-10 11:19:51 [INFO] Cluster: Loading to runtime ProxySQL Servers from peer 10.138.244.108:6032
2018-05-10 11:19:51 [INFO] Destroyed Cluster Node Entry for host 10.138.148.242:6032
2018-05-10 11:19:51 [INFO] Cluster: Saving to disk ProxySQL Servers from peer 10.138.244.108:6032
2018-05-10 11:19:52 [INFO] Cluster: detected a new checksum for proxysql_servers from peer 10.138.180.183:6032, version 6, epoch 1525951191, checksum 0x3D819A34C06EF4EA . Not syncing yet ...
2018-05-10 11:19:52 [INFO] Cluster: checksum for proxysql_servers from peer 10.138.180.183:6032 matches with local checksum 0x3D819A34C06EF4EA , we won't sync.
2018-05-10 11:19:52 [INFO] Cluster: closing thread for peer 10.138.148.242:6032
2018-05-10 11:19:52 [INFO] Cluster: detected a new checksum for proxysql_servers from peer 10.138.244.244:6032, version 4, epoch 1525951163, checksum 0x3D819A34C06EF4EA . Not syncing yet ...
2018-05-10 11:19:52 [INFO] Cluster: checksum for proxysql_servers from peer 10.138.244.244:6032 matches with local checksum 0x3D819A34C06EF4EA , we won't sync
...

Another example is to add users to the table mysql_users. Remember these users are to enable MySQL connections between the application (frontend) and MySQL (backend).

We will add a new username and password on any server; in my test I’ll use node2:

admin proxysql2 ((none))>INSERT INTO mysql_users(username,password) VALUES ('user1','crazyPassword');
Query OK, 1 row affected (0.00 sec)
admin proxysql2 ((none))>LOAD MYSQL USERS TO RUNTIME;
Query OK, 0 rows affected (0.00 sec)

In the log file from node3, we can see the update immediately:

[root@proxysql3 ~]# $ tail /var/lib/proxysql/proxysql.log
...
2018-05-10 11:30:57 [INFO] Cluster: detected a new checksum for mysql_users from peer 10.138.244.108:6032, version 2, epoch 1525951873, checksum 0x2AF43564C9985EC7 . Not syncing yet ...
2018-05-10 11:30:57 [INFO] Cluster: detected a peer 10.138.244.108:6032 with mysql_users version 2, epoch 1525951873, diff_check 3. Own version: 1, epoch: 1525950968. Proceeding with remote sync
2018-05-10 11:30:57 [INFO] Cluster: detected a peer 10.138.244.108:6032 with mysql_users version 2, epoch 1525951873, diff_check 4. Own version: 1, epoch: 1525950968. Proceeding with remote sync
2018-05-10 11:30:57 [INFO] Cluster: detected peer 10.138.244.108:6032 with mysql_users version 2, epoch 1525951873
2018-05-10 11:30:57 [INFO] Cluster: Fetching MySQL Users from peer 10.138.244.108:6032 started
2018-05-10 11:30:57 [INFO] Cluster: Fetching MySQL Users from peer 10.138.244.108:6032 completed
2018-05-10 11:30:57 [INFO] Cluster: Loading to runtime MySQL Users from peer 10.138.244.108:6032
2018-05-10 11:30:57 [INFO] Cluster: Saving to disk MySQL Query Rules from peer 10.138.244.108:6032
2018-05-10 11:30:57 [INFO] Cluster: detected a new checksum for mysql_users from peer 10.138.244.244:6032, version 2, epoch 1525951857, checksum 0x2AF43564C9985EC7 . Not syncing yet ...
2018-05-10 11:30:57 [INFO] Cluster: checksum for mysql_users from peer 10.138.244.244:6032 matches with local checksum 0x2AF43564C9985EC7 , we won't sync.
2018-05-10 11:30:57 [INFO] Cluster: detected a new checksum for mysql_users from peer 10.138.180.183:6032, version 2, epoch 1525951886, checksum 0x2AF43564C9985EC7 . Not syncing yet ...
2018-05-10 11:30:57 [INFO] Cluster: checksum for mysql_users from peer 10.138.180.183:6032 matches with local checksum 0x2AF43564C9985EC7 , we won't sync.
...

What happens if some node is down?

In this example, we will see and find out what happens if one node is down or has a network glitch, or other issue. I’ll stop ProxySQL node3:

[root@proxysql3 ~]# service proxysql stop
Shutting down ProxySQL: DONE!

On ProxySQL node1, we can check that node3 is unreachable:

[root@proxysql1 ~]# tailf /var/lib/proxysql/proxysql.log
2018-05-10 11:57:33 ProxySQL_Cluster.cpp:180:ProxySQL_Cluster_Monitor_thread(): [WARNING] Cluster: unable to connect to peer 10.138.244.244:6032 . Error: Can't connect to MySQL server on '10.138.244.244' (107)
2018-05-10 11:57:33 ProxySQL_Cluster.cpp:180:ProxySQL_Cluster_Monitor_thread(): [WARNING] Cluster: unable to connect to peer 10.138.244.244:6032 . Error: Can't connect to MySQL server on '10.138.244.244' (107)
2018-05-10 11:57:33 ProxySQL_Cluster.cpp:180:ProxySQL_Cluster_Monitor_thread(): [WARNING] Cluster: unable to connect to peer 10.138.244.244:6032 . Error: Can't connect to MySQL server on '10.138.244.244' (107)

And another check can be run in any ProxySQL node like node2, for example:

admin proxysql2 ((none))>SELECT hostname, checksum, FROM_UNIXTIME(changed_at) changed_at, FROM_UNIXTIME(updated_at) updated_at FROM stats_proxysql_servers_checksums WHERE name='proxysql_servers' ORDER BY hostname;
+----------------+--------------------+---------------------+---------------------+
| hostname       | checksum           | changed_at          | updated_at          |
+----------------+--------------------+---------------------+---------------------+
| 10.138.180.183 | 0x3D819A34C06EF4EA | 2018-05-10 11:19:39 | 2018-05-10 12:01:59 |
| 10.138.244.108 | 0x3D819A34C06EF4EA | 2018-05-10 11:19:38 | 2018-05-10 12:01:59 |
| 10.138.244.244 | 0x3D819A34C06EF4EA | 2018-05-10 11:19:39 | 2018-05-10 11:56:59 |
+----------------+--------------------+---------------------+---------------------+
3 rows in set (0.00 sec)

In the previous result, we can see node3 (10.138.244.244) is not being updated; the column updated_at should have a later datetime. This means that node3 is not running (or is down or network glitch).

At this point, any change to any of the tables, mysql_query_rules, mysql_servers, mysql_users, proxysql_servers, will be replicated between nodes 1 & 2.

In this next example, while node3 is offline, we will add another user to mysql_users table.

admin proxysql2 ((none))>INSERT INTO mysql_users(username,password) VALUES ('user2','passwordCrazy');
Query OK, 1 row affected (0.00 sec)
admin proxysql2 ((none))>LOAD MYSQL USERS TO RUNTIME;
Query OK, 0 rows affected (0.00 sec)

That change was propagated to node1:

[root@proxysql3 ~]# $ tail /var/lib/proxysql/proxysql.log
...
2018-05-10 12:12:36 [INFO] Cluster: detected a peer 10.138.244.108:6032 with mysql_users version 3, epoch 1525954343, diff_check 4. Own version: 2, epoch: 1525951886. Proceeding with remote sync
2018-05-10 12:12:36 [INFO] Cluster: detected peer 10.138.244.108:6032 with mysql_users version 3, epoch 1525954343
2018-05-10 12:12:36 [INFO] Cluster: Fetching MySQL Users from peer 10.138.244.108:6032 started
2018-05-10 12:12:36 [INFO] Cluster: Fetching MySQL Users from peer 10.138.244.108:6032 completed
2018-05-10 12:12:36 [INFO] Cluster: Loading to runtime MySQL Users from peer 10.138.244.108:6032
2018-05-10 12:12:36 [INFO] Cluster: Saving to disk MySQL Query Rules from peer 10.138.244.108:6032
...

We keep seeing node3 is out of sync about 25 minutes ago.

admin proxysql2 ((none))>SELECT hostname, checksum, FROM_UNIXTIME(changed_at) changed_at, FROM_UNIXTIME(updated_at) updated_at FROM stats_proxysql_servers_checksums WHERE name='mysql_users' ORDER BY hostname;
+----------------+--------------------+---------------------+---------------------+
|      hostname  |     checksum       |     changed_at      |      updated_at     |
+----------------+--------------------+---------------------+---------------------+
| 10.138.180.183 | 0x3D819A34C06EF4EA | 2018-05-10 11:19:39 | 2018-05-10 12:21:35 |
| 10.138.244.108 | 0x3D819A34C06EF4EA | 2018-05-10 11:19:38 |2018-05-10 12:21:35  |
| 10.138.244.244 | 0x3D819A34C06EF4EA | 2018-05-10 11:19:39 |2018-05-10 12:21:35  |
+----------------+--------------------+---------------------+---------------------+
3 rows in set (0.00 sec)

Let’s start node3 and check if the sync works. node3 should connect to the other nodes and get the last changes.

[root@proxysql3 ~]# tail /var/lib/proxysql/proxysql.log
...
2018-05-10 12:30:02 [INFO] Cluster: detected a peer 10.138.244.108:6032 with mysql_users version 3, epoch 1525954343, diff_check 3. Own version: 1, epoch: 1525955402. Proceeding with remote sync
2018-05-10 12:30:02 [INFO] Cluster: detected a peer 10.138.180.183:6032 with mysql_users version 3, epoch 1525954356, diff_check 3. Own version: 1, epoch: 1525955402. Proceeding with remote sync
…
2018-05-10 12:30:03 [INFO] Cluster: detected peer 10.138.180.183:6032 with mysql_users version 3, epoch 1525954356
2018-05-10 12:30:03 [INFO] Cluster: Fetching MySQL Users from peer 10.138.180.183:6032 started
2018-05-10 12:30:03 [INFO] Cluster: Fetching MySQL Users from peer 10.138.180.183:6032 completed
2018-05-10 12:30:03 [INFO] Cluster: Loading to runtime MySQL Users from peer 10.138.180.183:6032
2018-05-10 12:30:03 [INFO] Cluster: Saving to disk MySQL Query Rules from peer 10.138.180.183:6032

Looking at the status from the checksum table, we can see node3 is now up to date.

admin proxysql2 ((none))>SELECT hostname, checksum, FROM_UNIXTIME(changed_at) changed_at, FROM_UNIXTIME(updated_at) updated_at FROM stats_proxysql_servers_checksums WHERE name='mysql_users' ORDER BY hostname;
+----------------+--------------------+---------------------+---------------------+
| hostname | checksum | changed_at | updated_at |
+----------------+--------------------+---------------------+---------------------+
| 10.138.180.183 | 0x3D819A34C06EF4EA | 2018-05-10 11:19:39 | 2018-05-10 12:21:35 |
| 10.138.244.108 | 0x3D819A34C06EF4EA | 2018-05-10 11:19:38 |2018-05-10 12:21:35 |
| 10.138.244.244 | 0x3D819A34C06EF4EA | 2018-05-10 11:19:39 |2018-05-10 12:21:35 |
+----------------+--------------------+---------------------+---------------------+
3 rows in set (0.00 sec)admin proxysql2 ((none))>SELECT hostname, checksum, FROM_UNIXTIME(changed_at) changed_at, FROM_UNIXTIME(updated_at) updated_at FROM stats_proxysql_servers_checksums WHERE name='mysql_users' ORDER BY hostname;
+----------------+--------------------+---------------------+---------------------+
| hostname | checksum | changed_at | updated_at |
+----------------+--------------------+---------------------+---------------------+
| 10.138.180.183 | 0x3928F574AFFF4C65 | 2018-05-10 12:12:24 | 2018-05-10 12:31:58 |
| 10.138.244.108 | 0x3928F574AFFF4C65 | 2018-05-10 12:12:23 | 2018-05-10 12:31:58 |
| 10.138.244.244 | 0x3928F574AFFF4C65 | 2018-05-10 12:30:19 | 2018-05-10 12:31:58 |
+----------------+--------------------+---------------------+---------------------+
3 rows in set (0.00 sec)

Now we have 3 ProxySQL nodes up to date. This example didn’t add any MySQL servers, hostgroups, etc, because the functionality is the same. The post is intended as an introduction to this new feature and how you can create and test a ProxySQL cluster.

Just remember that this is still an experimental feature and is subject to change with newer versions of ProxySQL.

Summary

This feature is really needed if you have more than one ProxySQL running for the same application in different instances. It is easy to maintain and configure for a single person and is easy to create and attach new nodes.

Hope you find this post helpful!

http://www.proxysql.com/blog/proxysql-cluster
http://www.proxysql.com/blog/proxysql-cluster-part2
http://www.proxysql.com/blog/proxysql-cluster-part3-mysql-servers
https://github.com/sysown/proxysql/wiki/ProxySQL-Cluster

The post ProxySQL Experimental Feature: Native ProxySQL Clustering appeared first on Percona Database Performance Blog.

by Walter Garcia at June 11, 2018 12:18 PM