Planet MariaDB

June 24, 2019

Oli Sennhauser

Oops! - That SQL Query was not intended... Flashback

It is Saturday night at 23:19. Time to go to bed after a hard migration day. Just a last clean-up query before finishing: Tap tap tap. Enter! - Oops!

SQL> UPDATE prospect_lists_prospects SET prospect_list_id = '73ae6cca-7b34-c4a3-5500-5d0e2674dbb6';
Query OK, 4686 rows affected (0.21 sec)
Rows matched: 5666  Changed: 4686  Warnings: 0

A verification query to make sure I am in the mess:

SQL> SELECT prospect_list_id, COUNT(*) FROM prospect_lists_prospects GROUP BY prospect_list_id;
+--------------------------------------+----------+
| prospect_list_id                     | count(*) |
+--------------------------------------+----------+
| 73ae6cca-7b34-c4a3-5500-5d0e2674dbb6 |     5666 |
+--------------------------------------+----------+

And certainly I did not enter the START TRANSACTION; command before. So no ROLLBACK!

Next look at the backup:

# ll backup/daily/bck_schema_crm_2019-06*
-rw-rw-r-- 1 mysql mysql  7900060 Jun  1 02:13 backup/daily/bck_schema_crm_2019-06-01_02-13-01.sql.gz
-rw-rw-r-- 1 mysql mysql  7900061 Jun  2 02:13 backup/daily/bck_schema_crm_2019-06-02_02-13-01.sql.gz
-rw-rw-r-- 1 mysql mysql  7900091 Jun  3 02:13 backup/daily/bck_schema_crm_2019-06-03_02-13-01.sql.gz
-rw-rw-r-- 1 mysql mysql  7903126 Jun  4 02:13 backup/daily/bck_schema_crm_2019-06-04_02-13-01.sql.gz
-rw-rw-r-- 1 mysql mysql  7903192 Jun  5 02:13 backup/daily/bck_schema_crm_2019-06-05_02-13-02.sql.gz
-rw-rw-r-- 1 mysql mysql  7903128 Jun  6 02:13 backup/daily/bck_schema_crm_2019-06-06_02-13-01.sql.gz
-rw-rw-r-- 1 mysql mysql  7912886 Jun 21 02:13 backup/daily/bck_schema_crm_2019-06-21_02-13-01.sql.gz
-rw-rw-r-- 1 mysql mysql  7920566 Jun 22 02:13 backup/daily/bck_schema_crm_2019-06-22_02-13-01.sql.gz

Yes! Backup is there and was done with the FromDual Backup Manager. So I am confident Restore and Point-in-Time-Recovery will work... But the Point-in-Time-Recovery with the Binary Logs for just one schema is a bit tricky and officially not so really supported.

So basically what I want to do is just to undo this UPDATE command. But unfortunately this UPDATE was not a reversible UPDATE command. Then I remembered a presentation about MariaDB 10.2 New Features (p. 41) where the speaker was talking about the flashback functionality in the mysqlbinlog utility.

Undo MySQL Binary Log Events with MariaDB mysqlbinlog utility

First of all I analysed the MySQL Binary Log to find the Binary Log Events to undo:

# mysqlbinlog --start-position=348622898 --verbose mysql-bin.000080 | less
# at 348622898
#190622 23:19:43 server id 7  end_log_pos 348622969 CRC32 0xd358d264    Query   thread_id=791264        exec_time=0     error_code=0
SET TIMESTAMP=1561238383/*!*/;
BEGIN
/*!*/;
# at 348622969
#190622 23:19:43 server id 7  end_log_pos 348623049 CRC32 0x71340183    Table_map: `crm`.`prospect_lists_prospects` mapped to number 2857
# at 348623049
#190622 23:19:43 server id 7  end_log_pos 348631021 CRC32 0x53d65c9b    Update_rows: table id 2857
...
### UPDATE `crm`.`prospect_lists_prospects`
### WHERE
###   @1='ff700497-41cc-e530-a690-5d0e606cd942'
###   @2='b851169d-5e94-5c43-3593-5d0e2825d848'
###   @3='2078d1ae-f7b4-a082-38a5-5d0e581584fc'
###   @4='Prospects'
###   @5='2019-06-22 17:07:41'
###   @6=0
### SET
###   @1='ff700497-41cc-e530-a690-5d0e606cd942'
###   @2='73ae6cca-7b34-c4a3-5500-5d0e2674dbb6'
###   @3='2078d1ae-f7b4-a082-38a5-5d0e581584fc'
###   @4='Prospects'
###   @5='2019-06-22 17:07:41'
###   @6=0
# at 349828089
#190622 23:19:43 server id 7  end_log_pos 349828120 CRC32 0x83f41493    Xid = 8361402
COMMIT/*!*/;

So the relevant part in the MySQL Binary Log is between position 348622898 and 349828120.

Now let us try the reverse operation. But for this we have to solve a little problem. The database is a MySQL 5.7. But the feature --flashback is only available in MariaDB 10.2 and newer. So we have to bring either the MySQL 5.7 Binary Logs to the MariaDB mysqlbinlog utility or the MariaDB mysqlbinlog utility to the MySQL 5.7 Binary Logs.

For a first attempt I moved the MySQL 5.7 Binary Logs to a MariaDB 10.3 testing system and gave it a try if mixing Binary Logs and Utility is working at all:

# mysqlbinlog --start-position=348622898 --stop-position=349828120 -v mysql-bin.000080 | grep -c 'UPDATE `crm`.`prospect_lists_prospects`'
4686

Looks good! Exactly the number of Row changes expected. Then let us look at the statement with --flashback:

# mysqlbinlog --flashback --start-position=348622898 --stop-position=349828120 mysql-bin.000080 -v | less
'/*!*/;
### UPDATE `crm`.`prospect_lists_prospects`
### WHERE
###   @1='ff700497-41cc-e530-a690-5d0e606cd942'
###   @2='73ae6cca-7b34-c4a3-5500-5d0e2674dbb6'
###   @3='2078d1ae-f7b4-a082-38a5-5d0e581584fc'
###   @4='Prospects'
###   @5='2019-06-22 17:07:41'
###   @6=0
### SET
###   @1='ff700497-41cc-e530-a690-5d0e606cd942'
###   @2='b851169d-5e94-5c43-3593-5d0e2825d848'
###   @3='2078d1ae-f7b4-a082-38a5-5d0e581584fc'
###   @4='Prospects'
###   @5='2019-06-22 17:07:41'
###   @6=0

Looks good! Seems to be the reverse query. And now let us do the final repair job:

# /home/mysql/product/mariadb-10.3/mysqlbinlog --flashback --start-position=348622898 --stop-position=349828120 mysql-bin.000080 \
| /home/mysql/product/mysql-5.7/bin/mysql --user=root --port=3320 --host=127.0.0.1 crm --force
ERROR 1193 (HY000) at line 21339: Unknown system variable 'check_constraint_checks'

The --force option was used to motivate mysql utility to continue even if an error occurs. Which was the case in our scenario. This option should usually not be used. We had tried out this step before on a testing system so I was aware what is happening and why this error occurs...

Now the final test on the repaired system shows the situation as it was before the accident:

SQL> SELECT IFNULL(prospect_list_id, 'Total:'), COUNT(*)
  FROM prospect_lists_prospects GROUP BY prospect_list_id
  WITH ROLLUP;
+--------------------------------------+----------+
| IFNULL(prospect_list_id, 'Total:')   | count(*) |
+--------------------------------------+----------+
| 1178ec2b-6aa9-43e4-a27e-5d0e264cac4c |       91 |
| 1bd03c4e-b3f3-b3eb-f237-5d0e26413ae9 |      946 |
| 1c0901f1-41b2-cf42-074d-5d0cdc12b47d |        5 |
| 21d9a74f-73af-9a5d-84ba-5d0e280772ef |      107 |
| 37230208-a431-f6d8-a428-5d0e28d9ec77 |      264 |
| 4b48da8a-33d9-4896-5000-5d0e287ffe39 |        3 |
| 5d06f6cc-3fe9-f501-b680-5d0ccfd19033 |        2 |
| 5e39a569-3213-cc64-496f-5d0e28e851c9 |        5 |
| 680a879c-ff3c-b955-c3b8-5d0e28c833c5 |      315 |
| 73ae6cca-7b34-c4a3-5500-5d0e2674dbb6 |      980 |
| 756c4803-dc73-dc09-b301-5d0e28e69546 |        2 |
| 8eb0ec25-6bbb-68de-d44f-5d0e262cd93d |      833 |
| 913861f0-a865-7c94-8109-5d0e28d714b6 |       12 |
| 96a10d6a-c10e-c945-eaeb-5d0e280aa16c |       74 |
| a43147a8-90f2-a5b3-5bcf-5d0e2862248a |       15 |
| ae869fb1-dd88-19c0-b0d6-538f7b7e329a |       20 |
| b57eb9ba-5a93-8570-5914-5d0e28d975a9 |       25 |
| b851169d-5e94-5c43-3593-5d0e2825d848 |      978 |
| be320e31-1a5b-fe86-09d7-5d0e28a0fd2e |        7 |
| c762abde-bc63-2383-ba30-5d0e28a714c9 |      160 |
| cbbd0ba7-dc25-f29f-36f4-5d0e287c3006 |       99 |
| d23490c8-99eb-f298-6aad-5d0e28e7fd4f |       52 |
| d5000593-836c-3679-ecb5-5d0e28dd076c |       57 |
| d81e9aae-ef60-fca2-7d99-5d0e269de1c0 |      421 |
| df768570-f9b8-2333-66c4-5a6768e34ed3 |        3 |
| e155d58a-19e8-5163-f846-5d0e282ba4b8 |       66 |
| f139b6a0-9598-0cd4-a204-5d0e28c2eccd |      120 |
| f165c48b-4fc1-b081-eee3-5d0cdd7947d5 |        4 |
| Total:                               |     5666 |
+--------------------------------------+----------+

Flashback of MySQL 5.7 Binary Logs with MariaDB 10.3 mysqlbinlog utility was successful!

If you want to learn more about Backup and Recovery strategies contact our MariaDB/MySQL consulting team or book one of our MariaDB/MySQL training classes.

by Shinguz at June 24, 2019 12:12 PM

June 21, 2019

Oli Sennhauser

Do not underestimate performance impacts of swapping on NUMA database systems

If your MariaDB or MySQL database system is swapping it can have a significant impact on your database query performance! Further it can also slow down your database shutdown and thus influence the whole reboot of your machine. This is especially painful if you have only short maintenance windows or if you do not want to spend the whole night with operation tasks.

When we do reviews of our customer MariaDB or MySQL database systems one of the items to check is Swap Space and swapping. With the free command you can find if your system has Swap Space enabled at all and how much of your Swap Space is used:

# free
              total        used        free      shared  buff/cache   available
Mem:       16106252     3300424      697284      264232    12108544    12011972
Swap:      31250428     1701792    29548636

With the command:

# swapon --show
NAME      TYPE       SIZE USED PRIO
/dev/sdb2 partition 29.8G 1.6G   -1

you can show on which disk drive your Swap Space is physically located. And with the following 3 commands you can find if your system is currently swapping or not:

# vmstat 1
procs ------------memory------------ ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd    free   buff    cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 1701784 692580 355716 11757864    2   12   685   601  237  146  9  3 86  2  0
 0  0 1701784 692472 355720 11757840    0    0     0   196  679 2350  2  1 97  1  0
 0  0 1701784 692720 355728 11757332    0    0     0   104  606 2170  0  1 98  1  0

# sar -W 1
15:44:30     pswpin/s pswpout/s
15:44:31         0.00      0.00
15:44:32         0.00      0.00
15:44:33         0.00      0.00

# sar -S 1
15:43:02    kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
15:43:03     29548648   1701780      5.45     41552      2.44
15:43:04     29548648   1701780      5.45     41552      2.44
15:43:05     29548648   1701780      5.45     41552      2.44

Side note: Recent Linux distributions tend to use Swap Files instead of Swap Partitions. The performance impact seems to be negligible compared to the operational advantages of Swap Files... [ 1 ] [ 2 ] [ 3 ] [ 4 ]

What is Swap Space on a Linux system

Modern Operating Systems like Linux manage Virtual Memory (VM) which consists of RAM (fast) and Disk (HDD very slow and SSD slow). If the Operating System is short in fast RAM it tries to write some "old" pages to slow disk to get more free fast RAM for "new" pages and/or for the file system cache. This technique enables the Operating System to keep more and/or bigger processes running than physical RAM is available (overcommitment of RAM).
If one of those "old" pages is needed again it has to be swapped in which technically is a physical random disk read (which is slow, this is also called a major page fault).
If this block is a MariaDB or MySQL database block this disk read to RAM will slow down your SELECT queries but also INSERT, UPDATE and DELETE when you do write queries. This can severely slow down for example your clean-up jobs which have to remove "old" data (located on disk possibly in Swap Space).

Sizing of Swap Space for database systems

A rule of thumb for Swap Space is: Have always Swap Space but never use it (disk is cheap nowadays)!

A reasonable Swap Space sizing for database systems is the following:

Amount of RAMSwap Space
4 GiB of RAM or lessa minimum of 4 GiB of Swap Space, is this really a Database server?
8 GiB to 16 GiB of RAMa minimum of once the amount of RAM of Swap Space
24 GiB to 64 GiB of RAMa minimum of half the amount of RAM of Swap Space
more than 64 GiB of RAMa minimum of 32 GiB of Swap Space

If you have a close look at your Swap usage and if you monitor your Swap Space precisely and if you know exactly what you are doing you can lower these values...

It is NOT recommended to disable Swap Space

Some people tend to disable Swap Space. We see this mainly in virtualized environments (virtual machines) and cloud servers. From the VM/Cloud administrator point of view I can even understand why they disable Swap. But from the MariaDB / MySQL DBA point of view this is a bad idea.

If you do proper MariaDB / MySQL configuration (innodb_buffer_pool_size = 75% of RAM) the server should not swap a lot. But if you exaggerate with memory configuration the system starts swapping heavily. Till to the end the OOM-Killer will be activated by your Linux killing the troublemaker (typically the database process). If you have sufficient Swap Space enabled you get some time to detect a bad database configuration and act accordingly. If you have Swap Space disabled completely you do not get this safety buffer and OOM killer will act immediately and kill your database process when you run out of RAM. This really cannot be in the interest of the DBA.

Some literature to read further about Swap: In defence of swap: common misconceptions

Influence swapping - Swappiness

The Linux kernel documentation tells us the following about swappiness:

swappiness

This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.

The default value is 60.

Source: Documentation for /proc/sys/vm/*

A informative article on StackExchange: Why is swappiness set to 60 by default?

To change your swappiness the following commands will help:

# sysctl vm.swappiness 
vm.swappiness = 60

# sysctl vm.swappiness=1

# sysctl vm.swappiness 
vm.swappiness = 1

To make these changes persistent you have to write it to some kind of configuration file dependent on your Operating System:

#
# /etc/sysctl.d/99-sysctl.conf
#
vm.swappiness=1

Who is using the Swap Space?

For further analysing your Swap Space and to find who is using your Swap Space please see our article MariaDB and MySQL swap analysis.

What if your system is still swapping? - NUMA!

If you did everything correctly until here and your system is still swapping you possibly missed one point: NUMA systems behave a bit tricky related to Databases and swapping. The first person who wrote extensively about this problem in the MySQL ecosystem was Jeremy Cole in 2010 in his two well written articles which you can find here:

What NUMA is you can find here: Non-uniform memory access.

If you have spent your money for an expensive NUMA system you can find with the following command:

# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping:              1
CPU MHz:               2600.000
CPU max MHz:           2600.0000
CPU min MHz:           1200.0000
BogoMIPS:              5201.37
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13,28-41
NUMA node1 CPU(s):     14-27,42-55

If you are now in the unfortunate situation of having such a huge box with several sockets you can do different things:

  • Configuring your MariaDB / MySQL database to allocate memory evenly on both sockets with the parameter innodb_numa_interleave. This works since MySQL 5.6.27, MySQL 5.7.9 and MariaDB 10.2.4 but there were various bugs in this area in Debian and CentOS packages (e.g. #80288, #78953, #79354 and MDEV-18660).
  • Disable NUMA support in your BIOS (Node Interleaving = enabled). Then there is no NUMA presentation to the Operating System any more.
  • Start your MariaDB / MySQL database with numactl --interleave all as described here: MySQL and NUMA.
  • Set innodb_buffer_pool_size to 75% of half of your RAM. Sad for having too much of RAM.
  • Playing around with the following Linux settings could help to decrease swapping: vm.zone_reclaim_mode=0 and kernel.numa_balancing=0.

Literature

Some further information about Swap Space you can find here:

Taxonomy upgrade extras: 

by Shinguz at June 21, 2019 07:26 AM

May 26, 2019

Valeriy Kravchuk

MySQL Support Engineer's Chronicles, Issue #10

As promised, I am trying to write one blog post in this series per week. So, even though writing about InnoDB row formats took a lot of time and efforts this weekend, I still plan to summarize my findings, questions, discussions, bugs and links I've collected over this week.

I've shared two links this week on Facebook that got a lot of comments (unlike links to my typical blog posts). The first one was to Marko Mäkelä's blog post at MariaDB.com, "InnoDB Quality Improvements in MariaDB Server". I do not see any comments (or any obvious way to comment) there, but the comments I've got at Facebook were mostly related to the statement that  
"We are no longer merging new MySQL features with MariaDB..."
noted in the text by Mark Callaghan and to the idea that "InnoDB" is a trademark of Oracle, so using it to refer to a fork (that is incompatible with the "upstream" InnoDB in too many ways since MariaDB 10.1 probably) is wrong, as stated by Matt Lord and Sunny Bains. People in the comments mostly agree that a new name makes sense (there are more reasons to give it now anyway than in the case of XtraDB by Percona), and we had a lot of nice and funny suggestions on Slack internally (FudDB was not among them, this is a registered trademark of Miguel Angel Nieto for many years already). We shell see how this may end up, but I would not be surprised by a new name announced soon. I suggest you to read comments in any case if you have a Facebook account, many of them are interesting.

The second link was to Kaj Arnö's post at mariadb.org, "On Contributions, Pride and Cockiness". It's worth checking just because of Monty's photo there. Laurynas Biveinis stated in the comments that any comparison of number of pull requests (open and processed) is meaningless when development model used by other parties is different (closed, with contributions coming mostly via bug reports in case of Oracle, or all changes, external and internal, coming via pull requests in case of Percona). MariaDB uses a mix of a kind, where some contributions from contractors come via pull requests, while engineers from MariaDB Corporation work on GitHub sources of MariaDB Server directly. Anyway (meaningless statistics aside), MariaDB seems to be the easiest target for contributions from Community at the moment, and nobody argued against that. My followers also agreed that the same workflow for internal and external contributions is a preferred development model in ideal world.

This kind of public discussions of (serious and funny) MySQL-related matters on Facebook (along with public discussions on MySQL bugs) make me think the way I use my Facebook page is proper and good for the mankind.

Now back to notes made while working on Support issues. This week I had to explain one case when MariaDB server was shut down normally (but unexpectedly for DBA):
2019-05-22 10:37:55 0 [Note] /usr/libexec/mysqld (initiated by: unknown): Normal shutdown
This Percona blog post summarizes different ways to find a process which sent a HUP/KILL/TERM or other signal to the mysqld process. I've used SystemTap-based solution like suggested in that blog post in the past successfully. In this context I find this summary of the ways to force MySQL to fail useful. for all kinds of testing. SELinux manual is also useful to re-read at times.

This week I've spent a lot of time and some efforts trying to reproduce the error (1942 and/or 1940 if anyone cares) on Galera node acting as an async replication slave. These efforts ended up with a bug report, MDEV-19572. Surely the idea to replicate MyISAM tables outside of mysql database to Galera cluster is bad at multiple levels, but why the error after running for a long time normally? In the process of testing I was reading various remotely related posts, so checked this and that... I also hit other problems in the process. Like this crash that happened probably while sending some signal to the node unintentionally:
190523 17:19:46 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.2.23-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=3
max_threads=153
thread_count=65544
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467240 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/home/openxs/dbs/maria10.2/bin/mysqld(my_print_stacktrace+0x29)[0x7f6475eb5b49]
/home/openxs/dbs/maria10.2/bin/mysqld(handle_fatal_signal+0x33d)[0x7f64759d50fd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f6473887330]
/home/openxs/dbs/maria10.2/bin/mysqld(+0xb3b817)[0x7f6475ebc817]
/home/openxs/dbs/maria10.2/bin/mysqld(+0xb3b9e6)[0x7f6475ebc9e6]
/home/openxs/dbs/maria10.2/bin/mysqld(+0xb3bb8a)[0x7f6475ebcb8a]
/home/openxs/dbs/maria10.2/bin/mysqld(lf_hash_delete+0x61)[0x7f6475ebcfa1]
/home/openxs/dbs/maria10.2/bin/mysqld(+0x601eed)[0x7f6475982eed]
include/my_atomic.h:298(my_atomic_storeptr)[0x7f6475983464]
sql/table_cache.cc:534(tdc_delete_share_from_hash)[0x7f6475811f17]
sql/table_cache.cc:708(tdc_purge(bool))[0x7f64759351ea]
sql/sql_base.cc:376(close_cached_tables(THD*, TABLE_LIST*, bool, unsigned long))[0x7f64757c9ec7]
nptl/pthread_create.c:312(start_thread)[0x7f647387f184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6472d8c03d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
I was not able so far top find the exact some backtrace in any known MariaDB bug, so one day I'll have to try to reproduce this crash as well.

I try to check some MariaDB ColumnStore issues from time ot time, for a change, and this week I ended up reading this KB page while trying to understand how much we can control placement of data there.

Finally, for the records, this is the way to "fix" InnoDB statistics if needed (and the need is real as you can find out from Bug #95507 - "innodb_stats_method is not honored when innodb_stats_persistent=ON" reported by my colleague Sergei Petrunia):
update mysql.innodb_index_stats set last_update=now(), stat_value=445000000 where database_name='test' and table_name='t1' and index_name='i1' and stat_name='n_diff_pfx01';
I like to return to familiar nice places and topics, like Regent's Canal or MySQL bugs...
The last bug not the least, MySQL bugs. This week I've subscribed to the following (already "Verified") interesting bug reports (besides the one mentioned above):
  • Bug #95484 - "EXCHANGE PARTITION works wrong/weird with different ROW_FORMAT.". Jean-François Gagné found out that there is a way to have partitions with different row_format values in the same InnoDB table, at least in MySQL 5.7. So why is this not supported officially? See also his Bug #95478 - "CREATE TABLE LIKE does not honour ROW_FORMAT.". It's a week of ROW_FORMAT studies for me, for sure!
  • Bug #95462 - "Data comparison broke in MySQL 8.0.16". It's common knowledge how much I like regression bugs. MySQL 8.0.16 introduced a new one, reported by
    Raman Haran, probably based on some good and valid intentions. But undocumented changes in behavior in GA versions are hardly acceptable, no matter what are the intentions.
That's all for now. Some more links to MySQL bugs from me are always available on Twitter.

by Valerii Kravchuk (noreply@blogger.com) at May 26, 2019 06:54 PM

On Importing InnoDB Tablespaces and Row Formats

Let me start with a short summary and then proceed with a long story, code snippets, hexdumps, links and awk functions converted from the source code of MariaDB server. This blog post can be summarized as follows:
  • One can find row_format used to create table explicitly in the .frm file (or the outputs of SHOW CREATE TABLE or SHOW TABLE STATUS). Internals manual may help to find out where is it stored and source code reading helps to find the way to interpret the values.
  • For InnoDB tables created without specifying the row_format explicitly neither logical backup nor .frm file itself contains the information about the row format used. There are 4 of them (Redundant, Compact, Dynamic and Compressed). The one used implicitly is defined by current value of the innodb_default_row_format that may change dynamically.
  • At the .ibd file level there is no (easy) way to distinguish Redundant from Compact, this detail should come from elsewhere. If the source table's row format had NOT changed you can find it from the information_schema.innodb_sys_tables (or innodb_tables in case of MySQL 8), or from the output of SHOW TABL STATUS.
  • There is an easy enough way to check tablespace level flags in the .ibd file (sample awk functions/script are presented below) and this helps to find out that the row format was Compressed or Dynamic.
  • So far in basic cases (encryption etc aside) individual .ibd files for InnoDB tables from MariaDB (even 10.3.x) and MySQL 8.0.x are compatible enough.
  • You have to take all the above into account while importing individual tables to do partial restore or copy/move tablespaces from one database to the other.
  • Some useful additional reading and links may be found in MariaDB bug reports MDEV-19523 and MDEV-15049. Yes, reading MariaDB MDEVs may help MySQL DBAs to understand some things better!
Now the real story.
I miss London, so I am going to be there on June 13 to partcipate in Open Databases Meetup. Should I speak about importing InnoDB tablespaces there?

* * *
This is a long enough blog post about a "research" I had to make while working in Support recently. It all started with a question like this in a support issue earlier in May:
"Is it somehow possible to extract ROW_FORMAT used from a table in a backup in XtraBackup format?
The context was importing tablespace for InnoDB table and error 1808, "Schema mismatch", and customer had a hope to find out proper format without attempts to import, in some way that can be scripted easily. When one tries to import .ibd file with a format that does not match .frm file or data dictionary content, she gets a very clear message in MariaDB (that still presents all thee details) due to the fix in MDEV-16851, but the idea was to avoid trial and error path entirely.

There were several ideas on how to proceed. Given the .frm, one could use mysqlfrm utility (you can still find MySQL Utilities that are only under Sustaining Support by Oracle here) to get full CREATE TABLE from the .frm. But I was sure that just checking ROW_FORMAT should be easier than that. (Later test of latest mysqlfrm I could get running on Fedora 29 proved that it was a good idea to avoid it due to some problems I may write about one day.) Fine MySQL Internals Manual clearly describes .frm file format and shows that at offset 0x28 in the header section we have row_type encoded as one byte:
0028 1 00 create_info->row_type
Quick search in source code ended up with the following defined in sql/handler.h (links refer to MariaDB code, but the idea is clear and same for MySQL as well):
enum row_type { ROW_TYPE_NOT_USED=-1, ROW_TYPE_DEFAULT, ROW_TYPE_FIXED,
                ROW_TYPE_DYNAMIC, ROW_TYPE_COMPRESSED,
                ROW_TYPE_REDUNDANT, ROW_TYPE_COMPACT, ROW_TYPE_PAGE };
The rest looked clear at the moment. We should see decimal values from 2 to 5 at offset 0x28 (decimal 40) from the beginning of the .frm file representing row formats supported by InnoDB. I quickly created a set of tables with different row formats:
MariaDB [test]> create table ti1(id int primary key, c1 int) engine=InnoDB row_format=redundant;
Query OK, 0 rows affected (0.147 sec)

MariaDB [test]> create table ti2(id int primary key, c1 int) engine=InnoDB row_format=compact;
Query OK, 0 rows affected (0.145 sec)

MariaDB [test]> create table ti3(id int primary key, c1 int) engine=InnoDB row_format=dynamic;
Query OK, 0 rows affected (0.149 sec)

MariaDB [test]> create table ti4(id int primary key, c1 int) engine=InnoDB row_format=compressed;
Query OK, 0 rows affected (0.130 sec)

MariaDB [test]> create table ti5(id int primary key, c1 int) engine=InnoDB;    
Query OK, 0 rows affected (0.144 sec)

MariaDB [test]> insert into ti5 values(5,5);
Query OK, 1 row affected (0.027 sec)
and checked the content of the .frm files with hexdump:
[openxs@fc29 maria10.3]$ hexdump -C data/test/ti1.frm | more00000000  fe 01 0a 0c 12 00 56 00  01 00 b2 03 00 00 f9 01  |......V.........|
00000010  09 00 00 00 00 00 00 00  00 00 00 02 21 00 08 00  |............!...|
00000020  00 05 00 00 00 00 08 00  04 00 00 00 00 00 00 f9  |................|
...
As you can see, we see expected value 04 for ROW_TYPE_REDUNDANT of the table ti1. After that it's easy to come up with some command line to just show numeric row format, like this:
[openxs@fc29 server]$ hexdump --skip 40 --length=1 ~/dbs/maria10.3/data/test/ti1.frm | awk '{print $2}'
0004
or even better:
[openxs@fc29 maria10.3]$ hexdump -C data/test/ti1.frm | awk '/00000020/ {print $10}'
04
[openxs@fc29 maria10.3]$ hexdump -C data/test/ti2.frm | awk '/00000020/ {print $10}'
05
[openxs@fc29 maria10.3]$ hexdump -C data/test/ti3.frm | awk '/00000020/ {print $10}'
02
[openxs@fc29 maria10.3]$ hexdump -C data/test/ti4.frm | awk '/00000020/ {print $10}'
03
[openxs@fc29 maria10.3]$ hexdump -C data/test/ti5.frm | awk '/00000020/ {print $10}'
00
But in real customer case there was no problem with tables created with explicit row_format set (assuming the correct .frm was in place). The problem was with table like ti5 above, those created with the default row format:
MariaDB [test]> show variables like 'innodb%format';
+---------------------------+---------+
| Variable_name             | Value   |
+---------------------------+---------+
| innodb_default_row_format | dynamic |
| innodb_file_format        |         |
+---------------------------+---------+
2 rows in set (0.001 sec)
In .frm file (and in SHOW CREATE TABLE output) the format is NOT set, it's default, 0 (or 0x00 in hex). The problem happens when we try to import such a table into an instance with different innodb_default_row_format. Consider the following test case:
[openxs@fc29 maria10.3]$ bin/mysql --socket=/tmp/mariadb.sock -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 9
Server version: 10.3.15-MariaDB Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> create database test2;
Query OK, 1 row affected (0.000 sec)

MariaDB [test]> use test2
Database changed
MariaDB [test2]> set global innodb_default_row_format=compact;
Query OK, 0 rows affected (0.000 sec)

MariaDB [test2]> create table ti0(id int primary key, c1 int) engine=InnoDB;    Query OK, 0 rows affected (0.165 sec)

MariaDB [test2]> show create table ti0\G
*************************** 1. row ***************************
       Table: ti0
Create Table: CREATE TABLE `ti0` (
  `id` int(11) NOT NULL,
  `c1` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.000 sec)

MariaDB [test2]> \! hexdump -C data/test2/ti0.frm | awk '/00000020/ {print $10}'
00
In this test we create a new table, ti0, in other database, test2, that has ROW_TYPE_DEFAULT (0) in the .frm file, same as the test.ti5 table created above. But if we try to import t5 tablespace by first exporting it properly in another session:
MariaDB [test]> flush tables ti5 for export;
Query OK, 0 rows affected (0.001 sec)
and then discarding original test2.t0 tablespace, copying .ibd and .cfg files (with proper renaming) and running ALTER TABLE ... IMPORT TABLESPACE:
MariaDB [test2]> alter table ti0 discard tablespace;Query OK, 0 rows affected (0.058 sec)

MariaDB [test2]> \! cp data/test/ti5.cfg data/test2/ti0.cfg
MariaDB [test2]> \! cp data/test/ti5.ibd data/test2/ti0.ibd
MariaDB [test2]> alter table ti0 import tablespace;
ERROR 1808 (HY000): Schema mismatch (Table flags don't match, server table has 0x1 and the meta-data file has 0x21; .cfg file uses ROW_FORMAT=DYNAMIC)
we fail with error 1808 (that has all the details about the original's table row format, DYNAMIC, and hex information about some flags in hex that are different). We failed because now innodb_default_row_format is different, it's COMPACT!


We can not fool the target server by removing (or not copying) non-mandatory .cfg file:
MariaDB [test2]> \! rm data/test2/ti0.cfg
MariaDB [test2]> alter table ti0 import tablespace;
ERROR 1808 (HY000): Schema mismatch (Expected FSP_SPACE_FLAGS=0x0, .ibd file contains 0x21.)
Now we see a bit different text, but the same error 1808. Real row format of InnoDB table is stored somewhere in .ibd file. As you can guess, copying .frm file (as it may when we copy back files from Xtrabackup- or mariabackup-based backup to do partial restore) also does not help - the files had the same row_format anyway and we verified that. So, real row format of InnoDB table is stored somewhere in InnoDB (data dictionary). When it does not match the one we see in .ibd file we get error 1808.

How to resolve this error? There are two ideas to explore (assuming we found the real format in .ibd file somehow):
  1. Try to create target table with proper row_format and then import.
  2. Set innodb_default_row_format properly and create target table without explicit row format set, and then import.
The first one works, as one can find out (but will end up with different .frm file than the original table had, surely). Check these:
MariaDB [test2]> select * from test.ti5;
+----+------+
| id | c1   |
+----+------+
|  5 |    5 |
+----+------+
1 row in set (0,001 sec)

MariaDB [test2]> alter table ti0 discard tablespace;
Query OK, 0 rows affected (0,066 sec)

MariaDB [test2]> \! cp data/test/ti5.ibd data/test2/ti0.ibd
MariaDB [test2]> alter table ti0 import tablespace;
ERROR 1808 (HY000): Schema mismatch (Expected FSP_SPACE_FLAGS=0x0, .ibd file contains 0x21.)
MariaDB [test2]> \! cp data/test/ti5.cfg data/test2/ti0.cfg
MariaDB [test2]> alter table ti0 import tablespace;
ERROR 1808 (HY000): Schema mismatch (Table flags don't match, server table has 0x1 and the meta-data file has 0x21; .cfg file uses ROW_FORMAT=DYNAMIC)
MariaDB [test2]> drop table ti0;
Query OK, 0 rows affected (0,168 sec)
So, if you care to understand the flags (we'll work on that below) or care to copy .cfg file as well, you surely can get the row format of the table. Now let's re-create ti0 with explicitly defined Dynamic row format and try to import again:
MariaDB [test2]> create table ti0(id int primary key, c1 int) engine=InnoDB row_format=Dynamic;
Query OK, 0 rows affected (0,241 sec)

MariaDB [test2]> alter table ti0 discard tablespace;
Query OK, 0 rows affected (0,071 sec)

MariaDB [test2]> \! cp data/test/ti5.ibd data/test2/ti0.ibd
MariaDB [test2]> alter table ti0 import tablespace;
Query OK, 0 rows affected, 1 warning (0,407 sec)

MariaDB [test2]> show warnings\G
*************************** 1. row ***************************
  Level: Warning
   Code: 1810
Message: IO Read error: (2, No such file or directory) Error opening './test2/ti0.cfg', will attempt to import without schema verification
1 row in set (0,000 sec)

MariaDB [test2]> select * from ti0;
+----+------+
| id | c1   |
+----+------+
|  5 |    5 |
+----+------+
1 row in set (0,001 sec)
We see that copying .cfg file is not really mandatory and that explicit setting of ROW_FORMAT (assuming that .frm file is NOT copied) works.

The second idea also surely works (and customer in his trial and error attempts just tried with all possible formats until import was successful). Lucky from the first error we'll know the original format used for sure:
MariaDB [test2]> drop table ti0;
Query OK, 0 rows affected (0.084 sec)

MariaDB [test2]> set global innodb_default_row_format=dynamic;
Query OK, 0 rows affected (0.000 sec)

MariaDB [test2]> create table ti0(id int primary key, c1 int) engine=InnoDB;    Query OK, 0 rows affected (0.171 sec)

MariaDB [test2]> alter table ti0 discard tablespace;
Query OK, 0 rows affected (0.049 sec)

MariaDB [test2]> \! cp data/test/ti5.cfg data/test2/ti0.cfg
MariaDB [test2]> \! cp data/test/ti5.ibd data/test2/ti0.ibd
MariaDB [test2]> alter table ti0 import tablespace;
Query OK, 0 rows affected (0.307 sec)

MariaDB [test2]> select * from ti0;
+----+------+
| id | c1   |
+----+------+
|  5 |    5 |
+----+------+
1 row in set (0.000 sec)

MariaDB [test2]> show create table ti0\G
*************************** 1. row ***************************
       Table: ti0
Create Table: CREATE TABLE `ti0` (
  `id` int(11) NOT NULL,
  `c1` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.000 sec)
Now we can proceed with UNLOCK TABLES in that another session where we flushed test.ti5 for export.

How could we find out the row format to use without trial and error, now that we know in one specific case .frm file (or even CREATE TABLE statement shown by server or mysqldump) misses it?

First of all we could try to save this information (select @@innodb_default_file_format) alone with the backup. But that would show the value of this variable at the moment of asking, and it could be different when specific table was created. Does not work in general case.

We could use SHOW TABLE STATUS also, as follows:
MariaDB [test]> show create table ti5\G
*************************** 1. row ***************************
       Table: ti5
Create Table: CREATE TABLE `ti5` (
  `id` int(11) NOT NULL,
  `c1` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.001 sec)

MariaDB [test]> show table status like 'ti5'\G
*************************** 1. row ***************************
            Name: ti5
          Engine: InnoDB
         Version: 10
      Row_format: Dynamic
...
In the example above that table was created without setting row_format explicitly, but we see the real one used in the output of SHOW TABLE STATUS. So, if we cared enough, this kind of output could be saved when the data were backed up or exported.

Then we could try to get it for each table from the InnoDB data dictionary of the system we get .ibd files from. In older MySQL versions we'd have to dig into the real data dictionary tables on disk probably, but in any recent MySQL (up to 5.7, 8.0 may be somewhat different due to a new data dictionary) or MariaDB we have a convenient, SQL-based way to get this information. There are two INFORMATION_SCHEMA tables to consider: INNODB_SYS_TABLESPACES and INNODB_SYS_TABLES. The first one is not good enough, as it considers Compact and Redundant row formats the same (even though fine MySQL Manual does not say this):
MariaDB [test]> select * from information_schema.innodb_sys_tablespaces where name like '%ti%';
+-------+----------------------------+------+----------------------+-----------+---------------+------------+---------------+-----------+----------------+
| SPACE | NAME                       | FLAG | ROW_FORMAT           | PAGE_SIZE | ZIP_PAGE_SIZE | SPACE_TYPE | FS_BLOCK_SIZE | FILE_SIZE | ALLOCATED_SIZE |
+-------+----------------------------+------+----------------------+-----------+---------------+------------+---------------+-----------+----------------+
|     3 | mysql/transaction_registry |   33 | Dynamic              |     16384 |         16384 | Single     |          4096 |    147456 |         147456 |
|     4 | mysql/gtid_slave_pos       |   33 | Dynamic              |     16384 |         16384 | Single     |          4096 |     98304 |          98304 |
|     6 | test/ti1                   |    0 | Compact or Redundant |     16384 |         16384 | Single     |          4096 |     98304 |          98304 |
|     7 | test/ti2                   |    0 | Compact or Redundant |     16384 |         16384 | Single     |          4096 |     98304 |          98304 |
|     8 | test/ti3                   |   33 | Dynamic              |     16384 |         16384 | Single     |          4096 |     98304 |          98304 |
|     9 | test/ti4                   |   41 | Compressed           |     16384 |          8192 | Single     |          4096 |     65536 |          65536 |
|    10 | test/ti5                   |   33 | Dynamic              |     16384 |         16384 | Single     |          4096 |     98304 |          98304 |
+-------+----------------------------+------+----------------------+-----------+---------------+------------+---------------+-----------+----------------+
7 rows in set (0.000 sec)
The second one works perfectly:
MariaDB [test2]> select * from information_schema.innodb_sys_tables where name like '%ti%';
+----------+----------------------------+------+--------+-------+------------+---------------+------------+
| TABLE_ID | NAME                       | FLAG | N_COLS | SPACE | ROW_FORMAT | ZIP_PAGE_SIZE | SPACE_TYPE |
+----------+----------------------------+------+--------+-------+------------+---------------+------------+
|       19 | mysql/gtid_slave_pos       |   33 |      7 |     4 | Dynamic    |             0 | Single     |
|       18 | mysql/transaction_registry |   33 |      8 |     3 | Dynamic    |             0 | Single     |
|       21 | test/ti1                   |    0 |      5 |     6 | Redundant  |             0 | Single     |
|       22 | test/ti2                   |    1 |      5 |     7 | Compact    |             0 | Single     |
|       23 | test/ti3                   |   33 |      5 |     8 | Dynamic    |             0 | Single     |
|       24 | test/ti4                   |   41 |      5 |     9 | Compressed |          8192 | Single     |
|       25 | test/ti5                   |   33 |      5 |    10 | Dynamic    |             0 | Single     |
|       26 | test2/ti0                  |    1 |      5 |    11 | Compact    |             0 | Single     |
+----------+----------------------------+------+--------+-------+------------+---------------+------------+
8 rows in set (0.000 sec)
In the table above I was wondering about the exact values in FLAG column (note 33, 0x21 in hex, looks familiar from the error message in previous examples). MySQL Manual says just this:
"A numeric value that represents bit-level information about tablespace format and storage characteristics."
MariaDB's KB page is now way more detailed after my bug report, MDEV-19523, was closed. See the link for the details, or check the code of the i_s_dict_fill_sys_tables() function if you want to interpret the data properly:
/**********************************************************************//**
Populate information_schema.innodb_sys_tables table with information
from SYS_TABLES.
@return 0 on success */
static
int
i_s_dict_fill_sys_tables(
/*=====================*/
    THD*        thd,        /*!< in: thread */
    dict_table_t*    table,        /*!< in: table */
    TABLE*        table_to_fill)    /*!< in/out: fill this table */
{
    Field**          fields;
    ulint            compact = DICT_TF_GET_COMPACT(table->flags);
    ulint            atomic_blobs = DICT_TF_HAS_ATOMIC_BLOBS(
                                table->flags);
    const ulint zip_size = dict_tf_get_zip_size(table->flags);
    const char*        row_format;

    if (!compact) {
        row_format = "Redundant";
    } else if (!atomic_blobs) {
        row_format = "Compact";
    } else if (DICT_TF_GET_ZIP_SSIZE(table->flags)) {
        row_format = "Compressed";
    } else {
        row_format = "Dynamic";
    }
...
Another part of the code shows how the checks above are performed:
#define DICT_TF_GET_COMPACT(flags) \
        ((flags & DICT_TF_MASK_COMPACT) \
        >> DICT_TF_POS_COMPACT)
/** Return the value of the ZIP_SSIZE field */
#define DICT_TF_GET_ZIP_SSIZE(flags) \
        ((flags & DICT_TF_MASK_ZIP_SSIZE) \
        >> DICT_TF_POS_ZIP_SSIZE)
/** Return the value of the ATOMIC_BLOBS field */
#define DICT_TF_HAS_ATOMIC_BLOBS(flags) \
        ((flags & DICT_TF_MASK_ATOMIC_BLOBS) \
        >> DICT_TF_POS_ATOMIC_BLOBS)
...
We miss masks and flags to double check (in the same storage/innobase/include/dict0mem.h file):
/** Width of the COMPACT flag */
#define DICT_TF_WIDTH_COMPACT        1

/** Width of the ZIP_SSIZE flag */
#define DICT_TF_WIDTH_ZIP_SSIZE        4

/** Width of the ATOMIC_BLOBS flag.  The ROW_FORMAT=REDUNDANT and
ROW_FORMAT=COMPACT broke up BLOB and TEXT fields, storing the first 768 bytes
in the clustered index. ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPRESSED
store the whole blob or text field off-page atomically.
Secondary indexes are created from this external data using row_ext_t
to cache the BLOB prefixes. */
#define DICT_TF_WIDTH_ATOMIC_BLOBS    1

...

/** Zero relative shift position of the COMPACT field */
#define DICT_TF_POS_COMPACT        0
/** Zero relative shift position of the ZIP_SSIZE field */
#define DICT_TF_POS_ZIP_SSIZE        (DICT_TF_POS_COMPACT        \
                    + DICT_TF_WIDTH_COMPACT)
/** Zero relative shift position of the ATOMIC_BLOBS field */
#define DICT_TF_POS_ATOMIC_BLOBS    (DICT_TF_POS_ZIP_SSIZE        \
+ DICT_TF_WIDTH_ZIP_SSIZE)
If we make some basic math we can find out that DICT_TF_POS_ZIP_SSIZE is 1 and DICT_TF_POS_ATOMIC_BLOBS is 5, etc. The masks are defined as:
/** Bit mask of the COMPACT field */
#define DICT_TF_MASK_COMPACT                \
        ((~(~0U << DICT_TF_WIDTH_COMPACT))    \
        << DICT_TF_POS_COMPACT)
/** Bit mask of the ZIP_SSIZE field */
#define DICT_TF_MASK_ZIP_SSIZE                \
        ((~(~0U << DICT_TF_WIDTH_ZIP_SSIZE))    \
        << DICT_TF_POS_ZIP_SSIZE)
/** Bit mask of the ATOMIC_BLOBS field */
#define DICT_TF_MASK_ATOMIC_BLOBS            \
        ((~(~0U << DICT_TF_WIDTH_ATOMIC_BLOBS))    \
        << DICT_TF_POS_ATOMIC_BLOBS)

Basically we have what we need now, bit positions and masks. We can create a function to return a row format based on decimal value of falgs. Consider this primitive awk example:
openxs@ao756:~/dbs/maria10.3$ awk '
> function DICT_TF_GET_COMPACT(flags) {
>   return rshift(and(flags, DICT_TF_MASK_COMPACT), DICT_TF_POS_COMPACT);
> }
>
> function DICT_TF_GET_ZIP_SSIZE(flags)
> {
>   return rshift(and(flags, DICT_TF_MASK_ZIP_SSIZE), DICT_TF_POS_ZIP_SSIZE);
> }
>
> function DICT_TF_HAS_ATOMIC_BLOBS(flags)
> {
>   return rshift(and(flags, DICT_TF_MASK_ATOMIC_BLOBS), DICT_TF_POS_ATOMIC_BLOBS);
> }
>
> function innodb_row_format(flags)
> {
>     compact = DICT_TF_GET_COMPACT(flags);
>     atomic_blobs = DICT_TF_HAS_ATOMIC_BLOBS(flags);
>
>     if (!compact) {
>         row_format = "Redundant";
>     } else if (!atomic_blobs) {
>         row_format = "Compact";
>     } else if (DICT_TF_GET_ZIP_SSIZE(flags)) {
>         row_format = "Compressed";
>     } else {
>         row_format = "Dynamic";
>     }
>     return row_format;
> }
>
> BEGIN {
> DICT_TF_WIDTH_COMPACT=1;
> DICT_TF_WIDTH_ZIP_SSIZE=4;
> DICT_TF_WIDTH_ATOMIC_BLOBS=1;
>
> DICT_TF_POS_COMPACT=0;
> DICT_TF_POS_ZIP_SSIZE=DICT_TF_POS_COMPACT + DICT_TF_WIDTH_COMPACT;
> DICT_TF_POS_ATOMIC_BLOBS=DICT_TF_POS_ZIP_SSIZE + DICT_TF_WIDTH_ZIP_SSIZE;
>
> DICT_TF_MASK_COMPACT=lshift(compl(lshift(compl(0), DICT_TF_WIDTH_COMPACT)),DICT_TF_POS_COMPACT);
> DICT_TF_MASK_ZIP_SSIZE=lshift(compl(lshift(compl(0), DICT_TF_WIDTH_ZIP_SSIZE)),DICT_TF_POS_ZIP_SSIZE);
> DICT_TF_MASK_ATOMIC_BLOBS=lshift(compl(lshift(compl(0), DICT_TF_WIDTH_ATOMIC_BLOBS)),DICT_TF_POS_ATOMIC_BLOBS);
>
> print innodb_row_format(0), innodb_row_format(1), innodb_row_format(33), innodb_row_format(41);
> }'
Redundant Compact Dynamic Compressed
openxs@ao756:~/dbs/maria10.3$
 
So, we know how to get format based on decimal values of flags. The remaining subtask is to find out where are the flags in the .ibd file. Instead of digging into the code (server/storage/innobase/include/fsp0fsp.h etc) one can just check this great blog post by Jeremy Cole to find out that flags are at bytes 54-57, 16 bytes offset after FIL header that is 38 bytes long (4 bytes starting from hex offset 0x36) in the .ibd file. These bytes are highlighted with bold below:

[openxs@fc29 maria10.3]$ hexdump -C data/test/ti2.ibd | more
00000000  5d 4f 09 aa 00 00 00 00  00 00 00 00 00 00 00 00  |]O..............|
00000010  00 00 00 00 00 19 11 ee  00 08 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 07 00 00  00 07 00 00 00 00 00 00  |................|
00000030  00 06 00 00 00 40 00 00  00 00 00 00 00 04 00 00  |.....@..........|
...


[openxs@fc29 maria10.3]$ hexdump -C data/test/ti4.ibd | more
00000000  6c cd 19 15 00 00 00 00  00 00 00 00 00 00 00 00  |l...............|
00000010  00 00 00 00 00 19 44 9f  00 08 00 00 00 00 00 00  |......D.........|
00000020  00 00 00 00 00 09 00 00  00 09 00 00 00 00 00 00  |................|
00000030  00 06 00 00 00 40 00 00  00 29 00 00 00 04 00 00  |.....@...)......|
...


[openxs@fc29 maria10.3]$ hexdump -C data/test/ti5.ibd | more00000000  d8 21 6d 2e 00 00 00 00  00 00 00 00 00 00 00 00  |.!m.............|
00000010  00 00 00 00 00 19 62 9d  00 08 00 00 00 00 00 00  |......b.........|
00000020  00 00 00 00 00 0a 00 00  00 0a 00 00 00 00 00 00  |................|
00000030  00 06 00 00 00 40 00 00  00 21 00 00 00 04 00 00  |.....@...!......|
...
As you can see we have hex values 0x00, 0x29 (41 decimal), 0x21 (33 decimal) etc, and, theoretically, we can find out the exact row_format used (and other details) from that, based on the information presented above. For row format we need just one byte and we can get it as follows in hex:
openxs@ao756:~/dbs/maria10.3$ hexdump -C data/test/t*.ibd | awk '/00000030/ {print $11}'
21
openxs@ao756:~/dbs/maria10.3$ hexdump -C data/test/t*.ibd | awk '/00000030/ {flags=strtonum("0x"$11); print flags;}'
33
To use the awk function defined above we need to convert hex to decimal, hence a small trick with strtonum() function. Now, let me put it all together and show that we can apply this MySQL as well (I checked MariaDB code mostly in the process). Let me create same tables ti1 ... ti5 in MySQL 8.0.x:
openxs@ao756:~/dbs/8.0$ bin/mysqld_safe --no-defaults --basedir=/home/openxs/dbs/8.0 --datadir=/home/openxs/dbs/8.0/data --port=3308 --socket=/tmp/mysql8.sock &
[1] 31790
openxs@ao756:~/dbs/8.0$ 2019-05-26T10:55:18.274601Z mysqld_safe Logging to '/home/openxs/dbs/8.0/data/ao756.err'.
2019-05-26T10:55:18.353458Z mysqld_safe Starting mysqld daemon with databases from /home/openxs/dbs/8.0/data

openxs@ao756:~/dbs/8.0$ bin/mysql --socket=/tmp/mysql8.sock -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 8.0.13 Source distribution

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select @@innodb_default_row_format;
+-----------------------------+
| @@innodb_default_row_format |
+-----------------------------+
| dynamic                     |
+-----------------------------+
1 row in set (0,00 sec)

mysql> create table ti1(id int primary key, c1 int) engine=InnoDB row_format=redundant;
Query OK, 0 rows affected (0,65 sec)

mysql> create table ti2(id int primary key, c1 int) engine=InnoDB row_format=compact;
Query OK, 0 rows affected (0,44 sec)

mysql> create table ti3(id int primary key, c1 int) engine=InnoDB row_format=dynamic;
Query OK, 0 rows affected (0,51 sec)

mysql> create table ti4(id int primary key, c1 int) engine=InnoDB row_format=compressed;
Query OK, 0 rows affected (0,68 sec)

mysql> create table ti5(id int primary key, c1 int) engine=InnoDB;
Query OK, 0 rows affected (0,59 sec)

mysql> select * from information_schema.innodb_sys_tables where name like 'test/ti%';
ERROR 1109 (42S02): Unknown table 'INNODB_SYS_TABLES' in information_schema
mysql> select * from information_schema.innodb_tables where name like 'test/ti%';
+----------+----------+------+--------+-------+------------+---------------+------------+--------------+
| TABLE_ID | NAME     | FLAG | N_COLS | SPACE | ROW_FORMAT | ZIP_PAGE_SIZE | SPACE_TYPE | INSTANT_COLS |
+----------+----------+------+--------+-------+------------+---------------+------------+--------------+
|     1158 | test/ti1 |    0 |      5 |     6 | Redundant  |             0 | Single     |            0 |
|     1159 | test/ti2 |    1 |      5 |     7 | Compact    |             0 | Single     |            0 |
|     1160 | test/ti3 |   33 |      5 |     8 | Dynamic    |             0 | Single     |            0 |
|     1161 | test/ti4 |   41 |      5 |     9 | Compressed |          8192 | Single     |            0 |
|     1162 | test/ti5 |   33 |      5 |    10 | Dynamic    |             0 | Single     |            0 |
+----------+----------+------+--------+-------+------------+---------------+------------+--------------+
5 rows in set (0,03 sec)
Now let's combine some shell and awk together:
openxs@ao756:~/dbs/8.0$ for file in `ls data/test/ti*.ibd`
> do
> echo $file
> hexdump -C $file | awk '
> function DICT_TF_GET_COMPACT(flags) {
>   return rshift(and(flags, DICT_TF_MASK_COMPACT), DICT_TF_POS_COMPACT);
> }
>
> function DICT_TF_GET_ZIP_SSIZE(flags)
> {
>   return rshift(and(flags, DICT_TF_MASK_ZIP_SSIZE), DICT_TF_POS_ZIP_SSIZE);
> }
>
> function DICT_TF_HAS_ATOMIC_BLOBS(flags)
> {
>   return rshift(and(flags, DICT_TF_MASK_ATOMIC_BLOBS), DICT_TF_POS_ATOMIC_BLOBS);
> }
>
> function innodb_row_format(flags)
> {
>     compact = DICT_TF_GET_COMPACT(flags);
>     atomic_blobs = DICT_TF_HAS_ATOMIC_BLOBS(flags);
>
>     if (!compact) {
>         row_format = "Redundant";
>     } else if (!atomic_blobs) {
>         row_format = "Compact";
>     } else if (DICT_TF_GET_ZIP_SSIZE(flags)) {
>         row_format = "Compressed";
>     } else {
>         row_format = "Dynamic";
>     }
>     return row_format;
> }
>
> BEGIN {
> DICT_TF_WIDTH_COMPACT=1;
> DICT_TF_WIDTH_ZIP_SSIZE=4;
> DICT_TF_WIDTH_ATOMIC_BLOBS=1;
>
> DICT_TF_POS_COMPACT=0;
> DICT_TF_POS_ZIP_SSIZE=DICT_TF_POS_COMPACT + DICT_TF_WIDTH_COMPACT;
> DICT_TF_POS_ATOMIC_BLOBS=DICT_TF_POS_ZIP_SSIZE + DICT_TF_WIDTH_ZIP_SSIZE;
>
> DICT_TF_MASK_COMPACT=lshift(compl(lshift(compl(0), DICT_TF_WIDTH_COMPACT)),DICT_TF_POS_COMPACT);
> DICT_TF_MASK_ZIP_SSIZE=lshift(compl(lshift(compl(0), DICT_TF_WIDTH_ZIP_SSIZE)),DICT_TF_POS_ZIP_SSIZE);
> DICT_TF_MASK_ATOMIC_BLOBS=lshift(compl(lshift(compl(0), DICT_TF_WIDTH_ATOMIC_BLOBS)),DICT_TF_POS_ATOMIC_BLOBS);
> }
> /00000030/ {flags=strtonum("0x"$11); print innodb_row_format(flags);}'
> done
data/test/ti1.ibd
Redundant
data/test/ti2.ibd
Redundant
data/test/ti3.ibd
Dynamic
data/test/ti4.ibd
Compressed
data/test/ti5.ibd
Dynamic
openxs@ao756:~/dbs/8.0$
So, we proved that there is a way (based on some code analysis and scripting) to find out the exact row format that was used to create InnoDB table based solely on the .ibd file and nothing else, in all cases but one! If you are reading carefully you noted Redundant printed for ti2.ibd as well, we've seen the same in the INNODB_SYS_TABLESPACES table. Flags in the tablespace are same for both Redundant and Compact row formats, see this part of the code also. It seems to be one of the reasons why .cfg file may be needed when we export tablespace is exactly this.

One day I'll find out and create a followup post. Too much core reading for my limited abilities today...

by Valerii Kravchuk (noreply@blogger.com) at May 26, 2019 04:18 PM

May 24, 2019

Oli Sennhauser

Dropped Tables with FromDual Backup Manager

Some applications have the bad behaviour to CREATE or DROP tables while our FromDual Backup Manager (bman) backup is running.

This leads to the following bman error message:

/opt/mysql/product/5.7.26/bin/mysqldump --user=dba --host=migzm96i --port=3306 --all-databases --quick --single-transaction --flush-logs --triggers --routines --hex-blob --events | tee >(md5sum --binary >/tmp/checksum.23357.md5) | gzip -1
to Destination: /var/mysql/dumps/mysql96i/daily/bck_mysql96i_full_2019-05-22_06-50-01.sql.gz
ERROR: /opt/mysql/product/5.7.26/bin/mysqldump command failed (rc=253).
mysqldump: [Warning] Using a password on the command line interface can be insecure.
Error: Couldn't read status information for table m_report_builder_cache_157_20190521035354 ()
mysqldump: Couldn't execute 'show create table `m_report_builder_cache_157_20190521035354`': Table 'totara.m_report_builder_cache_157_20190521035354' doesn't exist (1146)

There are various strategies to work around this problem:

  • If the table is only temporary create it with the CREATE command as a TEMPORARY TABLE instead of a normal table. This workaround would not work in this case because the table is a caching table which must be available for other connections as well.
  • Try to schedule your application job or your bman job in the way they do not collide. With bman that is quite easy but sometimes not with the application.
  • Try to create the table in its own schema (e.g. cache) which is excluded from bman backup. So you can easily do a bman backup without the cache schema. For example like this:
    $ bman --target=brman@127.0.0.1:3306 --type=schema --schema=-cache --policy=daily

  • If this strategy also does not work (because you cannot change the application behaviour) try to ignore the table. The underlying command mysqldump knows the option --ignore-table:
    mysqldump --help
    ...
      --ignore-table=name Do not dump the specified table. To specify more than one
                          table to ignore, use the directive multiple times, once
                          for each table.  Each table must be specified with both
                          database and table names, e.g.,
                          --ignore-table=database.table.
    

    This option can be used in bman as well. Options to the underlying application are passed through FromDual Backup Manager as follows:
    $ bman --target=brman@127.0.0.1:3306 --type=full --policy=daily --pass-through='--ignore-table=totara.m_report_builder_cache_157_20190521035354'

  • The problem here is, that this table contains a timestamp in its table name (20190521035354). So the table name is changing all the time. To pass through wildcards with --ignore-table is not possible with mysqldump. The tool mysqldump does not support (yet) this feature. The only solution we have in this case is, to ignore the error message with the risk that possible other error messages are also ignored. This is achieved again with the --pass-through option:
    $ bman --target=brman@127.0.0.1:3306 --type=full --policy=daily --pass-through='--force'

I hope with this few tricks we can help you to make your FromDual Backup Manager (bman) backups hassle-free.

Taxonomy upgrade extras: 

by Shinguz at May 24, 2019 05:23 AM

May 19, 2019

Valeriy Kravchuk

MySQL Support Engineer's Chronicles, Issue #9

My previous post from this series was published more than 1.5 years ago. I had never planned to stop writing about my everyday work on a regular basis, but sometimes it's not easy to pick up something really interesting for wider MySQL audience and when in doubts I always prefer to write about MySQL bugs...

In any case, any long way starts from the first step, so I decided to write one post in this series per week and try to summarize in it whatever findings, questions, discussions, bugs and links I've collected over the week. My work experience differs week after week, so some of these posts may be boring or less useful, but I still want to try to create them on a regular basis.

I was working on (upcoming) blog post (inspired by one customer issue) on the impact of innodb_default_row_format setting for importing tablespaces (and related checking of the row format really used in both .frm and .ibd files) and found FSP Header description in this old post by Jeremy Cole useful for further checks in the InnoDB source code. MySQL manual is not very informative (and MariaDB KB page is just wrong/incomplete) when describing flags for the table or tablespace, unfortunately, so I've reported MDEV-19523 to get this improved.

If you ever wonder what MariaDB plans to do with InnoDB in the future, please, check MDEV-11633 among other sources.

This week we in Support got customer (on MySQL 8.0.x) complaining that they can not start server any more on Windows 10 after moving datadir to other drive. Check this blog post by my colleague Nil on the reason, explanations and way to fix/prevent this from happening. One of those cases when MySQL Forums give useful hint.

If you build MariaDB (and MySQL) from source on a regular basis (as I do), you may wonder at times how to disable some storage engine plugin at build time (for example, NOT to be affected by some temporary bugs in it when you do not really need it for testing or production use). Save this as hint:
-DPLUGIN_MROONGA=NO
This is what you have to add to cmake command line to prevent building Mroonga, for example. Same approach applies to TokuDB etc. See this KB page also for more details.

I never noted before that "Explain Analyzer" service exists at mariadb.org, but it seems some customers use it and even prefer to share its output instead of plain text EXPLAIN. Just copy/paste any EXPLAIN ...\G there and decide if the result is useful. For Support purposes and queries accessing less than 10 tables or so I'd prefer usual text output.

Yet another public service at mariadb.org I noted this week by pure chance is "MariaDB CI" page with buildbot status and ways to check what is building now, what failed etc. MariaDB Foundation works in a true open manner at all levels.

If you ever cares to find out what exact versions of MariaDB (or MySQL) contain specific commit you can find out using git tag --contains commit_hash command.

I still do not care about Kubernetes at all, but it seems customers start to use it in production, so here is the hint for myself on how run specific command in a running container:
kubectl exec -it <pod> --container <container> -- vi grastate.dat
I may have to write or speak about some details of MySQL and MariaDB architecture soon, so I was looking for related pictures and texts. I found useful details in the following places:
If you are interested in different storage engines and efficiency of indexing, check this blog post by Mark Callaghan

The last but not the least, I've nominated the following bugs:
  • Bug #95269 - "binlog_row_image=minimal causes assertion failure". I really wonder why this combination was missed in any regular testing of debug builds (that I hope Oracle does).
  • Bug #90681 - "MySQL 8.0 fails to install and start from Oracle .debs on debian 9 x86_64". It seems proper documentation is missing for users to know what conflicting packages to remove, what paths to clean up (if any) etc. Maybe this is no longer a concern (I do no use Oracle .deb packages, so I don't know), but in any case having this bug just "Open" helps nobody.
  • Bug #87312 - "Test main.events_time_zone is fundamentally unstable". It's even more strange to see this bug report about unstable test case "Open" for more than 2 years. Is it really hard to run MTR many times or check the code and improve, or just agree to disable it?
  • Bug #95411 - "LATERAL produces wrong results (values instead of NULLs) on 8.0.16". This regression bug in optimizer of 8.0.16 (vs 8.0.14) leads to wrong results, but so far nobody cared to verify it (even though it has simple and clear "How to repeat" instructions). This is sad.
for bug of the day on Twitter this week. I've also participated in a discussion there. As a result I ended up reading some recent MEB 8.0 manual pages (and more here). MySQL Enterprise Backup really provides a lot of potentially useful options that mariabackup may benefit from one day...

I spent first two weeks of May properly last year, on vacation in UK. Battersea Park here.
That's more or less all I had written down for further review this week that I am ready top share. Stay tuned for what may come up next week!

by Valerii Kravchuk (noreply@blogger.com) at May 19, 2019 12:33 PM

May 07, 2019

Oli Sennhauser

FromDual Ops Center for MariaDB and MySQL 0.9.1 has been released

FromDual has the pleasure to announce the release of the new version 0.9.1 of its popular FromDual Ops Center for MariaDB and MySQL focmm.

The FromDual Ops Center for MariaDB and MySQL (focmm) helps DBA's and System Administrators to manage MariaDB and MySQL database farms. Ops Center makes DBA and Admins life easier!

The main task of Ops Center is to support you in your daily MySQL and MariaDB operation tasks. More information about FromDual Ops Center you can find here.

Download

The new FromDual Ops Center for MariaDB and MySQL (focmm) can be downloaded from here. How to install and use focmm is documented in the Ops Center User Guide.

In the inconceivable case that you find a bug in the FromDual Ops Center for MariaDB and MySQL please report it to the FromDual bug tracker or just send us an email.

Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

Installation of Ops Center 0.9.1

A complete guide on how to install FromDual Ops Center you can find in the Ops Center User Guide.

Upgrade from 0.3 or 0.9.0 to 0.9.1

Upgrade from 0.3 or 0.9.0 to 0.9.1 should happen automatically. Please do a backup of your Ops Center Instance before you upgrade! Please also check Upgrading.

Changes in Ops Center 0.9.1

Upgrade

  • Sever upgrade bug fixed which prohibited installation of v0.9.0.

Build and Packaging

  • RPM package for RHEL/CentOS 7 is available now.
  • DEB package for Ubuntu 18.04 LTS is available now.
  • SElinux Policy Package file added.
  • COMMIT tag was not replaced correctly during build. This is fixed now.
Taxonomy upgrade extras: 

by Shinguz at May 07, 2019 03:12 PM

Percona

MariaDB Track at Percona Live

MariaDB track at Percona Live 2019

mariadb logoLess than one month left until Percona Live. This time the Committee work was a bit unusual. Instead of having one big committee for the whole conference we had a few mini-committees, each responsible for a track. Each independent mini-committee, in turn, had a leader who was responsible for the whole process. I led the MariaDB track. In this post, I want to explain how we worked, which topics we have chosen, and why.

For MariaDB, we had seven slots: five for 50-minutes talks, two for 25-minutes talks and 19 submissions. We had to reject two out of three proposals. We also had to decide how many topics the program should cover. My aim here was to use the MariaDB track to demonstrate as many MariaDB unique features as possible. I also wanted to have as many speakers as possible, considering the number of slots we had available.

The committee agreed, and we tried our best for the program to cover the various topics. If someone sent us two or more proposals, we choose only one to allow more speakers to attend.

We also looked to identify gaps in submitted sessions. For example, if we wanted for a topic to be covered and no one sent a proposal with such a subject, we invited potential speakers and asked them to submit with that topic in mind. Or we asked those who already submitted similar talks to improve them.

In the end, we have five 50-minutes sessions, one MariaDB session in the MySQL track, two 25-minutes sessions, one tutorial, and one keynote. All of them are by different speakers.

The Program

The first MariaDB event will be a tutorial: “Expert MariaDB: Harness the Strengths of MariaDB Server” by Colin Charles on Tuesday, May 28

codership_720_156Colin started his MySQL career as a Community Engineer back in the MySQL AB times. He worked on numerous MySQL events, both big and small, including Percona Live’s predecessor, O’Reilly’s MySQL Conference and Expo. Colin joined Monty Program Ab, and MariaDB Corporation as a Chief Evangelist, then spent two years as Chief Evangelist at Percona. Colin is now a Consultant at Codership, the makers of Galera Cluster.

Colin will not only talk about unique MariaDB features up to version 10.4, but will also help you try all of them out. This tutorial is a must-attend for everyone interested in MariaDB.

Next day: Wednesday, May 29 – the first conference day – will be the MariaDB Track day.

MariaDB Foundation Bronze SponsorshipMariaDB talks will start from the keynote by Vicentiu Ciorbaru about new MariaDB features in version 10.4. He will highlight all the significant additions in this version.

Vicentiu started his career at MariaDB Foundation as a very talented Google Summer of Code student. His first project was Roles. Then he worked a lot on MariaDB Optimizer, bug fixes, and code maintenance. At the same time, he discovered a talent for public speaking, and now he is the face of MariaDB Foundation.

We at the committee had a hard choice: either to accept his 50-minutes session proposal or ask him to make a keynote. This decision was not easy, because a keynote is shorter than 50 minutes. At the same time, though, everyone at the conference will be able to see it. Brand new features of version 10.4 are a very important topic. Therefore, we decided that it would be best to have Vicentiu as a keynote speaker.

Morning sessions

virtualhealthSessions will start with a talk by Alexander Rubin “Opensource Column Store Databases: MariaDB ColumnStore vs. ClickHouse” Alex began his MySQL career as a web developer, then joined MySQL AB as a consultant. He then moved to Percona as Principal Architect. It was our loss when he left Percona to start applying his recommendations himself on behalf of a medical startup VirtualHealth! During his career as a MySQL consultant, he tried all the sexiest database products, loaded terabytes of data into them, ran the deadly intensive loads. He is the one who knows best about database strengths and weaknesses. I would recommend his session to everyone who is considering a column store solution.

codership_720_156Next talk is “Galera Cluster New Features” by Seppo Jaakola. This session is about the long-awaited Galera 4 library. Seppo is one of three founders of Codership Oy: the company which brought us Galera library. Before the year 2007, when the Galera library was first released, MySQL users had to choose between asynchronous replication and asynchronous replication (that’s not a typo). Seppo brought us a solution which allowed us to continue using InnoDB in the style we were used to using while writing to all nodes. The Galera library looks after the data consistency. After more than ten years the product is mature and leaving its competitors far behind. The new version brings us streaming replication technology and other improvements which relax usage limitations and make Galera Cluster more stable. I recommend this session for everyone who looks forward to a synchronous replication future.

Afternoon sessions

Walmart LogoAfter the lunch break, we will meet MariaDB users Sandeep Jangra and Andre Van Looveren who will show how they use MariaDB at Walmart in their talk “Lessons Learned Building a Fully Automated Database Platform as a Service Using Open Source Technologies in the Cloud”. Sandeep and Andre manage more than 6000 MariaDB installations. In addition to setting up automation, they have experience with migration and upgrade. This talk will be an excellent case study, which I recommend to attend everyone who is considering implementing automation for a farm of MariaDB or MySQL servers.MariaDB Foundation

Next topic is “MariaDB Security Features and Best Practices” by Robert Bindar.  Robert is a server Developer at MariaDB Foundation. He will cover best security practices for MariaDB deployment, including the latest security features, added to version 10.4

At 4:15 pm we will have two MariaDB topics in parallel

MariaDB Foundation Bronze Sponsorship“MariaDB and MySQL – What Statistics Optimizer Needs Or When and How Not to Use Indexes” by Sergei Golubchik – a Member of the MariaDB Foundation Board – discovers optimization techniques which are often ignored in favor of indexes. Sergei worked on MySQL, and then on MariaDB, from their very first days. I’ve known him since 2006 when I joined the MySQL team. Each time when I am in trouble to find out how a particular piece of code works, just a couple of words from Sergei help to solve the issue! He has an encyclopedic knowledge on both MariaDB and MySQL databases. In this session, Sergei will explain which statistics optimizer we can use in addition to indexes. While he will focus on specific MariaDB features he will cover MySQL too. Spoiler: these are not only histograms!

Backups in the MySQL track…

In the parallel MySQL track, Iwo Panowicz and Juan Pablo Arruti will speak about backups in their “Percona XtraBackup vs. Mariabackup vs. MySQL Enterprise Backup” Iwo and Juan Pablo are Support Engineers at Percona. Iwo joined Percona two years ago, and now he is one of the most senior engineers in the EMEA team. Linux, PMM, analyzing core files, engineering best practices: Iwo is well equipped to answer all these and many more questions. Juan Pablo works in the American Support team for everything around MariaDB and MySQL: replication, backup, performance issues, data corruption… Through their support work, Iwo and Juan Pablo have had plenty of chances to find out strengths and weaknesses of different backup solutions.

Three tools, which they will cover in the talk, can be used to make a physical backup of MySQL and MariaDB databases, and this is the fastest and best recommended way to work with an actively used server. But what is the difference? When and why should you prefer one instrument over another? Iwo and Juan Pablo will answer these questions.

At the end of the day we will have two 25-minute sessions

Alibaba CloudJim Tommaney will present “Tips and Tricks with MariaDB ColumnStore”. Unlike Alex Rubin, who is an end user of ColumnStore databases, Jim is from another side: development. Thus his insights into MariaDB ColumnStore could be fascinating. If you are considering ColumnStore: this topic is a must-go!

Daniel Black will close the day with his talk “Squash That Old Bug”. This topic is the one I personally am looking forward to the most! Not only because I stick with bugs. But, well… the lists of accepted patches which Daniel’s posts to MariaDB and to MySQL servers are impressive. Especially when you know how strict is the quality control for external patches in MariaDB and MySQL! IBMIn his talk, Daniel is going to help you to start contributing yourself. And to do it successfully, so your patches are accepted. This session is very important for anyone who has asked themselves why one or another MariaDB or MySQL bug has not been fixed for a long time. I do not know a single user who has not asked that question!

MariaDB track at Percona Live 2019Conclusion

This blog about MariaDB track at Percona Live covers eight sessions, one keynote, one tutorial, 12 speakers, seven mini-committee members – two of whom are also speakers. We worked hard, and continue to work hard, to bring you great MariaDB program.

I cannot wait for the show to begin!


Photo by shannon VanDenHeuvel on Unsplash

by Sveta Smirnova at May 07, 2019 11:04 AM

May 01, 2019

Valeriy Kravchuk

Fun with Bugs #85 - On MySQL Bug Reports I am Subscribed to, Part XX

We have a public holiday here today and it's raining outside for a third day in a row already, so I hardly have anything better to do than writing yet another review of public MySQL bug reports that I've subscribed to recently.

Not sure if these reviews are really considered useful by anyone but few of my readers, but I am still going to try in a hope to end up with some useful conclusions. Last time I've stopped on Bug #94903, so let me continue with the next bug in my list:
  • Bug #94912 - "O_DIRECT_NO_FSYNC possible write hole". In this bug report Janet Campbell shared some concerns related to the way O_DIRECT_NO_FSYNC (and O_DIRECT) settings for innodb_flush_method work. Check comments, including those by Sunny Bains, where he agrees that "...this will cause problems where the redo and data are on separate devices.". Useful reading for anyone interested in InnoDB internals or using  innodb_dedicated_server setting in MySQL 8.0.14+.
  • Bug #94971 - "Incorrect key file error during log apply table stage in online DDL". Monty Solomon reported yet another case when "online' ALTER for InnoDB table fails in a weird way. The bug is still "Open" and there is no clear test case to just copy/paste, but both the problem and potential solutions (make sure you have "big enough" innodb_online_alter_log_max_size or better use pt-online-schema-change or gh-ost tools) were already discussed here.
  • Bug #94973 - "Wrong result with subquery in where clause and order by". Yet another wrong results bug with subquery on MySQL 5.7.25 was reported by Andreas Kohlbecker. We can only guess if MySQL 8 is also affected (MariaDB 10.3.7 is not, based on my test results shared below) as Oracle engineer who verified the bug had NOT card to check or share the results of this check. What can be easier than running this (a bit modified) test case on every MySQL major version and copy pasting the results:
    MariaDB [test]> CREATE TABLE `ReferenceB` (
        ->   `id` int(11) NOT NULL,
        ->   `bitField` bit(1) NOT NULL,
        ->   `refType` varchar(255) NOT NULL,
        ->   `externalLink` longtext,
        ->   PRIMARY KEY (`id`)
        -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
    Query OK, 0 rows affected (0.170 sec)

    MariaDB [test]> INSERT INTO ReferenceB (id, bitField, refType, externalLink) VALUES(1, 0, 'JOU', NULL);
    Query OK, 1 row affected (0.027 sec)

    MariaDB [test]> INSERT INTO ReferenceB (id, bitField, refType, externalLink) VALUES(2, 0, 'JOU', NULL);
    Query OK, 1 row affected (0.002 sec)

    MariaDB [test]> SELECT hex(bitField) from ReferenceB  where id in (select id as
    y0_ from ReferenceB  where refType='JOU') order by externalLink asc;
    +---------------+
    | hex(bitField) |
    +---------------+
    | 0             |
    | 0             |
    +---------------+
    2 rows in set (0.028 sec)
    But we do not see anything like that in the bug report... This is sad.
  • Bug #94994 - "Memory leak detect on temptable storage engine". Yet another memory leak (found with ASan) reported by Zhao Jianwei, who had also suggested a patch.
  • Bug #95008 - "applying binary log doesn't work with blackhole engine tables". This bug was reported by Thomas Benkert. It seems there is a problem to apply row-based events to BLACKHOLE table and this prevents some nice recovery tricks from working.
  • Bug #95020 - "select no rows return but check profile process Creating sort index". Interesting finding from cui jacky. I can reproduce this with MariaDB as well. It seems we either have to define some new stage or define "Creating sort index" better than in the current manual. This:
    The thread is processing a SELECT that is resolved using an internal temporary table.
    is plain wrong in the case shown in the bug report IMHO.
  • Bug #95040 - "Duplicately remove locks from lock_sys->prdt_page_hash in btr_compress". One of those rare cases when Zhai Weixiang does not provide the patch, just suggests the fix based on code review :)
  • Bug #95045 - "Data Truncation error occurred on a write of column 0Data was 0 bytes long and". This really weird regression bug in MySQL 8.0.14+ was reported by Adarshdeep Cheema. MariaDB 10.3 is surely not affected.
  • Bug #95049 - "Modified rows are not locked after rolling back to savepoint". Bug reporter, John Lin, found that fine MySQL manual does not describe the real current implementation. Surprise!
  • Bug #95058 - "Index not used for column with IS TRUE or IS FALSE operators". Take extra care when using BOOLEAN columns in MySQL. As it was noted by Monty Solomon, proper index is NOT used when you try to check BOOLEAN values as manual suggests, using IS TRUE or IS FALSE conditions. Roy Lyseng explained how such queries are threated internally, but surely there is a better way. MariaDB 10.3.7 is also affected, unfortunately.
  • Bug #95064 - "slave server may has gaps in Executed_Gtid_Set when a special case happen ". Nice bug report from yoga yoga, who had also contributed a patch. Parallel slave can easily get out of sync with master in case of lock wait timeout and failed retries. Again, we do NOT see any check if MySQL 8 is affected, unfortunately.
  • Bug #95065 - "Strange memory management when using full-text indexes". We all know that InnoDB FULLTEXT indexes implementation is far from perfect. Now, thanks to Yura Sorokin, we know also about a verified memory leak bug there that may lead to OOM killing of MySQL server.
  • Bug #95070 - "INSERT .. VALUES ( .., (SELECT ..), ..) takes shared lock with READ-COMMITTED". Seunguck Lee found yet another case of InnoDB locking behavior that MySQL manual does not explain. The bug is still "Open" for some reason.
  • Bug #95115 - "mysqld deadlock of all client threads originating from 3-way deadlock". It took some efforts for bug reporter, Sandeep Dube, and other community users (mostly Jacek Cencek) to attract proper attention to this bug from proper Oracle developer, Dmitry Lenev, until it ended up "Verified" based on code review. We still can not be sure if MySQL 8 is also affected.
That's all for now. I have few more new bug reports that I monitor, but I do not plan to continue with this kind of reviews in upcoming few months in this blog. I hope I'll get a reason soon to write different kind of posts, with more in depth study of various topics...

In any case you may follow me on Twitter for anything related to recent interesting or wrongly handled MySQL bug reports.

This view of Chelsea from our apartment at Chelsea Cloisters reminds me that last year I spent spring holiday season properly - no time was devoted to MySQL bugs :)
To summarize:
  1. Do not use O_DIRECT_NO_FSYNC value for innodb_flush_method if your redo logs are located on different device than your data files. Just don't.
  2. Some Oracle engineers who process bugs still do not care to check if all supported major versions are affected and/or share the results of such checks in public.
  3. There are still many details of InnoDB locking to study, document properly and maybe fix.
  4. I am really concerned with the state of MySQL optimizer. We see all kinds of weird bugs (including regressions) and very few fixes in each maintenance release.

by Valerii Kravchuk (noreply@blogger.com) at May 01, 2019 03:54 PM

April 30, 2019

Oli Sennhauser

FromDual Ops Center for MariaDB and MySQL 0.9 has been released

Caution: We have introduced an evil bug which prohibits installation of focmm. Sorry! Somehow it did pass our QA. To fix this bug update file lib/Upgrade.inc on Line 1965 as follows:

- $sql = sprintf("REPLACE INTO `focmm_configuration` (`key`, `value`) VALUES ('%s', '%s'), ('%s', '%s'), ('%s', '%s')"
+ $sql = sprintf("REPLACE INTO `focmm_configuration` (`key`, `value`) VALUES ('%s', '%s'), ('%s', '%s')"

In the meanwhile we prepare a new release.



FromDual has the pleasure to announce the release of the new version 0.9 of its popular FromDual Ops Center for MariaDB and MySQL focmm.

The FromDual Ops Center for MariaDB and MySQL (focmm) helps DBA's and System Administrators to manage MariaDB and MySQL database farms. Ops Center makes DBA and Admins life easier!

The main task of Ops Center is to support you in your daily MySQL and MariaDB operation tasks. More information about FromDual Ops Center you can find here.

Download

The new FromDual Ops Center for MariaDB and MySQL (focmm) can be downloaded from here. How to install and use focmm is documented in the Ops Center User Guide.

In the inconceivable case that you find a bug in the FromDual Ops Center for MariaDB and MySQL please report it to the FromDual bug tracker or just send us an email.

Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

Installation of Ops Center 0.9

A complete guide on how to install FromDual Ops Center you can find in the Ops Center User Guide.

Upgrade from 0.3 to 0.9

Upgrade from 0.3 to 0.9 should happen automatically. Please do a backup of you Ops Center Instance befor you upgrade! Please also check Upgrading.

Changes in Ops Center 0.9

Everything has changed!

Taxonomy upgrade extras: 

by Shinguz at April 30, 2019 07:17 AM

April 27, 2019

Valeriy Kravchuk

Fun with Bugs #84 - On Some Public Bugs Fixed in MySQL 5.7.26

Oracle released minor MySQL Server versions in all supported branches on April 25, 2019. MySQL 5.7.26 is just one of them, but recently I prefer to ignore MySQL 8 releases (after checking that I can build them from source code at least somewhere, even if it takes 18G+ of disk space and that they work in basic tests), as there are more chances for MySQL 5.7 bug fixes to affect me (and customers I care about) directly.

So, in this yet another boring blog post (that would never be a reason for any award) I plan to concentrate on bugs reported in public MySQL bugs database and fixed in MySQL 5.7.26. As usual I name bug reporters explicitly and give links to their remaining currently active bug reports, if any. This time the list is short enough, so I do not even split it by categories:
  • Bug #93164 - "Memory leak in innochecksum utility detected by ASan". This bug was reported by Yura Sorokin from Percona, who also had contributed a patch (for some reason this is not mentioned in the official release notes).
  • Bug #90402 - "innodb async io error handling in io_event". Wei Zhao found yet another case when wrong data type was used in the code and I/O error was not handled, and this could lead even to crashes. He had submitted a patch.
  • Bug #89126 - "create table panic on innobase_parse_hint_from_comment". Nice bug report with a patch from Yan Huang. Note also detailed analysis and test case provided by Marcelo Altmann in the comment. It's a great example of cooperation of all sides: Oracle MySQL developers, bugs verification team, bug reporter and other community users.
  • Bug #92241 - "alter partitioned table add auto_increment diff result depending on algorithm". Yet another great finding from Shane Bester himself!
  • Bug #94247 - "Contribution: Fix fractional timeout values used with WAIT_FOR_EXECUTED_GTI ...". This bug report was created based on pull request from Dirkjan Bussink, who had suggested a patch to fix the problem. Note the comment from Shlomi Noach that refers to Bug #94311 (still private).
  • Bug #85158 - "heartbeats/fakerotate cause a forced sync_master_info". Note MTR test case contributed by Sveta Smirnova and code analysis in a comment from Vlad Lesin (both from Percona at that time) in this bug report from Trey Raymond.
  • Bug #92690 - "Group Replication split brain with faulty network". I do not care about group replication (I have enough Galera in my life instead), but I could not skip this report by Przemyslaw Malkowski from Percona, with detailed steps on how to reproduce. Note comments from other community members. Yet another case to show that good bug reports attract community feedback and are fixed relatively fast.
  • Bug #93750 - "Escaping of column names for GRANT statements does not persist in binary logs". Clear and simple bug report from Andrii Ustymenko. I wonder why it was not found by internal testing/QA. Quick test shows that MariaDB 10.3.7, for example, is not affected:
    c:\Program Files\MariaDB 10.3\bin>mysql -uroot -proot -P3316 test
    Welcome to the MariaDB monitor.  Commands end with ; or \g.
    Your MariaDB connection id is 9
    Server version: 10.3.7-MariaDB-log mariadb.org binary distribution

    Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

    MariaDB [test]> create table t_from(id int primary key, `from` int, c1 int);
    Query OK, 0 rows affected (0.582 sec)

    MariaDB [test]> create user 'user01'@'%' identified by 'user01';
    Query OK, 0 rows affected (0.003 sec)

    MariaDB [test]> grant select (`id`,`from`) on `test`.`t_from` to 'user01'@'%';
    Query OK, 0 rows affected (0.054 sec)

    MariaDB [test]> show master status;
    +------------------+----------+--------------+------------------+
    | File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
    +------------------+----------+--------------+------------------+
    | pc-PC-bin.000007 |      852 |              |                  |
    +------------------+----------+--------------+------------------+
    1 row in set (0.030 sec)

    MariaDB [test]> show binlog events in 'pc-PC-bin.000007';
    +------------------+-----+-------------------+-----------+-------------+--------
    -------------------------------------------------------------------+
    | Log_name         | Pos | Event_type        | Server_id | End_log_pos | Info
                                                                       |
    +------------------+-----+-------------------+-----------+-------------+--------
    -------------------------------------------------------------------+
    | pc-PC-bin.000007 |   4 | Format_desc       |         1 |         256 | Server
    ver: 10.3.7-MariaDB-log, Binlog ver: 4                             |
    | pc-PC-bin.000007 | 256 | Gtid_list         |         1 |         299 | [0-1-42
    ]                                                                  |
    | pc-PC-bin.000007 | 299 | Binlog_checkpoint |         1 |         342 | pc-PC-b
    in.000007                                                          |
    ...
    | pc-PC-bin.000007 | 708 | Query             |         1 |         852 | use `te
    st`; grant select (`id`,`from`) on `test`.`t_from` to 'user01'@'%' |

    +------------------+-----+-------------------+-----------+-------------+--------
    -------------------------------------------------------------------+
    9 rows in set (0.123 sec)
  • Bug #73936 - "If the storage engine supports RBR, unsafe SQL statementes end up in binlog". Nice bug report with MTR test case by Santosh Praneeth Banda. Note that last comment about the fix mentions only MySQL 8.0.15, not a single work about the fix in MySQL 5.7.26 (or anything about MySQL 5.6.x while the bug was reported for 5.6).
  • Bug #93341 - "Check for tirpc needs improvement". The need for improvement of CMake check was noted by Terje Røsten.
  • Bug #91803 - "mysqladmin shutdown does not wait for MySQL to shut down anymore". This regression bug (without a "regression" tag) was reported by Christian Roser.
  • Bug #91541 - ""Flush status" statement adds twice to global values ". Yura Sorokin contributed a detailed anlysis, MTR test case and a patch in this bug reported by Carlos Tutte.
  • Bug #90351 - "GLOBAL STATUS variables drift after rollback". Zsolt Parragi contibuted a patch to this bug found and reported by Iwo P. For some reason this contribution is not highlighted in the release notes.
  • Bug #81441 - "Warning about localhost when using skip-name-resolve". One of many bug reports from Monty Solomon in which he (and other community members like Jean-François Gagné) had to spend a lot of efforts and fight with a member of bugs verification team to get the bug accepted as a real code bug and then get it fixed in all versions affected.
  • Bug #90902 - "Select Query With Complex Joins Leaks File Handles". This bug was reported by James Wilson. I still wonder if MySQL 5.6 was affected. Bug reports says nothing about this (while I expect all supported GA versions to be checked when the bug is verified, and the results of such check clearly documented).
The future looks bright for MySQL 5.7
To summarize:
  1. Consider upgrade to 5.7.26 if you use complex joins, partitioned tables with auto_increment columns or rely on InnoDB or replication a lot.
  2. It's good to see crashing bugs that do not end up as hidden/"security", maybe because they are reported with patches...
  3. It's good to see examples of cooperation of several community users contributing to the same bug report!
  4. Percona engineers contribute a lot to MySQL, both in form of bug reports, patches and by helping other community users to make their point and get their bugs fixed fast.
  5. There are still things to improve in a way Oracle egnineers handle bugs verification, IMHO.
  6.  It's also a bit strange to see only one optimizer-related fix in this release. It means that either MySQL optimizer is already near perfect and there are no bugs to fix (check yourself, but I see 123 bugs here), or that nobody cares that much about MySQL optimizer in 5.7 these days.
  7. It seems for some bugs fixed in previous MySQL 8.0.x minor release there is no extra check/updates in public comments about the versions with the fix when it is released in MySQL 5.6 or 5.7.

by Valerii Kravchuk (noreply@blogger.com) at April 27, 2019 04:08 PM

April 15, 2019

Oli Sennhauser

MariaDB Prepared Statements, Transactions and Multi-Row Inserts

Last week at the MariaDB/MySQL Developer Training we had one participant asking some tricky questions I did not know the answer by heart.

Also MariaDB documentation was not too verbose (here and here).

So time to do some experiments:

Prepared Statements and Multi-Row Inserts

SQL> PREPARE stmt1 FROM 'INSERT INTO `test`.`test` (`data`) VALUES (?), (?), (?)';
Statement prepared
SQL> SET @d1 = 'Bli';
SQL> SET @d2 = 'Bla';
SQL> SET @d3 = 'Blub';
SQL> EXECUTE stmt1 USING @d1, @d2, @d3;
Query OK, 3 rows affected (0.010 sec)
Records: 3  Duplicates: 0  Warnings: 0
SQL> DEALLOCATE PREPARE stmt1;
SQL> SELECT * FROM test;
+----+------+---------------------+
| id | data | ts                  |
+----+------+---------------------+
|  1 | Bli  | 2019-04-15 17:26:22 |
|  2 | Bla  | 2019-04-15 17:26:22 |
|  3 | Blub | 2019-04-15 17:26:22 |
+----+------+---------------------+

Prepared Statements and Transactions

SQL> SET SESSION autocommit=Off;
SQL> START TRANSACTION;
SQL> PREPARE stmt2 FROM 'INSERT INTO `test`.`test` (`data`) VALUES (?)';
Statement prepared

SQL> SET @d1 = 'BliTrx';
SQL> EXECUTE stmt2 USING @d1;
Query OK, 1 row affected (0.000 sec)

SQL> SET @d1 = 'BlaTrx';
SQL> EXECUTE stmt2 USING @d1;
Query OK, 1 row affected (0.000 sec)
SQL> COMMIT;

-- Theoretically we should do a START TRANSACTION; here again...
SQL> SET @d1 = 'BlubTrx';
SQL> EXECUTE stmt2 USING @d1;
Query OK, 1 row affected (0.000 sec)
SQL> ROLLBACK;

SQL> DEALLOCATE PREPARE stmt2;
SQL> SELECT * FROM test;
+----+---------+---------------------+
| id | data    | ts                  |
+----+---------+---------------------+
| 10 | BliTrx  | 2019-04-15 17:33:30 |
| 11 | BlaTrx  | 2019-04-15 17:33:39 |
+----+---------+---------------------+

Prepared Statements and Transactions and Multi-Row Inserts

SQL> SET SESSION autocommit=Off;
SQL> START TRANSACTION;
SQL> PREPARE stmt3 FROM 'INSERT INTO `test`.`test` (`data`) VALUES (?), (?), (?)';
Statement prepared

SQL> SET @d1 = 'Bli1Trx';
SQL> SET @d2 = 'Bla1Trx';
SQL> SET @d3 = 'Blub1Trx';
SQL> EXECUTE stmt3 USING @d1, @d2, @d3;
Query OK, 3 rows affected (0.000 sec)
SQL> COMMIT;

-- Theoretically we should do a START TRANSACTION; here again...
SQL> SET @d1 = 'Bli2Trx';
SQL> SET @d2 = 'Bla2Trx';
SQL> SET @d3 = 'Blub2Trx';
SQL> EXECUTE stmt3 USING @d1, @d2, @d3;
Query OK, 3 rows affected (0.000 sec)
SQL> ROLLBACK;

-- Theoretically we should do a START TRANSACTION; here again...
SQL> SET @d1 = 'Bli3Trx';
SQL> SET @d2 = 'Bla3Trx';
SQL> SET @d3 = 'Blub3Trx';
SQL> EXECUTE stmt3 USING @d1, @d2, @d3;
Query OK, 3 rows affected (0.001 sec)
SQL> COMMIT;

SQL> DEALLOCATE PREPARE stmt3;
SQL> SELECT * FROM test;
+----+----------+---------------------+
| id | data     | ts                  |
+----+----------+---------------------+
|  1 | Bli1Trx  | 2019-04-15 17:37:50 |
|  2 | Bla1Trx  | 2019-04-15 17:37:50 |
|  3 | Blub1Trx | 2019-04-15 17:37:50 |
|  7 | Bli3Trx  | 2019-04-15 17:38:38 |
|  8 | Bla3Trx  | 2019-04-15 17:38:38 |
|  9 | Blub3Trx | 2019-04-15 17:38:38 |
+----+----------+---------------------+

Seems all to work as expected. Now we know it for sure!

by Shinguz at April 15, 2019 04:09 PM

Valeriy Kravchuk

Fun with Bugs #83 - On MySQL Bug Reports I am Subscribed to, Part XIX

I have not much yet to say on a popular topic of upgrading everything to MySQL 8, so let me just continue reviewing public MySQL bug reports that I've subscribed to recently. After my previous post at least one bug, Bug #94747, got enough comments and clarifications (up to specific commit that introduced this regression pointed out by Daniel Black!) to have it re-classified and verified as InnoDB code bug. So, I see good reasons to continue attracting wide public attention to selected MySQL bugs - this helps to make MySQL better eventually.

As usual, I start from the oldest bug reports:
  • Bug #94758 - "record with REC_INFO_MIN_REC_FLAG is not the min record on non-leaf page". It was reported by a well known person, Zhai Weixiang, who contributed a lot to MySQL code and quality. This time he added a function to the code to prove his point and show that data may be stored in an unexpected order on the root node of InnoDB table. For this very reason (Oracle's code modified to show the problem) this report was marked as "Not a Bug". This is weird, one may prove the point by checking memory with gdb if needed (or maybe by checking data pages on disk as well), without any code modifications.
  • Bug #94775 - "Innodb_row_lock_current_waits status variable incorrect values on idle server". If you read this bug report by Uday Sitaram you can find out a statement that some status variables, like Innodb_row_lock_current_waits, are designed to be "fuzzy", so no matter what value you may see it's probably not a bug. Very enlightening!
  • Bug #94777 - "Question about the redo log write_ahead_buffer". One may argue that public bugs database is not a proper place to ask questions, but in this case Chen Zongzhi actually proved that MySQL 8.0 works better and started up some discussion that reveal probably a real bug (see comments starting from this one, "[5 Apr 15:59] Inaam Rana "). So, even if "Not a Bug" status is correct for the original finding, it seems there is something to study and we have a hope this study happens elsewhere (although I'd prefer to see this or new public bug report for this "Verified").
  • Bug #94797 - "Auto_increment values may decrease when adding a generated column". I can not reproduce this problem reported by Fengchun Hua with MariaDB 10.1.x. My related comments in the bug remain hidden and I've already agreed not to make any such comments in the bugs database. So, for now we have a "Verified" bug in MySQL 5.7.
  • Bug #94800 - "Lost connection (for Debug version) or wrong result (for release version)". According to my tests, MariaDB 10.3.7 is not affected by this bug reported by Weidong Yu, who had also suggested a fix. See also his Bug #94802 - "The behavior between insert stmt and "prepare stmt and execute stmt" different ". (MariaDB 10.3.7 is also not affected).
  • Bug #94803 - "rpl sql_thread may broken due to XAER_RMFAIL error for unfinished xa transaction". This bug reported by Dennis Gao is verified based on code review, but we still do not know if any major version besides 5.7 is affected.
  • Bug #94814 - "slave replication lock wait timeout because of wrong trx order in binlog file". Yet another case when XA transactions may break replication was found by Zhenghu Wen. The bug is still "Open" and I am really interested to see it properly processed soon.
  • Bug #94816 - "Alter table results in foreign key error that appears to drop referenced table". From reading this report I conclude that MySQL 5.7.25 (and Percona Server 5.7.25-28, for that matter) is affected (src table disappears) and this was verified, but still the bug ends up as "Can't repeat" (?) with a statement that there is a fix in MySQL 8.0 that can not be back ported. This is rally weird, as we have plenty of bugs NOT affecting 8.0 but verified as valid 5.7.x bugs. Moreover, I've verified that in case of MySQL 8.0.x ref table just can not be created:
    mysql> create table ref (
        -> a_id int unsigned not null,
        -> b_id int unsigned not null,
        ->
        -> constraint FK_ref_a_b foreign key (b_id,a_id) references src (b_id,a_id)
        -> ) engine=InnoDB;
    ERROR 1822 (HY000): Failed to add the foreign key constraint. Missing index for constraint 'FK_ref_a_b' in the referenced table 'src'
    But it means the test case does not apply to 8.0 "as is", that MySQL 8.0 is not affected, but from the above it's not obvious if there is a fix to back port at all. As a next step I tried essentially the same test case on MariaDB 10.3 and ended up with a crash that I've reported as MDEV-19250. So, this bug report that was not even accepted by Oracle MySQL team ended up as a source of a useful check and bug report for MariaDB.
  • Bug #94835 - "debug-assert while restarting server post install component". This is a classical Percona style bug report from Krunal Bauskar. Percona engineers carefully work on debug builds and find many unique new bugs that way.
  • Bug #94850 - "Not able to import partitioned tablespace older than 8.0.14". This regression bug (for cases when lower_case_table_names=1) was reported by Sean Ren.
  • Bug #94858 - "Deletion count incorrect when rows deleted through multi-hop foreign keys". I've checked that MariaDB 10.3 is also affected by this bug reported by Sawyer Knoblich.
  • Bug #94862 - "MySQL optimizer scan full index for max() on indexed column." Nice bug report from Seunguck Lee. As one can easily check MariaDB is not affected:
    MariaDB [test]> explain select max(fd2) from test;
    +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
    | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
    +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
    |    1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Select tables optimized away |
    +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
    1 row in set (0,001 sec)

    MariaDB [test]> explain select get_timestamp(max(fd2)) from test;
    +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
    | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
    +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
    |    1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Select tables optimized away |
    +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
    1 row in set (0,001 sec)

    MariaDB [test]> select version();
    +-----------------+
    | version()       |
    +-----------------+
    | 10.3.14-MariaDB |
    +-----------------+
    1 row in set (0,000 sec)
  • Bug #94881 - "slave replication lock wait timeout because of supremum record". I fail to understand why this bug report from Zhenghu Wen ended up as "Closed". There is a detailed enough code analsys, but no test case to just copy/paste. The problem happens only with XA transactions and it's not clear if recent MySQL 5.7.25 is also affected. It means the bug can be in "Need Feedback" or even "Can't Repeat", but I see zero reasons to close it at the moment. Looks very wrong to me.
  • Bug #94903 - "optimizer chooses inefficient plan for order by + limit in subquery". It seems recently a lot of efforts from both bug reporter (Василий Лукьянчиков in this case) and even Oracle developer (Guilhem Bichot in this case) may be needed to force proper processing of the real bug.
It may take more than one dram of a good single malt to keep up with recent style of MySQL bugs processing...
* * *
To summarize:
  1. Attracting public attention of MySQL community users (via blog posts in this series or by any other means) to some MySQL bugs still helps to get them processed properly.
  2. Oracle MySQL engineers who work on bugs continue to refuse further processing of some valid bug reports based on formal and not entirely correct assumptions. In some cases I clearly miss checks for possible regressions vs older versions.
  3. As I already stated, Oracle does not seem to care much about bugs in XA transactions and possible replication problems they may cause.
  4. I encourage community users to share their findings and concerns in public MySQL bugs database. Even if they end up as "Not a Bug", they may still start useful discussions and fixes.
  5. By the way, my comment about the related discussion in MariaDB MDEV-15641 is still private in Bug #94610. This is unfortunate.

by Valerii Kravchuk (noreply@blogger.com) at April 15, 2019 06:02 AM

April 04, 2019

Valeriy Kravchuk

Fun with Bugs #82 - On MySQL Bug Reports I am Subscribed to, Part XVIII

I've got few comments to my post on references to MariaDB in MySQL bug reports (not in the blog, but via social media and in personal messages), and all but one comments from current and former colleagues whose opinion I value a lot confirmed that this really looks like a kind of attempt to advertise MariaDB. So, from now on I'll try to keep my findings on how tests shared by MySQL bug reporters work in MariaDB for myself, MariaDB JIRA and this blog (where I can and will advertise whatever makes sense to me), and avoid adding them to MySQL bug reports.

That said, I still think that it's normal to share links to MariaDB bug reports that add something useful (like patches, explanations or better test cases), and I keep insisting that this kind of feedback should not be hidden. Yes, I want to mention Bug #94610 (and related MDEV-15641) again, as a clear example of censorship that is not reasonable and should not be tolerated.

In the meantime, since my previous post in this series I've subscribed to 30 or so new MySQL bug reports. Some of them are listed below, started from the oldest. This time I am not going to exclude "inactive" reports that were not accepted by Oracle MySQL engineers as valid:
  • Bug #94629 - "no variable can skip a single channel error in mysql replication". This is a request to add support for per-channel options to skip N transactions or specific errors. It is not accepted ("Not a Bug") just because one can stop replication on all channels and start on one to skip transaction(s) there, then resume replication for all channels. Do you really think this is a right and only way to process such a report?
  • Bug #94647 - "Memory leak in MEMORY table by glibc". This is also not a bug because one ca use something like malloc-lib=jemalloc with mysqld_safe or Environment="LD_PRELOAD=/path/to/jemalloc" with systemd services. There might be some cost related to that in older versions... Note that similar MDEV-14050 is still open.
  • Bug #94655 - "Some GIS function do not use spatial index anymore". yet another regression vs MySQL 5.7 reported by Cedric Tabin. It ended up verified as feature request without a regression tag...
  • Bug #94664 - "Binlog related deadlock leads to all incoming connection choked.". This report from Yanmin Qiao ended up as a duplicate of  Bug #92108 - "Deadlock by concurrent show binlogs, pfs session_variables table & binlog purge" (fixed in MySQL 5.7.25+, thanks Sveta Smirnova for the hint). See also Bug #91941.
  • Bug #94665 - "enabling undo-tablespace encryption doesn't mark tablespace encryption flag". Nice finding by Krunal Bauskar from Percona.
  • Bug #94699 - "Mysql deadlock and bugcheck on aarch64 under stress test". Bug report with a patch contributed by Cai Yibo. The fix is included in upcoming MySQL 8.0.17 and the bug is already closed.
  • Bug #94709 - "Regression behavior for full text index". This regression was reported by Carlos Tutte and properly verified (with regression tag added and all versions checked) by Umesh Shastry. See also detailed analysis of possible reason in the comment from Nikolai Ikhalainen.
  • Bug #94723 - "Incorrect simple query result with func result and FROM table column in where". Michal Vrabel found this interesting case when MySQL 8.0.215 returns wrong results. I've checked the test case on MariaDB 10.3.7 and it is not affected. Feel free to consider this check and statement my lame attempt to advertise MariaDB. I don't mind.
  • Bug #94730 - "Kill slave may cause start slave to report an error.". This bug was declared a duplicate of a nice Bug #93397 - "Replication does not start if restart MySQL after init without start slave." reported by Jean-François Gagné earlier. Both bugs were reported for MySQL 5.7.x, but I do not see any public attempt to verify if MySQL 5.6 or 8.0 is also affected. In the past it was required to check/verify bug on all GA versions supported if the test case applies. Nowadays this approach is not followed way too often, even when bug reporter cared enough to provide MTR test case.
  • Bug #94737 - "MySQL uses composite hash index when not possible and returns wrong result". Yet another optimizer bug was reported by Simon Banaan. Again, MariaDB 10.3.7 is NOT affected. I can freely and happily state this here if it's inappropriate to state so in the bug report itself. By the way, other MySQL versions were probably not checked. Also, unlike Oracle engineer who verified the bug, I do not hesitate to copy/paste the entire results of my testing here:
    MariaDB [test]> show create table tmp_projectdays_4\G*************************** 1. row ***************************
           Table: tmp_projectdays_4
    Create Table: CREATE TABLE `tmp_projectdays_4` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `project` int(11) NOT NULL,
      `datum` date NOT NULL,
      `voorkomen` tinyint(1) NOT NULL DEFAULT 1,
      `tijden` tinyint(1) NOT NULL DEFAULT 0,
      `personeel` tinyint(1) NOT NULL DEFAULT 0,
      `transport` tinyint(1) NOT NULL DEFAULT 0,
      `materiaal` tinyint(1) NOT NULL DEFAULT 0,
      `materiaaluit` tinyint(1) NOT NULL DEFAULT 0,
      `materiaalin` tinyint(1) NOT NULL DEFAULT 0,
      `voertuigen` varchar(1024) DEFAULT '',
      `medewerkers` varchar(1024) DEFAULT '',
      `personeel_nodig` int(11) DEFAULT 0,
      `personeel_gepland` int(11) DEFAULT 0,
      `voertuigen_nodig` int(11) DEFAULT 0,
      `voertuigen_gepland` int(11) DEFAULT 0,
      `created` datetime DEFAULT NULL,
      `modified` datetime DEFAULT NULL,
      `creator` int(11) DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `project` (`project`,`datum`) USING HASH
    ) ENGINE=MEMORY AUTO_INCREMENT=2545 DEFAULT CHARSET=utf8mb4
    1 row in set (0.001 sec)

    MariaDB [test]> explain SELECT COUNT(1) FROM `tmp_projectdays_4` WHERE `project`
     IN(15409,15911,15929,15936,16004,16005,16007,16029,16031,16052,16054,16040,1248
    5,15892,16035,16060,16066,16093,16057,16027,15988,15440,15996,11457,15232,15704,
    12512,12508,14896,15594,16039,14997,16058,14436,16006,15761,15536,16016,16019,11
    237,13332,16037,14015,15537,15369,15756,12038,14327,13673,11393,14377,15983,1251
    4,12511,13585,12732,14139,14141,12503,15727,15531,15746,15773,15207,13675,15676,
    15663,10412,13677,15528,15530,10032,15535,15693,15532,15533,15534,15529,16056,16
    064,16070,15994,15918,16045,16073,16074,16077,16069,16022,16081,15862,16048,1606
    2,15610,15421,16001,15896,15004,15881,15882,15883,15884,15886,16065,15814,16076,
    16085,16174,15463,15873,15874,15880,15636,16092,15909,16078,15923,16026,16047,16
    094,16111,15914,15919,16041,16063,16068,15971,16080,15961,16038,16096,16127,1564
    1,13295,16146,15762,15811,15937,16150,16152,14438,16086,16156,15593,16147,15910,
    16106,16107,16161,16132,16095,16137,16072,16097,16110,16114,16162,16166,16175,16
    176,16178,15473,16160,15958,16036,16042,16115,16165,16167,16170,16177,16185,1582
    3,16190,16169,15989,16194,16116,16131,16157,16192,16197,16203,16193,16050,16180,
    16209,15522,16148,16205,16201,15990,16158,16216,16033,15974,16112,16133,16181,16
    188,16189,16212,16238,16241,16183,15640,15638,16087,16088,16129,16186,16164,1610
    8,15985,16244,15991,15763,16049,15999,16104,16208,13976,16122,15924,16046,16242,
    16151,16117,16187);

    +------+-------------+-------------------+------+---------------+------+--------
    -+------+------+-------------+
    | id   | select_type | table             | type | possible_keys | key  | key_len
     | ref  | rows | Extra       |
    +------+-------------+-------------------+------+---------------+------+--------
    -+------+------+-------------+
    |    1 | SIMPLE      | tmp_projectdays_4 | ALL  | project       | NULL | NULL
     | NULL | 2544 | Using where |
    +------+-------------+-------------------+------+---------------+------+--------
    -+------+------+-------------+
    1 row in set (0.004 sec)

    MariaDB [test]> SELECT COUNT(1) FROM `tmp_projectdays_4` WHERE `project` IN(1540
    9,15911,15929,15936,16004,16005,16007,16029,16031,16052,16054,16040,12485,15892,
    16035,16060,16066,16093,16057,16027,15988,15440,15996,11457,15232,15704,12512,12
    508,14896,15594,16039,14997,16058,14436,16006,15761,15536,16016,16019,11237,1333
    2,16037,14015,15537,15369,15756,12038,14327,13673,11393,14377,15983,12514,12511,
    13585,12732,14139,14141,12503,15727,15531,15746,15773,15207,13675,15676,15663,10
    412,13677,15528,15530,10032,15535,15693,15532,15533,15534,15529,16056,16064,1607
    0,15994,15918,16045,16073,16074,16077,16069,16022,16081,15862,16048,16062,15610,
    15421,16001,15896,15004,15881,15882,15883,15884,15886,16065,15814,16076,16085,16
    174,15463,15873,15874,15880,15636,16092,15909,16078,15923,16026,16047,16094,1611
    1,15914,15919,16041,16063,16068,15971,16080,15961,16038,16096,16127,15641,13295,
    16146,15762,15811,15937,16150,16152,14438,16086,16156,15593,16147,15910,16106,16
    107,16161,16132,16095,16137,16072,16097,16110,16114,16162,16166,16175,16176,1617
    8,15473,16160,15958,16036,16042,16115,16165,16167,16170,16177,16185,15823,16190,
    16169,15989,16194,16116,16131,16157,16192,16197,16203,16193,16050,16180,16209,15
    522,16148,16205,16201,15990,16158,16216,16033,15974,16112,16133,16181,16188,1618
    9,16212,16238,16241,16183,15640,15638,16087,16088,16129,16186,16164,16108,15985,
    16244,15991,15763,16049,15999,16104,16208,13976,16122,15924,16046,16242,16151,16
    117,16187);

    +----------+
    | COUNT(1) |
    +----------+
    |     2544 |
    +----------+
    1 row in set (0.025 sec)

    MariaDB [test]> select version();
    +--------------------+
    | version()          |
    +--------------------+
    | 10.3.7-MariaDB-log |
    +--------------------+
    1 row in set (0.021 sec)
    When the job was done properly I see no reasons NOT to share the results.
  • Bug #94747 - "4GB Limit on large_pages shared memory set-up". My former colleague Nikolai Ikhalainen from Percona noted this nice undocumented "feature" (Had I forgotten to advertise Percona recently? Sorry about that...) He proved with a C program that one can create shared memory segments on Linux large than 4GB, one just had to use proper data type, unsigned long integer, in MySQL's code. Still, this report ended up as non-critical bug in "MySQL Server: Documentation" category, or even maybe a feature request internally. What a shame!
    Spring in Paris is nice, as this photo made 3 years ago proves. The way MySQL bug reports are handled this spring is not any nice in some cases.
    To summarize:
    1. It seems recently the fact that there is some limited workaround already published somewhere is a good enough reason NOT to accept valid feature request. Noted.
    2. Regression bugs (reports about drop in performance or problem that had not happened with older version but happens with some recent) are still not marked with regression tag sometimes. Moreover, clear performance regressions in MySQL 8.0.x vs MySQL 5.7.x may end up as just feature requests... A request to "Make MySQL Great Again" maybe?
    3. MySQL engineers who verify bugs often do not care to check all major versions and/or share the results of their tests. This is unfortunate.
    4. Some bugs are not classified properly upon verification. The fact that wrong data type is used is anything but severity 3 documentation problem, really.

    by Valerii Kravchuk (noreply@blogger.com) at April 04, 2019 07:26 PM

    April 02, 2019

    Peter Zaitsev

    Percona XtraDB Cluster Operator 0.3.0 Early Access Release Is Now Available

    Percona XtraDB Cluster Operator

    Percona announces the release of Percona XtraDB Cluster Operator 0.3.0 early access.

    The Percona XtraDB Cluster Operator simplifies the deployment and management of Percona XtraDB Cluster in a Kubernetes or OpenShift environment. It extends the Kubernetes API with a new custom resource for deploying, configuring and managing the application through the whole life cycle.Percona XtraDB Cluster Operator

    You can install the Percona XtraDB Cluster Operator on Kubernetes or OpenShift. While the operator does not support all the Percona XtraDB Cluster features in this early access release, instructions on how to install and configure it are already available along with the operator source code, hosted in our Github repository.

    The Percona XtraDB Cluster Operator is an early access release. Percona doesn’t recommend it for production environments.

    New features

    Improvements

    Fixed Bugs

    • CLOUD-148: Pod Disruption Budget code caused the wrong configuration to be applied for ProxySQL and had lack of multiple availability zones support.
    • CLOUD-138: The restore-backup.sh script was exiting with an error because its code was not taking into account images version numbers.
    • CLOUD-118: The backup recovery job was unable to start if Persistent Volume for backup and Persistent Volume for Pod-0 were placed in different availability zones.

    Percona XtraDB Cluster is an open source, cost-effective and robust clustering solution for businesses. It integrates Percona Server for MySQL with the Galera replication library to produce a highly-available and scalable MySQL® cluster complete with synchronous multi-master replication, zero data loss and automatic node provisioning using Percona XtraBackup.

    Help us improve our software quality by reporting any bugs you encounter using our bug tracking system.

    by Dmitriy Kostiuk at April 02, 2019 08:37 PM

    Simple STONITH with ProxySQL and Orchestrator

    3 DC Orchestrator ProxySQL

    Distributed systems are hard – I just want to echo that. In MySQL, we have quite a number of options to run highly available systems. However, real fault tolerant systems are difficult to achieve.

    Take for example a common use case of multi-DC replication where Orchestrator is responsible for managing the topology, while ProxySQL takes care of the routing/proxying to the correct server, as illustrated below. A rare case you might encounter is that the primary MySQL

    node01
    on DC1 might have a blip of a couple of seconds. Because Orchestrator uses an adaptive health check – not only the node itself but also consults its replicas – it can react really fast and promote the node in DC2.

    Why is this problematic?

    The problem occurs when

    node01
    resolves its temporary issue. A race condition could occur within ProxySQL that could mark it back as read-write. You can increase an “offline” period within ProxySQL to make sure Orchestrator rediscovers the node first. Hopefully, it will set it to read-only immediately, but what we want is an extra layer of predictable behavior. This normally comes in the form of STONITH – by taking the other node out of action, we practically reduce the risk of conflict close to zero.

    The solution

    Orchestrator supports hooks to do this, but we can also do it easily with ProxySQL using its built in scheduler. In this case, we create a script where Orchestrator is consulted frequently for any nodes recently marked as

    downtimed
    , and we also mark them as such in ProxySQL. The script proxy-oc-tool.sh can be found on Github.

    What does this script do? In the case of our topology above:

    • If for any reason, connections to MySQL on
      node01
      fail, Orchestrator will pick
      node02
        as the new primary.
    • Since
      node01
      is unreachable –  cannot modify
      read_only
      nor update replication – it will be marked as
      downtimed
      with
      lost-in-recovery
      as the reason.
    • If
      node01
      comes back online, and ProxySQL sees it before the next Orchestrator check, it can rejoin the pool. Then it’s possible that you have two writeable nodes in the hostgroup.
    • To prevent the condition above, as soon as the node is marked with downtime from Orchestrator, the script proxy-oc-tool.sh will mark it
      OFFLINE_SOFT
      so it never rejoins the
      writer_hostgroup
        in ProxySQL.
    • Once an operator fixes
      node01
      i.e. reattaches as a replica and removes the
      downtimed
      mark, the script proxy-oc-tool.sh will mark it back
      ONLINE
        automatically.
    • Additionally, if DC1 gets completely disconnected from DC2 and AWS, the script will not be able to reach Orchestrator’s raft-leader and will set all nodes to
      OFFLINE_SOFT
      preventing isolated writes on DC1.

    Adding the script to ProxySQL is simple. First you download and set permissions. I placed the script in

    /usr/bin/
    – but you can put it anywhere accessible by the ProxySQL process.

    wget https://gist.githubusercontent.com/dotmanila/1a78ef67da86473c70c7c55d3f6fda89/raw/b671fed06686803e626c1541b69a2a9d20e6bce5/proxy-oc-tool.sh
    chmod 0755 proxy-oc-tool.sh
    mv proxy-oc-tool.sh /usr/bin/

    Note, you will need to edit some variables in the script i.e.

    ORCHESTRATOR_PATH
     .

    Then load into the scheduler:

    INSERT INTO scheduler (interval_ms, filename)
      VALUES (5000, '/usr/bin/proxy-oc-tool.sh');
    LOAD SCHEDULER TO RUNTIME;
    SAVE SCHEDULER TO DISK;

    I’ve set the interval to five seconds since inside ProxySQL, a shunned node will need about 10 seconds before the next read-only check is done. This way, this script is still ahead of ProxySQL and is able to mark the dead node as

    OFFLINE_SOFT
     .

    Because this is the simple version, there are obvious additional improvements to be made in the script like using scheduler args to specify and

    ORCHESTRATOR_PATH
    implement error checking.

    by Jervin Real at April 02, 2019 10:12 AM

    March 30, 2019

    Valeriy Kravchuk

    On References to MariaDB and MariaDB Bugs (MDEVs) in MySQL Bug Reports

    Recently I noted that some of my comments to public MySQL bug reports got hidden by somebody from Oracle with privileges to do so. I was not able to find out who did that and when, as this information is not communicated to bug subscribers (this may change if my feature requests, Bug #94807 - "Subscriber should be notified when comment is made private", is eventually implemented).

    When it happened for the first time I thought it was probably non-intentional. When it happened for a second time I complained with a tweet that got few likes and zero comments. Recently this happened again and yet another tweet had not got much attention, but at least I've got a comment via Bug #94797 that my comment there (where I checked test case on MariaDB version I had at hand to find out it's not affected, something I often do for bugs mentioned in my blog posts here) was hidden as irrelevant and "an attempt to advertise MariaDB".

    Snow hides everything, good and bad, dog shit, holes in the road and autumn flowers... Do we really want information provided in comments to public MySQL bugs got hidden just because someone once decided it's "bad"?
    I really wonder if any of my readers think that I advertise MariaDB with my public posts or public comments anywhere or specifically in MySQL bug reports?

    I'd also like to share here, where no one besides me can hide or delete comments (I hope), what was hidden in the case that caused me to tweet about censorship I have to deal with. In Bug #94610 - "Server stalls because ALTER TABLE on partitioned table holds dict mutex" that ended up as "Not a Bug" (not even a duplicate of a verified Bug #83435 - "ALTER TABLE is very slow when using PARTITIONED table" it referred to and extended with a global mutex usage highlighted and impact explained), I've added the following comment:
    "[12 Mar 7:30] Valeriy Kravchuk
    Not only it stalls, but if it stalls for long enough time it will crash :)

    Useful related reading is here: https://jira.mariadb.org/browse/MDEV-15641
    "
    The comment was hidden very soon. Now, if you check that link, you'll see confirmed, unresolved MariaDB bug report. I mostly had this comment to the MDEV-15641 in mind, were my colleague and well known InnoDB developer Marko Mäkelä stated:
    "The row_log_table_apply() is actually invoked while holding both dict_sys->mutex and dict_operation_lock. If there is a lot of log to apply, this may actually cause InnoDB to crash."
    I may be mistaking in linking these two bug reports together, but isn't highlighting the possibility of crash due to long semaphore wait important to understand the impact of the bug report and triage it properly? What wrong MySQL users and bug report readers may see if they follow the link to MariaDB bug I considered relevant? What was advertised by this comment that is harmful or useless for MySQL Community?

    I was even more surprised by these recent actions on my comments because in the past I had never noted similar approach. Check the following bug reports, for example (I searched for those with "MDEV" and "Kravchuk" in them to get these):
    • Bug #80919 - "MySQL Crashes when Droping Indexes - Long semaphore wait". In this bug report (real bug fixed in 5.7.22) I've added a comment that refers to MDEV-14637. The comment still remains public and, IMHO, is still useful. Providing this link helped to get proper attention to the bug, so it was re-opened and got comments from Oracle engineers finally. Was it an attempt to advertise MariaDB? How this case is different from my comment in Bug #94610 quoted above? 
    • Bug #84185 - "Not all "Statements writing to a table with an auto-increment..." are unsafe". I reported this "upstream" MySQL bug based on MDEV-10170 - "Misleading "Statements writing to a table with an auto-increment column after selecting from another table are unsafe" on DELETE ... SELECT", previously found by my colleague Hartmut Holzgraefe. I've also added link to the "upstream" MySQL bug report to that MDEV. Does anybody in MySQL or MariaDB user communities think that such cross-references are useless, harmful or may be considered as and "attempt to advertise competitor" if any of vendors fixes the bug first?
    • Bug #48392 - "mysql_upgrade improperly escapes passwords with single quotes". I verified this bug in 2009 while working for MySQL at Sun, and it still remains "Verified" (I had not re-checked if it's still repeatable with current MySQL versions). Then in 2013 community user added a comment referring to the MariaDB bug, MDEV-4664 - "mysql_upgrade crashes if root's password contains an apostrophe/single quotation mark" that was fixed later, in 2015. This comment still remains public and is useful!
    So, had my comments that mention MDEVs or MariaDB in general became so irrelevant and MariaDB advertising recently comparing to the previous ones? What exact community standards or rules they break? Is it now forbidden to any user of MySQL bugs database to mention MariaDB or bugs in it, use MariaDB in tests to make some point and share the results in public in MySQL bugs database, or the problem is with me personally doing this?

    I'd be happy to read explanations or opinions from MySQL community users and my former Oracle colleagues in comments to this blog post.

    by Valerii Kravchuk (noreply@blogger.com) at March 30, 2019 01:28 PM

    March 29, 2019

    Peter Zaitsev

    Percona Server for MongoDB 3.4.20-2.18 Is Now Available

    Percona Server for MongoDB

    Percona Server for MongoDB

    Percona announces the release of Percona Server for MongoDB 3.4.20-2.18 on March 29, 2019. Download the latest version from the Percona website or the Percona software repositories.

    Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 3.4 Community Edition. It supports MongoDB 3.4 protocols and drivers.

    Percona Server for MongoDB extends Community Edition functionality by including the Percona Memory Engine storage engine, as well as several enterprise-grade features:

    Also, it includes MongoRocks storage engine, which is now deprecated. Percona Server for MongoDB requires no changes to MongoDB applications or code.

    Release 3.4.20-2.18 extends the buildInfo command with the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key not available from MongoDB.

    Improvements

    • PSMDB-216: The database command buildInfo provides the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key is not available from MongoDB.

    The Percona Server for MongoDB 3.4.20-2.18 release notes are available in the official documentation.

    by Borys Belinsky at March 29, 2019 05:08 PM

    PostgreSQL: Access ClickHouse, One of the Fastest Column DBMSs, With clickhousedb_fdw

    Database management systems are meant to house data but, occasionally, they may need to talk with another DBMS. For example, to access an external server which may be hosting a different DBMS. With heterogeneous environments becoming more and more common, a bridge between the servers is established. We call this bridge a “Foreign Data Wrapper” (FDW). PostgreSQL completed its support of SQL/MED (SQL Management of External Data) with release 9.3 in 2013. A foreign data wrapper is a shared library that is loaded by a PostgreSQL server. It enables the creation of foreign tables in PostgreSQL that act as proxies for another data source.

    When you query a foreign table, Postgres passes the request to the associated foreign data wrapper. The FDW creates the connection and retrieves or updates the data in the external data store. Since PostgreSQL planner is involved in all of this process as well, it may perform certain operations like aggregate or joins on the data when retrieved from the data source. I cover some of these later in this post.

    ClickHouse Database

    ClickHouse is an open source column based database management system which claims to be 100–1,000x faster than traditional approaches, capable of processing of more than a billion rows in less than a second.

    clickhousedb_fdw

    clickhousedb_fdw is an open source project – GPLv2 licensed – from Percona. Here’s the link for GitHub project repository:

    https://github.com/Percona-Lab/clickhousedb_fdw

    It is an FDW for ClickHouse that allows you to SELECT from, and INSERT INTO, a ClickHouse database from within a PostgreSQL v11 server.

    The FDW supports advanced features like aggregate pushdown and joins pushdown. These significantly improve performance by utilizing the remote server’s resources for these resource intensive operations.

    If you would like to follow this post and try the FDW between Postgres and ClickHouse, you can download and set up the ontime dataset for ClickHouse.  After following the instructions, the test that you have the desired data. The ClickHouse client is a client CLI for the ClickHouse Database.

    Prepare Data for ClickHouse

    Now the data is ready in ClickHouse, the next step is to set up PostgreSQL. We need to create a ClickHouse foreign server, user mapping, and foreign tables.

    Install the clickhousedb_fdw extension

    There are manual ways to install the clickhousedb_fdw, but clickhousedb_fdw uses PostgreSQL’s coolest extension install feature. By just entering a SQL command you can use the extension:

    CREATE EXTENSION clickhousedb_fdw;

    CREATE SERVER clickhouse_svr FOREIGN DATA WRAPPER clickhousedb_fdw
    OPTIONS(dbname 'test_database', driver '/use/lib/libclickhouseodbc.so');

    CREATE USER MAPPING FOR CURRENT_USER SERVER clickhouse_svr;

    CREATE FOREIGN TABLE clickhouse_tbl_ontime (  "Year" Int,  "Quarter" Int8,  "Month" Int8,  "DayofMonth" Int8,  "DayOfWeek" Int8,  "FlightDate" Date,  "UniqueCarrier" Varchar(7),  "AirlineID" Int,  "Carrier" Varchar(2),  "TailNum" text,  "FlightNum" text,  "OriginAirportID" Int,  "OriginAirportSeqID" Int,  "OriginCityMarketID" Int,  "Origin" Varchar(5),  "OriginCityName" text,  "OriginState" Varchar(2),  "OriginStateFips" text,  "OriginStateName" text,  "OriginWac" Int,  "DestAirportID" Int,  "DestAirportSeqID" Int,  "DestCityMarketID" Int,  "Dest" Varchar(5),  "DestCityName" text,  "DestState" Varchar(2),  "DestStateFips" text,  "DestStateName" text,  "DestWac" Int,  "CRSDepTime" Int,  "DepTime" Int,  "DepDelay" Int,  "DepDelayMinutes" Int,  "DepDel15" Int,  "DepartureDelayGroups" text,  "DepTimeBlk" text,  "TaxiOut" Int,  "WheelsOff" Int,  "WheelsOn" Int,  "TaxiIn" Int,  "CRSArrTime" Int,  "ArrTime" Int,  "ArrDelay" Int,  "ArrDelayMinutes" Int,  "ArrDel15" Int,  "ArrivalDelayGroups" Int,  "ArrTimeBlk" text,  "Cancelled" Int8,  "CancellationCode" Varchar(1),  "Diverted" Int8,  "CRSElapsedTime" Int,  "ActualElapsedTime" Int,  "AirTime" Int,  "Flights" Int,  "Distance" Int,  "DistanceGroup" Int8,  "CarrierDelay" Int,  "WeatherDelay" Int,  "NASDelay" Int,  "SecurityDelay" Int,  "LateAircraftDelay" Int,  "FirstDepTime" text,  "TotalAddGTime" text,  "LongestAddGTime" text,  "DivAirportLandings" text,  "DivReachedDest" text,  "DivActualElapsedTime" text,  "DivArrDelay" text,  "DivDistance" text,  "Div1Airport" text,  "Div1AirportID" Int,  "Div1AirportSeqID" Int,  "Div1WheelsOn" text,  "Div1TotalGTime" text,  "Div1LongestGTime" text,  "Div1WheelsOff" text,  "Div1TailNum" text,  "Div2Airport" text,  "Div2AirportID" Int,  "Div2AirportSeqID" Int,  "Div2WheelsOn" text,  "Div2TotalGTime" text,  "Div2LongestGTime" text,"Div2WheelsOff" text,  "Div2TailNum" text,  "Div3Airport" text,  "Div3AirportID" Int,  "Div3AirportSeqID" Int,  "Div3WheelsOn" text,  "Div3TotalGTime" text,  "Div3LongestGTime" text,  "Div3WheelsOff" text,  "Div3TailNum" text,  "Div4Airport" text,  "Div4AirportID" Int,  "Div4AirportSeqID" Int,  "Div4WheelsOn" text,  "Div4TotalGTime" text,  "Div4LongestGTime" text,  "Div4WheelsOff" text,  "Div4TailNum" text,  "Div5Airport" text,  "Div5AirportID" Int,  "Div5AirportSeqID" Int,  "Div5WheelsOn" text,  "Div5TotalGTime" text,  "Div5LongestGTime" text,  "Div5WheelsOff" text,  "Div5TailNum" text) server clickhouse_svr options(table_name 'ontime');

    postgres=# SELECT a."Year", c1/c2 as Value FROM ( select "Year", count(*)*1000 as c1          
               FROM clickhouse_tbl_ontime          
               WHERE "DepDelay">10 GROUP BY "Year") a                        
               INNER JOIN (select "Year", count(*) as c2 from clickhouse_tbl_ontime          
               GROUP BY "Year" ) b on a."Year"=b."Year" LIMIT 3;
    Year |   value    
    ------+------------
    1987 |        199
    1988 | 5202096000
    1989 | 5041199000
    (3 rows)

    Performance Features

    PostgreSQL has improved foreign data wrapper processing by added the pushdown feature. Push down improves performance significantly, as the processing of data takes place earlier in the processing chain. Push down abilities include:

    • Operator and function Pushdown
    • Predicate Pushdown
    • Aggregate Pushdown
    • Join Pushdown

    Operator and function Pushdown

    The function and operators send to Clickhouse instead of calculating and filtering at the PostgreSQL end.

    postgres=# EXPLAIN VERBOSE SELECT avg("DepDelay") FROM clickhouse_tbl_ontime WHERE "DepDelay" <10; 
               Foreign Scan  (cost=1.00..-1.00 rows=1000 width=32) Output: (avg("DepDelay"))  
               Relations: Aggregate on (clickhouse_tbl_ontime)  
               Remote SQL: SELECT avg("DepDelay") FROM "default".ontime WHERE (("DepDelay" < 10))(4 rows)

    Predicate Pushdown

    Instead of filtering the data at PostgreSQL, clickhousedb_fdw send the predicate to Clikhouse Database.

    postgres=# EXPLAIN VERBOSE SELECT "Year" FROM clickhouse_tbl_ontime WHERE "Year"=1989;                                  
               Foreign Scan on public.clickhouse_tbl_ontime  Output: "Year"  
               Remote SQL: SELECT "Year" FROM "default".ontime WHERE (("Year" = 1989)

    Aggregate Pushdown

    Aggregate push down is a new feature of PostgreSQL FDW. There are currently very few foreign data wrappers that support aggregate push down – clickhousedb_fdw is one of them. Planner decides which aggregates are pushed down and which aren’t. Here is an example for both cases.

    postgres=# EXPLAIN VERBOSE SELECT count(*) FROM clickhouse_tbl_ontime;
              Foreign Scan (cost=1.00..-1.00 rows=1000 width=8)
              Output: (count(*)) Relations: Aggregate on (clickhouse_tbl_ontime)
              Remote SQL: SELECT count(*) FROM "default".ontime

    Join Pushdown

    Again, this is a new feature in PostgreSQL FDW, and our clickhousedb_fdw also supports join push down. Here’s an example of that.

    postgres=# EXPLAIN VERBOSE SELECT a."Year"
                               FROM clickhouse_tbl_ontime a
                               LEFT JOIN clickhouse_tbl_ontime b ON a."Year" = b."Year";
            Foreign Scan (cost=1.00..-1.00 rows=1000 width=50);
            Output: a."Year" Relations: (clickhouse_tbl_ontime a) LEFT JOIN (clickhouse_tbl_ontime b)
            Remote SQL: SELECT r1."Year" FROM&nbsp; "default".ontime r1 ALL LEFT JOIN "default".ontime r2 ON (((r1."Year" = r2."Year")))

    Percona’s support for PostgreSQL

    As part of our commitment to being unbiased champions of the open source database eco-system, Percona offers support for PostgreSQL – you can read more about that here. And as you can see, as part of our support commitment, we’re now developing our own open source PostgreSQL projects such as the clickhousedb_fdw. Subscribe to the blog to be amongst the first to know of PostgreSQL and other open source projects from Percona.

    As an author of the new clickhousdb_fdw – as well as other  FDWs – I’d be really happy to hear of your use cases and your experience of using this feature.


    Photo by Hidde Rensink on Unsplash

    by Ibrar Ahmed at March 29, 2019 02:01 PM

    Kurt von Finck

    Last post. I’m gone.

    Last post. I’m gone.

    https://reddit.com/r/ploos

    https://pluspora.com

    I’m “mneptok” just about everywhere. I’ll see you all in the next life, when we are all cats.

    https://youtu.be/FqHIkkRrwcQ

    by mneptok at March 29, 2019 02:57 AM

    March 27, 2019

    Peter Zaitsev

    PostgreSQL Upgrade Using pg_dump/pg_restore

    pg-dump upgrade postgres

    PostgreSQL logoIn this blog post, we will explore 

    pg_dump
    /
    pg_restore
    , one of the most commonly used options for performing a PostgreSQL upgrade. It is important to understand the scenarios under which
    pg_dump
    and
    pg_restore
    utilities will be helpful.

    This post is the second of our Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions series where we’ll be exploring different methods available to upgrade your PostgreSQL databases.

    About pg_dump

    pg_dump
    is a utility to perform a backup of single database. You cannot backup multiple databases unless you do so using separate commands in parallel. If your upgrade plan needs global objects to be copied over,
    pg_dump
    need to be supplemented by
    pg_dumpall
     . To know more about
    pg_dumpall
     , you may refer to our previous blog post.

    pg_dump formats

    pg_dump
    can produce dumps in multiple formats – plain text and custom format – each with own advantages. When you use
    pg_dump
    with custom format
    (-Fc)
    , you must use
    pg_restore
    to restore the dump.

    If the dump is taken using a plain-text format, pg_dump generates a script file of multiple SQL commands. It can be restored using psql.

    A custom format dump, however, is compressed and is not human-readable.

    A dump taken in plain text format may be slightly larger in size when compared to a custom format dump.

    At times, you may wish to perform schema changes in your target PostgreSQL database before restore, for example, table partitioning. Or you may wish to restore only a selected list of objects from a dump file.

    In such cases, you cannot restore a selected list of tables from a plain format dump of a database. If you take the database dump in custom format,  you can use pg_restore, which will help you choose a specific set of tables for restoration.

    Steps involved in upgrade

    The most important point to remember is that both dump and restore should be performed using the latest binaries. For example, if we need to migrate from version 9.3 to version 11, we should be using the pg_dump binary of PostgreSQL 11 to connect to 9.3 .

    When a server is equipped with two different versions of binaries, it is good practice to specify the full path of the pg_dump from the latest version as follows :

    /usr/lib/postgresql/11/bin/pg_dump <connection_info_of_source_system> <options>

    Getting the global dumps

    In PostgreSQL, users/roles are global to the database cluster, and the same user can have privileges on objects in different databases. These are called “Globals” because they are applicable for all the databases within the instance. Creation of globals in the target system at the earliest opportunity is very important, because rest of the DDLs may contain GRANTs to these users/roles. It is good practice to dump the globals into a file, and to examine the file, before importing into destination system. This can be achieved using the following command :

    /usr/lib/postgresql/11/bin/pg_dumpall -g -p 5432 > /tmp/globals_only.sql

    Since this produces a plain SQL dump file, it can be fed to

    psql
    connected to the destination server. If there are no modifications required, the globals can be directly piped to the destination server using the command in the next example. Since this is a plain SQL dump file, it can be fed to psql for restore.

    /usr/lib/postgresql/11/bin/pg_dumpall -g <source_connection_info> | psql -p <destination_connection_info>

    The above command would work for an upgrade in a local server. You can add an additional argument

    -h
    for
    hostname
    in the
    <destination_connection_info>
    if you are performing an upgrade to a remote database server.

    Schema Only Dumps

    The next stage of the migration involves the creation of schema objects. At this point, you might want to move different database objects to different tablespaces, and partition a few of the tables. If such schema modifications are part of the plan, then we should extract the schema definition to a plain text file. Here’s an example command that can be used to achieve this :

    /usr/lib/postgresql/11/bin/pg_dump -s -d databasename -p 5432 > /tmp/schema_only.sql

    In general, the majority of the database objects won’t need any modifications. In such cases, it is good practice to dump the schema objects as such into the destination database using a

    PIPE
    , using a similar command to this:

    /usr/lib/postgresql/11/bin/pg_dump -s -d databasename <source_connection> | psql -d database <destination_connection>

    Once all the schema objects are created, we should be able to drop only those objects which need modification. We can then recreate them with their modified definition.

    Copying data

    This is the stage when the majority of the data transfers between the database servers. If there is good bandwidth between source and destination, we should look to achieve maximum parallelism at this stage. In many situations, we could analyze the foreign key dependency hierarchy and import data in parallel batches for a group of tables. Data-only copying is possible using

    -a
    or
    --data-only
      flag of
    pg_dump
     .

    Copying the data of individual tables

    You might have to incorporate schema changes as part of an upgrade. In this case, you can copy the data of a few tables individually. We provide an example here:

    /usr/lib/postgresql/11/bin/pg_dump <sourcedb_connection_info> -d <database> -a -t schema.tablename | psql <destinationdb_connection_info> <databasename>

    There could be special situations where you need to append only a partial selection of the data. This happens especially on time-series data. In such cases, you can use copy commands with a WHERE clause cto extract and import specific data. You can see this in the following example :

    /usr/lib/postgresql/11/bin/psql <sourcedb_connection_info> -c "COPY (select * from <table> where <filter condition>)” > /tmp/selected_table_data.sql

    Summary

    pg_dump/pg_restore may be useful if you need to perform a faster upgrade of PostgreSQL server with a modified schema and bloat-free relations. To see more about this method in action, please subscribe to our webinar here.


    image based on photos by Skitterphoto and Magda Ehlers from Pexels

    by Jobin Augustine at March 27, 2019 06:09 PM

    Percona University Travels to South America

    Percona University April 2019

    Percona University April 2019We started hosting Percona University a few years back with the aim of sharing knowledge with the open source database community. The events are held in cities across the world. The next Percona University days will visiting Uruguay, Argentina, and Brazil, in a lightning tour at the end of April.

    • Montevideo, Tuesday, April 23 2019 from 8.30am to 6.30pm
    • Buenos Aires, Thursday, April 25 2019 from 1.30pm to 10.3pm
    • São Paulo, Saturday, April 27 2019 from 9.30am to 7.30pm

    What is Percona University?

    It is a technical educational event. We encourage people to join us at any point during these talks – we understand that not everyone can take off a half a day from their work or studies. As long as you register – that’s essential.

    What is on the agenda for each of the events?

    Full agendas and registration forms for the Montevideo, Buenos Aires, and São Paulo events can be accessed at their respective links.

    Does the word “University” mean that we won’t cover any in-depth topics, and these events would only interest college/university students?

    No, it doesn’t. We designed Percona University presentations for all kinds of “students,” including professionals with years of database industry experience. The word “University” means that this event series is about educating attendees on technical topics (it’s not a sales-oriented event, it’s about sharing knowledge with the community).

    Does Percona University cover only Percona technology?

    We will definitely mention Percona technology, but we will also focus on real-world technical issues and recommend solutions that work (regardless of whether Percona developed them).

    Are there other Percona University events coming up besides these in South America?

    We will hold more Percona University events in different locations in the future. Our newsletter is a good source of information about when and where they will occur. If you’d like to partner with Percona in organizing a Percona University event, contact Tom Basil who leads our community team… or Lorraine Pocklington our Community Manager. You can also check our list of technical webinars to get further educational insights.

    These events are free and low-key! They aren’t meant to look like a full conference (like our Percona Live series). Percona University has a different format. We make it informal, so it’s designed to be perfect for learning and networking. This is an in-person database community gathering, so feel free to come with interesting cases and tricky questions!

    by Agustín at March 27, 2019 03:28 PM

    March 26, 2019

    Peter Zaitsev

    Upcoming Webinar Wed 3/27: Monitoring PostgreSQL with Percona Monitoring and Management (PMM)

    Monitoring PostgreSQL with Percona Monitoring and Management (PMM)

    Monitoring PostgreSQL with Percona Monitoring and Management (PMM)Please join Percona’s Product Manager, Michael Coburn, as he presents his talk Monitoring PostgreSQL with Percona Monitoring and Management (PMM) on March 27th, 2019 at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

    Register Now

    In this webinar, learn how to monitor PostgreSQL using Percona Monitoring and Management (PMM) so that you can:

    Gain greater visibility of performance and bottlenecks for PostgreSQL
    Consolidate your PostgreSQL servers into the same monitoring platform you already use for MySQL and MongoDB
    Respond more quickly and efficiently in Severity 1 issues

    We’ll also show how using PMM’s External Exporters can help you integrate PostgreSQL in only minutes!

    In order to learn more, register for this webinar on how to monitor PostgreSQL with PMM.

    by Michael Coburn at March 26, 2019 03:58 PM

    March 25, 2019

    Peter Zaitsev

    Percona Server for MongoDB Operator 0.3.0 Early Access Release Is Now Available

    Percona Server for MongoDB

    Percona Server for MongoDB OperatorPercona announces the availability of the Percona Server for MongoDB Operator 0.3.0 early access release.

    The Percona Server for MongoDB Operator simplifies the deployment and management of Percona Server for MongoDB in a Kubernetes or OpenShift environment. It extends the Kubernetes API with a new custom resource for deploying, configuring and managing the application through the whole life cycle.

    You can install the Percona Server for MongoDB Operator on Kubernetes or OpenShift. While the operator does not support all the Percona Server for MongoDB features in this early access release, instructions on how to install and configure it are already available along with the operator source code in our Github repository.

    The Percona Server for MongoDB Operator is an early access release. Percona doesn’t recommend it for production environments.

    New Features

    Improvements

    Fixed Bugs

    • CLOUD-141: Operator failed to rescale cluster after self-healing.
    • CLOUD-151: Dashboard upgrade in Percona Monitoring and Management caused loop due to no write access.
    • CLOUD-152: Percona Server for MongoDB crash took place in case of no backup section in the Operator configuration file.
    • CLOUD-91: The Operator was throwing error messages with Arbiters disabled in the deploy/cr.yaml configuration file.

    Percona Server for MongoDB is an enhanced, open source and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB Community Edition. It supports MongoDB® protocols and drivers. Percona Server for MongoDB extends MongoDB Community Edition functionality by including the Percona Memory Engine, as well as several enterprise-grade features. It requires no changes to MongoDB applications or code.

    Help us improve our software quality by reporting any bugs you encounter using our bug tracking system.

    by Dmitriy Kostiuk at March 25, 2019 02:13 PM

    How to Perform Compatible Schema Changes in Percona XtraDB Cluster (Advanced Alternative)?

    PXC schema changes options

    PXC schema changes optionsIf you are using Galera replication, you know that schema changes may be a serious problem. With its current implementation, there is no way even a simple ALTER will be unobtrusive for live production traffic. It is a fact that with the default TOI alter method, Percona XtraDB Cluster (PXC) cluster suspends writes in order to execute the ALTER in the same order on all nodes.

    For factual data structure changes, we have to adapt to the limitations, and either plan for a maintenance window, or use pt-online-schema-change, where interruptions should be very short. I suggest you be extra careful here, as normally you cannot kill an ongoing ALTER query in Galera cluster.

    For schema compatible changes, that is, ones that cannot break ROW replication when the writer node and applier nodes have different metadata, we can consider using the Rolling Schema Update (RSU) method. An example of 100% replication-safe DDL is OPTIMIZE TABLE (aka noop-ALTER). However, the following are safe to consider too:

    • adding and removing secondary index,
    • renaming an index,
    • changing the ROW_FORMAT (for example enabling/disabling table compression),
    • changing the KEY_BLOCK_SIZE(compression property).

    However, a lesser known fact is that even using the RSU method or pt-online-schema-change for the above may not save us from some unwanted disruptions.

    RSU and Concurrent Queries

    Let’s take a closer look at a very simple scenario with noop ALTER. We will set wsrep_OSU_method to RSU to avoid a cluster-wide stall. In fact, this mode turns off replication for the following DDL (and only for DDL), so you have to remember to repeat the same ALTER on every cluster member later.

    For simplicity, let’s assume there is only one node used for writes. In the first client session, we change the method accordingly to prepare for DDL:

    node1 > set wsrep_OSU_method=RSU;
    Query OK, 0 rows affected (0.00 sec)
    node1 > select @@wsrep_OSU_method,@@wsrep_on,@@wsrep_desync;
    +--------------------+------------+----------------+
    | @@wsrep_OSU_method | @@wsrep_on | @@wsrep_desync |
    +--------------------+------------+----------------+
    | RSU                |          1 |              0 |
    +--------------------+------------+----------------+
    1 row in set (0.00 sec)

    (By the way, as seen above, the desync mode is not enabled yet, as it will be automatically enabled around the DDL query only, and disabled right after it finishes).

    In a second client session, we start a long enough SELECT query:

    node1 > select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000;
    ...

    And while it’s ongoing, let’s rebuild the table:

    node1 > alter table db1.sbtest1 engine=innodb;
    Query OK, 0 rows affected (0.98 sec)
    Records: 0 Duplicates: 0 Warnings: 0

    Surprisingly, immediately the client in the second session receives its SELECT failure:

    ERROR 1213 (40001): WSREP detected deadlock/conflict and aborted the transaction. Try restarting the transaction

    So, even a simple SELECT is aborted if it conflicts with the local, concurrent ALTER (RSU)… We can see more details in the error log:

    2018-12-04T21:39:17.285108Z 0 [Note] WSREP: Member 0.0 (node1) desyncs itself from group
    2018-12-04T21:39:17.285124Z 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 471796)
    2018-12-04T21:39:17.305018Z 12 [Note] WSREP: Provider paused at 7bf59bb4-996d-11e8-b3b6-8ed02cd38513:471796 (30)
    2018-12-04T21:39:17.324509Z 12 [Note] WSREP: --------- CONFLICT DETECTED --------
    2018-12-04T21:39:17.324532Z 12 [Note] WSREP: cluster conflict due to high priority abort for threads:
    2018-12-04T21:39:17.324535Z 12 [Note] WSREP: Winning thread:
    THD: 12, mode: total order, state: executing, conflict: no conflict, seqno: -1
    SQL: alter table db1.sbtest1 engine=innodb
    2018-12-04T21:39:17.324537Z 12 [Note] WSREP: Victim thread:
    THD: 11, mode: local, state: executing, conflict: no conflict, seqno: -1
    SQL: select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000
    2018-12-04T21:39:17.324542Z 12 [Note] WSREP: MDL conflict db=db1 table=sbtest1 ticket=MDL_SHARED_READ solved by abort
    2018-12-04T21:39:17.324544Z 12 [Note] WSREP: --------- CONFLICT DETECTED --------
    2018-12-04T21:39:17.324545Z 12 [Note] WSREP: cluster conflict due to high priority abort for threads:
    2018-12-04T21:39:17.324547Z 12 [Note] WSREP: Winning thread:
    THD: 12, mode: total order, state: executing, conflict: no conflict, seqno: -1
    SQL: alter table db1.sbtest1 engine=innodb
    2018-12-04T21:39:17.324548Z 12 [Note] WSREP: Victim thread:
    THD: 11, mode: local, state: executing, conflict: must abort, seqno: -1
    SQL: select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000
    2018-12-04T21:39:18.517457Z 12 [Note] WSREP: resuming provider at 30
    2018-12-04T21:39:18.517482Z 12 [Note] WSREP: Provider resumed.
    2018-12-04T21:39:18.518310Z 0 [Note] WSREP: Member 0.0 (node1) resyncs itself to group
    2018-12-04T21:39:18.518342Z 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 471796)
    2018-12-04T21:39:18.519077Z 0 [Note] WSREP: Member 0.0 (node1) synced with group.
    2018-12-04T21:39:18.519099Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 471796)
    2018-12-04T21:39:18.519119Z 2 [Note] WSREP: Synchronized with group, ready for connections
    2018-12-04T21:39:18.519126Z 2 [Note] WSREP: Setting wsrep_ready to true

    Another example – a simple sysbench test, during which I did noop ALTER in RSU mode:

    # sysbench /usr/share/sysbench/oltp_read_only.lua --table-size=1000 --tables=8 --mysql-db=db1 --mysql-user=root --threads=8 --time=200 --report-interval=1 --events=0 --db-driver=mysql run
    sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)
    Running the test with following options:
    Number of threads: 8
    Report intermediate results every 1 second(s)
    Initializing random number generator from current time
    Initializing worker threads...
    Threads started!
    [ 1s ] thds: 8 tps: 558.37 qps: 9004.30 (r/w/o: 7880.62/0.00/1123.68) lat (ms,95%): 18.28 err/s: 0.00 reconn/s: 0.00
    [ 2s ] thds: 8 tps: 579.01 qps: 9290.22 (r/w/o: 8130.20/0.00/1160.02) lat (ms,95%): 17.01 err/s: 0.00 reconn/s: 0.00
    [ 3s ] thds: 8 tps: 597.36 qps: 9528.89 (r/w/o: 8335.17/0.00/1193.72) lat (ms,95%): 15.83 err/s: 0.00 reconn/s: 0.00
    FATAL: mysql_stmt_store_result() returned error 1317 (Query execution was interrupted)
    FATAL: `thread_run' function failed: /usr/share/sysbench/oltp_common.lua:432: SQL error, errno = 1317, state = '70100': Query execution was interrupted

    So, SELECT queries are aborted to resolve MDL lock request that a DDL in RSU needs immediately. This of course applies to INSERT, UPDATE and DELETE as well. That’s quite an intrusive way to accomplish the goal…

    “Manual RSU”

    Let’s try a “manual RSU” workaround instead. In fact, we can achieve the same isolated DDL execution as in RSU, by putting a node in desync mode (to avoid flow control) and disabling replication for our session. That way, the ALTER will only be executed in that particular node.

    Session 1:

    node1 > set wsrep_OSU_method=TOI; set global wsrep_desync=1; set wsrep_on=0;
    Query OK, 0 rows affected (0.01 sec)
    Query OK, 0 rows affected (0.00 sec)
    Query OK, 0 rows affected (0.00 sec)
    node1 > select @@wsrep_OSU_method,@@wsrep_on,@@wsrep_desync;
    +--------------------+------------+----------------+
    | @@wsrep_OSU_method | @@wsrep_on | @@wsrep_desync |
    +--------------------+------------+----------------+
    | TOI                |          0 |              1 |
    +--------------------+------------+----------------+
    1 row in set (0.00 sec)

    Session 2:

    node1 > select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000;
    +-----------+
    | count(*)  |
    +-----------+
    | 423680000 |
    +-----------+
    1 row in set (14.07 sec)

    Session 1:

    node1 > alter table db1.sbtest1 engine=innodb;
    Query OK, 0 rows affected (13.52 sec)
    Records: 0 Duplicates: 0 Warnings: 0

    Session 3:

    node1 > select id,command,time,state,info from information_schema.processlist where user="root";
    +----+---------+------+---------------------------------+-----------------------------------------------------------------------------------------+
    | id | command | time | state                           | info |
    +----+---------+------+---------------------------------+-----------------------------------------------------------------------------------------+
    | 11 | Query   | 9    | Sending data                    | select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000 |
    | 12 | Query   | 7    | Waiting for table metadata lock | alter table db1.sbtest1 engine=innodb |
    | 17 | Query   | 0    | executing                       | select id,command,time,state,info from information_schema.processlist where user="root" |
    +----+---------+------+---------------------------------+-----------------------------------------------------------------------------------------+
    3 rows in set (0.00 sec)
    node1 > select id,command,time,state,info from information_schema.processlist where user="root";
    +----+---------+------+----------------+-----------------------------------------------------------------------------------------+
    | id | command | time | state          | info |
    +----+---------+------+----------------+-----------------------------------------------------------------------------------------+
    | 11 | Sleep   | 14   |                | NULL |
    | 12 | Query   | 13   | altering table | alter table db1.sbtest1 engine=innodb |
    | 17 | Query   | 0    | executing      | select id,command,time,state,info from information_schema.processlist where user="root" |
    +----+---------+------+----------------+-----------------------------------------------------------------------------------------+
    3 rows in set (0.00 sec)

    In this case, there was no interruption, the ALTER waited for it’s MDL lock request to succeed gracefully, and did it’s job when it became possible.

    Remember, you have to execute the same commands on the rest of the nodes to make them consistent – even for noop-alter, it’s important to make the nodes consistent in terms of table size on disk.

    Kill Problem

    Another fact is that you cannot cancel or kill a DDL query executed in RSU or in TOI method:

    node1 > kill query 12;
    ERROR 1095 (HY000): You are not owner of thread 12

    This may be an annoying problem when you need to unblock a node urgently. Fortunately, the workaround with wsrep_on=0 also allows to kill an ALTER without that restriction:

    Session 1:

    node1 > kill query 22;
    Query OK, 0 rows affected (0.00 sec)

    Session 2:

    node1 > alter table db1.sbtest1 engine=innodb;
    ERROR 1317 (70100): Query execution was interrupted

    Summary

    The RSU method may be more intrusive then you’d expect. For schema compatible changes, it is worth considering “manual RSU” with

    set global wsrep_desync=1; set wsrep_on=0;

    When using it though, please remember that wsrep_on applies to all types of writes, both DDL and DML, so be extra careful to set it back to 1 after the ALTER is done. So the procedure will look like this:

    SET GLOBAL wsrep_desync=1;
    SET wsrep_on=0;
    ALTER ...  /* compatible schema change only! */
    SET wsrep_on=1;
    SET GLOBAL wsrep_desync=0;

    Incidentally, as in my opinion the current RSU behavior is unnecessarily intrusive, I have filed this change suggestion: https://jira.percona.com/browse/PXC-2293


    Photo by Pierre Bamin on Unsplash

    by Przemysław Malkowski at March 25, 2019 12:37 PM

    March 20, 2019

    Peter Zaitsev

    MongoDB on ARM Processors

    reads updates transactions per hour per $

    ARM processors have been around for a while. In mid-2015/2016 there were a couple of attempts by the community to port MongoDB to work with this architecture. At the time, the main storage engine was MMAP and most of the available ARM boards were 32-bits. Overall, the port worked, but the fact is having MongoDB running on a Raspberry Pi was more a hack than a setup. The public cloud providers didn’t yet offer machines running with these processors.

    The ARM processors are power-efficient and, for this reason, they are used in smartphones, smart devices and, now, even laptops. It was just a matter of time to have them available in the cloud as well. Now that AWS is offering ARM-based instances you might be thinking: “Hmmm, these instances include the same amount of cores and memory compared to the traditional x86-based offers, but cost a fraction of the price!”.

    But do they perform alike?

    In this blog, we selected three different AWS instances to compare: one powered by  an ARM processor, the second one backed by a traditional x86_64 Intel processor with the same number of cores and memory as the ARM instance, and finally another Intel-backed instance that costs roughly the same as the ARM instance but carries half as many cores. We acknowledge these processors are not supposed to be “equivalent”, and we do not intend to go deeper in CPU architecture in this blog. Our goal is purely to check how the ARM-backed instance fares in comparison to the Intel-based ones.

    These are the instances we will consider in this blog post.

    Methodology

    We will use the Yahoo Cloud Serving Benchmark (YCSB, https://github.com/brianfrankcooper/YCSB) running on a dedicated instance (c5d.4xlarge) to simulate load in three distinct tests:

    1. a load of 1 billion documents in one collection having only the primary key (which we’ll call Inserts).
    2. a workload comprised of exclusively reads (Reads)
    3. a workload comprised of a mix of 75% reads with 5% scans plus 25% updates (Reads/Updates)

    We will run each test with a varying number of concurrent threads (32, 64, and 128), repeating each set three times and keeping only the second-best result.

    All instances will run the same MongoDB version (4.0.3, installed from a tarball and running with default settings) and operating system, Ubuntu 16.04. We chose this setup because MongoDB offer includes an ARM version for Ubuntu-based machines.

    All the instances will be configured with:

    • 100 GB EBS with 5000 PIOPS and 20 GB EBS boot device
    • Data volume formatted with XFS, 4k blocks
    • Default swappiness and disk scheduler
    • Default kernel parameters
    • Enhanced cloud watch configured
    • Free monitoring tier enabled

    Preparing the environment

    We start with the setup of the benchmark software we will use for the test, YCSB. The first task was to spin up a powerful machine (c5d.4xlarge) to run the software and then prepare the environment:

    The YCSB program requires Java, Maven, Python, and pymongo which doesn’t come by default in our Linux version – Ubuntu server x86. Here are the steps we used to configure our environment:

    Installing Java

    sudo apt-get install java-devel

    Installing Maven

    wget http://ftp.heanet.ie/mirrors/www.apache.org/dist/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
    sudo tar xzf apache-maven-*-bin.tar.gz -C /usr/local
    cd /usr/local
    sudo ln -s apache-maven-* maven
    sudo vi /etc/profile.d/maven.sh

    Add the following to maven.sh

    export M2_HOME=/usr/local/maven
    export PATH=${M2_HOME}/bin:${PATH}

    Installing Python 2.7

    sudo apt-get install python2.7

    Installing pip to resolve the pymongo dependency

    sudo apt-get install python-pip

    Installing pymongo (driver)

    sudo pip install pymongo

    Installing YCSB

    curl -O --location https://github.com/brianfrankcooper/YCSB/releases/download/0.5.0/ycsb-0.5.0.tar.gz
    tar xfvz ycsb-0.5.0.tar.gz
    cd ycsb-0.5.0

    YCSB comes with different workloads, and also allows for the customization of a workload to match our own requirements. If you want to learn more about the workloads have a look at https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workload_template

    First, we will edit the workloads/workloada file to perform 1 billion inserts (for our first test) while also preparing it to later perform only reads (for our second test):

    recordcount=1000000
    operationcount=1000000
    workload=com.yahoo.ycsb.workloads.CoreWorkload
    readallfields=true
    readproportion=1
    updateproportion=0.0

    We will then change the workloads/workloadb file so as to provide a mixed workload for our third test.  We also set it to perform 1 billion reads, but we break it down into 70% of read queries and 30% of updates with a scan ratio of 5%, while also placing a cap on the maximum number of scanned documents (2000) in an effort to emulate real traffic – workloads are not perfect, right?

    recordcount=10000000
    operationcount=10000000
    workload=com.yahoo.ycsb.workloads.CoreWorkload
    readallfields=true
    readproportion=0.7
    updateproportion=0.25
    scanproportion=0.05
    insertproportion=0
    maxscanlength=2000

    With that, we have the environment configured for testing.

    Running the tests

    With all instances configured and ready, we run the stress test against our MongoDB servers using the following command :

    ./bin/ycsb [load/run] mongodb -s -P workloads/workload[ab] -threads [32/64/128] \
     -p mongodb.url=mongodb://xxx.xxx.xxx.xxx.:27017/ycsb0000[0-9] \
     -jvm-args="-Dlogback.configurationFile=disablelogs.xml"

    The parameters between brackets varied according to the instance and operation being executed:

    • [load/run] load means insert data while run means perform action (update/read)
    • workload[a/b] reference the different workloads we’ve used
    • [32/64/128] indicate the number of concurrent threads being used for the test
    • ycsb0000[0-9] is the database name we’ve used for the tests (for reference only)

    Results

    Without further ado, the table below summarizes the results for our tests:

     

     

     

    Performance cost

    Considering throughput alone – and in the context of those tests, particularly the last one – you may get more performance for the same cost. That’s certainly not always the case, which our results above also demonstrate. And, as usual, it depends on “how much performance do you need” – a matter that is even more pertinent in the cloud. With that in mind, we had another look at our data under the “performance cost” lens.

    As we saw above, the c5.4xlarge instance performed better than the other two instances for a little over 50% more (in terms of cost). Did it deliver 50% more (performance) as well? Well, sometimes it did even more than that, but not always. We used the following formula to extrapolate the OPS (Operations Per Second) data we’ve got from our tests into OPH (Operations Per Hour), so we could them calculate how much bang (operations) for the buck (US$1) each instance was able to provide:

    transactions/hour/US$1 = (OPS * 3600) / instance cost per hour

    This is, of course, an artificial metric that aims to correlate performance and cost. For this reason, instead of plotting the raw values, we have normalized the results using the best performer instance as baseline(100%):

     

     

    The intent behind these was only to demonstrate another way to evaluate how much we’re getting for what we’re paying. Of course, you need to have a clear understanding of your own requirements in order to make a balanced decision.

    Parting thoughts

    We hope this post awakens your curiosity not only about how MongoDB may perform on ARM-based servers, but also by demonstrating another way you can perform your own tests with the YCSB benchmark. Feel free to reach out to us through the comments section below if you have any suggestions, questions, or other observations to make about the work we presented here.

    by Adamo Tonete at March 20, 2019 05:31 PM

    March 19, 2019

    Peter Zaitsev

    How To Test and Deploy Kubernetes Operator for MySQL(PXC) in OSX/macOS?

    kubernetes on mac osx

    kubernetes on mac osxIn this blog post, I’m going to show you how to test Kubernetes locally on OSX/macOS. Testing Kubernetes without having access to a cloud operator in a local lab is not as easy as it sounds. I’d like to share some of my experiences in this adventure. For those who have already experienced in Virtualbox & Vagrant combination, I can tell you that it doesn’t work. Since Kubernetes will require virtualization, setting another virtual environment within another VirtualBox has several issues. After trying to bring up a cluster for a day or two, I gave up my traditional lab and figured out that Kubernetes has an alternate solution called minikube.

    Installation

    If your OSX/macOS doesn’t have brew I strongly recommend installing it. My OSX/macOS version at the time of this post was macOS 10.14.3 (18D109).

    $ brew update && brew install kubectl && brew cask install docker minikube virtualbox

    Once minikube is installed, we’ll need to start the virtual environment that is required to run our operator.

    I’m starting my minikube environment with 4Gb memory since our Percona Xtradb(PXC) Cluster will have 3 MySQL nodes + 1 ProxySQL pod.

    $ minikube start --memory 4096
    😄  minikube v0.35.0 on darwin (amd64)
    🔥  Creating virtualbox VM (CPUs=2, Memory=4096MB, Disk=20000MB) ...
    📶  "minikube" IP address is 192.168.99.100
    🐳  Configuring Docker as the container runtime ...
    ✨  Preparing Kubernetes environment ...
    🚜  Pulling images required by Kubernetes v1.13.4 ...
    🚀  Launching Kubernetes v1.13.4 using kubeadm ...
    ⌛  Waiting for pods: apiserver proxy etcd scheduler controller addon-manager dns
    🔑  Configuring cluster permissions ...
    🤔  Verifying component health .....
    💗  kubectl is now configured to use "minikube"
    🏄  Done! Thank you for using minikube!

    We’re now ready to install Install Percona XtraDB Cluster on Kubernetes.

    Setup

    Clone and download Kubernetes Operator for MySQL.

    $ git clone -b release-0.2.0 https://github.com/percona/percona-xtradb-cluster-operator
    Cloning into 'percona-xtradb-cluster-operator'...
    remote: Enumerating objects: 191, done.
    remote: Counting objects: 100% (191/191), done.
    remote: Compressing objects: 100% (114/114), done.
    remote: Total 10321 (delta 73), reused 138 (delta 67), pack-reused 10130
    Receiving objects: 100% (10321/10321), 17.04 MiB | 3.03 MiB/s, done.
    Resolving deltas: 100% (3526/3526), done.
    Checking out files: 100% (5159/5159), done.
    $ cd percona-xtradb-cluster-operator

    Here we have to make the following modifications for this operator to work on OSX/macOS.

    1. Reduce memory allocation for each pod.
    2. Reduce CPU usage for each pod.
    3. Change the topology type (because we want to run all PXC instances on one node).

    $ sed -i.bak 's/1G/500m/g' deploy/cr.yaml
    $ grep "memory" deploy/cr.yaml
            memory: 500m
          #   memory: 500m
            memory: 500m
          #   memory: 500m
    $ sed -i.bak 's/600m/200m/g' deploy/cr.yaml
    $ grep "cpu" deploy/cr.yaml
            cpu: 200m
          #   cpu: "1"
            cpu: 200m
          #   cpu: 700m
    $ grep "topology" deploy/cr.yaml
          topologyKey: "kubernetes.io/hostname"
        #   topologyKey: "failure-domain.beta.kubernetes.io/zone"
    $ sed -i.bak 's/kubernetes\.io\/hostname/none/g' deploy/cr.yaml
    $ grep "topology" deploy/cr.yaml
          topologyKey: "none"
        #   topologyKey: "failure-domain.beta.kubernetes.io/zone"

    We’re now ready to deploy our PXC via the operator.

    $ kubectl apply -f deploy/crd.yaml
    customresourcedefinition.apiextensions.k8s.io/perconaxtradbclusters.pxc.percona.com created
    customresourcedefinition.apiextensions.k8s.io/perconaxtradbbackups.pxc.percona.com created
    $ kubectl create namespace pxc
    namespace/pxc created
    $ kubectl config set-context $(kubectl config current-context) --namespace=pxc
    Context "minikube" modified.
    $ kubectl apply -f deploy/rbac.yaml
    role.rbac.authorization.k8s.io/percona-xtradb-cluster-operator created
    rolebinding.rbac.authorization.k8s.io/default-account-percona-xtradb-cluster-operator created
    $ kubectl apply -f deploy/operator.yaml
    deployment.apps/percona-xtradb-cluster-operator created
    $ kubectl apply -f deploy/secrets.yaml
    secret/my-cluster-secrets created
    $ kubectl apply -f deploy/configmap.yaml
    configmap/pxc created
    $ kubectl apply -f deploy/cr.yaml
    perconaxtradbcluster.pxc.percona.com/cluster1 created

    Here we’re ready to monitor the progress of our deployment.

    $ kubectl get pods
    NAME                                               READY   STATUS              RESTARTS   AGE
    cluster1-pxc-node-0                                0/1     ContainerCreating   0          86s
    cluster1-pxc-proxysql-0                            1/1     Running             0          86s
    percona-xtradb-cluster-operator-5857dfcb6c-g7bbg   1/1     Running             0          109s

    If any of the nodes is having difficulty passing any STATUS to Running state

    $ kubectl describe pod cluster1-pxc-node-0
    Name:               cluster1-pxc-node-0
    Namespace:          pxc
    Priority:           0
    .
    ..
    ...
    Events:
      Type     Reason            Age                     From               Message
      ----     ------            ----                    ----               -------
      Warning  FailedScheduling  3m47s (x14 over 3m51s)  default-scheduler  pod has unbound immediate PersistentVolumeClaims
      Normal   Scheduled         3m47s                   default-scheduler  Successfully assigned pxc/cluster1-pxc-node-0 to minikube
      Normal   Pulling           3m45s                   kubelet, minikube  pulling image "perconalab/pxc-openshift:0.2.0"
      Normal   Pulled            118s                    kubelet, minikube  Successfully pulled image "perconalab/pxc-openshift:0.2.0"
      Normal   Created           117s                    kubelet, minikube  Created container
      Normal   Started           117s                    kubelet, minikube  Started container
      Warning  Unhealthy         89s                     kubelet, minikube  Readiness probe failed:
    At this stage we’re ready to verify our cluster as soon as we see following output (READY 1/1):
    $ kubectl get pods
    NAME                                               READY   STATUS    RESTARTS   AGE
    cluster1-pxc-node-0                                1/1     Running   0          7m38s
    cluster1-pxc-node-1                                1/1     Running   0          4m46s
    cluster1-pxc-node-2                                1/1     Running   0          2m25s
    cluster1-pxc-proxysql-0                            1/1     Running   0          7m38s
    percona-xtradb-cluster-operator-5857dfcb6c-g7bbg   1/1     Running   0          8m1s

    In order to connect to this cluster, we’ll need to deploy a client shell access.

    $ kubectl run -i --rm --tty percona-client --image=percona:5.7 --restart=Never -- bash -il
    If you don't see a command prompt, try pressing enter.
    bash-4.2$ mysql -h cluster1-pxc-proxysql -uroot -p
    Enter password:
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    Your MySQL connection id is 3617
    Server version: 5.5.30 (ProxySQL)
    Copyright (c) 2009-2019 Percona LLC and/or its affiliates
    Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
    Oracle is a registered trademark of Oracle Corporation and/or its
    affiliates. Other names may be trademarks of their respective
    owners.
    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
    mysql> \s
    --------------
    mysql  Ver 14.14 Distrib 5.7.25-28, for Linux (x86_64) using  6.2
    Connection id:		3617
    Current database:	information_schema
    Current user:		root@cluster1-pxc-proxysql-0.cluster1-pxc-proxysql.pxc.svc.cluste
    SSL:			Not in use
    Current pager:		stdout
    Using outfile:		''
    Using delimiter:	;
    Server version:		5.5.30 (ProxySQL)
    Protocol version:	10
    Connection:		cluster1-pxc-proxysql via TCP/IP
    Server characterset:	latin1
    Db     characterset:	utf8
    Client characterset:	latin1
    Conn.  characterset:	latin1
    TCP port:		3306
    Uptime:			14 min 1 sec
    Threads: 1  Questions: 3  Slow queries: 0
    --------------

    A few things to remember:

    • Secrets for this setup are under deploy/secrets.yaml, you can decode via

    $ echo -n '{secret}' |base64 -D

    • To reconnect shell

    $ kubectl run -i --tty percona-client --image=percona:5.7 -- sh

    • To redeploy the pod delete first and repeat above steps without configuration changes

    $ kubectl delete -f deploy/cr.yaml

    • To stop and delete  minikube virtual environment

    $ minikube stop

    $ minikube delete

    References

    Credits


    Photo by frank mckenna on Unsplash

    by Alkin Tezuysal at March 19, 2019 05:12 PM

    Upcoming Webinar Thurs 3/21: MySQL Performance Schema in 1 hour

    MySQL Performance Schema in 1 hour

    MySQL Performance Schema in 1 hourPlease join Percona’s Principal Support Engineer, Sveta Smirnova, as she presents MySQL Performance Schema in 1 hour on Thursday, March 21st, 2019, at 10:00 am PDT (UTC-7) / 1:00 pm EDT (UTC-4).

    Register Now

    MySQL 8.0 Performance Schema is a mature tool, used by humans and monitoring products. It was born in 2010 as “a feature for monitoring server execution at a low level.” The tool has grown over the years with performance fixes and DBA-faced features. In this webinar, I will give an overview of Performance Schema, focusing on its tuning, performance, and usability.

    Performance Schema helps to troubleshoot query performance, complicated locking issues and memory leaks. It can also troubleshoot resource usage, problematic behavior caused by inappropriate settings and much more. Additionally, it comes with hundreds of options which allow for greater precision tuning.

    Performance Schema is a potent and very complicated tool. What’s more, it does not affect performance in most cases. However, it collects a lot of data and sometimes this data is hard to read.

    In this webinar, I will guide you through the main Performance Schema features, design, and configuration. You will learn how to get the best of it. I will cover its companion sys schema and graphical monitoring tools.

    In order to learn more, register for MySQL Performance Schema in 1 hour today.

    by Sveta Smirnova at March 19, 2019 04:19 PM

    March 18, 2019

    Peter Zaitsev

    PostgreSQL Upgrade Using pg_dumpall

    migrating PostgreSQL using pg_dumpall

    PostgreSQL logoThere are several approaches to assess when you need to upgrade PostgreSQL. In this blog post, we look at the option for upgrading a postgres database using pg_dumpall. As this tool can also be used to back up PostgreSQL clusters, then it is a valid option for upgrading a cluster too. We consider the advantages and disadvantages of this approach, and show you the steps needed to achieve the upgrade.

    This is the first of our Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions series where we’ll be exploring different paths to accomplish postgres upgrade or migration. The series will culminate with a practical webinar to be aired April 17th (you can register here).

    We begin this journey by providing you the most straightforward way to carry on with a PostgreSQL upgrade or migration: by rebuilding the entire database from a logical backup.

    Defining the scope

    Let’s define what we mean by upgrading or migrating PostgreSQL using pg_dumpall.

    If you need to perform a PostgreSQL upgrade within the same database server, we’d call that an in-place upgrade or just an upgrade. Whereas a procedure that involves migrating your PostgreSQL server from one server to another server, combined with an upgrade from an older version (let’s say 9.3) to a newer version PostgreSQL (say PG 11.2), can be considered a migration.

    There are two ways to achieve this requirement using logical backups :

    1. Using pg_dumpall
    2. Using pg_dumpall + pg_dump + pg_restore

    We’ll be discussing the first option (pg_dumpall) here, and will leave the discussion of the second option for our next post.

    pg_dumpall

    pg_dumpall can be used to obtain a text-format dump of the whole database cluster, and which includes all databases in the cluster. This is the only method that can be used to backup globals such as users and roles in PostgreSQL.

    There are, of course, advantages and disadvantages in employing this approach to upgrading PostgreSQL by rebuilding the database cluster using pg_dumpall.

    Advantages of using pg_dumpall for upgrading a PostgreSQL server :

    1. Works well for a tiny database cluster.
    2. Upgrade can be completed using just a few commands.
    3. Removes bloat from all the tables and shrinks the tables to their absolute sizes.

    Disadvantages of using pg_dumpall for upgrading a PostgreSQL server :

    1. Not the best option for databases that are huge in size as it might involve more downtime. (Several GB’s or TB’s).
    2. Cannot use parallel mode. Backup/restore can use just one process.
    3. Requires double the space on disk as it involves temporarily creating a copy of the database cluster for an in-place upgrade.

    Let’s look at the steps involved in performing an upgrade using pg_dumpall:

    1. Install new PostgreSQL binaries in the target server (which could be the same one as the source database server if it is an in-place upgrade).

      -- For a RedHat family OS
      # yum install postgresql11*
      Or
      -- In an Ubuntu/Debian OS
      # apt install postgresql11
    2. Shutdown all the writes to the database server to avoid data loss/mismatch between the old and new version after upgrade.
    3. If you are doing an upgrade within the same server, create a cluster using the new binaries on a new data directory and start it using a port other than the source. For example, if the older version PostgreSQL is running on port 5432, start the new cluster on port 5433. If you are upgrading and migrating the database to a different server, create a new cluster using new binaries on the target server – the cluster may not need to run on a different port other than the default, unless that’s your preference.

      $ /usr/pgsql-11/bin/initdb -D new_data_directory
      $ cd new_data_directory
      $ echo “port = 5433” >> postgresql.auto.conf
      $ /usr/pgsql-11/bin/pg_ctl -D new_data_directory start
    4. You might have a few extensions installed in the old version PostgreSQL cluster. Get the list of all the extensions created in the source database server and install them for the new versions. You can exclude those you get with the contrib module by default. To see the list of extensions created and installed in your database server, you can run the following command.

      $ psql -d dbname -c "\dx"

      Please make sure to check all the databases in the cluster as the extensions you see in one database may not match the list of those created in another database.
    5. Prepare a postgresql.conf file for the new cluster. Carefully prepare this by looking at the existing configuration file of the older version postgres server.
    6. Use pg_dumpall to take a cluster backup and restore it to the new cluster.

      -- Command to dump the whole cluster to a file.
      $ /usr/pgsql-11/bin/pg_dumpall > /tmp/dumpall.sql
      -- Command to restore the dump file to the new cluster (assuming it is running on port 5433 of the same server).
      $ /usr/pgsql-11/bin/psql -p 5433 -f /tmp/dumpall.sql

      Note that i have used the new pg_dumpall from the new binaries to take a backup.
      Another, easier, way is to use PIPE to avoid the time involved in creating a dump file. Just add a hostname if you are performing an upgrade and migration.

      $ pg_dumpall -p 5432 | psql -p 5433
      Or
      $ pg_dumpall -p 5432 -h source_server | psql -p 5433 -h target_server
    7. Run ANALYZE to update statistics of each database on the new server.
    8. Restart the database server using the same port as the source.

    Our next post in this series provides a similar way of upgrading your PostgreSQL server while at the same time providing some flexibility to carry on with changes like the ones described above. Stay tuned!


    Image based on photo by Sergio Ortega on Unsplash

    by Avinash Vallarapu at March 18, 2019 02:59 PM

    March 15, 2019

    Peter Zaitsev

    Percona Server for MySQL 8.0.15-5 Is Now Available

    Percona Server for MySQL 8.0

    Percona Server for MySQL 5.6

    Percona announces the release of Percona Server for MySQL 8.0.15-5 on March 15, 2019 (downloads are available here and from the Percona Software Repositories).

    This release includes fixes to bugs found in previous releases of Percona Server for MySQL 8.0.

    Incompatible changes

    In previous releases, the audit log used to produce time stamps inconsistent with the ISO 8601 standard. Release 8.0.15-5 of Percona Server for MySQL solves this problem. This change, however, may break programs that rely on the old time stamp format.

    Starting from release 8.0.15-5, Percona Server for MySQL uses the upstream implementation of binary log encryption. The variable encrypt_binlog is removed and the related command line option --encrypt_binlog is not supported. It is important that you remove the encrypt_binlog variable from your configuration file before you attempt to upgrade either from another release in the Percona Server for MySQL 8.0 series or from Percona Server for MySQL 5.7. Otherwise, a server boot error will be produced reporting an unknown variable. The implemented binary log encryption is compatible with the old format: the binary log encrypted in a previous version of MySQL 8.0 series or Percona Server for MySQL are supported.

    See MySQL documentation for more information: Encrypting Binary Log Files and Relay Log Files and binlog_encryption variable.

    This release is based on MySQL 8.0.14 and MySQL 8.0.15. It includes all bug fixes in these releases. Percona Server for MySQL 8.0.14 was skipped.

    Percona Server for MySQL 8.0.15-5 is now the current GA release in the 8.0 series. All of Percona’s software is open-source and free.

    Percona Server for MySQL 8.0 includes all the features available in MySQL 8.0 Community Edition in addition to enterprise-grade features developed by Percona. For a list of highlighted features from both MySQL 8.0 and Percona Server for MySQL 8.0, please see the GA release announcement.

    Note

    If you are upgrading from 5.7 to 8.0, please ensure that you read the upgrade guide and the document Changed in Percona Server for MySQL 8.0.

    Bugs Fixed

    • The audit log produced time stamps inconsistent with the ISO 8601 standard. Bug fixed PS-226.
    • FLUSH commands written to the binary log could cause errors in case of replication. Bug fixed PS-1827 (upstream #88720).
    • When audit_plugin was enabled, the server could use a lot of memory when handling large queries. Bug fixed PS-5395.
    • The page cleaner could sleep for long time when the system clock was adjusted to an earlier point in time. Bug fixed PS-5221 (upstream #93708).
    • In some cases, the MyRocks storage engine could crash without triggering the crash recovery. Bug fixed PS-5366.
    • In some cases, when it failed to read from a file, InnoDB did not inform the name of the file in the related error message. Bug fixed PS-2455 (upstream #76020).
    • The ACCESS_DENIED field of the information_schema.user_statistics table was not updated correctly. Bugs fixed PS-3956 and PS-4996.
    • MyRocks could crash while running START TRANSACTION WITH CONSISTENT SNAPSHOT if other transactions were in specific states. Bug fixed PS-4705.
    • In some cases, the server using the the MyRocks storage engine could crash when TTL (Time to Live) was defined on a table. Bug fixed PS-4911.
    • MyRocks incorrectly processed transactions in which multiple statements had to be rolled back. Bug fixed PS-5219.
    • A stack buffer overrun could happen if the redo log encryption with key rotation was enabled. Bug fixed PS-5305.
    • The TokuDB storage engine would assert on load when used with jemalloc 5.x. Bug fixed PS-5406.

    Other bugs fixed: PS-4106PS-4107PS-4108PS-4121PS-4474PS-4640PS-5055PS-5218PS-5263PS-5328PS-5369.

    Find the release notes for Percona Server for MySQL 8.0.15-5 in our online documentation. Report bugs in the Jira bug tracker.

    by Borys Belinsky at March 15, 2019 06:31 PM

    Percona Server for MongoDB 3.6.11-3.1 Is Now Available

    Percona Server for MongoDB

    Percona Server for MongoDB

    Percona announces the release of Percona Server for MongoDB 3.6.11-3.1 on March 15, 2019. Download the latest version from the Percona website or the Percona software repositories.

    Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 3.6 Community Edition. It supports MongoDB 3.6 protocols and drivers.

    Percona Server for MongoDB extends Community Edition functionality by including the Percona Memory Engine storage engine, as well as several enterprise-grade features. Also, it includes MongoRocks storage engine, which is now deprecated. Percona Server for MongoDB requires no changes to MongoDB applications or code.

    Release 3.6.11-3.1 extends the buildInfo command with the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key not available from MongoDB.

    Improvements

    • PSMDB-216: The database command buildInfo provides the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key is not available from MongoDB.

    The Percona Server for MongoDB 3.6.11-3.1 release notes are available in the official documentation.

    by Borys Belinsky at March 15, 2019 05:43 PM

    Oli Sennhauser

    Uptime of a MariaDB Galera Cluster

    A while ago somebody on Google Groups asked for the Uptime of a Galera Cluster. The answer is easy... Wait, no! Not so easy... The uptime of a Galera Node is easy (or not?). But Uptime of the whole Galera Cluster?

    My answer then was: "Grep the error log." My answer now is still: "Grep the error log." But slightly different:

    $ grep 'view(view_id' *
    2019-03-07 16:10:26 [Note] WSREP: view(view_id(PRIM,0e0a2851,1) memb {
    2019-03-07 16:14:37 [Note] WSREP: view(view_id(PRIM,0e0a2851,2) memb {
    2019-03-07 16:16:23 [Note] WSREP: view(view_id(PRIM,0e0a2851,3) memb {
    2019-03-07 16:55:56 [Note] WSREP: view(view_id(NON_PRIM,0e0a2851,3) memb {
    2019-03-07 16:56:04 [Note] WSREP: view(view_id(PRIM,6d80bb1a,5) memb {
    2019-03-07 17:00:28 [Note] WSREP: view(view_id(NON_PRIM,6d80bb1a,5) memb {
    2019-03-07 17:01:11 [Note] WSREP: view(view_id(PRIM,24f67954,7) memb {
    2019-03-07 17:18:58 [Note] WSREP: view(view_id(NON_PRIM,24f67954,7) memb {
    2019-03-07 17:19:31 [Note] WSREP: view(view_id(PRIM,a380c8cb,9) memb {
    2019-03-07 17:20:27 [Note] WSREP: view(view_id(PRIM,a380c8cb,11) memb {
    2019-03-08  7:58:38 [Note] WSREP: view(view_id(PRIM,753a350f,15) memb {
    2019-03-08 11:31:38 [Note] WSREP: view(view_id(NON_PRIM,753a350f,15) memb {
    2019-03-08 11:31:43 [Note] WSREP: view(view_id(PRIM,489e3c67,17) memb {
    2019-03-08 11:31:58 [Note] WSREP: view(view_id(PRIM,489e3c67,18) memb {
    ...
    2019-03-22  7:05:53 [Note] WSREP: view(view_id(NON_PRIM,49dc20da,49) memb {
    2019-03-22  7:05:53 [Note] WSREP: view(view_id(PRIM,49dc20da,50) memb {
    2019-03-26 12:14:05 [Note] WSREP: view(view_id(NON_PRIM,49dc20da,50) memb {
    2019-03-27  7:33:25 [Note] WSREP: view(view_id(NON_PRIM,22ae25aa,1) memb {
    

    So this Cluster had an Uptime of about 18 days and 20 hours. Why can I seed this? Simple: In the brackets there is a number at the very right. This number seems to be the same as wsrep_cluster_conf_id which is reset by a full Galera Cluster shutdown.

    So far so good. But, wait, what is the definition of Uptime? Hmmm, not so helpful, how should I interpret this for a 3-Node Galera Cluster?

    I would say a good definition for Uptime of a Galera Cluster would be: "At least one Galera Node must be available for the application for reading and writing." That means PRIM in the output above. And we still cannot say from the output above if there was at least on Galera Node available (reading and writing) at any time. For this we have to compare ALL 3 MariaDB Error Logs... So it does not help, we need a good Monitoring solution to answer this question...

    PS: Who has found the little fake in this blog?

    Taxonomy upgrade extras: 

    by Shinguz at March 15, 2019 04:58 PM

    Linux system calls of MySQL process

    We had the problem today that a MySQL Galera Cluster node with the multi-tenancy pattern caused a lot of system time (sy 75%, load average about 30 (you really must read this article by Brendan Gregg, it is worth it!)) so we wanted to find what system calls are being used to see what could cause this issue (to verify if it is a TOC or a TDC problem:

    $ sudo strace -c -p $(pidof -s mysqld) -f -e trace=all
    Process 5171 attached with 41 threads
    Process 16697 attached
    ^C
    Process 5171 detached
    ...
    Process 5333 detached
    Process 16697 detached
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     66.85    1.349700         746      1810           io_getevents
     25.91    0.523055        1298       403       197 futex
      4.45    0.089773        1069        84        22 read
      2.58    0.052000       13000         4         3 restart_syscall
      0.19    0.003802        1901         2           select
      0.01    0.000235           3        69         1 setsockopt
      0.01    0.000210          18        12           getdents
      0.00    0.000078           2        32           write
      0.00    0.000056           1        49           fcntl
      0.00    0.000026           4         6           openat
      0.00    0.000012           2         6           close
      0.00    0.000000           0         2         2 open
      0.00    0.000000           0        22           stat
      0.00    0.000000           0         2           mmap
      0.00    0.000000           0         7           mprotect
      0.00    0.000000           0        16           pread
      0.00    0.000000           0         1           access
      0.00    0.000000           0         1           sched_yield
      0.00    0.000000           0         5           madvise
      0.00    0.000000           0         1           accept
      0.00    0.000000           0         1           getsockname
      0.00    0.000000           0         1           clone
      0.00    0.000000           0         1           set_robust_list
    ------ ----------- ----------- --------- --------- ----------------
    100.00    2.018947                  2537       225 total
    
    $ man io_getevents
    ...
    

    See also: Configuration of MySQL for Shared Hosting.

    by Shinguz at March 15, 2019 04:06 PM

    Peter Zaitsev

    MySQL Ripple: The First Impression of a MySQL Binlog Server

    MySQL Ripple

    MySQL RippleJust about a month ago, Pavel Ivanov released Ripple under the Apache-2.0 license. Ripple is a MySQL binlog server: software which receives binary logs from MySQL or MariaDB servers and delivers them to another MySQL or MariaDB server. Practically ,this is an intermediary master which does not store any data, except the binary logs themselves, and does not apply events. This solution allows saving of a lot of resources on the server, which acts only as a middle-man between the master and its actual slave(s).

    The intermediary server, keeping binary logs only and not doing any other job, is a prevalent use case which allows us to remove IO (binlog read) and network (binlog retrieval via network) load from the actual master and free its resources for updates. The intermediary master, which does not do any work, distributes binary logs to slaves connected to it. This way you can have an increased number of slaves, attached to such a server, without affecting the application, running updates.

    Currently, users exploit the Blackhole storage engine to emulate similar behavior. But Blackhole is just a workaround: it still executes all the events in the binary logs, requires valid MySQL installation, and has a lot of issues. Such a pain!

    Therefore a new product which can do the same job and is released with an open source license is something worth trying.

    A simple test

    For this blog, I did a simple test. First, I installed it as described in the README file. Instructions are pretty straightforward, and I successfully built the server on my Ubuntu 18.04.2 LTS laptop. Guidelines suggest to install

    libmariadbclient-dev
    , and I replaced
    libmysqlclient-dev
    which I had already on my machine. Probably this was not needed, but since the tool claims to support both MySQL and MariaDB binary log formats, I preferred to install the MariaDB client.

    There is no manual of usage instructions. However, the tool supports

    -help
      command, and it is, again, straightforward.

    The server can be started with options:

    $./bazel-bin/rippled -ripple_datadir=./data -ripple_master_address=127.0.0.1 -ripple_master_port=13001 -ripple_master_user=root -ripple_server_ports=15000

    Where:

    • -ripple-datadir
       : datadir where Ripple stores binary logs
    • -ripple_master_address
       : master host
    • -ripple_master_port
       : master port
    • -ripple_master_user
       : replication user
    • -ripple_server_ports
       : comma-separated ports which Ripple will listen

    I did not find an option for securing binary log retrieval. The slave can connect to the Ripple server with any credentials. Have this in mind when deploying Ripple in production.

    Now, let’s run a simple test. I have two servers. Both running on localhost, one with port 13001 (master) and another one on port 13002 (slave). The command line which I used to start

    rippled
     , points to the master. Binary logs are stored in the data directory:

    $ ls -l data/
    total 14920
    -rw-rw-r-- 1 sveta sveta 15251024 Mar 6 01:43 binlog.000000
    -rw-rw-r-- 1 sveta sveta 71 Mar 6 00:50 binlog.index

    I pointed the slave to the Ripple server with the command

    mysql> change master to master_host='127.0.0.1',master_port=15000, master_user='ripple';
    Query OK, 0 rows affected, 1 warning (0.02 sec)

    Then started the slave.

    On the master, I created the database

    sbtest
      and ran sysbench
    oltp_read_write.lua
    test for a single table. After some time, I stopped the load and checked the content of the table on master and slave:

    master> select count(*) from sbtest1;
    +----------+
    | count(*) |
    +----------+
    | 10000 |
    +----------+
    1 row in set (0.08 sec)
    master> checksum table sbtest1;
    +----------------+------------+
    | Table | Checksum |
    +----------------+------------+
    | sbtest.sbtest1 | 4162333567 |
    +----------------+------------+
    1 row in set (0.11 sec)
    slave> select count(*) from sbtest1;
    +----------+
    | count(*) |
    +----------+
    | 10000 |
    +----------+
    1 row in set (0.40 sec)
    slave> checksum table sbtest1;
    +----------------+------------+
    | Table | Checksum |
    +----------------+------------+
    | sbtest.sbtest1 | 1797645970 |
    +----------------+------------+
    1 row in set (0.13 sec)
    slave> checksum table sbtest1;
    +----------------+------------+
    | Table | Checksum |
    +----------------+------------+
    | sbtest.sbtest1 | 4162333567 |
    +----------------+------------+
    1 row in set (0.10 sec)

    It took some time for the slave to catch up, but everything was applied successfully.

    Ripple has nice verbose logging:

    $ ./bazel-bin/rippled -ripple_datadir=./data -ripple_master_address=127.0.0.1 -ripple_master_port=13001 -ripple_master_user=root -ripple_server_ports=15000
    WARNING: Logging before InitGoogleLogging() is written to STDERR
    I0306 15:57:13.641451 27908 rippled.cc:48] InitPlugins
    I0306 15:57:13.642007 27908 rippled.cc:60] Setup
    I0306 15:57:13.642937 27908 binlog.cc:307] Starting binlog recovery
    I0306 15:57:13.644090 27908 binlog.cc:350] Scanning binlog file: binlog.000000
    I0306 15:57:13.872016 27908 binlog.cc:417] Binlog recovery complete
    binlog file: binlog.000000, offset: 15251088, gtid: 6ddac507-3f90-11e9-8ee9-00163e000000:0-0-7192
    I0306 15:57:13.872050 27908 rippled.cc:106] Recovered binlog
    I0306 15:57:13.873811 27908 mysql_server_port_tcpip.cc:150] Listen on host: localhost, port: 15000
    I0306 15:57:13.874282 27908 rippled.cc:62] Start
    I0306 15:57:13.874511 27910 mysql_master_session.cc:181] Master session starting
    I0306 15:57:13.882601 27910 mysql_client_connection.cc:148] connected to host: 127.0.0.1, port: 13001
    I0306 15:57:13.895349 27910 mysql_master_session.cc:137] Connected to host: 127.0.0.1, port: 13001, server_id: 1, server_name:
    W0306 15:57:13.898556 27910 mysql_master_session.cc:197] master does not support semi sync
    I0306 15:57:13.898583 27910 mysql_master_session.cc:206] start replicating from '6ddac507-3f90-11e9-8ee9-00163e000000:0-0-7192'
    I0306 15:57:13.899031 27910 mysql_master_session.cc:229] Master session entering main loop
    I0306 15:57:13.899550 27910 binlog.cc:626] Update binlog position to end_pos: binlog.000000:15251152, gtid: 0-0-7192
    I0306 15:57:13.899572 27910 binlog.cc:616] Skip writing event [ Previous_gtids len = 67 ]
    I0306 15:57:13.899585 27910 binlog.cc:626] Update binlog position to end_pos: binlog.000000:15251152, gtid: 0-0-7192
    ...

    Conclusion

    it may be good to run more tests before using Ripple in production, and to explore its other options, but from a first view it seems to be a very nice and useful product.


    Photo by Kishor on Unsplash

    by Sveta Smirnova at March 15, 2019 01:16 PM

    March 14, 2019

    Oli Sennhauser

    MariaDB and MySQL Database Consolidation

    We see at various customers the request for consolidating their MariaDB and MySQL infrastructure. The advantage of such a measure is clear in the first step: Saving costs! And this request comes typically from managers. But what we unfortunately see rarely is to question this request from the IT engineering perspective. Because it comes, as anything in life, with some "costs". So, saving costs with consolidation on one side comes with "costs" for operation complexity on the other side.

    To give you some arguments for arguing with managers we collected some topics to consider before consolidating:

    • Bigger Database Instances are more demanding in handling than smaller ones:
      • Backup and Restore time takes longer. Copying files around takes longer, etc.
      • Possibly your logical backup with mysqldump does not restore any longer in a reasonable amount of time (Mean Time to Repair/Recover (MTTR) is not met any more). You have to think about some physical backup methods including MariaDB or MySQL Enterprise Backup solutions.
      • Consolidated database instances typically contain many different schemas of various different applications. In case of problems you typically want to restore and possibly recover only one single schema and not all schemas. And this becomes much more complicated (depending on your backup strategy). MariaDB/MySQL tooling is not yet (fully) prepared for this situation (#17365). Possibly your old backup strategy is not adequate any more?
      • Binary Logs are written globally, not per schema. Have you considered how to do a PiTR for one or several schemas on your consolidated instance? Not an easy game.
      • When you restore a schema you do not want the application interfering with your restore. How can you properly exclude the one application from your database instance while you are restoring? Locking accounts (possible only with MariaDB 10.4 and MySQL 5.7 and newer). Tricks like --skip-networking, adding Firewall rules, --read-only, database port change (--port=3307), do not work any more (as easy)!
      • In short the costs are: Restore/Recovery Operations become more demanding!
    • Do NOT mix schemas of different criticalities into the same database instance! The worst cases we have seen were some development schemas which were on the same high-availability Cluster like highly critical transactional systems. The developers did some nasty things on their development systems (which IMHO is OK for them on a development system). What nobody considered in this case was that the troubles from the development schema brought down the whole production schema which was located on the same machine... Cost: Risk of failure of your important services caused by some non-important services AND planning becomes more expensive and you need to know more about all instances and other instances.
    • This phenomena is also called Noisy Neighbor effect. Noisy Neighbors become a bigger issue with consolidated systems. You have to know much more in detail what you and everybody else is doing on the system! Do you...? Costs are: More know-how is required, better education and training of people, more clever people, better planning, better monitoring, etc.
    • When you consolidate different applications into one system it becomes more critical than the previous ones on their own. So you have to think about High-Availability solutions. Costs are: 1 to 4 new instances (for HA), more complexity, more know-how, more technologies... Do you plan to buy an Enterprise Support subscription?
    • Do NOT mix different maintenance windows (Asia vs. Europe vs. America) or daily online-business and nightly job processing. You get shorter maintenance windows. Costs are: Better planning is needed, costly night and weekend maintenance time, etc...

      Europe12:00
      China19:00(7 hours ahead of us)
      US east07:00(5 hours behind us)
      US west04:00(8 hours behind us)
    • Resource Fencing becomes more tricky. Within the same instance resource fencing becomes more tricky and is not really doable atm. MySQL 8.0 shows some firsts steps with the Resource Groups but this is pretty complicated and is by far not complete and usable yet. A better way would be to install several instances on the same machine an fence them with some O/S means like Control Groups. This comes at the costs of know-how, complexity and more complicated set-ups.
    • Naming conflicts can happen: Application a) is called `wiki` and application b) is called `wiki` as well and for some reasons you cannot rename them (any more).
    • Monitoring becomes much more demanding and needs to be done more fine grained. You want to know exactly what is going on your system because it can easily have some side effects on many different schemas/applications. Example of today: We were running out of kernel file descriptors (file-max) and we did not recognize it in the beginning.
    • Consolidated things are a much a higher Bulk Risk (this is true also for SAN or Virtualisation Clusters). When you have an outage not only one application is down but the whole company is down. We have seen this already for SAN and Virtualisation Clusters and we expect to see that soon also on highly consolidated Database Clusters. Costs: Damage on the company is bigger for one incident.
    • Different applications have different configuration requirements which possibly conflict with other requirements from other applications (Jira from Atlassian is a good example for this).
      Server variables cannot be adjusted any more according to somebody’s individual wishes...
      • sql_mode: Some old legacy applications still require ONLY_FULL_GROUP_BY) :-(
      • The requirements are conflicting: Performance/fast vs. Safe/durability: innodb_flush_log_at_trx_commit, sync_binlog, crash-safe binary logging, etc.
      • Transaction isolation: transaction_isolation = READ-COMMITTED (old: tx_isolation, Jira again as an example) vs. REPEATABLE-READ (default). Other applications which do not assume, that transaction isolation behaviour changes. And cannot cope with it. Have you ever asked your developers if their application can cope with a different transaction isolation levels? :-) Do they know what you are talking about?
      • Character set (utf8_bin for Jira as example again), which can be changed globally or on a schema level, but it has to be done correctly for all participants.
    • Some applications require MariaDB some application require MySQL. They are not the same databases any more nowadays (8.0 vs. 10.3/10.4). So you cannot consolidate them (easily).
    • You possibly get a mixture of persistent connections (typically Java with connection pooling) and non-persistent connections (typically PHP and other languages). Which causes different database behaviour, which has an impact on how you configure the database instance. Which is more demanding and needs more knowledge of the database AND the application or you solve it with more RAM.
    • You need to know much more about you application to understand what it does and how could it interfere with others...
    • When you consolidate more and more schemas into your consolidated database server you have to adjust your database setting as well from time to time (innodb_buffer_pool_size, table_open_cache, table_definition_cache, O/S File descriptors, etc). And possibly add more RAM, CPU and stronger I/O. When is your network saturated? Have you thought about this already?
    • Upgrading MariaDB/MySQL and changes in database configuration becomes more demanding in communication and coordination. Potentially several development teams are affected. And they possibly have event different requirements/needs in O/S, forks and database versions. Or are even not willing or capable to update.
    • If you have different schemas on the same Instance it is easier to access data in different schemas at the same time in the same query. This can cause (unwanted) dependencies between those schemas. The database becomes the interface between applications. Here you have to be very restrictive with user privileges to avoid these dependencies. From an architecture point of view it would be more preferable to use clearly defined interfaces outside of the database. For example APIs. But those APIs require much more development resources than a simple SQL query. The problem comes later: If you want to separate the schemas again into different instances the effort is increasing significantly to split/rewrite the JOIN queries and the underlying data sets. Or the depending schemas must be moved all together which causes longer downtimes for applications and requires more coordination between teams.

    This leads us to the result that consolidation let us save some costs on infrastructure but adds additional costs on complexity, skills etc. Theses costs will grow exponentially and thus at some point it is not worth the effort any more. This will end up in not only one big consolidated instance but possibly in a hand full of them.

    Where this point is for you you have to find yourself...

    Alternatives to consolidating everything into one instance

    • 1 Machine can contain 1 to many Database Instances can contain 1 to many Schemas. Instead of putting all schemas into one machine, think about installing several instances on one machine. This comes at the cost of more complexity. MyEnv will help you to manage this additional complexity.
    • 1 Machine can contain 1 to many Virtual Machines (VMs, kvm, XEN, VMWare, etc.) can contain 1 to many Instance(s) can contain 1 to many Schemas. This comes at the cost of even more complexity and pretty complex technology (Virtualization).

    A big thanks to Antoniya K. for here valuable feedback!

    Taxonomy upgrade extras: 

    by Shinguz at March 14, 2019 10:05 PM

    Peter Zaitsev

    Percona’s Open Source Data Management Software Survey

    PerconaSurvey

    Click Here to Complete our New Survey!

    Last year we informally surveyed the open source community and our conference attendees.
    The results revealed that:

    • 48% of those in the cloud choose to self-manage their databases, but 52% were comfortable relying on the DBaaS offering of their cloud vendor.
    • 49% of people said “performance issues” when asked, “what keeps you up at night?”
    • The major decision influence for buying services was price, with 42% of respondents keen to make the most of their money.

    We found this information so interesting that we wanted to find out more! As a result, we are pleased to announce the launch of our first annual Open Source Data Management Software Survey.

    The final results will be 100% anonymous, and will be made freely available on Creative Commons.

    How Will This Survey Help The Community?

    Unlimited access to accurate market data is important. Millions of open source projects are in play, and most are dependent on databases. Accurate market data helps you track the popularity of different databases, as well as seeing how and where these databases are run. This helps us all build better software and take advantage of shifting trends.

    Thousands of vendors are focused on helping SysAdmins, DBAs, and Developers get the most out of their database infrastructure. Insightful market data enables them to create better tools that meet current demands and grow the open source database market.

    We want to assist companies who are still deciding what, how, and where to run their systems. This information will help them understand the industry direction and allow them to make an informed decision on the software and services they choose.

    How Can You Help Make This Survey A Success?

    Firstly, please share your insight into current trends and new developments in open source data management software.

    Secondly, please share this survey with other people who work in the industry, and encourage them to contribute.

    The more responses we receive, the more useful this will be to the whole open source community. If we missed anything, or you would like to ask other questions in future, let us know!

    So tell us; who are the big fish, and which minnows are nibbling at their tails?! Is the cloud giving you altitude sickness, or are you flying high? What is the next big thing and is everyone on board, or is your company lagging behind?

    Preliminary results will be presented at our annual Percona Live Conference in Austin, Texas (May 28-30, 2019) by our CEO, Peter Zaitsev and released to the open source community when finalized.

    Click Here to Have Your Say!

    by Rachel Pescador at March 14, 2019 11:08 AM

    March 13, 2019

    Oli Sennhauser

    FromDual Performance Monitor for MariaDB and MySQL 1.0.2 has been released

    FromDual has the pleasure to announce the release of the new version 1.0.2 of its popular Database Performance Monitor for MariaDB, MySQL, Galera Cluster and Percona Server fpmmm.

    The new FromDual Performance Monitor for MariaDB and MySQL (fpmmm) can be downloaded from here. How to install and use fpmmm is documented in the fpmmm Installation Guide.

    In the inconceivable case that you find a bug in the FromDual Performance Manager for MariaDB and MySQL please report it the FromDual Bugtracker or just send us an email.

    Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

    Monitoring as a Service (MaaS)

    You do not want to set-up your Database monitoring yourself? No problem: Choose our MariaDB and MySQL Monitoring as a Service (Maas) program to safe costs!

    Upgrade from 1.0.x to 1.0.2

    shell> cd /opt
    shell> tar xf /download/fpmmm-1.0.2.tar.gz
    shell> rm -f fpmmm
    shell> ln -s fpmmm-1.0.2 fpmmm
    

    Changes in FromDual Performance Monitor for MariaDB and MySQL 1.0.2

    This release contains various bug fixes.

    You can verify your current FromDual Performance Monitor for MariaDB and MySQL version with the following command:

    shell> fpmmm --version
    

    fpmmm agent

    • Server entropy probe added.
    • Processlist empty state is covered.
    • Processlist statements made more robust.
    • Error caught properly after query.
    • Branch for Ubuntu is different, fixed.
    • PHP Variable variables_order is included into program.
    • Fixed the documentation URL in file INSTALL.
    • Connection was not set to utf8. This is fixed now.
    • fprint error fixed.
    • Library myEnv.inc updated from MyEnv project.

    fpmmm Templates

    • Backup template added.
    • SQL thread and IO thread error more verbose and running again triggers implemented. Typo in slave template fixed.
    • Forks graph fixed, y axis starts from 0.

    fpmmm agent installer

    • Error messages made more flexible.

    For subscriptions of commercial use of fpmmm please get in contact with us.

    by Shinguz at March 13, 2019 07:58 PM

    Peter Zaitsev

    Super Saver Discount Ends 17 March for Percona Live 2019

    percona-live-2019-austin-tutorials-talks

    percona-live-2019-austin-tutorials-talksTutorials and initial sessions are set for the Percona Live Open Source Database Conference 2019, to be held May 28-30 at the Hyatt Regency in Austin, Texas! Percona Live 2019 is the premier open source database conference event for users of MySQL®, MariaDB®, MongoDB®, and PostgreSQL. It will feature 13 tracks presented over two days, plus a day of hands-on tutorials. Register now to enjoy our best Super Saver Registration rates which end March 17, 2019 at 11:30 p.m. PST.

    Sample Sessions

    Here is one item from each of our 13 tracks, samples from our full conference schedule.  Note too that many more great talks will be announced soon!

    1. MySQL®: The MySQL Query Optimizer Explained Through Optimizer Trace by Øystein Grøvlen of Alibaba Cloud.
    2. MariaDB®:  MariaDB Security Features and Best Practices by Robert Bindar of MariaDB Foundation.
    3. MongoDB®: MongoDB: Where Are We Going From Here? presented by David Murphy, Huawei
    4. PostgreSQL: A Deep Dive into PostgreSQL Indexing by Ibrar Ahmed, Percona
    5. Other Open Source Databases: ClickHouse Data Warehouse 101: The First Billion Rows by Alexander Zaitsev and Robert Hodges, Altinity
    6. Observability & Monitoring: Automated Database Monitoring at Uber with M3 and Prometheus by Rob Skillington and Richard Artoul, Uber
    7. Kubernetes: Complex Stateful Applications Made Easier with Kubernetes by Patrick Galbraith of Oracle MySQL
    8. Automation & AI: Databases at Scale, at Square by Emily Slocombe, Square
    9. Java Development for Open Source Databases: Introducing Java Profiling via Flame Graphs by Agustín Gallego, Percona
    10. Migration to Open Source Databases: Migrating between Proprietary and Open Source Database Technologies – Calculating your ROI by John Schultz, The Pythian Group
    11. Polyglot Persistence: A Tale of 8T (Transportable Tablespaces Vs Mysqldump) by Kristofer Grahn, Verisure AB
    12. Database Security & Compliance: MySQL Security and Standardization at PayPal by Stacy Yuan and Yashada Jadhav, Paypal Holdings Inc
    13. Business and Enterprise: MailChimp – Scale A MySQL Perspective by John Scott, MailChimp

    Venue

    Percona Live 2019 will be held at the downtown Hyatt Regency Austin Texas.  Located on the shores of Lady Bird Lake, it’s near water sports like kayaking, canoeing, stand-up paddling, and rowing. There are many food and historical sites nearby, such as the Texas Capitol, the LBJ Library, and Barton Springs Pool.  Book here for Percona’s conference room rate.

    Sponsorships

    Sponsors of Percona Live 2019 can interact with DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solution vendors, and entrepreneurs who typically attend Percona Live. Download our prospectus for more information.

    by Bronwyn Campbell at March 13, 2019 04:18 PM

    Live MySQL Slave Rebuild with Percona Toolkit

    MySQL slave data out of sync

    MySQL slave data out of syncRecently, we had an edge case where a MySQL slave went out-of-sync but it couldn’t be rebuilt from scratch. The slave was acting as a master server to some applications and it had data was being written to it. It was a design error, and this is not recommended, but it happened. So how do you synchronize the data in this circumstance? This blog post describes the steps taken to recover from this situation. The tools used to recover the slave were pt-slave-restartpt-table-checksum, pt-table-sync and mysqldiff.

    Scenario

    To illustrate this situation, it was built a master x slave configuration with sysbench running on the master server to simulate a general application workload. The environment was set with a Percona Server 5.7.24-26 and sysbench 1.0.16.

    Below are the sysbench commands to prepare and simulate the workload:

    # Create Data
    sysbench --db-driver=mysql --mysql-user=root --mysql-password=msandbox \
      --mysql-socket=/tmp/mysql_sandbox45008.sock --mysql-db=test --range_size=100 \
      --table_size=5000 --tables=100 --threads=1 --events=0 --time=60 \
      --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua prepare
    # Simulate Workload
    sysbench --db-driver=mysql --mysql-user=root --mysql-password=msandbox \
      --mysql-socket=/tmp/mysql_sandbox45008.sock --mysql-db=test --range_size=100 \
      --table_size=5000 --tables=100 --threads=10 --events=0 --time=6000 \
      --rand-type=uniform /usr/share/sysbench/oltp_read_write.lua --report-interval=1 run

    With the environment set, the slave server was stopped, and some operations to desynchronize the slave were performed to reproduce the problem.

    Fixing the issue

    With the slave desynchronized, a restart on the replication was executed. Immediately, the error below appeared:

    Last_IO_Errno: 1236
    Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'

    To recover the slave from this error, we had to point the slave to an existing binary log with a valid binary log position. To get a valid binary log position, the command shown below had to be executed on the master:

    master [localhost] {msandbox} ((none)) > show master status\G
    *************************** 1. row ***************************
    File: mysql-bin.000007
    Position: 218443612
    Binlog_Do_DB:
    Binlog_Ignore_DB:
    Executed_Gtid_Set:
    1 row in set (0.01 sec)

    Then, a CHANGE MASTER command was run on the slave:

    slave1 [localhost] {msandbox} (test) > change master to master_log_file='mysql-bin.000007', MASTER_LOG_POS=218443612;
    Query OK, 0 rows affected (0.00 sec)
    slave1 [localhost] {msandbox} (test) > start slave;
    Query OK, 0 rows affected (0.00 sec)

    Now the slave had a valid binary log file to read, but since it was inconsistent, it hit another error:

    Last_SQL_Errno: 1032
                   Last_SQL_Error: Could not execute Delete_rows event on table test.sbtest8; Can't find record in 'sbtest8', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000005, end_log_pos 326822861

    Working past the errors

    Before fixing the inconsistencies, it was necessary to keep the replication running and to skip the errors. For this, the pt-slave-restart tool will be used. The tool needs to be run on the slave server:

    pt-slave-restart --user root --socket=/tmp/mysql_sandbox45008.sock --ask-pass

    The tool skips errors and starts the replication threads. Below is an example of the output of the pt-slave-restart:

    $ pt-slave-restart --user root --socket=/tmp/mysql_sandbox45009.sock --ask-pass
    Enter password:
    2019-02-22T14:18:01 S=/tmp/mysql_sandbox45009.sock,p=...,u=root mysql-relay.000007        1996 1146
    2019-02-22T14:18:02 S=/tmp/mysql_sandbox45009.sock,p=...,u=root mysql-relay.000007        8698 1146
    2019-02-22T14:18:02 S=/tmp/mysql_sandbox45009.sock,p=...,u=root mysql-relay.000007       38861 1146

    Finding the inconsistencies

    With the tool running on one terminal, the phase to check the inconsistencies began. First things first, an object definition check was performed using mysqldiff utility. The mysqldiff tool is part of MySQL utilities. To execute the tool:

    $ mysqldiff --server1=root:msandbox@localhost:48008 --server2=root:msandbox@localhost:48009 test:test --difftype=sql --changes-for=server2

    And below are the differences found between the master and the slave:

    1-) A table that doesn’t exist

    # WARNING: Objects in server1.test but not in server2.test:
    # TABLE: joinit

    2-) A wrong table structure

    # Comparing `test`.`sbtest98` to `test`.`sbtest98` [FAIL]
    # Transformation for --changes-for=server2:
    #
    ALTER TABLE `test`.`sbtest98`
    DROP INDEX k_98,
    DROP COLUMN x,
    ADD INDEX k_98 (k);

    Performing the recommendations on the slave (creating the table and the table modification) the object’s definition was now equal. The next step was to check data consistency. For this, the pt-table-checksum was used to identify which tables are out-of-sync. This command was run on the master server.

    $ pt-table-checksum -uroot -pmsandbox --socket=/tmp/mysql_sandbox48008.sock --replicate=percona.checksums --create-replicate-table --empty-replicate-table --no-check-binlog-format --recursion-method=hosts

    And an output example:

    01 master]$ pt-table-checksum --recursion-method dsn=D=percona,t=dsns --no-check-binlog-format --nocheck-replication-filter --host 127.0.0.1 --user root --port 48008 --password=msandbox
    Checking if all tables can be checksummed ...
    Starting checksum ...
      at /usr/bin/pt-table-checksum line 332.
    Replica lag is 66 seconds on bm-support01.bm.int.percona.com.  Waiting.
    Replica lag is 46 seconds on bm-support01.bm.int.percona.com.  Waiting.
    Replica lag is 33 seconds on bm-support01.bm.int.percona.com.  Waiting.
               TS ERRORS  DIFFS     ROWS  DIFF_ROWS  CHUNKS SKIPPED    TIME TABLE
    02-26T16:27:59      0      0     5000          0       1       0   0.037 test.sbtest1
    02-26T16:27:59      0      0     5000          0       1       0   0.039 test.sbtest10
    02-26T16:27:59      0      1     5000          0       1       0   0.033 test.sbtest100
    02-26T16:27:59      0      1     5000          0       1       0   0.034 test.sbtest11
    02-26T16:27:59      0      1     5000          0       1       0   0.040 test.sbtest12
    02-26T16:27:59      0      1     5000          0       1       0   0.034 test.sbtest13

    Fixing the data inconsistencies

    Analyzing the DIFFS column it is possible to identify which tables were compromised. With this information, the pt-table-sync tool was used to fix these inconsistencies. The tool synchronizes MySQL table data efficiently, performing non-op changes on the master so they can be replicated and applied on the slave. The tools need to be run on the slave server. Below is an example of the tool running:

    $ pt-table-sync --execute --sync-to-master h=localhost,u=root,p=msandbox,D=test,t=sbtest100,S=/tmp/mysql_sandbox48009.sock

    It is possible to perform a dry-run of the tool before executing the changes to check what changes the tool will apply:

    $ pt-table-sync --print --sync-to-master h=localhost,u=root,p=msandbox,D=test,t=sbtest100,S=/tmp/mysql_sandbox48009.sock
    REPLACE INTO `test`.`sbtest100`(`id`, `k`, `c`, `pad`) VALUES ('1', '1654', '97484653464-60074971666-42998564849-40530823048-27591234964-93988623123-02188386693-94155746040-59705759910-14095637891', '15000678573-85832916990-95201670192-53956490549-57402857633') /*percona-toolkit src_db:test src_tbl:sbtest100 src_dsn:D=test,P=48008,S=/tmp/mysql_sandbox48009.sock,h=127.0.0.1,p=...,t=sbtest100,u=root dst_db:test dst_tbl:sbtest100 dst_dsn:D=test,S=/tmp/mysql_sandbox48009.sock,h=localhost,p=...,t=sbtest100,u=root lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:17806 user:vinicius.grippa host:bm-support01.bm.int.percona.com*/;
    REPLACE INTO `test`.`sbtest100`(`id`, `k`, `c`, `pad`) VALUES ('2', '3007', '31679133794-00154186785-50053859647-19493043469-34585653717-64321870163-33743380797-12939513287-31354198555-82828841987', '30122503210-11153873086-87146161761-60299188705-59630949292') /*percona-toolkit src_db:test src_tbl:sbtest100 src_dsn:D=test,P=48008,S=/tmp/mysql_sandbox48009.sock,h=127.0.0.1,p=...,t=sbtest100,u=root dst_db:test dst_tbl:sbtest100 dst_dsn:D=test,S=/tmp/mysql_sandbox48009.sock,h=localhost,p=...,t=sbtest100,u=root lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:17806 user:vinicius.grippa host:bm-support01.bm.int.percona.com*/;

    After executing the pt-table-sync, we recommend that you run the pt-table-checksum again and check if the DIFFS column shows the value of 0.

    Conclusion

    This blog post was intended to cover all possible issues that could happen on a slave when it goes out-of-sync such as DDL operations, binary log purge and DML operations. This process involves many steps and it could take a long time to finish, especially in large databases. Note that this process might take longer than the backup/restore process. However, in situations like the one mentioned above, it might be the only solution to recover a slave.


    Image based on Photo by Randy Fath on Unsplash

     

    by Vinicius Grippa at March 13, 2019 11:58 AM

    March 12, 2019

    Peter Zaitsev

    Upcoming Webinar Thurs 3/14: Web Application Security – Why You Should Review Yours

    Please join Percona’s Information Security Architect, David Bubsy, as he presents his talk Web Application Security – Why You Should Review Yours on March 14th, 2019 at 6:00 AM PDT (UTC-7) / 9:00 AM EDT (UTC-4).

    Register Now

    In this talk, we take a look at the whole stack and I don’t just mean LAMP.

    We’ll cover what an attack surface is and some areas you may look to in order to ensure that you can reduce it.

    For instance, what’s an attack surface?

    Acronym Hell, what do they mean?

    Vulnerability Naming, is this media naming stupidity or driving the message home?

    Detection, Prevention and avoiding the boy who cried wolf are some further examples.

    Additionally, we’ll cover emerging technologies to keep an eye on or even implement yourself to help improve your security posture.

    There will also be a live compromise demo (or backup video if something fails) that covers compromising a PCI compliant network structure to reach the database system. Through this compromise you can ultimately exploit multiple failures to gain bash shell access over the MySQL protocol.

    by David Busby at March 12, 2019 08:59 PM

    PMM’s Custom Queries in Action: Adding a Graph for InnoDB mutex waits

    PMM mutex wait graph

    One of the great things about Percona Monitoring and Management (PMM) is its flexibility. An example of that is how one can go beyond the exporters to collect data. One approach to achieve that is using textfile collectors, as explained in  Extended Metrics for Percona Monitoring and Management without modifying the Code. Another method, which is the subject matter of this post, is to use custom queries.

    While working on a customer’s contention issue I wanted to check the behaviour of InnoDB Mutexes over time. Naturally, I went straight to PMM and didn’t find a graph suitable for my needs. No graph, no problem! Luckily anyone can enhance PMM. So here’s how I made the graph I needed.

    The final result will looks like this:

    Custom Queries

    What is it?

    Starting from the version 1.15.0, PMM provides user the ability to take a SQL SELECT statement and turn the resultset into a metric series in PMM. That is custom queries.

    How do I enable that feature?

    This feature is ON by default. You only need to edit the configuration file using YAML syntax

    Where is the configuration file located?

    Config file location is /usr/local/percona/pmm-client/queries-mysqld.yml by default. You can change it when adding mysql metrics via pmm-admin:

    pmm-admin add mysql:metrics ... -- --queries-file-name=/usr/local/percona/pmm-client/query.yml

    How often is data being collected?

    The queries are executed at the LOW RESOLUTION level, which by default is every 60 seconds.

    InnoDB Mutex monitoring

    The method used to gather Mutex status is querying the PERFORMANCE SCHEMA, as explained here: https://dev.mysql.com/doc/refman/5.7/en/monitor-innodb-mutex-waits-performance-schema.html but intentionally removed the SUM_TIMER_WAIT > 0 condition, so the query used looks like this:

    SELECT
    EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT
    FROM performance_schema.events_waits_summary_global_by_event_name
    WHERE EVENT_NAME LIKE 'wait/synch/mutex/innodb/%'

    For this query to return data, some requirements need to be met:

    • The most important one: Performance Schema needs to be enabled
    • Consumers for “event_waits” enabled
    • Instruments for ‘wait/synch/mutex/innodb’ enabled.

    If performance schema is enabled, the other two requirements are met by running these two queries:

    update performance_schema.setup_instruments set enabled='YES' where name like 'wait/synch/mutex/innodb%';
    update performance_schema.setup_consumers set enabled='YES' where name like 'events_waits%';

    YAML Configuration File

    This is where the magic happens. Explanation of the YAML syntax is covered in deep on the documentation: https://www.percona.com/doc/percona-monitoring-and-management/conf-mysql.html#pmm-conf-mysql-executing-custom-queries

    The one used for this issue is:

    ---
    mysql_global_status_innodb_mutex:
        query: "SELECT EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT FROM performance_schema.events_waits_summary_global_by_event_name WHERE EVENT_NAME LIKE 'wait/synch/mutex/innodb/%'"
        metrics:
          - EVENT_NAME:
              usage: "LABEL"
              description: "Name of the mutex"
          - COUNT_STAR:
              usage: "COUNTER"
              description: "Number of calls"
          - SUM_TIMER_WAIT:
              usage: "GAUGE"
              description: "Duration"

    The key info is:

    • The metric name is mysql_global_status_innodb_mutex
    • Since EVENT_NAME is used as a label, it will be possible to have values per event

    Remember that this should be in the queries-mysql.yml file. Full path /usr/local/percona/pmm-client/queries-mysqld.yml  inside the db node.

    Once that is done, you will start to have those metrics available in Prometheus. Now, we have a graph to do!

    Creating the graph in Grafana

    Before jumping to grafana to add the graph, we need a proper Prometheus Query (A.K.A: PromQL). I came up with these two (one for the count_star, one for the sum_timer_wait):

    topk(5, label_replace(rate(mysql_global_status_innodb_mutex_COUNT_STAR{instance="$host"}[$interval]), "mutex", "$2", "EVENT_NAME", "(.*)/(.*)" ) or label_replace(irate(mysql_global_status_innodb_mutex_COUNT_STAR{instance="$host"}[5m]), "mutex", "$2", "EVENT_NAME", "(.*)/(.*)" ))

    and

    topk(5, label_replace(rate(mysql_global_status_innodb_mutex_SUM_TIMER_WAIT{instance="$host"}[$interval]), "mutex", "$2", "EVENT_NAME", "(.*)/(.*)" ) or label_replace(irate(mysql_global_status_innodb_mutex_SUM_TIMER_WAIT{instance="$host"}[5m]), "mutex", "$2", "EVENT_NAME", "(.*)/(.*)" ))

    These queries are basically: Return the rate values of each mutex event for a specific host. And make some regex to return only the name of the event, and discard whatever is before the last slash character.

    Once we are good with our PromQL queries, we can go and add the graph.

    Finally, I got the graph that I needed with a very small effort.

    The dashboard is also published on the Grafana Labs Community dashboards site.

    Summary

    PMM’s collection of graphs and dashboard is quite complete, but it is also natural that there are specific metrics that might not be there. For those cases, you can count on the flexibility and ease usage of PMM to collect metrics and create custom graphs. So go ahead, embrace PMM, customize it, make it yours!

    The JSON for this graph, so it can be imported easily, is:

    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fill": 0,
      "gridPos": {
        "h": 18,
        "w": 24,
        "x": 0,
        "y": 72
      },
      "id": null,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": false,
        "max": true,
        "min": true,
        "rightSide": false,
        "show": true,
        "sideWidth": 0,
        "sort": "avg",
        "sortDesc": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 2,
      "links": [],
      "nullPointMode": "null",
      "percentage": false,
      "pointradius": 0.5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "/Timer Wait/i",
          "yaxis": 2
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "topk(5, label_replace(rate(mysql_global_status_innodb_mutex_COUNT_STAR{instance=\"$host\"}[$interval]), \"mutex\", \"$2\", \"EVENT_NAME\", \"(.*)/(.*)\" )) or topk(5,label_replace(irate(mysql_global_status_innodb_mutex_COUNT_STAR{instance=\"$host\"}[5m]), \"mutex\", \"$2\", \"EVENT_NAME\", \"(.*)/(.*)\" ))",
          "format": "time_series",
          "interval": "$interval",
          "intervalFactor": 1,
          "legendFormat": "{{ mutex }} calls",
          "refId": "A",
          "hide": false
        },
        {
          "expr": "topk(5, label_replace(rate(mysql_global_status_innodb_mutex_SUM_TIMER_WAIT{instance=\"$host\"}[$interval]), \"mutex\", \"$2\", \"EVENT_NAME\", \"(.*)/(.*)\" )) or topk(5, label_replace(irate(mysql_global_status_innodb_mutex_SUM_TIMER_WAIT{instance=\"$host\"}[5m]), \"mutex\", \"$2\", \"EVENT_NAME\", \"(.*)/(.*)\" ))",
          "format": "time_series",
          "interval": "$interval",
          "intervalFactor": 1,
          "legendFormat": "{{ mutex }} timer wait",
          "refId": "B",
          "hide": false
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeShift": null,
      "title": "InnoDB Mutex",
      "tooltip": {
        "shared": true,
        "sort": 2,
        "value_type": "individual"
      },
      "transparent": false,
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": "",
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "decimals": null,
          "format": "ns",
          "label": "",
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }

    by Daniel Guzmán Burgos at March 12, 2019 06:31 PM

    March 11, 2019

    Peter Zaitsev

    Switch your PostgreSQL Primary for a Read Replica, Without Downtime

    postgres read replica from primary

    PostgreSQL logoIn my ongoing research to identify solutions and similarities between MySQL – PostgreSQL, I recently faced a simple issue. I needed to perform a slave shift from one IP to another and I did not want to have to restart the slave that is serving the reads. In MySQL, I can repoint the replication online with the command Change Master TO, so I was looking for similar solution in postgres. In my case, I could also afford some stale reads, so a few seconds delay would have been OK, but I couldn’t take down the server.

    After brief research, I noticed that there is not a solution that allow you to do that without restarting the PostgreSQL server instance.
    I was a bit disappointed, because I was just trying to move the whole traffic from one subnet to another, so not really changing the Master, but just the pointer.

    At this point I raised my question to my colleagues who are experts in PG. Initially they confirmed to me that there is no real dynamic solution/command for that. However, while discussing this, one of them (Jobin Augustine) suggested a not “officially supported” way, that might work.

    In brief, given that the WAL Receiver uses its own process, killing it would trigger an internal refresh operation, and that could result in having the replication restart from the new desired configuration.

    This was an intriguing suggestion, but I wondered if it might have some negative side effects. In any case, I decided to try it and see what would happen.

    This article describe the process I followed to test the approach. To be clear:  this is not an “Official” solution, and is not recommended as best practice.

    From now on in this article I will drop the standard MySQL terms and instead use Primary for Master and Replica for Slave.

    Scenarios

    I carried out two main tests:

    1. No load in writing
    2. Writing happening

    for each of these I took these steps:

    a) move Replica to same Primary (different ip)
    b) move Replica to different Primary/Replica, creating a chain, so from:

    +--------+
                              | Primary|
                              +----+---+
                                   |
                    +--------+     |    +--------+
                    |Replica1+<----+--->+Replica2|
                    +--------+          +--------+

    To:

    +-------+
                              |Primary|
                              +---+---+
                                  |
                                  v
                              +---+----+
                              |Replica2|
                              +---+----+
                                  |
                                  v
                              +---+----+
                              |Replica1|
                              +--------+

    The other thing was to try to be as non-invasive as possible. Given that, I used KILL SIGQUIT(3) instead of the more brutal SIGKILL.

    SIGQUIT “The SIGQUIT signal is sent to a process by its controlling terminal when the user requests that the process quit and perform a core dump.

    To note that I did try this with SIGTERM (15) which is the nicest approach, but it didn’t in fact force the process to perform the shift as desired.

    In general in all the following tests what I execute is:

    ps aux|grep 'wal receiver'
    kill -3 <pid>

    These are the current IPs for node:

    Node1 (Primary):

    NIC1 = 192.168.1.81
    NIC2 = 192.168.4.81
    NIC3 = 10.0.0.81

    Node2 (replica1):

    NIC1 = 192.168.1.82
    NIC2 = 192.168.4.82
    NIC3 = 10.0.0.82

    Node1 (replica2):

    NIC1 = 192.168.1.83
    NIC2 = 192.168.4.83
    NIC3 = 10.0.0.83

    The starting position is:

    select pid,usesysid,usename,application_name,client_addr,client_port,backend_start,state,sent_lsn,write_lsn,flush_lsn,sync_state from pg_stat_replication;
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     22495 |    24601 | replica | node2            | 192.168.4.82 |       49518 | 2019-02-06 11:07:46.507511-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

    And now let’s roll the ball and see what happen.

    Experiment 1 – moving to same Primary no load

    I will move Node2 to point to 192.168.1.81

    In my recovery.conf
    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

    change to:

    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.1.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

    [root@pg1h3p82 data]# ps aux|grep 'wal receiver'
    postgres 8343 0.0 0.0 667164 2180 ? Ss Feb06 16:27 postgres: wal receiver process streaming 10/FD6C60E8

    Checking the replication status:

    [root@pg1h3p82 data]# ps aux|grep 'wal receiver'
    postgres  8343  0.0  0.0 667164  2180 ?        Ss   Feb06  16:27 postgres: wal receiver process   streaming 10/FD6C60E8
                                                                      Tue 19 Feb 2019 12:10:22 PM EST (every 1s)
     pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     23748 |    24601 | replica | node2            | 192.168.4.82 |       49522 | 2019-02-19 12:09:31.054915-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
    (2 rows)
                                                                      Tue 19 Feb 2019 12:10:23 PM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
    (1 row)
                                                                      Tue 19 Feb 2019 12:10:26 PM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     23756 |    24601 | replica | node2            | 192.168.1.82 |       37866 | 2019-02-19 12:10:26.904766-05 | catchup   | 10/FD460000 | 10/FD3A0000 | 10/FD6C60E8 | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
    (2 rows)
                                                                      Tue 19 Feb 2019 12:10:28 PM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     23756 |    24601 | replica | node2            | 192.168.1.82 |       37866 | 2019-02-19 12:10:26.904766-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
    (2 rows)

    It takes six seconds to kill the process, shift to a new IP, and perform the catch up.

    Experiment 2 – moving to Different Primary (as a chain of replicas) No load

    I will move Node2 to point to 192.168.4.83

    In my recovery.conf
    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.1.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
    change to:
    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.83 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

    [root@pg1h3p82 data]# ps aux|grep 'wal receiver'
    postgres 25859 0.0 0.0 667164 3484 ? Ss Feb19 1:53 postgres: wal receiver process

    On Node1

    Thu 21 Feb 2019 04:23:26 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
     31241 |    24601 | replica | node2            | 192.168.1.82 |       38232 | 2019-02-21 04:17:24.535662-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async
    (2 rows)
                                                                      Thu 21 Feb 2019 04:23:27 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

    On Node3

    pid | usesysid | usename | application_name | client_addr | client_port | backend_start | state | sent_lsn | write_lsn | flush_lsn | sync_state
    -----+----------+---------+------------------+-------------+-------------+---------------+-------+----------+-----------+-----------+------------
    (0 rows)
                                                                      Thu 21 Feb 2019 04:23:30 AM EST (every 1s)
     pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    ------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     1435 |    24601 | replica | node2            | 192.168.4.82 |       58116 | 2019-02-21 04:23:29.846798-05 | streaming | 10/FD6C60E8 | 10/FD6C60E8 | 10/FD6C60E8 | async

    In this case, shifting to a new primary took four seconds.

    Now all this is great, but I was working with NO load, what would happen if we have read/write taking place?

    Experiment 3 – moving to same Primary WITH Load

    I will move Node2 to point to 192.168.4.81

    In my recovery.conf
    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.1.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
    change to:
    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

    [root@pg1h3p82 data]# ps aux|grep 'wal receiver'
    postgres 20765 0.2 0.0 667196 3712 ? Ss 06:23 0:00 postgres: wal receiver process streaming 11/E33F760

    Thu 21 Feb 2019 06:23:03 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
     31649 |    24601 | replica | node2            | 192.168.1.82 |       38236 | 2019-02-21 06:21:23.539493-05 | streaming | 11/8FEC000 | 11/8FEC000 | 11/8FEC000 | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/8FEC000 | 11/8FEC000 | 11/8FEC000 | async
                                                                     Thu 21 Feb 2019 06:23:04 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/904DCC0 | 11/904C000 | 11/904C000 | async
                                                                     Thu 21 Feb 2019 06:23:08 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
     31778 |    24601 | replica | node2            | 192.168.4.82 |       49896 | 2019-02-21 06:23:08.978179-05 | catchup   | 11/9020000 |            |            | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/9178000 | 11/9178000 | 11/9178000 | async
                                                                     Thu 21 Feb 2019 06:23:09 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn  | write_lsn  | flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+------------+------------+------------+------------
     31778 |    24601 | replica | node2            | 192.168.4.82 |       49896 | 2019-02-21 06:23:08.978179-05 | streaming | 11/91F7860 | 11/91F7860 | 11/91F7860 | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/91F7860 | 11/91F7860 | 11/91F7860 | async

    In this case shifting to a new primary takes six seconds.

    Experiment 4 – moving to Different Primary (as a chain of replicas) No load

    I move Node2 to point to 192.168.4.83
    In my recovery.conf
    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

    change to:
    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.83 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

    [root@pg1h3p82 data]# ps aux|grep 'wal receiver'
    postgres 21158 6.3 0.0 667196 3704 ? Ds 06:30 0:09 postgres: wal receiver process streaming 11/4F000000

    Node1

    Thu 21 Feb 2019 06:30:56 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     31778 |    24601 | replica | node2            | 192.168.4.82 |       49896 | 2019-02-21 06:23:08.978179-05 | streaming | 11/177F8000 | 11/177F8000 | 11/177F8000 | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/177F8000 | 11/177F8000 | 11/177F8000 | async
    (2 rows)
                                                                      Thu 21 Feb 2019 06:30:57 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/17DAA000 | 11/17DAA000 | 11/17DAA000 | async
    (1 row)

    Node3

    Thu 21 Feb 2019 06:31:01 AM EST (every 1s)
     pid | usesysid | usename | application_name | client_addr | client_port | backend_start | state | sent_lsn | write_lsn | flush_lsn | sync_state
    -----+----------+---------+------------------+-------------+-------------+---------------+-------+----------+-----------+-----------+------------
    (0 rows)
                                                                     Thu 21 Feb 2019 06:31:02 AM EST (every 1s)
     pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |  state  |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    ------+----------+---------+------------------+--------------+-------------+-------------------------------+---------+-------------+-------------+-------------+------------
     1568 |    24601 | replica | node2            | 192.168.4.82 |       58122 | 2019-02-21 06:31:01.937957-05 | catchup | 11/17960000 | 11/17800000 | 11/177F8CC0 | async
    (1 row)
                                                                      Thu 21 Feb 2019 06:31:03 AM EST (every 1s)
     pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    ------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
     1568 |    24601 | replica | node2            | 192.168.4.82 |       58122 | 2019-02-21 06:31:01.937957-05 | streaming | 11/1A1D3D08 | 11/1A1D3D08 | 11/1A1D3D08 | async
    (1 row)

    In this case shifting to a new primary took seven seconds.

    Finally, I did another test. I was wondering, can I move the server Node2 back under the main Primary Node1 while writes are happening?

    Well, here’s what happened:

    In my recovery.conf
    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.83 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
    change to:
    primary_conninfo = 'application_name=node2 user=replica password=replica connect_timeout=10 host=192.168.4.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'

    After I kill the process as I did in the previous examples, Node2 rejoined the Primary Node1, but …

    Thu 21 Feb 2019 06:33:58 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
      1901 |    24601 | replica | node2            | 192.168.4.82 |       49900 | 2019-02-21 06:33:57.81308-05  | catchup   | 11/52E40000 | 11/52C00000 | 11/52BDFFE8 | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/5D3F9EC8 | 11/5D3F9EC8 | 11/5D3F9EC8 | async

    … Node2 was not really able to catch up quickly, or at least not able to do that until the load was on the primary and high. As soon I reduced the application pressure:

    Thu 21 Feb 2019 06:35:29 AM EST (every 1s)
      pid  | usesysid | usename | application_name | client_addr  | client_port |         backend_start         |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | sync_state
    -------+----------+---------+------------------+--------------+-------------+-------------------------------+-----------+-------------+-------------+-------------+------------
      1901 |    24601 | replica | node2            | 192.168.4.82 |       49900 | 2019-02-21 06:33:57.81308-05  | streaming | 11/70AE8000 | 11/70000000 | 11/70000000 | async
     22449 |    24601 | replica | node3            | 192.168.4.83 |       43648 | 2019-02-06 10:56:32.612439-05 | streaming | 11/70AE8000 | 11/70AE8000 | 11/70AE8000 | async

    Node2 was able to catch up and align itself.

    Conclusions

    In all tests , the Replica was able to rejoin the Primary or the new primary, with obvious different times.

    From the tests I carried out so far, it seems that modifying the replication source, and then killing the “WAL receiver” thread, is a procedure that allows us to shift the replication source without the need for a service restart.

    This is even more efficient compared to the MySQL solution, given the time taken for the recovery and the flexibility.

    What I am still wondering is IF this might cause some data inconsistency issues or not. I asked some of the PG experts inside the company, and it seems that the process should be relatively safe, but I would appreciate any feedback/comment in case you know this may not be a safe operation.

    Good PostgreSQL to everybody!


    Photo by rawpixel.com from Pexels

    by Marco Tusa at March 11, 2019 12:59 PM

    March 09, 2019

    Valeriy Kravchuk

    Fun with Bugs #81 - On MySQL Bug Reports I am Subscribed to, Part XVII

    Two weeks passed since my previous review of public MySQL bug reports I consider interesting enough to subscribe to them. Over this period I picked up a dozen or so new public bug reports that I'd like to briefly review today.

    Here is my recent subscriptions list, starting from the oldest bug reports:
    • Bug #94431 - "Can't upgrade from 5.7 to 8.0 if any database have a hyphen in their name". It seems one actually needs a database like that created in MySQL 5.6 with at least one InnoDB table having FULLTEXT index to hit the problem. Great finding by Phil Murray. Note that after several unsuccessful attempts by others the bug was eventually reproduced and verified by Jesper Wisborg Krogh. Let's hope we'll see it fixed in MySQL 8.0.16.
    • Bug #94435 - "mysql command hangs up and cosume CPU almost 100%". It was reported by Masaaki HIROSE, whose previous related/similar Bug #94219 - "libmysqlclient enters and infinite loop and consume CPU usage 100%" ended up as "Not a bug" (wrongly, IMHO, as nobody cared enough to reproduce the steps instead of commenting on their correctness and checking something else). Bug reporter had not only insisted and provided all the details, but also tried to analyze the reasons of the bug and provided links to other potentially related bug reports (Bug #88428 - "mysql_real_query hangs with EINTR errno (using YASSL)" and Bug #92394 - "libmysqlclient enters infinite loop after signal (race condition)"). Great job and nice to see the bug "Verified" eventually.
    • Bug #94441 - "empty ibuf aio reads in innodb status". This regression vs MySQL 5.6 was noted by Nikolai Ikhalainen from Percona. MariaDB 10.3.7 is also affected, unfortunately:
      ...
      I/O thread 9 state: native aio handle (write thread)
      Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
       ibuf aio reads:, log i/o's:, sync i/o's:Pending flushes (fsync) log: 0; buffer pool: 0
      1344 OS file reads, 133 OS file writes, 2 OS fsyncs
      ...
    • Bug #94448 - "Rewrite LOG_BLOCK_FIRST_REC_GROUP during recovery may be dangerous.". Yet another MySQL 8 regression (not marked with "regression" tag) was found by Kang Wang.
    • Bug #94476 - "mysql semisync replication stuck with master in Waiting to finalize termination". It has "Need feedback" status at the moment. I've subscribed to this report from Shirish Keshava Murthy mostly to find out how a report that may look like a free support request will be processed by Oracle engineers. Pure curiosity, for now.
    • Bug #94504 - "AIO::s_log seems useless". This problem was reported by Yuhui Wang. It's a regression in a sense that part of the code is no longer needed (and seems not to be used) in MySQL 8, but still remains.
    • Bug #94541 - "Assertion on import via Transportable Tablespace". This bug reported by  Daniël van Eeden was verified based on code review and some internal discussion. We do not know if any other version besides 5.7.25 is affected, though. The assertion itself:
      InnoDB: Failing assertion: btr_page_get_prev(next_page, mtr) == btr_pcur_get_block(cursor)->page.id.page_no()
      does not seem to be unique. We can find it in MDEV-18455 also (in other context).
    • Bug #94543 - "MySQL does not compile with protobuf 3.7.0". I care about build/compiling bugs historically, as I mostly use MySQL binaries that I built myself from GitHub source. So, I've immediately subscribed to this bug report from Laurynas Biveinis.
    • Bug #94548 - "Optimizer error evaluating JSON_Extract". This bug was reported by Dave Pullin. From my quick test it seems MariaDB 10.3.7 is also affected. Error message is different in the failing case, but the point is the same - the function is not evaluated if the column from derived table that is built using the function is not referenced in the SELECT list. This optimization is questionable and may lead to hidden "bombs" in the application code.
    • Bug #94550 - "generated columns referring to current_timestamp fail". I tried to check simple test case in this bug report by Mario Beck on MariaDB 10.3.7, but it does not seem to accept NOT NULL constraint for generated stored columns at all:
      MariaDB [test]> CREATE TABLE `t2` (
      -> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
      -> `content` varchar(42) DEFAULT NULL,
      -> `bucket` tinyint(4) GENERATED ALWAYS AS ((floor((to_seconds(`created_at
      `) / 10)) % 3)) STORED NOT NULL);
      ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
      corresponds to your MariaDB server version for the right syntax to use near 'NOT
      NULL)' at line 4
      I do not see this option in formal syntax described here as well. But in case of MariaDB we can actually make sure the generated column is never NULL by adding CHECK constraint like this:
      MariaDB [test]> CREATE TABLE `t2` (    ->   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
          ->   `content` varchar(42) DEFAULT NULL,
          ->   `bucket` tinyint(4) GENERATED ALWAYS AS ((floor((to_seconds(`created_at`) / 10)) % 3)) STORED);
      Query OK, 0 rows affected (0.434 sec)

      MariaDB [test]> INSERT INTO t2 (content) VALUES ("taraaaa");
      Query OK, 1 row affected (0.070 sec)

      MariaDB [test]> alter table t2 add constraint cnn CHECK (`bucket` is NOT NULL);
      Query OK, 1 row affected (1.159 sec)
      Records: 1  Duplicates: 0  Warnings: 0

      MariaDB [test]> INSERT INTO t2 (content) VALUES ("tarabbb");
      Query OK, 1 row affected (0.029 sec)

      MariaDB [test]> INSERT INTO t2 (content) VALUES ("");
      Query OK, 1 row affected (0.043 sec)

      MariaDB [test]> select * from t2;
      +---------------------+---------+--------+
      | created_at          | content | bucket |
      +---------------------+---------+--------+
      | 2019-03-09 17:28:03 | taraaaa |      0 |
      | 2019-03-09 17:29:43 | tarabbb |      1 |
      | 2019-03-09 17:29:50 |         |      2 |
      +---------------------+---------+--------+
      3 rows in set (0.002 sec)

      MariaDB [test]> show create table t2\G*************************** 1. row ***************************
             Table: t2
      Create Table: CREATE TABLE `t2` (
        `created_at` timestamp NOT NULL DEFAULT current_timestamp(),
        `content` varchar(42) DEFAULT NULL,
        `bucket` tinyint(4) GENERATED ALWAYS AS (floor(to_seconds(`created_at`) / 10)
      MOD 3) STORED,
        CONSTRAINT `cnn` CHECK (`bucket` is not null)

      ) ENGINE=InnoDB DEFAULT CHARSET=latin1
      1 row in set (0.011 sec)
      So, maybe after all we can state that MariaDB is NOT affected.
    • Bug #94552 - "innodb.virtual_basic fails when valgrind is enabled". I still wonder if anyone in Oracle runs MTR test suite on Valgrind-enabled (-DWITH_VALGRIND=1 cmake option) at least in the process of official release (and if they check the failures). It seems not to be the case based on this bug report from Manuel Ung.
    • Bug #94553 - "Crash in trx_undo_rec_copy". Bernardo Perez noted that as a side effect of still "Verified" Bug #82734 - "trx_undo_rec_copy needlessly relies on buffer pool page alignment" (that affects both MySQL 5.7 and 8.0) we may get crashes while working with generated columns. I hope to see them both fixed soon, but for now Bug #94553 has status "Need Feedback", probably in a hope to get a repeatable test case. I'll watch it carefully.
    • Bug #94560 - "record comparison in spatial index non-leaf rtree node seems incorrect". I doubt spatial indexes of InnoDB are widely used, and I have no doubts there are many bugs waiting to be discovered in this area. This specific bug was reported by Jie Zhou who had also suggested a fix.
    • Bug #94610 - "Server stalls because ALTER TABLE on partitioned table holds dict mutex". My former colleague Justin Swanhart reported this bug just yesterday, so no wonder it is not verified yet. It refers to a well known verified old Bug #83435 - "ALTER TABLE is very slow when using PARTITIONED table"  (that I've also subscribed to immediately) from Roel Van de Paar, affecting both MySQL 5.6 and 5.7. I hope to see this bug verified and fixed soon, as recently I see this kind of state for main thread:
      Main thread process no. 3185, id 140434206619392, state: enforcing dict cache limit
      too often in INNODB STATUS outputs to my liking...
    As you could note, I still try to check (at least in some cases) if MariaDB is also affected by the same problem. I think it's a useful check both for me (as I work mostly with MariaDB as a support engineer) and for the reader (to know if switching to MariaDB may help in any way or if there are any chances for MariaDB engineers to contribute anything useful, like a fix).

    "Hove, actually". For years residents of Hove used this humorous reply when they live in Brighton... "Regression, actually" is what I want to say (seriously) about every other MySQL bug report I subscribe to... So, you see Hove and many regression bugs above!
    To summarize:
    1. Sometimes Oracle engineers demonstrate proper collective effort to understand and carefully verify public bug reports. Good to know they are not ready to give up fast!
    2. I have to copy-paste this item from my previous post. As the list above proves, Oracle engineers still do not use "regression" tag when setting "Verified" status for obviously regression bugs. I think bug reporters should care then to always set it when they report regression of any kind.
    3. It seems there no regular MTR test runs for Valgrind builds performed by Oracle engineers, or maybe they just ignore failures.

    by Valerii Kravchuk (noreply@blogger.com) at March 09, 2019 09:17 PM

    March 07, 2019

    Peter Zaitsev

    Reducing High CPU on MySQL: a Case Study

    CPU Usage after query tuning

    In this blog post, I want to share a case we worked on a few days ago. I’ll show you how we approached the resolution of a MySQL performance issue and used Percona Monitoring and Management PMM to support troubleshooting. The customer had noticed a linear high CPU usage in one of their MySQL instances and was not able to figure out why as there was no much traffic hitting the app. We needed to reduce the high CPU usage on MySQL. The server is a small instance:

    Models | 6xIntel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz
    10GB RAM

    This symptom can be caused by various different reasons. Let’s see how PMM can be used to troubleshoot the issue.

    CPU

    The original issue - CPU usage at almost 100% during application use

    It’s important to understand where the CPU time is being consumed: user space, system space, iowait, and so on. Here we can see that CPU usage was hitting almost 100% and the majority of the time was being spent on user space. In other words, the time the CPU was executing user code, such as MySQL. Once we determined that the time was being spent on user space, we could discard other possible issues. For example, we could eliminate the possibility that a high amount of threads were competing for CPU resources, since that would cause an increase in context switches, which in turn would be taken care of by the kernel – system space.

    With that we decided to look into MySQL metrics.

    MySQL

    Thread activity graph in PMM for MySQL

    Queries per second

    As expected, there weren’t a lot of threads running—10 on average—and MySQL wasn’t being hammered with questions/transactions. It was running from 500 to 800 QPS (queries per second). Next step was to check the type of workload that was running on the instance:

    All the commands are of a SELECT type, in red in this graph

    In red we can see that almost all commands are SELECTS. With that in mind, we checked the handlers using 

    SHOW STATUS LIKE 'Handler%'
     to verify if those selects were doing an index scan, a full table scan or what.

    Showing that the query was a full table scan

    Blue in this graph represents

    Handler_read_rnd_next
     , which is the counter MySQL increments every time it reads a row when it’s doing a full table scan. Bingo!!! Around 350 selects were reading 2.5 million rows. But wait—why was this causing CPU issues rather than IO issues? If you refer to the first graph (CPU graph) we cannot see iowait.

    That is because the data was stored in the InnoDB Buffer Pool, so instead of having to read those 2.5M rows per second from disk, it was fetching them from memory. The stress had moved from disk to CPU. Now that we identified that the issue had been caused by some queries or query, we went to QAN to verify the queries and check their status:

    identifying the long running query in QAN

    First query, a

    SELECT
      on table 
    store.clients
     was responsible for 98% of the load and was executing in 20+ seconds.

    The initial query load

    EXPLAIN confirmed our suspicions. The query was accessing the table using type ALL, which is the last type we want as it means “Full Table Scan”. Taking a look into the fingerprint of the query, we identified that it was a simple query:

    Fingerprint of query
    Indexes on table did not include a key column

    The query was filtering clients based on the status field

    SELECT * FROM store.clients WHERE status = ?
     As shown in the indexes, that column was not indexed. Talking with the customer, this turned out to be a query that was introduced as part of a new software release.

    From that point, we were confident that we had identified the problem. There could be more, but this particular query was definitely hurting the performance of the server. We decided to add an index and also sent an annotation to PMM, so we could refer back to the graphs to check when the index has been added, check if CPU usage had dropped, and also check Handler_read_rnd_next.

    To run the alter we decided to use pt-online-schema-change as it was a busy table, and the tool has safeguards to prevent the situation from becoming even worse. For example, we wanted to pause or even abort the alter in the case of the number of Threads_Running exceeding a certain threshold. The threshold is controlled by

    --max-load
      (25 by default) and
    --critical-load
      (50 by default):

    pmm-admin annotate "Started ALTER store.clients ADD KEY (status)" && \
    pt-online-schema-change --alter "ADD KEY (status)" --execute u=root,D=store,t=clients && \
    pmm-admin annotate "Finished ALTER store.clients ADD KEY (status)"
    Your annotation was successfully posted.
    No slaves found. See --recursion-method if host localhost.localdomain has slaves.
    Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
    Operation, tries, wait:
    analyze_table, 10, 1
    copy_rows, 10, 0.25
    create_triggers, 10, 1
    drop_triggers, 10, 1
    swap_tables, 10, 1
    update_foreign_keys, 10, 1
    Altering `store`.`clients`...
    Creating new table...
    Created new table store._clients_new OK.
    Altering new table...
    Altered `store`.`_clients_new` OK.
    2019-02-22T18:26:25 Creating triggers...
    2019-02-22T18:27:14 Created triggers OK.
    2019-02-22T18:27:14 Copying approximately 4924071 rows...
    Copying `store`.`clients`: 7% 05:46 remain
    Copying `store`.`clients`: 14% 05:47 remain
    Copying `store`.`clients`: 22% 05:07 remain
    Copying `store`.`clients`: 30% 04:29 remain
    Copying `store`.`clients`: 38% 03:59 remain
    Copying `store`.`clients`: 45% 03:33 remain
    Copying `store`.`clients`: 52% 03:06 remain
    Copying `store`.`clients`: 59% 02:44 remain
    Copying `store`.`clients`: 66% 02:17 remain
    Copying `store`.`clients`: 73% 01:50 remain
    Copying `store`.`clients`: 79% 01:23 remain
    Copying `store`.`clients`: 87% 00:53 remain
    Copying `store`.`clients`: 94% 00:24 remain
    2019-02-22T18:34:15 Copied rows OK.
    2019-02-22T18:34:15 Analyzing new table...
    2019-02-22T18:34:15 Swapping tables...
    2019-02-22T18:34:27 Swapped original and new tables OK.
    2019-02-22T18:34:27 Dropping old table...
    2019-02-22T18:34:32 Dropped old table `store`.`_clients_old` OK.
    2019-02-22T18:34:32 Dropping triggers...
    2019-02-22T18:34:32 Dropped triggers OK.
    Successfully altered `store`.`clients`.
    Your annotation was successfully posted.

    Results

    MySQL Handlers after query tuning MySQL query throughput after query tuning
    Query analysis by EXPLAIN in PMM after tuning

    As we can see, above, CPU usage dropped to less than 25%, which is 1/4 of the previous usage level. Handler_read_rnd_next dropped and we can’t even see it once pt-osc has finished. We had a small increase on Handler_read_next as expected because now MySQL is using the index to resolve the WHERE clause. One interesting outcome is that the instance was able to increase it’s QPS by 2x after the index was added as CPU/Full Table Scan was no longer limiting performance. On average, query time has dropped from 20s to only 661ms.

    Summary:

    1. Applying the correct troubleshooting steps to your problems is crucial:
      a) Understand what resources have been saturated.
      b) Understand what if anything is causing an error.
      c) From there you can divert into the areas that are related to that resource and start to narrow down the issue.
      d) Tackle the problems bit by bit.
    2. Having the right tools for the job key for success. PMM is a great example of a tool that can help you quickly identify, drill in, and fix bottlenecks.
    3. Have realistic load tests. In this case, they had tested the new release on a concurrency level that was not like their production
    4. By identifying the culprit query we were able to:
      a.) Drop average query time from 20s to 661ms
      b.) Increase QPS by 2x
      c.) Reduce the usage of CPU to 1/4 of its level prior to our intervention

    Disclosure: For security reasons, sensitive information, such as database, table, column names have been modified and graphs recreated to simulate a similar problem.

    by Marcelo Altmann at March 07, 2019 03:17 PM

    March 06, 2019

    Peter Zaitsev

    Settling the Myth of Transparent HugePages for Databases

    The concept of Linux HugePages has existed for quite a while: for more than 10 years, introduced to Debian in 2007 with kernel version 2.6.23. Whilst a smaller page size is useful for general use, some memory intensive applications may gain performance by using bigger memory pages. By having bigger memory chunks available to them, they can reduce lookup time as well as improve the performance of read/write operations. To be able to make use of HugePages, applications need to carry the specific code directive, and changing applications across the board is not necessarily a simple task. So enter Transparent HugePages (THP).

    By reputation, THPs are said to have a negative impact on performance. For this post, I set out to either prove or debunk the case for the use of THPs for database applications.

    The Linux context

    On Linux – and for that matter all operating systems that I know of – memory is divided into small chunks called pages. A typical memory page size is set to 4k. You can obtain the value of page size on Linux using getconf.

    # getconf PAGE_SIZE
    4096

    Generally, the latest processors support multiple page sizes. However, Linux defaults to a minimal 4k page size. For a system with 64GB physical memory, this memory will be divided into more than 16 million pages. Linking between these pages and physical memory (which is called page table walking) is undertaken by the CPU’s memory management unit (MMU). To optimize page lookup, CPU maintains a cache of recently used pages called the Table Lookaside Buffer (TLB). The higher the number of pages, the lower the percentage of pages that are maintained in TLB. This translates to a higher cache miss ratio. With every cache miss, a more expensive search must be done via page table walking. In effect, that leads to a degradation in performance.

    So what if we could increase the page size? We could then reduce the number of pages accessed, and reduce the cost of page walking. Cache hit ratio might then improve because more relevant data now fits in one page rather than multiple pages.

    The Linux kernel will always try to allocate a HugePage (if enabled) and will fall back to the default 4K if a contiguous chunk of the required memory size is not available in the required memory space.

    The implication for applications

    As mentioned, for an application to make use of HugePages it has to contain an explicit instruction to do so. It’s not always practical to change applications in this way so there’s another option.

    Transparent HugePages provides a layer within the Linux kernel – probably since version 2.6.38 – which if enabled can potentially allocate HugePages for applications without them actually “knowing” it; hence the transparency. The expectation is that this will improve application performance.

    In this blog, I’ll attempt to find the reasons why THP might help improve database performance. There’s a lot of discussion amongst database experts that classic HugePages give a performance gain, but you’ll see a performance hit with Transparent HugePages. I decided to take up the challenge and perform various benchmarks, with different settings, and with different workloads.

    So do Transparent HugePages (THP) improve application performance? More specifically, do they improve performance for database workloads? Most industry standard databases recommend disabling THP and enabling HugePages alone.

    So is this a myth or does THP degrade performance for databases? Time to break this myth.

    Enabling THP

    The current setting can be seen using the command line

    # cat /sys/kernel/mm/transparent_hugepage/enabled
    [always] madvise never

    Temporary Change

    It can be enabled or disabled using the command line.

    # echo never > /sys/kernel/mm/transparent_hugepage/enabled

    Permanent Change via grub

    Or by setting grub parameter  in 

    /etc/default/grub
     . 

    You can choose one of the three configurations for THP; enable, disable, or “madvise”. Whilst enable and disable options are self-explanatory, madvise allows applications that are optimized for HugePages to use THP.  Applications can use Transparent HugePages by making the madvise system call.

    Why was the madvise option added? We will discuss that in a later section.

    Transparent HugePages problems

    The khugepaged CPU usage

    The allocation of a HugePage can be tricky. Whilst traditional HugePages are reserved in virtual memory, THPs are not. In the background, the kernel attempts to allocate a THP, and if it fails, will default to the standard 4k page. This all happens transparently to the user.

    The allocation process can potentially involve a number of kernel processes which may include kswapd, defrag, and kcompactd. All of these are responsible for making space in the virtual memory for a future THP. When required, the allocation is made by another kernel process; khugepaged. This process manages Transparent HugePages.

    Spikes

    It depends on how khugepaged is configured, but since no memory is reserved beforehand, there is potential for performance degradation. With every attempt to allocate a HugePage, potentially a number of kernel processes are invoked. These carry out certain actions to make enough room in the virtual memory for a THP allocation. Although no notifications are provided to the application, precious resources are spent, and this can lead to spikes in performance with any dips indicating an attempt to allocate THP.

    Memory Bloating

    HugePages are for not for every application. For example, an application that wants to allocate only one byte of data would be better off using a 4k page rather than a huge one. That way, memory is more efficiently used. To prevent this, one option is to configure THP to “madvise”. By doing this, HugePages are disabled system-wide but are available to applications that make a madvise call to allocate THP in the madvise memory region.

    Swapping

    Linux kernel keeps track of memory pages and differentiates between pages are that are actively being used and the ones that are not immediately required. It may load or unload a page from active memory to disk if that page is no longer required or vice versa.

    When page size is 4k, these memory operations are understandably fast. However, consider a 1GB page size: there will a significant performance hit when such a page is swapped out. When a THP is swapped out, it is split in standard page sizes. Unlike conventional HugePages which are reserved in RAM and are never swapped, THPs are swappable pages. They could, therefore, potentially be swapped causing a dip in performance. Although in recent years, there have been loads of performance improvements around swapping out the THPs process, it still does impact performance negatively.

    Benchmark

    I decided to benchmark with and without Transparent HugePages enabled. Initially, I used pgbench – a PostgreSQL benchmarking tool based on TPCB – for a duration of ten minutes. The benchmark used a mixed mode of READ/WRITE. The results with and without the Transparent HugePages show no degradation or improvement in the benchmark. To be sure, I repeated the same benchmark for 60 minutes and got almost the same results.  I performed another benchmark with a TPCC workload using the sysbench benchmarking tool. The results are almost the same.

    Benchmark Machine

    • Supermicro server:
      • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
      • 2 sockets / 28 cores / 56 threads
      • Memory: 256GB of RAM
      • Storage: SAMSUNG  SM863 1.9TB Enterprise SSD
      • Filesystem: ext4/xfs
    • OS: Linux smblade01 4.15.0-42-generic #45~16.04.1-Ubuntu
    • PostgreSQL: version 11

    Benchmark TPCB (pgbench) – 10 Minute duration

    The following graphs show results for two different database sizes; 48GB and 112GB with 64, 128 and 256 clients each. All other settings were kept unchanged for these benchmarks to ensure that our results are comparable. It is evident that both lines — representing execution with or without THP — are almost overlapping one another. This suggests no performance gains.

    Figure 1.1 PostgreSQL' s Benchmark, 10 minutes execution time where database workload(48GB) < shared_buffer (64GB)

    Figure 1.1 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload(48GB) < shared_buffer (64GB)

     

    Figure 1.2 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) > shared_buffer (64GB)

    Figure 1.2 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) > shared_buffer (64GB)

     

    Figure 1.3 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB)

    Figure 1.3 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB) -dTLB-misses

     

    Figure 1.4 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (112GB) > shared_buffer (64GB)

    Figure 1.4 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (112GB) > shared_buffer (64GB)-dTLB-misses

     

    Benchmark TPCB (pgbench) – 60 Minute duration

    Figure 2.1 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB)

    Figure 2.1 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB)

     

    Figure 2.2 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (112GB) &gt; shared_buffer (64GB)

    Figure 2.2 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (112GB) > shared_buffer (64GB)

     

    Figure 2.3 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB)

    Figure 2.3 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (48GB) < shared_buffer (64GB) -dTLB-misses

     

    Figure 2.4 PostgreSQL' s Benchmark, 60 minutes execution time where database workload (112GB) > shared_buffer (64GB)

    Figure 2.4 PostgreSQL’ s Benchmark, 60 minutes execution time where database workload (112GB) > shared_buffer (64GB) -dTLB-misses

     

    Benchmark TPCC (sysbecnch) – 10 Minute duration

    Figure 3.1 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) &lt; shared_buffer (64GB)

    Figure 3.1 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB)

    Figure 3.2 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (112GB) &gt; shared_buffer (64GB)

    Figure 3.2 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (112GB) > shared_buffer (64GB)

     

    Figure 3.3 PostgreSQL' s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB)

    Figure 3.3 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) < shared_buffer (64GB) -dTLB-misses

     

    Figure 3.4 PostgreSQL' s Benchmark, 10 minutes execution time where database workload 112GB) > shared_buffer (64GB)

    Figure 3.4 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload 112GB) > shared_buffer (64GB) -dTLB-misses

     

    Conclusion

    I attained these results by running different benchmarking tools and evaluating different OLTP benchmarking standards. The results clearly indicate that for these workloads, THP has a negative impact on the overall database performance. Although the performance degradation is negligible, it is, however, clear that there is no performance gain as one might expect. This is very much in line with all the different databases’ recommendation which suggests disabling the THP.

    THP may be beneficial for various applications, but it certainly doesn’t give any performance gains when handling an OLTP workload.

    We can safely say that the “myth” is derived from experience and that the rumors are true.

    Summary

    • The complete benchmark data is available at GitHub[1]
    • The complete “nmon” reports, which include CPU, memory etc usage can be found at GitHub[2]
    • This whole benchmark is based around OLTP. Watch out for the OLAP benchmark. Maybe THP will have more effect on this type of workload.

    [1] – https://github.com/Percona-Lab-results/THP-POSTGRESQL-2019/blob/master/results.xlsx

    [2] – https://github.com/Percona-Lab-results/THP-POSTGRESQL-2019/tree/master/results

     

     

    by Ibrar Ahmed at March 06, 2019 01:07 PM

    March 05, 2019

    Peter Zaitsev

    Upcoming Webinar Thurs 3/7: Enhancing MySQL Security

    Enhancing MySQL Security Webinar

    Enhancing MySQL Security WebinarJoin Percona Support Engineer, Vinicius Grippa, as he presents his talk Enhancing MySQL Security on Thursday, March 7th, 2019 at 7:00 AM PST (UTC-8) / 10:00 AM EST (UTC-5).

    Register Now

    Security is always a challenge when it comes to data. What’s more, regulations like GDPR add a whole new layer on top of it, with rules more and more restrictive to access and manipulate data. Join us in this presentation to check security best practices, as well as traditional and new features available for MySQL including features coming with the new MySQL 8.

    In this talk, DBA’s and sysadmins will walk through the security features available on the OS and MySQL. For instance, these features include:

    – SO security
    – SSL
    – ACL
    – TDE
    – Audit Plugin
    – MySQL 8 features (undo, redo and binlog encryption)
    – New caching_sha2_password
    – Roles
    – Password Management
    – FIPS mode

    In order to learn more register for this webinar on Enhancing MySQL Security.

    by Vinicius Grippa at March 05, 2019 09:57 PM

    How to Upgrade Amazon Aurora MySQL from 5.6 to 5.7

    Over time, software evolves and it is important to stay up to date if you want to benefit from new features and performance improvements.  Database engines follow the exact same logic and providers are always careful to provide an easy upgrade path. With MySQL, the mysql_upgrade tool serves that purpose.

    A database upgrade process becomes more challenging in a managed environment like AWS RDS where you don’t have shell access to the database host and don’t have access to the SUPER MySQL privilege. This post is a collaboration between Fattmerchant and Percona following an engagement focused on the upgrade of the Fattmerchant database from Amazon Aurora MySQL 5.6 to Amazon Aurora MySQL 5.7. Jacques Fu, the CTO of Fattmerchant, is the co-author of this post.  Our initial plan was to follow a path laid out previously by others but we had difficulties finding any complete and detailed procedure outlining the steps. At least, with this post, there is now one.

    Issues with the regular upgrade procedure

    How do we normally upgrade a busy production server with minimal downtime?  The simplest solution is to use a slave server with the newer version. Such a procedure has the side benefit of providing a “staging” database server which can be used to test the application with the new version. Basically we need to follow these steps:

    1. Enable replication on the old server
    2. Make a consistent backup
    3. Restore the backup on a second server with the newer database version – it can be a temporary server
    4. Run mysql_upgrade if needed
    5. Configure replication with the old server
    6. Test the application against the new version. If the tests includes conflicting writes, you may have to jump back to step 3
    7. If tests are OK and the new server is in sync, replication wise, with the old server, stop the application (only for a short while)
    8. Repoint the application to the new server
    9. Reset the slave
    10. Start the application

    If the new server was temporary, you’ll need to repeat most of the steps the other way around, this time starting from the new server and ending on the old one.

    What we thought would be a simple task turned out to be much more complicated. We were preparing to upgrade our database from Amazon Aurora MySQL 5.6 to 5.7 when we discovered that there was no option for an in-place upgrade. Unlike a standard AWS RDS MySQL (RDS MySQL upgrade 5.6 to 5.7) at the time of this article you cannot perform an in-place upgrade or even restore a backup across the major versions of Amazon Aurora MySQL.

    We initially chose Amazon Aurora for the benefits of the tuning work that AWS provided out of the box, but we realized with any set of pros there comes a list of cons. In this case, the limitations meant that something that should have been straightforward took us off the documented path.

    Our original high-level plan

    Since we couldn’t use an RDS snapshot to provision a new Amazon Aurora MySQL 5.7 instance, we had to fallback to the use of a logical backup. The intended steps were:

    1. Backup the Amazon Aurora MySQL 5.6 write node with mysqldump
    2. Spin up an empty Amazon Aurora MySQL 5.7 cluster
    3. Restore the backup
    4. Make the Amazon Aurora MySQL 5.7 write node a slave of the Amazon Aurora MySQL 5.6 write node
    5. Once in sync, transfer the application to the Amazon Aurora MySQL 5.7 cluster

    Even those simple steps proved to be challenging.

    Backup of the Amazon Aurora MySQL 5.6 cluster

    First, the Amazon Aurora MySQL 5.6 write node must generate binary log files. The default cluster parameter group that is generated when creating an Amazon Aurora instance does not enable these settings. Our 5.6 write node was not generating binary log files, so we copied the default cluster parameter group to a new “replication” parameter group and changed the “binlog_format” variable to MIXED.  The parameter is only effective after a reboot, so overnight we rebooted the node. That was a first short downtime.

    At that point, we were able to confirm, using “show master status;” that the write node was indeed generating binlog files.  Since our procedure involves a logical backup and restore, we had to make sure the binary log files are kept for a long enough time. With a regular MySQL server the variable “expire_logs_days” controls the binary log files retention time. With RDS, you have to use the mysql.rds_set_configuration. We set the retention time to two weeks:

    CALL mysql.rds_set_configuration('binlog retention hours', 336);

    You can confirm the new setting is used with:

    CALL mysql.rds_show_configuration;

    For the following step, we needed a mysqldump backup along with its consistent replication coordinates. The option

    --master-data
       of mysqldump implies “Flush table with read lock;” while the replication coordinates are read from the server.  A “Flush table” requires the SUPER privilege and this privilege is not available in RDS.

    Since we wanted to avoid downtime, it is out of question to pause the application for the time it would take to backup 100GB of data. The solution was to take a snapshot and use it to provision a temporary Amazon Aurora MySQL 5.6 cluster of one node. As part of the creation process, the events tab of the AWS console will show the binary log file and position consistent with the snapshot, it looks like this:

    Consistent snapshot replication coordinates

    Consistent snapshot replication coordinates

    From there, the temporary cluster is idle so it is easy to back it up with mysqldump. Since our dataset is large we considered the use of MyDumper but the added complexity was not worthwhile for a one time operation. The dump of a large database can take many hours. Essentially we performed:

    mysqldump -h entrypoint-temporary-cluster -u awsrootuser -pxxxx \
     --no-data --single-transaction -R -E -B db1 db2 db3 > schema.sql
    mysqldump -h entrypoint-temporary-cluster -nt --single-transaction \
     -u awsrootuser -pxxxx -B db1 db2 db3 | gzip -1 > dump.sql.gz
    pt-show-grants -h entrypoint-temporary-cluster -u awsrootuser -pxxxx > grants.sql

    The schema consist of three databases: db1, db2 and db3.  We have not included the mysql schema because it will cause issues with the new 5.7 instance. You’ll see why we dumped the schema and the data separately in the next section.

    Restore to an empty Amazon Aurora MySQL 5.7 cluster

    With our backup done, we are ready to spin up a brand new Amazon Aurora MySQL 5.7 cluster and restore the backup. Make sure the new Amazon Aurora MySQL 5.7 cluster is in a subnet with access to the Amazon Aurora MySQL 5.6 production cluster. In our schema, there a few very large tables with a significant number of secondary keys. To speed up the restore, we removed the secondary indexes of these tables from the schema.sql file and created a restore-indexes.sql file with the list of alter table statements needed to recreate them. Then we restored the data using these steps:

    cat grants.sql | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx
    cat schema-modified.sql | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx
    zcat dump.sql.gz | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx
    cat restore-indexes.sql | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx

    Configure replication

    At this point, we have a new Amazon Aurora MySQL 5.7 cluster provisioned with a dataset at a known replication coordinates from the Amazon Aurora MySQL 5.6 production cluster.  It is now very easy to setup replication. First we need to create a replication user in the Amazon Aurora MySQL 5.6 production cluster:

    GRANT REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'repl_user'@'%' identified by 'agoodpassword';

    Then, in the new Amazon Aurora MySQL 5.7 cluster, you configure replication and start it by:

    CALL mysql.rds_set_external_master ('mydbcluster.cluster-123456789012.us-east-1.rds.amazonaws.com', 3306,
      'repl_user', 'agoodpassword', 'mysql-bin-changelog.000018', 65932380, 0);
    CALL mysql.rds_start_replication;

    The endpoint mydbcluster.cluster-123456789012.us-east-1.rds.amazonaws.com points to the Amazon Aurora MySQL 5.6 production cluster.

    Now, if everything went well, the new Amazon Aurora MySQL 5.7 cluster will be actively syncing with its master, the current Amazon Aurora MySQL 5.6 production cluster. This process can take a significant amount of time depending on the write load and the type of instance used for the new cluster. You can monitor the progress with the show slave status\G command, the Seconds_Behind_Master will tell you how far behind in seconds the new cluster is compared to the old one.  It is not a measurement of how long it will take to resync.

    You can also monitor throughput using the AWS console. In this screenshot you can see the replication speeding up over time before it peaks when it is completed.

    Replication speed

    Test with Amazon Aurora MySQL 5.7

    At this point, we have an Amazon Aurora MySQL 5.7 cluster in sync with the production Amazon Aurora MySQL 5.6 cluster. Before transferring the production load to the new cluster, you need to test your application with MySQL 5.7. The easiest way is to snapshot the new Amazon Aurora MySQL 5.7 cluster and, using the snapshot, provision a staging Amazon Aurora MySQL 5.7 cluster. Test your application against the staging cluster and, once tested, destroy the staging cluster and any unneeded snapshots.

    Switch production to the Amazon Aurora MySQL 5.7 cluster

    Now that you have tested your application with the staging cluster and are satisfied how it behaves with Amazon Aurora MySQL 5.7, the very last step is to migrate the production load. Here are the last steps you need to follow:

    1. Make sure the Amazon Aurora MySQL 5.7 cluster is still in sync with the Amazon Aurora MySQL 5.6 cluster
    2. Stop the application
    3. Validate the Show master status; of the 5.6 cluster is no longer moving
    4. Validate from the Show slave status\G in the 5.7 cluster the Master_Log_File and Exec_Master_Log_Pos match the output of the “Show master status;” from the 5.6 cluster
    5. Stop the slave in the 5.7 cluster with CALL mysql.rds_stop_replication;
    6. Reset the slave in the 5.7 cluster with CALL mysql.rds_reset_external_master;
    7. Reconfigure the application to use the 5.7 cluster endpoint
    8. Start the application

    The application is down from steps 2 to 8.  Although that might appear to be a long time, these steps can easily be executed within a few minutes.

    Summary

    So, in summary, although RDS Aurora doesn’t support an in place upgrade between Amazon Aurora MySQL 5.6 and 5.7, there is a possible migration path, minimizing downtime.  In our case, we were able to limit the downtime to only a few minutes.

    Co-Author: Jacques Fu, Fattmerchant

     

    Jacques is CTO and co-founder at the fintech startup Fattmerchant, author of Time Hacks, and co-founder of the Orlando Devs, the largest developer meetup in Orlando. He has a passion for building products, bringing them to market, and scaling them.

    by Yves Trudeau at March 05, 2019 05:31 PM

    Shlomi Noach

    Un-split brain MySQL via gh-mysql-rewind

    We are pleased to release gh-mysql-rewind, a tool that allows us to move MySQL back in time, automatically identify and rewind split brain changes, restoring a split brain server into a healthy replication chain.

    I recently had the pleasure of presenting gh-mysql-rewind at FOSDEM. Video and slides are available. Consider following along with the video.

    Motivation

    Consider a split brain scenario: a "standard" MySQL replication topology suffered network isolation, and one of the replicas was promoted as new master. Meanwhile, the old master was still receiving writes from co-located apps.

    Once the network isolation is over, we have a new master and an old master, and a split-brain situation: some writes only took place on one master; others only took place on the other. What if we wanted to converge the two? What paths do we have to, say, restore the old, demoted master, as a replica of the newly promoted master?

    The old master is unlikely to agree to replicate from the new master. Changes have been made. AUTO_INCREMENT values have been taken. UNIQUE constraints will fail.

    A few months ago, we at GitHub had exactly this scenario. An entire data center went network isolated. Automation failed over to a 2nd DC. Masters in the isolated DC meanwhile kept receiving writes. At the end of the failover we ended up with a split brain scenario - which we expected. However, an additional, unexpected constraint forced us to fail back to the original DC.

    We had to make a choice: we've already operated for a long time in the 2nd DC and took many writes, that we were unwilling to lose. We were OK to lose (after auditing) the few seconds of writes on the isolated DC. But, how do we converge the data?

    Backups are the trivial way out, but they incur long recovery time. Shipping backup data over the network for dozens of servers takes time. Restore time, catching up with changes since backup took place, warming up the servers so that they can handle production traffic, all take time.

    Could we have reduces the time for recovery?

    There are multiple ways to do that: local backups, local delayed replicas, snapshots... We have embarked on several. In this post I wish to outline gh-mysql-rewind, which programmatically identifies the rogue (aka "bad") transactions on the network isolated master, rewinds/reverts them, applies some bookkeeping and restores the demoted master as a healthy replica under the newly promoted master, thereby prepared to be promoted if needed.

    General overview

    gh-mysql-rewind is a shell script. It utilizes multiple technologies, some of which do not speak to each other, to be able to do the magic. It assumes and utilizes the following:

    Some breakdown follows.

    GTID

    MySQL GTIDs keep track of all transactions executed on a given server. GTIDs indicate which server (UUID) originated a write, and ranges of transaction sequences. In a clean state, only one writer will generate GTIDs, and on all the replicas we would see the same GTID set, originated with the writer's UUID.

    In a split brain scenario, we would see divergence. It is possible to use GTID_SUBTRACT(old_master-GTIDs, new-master-GTIDs) to identify the exact set of transactions executed on the old, demoted master, right after the failover. This is the essence of the split brain.

    For example, assume that just before the network partition, GTID on the master was 00020192-1111-1111-1111-111111111111:1-5000. Assume after the network partition the new master has UUID of 00020193-2222-2222-2222-222222222222. It began to take writes, and after some time its GTID set showed 00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200.

    On the demoted master, other writes took place, leading to the GTID set 00020192-1111-1111-1111-111111111111:1-5042.

    We will run...

    SELECT GTID_SUBTRACT(
      '00020192-1111-1111-1111-111111111111:1-5042',
      '00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200'
    );
    
    > '00020192-1111-1111-1111-111111111111:5001-5042'
    

    ...to identify the exact set of "bad transactions" on the demoted master.

    Row Based Replication

    With row based replication, and with FULL image format, each DML (INSERT, UPDATE, DELETE) writes to the binary log the complete row data before and after the operation. This means the binary log has enough information for us to revert the operation.

    Flashback

    Developed by Alibaba, flashback has been incorporated in MariaDB. MariaDB's mysqlbinlog utility supports a --flashback flag, which interprets the binary log in a special way. Instead of printing out the events in the binary log in order, it prints the inverted operations in reverse order.

    To illustrate, let's assume this pseudo-code sequence of events in the binary log:

    insert(1, 'a')
    insert(2, 'b')
    insert(3, 'c')
    update(2, 'b')->(2, 'second')
    update(3, 'c')->(3, 'third')
    insert(4, 'd')
    delete(1, 'a')
    

    A --flashback of this binary log would produce:

    insert(1, 'a')
    delete(4, 'd')
    update(3, 'third')->(3, 'c')
    update(2, 'second')->(2, 'b')
    delete(3, 'c')
    delete(2, 'b')
    delete(1, 'a')
    

    Alas, MariaDB and flashback do not speak MySQL GTID language. GTIDs are one of the major points where MySQL and MariaDB have diverged beyond compatibility.

    The output of MariaDB's mysqlbinlog --flashback has neither any mention of GTIDs, nor does the tool take notice of GTIDs in the binary logs in the first place.

    gh-mysql-rewind

    This is where we step in. GTIDs provide the information about what went wrong. flashback has the mechanism to generate the reverse sequence of statements. gh-mysql-rewind:

    • uses GTIDs to detect what went wrong
    • correlates those GTID entries with binary log files: identifies which binary logs actually contain those GTID events
    • invokes MariaDB's mysqlbinlog --flashback to generate the reverse of those binary logs
    • injects (dummy) GTID information into the output
    • computes ETA

    This last part is worth elaborating. We have created a time machine. We have the mechanics to make it work. But as any Sci-Fi fan knows, one of the most important parts of time travel is knowing ahead where (when) you are going to land. Are you back in the Renaissance? Or are you suddenly to appear on board the French Revolution? Better dress accordingly.

    In our scenario it is not enough to move MySQL back in time to some consistent state. We want to know at what time we landed, so that we can instruct the rewinded server to join the replication chain as a healthy replica. In MySQL terms, we need to make MySQL "forget" everything that ever happened after the split brain: not only in terms of data (which we already did), but in terms of GTID history.

    gh-mysql-rewind will do the math to project, ahead of time, at what "time" (i.e. GTID set) our time machine arrived. It will issue a `RESET MASTER; SET GLOBAL gtid_purged='gtid-of-the-landing-time'" to make our re-winded MySQL consistent not only with some past dataset, but also with its own perception of the point in time where that dataset existed.

    Limitations

    Some limitations are due to MariaDB's incompatibility with MySQL, some are due to MySQL DDL nature, some due to the fact gh-mysql-rewind is a shell script.

    • Cannot rewind DDL. DDLs are silently ignored, and will impose a problem when trying to re-apply them.
    • JSON, POINT data types are not supported.
    • The logic rewinds the MySQL server farther into the past than strictly required. This simplifies the code considerably, but imposed superfluous time to rewind+reapply, i.e. time to recover.
    • Currently, this only works one server at a time. If a group of 10 servers were network isolated together, the operation would need to run on each of these 10 servers.
    • Runs locally on each server. Requires both MySQL's mysqlbinlog as well as MariaDB's mysqlbinlog.

    Testing

    There's lot of moving parts to this mechanism. A mixture of technologies that don't normally speak to each other, injection of data, prediction of ETA... How reliable is all this?

    We run continuous gh-mysql-rewind testing in production to consistently prove that it works as expected. Our testing uses a non-production, dedicated, functional replica. It contaminates the data on the replica. It lets gh-mysql-rewind automatically move it back in time, it joins the replica back into the healthy chain.

    That's not enough. We actually create a scenario where we can predict, ahead of testing, what the time-of-arrival will be. We checksum the data on that replica at that time. After contaminating and effectively breaking replication, we expect gh-mysql-rewind to revert the changes back to our predicted point in time. We checksum the data again. We expect 100% match.

    See the video or slides for more detail on our testing setup.

    Status

    At this time the tool in one of several solutions we hope to never need to employ. It is stable and tested. We are looking forward to a promising MySQL development that will provide GTID-revert capabilities using standard commands, such as SELECT undo_transaction('00020192-1111-1111-1111-111111111111:5042').

    We have released gh-mysql-rewind as open source, under the MIT license. The public release is a stripped down version of our own script, which has some GitHub-specific integration. We have general ideas in incorporating this functionality into higher level tools.

    gh-mysql-rewind is developed by the database-infrastructure team at GitHub.

    by shlomi at March 05, 2019 01:51 PM

    March 04, 2019

    Peter Zaitsev

    Percona XtraBackup 8.0.5 Is Now Available

    Percona XtraBackup 8.0

    Percona XtraBackup 8.0

    Percona is glad to announce the release of Percona XtraBackup 8.0.5 on March 4, 2019. Downloads are available from our download site and from apt and yum repositories.

    Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, it drives down backup costs while providing unique features for MySQL backups.

    Percona XtraBackup 8.0.5 introduces the support of undo tablespaces created using the new syntax (CREATE UNDO TABLESPACEavailable since MySQL 8.0.14. Percona XtraBackup also supports the binary log encryption introduced in MySQL 8.0.14.

    Two new options were added to xbstream. Use the --decompress option with xbstream to decompress individual qpress files. With the --decompress-threads option, specify the number of threads to apply when decompressing. Thanks to Rauli Ikonen for this contribution.

    This release of Percona XtraBackup is a General Availability release ready for use in a production environment.

    All Percona software is open-source and free.

    Please note the following about this release:

    • The deprecated innobackupex has been removed. Use the xtrabackup command to back up your instances: $ xtrabackup --backup --target-dir=/data/backup
    • When migrating from earlier database server versions, backup and restore and using Percona XtraBackup 2.4 and then use mysql_upgrade from MySQL 8.0.x
    • If using yum or apt repositories to install Percona XtraBackup 8.0.5, ensure that you have enabled the new tools repository. You can do this with the percona-release enable tools release command and then install the percona-xtrabackup-80 package.

    New Features

    • PXB-1548: Percona XtraBackup enables updating the ib_buffer_pool file with the latest pages present in the buffer pool using the --dump-innodb-buffer-pool option. Thanks to Marcelo Altmann for contribution.
    • PXB-1768: Added support for undo tablespaces created with the new MySQL 8.0.14 syntax.
    • PXB-1781: Added support for binary log encryption introduced in MySQL 8.0.14.
    • PXB-1797: For xbstream, two new options were added. The --decompress option enables xbstream to decompress individual qpress files. The --decompress-threads option controls the number of threads to apply when decompressing. Thanks to Rauli Ikonen for this contribution.

    Bugs Fixed

    • Using --lock-ddl-per-table caused the server to scan all records of partitioned tables which could lead to the “out of memory” error. Bugs fixed PXB-1691 and PXB-1698.
    • When Percona XtraBackup was started run with the --slave-info, incorrect coordinates were written to the xtrabackup_slave_info file. Bug fixed PXB-1737
    • Percona XtraBackup could crash at the prepare stage when making an incremental backup if the variable innodb-rollback-segments was changed after starting the MySQL Server. Bug fixed PXB-1785.
    • The full backup could fail when Percona Server was started with the --innodb-encrypt-tables parameter. Bug fixed PXB-1793.

    Other bugs fixed: PXB-1632PXB-1715PXB-1770PXB-1771PXB-1773.

    by Borys Belinsky at March 04, 2019 07:16 PM

    Upcoming Webinar Wed 3/6: High Availability and Disaster Recovery in Amazon RDS

    High Availability and Disaster Recovery in Amazon RDS Webinar

    MySQL High Availability and Disaster Recovery WebinarJoin Percona CEO Peter Zaitsev as he presents High Availability and Disaster Recovery in Amazon RDS on Wednesday, March 6th, 2019, at 11:00 AM PST (UTC-8) / 2:00 PM EST (UTC-5).

    Register Now

    In this hour-long webinar, Peter describes the differences between high availability (HA) and disaster recovery (DR). Afterward, Peter will go through scenarios detailing how each is handled manually and in Amazon RDS.

    He will review the pros and cons of managing HA and DR in the traditional database environment as well in the cloud. Having full control of these areas is daunting. However, Amazon RDS makes meeting these needs easier and more efficient.

    Regardless of which path you choose, monitoring your environment is vital. Peter’s talk will make that message clear. A discussion of metrics you should regularly review to keep your environment working correctly and performing optimally concludes the webinar.

    In order to learn more register for Peter’s webinar on High Availability and Disaster Recovery in Amazon RDS.

    by Peter Zaitsev at March 04, 2019 04:14 PM

    PostgreSQL Webinar Wed April 17 – Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions

    upgrade postgresql webinar series

    PostgreSQL logoA date for your diary. On Wednesday, April 17 at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4) Percona’s PostgreSQL Support Technical Lead, Avinash Vallarapu and Senior Support Engineers, Fernando Laudares, Jobin Augustine and Nickolay Ihalainen, will demonstrate the upgrade of a legacy version of PostgreSQL to a newer version, using built-in as well as open source tools. In the lead up to the live webinar, we’ll be publishing a series of five blog posts that will help you to understand the solutions available to perform a PostgreSQL upgrade.

    Register Now

    Synopsis

    Are you stuck with an application that is using an older version PostgreSQL which is no longer supported? Are you looking for the methods available to upgrade a legacy version PostgreSQL cluster (< PostgreSQL 9.3)? Are you searching for solutions that could upgrade your PostgreSQL with a minimalistic downtime? Are you afraid that your application may not work with latest PostgreSQL versions as it was built on a legacy version, a few years ago? Do you want to confirm if you are doing your PostgreSQL upgrades the right way ? Do you think that you need to buy an enterprise license to minimize the downtime involved in upgrades?

    Then we suggest you to subscribe to our webinar, that should answer most of your questions around PostgreSQL upgrades.

    This webinar starts with a list of solutions that are built-in to PostgreSQL to help us upgrade a legacy version of PostgreSQL with minimal downtime. The advantages of choosing such methods will also be discussed. You’ll notice a list of prerequisites for each solution, reducing the scope of possible mistakes. It’s important to minimize downtime when upgrading from an older version of PostgreSQL server. Therefore, we will present three open source solutions that will help us either to minimize or to completely avoid downtime.

    Our presentation will show the full process of upgrading a set of PostgreSQL servers to the latest available version. Furthermore, we’ll show the pros and cons for each of the methods we employed.

    The webinar programme

    Topics covered in this webinar will include:

    1. PostgreSQL upgrade using pg_dump/pg_restore (with downtime)
    2. PostgreSQL upgrade using pg_dumpall (with downtime)
    3. Continuous replication from a legacy PostgreSQL version to a newer version using Slony.
    4. Replication between major PostgreSQL versions using Logical Replication.
    5. Fast upgrade of legacy PostgreSQL with minimum downtime.

    In the 45 minute session, we’ll walk you through the methods and demonstrate some of the methods you may find useful in your database environment. We’ll see how simple and quick it is to perform the upgrade using our approach.

    Register Now


    Image adapted from Photo by Magda Ehlers from Pexels

    by Avinash Vallarapu at March 04, 2019 02:34 PM

    March 01, 2019

    Oli Sennhauser

    MariaDB and MySQL consulting by plane

    Since January 2019 FromDual tries to contribute actively a little bit against global warming too.

    The best for the climate would be to NOT travel to the customer at all! For this cases we have our FromDual remote-DBA services for MariaDB and MySQL.

    But sometimes customer wants or needs us on-site for our FromDual in-house trainings or our FromDual on-site consulting engagements. In these cases we try to travel by train. Travelling by train is after walking or travelling by bicycle the most climate friendly way to travel:


    But some customers are located more than 7 to 8 hours far away by train. For these customers we have to take the plan which is not good for the climate at all. But at least we will compensate for our CO2 emission via MyClimate.org:

    my_climate.png

    by Shinguz at March 01, 2019 02:27 PM

    February 28, 2019

    Peter Zaitsev

    Percona XtraDB Cluster 5.6.43-28.32 Is Now Available

    Percona XtraDB Cluster 5.7

    Percona XtraDB Cluster 5.7

    Percona is glad to announce the release of Percona XtraDB Cluster 5.6.43-28.32 on February 28, 2019. Binaries are available from the downloads section or from our software repositories.

    This release of Percona XtraDB Cluster includes the support of Ubuntu 18.10 (Cosmic Cuttlefish). Percona XtraDB Cluster 5.6.43-28.32 is now the current release, based on the following:

    All Percona software is open-source and free.

    Bugs Fixed

    • PXC-2388: In some cases, DROP FUNCTION function_name was not replicated.

    Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

    by Borys Belinsky at February 28, 2019 09:24 PM

    Percona XtraDB Cluster 5.7.25-31.35 Is Now Available

    Percona XtraDB Cluster 5.7

    Percona XtraDB Cluster 5.7Percona is glad to announce the release of Percona XtraDB Cluster 5.7.25-31.35 on February 28, 2018. Binaries are available from the downloads section or from our software repositories.

    This release of Percona XtraDB Cluster includes the support of Ubuntu 18.10 (Cosmic Cuttlefish). Percona XtraDB Cluster 5.7.25-31.35 is now the current release, based on the following:

    All Percona software is open-source and free.

    Bugs Fixed

    • PXC-2346mysqld could crash when executing mysqldump --single-transaction while the binary log is disabled. This problem was also reported in PXC-1711PXC-2371PXC-2419.
    • PXC-2388: In some cases, DROP FUNCTION function_name was not replicated.

    Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

    by Borys Belinsky at February 28, 2019 08:56 PM

    Percona Server for MongoDB 4.0.6-3 Is Now Available

    Percona Server for MongoDB

    Percona Server for MongoDB

    Percona announces the release of Percona Server for MongoDB 4.0.6-3 on February 28, 2019. Download the latest version from the Percona website or the Percona software repositories.

    Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 4.0 Community Edition. It supports MongoDB 4.0 protocols and drivers.

    Percona Server for MongoDB extends the functionality of the MongoDB 4.0 Community Edition by including the Percona Memory Engine storage engine, encrypted WiredTiger storage engineaudit loggingSASL authenticationhot backups, and enhanced query profilingPercona Server for MongoDB requires no changes to MongoDB applications or code.

    Release 4.0.6-3 extends the buildInfo command with the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key not available from MongoDB.

    This release includes all features of MongoDB 4.0 Community Edition 4.0. Most notable among these are:

    Note that the MMAPv1 storage engine is deprecated in MongoDB 4.0 Community Edition 4.0.

    Improvements

    • PSMDB-216: The database command buildInfo provides the psmdbVersion key to report the version of Percona Server for MongoDB. If this key exists then Percona Server for MongoDB is installed on the server. This key is not available from MongoDB.

    The Percona Server for MongoDB 4.0.6-3 release notes are available in the official documentation.

    by Borys Belinsky at February 28, 2019 05:08 PM

    MySQL 8.0 Bug 94394, Fixed!

    MySQL optimizer bugs

    MySQL optimizer bugs

    Last week I came across a bug in MySQL 8.0, which meant that the absence of mysql.user leads to auto-apply of –skip-grant-tables (#94394) would leave MySQL running in an undesirable state. My colleague Sveta Smirnova blogged about the issue and it also caught the interest of Valeriy Kravchuk in Fun with Bugs #80 – On MySQL Bug Reports I am Subscribed to, Part XVI. Thanks for the extra visibility!

    Credit is now due to Oracle for the quick response, as it was fixed in less than one week (including a weekend):

    Fixed in 8.0.16.

    Previously, if the grant tables were corrupted, the MySQL server
    wrote a message to the error log but continued as if the
    –skip-grant-tables option had been specified. This resulted in the
    server operating in an unexpected state unless –skip-grant-tables
    had in fact been specified. Now, the server stops after writing a
    message to the error log unless started with –skip-grant-tables.
    (Starting the server with that option enables you to connect to
    perform diagnostic operations.)

    I think that this particular bug reflects some of the nice things about the MySQL community (and Open Source in general); anyone can find and report a bug, or make a feature request, to one of the software vendors (MySQL, Percona, or MariaDB) and try to improve the software. Sometimes bugs hang around for a while, either because they are hard to fix, viewed as lower in priority (despite the reporter’s opinion), or perhaps the bug does not have enough public visibility. Then a member of the community notices the bug and takes an interest and soon there is more interest. If you are lucky the bug gets fixed quickly! You can of course also provide a fix for the bug yourself, which may speed up the process with a little luck.

    If you have not yet reported a bug, or want to find if you are reporting them in the right sort of way then you can take a look at How to create a useful MySQL bug report…and make sure it’s properly processed by Valeriy from FOSDEM 2019.

    🐛 🐛 🐛 You can help to find more!

    by Ceri Williams at February 28, 2019 03:05 PM

    February 27, 2019

    Peter Zaitsev

    Charset and Collation Settings Impact on MySQL Performance

    MySQL 8.0 utf8mb4

    Following my post MySQL 8 is not always faster than MySQL 5.7, this time I decided to test very simple read-only CPU intensive workloads, when all data fits memory. In this workload there is NO IO operations, only memory and CPU operations.

    My Testing Setup

    Environment specification

    • Release | Ubuntu 18.04 LTS (bionic)
    • Kernel | 4.15.0-20-generic
    • Processors | physical = 2, cores = 28, virtual = 56, hyperthreading = yes
    • Models | 56xIntel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz<
    • Memory Total | 376.6G
    • Provider | packet.net x2.xlarge.x86 instance

    I will test two workloads, sysbench oltp_read_only and oltp_point_select varying amount of threads

    sysbench oltp_read_only --mysql-ssl=off --report-interval=1 --time=300 --threads=$i --tables=10 --table-size=10000000 --mysql-user=root run

    sysbench oltp_point_select --mysql-ssl=off --report-interval=1 --time=300 --threads=$i --tables=10 --table-size=10000000 --mysql-user=root run

    The results for OLTP read-only (latin1 character set):

    MySQL 5.7.25 MySQL 8.0.15
    threads throughput throughput throughput ratio
    1 1241.18 1114.4 1.11
    4 4578.18 4106.69 1.11
    16 15763.64 14303.54 1.10
    24 21384.57 19472.89 1.10
    32 25081.17 22897.04 1.10
    48 32363.27 29600.26 1.09
    64 39629.09 35585.88 1.11
    128 38448.23 34718.42 1.11
    256 36306.44 32798.12 1.11

    The results for point_select (latin1 character set):

    point select MySQL 5.7.25 MySQL 8.0.15
    threads throughput throughput throughput ratio
    1 31672.52 28344.25 1.12
    4 110650.7 98296.46 1.13
    16 390165.41 347026.49 1.12
    24 534454.55 474024.56 1.13
    32 620402.74 554524.73 1.12
    48 806367.3 718350.87 1.12
    64 1120586.03 972366.59 1.15
    128 1108638.47 960015.17 1.15
    256 1038166.63 891470.11 1.16

    We can see that in the OLTP read-only workload, MySQL 8.0.15 is slower by 10%, and for the point_select workload MySQL 8.0.15 is slower by 12-16%.

    Although the difference is not necessarily significant, this is enough to reveal that MySQL 8.0.15 does not perform as well as MySQL 5.7.25 in the variety of workloads that I am testing.

    However, it appears that the dynamic of the results will change if we use the utf8mb4 character set instead of latin1.

    Let’s compare MySQL 5.7.25 latin1 vs utf8mb4, as utf8mb4 is now default CHARSET in MySQL 8.0

    But before we do that let’s take look also at COLLATION.

    MySQL 5.7.25 uses a default collation utf8mb4_general_ci, However, I read that to use proper sorting and comparison for Eastern European languages, you may want to use the utf8mb4_unicode_ci collation. For MySQL 8.0.5 the default collation is

    So let’s compare each version latin1 vs utf8mb4 (with default collation). First 5.7:

    Threads utf8mb4_general_ci latin1 latin1 ratio
    4 2957.99 4578.18 1.55
    24 13792.55 21384.57 1.55
    64 24516.99 39629.09 1.62
    128 23977.07 38448.23 1.60

    So here we can see that utf8mb4 in MySQL 5.7 is really much slower than latin1 (by 55-60%)

    And the same for MySQL 8.0.15

    MySQL 8.0 defaultcollations

    Threads utf8mb4_0900_ai_ci (default) latin1 latin1 ratio
    4 3968.88 4106.69 1.03
    24 18446.19 19472.89 1.06
    64 32776.35 35585.88 1.09
    128 31301.75 34718.42 1.11

    For MySQL 8.0 the hit from utf8mb4 is much lower (up to 11%)

    Now let’s compare all collations for utf8mb4

    For MySQL 5.7

    MySQL 5.7 utf8mb4

    utf8mb4_general_ci (default) utf8mb4_bin utf8mb4_unicode_ci utf8mb4_unicode_520_ci
    4 2957.99 3328.8 2157.61 1942.78
    24 13792.55 15857.29 9989.96 9095.17
    64 24516.99 28125.16 16207.26 14768.64
    128 23977.07 27410.94 15970.6 14560.6

    If you plan to use utf8mb4_unicode_ci, you will get an even further performance hit (comparing to utf8mb4_general_ci )

    And for MySQL 8.0.15

    MySQL 8.0 utf8mb4

    utf8mb4_general_ci utf8mb4_bin utf8mb4_unicode_ci utf8mb4_0900_ai_ci (default)
    4 3461.8 3628.01 3363.7 3968.88
    24 16327.45 17136.16 15740.83 18446.19
    64 28960.62 30390.29 27242.72 32776.35
    128 27967.25 29256.89 26489.83 31301.75

    So now let’s compare MySQL 8.0 vs MySQL 5.7 in utf8mb4 with default collations:

    mysql 8 and 5.7 default collation

    MySQL 8.0 utf8mb4_0900_ai_ci MySQL 5.7 utf8mb4_general_ci MySQL 8.0 ratio
    4 3968.88 2957.99 1.34
    24 18446.19 13792.55 1.34
    64 32776.35 24516.99 1.34
    128 31301.75 23977.07 1.31

    So there we are. In this case, MySQL 8.0 is actually better than MySQL 5.7 by 34%

    Conclusions

    There are several observations to make:

    • MySQL 5.7 outperforms MySQL 8.0 in latin1 charset
    • MySQL 8.0 outperforms MySQL 5.7 by a wide margin if we use utf8mb4 charset
    • Be aware that utf8mb4  is now default MySQL 8.0, while MySQL 5.7 has latin1 by default
    • When running comparison between MySQL 8.0 vs MySQL 5.7 be aware what charset you are using, as it may affect the comparison a lot.

    by Vadim Tkachenko at February 27, 2019 11:11 PM

    February 26, 2019

    Peter Zaitsev

    Percona XtraBackup Now Supports Dump of InnoDB Buffer Pool

    percona-xtra-backup buffer pool restore

    percona-xtra-backup buffer pool restoreInnoDB keeps hot data in memory on its buffer named InnoDB Buffer Pool. For a long time, when a MySQL instance needed to bounce, this hot cached data was lost and the instance required a warm-up period to perform as well as it did before the service restart.

    That is not the case anymore. Newer versions of MySQL/MariaDB allow users to save the state of this buffer by dumping tablespace ID’s and page ID’s to a file on disk that will be loaded automatically on startup, making the newly started server buffer pool as it was prior the restart.

    Details about the MySQL implementation can be found at https://dev.mysql.com/doc/refman/5.7/en/innodb-preload-buffer-pool.html

    With that in mind, Percona XtraBackup versions 2.4.13 can now instruct MySQL to dump the content of buffer pool while taking a backup. This means you can restore the backup on a new server and make MySQL perform just like the other instance in terms of InnoDB Buffer Pool data.

    How it works

    The buffer pool dump happens at the beginning of backup if --dump-innodb-buffer-pool is set.

    The user can choose to change the default innodb_buffer_pool_dump_pct. If --dump-innodb-buffer-pool-pct is set, it stores the current MySQL innodb_buffer_pool_dump_pct value, then it changes it to the desired percentage. After the end of the backup, original values is restored back.

    The actual file copy happens at the end of the backup.

    Percona XtraDB Cluster

    A very good use case is PXC/Galera. When a node initiates SST, we would like the joiner to have a copy of InnoDB Buffer Pool from the donor. We can configure PXC nodes to do that:

    [xtrabackup]
    dump-innodb-buffer-pool
    dump-innodb-buffer-pool-pct=100

    Here is an example of a PXC node that just received SST:

    Before PXB-1548:

    [root@marcelo-altmann-pxb-pxc-3 ~]# systemctl stop mysql && rm -rf /var/lib/mysql/* && systemctl start mysql && mysql -psekret -e "SHOW ENGINE INNODB STATUS\G" | grep 'Database pages'
    mysql: [Warning] Using a password on the command line interface can be insecure.
    Database pages 311

    Joiner started with a cold buffer pool.

    After adding dump-innodb-buffer-pool and dump-innodb-buffer-pool-pct=100 to my.cnf :

    [root@marcelo-altmann-pxb-pxc-3 ~]# systemctl stop mysql && rm -rf /var/lib/mysql/* && systemctl start mysql && mysql -psekret -e "SHOW ENGINE INNODB STATUS\G" | grep 'Database pages'
    mysql: [Warning] Using a password on the command line interface can be insecure.
    Database pages 30970

    Joiner started with a copy of the buffer pool from the donor, which will reduce the joiner warm-up period.

    Conclusion

    The new version of Percona XtraBackup can help to minimize the time a newly restored backup will take to perform like source server


    Photo by Jametlene Reskp on Unsplash

    by Marcelo Altmann at February 26, 2019 10:38 AM

    February 25, 2019

    MariaDB Foundation

    MariaDB 10.4.3 now available

    The MariaDB Foundation is pleased to announce the availability of MariaDB 10.4.3, the first release candidate in the MariaDB 10.4 series. See the release notes and changelogs for details. Download MariaDB 10.4.3 Release Notes Changelog What is MariaDB 10.4? MariaDB APT and YUM Repository Configuration Generator Contributors to MariaDB 10.4.3 Aleksey Midenkov (Tempesta) Alexander Barkov […]

    The post MariaDB 10.4.3 now available appeared first on MariaDB.org.

    by Ian Gilfillan at February 25, 2019 07:07 PM

    Peter Zaitsev

    MySQL Challenge: 100k Connections

    thread pools MySQL 100k connections

    In this post, I want to explore a way to establish 100,000 connections to MySQL. Not just idle connections, but executing queries.

    100,000 connections. Is that really needed for MySQL, you may ask? Although it may seem excessive, I have seen a lot of different setups in customer deployments. Some deploy an application connection pool, with 100 application servers and 1,000 connections in each pool. Some applications use a “re-connect and repeat if the query is too slow” technique, which is a terrible practice. It can lead to a snowball effect, and could establish thousands of connections to MySQL in a matter of seconds.

    So now I want to set an overachieving goal and see if we can achieve it.

    Setup

    For this I will use the following hardware:

    Bare metal server provided by packet.net, instance size: c2.medium.x86
    Physical Cores @ 2.2 GHz
    (1 X AMD EPYC 7401P)
    Memory: 64 GB of ECC RAM
    Storage : INTEL® SSD DC S4500, 480GB

    This is a server grade SATA SSD.

    I will use five of these boxes, for the reason explained below. One box for the MySQL server and four boxes for client connections.

    For the server I will use Percona  Server for MySQL 8.0.13-4 with the thread pool plugin. The plugin will be required to support the thousands of connections.

    Initial server setup

    Network settings (Ansible format):

    - { name: 'net.core.somaxconn', value: 32768 }
    - { name: 'net.core.rmem_max', value: 134217728 }
    - { name: 'net.core.wmem_max', value: 134217728 }
    - { name: 'net.ipv4.tcp_rmem', value: '4096 87380 134217728' }
    - { name: 'net.ipv4.tcp_wmem', value: '4096 87380 134217728' }
    - { name: 'net.core.netdev_max_backlog', value: 300000 }
    - { name: 'net.ipv4.tcp_moderate_rcvbuf', value: 1 }
    - { name: 'net.ipv4.tcp_no_metrics_save', value: 1 }
    - { name: 'net.ipv4.tcp_congestion_control', value: 'htcp' }
    - { name: 'net.ipv4.tcp_mtu_probing', value: 1 }
    - { name: 'net.ipv4.tcp_timestamps', value: 0 }
    - { name: 'net.ipv4.tcp_sack', value: 0 }
    - { name: 'net.ipv4.tcp_syncookies', value: 1 }
    - { name: 'net.ipv4.tcp_max_syn_backlog', value: 4096 }
    - { name: 'net.ipv4.tcp_mem', value: '50576   64768 98152' }
    - { name: 'net.ipv4.ip_local_port_range', value: '4000 65000' }
    - { name: 'net.ipv4.netdev_max_backlog', value: 2500 }
    - { name: 'net.ipv4.tcp_tw_reuse', value: 1 }
    - { name: 'net.ipv4.tcp_fin_timeout', value: 5 }

    These are the typical settings recommended for 10Gb networks and high concurrent workloads.

    Limits settings for systemd:

    [Service]
    LimitNOFILE=1000000
    LimitNPROC=500000

    And the relevant setting for MySQL in my.cnf:

    back_log=3500
    max_connections=110000

    For the client I will use sysbench version 0.5 and not 1.0.x, for the reasons explained below.

    The workload is

    sysbench --test=sysbench/tests/db/select.lua --mysql-host=139.178.82.47 --mysql-user=sbtest --mysql-password=sbtest --oltp-tables-count=10 --report-interval=1 --num-threads=10000 --max-time=300 --max-requests=0 --oltp-table-size=10000000 --rand-type=uniform --rand-init=on run

    Step 1. 10,000 connections

    This one is very easy, as there is not much to do to achieve this. We can do this with only one client. But you may face the following error on the client side:

    FATAL: error 2004: Can't create TCP/IP socket (24)

    This is caused by the open file limit, which is also a limit of TCP/IP sockets. This can be fixed by setting  

    ulimit -n 100000
      on the client.

    The performance we observe:

    [  26s] threads: 10000, tps: 0.00, reads: 33367.48, writes: 0.00, response time: 3681.42ms (95%), errors: 0.00, reconnects:  0.00
    [  27s] threads: 10000, tps: 0.00, reads: 33289.74, writes: 0.00, response time: 3690.25ms (95%), errors: 0.00, reconnects:  0.00

    Step 2. 25,000 connections

    With 25,000 connections, we hit an error on MySQL side:

    Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug

    If you try to lookup information on this error you might find the following article:  https://www.percona.com/blog/2013/02/04/cant_create_thread_errno_11/

    But it does not help in our case, as we have all limits set high enough:

    cat /proc/`pidof mysqld`/limits
    Limit                     Soft Limit Hard Limit           Units
    Max cpu time              unlimited  unlimited            seconds
    Max file size             unlimited  unlimited            bytes
    Max data size             unlimited  unlimited            bytes
    Max stack size            8388608    unlimited            bytes
    Max core file size        0          unlimited            bytes
    Max resident set          unlimited  unlimited            bytes
    Max processes             500000     500000               processes
    Max open files            1000000    1000000              files
    Max locked memory         16777216   16777216             bytes
    Max address space         unlimited  unlimited            bytes
    Max file locks            unlimited  unlimited            locks
    Max pending signals       255051     255051               signals
    Max msgqueue size         819200     819200               bytes
    Max nice priority         0          0
    Max realtime priority     0          0
    Max realtime timeout      unlimited unlimited            us

    This is where we start using the thread pool feature:  https://www.percona.com/doc/percona-server/8.0/performance/threadpool.html

    Add:

    thread_handling=pool-of-threads

    to the my.cnf and restart Percona Server

    The results:

    [   7s] threads: 25000, tps: 0.00, reads: 33332.57, writes: 0.00, response time: 974.56ms (95%), errors: 0.00, reconnects:  0.00
    [   8s] threads: 25000, tps: 0.00, reads: 33187.01, writes: 0.00, response time: 979.24ms (95%), errors: 0.00, reconnects:  0.00

    We have the same throughput, but actually the 95% response time has improved (thanks to the thread pool) from 3690 ms to 979 ms.

    Step 3. 50,000 connections

    This is where we encountered the biggest challenge. At first, trying to get 50,000 connections in sysbench we hit the following error:

    FATAL: error 2003: Can't connect to MySQL server on '139.178.82.47' (99)

    Error (99) is cryptic and it means: Cannot assign requested address.

    It comes from the limit of ports an application can open. By default on my system it is

    cat /proc/sys/net/ipv4/ip_local_port_range : 32768   60999

    This says there are only 28,231 available ports — 60999 minus 32768 — or the limit of TCP connections you can establish from or to the given IP address.

    You can extend this using a wider range, on both the client and the server:

    echo 4000 65000 > /proc/sys/net/ipv4/ip_local_port_range

    This will give us 61,000 connections, but this is very close to the limit for one IP address (maximal port is 65535). The key takeaway from here is that if we want more connections we need to allocate more IP addresses for MySQL server. In order to achieve 100,000 connections, I will use two IP addresses on the server running MySQL.

    After sorting out the port ranges, we hit the following problem with sysbench:

    sysbench 0.5:  multi-threaded system evaluation benchmark
    Running the test with following options:
    Number of threads: 50000
    FATAL: pthread_create() for thread #32352 failed. errno = 12 (Cannot allocate memory)

    In this case, it’s a problem with sysbench memory allocation (namely lua subsystem). Sysbench can allocate memory for only 32,351 connections. This is a problem which is even more severe in sysbench 1.0.x.

    Sysbench 1.0.x limitation

    Sysbench 1.0.x uses a different Lua JIT, which hits memory problems even with 4000 connections, so it is impossible to go over 4000 connection in sysbench 1.0.x

    So it seems we hit a limit with sysbench sooner than with Percona Server. In order to use more connections, we need to use multiple sysbench clients, and if 32,351 connections is the limit for sysbench, we have to use at least four sysbench clients to get up to 100,000 connections.

    For 50,000 connections I will use 2 servers (each running separate sysbench), each running 25,000 threads from sysbench.

    The results for each sysbench looks like:

    [  29s] threads: 25000, tps: 0.00, reads: 16794.09, writes: 0.00, response time: 1799.63ms (95%), errors: 0.00, reconnects:  0.00
    [  30s] threads: 25000, tps: 0.00, reads: 16491.03, writes: 0.00, response time: 1800.70ms (95%), errors: 0.00, reconnects:  0.00

    So we have about the same throughput (16794*2 = 33588 tps in total), however the 95% response time doubled. This is to be expected as we are using twice as many connections compared to the 25,000 connections benchmark.

    Step 3. 75,000 connections

    To achieve 75,000 connections we will use three servers with sysbench, each running 25,000 threads.

    The results for each sysbench:

    [ 157s] threads: 25000, tps: 0.00, reads: 11633.87, writes: 0.00, response time: 2651.76ms (95%), errors: 0.00, reconnects:  0.00
    [ 158s] threads: 25000, tps: 0.00, reads: 10783.09, writes: 0.00, response time: 2601.44ms (95%), errors: 0.00, reconnects:  0.00

    Step 4. 100,000 connections

    There is nothing eventful to achieve75k and 100k connections. We just spin up an additional server and start sysbench. For 100,000 connections we need four servers for sysbench, each shows:

    [ 101s] threads: 25000, tps: 0.00, reads: 8033.83, writes: 0.00, response time: 3320.21ms (95%), errors: 0.00, reconnects:  0.00
    [ 102s] threads: 25000, tps: 0.00, reads: 8065.02, writes: 0.00, response time: 3405.77ms (95%), errors: 0.00, reconnects:  0.00

    So we have the same throughput (8065*4=32260 tps in total) with 3405ms 95% response time.

    A very important takeaway from this: with 100k connections and using a thread pool, the 95% response time is even better than for 10k connections without a thread pool. The thread pool allows Percona Server to manage resources more efficiently and provides better response times.

    Conclusions

    100k connections is quite achievable for MySQL, and I am sure we could go even further. There are three components to achieve this:

    • Thread pool in Percona Server
    • Proper tuning of network limits
    • Using multiple IP addresses on the server box (one IP address per approximately 60k connections)

    Appendix: full my.cnf

    [mysqld]
    datadir {{ mysqldir }}
    ssl=0
    skip-log-bin
    log-error=error.log
    # Disabling symbolic-links is recommended to prevent assorted security risks
    symbolic-links=0
    character_set_server=latin1
    collation_server=latin1_swedish_ci
    skip-character-set-client-handshake
    innodb_undo_log_truncate=off
    # general
    table_open_cache = 200000
    table_open_cache_instances=64
    back_log=3500
    max_connections=110000
    # files
    innodb_file_per_table
    innodb_log_file_size=15G
    innodb_log_files_in_group=2
    innodb_open_files=4000
    # buffers
    innodb_buffer_pool_size= 40G
    innodb_buffer_pool_instances=8
    innodb_log_buffer_size=64M
    # tune
    innodb_doublewrite= 1
    innodb_thread_concurrency=0
    innodb_flush_log_at_trx_commit= 0
    innodb_flush_method=O_DIRECT_NO_FSYNC
    innodb_max_dirty_pages_pct=90
    innodb_max_dirty_pages_pct_lwm=10
    innodb_lru_scan_depth=2048
    innodb_page_cleaners=4
    join_buffer_size=256K
    sort_buffer_size=256K
    innodb_use_native_aio=1
    innodb_stats_persistent = 1
    #innodb_spin_wait_delay=96
    innodb_adaptive_flushing = 1
    innodb_flush_neighbors = 0
    innodb_read_io_threads = 16
    innodb_write_io_threads = 16
    innodb_io_capacity=1500
    innodb_io_capacity_max=2500
    innodb_purge_threads=4
    innodb_adaptive_hash_index=0
    max_prepared_stmt_count=1000000
    innodb_monitor_enable = '%'
    performance_schema = ON

    by Vadim Tkachenko at February 25, 2019 03:00 PM

    February 23, 2019

    Valeriy Kravchuk

    Fun with Bugs #80 - On MySQL Bug Reports I am Subscribed to, Part XVI

    Today I'd like to continue my review of public MySQL bug reports with a list of some bugs I've subscribed to over last 3 weeks. It's already long enough and includes nice cases to check and share. Note that I usually subscribe to a bug either because it directly affects me or customers I work with, or I consider it technically interesting (so I mostly care about InnoDB, replication, partitioning and optimizer bugs), or it's a "metabug" - a problem in the way public bug report is handled by Oracle engineers. These are my interests related to MySQL bugs.

    As usual, I start with the oldest bugs and try to mention bug reporters by name with links to their other reports whenever this may give something useful to a reader. I try to check if MariaDB is also affected in some cases. Check also my summary comments at the end of this blog post.
    • Bug #94148 - "Unnecessary Shared lock on parent table During UPDATE on a child table". In this bug report Uday Varagani reasonably pointed out that formally there is no need to lock parent row when column NOT included in the foreign key gets updated. This happens though when this column is included into the index used to support foreign key constraint. IMHO it's a reasonable feature request and both Trey Raymond and Sveta Smirnova tried their best to  highlight this, but this report now has a "Need Feedback" status with a request to explain new algorithm suggested. It's simple - "Stop it", check that column changed is NOT the one foreign key is defined on, even if it's in the same index...I see no reason NOT to verify this as a reasonable feature request. Is it a new policy that every feature request should come with details on how to implement it? I truly doubt.
    • Bug #94224 - "[5.6] Optimizer reconsiders index based on index definition order, not value". Domas Mituzas found yet another case (see also Bug #36817 - "Non optimal index choice, depending on index creation order" from Jocelyn Fournier, the bug I verified more than 10 years ago) when in MySQL order of index definition matters more for optimizer than anything else.  My quick check shows that MariaDB 10.3.7 is not affected:
      MariaDB [test]> explain select distinct b from t1 where c not in (0) and d > 0;+------+-------------+-------+-------+---------------+--------------------+---------+------+------+-------------+| id   | select_type | table | type  | possible_keys | key            | key_len
      | ref  | rows | Extra                    |
      +------+-------------+-------+-------+---------------+--------------------+--------+------+------+-------------+
      |    1 | SIMPLE      | t1    | index | NULL          | non_covering_index | 9    | NULL |    1 | Using where |
      +------+-------------+-------+-------+---------------+--------------------+---------+------+------+-------------+
      1 row in set (0.002 sec)

      MariaDB [test]> alter table t1 add index covering_index (b, c, d);
      Query OK, 0 rows affected (0.149 sec)
      Records: 0  Duplicates: 0  Warnings: 0

      MariaDB [test]> explain select distinct b from t1 where c not in (0) and d > 0;
      +------+-------------+-------+-------+---------------+----------------+---------+------+------+--------------------------+
      | id   | select_type | table | type  | possible_keys | key            | key_len
      | ref  | rows | Extra                    |
      +------+-------------+-------+-------+---------------+----------------+---------+------+------+--------------------------+
      |    1 | SIMPLE      | t1    | index | NULL          | covering_index | 14
      | NULL |    1 | Using where; Using index |
      +------+-------------+-------+-------+---------------+----------------+---------+------+------+--------------------------+
      1 row in set (0.025 sec)
      Fortunately MySQL 8 is no longer affected. Unfortunately we do not see a public comment showing the results of testing on MySQL 5.7 (or any version, for that matter), from engineer who verified the bug. I already pointed out that this "metabug" becomes popular in my previous blog post.
    • Bug #94243 - "WL#9508 introduced non-idiomatic potentially-broken C macros". Laurynas Biveinis from Percona found new code that in ideal world wound not pass any serious code review.
    • Bug #94251 - "Aggregate function result is dependent by window is defined directly or as named". This bug was reported by Владислав Сокол. From what I see:
      MariaDB [test]> WITH RECURSIVE cte AS (
          -> SELECT 1 num
          -> UNION ALL
          -> SELECT num+1 FROM cte WHERE num < 5
          -> )
          -> SELECT num, COUNT(*) OVER (frame) cnt_named, COUNT(*) OVER (ORDER BY num
      DESC) cnt_direct
          -> FROM cte
          -> WINDOW frame AS (ORDER BY num DESC);
      +------+-----------+------------+
      | num  | cnt_named | cnt_direct |
      +------+-----------+------------+
      |    1 |         5 |          5 |
      |    2 |         4 |          4 |
      |    3 |         3 |          3 |
      |    4 |         2 |          2 |
      |    5 |         1 |          1 |
      +------+-----------+------------+
      5 rows in set (0.117 sec)

      MariaDB [test]> WITH RECURSIVE cte AS (
          -> SELECT 1 num
          -> UNION ALL
          -> SELECT num+1 FROM cte WHERE num < 5
          -> )
          -> SELECT num, COUNT(*) OVER (frame) cnt_named, COUNT(*) OVER (ORDER BY num
      DESC) cnt_direct
          -> FROM cte
          -> WINDOW frame AS (ORDER BY num DESC)
          -> ORDER BY num desc;
      +------+-----------+------------+
      | num  | cnt_named | cnt_direct |
      +------+-----------+------------+
      |    5 |         1 |          1 |
      |    4 |         2 |          2 |
      |    3 |         3 |          3 |
      |    2 |         4 |          4 |
      |    1 |         5 |          5 |
      +------+-----------+------------+
      5 rows in set (0.003 sec)
      MariaDB 10.3.7 is NOT affected.
    • Bug #94283 - "MySQL 8.0.15 is slower than MySQL 5.7.25". Percona's CTO Vadim Tkachenko reported that MySQL 8.0.15 is notably slower than 5.7.25 on a simple oltp_read_write sysbench test. He had recently written a separate blog post about this, with more details.There is one detail to clarify based on today's comment from Peter Zaitsev (was the same default character set used), but as my dear friend Sinisa Milivojevic verified the bug without any questions, requests or his own test outputs shared, we can assume that Oracle officially accepted this performance regression (even though "regression" tag was not set).

      Check also later Bug #94387 - "MySQL 8.0.15 is slower than MySQL 5.7.25 in read only workloads", yet another performance regression report from Vadim, where he found that on read only (sysbench oltp_point_select) all in memory workloads MySQL 8.0.15 may also be slower than MySQL 5.7.25.
    • Bug #94302 - "reset master could not break dump thread in some cases". This bug was reported by Ashe Sun. This is definitely a corner case, as it happens only master is still writing to the very first binary log. We can not find out from public comments in the bug report if any other versions besides 5.7.x are affected. This is yet another "metabug" - during my days in Oracle's MySQL bugs verification team we had to check on all versions still supported and present the results explicitly.
    • Bug #94319 - "Format_description_log_event::write can cause segfaults". Nice bug report by Manuel Ung from Facebook.
    • Bug #94330 - "Test for possible compressed failures before upgrade?". Change of zlib version starting from MySQL 5.7.24 means that some operations for InnoDB tables with ROW_FORMAT=COMPRESSED that previously worked may start to fail. In this report Monty Solomon asks for some way to determine if there will be a problem with existing compressed tables before upgrading to 5.7.24. The bug is still "Open".
    • Bug #94338 - "Dirty read-like behavior in READ COMMITTED transaction". Bug reporter, Masaki Oguro, stated that MySQL 8 is not affected (only 5.6 and 5.7) and the bug is verified on these versions, so we should assume it's really the case. But I miss public comment showing the result of testing on recent MySQL 8.0.15.
    • Bug #94340 - "backwards incompatible changes in 8.0: Error number: 3747". Simon Mudd complains about incompatible change in 8.0.13 that does not allow slave to easily switch from SBR to RBR without restart (and was not clearly documented as a change in behavior). Make sure to read all comments.
    • Bug #94370 - "Performance regression of btr_cur_prefetch_siblings". Nice bug report with a patch from Zhai Weixiang.
    • Bug #94383 - "simple ALTER cause unnecessary InnoDB index rebuilds, 5.7.23 or later 5.7 rlses". In this bug report Mikhail Izioumtchenko presented the detailed analysis and suggested diagnostics patches to show what really happens and why. This bug is also a regression of a kind, so while testing results are presented, I still think that it could be processed better according to the good old rules I have in mind.
    • Bug #94394 - "Absence of mysql.user leads to auto-apply of --skip-grant-tables". Great finding by Ceri Williams from Percona. Sveta Smirnova provided a separate MTR test case and clarified the impact of the bug. Surely this is also a regression comparing to MySQL 5.7, as there you can not start MySQL if mysql.user table is missing. I leave it to a reader to decide if there is any security-related impact of this bug...
    • Bug #94396 - "Error message too broad: The used command is not allowed with this MySQL version". This bug was reported by my former colleague in Percona Support, famous Bill Karwin. Informative error messages matter for good user experience.
    We rely on MySQL in a same way as that guys on top of dolphins pyramid on this strange monument in some court somewhere at the Lanes. Reliable foundation matters, so regressions should better be avoided.
    To summarize:
    1. Looks like it's time for Oracle to spend some efforts to make MySQL 8 great again, by fixing some of the bugs mentioned above, especially performance regressions vs MySQL 5.7 found recently by Vadim Tkachenko from Percona.
    2. Oracle continues to introduce backward-incompatible changes in behavior in minor MySQL 8.0.x releases at GA stage. This is not really good for any production environment.
    3. Asking bug reporters to provide "the basics of such a new algorithm" when they complain that current one is wrong or not optimal is a new word in bugs processing!
    4. When I joined MySQL bugs verification team in 2005 we've set up a culture of bugs processing that included, among other things, presenting in a public comment any successful or unsuccessful attempt to verify the bug, by copy-pasting all commands and statements used along with the outputs, whenever possible and with enough context to show what was really checked. I've studied this approach from Oracle's Tom Kyte over the previous 10 years when I followed him closely. I used to think it's standard for more than a decade already, a kind of my (and not only my) "heritage". It's sad to see this approach is no longer followed by many Oracle engineers who process bugs, in too many cases.
    5. Oracle engineers still do not use "regression" tag when setting "Verified" status for obviously regression bugs. I think bug reporters should care then to always set it when they report regression of any kind.

    by Valerii Kravchuk (noreply@blogger.com) at February 23, 2019 06:10 PM

    February 22, 2019

    Peter Zaitsev

    Percona Live 2019 First Sneak Peek!

    Percona Live 2019We know you’ve been really looking forward to a glimpse of what to expect at Percona Live Austin, so here is the first sneak peek of the agenda!

    Our conference committee has been reviewing hundreds of talks over the last few weeks and is delighted to present some initial talks.

    • New features in MySQL 8.0 Replication by Luís Soares, Oracle OSS
    • Shaping the Future of Privacy & Data Protection by Cristina DeLisle, XWiki SAS
    • Galera Cluster New Features by Seppo Jaakola, Codership
    • MySQL Security and Standardization at PayPal by Stacy Yuan &  Yashada Jadha, PayPal
    • Mailchimp Scale: a MySQL Perspective by John Scott, Mailchimp
    • The State of Databases in 2019 by Dinesh Joshi, Apache Cassandra

    PingCAP will be sponsoring the TiDB track and have a day of really exciting content to share! Liu Tang, Chief Engineer at PingCAP, will be presenting: Using Chaos Engineering to Build a Reliable TiDB. Keep your eye out for more coming soon!

    We could not put on this conference without the support of our sponsors. By being a sponsor at Percona Live it gives companies the opportunity to showcase their products and services, interact with the community for invaluable face time, meet with users or customers and showcase their recruitment opportunities.

    It’s with great pleasure to announce the first round of sponsors for Percona Live!

    Diamond Sponsors

    continuent

     

    VividCortex

     

    Silver Sponsors

    pingcapmysql

    If you’d like to find out more about being a sponsor, download the prospectus here
     
    Stay tuned for more updates on the conference agenda! 

    by Bronwyn Campbell at February 22, 2019 05:31 PM

    Oli Sennhauser

    FromDual Backup and Recovery Manager for MariaDB and MySQL 2.1.0 has been released

    FromDual has the pleasure to announce the release of the new version 2.1.0 of its popular Backup and Recovery Manager for MariaDB and MySQL (brman).

    The new FromDual Backup and Recovery Manager can be downloaded from here. How to install and use the Backup and Recovery Manager is describe in FromDual Backup and Recovery Manager (brman) installation guide.

    In the inconceivable case that you find a bug in the FromDual Backup and Recovery Manager please report it to the FromDual Bugtracker or just send us an email.

    Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

    Upgrade from 1.2.x to 2.1.0

    brman 2.1.0 requires a new PHP package for ssh connections.

    shell> sudo apt-get install php-ssh2
    
    shell> cd ${HOME}/product
    shell> tar xf /download/brman-2.1.0.tar.gz
    shell> rm -f brman
    shell> ln -s brman-2.1.0 brman
    

    Changes in FromDual Backup and Recovery Manager 2.1.0

    This release is a new major release series. It contains a lot of new features. We have tried to maintain backward-compatibility with the 1.2 and 2.0 release series. But you should test the new release seriously!

    You can verify your current FromDual Backup Manager version with the following command:

    shell> fromdual_bman --version
    shell> bman --version
    

    FromDual Backup Manager

    • Usage (--help) updated.
    • Some WARN severities downgraded to INFO to keep mail output clean.
    • Error messages made more flexible and fixed PHP library advice.
    • Split some redundant code from bman library into brman library.
    • Security fix: Password from config file is hidden now.
    • Bug on simulation of physical backup fixed (xtrabackup_binlog_info not found).
    • Options --backup-name and --backup-overwrite introduced for restore automation.
    • Minor typo bugs fixed.
    • Option --options remove.
    • Sort order for schema backup changed to ORDER BY ASC.
    • 2 PHP errors fixed for simulation.
    • Maskerade API added.
    • Physical backup sftp archiving with special characters (+foodmarat) in archive directory name fixed.

    FromDual Recovery Manager

    • Rman has progress report.
    • Full logical restore is implemented.
    • Schema logical restore is implemented.
    • Physical restore is implemented.
    • Physical restore of compressed backups is implemented.
    • Option --cleanup-first was implemented for physical backup as well.
    • Option: --stop-instance implemented.

    FromDual Backup Manager Catalog

    • No changes.

    Subscriptions for commercial use of FromDual Backup and Recovery Manager you can get from from us.

    by Shinguz at February 22, 2019 04:14 PM

    MariaDB Foundation

    “Account Locking and Password Expiration Overview” – MariaDB Unconference Presentations

    Security is one of the hottest topics in Computer Software today, everybody handles highly valuable data. From private personal data, medical records for clinics to customers credit card information for  online bussinesses, malicious data breaches are always part of the  worst case scenario. Robert Bindar (robert@mariadb.org) is going to present a session at the 2019 MariaDB Unconference, New York about […]

    The post “Account Locking and Password Expiration Overview” – MariaDB Unconference Presentations appeared first on MariaDB.org.

    by Anna Widenius at February 22, 2019 02:45 PM

    Peter Zaitsev

    PostgreSQL fsync Failure Fixed – Minor Versions Released Feb 14, 2019

    fsync postgresql upgrade

    PostgreSQL logoIn case you didn’t already see this news, PostgreSQL has got its first minor version released for 2019. This includes minor version updates for all supported PostgreSQL versions. We have indicated in our previous blog post that PostgreSQL 9.3 had gone EOL, and it would not support any more updates. This release includes the following PostgreSQL major versions:

    What’s new in this release?

    One of the common fixes applied to all the supported PostgreSQL versions is on – panic instead of retrying after fsync () failure. This fsync failure has been in discussion for a year or two now, so let’s take a look at the implications.

    A fix to the Linux fsync issue for PostgreSQL Buffered IO in all supported versions

    PostgreSQL performs two types of IO. Direct IO – though almost never – and the much more commonly performed Buffered IO.

    PostgreSQL uses O_DIRECT when it is writing to WALs (Write-Ahead Logs aka Transaction Logs) only when

    wal_sync_method
     is set to :
    open_datasync
     or to 
    open_sync
     with no archiving or streaming enabled. The default 
    wal_sync_method
     may be
    fdatasync
     that does not use O_DIRECT. This means, almost all the time in your production database server, you’ll see PostgreSQL using O_SYNC / O_DSYNC while writing to WAL’s. Whereas, writing the modified/dirty buffers to datafiles from shared buffers is always through Buffered IO.  Let’s understand this further.

    Upon checkpoint, dirty buffers in shared buffers are written to the page cache managed by kernel. Through an fsync(), these modified blocks are applied to disk. If an fsync() call is successful, all dirty pages from the corresponding file are guaranteed to be persisted on the disk. When there is an fsync to flush the pages to disk, PostgreSQL cannot guarantee a copy of a modified/dirty page. The reason is that writes to storage from the page cache are completely managed by the kernel, and not by PostgreSQL.

    This could still be fine if the next fsync retries flushing of the dirty page. But, in reality, the data is discarded from the page cache upon an error with fsync. And the next fsync would obviously succeed ignoring the previous errors, because it now includes the next set of dirty buffers that need to be written to disk and not the ones that failed earlier.

    To understand it better, consider an example of Linux trying to write dirty pages from page cache to a USB stick that was removed during an fsync. Neither the ext4 file system nor the btrfs nor an xfs tries to retry the failed writes. A silently failing fsync may result in data loss, block corruption, table or index out of sync, foreign key or other data integrity issues… and deleted records may reappear.

    Until a while ago, when we used local storage or storage using RAID Controllers with write cache, it might not have been a big problem. This issue goes back to the time when PostgreSQL was designed for buffered IO but not Direct IO. Should this now be considered an issue with PostgreSQL and the way it’s designed? Well, not exactly.

    All this started with the error handling during a writeback in Linux. A writeback asynchronously performs dirty page writes from page cache to filesystem. In ext4 like filesystems, upon a writeback error, the page is marked clean and up to date, and the user space is unaware of the problem.

    fsync errors are now detected

    Starting from kernel 4.13, we can now reliably detect such errors during fsync. So, any open file descriptor to a file includes a pointer to the address_space structure, and a new 32-bit value (errseq_t) has been added that is visible to all the processes accessing that file. With the new minor version for all supported PostgreSQL versions, a PANIC is triggered upon such error. This performs a database crash and initiates recovery from the last CHECKPOINT. There is a patch expected to be released in PostgreSQL 12 that works for newer kernel versions and modifies the way PostgreSQL handles the file descriptors. A long term solution to this issue may be Direct IO, but you might see a different approach to this in PG 12.

    A good amount of work on this issue was done by Jeff Layton on reporting writeback errors, and Matthew Wilcox. What this patch means is that a writeback error gets reported during an fsync, which can be seen by another process that opens that file. A new 32-bit value that stores an error code and a sequence number are added to a new

    typedef: errseq_t
     . So, these errors are now in the
    address_space
     . But, if the struct inode is gone due to a memory pressure, this patch has no value.

    Can i enable or disable the PANIC on fsync failure in PostgreSQL newer releases ?

    Yes. You can set this parameter :

    data_sync_retry
     to false (default), where a PANIC-level error is raised to recover from WAL through a database crash. You must be sure to have a proper high-availability mechanism so that the impact is minimal for your application. You could let your application failover to a slave, which could minimize the impact.

    You can always set

    data_sync_retry
     to true, if you are sure about how your OS behaves during write-back failures. By setting this to true, PostgreSQL will just report an error and continue to run.

    Some of the other possible issues now fixed and common to these minor releases

    1. A lot of features and fixes related to PARTITIONING have been applied in this minor release. (PostgreSQL 10 and 11 only).
    2. Autovacuum has been made more aggressive about removing leftover temporary tables.
    3. Deadlock when acquiring multiple buffer locks.
    4. Crashes in logical replication.
    5. Incorrect planning of queries in which a lateral reference must be evaluated at a foreign table scan.
    6. Fixed some issues reported with ANALYZE and TRUNCATE operations.
    7. Fix to contrib/hstore to calculate correct hash values for empty hstore values that were created in version 8.4 or before.
    8. A fix to pg_dump’s handling of materialized views with indirect dependencies on primary keys.

    We always recommend that you keep your PostgreSQL databases updated to the latest minor versions. Applying a minor release might need a restart after updating the new binaries.

    Here is the sequence of steps you should follow to upgrade to the latest minor versions after thorough testing :

    1. Shutdown the PostgreSQL database server
    2. Install the updated binaries
    3. Restart your PostgreSQL database server

    Most of the time, you can choose to update the minor versions in a rolling fashion in a master-slave (replication) setup because it avoids downtime for both reads and writes simultaneously. For a rolling style update, you could perform the update on one server after another… but not all at once. However, the best method that we’d almost always recommend is – shutdown, update and restart all instances at once.

    If you are currently running your databases on PostgreSQL 9.3.x or earlier, we recommend that you to prepare a plan to upgrade your PostgreSQL databases to the supported versions ASAP. Please subscribe to our blog posts so that you can hear about the various options for upgrading your PostgreSQL databases to a supported major version.


    Photo by Andrew Rice on Unsplash

    by Avinash Vallarapu at February 22, 2019 01:47 PM

    MariaDB Foundation

    “How to write your first patch ? ” – MariaDB Unconference Presentations

     Have you ever wondered how to get started with contributions to the world’s most popular open source database? Did you have a problems with building and configuring from source code, writing the contribution patch and testing the server with  use of mysql-test-run (mtr) framework  afterwards? How to make your patch visible to other developers? In […]

    The post “How to write your first patch ? ” – MariaDB Unconference Presentations appeared first on MariaDB.org.

    by Anna Widenius at February 22, 2019 12:51 PM

    Peter Zaitsev

    Measuring Percona Server for MySQL On-Disk Decryption Overhead

    benchmark heavy IO percona server for mysql 8 encryption

    Percona Server for MySQL 8.0 comes with enterprise grade total data encryption features. However, there is always the question of how much overhead – or performance penalty – comes with the data decryption. As we saw in my networking performance post, SSL under high concurrency might be problematic. Is this the case for data decryption?

    To measure any overhead, I will start with a simplified read-only workload, where data gets decrypted during read IO.

    MySQL decryption schematic

    During query execution, the data in memory is already decrypted so there is no additional processing time. The decryption happens only for blocks that require a read from storage.

    For the benchmark I will use the following workload:

    sysbench oltp_read_only --mysql-ssl=off --tables=20 --table-size=10000000 --threads=$i --time=300 --report-interval=1 --rand-type=uniform run

    The datasize for this workload is about 50GB, so I will use

    innodb_buffer_pool_size = 5GB
      to emulate a heavy disk read IO during the benchmark. In the second run, I will use
    innodb_buffer_pool_size = 60GB
      so all data is kept in memory and there are NO disk read IO operations.

    I will only use table-level encryption at this time (ie: no encryption for binary log, system tablespace, redo-  and undo- logs).

    The server I am using has AES hardware CPU acceleration. Read more at https://en.wikipedia.org/wiki/AES_instruction_set

    Benchmark N1, heavy read IO

    benchmark heavy IO percona server for mysql 8 encryption

    Threads encrypted storage no encryption encryption overhead
    1 389.11 423.47 1.09
    4 1531.48 1673.2 1.09
    16 5583.04 6055 1.08
    32 8250.61 8479.61 1.03
    64 8558.6 8574.43 1.00
    96 8571.55 8577.9 1.00
    128 8570.5 8580.68 1.00
    256 8576.34 8585 1.00
    512 8573.15 8573.73 1.00
    1024 8570.2 8562.82 1.00
    2048 8422.24 8286.65 0.98

    Benchmark N2, data in memory, no read IO

    benchmark data in memory percona server for mysql 8 encryption

    Threads Encryption No encryption
    1 578.91 567.65
    4 2289.13 2275.12
    16 8304.1 8584.06
    32 13324.02 13513.39
    64 20007.22 19821.48
    96 19613.82 19587.56
    128 19254.68 19307.82
    256 18694.05 18693.93
    512 18431.97 18372.13
    1024 18571.43 18453.69
    2048 18509.73 18332.59

    Observations

    For a high number of threads, there is no measurable difference between encrypted and unencrypted storage. This is because a lot of CPU resources are spent in contention and waits, so the relative time spend in decryption is negligible.

    However, we can see some performance penalty for a low number of threads: up to 9% penalty for hardware decryption. When data fully fits into memory, there is no measurable difference between encrypted and unencrypted storage.

    So if you have hardware support then you should see little impact when using storage encryption with MySQL. The easiest way to check if you have support for this is to look at CPU flags and search for ‘aes’ string:

    > lscpu | grep aes Flags: ... tsc_deadline_timer aes xsave avx f16c ...

    by Vadim Tkachenko at February 22, 2019 12:38 PM

    Chris Calender

    MariaDB MaxScale Masking Basics and Examples

    I wanted to take a moment to write up a post on MariaDB MaxScale’s masking basics and include some real-world examples.

    We have nice documentation on the subject, and Dipti wrote a nice blog post on it as well. I just wanted to provide my take on it, and hopefully build upon what is already there and offer some additional insights.

    To provide a 50-foot overview, the masking filter makes it possible to obfuscate the returned value of a particular column.

    3 quite common columns where this would be very beneficial: Social Security Number (“SSN”), Date of Birth (“DOB”), and Credit Card Number (“CCNUM”).

    To use masking, it assumes you already have a MaxScale service up and running. For instance, the readwrite splitter.

    In this case, you would already have a configuration file similar to this (3 backend servers, 1 master (server1), 2 slaves (server2 & server3), with readwritesplit (Read-Write-Service) and its listener (Read-Write-Listener) set up:

    [maxscale]
    threads=4
    log_info=1
    local_address=192.168.1.183
    log_debug=1     # debug only
    
    [server1]
    type=server
    address=127.0.0.1
    port=3306
    protocol=MySQLBackend
    
    [server2]
    type=server
    address=127.0.0.1
    port=3344
    protocol=MySQLBackend
    
    [server3]
    type=server
    address=127.0.0.1
    port=3340
    protocol=MySQLBackend
    
    [Read-Write-Service]
    type=service
    router=readwritesplit
    servers=server1,server2,server3
    user=root
    passwd=xxx
    max_slave_connections=100%
    enable_root_user=1
    
    [Read-Write-Listener]
    type=listener
    service=Read-Write-Service
    protocol=MySQLClient
    port=4006
    
    [MaxAdmin-Service]
    type=service
    router=cli
    enable_root_user=1
    

    In the examples from the aforementioned manual and blog post, you will see something like this for your “configuration” addition:

    [MyMasking]
    type=filter
    module=masking
    rules=...
    
    [MyService]
    type=service
    ...
    filters=MyMasking
    

    MyMasking is the name you will choose for your masking filter.

    MyService is a service you already have defined and running. In this example, it is [Read-Write-Service].

    Thus, I simply add the following line to [Read-Write-Service]:

    filters=MyMasking
    

    If you already have a filter defined for this service, say NamedServerFilter, then you can add a second filter like this (i.e., each filter is separated by a “|”):

    filters=NamedServerFilter | MyMasking
    

    And then add your [MyMasking] section/configuration:

    [MyMasking]
    type=filter
    module=masking
    warn_type_mismatch=always
    large_payload=abort
    rules=/etc/maxscale.modules.d/masking_rules.json
    

    In the above, the type is “filter”, and the module is “masking”. Both of those are self-explanatory.

    The “warn_type_mismatch” instructs MaxScale to log a warning if a masking rule matches a column that is not of one of the allowed types. Possible values are “never” and “always” (with “never” being the default). However, a limitation of masking is that can only be used for masking columns of the following types: BINARY, VARBINARY, CHAR, VARCHAR, BLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB, TEXT, TINYTEXT, MEDIUMTEXT, LONGTEXT, ENUM and SET. If the type of the column is something else (INTs, DATEs, etc.), then no masking will be performed. So you might want to be “warned” if this happens, hence why I chose “always”.

    The “large_payload” specifies how the masking filter should treat payloads larger than 16MB. Possible values are “ignore” and “abort” (with “abort” being the default). If you choose ignore, then if the result set is > 16MB, then no masking will be performed, and the result set will be returned to the client. If abort, then the client conneciton is closed.

    And the “rules” defines the path and name to the masking_rules.json file which you must use to define your rules, what you want filtered, which columns, from which tables, schemas, or database-wide, and options on how to handle the display, and so forth. It is very flexible, suffice to say.

    Thus my updated config file becomes:

    [maxscale]
    threads=4
    log_info=1
    local_address=192.168.1.183
    log_debug=1     # debug only
    
    [server1]
    type=server
    address=127.0.0.1
    port=3306
    protocol=MySQLBackend
    
    [server2]
    type=server
    address=127.0.0.1
    port=3344
    protocol=MySQLBackend
    
    [server3]
    type=server
    address=127.0.0.1
    port=3340
    protocol=MySQLBackend
    
    [Read-Write-Service]
    type=service
    router=readwritesplit
    servers=server1,server2,server3
    user=root
    passwd=xxx
    max_slave_connections=100%
    enable_root_user=1
    filters=MyMasking
    
    [Read-Write-Listener]
    type=listener
    service=Read-Write-Service
    protocol=MySQLClient
    port=4006
    
    [MaxAdmin-Service]
    type=service
    router=cli
    enable_root_user=1
    
    [MyMasking]
    type=filter
    module=masking
    warn_type_mismatch=always
    large_payload=abort
    rules=/etc/maxscale.modules.d/masking_rules.json
    

    In MaxScale 2.3, there is also a “prevent_function_usage” option, which can be set to “true” or “false”. If true, then all statements that contain functions referring to masked columns will be rejected. Otherwise, not. True is the default, thus I’ll omit this part, so that this config can be used for all 2.x MaxScale setups.

    Now we need to create masking_rules.json (in /etc/maxscale.modules.d/), and we should be all set to start masking.

    chris@chris-linux-laptop-64:/etc/maxscale.modules.d$ cat masking_rules.json
    {
    	"rules": [
    		{
    			"replace": {
    				"column": "SSN"
    			},
    			"with": {
    				"fill": "*"
    			}
    		}
    	]
    }
    

    This is the most basic. In this rule, *any* column named “SSN” in *any* schema will be replaced with all “*”s.

    So, once you’ve made your config change, and created masking_rules.json, it’s time to restart MaxScale so that it reads/loads your new masking filter:

    sudo service maxscale restart
    

    Now for some testing:

    CREATE SCHEMA employees;
    
    USE employees;
    
    CREATE TABLE employees (name char(10), location char(10), SSN char(11), DOB char(10), CCNUM char(16)); 
    
    INSERT INTO employees VALUES ('chris', 'hanger18', '123-45-6789', '07/07/1947', '6011123456789012');
    

    Note that I made DOB a CHAR column so that masking would be applicable as it is not for a DATE column.

    Thus with no masking, we see everything:

    SELECT * FROM employees.employees;
    +-------+----------+-------------+------------+------------------+
    | name  | location | SSN         | DOB        | CCNUM            |
    +-------+----------+-------------+------------+------------------+
    | chris | hanger18 | 123-45-6789 | 07/07/1947 | 6011123456789012 |
    +-------+----------+-------------+------------+------------------+
    

    Now, connect to the service listener, in this case [Read-Write-Listener] running on port 4006:

    mysql -uroot -pxxx -P4006 --protocol=tcp
    
    SELECT * FROM employees.employees;
    +-------+----------+-------------+------------+------------------+
    | name  | location | SSN         | DOB        | CCNUM            |
    +-------+----------+-------------+------------+------------------+
    | chris | hanger18 | *********** | 07/07/1947 | 6011123456789012 |
    +-------+----------+-------------+------------+------------------+
    

    So we successfully ***’ed out SSN. Now, to also handle DOB and CCNUM. So edit the masking_rules.json file to:

    {
    	"rules": [
    		{
    			"replace": {
    				"column": "SSN"
    			},
    			"with": {
    				"fill": "*"
    			}
    		},
    		{
    			"replace": {
    				"column": "DOB"
    			},
    			"with": {
    				"fill": "*"
    			}
    		},
    		{
    			"replace": {
    				"column": "CCNUM"
    			},
    			"with": {
    				"fill": "*"
    			}
    		}
    	]
    }
    

    If for the time being, you can still use MaxAdmin to reload the file without having to restart MaxScale (though do note maxadmin is deprecated in 2.3, and will be removed soon, though I suspect all functionality it provided will be available via maxctrl soon, if not already.):

    sudo maxadmin
    MaxScale> call command masking reload MyMasking
    

    Assuming the last command completed without errors, then can now simply re-query (via port 4006). However, first exit the connection to port 4006 and then re-connect:

    select * from employees.employees;
    +-------+----------+-------------+------------+------------------+
    | name  | location | SSN         | DOB        | CCNUM            |
    +-------+----------+-------------+------------+------------------+
    | chris | hanger18 | *********** | ********** | **************** |
    +-------+----------+-------------+------------+------------------+
    

    Note: The column names are case-sensitive, so if you have columns like “SSN” and “ssn”, then you will need to add 2 entries to masking_rules.json.

    Here is a table that uses “ssn” instead of “SSN” (everything else is the same):

    CREATE TABLE employees2 (name char(10), location char(10), ssn char(11), DOB char(10), CCNUM char(16));
    
    INSERT INTO employees2 VALUES ('chris', 'hanger18', '123-45-6789', '07/07/1947', '6011123456789012');
    
    SELECT * FROM employees.employees2;
    +-------+----------+-------------+------------+------------------+
    | name  | location | ssn         | DOB        | CCNUM            |
    +-------+----------+-------------+------------+------------------+
    | chris | hanger18 | 123-45-6789 | ********** | **************** |
    +-------+----------+-------------+------------+------------------+
    

    As you can see, the “ssn” is not masked, but DOB and CCNUm still are. So let’s add a sction for “ssn” in masking_rules.json:

    {
    	"rules": [
    		{
    			"replace": {
    				"column": "SSN"
    			},
    			"with": {
    				"fill": "*"
    			}
    		},
    		{
    			"replace": {
    				"column": "ssn"
    			},
    			"with": {
    				"fill": "*"
    			}
    		},
    		{
    			"replace": {
    				"column": "DOB"
    			},
    			"with": {
    				"fill": "*"
    			}
    		},
    		{
    			"replace": {
    				"column": "CCNUM"
    			},
    			"with": {
    				"fill": "*"
    			}
    		}
    	]
    }
    

    Then reload the file:

    sudo maxadmin
    MaxScale> call command masking reload MyMasking
    

    And then exit port 4006 and re-connect, and re-issue the query:

    SELECT * FROM employees.employees2;
    +-------+----------+-------------+------------+------------------+
    | name  | location | ssn         | DOB        | CCNUM            |
    +-------+----------+-------------+------------+------------------+
    | chris | hanger18 | *********** | ********** | **************** |
    +-------+----------+-------------+------------+------------------+
    

    There we have it.

    And again, you have many more options when it comes to your string replacements, matching, fill, values, obfuscation, pcre2 regex, and so forth. I’ll leave you to the manual page to investigate those options if you wish.

    All in all, I hope this is helpful for anyone wanting to get started using MaxScale’s masking filter.

    by chris at February 22, 2019 10:48 AM

    MariaDB Foundation

    MariaDB 10.3.13 and MariaDB Connector/C 3.0.9 now available

    The MariaDB Foundation is pleased to announce the availability of MariaDB 10.3.13, the latest stable release in the MariaDB 10.3 series, as well as MariaDB Connector/C 3.0.9, the latest stable release in the MariaDB Connector/ODBC series. See the release notes and changelogs for details. Download MariaDB 10.3.13 Release Notes Changelog What is MariaDB 10.3? MariaDB […]

    The post MariaDB 10.3.13 and MariaDB Connector/C 3.0.9 now available appeared first on MariaDB.org.

    by Ian Gilfillan at February 22, 2019 02:21 AM

    February 21, 2019

    Peter Zaitsev

    Percona Server for MongoDB Operator 0.2.1 Early Access Release Is Now Available

    Percona Server for MongoDB

    Percona Server for MongoDB OperatorPercona announces the availability of the Percona Server for MongoDB Operator 0.2.1 early access release.

    The Percona Server for MongoDB Operator simplifies the deployment and management of Percona Server for MongoDB in a Kubernetes or OpenShift environment. It extends the Kubernetes API with a new custom resource for deploying, configuring and managing the application through the whole life cycle.

    Note: PerconaLabs is one of the open source GitHub repositories for unofficial scripts and tools created by Percona staff. These handy utilities can help save your time and effort.

    Percona software builds located in the Percona repository are not officially released software, and also aren’t covered by Percona support or services agreements.

    You can install the Percona Server for MongoDB Operator on Kubernetes or OpenShift. While the operator does not support all the Percona Server for MongoDB features in this early access release, instructions on how to install and configure it are already available along with the operator source code in our Github repository.

    The Percona Server for MongoDB Operator on Percona-Lab is an early access release. Percona doesn’t recommend it for production environments.

    Improvements

    • Backups to S3 compatible storages
    • CLOUD-117: An error proof functionality was included into this release. It doesn’t allow unsafe configurations by default, preventing user from configuring a cluster with more than one Arbiter node or a Replica Set with less than three nodes.
      • For those who still need such configurations, this protection can be disabled by setting allowUnsafeConfigurations=true in the deploy/cr.yaml file.

    Fixed Bugs

    • CLOUD-105: The Service-per-Pod feature used with the LoadBalancer didn’t work with cluster sizes not equal to 1.
    • CLOUD-137: PVC assigned to the Arbiter Pod had the same size as PVC of the regular Percona Server for MongoDB Pods, despite the fact that Arbiter doesn’t store data.

    Percona Server for MongoDB is an enhanced, open source and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB Community Edition. It supports MongoDB protocols and drivers. Percona Server for MongoDB extends MongoDB Community Edition functionality by including the Percona Memory Engine, as well as several enterprise-grade features. It requires no changes to MongoDB applications or code.

    Help us improve our software quality by reporting any bugs you encounter using our bug tracking system.

    by Dmitriy Kostiuk at February 21, 2019 09:48 PM

    MySQL 8 is not always faster than MySQL 5.7

    mysql 8 slower than mysql 5.7 sysbench

    MySQL 8.0.15 performs worse in sysbench oltp_read_write than MySQL 5.7.25

    Initially I was testing group replication performance and was puzzled why MySQL 8.0.15 performs consistently worse than MySQL 5.7.25.

    It appears that a single server instance is affected by a performance degradation.

    My testing setup

    mysql 8 slower than mysql 5.7 sysbenchHardware details:
    Bare metal server provided by packet.net, instance size: c2.medium.x86
    24 Physical Cores @ 2.2 GHz
    (1 X AMD EPYC 7401P)
    Memory: 64 GB of ECC RAM

    Storage : INTEL® SSD DC S4500, 480GB

    This is a server grade SATA SSD.

    Benchmark

    sysbench oltp_read_write --report-interval=1 --time=1800 --threads=24 --tables=10 --table-size=10000000 --mysql-user=root --mysql-socket=/tmp/mysql.sock run

    In the following summary I used these combinations:

    • innodb_flush_log_at_trx_commit=0 or 1
    • Binlog: off or on
    • sync_binlog=1000 or sync_binlog=1

    The summary table, the number are transactions per second (tps – the more the better)

    +-------------------------------------------+--------------+--------------+-------+
    | case                                      | MySQL 5.7.25 | MySQL 8.0.15 | ratio |
    +-------------------------------------------+--------------+--------------+-------+
    | trx_commit=0, binlog=off                  | 11402 tps    | 9840(*)      | 1.16  |
    +-------------------------------------------+--------------+--------------+-------+
    | trx_commit=1, binlog=off                  | 8375         | 7974         | 1.05  |
    +-------------------------------------------+--------------+--------------+-------+
    | trx_commit=0, binlog=on, sync_binlog=1000 | 10862        | 8871         | 1.22  |
    +-------------------------------------------+--------------+--------------+-------+
    | trx_commit=0, binlog=on, sync_binlog=1    | 7238         | 6459         | 1.12  |
    +-------------------------------------------+--------------+--------------+-------+
    | trx_commit=1, binlog=on, sync_binlog=1    | 5970         | 5043         | 1.18  |
    +-------------------------------------------+--------------+--------------+-------+

    Summary: MySQL 8.0.15 is persistently worse than MySQL 5.7.25.

    In the worst case with

    trx_commit=0
      and
    sync_binlog=1000
     , it is worse by 22%, which is huge.

    I was looking to use these settings for group replication testing, but these settings, when used with MySQL 8.0.15, provide much worse results than I had with MySQL 5.7.25

    (*)  in the case of trx_commit=0, binlog=off, MySQL 5.7.25 performance is very stable, and practically stays at the 11400 tps level. MySQL 8.0.15 varies a lot from 8758 tps to 10299 tps in 1 second resolution measurements

    Update:

    To clarify some comments, I’ve used latin1 CHARSET in this benchmark for both MySQL 5.7 and MySQL 8.0

    Appendix:

    [mysqld]
    datadir= /mnt/data/mysql
    socket=/tmp/mysql.sock
    ssl=0
    #innodb-encrypt-tables=ON
    character_set_server=latin1
    collation_server=latin1_swedish_ci
    skip-character-set-client-handshake
    #skip-log-bin
    log-error=error.log
    log_bin = binlog
    relay_log=relay
    sync_binlog=1000
    binlog_format = ROW
    binlog_row_image=MINIMAL
    server-id=1
    # Disabling symbolic-links is recommended to prevent assorted security risks
    symbolic-links=0
    # Recommended in standard MySQL setup
    # general
     table_open_cache = 200000
     table_open_cache_instances=64
     back_log=3500
     max_connections=4000
    # files
     innodb_file_per_table
     innodb_log_file_size=15G
     innodb_log_files_in_group=2
     innodb_open_files=4000
    # buffers
     innodb_buffer_pool_size= 40G
     innodb_buffer_pool_instances=8
     innodb_log_buffer_size=64M
    # tune
     innodb_doublewrite= 1
     innodb_thread_concurrency=0
     innodb_flush_log_at_trx_commit= 0
     innodb_flush_method=O_DIRECT_NO_FSYNC
     innodb_max_dirty_pages_pct=90
     innodb_max_dirty_pages_pct_lwm=10
     innodb_lru_scan_depth=2048
     innodb_page_cleaners=4
     join_buffer_size=256K
     sort_buffer_size=256K
     innodb_use_native_aio=1
     innodb_stats_persistent = 1
     #innodb_spin_wait_delay=96
    # perf special
     innodb_adaptive_flushing = 1
     innodb_flush_neighbors = 0
     innodb_read_io_threads = 16
     innodb_write_io_threads = 16
     innodb_io_capacity=1500
     innodb_io_capacity_max=2500
     innodb_purge_threads=4
     innodb_adaptive_hash_index=0
    max_prepared_stmt_count=1000000


    Photo by Suzy Hazelwood from Pexels

     

    by Vadim Tkachenko at February 21, 2019 06:10 PM

    Parallel queries in PostgreSQL

    parallel queries in postgresql

    PostgreSQL logoModern CPU models have a huge number of cores. For many years, applications have been sending queries in parallel to databases. Where there are reporting queries that deal with many table rows, the ability for a query to use multiple CPUs helps us with a faster execution. Parallel queries in PostgreSQL allow us to utilize many CPUs to finish report queries faster. The parallel queries feature was implemented in 9.6 and helps. Starting from PostgreSQL 9.6 a report query is able to use many CPUs and finish faster.

    The initial implementation of the parallel queries execution took three years. Parallel support requires code changes in many query execution stages. PostgreSQL 9.6 created an infrastructure for further code improvements. Later versions extended parallel execution support for other query types.

    Limitations

    • Do not enable parallel executions if all CPU cores are already saturated. Parallel execution steals CPU time from other queries, and increases response time.
    • Most importantly, parallel processing significantly increases memory usage with high WORK_MEM values, as each hash join or sort operation takes a work_mem amount of memory.
    • Next, low latency OLTP queries can’t be made any faster with parallel execution. In particular, queries that returns a single row can perform badly when parallel execution is enabled.
    • The Pierian spring for developers is a TPC-H benchmark. Check if you have similar queries for the best parallel execution.
    • Parallel execution supports only SELECT queries without lock predicates.
    • Proper indexing might be a better alternative to a parallel sequential table scan.
    • There is no support for cursors or suspended queries.
    • Windowed functions and ordered-set aggregate functions are non-parallel.
    • There is no benefit for an IO-bound workload.
    • There are no parallel sort algorithms. However, queries with sorts still can be parallel in some aspects.
    • Replace CTE (WITH …) with a sub-select to support parallel execution.
    • Foreign data wrappers do not currently support parallel execution (but they could!)
    • There is no support for FULL OUTER JOIN.
    • Clients setting max_rows disable parallel execution.
    • If a query uses a function that is not marked as PARALLEL SAFE, it will be single-threaded.
    • SERIALIZABLE transaction isolation level disables parallel execution.

    Test environment

    The PostgreSQL development team have tried to improve TPC-H benchmark queries’ response time. You can download the benchmark and adapt it to PostgreSQL by using these instructions. It’s not an official way to use the TPC-H benchmark, so you shouldn’t use it to compare different databases or hardware.

    1. Download TPC-H_Tools_v2.17.3.zip (or newer version) from official TPC site.
    2. Rename makefile.suite to Makefile and modify it as requested at https://github.com/tvondra/pg_tpch . Compile the code with make command
    3. Generate data: ./dbgen -s 10 generates 23GB database which is enough to see the difference in performance for parallel and non-parallel queries.
    4. Convert tbl files to csv with for + sed
    5. Clone pg_tpch repository and copy csv files to pg_tpch/dss/data
    6. Generate queries with qgen command
    7. Load data to the database with ./tpch.sh command.

    Parallel sequential scan

    This might be faster not because of parallel reads, but due to scattering of data across many CPU cores. Modern OS provides good caching for PostgreSQL data files. Read-ahead allows getting a block from storage more than just the block requested by PG daemon. As a result, query performance is not limited due to disk IO. It consumes CPU cycles for:

    • reading rows one by one from table data pages
    • comparing row values and WHERE conditions

    Let’s try to execute simple select query:

    tpch=# explain analyze select l_quantity as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day;
    QUERY PLAN
    --------------------------------------------------------------------------------------------------------------------------
    Seq Scan on lineitem (cost=0.00..1964772.00 rows=58856235 width=5) (actual time=0.014..16951.669 rows=58839715 loops=1)
    Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)
    Rows Removed by Filter: 1146337
    Planning Time: 0.203 ms
    Execution Time: 19035.100 ms

    A sequential scan produces too many rows without aggregation. So, the query is executed by a single CPU core.

    After adding SUM(), it’s clear to see that two workers will help us to make the query faster:

    explain analyze select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day;
    QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------------------------
    Finalize Aggregate (cost=1589702.14..1589702.15 rows=1 width=32) (actual time=8553.365..8553.365 rows=1 loops=1)
    -> Gather (cost=1589701.91..1589702.12 rows=2 width=32) (actual time=8553.241..8555.067 rows=3 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    -> Partial Aggregate (cost=1588701.91..1588701.92 rows=1 width=32) (actual time=8547.546..8547.546 rows=1 loops=3)
    -> Parallel Seq Scan on lineitem (cost=0.00..1527393.33 rows=24523431 width=5) (actual time=0.038..5998.417 rows=19613238 loops=3)
    Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)
    Rows Removed by Filter: 382112
    Planning Time: 0.241 ms
    Execution Time: 8555.131 ms

    The more complex query is 2.2X faster compared to the plain, single-threaded select.

    Parallel Aggregation

    A “Parallel Seq Scan” node produces rows for partial aggregation. A “Partial Aggregate” node reduces these rows with SUM(). At the end, the SUM counter from each worker collected by “Gather” node.

    The final result is calculated by the “Finalize Aggregate” node. If you have your own aggregation functions, do not forget to mark them as “parallel safe”.

    Number of workers

    We can increase the number of workers without server restart:

    alter system set max_parallel_workers_per_gather=4;
    select * from pg_reload_conf();
    Now, there are 4 workers in explain output:
    tpch=# explain analyze select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day;
    QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------------------------
    Finalize Aggregate (cost=1440213.58..1440213.59 rows=1 width=32) (actual time=5152.072..5152.072 rows=1 loops=1)
    -> Gather (cost=1440213.15..1440213.56 rows=4 width=32) (actual time=5151.807..5153.900 rows=5 loops=1)
    Workers Planned: 4
    Workers Launched: 4
    -> Partial Aggregate (cost=1439213.15..1439213.16 rows=1 width=32) (actual time=5147.238..5147.239 rows=1 loops=5)
    -> Parallel Seq Scan on lineitem (cost=0.00..1402428.00 rows=14714059 width=5) (actual time=0.037..3601.882 rows=11767943 loops=5)
    Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)
    Rows Removed by Filter: 229267
    Planning Time: 0.218 ms
    Execution Time: 5153.967 ms

    What’s happening here? We have changed the number of workers from 2 to 4, but the query became only 1.6599 times faster. Actually, scaling is amazing. We had two workers plus one leader. After a configuration change, it becomes 4+1.

    The biggest improvement from parallel execution that we can achieve is: 5/3 = 1.66(6)X faster.

    How does it work?

    Processes

    Query execution always starts in the “leader” process. A leader executes all non-parallel activity and its own contribution to parallel processing. Other processes executing the same queries are called “worker” processes. Parallel execution utilizes the Dynamic Background Workers infrastructure (added in 9.4). As other parts of PostgreSQL uses processes, but not threads, the query creating three worker processes could be 4X faster than the traditional execution.

    Communication

    Workers communicate with the leader using a message queue (based on shared memory). Each process has two queues: one for errors and the second one for tuples.

    How many workers to use?

    Firstly, the max_parallel_workers_per_gather parameter is the smallest limit on the number of workers. Secondly, the query executor takes workers from the pool limited by max_parallel_workers size. Finally, the top-level limit is max_worker_processes: the total number of background processes.

    Failed worker allocation leads to single-process execution.

    The query planner could consider decreasing the number of workers based on a table or index size. min_parallel_table_scan_size and min_parallel_index_scan_size control this behavior.

    set min_parallel_table_scan_size='8MB'
    8MB table => 1 worker
    24MB table => 2 workers
    72MB table => 3 workers
    x => log(x / min_parallel_table_scan_size) / log(3) + 1 worker

    Each time the table is 3X bigger than min_parallel_(index|table)_scan_size, postgres adds a worker. The number of workers is not cost-based! A circular dependency makes a complex implementation hard. Instead, the planner uses simple rules.

    In practice, these rules are not always acceptable in production and you can override the number of workers for the specific table with ALTER TABLE … SET (parallel_workers = N).

    Why parallel execution is not used?

    Besides to the long list of parallel execution limitations, PostgreSQL checks costs:

    parallel_setup_cost to avoid parallel execution for short queries. It models the time spent for memory setup, process start, and initial communication

    parallel_tuple_cost : The communication between leader and workers could take a long time. The time is proportional to the number of tuples sent by workers. The parameter models the communication cost.

    Nested loop joins

    PostgreSQL 9.6+ could execute a “Nested loop” in parallel due to the simplicity of the operation.

    explain (costs off) select c_custkey, count(o_orderkey)
                    from    customer left outer join orders on
                                    c_custkey = o_custkey and o_comment not like '%special%deposits%'
                    group by c_custkey;
                                          QUERY PLAN
    --------------------------------------------------------------------------------------
     Finalize GroupAggregate
       Group Key: customer.c_custkey
       ->  Gather Merge
             Workers Planned: 4
             ->  Partial GroupAggregate
                   Group Key: customer.c_custkey
                   ->  Nested Loop Left Join
                         ->  Parallel Index Only Scan using customer_pkey on customer
                         ->  Index Scan using idx_orders_custkey on orders
                               Index Cond: (customer.c_custkey = o_custkey)
                               Filter: ((o_comment)::text !~~ '%special%deposits%'::text)

    Gather happens in the last stage, so “Nested Loop Left Join” is a parallel operation. “Parallel Index Only Scan” is available from version 10. It acts in a similar way to a parallel sequential scan. The

    c_custkey = o_custkey
    condition reads a single order for each customer row. Thus it’s not parallel.

    Hash Join

    Each worker builds its own hash table until PostgreSQL 11. As a result, 4+ workers weren’t able to improve performance. The new implementation uses a shared hash table. Each worker can utilize WORK_MEM to build the hash table.

    select
            l_shipmode,
            sum(case
                    when o_orderpriority = '1-URGENT'
                            or o_orderpriority = '2-HIGH'
                            then 1
                    else 0
            end) as high_line_count,
            sum(case
                    when o_orderpriority <> '1-URGENT'
                            and o_orderpriority <> '2-HIGH'
                            then 1
                    else 0
            end) as low_line_count
    from
            orders,
            lineitem
    where
            o_orderkey = l_orderkey
            and l_shipmode in ('MAIL', 'AIR')
            and l_commitdate < l_receiptdate
            and l_shipdate < l_commitdate
            and l_receiptdate >= date '1996-01-01'
            and l_receiptdate < date '1996-01-01' + interval '1' year
    group by
            l_shipmode
    order by
            l_shipmode
    LIMIT 1;
                                                                                                                                        QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=1964755.66..1964961.44 rows=1 width=27) (actual time=7579.592..7922.997 rows=1 loops=1)
       ->  Finalize GroupAggregate  (cost=1964755.66..1966196.11 rows=7 width=27) (actual time=7579.590..7579.591 rows=1 loops=1)
             Group Key: lineitem.l_shipmode
             ->  Gather Merge  (cost=1964755.66..1966195.83 rows=28 width=27) (actual time=7559.593..7922.319 rows=6 loops=1)
                   Workers Planned: 4
                   Workers Launched: 4
                   ->  Partial GroupAggregate  (cost=1963755.61..1965192.44 rows=7 width=27) (actual time=7548.103..7564.592 rows=2 loops=5)
                         Group Key: lineitem.l_shipmode
                         ->  Sort  (cost=1963755.61..1963935.20 rows=71838 width=27) (actual time=7530.280..7539.688 rows=62519 loops=5)
                               Sort Key: lineitem.l_shipmode
                               Sort Method: external merge  Disk: 2304kB
                               Worker 0:  Sort Method: external merge  Disk: 2064kB
                               Worker 1:  Sort Method: external merge  Disk: 2384kB
                               Worker 2:  Sort Method: external merge  Disk: 2264kB
                               Worker 3:  Sort Method: external merge  Disk: 2336kB
                               ->  Parallel Hash Join  (cost=382571.01..1957960.99 rows=71838 width=27) (actual time=7036.917..7499.692 rows=62519 loops=5)
                                     Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
                                     ->  Parallel Seq Scan on lineitem  (cost=0.00..1552386.40 rows=71838 width=19) (actual time=0.583..4901.063 rows=62519 loops=5)
                                           Filter: ((l_shipmode = ANY ('{MAIL,AIR}'::bpchar[])) AND (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) AND (l_receiptdate >= '1996-01-01'::date) AND (l_receiptdate < '1997-01-01 00:00:00'::timestamp without time zone))
                                           Rows Removed by Filter: 11934691
                                     ->  Parallel Hash  (cost=313722.45..313722.45 rows=3750045 width=20) (actual time=2011.518..2011.518 rows=3000000 loops=5)
                                           Buckets: 65536  Batches: 256  Memory Usage: 3840kB
                                           ->  Parallel Seq Scan on orders  (cost=0.00..313722.45 rows=3750045 width=20) (actual time=0.029..995.948 rows=3000000 loops=5)
     Planning Time: 0.977 ms
     Execution Time: 7923.770 ms

    Query 12 from TPC-H is a good illustration for a parallel hash join. Each worker helps to build a shared hash table.

    Merge Join

    Due to the nature of merge join it’s not possible to make it parallel. Don’t worry if it’s the last stage of the query execution—you can still can see parallel execution for queries with a merge join.

    -- Query 2 from TPC-H
    explain (costs off) select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment
    from    part, supplier, partsupp, nation, region
    where
            p_partkey = ps_partkey
            and s_suppkey = ps_suppkey
            and p_size = 36
            and p_type like '%BRASS'
            and s_nationkey = n_nationkey
            and n_regionkey = r_regionkey
            and r_name = 'AMERICA'
            and ps_supplycost = (
                    select
                            min(ps_supplycost)
                    from    partsupp, supplier, nation, region
                    where
                            p_partkey = ps_partkey
                            and s_suppkey = ps_suppkey
                            and s_nationkey = n_nationkey
                            and n_regionkey = r_regionkey
                            and r_name = 'AMERICA'
            )
    order by s_acctbal desc, n_name, s_name, p_partkey
    LIMIT 100;
                                                    QUERY PLAN
    ----------------------------------------------------------------------------------------------------------
     Limit
       ->  Sort
             Sort Key: supplier.s_acctbal DESC, nation.n_name, supplier.s_name, part.p_partkey
             ->  Merge Join
                   Merge Cond: (part.p_partkey = partsupp.ps_partkey)
                   Join Filter: (partsupp.ps_supplycost = (SubPlan 1))
                   ->  Gather Merge
                         Workers Planned: 4
                         ->  Parallel Index Scan using part_pkey on part
                               Filter: (((p_type)::text ~~ '%BRASS'::text) AND (p_size = 36))
                   ->  Materialize
                         ->  Sort
                               Sort Key: partsupp.ps_partkey
                               ->  Nested Loop
                                     ->  Nested Loop
                                           Join Filter: (nation.n_regionkey = region.r_regionkey)
                                           ->  Seq Scan on region
                                                 Filter: (r_name = 'AMERICA'::bpchar)
                                           ->  Hash Join
                                                 Hash Cond: (supplier.s_nationkey = nation.n_nationkey)
                                                 ->  Seq Scan on supplier
                                                 ->  Hash
                                                       ->  Seq Scan on nation
                                     ->  Index Scan using idx_partsupp_suppkey on partsupp
                                           Index Cond: (ps_suppkey = supplier.s_suppkey)
                   SubPlan 1
                     ->  Aggregate
                           ->  Nested Loop
                                 Join Filter: (nation_1.n_regionkey = region_1.r_regionkey)
                                 ->  Seq Scan on region region_1
                                       Filter: (r_name = 'AMERICA'::bpchar)
                                 ->  Nested Loop
                                       ->  Nested Loop
                                             ->  Index Scan using idx_partsupp_partkey on partsupp partsupp_1
                                                   Index Cond: (part.p_partkey = ps_partkey)
                                             ->  Index Scan using supplier_pkey on supplier supplier_1
                                                   Index Cond: (s_suppkey = partsupp_1.ps_suppkey)
                                       ->  Index Scan using nation_pkey on nation nation_1
                                             Index Cond: (n_nationkey = supplier_1.s_nationkey)

    The “Merge Join” node is above “Gather Merge”. Thus merge is not using parallel execution. But the “Parallel Index Scan” node still helps with the part_pkey segment.

    Partition-wise join

    PostgreSQL 11 disables the partition-wise join feature by default. Partition-wise join has a high planning cost. Joins for similarly partitioned tables could be done partition-by-partition. This allows postgres to use smaller hash tables. Each per-partition join operation could be executed in parallel.

    tpch=# set enable_partitionwise_join=t;
    tpch=# explain (costs off) select * from prt1 t1, prt2 t2
    where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;
                        QUERY PLAN
    ---------------------------------------------------
     Append
       ->  Hash Join
             Hash Cond: (t2.b = t1.a)
             ->  Seq Scan on prt2_p1 t2
                   Filter: ((b >= 0) AND (b <= 10000))
             ->  Hash
                   ->  Seq Scan on prt1_p1 t1
                         Filter: (b = 0)
       ->  Hash Join
             Hash Cond: (t2_1.b = t1_1.a)
             ->  Seq Scan on prt2_p2 t2_1
                   Filter: ((b >= 0) AND (b <= 10000))
             ->  Hash
                   ->  Seq Scan on prt1_p2 t1_1
                         Filter: (b = 0)
    tpch=# set parallel_setup_cost = 1;
    tpch=# set parallel_tuple_cost = 0.01;
    tpch=# explain (costs off) select * from prt1 t1, prt2 t2
    where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;
                            QUERY PLAN
    -----------------------------------------------------------
     Gather
       Workers Planned: 4
       ->  Parallel Append
             ->  Parallel Hash Join
                   Hash Cond: (t2_1.b = t1_1.a)
                   ->  Parallel Seq Scan on prt2_p2 t2_1
                         Filter: ((b >= 0) AND (b <= 10000))
                   ->  Parallel Hash
                         ->  Parallel Seq Scan on prt1_p2 t1_1
                               Filter: (b = 0)
             ->  Parallel Hash Join
                   Hash Cond: (t2.b = t1.a)
                   ->  Parallel Seq Scan on prt2_p1 t2
                         Filter: ((b >= 0) AND (b <= 10000))
                   ->  Parallel Hash
                         ->  Parallel Seq Scan on prt1_p1 t1
                               Filter: (b = 0)

    Above all, a partition-wise join can use parallel execution only if partitions are big enough.

    Parallel Append

    Parallel Append partitions work instead of using different blocks in different workers. Usually, you can see this with UNION ALL queries. The drawback – less parallelism, because every worker could ultimately work for a single query.

    There are just two workers launched even with four workers enabled.

    tpch=# explain (costs off) select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '1998-12-01' - interval '105' day union all select sum(l_quantity) as sum_qty from lineitem where l_shipdate <= date '2000-12-01' - interval '105' day;
                                               QUERY PLAN
    ------------------------------------------------------------------------------------------------
     Gather
       Workers Planned: 2
       ->  Parallel Append
             ->  Aggregate
                   ->  Seq Scan on lineitem
                         Filter: (l_shipdate <= '2000-08-18 00:00:00'::timestamp without time zone)
             ->  Aggregate
                   ->  Seq Scan on lineitem lineitem_1
                         Filter: (l_shipdate <= '1998-08-18 00:00:00'::timestamp without time zone)

    Most important variables

    • WORK_MEM limits the memory usage of each process! Not just for queries: work_mem * processes * joins => could lead to significant memory usage.
    • max_parallel_workers_per_gather  – how many workers an executor will use for the parallel execution of a planner node
    • max_worker_processes – adapt the total number of workers to the number of CPU cores installed on a server
    • max_parallel_workers – same for the number of parallel workers

    Summary

    Starting from 9.6 parallel queries execution could significantly improve performance for complex queries scanning many rows or index records. In PostgreSQL 10, parallel execution was enabled by default. Do not forget to disable parallel execution on servers with a heavy OLTP workload. Sequential scans or index scans still consume a significant amount of resources. If you are not running a report against the whole dataset, you may improve query performance just by adding missing indexes or by using proper partitioning.

    References


    Image compiled from photos by Nathan Gonthier and Pavel Nekoranec on Unsplash

    by Nickolay Ihalainen at February 21, 2019 02:05 PM

    February 20, 2019

    Henrik Ingo

    20 years later, what's left of the CAP theorem?

    The CAP theorem was published in (party like it's...) 1999: Fox Armando, Brewer Eric A: Harvest, Yield, and Scalable Tolerant Systems.

    Since its publication it has provided a beacon and rallying cry around which web scale distributed databases could be built and debated. It(s interpretation) has also evolved. Quite quickly the original 1999 formulation was abandoned, and from there it has further eroded as real world database implementations have provided ever more finer grained trade offs for navigating the space that - after all - was correctly mapped out by the CAP theorem.

    Pick ANY two? Really?

    read more

    by hingo at February 20, 2019 09:20 PM

    Peter Zaitsev

    Percona Monitoring and Management (PMM) 1.17.1 Is Now Available

    Percona Monitoring and Management 1.17.0

    Percona Monitoring and Management

    Percona Monitoring and Management (PMM) is a free and open-source platform for managing and monitoring MySQL®, MongoDB®, and PostgreSQL performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL®, MongoDB®, and PostgreSQL® servers to ensure that your data works as efficiently as possible.

    In this release, we are introducing support for detection of our upcoming PMM 2.0 release in order to avoid potential version conflicts in the future, as PMM 1.x will not be compatible with PMM 2.x.

    Another improvement in this release is we have updated the Tooltips for Dashboard MySQL Query Response Time by providing a description of what the graphs display, along with links to related documentation resources. An example of Tooltips in action:

    PMM 1.17.1 release provides fixes for CVE-2018-16492 and CVE-2018-16487 vulnerabilities, related to Node.js modules. The authentication system used in PMM is not susceptible to the attacks described in these CVE reports. PMM does not use client-side data objects to control user-access.

    In release 1.17.1 we have included two improvements and fixed nine bugs.

    Improvements

    • PMM-1339: Improve tooltips for MySQL Query Response Time dashboard
    • PMM-3477: Add Ubuntu 18.10 support

    Fixed Bugs

    • PMM-3471: Fix global status metric names in mysqld_exporter for MySQL 8.0 compatibility
    • PMM-3400: Duplicate column in the Query Analytics dashboard Explain section
    • PMM-3353: postgres_exporter does not work with PostgreSQL 11
    • PMM-3188: Duplicate data on Amazon RDS / Aurora MySQL Metrics dashboard
    • PMM-2615: Fix wrong formatting in log which appears if pmm-qan-agent process fails to start
    • PMM-2592: MySQL Replication Dashboard shows error with multi-source replication
    • PMM-2327: Member State Uptime and Max Member Ping time charts on the MongoDB ReplSet dashboard return an error
    • PMM-955: Fix format of User Time and CPU Time Graphs on MySQL User Statistics dashboard
    • PMM-3522: CVE-2018-16492 and CVE-2018-16487

    Help us improve our software quality by reporting any Percona Monitoring and Management bugs you encounter using our bug tracking system.

    by Dmitriy Kostiuk at February 20, 2019 03:11 PM

    ProxySQL Native Support for Percona XtraDB Cluster (PXC)

    galera proxy content image

    ProxySQL in its versions up to 1.x did not natively support Percona XtraDB Cluster (PXC). Instead, it relied on the flexibility offered by the scheduler. This approach allowed users to implement their own preferred way to manage the ProxySQL behaviour in relation to the Galera events.

    From version 2.0 we can use native ProxySQL support for PXC.. The mechanism to activate native support is very similar to the one already in place for group replication.

    In brief it is based on the table [runtime_]mysql_galera_hostgroups and the information needed is mostly the same:

    • writer_hostgroup: the hostgroup ID that refers to the WRITER
    • backup_writer_hostgroup: the hostgoup ID referring to the Hostgorup that will contain the candidate servers
    • reader_hostgroup: The reader Hostgroup ID, containing the list of servers that need to be taken in consideration
    • offline_hostgroup: The Hostgroup ID that will eventually contain the writer that will be put OFFLINE
    • active: True[1]/False[0] if this configuration needs to be used or not
    • max_writers: This will contain the MAX number of writers you want to have at the same time. In a sane setup this should be always 1, but if you want to have multiple writers, you can define it up to the number of nodes.
    • writer_is_also_reader: If true [1] the Writer will NOT be removed from the reader HG
    • max_transactions_behind: The number of wsrep_local_recv_queue after which the node will be set OFFLINE. This must be carefully set, observing the node behaviour.
    • comment: I suggest to put some meaningful notes to identify what is what.

    Given the above let us see what we need to do in order to have a working galera native solution.
    I will have three Servers:

    192.168.1.205 (Node1)
      192.168.1.21  (Node2)
      192.168.1.231 (node3)

    As set of Hostgroup, I will have:

    Writer  HG-> 100
    Reader  HG-> 101
    BackupW HG-> 102
    offHG   HG-> 9101

    To set it up

    Servers first:

    INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,1000);
    INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,1000);
    INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,1000);

    Then the galera settings:

    insert into mysql_galera_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) values (100,102,101,9101,0,1,1,16);

    As usual if we want to have R/W split we need to define the rules for it:

    insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1040,6033,'windmills','app_test',100,1,3,'^SELECT.*FOR UPDATE',1);
    insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1041,6033,'windmills','app_test',101,1,3,'^SELECT.*@@',1);
    save mysql query rules to disk;
    load mysql query rules to run;

    Then another important variable… the server version, please do yourself a good service ad NEVER use the default.

    update global_variables set variable_value='5.7.0' where variable_name='mysql-server_version';
    LOAD MYSQL VARIABLES TO RUNTIME;SAVE MYSQL VARIABLES TO DISK;

    Finally activate the whole thing:

    save mysql servers to disk;
    load mysql servers to runtime;

    One thing to note before we go ahead. In the list of servers I had:

    1. Filled only the READER HG
    2. Used the same weight

    This because of the election mechanism ProxySQL will use to identify the writer, and the (many) problems that may be attached to it.

    For now let us go ahead and see what happens when I load this information to runtime.

    Before running the above commands:

    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

    After:

    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | 1000   | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
    | 1000   | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
    | 1000   | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
    | 1000   | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
    | 1000   | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
    | 1000   | 102       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    mysql> select * from runtime_mysql_galera_hostgroups \G
    *************************** 1. row ***************************
           writer_hostgroup: 100
    backup_writer_hostgroup: 102
           reader_hostgroup: 101
          offline_hostgroup: 9101
                    active: 0  <----------- note this
                max_writers: 1
      writer_is_also_reader: 1
    max_transactions_behind: 16
                    comment: NULL
    1 row in set (0.01 sec)

    As we can see, ProxySQL had taken the nodes from my READER group and distribute them adding node 1 in the writer and node 2 as backup_writer.

    But – there is a but – wasn’t my rule set with Active=0? Indeed it was, and I assume this is a bug (#Issue  1902).

    The other thing we should note is that ProxySQL had elected as writer node 3 (192.168.1.231).
    As I said before what should we do IF we want to have a specific node as preferred writer?

    We need to modify its weight. So say we want to have node 1 (192.168.1.205) as writer we will need something like this:

    INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,10000);
    INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,100);
    INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,100);

    Doing that will give us :

    +--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | weight | hostgroup | srv_host      | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
    +--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | 10000  | 100       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 2209       |
    | 100    | 101       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
    | 100    | 101       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 508        |
    | 10000  | 101       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 2209       |
    | 100    | 102       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
    | 100    | 102       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 508        |
    +--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

    If you noticed, given we had set the WEIGHT in node 1 higher, this node will become also the most utilized for reads.
    We probably do not want that, so let us modify the reader weight.

    update mysql_servers set weight=10 where hostgroup_id=101 and hostname='192.168.1.205';

    At this point if we trigger the failover, with set global wsrep_reject_queries=all; on node 1.
    ProxySQL will take action and will elect another node as writer:

    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
    | 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
    | 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 588        |
    | 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 588        |
    | 10000  | 9101      | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 468        |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

    Node 3 (192.168.1.231) is the new writer and node 1 is in the special group for OFFLINE.
    Let see now what will happen IF we put back node 1.

    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | 10000  | 100       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
    | 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
    | 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |
    | 10000  | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
    | 100    | 102       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
    | 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

    Ooops the READER has come back with the HIGHEST value and as such it will be the most used node, once more. To fix it, we need to re-run the update as before.

    But there is a way to avoid this? In short the answer is NO!
    This, in my opinion, is BAD and is worth a feature request, because this can really put a node on the knees.

    Now this is not the only problem. There is another point that is probably worth discussion, which is the fact ProxySQL is currently doing FAILOVER/FAILBACK.

    Failover, is obviously something we want to have, but failback is another discussion. The point is, once the failover is complete and the cluster has redistributed the incoming requests, doing a failback is an impacting operation that can be a disruptive one too.

    If all nodes are treated as equal, there is no real way to prevent it, while if YOU set a node to be the main writer, something can be done, let us see what and how.
    Say we have:

    INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,1000);
    INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,100);
    INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,100);
    +--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | weight | hostgroup | srv_host      | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
    +--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | 1000   | 100       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
    | 100    | 101       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
    | 100    | 101       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 613        |
    | 10     | 101       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
    | 100    | 102       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
    | 100    | 102       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 613        |
    +--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

    Let us put the node down
    set global wsrep_reject_queries=all;

    And check:

    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
    | 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
    | 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 506        |
    | 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 506        |
    | 1000   | 9101      | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 527        |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

    We can now manipulate the weight in the special OFFLINE group and see what happen:

    update mysql_servers set weight=10 where hostgroup_id=9101 and hostname='192.168.1.205'

    Then I put the node up again:
    set global wsrep_reject_queries=none;

    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
    | 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
    | 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
    | 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 573        |
    | 10     | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 458	|
    | 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 573	|
    | 10     | 102       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 458        |
    +--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

    That’s it, the node is back but with no service interruption.

    At this point we can decide if make this node reader like the others, or wait and plan a proper time of the day when we can put it back as writer, while, in the meanwhile it has a bit of load to warm its bufferpool.

    The other point – and important information – is what is ProxySQL is currently checking on Galera? From reading the code Proxy will trap the following:

    • read_only
    • wsrep_local_recv_queue
    • wsrep_desync
    • wsrep_reject_queries
    • wsrep_sst_donor_rejects_queries
    • primary_partition

    Plus the standard sanity checks on the node.

    Finally to monitor the whole situation we can use this:

    mysql> select * from mysql_server_galera_log order by time_start_us desc limit 10;
    +---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+
    | hostname      | port | time_start_us    | success_time_us | primary_partition | read_only | wsrep_local_recv_queue | wsrep_local_state | wsrep_desync | wsrep_reject_queries | wsrep_sst_donor_rejects_queries | error |
    +---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+
    | 192.168.1.231 | 3306 | 1549982591661779 | 2884            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
    | 192.168.1.21  | 3306 | 1549982591659644 | 2778            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
    | 192.168.1.205 | 3306 | 1549982591658728 | 2794            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
    | 192.168.1.231 | 3306 | 1549982586669233 | 2827            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
    | 192.168.1.21  | 3306 | 1549982586663458 | 5100            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
    | 192.168.1.205 | 3306 | 1549982586658973 | 4132            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
    | 192.168.1.231 | 3306 | 1549982581665317 | 3084            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
    | 192.168.1.21  | 3306 | 1549982581661261 | 3129            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
    | 192.168.1.205 | 3306 | 1549982581658242 | 2786            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
    | 192.168.1.231 | 3306 | 1549982576661349 | 2982            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
    +---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+

    As you can see above the log table keeps track of what is changed. In this case, it reports that node 1 has wsrep_reject_queries activated, and it will continue to log this until we set wsrep_reject_queries=none.

    Conclusions

    ProxySQL galera native integration is a useful feature to manage any Galera implementation, no matter whether it’s Percona PXC, MariaDB cluster or MySQL/Galera.

    The generic approach is obviously a good thing, still it may miss some specific extension like we have in PXC with the performance_schema pxc_cluster_view table.

    I’ve already objected about the failover/failback, and I am here again to remind you: whenever you do a controlled failover REMEMBER to change the weight to prevent an immediate failback.

    This is obviously not possible in the case of a real failover, and, for instance, a simple temporary eviction will cause two downtimes instead only one. Some environments are fine with that others not so.

    Personally I think there should be a FLAG in the configuration, such that we can decide if failback should be executed or not.

     

    by Marco Tusa at February 20, 2019 02:11 PM

    February 19, 2019

    Oli Sennhauser

    MySQL Enterprise Backup Support Matrix

    MySQL Enterprise Backup (MEB) is a bit limited related to support of older MySQL versions. So you should consider the following release matrix:

    MEB/MySQLSupported 5.5  5.6  5.7  8.0 
    3.11.xNOxx
    3.12.xYESxx
    4.0.xNOx
    4.1.xYESx
    8.0.xYES8.0.x*

    * MySQL Enterprise Backup 8.0.15 only supports MySQL 8.0.15. For earlier versions of MySQL 8.0, use the MySQL Enterprise Backup version with the same version number as the server.

    MySQL Enterprise Backup is available for download from the My Oracle Support (MOS) website. This release will be available on Oracle eDelivery (OSDC) after the next upload cycle. MySQL Enterprise Backup is a commercial extension to the MySQL family of products.

    As an Open Source alternative Percona XtraBackup for MySQL databases is available.

    Compatibility with MySQL Versions: 3.11, 3.12, 4.0, 4.1, 8.0.

    MySQL Enterprise Backup User's Guide: 3.11, 3.12, 4.0, 4.1, 8.0.

    by Shinguz at February 19, 2019 06:13 PM

    Peter Zaitsev

    Percona Server for MongoDB 3.4.19-2.17 Is Now Available

    Percona Server for MongoDB

    Percona Server for MongoDB

    Percona announces the release of Percona Server for MongoDB 3.4.19-2.17 on February 19, 2019. Download the latest version from the Percona website or the Percona Software Repositories.

    Percona Server for MongoDB 3.4 is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 3.4 Community Edition. It supports MongoDB 3.4 protocols and drivers.

    Percona Server for MongoDB extends MongoDB Community Edition functionality by including the Percona Memory Engine and MongoRocks storage engines, as well as several enterprise-grade features:

    Percona Server for MongoDB requires no changes to MongoDB applications or code. This release is based on MongoDB 3.4.19.

    In this release, Percona Server for MongoDB supports the ngram full-text search engine. Thanks to Sunguck Lee (@SunguckLee) for this contribution. To enable the ngram full-text search engine, create an index passing ngram to the default_language parameter:

    mongo > db.collection.createIndex({name:"text"}, {default_language: "ngram"})

    New Features

    • PSMDB-250: The ngram full-text search engine has been added to Percona Server for MongoDB.Thanks to Sunguck Lee (@SunguckLee) for this contribution.

    Bugs Fixed

    • PSMDB-272mongos could crash when running the createBackup command.

    Other bugs fixed: PSMDB-247

    The Percona Server for MongoDB 3.4.19-2.17 release notes are available in the official documentation.

    by Borys Belinsky at February 19, 2019 01:43 PM

    How Network Bandwidth Affects MySQL Performance

    10gb network and 10gb with SSL

    Network is a major part of a database infrastructure. However, often performance benchmarks are done on a local machine, where a client and a server are collocated – I am guilty myself. This is done to simplify the setup and to exclude one more variable (the networking part), but with this we also miss looking at how network affects performance.

    The network is even more important for clustering products like Percona XtraDB Cluster and MySQL Group Replication. Also, we are working on our Percona XtraDB Cluster Operator for Kubernetes and OpenShift, where network performance is critical for overall performance.

    In this post, I will look into networking setups. These are simple and trivial, but are a building block towards understanding networking effects for more complex setups.

    Setup

    I will use two bare-metal servers, connected via a dedicated 10Gb network. I will emulate a 1Gb network by changing the network interface speed with

    ethtool -s eth1 speed 1000 duplex full autoneg off
      command.

    network test topology

    I will run a simple benchmark:

    sysbench oltp_read_only --mysql-ssl=on --mysql-host=172.16.0.1 --tables=20 --table-size=10000000 --mysql-user=sbtest --mysql-password=sbtest --threads=$i --time=300 --report-interval=1 --rand-type=pareto

    This is run with the number of threads varied from 1 to 2048. All data fits into memory – innodb_buffer_pool_size is big enough – so the workload is CPU-intensive in memory: there is no IO overhead.

    Operating System: Ubuntu 16.04

    Benchmark N1. Network bandwidth

    In the first experiment I will compare 1Gb network vs 10Gb network.

    1gb vs 10gb network

    threads/throughput 1Gb network 10Gb network
    1 326.13 394.4
    4 1143.36 1544.73
    16 2400.19 5647.73
    32 2665.61 10256.11
    64 2838.47 15762.59
    96 2865.22 17626.77
    128 2867.46 18525.91
    256 2867.47 18529.4
    512 2867.27 17901.67
    1024 2865.4 16953.76
    2048 2761.78 16393.84

     

    Obviously the 1Gb network performance is a bottleneck here, and we can improve our results significantly if we move to the 10Gb network.

    To see that 1Gb network is bottleneck we can check the network traffic chart in PMM:

    network traffic in PMM

    We can see we achieved 116MiB/sec (or 928Mb/sec)  in throughput, which is very close to the network bandwidth.

    But what we can do if the our network infrastructure is limited to 1Gb?

    Benchmark N2. Protocol compression

    There is a feature in MySQL protocol whereby you can see the compression for the network exchange between client and server:

    --mysql-compression=on
      for sysbench.

    Let’s see how it will affect our results.

    1gb network with compression protocol

    threads/throughput 1Gb network 1Gb with compression protocol
    1 326.13 198.33
    4 1143.36 771.59
    16 2400.19 2714
    32 2665.61 3939.73
    64 2838.47 4454.87
    96 2865.22 4770.83
    128 2867.46 5030.78
    256 2867.47 5134.57
    512 2867.27 5133.94
    1024 2865.4 5129.24
    2048 2761.78 5100.46

     

    Here is an interesting result. When we use all available network bandwidth, the protocol compression actually helps to improve the result.10g network with compression protocol

    threads/throughput 10Gb 10Gb with compression
    1 394.4 216.25
    4 1544.73 857.93
    16 5647.73 3202.2
    32 10256.11 5855.03
    64 15762.59 8973.23
    96 17626.77 9682.44
    128 18525.91 10006.91
    256 18529.4 9899.97
    512 17901.67 9612.34
    1024 16953.76 9270.27
    2048 16393.84 9123.84

     

    But this is not the case with the 10Gb network. The CPU resources needed for compression/decompression are a limiting factor, and with compression the throughput actually only reach half of what we have without compression.

    Now let’s talk about protocol encryption, and how using SSL affects our results.

    Benchmark N3. Network encryption

    1gb network and 1gb with SSL

    threads/throughput 1Gb network 1Gb SSL
    1 326.13 295.19
    4 1143.36 1070
    16 2400.19 2351.81
    32 2665.61 2630.53
    64 2838.47 2822.34
    96 2865.22 2837.04
    128 2867.46 2837.21
    256 2867.47 2837.12
    512 2867.27 2836.28
    1024 2865.4 1830.11
    2048 2761.78 1019.23

    10gb network and 10gb with SSL

    threads/throughput 10Gb 10Gb SSL
    1 394.4 359.8
    4 1544.73 1417.93
    16 5647.73 5235.1
    32 10256.11 9131.34
    64 15762.59 8248.6
    96 17626.77 7801.6
    128 18525.91 7107.31
    256 18529.4 4726.5
    512 17901.67 3067.55
    1024 16953.76 1812.83
    2048 16393.84 1013.22

     

    For the 1Gb network, SSL encryption shows some penalty – about 10% for the single thread – but otherwise we hit the bandwidth limit again. We also see some scalability hit on a high amount of threads, which is more visible in the 10Gb network case.

    With 10Gb, the SSL protocol does not scale after 32 threads. Actually, it appears to be a scalability problem in OpenSSL 1.0, which MySQL currently uses.

    In our experiments, we saw that OpenSSL 1.1.1 provides much better scalability, but you need to have a special build of MySQL from source code linked to OpenSSL 1.1.1 to achieve this. I don’t show them here, as we do not have production binaries.

    Conclusions

    1. Network performance and utilization will affect the general application throughput.
    2. Check if you are hitting network bandwidth limits
    3. Protocol compression can improve the results if you are limited by network bandwidth, but also can make things worse if you are not
    4. SSL encryption has some penalty (~10%) with a low amount of threads, but it does not scale for high concurrency workloads.

    by Vadim Tkachenko at February 19, 2019 11:52 AM

    February 18, 2019

    Peter Zaitsev

    Percona Server for MySQL 5.7.25-28 Is Now Available

    Percona Server for MySQL 8.0

    Percona Server for MySQL 5.6Percona is glad to announce the release of Percona Server 5.7.25-28 on February 18, 2019. Downloads are available here and from the Percona Software Repositories.

    This release is based on MySQL 5.7.25 and includes all the bug fixes in it. Percona Server 5.7.25-28 is now the current GA (Generally Available) release in the 5.7 series.

    All software developed by Percona is open-source and free.

    In this release, Percona Server introduces the variable binlog_skip_flush_commands. This variable controls whether or not FLUSH commands are written to the binary log. Setting this variable to ON can help avoid problems in replication. For more information, refer to our documentation.

    Note

    If you’re currently using Percona Server 5.7, Percona recommends upgrading to this version of 5.7 prior to upgrading to Percona Server 8.0.

    Bugs fixed

    • FLUSH commands written to the binary log could cause errors in case of replication. Bug fixed #1827: (upstream #88720).
    • Running LOCK TABLES FOR BACKUP followed by STOP SLAVE SQL_THREAD could block replication preventing it from being restarted normally. Bug fixed #4758.
    • The ACCESS_DENIED field of the information_schema.user_statistics table was not updated correctly. Bug fixed #3956.
    • MySQL could report that the maximum number of connections was exceeded with too many connections being in the CLOSE_WAIT state. Bug fixed #4716 (upstream #92108)
    • Wrong query results could be received in semi-join sub queries with materialization-scan that allowed inner tables of different semi-join nests to interleave. Bug fixed #4907 (upstream bug #92809).
    • In some cases, the server using the the MyRocks storage engine could crash when TTL (Time to Live) was defined on a table. Bug fixed #4911
    • Running the SELECT statement with the ORDER BY and LIMIT clauses could result in a less than optimal performance. Bug fixed #4949 (upstream #92850)
    • There was a typo in mysqld_safe.shtrottling was replaced with throttling. Bug fixed #240. Thanks to Michael Coburn for the patch.
    • MyRocks could crash while running START TRANSACTION WITH CONSISTENT SNAPSHOT if other transactions were in specific states. Bug fixed #4705,
    • In some cases, mysqld could crash when inserting data into a database the name of which contained special characters (CVE-2018-20324). Bug fixed #5158.
    • MyRocks incorrectly processed transactions in which multiple statements had to be rolled back. Bug fixed #5219.
    • In some cases, the MyRocks storage engine could crash without triggering the crash recovery. Bug fixed #5366.
    • When bootstrapped with undo or redo log encryption enabled on a very fast storage, the server could fail to start. Bug fixed #4958.

    Other bugs fixed: #2455#4791#4855#4996#5268.

    This release also contains fixes for the following CVE issues: CVE-2019-2534, CVE-2019-2529, CVE-2019-2482, CVE-2019-2434.

    Find the release notes for Percona Server for MySQL 5.7.25-28 in our online documentation. Report bugs in the Jira bug tracker.

     

    by Borys Belinsky at February 18, 2019 04:38 PM

    Percona Server for MongoDB 4.0.5-2 Is Now Available

    Percona Server for MongoDB

    Percona Server for MongoDB

    Percona announces the release of Percona Server for MongoDB 4.0.5-2 on February 18, 2019. Download the latest version from the Percona website or the Percona Software Repositories.

    Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 4.0 Community Edition. It supports MongoDB 4.0 protocols and drivers.

    Percona Server for MongoDB extends Community Edition functionality by including the Percona Memory Engine storage engine, as well as several enterprise-grade features. It also includes MongoRocks storage engine (which is now deprecated). Percona Server for MongoDB requires no changes to MongoDB applications or code.

    This release includes all features of MongoDB 4.0 Community Edition. Most notable among these are:

    Note that the MMAPv1 storage engine is deprecated in MongoDB 4.0 Community Edition.

    In Percona Server for MongoDB 4.0.5-2, data at rest encryption becomes GA. The data at rest encryption feature now covers the temporary files used for external sorting and the rollback files. You can decrypt and examine the contents of the rollback files using the new perconadecrypt command line tool.

    In this release, Percona Server for MongoDB supports the ngram full-text search engine. Thanks to Sunguck Lee (@SunguckLee) for this contribution. To enable the ngram full-text search engine, create an index passing ngram to the default_language parameter:

    mongo > db.collection.createIndex({name:"text"}, {default_language: "ngram"})

    New Features

    • PSMDB-276perconadecrypt tool is now available for decrypting the encrypted rollback files.
    • PSMDB-250: The Ngram full-text search engine has been added to Percona Server for MongoDB.Thanks to Sunguck Lee (@SunguckLee) for this contribution.

    Bugs Fixed

    • PSMDB-234: It was possible to use a key file for encryption the owner of which was not the owner of the mongod process.
    • PSMDB-273: When using data at rest encryption, temporary files for external sorting and rollback files were not encrypted
    • PSMDB-257: MongoDB could not be started with a group-readable key file owned by root.
    • PSMDB-272mongos could crash when running the createBackup command.

    Other bugs fixed: PSMDB-247

    The Percona Server for MongoDB 4.0.5-2 release notes are available in the official documentation.

    by Borys Belinsky at February 18, 2019 04:13 PM