Planet MariaDB

September 01, 2015

MariaDB AB

Evaluating MySQL Parallel Replication Part 3: Benchmarks in Production

jeanfrancoisgagne

Parallel replication is a much-expected feature of MySQL. It is available in MariaDB 10.0 and in MySQL 5.7. In this 3rd post of the series, we present benchmark results from Booking.com production environments.

This is a repost of Jean-François Gagné's blog post on blog.booking.com.

Note: this post has an annex: Under the Hood. Benchmarking is a complex art and reporting results accurately is even harder. If all the details were put in a single article, it would make a very long post. The links to the annex should satisfy readers eager for more details.

Parallel replication is on its way and with it comes the hope that more transactions can be run on a master without introducing slave lag. But it remains to be seen whether this dream will come true - will parallel replication hold its promise or will there be surprises after its deployment? We would like to know whether or not we can count on that feature in the future.

To get answers, nothing is better than experimenting with the technology. Benchmark results have already been published (MariaDB 10.0: 10 times improvement; and MySQL 5.7: 3 to 6 times improvement) but results might be different in our production environments. The best test would be to run parallel replication on our real workloads but this is not trivial. To be able to run transactions in parallel, a slave needs parallelism information from the master (for more details, see Part 1). With a master in an older version (MySQL 5.6 in our case), slaves do not have this information.

Luckily, we can use slave group commit on an intermediate master to identify transactions that can be run in parallel. The trick is to execute transactions sequentially on a slave, but to delay their commit. While the commit is delayed, the next transaction is started. If the two transactions are non-conflicting, the second could complete and the two transactions would commit together. In this scenario, grouping has succeeded (the two transactions committed in a single group) and parallelism is identified (as they commit together, the transactions are non-conflicting, thus can be run in parallel on slaves). For more details on slave group commit, see Part 2.

Our benchmarks are established within four production environments: E1, E2, E3, and E4. Each of these environments is composed of one production master, with some intermediate masters, and a leaf slave. A complete description of those environments can be found in the annex.

Slave group commit identifies less parallelism than a parallel execution on a master would identify (details in the Group Commit: Slave vs. Master section of the annex). Even so, decent group sizes are obtained, as shown in the group commit size graphs in the annex. Usually we have group sizes of at least 5, most of the time they are larger than 10, and sometimes they are as big as 15 or even 20. These will allow us to test parallel replication with real production workload.

Before closing this long introduction, let's talk a little about our expectations. InnoDB is getting better at using many cores (RO and RW benchmark results) but single-threaded replication gets in the way of pushing more writes in a replicating environment. Without parallel replication, a single core can be used on a master to perform writes without incurring slave lag. This is disappointing as servers come with 12 and more cores (only one can be used for writes). Ideally, we would like to use a significant percentage of the cores of a server for writes (25% could be a good start). So speedups of 3 would be good results at this point (12 cores), and speedups of 6 and 10 would be needed in the near future (24 and 40 cores).

The Test: Catching Up with 24 Hours of Transactions

Our benchmark scenario is as follows: after restoring a backup of the database, starting MariaDB, waiting for the buffer pool to be loaded, and running the slave for at least 12 hours, we measure the length of time it has taken to process 24 hours of production transactions.

The tests are run in the following binary log configurations:

  • Intermediary Master (IM): both binary logs and log-slave-updates are enabled.
  • Slave with Binary Logs (SB): binary logs are enabled but log-slave-updates is disabled.
  • Standard Slave (SS): both binary logs and log-slave-updates are disabled.

And in the following durability configurations:

  • High Durability (HD): sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1.
  • No Durability (ND): sync_binlog = 0 and innodb_flush_log_at_trx_commit = 2 (also described/known as relaxed durability).

For each of those configurations (6 in total: IM-HD, IM-ND, SB-HD, SB-ND, SS-HD, and SS-ND), the tests are run with different values of slave_parallel_threads (SPT). The full results are presented in the annex and the most interesting results are presented below (SB-HD and SB-ND). The times presented are in the format hours:minutes.seconds. Below the time taken to process 24-hours of transactions, the speedup achieved from the single-threaded run is presented in bold.

Execution Times and Speedups for Slave with Binary Logs
E1 E2 E3 E4
SPT SB-HD SB-ND SB-HD SB-ND SB-HD SB-ND SB-HD SB-ND
0
9:11.11
5:22.29
2:50.58
1:11.30
9:58.58
9:06.20
7:43.06
7:26.24
5
6:47.38
1.35
5:15.03
1.02
1:40.32
1.70
1:11.19
1.00
8:41.47
1.15
8:24.40
1.08
6:36.06
1.17
6:29.52
1.15
10
6:16.49
1.46
5:10.02
1.04
1:28.48
1.93
1:09.40
1.03
8:23.07
1.19
8:16.23
1.10
6:17.59
1.23
6:13.48
1.19
20
6:00.22
1.53
5:07.16
1.05
1:24.52
2.01
1:09.03
1.04
8:06.12
1.23
8:05.13
1.13
6:05.40
1.27
6:05.48
1.22
40
5:53.42
1.56
5:05.36
1.06
1:22.12
2.08
1:08.32
1.04
8:07.26
1.23
8:04.31
1.13
5:59.30
1.29
5:59.57
1.24

In the Graph during Tests section of the annex, you can find many details about the different test runs.

Discussion

There are lots of things to say about those results. Let's start with observations that are not related to parallel replication (SPT=0):

  • Obs1: Standard Slave results (without binary logs) are very close to the results of Slave with Binary Logs (without log-slave-updates)(details in the annex).
  • Obs2: log-slave-updates has visible cost for E1 (time difference between IM-HD and SB-HD), a less obvious but still noticeable cost for E2 and E3, and it is a win for E4 (that last one is disturbing, the numbers are in the annex).
  • Obs3: relaxing durability is a huge win for E1 and E2, a more limited win for E3, and a much smaller one for E4.

With reference to Obs1 above, this shows that binary logs should probably not be disabled on slave: the cost is almost inexistent and the wins are big (tracing errant transactions and being a candidate for master promotion). However, slaves with log-slave-updates are slower than slaves with only binary logs enabled (Obs2 above), so log-slave-updates should be avoided when possible. Binlog Servers can be used to replace log-slave-updates for intermediate masters, see MySQL Slave Scaling for more details (see also Better Parallel Replication for MySQL for an explanation why log-slave-updates is bad on intermediate masters for parallel replication).


With reference to Obs3 above, this can be explained by the different workload of the four environments (more details about the workloads can be found in the annex):

  • E2 is a CPU-bound workload (the dataset fits in RAM).
  • E1 is also mostly CPU-bound but with some cache misses in the InnoDB buffer pool, so it needs a page fetch from disk before doing a write.
  • E3 is a mixed CPU and IO workload (more cache misses in the InnoDB buffer pool but still with enough cache hit to get a good commit throughput).
  • E4 is an IO-bound workload (mostly cache misses).

Relaxing durability on CPU-bound workloads achieves good throughput improvements, but this does not happen on IO-bound workloads.


Now, let's focus on parallel replication. The Standard Slave results (SS-HD and SS-ND) are not worth discussing as they are very close to the Slave with Binary Logs results (Obs1 above). We will also not discuss Intermediate Master results (IM-HD and IM-ND) as they should be replaced by Binlog Servers. So all observations below are made on the results of Slave with Binary Logs (SB-HD and SB-ND):

  • Obs4: the best speedup (~2.10) is in E2 with high durability. E1 follows with a speedup of ~1.56 (always with high durability).
  • Obs5: the speedups for E4 are modest (~1.29) and the results are almost identical for both durability settings.
  • Obs6: for E1 and E2, the speedups with no durability are almost non-existent (less than 1.10).
  • Obs7: for both E1 and E2, relaxing durability with single-threaded replication leads to faster execution than enabling parallel replication.
  • Obs8: the results for E3 are halfway between E1/E2 and E4: both SB-HD and SB-ND get some modest speedups from parallel replication (like E4 and opposite to E1/E2) and relaxing durability makes things run a little faster (like E1/E2 and opposite to E4), but not to the point where single-threaded low durability is faster than multi-threaded high durability.

All those observations point to the importance of the workload in parallel replication speedups:

  • CPU-bound workloads seem to get modest speedups in high-durability configurations.
  • Relaxing durability for CPU-bound workloads looks like a better option than enabling parallel replication on a high-durability configuration.
  • IO-bound workloads get more limited speedups.

Our first reaction is disappointment: the speedups are not as high as expected. Don't get us wrong: faster is always better, especially when the only thing to do is to upgrade the software, which we will do anyway eventually. However, having only 25% more writes on a master (or 110% depending on which environment we look at) will not help us in the long term. Parallel replication is not the solution (at least not the only solution) that will allow us to stop/avoid sharding.

Ideas and Future Work

We have a hypothesis explaining the modest speedups: long-running transactions. In the presence of long-running transactions, the parallel replication pipeline on the slave stalls. Let's take the following six transactions committing on the master in two commit groups (B for begin and C for commit):

   --------Time-------->
T1:  B-----------C
T2:           B--C
T3:           B--C
T4:           B--C
T5:              B--C
T6:              B--C

Running those transactions on a single-threaded slave takes 33 units of time (time scale is at the bottom):

   ----------------Time---------------->
T1:  B-----------C
T2:               B--C
T3:                   B--C
T4:                       B--C
T5:                           B--C
T6:                               B--C
              1         2         3
     123456789012345678901234567890123

Running those transactions on a multi-threaded slave with SPT=4 takes 17 units of time:

   ---------Time-------->
T1:  B-----------C
T2:  B-- . . . . C
T3:  B-- . . . . C
T4:  B-- . . . . C
T5:               B--C
T6:               B--C
              1
     12345678901234567

So we barely achieve a speedup of 2 (and the second commit group does not even contain a large transaction). The low speedup is explained by T1 being much bigger than the other transactions in the group. So our intuition is that to get better speedup with parallel replication, all transactions should be of similar size, and bigger transactions should be broken down into smaller ones (when possible).

We have many of those big transactions in our workload at Booking.com. Most of our design choices predate MySQL 5.6, where a commit was expensive. Reducing the number of commits was a good optimization at that time, so doing many changes in a single transaction was a good thing. Now, with binary log group commit, this optimization is less useful but does not harm. However, this optimization is very bad for parallel replication.

There are at least two other things to discuss from those results but this post is already too long, so you will have to go in the annex to read the Additional Discussions.

Conclusion

It is possible to test parallel replication with true production workload, even if the master is running an old version of MySQL. Thanks to slave group commit in MariaDB 10.0, we can identify parallelism on intermediate master and enable parallel replication on a slave. Even if this parallelism identification is not as good as it would be on a master, we get decent group sizes.

Our CPU-bound workloads are getting speedups of ~1.56 to ~2.10 with high-durability constraints. This is a little disappointing: we would like to have more than two cores busy applying writes on slaves. Our guess is that better speedup could be obtained by hunting down large transactions, but that still needs to be verified. At this point, and for this type of workload, our tests show that relaxing durability is a better optimization than enabling parallel replication. Finally, with relaxed durability, parallel replication shows almost no improvement (4% to 6% improvement), and it is still unknown if hunting down large transactions and splitting them would result in better speedups.

Our IO-bound workloads are getting speedups of ~1.23 to ~1.29, which is also disappointing but expected because it is hard to fight against seek time of magnetic disks. In this type of workload, relaxed durability setting benefits from parallel replication. However, at high enough parallelism on the slave and for this type of workload, relaxing durability is not very beneficial. It is hard to tell what types of improvement would come from hunting down large transactions for this type of workload.

The next step on parallel replication evaluation would be to try optimistic parallel replication. This will make a good fourth part in the series.

This is a repost of Jean-François Gagné's blog post on booking.com.

About the Author

jeanfrancoisgagne's picture

Jean-François Gagné is a System Engineer at Booking.com supporting the growth of Booking.com with a focus on growing their MariaDB and MySQL infrastructure.

by jeanfrancoisgagne at September 01, 2015 11:17 AM

Shlomi Noach

Orchestrator 1.4.340: GTID, binlog servers, Smart Mode, failovers and lots of goodies

Orchestrator 1.4.340 is released. Not quite competing with the MySQL latest changelog, and as I haven't blogged about orchestrator featureset in a while, this is a quick listing of orchestrator features available since my last publication:

  • Supports GTID (Oracle & MariaDB)
    • GTID still not being used in automated recovery -- in progress.
    • enable-gtid, disable-gtid, skip-query for GTID commands
  • Supports binlog servers (MaxScale)
    • Discovery & operations on binlog servers
    • Understanding slave repositioning in a binlog-server architecture
  • Smart mode: relocate & relocate-below commands (or Web/GUI drag-n-drop) let orchestrator figure out the best way of slave repositioning. Orchestrator picks from GTID, Pseudo GTID, binlog servers, binlog file:pos math (and more) options, or combinations of the above. Fine grained commands still there, but mostly you won't need them.
  • Crash recoveries (did you know orchestrator does that?):
    • For intermediate master recovery: improved logic in picking the best recovery plan (prefer in-DC, prefer promoting local slave, supporting binlog server topologies, ...)
    • For master recovery: even better slave promotion; supports candidate slaves (prefer promoting such slaves); supports binlog server shared topologies
    • Better auditing and logging of recovery cases
    • Better analysis of crash scenarios, also in the event of lost VIPs, hanging connections; emergent checks in crash suspected scenarios
    • recover-lite: do all topology-only recovery steps, without invoking external processes
  • Better browser support: used to only work on Firefox and Chrome (and the latter has had issues), the Web UI should now work well on all browsers, at the cost of reduced d3 animation. More work still in progress.
  • Faster, more parallel, less blocking operations on all counts; removed a lots of serialized code; less locks.
  • Web enhancements
    • More verbose drag-n-drop (operation hint; color hints)
    • Drag-n-drop for slaves-of-a-server
    • Replication/crash analysis dashboard
  • Pools: orchestrator can be told about instance-to-pool association (submit-pool-instances command)
    • And can then present pool status (web)
    • Or pool hints within topologies (web)
    • Or queried for all pools (cluster-pool-instances command)
  • Other:
    • Supports MySQL 5.7 (tested with 5.7.8)
    • Configurable graphite path for metrics
    • --noop flag; does all the work except for actually changing master on slaves. Shows intentions.
    • Web (or cli which-cluster-osc-slaves command) provide list of control slaves to use in pt-osc operation
    • hostname-unresolve: force orchestrator to unresolve a fqdn into VIP/CNAME/... when issuing a CHANGE MASTER TO
  • 3rd party contributions (hey, thanks!) include:
    • More & better SSL support
    • Vagrant templates
  • For developers:
    • Orchestrator now go-gettable. Just go get github.com/outbrain/orchestrator
    • Improved build script; supports more architectures

Also consider these manuals:

Orchestrator is free and open source (Apache 2.0 License).

I'll be speaking about orchestrator in PerconaLive Amsterdam.

by shlomi at September 01, 2015 10:10 AM

Peter Zaitsev

Booking.com’s Jean-François Gagné on Percona Live Amsterdam

Booking.com, one of the world’s leading e-commerce companies, helps travels book nearly 1 million rooms per night. Established in 1996, Booking.com B.V. guarantees the best prices for any type of property, from small, family-run bed and breakfasts to executive apartments and five-star luxury suites.

The travel website is also a dedicated contributor to the MySQL and Perl community. Other open source technologies include CentOS Linux, Nginx, python, puppet, Git and more.

A Diamond sponsor of Percona Live Amsterdam Sept. 21-23, you can meet the people who power Booking.com at booth 205. Enter promo code “BlogInterview” at registration to save €20!

In the meantime, meet Jean-François Gagné, a system engineer at Booking.com. He’ll be presenting a couple of talks: “Riding the Binlog: an in Deep Dissection of the Replication Stream” and “Binlog Servers at Booking.com.”


Tom: Hi Jean-François, in your session, “Riding the Binlog: an in Deep Dissection of the Replication Stream“, you talk about how we can think of the binary logs as a transport for a “Stream of Transactions”. What will be the top 3 things attendees will come away with following this 50-minute talk?

Jean-François: Hi Tom, thanks for this opportunity to give a sneak peak of my talk.  The most important subject that will be discussed is that the binary logs evolves: by the usage of “log-slave-updates”, the stream can grow, shrink or morph.  Said in another way: the binary logs of a slave can be very different from the binary logs of the master, and this should be taken into account when relying on those (including when replicating using intermediate master and when promoting a slave as a new master using GTIDs).  We will also explore how the binary logs can be decomposed in sub-streams, or viewed as the multiplexing of many streams.  We will also look for de-multiplexing functions and the new possibilities that are opened with that.

 

Tom: Percona Live, starting with this conference, has a new venue and a broader theme – now encompassing, in addition to MySQL, MongoDB, NoSQL and data in the cloud. Your thoughts? And what do think is missing – what would you change (if anything)?

Jean-François: I think you forget the best of all changes: going from a 2 day conference last year in London to a 3 day conference this year.  This will allow better knowledge exchange and I am very happy about that.  I think this event will be a success with a good balance of sessions focus on technologies and presentation about a specific use-case of those technologies.  If I had one wish: I would like to see more sessions about specific use-cases of NoSQL technologies with and in deep discussion about why they are a better choice than more traditional solutions: maybe more of those sessions will be submitted next year.

 

Tom: Which other session(s) are you most looking forward to besides your own?

Jean-François: I will definitely attend the Facebook session about Semi-Synchronous Replication: it is very close to my interest, especially as Booking.com is thinking about using loss-less semi-sync replication in the future, and I look forward to hear war stories about this feature.  All sessions dissecting internals of a technology (InnoDB, TokuDB, RocksDB, …) will also have my attention.  Finally, it is always interesting to hear about how large companies are using databases, so I plan to attend the MySQL@Wikimedia session.

 

Tom: As a resident of Amsterdam, what are some of the must-do activities/sightseeing for those visiting for Percona Live from out of town?

Jean-François: Seeing the city from a high point is impressive, and you will have the opportunity of enjoying that view from the Booking.com office at the Community Dinner.  Also, I recommend finding a bike and discover the city pedaling (there are many renting shops, just ask Google).  From the conference venue, you can do a 70 minutes ride crossing three nice parks: the Westerpark, the Rembrandtpark and the Vondelpark – https://goo.gl/P13Mc7 – and you can discover the first of third park in a shorter ride (45 minutes).  If you feel a little more adventurous, I recommend a 90 minute ride South following the Amstel: once out of Amsterdam, you will have the water on one side at the level of the road, and the fields (Polder) 3 meters below on the other side (https://goo.gl/OPDv5z).  This will allow you to see for yourself why this place is called the “Low Countries”.

The post Booking.com’s Jean-François Gagné on Percona Live Amsterdam appeared first on MySQL Performance Blog.

by Tom Diederich at September 01, 2015 10:00 AM

August 31, 2015

Peter Zaitsev

High-load clusters and desynchronized nodes on Percona XtraDB Cluster

There can be a lot of confusion and lack of planning in Percona XtraDB Clusters in regards to nodes becoming desynchronized for various reasons.  This can happen a few ways:

When I say “desynchronized” I mean a node that is permitted to build up a potentially large wsrep_local_recv_queue while some operation is happening.  For example a node taking a backup would set wsrep_desync=ON during the backup and potentially fall behind replication some amount.

Some of these operations may completely block Galera from applying transactions, while others may simply increase load on the server enough that it falls behind and applies at a reduced rate.

In all the cases above, flow control is NOT used while the node cannot apply transactions, but it MAY be used while the node is recovering from the operation.  For an example of this, see my last blog about IST.

If a cluster is fairly busy, then the flow control that CAN happen when the above operations catch up MAY be detrimental to performance.

Example setup

Let us take my typical 3 node cluster with workload on node1.  We are taking a blocking backup of some kind on node3 so we are executing the following steps:

  1. node3> set global wsrep_desync=ON;
  2. Node3’s “backup” starts, this starts with FLUSH TABLES WITH READ LOCK;
  3. Galera is paused on node3 and the wsrep_local_recv_queue grows some amount
  4. Node3’s “backup” finishes, finishing with UNLOCK TABLES;
  5. node3> set global wsrep_desync=OFF;

During the backup

This includes up through step 3 above.  My node1 is unaffected by the backup on node3, I can see it averaging 5-6k writesets(transactions) per second which it did before we began:

Screen Shot 2015-08-19 at 2.38.34 PM

 

node2 is also unaffected:

Screen Shot 2015-08-19 at 2.38.50 PM

but node3 is not applying and its queue is building up:

Screen Shot 2015-08-19 at 2.39.04 PM

Unlock tables, still wsrep_desync=ON

Let’s examine briefly what happens when node3 is permitted to start applying, but wsrep_desync stays enabled:

Screen Shot 2015-08-19 at 2.42.16 PM

node1’s performance is pretty much the same, node3 is not using flow control yet. However, there is a problem:

Screen Shot 2015-08-19 at 2.43.13 PM

It’s hard to notice, but node3 is NOT catching up, instead it is falling further behind!  We have potentially created a situation where node3 may never catch up.

The PXC nodes were close enough to the red-line of performance that node3 can only apply just about as fast (and somewhat slower until it heats up a bit) as new transactions are coming into node1.

This represents a serious concern in PXC capacity planning:

Nodes do not only need to be fast enough to handle normal workload, but also to catch up after maintenance operations or failures cause them to fall behind.

Experienced MySQL DBA’s will realize this isn’t all that different than Master/Slave replication.

Flow Control as a way to recovery

So here’s the trick:  if we turn off wsrep_desync on node3 now, node3 will use flow control if and only if the incoming replication exceeds node3’s apply rate.  This gives node3 a good chance of catching up, but the tradeoff is reducing write throughput of the cluster.  Let’s see what this looks like in context with all of our steps.  wsrep_desync is turned off at the peak of the replication queue size on node3, around 12:20PM:

Screen Shot 2015-08-19 at 2.47.12 PM

Screen Shot 2015-08-19 at 2.48.07 PM

So at the moment node3 starts utilizing flow control to prevent falling further behind, our write throughput (in this specific environment and workload) is reduced by approximately 1/3rd (YMMV).   The cluster will remain in this state until node3 catches up and returns to the ‘Synced’ state.  This catchup is still happening as I write this post, almost 4 hours after it started and will likely take another hour or two to complete.

I can see a more realtime representation of this by using myq_status on node1, summarizing every minute:

[root@node1 ~]# myq_status -i 1m wsrep
mycluster / node1 (idx: 1) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
19:58:47 P   5  3 Sync 0.9ms 3128 2.0M   0   27 213b   0 25.4s   0   0   0 3003k  16k  62%
19:59:47 P   5  3 Sync 1.1ms 3200 2.1M   0   31 248b   0 18.8s   0   0   0 3003k  16k  62%
20:00:47 P   5  3 Sync 0.9ms 3378 2.2M  32   27 217b   0 26.0s   0   0   0 3003k  16k  62%
20:01:47 P   5  3 Sync 0.9ms 3662 2.4M  32   33 266b   0 18.9s   0   0   0 3003k  16k  62%
20:02:47 P   5  3 Sync 0.9ms 3340 2.2M  32   27 215b   0 27.2s   0   0   0 3003k  16k  62%
20:03:47 P   5  3 Sync 0.9ms 3193 2.1M   0   27 215b   0 25.6s   0   0   0 3003k  16k  62%
20:04:47 P   5  3 Sync 0.9ms 3009 1.9M  12   28 224b   0 22.8s   0   0   0 3003k  16k  62%
20:05:47 P   5  3 Sync 0.9ms 3437 2.2M   0   27 218b   0 23.9s   0   0   0 3003k  16k  62%
20:06:47 P   5  3 Sync 0.9ms 3319 2.1M   7   28 220b   0 24.2s   0   0   0 3003k  16k  62%
20:07:47 P   5  3 Sync 1.0ms 3388 2.2M  16   31 251b   0 22.6s   0   0   0 3003k  16k  62%
20:08:47 P   5  3 Sync 1.1ms 3695 2.4M  19   39 312b   0 13.9s   0   0   0 3003k  16k  62%
20:09:47 P   5  3 Sync 0.9ms 3293 2.1M   0   26 211b   0 26.2s   0   0   0 3003k  16k  62%

This reports around 20-25 seconds of flow control every minute, which is consistent with that ~1/3rd of performance reduction we see in the graphs above.

Watching node3 the same way proves it is sending the flow control (FlowC snt):

mycluster / node3 (idx: 2) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
17:38:09 P   5  3 Dono 0.8ms    0   0b   0 4434 2.8M 16m 25.2s  31   0   0 18634  16k  80%
17:39:09 P   5  3 Dono 1.3ms    0   0b   1 5040 3.2M 16m 22.1s  29   0   0 37497  16k  80%
17:40:09 P   5  3 Dono 1.4ms    0   0b   0 4506 2.9M 16m 21.0s  31   0   0 16674  16k  80%
17:41:09 P   5  3 Dono 0.9ms    0   0b   0 5274 3.4M 16m 16.4s  27   0   0 22134  16k  80%
17:42:09 P   5  3 Dono 0.9ms    0   0b   0 4826 3.1M 16m 19.8s  26   0   0 16386  16k  80%
17:43:09 P   5  3 Jned 0.9ms    0   0b   0 4957 3.2M 16m 18.7s  28   0   0 83677  16k  80%
17:44:09 P   5  3 Jned 0.9ms    0   0b   0 3693 2.4M 16m 27.2s  30   0   0  131k  16k  80%
17:45:09 P   5  3 Jned 0.9ms    0   0b   0 4151 2.7M 16m 26.3s  34   0   0  185k  16k  80%
17:46:09 P   5  3 Jned 1.5ms    0   0b   0 4420 2.8M 16m 25.0s  30   0   0  245k  16k  80%
17:47:09 P   5  3 Jned 1.3ms    0   0b   1 4806 3.1M 16m 21.0s  27   0   0  310k  16k  80%

There are a lot of flow control messages (around 30) per minute.  This is a lot of ON/OFF toggles of flow control where writes are briefly delayed rather than a steady “you can’t write” for 20 seconds straight.

It also interestingly spends a long time in the Donor/Desynced state (even though wsrep_desync was turned OFF hours before) and then moves to the Joined state (this has the same meaning as during an IST).

Does it matter?

As always, it depends.

If these are web requests and suddenly the database can only handle ~66% of the traffic, that’s likely a problem, but maybe it just slows down the website somewhat.  I want to emphasize that WRITES are what is affected here.  Reads on any and all nodes should be normal (though you probably don’t want to read from node3 since it is so far behind).

If this were some queue processing that had reduced throughput, I’d expect it to possibly catch up later

This can only be answered for your application, but the takeaways for me are:

  • Don’t underestimate your capacity requirements
  • Being at the redline normally means you are well past the redline for abnormal events.
  • Plan for maintenance and failure recoveries
  • Where possible, build queuing into your workflows so diminished throughput in your architecture doesn’t generate failures.

Happy clustering!

Graphs in this post courtesy of VividCortex.

The post High-load clusters and desynchronized nodes on Percona XtraDB Cluster appeared first on MySQL Performance Blog.

by Jay Janssen at August 31, 2015 10:00 AM

August 28, 2015

Peter Zaitsev

Percona Toolkit 2.2.15 is now available

Percona ToolkitPercona is pleased to announce the availability of Percona Toolkit 2.2.15.  Released August 28, 2015. Percona Toolkit is a collection of advanced command-line tools to perform a variety of MySQL server and system tasks that are too difficult or complex for DBAs to perform manually. Percona Toolkit, like all Percona software, is free and open source.

This release is the current GA (Generally Available) stable release in the 2.2 series. It includes multiple bug fixes as well as continued preparation for MySQL 5.7 compatibility. Full details are below. Downloads are available here and from the Percona Software Repositories.

New Features:

  • Added --max-flow-ctl option to pt-online-schema-change and pt-archiver with a value set in percent. When a Percona XtraDB Cluster node is very loaded, it sends flow control signals to the other nodes to stop sending transactions in order to catch up. When the average value of time spent in this state (in percent) exceeds the maximum provided in the option, the tool pauses until it falls below again.Default is no flow control checking.
  • Added the --sleep option for pt-online-schema-change to avoid performance problems. The option accepts float values in seconds.
  • Implemented ability to specify --check-slave-lag multiple times for pt-archiver. The following example enables lag checks for two slaves:
    pt-archiver --no-delete --where '1=1' --source h=oltp_server,D=test,t=tbl --dest h=olap_server --check-slave-lag h=slave1 --check-slave-lag h=slave2 --limit 1000 --commit-each
  • Added the --rds option to pt-kill, which makes the tool use Amazon RDS procedure calls instead of the standard MySQL kill command.

Bugs Fixed:

  • Fixed bug 1042727: pt-table-checksum doesn’t reconnect the slave $dbh
    Before, the tool would die if any slave connection was lost. Now the tool waits forever for slaves.
  • Fixed bug 1056507: pt-archiver --check-slave-lag agressiveness
    The tool now checks replication lag every 100 rows instead of every row, which significantly improves efficiency.
  • Fixed bug 1215587: Adding underscores to constraints when using pt-online-schema-change can create issues with constraint name length
    Before, multiple schema changes lead to underscores stacking up on the name of the constraint until it reached the 64 character limit. Now there is a limit of two underscores in the prefix, then the tool alternately removes or adds one underscore, attempting to make the name unique.
  • Fixed bug 1277049pt-online-schema-change can’t connect with comma in password
    For all tools, documented that commas in passwords provided on the command line must be escaped.
  • Fixed bug 1441928: Unlimited chunk size when using pt-online-schema-change with --chunk-size-limit=0 inhibits checksumming of single-nibble tables
    When comparing table size with the slave table, the tool now ignores --chunk-size-limit if it is set to zero to avoid multiplying by zero.
  • Fixed bug 1443763: Update documentation and/or implentation of pt-archiver --check-interval
    Fixed the documentation for --check-interval to reflect its correct behavior.
  • Fixed bug 1449226: pt-archiver dies with “MySQL server has gone away” when --innodb_kill_idle_transaction is set to a low value and --check-slave-lag is enabled
    The tool now sends a dummy SQL query to avoid timing out.
  • Fixed bug 1446928: pt-online-schema-change not reporting meaningful errors
    The tool now produces meaningful errors based on text from MySQL errors.
  • Fixed bug 1450499: ReadKeyMini causes pt-online-schema-change session to lock under some circumstances
    Removed ReadKeyMini, because it is no longer necessary.
  • Fixed bug 1452914: --purge and --no-delete are mutually exclusive, but still allowed to be specified together by pt-archiver
    The tool now issues an error when --purge and --no-delete are specified together.
  • Fixed bug 1455486: pt-mysql-summary is missing the --ask-pass option
    Added the --ask-pass option to the tool.
  • Fixed bug 1457573: pt-mysql-summary fails to download pt-diskstats pt-pmp pt-mext pt-align
    Added the -L option to curl and changed download address to use HTTPS.
  • Fixed bug 1462904: pt-duplicate-key-checker doesn’t support triple quote in column name
    Updated TableParser module to handle literal backticks.
  • Fixed bug 1488600: pt-stalk doesn’t check TokuDB status
    Implemented status collection similar to how it is performed for InnoDB.
  • Fixed bug 1488611: various testing bugs related to newer Perl versions

Details of the release can be found in the release notes and the 2.2.15 milestone on Launchpad. Bugs can be reported on the Percona Toolkit launchpad bug tracker.

The post Percona Toolkit 2.2.15 is now available appeared first on MySQL Performance Blog.

by Alexey Zhebel at August 28, 2015 05:30 PM

August 26, 2015

Peter Zaitsev

ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam

Say hello to David Murphy, lead DBA and MongoDB Master at ObjectRocket (a Rackspace company). David works on sharding, tool building, very large-scale issues and high-performance MongoDB architecture. Prior to ObjectRocket he was a MySQL/NoSQL architect at Electronic Arts. David enjoys large-scale operational tool building, high performance OS and database tuning. He is also a core code contributor to MongoDB. He’ll be speaking next month at Percona Live Amsterdam, which runs Sept. 21-13. Enter promo code “BlogInterview” at registration to save €20!


Tom: David, your 3-hour tutorial is titled “Mongo Sharding from the trench: A Veterans field guide.” Did your experience in working with vast amounts of data at Rackspace give you a unique perspective, in view, that now puts you into a position to help people just getting started? Can you give a couple examples?

David: I think this has been something organically I grew into from the days of supporting Cpanel type MySQL instances to today. I have worked for a few verticals from hosts to advertising to gaming, finally entering into the platform service. The others give me a host of knowledge around how customer need systems to work, and then the number and range of workloads we see at Rackspace re-enforces this.

ObjectRocket's David Murphy talks MongoDB & Percona Live Amsterdam

ObjectRocket’s David Murphy

Many times the unique perspective comes with the scale such as someone calling up a single node to the multi-terabyte range. When they go to “shard” they can find the process that is normally very light and unnoticeable to most Mongo sharding can severally lock the metadata for an extended time. In other cases, the “balancer” might not be able to keep up with the amount of working being asked of it.

Toward the smaller end of the spectrum, having seen so many workloads from big to small. I can see similar thought processes and trends. When this happens having worked with some many of these workloads, and honestly having learned along the evolution of mongo helps me explain to clients the good, bad, and the hairy. Many times discussions come down to people not using connection pooling, non-indexed sorting, or complex operators such as $in, $nin, and more. In these cases, I can talk to people about the balance of using these concepts and when they will become bigger issues for them. My goal is to give them the enough knowledge to help determine when it is correct to use development resource to fix and issue, and when it’s manageable and that development could be better spent elsewhere.

 

Tom: The title of your tutorial also sounds like the perfect title of a book. Do you have any for one?

David: What an excellent question! I have thought about this. However, I think the goal of a book if I can find the time to do it. A working title might be “Mongo from the trenches: Surviving the minefield to get ahead”. I think the book might be broken into three sections:  “When should you use or not user Mongo”,  “Schema and Operatorators in the NoSQL world”, “Sharding”. I would do this as this could be a great mini book on its own the community really could use a level of depth similar to the MySQL 5.0 certification guides.  I liked these books as it helped someone understand all the bits of what to consider with your schema design and how it affects the application as much as the database hosts. Then in the second half more administration geared it took those same schema and design choices to help you manage them with confidence.

In the end, Mongo is a good product that works well for most people as it matures we need more and discussion. On topics such as what should you monitor, how you should predict issues, and how valuable are regular audits. Especially in an ecosystem where it’s easy to spin something up, launch it, and move on to the next project.

 

Tom: When and why would you recommend using MongoDB instead of MySQL?

David: I am glad I mentioned this is worthy of a book already, as it is such a complex topic and one that gets me very excited.

I feel there is a bit or misinformation on both sides of this field. Many in the MySQL camp of experts know when someone says they can’t get more than 1000 TPS via MySQL. 9 out of 10 times and design, not a technology issue,  the Mongo crowd love this and due to inherit sharding nature of Mongo they can sidestep these types of issues. Conversely in the Mongo camp you will hear how bad the  SQL standard is, however, omitting transactions for a moment, the same types of operations exist in MySQL and Mongo.  There are some interesting powers in the Mongo aggregation. However, SQL is more powerful and just as complex as some map reduce jobs and aggregations I have written.

As to your question, MySQL will always win in regards to repeatable reads to the database in a transaction. There is some talk of limited transactions in Mongo. However, these will likely not become global and cluster wide anytime soon if ever.  I don’t trust floats in Mongo for financials; it’s not that Mongo doesn’t do them but rather JavaScript type floats are present. Sometimes you need to store data as a 64-bit integer and do math in the app to make it a high precision float. MySQL, on the other hand, has excellent support for precision.

Another area is simply looking at the history of Mongo and MySQL.  Mongo until WiredTiger and  RocksDB were very similar to MyISAM from a locking behavior and support perspective. With the advent of the new storage system, we will-will see major leaps forward in types of flows you will want in Mongo. With the writer lock issue is gone, and locking between the systems is becoming more and more similar making deciding which much harder.

The news is not all use. However, subdocuments and array support in Mongo is amazing there are so many things  I can do in Mongo that even in bitwise SET/ENUM operators I could not do. So if you need that type of system, or you want to create a semi denormalize for of a view in the database. Mongo can do this with ease and on the fly. MySQL, on the other hand, would take careful planning and need whole tables updated.  In this regard I feel more people could use Mongo and is ability to have a versioned document schema allowing more incremental changes to documents. With new code  releases, allowing the application to read old version and “upgrade” them to the latest form. Removing a whole flurry of maintenance related pains that RDBMs have to the frustration of developers who just want to launch the new product.

The last thing I would want to say here is you need not choose, why not use both. Mongo can be very powerful for keeping a semi denormalized version of the data that is nimble to allow fast application or system updates and features. Leaving MySQL for a very specific workload that need the precision are simple are not expected to have schema changes.  I am a huge fan of keeping the transactional portions in MySQL, and the rest in Mongo. Allowing you to scale quickly up and down the build of your data needs, and more slowly change the parts that need to be 100% consistent all of the time with no room for eventual consistency.

 

Tom: What another session(s) are you most looking forward to besides your own at Percona Live Amsterdam?

David: There are a few that are near and dear to me.

Turtles all the way down: tuning Linux for database workloads” looks like a great one. It is one view I have always had, and DBA’s should be DBA’s,  SysAdmins, and Storage people rolled into one. That way they can understand the impacts of the application down to the blocks the database reads.

TokuDB internals” is another one. I have used TokuDB in MySQL and Mongo to some degree but as it has never had in-depth documentation. A topic like that is a great way to fill any gaps for experienced and new people alike.

Database Reliability Engineering” looks like a great talk from a great speaker.

As an InnoDB geek, I like the idea around “Understanding InnoDB locks: case studies.”

I see a huge amount of potential for MaxScale if anyone else is curious, “Anatomy of a Proxy Server: MaxScale Internals” should be good for R/W splits and split writing type cases.

Finally, one of my favorite people is Charity as she always is so energetic and can get to the heart of the matter. If you are not going to “Upgrade your database: without losing your data, your perf or your mind” you are missing out!

 

Tom: Thanks for speaking with me, David! Is there anything else you’d like to add: either about Rackspace or Percona Live Amsterdam?

David: In regards to Rackspace, I urge everyone to check out the Data Services group.  We handle everything from Redis to Hadoop with a goal of augmenting your groups or providing experts to help keep your uptime as high as possible. With options for dedicated hosts to platform type services, there is something that helps everyone. Rackspace is not just a cloud company but a real support company that provides amazing hardware to use, or support for other hardware location that is growing rapidly.

With Percona Amsterdam, everyone should come the group of speakers is simply amazing, I for one am excited by so many topics because they are all so compelling. Outside of that you will it hard find another a gathering of database experts with multiple technologies under their belt and who truly believe in the move to picking the right technology for the right use case.

The post ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam appeared first on MySQL Performance Blog.

by Tom Diederich at August 26, 2015 08:07 PM

Jean-Jerome Schmidt

Become a MySQL DBA blog series - Analyzing your SQL Workload using pt-query-digest

In our previous post, we discussed the different ways to collect data about slow queries - MySQL offers slow log, general log and binary log. Using tcpdump, you can grab network traffic data - a good and low-impact method of collecting basic metrics for slow queries. So, now that you are sitting on top of a large pile of data, perhaps tens of gigabytes of logs, how do you get a whole picture of the workload? 

Mid to large size applications tend to have hundred of SQL statements distributed throughout a large code base, with potentially hundreds of queries running every second. That can generate a lot of data. How do we identify causes of bottlenecks slowing down our applications? Obviously, going through the information query by query would not be great - we’ll get drowned with all the entries. We need to find a way to aggregate the data and make sense of all that.

One solution would be to write your own scripts to parse the logs and generate some sensible output. But why reinvent the wheel when such a script already exists, and is only a single wget away? In this blog, we’ll have a close look at pt-query-digest, which is part of Percona Toolkit. 

This is the tenth installment in the ‘Become a MySQL DBA’ blog series. Our previous posts in the DBA series include Query Tuning Process, Configuration Tuning,  Live Migration using MySQL Replication, Database Upgrades, Replication Topology Changes, Schema Changes, High Availability, Backup & Restore, Monitoring & Trending.

Processing data

Pt-query-digest accepts data from general log, binary log, slow log or tcpdump - this covers all of the ways MySQL can generate query data. In addition to that, it’s possible to poll the MySQL process list at a defined interval - a process which can be resource-intensive and far from ideal, but can still be used as a an alternative.

Let’s start with data collected in the slow log. For completeness we’ve used Percona Server with log_slow_verbosity=full. As a source of queries we’ve used sysbench started with --oltp-read-only=off and default query mix.

Slow log - summary part

First of all, we need to process all the data in our slow query log. This can be done by running:

$ pt-query-digest --limit=100% /var/lib/mysql/slow.log > ptqd1.out

We’ve used limit=100% to ensure all queries are included in the report, otherwise some of the less impacting queries may not end up in the output.

After a while you’ll be presented with a report file. It starts with a summary section.

# 52.5s user time, 100ms system time, 33.00M rss, 87.85M vsz
# Current date: Sat Aug 22 16:56:34 2015
# Hostname: vagrant-ubuntu-trusty-64
# Files: /var/lib/mysql/slow.log
# Overall: 187.88k total, 11 unique, 3.30k QPS, 0.73x concurrency ________
# Time range: 2015-08-22 16:54:17 to 16:55:14

At the beginning, you’ll see information about how many queries were logged, how many unique query types there were, QPS and what the concurrency looked like. You can also check the timeframe for the report.

To calculate unique queries, pt-query-digest strips down values passed in the WHERE condition, removes all whitespaces (among others) and calculates a hash for a query. Such hash, called a ‘Query ID’ is used in pt-query-digest to differentiate between queries.

# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time            42s     3us    66ms   221us   541us   576us   131us
# Lock time             5s       0    25ms    26us    38us   146us    21us
# Rows sent          2.78M       0     100   15.54   97.36   34.53    0.99
# Rows examine       6.38M       0     300   35.63  192.76   77.79    0.99
# Rows affecte      36.66k       0       1    0.20    0.99    0.40       0
# Bytes sent       354.43M      11  12.18k   1.93k  11.91k   4.21k  192.76
# Merge passes           0       0       0       0       0       0       0
# Tmp tables         9.17k       0       1    0.05       0    0.22       0
# Tmp disk tbl           0       0       0       0       0       0       0
# Tmp tbl size       1.11G       0 124.12k   6.20k       0  26.98k       0
# Query size        10.02M       5     245   55.95  151.03   50.65   36.69
# InnoDB:
# IO r bytes        22.53M       0  96.00k  139.75       0   1.73k       0
# IO r ops           1.41k       0       6    0.01       0    0.11       0
# IO r wait             2s       0    38ms    10us       0   382us       0
# pages distin     355.29k       0      17    2.15    4.96    2.13    1.96
# queue wait             0       0       0       0       0       0       0
# rec lock wai         2ms       0   646us       0       0     2us       0
# Boolean:
# Filesort       9% yes,  90% no
# Tmp table      4% yes,  95% no

Another part of the summary is pretty self-explanatory. This is a list of different attributes like execution time, lock time, number of rows sent, examined or modified. There’s also information about temporary tables, query size and network traffic. We also have some InnoDB metrics.

For each of those metrics you can find different statistical data. What was the total execution time (or how many rows in total were examined)? What was the minimal execution time (or minimal number of rows examined)? What was the longest query (or the most rows examined by a single query)? We have also calculated average, 95th percentile, standard deviation and median. Finally, at the end of this section you can find some information about how large percentage of queries created a temporary table or used a filesort algorithm.

Next section tells us more about each of the query type. By default pt-query-digest sorts the queries by total execution time. It can be changed, though (using --order-by flag), and you can easily sort queries by, for example, total number of rows examined.

# Profile
# Rank Query ID           Response time Calls R/Call V/M   Item
# ==== ================== ============= ===== ====== ===== ==============
#    1 0x558CAEF5F387E929 14.6729 35.2% 93956 0.0002  0.00 SELECT sbtest?
#    2 0x737F39F04B198EF6  6.4087 15.4%  9388 0.0007  0.00 SELECT sbtest?
#    3 0x84D1DEE77FA8D4C3  3.8566  9.2%  9389 0.0004  0.00 SELECT sbtest?
#    4 0x3821AE1F716D5205  3.2526  7.8%  9390 0.0003  0.00 SELECT sbtest?
#    5 0x813031B8BBC3B329  3.1080  7.5%  9385 0.0003  0.00 COMMIT
#    6 0xD30AD7E3079ABCE7  2.2880  5.5%  9387 0.0002  0.00 UPDATE sbtest?
#    7 0x6EEB1BFDCCF4EBCD  2.2869  5.5%  9389 0.0002  0.00 SELECT sbtest?
#    8 0xE96B374065B13356  1.7520  4.2%  9384 0.0002  0.00 UPDATE sbtest?
#    9 0xEAB8A8A8BEEFF705  1.6490  4.0%  9386 0.0002  0.00 DELETE sbtest?
#   10 0xF1256A27240AEFC7  1.5952  3.8%  9384 0.0002  0.00 INSERT sbtest?
#   11 0x85FFF5AA78E5FF6A  0.8382  2.0%  9438 0.0001  0.00 BEGIN

As you can see, we have eleven queries listed. Each of them has a calculated summary response time and how big part of the total response time it is. We can check how many calls there were in total, what was the mean request time for a query, its variance-to-mean ratio which tells us whether the workload was stable or not. Finally, we see a query in its distilled form.

This concludes the summary part, the rest of the report discusses exact queries in more detail.

Slow log - queries

In this section pt-query-digest prints data about each of the distinct query type it determined. Queries are sorted from the most time-consuming to the least impacting one.

# Query 1: 1.65k QPS, 0.26x concurrency, ID 0x558CAEF5F387E929 at byte 46229
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.00
# Time range: 2015-08-22 16:54:17 to 16:55:14
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         50   93956
# Exec time     35     15s     9us    66ms   156us   236us   713us   119us
# Lock time     52      3s       0    25ms    27us    33us   199us    20us

From the data above we can learn that this query was responsible for 35% of total execution time. It also consisted 50% of the all queries executed in this slow log. It was responsible for 52% of total row locks. We can check that the query is pretty quick - on average it took 156 microseconds to finish, the longest execution took 66 milliseconds.

# Rows sent      3  91.75k       1       1       1       1       0       1
# Rows examine   1  91.75k       1       1       1       1       0       1
# Rows affecte   0       0       0       0       0       0       0       0
# Bytes sent     4  17.47M     195     195     195     195       0     195
# Merge passes   0       0       0       0       0       0       0       0
# Tmp tables     0       0       0       0       0       0       0       0
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size   0       0       0       0       0       0       0       0
# Query size    32   3.27M      36      37   36.50   36.69    0.50   36.69

This query sent 3% and examined 1% of rows, which means it’s pretty well optimized when we compare that with total count (it consists 50% of all queries). It’s clearly depicted in the next columns - it scans and sends only a single row which means it is most likely either a PK or an unique key based. No temporary tables were created by this query.

# InnoDB:
# IO r bytes    46  10.42M       0  96.00k  116.31       0   1.37k       0
# IO r ops      46     667       0       6    0.01       0    0.09       0
# IO r wait     68      1s       0    38ms    12us       0   498us       0
# pages distin  14  51.33k       0       8    0.56    3.89    1.33       0
# queue wait     0       0       0       0       0       0       0       0
# rec lock wai   0       0       0       0       0       0       0       0

Next part contains InnoDB statistics. We can see that the query did some disk operations, what’s most important is the fact that it’s responsible for 68% of the InnoDB I/O wait. There are no queue waits (time spent by a thread before it could enter InnoDB kernel) or record lock waits.

# String:
# Databases    sbtest
# Hosts        10.0.0.200
# InnoDB trxID 112C (10/0%), 112D (10/0%), 112E (10/0%)... 9429 more
# Last errno   0
# Users        sbtest
# Query_time distribution
#   1us  #
#  10us  ####
# 100us  ################################################################
#   1ms  #
#  10ms  #
# 100ms
#    1s
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `sbtest` LIKE 'sbtest1'\G
#    SHOW CREATE TABLE `sbtest`.`sbtest1`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT c FROM sbtest1 WHERE id=100790\G

Finally, we have some data covering who, where and what. What schema was used for a query? Which host executed the query. Which user executed the query? What InnoDB transaction ID’s it had? Did it end up in error? What was the error number?

We have also a nice chart covering execution time distribution - it’s very useful to locate all those queries which are “fast most of the time but sometimes not really”. These queries may suffer from internal or row level locking issues or have their execution plan flipping. Finally we have SHOW TABLE STATUS and SHOW CREATE TABLE statements printed, covering all of the tables involved in a query. It comes really handy - you can just copy it from the report and paste in the MySQL console - a very nice touch. At the end we can find a full version of the query (along with EXPLAIN, again, for easy copy/paste). Full version represents the query of a given type which had the longest execution time. For a DML, where EXPLAIN doesn’t work yet, pt-query-digest prints the equivalent written in a form of a SELECT:

UPDATE sbtest1 SET k=k+1 WHERE id=100948\G
# Converted for EXPLAIN
# EXPLAIN /*!50100 PARTITIONS*/
select  k=k+1 from sbtest1 where  id=100948\G

 

Binary log

Another example of data that pt-query-digest can process is the binary log. As we mentioned last time, binary logs contain data regarding DML’s so if you are looking for SELECTs, you won’t find them here. Still, if you are suffering from DML-related issues (significant locking for example), binary logs may be useful to, at least preliminarily, check the health of the database. What’s important is that binary logs have to be in ‘statement’ or ‘mixed’ format to be useful for pt-query-digest. If that’s the case, here are the steps you need to follow to get the report.

First, we need to parse binary logs:

$ mysqlbinlog /var/lib/mysql/binlog.000004 > binlog.000004.txt

Then we need to feed them to pt-query-digest, ensuring that we set --type flag to ‘binlog’:

$ pt-query-digest --type binlog binlog.000004.txt > ptqd_bin.out

Here’s an example output - it’s a bit messy but all data that could have been derived from the binary log is there:

# 7s user time, 40ms system time, 26.02M rss, 80.88M vsz
# Current date: Mon Aug 24 16:43:55 2015
# Hostname: vagrant-ubuntu-trusty-64
# Files: binlog.000004.txt
# Overall: 48.31k total, 9 unique, 710.44 QPS, 0.07x concurrency _________
# Time range: 2015-08-24 12:17:26 to 12:18:34
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time             5s       0      1s   103us       0    10ms       0
# Query size         4.48M       1     245   81.09  234.30   85.44   38.53
# @@session.au           1       1       1       1       1       0       1
# @@session.au           1       1       1       1       1       0       1
# @@session.au           1       1       1       1       1       0       1
# @@session.ch           8       8       8       8       8       0       8
# @@session.co           8       8       8       8       8       0       8
# @@session.co           8       8       8       8       8       0       8
# @@session.fo           1       1       1       1       1       0       1
# @@session.lc           0       0       0       0       0       0       0
# @@session.ps           2       2       2       2       2       0       2
# @@session.sq           0       0       0       0       0       0       0
# @@session.sq       1.00G   1.00G   1.00G   1.00G   1.00G       0   1.00G
# @@session.un           1       1       1       1       1       0       1
# error code             0       0       0       0       0       0       0


# Query 1: 146.39 QPS, 0.03x concurrency, ID 0xF1256A27240AEFC7 at byte 13260934
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.98
# Time range: 2015-08-24 12:17:28 to 12:18:34
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         20    9662
# Exec time     40      2s       0      1s   206us       0    14ms       0
# Query size    50   2.25M     243     245  243.99  234.30       0  234.30
# error code     0       0       0       0       0       0       0       0
# String:
# Databases    sbtest
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s  ################################################################
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `sbtest` LIKE 'sbtest1'\G
#    SHOW CREATE TABLE `sbtest`.`sbtest1`\G
INSERT INTO sbtest1 (id, k, c, pad) VALUES (99143, 99786, '17998642616-46593331146-44234899131-07942028688-05655233655-90318274520-47286113949-12081567668-14603691757-62610047808', '13891459194-69192073278-13389076319-15715456628-59691839045')\G

As you can see, we have data about number of queries executed and their execution time. There’s also information about query size  and the query itself.

Tcpdump

pt-query-digest works nicely with data captured by tcpdump. We’ve discussed some ways you can collect the data in our last post. One command you may use is:

$ tcpdump -s 65535 -x -nn -q -tttt -i any port 3306 > mysql.tcp.txt

Once the capture process ends, we can proceed with processing the data:

$ pt-query-digest --limit=100% --type tcpdump mysql.tcp.txt > ptqd_tcp.out

Let’s take a look at the final report:

# 113s user time, 3.4s system time, 41.98M rss, 96.71M vsz
# Current date: Mon Aug 24 18:23:20 2015
# Hostname: vagrant-ubuntu-trusty-64
# Files: mysql.tcp.txt
# Overall: 124.42k total, 12 unique, 1.98k QPS, 0.95x concurrency ________
# Time range: 2015-08-24 18:15:58.272774 to 18:17:00.971587
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time            59s       0   562ms   477us     2ms     3ms   152us
# Rows affecte      24.24k       0       1    0.20    0.99    0.40       0
# Query size         6.64M       5     245   55.97  151.03   50.67   36.69
# Warning coun           0       0       0       0       0       0       0

# Profile
# Rank Query ID           Response time Calls R/Call V/M   Item
# ==== ================== ============= ===== ====== ===== ==============
#    1 0x558CAEF5F387E929 18.1745 30.6% 62212 0.0003  0.01 SELECT sbtest?
#    2 0x813031B8BBC3B329  8.9132 15.0%  6220 0.0014  0.09 COMMIT
#    3 0x737F39F04B198EF6  6.8068 11.4%  6220 0.0011  0.00 SELECT sbtest?
#    4 0xD30AD7E3079ABCE7  5.0876  8.6%  6220 0.0008  0.00 UPDATE sbtest?
#    5 0x84D1DEE77FA8D4C3  3.7781  6.4%  6220 0.0006  0.00 SELECT sbtest?
#    6 0xE96B374065B13356  3.5256  5.9%  6220 0.0006  0.00 UPDATE sbtest?
#    7 0x6EEB1BFDCCF4EBCD  3.4662  5.8%  6220 0.0006  0.00 SELECT sbtest?
#    8 0xEAB8A8A8BEEFF705  2.9711  5.0%  6220 0.0005  0.00 DELETE sbtest?
#    9 0xF1256A27240AEFC7  2.9067  4.9%  6220 0.0005  0.00 INSERT sbtest?
#   10 0x3821AE1F716D5205  2.6769  4.5%  6220 0.0004  0.00 SELECT sbtest?
#   11 0x85FFF5AA78E5FF6A  1.1544  1.9%  6222 0.0002  0.00 BEGIN
#   12 0x5D51E5F01B88B79E  0.0030  0.0%     2 0.0015  0.00 ADMIN CONNECT
# Query 1: 992.25 QPS, 0.29x concurrency, ID 0x558CAEF5F387E929 at byte 293768777
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.01
# Time range: 2015-08-24 18:15:58.273745 to 18:17:00.971587
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         50   62212
# Exec time     30     18s       0   448ms   292us   657us     2ms   113us
# Rows affecte   0       0       0       0       0       0       0       0
# Query size    32   2.17M      36      37   36.50   36.69    0.50   36.69
# Warning coun   0       0       0       0       0       0       0       0
# String:
# Databases    sbtest^@mysql_native_password
# Hosts        10.0.0.200
# Users        sbtest
# Query_time distribution
#   1us  #
#  10us  ########################
# 100us  ################################################################
#   1ms  ###
#  10ms  #
# 100ms  #
#    1s
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `sbtest^@mysql_native_password` LIKE 'sbtest1'\G
#    SHOW CREATE TABLE `sbtest^@mysql_native_password`.`sbtest1`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT c FROM sbtest1 WHERE id=138338\G

It looks fairly similar to the report generated from binary logs. What’s important - resolution is much better. Binary log could show us query execution time rounded to the nearest second, here we enjoy the same resolution as when using slow log with long_query_time=0.

Unfortunately, we are missing every single additional information that slow log could provide - only query count, query execution time and query size is there. Better resolution, though, allows the query execution time graph to be more detailed and therefore it can be used to quickly check if a given query has stable performance or not.

In the previous blog post we mentioned that it’s common to combine different methods of gathering information about queries in the system. We hope you see now why - if you need detailed data, there’s no other way than to use the slow query log. For a day-to-day checks it’s enough to use TCP traffic - it’s less impacting than logging all queries to the slow log, it’s also good enough for determining if a query suffers from slowdowns or not.

In the next post we will walk you through some examples of pt-query-digest reports and how we can derive important information out of them. We will also show you how you can benefit from a different additional options that pt-query-digest offers to get more detailed information.

Blog category:

by Severalnines at August 26, 2015 01:14 PM

August 25, 2015

Jean-Jerome Schmidt

Howto: Online Upgrade of MariaDB Galera Cluster 5.5 to MariaDB 10

The MariaDB team released a GA version of MariaDB Galera Cluster 10 in June 2014. MariaDB 10 is the equivalent of MySQL 5.6, and therefore, packed with lots of great features. 

In this blog post, we’ll look into how to perform an online upgrade to MariaDB Galera Cluster 10. At the time of writing, MariaDB 10.1 was still in beta so the instructions in this blog are applicable to MariaDB 10.0. If you are running the Codership build of Galera (Galera Cluster for MySQL), you might be interested in the online upgrade to MySQL 5.6 instead. 

Offline Upgrade

An offline upgrade requires downtime, but it is more straightforward. If you can afford a maintenance window, this is probably a safer way to reduce the risk of upgrade failures. The major steps consists of stopping the cluster, upgrading all nodes, bootstrap and starting the nodes. We covered the procedure in details in this blog post.

Online Upgrade

An online upgrade has to be done in rolling upgrade/restart fashion, i.e., upgrade one node at a time and then proceed to the next. During the upgrade, you will have a mix of MariaDB 5.5. and 10. This can cause problems if not handled with care. 

Here is the list that you need to check prior to the upgrade:

  • Read and understand the changes with the new version
  • Note the unsupported configuration options between the major versions
  • Determine your cluster ID from the ClusterControl summary bar
  • garbd nodes will also need to be upgraded
  • All nodes must have internet connection
  • SST must be avoided during the upgrade so we must ensure each node’s gcache is appropriately configured and loaded prior to the upgrade.
  • Some nodes will be read-only during the period of upgrade which means there will be some impact on the cluster’s write performance. Perform the upgrade during non-peak hours.
  • The load balancer must be able to detect and exclude backend DB servers in read-only mode. Writes coming from MariaDB 10.0 are not compatible with 5.5 in a same cluster. Percona’s clusterchk and ClusterControl mysqlchk script should be able to handle this by default.
  • If you are running on ClusterControl, ensure ClusterControl auto recovery feature is turned off to prevent ClusterControl from recovering a node during its upgrade. 

Here is what we’re going to do to perform the online upgrade:

  1. Set up MariaDB 10 repository on all DB nodes.
  2. Increase gcache size and perform rolling restart.
  3. Backup all databases.
  4. Turn off ClusterControl auto recovery.
  5. Start the maintenance window.
  6. Upgrade packages in Db1 to MariaDB 10.
  7. Add compatibility configuration options on my.cnf of Db1 (node will be read-only).
  8. Start Db1. At this point the cluster will consist of MariaDB 10 and 5.5 nodes. (Db1 is read-only, writes go to Db2 and Db3)
  9. Upgrade packages in Db2 to MariaDB 10.
  10. Add compatibility configuration options on my.cnf of Db2 (node will be read-only).
  11. Start Db2. (Db1 and Db2 are read-only, writes go to Db3)
  12. Bring down Db3 so the read-only on upgraded nodes can be turned off (no more MariaDB 5.5 at this point)
  13. Turn off read-only on Db1 and Db2. (Writes go to Db1 and Db2)
  14. Upgrade packages in Db3 to MariaDB 10.
  15. Start Db3. At this point the cluster will consist of MariaDB 10 nodes. (Writes go to Db1, Db2 and Db3)
  16. Clean up the compatibility options on all DB nodes.
  17. Verify nodes performance and availability.
  18. Turn on ClusterControl auto recovery.
  19. Close maintenance window

Upgrade Steps

In the following example, we have a 3-node MariaDB Galera Cluster 5.5 that we installed via the Severalnines Configurator, running on CentOS 6 and Ubuntu 14.04. The steps performed here should work regardless of the cluster being deployed with or without ClusterControl. Omit sudo if you are running as root.

Preparation

1. MariaDB 10 packages are available in the MariaDB package repository. Replace the existing MariaDB repository URL with version 10.

On CentOS 6:

$ sed -i 's/5.5/10.0/' /etc/yum.repos.d/MariaDB.repo

On Ubuntu 14.04:

$ sudo sed -i 's/5.5/10.0/' /etc/apt/sources.list.d/MariaDB.list

2. Increase gcache size to a suitable amount. To be safe, increase the size so it can hold up to 1 to 2 hours downtime, as explained in this blog post under ‘Determining good gcache size’ section. In this example, we are going to increase the gcache size to 1GB. Open the MariaDB configuration file:

$ vim /etc/my.cnf # CentOS
$ sudo vim /etc/mysql/my.cnf # Ubuntu

Append or modify the following line under wsrep_provider_options:

wsrep_provider_options="gcache.size=1G"

Perform a rolling restart to apply the change. For ClusterControl users, you can use Manage > Upgrades > Rolling Restart.

3. Backup all databases. This is critical before performing any upgrade so you have something to fall back to if the upgrade fails. For ClusterControl users, you can use Backup > Start a Backup Immediately.

4. Turn off ClusterControl auto recovery from the UI, similar to screenshot below:

5. If you installed HAProxy through ClusterControl, there was a bug in previous versions in the health check script when detecting read-only node. Run the following command on all DB nodes to fix it (this is fixed in the latest version):

$ sudo sed -i 's/YES/ON/g' /usr/local/sbin/mysqlchk

 

Upgrading Database Server Db1 and Db2

1. Stop the MariaDB service and remove MariaDB Galera server and galera package.

CentOS:

$ service mysql stop
$ yum remove MariaDB*server MariaDB-client galera

Ubuntu:

$ sudo killall -15 mysqld mysqld_safe
$ sudo apt-get remove mariadb-galera* galera*

2. Modify the MariaDB configuration file for 10’s non-compatible option by commenting or removing the following line (if exists):

#engine-condition-pushdown=1

Then, append and modify the following lines for backward compatibility options:

[MYSQLD]
# Required for compatibility with galera-2. Ignore if you have galera-3 installed.
# Append socket.checksum=1 in wsrep_provider_options:
wsrep_provider_options="gcache.size=1G; socket.checksum=1"

# Required for replication compatibility
# Add following lines under [mysqld] directive:
binlog_checksum=NONE
read_only=ON

3. For CentOS, install the latest version via package manager and follow the substeps (a) and (b):
CentOS:

$ yum clean metadata
$ yum install MariaDB-Galera-server MariaDB-client MariaDB-common MariaDB-compat galera

3a) Start MariaDB with skip grant and run mysql_upgrade script to upgrade system table:

$ mysqld --skip-grant-tables --user=mysql --wsrep-provider='none' &
$ mysql_upgrade -uroot -p

Make sure the last line returns ‘OK’, indicating the mysql_upgrade succeeded.

3b) Gracefully kill the running mysqld process and start the server:

$ sudo killall -15 mysqld
$ sudo service mysql start

Ubuntu:

$ sudo apt-get update
$ sudo apt-get install mariadb-galera-server

**For Ubuntu and Debian packages, mysql_upgrade will be run automatically when they are installed.

4. Monitor the MariaDB error log and ensure the node joins through IST, similar to below:

150821 15:08:04 [Note] WSREP: Signalling provider to continue.
150821 15:08:04 [Note] WSREP: SST received: 2d14e556-473a-11e5-a56e-27822412a930:517325
150821 15:08:04 [Note] WSREP: Receiving IST: 1711 writesets, seqnos 517325-519036
150821 15:08:04 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.0.21-MariaDB-wsrep'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server, wsrep_25.10.r4144
150821 15:08:04 [Note] WSREP: IST received: 2d14e556-473a-11e5-a56e-27822412a930:519036
150821 15:08:04 [Note] WSREP: 0.0 (192.168.55.139): State transfer from 2.0 (192.168.55.141) complete.
150821 15:08:04 [Note] WSREP: Shifting JOINER -> JOINED (TO: 519255)
150821 15:08:04 [Note] WSREP: Member 0.0 (192.168.55.139) synced with group.
150821 15:08:04 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 519260)
150821 15:08:04 [Note] WSREP: Synchronized with group, ready for connections

Repeat the same step for the second node. Once the upgrade is completed on Db1 and Db2, you should notice that writes are now redirected to Db3, which is still running on MariaDB 5.5:

This doesn’t mean that both upgraded MariaDB servers are down, they are just set to be read-only to prevent writes coming from 10 being replicated to 5.5 (which is not supported). We will switch the writes over just before we start the upgrade of the last node.

Upgrading the Last Database Server (Db3)

1. Stop MariaDB service on Db3:

$ sudo service mysql stop # CentOS
$ sudo killall -15 mysqld mysqld_safe # Ubuntu

2. Turn off read-only on Db1 and Db2 so they can start to receive writes. From this point, writes will go to MariaDB 10 only. On Db1 and Db2, run the following statement:

$ mysql -uroot -p -e 'SET GLOBAL read_only = OFF'

Verify that you see something like below on the HAProxy statistic page (ClusterControl > Nodes > select HAproxy node), indicating Db1 and Db2 are now up and active:

3. Modify the MariaDB configuration file for 10’s non-compatible option by commenting or removing the following line (if exists):

#engine-condition-pushdown=1

** There is no need to enable backward compatibility options anymore since the other nodes (Db1 and Db2) are already in 10.

4. Now we can proceed with the upgrade on Db3. Install the latest version via package manager. For CentOS, follow the substeps (a) and (b):
CentOS:

$ yum clean metadata
$ yum install MariaDB-Galera-server MariaDB-client MariaDB-common MariaDB-compat galera

4a) Start MariaDB with skip grant and run mysql_upgrade script to upgrade system table:

$ mysqld --skip-grant-tables --user=mysql --wsrep-provider='none' &
$ mysql_upgrade -uroot -p

Make sure the last line returns ‘OK’, indicating the mysql_upgrade succeeded.

4b) Gracefully kill the running mysqld process and start the server:

$ sudo killall -15 mysqld
$ sudo service mysql start

Ubuntu:

$ sudo apt-get update
$ sudo apt-get install mariadb-galera-server

**For Ubuntu and Debian packages, mysql_upgrade will be run automatically when they are installed.

Wait until the last node joins the cluster and verify the status and version from the ClusterControl Overview tab:

That’s it. Your cluster is upgraded to MariaDB Galera 10. Next, we need to perform some cleanups.

Cleaning Up

1. Fetch the new configuration file contents into ClusterControl by going to ClusterControl > Manage > Configurations > Reimport Configuration.

2. Remove the backward compatibility option on Db1 and Db2 by modifying the following lines:

# Remove socket.checksum=1 in wsrep_provider_options:
wsrep_provider_options="gcache.size=1G"

# Remove or comment following lines:
#binlog_checksum=NONE
#read_only=ON

Restart Db1 and Db2 (one node at a time) to immediately apply the changes.

3. Enable ClusterControl automatic recovery:

4. In some occasions, you might see “undefined()” appearing in Overview page. This is due to a ClusterControl bug in previous versions on detecting a new node. To reset the node status, run the following on the ClusterControl node:

$ sudo service cmon stop
$ mysql -uroot -p -e 'truncate cmon.server_node'
$ sudo service cmon start

Take note that the upgrade preparation part could take a long time if you have a huge dataset to backup. The upgrading process (excluding preparation) took us approximately 45 minutes to complete. 

For MySQL Galera users, we have covered the similar online upgrade to MySQL 5.6 in this blog post.

References

Blog category:

by Severalnines at August 25, 2015 09:32 AM

August 24, 2015

Peter Zaitsev

Advanced Query Tuning in MySQL 5.6 and MySQL 5.7 Webinar: Q&A

Thank you for attending my July 22 webinar titled “Advanced Query Tuning in MySQL 5.6 and 5.7” (my slides and a replay available here). As promised here is the list of questions and my answers (thank you for your great questions).

Q: Here is the explain example:

mysql> explain extended select id, site_id from test_index_id where site_id=1
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: test_index_id
         type: ref
possible_keys: key_site_id
          key: key_site_id
      key_len: 5
          ref: const
         rows: 1
     filtered: 100.00
        Extra: Using where; Using index

why is site_id a covered index for the query, given the fact that a) we are selecting “id”, b) key_site_id only contains site_id?

As the table is InnoDB, all secondary keys will always contain primary key (“id”); in this case the secondary index will contain all needed information to satisfy the above query and key_site_id will be “covered index”

Q: Applications change over time. Do you suggest doing a periodic analysis of indexes that are being used and drop the ones that are not? If yes, any suggestions as to tackle that?

Yes, that is a good idea. Usually it can be done easily with Percona toolkit or Performance_schema in MySQL 5.6

  1. Enable slow query log and log every query, then use Pt-index-usage tool
  2. Or use the following query (as suggested by FromDual blog post):

SELECT object_schema, object_name, index_name
  FROM performance_schema.table_io_waits_summary_by_index_usage
 WHERE index_name IS NOT NULL
   AND count_star = 0
 ORDER BY object_schema, object_name;

Q: Does the duplicate index is found on 5.6/5.7 will that causes an performance impact to the db while querying?

Duplicate keys can have negative impact on selects:

  1. MySQL can get confused and choose a wrong index
  2. Total index size can grow, which can cause MySQL to run out of RAM

Q: What is the suggested method to measure performance on queries (other than the slow query log) so as to know where to create indexes?

Slow query log is most common method. In MySQL 5.6 you can also use Performance Schema and use events_statements_summary_by_digest table.

Q: I’m not sure if this was covered in the webinar but… are there any best-practices for fulltext indexes?

That was not covered in this webinar, however, I’ve done a number of presentations regarding Full Text Indexes. For example: Creating Geo Enabled Applications with MySQL 5.6

Q: What would be the limit on index size or number of indexes you can defined per table?

There are no limits on Index size on disk, however, it will be good (performance wise) to have active indexes fit in RAM.

In InnoDB there are a number of index limitations, i.e. a table can contain a maximum of 64 secondary indexes.

Q:  If a table has two columns you would like to sum, can you have that sum indexed as a calculated index? To add to that, can that calculated index have “case when”?

Just to clarify, this is only a feature of MySQL 5.7 (not released yet).

Yes, it is documented now:

CREATE TABLE triangle (
  sidea DOUBLE,
  sideb DOUBLE,
  sidec DOUBLE AS (SQRT(sidea * sidea + sideb * sideb))
);

Q: I have noticed that you created indexes on columns like DayOfTheWeek with very low cardinality. Shouldn’t that be a bad practice normally?

Yes, you are right! Unless, you are doing queries like “select count(*) from … where DayOfTheWeek = 7” those indexes may not be very useful.

Q: I saw an article that if you don’t specify a primary key upfront mysql / innodb creates one in the background (hidden). Is it different from a primary key itself, if most of the where fields that are used not in the primary / semi primary key? And is there a way to identify the tables with the hidden primary key indexes?

The “hidden” primary key will be 6 bytes, which will also be appended (duplicated) to all secondary keys. You can create an INT primary key auto_increment, which will be smaller (if you do not plan to store more than 4 billion rows). In addition, you will not be able to use the hidden primary key in your queries.

The following query (against information_schema) can be used to find all tables without declared primary key (with “hidden” primary key):

SELECT tables.table_schema, tables.table_name, tables.table_rows
FROM information_schema.tables
LEFT JOIN (
  SELECT table_schema, table_name
  FROM information_schema.statistics
  GROUP BY table_schema, table_name, index_name
  HAVING
    SUM(
      CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END
    ) = COUNT(*)
) puks
ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name
WHERE puks.table_name IS NULL
AND tables.table_type = 'BASE TABLE' AND engine='InnoDB'

You may also use mysql.innodb_index_stats table to find rows with the hidden primary key:

Example:

mysql> select * from mysql.innodb_index_stats;
+---------------+------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name | index_name      | last_update         | stat_name    | stat_value | sample_size | stat_description                  |
+---------------+------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| test          | t1         | GEN_CLUST_INDEX | 2015-08-08 20:48:23 | n_diff_pfx01 | 96         | 1           | DB_ROW_ID                         |
| test          | t1         | GEN_CLUST_INDEX | 2015-08-08 20:48:23 | n_leaf_pages | 1          | NULL        | Number of leaf pages in the index |
| test          | t1         | GEN_CLUST_INDEX | 2015-08-08 20:48:23 | size         | 1          | NULL        | Number of pages in the index      |
+---------------+------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+

Q: You are using the alter table to create index, but how does mysql sort the data for creating the index? isn’t it uses temp table for that?

That is a very good question: the behavior of the “alter table … add index” has changed over time. As documented in Overview of Online DDL:

Historically, many DDL operations on InnoDB tables were expensive. Many ALTER TABLE operations worked by creating a new, empty table defined with the requested table options and indexes, then copying the existing rows to the new table one-by-one, updating the indexes as the rows were inserted. After all rows from the original table were copied, the old table was dropped and the copy was renamed with the name of the original table.

MySQL 5.5, and MySQL 5.1 with the InnoDB Plugin, optimized CREATE INDEX and DROP INDEX to avoid the table-copying behavior. That feature was known as Fast Index Creation

When MySQL uses “Fast Index Creation” operation it will create a set of temporary files in MySQL’s tmpdir:

To add a secondary index to an existing table, InnoDB scans the table, and sorts the rows using memory buffers and temporary files in order by the values of the secondary index key columns. The B-tree is then built in key-value order, which is more efficient than inserting rows into an index in random order.

Q: How good is InnoDB deadlocks on 5.7 comparing to 5.6 version. Is that based on parameters setup?

InnoDB deadlocks discussion is outside of the scope of this presentation. Valerii Kravchuk and Nilnandan Joshi did an excellent talk at Percona Live 2015 (slides available): Understanding Innodb Locks and Deadlocks

Q: What is the performance impact of generating a virtual column for a table having 66 Million records and generating the index. And how would you go about it? Do you have any suggestions on how to re organize indexes on the physical disk?

As MySQL 5.7 is not released yet, behavior of the virtual columns may change.  The main question here is: will it be online operations to a) add a virtual column (as this is only metadata change it should be very light operation anyway). b) add index on that virtual column. In the labs released it was not online, however this can change.

Thank you again for attending.

The post Advanced Query Tuning in MySQL 5.6 and MySQL 5.7 Webinar: Q&A appeared first on MySQL Performance Blog.

by Alexander Rubin at August 24, 2015 02:16 PM

August 21, 2015

Peter Zaitsev

Find unused indexes on MongoDB and TokuMX

Finding and removing unused indexes is a pretty common technique to improve overall performance of relational databases. Less indexes means faster insert and updates but also less disk space used. The usual way to do it is to log all queries’ execution plans and then get a list of those indexes that are not used. Same theory applies to MongoDB and TokuMX so in this blog post I’m going to explain how to find those.

Profiling in MongoDB

To understand what profiling is you only need to think about MySQL’s slow query log, it is basically the same idea. It can be enabled with the following command:

db.setProfilingLevel(level, slowms)

There are three different levels:

0: No profiling enabled.
1: Only those queries slower than “slowms” are profiled.
2: All queries are profiled, similar to query_long_time=0.

Once it is enabled you can use db.system.profile.find().pretty() to read it. You would need to scan through all profiles and find those indexes that are never used. To make things easier there is a javascript program that will find the unused indexes after reading all the profile information. Unfortunately, it only works with mongodb 2.x.

The javascript is hosted in this github project https://github.com/wfreeman/indexalizer You just need to start mongo shell with indexStats.js and run db.indexStats() command. This is an sample output:

scanning profile {ns:"test.col"} with 2 records... this could take a while.
{
	"query" : {
		"b" : 1
	},
	"count" : 1,
	"index" : "",
	"cursor" : "BtreeCursor b_1",
	"millis" : 0,
	"nscanned" : 1,
	"n" : 1,
	"scanAndOrder" : false
}
{
	"query" : {
		"b" : 2
	},
	"count" : 1,
	"index" : "",
	"cursor" : "BtreeCursor b_1",
	"millis" : 0,
	"nscanned" : 1,
	"n" : 1,
	"scanAndOrder" : false
}
checking for unused indexes in: col
this index is not being used:
"_id_"
this index is not being used:
"a_1"

 

So “a_1” is not used and could be dropped. We can ignore “_id_” because that one is needed :)

There is a problem with profiling. It will affect performance so you need to run it only for some hours and usually during low peak. That means that there is a possibility that not all possible queries from your application are going to be executed during that maintenance window. What alternative TokuMX provides?

Finding unused indexes in TokuMX

Good news for all of us. TokuMX doesn’t require you to enable profiling. Index usage statistics are stored as part of every query execution and you can access them with a simple db.collection.stats() command. Let me show you an example:

> db.col.stats()
[...]
{
"name" : "a_1",
"count" : 5,
"size" : 140,
"avgObjSize" : 28,
"storageSize" : 16896,
"pageSize" : 4194304,
"readPageSize" : 65536,
"fanout" : 16,
"compression" : "zlib",
"queries" : 0,
"nscanned" : 0,
"nscannedObjects" : 0,
"inserts" : 0,
"deletes" : 0
},
{
"name" : "b_1",
"count" : 5,
"size" : 140,
"avgObjSize" : 28,
"storageSize" : 16896,
"pageSize" : 4194304,
"readPageSize" : 65536,
"fanout" : 16,
"compression" : "zlib",
"queries" : 2,
"nscanned" : 2,
"nscannedObjects" : 2,
"inserts" : 0,
"deletes" : 0
}
],
"ok" : 1
}

 

There are our statistics without profiling enabled. queries means the number of times that index has been used on a query execution. b_1 has been used twice and a_1 has never been used. You can use this small javascript code I’ve written to scan all collections inside the current database:

db.forEachCollectionName(function (cname) {
	output = db.runCommand({collstats : cname });
	print("Checking " + output.ns + "...")
	output.indexDetails.forEach(function(findUnused) { if (findUnused.queries == 0) { print( "Unused index: " + findUnused.name ); }})
});

 

An example using the same data:

> db.forEachCollectionName(function (cname) {
... output = db.runCommand({collstats : cname });
... print("Checking " + output.ns + "...")
... output.indexDetails.forEach(function(findUnused) { if (findUnused.queries == 0) { print( "Unused index: " + findUnused.name ); }})
...
... });
Checking test.system.indexes...
Checking test.col...
Unused index: a_1

 

Conclusion

Finding unused indexes is a regular task that every DBA should do. In MongoDB you have to use profiling while in TokuMX nothing needs to be enabled because it will gather information by default without impacting service performance.

The post Find unused indexes on MongoDB and TokuMX appeared first on MySQL Performance Blog.

by Miguel Angel Nieto at August 21, 2015 03:33 PM

Henrik Ingo

3 modifications to the Raft consensus algorithm (paper)

August is usually a slower month as a lot of people are on vacations. I try to take advantage of that to work on tasks that require a bit more deliberation and quiet time. This Summer I returned to re-reading the paper on the Raft algorithm, in particular my colleagues in New York pointed out that the PhD thesis that extends on the original paper was now complete, and contains some additional details.

read more

by hingo at August 21, 2015 02:17 PM

August 20, 2015

Peter Zaitsev

Optimizing PXC Xtrabackup State Snapshot Transfer

State Snapshot Transfer (SST) at a glance

PXC uses a protocol called State Snapshot Transfer to provision a node joining an existing cluster with all the data it needs to synchronize.  This is analogous to cloning a slave in asynchronous replication:  you take a full backup of one node and copy it to the new one, while tracking the replication position of the backup.

PXC automates this process using scriptable SST methods.  The most common of these methods is the xtrabackup-v2 method which is the default in PXC 5.6.  Xtrabackup generally is more favored over other SST methods because it is non-blocking on the Donor node (the node contributing the backup).

The basic flow of this method is:

  • The Joiner:
    • joins the cluster
    • Learns it needs a full SST and clobbers its local datadir (the SST will replace it)
    • prepares for a state transfer by opening a socat on port 4444 (by default)
    • The socat pipes the incoming files into the datadir/.sst directory
  • The Donor:
    • is picked by the cluster (could be configured or be based on WAN segments)
    • starts a streaming Xtrabackup and pipes the output of that via socat to the Joiner on port 4444.
    • Upon finishing its backup, sends an indication of this and the final Galera GTID of the backup is sent to the Joiner
  • The Joiner:
    • Records all changes from the Donor’s backup’s GTID forward in its gcache (and overflow pages, this is limited by available disk space)
    • runs the –apply-log phase of Xtrabackup on the donor
    • Moves the datadir/.sst directory contents into the datadir
    • Starts mysqld
    • Applies all the transactions it needs (Joining and Joined states just like IST does it)
    • Moves to the ‘Synced’ state and is done.

There are a lot of moving pieces here, and nothing is really tuned by default.  On larger clusters, SST can be quite scary because it may take hours or even days.  Any failure can mean starting over again from the start.

This blog will concentrate on some ways to make a good dent in the time SST can take.  Many of these methods are trade-offs and may not apply to your situations.  Further, there may be other ways I haven’t thought of to speed things up, please share what you’ve found that works!

The Environment

I am testing SST on a PXC 5.6.24 cluster in AWS.  The nodes are c3.4xlarge and the datadirs are RAID-0 over the two ephemeral SSD drives in that instance type.  These instances are all in the same region.

My simulated application is using only node1 in the cluster and is sysbench OLTP with 200 tables with 1M rows each.  This comes out to just under 50G of data.  The test application runs on a separate server with 32 threads.

The PXC cluster itself is tuned to best practices for Innodb and Galera performance

Baseline

In my first test the cluster is a single member (receiving workload) and I am  joining node2.  This configuration is untuned for SST.  I measured the time from when mysqld started on node2 until it entered the Synced state (i.e., fully caught up).  In the log, it looked like this:

150724 15:59:24 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
... lots of other output ...
2015-07-24 16:48:39 31084 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 4647341)

Doing some math on the above, we find that the SST took 51 minutes to complete.

–use-memory

One of the first things I noticed was that the –apply-log step on the Joiner was very slow.  Anyone who uses Xtrabackup a lot will know that –apply-log will be a lot faster if you give it some extra RAM to use while making the backup consistent via the –use-memory option.  We can set this in our my.cnf like this:

[sst]
inno-apply-opts="--use-memory=20G"

The [sst] section is a special one understood only by the xtrabackup-v2 script.  inno-apply-opts allows me to specify arguments to innobackupex when it runs.

Note that this change only applies to the Joiner (i.e., you don’t have to put it on all your nodes and restart them to take advantage of it).

This change immediately makes a huge improvement to our above scenario (node2 joining node1 under load) and the SST now takes just over 30 minutes.

wsrep_slave_threads

Another slow part of getting to Synced is how long it takes to apply transactions up to realtime after the backup is restored and in place on the Joiner.  We can improve this throughput by increasing the number of apply threads on the Joiner to make better use of the CPU.  Prior to this wsrep_slave_threads was set to 1, but if I increase this to 32  (there are 16 cores on this instance type) my SST now takes 25m 32s

Compression

xtrabackup-v2 supports adding a compression process into the datastream.  On the Donor it compresses and on the Joiner it decompresses.  This allows you to trade CPU for transfer speed.  If your bottleneck turns out to be network transport and you have spare CPU, this can help a lot.

Further, I can use pigz instead of gzip to get parallel compression, but theoretically any compression utilization can work as long as it can compress and decompress standard input to standard output.  I install the ‘pigz’ package on all my nodes and change my my.cnf like this:

[sst]
inno-apply-opts="--use-memory=20G"
compressor="pigz"
decompressor="pigz -d"

Both the Joiner and the Donor must have the respective decompressor and compressor settings or the SST will fail with a vague error message (not actually having pigz installed will do the same thing).

By adding compression, my SST is down to 21 minutes, but there’s a catch.  My application performance starts to take a serious nose-dive during this test.  Pigz is consuming most of the CPU on my Donor, which is also my primary application node.  This may or may not hurt your application workload in the same way, but this emphasizes the importance of understanding (and measuring) the performance impact of SST has on your Donor nodes.

Dedicated donor

To alleviate the problem with the application, I now leave node2 up and spin up node3.  Since I’m expecting node2 to normally not be receiving application traffic directly, I can configure node3 to prefer node2 as its donor like this:

[mysqld]
...
wsrep_sst_donor = node2,

When node3 starts, this setting instructs the cluster that node3 is the preferred donor, but if that’s not available, pick something else (that’s what the trailing comma means).

Donor nodes are permitted to fall behind in replication apply as needed without sending flow control.  Sending application traffic to such a node may see an increase in the amount of stale data as well as certification failures for writes (not to mention the performance issues we saw above with node1).  Since node2 is not getting application traffic, moving into the Donor state and doing an expensive SST with pigz compression should be relatively safe for the rest of the cluster (in this case, node1).

Even if you don’t have a dedicated donor, if you use a load balancer of some kind in front of your cluster, you may elect to consider Donor nodes as failing their health checks so application traffic is diverted during any state transfer.

When I brought up node3, with node2 as the donor, the SST time dropped to 18m 33s

Conclusion

Each of these tunings helped the SST speed, though the later adjustments maybe had less of a direct impact.  Depending on your workload, database size, network and CPU available, your mileage may of course vary.  Your tunings should vary accordingly, but also realize you may actually want to limit (and not increase) the speed of state transfers in some cases to avoid other problems. For example, I’ve seen several clusters get unstable during SST and the only explanation for this is the amount of network bandwidth consumed by the state transfer preventing the actual Galera communication between the nodes. Be sure to consider the overall state of production when tuning your SSTs.

The post Optimizing PXC Xtrabackup State Snapshot Transfer appeared first on MySQL Performance Blog.

by Jay Janssen at August 20, 2015 03:38 PM

August 19, 2015

Peter Zaitsev

How much could you benefit from MySQL 5.6 parallel replication?

I have heard this question quite often: “At busy times, our replicas start lagging quite frequently. We are using N schemas, so which performance boost could we expect from MySQL 5.6 parallel replication?” Here is a quick way to give you a rough estimate of the potential benefit.

General idea

In MySQL 5.6, parallelism is added at the schema level. So in theory, if you have N schemas and if you use N parallel threads, replication could be up to N times faster. This assumes at least 2 things:

  • Replication throughput scales linearly with the number of parallel threads.
  • Writes are evenly distributed across schemas.

Both assumptions are of course not realistic. But it is easy to know the distribution of writes, and that can already give you an idea about how much you could benefit from parallel replication.

Writes are stored in binary logs but it is much easier to work with the slow query log, so we can enable full slow query logging for some time with long_query_time = 0 and then use pt-query-digest to analyze the resulting log file.

An example

I have a test server with 3 schemas, and I’ve run some sysbench load on it to get a decent slow query log file. Once done, I can run this command:

pt-query-digest --filter '$event->{arg} !~ m/^select|^set|^commit|^show|^admin|^rollback|^begin/i' --group-by db --report-format profile slow_query.log > digest.out

and here is the result I get:

# Profile
# Rank Query ID Response time  Calls  R/Call V/M   Item
# ==== ======== ============== ====== ====== ===== ====
#    1 0x       791.6195 52.1% 100028 0.0079  0.70 db3
#    2 0x       525.1231 34.5% 100022 0.0053  0.68 db1
#    3 0x       203.4649 13.4% 100000 0.0020  0.64 db2

In a perfect world, with 3 parallel threads and if each schema would handle 33% of the total write workload, I could expect a 3x performance improvement.

However here we can see in the report that the 3 replication threads will only work simultaneously 25% of the time in the best case (13.4/52.1 = 0.25). We can also expect 2 replication threads to work simultaneously for some part of the workload, but let’s ignore that for clarity.

It means that instead of the theoretical 200% performance improvement (3 parallel threads 100% of the time), we can hardly expect more than a 50% performance improvement (3 parallel threads 25% of the time). And the reality is that the benefit will be much lower than that.

Conclusion

Parallel replication in MySQL 5.6 is a great step forward, however don’t expect too much if your writes are not evenly distributed across all your schemas. The pt-query-digest trick I shared can give you a rough idea whether your workload is a good fit for multi-threaded slaves in 5.6.

I’m expecting much better results for 5.7, partly because parallelism is handled differently, but also because you can tune how efficient parallel replication will be by adjusting the binlog group commit settings.

The post How much could you benefit from MySQL 5.6 parallel replication? appeared first on MySQL Performance Blog.

by Stephane Combaudon at August 19, 2015 10:00 AM

Jean-Jerome Schmidt

Howto: Online Upgrade of Galera Cluster to MySQL 5.6

Oracle released a GA version of MySQL 5.6 in February 2013, Codership released the first GA in their patched 5.6 series in November 2013. Galera Cluster for MySQL 5.6 has been around for almost 2 years now, so what are you waiting for? :-)

Okay, this is a major upgrade so there are risks! Therefore, an upgrade must be carefully plan and tested. In this blog post, we’ll look into how to perform an online upgrade of your Galera Cluster (the Codership build of Galera) to MySQL 5.6. 

Offline Upgrade

An offline upgrade requires downtime, but it is more straightforward. If you can afford a maintenance window, this is probably a safer way to reduce the risk of upgrade failures. The major steps consists of stopping the cluster, upgrading all nodes, bootstrap and starting the nodes. We covered the procedure in details in this blog post.

Online Upgrade

An online upgrade has to be done in rolling upgrade/restart fashion, i.e., upgrade one node at a time and then proceed to the next. During the upgrade, you will have a mix of MySQL 5.5. and 5.6. This can cause problems if not handled with care. 

Here is the list that you need to check prior to the upgrade:

  • Read and understand the changes with the new version
  • Note the unsupported configuration options between the major versions
  • Determine your cluster ID from the ClusterControl summary bar
  • garbd nodes will also need to be upgraded
  • All nodes must have internet connection
  • SST must be avoided in the duration of upgrade so we must ensure each node’s gcache is appropriately configured and must be loaded prior to the upgrade.
  • Some nodes will be read-only during the period of upgrade which means there will be some impact on the cluster’s write performance. Perform the upgrade during non-peak hours.
  • The load balancer must be able to detect and exclude backend DB servers in read-only mode. Writes come from 5.6 are not compatible with 5.5 in a same cluster. Percona’s clustercheck and ClusterControl mysqlchk script should be able to handle this by default.
  • If you are running on ClusterControl, ensure ClusterControl auto recovery feature is turned off to prevent ClusterControl recovering a node during its upgrade. 

Here is what we’re going to do to perform the online upgrade:

  1. Set up Codership repository on all DB nodes.
  2. Increase gcache size and perform rolling restart.
  3. Backup all databases.
  4. Turn off ClusterControl auto recovery.
  5. Start the maintenance window.
  6. Upgrade packages in Db1 to 5.6.
  7. Add compatibility configuration options on my.cnf of Db1 (node will be read-only).
  8. Start Db1. At this point the cluster will consist MySQL 5.5 and 5.6 nodes. (Db1 is read-only, writes go to Db2 and Db3)
  9. Upgrade packages in Db2 to 5.6.
  10. Add compatibility configuration options on my.cnf of Db2 (node will be read-only).
  11. Start Db2. (Db1 and Db2 are read-only, writes go to Db3)
  12. Bring down Db3 so the read-only on upgraded nodes can be turned off (no more MySQL 5.5 at this point)
  13. Turn off read-only on Db1 and Db2. (Writes go to Db1 and Db2)
  14. Upgrade packages in Db3 to 5.6.
  15. Start Db3. At this point the cluster will consist of MySQL 5.6 nodes. (Writes go to Db1, Db2 and Db3)
  16. Clean up the compatibility options on all DB nodes.
  17. Verify nodes performance and availability.
  18. Turn on ClusterControl auto recovery.
  19. Close maintenance window

Upgrade Steps

In this example, we have three nodes MySQL Galera Cluster 5.5 (Codership) that we installed via the Severalnines Configurator running on CentOS 6 and Ubuntu 14.04. The steps performed here should work regardless if the cluster is deployed with or without ClusterControl. Omit sudo if you are running as root.

Preparation

1. MySQL 5.6 packages are available in the Codership package repository. We need to enable the repository on each of the DB nodes. You can find instructions for other operating systems here.

On CentOS 6, add the following lines into /etc/yum.repos.d/Codership.repo:

[codership]
name = Galera
baseurl = http://releases.galeracluster.com/centos/6/x86_64
gpgkey = http://releases.galeracluster.com/GPG-KEY-galeracluster.com
gpgcheck = 1

On Ubuntu 14.04:

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv BC19DDBA
$ echo "deb http://releases.galeracluster.com/ubuntu trusty main" | sudo tee -a /etc/apt/sources.list
$ sudo apt-get update

2. Increase gcache size to a suitable amount. To be safe, increase the size so it can hold up to 1 to 2 hours downtime, as explained in this blog post under 'Determining good gcache size’'section. In this example, we are going to increase the gcache size to 1GB. Open MySQL configuration file:

$ vim /etc/my.cnf # CentOS
$ sudo vim /etc/mysql/my.cnf # Ubuntu

Append or modify the following line under wsrep_provider_options:

wsrep_provider_options="gcache.size=1G"

Perform a rolling restart to apply the change. For ClusterControl users, you can use Manage > Upgrades > Rolling Restart.

3. Backup all databases. This is critical before performing any upgrade so you have something to fail back to if the upgrade fails. For ClusterControl users, you can use Backup > Start a Backup Immediately.

4. Turn off ClusterControl auto recovery from the UI, similar to screenshot below:

5. If you installed HAProxy through ClusterControl, there was a bug in previous versions in the health check script when detecting read-only node. Run the following command on all DB nodes to fix it (this is fixed in the latest version):

$ sudo sed -i 's/YES/ON/g' /usr/local/sbin/mysqlchk

Upgrading Database Server Db1 and Db2

1. Stop the MySQL service and remove the existing package. For Ubuntu/Debian, remove the symlink to MySQL 5.5 base directory which should be installed under /usr/local/mysql:

CentOS:

$ service mysql stop
$ yum remove MySQL-*

Ubuntu:

$ sudo service mysql stop
$ sudo rm -f /usr/local/mysql

IMPORTANT: As time is critical to avoid SST (in this post our gcache size can hold up to ~1 hour downtime without SST), you can download the packages directly from the repository before bringing down MySQL and then use local install command (yum localinstall or dpkg -i) instead to speed up the installation process. We have seen cases where the MySQL installation via package manager took a very long time due to slow connection to Codership’s repository.

2. Modify the MySQL configuration file for 5.6’s non-compatible option by commenting or removing the following line (if exists):

#engine-condition-pushdown=1

Then, append and modify the following lines for backward compatibility options:

[MYSQLD]
# new basedir installed with apt (Ubuntu/Debian only)
basedir=/usr

# Required for compatibility with galera-2
# Append socket.checksum=1 in wsrep_provider_options:
wsrep_provider_options="gcache.size=1G; socket.checksum=1"

# Required for replication compatibility
# Add following lines under [mysqld] directive:
log_bin_use_v1_row_events=1
binlog_checksum=NONE
gtid_mode=0
read_only=ON

[MYSQLD_SAFE]
# new basedir installed with apt (Ubuntu/Debian only)
basedir=/usr

3. Once configured, install the latest version via package manager:
CentOS:

$ yum install mysql-wsrep-server-5.6 mysql-wsrep-client-5.6

Ubuntu (also need to change the mysql client path in mysqlchk script):

$ sudo apt-get install mysql-wsrep-5.6
$ sudo sed -i 's|MYSQL_BIN=.*|MYSQL_BIN="/usr/bin/mysql"|g' /usr/local/sbin/mysqlchk

4. Start MySQL with skip grant and run mysql_upgrade script to upgrade system table:

$ sudo mysqld --skip-grant-tables --user=mysql --wsrep-provider='none' &
$ sudo mysql_upgrade -uroot -p

Make sure the last line returns ‘OK’, indicating the mysql_upgrade succeeded.

5. Gracefully kill the running mysqld process and start the server:

$ sudo killall -15 mysqld
$ sudo service mysql start

Monitor the MySQL error log and ensure the node joins through IST, similar to below:

2015-08-14 16:56:12 89049 [Note] WSREP: Signalling provider to continue.
2015-08-14 16:56:12 89049 [Note] WSREP: inited wsrep sidno 1
2015-08-14 16:56:12 89049 [Note] WSREP: SST received: 82f872f9-4188-11e5-aa6b-2ab795bec872:46229
2015-08-14 16:56:12 89049 [Note] WSREP: Receiving IST: 49699 writesets, seqnos 46229-95928
2015-08-14 16:56:12 89049 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.23'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL), wsrep_25.10
2015-08-14 16:56:20 89049 [Note] WSREP: 0 (192.168.55.139): State transfer to 1 (192.168.55.140) complete.
2015-08-14 16:56:20 89049 [Note] WSREP: IST received: 82f872f9-4188-11e5-aa6b-2ab795bec872:95928
2015-08-14 16:56:20 89049 [Note] WSREP: Member 0 (192.168.55.139) synced with group.
2015-08-14 16:56:20 89049 [Note] WSREP: 1 (192.168.55.140): State transfer from 0 (192.168.55.139) complete.
2015-08-14 16:56:20 89049 [Note] WSREP: Shifting JOINER -> JOINED (TO: 97760)
2015-08-14 16:56:20 89049 [Note] WSREP: Member 1 (192.168.55.140) synced with group.
2015-08-14 16:56:20 89049 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 97815)
2015-08-14 16:56:20 89049 [Note] WSREP: Synchronized with group, ready for connections

Repeat the same step for the second node. Once upgrade is completed on Db1 and Db2, you should notice that writes are now redirected to Db3, which is still running on MySQL 5.5:

This doesn’t mean that both upgraded MySQL servers are down, they are just set to be read-only to prevent writes coming from 5.6 being replicated to 5.5 (which is not supported). We will switch the writes over just before we start the upgrade of the last node.

Upgrading the Last Database Server (Db3)

1. Stop MySQL service on Db3:

$ sudo service mysql stop

2. Turn off read-only on Db1 and Db2 so they can start to receive writes. From this point, writes will go to MySQL 5.6 only. On Db1 and Db2, run the following statement:

$ mysql -uroot -p -e 'SET GLOBAL read_only = OFF'

Verify that you see something like below on the HAProxy statistic page (ClusterControl > Nodes > select HAproxy node), indicating Db1 and Db2 are now up and active:

4. Modify the MySQL configuration file for 5.6’s non-compatible option by commenting or removing the following line (if exists):

#engine-condition-pushdown=1

Then, modify the following lines for the new basedir path (Ubuntu/Debian only):

[MYSQLD]
# new basedir installed with apt (Ubuntu/Debian only)
basedir=/usr

[MYSQLD_SAFE]
# new basedir installed with apt (Ubuntu/Debian only)
basedir=/usr

** There is no need to enable backward compatibility options anymore since the other nodes (Db1 and Db2) are already in 5.6.

3. Now we can proceed with the upgrade on Db3:

CentOS:

$ yum remove MySQL-*
$ yum install mysql-wsrep-server-5.6 mysql-wsrep-client-5.6

Ubuntu (also need to change the mysql client path in mysqlchk script):

$ sudo rm -f /usr/local/mysql
$ sudo apt-get install mysql-wsrep-5.6
$ sudo sed -i 's|MYSQL_BIN=.*|MYSQL_BIN="/usr/bin/mysql"|g' /usr/local/sbin/mysqlchk

5. Start MySQL with skip grant and run mysql_upgrade script to upgrade system tables:

$ sudo mysqld --skip-grant-tables --user=mysql --wsrep-provider='none' &
$ sudo mysql_upgrade -uroot -p

Make sure the last line returns ‘OK’, indicating the mysql_upgrade succeeded.

6. Gracefully kill the running mysqld process and start the server:

$ sudo killall -15 mysqld
$ sudo service mysql start

Wait until the last node joins the cluster and verify the status and version from the ClusterControl Overview tab:

That’s it. Your cluster is upgraded to 5.6. Next, we need to perform some cleanups.

Cleaning Up

1. Fetch the new configuration file contents into ClusterControl by going to ClusterControl > Manage > Configurations > Reimport Configuration.

2. Remove the backward compatibility option on Db1 and Db2 by modifying the following lines:

# Remove socket.checksum=1 in wsrep_provider_options:
wsrep_provider_options="gcache.size=1G"

# Remove or comment following lines:
#log_bin_use_v1_row_events=1
#gtid_mode=0
#binlog_checksum=NONE
#read_only=ON

Restart Db1 and Db2 (one node at a time) to immediately apply the changes.

3. Enable ClusterControl automatic recovery:

4. In some occasions, you might see “undefined()” appearing in Overview page. This is due to a ClusterControl bug in previous versions on detecting a new node. To reset the node status, perform the following command on ClusterControl node:

$ sudo service cmon stop
$ mysql -uroot -p -e 'truncate cmon.server_node'
$ sudo service cmon start

Take note that the upgrade preparation part could take a long time if you have a huge dataset to backup. The upgrading process (excluding preparation) took us approximately 45 minutes to complete. For MariaDB users, we will cover online upgrade to MariaDB Cluster 10 in an upcoming post. Stay tuned!

References

Blog category:

by Severalnines at August 19, 2015 09:59 AM

August 18, 2015

Peter Zaitsev

Featured Talk: The Future of Replication is Today: New Features in Practice

In the past years, both MySQL 5.6, MySQL 5.7 and MariaDB 10 have been successful implementing new features. For many DBAs, the “old way” of replicating data is comfortable so taking the action to implement these new features seems like a momentous leap rather then a simple step. But perhaps it isn’t that complicated…

Giuseppe Maxia, a Quality Assurance Architect at VMware and loyal member of the Percona Live Confepercona-2015DSC_4112rence Committee will be presenting “The Future of Replication is Today: New Features in Practice” at the Percona Live Data Performance Conference this September in Amsterdam.
Percona’s Community Manager, Tom Diederich had an opportunity to catch up with Giuseppe last week and get an in-depth look at some of the items Giuseppe will be covering in his talk in addition to getting his take on some of the hot sessions to hit while at the conference.  This is how it went:

(Hint: Read to the end to find a special discount code) 

 

Tom: Your talk is titled, “The Future of Replication is today: new features in practice.” What are the top 3 areas in which replication options have improved in MySQL 5.6, MySQL 5.7, and MariaDB 10?
Giuseppe: Replication has been stagnant for over 10 years. Before MySQL 5.6, the only important change in the technology was the introduction of row-based replication in 2008. After that, we had to wait till 2013 to see global transaction identifiers in MySQL 5.6, followed by the same feature, with different implementation in 2014 with MariaDB 10. GTID has been complemented, in both flavors, with crash-safe replication tables, which is a feature that guarantees a reliable resume of replication after a server failure. There is also the parallel applier, a minor feature that has been implemented in both MySQL 5.6 and MariaDB, and improved in latest versions, although it seems to lack proper support for monitoring. The last feature that was introduced in MySQL 5.6 and MariaDB 10 is multi-source replication, i.e. the ability of replicating from multiple masters to a single slave. In both editions, the implementation is quite simple, and not so different from what DBAs are used to do for regular replication.
Tom: For DBAs, how difficult will it be to make the change from the “old way” of replicating data — to stop using the same comfortable features that have been around for several years — and put into practice some of the latest features?
Giuseppe: The adoption of new features can be deceptively simple. For example, GTID in MariaDB comes out of the box and its adoption could be as easy as running a backup followed by a restore, but it can produce unpleasant results if you try to combine this feature with multi-source replication without planning ahead. That said, the transition could be simpler than its counterpart in MySQL.
MySQL 5.6 and 5.7 require some reconfiguration to run GTID, and users can face unpleasant failures due to the complexity of the rules applying to this feature. They will need to read the manual thoroughly and test the deployment extensively before trusting an upgrade in production.
For multi-source replication, the difficulties are, in my experience, hidden in the users expectations. When speaking about multi-source (or multi-masters, as it is commonly referred to), many users have the mistaken expectation that they can easily insert anything in multiple masters as if they were doing it in a single server. However, the nature of asynchronous replication and the current implementation of multi-source topologies do not handle conflicts, and this fact will probably surprise and anger the early adopters.
Tom: What is still missing in replication technology? How can MySQL improve?
Giuseppe: There are two areas where the current implementation is lacking. The first one is monitoring data: while new features have been adding up to replication, there is not enough effort made to cover the monitoring needs. The current way of monitoring replication is hard-wired around the original replication feature, and little has been done to give the users a deeper view of what is going on. With the latest releases at our disposal, we can run parallel replication using multiple masters, and yet we have very little visibility on what goes on inside the dozen of threads that the new features can unchain inside a single slave. It’s like driving a F1 racing car with the dashboard of a Ford model-T. MySQL 5.7 has moved a few steps in that direction, with the new replication tables in performance_schema, but it is still a drop in the ocean compared to what we need.
The second area where replication is still too much tied with its past is in heterogeneous replication. While relational databases are still dominating the front-end of the web economy, its back-end is largely being run by different structures, such as Hadoop, MongoDB, Cassandra. Moving data back and forth between the relational storage and its growing siblings has become an urgent need. There have been a few sparks of change in this direction, but nothing that can qualify as promising changes.
Tom: Which other session(s) are you most looking forward to besides your own?
Giuseppe: I am always interested in the sessions that explain and discuss new features. I am most interested in the talks by Oracle engineers, who have been piling up many features in the latest years, and I am sure they have something more up their sleeve that will appear at the conference. I also attend eagerly sessions about complementary tools, which are usually highly educational and often give me more ideas.

Want to read more on the topic? Visit Giuseppe’s blog:

 MySQL Replication Monitoring 101

The Percona Live Data Performance Conference is the premier event for the rich and diverse MySQL, NoSQL and data in the cloud ecosystems in Europe. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and IoT (Internet of Things) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

This year’s conference will feature one day of tutorials and two days of keynote talks and breakout sessions related to MySQL, NoSQL and Data in the Cloud. Attendees will get briefed on the hottest topics, learn about building and maintaining high-performing deployments and hear from top industry leaders.

The Percona Live Europe Data Performance Conference will be September 21-23 at the Mövenpick Hotel Amsterdam City Centre.

Register using code “FeaturedTalk” and save 20 euros off of registration!

Hope to see you in Amsterdam!

The post Featured Talk: The Future of Replication is Today: New Features in Practice appeared first on MySQL Performance Blog.

by Kortney Runyan at August 18, 2015 11:17 PM

MariaDB Foundation

MariaDB Galera Cluster 10.0.21 and 5.5.45 now available

The MariaDB project is pleased to announce the immediate availability of MariaDB Galera Cluster 10.0.21 and 5.5.45. These are both Stable (GA) releases.

Download MariaDB Galera Cluster 10.0.21

Release Notes Changelog What is MariaDB Galera Cluster?

 

Download MariaDB Galera Cluster 5.5.45

Release Notes Changelog What is MariaDB Galera Cluster?

MariaDB APT and YUM Repository Configuration Generator

See the Release Notes and Changelogs for detailed information on these releases.

Thanks, and enjoy MariaDB!

mariadb-seal-shaded-browntext-alt

by Daniel Bartholomew at August 18, 2015 07:46 PM

MariaDB AB

MariaDB on IBM z Systems with Linux

dion_cornett_g

The Convergence of Enterprise-Class and Choice – an Open Source Ecosystem Running on a True Enterprise-Class Platform: MariaDB Enterprise on IBM z Systems

While selling free software sounds like a hard job, the growing mainstream adoption of open source makes it one of the most exciting segments in IT. In my daily meetings with customers, there is no ambiguity that open source is the foundation for the hottest trends in technology, including Cloud, Analytics, Mobile and Social. Gartner reports that 70% of all new applications are built using an open source database. (1)

Yet until now, many customers had to make compromises related to leveraging the innovation, high-quality code, lower costs, and choice inherent in open source, versus leveraging highly secure, scalable and power efficient enterprise-class z Systems offered by IBM.

Today at LinuxCon North America that changes. We at MariaDB Corporation are proud to be part of a new ecosystem with IBM that enables MariaDB Enterprise to run on Linux on z Systems as well as on the new IBM LinuxONE. MariaDB on z Systems extends our current enterprise-level open source offerings, which deliver Community-driven innovation and a comprehensive value proposition; combined with performance, high availability and security at web-scale.

Benefits to Our Customers

Choice – Now, with the availability of MariaDB Enterprise on z Systems, clients can have the best of both worlds—an open source ecosystem running on a true enterprise-class platform.

Security – The world-class security of IBM z Systems, when combined with the hyper-secure community-driven innovations within MariaDB, provide unparalleled protection against a growing IT threat environment.

Extensibility – Joining with IBM on z Systems, MariaDB is now able to offer customers the only open source enterprise database platform that runs from x86 and Power to Mainframe and Cloud.

Consolidation – The ability to scale MariaDB with IBM’s industry-leading mainframe virtualization technologies means clients can easily consolidate clusters with many servers on one mainframe. MariaDB on IBM z Systems can host more servers per core than any other system with high-speed encryption, disaster recovery, and continuous availability capabilities,hereby also saving electricity, cooling and management costs.

Future-Proofing Your Database Investment –The customer’s database investment whether on premises, SaaS or Cloud is protected. Moreover, a true open source development model means that the latest innovations can always be incorporated. MariaDB will support the customer’s needs no matter the direction they choose, all without vendor lock-in.

"As the ONE default database platform for leading Linux distributors, including Red Hat and SUSE, MariaDB is excited to support IBM LinuxONE,” stated Patrik Sallner, CEO of MariaDB. “With Linux on IBM z Systems growing at twice the rate of the Linux market overall, there is clear customer demand for open source solutions on IBM’s highly scalable and secure platform. These qualities align perfectly with MariaDB’s true open source model, which leverages Community innovations to provide best-in-class scalability, improved performance, and security, for on-premise, hybrid and cloud applications.”

MariaDB Enterprise on IBM z Systems (LinuxONE) will be available later this calendar year.

MariaDB Community Server on z Systems

The MariaDB community version runs out of the box on Linux on z Systems. You can install MariaDB 5.x through Yum on RHEL or Zypper on SLES. See the Installation guide for MariaDB 10.x (https://github.com/linux-on-i/docs/wiki/Building-MariaDB-10.0bm-z). MariaDB Inc. and IBM are working together to bring MariaDB Enterprise to z Systems.

For more information, go to https://mariadb.com

More about the IBM partnership, go to, http://www-03.ibm.com/press/us/en/pressrelease/47474.wss

1. Gartner, The State of Open-Source RDBMSs, 2015, Donald Feinberg, Merv Adrian, April 21, 2015

About the Author

dion_cornett_g's picture

Dion Cornett is VP Global Sales at MariaDB and was previously VP Sales Strategy at Red Hat.

by dion_cornett_g at August 18, 2015 03:06 PM

August 17, 2015

Henrik Ingo

Slides and video of my CLS and Oscon talks 2015

A month ago I did an exciting journey to Portland, to present two talks in the field of Open Source business strategy. One at OSCON, and the other was a keynote session at the Community Leadership Summit.

O'Reilly does an awesome job recording all the talks in Oscon, and they let the speakers download and share a copy of their own talk. Thank you O'Reilly! The CLS talk was also recorded, but I haven't seen it published yet. The Kaltura guys do all of the filming and post production on the side of their day jobs, which is a respectable amount of work to do for the community. I'll update this post when the video does become available.

read more

by hingo at August 17, 2015 12:58 PM

Peter Zaitsev

MySQL is crashing: a support engineer’s point of view

In MySQL QA Episode #12, “MySQL is Crashing, now what?,” Roel demonstrated how to collect crash-related information that will help Percona discover what the issue is that you are experiencing, and fix it.

As a Support Engineer I (Sveta) am very happy to see this post – but as a person who better understands writing than recording – I’d like to have same information, in textual form. We discussed it, and decided to do a joint blog post. Hence, this post :)

If you haven’t seen the video yet, or you do not have any experience with gdb, core files and crashes, I highly recommend to watch it first.

Once you have an idea of why crashes happen, what to do after it happens in your environment, and how to open a Support issue and/or a bug report, you’re ready for the next step: which information do you need to provide? Note that the more complete and comprehensive information you provide, the quicker the evaluation and potential fix process will go – it’s a win-win situation!

At first we need the MySQL error log file. If possible, please send us the full error log file. Often users like to send only the part which they think is relevant, but the error log file can contain other information, recorded before the crash happened. For example, records about table corruptions, lack of disk space, issues with InnoDB dictionary, etc.

If your error log is quite large, please note it would compress very well using a standard compression tool like gzip. If for some reason you cannot send the full error log file, please sent all lines, written after the words “mysqld: ready for connections” (as seen the last time before the actual crash), until the end of the error log file (alternatively, you can also search for rows, started with word “Version:”). Or, if you use scripts (or mysqld_safe) which automatically restart MySQL Server in case of disaster, obviously please search for the one-previous server start after the crash.

An example which includes an automatic restart as mentioned above:

2015-08-03 14:24:03 9911 [Note] /home/sveta/SharedData/Downloads/5.6.25/bin/mysqld: ready for connections.
Version: '5.6.25-73.1-log'  socket: '/tmp/mysql_sandbox21690.sock'  port: 21690  Percona Server (GPL), Release 73.1, Revision 07b797f
2015-08-03 14:24:25 7f5b193f9700 InnoDB: Buffer pool(s) load completed at 150803 14:24:25
11:25:12 UTC - mysqld got signal 4 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/
key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=1
max_threads=216
thread_count=1
connection_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 348059 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(my_print_stacktrace+0x2e)[0x8dd38e]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(handle_fatal_signal+0x491)[0x6a5dc1]
/lib64/libpthread.so.0(+0xf890)[0x7f5c58ac8890]
/lib64/libc.so.6(__poll+0x2d)[0x7f5c570fbc5d]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(_Z26handle_connections_socketsv+0x1c2)[0x5f64c2]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(_Z11mysqld_mainiPPc+0x1b5d)[0x5fd87d]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5c57040b05]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld[0x5f10fd]
You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.
150803 14:25:12 mysqld_safe Number of processes running now: 0
150803 14:25:12 mysqld_safe mysqld restarted
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld: /lib64/libssl.so.1.0.0: no version information available (required by /home/sveta/SharedData/Downloads/5.6.25/bin/mysqld)
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld: /lib64/libcrypto.so.1.0.0: no version information available (required by /home/sveta/SharedData/Downloads/5.6.25/bin/mysqld)
2015-08-03 14:25:12 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2015-08-03 14:25:12 0 [Note] /home/sveta/SharedData/Downloads/5.6.25/bin/mysqld (mysqld 5.6.25-73.1-log) starting as process 10038 ...
2015-08-03 14:25:12 10038 [Warning] Buffered warning: Changed limits: max_open_files: 1024 (requested 50005)
2015-08-03 14:25:12 10038 [Warning] Buffered warning: Changed limits: max_connections: 214 (requested 10000)
2015-08-03 14:25:12 10038 [Warning] Buffered warning: Changed limits: table_open_cache: 400 (requested 4096)
2015-08-03 14:25:12 10038 [Note] Plugin 'FEDERATED' is disabled.
2015-08-03 14:25:12 10038 [Note] InnoDB: Using atomics to ref count buffer pool pages
2015-08-03 14:25:12 10038 [Note] InnoDB: The InnoDB memory heap is disabled
2015-08-03 14:25:12 10038 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-08-03 14:25:12 10038 [Note] InnoDB: Memory barrier is not used
2015-08-03 14:25:12 10038 [Note] InnoDB: Compressed tables use zlib 1.2.3
2015-08-03 14:25:12 10038 [Note] InnoDB: Using Linux native AIO
2015-08-03 14:25:12 10038 [Note] InnoDB: Using CPU crc32 instructions
2015-08-03 14:25:12 10038 [Note] InnoDB: Initializing buffer pool, size = 4.0G
2015-08-03 14:25:13 10038 [Note] InnoDB: Completed initialization of buffer pool
2015-08-03 14:25:13 10038 [Note] InnoDB: Highest supported file format is Barracuda.
2015-08-03 14:25:13 10038 [Note] InnoDB: The log sequence numbers 514865622 and 514865622 in ibdata files do not match the log sequence number 514865742 in the ib_logfiles!
2015-08-03 14:25:13 10038 [Note] InnoDB: Database was not shutdown normally!
2015-08-03 14:25:13 10038 [Note] InnoDB: Starting crash recovery.
2015-08-03 14:25:13 10038 [Note] InnoDB: Reading tablespace information from the .ibd files...
2015-08-03 14:25:14 10038 [Note] InnoDB: Restoring possible half-written data pages
2015-08-03 14:25:14 10038 [Note] InnoDB: from the doublewrite buffer...
InnoDB: Last MySQL binlog file position 0 150866, file name mysql-bin.000006
2015-08-03 14:25:16 10038 [Note] InnoDB: 128 rollback segment(s) are active.
2015-08-03 14:25:16 10038 [Note] InnoDB: Waiting for purge to start
2015-08-03 14:25:16 10038 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.25-rel73.1 started; log sequence number 514865742
2015-08-03 14:25:16 7f67ceff9700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
2015-08-03 14:25:16 10038 [Note] Recovering after a crash using mysql-bin
2015-08-03 14:25:16 10038 [Note] Starting crash recovery...
2015-08-03 14:25:16 10038 [Note] Crash recovery finished.
2015-08-03 14:25:17 10038 [Note] RSA private key file not found: /home/sveta/sandboxes/rsandbox_Percona-Server-5_6_25/master/data//private_key.pem. Some authentication plugins will not work.
2015-08-03 14:25:17 10038 [Note] RSA public key file not found: /home/sveta/sandboxes/rsandbox_Percona-Server-5_6_25/master/data//public_key.pem. Some authentication plugins will not work.
2015-08-03 14:25:17 10038 [Note] Server hostname (bind-address): '127.0.0.1'; port: 21690
2015-08-03 14:25:17 10038 [Note]   - '127.0.0.1' resolves to '127.0.0.1';
2015-08-03 14:25:17 10038 [Note] Server socket created on IP: '127.0.0.1'.
2015-08-03 14:25:17 10038 [Warning] 'proxies_priv' entry '@ root@thinkie' ignored in --skip-name-resolve mode.
2015-08-03 14:25:17 10038 [Note] Event Scheduler: Loaded 0 events
2015-08-03 14:25:17 10038 [Note] /home/sveta/SharedData/Downloads/5.6.25/bin/mysqld: ready for connections.
Version: '5.6.25-73.1-log'  socket: '/tmp/mysql_sandbox21690.sock'  port: 21690  Percona Server (GPL), Release 73.1, Revision 07b797f

Usually the error log file contains the actual query which caused the crash. If it does not and you know the query (for example, if your application logs errors / query problems), please send us this query too. Additionally, if possible, include the CREATE TABLE statements for any tables mentioned in the query. Actually working with the query is the first thing which you can do to resolve the issue: try to run this query (on a non-production/test server which is as close a copy to your production server as possible), to ensure it crashes MySQL Server consistently. If so, you can try and create a temporary workaround by avoiding this kind of queries in your application.

If you have doubts as to which query caused the crash, but have the general query log turned ON, you can use utility

parse_general_log.pl
  from percona-qa to create a potential test case. Simply execute:

$ sudo yum install bzr
$ cd ~
$ bzr branch lp:percona-qa
$ cp /path_that_contains_your_general_log/your_log_file.sql ~
$ ~/percona-qa/parse_general_log.pl -i./your_log_file.sql -o./output.sql

And subsequently execute output.sql against mysqld on a non-production test server to see if a crash is produced. Alternatively, you may mail us the output.sql file (provided your company privacy etc. policies allow for this). If you want to try and reduce the testcase further, please see QA Episode #7 on reducing testcases.

The next thing which we need is a backtrace. You usually have a simple backtrace showing in the error log directly after crash. An example (extracted from an error log) of what this looks like:

stack_bottom = 0 thread_stack 0x40000
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(my_print_stacktrace+0x2e)[0x8dd38e]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(handle_fatal_signal+0x491)[0x6a5dc1]
/lib64/libpthread.so.0(+0xf890)[0x7f5c58ac8890]
/lib64/libc.so.6(__poll+0x2d)[0x7f5c570fbc5d]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(_Z26handle_connections_socketsv+0x1c2)[0x5f64c2]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(_Z11mysqld_mainiPPc+0x1b5d)[0x5fd87d]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5c57040b05]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld[0x5f10fd]

Note that the above backtrace is mangled. You can send us the file like this (we can demangle it). However, if you want to work with it yourself more comfortably you can unmangle it with help of

c++filt
  utility:

sveta@linux-85fm:~/sandboxes/rsandbox_Percona-Server-5_6_25> cat master/data/msandbox.err | c++filt
...
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(my_print_stacktrace+0x2e)[0x8dd38e]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(handle_fatal_signal+0x491)[0x6a5dc1]
/lib64/libpthread.so.0(+0xf890)[0x7f5c58ac8890]
/lib64/libc.so.6(__poll+0x2d)[0x7f5c570fbc5d]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(handle_connections_sockets()+0x1c2)[0x5f64c2]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld(mysqld_main(int, char**)+0x1b5d)[0x5fd87d]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5c57040b05]
/home/sveta/SharedData/Downloads/5.6.25/bin/mysqld[0x5f10fd]
...

Now the backtrace looks much nicer. When sending error log reports to us, please to not use 

c++filt
 on them before sending. We have a list of known bugs, and to scan the known bugs list we need to receive the error log unaltered.

You can also turn core files ON. Core files are memory dump files, created when a process crashes. They are very helpful for debugging, because they contain not only the backtrace of crashing thread, but backtraces of all threads, and much of what was in memory at the time the crash occurred.

Sidenote: Please note it is always a good idea to have the debuginfo (for example Percona-Server-56-debuginfo.x86_64 from the Percona Repository) package installed. This package provides the debugging symbols for Percona server (there are similar packages for other distributions) and ensures that stack traces are more readable and contain more information. It is important to ensure that you have the right package version etc. as symbols are different for each version of mysqld. If you have installed Percona Server from our repository, you can simply install the debuginfo package, the version will be correct, and the package will be auto-updated when Percona Server is updated.

By default the MySQL server does not create core files. To let it do so, you can follow instructions from the “GDB Cheat sheet” (page 2 under header ‘Core Files Cheat Sheet’). In short:

  • Add the option core-file under the [mysqld] section of your configuration file
  • Tune your OS options, so it allows mysqld to create core files as described in the cheat sheet
    echo "core.%p.%e.%s" > /proc/sys/kernel/core_pattern
    ulimit -c unlimited
    sudo sysctl -w fs.suid_dumpable=2

    Note: some systems use ‘kernel’ instead of ‘fs’: use kernel.suid_dumpable=2 instead if you get key or file warnings/errors.
  • Restart the MySQL server

Besides the core file which is generated by the MySQL server, you can also setup the operating system to dump a core file. These are two different core files (for a single crash of the mysqld binary), and the amount of information contained within may differ. The procedure above shows how to setup the one for the MySQL Server alone.

If you like your operating system to dump a core file as well, please see the MySQL QA Episode #12 video. Also, please note that changing the ulimit and fs.suid_dumpable settings may alter the security of your system. Please read more about these options online before using them or leaving them permanently on a production system.

Once a core file is generated, you can use the GDB utility to debug the core file (also called a ‘coredump’). GDB allows you to better resolve backtraces (also called ‘stack traces’ or ‘stacks’), for example by taking a back trace of all threads instead of only the crashing threads. This is off-course better then the single backtrace available in the error log file. To use GDB, you need to first start it:

gdb /path_to_mysqld /path_to_core

/path_to_core is usually your data directory (for coredumps produced by mysqld as a result of using the –core-file option in your my.cnf file), or sometimes in the same directory where the crashing binary is (for coredumps produced by the OS) – though you can specify an alternate fixed location for OS coredumps as the cheat sheet. Note that OS generated dumps are sometimes written with very few privileges and so you may have to use chown/chmod/sudo to access it.

Once you’re into GDB, and all looks fine, run the commands

bt
(backtrace) and  
bt thread apply all
(get backtrace for all threads) to get the stacktraces. bt should more or less match the backtrace seen in the error log, but sometimes this may differ.

For us, ideally you would run the following commands in GDB (as seen in the cheat sheet):

set trace-commands on
set pagination off
set print pretty on
set print array on
set print array-indexes on
set print elements 4096
set logging file gdb_standard.txt
set logging on
thread apply all bt
set logging off
set logging file gdb_full.txt
set logging on
thread apply all bt full

After you run these commands and have existed ( quit ) GDB, please send us the 

./gdb_standard.txt
  and
./gdb_full.txt
 files.

Finally, we would be happy to receive the actual core file from you. In terms of security and privacy, please note that a core file often contains fragments, or sections, or even the full memory of your server.

However, a core file without mysqld is useless, thus please add the mysqld binary together with the core file. If you use our compiled binaries you can also specify the exact package and file name which you downloaded, but if you use a self-compiled version of the server, the mysqld binary is required for us to resolve backtrace and other necessary information (like varialbes) from your core file. Generally speaking, it’s easier just to sent mysqld along.

Also, it would be really nice, if you send us library files which are dynamically linked with mysqld you use. You can get them by using a tool, called

ldd_files.sh
  from the percona-qa suite. Just create a temporary directory, copy your
mysqld
  binary to it and run the tool on it:

sveta@thinkie:~/tmp> wget http://bazaar.launchpad.net/~percona-core/percona-qa/trunk/download/head:/ldd_files.sh-20150713030145-8xdk0llrd3skfsan-1/ldd_files.sh
sveta@thinkie:~/tmp> mkdir tmp
sveta@thinkie:~/tmp> cd tmp/
sveta@thinkie:~/tmp/tmp> cp /home/sveta/SharedData/Downloads/5.6.25/bin/mysqld . # Copy of your mysqld
sveta@thinkie:~/tmp/tmp> ../ldd_files.sh mysqld # Run the tool on it
cp: cannot stat ‘./mysqld: /lib64/libssl.so.1.0.0: no version information available’: No such file or directory # Ignore
cp: cannot stat ‘./mysqld: /lib64/libcrypto.so.1.0.0: no version information available’: No such file or directory # Ignore
sveta@thinkie:~/tmp/tmp> ls
ld-linux-x86-64.so.2 libaio.so.1 libcrypto.so.1.0.0 libcrypt.so.1 libc.so.6 libdl.so.2 libgcc_s.so.1 libm.so.6 libpthread.so.0 librt.so.1 libssl.so.1.0.0 libstdc++.so.6 libz.so.1 mysqld # Files to supply in combination with mysqld

These library files are needed if case some of the frames from the stacktrace are system calls, so that our developers can resolve/check those frames also.

Summary

If you hit a crash, please send us (in order of preference, but even better ‘all of these’):

  • The error log file (please sent it unaltered – i.e. before c++filt was executed – which allows us to scan for known bugs)
  • The crashing query (from your application logs and/or extracted from the core file – ref the query extraction blog post)
    • Please include the matching CREATE TABLE statements
  • A resolved backtrace (and/or preferably the ./gdb_standard.txt and ./gdb_full.txt files)
  • The core file together with the mysqld binary and preferably the ldd files

Thank you!

 

The post MySQL is crashing: a support engineer’s point of view appeared first on MySQL Performance Blog.

by Sveta Smirnova at August 17, 2015 11:38 AM

Jean-Jerome Schmidt

Become a MySQL DBA blog series - The Query Tuning Process

Query tuning is something that a DBA does on a daily basis - analyse queries and updates, how these interact with the data and schema, and optimize for performance. This is an extremely important task as this is where database performance can be significantly improved - sometimes by orders of magnitude. 

In the next few posts, we will cover the basics of query tuning - indexing, what types of queries to avoid, optimizer hints, EXPLAIN and execution plans, schema tips, and so on. We will start, though, by discussing the process of query review - how to gather data and which methods are the most efficient.

This is the ninth installment in the 'Become a MySQL DBA' blog series. Our previous posts in the DBA series include Configuration Tuning, Live Migration using MySQL Replication, Database Upgrades, Replication Topology Changes, Schema Changes, High Availability, Backup & Restore, Monitoring & Trending.

Data gathering

There are a couple of ways to grab information about queries that are executed on the database. MySQL itself provides three ways - general log, binlog and slow query log.

General log

General log is the least popular way as it causes significant amount of logging and has a high impact on the overall performance (Thanks to PavelK for correcting us in the comments section). It does, though, store data about queries that are being executed, together with all the needed information to assess how long a given query took.

                   40 Connect   root@localhost on sbtest
                   40 Query     set autocommit=0
                   40 Query     set session read_buffer_size=16384
                   40 Query     set global read_buffer_size=16384
                   40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:37:42    40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:41:45    40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:45:46    40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:49:56    40 Query     select count(*) from sbtest.sbtest3 where pad like '6%'
150812  7:54:08    40 Quit

Given the additional impact, the general log is not really a feasible way of collecting slow queries, but it still can be a valid source if you have it enabled for some other reason.

Binary log

Binary logs store all modifications that were executed on the database - this is used for replication or for point-in-time recovery. There’s no reason, though, why you couldn’t use this data to check the performance of the DML's - as long as the query has been logged in the original format. This means that you should be ok for the majority of the writes as long as binlog format is set to ‘mixed’. Even better would be to use the ‘statement’ format, but it’s not recommended due to possible issues with data consistency between the nodes. The main difference between ‘statement’ and ‘mixed’ formats is that in ‘mixed’ format, all queries which might cause inconsistency will be logged in a safe, ‘row’ format. This format, though, doesn’t preserve the original query statement and so the data cannot be used for a query review.

If requirements are fulfilled, binary logs will give us enough data to work on - the exact query statement and time taken to execute it on the master. Note that this is not a very popular way of collecting the data. It has it’s own uses, though. For example, if we are concerned about the write traffic, using binary logs is a perfectly valid way of getting the data, especially if they are already enabled.

Slow query log

The slow query log is probably the most common source of information for slow queries. It was designed to log the most important information about the queries - how long they took, how many rows were scanned, how many rows were sent to the client.

Slow query log can be enabled by setting the slow_query_log variable to 1. It’s location can be set using slow_query_log_file variable. Another variable, long_query_time, sets the threshold above which queries are being logged. By default it’s 10 seconds which means queries that execute under 10 seconds will not be logged. This variable is dynamic, you can change it at any time. If you set it to 0, all queries will be logged into a slow log. It is possible to use fractions when setting long_query_time so settings like 0.1 or 0.0001 are valid ones.

What you need to remember when dealing with long_query_time is that it’s change on a global level affects only new connections. When changing it in a session, it affects the current session only (as one would expect). If you use some kind of connection pooling, this may become a significant issue. Percona Server has an additional variable, slow_query_log_use_global_control, which eliminates this drawback - it made possible to make long_query_time (and a couple of other slow log related settings that were introduced in Percona Server) truly a dynamic variable, affecting also currently open sessions.

Let’s take a look at the content of the slow query log:

# Time: 150812  8:25:19
# User@Host: root[root] @ localhost []  Id:    39
# Schema: sbtest  Last_errno: 0  Killed: 0
# Query_time: 238.396414  Lock_time: 0.000130  Rows_sent: 1  Rows_examined: 59901000  Rows_affected: 0
# Bytes_sent: 69
SET timestamp=1439367919;
select count(*) from sbtest.sbtest3 where pad like '6%';

In this entry we can see information about the time when the query was logged, the user who executed the query, thread id inside MySQL (something you’d see as Id in your processlist output), current schema, whether it failed with some error code or whether it was killed or not. Then we have the most interesting data: how long it took to execute this query, how much of this time was spent on row level locking, how many rows were sent to the client, how many rows were scanned in MySQL, how many rows were modified by the query. Finally we have info on how many bytes were sent to the client, timestamp of the time when the query was executed and the query itself.

This gives us a pretty good idea of what may be wrong with the query. Query_time is obvious - the longer a query takes to execute, the more impact on the server it will have. But there are also other clues. For example, if we see that the number of rows examined is high compared to the rows sent, it may mean that the query is not indexed properly and it’s scanning much more rows that it should be. High lock time can be a hint that we are suffering from row level locking contention. If a query had to wait some time to grab all the locks it needed, something is definitely not right. It could be that some other long running query already acquired some of the locks needed, it could be that there are long-running transactions that stay open and do not release their locks. As you can see, even such simple information may be of a great value for a DBA.

Percona Server can log some additional information into the slow query log - you can manage what’s being logged by using variable log_slow_verbosity. Below is a sample of such data:

# Time: 150812  9:41:32
# User@Host: root[root] @ localhost []  Id:    44
# Schema: sbtest  Last_errno: 0  Killed: 0
# Query_time: 239.092995  Lock_time: 0.000085  Rows_sent: 1  Rows_examined: 59901000  Rows_affected: 0
# Bytes_sent: 69  Tmp_tables: 0  Tmp_disk_tables: 0  Tmp_table_sizes: 0
# InnoDB_trx_id: 13B08
# QC_Hit: No  Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: No  Filesort_on_disk: No  Merge_passes: 0
#   InnoDB_IO_r_ops: 820579  InnoDB_IO_r_bytes: 13444366336  InnoDB_IO_r_wait: 206.397731
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 65314
SET timestamp=1439372492;
select count(*) from sbtest.sbtest3 where pad like '6%';

As you can see, we have all the data from the default settings and much more. We can see if a query created temporary tables, how many in total and how many of them were created on disk. We can also see the total size of those tables. This data gives us important insight - are temporary tables an issue or not? Small temporary tables may be fine, larger ones can significantly impact overall performance. Next, we see some characteristics of the query - did it use query cache, did it make a full table scan? Did it make a join without using indexes? Did it run a sort operation? Did it use disk for the sorting? How many merge passes the filesort algorithm had to do?

Next section contains information about InnoDB - how many read operations it had to do? How many bytes of data it read, how much time MySQL spend on waiting for InnoDB I/O activity, row lock acquisition or waiting in the queue to start being processed by InnoDB? We can also see the approximate number of unique pages that the query accessed. Finally, we see the timestamp and the query itself.

This additional information is useful to pinpoint the problem with a query. By looking at our example it’s clear that the query does not use any index and it makes a full table scan. By looking at the InnoDB data we can confirm that the query did a lot of I/O - it scanned ~13G of data. We can also confirm that the majority of the query execution time (239 seconds) was spent on InnoDB I/O (InnoDB_IO_r_wait - 206 seconds).

Impact of the slow log on the performance

The slow log is definitely a great way of collecting data about the performance of the queries. Unfortunately, it comes at a price - enabling slow query log enables some additional load on MySQL - a load that impacts the overall performance. We are talking here about both throughput and stability, we do not want to see drops in performance as they don’t play well with user experience and may cause some additional problems by causing temporary pileups in queries, scanned rows etc.

We’ve prepared a very simple and generic benchmark to show you the impact of the different slow log verbosity. We’ve set up a m4.xlarge instance on AWS and we used sysbench to build a single table. The workload is two threads (m4.xlarge has four cores), it’s read-only and the data set fits in the memory - we have here a very simple CPU-bound workload. Below is the exact sysbench command:

sysbench \
--test=/root/sysbench/sysbench/tests/db/oltp.lua \
--num-threads=2 \
--max-requests=0 \
--max-time=600 \
--mysql-host=localhost \
--mysql-user=sbtest \
--mysql-password=sbtest \
--oltp-tables-count=1 \
--oltp-read-only=on \
--oltp-index-updates=200 \
--oltp-non-index-updates=10 \
--report-interval=1 \
--oltp-table-size=800000 \
run

We used four verbosity stages for the slow log:

  • disabled
  • enabled, no Percona Server features
  • enabled, log_slow_verbosity='full'
  • enabled, log_slow_verbosity='full,profiling_use_getrusage,profiling'

The last one adds profiling information for a query - time spent in each of the states it went through and overall CPU time used for it. Here are the results:

As you can see, throughput-wise, impact is not that bad, except for the most verbose option. Unfortunately, it’s not a stable throughput - as you can see there are many periods when no transaction was executed - it’s true for all runs with slow log enabled. This is a significant drawback of using slow log to collect the data - if you are ok with some impact, that’s a great tool. If not, we need to look for an alternative. Of course, this is a very generic test and your mileage may vary - impact may depends on so many factors: CPU utilization, I/O throughput, number of queries per second, exact query mix etc. If you are interested in checking the impact on your system, you need to perform tests on your own.

Using tcpdump to grab the data

As we have seen, using slow log, while allowing you to collect a great deal of information, significantly impacts the throughput of the server. That’s why yet another way of collecting the data was developed. The idea is simple - MySQL sends the data over the network so all queries are there. If you capture the traffic between the application and the MySQL server, you’ll have all the queries exchanged during that time. You know when a query started, you know when a given query finished - this allows you to calculate the query’s execution time.

Using the following command you can capture the traffic hitting port 3306 on a given host.

tcpdump -s 65535 -x -nn -q -tttt -i any -c 1000 port 3306 > mysql.tcp.txt

Of course, it still causes some performance impact, let’s compare it with a clean server, no slow log enabled:

As you can see, total throughput is lower and spiky but it’s somewhat more stable than when slow log is enabled. What’s more important, tcpdump can be executed on a MySQL host but it can also be used on a proxy node (you may need to change the port in some cases), to ease the load on the MySQL node itself. In such case the performance impact will be even lower. Of course, tcpdump can’t provide you with such detailed informations as the query log does - all you can grab is a query itself and amount of data sent from the server to the client. There’s no info about rows scanned or sent, there’s no info if a query created a temporary table or not, nothing - just the execution time and query size.

Given the fact that both main methods, slow log and tcpdump, have their pros and cons, it’s common to combine them. You can use long_query_time to filter out most of the queries and log only the slowest ones. You can use tcpdump to collect the data on a regular basis (or even all the time) and use slow log in some particular cases, if you find an intriguing query. You can use data from the slow log only for thorough query reviews that happen couple of times during a year and stick to tcpdump on a daily basis. Those two methods complement each other and it’s up to a DBA to decide how to use them.

Once we have data captured using any of the methods described in this blog post, we need to process it. While data in the log files can be easily read by a human, and also it’s not a rocket science to print and parse data captured by a tcpdump, it’s not really the way you’d like to approach the query review - it’s definitely too hard to get the total picture and there’s too much noise. You need something that will aggregate information you collected and present you a nice summary. We’ll discuss such a tool in next post in this series.

 

Blog category:

by Severalnines at August 17, 2015 05:03 AM

August 14, 2015

MariaDB AB

Optimizing Conservative In-order Parallel Replication with MariaDB 10.0

geoff_montee_g

Conservative in-order parallel replication is a great feature in MariaDB 10.0 that improves replication performance by using knowledge of group commit on the master to commit transactions in parallel on a slave. If slave_parallel_threads is greater than 0, then the SQL thread will instruct multiple worker threads to concurrently apply transactions that were committed in the same group commit on the master.

Conservative in-order parallel replication is a good alternative to out-of-order parallel replication for use cases where explicitly setting domain IDs for independent transactions is impractical or impossible.

Although setting slave_parallel_threads is enough to enable conservative in-order parallel replication, you may have to tweak binlog_commit_wait_usec and binlog_commit_wait_count in order to increase your group commit ratio on the master, which is necessary to enable parallel applying on the slave. In this blog post, I'll show an example where this is the case.

Note: MariaDB 10.1 also adds the slave_parallel_mode configuration variable to enable other modes for in-order parallel replication.

Configure the master and slave

For our master, let's configure the following settings:

[mysqld]
log_bin
binlog_format=ROW
server_id=1

For our slave, let's configure the following:

[mysqld]
server_id=2
slave_parallel_threads=2

Set up replication on master

Now, let's set up the master for replication:

MariaDB [(none)]> CREATE USER 'repl'@'172.31.31.73' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'172.31.31.73';
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> RESET MASTER;
Query OK, 0 rows affected (0.22 sec)

MariaDB [(none)]> SHOW MASTER STATUS\G
*************************** 1. row ***************************
            File: master-bin.000001
        Position: 322
    Binlog_Do_DB: 
Binlog_Ignore_DB: 
1 row in set (0.00 sec)

MariaDB [(none)]> SELECT BINLOG_GTID_POS('master-bin.000001', 322);
+-------------------------------------------+
| BINLOG_GTID_POS('master-bin.000001', 322) |
+-------------------------------------------+
|                                           |
+-------------------------------------------+
1 row in set (0.00 sec)

If you've set up GTID replication with MariaDB 10.0 before, you've probably used BINLOG_GTID_POS to convert a binary log position to its corresponding GTID position. On newly installed systems like my example above, this GTID position might be blank.

Now, let's set up replication on the slave:

MariaDB [(none)]> SET GLOBAL gtid_slave_pos ='';
Query OK, 0 rows affected (0.09 sec)

MariaDB [(none)]> CHANGE MASTER TO master_host='172.31.31.72', master_user='repl', master_password='password', master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.04 sec)

MariaDB [(none)]> START SLAVE;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 172.31.31.72
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 322
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 601
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 322
              Relay_Log_Space: 898
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 
1 row in set (0.00 sec)

Create a test table on master

Let's set up a test table on the master using mysqlslap. The table will automatically be replicated to the slave:

mysqlslap -u root --create-schema=db1 --no-drop \
    --create="CREATE TABLE test_table (id INT AUTO_INCREMENT PRIMARY KEY,file BLOB);" 

Generate some data on master

Now, in a Linux shell on the master, let's create a random 1 KB file:

[gmontee@master ~]$ dd if=/dev/urandom of=/tmp/file.out bs=1KB count=1
1+0 records in
1+0 records out
1000 bytes (1.0 kB) copied, 0.000218694 s, 4.6 MB/s
[gmontee@master ~]$ chmod 0644 /tmp/file.out

Get group commit status on master (before first test)

Before we insert our data on the master, let's get the starting values of Binlog_commits and Binlog_group_commits.

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Binlog_commits       | 20    |
| Binlog_group_commits | 20    |
+----------------------+-------+
2 rows in set (0.00 sec)

Insert some data on master

Now, let's use mysqlslap to insert our random file into the table a bunch of times:

mysqlslap -u root --create-schema=db1 --concurrency=5 --iterations=20 --no-drop \
    --query="INSERT INTO db1.test_table (file) VALUES (LOAD_FILE('/tmp/file.out'));"

Get group commit status on master (after first test)

After inserting our data, let's get the values of Binlog_commits and Binlog_group_commits again.

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Binlog_commits       | 120   |
| Binlog_group_commits | 120   |
+----------------------+-------+
2 rows in set (0.00 sec)

To get the group commit ratio for our batch job, we would subtract the pre-job Binlog_commits and Binlog_group_commits values from the post-job values:

transactions/group commit = (Binlog_commits (after) - Binlog_commits (before))/(Binlog_group_commits (after) - Binlog_group_commits (before))

So here, we have:

transactions/group commit = (120 - 20) / (120 - 20) = 100 / 100 = 1 transactions/group commit

At 1 transactions/group commit, there isn't any potential for the slave to apply transactions in parallel.

Insert some more data on master

Now let's change binlog_commit_wait_count and binlog_commit_wait_usec:

MariaDB [(none)]> SET GLOBAL binlog_commit_wait_count=2;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> SET GLOBAL binlog_commit_wait_usec=10000;
Query OK, 0 rows affected (0.00 sec)

And then let's insert some more data:

mysqlslap -u root --create-schema=db1 --concurrency=5 --iterations=20 --no-drop \
    --query="INSERT INTO db1.test_table (file) VALUES (LOAD_FILE('/tmp/file.out'));"

Get group commit status on master (after second test)

After changing the values for binlog_commit_wait_count and binlog_commit_wait_usec and inserting more data, let's get the values of Binlog_commits and Binlog_group_commits again.

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Binlog_commits       | 220   |
| Binlog_group_commits | 145   |
+----------------------+-------+
2 rows in set (0.00 sec)

So here, we have:

transactions/group commit = (220 - 120) / (145 - 120) = 100 / 25 = 4 transactions/group commit

At 4 transactions/group commit, there is much more potential for the slave to apply transactions in parallel.

Check slave concurrency

Now that the values of binlog_commit_wait_count and binlog_commit_wait_usec have been tweaked on the master to allow for parallel applying on the slave, let's execute a bigger job on the master and then see if the slave actually does apply its transactions in parallel.

First, let's run this on the master:

mysqlslap -u root --create-schema=db1 --concurrency=100 --iterations=100 --no-drop \
    --query="INSERT INTO db1.test_table (file) VALUES (LOAD_FILE('/tmp/file.out'));"

And then, let's execute SHOW FULL PROCESSLIST on the slave:

MariaDB [(none)]> SHOW FULL PROCESSLIST;
+----+-------------+-----------+------+---------+------+-----------------------------------------------+-----------------------+----------+
| Id | User        | Host      | db   | Command | Time | State                                         | Info                  | Progress |
+----+-------------+-----------+------+---------+------+-----------------------------------------------+-----------------------+----------+
|  2 | system user |           | NULL | Connect | 2266 | Waiting for master to send event              | NULL                  |    0.000 |
|  3 | system user |           | NULL | Connect |    7 | closing tables                                | NULL                  |    0.000 |
|  4 | system user |           | NULL | Connect |    7 | closing tables                                | NULL                  |    0.000 |
|  5 | system user |           | NULL | Connect | 2266 | Waiting for room in worker thread event queue | NULL                  |    0.000 |
|  8 | root        | localhost | NULL | Query   |    0 | init                                          | SHOW FULL PROCESSLIST |    0.000 |
+----+-------------+-----------+------+---------+------+-----------------------------------------------+-----------------------+----------+
5 rows in set (0.00 sec)

Here, both of our worker threads have the state "closing tables", so both of them are applying transactions in parallel.

Conclusion

If you want to use conservative in-order parallel replication to improve slave performance, but you find that your slave isn't applying transactions in parallel, you may want to adjust binlog_commit_wait_count and binlog_commit_wait_usec.

About the Author

geoff_montee_g's picture

Geoff Montee is a Support Engineer with MariaDB. He has previous experience as a Database Administrator/Software Engineer with the U.S. Government, and as a System Administrator and Software Developer at Florida State University.

by geoff_montee_g at August 14, 2015 09:34 AM

August 13, 2015

Peter Zaitsev

The language of compression

Leif Walsh & friends

Leif Walsh will talk about the language of compression at Percona Live Amsterdam

Storage. Everyone needs it. Whether your data is in MySQL, a NoSQL, or somewhere in the cloud, with ever-growing data volumes – along with the need for SSDs to cut latency and replication to provide insurance – an organization’s storage footprint is an important place to look for savings. That’s where compression comes in (squeeze!) to save disk space.

Two Sigma software engineer Leif Walsh speaks the language of compression. Fluently. In fact, he’ll be speaking on

that exact subject September 22 during the Percona Live conference in Amsterdam.

I asked him about his talk, and about Amsterdam, the other day. Here’s what he had to say.

* * *

Tom: Hi Leif, how will your talk help IT decision-makers cut through the marketing mumbo-jumbo on what’s important to focus on and what is not
Leif: My talk will have three lessons aimed at those making storage decisions for their company:

  1. What are the key factors to consider when evaluating storage options, and how can they affect your bottom line?  This is not only how storage tech influences your hardware, operations, and management costs, but also how it can facilitate new development initiatives and cut time-to-market for your products.
  2. How should you read benchmarks and marketing materials about storage technology?  You’ll learn what to look for in promotional material, and how to think critically about whether that material is applicable to your business needs.
  3. What’s the most effective way to communicate with storage vendors about your application’s requirements?  A lot of time can be spent in the early stages of a relationship in finding a common language for users and vendors to have meaningful discussions about users’ needs and vendors’ capacity to meet those needs.  With the tools you’ll learn in my talk, you’ll be able to accelerate quickly to the high-bandwidth conversations you need to have in order to make the right decision, and consequently, you’ll be empowered to evaluate more choices to find the best one faster.

Tom: In addition to IT decision-makers, who else should attend your session and what will they take away afterward?
Leif: My talk is primarily about the language that everyone in the storage community should be using to communicate. Therefore, storage vendors should attend to get ideas for how  to express their benchmarks and their system’s properties more effectively, and application developers and operations people will learn strategies for getting better support and for making a convincing case to the decision makers in their own company.

Tom: Which session(s) are you most looking forward to besides your own?
Leif: Sam Kottler is a good friend and an intensely experienced systems engineer with a dynamic and boisterous personality, so I can’t wait to hear more about his experiences with Linux tuning.

As one of the original developers of TokuMX, I’ll absolutely have to check out Stephane’s talk about it, but I promise not to heckle. Charity Majors is always hilarious and has great experiences and insights to share, so I’ll definitely check out her talk too.

* * *

Catch Leif’s talk at Percona Live in Amsterdam September 21-23. Enter the promo code “BlogInterview” at registration and save €20! Register now!

The post The language of compression appeared first on MySQL Performance Blog.

by Tom Diederich at August 13, 2015 07:19 PM

Shlomi Noach

Percona Live Amsterdam: Community Dinner, Sep. 22nd

Keeping up with tradition, there will be a community event held at the upcoming Percona Live Europe: Amsterdam 2015 conference.

This year, Booking.com will be hosting the event at the company's headquarters in the heart of Amsterdam.

We will hold a community dinner (dish selection, includes vegetarian; beverages will be served) in our caffeteria and hope to add some spicy activities to the event!

Space is limited, and tickets can be purchased via Eventbrite.

Special thanks to Daniël van Eeden and Jean-François Gagné for their work in making this happen!

Tuesday, September 22, 2015 from 6:30 PM to 10:00 PM (CEST)
Herengracht 597, 1017 CE
Amsterdam

Location: https://goo.gl/maps/06oOA

Walking route from conference venue: https://goo.gl/maps/Ocptu

by shlomi at August 13, 2015 05:21 PM

Peter Zaitsev

MySQL Quality Assurance: A Vision for the Future by Roel Van de Paar (Final Episode 13)

Welcome to the final – but most important – episode in the MySQL QA Series.

In it, I present my vision for all MySQL Quality Assurance – for all distributions – worldwide.

Episode 13: A Better Approach to all MySQL Regression, Stress & Feature Testing: Random Coverage Testing & SQL Interleaving

1. pquery Review
2. Random Coverage Testing
3. SQL Interleaving
4. The past & the future

Presented by Roel Van de Paar. Full-screen viewing @ 720p resolution recommended

Interested in the full MySQL QA Series?

The post MySQL Quality Assurance: A Vision for the Future by Roel Van de Paar (Final Episode 13) appeared first on MySQL Performance Blog.

by Roel Van de Paar at August 13, 2015 05:13 PM

August 12, 2015

Peter Zaitsev

Percona Live Amsterdam Discounted Pricing and Community Dinner!

The countdown is on for the annual Percona Live Data Performance Conference and Expo in Europe! This year the conference will be taking place in the great city of Amsterdam September 21-23rd. This three day conference will focus on the latest trends, news and best practices in the MySQL, NoSQL and Data in the Cloud markets, while looking forward to what’s on the long-term horizon within the global Data Performance industry. With 84 breakout sessions, 8 tutorial sessions and 6 keynotes, including a Data in the Cloud keynote panel, there will certainly be no lack of content.
Advanced Rate Registration ENDS August 16th so make sure to register now to secure the best price possible.

As it is a Percona Live Conference, there will certainly be no lack of FUN either!!!!Rembrandt-Square

As tradition holds, there will be a Community Dinner. Tuesday night, September 22, Percona Live Diamond Sponsor Booking.com will be hosting the Community Dinner of the year at their very own headquarters located in historic Rembrandt Square in the heart of the city. After breakout sessions conclude, attendees will be picked up right outside of the venue and taken to booking.com’s headquarters by canal boats! This gives all attendees the opportunity to play “tourist” while viewing the beauty of Amsterdam from the water. amsterdam-bus-and-canal-boat-tour-3bfd6Attendees will be dropped off right next to Booking.com’s office! Come and show your support for the community while enjoying dinner and drinks! Space is limited so make sure to sign up ASAP!

So don’t forgetregister for the conference and sign up for the community dinner before space is gone!
See you in Amsterdam!

The post Percona Live Amsterdam Discounted Pricing and Community Dinner! appeared first on MySQL Performance Blog.

by Kortney Runyan at August 12, 2015 09:37 PM

Jean-Jerome Schmidt

Become a MySQL DBA - Webinar Series: Schema Changes for MySQL Replication & Galera Cluster

With the rise of agile development methodologies, more and more systems and applications are built in series of iterations. This is true for the database schema as well, as it has to evolve together with the application. Unfortunately, schema changes and databases do not play well together. Changes usually require plenty of advance scheduling, and can be disruptive to your operations. 

In this new webinar, we will discuss how to implement schema changes in the least impacting way to your operations and ensure availability of your database. We will also cover some real-life examples and discuss how to handle them.

DATE & TIME

Europe/MEA/APAC
Tuesday, August 25th at 09:00 BST / 10:00 CEST (Germany, France, Sweden)
Register Now

North America/LatAm
Tuesday, August 25th at 09:00 Pacific Time (US) / 12:00 Eastern Time (US)
Register Now

AGENDA

  • Different methods to perform schema changes on MySQL Replication and Galera
    • rolling schema change
    • online alters
    • external tools, e.g., pt-online-schema-change
  • Differences between MySQL 5.5 and 5.6
  • Differences between MySQL Replication vs Galera
  • Example real-life scenarios with MySQL Replication and Galera setups

osc2_mysqll.png

SPEAKER

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard. This webinar builds upon recent blog posts and related webinar series by Krzysztof on how to become a MySQL DBA.

We look forward to “seeing” you there and to insightful discussions!

Blog category:

by Severalnines at August 12, 2015 11:04 AM

Peter Zaitsev

MySQL replication primer with pt-table-checksum and pt-table-sync

MySQL replication is a process that allows you to easily maintain multiple copies of MySQL data by having them copied automatically from a master to a slave database.

It’s essential to make sure the slave servers have the same set of data as the master to ensure data is consistent within the replication stream. MySQL slave server data can drift from the master for many reasons – e.g. replication errors, accidental direct updates on slave, etc.

Here at Percona Support we highly recommend that our customers periodically run the pt-table-checksum tool to verify data consistency within replication streams. Specifically, after fixing replication errors on slave servers to ensure that the slave has identical data as its master. As you don’t want to put yourself in a situation where you need to failover to a slave server for some reason and you find different data on that slave server.

In this post, I will examine the pt-table-checksum and pt-table-sync tools usage from Percona Toolkit on different replication topologies. We often receive queries from customers about how to run these tools and I hope this post will help.

Percona Toolkit is a free collection of advanced command-line tools to perform a variety of MySQL server and system tasks that are too difficult or complex to perform manually.

One of those tools is pt-table-checksum, which works by dividing table rows into chunks of rows. The size of a chunk changes dynamically during the operation to avoid overloading the server. pt-table-checksum has many safeguards including variation into chunk size to make sure queries run in a desired amount of time.

pt-table-checksum verifies chunk size by running EXPLAIN query on each chunk. It also monitors slave server’s continuously in order to make sure replicas not falls too far behind and in this case tool pauses itself to allow slave to catch up. Along with that there are many other safeguards builtin and you can find all the details in this documentation

In my first example case, I am going to run pt-table-checksum against pair of replication servers – i.e. master having only one slave in replication topology. We will run pt-table-checksum tool on master server to verify data integrity on slave and in case If differences found by pt-table-checksum tool we will sync those changes on slave server via pt-table-sync tool.

I have created a dummy table under test database and inserted 10 records on master server as below:

mysql-master> create table dummy (id int(11) not null auto_increment primary key, name char(5)) engine=innodb;
Query OK, 0 rows affected (0.08 sec)
mysql-master> insert into dummy VALUES (1,'a'), (2,'b'), (3,'c'), (4,'d'), (5,'e'), (6,'f'), (7,'g'), (8,'h'), (9,'i'), (10,'j');
Query OK, 10 rows affected (0.00 sec)
Records: 10  Duplicates: 0  Warnings: 0
mysql-master> select * from dummy;
+------+------+
| id   | name |
+------+------+
|    1 | a    |
|    2 | b    |
|    3 | c    |
|    4 | d    |
|    5 | e    |
|    6 | f    |
|    7 | g    |
|    8 | h    |
|    9 | i    |
|   10 | j    |
+------+------+
10 rows in set (0.00 sec)

Then I intentionally deleted a few records from the slave server to make it inconsistent with the master for the purpose of this post.

mysql-slave> delete from dummy where id>5;
Query OK, 5 rows affected (0.03 sec)
mysql-slave> select * from dummy;
+----+------+
| id | name |
+----+------+
|  1 | a    |
|  2 | b    |
|  3 | c    |
|  4 | d    |
|  5 | e    |
+----+------+
5 rows in set (0.00 sec)

Now, in this case the master server has 10 records on the dummy table while the slave server has only 5 records missing records from id>5 – we will run pt-table-checksum at this point on the master server to see if the pt-table-checksum tool catches those differences.

[root@master]# pt-table-checksum --replicate=percona.checksums --ignore-databases mysql h=localhost,u=checksum_user,p=checksum_password
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
07-11T18:30:13      0      1       10       1       0   1.044 test.dummy

This needs to be executed on the master. The user and password you specify will be used to not only connect to the master but the slaves as well. You need the following privileges for the pt-table-checksum mysql user:

mysql-master> GRANT REPLICATION SLAVE,PROCESS,SUPER, SELECT ON *.* TO `checksum_user`@'%' IDENTIFIED BY 'checksum_password';
mysql-master> GRANT ALL PRIVILEGES ON percona.* TO `checksum_user`@'%';

Earlier, in pt-table-checksum command, I used –replicate option which writes replication queries to mentioned table percona.checksums. Next I passed  –ignore-databases option which accepts comma separated list of databases to ignore. Moreover, –create-replicate-table and —empty-replicate-table options are “Yes” by default and you can specify both options explicitly if you want to create database table different then percona.checksums.

pt-table-checksum reported 1 DIFF which is number of chunks which are different from master on one or more slaves. You can find details about tabular columns e.g. TS, ERRORS and so on on documentation of pt-table-checksum. After that, I ran next command to identify which table has difference on slave.

[root@master]# pt-table-checksum --replicate=percona.checksums --replicate-check-only --ignore-databases mysql h=localhost,u=checksum_user,p=checksum_password
Differences on slave
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
test.dummy 1 -5 1

In this command I used –replicate-check-only option which only reports the tables with having differences vice versa only checksum differences on detected replicas are printed. It doesn’t checksum any tables. It checks replicas for differences found by previous checksumming, and then exits.

You may also login to the slave and also execute below query to find out which tables have inconsistencies.

mysql-slave> SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks
FROM percona.checksums
WHERE (
master_cnt <> this_cnt
OR master_crc <> this_crc
OR ISNULL(master_crc) <> ISNULL(this_crc))
GROUP BY db, tbl;

pt-table-checksum identified test.dummy table is different on slave now we are going to use pt-table-sync tool to synchronize table data between MySQL servers.

[root@slave]# pt-table-sync --print --replicate=percona.checksums --sync-to-master h=localhost,u=checksum_user,p=checksum_password
REPLACE INTO `test`.`dummy`(`id`, `name`) VALUES ('6', 'f') /*percona-toolkit src_db:test src_tbl:dummy src_dsn:P=3306,h=192.168.0.130,p=...,u=checksum_user dst_db:test dst_tbl:dummy dst_dsn:h=localhost,p=...,u=checksum_user lock:1 transaction:1 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:24683 user:root host:slave*/;
REPLACE INTO `test`.`dummy`(`id`, `name`) VALUES ('7', 'g') /*percona-toolkit src_db:test src_tbl:dummy src_dsn:P=3306,h=192.168.0.130,p=...,u=checksum_user dst_db:test dst_tbl:dummy dst_dsn:h=localhost,p=...,u=checksum_user lock:1 transaction:1 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:24683 user:root host:slave*/;
REPLACE INTO `test`.`dummy`(`id`, `name`) VALUES ('8', 'h') /*percona-toolkit src_db:test src_tbl:dummy src_dsn:P=3306,h=192.168.0.130,p=...,u=checksum_user dst_db:test dst_tbl:dummy dst_dsn:h=localhost,p=...,u=checksum_user lock:1 transaction:1 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:24683 user:root host:slave*/;
REPLACE INTO `test`.`dummy`(`id`, `name`) VALUES ('9', 'i') /*percona-toolkit src_db:test src_tbl:dummy src_dsn:P=3306,h=192.168.0.130,p=...,u=checksum_user dst_db:test dst_tbl:dummy dst_dsn:h=localhost,p=...,u=checksum_user lock:1 transaction:1 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:24683 user:root host:slave*/;
REPLACE INTO `test`.`dummy`(`id`, `name`) VALUES ('10', 'j') /*percona-toolkit src_db:test src_tbl:dummy src_dsn:P=3306,h=192.168.0.130,p=...,u=checksum_user dst_db:test dst_tbl:dummy dst_dsn:h=localhost,p=...,u=checksum_user lock:1 transaction:1 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:24683 user:root host:slave*/;

I ran the pt-table-sync tool from an opposite host this time i.e. from the slave as I used the –sync-to-master option which treats DSN as slave and syncs to master. Again, pt-table-sync will use the mysql username and password you specify to connect to the slave as well as to its master. –replicate option here examines the specified table to find out the data differences and –print just prints the SQL  (REPLACE queries) not actually executes it.

You may audit the queries before executing to sync data between master/slave.  You may see it printed only missing records on the slave. Once you are happy with the results, you can substitute –print with –execute to do actual synchronization.

As a reminder, these queries always executed on the master as this is the only safe way to do the changes on slave. However, on the master it’s no-op changes as these records already exists on master but then falls to slave via replication stream to sync it with master.

If you find lots of differences on your slave server it may lag during synchronization of those changes. As I mentioned earlier, you can use –print option to go through your queries which are going to be executed to sync slave with master server. I found this post useful if you see a huge difference in the table between master/slave(s).

Note, you may use the –dry-run option initially which only analyzes print information about the sync algorithm and then exits. It shows verbose output; it doesn’t do any changes though. –dry-run parameter will basically instruct pt-table-sync to not actually do the sync, but just perform some checks.

Let me present another replication topology, where the master has two slaves where slave2 is running on non-default port 3307 while master and slave1 running on port 3306. Further, slave2 is out of sync with master and I will show you how to sync slave2 which running on port 3307 with master.

mysql-master> SELECT * FROM dummy;
+----+------+
| id | name |
+----+------+
|  1 | a    |
|  2 | b    |
|  3 | c    |
|  4 | d    |
|  5 | e    |
+----+------+
5 rows in set (0.00 sec)
mysql-slave1> SELECT * FROM test.dummy;
+----+------+
| id | name |
+----+------+
|  1 | a    |
|  2 | b    |
|  3 | c    |
|  4 | d    |
|  5 | e    |
+----+------+
5 rows in set (0.00 sec)
mysql-slave2> SELECT * FROM test.dummy;
+----+------+
| id | name |
+----+------+
|  1 | a    |
|  2 | b    |
|  3 | c    |
+----+------+

Let’s run pt-table-checksum tool on master database server.

[root@master]# pt-table-checksum --replicate percona.checksums --ignore-databases=mysql h=192.168.0.130,u=checksum_user,p=checksum_password --recursion-method=dsn=D=percona,t=dsns
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
07-23T13:57:39      0      0        2       1       0   0.310 percona.dsns
07-23T13:57:39      0      1        5       1       0   0.036 test.dummy

I used –recursion-method parameter this time which is method to use find slaves in replication stream and it’s pretty useful when your servers run on non-standard port i.e. other than 3306. I created dsns table under percona database with following entries. You may find dsns table structure in documentation.

mysql> SELECT * FROM dsns;
+----+-----------+------------------------------------------------------------+
| id | parent_id | dsn                                                        |
+----+-----------+------------------------------------------------------------+
|  1 |         1 | h=192.168.0.134,u=checksum_user,p=checksum_password,P=3306 |
|  2 |         2 | h=192.168.0.132,u=checksum_user,p=checksum_password,P=3307 |
+----+-----------+------------------------------------------------------------+

Next I ran below pt-table-checksum command to identify which slave server has differences on test.dummy table.

[root@master]# pt-table-checksum --replicate=percona.checksums --replicate-check-only --ignore-databases=mysql h=192.168.0.130,u=checksum_user,p=checksum_password --recursion-method=dsn=D=percona,t=dsns
Differences on slave2
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
test.dummy 1 -2 1

This shows that slave2 has different data on test.dummy table as compared to the master. Now let’s run pt-table-sync tool to sync those differences and make slave2 identical as the master.

[root@slave2] ./pt-table-sync --print --replicate=percona.checksums --sync-to-master h=192.168.0.132,u=checksum_user,p=checksum_password
REPLACE INTO `test`.`dummy`(`id`, `name`) VALUES ('4', 'd') /*percona-toolkit src_db:test src_tbl:dummy src_dsn:P=3306,h=192.168.0.130,p=...,u=checksum dst_db:test dst_tbl:dummy dst_dsn:h=192.168.0.132,p=...,u=checksum lock:1 transaction:1 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:1514 user:root host:slave2*/;
REPLACE INTO `test`.`dummy`(`id`, `name`) VALUES ('5', 'e') /*percona-toolkit src_db:test src_tbl:dummy src_dsn:P=3306,h=192.168.0.130,p=...,u=checksum dst_db:test dst_tbl:dummy dst_dsn:h=192.168.0.132,p=...,u=checksum lock:1 transaction:1 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:1514 user:root host:slave2*/;

It shows 2 rows are different on slave2. Substituting –print with –execute synchronized the differences on slave2 and re-running pt-table-checksum tool shows no more differences.

Conclusion:
pt-table-checksum and pt-table-sync are the finest tools from Percona Toolkit to validate data between master/slave(s). With the help of these tools you can easily identify data drifts and fix them. I mentioned a couple of replication topologies above about how to check replication consistency and how to fix it in case of data drift. You may script pt-table-checksum / pt-table-sync steps and cron checksum script to periodically check the data consistency within replication stream.

This procedure is only safe for a single level master-slave(s) hierarchy. I will discuss the procedure for other topologies in future posts – i.e. I will describe more complex scenarios on how to use these tools in chain replication i.e. master -> slave1 -> slave2 pair and in Percona XtraDB Cluster setup.

The post MySQL replication primer with pt-table-checksum and pt-table-sync appeared first on MySQL Performance Blog.

by Muhammad Irfan at August 12, 2015 07:00 AM

August 11, 2015

Peter Zaitsev

MySQL QA Episode 12: My server is crashing… Now what? For customers or users experiencing a crash

My server is crashing… Now what?

This special episode in the MySQL QA Series is for customers or users experiencing a crash.

  1. A crash?
    1. Cheat sheet: https://goo.gl/rrmB9i
    2. Sever install & crash. Note this is as a demonstration: do not action this on a production server!
      sudo yum install -y http://www.percona.com/downloads/percona-release/redhat/0.1-3/percona-release-0.1-3.noarch.rpm
      sudo yum install -y Percona-Server-client-56 Percona-Server-server-56
      sudo service mysql start
  2. Gimme Stacks!
    1. Debug info packages (can be executed on a production system, but do match your 5.5, 5.6 or 5.7 version correctly)
      sudo yum install -y Percona-Server-56-debuginfo
  3. Testcase?

Full-screen viewing @ 720p resolution recommended.

Also checkout the full MySQL QA Series!

The post MySQL QA Episode 12: My server is crashing… Now what? For customers or users experiencing a crash appeared first on MySQL Performance Blog.

by Roel Van de Paar at August 11, 2015 10:00 AM

August 10, 2015

Colin Charles

LinuxCon North America in Seattle

I’m excited to be at LinuxCon North America in Seattle next week (August 17-19 2015). I’ve spoken at many LinuxCon events, and this one won’t be any different. Part of the appeal of the conference is being able to visit a new place every year.

MariaDB Corporation will have a booth, so you’ll always be able to see friendly Rod Allen camped there. In between talks and meetings, there will also be Max Mether and quite possibly all the other folk that live in Seattle (Kolbe Kegel, Patrick Crews, Gerry Narvaja).

For those in the database space, don’t forget to come attend some of our talks (represented by MariaDB Corporation and Oracle Corporation):

  1. MariaDB: The New MySQL is Five Years Old & Everywhere by Colin Charles
  2. MySQL High Availability in 2015 by Colin Charles
  3. Handling large MySQL and MariaDB farms with MaxScale by Max Mether
  4. The Proper Care and Feeding of a MySQL Database for a Linux Administrator by Dave Stokes
  5. MySQL Security in a Cloudy World by Dave Stokes

See you in Seattle soon!

by Colin Charles at August 10, 2015 09:00 AM

Jean-Jerome Schmidt

Become a MySQL DBA blog series - Configuration Tuning for Performance

A database server needs CPU, memory, disk and network in order to function. Understanding these resources is important for a DBA, as any resource that is weak or overloaded can become a limiting factor and cause the database server to perform poorly. A main task of the DBA is to tune operating system and database configurations and avoid overutilization or underutilization of the available resources.

In this blog post, we’ll discuss some of the settings that are most often tweaked and which can bring you significant improvement in the performance. We will also cover some of the variables which are frequently modified even though they should not. Performance tuning is not easy, but you can go a surprisingly long way with a few basic guidelines.
 
This is the eighth installment in the ‘Become a MySQL DBA’ blog series. Our previous posts in the DBA series include Live Migration using MySQL Replication, Database Upgrades, Replication Topology Changes, Schema Changes, High Availability, Backup & Restore, Monitoring & Trending.

Performance tuning - a continuous process

Installing MySQL is usually the first step in the process of tuning both OS and database configurations. This is a never-ending story as a database is a dynamic system. Your MySQL database can be CPU-bound at first, as you have plenty of memory and little data. With time, though, it may change and disk access may become more frequent. As you can imagine, the configuration of a server where I/O is the main concern will look different to that of a server where all data fits in memory. Additionally, your query mix may also change in time and as such, access patterns or utilization of the features available in MySQL (like adaptive hash index), can change with it.

What’s also important to keep in mind is that, most of the time, tweaks in MySQL configuration will not give you significant difference in performance. There are couple of exceptions but you should not expect anything like 10x improvement or something similar. Adding a correct index may help you much more than tweaking your my.cnf. 

The tuning process

Let’s start with a description of the tuning process.

To begin, you would need a deterministic environment to test your changes and observe results. The environment should be as close to production as possible. By that we mean both data and traffic. For safety reasons you should not implement and test changes directly on the production systems. It’s also much easier to make changes in the testing environment - some of the tweaks require MySQL to restart - this is not something you can do on production.

Another thing to keep in mind - when you make changes, it is very easy to lose track of which change affected your workload in a particular way. People tend to take shortcuts and make multiple tweaks at the same time - it’s not the best way. After implementing multiple changes at the same time, you do not really know what impact each of them had. The result of all the changes is known but it’s not unlikely that you’d be better off implementing only one of the five changes you made.

After each config change, you also want to ensure your system is in the same state - restore the data set to a known (and always the same) position - e.g., you can restore your data from a given backup. Then you need to run exactly the same query mix to reproduce the production workload - this is the only way to ensure that your results are meaningful and that you can reproduce them. What’s also important is to isolate the testing environment from the rest of your infrastructure so your results won’t be affected by external factors. This means that you do not want to use VMs located on a shared host as other VMs may impact your tests. The same is true for storage - shared SAN may cause some unexpected results.

OS tuning

You’ll want to check operating system settings related to the way the memory and filesystem cache are handled. In general, we want to keep both vm.dirty_ratio and vm.dirty_background_ratio low.

vm.dirty_background_ratio is the percentage of system memory that can be used to cache modified (“dirty”) pages before the background flush process kicks in. More of them means more work needs to be done in order to clean the cache. 

vm.dirty_ratio, on the other hand, is a hard limit of the memory that can be used to cache dirty pages. It can be reached if, due to high write activity, the background process cannot flush data fast enough to keep up with new modifications. Once vm.dirty_ratio is reached, all I/O activity is locked until dirty pages have been written to disk. Default setting here is usually 40% (it may be different in your distribution), which is pretty high for any host with large memory. Let’s say that for a 128GB instance, it amounts to ~51GB which may lock your I/O for a significant amount of time, even if you are using fast SSD’s.

In general, we want to see both of those variables set to low numbers, 5 - 10%, as we want background flushing to kick in early on and to keep any stalls as short as possible.

Another important system variable to tune is vm.swappiness. When using MySQL we do not want to use swap unless in dire need - swapping out InnoDB buffer pool to disk removes a point of having an in-memory buffer pool. On the other hand, if the alternative is to start OOM and kill MySQL, we’d prefer not to do that. Historically, such behavior could have been achieved by setting vm.swappiness to 0. Since kernel 3.5-rc1 (and this change has been backported to older kernels in some of the distros - CentOS for example), this behavior has changed and setting it to 0 prevents swapping. Therefore it’s recommended to set vm.swappiness to 1, to allow some of the swapping to happen should it be the only option to keep MySQL up. Sure, it will slow down the system but OOM on MySQL is very harsh. It may result in data loss (if you do not run with full durability settings) or, in the best case scenario, trigger InnoDB recovery, a process which may take some time to complete.

Another memory-related setting - ensure that you have NUMA interleave set to all. You can do it by modifying the startup script to start MySQL via:

numactl --interleave=all $command

This setting balances memory allocation between NUMA nodes and minimizes chances that one of the nodes go out of memory.

Memory allocators can also have a significant impact on MySQL performance. This is a larger topic and we’ll only scratch the surface here. You can choose different memory allocators to use with MySQL. Their performance differ between the versions and between workloads so the exact choice should be made only after you performed detailed tests to confirm which one works best in your environment. Most common choices you’ll be looking into are default glibc malloc, tcmalloc and jemalloc. You can add new allocators by installing a new package (for jemalloc and tcmalloc) and then use either LD_PRELOAD (i.e. export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4.1.2") or malloc-lib variable in [mysqld_safe] section of my.cnf.

Next, you’d want to take a look at disk schedulers. CFQ, which is usually the default one, is tuned for a desktop workload. It doesn’t work well for a database workload. Most of the time you’ll see better results if you change it to noop or deadline. There’s little difference between those two schedulers, we found that noop is slightly better for storage based on SAN (SAN usually is better in handling the workload as it knows more about the underlying hardware and what’s actually stored in its cache as compared to the operating system). Differences are minimal, though, and most of the time you won’t go wrong by using any of those options. Again, testing may help you squeeze a bit more from your system.

If we are talking about disks, most often the best choice for filesystem will be either EXT4 or XFS - this has changed a couple of times in the past, and if you’d like to get the most of your I/O subsystem, you’d probably have to do some testing on your setup. No matter which filesystem you use though, you should disable noatime and nodiratime for the MySQL volume - the less writes to the metadata, the lower the overall overhead.

MySQL configuration tuning

MySQL configuration tuning is a topic for a whole book, it’s not possible to cover it in a single blog post. We’ll try to mention some of the more important variables here.

InnoDB Buffer Pool

Let’s start with something rather obvious - InnoDB buffer pool. We still see, from time to time (although it becomes less and less frequent, which is really nice), that it’s not setup correctly. Defaults are way too conservative. What is the buffer pool and why is it so important? The buffer pool is memory used by InnoDB to cache data. It is used for caching both reads and writes - every page that has been modified, had to be loaded first to the buffer pool. It then becomes a dirty page - a page that has been modified and is not yet flushed to the tablespace. As you can imagine, such buffer is really important for a database to perform correctly. The worse the “memory/disk” ratio is, the more I/O bound your workload will be. I/O bound workloads tend to be slow.

You may have heard the rule of thumb to set the InnoDB buffer pool to 80% of the total memory in the system. It worked in times when 8GB was a huge amount of memory, but that is not true nowadays. When calculating the InnoDB buffer pool size, you need to take into consideration memory requirements of the rest of MySQL (assuming that MySQL is the only application running on the server). We are talking here, for example, about all those per-connection or even per-query buffers like join buffer or in-memory temporary table max size. You also need to take under consideration maximum allowed connections - more connections means more memory usage.

For a MySQL database server with 24 to 32 cores and 128GB memory, handling up to 20 - 30 of simultaneous running connections and up to a few hundreds of simultaneously connected clients, we’d say that 10 - 15GB of memory should be enough. If you want to stay on the safe side, 20GB should be plenty. In general, unless you know the behaviour of your database, it’s somewhat a process of trial and error to set up ideal buffer pool size. At the moment of writing, InnoDB buffer pool is not a dynamic variable so changes require restart. Therefore it is safer to err on the side of “too small”. It will change with MySQL 5.7 as Oracle introduced dynamically allocated buffer pool, something which will make tuning much easier.

MySQL uses many buffers other than the InnoDB buffer pool - they are controlled by variables: join_buffer_size, sort_buffer_size, read_buffer_size, read_rnd_buffer_size. These buffers are allocated per-session (with an exception of the join buffer, which is allocated per JOIN). We’ve seen MySQL with those buffers set to hundreds of megabytes - it’s somewhat natural that by increasing join_buffer_size, you’d expect your JOINs to perform faster.

By default those variables have rather small values and it actually makes sense - we’ve seen that low settings, up to 256K, can be significantly faster than larger values like for example 4M. It is hard to tell the exact reason for this behavior, most likely there are many of them. One, definitely, is the fact that Linux changes the way memory is allocated. Up to 256KB it uses malloc(). For larger chunks of memory - mmap(). What’s important to remember is that when it comes to those variables, any change has to be backed by benchmarks that confirm the new setting is indeed the correct one. Otherwise you may be reducing your performance instead of increasing it.

InnoDB Durability

Another variable that has a significant impact on the MySQL performance is innodb_flush_log_at_trx_commit. It’s governing to what extend InnoDB is durable. Defaults (1) ensure your data is safe even if the database server gets killed - under any circumstances there’ll be no data loss. Other settings (2 and 0) say that you may lose up to 1s of transactions if the whole database server crashes (2) and that you may lose up to 1s of transactions if the mysqld gets killed.

Full durability is obviously a great thing to have but it comes at a significant price - I/O load is much higher because the flush operation has to happen after each commit. Therefore, under some circumstances, it’s very popular to reduce durability and accept the risk of data loss in certain conditions. It’s true for master - multiple slaves setups where, usually, it’s perfectly fine to have one slave in the rebuild process after a crash because the rest of them can easily handle the workload. Same is true for Galera clusters - the whole cluster works as a single instance so even if one node crashes and somehow loses its data, it still can resync from another node in the cluster - it’s not worth paying the high price for full durability (especially that writes in Galera are already more expensive than in regular MySQL) when you can easily recover from such situations. 

I/O-related settings

Other variables which may have significant impact on some workloads are innodb_io_capacity, innodb_io_capacity_max and innodb_lru_scan_depth. Those variables define the number of disk operations that can be done by InnoDB’s background threads to, e.g., flush dirty pages from the InnoDB buffer pool. Default settings are conservative, which is fine most of the time. If your workload is very write-intensive, you may want to tune those settings and see if you are not preventing InnoDB from using your I/O subsystem fully. This is especially true if you have fast storage: SSD or PCIe SSD card.

When it comes to disks, innodb_flush_method is another setting that you may want to look at. We’ve seen visible performance gains by switching this setting from default fdatasync to O_DIRECT. Such gain is clearly visible on setups with hardware RAID controller which is backed up by the BBU. On the other hand, when it comes to EBS volumes, we’ve seen better results using O_DSYNC. Benchmarking here is very important to understand which setting would be better in your particular case.

InnoDB Redo Logs

The size of InnoDB’s redo logs is also something you may want to take a look at. It is governed by innodb_log_file_size and innodb_log_files_in_group. By default we have two logs in a group, each ~50MB in size. Those logs are used to store write transactions and they are written sequentially. The main problem here is that MySQL must not run out of space in the logs and if logs are almost full, it will have to stop the whole activity and focus on flushing the data to the tablespaces. Of course, this is very bad for the application as no writes can happen during this time. This is one of the reasons why InnoDB I/O settings, that we discussed above, are very important. We can also help by increasing the redo log size by changing innodb_log_file_size. The rule of thumb is to set them large enough to cover at least 1h of writes. We discussed InnoDB I/O settings in more details in an earlier post, where we also covered a method for calculating InnoDB redo log size.

Query Cache

MySQL query cache is also often “tuned” - this cache stores hashes of the SELECT statements and their results. There are two problems with it - first one, the cache may be frequently flushed. If any DML was executed against a given table, all results related to this table are removed from the query cache. This seriously impacts the usefulness of the MySQL query cache. Second problem is that the query cache is protected by a mutex and access is serialized. This is a significant drawback and limitation for any workload with higher concurrency. Therefore it is strongly recommended to “tune” MySQL query cache by disabling it altogether. You can do it by setting query_cache_type to OFF. It’s true that in some cases it can be of some use, but most of the time it’s not. Instead of relying on MySQL query cache, you can also leverage any other external systems like Memcached or Redis to cache data.

Internal contention handling

Another set of settings that you may want to look at are variables that control how many instances/partitions of a given structure that MySQL should create. We are talking here about the variables: innodb_buffer_pool_instances, table_open_cache_instances, metadata_locks_hash_instances and innodb_adaptive_hash_index_partitions. Those options were introduced when it became clear that, for example, a single buffer pool or single adaptive hash index can become a point of contention for workloads with high concurrency. Once you find out that one of those structures becomes a pain point (we discussed how you can catch these situations in an earlier blog post) you’ll want to adjust the variables. Unfortunately, there are no rules of thumb here. It’s suggested that a single buffer pool instance should be at least 2GB in size, so for smaller buffer pools you may want to stick to this limit. In case of the other variables, if we are talking about issues with contentions, you will probably increase the number of instances/partitions of those data structures, but there are no rules on how to do that - you need to observe your workload and decide at which point contention is no longer an issue.

Other settings

There are a few other settings you may want to look at, some are applicable in the most efficient way at the setup time. Some can be changed dynamically. Those settings won’t have a large impact on the performance (sometimes the impact may also be negative one), but it is still important to keep them in mind. 

max_connections - on one hand you want to keep it high enough to handle any incoming connections. On the other hand, you don’t want to keep it too high as most of the servers are not able to handle hundreds or more connections simultaneously. One way of going around this problem is to implement connection pooling on the application side, or e.g. using a load balancer like HAProxy to throttle the load.

log_bin - if you are using MySQL replication, you need to have binary logs enabled. Even if you do not use them, it’s very handy to keep them enabled as they can be used to do a point-in-time recovery.

skip_name_resolve - this variable decides whether DNS lookup is performed on the host that is a source of incoming connection. If enabled, FQDNs can be used in MySQL grants as host. If it’s not, only users defined with IP addresses as host will work. The problem of having DNS lookup enabled is that it can introduce extra latency. DNS servers can also stop responding (because of a crash or network issues) and in such case MySQL won’t be able to accept any new connections.

innodb_file_per_table - this variable decides if InnoDB tables are to be created in a separate tablespace (when set to 1) or in the shared tablespace (when set to 0). It’s much easier to manage MySQL when each of the InnoDB tables has a separate tablespace. For example, with separate tablespaces you can easily reclaim disk space by dropping the table or partition. With shared tablespace it doesn’t work - the only way of reclaiming the disk space is to dump the data, clean the MySQL data directory and then reload the data again. Obviously, this is not convenient.

That is it for now. As we mentioned at the beginning, tweaking those settings might not make your MySQL database blazing fast - you are more likely to speed it up by tuning your queries. But they should still have visible impact on the overall performance. Good luck with the tuning work!

Related resources

 

Blog category:

by Severalnines at August 10, 2015 12:00 AM

August 07, 2015

Chris Calender

MariaDB 10.1.6 Overview and Highlights

MariaDB 10.1.6 was recently released, and is available for download here:

https://downloads.mariadb.org/mariadb/10.1.6/

This is the 4th beta, and 7th overall, release of MariaDB 10.1. There were not many major changes in this release, but a few notable items, as well as many overall bugs fixed (I counted 156, down ~50% from 10.1.5).

Since it’s beta, I’ll only cover the major changes and additions, and omit covering general bug fixes (feel free to browse them all here).

To me, these are the highlights:

  • RESET_MASTER is extended with TO # clause which allows one to specify the number of the first binary log. (MDEV-8469)
  • Added support for binlog_row_image=minimal for compatibility with MySQL.
  • New status variables: log_bin_basename, log_bin_index and relay_log_basename.
  • New status variables: innodb_buf_dump_status_frequency for determining how often the buffer pool dump status should be printed in the logs.
  • New status variables: Com_create_temporary_table and Com_drop_temporary_table for tracking the number of CREATE/DROP TEMPORARY TABLE statements.
  • Connect updated to 1.04.0001.
  • Mroonga updated to 5.04 (earlier versions of Mroonga did not work in 10.1).

Of course it goes without saying that do not use this for production systems since it is still only beta. However, I definitely recommend installing it on a test server and testing it out. And if you happen to be running a previous version of 10.1, then you should definitely upgrade to this latest release.

You can read more about the 10.1.6 release here:

https://mariadb.com/kb/en/mariadb-1016-release-notes/

And if interested, you can review the full list of changes in 10.1.6 (changelogs) here:

https://mariadb.com/kb/en/mariadb-1016-changelog/

Hope this helps.

by chris at August 07, 2015 10:14 PM

MySQL 5.6.26 Overview and Highlights

MySQL 5.6.26 was recently released (it is the latest MySQL 5.6, is GA), and is available for download here.

For this release, there are 3 “Functionality Added or Changed” items, 1 “Security Fix”, and 36 other bug fixes.

Out of those other 36 bugs, 13 are InnoDB, 1 Partitioning, 3 Replication, and 19 misc. (including 3 potentially crashing bug fixes, and 1 performance-related fix) Here are the ones of note:

  • Functionality Added/Changed: Replication: When using a multi-threaded slave, each worker thread has its own queue of transactions to process. In previous MySQL versions, STOP SLAVE waited for all workers to process their entire queue. This logic has been changed so that STOP SLAVE first finds the newest transaction that was committed by any worker thread. Then, it waits for all workers to complete transactions older than that. Newer transactions are not processed. The new logic allows STOP SLAVE to complete faster in case some worker queues contain multiple transactions. (Bug #75525)
  • Functionality Added/Changed: Previously, the max_digest_length system variable controlled the maximum digest length for all server functions that computed statement digests. However, whereas the Performance Schema may need to maintain many digest values, other server functions such as MySQL Enterprise Firewall need only one digest per session. Increasing the max_digest_length value has little impact on total memory requirements for those functions, but can increase Performance Schema memory requirements significantly. To enable configuring digest length separately for the Performance Schema, its digest length is now controlled by the new performance_schema_max_digest_length system variable.
  • Functionality Added/Changed: Previously, changes to the validate_password plugin dictionary file (named by the validate_password_dictionary_file system variable) while the server was running required a restart for the server to recognize the changes. Now validate_password_dictionary_file can be set at runtime and assigning a value causes the named file to be read without a restart. In addition, two new status variables are available. validate_password_dictionary_file_last_parsed indicates when the dictionary file was last read, and validate_password_dictionary_file_words_count indicates how many words it contains. (Bug #66697)
  • Security-related: Due to the LogJam issue (https://weakdh.org/), OpenSSL has changed the Diffie-Hellman key length parameters for openssl-1.0.1n and up. OpenSSL has provided a detailed explanation at http://openssl.org/news/secadv_20150611.txt. To adopt this change in MySQL, the key length used in vio/viosslfactories.c for creating Diffie-Hellman keys has been increased from 512 to 2,048 bits. (Bug #77275)
  • InnoDB: Importing a tablespace with a full-text index resulted in an assertion when attempting to rebuild the index.
  • InnoDB: Opening a foreign key-referenced table with foreign_key_checks enabled resulted in an error when the table or database name contained special characters.
  • InnoDB: The page_zip_verify_checksum function returned false for a valid compressed page.
  • InnoDB: A failure to load a change buffer bitmap page during a concurrent delete tablespace operation caused a server exit.
  • InnoDB: After dropping a full-text search index, the hidden FTS_DOC_ID and FTS_DOC_ID_INDEX columns prevented online DDL operations. (Bug #76012)
  • InnoDB: An index record was not found on rollback due to inconsistencies in the purge_node_t structure. (Bug #70214)
  • Partitioning: In certain cases, ALTER TABLE … REBUILD PARTITION was not handled correctly when executed on a locked table.
  • Replication: If flushing the cache to the binary log failed, for example due to a disk problem, the error was not detected by the binary log group commit logic. This could cause inconsistencies between the master and the slave. The fix uses the binlog_error_action variable to decide how to handle this situation. If binlog_error_action=ABORT_SERVER, then the server aborts after informing the client with an ER_BINLOGGING_IMPOSSIBLE error. If binlog_error_action=IGNORE_ERROR, then the error is ignored and binary logging is disabled until the server is restarted again. The same is mentioned in the error log file, and the transaction is committed inside the storage engine without being added to the binary log. (Bug #76795)
  • Replication: When using GTIDs, a multi-threaded slave which had relay_log_recovery=1 and that stopped unexpectedly could encounter a relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode error upon restart. The fix ensures that the relay log recovery process checks if GTIDs are in use or not. If GTIDs are in use, the multi-threaded slave recovery process uses the GTID protocol to fill any unprocessed transactions. (Bug #73397)
  • Replication: When two slaves with the same server_uuid were configured to replicate from a single master, the I/O thread of the slaves kept reconnecting and generating new relay log files without new content. In such a situation, the master now generates an error which is sent to the slave. By receiving this error from the master, the slave I/O thread does not try to reconnect, avoiding this problem. (Bug #72581)
  • Crashing Bug: Incorrect cost calculation for the semi-join Duplicate Weedout strategy could result in a server exit.
  • Crashing Bug: For large values of max_digest_length, the Performance Schema could encounter an overflow error when computing memory requirements, resulting in a server exit.
  • Crashing Bug: GROUP BY or ORDER BY on a CHAR(0) NOT NULL column could lead to a server exit.
  • Performance-related: When choosing join order, the optimizer could incorrectly calculate the cost of a table scan and choose a table scan over a more efficient eq_ref join. (Bug #71584)

Conclusions:

So while there were no major changes, and not too many overall bug fixes, the security fix could be an issue if you run the latest RHEL/CentOS with SSL connections + a DHE SSL cipher specifed with –ssl-cipher=DHE-RSA-… Also, some of those InnoDB bugs are nasty, especially the fulltext bugs, thus if you use InnoDB’s fulltext, I’d recommend planning for an upgrade.

The full 5.6.26 changelogs can be viewed here (which has more details about all of the bugs listed above):

http://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-26.html

Hope this helps. :)

by chris at August 07, 2015 09:16 PM

MySQL 5.5.45 Overview and Highlights

MySQL 5.5.45 was recently released (it is the latest MySQL 5.5, is GA), and is available for download here:

http://dev.mysql.com/downloads/mysql/5.5.html

This release, similar to the last 5.5 release, is mostly uneventful.

There were 0 “Functionality Added or Changed” items this time, 1 “Security Fix”, and just 9 bugs overall fixed.

Out of the 9 bugs, there were 3 InnoDB bugs, 1 security-related bug, and 1 potential crashing bug. Here are the ones worth noting:

  • InnoDB: An index record was not found on rollback due to inconsistencies in the purge_node_t structure.
  • InnoDB: An assertion was raised when InnoDB attempted to dereference a NULL foreign key object.
  • InnoDB: On Unix-like platforms, os_file_create_simple_no_error_handling_func and os_file_create_func opened files in different modes when innodb_flush_method was set to O_DIRECT. (Bug #76627)
  • Security-related: Due to the LogJam issue (https://weakdh.org/), OpenSSL has changed the Diffie-Hellman key length parameters for openssl-1.0.1n and up. OpenSSL has provided a detailed explanation at http://openssl.org/news/secadv_20150611.txt. To adopt this change in MySQL, the key length used in vio/viosslfactories.c for creating Diffie-Hellman keys has been increased from 512 to 2,048 bits. (Bug #77275)
  • Crashing Bug: GROUP BY or ORDER BY on a CHAR(0) NOT NULL column could lead to a server exit.

I don’t think I’d call any of these urgent for all (unless you run the latest RHEL/CentOS with SSL connections + a DHE SSL cipher specifed with –ssl-cipher=DHE-RSA-…), but if running 5.5, especially if not a very recent 5.5, you should consider upgrading.

For reference, the full 5.5.45 changelog can be viewed here:

http://dev.mysql.com/doc/relnotes/mysql/5.5/en/news-5-5-45.html

Hope this helps.

by chris at August 07, 2015 07:55 PM

MariaDB Foundation

MariaDB: InnoDB foreign key constraint errors

Introduction

A foreign key is a field (or collection of fields) in one table that uniquely identifies a row of another table. The table containing the foreign key is called the child table, and the table containing the candidate key is called the referenced or parent table. The purpose of the foreign key is to identify a particular row of the referenced table. Therefore, it is required that the foreign key is equal to the candidate key in some row of the primary table, or else have no value (the NULL value). This is called a referential integrity constraint between the two tables. Because violations of these constraints can be the source of many database problems, most database management systems provide mechanisms to ensure that every non-null foreign key corresponds to a row of the referenced table. Consider following simple example:

create table parent (
    id int not null primary key,
    name char(80)
) engine=innodb;

create table child (
    id int not null,
    name char(80),
    parent_id int, 
    foreign key(parent_id) references parent(id)
) engine=innodb;

As far as I know, the following storage engines for MariaDB and/or MySQL support foreign keys:

MariaDB foreign key syntax is documented at https://mariadb.com/kb/en/mariadb/foreign-keys/ (and MySQL at http://dev.mysql.com/doc/refman/5.5/en/innodb-foreign-key-constraints.html). While most of the syntax is parsed and checked when the CREATE TABLE or ALTER TABLE clause is parsed, there are still several error cases that can happen inside InnoDB. Yes, InnoDB has its own internal foreign key constraint parser (in dict0dict.c function dict_create_foreign_constraints_low()).

However, the error messages shown in CREATE or ALTER TABLE, and SHOW WARNINGS in versions of MariaDB prior to 5.5.45 and 10.0.21 are not very informative or clear. There are additional error messages if you issue SHOW ENGINE INNODB STATUS, which help, but were not an ideal solution. In this blog I’ll present a few of the most frequent error cases using MariaDB 5.5.44 and how these error messages are improved in MariaDB 5.5.45 and 10.0.21. I will use the default InnoDB (i.e. XtraDB) but innodb_plugin works very similarly.

Constraint name not unique

Foreign name constraint names must be unique in a database. However, the error message is unclear and leaves a lot unclear:

--------------
CREATE TABLE t1 (
  id int(11) NOT NULL PRIMARY KEY,
  a int(11) NOT NULL,
  b int(11) NOT NULL,
  c int not null,
  CONSTRAINT test FOREIGN KEY (b) REFERENCES t1 (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
--------------
Query OK, 0 rows affected (0.45 sec)

--------------
CREATE TABLE t2 (
id int(11) NOT NULL PRIMARY KEY,
a int(11) NOT NULL,
b int(11) NOT NULL,
c int not null,
CONSTRAINT mytest FOREIGN KEY (c) REFERENCES t1(id),
CONSTRAINT test FOREIGN KEY (b) REFERENCES t2 (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
--------------

ERROR 1005 (HY000): Can't create table `test`.`t2` (errno: 121 "Duplicate key on write or update")
--------------
show warnings
--------------

+---------+------+--------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------------------------------------------------+
| Error | 1005 | Can't create table `test`.`t2` (errno: 121 "Duplicate key on write or update") |
| Warning | 1022 | Can't write; duplicate key in table 't2' |
+---------+------+--------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

These messages are not very helpful because there are two foreign key constraints. Looking into SHOW ENGINE INNODB STATUS we get a better message:

show engine innodb status
--------------
------------------------
LATEST FOREIGN KEY ERROR
------------------------
2015-07-30 12:37:48 7f44a1111700 Error in foreign key constraint creation for table `test`.`t2`.
A foreign key constraint of name `test`.`test`
already exists. (Note that internally InnoDB adds 'databasename'
in front of the user-defined constraint name.)
Note that InnoDB's FOREIGN KEY system tables store
constraint names as case-insensitive, with the
MySQL standard latin1_swedish_ci collation. If you
create tables or databases whose names differ only in
the character case, then collisions in constraint
names can occur. Workaround: name your constraints
explicitly with unique names.

In MariaDB 5.5.45 and 10.0.21, the message is clearly improved:

CREATE TABLE t1 (
  id int(11) NOT NULL PRIMARY KEY,
  a int(11) NOT NULL,
  b int(11) NOT NULL,
  c int not null,
  CONSTRAINT test FOREIGN KEY (b) REFERENCES t1 (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
--------------

Query OK, 0 rows affected (0.14 sec)

--------------
CREATE TABLE t2 (
  id int(11) NOT NULL PRIMARY KEY,
  a int(11) NOT NULL,
  b int(11) NOT NULL,
  c int not null,
  CONSTRAINT mytest FOREIGN KEY (c) REFERENCES t1(id),
  CONSTRAINT test FOREIGN KEY (b) REFERENCES t2 (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
--------------

ERROR 1005 (HY000): Can't create table 'test.t2' (errno: 121)
--------------
show warnings
--------------

+---------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                                                                                                                                                                     |
+---------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning |  121 | Create or Alter table `test`.`t2` with foreign key constraint failed. Foreign key constraint `test/test` already exists on data dictionary. Foreign key constraint names need to be unique in database. Error in foreign key definition: CONSTRAINT `test` FOREIGN KEY (`b`) REFERENCES `test`.`t2` (`id`). |
| Error   | 1005 | Can't create table 'test.t2' (errno: 121)                                                                                                                                                                                                                                                                   |
+---------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

No index

There should be an index for columns in a referenced table that contains referenced columns as the first columns.

create table t1(a int, b int, key(b)) engine=innodb
--------------
Query OK, 0 rows affected (0.46 sec)

--------------
create table t2(a int, b int, constraint b foreign key (b) references t1(b), constraint a foreign key a (a) references t1(a)) engine=innodb
--------------

ERROR 1005 (HY000): Can't create table `test`.`t2` (errno: 150 "Foreign key constraint is incorrectly formed")
--------------
show warnings
--------------

+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 150 | Create table 'test/t2' with foreign key constraint failed. There is no index in the referenced table where the referenced columns appear as the first columns.
|
| Error | 1005 | Can't create table `test`.`t2` (errno: 150 "Foreign key constraint is incorrectly formed") |
| Warning | 1215 | Cannot add foreign key constraint |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)

Fine but again we have no idea which foreign key it was. As before, there is a better message in the SHOW ENGINE INNODB STATUS output:

LATEST FOREIGN KEY ERROR
------------------------
2015-07-30 13:44:31 7f30e1520700 Error in foreign key constraint of table test/t2:
 foreign key a (a) references t1(a)) engine=innodb:
Cannot find an index in the referenced table where the
referenced columns appear as the first columns, or column types
in the table and the referenced table do not match for constraint.
Note that the internal storage type of ENUM and SET changed in
tables created with >= InnoDB-4.1.12, and such columns in old tables
cannot be referenced by such columns in new tables.
See http://dev.mysql.com/doc/refman/5.6/en/innodb-foreign-key-constraints.html
for correct foreign key definition.

In MariaDB 5.5.45 and 10.0.21, the message is clearly improved:

create table t1(a int, b int, key(b)) engine=innodb
--------------

Query OK, 0 rows affected (0.16 sec)

--------------
create table t2(a int, b int, constraint b foreign key (b) references t1(b), constraint a foreign key a (a) references t1(a)) engine=innodb
--------------

ERROR 1005 (HY000): Can't create table 'test.t2' (errno: 150)
--------------
show warnings
--------------

+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                                                                                                |
+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning |  150 | Create  table '`test`.`t2`' with foreign key constraint failed. There is no index in the referenced table where the referenced columns appear as the first columns. Error close to  foreign key a (a) references t1(a)) engine=innodb. |
| Error   | 1005 | Can't create table 'test.t2' (errno: 150)                                                                                                                                                                                              |
+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Referenced table not found

A table that is referenced on foreign key constraint should exist in InnoDB data dictionary. If not:

create table t1 (f1 integer primary key) engine=innodb
--------------

Query OK, 0 rows affected (0.47 sec)

--------------
alter table t1 add constraint c1 foreign key (f1) references t11(f1)
--------------

ERROR 1005 (HY000): Can't create table `test`.`#sql-2612_2` (errno: 150 "Foreign key constraint is incorrectly formed")
--------------
show warnings
--------------

+---------+------+-----------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                             |
+---------+------+-----------------------------------------------------------------------------------------------------+
| Error   | 1005 | Can't create table `test`.`#sql-2612_2` (errno: 150 "Foreign key constraint is incorrectly formed") |
| Warning | 1215 | Cannot add foreign key constraint                                                                   |
+---------+------+-----------------------------------------------------------------------------------------------------+
show engine innodb status
--------------
LATEST FOREIGN KEY ERROR
------------------------
2015-07-30 13:44:34 7f30e1520700 Error in foreign key constraint of table test/#sql-2612_2:
 foreign key (f1) references t11(f1):
Cannot resolve table name close to:
(f1)

Both messages are first referring to an internal table name and the foreign key error message is referring to an incorrect name. In MariaDB 5.5.45 and 10.0.21, the message is clearly improved:

create table t1 (f1 integer primary key) engine=innodb
--------------

Query OK, 0 rows affected (0.11 sec)

--------------
alter table t1 add constraint c1 foreign key (f1) references t11(f1)
--------------

ERROR 1005 (HY000): Can't create table 'test.#sql-2b40_2' (errno: 150)
--------------
show warnings
--------------

+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                                    |
+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning |  150 | Alter  table `test`.`t1` with foreign key constraint failed. Referenced table `test`.`t11` not found in the data dictionary close to  foreign key (f1) references t11(f1). |
| Error   | 1005 | Can't create table 'test.#sql-2b40_2' (errno: 150)                                                                                                                         |
+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

--------------
show engine innodb status
--------------
150730 13:50:36 Error in foreign key constraint of table `test`.`t1`:
Alter  table `test`.`t1` with foreign key constraint failed. Referenced table `test`.`t11` not found in the data dictionary close to  foreign key (f1) references t11(f1).

Temporary tables

Temporary tables can’t have foreign key constraints because temporary tables are not stored to the InnoDB data dictionary.

create temporary table t2(a int, foreign key(a) references t1(a)) engine=innodb
--------------

ERROR 1005 (HY000): Can't create table `test`.`t2` (errno: 150 "Foreign key constraint is incorrectly formed")
--------------
show warnings
--------------

+---------+------+--------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                    |
+---------+------+--------------------------------------------------------------------------------------------+
| Error   | 1005 | Can't create table `test`.`t2` (errno: 150 "Foreign key constraint is incorrectly formed") |
| Warning | 1215 | Cannot add foreign key constraint                                                          |
+---------+------+--------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

--------------
show engine innodb status
--------------
LATEST FOREIGN KEY ERROR
------------------------
2015-07-30 13:44:35 7f30e1520700 Error in foreign key constraint of table tmp/#sql2612_2_1:
foreign key(a) references t1(a)) engine=innodb:
Cannot resolve table name close to:
(a)) engine=innodb

--------------
alter table t1 add foreign key(b) references t1(a)
--------------

ERROR 1005 (HY000): Can't create table `test`.`#sql-2612_2` (errno: 150 "Foreign key constraint is incorrectly formed")
--------------
show warnings
--------------

+---------+------+-----------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                             |
+---------+------+-----------------------------------------------------------------------------------------------------+
| Error   | 1005 | Can't create table `test`.`#sql-2612_2` (errno: 150 "Foreign key constraint is incorrectly formed") |
| Warning | 1215 | Cannot add foreign key constraint                                                                   |
+---------+------+-----------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

These error messages do not really help the user, because the actual reason for the error is not printed and the foreign key error references an internal table name. In MariaDB 5.5.45 and 10.0.21 this is clearly improved:

create temporary table t1(a int not null primary key, b int, key(b)) engine=innodb
--------------

Query OK, 0 rows affected (0.04 sec)

--------------
create temporary table t2(a int, foreign key(a) references t1(a)) engine=innodb
--------------

ERROR 1005 (HY000): Can't create table 'test.t2' (errno: 150)
--------------
show warnings
--------------

+---------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                                             |
+---------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning |  150 | Create  table `tmp`.`t2`Ï with foreign key constraint failed. Referenced table `tmp`.`t1` not found in the data dictionary close to foreign key(a) references t1(a)) engine=innodb.  |
| Error   | 1005 | Can't create table 'test.t2' (errno: 150)                                                                                                                                           |
+---------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

--------------
alter table t1 add foreign key(b) references t1(a)
--------------

ERROR 1005 (HY000): Can't create table 'test.#sql-2b40_2' (errno: 150)
--------------
show warnings
--------------

+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                             |
+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning |  150 | Alter  table `tmp`.`t1`Ï with foreign key constraint failed. Referenced table `tmp`.`t1` not found in the data dictionary close to foreign key(b) references t1(a).  |
| Error   | 1005 | Can't create table 'test.#sql-2b40_2' (errno: 150)                                                                                                                  |
+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Column count does not match

There should be exactly the same number of columns in both the foreign key column list and the referenced column list. However, this currently raises the following error:

create table t1(a int not null primary key, b int, key(b)) engine=innodb
--------------

Query OK, 0 rows affected (0.17 sec)

--------------
alter table t1 add foreign key(a,b) references t1(a)
--------------

ERROR 1005 (HY000): Can't create table 'test.#sql-4856_1' (errno: 150)
--------------
show warnings
--------------

+-------+------+----------------------------------------------------+
| Level | Code | Message |
+-------+------+----------------------------------------------------+
| Error | 1005 | Can't create table 'test.#sql-4856_1' (errno: 150) |
+-------+------+----------------------------------------------------+
1 row in set (0.00 sec)

-----------------+
show engine innodb status;
-----------------+
LATEST FOREIGN KEY ERROR
------------------------
150730 15:15:57 Error in foreign key constraint of table test/#sql-4856_1:
foreign key(a,b) references t1(a):
Syntax error close to: 2015-07-30 13:44:35 7f30e1520700 Error in foreign key constraint of table tmp/#sql2612_2_2: foreign key(b) references t1(a): Cannot resolve table name close to: (a)

The error message is not clear and the foreign key error refers to an internal table name. In MariaDB 5.5.45 and 10.0.21 there is additional information:

create table t1(a int not null primary key, b int, key(b)) engine=innodb
--------------

Query OK, 0 rows affected (0.14 sec)

--------------
alter table t1 add foreign key(a,b) references t1(a)
--------------

ERROR 1005 (HY000): Can't create table 'test.#sql-2b40_2' (errno: 150)
--------------
show warnings
--------------

+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 150 | Alter table `test`.`t1` with foreign key constraint failed. Foreign key constraint parse error in foreign key(a,b) references t1(a) close to ). Too few referenced columns, you have 1 when you should have 2. |
| Error | 1005 | Can't create table 'test.#sql-2b40_2' (errno: 150) |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Incorrect cascading

A user may define a foreign key constraint with ON UPDATE SET NULL or ON DELETE SET NULL. However, this requires that the referenced columns are not defined as NOT NULL. Currently, the error message on this situation is:

create table t1 (f1 integer not null primary key) engine=innodb
--------------

Query OK, 0 rows affected (0.40 sec)

--------------
alter table t1 add constraint c1 foreign key (f1) references t1(f1) on update set null
--------------

ERROR 1005 (HY000): Can't create table `test`.`#sql-2612_2` (errno: 150 "Foreign key constraint is incorrectly formed")
--------------
show warnings
--------------

+---------+------+-----------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                             |
+---------+------+-----------------------------------------------------------------------------------------------------+
| Error   | 1005 | Can't create table `test`.`#sql-2612_2` (errno: 150 "Foreign key constraint is incorrectly formed") |
| Warning | 1215 | Cannot add foreign key constraint                                                                   |
+---------+------+---------------------------------------------------------------------------------------------

--------+
show engine innodb status;
--------+
LATEST FOREIGN KEY ERROR
------------------------
2015-07-30 13:44:37 7f30e1520700 Error in foreign key constraint of table test/#sql-2612_2:
 foreign key (f1) references t1(f1) on update set null:
You have defined a SET NULL condition though some of the
columns are defined as NOT NULL.

Both error messages are not very useful, because the first does not really tell how the foreign key constraint is incorrectly formed and later does not say which column has the problem. This is improved in MariaDB 5.5.45 and 10.0.21:

create table t1 (f1 integer not null primary key) engine=innodb
--------------

Query OK, 0 rows affected (0.10 sec)

--------------
alter table t1 add constraint c1 foreign key (f1) references t1(f1) on update set null
--------------

ERROR 1005 (HY000): Can't create table 'test.#sql-2b40_2' (errno: 150)
--------------
show warnings
--------------

+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                                                                                         |
+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning |  150 | Alter  table `test`.`t1` with foreign key constraint failed. You have defined a SET NULL condition but column f1 is defined as NOT NULL in  foreign key (f1) references t1(f1) on update set null close to  on update set null. |
| Error   | 1005 | Can't create table 'test.#sql-2b40_2' (errno: 150)                                                                                                                                                                              |
+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Incorrect types

Column types for foreign key columns and referenced columns should match and use the same character set. If they do not, you currently get:

create table t1 (id int not null primary key, f1 int, f2 int, key(f1)) engine=innodb
--------------

Query OK, 0 rows affected (0.47 sec)

--------------
create table t2(a char(20), key(a), foreign key(a) references t1(f1)) engine=innodb
--------------

ERROR 1005 (HY000): Can't create table `test`.`t2` (errno: 150 "Foreign key constraint is incorrectly formed")
--------------
show warnings
--------------

+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                         |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning |  150 | Create table 'test/t2' with foreign key constraint failed. There is no index in the referenced table where the referenced columns appear as the first columns.
 |
| Error   | 1005 | Can't create table `test`.`t2` (errno: 150 "Foreign key constraint is incorrectly formed")                                                                      |
| Warning | 1215 | Cannot add foreign key constraint                                                                                                                               |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)

--------+
show engine innodb status;
--------+
LATEST FOREIGN KEY ERROR
------------------------
2015-07-30 13:44:39 7f30e1520700 Error in foreign key constraint of table test/t2:
foreign key(a) references t1(f1)) engine=innodb:
Cannot find an index in the referenced table where the
referenced columns appear as the first columns, or column types
in the table and the referenced table do not match for constraint.
Note that the internal storage type of ENUM and SET changed in
tables created with >= InnoDB-4.1.12, and such columns in old tables
cannot be referenced by such columns in new tables.
See http://dev.mysql.com/doc/refman/5.6/en/innodb-foreign-key-constraints.html
for correct foreign key definition.

But do we have an index for the referenced column f1 in the table t2? So if there are multiple columns in both the foreign key column list and the referenced column list, where do we look for the error? In MariaDB 5.5.45 and 10.0.21 this is improved by:

create table t1 (id int not null primary key, f1 int, f2 int, key(f1)) engine=innodb
--------------

Query OK, 0 rows affected (0.15 sec)

--------------
create table t2(a char(20), key(a), foreign key(a) references t1(f1)) engine=innodb
--------------

ERROR 1005 (HY000): Can't create table 'test.t2' (errno: 150)
--------------
show warnings
--------------

+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                                                            |
+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning |  150 | Create  table `test`.`t2` with foreign key constraint failed. Field type or character set for column a does not mach referenced column f1 close to foreign key(a) references t1(f1)) engine=innodb |
| Error   | 1005 | Can't create table 'test.t2' (errno: 150)                                                                                                                                                          |
+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Conclusions

There are several different ways to incorrectly define a foreign key constraint. In many cases when using earlier versions of MariaDB (and MySQL), the error messages produced by these cases were not very clear or helpful. In MariaDB 5.5.45 and 10.0.21 there are clearly improved error messages to help out the user. Naturally, there is always room for further improvements, so feedback is more than welcome!

References

 

by Jan Lindstrom at August 07, 2015 05:51 PM

Shlomi Noach

Baffling 5.7 global/status variables issues, unclean migration path

MySQL 5.7 introduces a change in the way we query for global variables and status variables: the INFORMATION_SCHEMA.(GLOBAL|SESSION)_(VARIABLES|STATUS) tables are now deprecated and empty. Instead, we are to use the respective performance_schema.(global|session)_(variables|status) tables.

But the change goes farther than that; there is also a security change. Oracle created a pitfall of 2 changes at the same time:

  1. Variables/status moved to a different table
  2. Privileges required on said table

As an example, my non-root user gets:

mysql> show session variables like 'tx_isolation';
ERROR 1142 (42000): SELECT command denied to user 'normal_user'@'my_host' for table 'session_variables'

Who gets affected by this? Nearly everyone and everything.

  • Your Nagios will not be able to read status variables
  • Your ORM will not be able to determine session variables
  • Your replication user will fail connecting (see this post by Giuseppe)
  • And most everyone else.

The problem with the above is that involves two unrelated changes to your setup, which are not entirely simple to coordinate:

  1. Change your app code to choose the correct schema (information_schema vs. performance_schema)
  2. GRANT the permissions on your database

Perhaps at this point you still do not consider this to be a problem. You may be thinking: well, let's first prepare by creating the GRANTs, and once that is in place, we can, at our leisure, modify the code.

Not so fast. Can you really that simply create those GRANTs?

Migration woes

How do you migrate to a new MySQL version? You do not reinstall all your servers. You want an easy migration path, and that path is: introduce one or two slaves of a newer version, see that everything works to your satisfaction, slowly upgrade all your other slaves, eventually switchover/upgrade your master.

This should not be any different for 5.7. We would like to provision a 5.7 slave in our topologies and just see that everything works. Well, we have, and things don't just work. Our Nagios stops working for that 5.7 slave. Orchestrator started complaining (by this time I've already fixed it to be more tolerant for the 5.7 problems so no crashes here).

I hope you see the problem by now.

You cannot issue a GRANT SELECT ON performance_schema.global_variables TO '...' on your 5.6 master.

The table simply does not exist there, which means the statement will not go to binary logs, which means it will not replicate on your 5.7 slave, which means you will not be able to SHOW GLOBAL VARIABLES on your slave, which means everything remains broken.

Yes, you can issue this directly on your 5.7 slaves. It's doable, but undesired. It's ugly in terms of automation (and will quite possibly break some assumptions and sanity checks your automation uses); in terms of validity testing. It's unfriendly to GTID (make sure to SET SQL_LOG_BIN=0 before that).

WHY in the first place?

It seems like a security thing. I'm not sure whether this was intended. So you prevent a SHOW GLOBAL VARIABLES for a normal user. Makes sense. And yet:

mysql> show global variables like 'hostname';
ERROR 1142 (42000): SELECT command denied to user 'normal_user'@'my_host' for table 'global_variables'

mysql> select @@global.hostname;
+---------------------+
| @@global.hostname   |
+---------------------+
| myhost.mydomain.com |
+---------------------+

mysql> select @@version;
+--------------+
| @@version    |
+--------------+
| 5.7.8-rc-log |
+--------------+

Seems like I'm allowed access to that info after all. So it's not strictly a security design decision. For status variable, I admit, I don't have a similar workaround.

Solutions?

The following are meant to be solutions, but do not really solve the problem:

  • SHOW commands. SHOW GLOBAL|SESSION VARIABLES|STATUS will work properly, and will implicitly know whether to provide the results via information_schema or performance_schema tables.
    • But, aren't we meant to be happier with SELECT queries? So that I can really do stuff that is smarter than LIKE 'variable_name%'?
    • And of course you cannot use SHOW in server side cursors. Your stored routines are in a mess now.
    • This does not solve the GRANTs problem.
  • show_compatibility_56: an introduced variable in 5.7, boolean. It truly is a time-travel-paradox novel in disguise, in multiple respects.
    • Documentation introduces it, and says it is deprecated.
      • time-travel-paradox :O
    • But it actually works in 5.7.8 (latest)
      • time-travel-paradox plot thickens
    • Your automation scripts do not know in advance whether your MySQL has this variable
      • Hence SELECT @@global.show_compatibility_56 will produce an error on 5.6
      • But the "safe" way of SHOW GLOBAL VARIABLES LIKE 'show_compatibility_56' will fail on a privilege error on 5.7
      • time-travel-paradox :O
    • Actually advised by my colleague Simon J. Mudd, show_compatibility_56 defaults to OFF. I support this line of thought. Or else it's old_passwords=1 all over again.
    • show_compatibility_56 doesn't solve the GRANTs problem.
    • This does not solve any migration path. It just postpones the moment when I will hit the same problem. When I flip the variable from "1" to "0", I'm back at square one.

Suggestion

I claim security is not the issue, as presented above. I claim Oracle will yet again fall into the trap of no-easy-way-to-migrate-to-GTID in 5.6 if the current solution is unchanged. I claim that there have been too many changes at once. Therefore, I suggest one of the alternative two flows:

  1. Flow 1: keep information_schema, later migration into performance_schema
    • In 5.7information_schema tables should still produce the data.
    • No security constraints on information_schema
    • Generate WARNINGs on reading from information_schema ("...this will be deprecated...")
    • performance_schema also available. With security constraints, whatever.
    • In 5.8 remove information_schema tables; we are left with performance_schema only.
  2. Flow 2: easy migration into performance_schema:
    • In 5.7, performance_schema tables should not require any special privileges. Any user can read from them.
    • Keep show_compatibility_56 as it is.
    • SHOW commands choose between information_schema or performance_schema on their own -- just as things are done now.
    • In 5.8performance_schema tables will require SELECT privileges.

As always, I love the work done by the engineers; and I love how they listen to the community.

Comments are most welcome. Have I missed the simple solution here? Are there even more complications to these features? Thoughts on my suggested two flows?

[UPDATE 2015-08-19]

Please see this followup by Morgan Tocker of Oracle.

by shlomi at August 07, 2015 12:39 PM

Peter Zaitsev

The MySQL query cache: Worst enemy or best friend?

During the last couple of months I have been involved in an unusually high amount of performance audits for e-commerce applications running with Magento. And although the systems were quite different, they also had one thing in common: the MySQL query cache was very useful. That was counter-intuitive for me as I’ve always expected the query cache to be such a bottleneck that response time is better when the query cache is turned off no matter what. That lead me to run a few experiments to better understand when the query cache can be helpful.

Some context

The query cache is well known for its contentions: a global mutex has to be acquired for any read or write operation, which means that any access is serialized. This was not an issue 15 years ago, but with today’s multi-core servers, such serialization is the best way to kill performance.

However from a performance point of view, any query cache hit is served in a few tens of microseconds while the fastest access with InnoDB (primary lookup) still requires several hundreds of microseconds. Yes, the query cache is at least an order of magnitude faster than any query that goes to InnoDB.

A simple test

To better understand how good or bad the query cache can be, I set up a very simple benchmark:

  • 1M records were inserted in 16 tables.
  • A moderate write load (65 updates/s) was run with a modified version of the update_index.lua sysbench script (see the end of the post for the code).
  • The select.lua sysbench script was run, with several values for the --num-threads option.

Note that the test is designed to be unfavorable to the query cache as the whole dataset fits in the buffer pool and the SELECT statements are very simple. Also note that I configured the query cache to be large enough so that no entry was evicted from the cache due to low memory.

Results – MySQL query cache ON

First here are the results when the query cache is enabled:

qcache_on

This configuration scales well up to 4 concurrent threads, but then the throughput degrades very quickly. With 10 concurrent threads, SHOW PROCESSLIST is enough to show you that all threads spend all their time waiting for the query cache mutex. Okay, this is not a surprise.

Results – MySQL query cache OFF

When the query cache is disabled, this is another story:

qcache_off

Throughput scales well up to somewhere between 10 and 20 threads (for the record the server I was using had 16 cores). But more importantly, even at the higher concurrencies, the overall throughput continued to increase: at 20 concurrent threads, MySQL was able to serve nearly 3x more queries without the query cache.

Conclusion

With Magento, you can expect to have a light write workload, very low concurrency and also quite complex SELECT statements. Given the results of our simple benchmarks, it is finally not that surprising that the MySQL query cache is a good fit in this case.

It is also worth noting that many applications run a database workload where writes are light and concurrency is low: the query cache should then not be discarded immediately. And maybe it is time for Oracle to make plans to improve the query cache as suggested by Peter a few years ago?

Annex: sysbench commands

# Modified update_index.lua
function event(thread_id)
   local table_name
   table_name = "sbtest".. sb_rand_uniform(1, oltp_tables_count)
   rs = db_query("UPDATE ".. table_name .." SET k=k+1 WHERE id=" .. sb_rand(1, oltp_table_size))
   db_query("SELECT SLEEP(0.015)")
end

# Populate the tables
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-table-size=1000000 --oltp-tables-count=16 --num-threads=16 --test=/usr/share/doc/sysbench/tests/db/insert.lua prepare
# Write workload
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-tables-count=16 --num-threads=1 --test=/usr/share/doc/sysbench/tests/db/update_index.lua --max-requests=1000000 run
# Read workload
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-tables-count=16 --num-threads=1 --test=/usr/share/doc/sysbench/tests/db/select.lua --max-requests=10000000 run

The post The MySQL query cache: Worst enemy or best friend? appeared first on MySQL Performance Blog.

by Stephane Combaudon at August 07, 2015 10:00 AM

August 06, 2015

MariaDB Foundation

MariaDB 10.0.21 and 5.5.45 now available

The MariaDB project is pleased to announce the immediate availability of MariaDB 10.0.21 and 5.5.45. These are both Stable (GA) releases.

Download MariaDB 10.0.21

Release Notes Changelog What is MariaDB 10.0?

 

Download MariaDB 5.5.45

Release Notes Changelog What is MariaDB 5.5?

MariaDB APT and YUM Repository Configuration Generator

See the Release Notes and Changelogs for detailed information on these releases.

Thanks, and enjoy MariaDB!

mariadb-seal-shaded-browntext-alt

by Daniel Bartholomew at August 06, 2015 06:52 PM

Jean-Jerome Schmidt

Architecting for Failure - Disaster Recovery of MySQL/MariaDB Galera Cluster

Failure is a fact of life, and cannot be avoided. No IT vendor in their right mind will claim 100% system availability, although some might claim several nines :-) We might have armies of ops soldiers doing everything they possibly can to avoid failure, yet we still need to prepare for it. What do you do when the sh*t hits the fan?

Whether you use unbreakable private datacenters or public cloud platforms, Disaster Recovery (DR) is indeed a key issue. This is not about copying your data to a backup site and being able to restore it, this is about business continuity and how fast you can recover services when disaster strikes. 

In this blog post, we will look into different ways of designing your Galera Clusters for fault tolerance, including failover and failback strategies. 

Disaster Recovery (DR)

Disaster recovery involves a set of policies and procedures to enable the recovery or continuation of infrastructure following a natural or human-induced disaster. A DR site is a backup site in another location where an organization can relocate following a disaster, such as fire, flood, terrorist threat or other disruptive event. A DR site is an integral part of a Disaster Recovery/Business Continuity plan.

Most large IT organizations have some sort of DR plan, not so for smaller organisations though - because of the high cost vs the relative risk. Thanks to the economics of public clouds, this is changing though. Smaller organisations with tighter budgets are also able to have something in place. 

Setting up Galera Cluster for DR Site

A good DR strategy will try to minimize downtime, so that in the event of a major failure, a backup site can instantly take over to continue operations. One key requirement is to have the latest data available. Galera is a great technology for that, as it can be deployed in different ways - one cluster stretched across multiple sites, multiple clusters kept in sync via asynchronous replication, mixture of synchronous and asynchronous replication, and so on. The actual solution will be dictated by factors like WAN latency, eventual vs strong data consistency and budget.
 
Let’s have a look at the different options to deploy Galera and how this affects the database part of your DR strategy. 

Active Passive Master-Master Cluster

This consists of 6 Galera nodes on two sites, forming a Galera cluster across WAN. You would need a third site to act as an arbitrator, voting for quorum and preserving the “primary component” if the primary site is unreachable. The DR site should be available immediately, without any intervention.

Failover strategy:

  1. Redirect your traffic to the DR site (e.g. update DNS records, etc.). The assumption here is that the DR site’s application instances are configured to access the local database nodes.

Failback strategy:

  1. Redirect your traffic back to primary site.

Advantages:

  • Failover and failback without downtime. Applications can switch to both sites back and forth.
  • Easier to switch side without extra steps on re-bootstrapping and reprovisioning the cluster. Both sites can receive reads/writes at any moment provided the cluster is in quorum.
  • SST (or IST) during failback won’t be painful as a set of nodes is available to serve the joiner on each site.

Disadvantages:

  • Highest cost. You need to have at least three sites with minimum of 7 nodes (including garbd). 
  • With the disaster recovery site mostly inactive, this would not be the best utilisation of your resources.
  • Requires low and reliable latency between sites, or else there is a risk for lag - especially for large transactions (even  with different segment IDs assigned)
  • Risk for performance degradation is higher with more nodes in the same cluster, as they are synchronous copies. Nodes with uniform specs are required.
  • Tightly coupled cluster across both sites. This means there is a high level of communication between the two sets of nodes, and e.g., a cluster failure will affect both sites. (on the other hand, a loosely coupled system means that the two databases would be largely independent, but with occasional synchronisation points)

Active Passive Master-Master Node

Two nodes are located in the primary site while the third node is located in the disaster recovery site. If the primary site is down, the cluster will fail as it is out of quorum. galera3 will need to be bootstrapped manually as a single node primary component. Once the primary site comes back up, galera 1 and galera2 need to rejoin galera3 to get synced. Having a pretty large gcache should help to reduce the risk of SST over WAN.

Failover strategy:

  1. Bootstrap galera3 as primary component running as “single node cluster”.
  2. Point database traffic to the DR site.

Failback strategy:

  1. Let galera1 and galera2 join the cluster.
  2. Once synced, point database traffic to the primary site.

Advantages:

  • Reads/writes to any node.
  • Easy failover with single command to promote the disaster recovery site as primary component.
  • Simple architecture setup and easy to administer.
  • Low cost (only 3 nodes required)

Disadvantages:

  • Failover is manual, as the administrator needs to promote the single node as primary component. There would be downtime in the meantime.
  • Performance might be an issue, as the DR site will be running with a single node to run all the load. It may be possible to scale out with more nodes after switching to the DR site, but beware of the additional load.
  • Failback will be harder if SST happens over WAN.
  • Increased risk, due to tightly coupled system between the primary and DR sites

Active Passive Master-Slave Cluster via Async Replication

This setup will make the primary and DR site independent of each other, loosely connected with asynchronous replication. One of the Galera nodes in the DR site will be a slave, that replicates from one of the Galera nodes (master) in the primary site. Ensure that both sites are producing binary logs with GTID and log_slave_updates is enabled - the updates that come from the asynchronous replication stream will be applied to the other nodes in the cluster.

By having two separate clusters, they will be loosely coupled and not affect each other. E.g. a cluster failure on the primary site will not affect the backup site. Performance-wise, WAN latency will not impact updates on the active cluster. These are shipped asynchronously to the backup site. The DR cluster could potentially run on smaller instances in a public cloud environment, as long as they can keep up with the primary cluster. The instances can be upgraded if needed. 

It’s also possible to have a dedicated slave instance as replication relay, instead of using one of the Galera nodes as slave.

Failover strategy:

  1. Ensure the slave in the DR site is up-to-date (up until the point when the primary site was down).
  2. Stop replication slave between slave and primary site. Make sure all replication events are applied.
  3. Direct database traffic on the DR site.

Failback strategy:

  1. On primary site, setup one of the Galera nodes (e.g., galera3) as slave to replicate from a (master) node in the DR site (galera2).
  2. Once the primary site catches up, switch database traffic back to the primary cluster.
  3. Stop the replication between primary site and DR site.
  4. Re-slave one of the Galera nodes on DR site to replicate from the primary site.

Advantages:

  • No downtime required during failover/failback.
  • No performance impact on the primary site since it is independent from the backup site.
  • Disaster recovery site can be used for other purposes like data backup, binary logs backup and reporting or large analytical queries (OLAP).

Disadvantages:

  • There is a chance of missing some data during failover if the slave was behind.
  • Pretty costly, as you have to setup a similar number of nodes on the disaster recovery site.
  • The failback strategy can be risky, it requires some expertise on switching master/slave role.

Active Passive Master-Slave Replication Node

Galera cluster on the primary site replicates to a single-instance slave in the DR site using asynchronous MySQL replication with GTID. Note that MariaDB had a different GTID implementation, so it has slightly different instructions. 
Take extra precaution to ensure the slave is replicating without replication lag, to avoid any data loss during failover. From the slave point-of-view, switching to another master should be easy with GTID.

Failover to DR site:

  1. Ensure the slave has caught up with the master. If it does not and the primary site is already down, you might miss some data. This will make things harder.
  2. If the slave has READ_ONLY=on, disable it so it can receive writes.
  3. Redirect database traffic to the DR site

Failback to primary site:

  1. Use xtrabackup to move the data from the DR site to a Galera node on primary site - this is online process which may cause some performance drops but it’s non-blocking for InnoDB-only databases
  2. Once data is in place, slave the Galera node off the DR host using the data from xtrabackup
  3. At the same time, slave the DR site off the Galera node - to form master-master replication
  4. Rebuild the rest of the Galera cluster using either xtrabackup or SST
  5. Wait until primary site catches up on replication with DR site
  6. Perform a failback by stopping the writes to DR, ensure that replication is in sync and finally repoint writes to the production
  7. Set the slave as read-only, stop the replication from DR to prod leaving only prod -> DR replication

Advantages:

  • Replication slave should not cause performance impact to the Galera cluster.
  • If you are using MariaDB, you can utilize multi-source replication, where the slave in the DR site is able to replicate from multiple masters.
  • Lowest cost and relatively faster to deploy.
  • Slave on disaster recovery site can be used for other purposes like data backup, binary logs backup and running huge analytical queries (OLAP).

Disadvantages:

  • There is a chance of missing data during failover if the slave was behind.
  • More hassle in failover/failback procedures.
  • Downtime during failover.

The above are a few options for your disaster recovery plan. You can design your own, make sure you perform failover/failback tests and document all procedures. Trust us - when disaster strikes, you won’t be as cool as when you’re reading this post.

Blog category:

by Severalnines at August 06, 2015 01:26 PM

MariaDB AB

Performance gains with MariaDB and IBM POWER8

maria-luisaraviol

MariaDB and Foedus paths crossed less than a year ago when I met Paolo Messina at a two-day IBM event in Tuscany. IBM invited me, as the MariaDB Italian representative, to introduce MariaDB as an Open Source Solution for the POWER8 platform.

Foedus was there as one of the top Italian IBM Partners and as an Italian ISV company that provides an ERP solution platform named “Octobus”. Since 2008, Foedus has developed innovative solutions and management tools using the LAMP stack as the foundation of their technology. Foedus is at the same time open to other platforms. Octobus, which has been developed in Java, also runs on Windows platform.

When we met in October 2014, Foedus was about to test their solution on a POWER8 machine. They wanted to consider POWER8 as an option because of the great performance of POWER8 machines, and because it was a great opportunity to run a native Linux-based solution on an IBM Linux native system. One of their best clients agreed to host a POC in their live environment--actually based on a WAMP stack--and use their data.

The plan was to install Octobus in a POWER8 LAMP stack test environment that could run in parallel with the production system and test the performance with static data. At a certain point and time, the plan was to stop the production environment, move all of the data to the POWER8 machine and switch the production for at least two days on the new environment and then monitor the performance.

At the beginning, MariaDB was not part of this plan. However, IBM suggested to Foedus that they add MariaDB to the configuration and experience the fantastic performance of MariaDB version 10 optimized for POWER8. They agreed. They also decided to use RedHat version 7 for the operating system.

At that time, Foedus was not very familiar with MariaDB, so they were worried this change would require some review either of the source code in the configuration or in the parameter settings. None of those were necessary. They just installed Octobus, installed MariaDB 10, did a backup of the MySQL database, moved the data to the POWER8 machine, started the service and the job was done!

They ran tests for two full days on the new environment and in production. The results were so impressive that the final customer is now considering IBM POWER8 with MariaDB and Linux as an option for next year. The most impressive result they observed was when they tested some well known, long running queries: the MariaDB optimizer and the POWER8 specific binaries allowed Foedus to cut the query execution time by more than ten times.

If you want to know more about Foedus and their experience, you can find a video that introduces them on this page: https://mariadb.com/blog/mariadb-galera-available-ibm-power8-platform.

About the Author

maria-luisaraviol's picture

Maria-Luisa Raviol is a Senior Sales Engineer with over 20 years industry experience.

by maria-luisaraviol at August 06, 2015 09:55 AM

August 05, 2015

Peter Zaitsev

PXC – Incremental State transfers in detail

IST Basics

State transfers in Galera remain a mystery to most people.  Incremental State transfers (as opposed to full State Snapshot transfers) are used under the following conditions:

  • The Joiner node reports Galera a valid Galera GTID to the cluster
  • The Donor node selected contains all the transactions the Joiner needs to catch up to the rest of the cluster in its Gcache
  • The Donor node can establish a TCP connection to the Joiner on port 4568 (by default)

IST states

Galera has many internal node states related to Joiner nodes.  They currently are:

  1. Joining
  2. Joining: preparing for State Transfer
  3. Joining: requested State Transfer
  4. Joining: receiving State Transfer
  5. Joining: State Transfer request failed
  6. Joining: State Transfer failed
  7. Joined

I don’t claim any special knowledge of most of these states apart from what their titles indicate.  Many of these states are occur very briefly and it is unlikely you’ll ever actually see them on a node’s wsrep_local_state_comment.

During IST, however, I have observed the following states have the potential to take a long while:

Joining: receiving State Transfer

During this state transactions are being streamed to the Joiner’s wsrep_local_recv_queue.  You can connect to the node at this time and poll state.  If you do, you’ll easily see the inbound queue increasing (usually quickly) but no writesets being ‘received’ (read: applied).  It’s not clear to me if there is a reason why transction apply couldn’t be started during this steam, but it does not do so currently.

The further behind the Joiner is, the longer this can take.  Here’s some output from the latest release of myq-tools showing wsrep stats:

[root@node2 ~]# myq_status wsrep
mycluster / node2 (idx: 1) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:40 P   4  2 J:Rc 0.4ms    0   0b   0    1 197b  4k   0ns   0   0   0  367k    0  94%
14:04:41 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  5k   0ns   0   0   0  368k    0  93%
14:04:42 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  6k   0ns   0   0   0  371k    0  92%
14:04:43 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  7k   0ns   0   0   0  373k    0  92%
14:04:44 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  8k   0ns   0   0   0  376k    0  92%
14:04:45 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 10k   0ns   0   0   0  379k    0  92%
14:04:46 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 11k   0ns   0   0   0  382k    0  92%
14:04:47 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 12k   0ns   0   0   0  386k    0  91%
14:04:48 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 13k   0ns   0   0   0  390k    0  91%
14:04:49 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 14k   0ns   0   0   0  394k    0  91%
14:04:50 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 15k   0ns   0   0   0  397k    0  91%
14:04:51 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 16k   0ns   0   0   0  401k    0  91%
14:04:52 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 18k   0ns   0   0   0  404k    0  91%
14:04:53 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 19k   0ns   0   0   0  407k    0  91%
14:04:54 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 20k   0ns   0   0   0  411k    0  91%

The node is in ‘J:Rc’ (Joining: Receiving) state and we can see the Inbound queue growing (wsrep_local_recv_queue). Otherwise this node is not sending or receiving transactions.

Joining

Once all the requested transactions are copied over, the Joiner flips to the ‘Joining’ state, during which it starts applying the transactions as quickly as the wsrep_slave_threads can go.  For example:

Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:55 P   4  2 Jing 0.6ms    0   0b   0 2243 3.7M 19k   0ns   0   0   0  2236  288  91%
14:04:56 P   4  2 Jing 0.5ms    0   0b   0 4317 7.0M 16k   0ns   0   0   0  6520  199  92%
14:04:57 P   4  2 Jing 0.5ms    0   0b   0 4641 7.5M 12k   0ns   0   0   0 11126  393  92%
14:04:58 P   4  2 Jing 0.4ms    0   0b   0 4485 7.2M  9k   0ns   0   0   0 15575  200  93%
14:04:59 P   4  2 Jing 0.5ms    0   0b   0 4564 7.4M  5k   0ns   0   0   0 20102  112  93%

Notice the Inbound msgs (wsrep_received) starts increasing rapidly and the queue decreases accordingly.

Joined

14:05:00 P   4  2 Jned 0.5ms    0   0b   0 4631 7.5M  2k   0ns   0   0   0 24692   96  94%

Towards the end the node briefly switches to the ‘Joined’ state, though that is a fast state in this case. ‘Joining’ and ‘Joined’ are similar states, the difference (I believe) is that:

  • ‘Joining’ is applying transactions acquired via the IST
  • ‘Joined’ is applying transactions that have queued up via standard Galera replication since the IST (i.e., everything has been happening on the cluster since the IST)

Flow control during Joining/Joined states

The Codership documentation says something interesting about ‘Joined’ (from experimentation, I believe the ‘Joining’ state behaves the same here.):

Nodes in this state can apply write-sets. Flow Control here ensures that the node can eventually catch up with the cluster. It specifically ensures that its write-set cache never grows. Because of this, the cluster wide replication rate remains limited by the rate at which a node in this state can apply write-sets. Since applying write-sets is usually several times faster than processing a transaction, nodes in this state hardly ever effect cluster performance.

What this essentially means is that a Joiner’s wsrep_local_recv_queue is allowed to shrink but NEVER GROW during an IST catchup.  Growth will trigger flow control, but why would it grow?  Writes on other cluster nodes must still be replicated to our Joiner and added to the queue.

If the Joiner’s apply rate is less than the rate of writes coming from Cluster replication, flow control will be applied to slow down Cluster replication (read: your application writes).  As far as I can tell, there is no way to tune this or turn it off.  The Codership manual continues here:

The one occasion when nodes in the JOINED state do effect cluster performance is at the very beginning, when the buffer pool on the node in question is empty.

Essentially a Joiner node with a cold cache can really hurt performance on your cluster.  This can possibly be improved by:

  • Better IO and other resources available to the Joiner for a quicker cache warmup.  A huge example of this would be flash over convention storage.
  • Buffer pool preloading
  • More Galera apply threads
  • etc.

Synced

From what I can tell, the ‘Joined’ state ends when the wsrep_local_recv_queue drops lower than the node’s configured flow control limit.  At that point it changes to ‘Synced’ and the node behaves more normally (WRT to flow control).

Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:05:01 P   4  2 Sync 0.5ms    0   0b   0 3092 5.0M   0   0ns   0   0   0 27748  150  94%
14:05:02 P   4  2 Sync 0.5ms    0   0b   0 1067 1.7M   0   0ns   0   0   0 28804  450  93%
14:05:03 P   4  2 Sync 0.5ms    0   0b   0 1164 1.9M   0   0ns   0   0   0 29954   67  92%
14:05:04 P   4  2 Sync 0.5ms    0   0b   0 1166 1.9M   0   0ns   0   0   0 31107  280  92%
14:05:05 P   4  2 Sync 0.5ms    0   0b   0 1160 1.9M   0   0ns   0   0   0 32258  606  91%
14:05:06 P   4  2 Sync 0.5ms    0   0b   0 1154 1.9M   0   0ns   0   0   0 33401  389  90%
14:05:07 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   1   0ns   0   0   0 34534  297  90%
14:05:08 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   0   0ns   0   0   0 35667  122  89%
14:05:09 P   4  2 Sync 0.5ms    0   0b   0 1121 1.8M   0   0ns   0   0   0 36778  617  88%

Conclusion

You may notice these states during IST if you aren’t watching the Joiner closely, but if your IST is talking a long while, it should be easy using the above situation to understand what is happening.

The post PXC – Incremental State transfers in detail appeared first on MySQL Performance Blog.

by Jay Janssen at August 05, 2015 10:00 AM

Jean-Jerome Schmidt

Become a MySQL DBA blog series - Live Migration using MySQL Replication

Migrating your database to a new datacenter can be a high-risk and time-consuming process. A database contains state, and can be much harder to migrate as compared to web servers, queues or cache servers. 

In this blog post, we will give you some tips on how to migrate your data from one service provider to another. The process is somewhat similar to our previous post on how to upgrade MySQL, but there are a couple of important differences. 
 
This is the seventh installment in the ‘Become a MySQL DBA’ blog series. Our previous posts in the DBA series include Database Upgrades, Replication Topology Changes, Schema Changes, High Availability, Backup & Restore, Monitoring & Trending.

MySQL Replication or Galera?

Switching to another service provider (e.g., moving from AWS to Rackspace or from colocated servers to cloud) very often means one would build a brand new infrastructure in parallel, sync it with the old infrastructure and then switch over to it.  To connect and sync them, you may want to leverage MySQL replication. 

If you are using Galera Cluster, it might be easier to move your Galera nodes to a different datacenter. However, note that the whole cluster still has to be treated as a single database. This means that your production site might suffer from the additional latency introduced when stretching Galera Cluster over the WAN. It is possible to minimize impact by tuning network settings in both Galera and the operating system, but the impact cannot be entirely eliminated. It is also possible to set up asynchronous MySQL Replication between the old and the new cluster instead, if the latency impact is not acceptable.

Setting up secure connectivity

MySQL replication is unencrypted, and therefore not safe to use over the WAN. There are numerous ways of ensuring that your data will be transferred safely. You should investigate if it is possible to establish a VPN connection between your current infrastructure and your new service provider. Most of the providers (for example both Rackspace and AWS) provides such service - you can connect your “cloudy” part to your existing datacenter via virtual private network.

If, for some reason, this solution does not work for you (maybe it requires hardware that you do not have access to), you can use software to build a VPN - one of them will be OpenVPN. This tool will work nicely to setup encrypted connection between your datacenters.

If OpenVPN is not an option, there are more ways to ensure replication will be encrypted. For example, you can use SSH to create a tunnel between old and new datacenters, and forward the 3306 port from the current MySQL slave (or master) to the new node. It can be done in a very simple way as long as you have SSH connectivity between the hosts:

$ ssh -L local_port:old_dc_host:mysql_port_in_old_dc root@old_dc_host -N &

For example:

$ ssh -L 3307:10.0.0.201:3306 root@10.0.0.201 -N &

Now, you can connect to the remote server by using 127.0.0.1:3307

$ mysql -P3307 -h 127.0.0.1

It will work similarly for the replication, just remember to use 127.0.0.1 for the master_host and 3307 for the master_port

Last but not least you can encrypt your replication using SSL. This previous blog post explains how it can be done and what kind of impact it may have on the replication performance.

If you decided to use Galera replication across both datacenters, the above suggestions also apply here. When it comes to the SSL, we previously blogged about how to encrypt Galera replication traffic. For a more complete solution, you may want to encrypt all database connections from client applications and any management/monitoring infrastructure.

Setting up the new infrastructure

Once you have connectivity, you need to start building the new infrastructure. For that, you will probably use xtrabackup unless you are combining the migration with a MySQL upgrade. We discussed it in the previous post but it’s important enough to reiterate - if you plan to perform a major version upgrade (e.g., from MySQL 5.5 to MySQL 5.6), the only supported option is via dump and reload. Xtrabackup does a binary copy which works fine as long as we have the same MySQL version on both ends. We would not recommend performing an upgrade together with the migration. Migrating to a new infrastructure is significant enough on it’s own so combining it with another major change increases complexity and risk. That’s true for other things too - you want to take step-by-step approach to changes. Only by changing things one at a time that you can understand the results of the changes, and how they impact your workload - if you made more than one change at a given time, you cannot be sure which one is responsible for a given (new) behavior that you’ve observed.

When you have a new MySQL instance up and running in the new datacenter, you need to slave it off the node in the old datacenter - to ensure that data in both datacenters will stay in sync. This will become handy as you prepare yourself for the final cutover. It’s also a nice way of ensuring that the new environment can handle your write load. 

Next step will be to build a complete staging infrastructure in the new location and perform tests and benchmarks. This is a very important step that shouldn’t be skipped - the problem here is that you, as the DBA, have to understand the capacity of your infrastructure. When you change the provider, things also change. New hardware/vm’s are faster or slower. There’s more or less memory per instance. You need to understand again how your workload will fit in the hardware you are going to use. For that you’ll probably use Percona Playback or pt-log-player to replay some of the real life queries on the staging system. You’ll want to test the performance and ensure that it’s on a level which is acceptable for you. You also want to perform all of the standard acceptance tests that you run on your new releases - just to confirm that everything is up and running. In general, all applications should be built in a way that they do not rely on the hardware configuration and on a current setup. But something might have slipped through and your app may depend on some of the config tweaks or hardware solutions that you do not have in the new environment. 

Finally, once you are happy with your tests, you’ll want to build a production-ready infrastructure. After this is done, you may want to run some read-only tests for final verification. This would be the final step before the cutover.

Cutover

After all those tests have been performed and after the infrastructure was deemed production-ready, the last step is to cutover traffic from the old infrastructure. 

Globally speaking, this is a complex process but when we are looking at the database tier, it’s more or less the same thing as standard failover - something that you may have done multiple times in the past. We covered it in details in an earlier post, in short the steps are: stop the traffic, ensure it’s stopped, wait while the application is being moved to the new datacenter (DNS records change or what not), do some smoke tests to ensure all looks good, go live, monitor closely for a while.

This cutover will require some downtime, as we can see. The problem is to make sure we have consistent state across the old site and the new one. If we want to do it without downtime, then we would need to set up master-master replication. The reason is that as we refresh DNS and move over sessions from the old site to the new one, both systems will be in use at the same time - until all sessions are redirected to the new site. In the meantime, any changes on the new site need to be reflected on the old site. 

Using Galera Cluster as described in this blog post can also be a way to keep data between the two sites in sync. 

We are aware this is a very brief description of the data migration process. Hopefully, it will be enough to point you into a good direction and help you identify what additional information you need to look for. 

Related Resources

Webinar Replay: Migrating to MySQL, MariaDB Galera and/or Percona XtraDB Cluster

Blog category:

by Severalnines at August 05, 2015 05:28 AM

August 04, 2015

Peter Zaitsev

MySQL QA Episode 11: Valgrind Testing: Pro’s, Con’s, Why and How

Today’s episode is all about Valgrind – from the pro’s to the con’s, from the why to the how! This episode will be of interest to anyone who is or wants to work with Valgrind on a regular or semi-regular basis.

  1. Pro’s/Why
  2. Con’s
  3. How
    1. Using the latest version
      sudo [yum/apt-get] install valgrind
      #OR#
      sudo [yum/apt-get] remove valgrind
      sudo [yum/apt-get] install bzip2 glibc*
      wget http://valgrind.org/downloads/valgrind-3.10.1.tar.bz2
      tar -xf valgrind-3.10.1.tar.bz2; cd valgrind-3.10.1
      ./configure; make; sudo make install
      valgrind –version # This should now read 3.10.1
    2. VGDB (cd ./mysql-test)
      ./lib/v1/mysql-test-run.pl –start-and-exit –valgrind –valgrind-option=”–leak-check=yes”
      –valgrind-option=”–vgdb=yes” –valgrind-option=”–vgdb-error=0″ 1st
      #OR# valgrind –leak-check=yes –vgdb=yes –vgdb-error=0 ./mysqld –options…
      #ALSO CONSIDER# –num-callers=40 –show-reachable=yes –track-origins=yes
      gdb /Percona-server/bin/mysqld
      (gdb) target remote | vgdb
      (gdb) bt
    3. pquery Valgrind
    4. Valgrind stacks overview & analysis

Full-screen viewing @ 720p resolution recommended.

Checkout the full MySQL QA Series today!

The post MySQL QA Episode 11: Valgrind Testing: Pro’s, Con’s, Why and How appeared first on MySQL Performance Blog.

by Roel Van de Paar at August 04, 2015 10:00 AM

Shlomi Noach

On SHOW BINLOG/RELAYLOG EVENTS

Some notes after working with SHOW BINLOG EVENTS and SHOW RELAYLOG EVENTS statements; there are a few gotchas and some interesting facts. My reflections also follow.

I'm calling these commands from orchestrator when working with Pseudo-GTID (which I do alot). I prefer to work with agent-free design, where a single, remote service can do everything: examine replication status, scan binary logs for information, and recover broken topologies via gluing together servers that were not previously directly associated.

Alas, documentation is short on these commands, and some stuff I learned the hard way.

Basically, SHOW BINLOG/RELAYLOG EVENTS commands are a poor man's replacement to mysqlbinlog, only you can issue them on MySQL protocol, and you do not have to have the binary/relay log files locally on your host.

Fun fact

The binary logs are called so because they are compressed. You are familiar with the binlog position you see on SHOW MASTER STATUS or SHOW SLAVE STATUS. You are familiar with the binlog position as you see it when you execute "mysqlbinlog mybinlog.001234". The position of a new entry equals to the file size of the binary log at that time. If:

$ ls -l master/data/mysql-bin.015901
-rw-rw---- 1 user user 401408 Jul 18 02:44 master/data/mysql-bin.015901

Then the next entry will be at position 401408, as this is the file size in bytes.

And so when MySQL writes an entry to the binary log, it (of course) knows the entry's position in the binary log, but then also immediately knows the position of the next entry.

We'll revisit this fact later.

Output of SHOW BINLOG/RELAYLOG EVENTS

The output of both statement depends on the binlog_format. In Statement Based Replication it may look like:

$gt; SHOW BINLOG EVENTS IN 'mysql-bin.015903' LIMIT 40,20;
+------------------+------+------------+-----------+-------------+---------------------------------------------------------------------------------------------------------------------+
| Log_name         | Pos  | Event_type | Server_id | End_log_pos | Info                                                                                                                |
+------------------+------+------------+-----------+-------------+---------------------------------------------------------------------------------------------------------------------+
| mysql-bin.015903 | 3650 | Gtid       |         1 |        3698 | SET @@SESSION.GTID_NEXT= '230ea8ea-81e3-11e4-972a-e25ec4bd140a:790000'                                              |
| mysql-bin.015903 | 3698 | Query      |         1 |        3785 | BEGIN                                                                                                               |
| mysql-bin.015903 | 3785 | User var   |         1 |        3832 | @`id`=7                                                                                                             |
| mysql-bin.015903 | 3832 | Query      |         1 |        3969 | use `test`; replace into test.t (id, i, t) values (@id, @id, now())                                                 |
| mysql-bin.015903 | 3969 | Xid        |         1 |        4000 | COMMIT /* xid=967 */                                                                                                |
| mysql-bin.015903 | 4000 | Gtid       |         1 |        4048 | SET @@SESSION.GTID_NEXT= '230ea8ea-81e3-11e4-972a-e25ec4bd140a:790001'                                              |
| mysql-bin.015903 | 4048 | Query      |         1 |        4135 | BEGIN                                                                                                               |
| mysql-bin.015903 | 4135 | User var   |         1 |        4182 | @`id`=9                                                                                                             |
| mysql-bin.015903 | 4182 | Query      |         1 |        4319 | use `test`; replace into test.t (id, i, t) values (@id, @id, now())                                                 |
| mysql-bin.015903 | 4319 | Xid        |         1 |        4350 | COMMIT /* xid=969 */                                                                                                |
| mysql-bin.015903 | 4350 | Gtid       |         1 |        4398 | SET @@SESSION.GTID_NEXT= '230ea8ea-81e3-11e4-972a-e25ec4bd140a:790002'                                              |
| mysql-bin.015903 | 4398 | Query      |         1 |        4576 | use `meta`; drop view if exists `meta`.`_pseudo_gtid_hint__asc(55B50F9C:00000000000000D4:C96BF9F7)` |
| mysql-bin.015903 | 4576 | Gtid       |         1 |        4624 | SET @@SESSION.GTID_NEXT= '230ea8ea-81e3-11e4-972a-e25ec4bd140a:790003'                                              |
| mysql-bin.015903 | 4624 | Query      |         1 |        4711 | BEGIN                                                                                                               |
| mysql-bin.015903 | 4711 | User var   |         1 |        4758 | @`id`=6                                                                                                             |
| mysql-bin.015903 | 4758 | Query      |         1 |        4895 | use `test`; replace into test.t (id, i, t) values (@id, @id, now())                                                 |
| mysql-bin.015903 | 4895 | Xid        |         1 |        4926 | COMMIT /* xid=971 */                                                                                                |
| mysql-bin.015903 | 4926 | Gtid       |         1 |        4974 | SET @@SESSION.GTID_NEXT= '230ea8ea-81e3-11e4-972a-e25ec4bd140a:790004'                                              |
| mysql-bin.015903 | 4974 | Query      |         1 |        5077 | BEGIN                                                                                                               |
| mysql-bin.015903 | 5077 | User var   |         1 |        5133 | @`hostname`=_utf8 0x736E6F6163682D616D7339 COLLATE utf8_general_ci                                                  |
+------------------+------+------------+-----------+-------------+---------------------------------------------------------------------------------------------------------------------+

The first thing to note is that there's far less data in this output as compared to mysqlbinlog's. For example, you can't see the TIMESTAMP at which the queries were issued. You can't see additional metadata like RAND() seed. Thus far it is documented.

You do know the binlog file name, position, end-position (which is the position of the next entry), originating server_id.

However with Row Based Replication things are even less informative:

$gt; SHOW BINLOG EVENTS IN 'mysql-bin.015902' LIMIT 20,20;
+------------------+------+-------------+-----------+-------------+------------------------------------------------------------------------+
| Log_name         | Pos  | Event_type  | Server_id | End_log_pos | Info                                                                   |
+------------------+------+-------------+-----------+-------------+------------------------------------------------------------------------+
| mysql-bin.015902 | 1365 | Query       |         1 |        1453 | BEGIN                                                                  |
| mysql-bin.015902 | 1453 | Table_map   |         1 |        1535 | table_id: 71 (meta.pseudo_gtid_status)                                 |
| mysql-bin.015902 | 1535 | Update_rows |         1 |        1829 | table_id: 71 flags: STMT_END_F                                         |
| mysql-bin.015902 | 1829 | Xid         |         1 |        1860 | COMMIT /* xid=26 */                                                    |
| mysql-bin.015902 | 1860 | Gtid        |         1 |        1908 | SET @@SESSION.GTID_NEXT= '230ea8ea-81e3-11e4-972a-e25ec4bd140a:789767' |
| mysql-bin.015902 | 1908 | Query       |         1 |        1988 | BEGIN                                                                  |
| mysql-bin.015902 | 1988 | Table_map   |         1 |        2035 | table_id: 70 (test.t)                                                  |
| mysql-bin.015902 | 2035 | Update_rows |         1 |        2129 | table_id: 70 flags: STMT_END_F                                         |
| mysql-bin.015902 | 2129 | Xid         |         1 |        2160 | COMMIT /* xid=29 */                                                    |
| mysql-bin.015902 | 2160 | Gtid        |         1 |        2208 | SET @@SESSION.GTID_NEXT= '230ea8ea-81e3-11e4-972a-e25ec4bd140a:789768' |
| mysql-bin.015902 | 2208 | Query       |         1 |        2288 | BEGIN                                                                  |
| mysql-bin.015902 | 2288 | Table_map   |         1 |        2335 | table_id: 70 (test.t)                                                  |
| mysql-bin.015902 | 2335 | Update_rows |         1 |        2429 | table_id: 70 flags: STMT_END_F                                         |
| mysql-bin.015902 | 2429 | Xid         |         1 |        2460 | COMMIT /* xid=32 */                                                    |
| mysql-bin.015902 | 2460 | Gtid        |         1 |        2508 | SET @@SESSION.GTID_NEXT= '230ea8ea-81e3-11e4-972a-e25ec4bd140a:789769' |
| mysql-bin.015902 | 2508 | Query       |         1 |        2588 | BEGIN                                                                  |
| mysql-bin.015902 | 2588 | Table_map   |         1 |        2635 | table_id: 70 (test.t)                                                  |
| mysql-bin.015902 | 2635 | Update_rows |         1 |        2729 | table_id: 70 flags: STMT_END_F                                         |
| mysql-bin.015902 | 2729 | Xid         |         1 |        2760 | COMMIT /* xid=34 */                                                    |
| mysql-bin.015902 | 2760 | Gtid        |         1 |        2808 | SET @@SESSION.GTID_NEXT= '230ea8ea-81e3-11e4-972a-e25ec4bd140a:789770' |
+------------------+------+-------------+-----------+-------------+------------------------------------------------------------------------+

What's this? We don't get the statement anymore. I mean, sure, this is RBR, but there's no more info! All we knows is there was an Update_rows on test.t; that also only after we've crossed two different entries, one mapping a test.t to 70, the other telling us an UPDATE operation was made on 70.

This is the reason why Pseudo GTID injection is made via DDL.

With RBR we also get some excessive information. The table_id mentioned above is true for this server; but on a slave the table would have a different id. The SHOW output would look different. Likewise, the commit number as in "COMMIT /* xid=32 */" changes from server to server. There are also other entries which may look different; character encoding changes shape. Really weird stuff.

The output of SHOW RELAYLOG EVENTS is very similar, except for the following:

  1. Obviously we're parsing and outputting relay logs
  2. Pos column relates to the position in the relay log where the entry starts (makes sense)
  3. End_log_pos column relates to the next binlog position of the master's binary log that correlates to this query. Read that again. There is no telling what the end-log-pos is for the relay log entry. We'll revisit that shortly.

Locking issues

Unfortunately, issuing a SHOW BINLOG EVENTS completely locks down the binary logs on the server.

Yes. This means you cannot write to the binary logs while issuing the SHOW statement. And that means you cannot write, everything piles up and your server is under complete lock down.

Before you throw your hands up in the air and move on to read something that can cheer you up, note that things are not completely bad. We are able at Booking.com to scan 24 hours worth of binary logs (i.e. dozens of hundreds of binary logs) on business-critical MySQL masters without making an impact. How?

SHOW...LIMIT offset, row_count vs SHOW...FROM pos LIMIT row_count

Of course the solution is to do many small steps instead of few giant steps. Reading all of a binary log's entries is suicidal to your application & database. The SHOW BINLOG/RELAYLOG EVENTS statements support a LIMIT clause.

As per previous example, you may:

SHOW BINLOG EVENTS IN 'mysql-bin.015903' LIMIT 40,20;

This skips 40 entries (offset), then reads 20 (row_count).

However much like non-indexed table scan, to skip the first 40 entries you must nonetheless iterate them. Since we don't know exactly where in the binlog the 41st entry begins, we need to unpack the binary log from start and find it out the full-scan way.

Don't iterate a binary log's entries via LIMIT 0,1000; LIMIT 1000,1000; LIMIT 2000,1000; ... This is O(n²)

In contrast, you may:

SHOW BINLOG EVENTS IN 'mysql-bin.015903' FROM 3650 LIMIT 20;

Remember that an entry's position is the binlog file size at the time before the entry was written? To execute the above MySQL merely seeks the binlog file 3650 bytes ahead. No scan required.

You must provide with a valid position value. If I just guess my way:

> SHOW BINLOG EVENTS IN 'mysql-bin.015903' FROM 3651 LIMIT 20;
ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Wrong offset or I/O error

And so you must have prior knowledge about positions. This is fine if you're scanning a binary log: you simply read the End_log_pos of the last entry you've just read and memorize it. Thus, to scan a binary log, you can:

> SHOW BINLOG EVENTS IN 'mysql-bin.015903' LIMIT 20;

> SHOW BINLOG EVENTS IN 'mysql-bin.015903' FROM 1499 LIMIT 20;

> SHOW BINLOG EVENTS IN 'mysql-bin.015903' FROM 3651 LIMIT 20;

But it's more difficult to do this for relay logs: remember that the End_log_pos in SHOW RELAYLOG EVENTS relates to the next position in the master's binary log. So you need to resort to an off-by-one or something. However the good news is that reading relay logs does not block writes on your server.

Locking thoughts

As per my comments in the bug report, I don't see why SHOW BINLOG EVENTS should always lock writes:

  • If we're scanning an old binary log (not the one being written to) then obviously we should not be blocking writes.
  • If we're scanning the current binlog, but are known to end with a statement that is already written (i.e. not trying to scan past the current writing position) then we should not be blocking writes.
  • OK to block writes when trying to read past current position.
  • OK to block on a FLUSH MASTER LOGS, PURGE MASTER LOGS

Why is the output so limited?

I'm actually unsure why the output of these commands is so limited as compared to the mysqlbinlog output. A replicating slave talks to the master over MySQL protocol; the slave does get all the data right. It does get event metadata (timestamp, random seed, session variable). It does get complete statement/row data. What I'm saying is that this functionality is already built in; it could be very easily added to the SHOW commands.

by shlomi at August 04, 2015 09:38 AM

August 03, 2015

Chris Calender

LDAP Authentication with MariaDB PAM Plugin

This is getting more and more common, so I wanted to provide the steps required to get LDAP authentication working with MariaDB PAM plugin.

Unless you’re already familiar with setting up the MariaDB PAM plugin, I’d first recommend getting this to work with a standard Linux user (steps 1-4), then once all is working fine, progress to the LDAP users (steps 5-10). (And if you do not want to test this for the Linux user account, then you may skip steps #2 and #3.)

  1. Enable plugin by running the following from the command line client:
    INSTALL SONAME 'auth_pam';

    You should see an entry like this afterward in SHOW PLUGINS:

    | pam | ACTIVE | AUTHENTICATION | auth_pam.so | GPL |
  2. Create the mysql user account (note it does not have a password, as it will obtain this from your Linux user, and eventually the LDAP account) and provide it with the GRANTS you want it to have:
    CREATE USER 'chris'@'localhost' IDENTIFIED VIA pam USING 'mariadb';
    GRANT ALL ON db1.* TO 'chris'@'localhost';

    Note “mariadb” is the PAM service name I’ve specified. It is good to specify this so you don’t overwrite the existing default policy (in case it is being used).

  3. Create PAM policy in “/etc/pam.d/mariadb” (ensure readable and ensure the file name, “mariadb”, matches the PAM service name you specified for your user in the above step):
    auth required pam_unix.so
    account required pam_unix.so

    (Restart MariaDB instance afterward.)

    Then, you should be able to connect via the command line with (assuming you have a Linux user ‘chris’):

    mysql -u chris -p

    This should allow you to login. Now you can move on to integrating LDAP.

  4. Verify the LDAP user exists with:
    shell> id chris

    It should return uid, gid, groups, etc.

  5. If using the MySQL client, you’ll need to enable the clear text plugin:
    [mysqld]
    pam_use_cleartext_plugin

    If you need to do this, it is recommended you begin using SSL connections, if not already.

    Also, you’ll need to reboot after this change, but wait until after step #6.

  6. We need to edit the PAM policy in “/etc/pam.d/mariadb” to:
    auth required pam_ldap.so
    account required pam_ldap.so

    (We’re basically just replacing “pam_unix.so” with “pam_ldap.so”.)

    Now, restart MariaDB.

  7. Next, you need to ensure that you have libpam-ldap/openldap installed (so you have “pam_ldap.so”, that is the key). You can install this on RedHat/CentOS with the following:
    # yum install openldap openldap-clients
  8. After that, you’ll need to configure /etc/ldap.conf. Here is a sample configuration:
    debug 10 # set debug level only during the initial configuration
    base dc=corp,dc=company_name,dc=com
    binddn cn=service_account,OU=Service Accounts,OU=US Security,DC=corp,DC=company_name,DC=com
    bindpw <password>
    timelimit 120
    idle_timelimit 3600
    uri ldaps://<LDAP URL>:<LDAP PORT>

    And if using Active Directory, you should also add these lines:

    pam_login_attribute samaccountname
    pam_member_attribute member
    nss_map_objectclass posixAccount user
    nss_map_objectclass shadowAccount user
    nss_map_attribute uid sAMAccountName
    nss_map_attribute homeDirectory unixHomeDirectory
    nss_map_attribute shadowLastChange pwdLastSet
    nss_map_objectclass posixGroup group
    nss_map_attribute uniqueMember member
    pam_login_attribute sAMAccountName
    pam_filter objectclass=User
    pam_password ad

    Note I obtained the sample ldap.conf from this Alexander Rubin post.

  9. After that, make sure you can connect to ldap and that you can search ldap with ldapsearch, which you can verify with:
    shell> telnet <ldap server> <ldap password> (this should report "connected")
    shell> ldapsearch –w <password for bind user> -x –D 'cn=USER,OU=People...' "(&(ObjectClass=user)(cn=USERNAME))"
  10. After this, things should be all set up, as the plugin is installed properly, the user has been created in MariaDB, we’ve installed pam_ldap.so, we’ve updated /etc/pam.d/mariadb to use the pam_ldap.so instead of the pam_unix.so, and created the appropriate ldap.conf. Thus you should be able to login with the following (this time assuming “chris” is an LDAP user account):
    mysql -u chris -p
  11. If you want to know more about user mapping, you should read this post by Geoff Montee as well as this post by Alexander Rubin.

    I hope this helps.

by chris at August 03, 2015 10:13 PM

Peter Zaitsev

Percona XtraBackup 2.2.12 is now available

Percona is glad to announce the release of Percona XtraBackup 2.2.12 on August 3, 2015. Downloads are available from our download site or Percona Software Repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, Percona XtraBackup drives down backup costs while providing unique features for MySQL backups.

Bugs Fixed:

  • Percona XtraBackup would segfault during the prepare phase of certain FTS pages. Bug fixed #1460138.
  • Fixed compilation error due to missing dependency caused by the upstream bug #77226. Bug fixed #1461129.
  • Regression introduced by fixing a bug #1403237 in Percona XtraBackup 2.2.8 could cause xtrabackup to read a redo log from incorrect offset which would cause an assertion. Bug fixed #1464608.
  • Fixed uninitialized current_thd thread-local variable. This also completely fixes bug #1415191. Bug fixed #1467574.
  • After the release of Percona XtraBackup 2.2.11, innobackupex issues a FLUSH TABLE before running the FLUSH TABLES WITH READ LOCK. While it will help the backups in some situation, it also implies that the FLUSH TABLE will be written to the binary log. On MariaDB 10.0 with GTID enabled, when backup was taken on the slave, this altered the GTID of that slave and Percona XtraBackup didn’t see the correct GTID anymore. Bug fixed #1466446 (Julien Pivotto).
  • RPM compilation of Percona XtraBackup was still requiring bzr. Bug fixed #1466888 (Julien Pivotto).
  • Compiling Percona XtraBackup RPMs with XB_VERSION_EXTRA option would create an incorrect RPM version. Bug fixed #1467424 (Julien Pivotto).
  • Percona XtraBackup would complete successfully even when redo log wasn’t copied completely. This means that backup were considered successful even when they were corrupt. Bug fixed #1470847.
  • In rare cases when there are two or more tablespaces with the same ID in the data directory, xtrabackup picks up the first one by lexical order, which could lead to losing the correct table. Bug fixed #1475487.
  • Percona XtraBackup was missing revision_id in binaries. Bug fixed #1394174.

Release notes with all the bugfixes for Percona XtraBackup 2.2.12 are available in our online documentation. Bugs can be reported on the launchpad bug tracker. Percona XtraBackup is an open source, free MySQL hot backup software that performs non-blocking backups for InnoDB and XtraDB databases.

The post Percona XtraBackup 2.2.12 is now available appeared first on MySQL Performance Blog.

by Hrvoje Matijakovic at August 03, 2015 12:31 PM

Checkpoint strikes back

In my recent benchmarks for MongoDB, we can see that the two engines WiredTiger and TokuMX struggle from periodical drops in throughput, which is clearly related to a checkpoint interval – and therefore I correspond it to a checkpoint activity.

The funny thing is that I thought we solved checkpointing issues in InnoDB once and for good. There are bunch of posts on this issue in InnoDB, dated some 4 years ago.  We did a lot of research back then working on a fix for Percona Server

But, like Baron said, “History Repeats“… and it seems it also repeats in technical issues.

So, let’s take a look what is checkpointing is, and why it is a problem for storage engines.

As you may know, a transactional engine writes transactional logs (name you may see: “redo logs”, write-ahead logs or WAL etc),
to be able to perform a crash recovery in the case of database crash or server power outage. To maintain somehow the time for recovery (we all expect that database will start quick), the engine has to limit how many changes are in logs. For this a transactional engine performs a “checkpoint,” which basically synchronizes changes in memory with corresponding changes in logs – so old log records can be deleted. Often it results in writing changed database pages from memory to a permanent storage.

InnoDB takes an approach to limit size of log files (in total it equals to innodb-log-file-size * innodb-log-files-in-group), which I name “size limited checkpointing”.
Both TokuDB and WiredTiger limit changes by time periods (by default 60 sec), which results in that log files may grow unlimited within a given time interval (I name it “time limited checkpointing”).

Also the difference is InnoDB takes a “fuzzy checkpointing” approach, which was not really “fuzzy”, until we fixed it, but basically it is not to wait until we reach size limited, but perform “checkpointing” all the time, and the more intensive the closer we get to log size limit. This allows to achieve more or less a smooth throughput, without significant drops in throughput.

Unlike InnoDB, both TokuDB (that I am sure) and WiredTiger (I speculate here, I did not look into WiredTiger internal) wait till the last moment, and perform checkpoint strictly by prescribed interval. If it happens that a database contains many changes in memory, it will result in performance stalls. This effect is close by effect to “hitting a wall on full speed”: user queries get locked, until an engine writes all changes in memory it has to write.

Interestingly enough, RocksDB, because it has a different architecture (I may write on it in future, but for now I will point to RocksDB Wiki), does not have this problem with checkpoints (it however has its own background activity, like level compactions and tombstone maintenance, but it is a different topic).

I do not know how WiredTiger is going to approach this issue with checkpoint, but we are looking to improve TokuDB to make this less painful for user queries – and eventually to move to “fuzzy checkpointing” model.

The post Checkpoint strikes back appeared first on MySQL Performance Blog.

by Vadim Tkachenko at August 03, 2015 10:00 AM

Jean-Jerome Schmidt

Deploy Asynchronous Replication Slave to MariaDB Galera Cluster 10.x using ClusterControl

Combining Galera and asynchronous replication in the same MariaDB setup, aka Hybrid Replication, can be useful - e.g. as a live backup node in a remote datacenter or reporting/analytics server. We already blogged about this setup for Codership/Galera or Percona XtraDB Cluster users, but master failover did not work for MariaDB because of its different GTID approach. In this post, we will show you how to deploy an asynchronous replication slave to MariaDB Galera Cluster 10.x (with master failover!), using GTID with ClusterControl v1.2.10. 

Preparing the Master

First and foremost, you must ensure that the master and slave nodes are running on MariaDB Galera 10.0.2 or later. MariaDB replication slave requires at least a master with GTID among the Galera nodes. However, we would recommend users to configure all the MariaDB Galera nodes as masters. GTID, which is automatically enabled in MariaDB, will be used to do master failover.

The following must be true for the masters:

  • At least one master among the Galera nodes
  • All masters must be configured with the same domain ID
  • log_slave_updates must be enabled
  • All masters’ MariaDB port is accessible by ClusterControl and slaves
  • Must be running MariaDB version 10.0.2 or later

To configure a Galera node as master, change the MariaDB configuration file for that node as per below:

gtid_domain_id=<must be same across all mariadb servers participating in replication>
server_id=<must be unique>
binlog_format=ROW
log_slave_updates=1
log_bin=binlog

 

Preparing the Slave

For the slave, you would need a separate host or VM, with or without MariaDB installed. If you do not have MariaDB installed, and choose ClusterControl to install MariaDB on the slave, ClusterControl will perform the necessary actions to prepare the slave; configure root password (based on monitored_mysql_root_password), create slave user (based on repl_user, repl_password), configure MariaDB, start the server and finally start replication.

In short, we must perform the following actions beforehand:

  • The slave node must be accessible using passwordless SSH from the ClusterControl server
  • MariaDB port (default 3306) and netcat port 9999 on the slave are open for connections
  • You must configure the following options in the ClusterControl configuration file for the respective cluster ID under /etc/cmon.cnf or /etc/cmon.d/cmon_<cluster ID>.cnf:
    • repl_user=<the replication user>
    • repl_password=<password for replication user>
    • monitored_mysql_root_password=<the mysql root password of all nodes including slave>
  • The slave configuration template file must be configured beforehand, and must have at least the following variables defined in the MariaDB configuration template:
    • gtid_domain_id (the value must be the same across all nodes participating in the replication)
    • server_id
    • basedir
    • datadir

To prepare the MariaDB configuration file for the slave, go to ClusterControl > Manage > Configurations > Template Configuration files > edit my.cnf.slave and add the following lines:

[mysqld]
bind-address=0.0.0.0
gtid_domain_id=1
log_bin=binlog
log_slave_updates=1
expire_logs_days=7
server_id=1001
binlog_format=ROW
slave_net_timeout=60
basedir=/usr
datadir=/var/lib/mysql

 

Attaching a Slave via ClusterControl

Let’s now add a MariaDB slave using ClusterControl. Our example cluster is running MariaDB 10.1.2 with ClusterControl v1.2.10. Our deployment will look like this:

1. Configure Galera nodes as master. Go to ClusterControl > Manage > Configurations, and click Edit/View on each configuration file and append the following lines under mysqld directive:

mariadb1:

gtid_domain_id=1
server_id=101
binlog_format=ROW
log_slave_updates=1
log_bin=binlog
expire_logs_days=7

mariadb2:

gtid_domain_id=1
server_id=102
binlog_format=ROW
log_slave_updates=1
log_bin=binlog
expire_logs_days=7

mariadb3:

gtid_domain_id=1
server_id=103
binlog_format=ROW
log_slave_updates=1
log_bin=binlog
expire_logs_days=7

2. Perform a rolling restart from ClusterControl > Manage > Upgrades > Rolling Restart. Optionally, you can restart one node at a time under ClusterControl > Nodes > select the corresponding node > Shutdown > Execute, and then start it again.

3. On the ClusterControl node, setup passwordless SSH to the slave node:

$ ssh-copy-id -i ~/.ssh/id_rsa 10.0.0.128

4. Then, ensure the following lines exist in the corresponding cmon.cnf or cmon_<cluster ID>.cnf:

repl_user=slave
repl_password=slavepassword123
monitored_mysql_root_password=myr00tP4ssword

Restart CMON daemon to apply the changes:

$ service cmon restart

5. Go to ClusterControl > Manage > Configurations > Create New Template or Edit/View existing template, and then add the following lines:

[mysqld]
bind-address=0.0.0.0
gtid_domain_id=1
log_bin=binlog
log_slave_updates=1
expire_logs_days=7
server_id=1001
binlog_format=ROW
slave_net_timeout=60
basedir=/usr
datadir=/var/lib/mysql

6. Now, we are ready to add the slave. Go to ClusterControl > Cluster Actions > Add Replication Slave. Choose a master and the configuration file as per the example below:

Click on Proceed. A job will be triggered and you can monitor the progress at ClusterControl > Logs > Jobs. You should notice that ClusterControl will use non-GTID replication in the Jobs details:

You can simply ignore it as we will setup our MariaDB GTID replication manually later. Once the add job is completed, you should see the master and slave nodes in the grids:

At this point, the slave is replicating from the designated master using the old way (using binlog file/position). 

Replication using MariaDB GTID

Ensure the slave catches up with the master host, the lag value should be 0. Then, stop the slave on slave1:

MariaDB> STOP SLAVE;

Verify the slave status and ensure Slave_IO_Running and Slave_SQL_Running return No. Retrieve the latest values of Master_Log_File and Read_Master_Log_Pos:

MariaDB> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 10.0.0.131
                  Master_User: slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.000022
          Read_Master_Log_Pos: 55111710
               Relay_Log_File: ip-10-0-0-128-relay-bin.000002
                Relay_Log_Pos: 28699373
        Relay_Master_Log_File: binlog.000022
             Slave_IO_Running: No
            Slave_SQL_Running: No

Then on the master (mariadb1) run the following statement to retrieve the GTID:

MariaDB> SELECT binlog_gtid_pos('binlog.000022',55111710);
+-------------------------------------------+
| binlog_gtid_pos('binlog.000022',55111710) |
+-------------------------------------------+
| 1-103-613991                              |
+-------------------------------------------+

The result from the function call is the current GTID, which corresponds to the binary file position on the master. 

Set the GTID slave position on the slave node. Run the following statement on slave1:

MariaDB> STOP SLAVE;
MariaDB> SET GLOBAL gtid_slave_pos = '1-103-613991';
MariaDB> CHANGE MASTER TO master_use_gtid=slave_pos;
MariaDB> START SLAVE;

The slave will start catching up with the master using GTID as you can verify with SHOW SLAVE STATUS command:

        ...
        Seconds_Behind_Master: 384
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 101
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-103-626963

 

Master Failover and Recovery

ClusterControl doesn’t support MariaDB slave failover with GTID via ClusterControl UI in the current version (v1.2.10), this will be supported in v1.2.11. So, if you are using 1.2.10 or earlier, failover has to be done manually whenever the designated master fails. Initially, when you added a replication slave via ClusterControl, it only added the slave user on the designated master (mariadb1). To ensure failover works, we have to explicitly add the slave user on mariadb2 and mariadb3.

Run following command on mariadb2 or mariadb3 once. It should replicate to all Galera nodes:

MariaDB> GRANT REPLICATION SLAVE ON *.* TO slave@'10.0.0.128' IDENTIFIED BY 'slavepassword123';
MariaDB> FLUSH PRIVILEGES;

If mariadb1 fails, to switch to another master, you just need to run following statement on slave1:

MariaDB> STOP SLAVE;
MariaDB> CHANGE MASTER TO MASTER_HOST='10.0.0.132';
MariaDB> START SLAVE;

The slave will resume from the last Gtid_IO_Pos. Check the slave status to verify everything is working:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.0.132
                  Master_User: slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.000020
          Read_Master_Log_Pos: 140875476
               Relay_Log_File: ip-10-0-0-128-relay-bin.000002
                Relay_Log_Pos: 1897915
        Relay_Master_Log_File: binlog.000020
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 130199239
              Relay_Log_Space: 12574457
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 133
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 102
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-103-675356

That’s it! Please give it a try and let us know how these instructions worked for you.

Blog category:

by Severalnines at August 03, 2015 05:22 AM

August 01, 2015

Open Query

on ORDER BY optimization | Domas Mituzas

http://dom.as/2015/07/30/on-order-by-optimization/

An insightful exploration by Domas (Facebook) on how some of the MySQL optimiser’s decision logic is sometimes naive, in this case regarding ORDER BY optimisation.

Quite often, “simple” logic can work better than complex logic as chasing all the corner cases can just make things worse – but sometimes, logic can be too simple.

Everything must be made as simple as possible, but no simpler.
— Albert Einstein / Roger Sessions

by Arjen Lentz at August 01, 2015 03:44 AM

July 31, 2015

MariaDB AB

MariaDB automatic failover with MaxScale and MariaDB Replication Manager

guillaumelefranc

Mandatory disclaimer: the techniques described in this blog post are experimental, so use at your own risk. Neither me nor MariaDB Corporation will be held responsible if anything bad happens to your servers.

Context

MaxScale 1.2.0 and above can call external scripts on monitor events. In the case of a classic Master-Slave setup, this can be used for automatic failover and promotion using MariaDB Replication Manager. The following use case is exposed using three MariaDB servers (one master, two slaves) and a MaxScale server. Please refer to my Vagrant files if you want to jumpstart such a testing platform.

Requirements

  • A mariadb-repmgr binary, version 0.4.0 or above. Grab it from the github Releases page, and extract in /usr/local/bin/ on your MaxScale server.
  • A working MaxScale installation with MySQL Monitor setup and whatever router you like. Please refer to the MaxScale docs for more information on how to configure it correctly.

MaxScale installation and configuration

The MySQL Monitor has to be configured to send scripts. Add the following three lines to your [MySQL Monitor] section:

monitor_interval=1000
script=/usr/local/bin/failover.sh
events=master_down

Failover script

As of the current MaxScale development branch, custom options are not supported, so we have to use a wrapper script to call MariaDB Replication Manager. Create the following script in /usr/local/bin/failover.sh:

#!/bin/bash
# failover.sh
# wrapper script to repmgr

# user:password pair, must have administrative privileges.
user=root:admin
# user:password pair, must have REPLICATION SLAVE privileges. 
repluser=repluser:replpass

ARGS=$(getopt -o '' --long 'event:,initiator:,nodelist:' -- "$@")

eval set -- "$ARGS"

while true; do
    case "$1" in
        --event)
            shift;
            event=$1
            shift;
        ;;
        --initiator)
            shift;
            initiator=$1
            shift;
        ;;
        --nodelist)
            shift;
            nodelist=$1
            shift;
        ;;
        --)
            shift;
            break;
        ;;
    esac
done
cmd="mariadb-repmgr -user $user -rpluser $repluser -hosts $nodelist -failover=dead"
eval $cmd

Make sure to configure user and repluser script variables to whatever your user:password pairs are for administrative user and replication user. Also make sure to make the script executable (chown +x) as it's very easy to forget that step.

Testing that failover works

Let's check the current status, where I have configured server3 as a master and server1-2 as slaves:

$ maxadmin -pmariadb "show servers"
Server 0x1b1f440 (server1)
    Server:             192.168.56.111
    Status:                     Slave, Running
    Protocol:           MySQLBackend
    Port:               3306
    Server Version:         10.0.19-MariaDB-1~trusty-log
    Node Id:            1
    Master Id:          3
    Slave Ids:          
    Repl Depth:         1
    Number of connections:      0
    Current no. of conns:       0
    Current no. of operations:  0
Server 0x1b1f330 (server2)
    Server:             192.168.56.112
    Status:                     Slave, Running
    Protocol:           MySQLBackend
    Port:               3306
    Server Version:         10.0.19-MariaDB-1~trusty-log
    Node Id:            2
    Master Id:          3
    Slave Ids:          
    Repl Depth:         1
    Number of connections:      8
    Current no. of conns:       1
    Current no. of operations:  0
Server 0x1a7b2c0 (server3)
    Server:             192.168.56.113
    Status:                     Master, Running
    Protocol:           MySQLBackend
    Port:               3306
    Server Version:         10.0.19-MariaDB-1~trusty-log
    Node Id:            3
    Master Id:          -1
    Slave Ids:          1, 2 
    Repl Depth:         0
    Number of connections:      2
    Current no. of conns:       0
    Current no. of operations:  0

Everything looks normal. Let's try failover by shutting down server3.

server3# service mysql stop
     * Stopping MariaDB database server mysqld                                                                                                                                                                                                                                                                      [ OK ]

Let's check the server status again:

$ maxadmin -pmariadb "show servers"
Server 0x1b1f440 (server1)
    Server:             192.168.56.111
    Status:                     Slave, Running
    Protocol:           MySQLBackend
    Port:               3306
    Server Version:         10.0.19-MariaDB-1~trusty-log
    Node Id:            1
    Master Id:          2
    Slave Ids:          
    Repl Depth:         1
    Number of connections:      0
    Current no. of conns:       0
    Current no. of operations:  0
Server 0x1b1f330 (server2)
    Server:             192.168.56.112
    Status:                     Master, Running
    Protocol:           MySQLBackend
    Port:               3306
    Server Version:         10.0.19-MariaDB-1~trusty-log
    Node Id:            2
    Master Id:          -1
    Slave Ids:          1
    Repl Depth:         0
    Number of connections:      8
    Current no. of conns:       1
    Current no. of operations:  0
Server 0x1a7b2c0 (server3)
    Server:             192.168.56.113
    Status:                     Down
    Protocol:           MySQLBackend
    Port:               3306
    Server Version:         10.0.19-MariaDB-1~trusty-log
    Node Id:            3
    Master Id:          -1
    Slave Ids:          
    Repl Depth:         0
    Number of connections:      2
    Current no. of conns:       0
    Current no. of operations:  0

MariaDB Replication Manager has promoted server2 to be the new master, and server1 has been reslaved to server2. server3 is now marked as down. If you restart server3, it will be marked as "Running" but not as slave - to put it back in the cluster, you just need to repoint replication with GTID with this command: CHANGE MASTER TO MASTER_HOST='server1', MASTER_USE_GTID=CURRENT_POS; The failover script could handle this case as well, although it remains to be tested.

About the Author

guillaumelefranc's picture

Guillaume Lefranc is managing the MariaDB Remote DBA Services Team, delivering performance tuning and high availability services worldwide. He's a believer in DevOps culture, Agile software development, and Craft Brewing.

by guillaumelefranc at July 31, 2015 12:14 PM

Peter Zaitsev

MySQL QA Episode 10: Reproducing and Simplifying: How to get it Right

Welcome to the 10th episode in the MySQL QA series! Today we’ll talk about reproducing and simplifying: How to get it Right.

Note that unless you are a QA engineer stuck on a remote, and additionally difficult-to-reproduce or difficult-to-reduce bug, this episode will largely be non-interesting for you.

However, what you may like to see – especially if you watched episodes 7 (and possibly 8 and 9) – is how reducer automatically generates handy start/stop/client (cl) etc. scripts, all packed into a handy bug tarball, in combination with the reduced SQL testcase.

This somewhat separate part is covered directly after the introduction (ends at 11:17), as well as with an example towards the end of the video (starts at time index 30:35).

The “in between part” (11:17 to 30:35) is all about reproducing and simplifying, which – unless you are working on a remote case – can likely be skipped by most; remember that 85-95% of bugs reproduce & reduce very easily – and for this – episode 7, episode 8 (especially the FORCE_SKIPV/FORCE_SPORADIC parts), and the script-related parts of this episode (start to 11:17 and 30:35 to end) would suffice.

As per the above, the topics covered in this video are:
1. percona-qa/reproducing_and_simplification.txt
2. Automatically generated scripts (produced by Reducer)

========= Example bug excerpt for copy/paste – as per the video
Though the testcase above should suffice for reproducing the bug, the attached tarball gives the testcase as an exact match of our system, including some handy utilities
$ vi {epoch}_mybase # Update base path in this file (the only change required!)
$ ./{epoch}_init # Initializes the data dir
$ ./{epoch}_start # Starts mysqld (MYEXRA –option)
$ ./{epoch}_stop # Stops mysqld
$ ./{epoch}_cl # To check mysqld is up
$ ./{epoch}_run # Run the testcase (produces output) using mysql CLI
$ ./{epoch}_run_pquery # Run the testcase (produces output) using pquery
$ vi /dev/shm/{epoch}/error.log.out # Verify the error log
$ ./{epoch}_gdb # Brings you to a gdb prompt
$ ./{epoch}_parse_core # Create {epoch}_STD.gdb and {epoch}_FULL.gdb; standard and full var gdb stack traces

Full-screen viewing @ 720p resolution recommended

The post MySQL QA Episode 10: Reproducing and Simplifying: How to get it Right appeared first on MySQL Performance Blog.

by Roel Van de Paar at July 31, 2015 10:00 AM

July 30, 2015

Oli Sennhauser

Max_used_connections per user/account

How many connections can be opened concurrently against my MySQL or MariaDB database can be configured and checked with the following command:

SHOW GLOBAL VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 505   |
+-----------------+-------+

If this limit was ever reached in the past can be checked with:

SHOW GLOBAL STATUS LIKE 'max_use%';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 23    |
+----------------------+-------+

But on MySQL instances with many different applications (= databases/schemas) and thus many different users it is a bit more complicated to find out which of these users have connected how many times concurrently. We can configure how many connections one specific user can have at maximum at the same time with:

SHOW GLOBAL VARIABLES LIKE 'max_user_connections';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| max_user_connections | 500   |
+----------------------+-------+

Further we can limit one specific user with:

GRANT USAGE ON *.* TO 'repl'@'%'
WITH MAX_CONNECTIONS_PER_HOUR 100 MAX_USER_CONNECTIONS 10;

and check with:

SELECT User, Host, max_connections, max_user_connections
  FROM mysql.user;
+------+---------------+-----------------+----------------------+
| User | Host          | max_connections | max_user_connections |
+------+---------------+-----------------+----------------------+
| root | localhost     |               0 |                    0 |
| repl | %             |             100 |                   10 |
| repl | 192.168.1.139 |               0 |                    0 |
+------+---------------+-----------------+----------------------+

But we have currently no chance to check if this limit was reached or nearly reached in the past...

A feature request for this was opened at MySQL wit bug #77888

Solution

If you cannot wait for the implementation here we have a little workaround:

DROP TABLE IF EXISTS mysql.`max_used_connections`;

CREATE TABLE mysql.`max_used_connections` (
  `USER` char(16) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `HOST` char(60) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `MAX_USED_CONNECTIONS` bigint(20) NOT NULL,
  PRIMARY KEY (`USER`, `HOST`) USING HASH
) ENGINE=MEMORY DEFAULT CHARSET=utf8
;

DROP EVENT IF EXISTS mysql.gather_max_used_connections;

-- event_scheduler = on
CREATE DEFINER=root@localhost EVENT mysql.gather_max_used_connections
ON SCHEDULE EVERY 10 SECOND
DO
INSERT INTO mysql.max_used_connections
SELECT user, host, current_connections
  FROM performance_schema.accounts
 WHERE user IS NOT NULL
   AND host IS NOT NULL
    ON DUPLICATE KEY
UPDATE max_used_connections = IF(current_connections > max_used_connections, current_connections, max_used_connections)
;

SELECT * FROM mysql.max_used_connections;

+--------+-----------+----------------------+
| USER   | HOST      | MAX_USED_CONNECTIONS |
+--------+-----------+----------------------+
| root   | localhost |                    4 |
| zabbix | localhost |                   21 |
+--------+-----------+----------------------+

Caution: Because we used a MEMORY table those values are reset at every MySQL restart (as it happens with the PERFORMANCE_SCHEMA or the INFORMATION_SCHEMA).

by Shinguz at July 30, 2015 09:34 PM

Peter Zaitsev

Why base64-output=DECODE-ROWS does not print row events in MySQL binary logs

Lately I saw many cases when users specified option

--base64-output=DECODE-ROWS
  to print out a statement representation of row events in MySQL binary logs just to get nothing. Reason for this is obvious: option
--base64-output=DECODE-ROWS
  does not convert row events into its string representation, this is job of option
--verbose
. But why users mix these two options so often? This blog post is result of my investigations.

There are already two great blog posts about printing row events on the Percona blog: “Debugging problems with row based replication” by Justin Swanhart and “Identifying useful info from MySQL row-based binary logs” by Alok Pathak.

Both authors run

mysqlbinlog
  with options 
–base64-output=decode-rows -vv
  and demonstrate how a combination of them can produce human-readable output of row events. However, one thing which is not clear yet is what the differences are between these options. I want to underline the differences in this post.

Let’s check the user manual first.

–base64-output=value

This option determines when events should be displayed encoded as base-64 strings using BINLOG statements. The option has these permissible values (not case sensitive):

    AUTO (“automatic”) or UNSPEC (“unspecified”) displays BINLOG statements automatically when necessary (that is, for format description events and row events). If no –base64-output option is given, the effect is the same as –base64-output=AUTO.
    Note

    Automatic BINLOG display is the only safe behavior if you intend to use the output of mysqlbinlog to re-execute binary log file contents. The other option values are intended only for debugging or testing purposes because they may produce output that does not include all events in executable form.

    NEVER causes BINLOG statements not to be displayed. mysqlbinlog exits with an error if a row event is found that must be displayed using BINLOG.

    DECODE-ROWS specifies to mysqlbinlog that you intend for row events to be decoded and displayed as commented SQL statements by also specifying the –verbose option. Like NEVER, DECODE-ROWS suppresses display of BINLOG statements, but unlike NEVER, it does not exit with an error if a row event is found.

For examples that show the effect of –base64-output and –verbose on row event output, see Section 4.6.8.2, “mysqlbinlog Row Event Display”.

Literally

--base64-output=DECODE-ROWS
  just suppresses
BINLOG
  statement and does not print anything.

To test its effect I run command

insert into t values (2, 'bar');

on an InnoDB table while binary log uses ROW format. As expected if I specify no option I will receive unreadable output:

$mysqlbinlog var/mysqld.1/data/master-bin.000002
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
BINLOG '
Q+esVQ8BAAAAdAAAAHgAAAAAAAQANS42LjI1LTczLjEtZGVidWctbG9nAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAeKu
Uj0=
'/*!*/;
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
BINLOG '
SeesVRMBAAAALwAAAO8AAAAAAEYAAAAAAAEABHRlc3QAAXQAAgMPAv8AA4uDQ+E=
SeesVR4BAAAALAAAABsBAAAAAEYAAAAAAAEAAgAC//wCAAAAA2Jhci06UnU=
'/*!*/;
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

The

INSERT
  is here:

BINLOG '
SeesVRMBAAAALwAAAO8AAAAAAEYAAAAAAAEABHRlc3QAAXQAAgMPAv8AA4uDQ+E=
SeesVR4BAAAALAAAABsBAAAAAEYAAAAAAAEAAgAC//wCAAAAA2Jhci06UnU=
'/*!*/;

But this string is not for humans.

What will happen if I add option

--base64-output=DECODE-ROWS
 ?

$mysqlbinlog var/mysqld.1/data/master-bin.000002 --base64-output=DECODE-ROWS
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Row event was just suppressed!

Lets now check option verbose:

–verbose, -v

Reconstruct row events and display them as commented SQL statements. If this option is given twice, the output includes comments to indicate column data types and some metadata.

For examples that show the effect of –base64-output and –verbose on row event output, see Section 4.6.8.2, “mysqlbinlog Row Event Display”.

Surprisingly

--base64-output=DECODE-ROWS
  is not needed!:

$mysqlbinlog var/mysqld.1/data/master-bin.000002 --verbose
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
BINLOG '
Q+esVQ8BAAAAdAAAAHgAAAAAAAQANS42LjI1LTczLjEtZGVidWctbG9nAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAeKu
Uj0=
'/*!*/;
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
BINLOG '
SeesVRMBAAAALwAAAO8AAAAAAEYAAAAAAAEABHRlc3QAAXQAAgMPAv8AA4uDQ+E=
SeesVR4BAAAALAAAABsBAAAAAEYAAAAAAAEAAgAC//wCAAAAA2Jhci06UnU=
'/*!*/;
### INSERT INTO `test`.`t`
### SET
###   @1=2
###   @2='bar'
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

INSERT statement successfully restored as:

### INSERT INTO `test`.`t`
### SET
###   @1=2
###   @2='bar'
# at 283

Why do the bloggers mentioned above suggest to use

--base64-output=DECODE-ROWS
 ? Lets try to use both options:

$mysqlbinlog var/mysqld.1/data/master-bin.000002 --base64-output=DECODE-ROWS --verbose
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
### INSERT INTO `test`.`t`
### SET
###   @1=2
###   @2='bar'
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

In this case row event was suppressed and statement is printed. Also, the resulting file cannot be used to re-apply events, because statements are commented out. This is very useful when binary log is big and you just need to investigate what it contains, not re-apply events.

This is not main purpose of this post, but you can also find information about columns metadata if specify option

--verbose
  twice:

$mysqlbinlog var/mysqld.1/data/master-bin.000002 --base64-output=DECODE-ROWS --verbose --verbose
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
### INSERT INTO `test`.`t`
### SET
###   @1=2 /* INT meta=0 nullable=1 is_null=0 */
###   @2='bar' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Note, this is, again, job of

--verbose
 , not
--base64-output=DECODE-ROWS

To conclude:

If you want to see statement representation of row events use option

--verbose (-v)

If you want to see metadata of columns specify
--verbose
  twice:
--verbose --verbose
  or
-vv

If you want to suppress output of row events specify option
--base64-output=DECODE-ROWS

The post Why base64-output=DECODE-ROWS does not print row events in MySQL binary logs appeared first on MySQL Performance Blog.

by Sveta Smirnova at July 30, 2015 07:00 AM

Jean-Jerome Schmidt

Webinar Replay & Slides: Become a MySQL DBA - Designing High Availability for MySQL

Thanks to everyone who joined us yesterday for this live session on designing HA for MySQL led by Krzysztof Książek, Senior Support Engineer at Severalnines. The replay and slides to the webinar are now available to watch and read online via the links below.

Watch the replay:

 

Read the slides:

 

AGENDA

  • HA - what is it?
  • Caching layer
  • HA solutions
    • MySQL Replication
    • MySQL Cluster
    • Galera Cluster
    • Hybrid Replication
  • Proxy layer
    • HAProxy
    • MaxScale
    • Elastic Load Balancer (AWS)
  • Common issues
    • Split brain scenarios
    • GTID-based failover and Errant Transactions

 

SPEAKER

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard. This webinar builds upon recent blog posts and related webinar series by Krzysztof on how to become a MySQL DBA.

For further blogs in this series visit: http://www.severalnines.com/blog-categories/db-ops

Blog category:

by Severalnines at July 30, 2015 02:05 AM

July 29, 2015

Shlomi Noach

Pseudo GTID, ASCENDING

Pseudo GTID is a technique where we inject Globally Unique entries into MySQL, gaining GTID abilities without using GTID. It is supported by orchestrator and described in more detail here, here and here.

Quick recap: we can join two slaves to replicate from one another even if they never were in parent-child relationship, based on our uniquely identifiable entries which can be found in the slaves' binary logs or relay logs. Having Pseudo-GTID injected and controlled by us allows us to optimize failovers into quick operations, especially where a large number of server is involved.

Ascending Pseudo-GTID further speeds up this process for delayed/lagging slaves.

Recap, visualized

(but do look at the presentation):

pseudo-gtid-quick

  1. Find last pseudo GTID in slave’s binary log (or last applied one in relay log)
  2. Search for exact match on new master’s binary logs
  3. Fast forward both through successive identical statements until end of slave’s applied entries is reached
  4. Point slave into cursor position on master

What happens if the slave we wish to reconnect is lagging? Or perhaps it is a delayed replica, set to run 24 hours behind its master?

The naive approach would expand bullet #2 into:

  • Search for exact match on master’s last binary logs
  • Unfound? Move on to previous (older) binary log on master
  • Repeat

The last Pseudo-GTID executed by the slave was issued by the master over 24 hours ago. Suppose the master generates one binary log per hour. This means we would need to full-scan 24 binary logs of the master where the entry will not be found; to only be matched in the 25th binary log (it's an off-by-one problem, don't hold the exact number against me).

Ascending Pseudo GTID

Since we control the generation of Pseudo-GTID, and since we control the search for Pseudo-GTID, we are free to choose the form of Pseudo-GTID entries. We recently switched into using Ascending Pseudo-GTID entries, and this works like a charm. Consider these Pseudo-GTID entries:

drop view if exists `meta`.`_pseudo_gtid_hint__asc:55B364E3:0000000000056EE2:6DD57B85`
drop view if exists `meta`.`_pseudo_gtid_hint__asc:55B364E8:0000000000056EEC:ACF03802`
drop view if exists `meta`.`_pseudo_gtid_hint__asc:55B364ED:0000000000056EF8:06279C24`
drop view if exists `meta`.`_pseudo_gtid_hint__asc:55B364F2:0000000000056F02:19D785E4`

The above entries are ascending in lexical order. The above is generated using a UTC timestamp, along with other watchdog/random values. For a moment let's trust that our generation is indeed always ascending. How does that help us?

Suppose the last entry found in the slave is

drop view if exists `meta`.`_pseudo_gtid_hint__asc:55B364E3:0000000000056EE2:6DD57B85`

And this is what we're to search on the master's binary logs. Starting with the optimistic hope that the entry is in the master's last binary log, we start reading. By nature of binary logs we have to scan them sequentially from start to end. As we read the binary log entries, we soon meet the first Pseudo-GTID injection, and it reads:

drop view if exists `meta`.`_pseudo_gtid_hint__asc:55B730E6:0000000000058F02:19D785E4`

 

At this stage we know we can completely skip scanning the rest of the binary log. Our entry will not be there: this entry is larger than the one we're looking for, and they'll only get larger as we get along in the binary log. It is therefore safe to ignore the rest of this file and move on to the next-older binary log on the master, to repeat our search there.

Binary logs where the entry cannot be in are only briefly examined: orchestrator will probably read no more than first 1,000 entries or so (can't give you a number, it's your workload) before giving up on the binary log.

On every topology chain we have 2 delayed replica slaves, to help us out in the case we make a grave mistake of DELETing the wrong data. These slaves would take, on some chains, 5-6 minutes to reconnect to a new master using Pseudo-GTID, since it required scanning many many GBs of binary logs. This is no longer the case; we've reduced scan time for such servers to about 25s at worst, and much quicker on average. There can still be dozens of binary logs to open, but all but one are given up very quickly. I should stress that those 25s are nonblocking for other slaves which are mote up to date than the delayed replicas.

Can there be a mistake?

Notice that the above algorithm does not require each and every entry to be ascending; it just compares the first entry in each binlog to determine whether our target entry is there or not. This means if we've messed up our Ascending order and injected some out-of-order entries, we can still get away with it -- as long as those entries are not the first ones in the binary log, nor are they the last entries executed by the slave.

But why be so negative? We're using UTC timestamp as the major sorting order, and inject Pseudo-GTID every 5 seconds; even with leap second we're comfortable.

On my TODO is to also include a "Plan B" full-scan search: if the Ascending algorithm fails, we can still opt for the full scan option. So there would be no risk at all.

Example

We inject Pseudo-GTID via event-scheduler. These are the good parts of the event definition:

create event if not exists
  create_pseudo_gtid_event
  on schedule every 5 second starts current_timestamp
  on completion preserve
  enable
  do
    begin
      set @connection_id := connection_id();
      set @now := now();
      set @rand := floor(rand()*(1 << 32));
      set @pseudo_gtid_hint := concat_ws(':', lpad(hex(unix_timestamp(@now)), 8, '0'), lpad(hex(@connection_id), 16, '0'), lpad(hex(@rand), 8, '0'));

      set @_create_statement := concat('drop ', 'view if exists `meta`.`_pseudo_gtid_', 'hint__asc:', @pseudo_gtid_hint, '`');
      PREPARE st FROM @_create_statement;
      EXECUTE st;
      DEALLOCATE PREPARE st;

We accompany this by the following orchestrator configuration:

 "PseudoGTIDPattern": "drop view if exists .*?`_pseudo_gtid_hint__",
 "PseudoGTIDMonotonicHint": "asc:",

"PseudoGTIDMonotonicHint" notes a string; if that string ("asc:") is found in the slave's Pseudo-GTID entry, then the entry is assumed to have been injected as part of ascending entries, and the optimization kicks in.

The Manual has more on this.

by shlomi at July 29, 2015 10:59 AM

Peter Zaitsev

Multi-source replication in MySQL 5.7 vs Tungsten Replicator

MySQL 5.7 comes with a new set of features and multi-source replication is one of them. In few words this means that one slave can replicate from different masters simultaneously.

During the last couple of months I’ve been playing a lot with this trying to analyze its potential in a real case that I’ve been facing while working with a customer.

This was motivated because my customer is already using multi-sourced slaves with Tungsten Replicator and I wanted to do a side-by-side comparison between Tungsten Replicator and Multi-source Replication in MySQL 5.7

Consider the following scenario:

mixed
DB1 is our main master attending mostly writes from several applications, it also needs to serve read traffic which is putting it’s capacity close to the limit. It has attached 6 replication slaves using regular replication.
A1, A2, A3, B1, B2 and DB7 are reporting slaves used to offload some reads from master and also woking on some offline ETL processes.

Since they had some idle capacity customer decided to go further and set a different architecture:
A1 and B1 became also masters of other slaves using Tungsten Replicator, in this case group A is a set of servers for a statistics application and B is attending a finance application, so A2, A3 and B2 became multi sourced slaves.
New applications writes directly to A1 and B1 without impacting write capacity of main master.

Pros and Cons of this approach

Pros

  • It just works. We’ve been running this way for a long time now and we haven’t suffered major issues.
  • Tungsten Replicator has some built in tools and scripts to make slave provision easy.

Cons

  • Tungsten Replicator is a great product but bigger than needed for this architecture. In some cases we had to configure Java Virtual Machine with 4GB of RAM to make it work properly.
  • Tungsten is a complex tool that needs some extra expertise to deploy it, make it work and troubleshoot issues when errors happen (i.e. handling duplicate keys errors)

With all this in mind we moved a step forward and started to test if we can move this architecture to use legacy replication only.

New architecture design:
Blank Flowchart - New Page (7)

We added some storage capacity to DB7  for our testing purposes and the goal here is to replace all Tungsten replicated slaves by a single server where all databases are consolidated.

For some data dependency we weren’t able to completely separate A1 and B1 servers to become master-only so they are currently acting as masters of DB7 and slaves of DB1 By data dependency I mean DB1 replicates it’s schemas to all of it’s direct slaves, including DB7.  DB7 also gets replication of the finance DB running locally to B1 and stats DB running locally to A1.

Some details about how this was done and what multi source is implemented:

  • The main difference between regular replication, as known up to 5.6 version, is that now you have replication channels, each channel means a different source, in other words each master has it’s own replication channel.
  • Replication needs to be set as crash safe, meaning that both master_info_repository and
    relay_log_info_repository variables needs to be set to TABLE
  • We haven’t considered GTID because servers acting as masters have different versions than our test multi-sourced slave.
  • log_slave_updates needs to be disabled in A1 and B2 to avoid having duplicate data in DB7 due replication flow.

Pros and Cons of this approach

Pros

  • MySQL 5.7 can replicate from different versions of master, we tested multi-source replication working along with 5.5 and 5.6 simultaneously and didn’t suffer problems besides those known changes with timestamp based fields.
  • Administration becomes easier. Any DBA already familiar with legacy replication can adapt to handle multiple channels without much learning, some new variables and a couple of new tables and you’re ready to go here.

Cons

  • 5.7 is not production ready yet. At this point we don’t have a GA release data which means that we may expect bugs to appear in the short/mid term.
  • Multi-source is still tricky for some special cases: database and table filtering works globally (can’t set per-channel filters) and administration commands like sql_slave_skip_counter is a global command still which means you can’t easily skip a statement in a particular channel.

Now the funny part: The How

It was easier than you think. First of all we needed to start from a backup of data coming from our masters. Due to versions used in production (main master is 5.5, A1 and B1 are 5.6) we started from a logical dump so we avoided to deal with mysql_upgrade issues.

Disclaimer: this does not pretend to be a guide on how to setup multi-source replication

For the matter of our case we did the backup/restore using mydumper/myloader as follow:

[root@db1]$ mydumper -l 600 -v 3 -t 8 --outputdir /mnt/backup_db1/20150708 --less-locking --regex="^(database1.|database2.|database3.)"
[root@a1]$ mydumper -l 600 -v 3 -t 8 --outputdir /mnt/backup_a1/20150708 --less-locking --regex="^(tungsten_stats.|stats.)"
[root@b1]$ mydumper -l 600 -v 3 -t 8 --outputdir /mnt/backup_b1/20150708 --less-locking --regex="^(tungsten_finance.|finance.)"

Notice each command was run in each master server, now the restore part:

[root@db7]$ myloader -d /mnt/backup_db1/20150708  -o -t 8 -q 10000 -h localhost
[root@db7]$ myloader -d /mnt/backup_a1/20150708 -o -t 8 -q 10000 -h localhost
[root@db7]$ myloader -d /mnt/backup_b1/20150708 -o -t 8 -q 10000 -h localhost

So at this point we have a new slave with a copy of databases from 3 different masters, just for context we need to dump/restore tungsten* databases because they are constantly updated by Replicator (which at this point is still in use). Pretty easy right?

Now the most important part of this whole process, setting up replication. The procedure is very similar than regular replication but now we need to consider which binlog position is necessary for each replication channel, this is very easy to get from each backup by reading in this case the metadata file created by mydumper. In known backup methods (either logical or physical) you have a way to get binlog coordinates, for example –master-data=2 in mysqldump or xtrabackup_binlog_info file in xtrabackup.

Once we get the replication info (and created a replication user in master) then we only need to run the known CHANGE MASTER TO and START SLAVE commands, but here we have our new way to do it:

db7:information_schema> change master to master_host='db1', master_user='rep', master_password='rep', master_log_file='db1-bin.091487', master_log_pos=74910596 FOR CHANNEL 'main_master';
       Query OK, 0 rows affected (0.02 sec)
db7:information_schema> change master to master_host='a1', master_user='rep', master_password='rep', master_log_file='a1-bin.394460', master_log_pos=56004 FOR CHANNEL 'a1_slave';
       Query OK, 0 rows affected (0.02 sec)
db7:information_schema> change master to master_host='b1', master_user='rep', master_password='rep', master_log_file='b1-bin.1653245', master_log_pos=2563356 FOR CHANNEL 'b1_slave';
       Query OK, 0 rows affected (0.02 sec)

Replication is set and now we are good to go:

db10:information_schema> START SLAVE FOR CHANNEL 'main_master';
       Query OK, 0 rows affected (0.00 sec)
db10:information_schema> START SLAVE FOR CHANNEL 'a1_slave';
       Query OK, 0 rows affected (0.00 sec)
db10:information_schema> START SLAVE FOR CHANNEL 'b1_slave';
       Query OK, 0 rows affected (0.00 sec)

New commands includes the FOR CHANNEL 'channel_name' option to handle replication channels independently

At this point we have a slave running 3 replication channels from different sources, we can check the status of replication with our known command SHOW SLAVE STATUS (TL;DR)

db10:information_schema> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: db1
                  Master_User: rep
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: db1-bin.077011
          Read_Master_Log_Pos: 15688468
               Relay_Log_File: db7-relay-main_master.000500
                Relay_Log_Pos: 18896705
        Relay_Master_Log_File: db1-bin.076977
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: mysql.%,temp.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 18896506
              Relay_Log_Space: 2260203264
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 31047
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1004
                  Master_UUID: 65107c0c-7ab5-11e4-a85a-bc305bf01f00
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: System lock
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name: main_master
*************************** 2. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: a1
                  Master_User: slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: a1-bin.072336
          Read_Master_Log_Pos: 10329256
               Relay_Log_File: db7-relay-db3_slave.000025
                Relay_Log_Pos: 10329447
        Relay_Master_Log_File: a1-bin.072336
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: mysql.%,temp.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 10329256
              Relay_Log_Space: 10329697
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 4000
                  Master_UUID: 0f061ec4-6fad-11e4-a069-a0d3c10545b0
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name: a1_slave
*************************** 3. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: b1.las1.fanops.net
                  Master_User: slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: b1-bin.093214
          Read_Master_Log_Pos: 176544432
               Relay_Log_File: db7-relay-db8_slave.000991
                Relay_Log_Pos: 176544623
        Relay_Master_Log_File: b1-bin.093214
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: mysql.%,temp.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 176544432
              Relay_Log_Space: 176544870
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1001
                  Master_UUID:
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name: b1_slave
3 rows in set (0.00 sec)

Yeah I know, output is too large and the Oracle guys noticed it, too, so they have created a set of new tables in performance_schema DB to help us retrieving this information in a friendly manner, check this link for more information. We could also run SHOW SLAVE STATUS FOR CHANNEL 'b1_slave' for instance

Some limitations found during tests:

  • As mentioned some configurations are still global and can’t be set per replication channel, for instance replication filters which can be set without restarting MySQL but they will affect all replication channels as you can see here.
  • Replication events are somehow serialized at slave side, just like a global counter that is not well documented yet. In reality this means that you need to be very careful when troubleshooting issues because you may suffer unexpected issues, for instance if you have 2 replication channels failing with a duplicate key error then is not easy to predict which even you will skip when running set global sql_slave_skip_counter=1

Conclusions
So far this new feature looks very nice and provides some extra flexibility to slaves which helps to reduce architecture complexity when we want to consolidate databases from different sources into a single server. After some time testing it I’d say that I prefer this type of replication over Tungsten Replicator in this kind of scenarios due it’s simplicity for administration, i.e. pt-table-checksum and pt-table-sync will work without proper limitations of Tungsten.

With the exception of some limitations that need to be addressed, I believe this new feature is game changing and will definitely make DBA’s life easier. I still have a lot to test still but that is material for a future post.

The post Multi-source replication in MySQL 5.7 vs Tungsten Replicator appeared first on MySQL Performance Blog.

by Franciso Bordenave at July 29, 2015 07:00 AM

July 28, 2015

Jean-Jerome Schmidt

Become a MySQL DBA blog series - Database upgrades

Database vendors typically release patches with bug/security fixes on a monthly basis, why should we care? The news is full of reports of security breaches and hacked systems, so unless security is not a concern, you might want to have the most current security fixes on your systems. Major versions are rarer, and usually harder (and riskier) to upgrade to. But they might bring along some important features that make the upgrade worth the effort.

In this blog post, we will cover one of the most basic tasks of the DBA - minor and major database upgrades.  

This is the sixth installment in the ‘Become a MySQL DBA’ blog series. Our previous posts in the DBA series include Replication Topology Changes, Schema Changes, High Availability, Backup & Restore, Monitoring & Trending.

MySQL upgrades

Once every couple of years, a MySQL version becomes outdated and is not longer supported by Oracle. It happened to MySQL 5.1 on December 4, 2013, and earlier to MySQL 5.0 on January 9, 2012. It will also happen to MySQL 5.5 somewhere in 2018, 8 years after the GA was released. It means that for both MySQL 5.0 and MySQL 5.1, users cannot rely on fixes - not even for serious, security bugs. This is usually the point where you really need to plan an upgrade of MySQL to a newer version.

You won’t be dealing only with major version upgrades, though - it’s more likely that you’ll be upgrading to minor versions more often, like 5.6.x -> 5.6.y. Most likely, it is so that the newest version brings some fixes for bugs that affect your workload, but it can be any other reason. 

There is a significant difference in the way you perform a major and a minor version upgrade.

Preparations

Before you can even think about performing an upgrade, you need to decide what kind of testing you need to do. Ideally, you have a staging/development environment where you do tests for your regular releases. If that is the case, the best way of doing pre-upgrade tests will be to build a database layer of your staging environment using the new MySQL version. Once that is done, you can proceed with a regular set of tests. More is better - you want to focus not only on the “feature X works/does not work” aspect but also performance.

On the database side, you can also do some generic tests. For that you would need a list of queries in a slow log format. Then you can use pt-upgrade to run them on both the old and the new MySQL version, comparing the response time and result sets. In the past, we have noticed that pt-upgrade returns a lot of false positives - it may report a query as slow while in fact, the query is perfectly fine on both versions. For that, you may want to introduce some additional sanity checks - parse pt-upgrade output, grab the slow queries it reported, execute them once more on the servers and compare the results again. What you need to keep in mind that you should connect to both old and new database servers in the same way (socket connection will be faster than TCP).

Typical results from such generic tests are queries where the execution plan has changed - usually it’s enough to add some indexes or force the optimizer to pick a correct one. You can also see queries with discrepancies in the result set - it’s most likely a result of lack of explicit ORDER BY in the query - you can’t rely on rows being sorted the way they are if you didn’t sort them explicitly.

Minor version upgrades

A minor upgrade is relatively easy to perform - most of the time, all you need to do is to just install the new version using the package manager of your distribution. Once you do that, you need to ensure that MySQL has been started after the upgrade and then you should run the mysql_upgrade script. This script goes through the tables in your database and ensures all of them are compatible with the current version. It may also fix your system tables if required.

Obviously, installing the new version of a package requires the service to be stopped. Therefore you need to plan the upgrade process. It may slightly differ depending if you use Galera Cluster or MySQL replication.

MySQL replication

When we are dealing with MySQL replication, the upgrade process is fairly simple. You need to upgrade slave by slave, taking them out of rotation for the time required to perform the upgrade (it is a short time if everything goes right, not more than few minutes of downtime). For that you may need to do some temporary changes in your proxy configuration to ensure that the traffic won’t be routed to the slave that is under maintenance. It’s hard to give any details here because it depends on your setup. In some cases, it might not even be needed to make any changes as the proxy can adapt to topology changes on it’s own and detects which node is available and which is not. That’s how you should configure your proxy, by the way.

Once every slave has been updated, you need to execute a planned failover. We discussed the process in an earlier blog post. The process may also depend on your setup. It doesn’t have to be manual one if you have tools to automate it for you (MHA for example). Once a new master is elected and failover is completed, you should perform the upgrade on the old master which, at this point, should be slaving off the new master. This will conclude minor version upgrade for the MySQL replication setup.

Galera Cluster

With Galera, it is somewhat easier to perform upgrades - you need to stop the nodes one by one, upgrade the stopped node and then restart before moving to the next. If your proxy needs some manual tweaks to ensure traffic won’t hit nodes which are undergoing maintenance, you will have to make those changes. If it can detect everything automatically, all you need to do is to stop MySQL, upgrade and restart. Once you gone over all nodes in the cluster, the upgrade is complete.

Major version upgrades

A major version upgrade in MySQL would be 5.x -> 5.y or even 4.x > 5.y. Such upgrade is more tricky and complex that the minor upgrades we just covered in earlier paragraphs.

The recommended way of performing the upgrade is to dump and reload the data - this requires some time (depends on the database size) but it’s usually not feasible to do it while the slave is out of rotation. Even when using mydumper/myloader, the process will take too long. In general, if the dataset is larger than a hundred of gigabytes, it will probably require additional preparations.

While it might be possible to do just a binary upgrade (install new packages), it is not recommended as there could be some incompatibilities in binary format between the old version and the new one, which, even after mysql_upgrade has been executed, may still cause some problems. We’ve seen cases where a binary upgrade resulted is some weird behavior in how the optimizer works, or caused instability. All those issues were solved by performing the dump/reload process. So, while you may be ok to run a binary upgrade, you may also run into serious problems - it’s your call and eventually it’s your decision. If you decide to perform a binary upgrade, you need to do detailed (and time-consuming) tests to ensure it does not break anything. Otherwise you are at risk. That’s why dump and reload is the officially recommended way to upgrade MySQL and that’s why we will focus on this approach to the upgrade.

MySQL replication

If our setup is based on MySQL replication, we will build a slave on the new MySQL version. Let’s say we are upgrading from MySQL 5.5 to MySQL 5.6. As we have to perform a long dump/reload process, we may want to build a separate MySQL host for that. A simplest way would be to use xtrabackup to grab the data from one of the slaves along with the replication coordinates. That data will allow you to slave the new node off the old master. Once the new node (still running MySQL 5.5 - xtrabackup just moves the data so we have to use the same, original, MySQL version) is up and running, it’s time to dump the data. You can use any of the logical backup tools that we discussed in our earlier post on Backup and Restore. It doesn’t matter as long as you can restore the data later.

After the dump had been completed, it’s time to stop the MySQL, wipe out the current data directory, install MySQL 5.6 on the node, initialize the data directory using mysql_install_db script and start the new MySQL version. Then it’s time to load the dumps - a process which also may take a lot of time. Once done, you should have a new and shiny MySQL 5.6 node. It’s time now to sync it back with the master - you can use coordinates collected by xtrabackup to slave the node off a member of the production cluster running MySQL 5.5. What’s important to remember here is that, as you want to eventually slave the node off the current production cluster, you need to ensure that binary logs won’t rotate out. For large datasets, the dump/reload process may take days so you want to adjust expire_logs_days accordingly on the master. You also want to confirm you have enough free disk space for all those binlogs.

Once we have a MySQL 5.6 slaving off MySQL 5.5 master, it’s time to go over the 5.5 slaves and upgrade them. The easiest way now would be to leverage xtrabackup to copy the data from the 5.6 node. So, we take a 5.5 slave out of rotation, stop the MySQL server, wipe out data directory, upgrade MySQL to 5.6, restore data from the other 5.6 slave using xtrabackup. Once that’s done, you can setup the replication again and you should be all set.

This process is much faster than doing dump/reload for each of the slaves - it’s perfectly fine to do it once per replication cluster and then use physical backups to rebuild other slaves. If you use AWS, you can rely on EBS snapshots instead of xtrabackup. Similar to the logical backup, it doesn’t really matter how you rebuild the slaves as long as it will work.

Finally, once all of the slaves were upgraded, you need to failover from the 5.5 master to one of the 5.6 slaves. At this point it may happen that you won’t be able to keep the 5.5 in the replication (even if you setup master - master replication between them). In general, replicating from a new version of MySQL to an older one is not supported - replication might break. One way or another, you’ll want to upgrade and rebuild the old master using the same process as with slaves.

Galera Cluster

Compared to MySQL Replication, Galera is, at the same time, both trickier and easier to upgrade. A cluster created with Galera should be treated as a single MySQL server. This is crucial to remember when discussing Galera upgrades - it’s not a master with some slaves or many masters connected to each other - it’s like a single server. To perform an upgrade of a single MySQL server you need to either do the offline upgrade (take it out of rotation, dump the data, upgrade MySQL to 5.6, load the data, bring it back into rotation) or create a slave, upgrade it and finally failover to it (the process we described in the previous section, while discussing MySQL replication upgrade).

Same thing applies for Galera cluster - you either take everything down for the upgrade (all nodes) or you have to build a slave - another Galera cluster connected via MySQL replication.

An online upgrade process may look as follows. For starters, you need to create the slave on MySQL 5.6 - process is exactly the same as above: create a node with MySQL 5.5 (it can be a Galera but it’s not required), use xtrabackup to copy the data and replication coordinates, dump the data using a logical backup tool, wipe out the data directory, upgrade MySQL to 5.6 Galera, bootstrap the cluster, load the data, slave the node off the 5.5 Galera cluster.

At this point you should have two Galera clusters - 5.5 and a single node of Galera 5.6, both connected via replication. Next step will be to build the 5.6 cluster to a production size. It’s hard to tell how to do it - if you are in the cloud, you can just spin up new instances. If you are using colocated servers in a datacenter, you may need to move some of the hardware from the old to the new cluster. You need to keep in mind the total capacity of the system to make sure it can cope with some nodes taken out of rotation. While hardware management may be tricky, what is nice is that you don’t have to do much regarding building the 5.6 cluster - Galera will use SST to populate new nodes automatically.

In general, the goal of this phase is to build a 5.6 cluster that’s large enough to handle the production workload. Once it’s done, you need to failover to 5.6 Galera cluster - this will conclude the upgrade. Of course, you may still need to add some more nodes to it but it’s now a regular process of provisioning Galera nodes, only now you use 5.6 instead of 5.5.

 

Blog category:

by Severalnines at July 28, 2015 12:56 PM

Peter Zaitsev

MySQL QA Episode 9: Reducing Testcases for Experts: multi-threaded reducer.sh

Welcome to MySQL QA Episode 9. This episode will go more in-depth into reducer.sh: Reducing Testcases for Experts: multi-threaded reducer.sh

We will explore how to use reducer.sh to do true multi-threaded testcase reduction – a world’s first.

Topics:

  1. Expert configurable variables & their default reducer.sh settings
    1. PQUERY_MULTI
    2. PQUERY_MULTI_THREADS
    3. PQUERY_MULTI_CLIENT_THREADS
    4. PQUERY_MULTI_QUERIES
    5. PQUERY_REVERSE_NOSHUFFLE_OPT

Full-screen viewing @ 720p resolution recommended.

The post MySQL QA Episode 9: Reducing Testcases for Experts: multi-threaded reducer.sh appeared first on MySQL Performance Blog.

by Roel Van de Paar at July 28, 2015 10:00 AM

Henrik Ingo

It was my fault

Last Friday noonish, I was back at PDX. I had decided to invest in the Thursday night parties - to strengthen those bonds of friendship that are the backbone of the open source community - then sleep, pack and take the light rail to the airport in the morning, skipping the remaining Friday morning conference sessions. I had already been at the convention center 6 days in a row, figured it would be enough for now.

read more

by hingo at July 28, 2015 06:56 AM

July 27, 2015

MariaDB Foundation

MariaDB 10.1.6 now available

Download MariaDB 10.1.6

Release Notes Changelog What is MariaDB 10.1?

MariaDB APT and YUM Repository Configuration Generator

mariadb-seal-shaded-browntext-altThe MariaDB project is pleased to announce the immediate availability of MariaDB 10.1.6. This is a Beta release.

See the Release Notes and Changelog for detailed information on this release and the What is MariaDB 10.1? page in the MariaDB Knowledge Base for general information about the MariaDB 10.1 series.

Thanks, and enjoy MariaDB!

by Daniel Bartholomew at July 27, 2015 03:10 PM

July 25, 2015

Shlomi Noach

What makes a MySQL server failure/recovery case?

Or: How do you reach the conclusion your MySQL master/intermediate-master is dead and must be recovered?

This is an attempt at making a holistic diagnosis of our replication topologies. The aim is to cover obvious and not-so-obvious crash scenarios, and to be able to act accordingly and heal the topology.

At Booking.com we are dealing with very large amounts of MySQL servers. We have many topologies, and many servers in each topology. See past numbers to get a feel for it. At these numbers failures happen frequently. Typically we would see normal slaves failing, but occasionally -- and far more frequently than we would like to be paged for -- an intermediate master or a master would crash. But our current (and ever in transition) setup also include SANs, DNS records, VIPs, any of which can fail and bring down our topologies.

Tackling issues of monitoring, disaster analysis and recovery processes, I feel safe to claim the following statements:

  • The fact your monitoring tool cannot access your database does not mean your database has failed.
  • The fact your monitoring tool can access your database does not mean your database is available.
  • The fact your database master is unwell does not mean you should fail over.
  • The fact your database master is alive and well does not mean you should not fail over.

Bummer. Let's review a simplified topology with a few failure scenarios. Some of these scenarios you will find familiar. Some others may be caused by setups you're not using. I would love to say I've seen it all but the more I see the more I know how strange things can become.

We will consider the simplified case of a master with three replicas: we have M as master, A, B, C as slaves.

mysql-topologies-failures

 

A common monitoring scheme is to monitor each machine's IP, availability of MySQL port (3306) and responsiveness to some simple query (e.g. "SELECT 1"). Some of these checks may run local to the machine, others remote.

Now consider your monitoring tool fails to connect to your master.

mysql-topologies-failures (1)

I've marked the slaves with question marks as the common monitoring schema does not associate the master's monitoring result to the slaves'.  Can you safely conclude your master is dead? Are your feeling comfortable with initiating a failover process? How about:

  • Temporary network partitioning; it just so happens that your monitoring tool cannot access the master, though everyone else can.
  • DNS/VIP/name cache/name resolving issue. Sometimes similar to the above; does you monitoring tool host think the master's IP is what it really is? Has something just changed? Some cache expired? Some cache is stale?
  • MySQL connection rejection. This could be due to a serious "Too many connections" problem on the master, or due to accidental network noise.

Now consider the following case: a first tier slave is failing to connect to the master:

mysql-topologies-failures (2)

The slave's IO thread is broken; do we have a problem here? Is the slave failing to connect because the master is dead, or because the slave itself suffers from a network partitioning glitch?

A holistic diagnosis

In the holistic approach we couple the master's monitoring with that of its direct slaves. Before I continue to describe some logic, the previous statement is something we must reflect upon.

We should associate the master's state with that of its direct slaves. Hence we must know which are its direct slaves. We might have slaves D, E, F, G replicating from B, C. They are not in our story. But slaves come and go. Get provisioned and de-provisioned. They get repointed elsewhere. Our monitoring needs to be aware of the state of our replication topology.

My preferred tool for the job is orchestrator, since I author it. It is not a standard monitoring tool and does not serve metrics; but it observes your topologies and records them. And notes changes. And acts as a higher level failure detection mechanism which incorporates the logic described below.

We continue our discussion under the assumption we are able to reliably claim we know our replication topology. Let's revisit our scenarios from above and then add some.

We will further only require MySQL client protocol connection to our database servers.

Dead master

A "real" dead master is perhaps the clearest failure. MySQL has crashed (signal 11); or the kernel panicked; or the disks failed; or power went off. The server is really not serving. This is observed as:

mysql-topologies-failures (3)

In the holistic approach, we observe that:

  • We cannot reach the master (our MySQL client connection fails).
  • But we are able to connect to the slaves A, B, C
  • And A, B, C are all telling us they cannot connect to the master

We have now cross referenced the death of the master with its three slaves. Funny thing is the MySQL server on the master may still be up and running. Perhaps the master is suffering from some weird network partitioning problem (when I say "weird", I mean we have it; discussed further below). And perhaps some application is actually still able to talk to the master!

And yet our entire replication topology is broken. Replication is not there for beauty; it serves our application code. And it's turning stale. Even if by some chance things are still operating on the master, this still makes for a valid failover scenario.

Unreachable master

Compare the above with:

mysql-topologies-failures (4)

Our monitoring scheme cannot reach our master. But it can reach the slaves, an they're all saying: "I'm happy!"

This gives us suspicion enough to avoid failing over. We may not actually have a problem: it's just us that are unable to connect to the master.

Right?

There are still interesting use cases. Consider the problem of "Too many connections" on the master. You are unable to connect; the application starts throwing errors; but the slaves are happy. They were there first. They started replicating at the dawn of time, long before there was an issue. Their persistent connections are good to go.

Or the master may suffer a deadlock. A long, blocking ALTER TABLE. An accidental FLUSH TABLES WITH READ LOCK. Or whatever occasional bug we hit. Slaves are still connected; but new connections are hanging; and your monitoring query is unable to process.

And still our holistic approach can find that out: as we are able to connect to our slaves, we are also able to ask them: well what have your relay logs have to say about this? Are we progressing in replication position? Do we actually find application content in the slaves' relay logs? We can do all this via MySQL protocol ("SHOW SLAVE STATUS", "SHOW RELAYLOG EVENTS").

Understanding the topology gives you greater insight into your failure case; you have increasing leevels of confidentiality in your analysis. Strike that: in your automated analysis.

Dead master and slaves

They're all gone!

mysql-topologies-failures (5)

You cannot reach the master and you cannot reach any of its slaves. Once you are able to associate your master and slaves you can conclude you either have a complete DC power failure problem (or is this cross DC?) or you are having a network partitioning problem. Your application may or may not be affected -- but at least you know where to start. Compare with:

Failed DC

mysql-topologies-failures (6)

I'm stretching it now, because when a DC fails all the red lights start flashing. Nonetheless, if M, A, B are all in one DC and C is on another, you have yet another diagnosis.

Dead master and some slaves

mysql-topologies-failures (7)

Things start getting complicated when you're unable to get an authorized answer from everyone. What happens if the master is dead as well as one of its slaves? We previously expected all slaves to say "we cannot replicate". For us, master being unreachable, some slaves being dead and all other complaining on IO thread is good enough indication that the master is dead.

All first tier slaves not replicating

mysql-topologies-failures (9)

Not a failover case, but certainly needs to ring the bells. All master's direct slaves are failing replication on some SQL error or are just stopped. Our topology is turning stale.

Intermediate masters

With intermediate master the situation is not all that different. In the below:

Untitled presentation

The servers E, F, G replicating from C provide us with the holistic view on C. D provides the holistic view on A.

Reducing noise

Intermediate master failover is a much simpler operation than master failover. Changing masters require name resolve changes (of some sort), whereas moving slaves around the topology affects no one.

This implies:

  • We don't mind over-reacting on failing over intermediate masters
  • We pay with more noise

Sure, we don't mind failing over D elsewhere, but as D is the only slave of A, it's enough that D hiccups that we might get an alert ("all" intermediate master's slaves are not replicating). To that effect orchestrator treats single slave scenarios differently than multiple slaves scenarios.

Not so fun setups and failures

At Booking.com we are in transition between setups. We have some legacy configuration, we have a roadmap, two ongoing solutions, some experimental setups, and/or all of the above combined. Sorry.

Some of our masters are on SAN. We are moving away from this; for those masters on SANs we have cold standbys in an active-passive mode; so master failure -> unmount SAN -> mount SAN on cold standby -> start MySQL on cold standby -> start recovery -> watch some TV -> go shopping -> end recovery.

Only SANs fail, too. When the master fails, switching over to the cold standby is pointless if the origin of the problem is the SAN. And given that some other masters share the same SAN... whoa. As I said we're moving away from this setup for Pseudo GTID and then for Binlog Servers.

The SAN setup also implied using VIPs for some servers. The slaves reference the SAN master via VIP, and when the cold standby start up it assumes the VIP, and the slaves know nothing about this. Same setup goes for DC masters. What happens when the VIP goes down? MySQL is running happily, but slaves are unable to connect. Does that make for a failover scenario? For intermediate masters we're pushing it to be so, failing over to a normal local-disk based server; this improves out confidence in non-SAN setups (which we have plenty of, anyhow).

Double checking

You sample your server once every X seconds. But in a failover scenario you want to make sure your data is up to date. When orchestrator suspects a dead master (i.e. cannot reach the master) it immediately contacts its direct slaves and checks their status.

Likewise, when orchestrator sees a first tier slave with broken IO thread, it immediately contacts the master to check if everything is fine.

For intermediate masters orchestrator is not so concerned and does not issue emergency checks.

How to fail over

Different story. Some other time. But failing over makes for complex decisions, based on who the replicating slaves are; with/out log-slave-updates; with-out GTID; with/out Pseudo-GTID; are binlog servers available; which slaves are available in which data centers. Or you may be using Galera (we're not) which answers most of the above.

Anyway we use orchestrator for that; it knows our topologies, knows how they should look like, understands how to heal them, knows MySQL replication rules, and invokes external processes to do the stuff it doesn't understand.

by shlomi at July 25, 2015 07:00 AM

July 24, 2015

Peter Zaitsev

InnoDB vs TokuDB in LinkBench benchmark

Previously I tested Tokutek’s Fractal Trees (TokuMX & TokuMXse) as MongoDB storage engines – today let’s look into the MySQL area.

I am going to use modified LinkBench in a heavy IO-load.

I compared InnoDB without compression, InnoDB with 8k compression, TokuDB with quicklz compression.
Uncompressed datasize is 115GiB, and cachesize is 12GiB for InnoDB and 8GiB + 4GiB OS cache for TokuDB.

Important to note is that I used tokudb_fanout=128, which is only available in our latest Percona Server release.
I will write more on Fractal Tree internals and what does tokudb_fanout mean later. For now let’s just say it changes the shape of the fractal tree (comparing to default tokudb_fanout=16).

I am using two storage options:

  • Intel P3600 PCIe SSD 1.6TB (marked as “i3600” on charts) – as a high end performance option
  • Crucial M500 SATA SSD 900GB (marked as “M500” on charts) – as a low end SATA SSD

The full results and engine options are available here

Results on Crucial M500 (throughput, more is better)

Crucial M500

    Engine Throughput [ADD_LINK/10sec]

  • InnoDB: 6029
  • InnoDB 8K: 6911
  • TokuDB: 14633

There TokuDB outperforms InnoDB almost two times, but also shows a great variance in results, which I correspond to a checkpoint activity.

Results on Intel P3600 (throughput, more is better)

Intel P3600

  • Engine Throughput [ADD_LINK/10sec]
  • InnoDB: 27739
  • InnoDB 8K: 9853
  • TokuDB: 20594

To understand the reasoning why InnoDB shines on a fast storage let’s review IO usage by all engines.
Following chart shows Reads in KiB, that engines, in average, performs for a request from client.

IO Reads

Following chart shows Writes in KiB, that engines, in average, performs for a request from client.

IO Writes

There we can make interesting observations that TokuDB on average performs two times less writes than InnoDB, and this is what allows TokuDB to be better on slow storages. On a fast storage, where there is no performance penalty on many writes, InnoDB is able to get ahead, as InnoDB is still better in using CPUs.

Though, it worth remembering, that:

  • On a fast expensive storage, TokuDB provides a better compression, which allows to store more data in limited capacity
  • TokuDB still writes two time less than InnoDB, that mean twice longer lifetime for SSD (still expensive).

Also looking at the results, I can make the conclusion that InnoDB compression is inefficient in its implementation, as it is not able to get benefits: first, from doing less reads (well, it helps to get better than uncompressed InnoDB, but not much); and, second, from a fast storage.

The post InnoDB vs TokuDB in LinkBench benchmark appeared first on MySQL Performance Blog.

by Vadim Tkachenko at July 24, 2015 02:12 PM

July 23, 2015

Peter Zaitsev

The Q&A: Creating best-in-class backup solutions for your MySQL environment

Percona MySQL and MongoDB WebinarsThank you for attending my July 15 webinar, “Creating Best in Class Backup solutions for your MySQL environment.” Due to the amount of content we discussed and some minor technical difficulties faced near the end of webinar we have decided to cover the final two slides of the presentation along with the questions asked by attendees during the webinar via this blog post.

The slides are available for download. And you can watch the webinar in it’s entirety here.

The final two slides were about our tips for having a good backup and recovery strategy. Lets see the bullet points along with what would have been their explanation during the webinar :

  • Use the three types of backups
    • Binary for full restores, new slaves
      • Binary backups are easy to restore, plus takes the least amount of time to restore. The mean time to recover is mostly bound by the time to transfer backup to the appropriate target server,
    • Logical for partial restores
      • Logical backups, especially when done table-wise come in handy when you’re wanting to restore one or few smaller tables,
    • Binlog for point in time recovery
      • Very often the need is to have Point In Time Recovery, with a Full backup of any type (Logical or Binary) its half the story, we still need the DML statements processed on the server in order to bring it to the latest state, thats where Binary logs (Binlog) backups come into picture.
  • Store on more than one server and off-site
    •  Store your backups on more than one location, what if the backup server goes down ? Considering offsite storages like Amazon S3 and Glacier with weekly or monthly backups retention can be cheaper options.
  • Test your backups!!!!
    • Testing your backups is very important, its always great to know backups are recoverable and not corrupted. Spin off an EC2 instance if you want, copy and restore the backup there, roll-forward a days worth of binlogs just to be sure.
  • Document restore procedures, script them and test them!!!
    • Also when you test your backups, make sure to document the steps to restore the backup to avoid last minute hassle over which commands to use.
  • If taking from a slave run pt-table-checksum
    • Backups are mostly taken from slaves, as such make sure to checksum them regularly, you dont wanna backup inconsistent data. 
  • Configuration files, scripts
    • Data is not the only thing you should be backing up, backup your config files, scripts and user access at a secure location.
  • Do you need to backup everything all days?
    • For very large instances doing a logical backup is a toughie, in such cases evaluate your backup needs, do you want to backup all the tables ? Most of the time smaller tables are the more important ones, and needs partial restore, backup only those.
  • Hardlinking backups can save lot of disk space in some circumstances
    • There are schemas which contains only a few high activity tables, rest of them are probably updated once a week or are updated by an archiver job that runs montly, make sure to hardlink the files with the previous backup, it can save good amount of space in such scenarios.
  • Monitor your Backups
    • Lastly, monitor your backups. You do not want to realize that you’re backup had been failing the whole time. Even a simple email notification from your backup scripts can help reduce the chance of failure.

Now lets try to answer some of the questions asked during the webinar :

Q : –use-memory=2G, is that pretty standard, if we have more more, should we have a higher value?
Usually we would evaluate the value based on size of xtrabackup_logfile (amount of transactions to apply). If you have more free memory feel free to provide it to –use-memory, you dont want to let the memory be a bottleneck in the restore process.

Q : which is the best backup option for a 8Tb DB?
Usually it would depend on what type of data would you have and business requirements for the backups. For eg: a full xtrabackup and later incrementals on the weekdays would be a good idea. Time required for backups play an important role here, backing up to a slow NAS share can be time consuming, and it will make xtrabackup record lot of transactions which will further increase your restore time. Also look into backing up very important medium-small size tables via logical backups.

Q : I’m not sure if this will be covered, but if you have a 3 master-master-master cluster using haproxy, is it recommended to run the backup from the haproxy server or directly on a specific server? Would it be wise to have a 4th server which would be part of the cluster, but not read from to perform the backups?
I am assuming this a Galera cluster setup, in which case you can do backups locally on any of the node by using tools like percona xtrabackup, however the best solution would be spinning off a slave from one of the nodes and running backups there.

Q : With Mudumper, can we strem the data over SSH or netcat to another server? Or would one have to use something like NFS? I’ve used mysqldump and piped it over netcat before.. curious if we can do that with Mydumper ?
Mydumper is similar in nature with other mysql client tools. They can be run remotely (–host option). Which means you can run mydumper from another server to backup from the master or slave. Mydumper can be piped for sure too.

Q : Is Mydumper still maintained. It hasn’t had a release since March of last year?
Indeed, Max Bubenick from Percona is currently maintaining the project. Actually he has added new features to the tool which  makes it more comprehensive and feature rich. He is planning the next release soon, stay tuned for the blog post.

Q : Is MyDumper an opensource ? prepare and restore are same ?
Absolutely. Right now we need to download the source and compile, however very soon we will have packages built for it too. Prepare and Restore are common terminologies used in the backup lingo, in the webinar, Restore means copying back the backup files from its storage location to the destination location, whereas Prepare means applying the transactions to the backup and making it ready to restore.

Q : Is binlog mirroring needed on Galera (PXC)?
It is good idea to keep binlog mirroring. Even though the IST and SST will do its job to join the node, the binlogs could play a role in case you wanted to rollforward a particular schema on a slave or QA instance.

Q : As we know that Percona XtraBackup takes Full & Incremental as well. Like that Does MyDumper helps in taking the incremental backup.?
At this moment we do not have the ability to take Incremental backups with mydumper or with any other logical backup tool. However, Weekly full backups (logical) and daily binlog backups can serve as the same strategy with other Incremental backup solutions, plus they are easy to restore :)

Q : Is it possible to encrypt the output file ? What will be Best methodology to back up data with the database size of 7 to 8 Gb and increses 25 % each day ? what is difference between innobackupex and mydumper ?
Indeed its possible to encrypt the backup files, as a matter of fact, we encrypt backup files with GPG keys before uploading to offsite storage. The best method to backup a 7 to 8G instance would be implementing all 3 types of backup we discussed in the webinar, your scenarios require planning for the future, so its always best to have different solutions available as the data grows. Innobackupex is part of the Percona-Xtrabackup toolkit and is a script which does binary backups of databases, MyDumper on the other hand is a logical backup tool which creates backups as text files.

Q : How can I optimize a MySQL dump of a large database? The main bottleneck while taking MySQL dump backup of a large database is if any table is found to be corrupted then it never goes beyond by skipping this corrupted tables temporary. Can we take database backup of large database without using locking mechanism i.e. Does someone know how to make the backup without locking the tables ? Is there any tools which would faster in restoration and backup technique or how come we use MySQL dump to optimize this kind of issue in future during crash recovery.
Mysqldump is logical backup tool, and as such it executes full table scans to backup the tables and write them down in the output file, hence its very difficult to improve performance of mysqldump (query-wise). Assuming that you’re referring the corruption to MyISAM tables, it is highly recommended you repair them before backing up, also to make sure mysqldump doesnt fail due to error on such a corrupt table try using –force option to mysqldump. If you’re using MyISAM tables first recommendation would be to switch to Innodb, with most of the tables innodb locking can be greatly reduced, actually till a point where the locking is negligible, look into –single-transaction. Faster backup recovery can be achieved with binary backups, look into using Percona Xtrabackup tool, we have comprehensive documentation to get you started.

Hope this was a good webinar and we have answered most of your questions. Stay tuned for more such webinars from Percona.

The post The Q&A: Creating best-in-class backup solutions for your MySQL environment appeared first on MySQL Performance Blog.

by Akshay Suryawanshi at July 23, 2015 01:55 PM

MySQL QA Episode 8: Reducing Testcases for Engineers: tuning reducer.sh

Welcome to MySQL QA Episode 8: Reducing Testcases for Engineers: tuning reducer.sh

  1. Advanced configurable variables & their default/vanilla reducer.sh settings
    1. FORCE_SKIPV
    2. FORCE_SPORADIC
    3. TIMEOUT_COMMAND & TIMEOUT_CHECK
    4. MULTI_THREADS
    5. MULTI_THREADS_INCREASE
    6. QUERYTIMEOUT
    7. STAGE1_LINES
    8. SKIPSTAGE
    9. FORCE_KILL
  2. Some examples
    1. FORCE_SKIPV/FORCE_SPORADIC
    2. TIMEOUT_COMMAND/TIMEOUT_CHECK

Full-screen viewing @ 720p resolution recommended.

The post MySQL QA Episode 8: Reducing Testcases for Engineers: tuning reducer.sh appeared first on MySQL Performance Blog.

by Roel Van de Paar at July 23, 2015 10:00 AM

July 22, 2015

Peter Zaitsev

SELinux and the MySQL init script

I recently worked with a customer who had a weird issue: when their MySQL server was started (Percona Server 5.5), if they try to run service mysql start a second time, the init script was not able to detect that an instance was already running. As a result, it tried to start a second instance with the same settings as the first one. Of course this fails and this creates a mess. What was the issue? A missing rule in SELinux. At least it looks like

Summary

If SELinux is set to enforcing and if you are using Percona Server on CentOS/RHEL 6 (other versions could be affected), service mysql start doesn’t work properly and a fix is simple to run:

# grep mysqld_safe /var/log/audit/audit.log | audit2allow -M mysqld_safe
# semodule -i mysqld_safe.pp
# service mysql restart

Other options are:

  • Set SELinux to permissive
  • Use the CentOS/RHEL standard MySQL init script (note I didn’t extensively check if that could trigger other errors)

How did we see the issue?

That was pretty easy: if an instance is already running and if you run service mysql start again, you should see something like this in the MySQL error log:

150717 08:47:58 mysqld_safe A mysqld process already exists

But if you rather see tons of error messages like:

2015-07-17 08:47:05 27065 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11
2015-07-17 08:47:05 27065 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files.

it means that the init script is broken somewhere.

Investigation

When the issue was brought to my attention, I tried to reproduce it on my local box, but with no luck. What was so special in the configuration used by the customer?

The only thing that was slightly out of the ordinary was SELinux which was set to enforcing. Then we set SELinux to permissive, and guess what? service mysql start was now working properly and it didn’t allow 2 concurrent instances to be run!

Next step was to look at the SELinux logs to find any error related to MySQL, and we discovered messages like:

type=SYSCALL msg=audit(1437121845.464:739): arch=c000003e syscall=62 success=no exit=-13
a0=475 a1=0 a2=0 a3=7fff0e954130 items=0 ppid=1 pid=5732 auid=500 uid=0 gid=0 euid=0 suid=0
fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=5 comm="mysqld_safe" exe="/bin/bash"
subj=unconfined_u:system_r:mysqld_safe_t:s0 key=(null)

At this point, we knew that a rule was missing for mysqld_safe, we needed to add a new one.

Deeper investigation

Actually what happens is that SELinux prevents this condition to be executed in mysqld_safe:

if kill -0 $PID > /dev/null 2> /dev/null

and then the script assumes that this means the mysqld process is not running. That’s why a second mysqld is started.

However users of Oracle MySQL will probably never experience that issue, simply because the init script is a bit different: before calling mysqld_safe, the init script tries to ping a potential mysqld instance and if it gets a positive reply or an Access denied error, it concludes that mysqld is already running and it doesn’t invoke mysqld_safe.

The fix

Fortunately, this is quite simple. You can generate the corresponding rule with audit2allow:

grep mysqld_safe /var/log/audit/audit.log | audit2allow -M mysqld_safe

And after checking the corresponding .te file, we were able to load that new module:

semodule -i mysqld_safe.pp

After stopping MySQL, you can now use service mysql start normally.

Conclusion

This issue was quite interesting to work on because finding the culprit was not that easy. Also it only triggers when SELinux is enabled and Percona Server is used. Now should the init script of Percona Server be fixed? I’m not sure of the potential problems that could occur if we did so, but of course feel free to leave your feedback in the comments.

The post SELinux and the MySQL init script appeared first on MySQL Performance Blog.

by Stephane Combaudon at July 22, 2015 05:09 PM

July 21, 2015

Peter Zaitsev

Percona now offering 24/7 support for MongoDB and TokuMX

Today Percona announced the immediate availability of 24/7, enterprise-class support for MongoDB and TokuMX. The new support service helps organizations achieve maximum application performance without database bloat. Customers have round-the-clock access (365 days a year) to the most trusted team of database experts in the open source community.

The news means that Percona now offers support across the entire open-source database ecosystem, including the entire LAMP stack (Linux, Apache, MySQL, and PHP/Python/Perl), providing a single, expert, proven service provider for companies to turn to in good times (always best to be proactive) – and during emergencies, too.

Today’s support announcement follows Percona’s acquisition of Tokutek, which included the Tokutek distribution of MongoDB – making Percona the first vendor to offer both MySQL and MongoDB software and solutions.

Like Percona’s other support services, support for MongoDB and TokuMX enables organizations to talk directly with Percona’s support experts at any time, day or night.

The Percona Support team is always ready to help resolve database and server instability, initiate data recovery, optimize performance, deal with response and outage issues – and ensure proactive system monitoring and alert responses. Percona also offers support across on-premises, cloud, and hybrid deployments.

The post Percona now offering 24/7 support for MongoDB and TokuMX appeared first on MySQL Performance Blog.

by Tom Diederich at July 21, 2015 07:22 PM

MySQL QA Episode 7: Reducing Testcases for Beginners – single-threaded reducer.sh!

Welcome to MySQL QA Episode #7 – Reducing Testcases for Beginners: single-threaded reducer.sh!

In this episode we’ll learn how to use reducer.sh. Topics discussed;

  1. reducer.sh introduction/concepts
  2. Basic configurable variables & their default reducer.sh settings
    1. INPUTFILE options
    2. MODE=x
    3. TEXT=”text”
    4. WORKDIR_LOCATION & WORKDIR_M3_DIRECTORY
    5. MYEXTRA
    6. MYBASE
    7. PQUERY_MOD & PQUERY_LOC
    8. MODE5_COUNTTEXT, MODE5_ADDITIONAL_TEXT & MODE5_ADDITIONAL_COUNTTEXT
    9. How to learn more about each of the settings
  3. Manual example
  4. Introduction to the script’s self-recursion concept – subreducer
  5. Quick setup re-cap, details of an already executed QA run
  6. Examples from pquery-prep-red.sh (including some issue reviews)
  7. Gotcha’s
  8. QUERYTIMEOUT & STAGE1_LINES Variables

Full-screen viewing @ 720p resolution recommended.

If the speed is too slow for you, consider setting YouTube to 1.25 playback speed.

The post MySQL QA Episode 7: Reducing Testcases for Beginners – single-threaded reducer.sh! appeared first on MySQL Performance Blog.

by Roel Van de Paar at July 21, 2015 10:00 AM

July 20, 2015

Peter Zaitsev

Percona Live Amsterdam discounted pricing ends July 26!

Percona Live Amsterdam discounted pricing ends soon!The Percona Live Data Performance Conference in Amsterdam is just two months away and it’s going to be an incredible event. With a new expanded focus on MySQL, NoSQL, and Data in the Cloud, this conference will be jam-packed with talks from some of the industry’s leading experts from MongoDB, VMware, Oracle, MariaDB, Facebook, Booking.com, Pythian, Google, Rackspace, Yelp (and many more, including of course Percona).

Early Bird pricing ends this Sunday (July 26)! So if you want to save €25, then you’d better register now. And for all of my readers, you can take an additional 10% off the entire registration price by using the promo code “10off” at checkout.

It’s also important to book your room at the Mövenpick Hotel for a special rate – but hurry because that deal ends July 27 and the rooms are disappearing fast due to some other big events going on in Amsterdam that week.

Sponsorship opportunities are also still available for Percona Live Amsterdam. Event sponsors become part of a dynamic and fast-growing ecosystem and interact with hundreds of DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solution vendors and entrepreneurs who typically attend the event. This year’s conference will feature expanded accommodations and turnkey kiosks. Current sponsors include:

We’ve got a fantastic conference schedule on tap. Sessions, which will follow each morning’s keynote addresses, feature a variety of topics related to MySQL and NoSQL, High Availability, DevOps, Programming, Performance Optimization, Replication and Backup, MySQL in the Cloud, MySQL Case Studies, Security, and What’s New in MySQL and MongoDB.

Sessions Include:
  • “InnoDB: A Journey to the Core,” Jeremy Cole, Sr. Systems Engineer, Google, Inc. and Davi Arnaut, Software Engineer, LinkedIn
  • “MongoDB Patterns and Antipatterns for Dev and Ops,” Steffan Mejia, Principal Consulting Engineer, MongoDB, Inc.
  • “NoSQL’s Biggest Lie: SQL Never Went Away,” Matthew Revell, Lead Developer Advocate, Couchbase
  • “The Future of Replication is Today: New Features in Practice,” Giuseppe Maxia, Quality Assurance Architect, VMware
  • “What’s New in MySQL 5.7,” Geir Høydalsvik, Senior Software Development Director, Oracle
Tutorials include:
  • “Best Practices for MySQL High Availability,” Colin Charles, Chief Evangelist, MariaDB
  • “Mongo Sharding from the Trench: A Veterans Field Guide,” David Murphy, Lead DBA, Rackspace Data Stores
  • “Advanced Percona XtraDB Cluster in a Nutshell, La Suite: Hands on Tutorial Not for Beginners!,” Frederic Descamps, Senior Architect, Percona

The conference’s evening events will be a perfect way to network, relax and have FUN while seeing the beautiful city of Amsterdam!

Monday night, September 21, after the tutorial sessions conclude, attendees are invited to the Delirium Cafe located across the street from the conference venue. With more than 500 beers on tap and great food, this will be the perfect way to kick off the Conference.

Tuesday night, September 22, Booking.com will be hosting the Community dinner of the year at their very own headquarters located in historic Rembrandt Square in the heart of the city. Hop on one of the sponsored canal boats that will pick you up right outside of the Mövenpick for your chance to see the city from the water on the way to the community dinner! You’ll be dropped off right next to Booking.com’s offices!

Wednesday night, September 23, there will be a closing reception taking place at the Mövenpick for your last chance to visit with our exhibitors and to wrap up what promises to be an amazing conference!

See you in Amsterdam!

The post Percona Live Amsterdam discounted pricing ends July 26! appeared first on MySQL Performance Blog.

by Kortney Runyan at July 20, 2015 09:32 PM

Fractal Tree library as a Key-Value store

As you may know, Tokutek is now part of Percona and I would like to explain some internals of TokuDB and TokuMX – what performance benefits they bring, along with further optimizations we are working on.

However, before going into deep details, I feel it is needed to explain the fundamentals of Key-Value store, and how Fractal Tree handles it.

Before that, allow me to say that I hear opinions that the “Fractal Tree” name does not reflect an internal structure and looks more like a marketing term than a technical one. I will not go into this discussion and will keep using name “Fractal Tree” just out of the respect to inventors. I think they are in a position to name their invention with any name they want.

So with that said, the Fractal Tree library implements a new data structure for a more efficient handling (with main focus on insertion, but more on this later) of Key-Value store.

You may question how Key-Value is related to SQL Transactional databases – this is more from the NOSQL world. Partially this is true, and Fractal Tree Key-Value library is successfully used in Percona TokuMX (based on MongoDB 2.4) and Percona TokuMXse (storage engine for MongoDB 3.0) products.

But if we look on a Key-Value store in general, actually it maybe a good fit to use in structural databases. To explain this, let’s take a look in Key-Value details.

So what is Key-Value data structure?

We will use a notation (k,v), or key=>val, which basically mean we associate some value “v” with a key “k”. For software developers following analogies may be close:
key-value access is implemented as dictionary in Python, associative array in PHP or map in C++.
(More details in Wikipedia)

I will define key-value structure as a list of pairs (k,v).

It is important to note that both key and value cannot be just scalars (single value), but to be compound.
That is "k1, k2, k3 => v1, v2", which we can read as (give me two values by a 3-part key).

This brings us closer to a database table structure.
If we apply additional requirement that all (k) in list (k,v) must be unique, this will represent
a PRIMARY KEY for a traditional database table.
To understand this better, let’s take a look on following table:
CREATE TABLE metrics (
ts timestamp,
device_id int,
metric_id int,
cnt int,
val double,
PRIMARY KEY (ts, device_id, metric_id),
KEY metric_id (metric_id, ts),
KEY device_id (device_id, ts)
)

We can state that Key-Value structure (ts, device_id, metric_id => cnt, val), with a requirement
"ts, device_id, metric_id" to be unique, represents PRIMARY KEY for this table, actually this is how InnoDB (and TokuDB for this matter) stores data internally.

Secondary indexes also can be represented in Key=>Value notion, for example, again, how it is used in TokuDB and InnoDB:
(seconday_index_key=>primary_key), where a key for a secondary index points to a primary key (so later we can get values by looking up primary key). Please note that that seconday_index_key may not be unique (unless we add an UNIQUE constraint to a secondary index).

Or if we take again our table, the secondary keys are defined as
(metric_id, ts => ts, device_id, metric_id)
and
(device_id, ts => ts, device_id, metric_id)

It is expected from a Key-Value storage to support basic data manipulation and extraction operations, such as:

        – Add or Insert: add

(key => value)

        pair to a collection
        – Update: from

(key => value2)

        to

(key => value2)

        , that is update

"value"

        assigned to

"key"

        .
        – Delete: remove(key): delete a pair

(key => value)

        from a collection
        – Lookup (select): give a

"value"

        assigned to

"key"

and I want to add fifth operation:

        – Range lookup: give all values for keys defined by a range, such as

"key > 5"

        or

"key >= 10 and key < 15"

They way software implements an internal structure of Key-Value store defines the performance of mentioned operations, and especially if datasize of a store grows over a memory capacity.

For the decades, the most popular data structure to represent Key-Value store on disk is B-Tree, and within the reason. I won’t go into B-Tree details (see for example https://en.wikipedia.org/wiki/B-tree), but it provides probably the best possible time for Lookup operations. However it has challenges when it comes to Insert operations.

And this is an area where newcomers to Fractal Tree and LSM-tree (https://en.wikipedia.org/wiki/Log-structured_merge-tree) propose structures which provide a better performance for Insert operations (often at the expense of Lookup/Select operation, which may become slower).

To get familiar with LSM-tree (this is a structure used by RocksDB) I recommend http://www.benstopford.com/2015/02/14/log-structured-merge-trees/. And as for Fractal Tree I am going to cover details in following posts.

The post Fractal Tree library as a Key-Value store appeared first on MySQL Performance Blog.

by Vadim Tkachenko at July 20, 2015 01:50 PM

MariaDB AB

Five things you must know about parallel replication in MariaDB 10.x

guillaumelefranc

When MariaDB 10.0 was launched as GA in 2014 it introduced a major feature: Parallel Replication. Parallel Replication is a fantastic addition to the long list of new MariaDB features, along with Global Transaction IDs. However I don’t think like many people understand this feature very well and know how to leverage its potential to the fullest extent. This blog post will explain a few of the gotchas that you could need while setting up Parallel Replication.

1. Single-thread replication can be slower in MariaDB 10.x than in MariaDB 5.5

The addition of new features in a new branch often comes at a cost. If you have a huge replication workload then you may find out that MariaDB 10.x replication is somehow slower than its counterpart in MariaDB 5.5. Thankfully, that can be counterbalanced by setting up Parallel Replication and the benefits will be much bigger than the ones you’d have while sticking with 5.5.

2. Parallel replication is not enabled by default.

Newsflash. You have to enable it, for example in your my.cnf (or SET GLOBAL, that variable is dynamic).

slave_parallel_threads = 8

I’ve seen customers putting it unnecessary large values, like 24 of 32. Don’t do that (unless you have 64 CPU Cores). You might introduce more contention, and the potential for parallel replication is not that big, which leads us to our third bullet point.

3. There needs to be a potential for parallel replication.

I won’t enter in depth in the specifics of parallel replication (I’ll point out to a couple of resources at the end of this blog post). Basically, you must have group commit happening on the master for that, in other words, write events which are committed as a group. To check if you have group commit happening, look up the two following status variables on the master:

MariaDB(db-01)[(none)]> show global status like 'binlog_%commits';
+----------------------+------------+
| Variable_name        | Value      |
+----------------------+------------+
| Binlog_commits       | 3790021298 |
| Binlog_group_commits | 3000740090 |
+----------------------+------------+

The bigger the difference between this two values, the more group commits are happening. For example above I have 1 group commit happening for 1.26 transactions, not fantastic but still OK.

4. You can make parallel replication faster by making commits slower

To increase group commit potential, you might actually need to make commits slower on the master. Don’t close this blog yet by calling me crazy. We can actually introduce a very small amount of latency on write transactions by using two configuration variables: binlog_commit_wait_count and binlog_commit_wait_usec.

The first variable, binlog_commit_wait_count, will delay a given number of transactions until they are ready to commit together. You can control this delay with the second variable, which is a given interval in microseconds which represents the maximum time those transactions will wait. So with those two variables, you can fine tune your group commit very precisely while taking a negligible performance hit on write commit time.

Of course you have to start with sensible values and increase/decrease if needed. Typical values for testing are like binlog_commit_wait_count=10 and binlog_commit_wait_usec=5000. This means that 10 transactions will wait a maximum 5ms to commit together. As you can guess, a 5ms delay is not an issue for most real-world applications.

5. You can make things even better with GTID and Domain Identifiers

Parallel Replication works without GTID, but turning on GTID has major advantages. Among other nice features like crash-safe replication and easy topology management, you can use multiple replication streams, identified each by a different Domain ID. Imagine you have two totally different applications in your server, you just have to implement the following in your code:

Application ACME -> SET gtid_domain_id=1
Application BETA -> SET gtid_domain_id=2

By using different Domain Identifiers, you will make sure those transactions always execute out-of-order, making use of two different replication threads on the slaves.

Bibliography

Try MariaDB 10

About the Author

guillaumelefranc's picture

Guillaume Lefranc is managing the MariaDB Remote DBA Services Team, delivering performance tuning and high availability services worldwide. He's a believer in DevOps culture, Agile software development, and Craft Brewing.

by guillaumelefranc at July 20, 2015 12:09 PM

Daniël van Eeden

Inserting large rows in MySQL and MariaDB

As the maximum storage size for a LONGBLOB in MySQL is 4GB and the maximum max_allowed_packet size is 1GB I was wondering how it is possible to use the full LONGBLOB.

So I started testing this. I wrote a Python script with MySQL Connector/Python and used MySQL Sandbox to bring up an instance of MySQl 5.6.25.

One of the first settings I had to change was max_allowed_packet, which was expected. I set it to 1GB.

The next setting was less expected, it was innodb_log_file_size. The server enforces that the transaction has to fit in 10% of the InnoDB log files. So I had to set it to 2 files of 5G to be able to insert one record of (almost) 1GB.

So that worked for a row of a bit less that 1GB, this is because there is some overhead in the packet and the total has to fit in 1GB.

For the next step (>1GB) I switched from Python to C so I could use mysql_stmt_send_long_data() which allows you to upload data in multiple chunks.

I expected that to work, but it didn't. This is because in MySQL 5.6 and up the max_long_data_size was replaced by max_allowed_packet. But max_allowed_packet can only be set to max 1GB and max_long_data_size can be set to 4GB.

So I switched from MySQL 5.6 to MariaDB 10.1, because MariaDB still has max_long_data_size. That worked, now I could upload rows of up to (almost) 4GB.

I also noticed InnoDB will complain in the error logs about large tuples
InnoDB: Warning: tuple size very big: 1100000025

So you can insert CD ISO images in your database. For small DVD images this could work if your connector uses the COM_STMT_SEND_LONG_DATA command.

But it's best to avoid this and keep the size of rows smaller.

The scripts I used for my tests (and references to the bugs I found):
https://github.com/dveeden/mysql_supersize

by Daniël van Eeden (noreply@blogger.com) at July 20, 2015 05:47 AM

July 17, 2015

MariaDB AB

Greater Developer Automation and Efficiency with MariaDB Enterprise Summer 2015

diptijoshi

In the last two releases of MariaDB Enterprise, we have provided enhanced performance with the introduction of certified MariaDB binaries for POWER8 and optimized binaries for the x86 platform. This Summer we make it more efficient and automated for developers and DBAs to use our high performance MariaDB binaries.

Ease of Use

MariaDB Enterprise now comes with Docker images as well as Chef recipes and cookbooks so developers can now easily deploy and run their database applications. Using Docker images from the MariaDB Enterprise repository,

  • you can run multiple instance of MariaDB Enterprise Server with different configurations on the same physical server, and
  • easily move your MariaDB Enterprise Server instance from one machine to another machine.

Greater Automation

MariaDB MaxScale™, an additional part of the MariaDB Enterprise subscription, allows DBAs to launch external scripts when MaxScale detects a change in the state of any backend server. This applies to MariaDB master-slave replication cluster, as well as MariaDB Enterprise Galera Cluster. DBAs can now automatically trigger failover by promoting a slave to master when the master fails, or can simply opt to be notified of a master or slave failure to take appropriate corrective action. You can also launch scripts with separate behaviour when a database server node is down, versus the database server being unreachable. In MariaDB Enterprise Galera Cluster, in addition to database nodes going up and down, you can also launch external scripts when the sync status of a Galera node changes.

Learn more about MariaDB Enterprise and MariaDB MaxScale, or simply download them to experience the efficiency and automation in this new release.

About the Author

diptijoshi's picture

Dipti Joshi is Sr Product Manager at MariaDB, Corp.

by diptijoshi at July 17, 2015 07:47 AM

July 16, 2015

Peter Zaitsev

Bypassing SST in Percona XtraDB Cluster with binary logs

In my previous post, I used incremental backups in Percona XtraBackup as a method for rebuilding a Percona XtraDB Cluster (PXC) node without triggering an actual SST. Practically this reproduces the SST steps, but it can be handy if you already had backups available to use.

In this post, I want to present another methodology for this that also uses a full backup, but instead of incrementals uses any binary logs that the cluster may be producing.

Binary logs on PXC

Binary logs are not strictly needed in PXC for replication, but you may be using them for backups or for asynchronous slaves of the cluster.  To set them up properly, we need the following settings added to our config:

server-id=1
log-bin
log-slave-updates

As I stated, none of these are strictly needed for PXC.

  • server-id=1 — We recommend PXC nodes share the same server-id.
  • log-bin — actually enable the binary log
  • log-slave-updates — log ALL updates to the cluster to this server’s binary log

This doesn’t need to be set on every node, but likely you would set these on at least two nodes in the cluster for redundancy.

Note that this strategy should work with or without 5.6 asynchronous GTIDs.

Recovering data with backups and binary logs

This methodology is conventional point-in-time backup recovery for MySQL.  We have a full backup that was taken at a specific binary log position:

... backup created in the past...
# innobackupex --no-timestamp /backups/full
# cat /backups/full/xtrabackup_binlog_info
node3-bin.000002	735622700

We have this binary log and all binary logs since:

-rw-r-----. 1 root root 1.1G Jul 14 18:53 node3-bin.000002
-rw-r-----. 1 root root 1.1G Jul 14 18:53 node3-bin.000003
-rw-r-----. 1 root root 321M Jul 14 18:53 node3-bin.000004

Recover the full backup

We start by preparing the backup with –apply-log:

# innobackupex --apply-log --use-memory=1G /backups/full
...
xtrabackup: Recovered WSREP position: 1663c027-2a29-11e5-85da-aa5ca45f600f:60072936
...
InnoDB: Last MySQL binlog file position 0 735622700, file name node3-bin.000002
...
# innobackupex --copy-back /backups/full
# chown -R mysql.mysql /var/lib/mysql

The output confirms the same binary log file and position that we knew from before.

Start MySQL without Galera

We need to start mysql, but without Galera so we can apply the binary log changes before trying to join the cluster. We can do this simply by commenting out all the wsrep settings in the MySQL config.

# grep wsrep /etc/my.cnf
#wsrep_cluster_address           = gcomm://pxc.service.consul
#wsrep_cluster_name              = mycluster
#wsrep_node_name                 = node3
#wsrep_node_address              = 10.145.50.189
#wsrep_provider                  = /usr/lib64/libgalera_smm.so
#wsrep_provider_options          = "gcache.size=8G; gcs.fc_limit=1024"
#wsrep_slave_threads             = 4
#wsrep_sst_method                = xtrabackup-v2
#wsrep_sst_auth                  = sst:secret
# systemctl start mysql

Apply the binary logs

We now check our binary log starting position:

# mysqlbinlog -j 735622700 node3-bin.000002 | grep Xid | head -n 1
#150714 18:38:36 server id 1  end_log_pos 735623273 CRC32 0x8426c6bc 	Xid = 60072937

We can compare the Xid on this binary log position to that of the backup. The Xid in a binary log produced by PXC will be the seqno of the GTID of that transaction. The starting position in the binary log shows us the next Xid is one increment higher, so this makes sense: we can start at this position in the binary log and apply all changes as high as we can go to get the datadir up to a more current position.

# mysqlbinlog -j 735622700 node3-bin.000002 | mysql
# mysqlbinlog node3-bin.000003 | mysql
# mysqlbinlog node3-bin.000004 | mysql

This action isn’t particularly fast as binlog events are being applied by a single connection thread. Remember that if the cluster is taking writes while this is happening, the amount of time you have is limited by the size of gcache and the rate at which it is being filled up.

Prime the grastate

Once the binary logs are applied, we can check the final log’s last position to get the seqno we need:

[root@node3 backups]# mysqlbinlog node3-bin.000004 | tail -n 500
...
#150714 18:52:52 server id 1  end_log_pos 335782932 CRC32 0xb983e3b3 	Xid = 63105191
...

This is indeed the seqno we put in our grastate.dat. Like in the last post, we can copy a grastate.dat from another node to get the proper format. However, this time we must put the proper seqno into place:

# cat grastate.dat
# GALERA saved state
version: 2.1
uuid:    1663c027-2a29-11e5-85da-aa5ca45f600f
seqno:   63105191
cert_index:

Be sure the grastate.dat has the proper permissions, uncomment the wsrep settings and restart mysql on the node:

# chown mysql.mysql /var/lib/mysql/grastate.dat
# grep wsrep /etc/my.cnf
wsrep_cluster_address           = gcomm://pxc.service.consul
wsrep_cluster_name              = mycluster
wsrep_node_name                 = node3
wsrep_node_address              = 10.145.50.189
wsrep_provider                  = /usr/lib64/libgalera_smm.so
wsrep_provider_options          = "gcache.size=8G; gcs.fc_limit=1024"
wsrep_slave_threads             = 4
wsrep_sst_method                = xtrabackup-v2
wsrep_sst_auth                  = sst:secret
# systemctl restart mysql

The node should now attempt to join the cluster with the proper GTID:

2015-07-14 19:28:50 4234 [Note] WSREP: Found saved state: 1663c027-2a29-11e5-85da-aa5ca45f600f:63105191

This, of course, still does not guarantee an IST. See my previous post for more details on the conditions needed for that to happen.

The post Bypassing SST in Percona XtraDB Cluster with binary logs appeared first on MySQL Performance Blog.

by Jay Janssen at July 16, 2015 10:00 AM

Bypassing SST in Percona XtraDB Cluster with incremental backups

Beware the SST

In Percona XtraDB Cluster (PXC) I often run across users who are fearful of SSTs on their clusters. I’ve always maintained that if you can’t cope with a SST, PXC may not be right for you, but that doesn’t change the fact that SSTs with multiple Terabytes of data can be quite costly.

SST, by current definition, is a full backup of a Donor to Joiner.  The most popular method is Percona XtraBackup, so we’re talking about a donor node that must:

  1. Run a full XtraBackup that reads its entire datadir
  2. Keep up with Galera replication to it as much as possible (though laggy donors don’t send flow control)
  3. Possibly still be serving application traffic if you don’t remove Donors from rotation.

So, I’ve been interested in alternative ways to work around state transfers and I want to present one way I’ve found that may be useful to someone out there.

Percona XtraBackup and Incrementals

It is possible to use Percona XtraBackup Full and Incremental backups to build a datadir that might possibly SST.  First we’ll focus on the mechanics of the backups, preparing them and getting the Galera GTID and then later discuss when it may be viable for IST.

Suppose I have fairly recent full Xtrabackup and and one or more incremental backups that I can apply on top of that to get VERY close to realtime on my cluster (more on that ‘VERY’ later).

# innobackupex --no-timestamp /backups/full
... sometime later ...
# innobackupex --incremental /backups/inc1 --no-timestamp --incremental-basedir /backups/full
... sometime later ...
# innobackupex --incremental /backups/inc2 --no-timestamp --incremental-basedir /backups/inc1

In my proof of concept test, I now have a full and two incrementals:

# du -shc /backups/*
909M	full
665M	inc1
812M	inc2
2.4G	total

To recover this data, I follow the normal Xtrabackup incremental apply process:

# cp -av /backups/full /backups/restore
# innobackupex --apply-log --redo-only --use-memory=1G /backups/restore
...
xtrabackup: Recovered WSREP position: 1663c027-2a29-11e5-85da-aa5ca45f600f:35694784
...
# innobackupex --apply-log --redo-only /backups/restore --incremental-dir /backups/inc1 --use-memory=1G
# innobackupex --apply-log --redo-only /backups/restore --incremental-dir /backups/inc2 --use-memory=1G
...
xtrabackup: Recovered WSREP position: 1663c027-2a29-11e5-85da-aa5ca45f600f:46469942
...
# innobackupex --apply-log /backups/restore --use-memory=1G

I can see that as I roll forward on my incrementals, I get a higher and higher GTID. Galera’s GTID is stored in the Innodb recovery information, so Xtrabackup extracts it after every batch it applies to the datadir we’re restoring.

We now have a datadir that is ready to go, we need to copy it into the datadir of our joiner node and setup a grastate.dat. Without a grastate, starting the node would force an SST no matter what.

# innobackupex --copy-back /backups/restore
# ... copy a grastate.dat from another running node ...
# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    1663c027-2a29-11e5-85da-aa5ca45f600f
seqno:   -1
cert_index:
# chown -R mysql.mysql /var/lib/mysql/

If I start the node now, it should see the grastate.dat with the -1 seqo and run –wsrep_recover to extract the GTID from Innodb (I could have also just put that directly into my grastate.dat).

This will allow the node to startup from merged Xtrabackup incrementals with a known Galera GTID.

But will it IST?

That’s the question.  IST happens when the selected donor has all the transactions the joiner needs to get it fully caught up inside of the donor’s gcache.  There are several implications of this:

  • A gcache is mmap allocated and does not persist across restarts on the donor.  A restart essentially purges the mmap.
  • You can query the oldest GTID seqno on a donor by checking the status variable ‘wsrep_local_cached_downto’.  This variable is not available on 5.5, so you are forced to guess if you can IST or not.
  • most PXC 5.6 will auto-select a donor based on IST.  Prior to that (i.e., 5.5) donor selection was not based on IST candidacy at all, meaning you had to be much more careful and do donor selection manually.
  • There’s no direct mapping from the earliest GTID in a gcache to a specific time, so knowing at a glance if a given incremental will be enough to IST is difficult.
  • It’s also difficult to know how big to make your gcache (set in MB/GB/etc.)  with respect to your backups (which are scheduled by the day/hour/etc.)

All that being said, we’re still talking about backups here.  The above method will only work if and only if:

  • You do frequent incremental backups
  • You have a large gcache (hopefully more on this in a future blog post)
  • You can restore a backup faster than it takes for your gcache to overflow

The post Bypassing SST in Percona XtraDB Cluster with incremental backups appeared first on MySQL Performance Blog.

by Jay Janssen at July 16, 2015 09:00 AM

July 15, 2015

Jean-Jerome Schmidt

How to Avoid SST when adding a new node to Galera Cluster for MySQL or MariaDB

State Snapshot Transfer (SST) is a way for Galera to transfer a full data copy from an existing node (donor) to a new node (joiner). If you come from a MySQL replication background, it is similar to taking a backup of a master and restoring on a slave. In Galera Cluster, the process is automated and is triggered depending on the joiner state.

SST can be painful in some occasions, as it can block the donor node (with SST methods like mysqldump or rsync) and burden it when backing up the data and feeding it to the joiner. For a dataset of a few hundred gigabytes or more, the syncing process can take hours to complete - even if you have a fast network. It might be advisable to avoid e.g. when running in WAN environments with slower connects and limited bandwidth, or if you just want a very fast way of introducing a new node in your cluster.

In this blog post, we’ll show you how to avoid SST. 

SST Methods

Through the variable wsrep_sst_method, it is possible to set the following methods:

  • mysqldump
  • rsync
  • xtrabackup/xtrabackup-v2

xtrabackup is non-blocking for the donor. Use xtrabackup-v2 (and not xtrabackup) if you are running on MySQL version 5.5.54 and later. 

Incremental State Transfer (IST)

To avoid SST, we’ll make use of IST and gcache. IST is a method to prepare a joiner by sending only the missing writesets available in the donor’s gcache. gcache is a file where a Galera node keeps a copy of writesets. IST is faster than SST, it is non-blocking and has no significant performance impact on the donor. It should be the preferred option whenever possible.

IST can only be achieved if all changes missed by the joiner are still in the gcache file of the donor. You will see the following in the donor’s MySQL error log:

WSREP: async IST sender starting to serve tcp://10.0.0.124:4568 sending 689768-761291

And on the joiner side:

150707 17:15:53 [Note] WSREP: Signalling provider to continue.
150707 17:15:53 [Note] WSREP: SST received: d38587ce-246c-11e5-bcce-6bbd0831cc0f:689767
150707 17:15:53 [Note] WSREP: Receiving IST: 71524 writesets, seqnos 689767-761291

 

Determining a good gcache size

Galera uses a pre-allocated gcache file of a specific size to store writesets in circular buffer style. By default, its size is 128MB. We have covered this in details here. It is important to determine the right size of the gcache size, as it can influence the data synchronization performance among Galera nodes.

The below gives an idea of the amount of data replicated by Galera. Run the following statement on one of the Galera node during peak hours (this works on MariaDB 10 and PXC 5.6, galera 3.x):

mysql> set @start := (select sum(VARIABLE_VALUE/1024/1024) from information_schema.global_status where VARIABLE_NAME like 'WSREP%bytes'); do sleep(60); set @end := (select sum(VARIABLE_VALUE/1024/1024) from information_schema.global_status where VARIABLE_NAME like 'WSREP%bytes'); set @gcache := (select SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(variable_value,';',29),';',-1),'=',-1),'M',1) from information_schema.global_variables where variable_name  like 'wsrep_provider_options'); select round((@end - @start),2) as `MB/min`, round((@end - @start),2) * 60 as `MB/hour`, @gcache as `gcache Size(MB)`, round(@gcache/round((@end - @start),2),2) as `Time to full(minutes)`;

+--------+---------+-----------------+-----------------------+
| MB/min | MB/hour | gcache Size(MB) | Time to full(minutes) |
+--------+---------+-----------------+-----------------------+
|   7.95 |  477.00 |  128            |                 16.10 |
+--------+---------+-----------------+-----------------------+

We can tell that the Galera node can have approximately 16 minutes of downtime, without requiring SST to join (unless Galera cannot determine the joiner state). If this is too short time and you have enough disk space on your nodes, you can change the wsrep_provider_options=”gcache.size=<value>” to an appropriate value. In this example, setting gcache.size=1G allows us to have 2 hours of node downtime with high probability of IST when the node rejoins.

Avoiding SST on New Node

Sometimes, SST is unavoidable. This can happen when Galera fails to determine the joiner state when a node is joining. The state is stored inside grastate.dat. Should the following scenarios happen, SST will be triggered:

  • grastate.dat does not exist under MySQL data directory - it could be a new node with a clean data directory, or e.g., the DBA manually deleted the file and intentionally forces Galera to perform SST 
  • This grastate.dat file has no seqno or group ID. This node crashed during DDL.
  • The seqno inside grastate.dat shows -1 while the MySQL server is still down, which means unclean shutdown or MySQL crashed/aborted due to database inconsistency. (Thanks to Jay Janssen from Percona for pointing this out)
  • grastate.dat is unreadable, due to lack of permissions or corrupted file system.

To avoid SST, we would get a full backup from one of the available nodes, restore the backup on the new node and create a Galera state file so Galera can determine the node’s state and skip SST. 

In the following example, we have two Galera nodes with a garbd. We are going to convert the garbd node to a MySQL Galera node (Db3). We have a full daily backup created using xtrabackup. We’ll create an incremental backup to get as close to the latest data before IST.

1. Install MySQL server for Galera on Db3. We will not cover the installation steps here. Please use the same Galera vendor as for the existing Galera nodes (Codership, Percona or MariaDB). If you are running ClusterControl, you can just use ‘Add Node’ function and disable the cluster/node auto recovery beforehand (as we don’t want ClusterControl to automatically join the new node, that will trigger SST):

2. The full backup is stored on the ClusterControl node (refer to the diagram above). Firstly, copy and extract the full backup to Db3:

[root@db3]$ mkdir -p /restore/full
[root@db3]$ cd /restore/full
[root@db3]$ scp root@clustercontrol:/root/backups/mysql_backup/BACKUP-1/backup-full-2015-07-08_113938.xbstream.gz .
[root@db3]$ gunzip backup-full-2015-07-08_113938.xbstream.gz
[root@db3]$ xbstream -x < backup-full-2015-07-08_113938.xbstream

3. Increase the gcache size on all nodes to increase the chance of IST. Append the gcache.size parameter in wsrep_provider_options line in the MySQL configuration file:

wsrep_provider_options="gcache.size=1G"

Perform a rolling restart, one Galera node a time (ClusterControl user can use Manage > Upgrades > Rolling Restart):

$ service mysql restart

4. Before creating an incremental backup on Db1 to get the latest data since the last full backup, we need to copy back the xtrabackup_checkpoints file that we got from the extracted full backup on Db3. The incremental backup, when applied, will bring us closer to the current database state. Since we got 2 hours of buffer after increasing the gcache size, we should have enough time to restore the backup and create the necessary files to skip SST.  

Create a base directory, which in our case will be /root/temp. Copy xtrabackup_checkpoints from Db3 into it:

[root@db1]$ mkdir -p /root/temp
[root@db1]$ scp root@db3:/restore/backup/xtrabackup_checkpoints /root/temp/

5. Create a target directory for incremental backup in Db3:

[root@db3]$ mkdir -p /restore/incremental

6. Now it is safe to create an incremental backup on Db1 based on information inside /root/temp/xtrabackup_checkpoints and stream it over to Db3 using SSH:

[root@db1]$ innobackupex --user=root --password=password --incremental --galera-info --incremental-basedir=/root/temp --stream=xbstream ./ 2>/dev/null | ssh root@db3 "xbstream -x -C /restore/incremental"

If you don’t have a full backup, you can generate one by running the following command on Db1 and stream it directly to Db3:

[root@db1]$ innobackupex --user=root --password=password --galera-info --stream=tar ./ | pigz | ssh root@db3 "tar xizvf - -C /restore/full"

7. Prepare the backup files:

[root@db3]$ innobackupex --apply-log --redo-only /restore/full
[root@db3]$ innobackupex --apply-log /restore/full --incremental-dir=/restore/incremental

Ensure you got the following line at the end of the output stream as an indicator that the above succeeded:

150710 14:08:20  innobackupex: completed OK!

8. Build a Galera state file under the /restore/full directory based on the latest information from the incremental backup. You can get the information inside /restore/incremental/xtrabackup_galera_info:

[root@db3]$ cat /restore/full/xtrabackup_galera_info
d38587ce-246c-11e5-bcce-6bbd0831cc0f:1352215

Create a new file called grastate.dat under the full backup directory:

[root@db3]$ vim /restore/full/grastate.dat

And add the following lines (based on the xtrabackup_galera_info):

# GALERA saved state
version: 2.1
uuid:    d38587ce-246c-11e5-bcce-6bbd0831cc0f
seqno:   1352215
cert_index:

9. The backup is prepared and ready to be copied over to the MySQL data directory. Clear the existing path, copy the prepared data and assign correct ownership:

[root@db3]$ rm -Rf /var/lib/mysql/*
[root@db3]$ innobackupex --copy-back /restore/full
[root@db3]$ chown -Rf mysql.mysql /var/lib/mysql

10. If you installed garbd through ClusterControl, remove it by going to Manage > Load Balancer > Remove Garbd > Remove and skip the following command. Otherwise, stop garbd service on this node:

[root@db3]$ killall -9 garbd

11. The new node is now ready to join the rest of the cluster. Fire it up:

[root@db3]$ service mysql start

Voila, the new node should bypass SST and sync via IST. Monitor the output of MySQL error log and ensure you see something like below:

150710 14:49:58 [Note] WSREP: Signalling provider to continue.
150710 14:49:58 [Note] WSREP: SST received: d38587ce-246c-11e5-bcce-6bbd0831cc0f:1352215
150710 14:49:58 [Note] WSREP: Receiving IST: 4921 writesets, seqnos 1352215-1357136

 

Blog category:

by Severalnines at July 15, 2015 11:51 AM

Henrik Ingo

Speaking at CLS and Oscon and shouting at Portland Timbers next week

Vacation is almost over - and it's still 17 degrees outside :-( It's time to start packing for Portland.

I'm as excited as ever, since this year I'm delivering 2 talks. Both fall into the category which is a long time passion of mine - open source community and business. It's refreshing to not have to talk about databases for once :-)

CLS

The Community Leadership Summit is mostly an unconference, but in recent years have started adding short 15 minute pre-arranged talks. (Kind of like morning keynotes, even if they don't call them that.) On Sunday the 19th, I will be do a talk called Open Source Governance Models Revisited.

read more

by hingo at July 15, 2015 09:49 AM

July 13, 2015

Jean-Jerome Schmidt

s9s Tools and Resources: The 'Become a MySQL DBA' Series, ClusterControl 1.2.10, Advisors and More!

Check Out Our Latest Technical Resources for MySQL, MariaDB, Postgres and MongoDB

This blog is packed with all the latest resources and tools we’ve recently published! Please do check it out and let us know if you have any comments or feedback.

Live Technical Webinar

In this webinar, we will look at some of the most widely used HA alternatives in the MySQL world and discuss their pros and cons. Krzysztof is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard. This webinar builds upon recent blog posts by Krzysztof on OS and database monitoring.

Register for the webinar

Product Announcements & Resources

ClusterControl 1.2.10 Release

We were pleased to announce a milestone release of ClusterControl in May, which includes several brand new features, making it a fully programmable DevOps platform to manage leading open source databases.

ClusterControl Developer Studio Release

With ClusterControl 1.2.10, we introduced our new, powerful ClusterControl DSL (Domain Specific Language), which allows you to extend the functionality of your ClusterControl platform by creating Advisors, Auto Tuners, or “mini Programs”. Check it out and start creating your own advisors! We’d love to hear your feedback!

Technical Webinar - Replay

We recently started a ‘Become a MySQL DBA’ blog and webinar series, which we’re extending throughout the summer. Here are the first details of that

Become a MySQL DBA - Deciding on a relevant backup solution

In this webinar, we discussed the multiple ways to take backups, which method best fits specific needs and how to implement point in time recovery.

Watch the replay and view the slides here

 

Technical Blogs

Here is a listing of our most recent technical blogs. Do check them out and let us know if you have any questions.

Become a MySQL DBA Blog Series

Further Technical Blogs:

We are hiring!

We’re looking for an enthusiastic frontend developer! If you know of anyone, who might be interested, please do let us know.

We trust these resources are useful. If you have any questions on them or on related topics, please do contact us!

Your Severalnines Team

Blog category:

by Severalnines at July 13, 2015 11:21 AM

July 09, 2015

Colin Charles

#PerconaLive Amsterdam – schedule now out

The schedule is out for Percona Live Europe: Amsterdam (September 21-23 2015), and you can see it at: https://www.percona.com/live/europe-amsterdam-2015/program.

From MariaDB Corporation/Foundation, we have 1 tutorial: Best Practices for MySQL High Availability – Colin Charles (MariaDB)

And 5 talks:

  1. Using Docker for Fast and Easy Testing of MariaDB and MaxScale – Andrea Tosatto (Colt Engine s.r.l.) (I expect Maria Luisa is giving this talk together – she’s a wonderful colleague from Italy)
  2. Databases in the Hosted Cloud Colin Charles (MariaDB)
  3. Database Encryption on MariaDB 10.1 Jan Lindström (MariaDB Corporation), Sergei Golubchik (Monty Program Ab)
  4. Meet MariaDB 10.1 Colin Charles (MariaDB), Monty Widenius (MariaDB Foundation)
  5. Anatomy of a Proxy Server: MaxScale Internals Ivan Zoratti (ScaleDB Inc.)

OK, Ivan is from ScaleDB now, but he was the SkySQL Ab ex-CTO, and one of the primary architects behind MaxScale! We may have more talks as there are some TBD holes to be filled up, but the current schedule looks pretty amazing already.

What are you waiting for, register now!

by Colin Charles at July 09, 2015 02:39 AM

July 03, 2015

Stephane Varoqui

Social Networking Using OQGraph

I was given the chance to experiment typical social networking query on an existing 60 Millions edges dataset

How You're Connected


Such algorithms and others are simply hardcoded into the OQGraph. 

With the upgrade of OQGraph V3 into MariaDB 10 we can proceed directly on top of the exiting tables holding the edges kine of featured VIRTUAL VIEW. 



CREATE OR REPLACE TABLE `relations` (
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `id1` int(10) unsigned NOT NULL,
  `id2` int(10) unsigned NOT NULL,
  `relation_type` tinyint(3) unsigned DEFAULT NULL,
  KEY `id1` (`id1`),
  KEY `id2` (`id2`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

oqgraph=# select count(*) from relations;

+----------+
| count(*) |
+----------+
| 59479722 |
+----------+
1 row in set (23.05 sec)

Very nice integration of table discovery that save me referring to documentation to found out all columns definition.  

CREATE TABLE `oq_graph`
ENGINE=OQGRAPH `data_table`='relations' `origid`='id1' `destid`='id2';

oqgraph=# SELECT * FROM oq_graph WHERE latch='breadth_first' AND origid=175135 AND destid=7;
+---------------+--------+--------+--------+------+--------+
| latch         | origid | destid | weight | seq  | linkid |
+---------------+--------+--------+--------+------+--------+
| breadth_first | 175135 |      7 |   NULL |    0 | 175135 |
| breadth_first | 175135 |      7 |      1 |    1 |      7 |
+---------------+--------+--------+--------+------+--------+
2 rows in set (0.00 sec)


oqgraph=# SELECT * FROM oq_graph WHERE latch='breadth_first' AND origid=175135 AND destid=5615775;
+---------------+--------+---------+--------+------+----------+
| latch         | origid | destid  | weight | seq  | linkid   |
+---------------+--------+---------+--------+------+----------+
| breadth_first | 175135 | 5615775 |   NULL |    0 |   175135 |
| breadth_first | 175135 | 5615775 |      1 |    1 |        7 |
| breadth_first | 175135 | 5615775 |      1 |    2 | 13553091 |
| breadth_first | 175135 | 5615775 |      1 |    3 |  1440976 |
| breadth_first | 175135 | 5615775 |      1 |    4 |  5615775 |
+---------------+--------+---------+--------+------+----------+
5 rows in set (0.44 sec)

What we first highlight is that underlying table indexes KEY `id1` (`id1`), KEY `id2` (`id2`) are used by OQgrah to navigate the vertices via a number of key reads and range scans, such 5 level relation was around 2689 jump and 77526  range access to the table . 

Meaning the death of the graph was around 2500 with an average of 30 edges per vertex 

# MyISAM

oqgraph=# SELECT * FROM oq_graph_myisam WHERE latch='breadth_first' AND origid=175135 AND destid=5615775;
+---------------+--------+---------+--------+------+----------+
| latch         | origid | destid  | weight | seq  | linkid   |
+---------------+--------+---------+--------+------+----------+
| breadth_first | 175135 | 5615775 |   NULL |    0 |   175135 |
| breadth_first | 175135 | 5615775 |      1 |    1 |        7 |
| breadth_first | 175135 | 5615775 |      1 |    2 | 13553091 |
| breadth_first | 175135 | 5615775 |      1 |    3 |  1440976 |
| breadth_first | 175135 | 5615775 |      1 |    4 |  5615775 |
+---------------+--------+---------+--------+------+----------+
5 rows in set (0.11 sec)

Need to investigate more such speed difference using MyISAM. Ideas are welcome ?

by Stephane Varoqui (noreply@blogger.com) at July 03, 2015 11:05 PM

Slave Election is welcoming GTID

Slave election is a popular HA architecture,  first MySQL MariaDB toolkit to manage switchover and failover in a correct way was introduce by Yoshinori Matsunobu into MHA.

Failover and switchover in asynchronous clusters require caution:

- The CAP theorem need to be satisfy. Getting strong consistency, require the slave election to reject transactions ending up in the old master when electing the candidate master.

- Slave election need to take care that all events on the old master are applied to the candidate master before switching roles.

- Should be instrumented to found a good candidate master and make sure it's setup to take the master role.

- Need topology detection, a master role can't be pre defined, as the role is moving around nodes .

- Need monitoring to escalate switchover to failover.

MHA as been coded at a time no unique event id was possible in a cluster,  each event was track as independent coordinate on each node, making MHA architecture to have an internal way to rematch coordinate on all nodes.

With introduction of GTID, MHA brings the heritage and looks like unnecessary complex, with an agent base solution and ssh connections requirement to all nodes .

A lighter MHA was needed for MariaDB when the replication is using GTID, and that's what my colleague Guillame Lefranc have been addressing inside a new MariaDB toolkit

In MariaDB GTID usage is as simple as:

#>stop slave;change master to master_use_gtid=current_pos;start slave; 

As a bonus, the code is in golang and do not require any external dependencies
We can enjoy a singe command line procedure in interactive mode.

mariadb-repmgr -hosts=9.3.3.55:3306,9.3.3.56:3306,9.3.3.57:3306 -user=admin:xxxxx -rpluser=repl:xxxxxx -pre-failover-script="/root/pre-failover.sh" -post-failover-script="/root/post-failover.sh" -verbose -maxdelay 15    
Don't be afraid default is to run in interactive mode and it does not launch anything yet.


In my post configuration script i usually update some haproxy configuration store in a NAS or a SAN and reload or shoot in the head all proxies

Note that the new elected master will be passed as second argument of the script.

I strongly advice not to try to auto failover base on some monitoring, get a good replication monitoring tool and analyze all master status alerts, checking for false positive situation before enjoying pre coded failover.

Loss less semi-synchronous replication in MDEV-162  and multiple performance improvements of semi-synchronous MDEV-7257, have made it to MariaDB 10.1, it can be use to greatly improve zero data lost in case of failure . Combine with parallel replication it's now possible to have an HA architecture that is as robust as asynchronous can be, and under replication delay control is crash safe as well.    

Galera aka MariaDB Cluster as a write speed limit bound to upper network speed, it come at the advantage to always offer crash safe consistency. Slave election HA have the master disk speed limit and do not suffer lower network speed but is losing consistency in failover when slave can't catch.

Interesting time to see how flash storage adoption flavor one or the other architecture.

by Stephane Varoqui (noreply@blogger.com) at July 03, 2015 11:04 PM

Chris Calender

MariaDB 10.0.20 Overview and Highlights

MariaDB 10.0.20 was recently released, and is available for download here:

https://downloads.mariadb.org/mariadb/10.0.20/

This is the eleventh GA release of MariaDB 10.0, and 21st overall release of MariaDB 10.0.

There were no major functionality changes, but there was one security fix, 6 crashing bugs fixed, some general upstream fixes, and quite a few bug fixes, so let me cover the highlights:

  • Security Fix: Client command line option –ssl-verify-server-cert (and MYSQL_OPT_SSL_VERIFY_SERVER_CERT option of the client API) when used together with –ssl will ensure that the established connection is SSL-encrypted and the MariaDB server has a valid certificate. This fixes CVE-2015-3152.
  • Crashing Bug: mysql_upgrade crashes the server with REPAIR VIEW (MDEV-8115).
  • Crashing Bug: Server crashes in intern_plugin_lock on concurrent installing semisync plugin and setting rpl_semi_sync_master_enabled (MDEV-363).
  • Crashing Bug: Server crash on updates with joins still on 10.0.18 (MDEV-8114).
  • Crashing Bug: Too large scale in DECIMAL dynamic column getter crashes mysqld (MDEV-7505).
  • Crashing Bug: Server crashes in get_server_from_table_to_cache on empty name (MDEV-8224).
  • Crashing Bug: FreeBSD-specific bug that caused a segfault on FreeBSD 10.1 x86 (MDEV-7398).
  • XtraDB upgraded to 5.6.24-72.2
  • InnoDB updated to InnoDB-5.6.25
  • Performance Schema updated to 5.6.25
  • TokuDB upgraded to 7.5.7

Given the security fix, you may want to consider upgrading if that particular CVE is of concern to you. Also, please review the crashing bugs to see if they might affect you, and upgrade if so. Also, if running TokuDB, XtraDB, InnoDB, or Performance Schema, you may also want to benefit from those fixes, as well as the new MariaDB fixes (139 in all).

You can read more about the 10.0.20 release here:

https://mariadb.com/kb/en/mariadb-10020-release-notes/

And if interested, you can review the full list of changes in 10.0.20 (changelogs) here:

https://mariadb.com/kb/en/mariadb-10020-changelog/

Hope this helps.

by chris at July 03, 2015 03:18 PM