Planet MariaDB

August 29, 2016

Peter Zaitsev

Percona Live Europe featured talk with Alexander Krasheninnikov — Processing 11 billion events a day with Spark in Badoo

Percona Live Europe Featured Talk

Percona Live Europe Featured TalkWelcome to a new Percona Live Europe featured talk with Percona Live Europe 2016: Amsterdam speakers! In this series of blogs, we’ll highlight some of the speakers that will be at this year’s conference. We’ll also discuss the technologies and outlooks of the speakers themselves. Make sure to read to the end to get a special Percona Live Europe registration bonus!

In this Percona Live Europe featured talk, we’ll meet Alexander Krasheninnikov, Head of Data Team at Badoo. His talk will be on Processing 11 billions events a day with Spark in Badoo. Badoo is one of the world’s largest and fastest growing social networks for meeting new people. I had a chance to speak with Alexander and learn a bit more about the database environment at Badoo:

Percona: Give me a brief history of yourself: how you got into database development, where you work, what you love about it?

Alexander: Currently, I work at Badoo as Head of Data Team. Our team is responsible for providing internal API’s for statistics data collecting and processing.

I started as a developer at Badoo, but the project I am going to cover in my talk lead to creating a separate department.

Percona: Your talk is called “Processing 11 billion events a day with Spark in Badoo.” What were the issues with your environment that led you to Spark? How did Spark solve these needs?

Alexander: When we designed the Unified Data Stream system in Badoo, we’ve extracted several requirements: scalability, fault tolerance and reliability. Altogether, these requirements moved us towards using Hadoop as deep data storage and data processing framework. Our initial implementation was built on top of Scribe + WebHDFS + Hive. But we’ve realized that processing speed and any lag of data delivery is unacceptable (we need near-realtime data processing). One of our BI team mentioned Spark as being significantly faster than Hive in some cases, (especially ones similar to ours). When investigated Spark’s API, we found the Streaming submodule — ideal for our needs. Additionally, this framework allowed us to use some third-party libraries, and write code. We’ve actually created an aggregation framework that follows “divide and conquer” principle. Without Spark, we definitely went way re-inventing lot of things from it.

Percona: Why is tracking the event stream important for your business model? How are you using the data Spark is providing you to reach business goals?

Alexander: The event stream always represents some important business/technical metrics — votes, messages, likes and so on. All this, brought together, forms the “health” of our product. The primary goal of our Spark-based system is to process a heterogeneous event stream one way, and draw charts automatically. We acheived this goal, and now we have hundreds of charts and dozens of developers/analysts/product team members using them. The system also evolved, and now we perform automatic anomaly detection over the event stream. We report strange data behavior to all the interested people.

Percona: What is changing in data use in your businesses model that keeps you awake at night? What tools or features are you looking for to address these issues?

Alexander: As I’ve mentioned before, we have an anomaly detection process for our metrics. If some of our metrics are out of expected bounds, it is treated as being an anomaly, and notification are sent. Also, we have a self-monitoring functionality for the whole system — a small event rate of heartbeats is generated, and processed with two different systems. If those show a significant difference — that defintely keeps me awake at night! 🙂

Percona: What are looking forward to the most at Percona Live Europe this year?

Alexander: My main interest is distributed open source databases. At Percona Live Europe, I expect to gain a lot of new information from the appropriate conference sections. Particularly, I want to get some knowledge about Yandex ClickHouse, as it looks very promising.

You can read more about how Alexander and Badoo use Spark here: techblog.badoo.com.

Want to find out more about Alexander, Spark and Badoo? Register for Percona Live Europe 2016, and come see his talk Processing 11 billions events a day with Spark in Badoo.

Use the code FeaturedTalk and receive €25 off the current registration price!

Percona Live Europe 2016: Amsterdam is the premier event for the diverse and active open source database community. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, and other open source databases. Percona live tackles subjects such as analytics, architecture and design, security, operations, scalability and performance. It also provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience. All of these people help you learn how to tackle your open source database challenges in a whole new way.

This conference has something for everyone!

Percona Live Europe 2016: Amsterdam is October 3-5 at the Mövenpick Hotel Amsterdam City Centre.

by Dave Avery at August 29, 2016 06:51 PM

Jean-Jerome Schmidt

MySQL on Docker: Introduction to Docker Swarm Mode and Multi-Host Networking

In the previous blog post, we looked into Docker’s single-host networking for MySQL containers. This time, we are going to look into the basics of multi-host networking and Docker swarm mode, a built-in orchestration tool to manage containers across multiple hosts.

Docker Engine - Swarm Mode

Running MySQL containers on multiple hosts can get a bit more complex depending on the clustering technology you choose.

Before we try to run MySQL on containers + multi-host networking, we have to understand how the image works, how much resources to allocate (disk,memory,CPU), networking (the overlay network drivers - default, flannel, weave, etc) and fault tolerance (how is the container relocated, failed over and load balanced). Because all these will impact the overall operations, uptime and performance of the database. It is recommended to use an orchestration tool to get more manageability and scalability on top of your Docker engine cluster. The latest Docker Engine (version 1.12, released on July 14th, 2016) includes swarm mode for natively managing a cluster of Docker Engines called a Swarm. Take note that Docker Engine Swarm mode and Docker Swarm are two different projects, with different installation steps despite they both work in a similar way.

Some of the noteworthy parts that you should know before entering the swarm world:

  • The following ports must be opened:
    • 2377 (TCP) - Cluster management
    • 7946 (TCP and UDP) - Nodes communication
    • 4789 (TCP and UDP) - Overlay network traffic
  • There are 2 types of nodes:
    • Manager - Manager nodes perform the orchestration and cluster management functions required to maintain the desired state of the swarm. Manager nodes elect a single leader to conduct orchestration tasks.
    • Worker - Worker nodes receive and execute tasks dispatched from manager nodes. By default, manager nodes are also worker nodes, but you can configure managers to be manager-only nodes.

More details in the Docker Engine Swarm documentation.

In this blog, we are going to deploy application containers on top of a load-balanced Galera Cluster on 3 Docker hosts (docker1, docker2 and docker3), connected through an overlay network. We will use Docker Engine Swarm mode as the orchestration tool.

“Swarming” Up

Let’s cluster our Docker nodes into a Swarm. Swarm mode requires an odd number of managers (obviously more than one) to maintain quorum for fault tolerance. So, we are going to use all the physical hosts as manager nodes. Note that by default, manager nodes are also worker nodes.

  1. Firstly, initialize Swarm mode on docker1. This will make the node as manager and leader:

    [root@docker1]$ docker swarm init --advertise-addr 192.168.55.111
    Swarm initialized: current node (6r22rd71wi59ejaeh7gmq3rge) is now a manager.
    
    To add a worker to this swarm, run the following command:
    
        docker swarm join \
        --token SWMTKN-1-16kit6dksvrqilgptjg5pvu0tvo5qfs8uczjq458lf9mul41hc-dzvgu0h3qngfgihz4fv0855bo \
        192.168.55.111:2377
    
    To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
  2. We are going to add two more nodes as manager. Generate the join command for other nodes to register as manager:

    [docker1]$ docker swarm join-token manager
    To add a manager to this swarm, run the following command:
    
        docker swarm join \
        --token SWMTKN-1-16kit6dksvrqilgptjg5pvu0tvo5qfs8uczjq458lf9mul41hc-7fd1an5iucy4poa4g1bnav0pt \
        192.168.55.111:2377
  3. On docker2 and docker3, run the following command to register the node:

    $ docker swarm join \
        --token SWMTKN-1-16kit6dksvrqilgptjg5pvu0tvo5qfs8uczjq458lf9mul41hc-7fd1an5iucy4poa4g1bnav0pt \
        192.168.55.111:2377
  4. Verify if all nodes are added correctly:

    [docker1]$ docker node ls
    ID                           HOSTNAME       STATUS  AVAILABILITY  MANAGER STATUS
    5w9kycb046p9aj6yk8l365esh    docker3.local  Ready   Active        Reachable
    6r22rd71wi59ejaeh7gmq3rge *  docker1.local  Ready   Active        Leader
    awlh9cduvbdo58znra7uyuq1n    docker2.local  Ready   Active        Reachable

    At the moment, we have docker1.local as the leader. You can run “docker network” and “docker service

Overlay Network

The only way to let containers running on different hosts connect to each other is by using an overlay network. It can be thought of as a container network that is built on top of another network (in this case, the physical hosts network). Docker Swarm mode comes with a default overlay network which implements a VxLAN-based solution with the help of libnetwork and libkv. You can however choose another overlay network driver like Flannel, Calico or Weave, where extra installation steps are necessary. We are going to cover more on that later in an upcoming blog post.

In Docker Engine Swarm mode, you can create an overlay network only from a manager node and it doesn’t need an external key-value store like etcd, consul or Zookeeper.

The swarm makes the overlay network available only to nodes in the swarm that require it for a service. When you create a service that uses an overlay network, the manager node automatically extends the overlay network to nodes that run service tasks.

Let’s create an overlay network for our containers. We are going to deploy Percona XtraDB Cluster and application containers on separate Docker hosts to achieve fault tolerance. These containers must be running on the same overlay network so they can communicate with each other.

We are going to name our network “mynet”. You can only create this on the manager node:

[docker1]$ docker network create --driver overlay mynet

Let’s see what networks we have now:

[docker1]$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
213ec94de6c9        bridge              bridge              local
bac2a639e835        docker_gwbridge     bridge              local
5b3ba00f72c7        host                host                local
03wvlqw41e9g        ingress             overlay             swarm
9iy6k0gqs35b        mynet               overlay             swarm
12835e9e75b9        none                null                local

There are now 2 overlay networks with a Swarm scope. The “mynet” network is what we are going to use today when deploying our containers. The ingress overlay network comes by default. The swarm manager uses ingress load balancing to expose the services you want externally to the swarm.

Deployment using Services and Tasks

We are going to deploy the Galera Cluster containers through services and tasks. When you create a service, you specify which container image to use and which commands to execute inside running containers. There are two type of services:

  • Replicated services - Distributes a specific number of replica tasks among the nodes based upon the scale you set in the desired state, for examples “--replicas 3”.
  • Global services - One task for the service on every available node in the cluster, for example “--mode global”. If you have 7 Docker nodes in the Swarm, there will be one container on each of them.

Docker Swarm mode has a limitation in managing persistent data storage. When a node fails, the manager will get rid of the containers and create new containers in place of the old ones to meet the desired replica state. Since a container is discarded when it goes down, we would lose the corresponding data volume as well. Fortunately for Galera Cluster, the MySQL container can be automatically provisioned with state/data when joining.

Deploying Key-Value Store

The docker image that we are going to use is from Percona-Lab. This image requires the MySQL containers to access a key-value store (supports etcd only) for IP address discovery during cluster initialization and bootstrap. The containers will look for other IP addresses in etcd, if there are any, start the MySQL with a proper wsrep_cluster_address. Otherwise, the first container will start with the bootstrap address, gcomm://.

  1. Let’s deploy our etcd service. We will use etcd image available here. It requires us to have a discovery URL on the number of etcd node that we are going to deploy. In this case, we are going to setup a standalone etcd container, so the command is:

    [docker1]$ curl -w "\n" 'https://discovery.etcd.io/new?size=1'
    https://discovery.etcd.io/a293d6cc552a66e68f4b5e52ef163d68
  2. Then, use the generated URL as “-discovery” value when creating the service for etcd:

    [docker1]$ docker service create \
    --name etcd \
    --replicas 1 \
    --network mynet \
    -p 2379:2379 \
    -p 2380:2380 \
    -p 4001:4001 \
    -p 7001:7001 \
    elcolio/etcd:latest \
    -name etcd \
    -discovery=https://discovery.etcd.io/a293d6cc552a66e68f4b5e52ef163d68

    At this point, Docker swarm mode will orchestrate the deployment of the container on one of the Docker hosts.

  3. Retrieve the etcd service virtual IP address. We are going to use that in the next step when deploying the cluster:

    [docker1]$ docker service inspect etcd -f "{{ .Endpoint.VirtualIPs }}"
    [{03wvlqw41e9go8li34z2u1t4p 10.255.0.5/16} {9iy6k0gqs35bn541pr31mly59 10.0.0.2/24}]

    At this point, our architecture looks like this:

Deploying Database Cluster

  1. Specify the virtual IP address for etcd in the following command to deploy Galera (Percona XtraDB Cluster) containers:

    [docker1]$ docker service create \
    --name mysql-galera \
    --replicas 3 \
    -p 3306:3306 \
    --network mynet \
    --env MYSQL_ROOT_PASSWORD=mypassword \
    --env DISCOVERY_SERVICE=10.0.0.2:2379 \
    --env XTRABACKUP_PASSWORD=mypassword \
    --env CLUSTER_NAME=galera \
    perconalab/percona-xtradb-cluster:5.6
  2. It takes some time for the deployment where the image will be downloaded on the assigned worker/manager node. You can verify the status with the following command:

    [docker1]$ docker service ps mysql-galera
    ID                         NAME                IMAGE                                  NODE           DESIRED STATE  CURRENT STATE            ERROR
    8wbyzwr2x5buxrhslvrlp2uy7  mysql-galera.1      perconalab/percona-xtradb-cluster:5.6  docker1.local  Running        Running 3 minutes ago
    0xhddwx5jzgw8fxrpj2lhcqeq  mysql-galera.2      perconalab/percona-xtradb-cluster:5.6  docker3.local  Running        Running 2 minutes ago
    f2ma6enkb8xi26f9mo06oj2fh  mysql-galera.3      perconalab/percona-xtradb-cluster:5.6  docker2.local  Running        Running 2 minutes ago
  3. We can see that the mysql-galera service is now running. Let’s list out all services we have now:

    [docker1]$ docker service ls
    ID            NAME          REPLICAS  IMAGE                                  COMMAND
    1m9ygovv9zui  mysql-galera  3/3       perconalab/percona-xtradb-cluster:5.6
    au1w5qkez9d4  etcd          1/1       elcolio/etcd:latest                    -name etcd -discovery=https://discovery.etcd.io/a293d6cc552a66e68f4b5e52ef163d68
  4. Swarm mode has an internal DNS component that automatically assigns each service in the swarm a DNS entry. So you use the service name to resolve to the virtual IP address:

    [docker2]$ docker exec -it $(docker ps | grep etcd | awk {'print $1'}) ping mysql-galera
    PING mysql-galera (10.0.0.4): 56 data bytes
    64 bytes from 10.0.0.4: seq=0 ttl=64 time=0.078 ms
    64 bytes from 10.0.0.4: seq=1 ttl=64 time=0.179 ms

    Or, retrieve the virtual IP address through the “docker service inspect” command:

    [docker1]# docker service inspect mysql-galera -f "{{ .Endpoint.VirtualIPs }}"
    [{03wvlqw41e9go8li34z2u1t4p 10.255.0.7/16} {9iy6k0gqs35bn541pr31mly59 10.0.0.4/24}]

    Our architecture now can be illustrated as below:

Deploying Applications

Finally, you can create the application service and pass the MySQL service name (mysql-galera) as the database host value:

[docker1]$ docker service create \
--name wordpress \
--replicas 2 \
-p 80:80 \
--network mynet \
--env WORDPRESS_DB_HOST=mysql-galera \
--env WORDPRESS_DB_USER=root \
--env WORDPRESS_DB_PASSWORD=mypassword \
wordpress

Once deployed, we can then retrieve the virtual IP address for wordpress service through the “docker service inspect” command:

[docker1]# docker service inspect wordpress -f "{{ .Endpoint.VirtualIPs }}"
[{p3wvtyw12e9ro8jz34t9u1t4w 10.255.0.11/16} {kpv8e0fqs95by541pr31jly48 10.0.0.8/24}]

At this point, this is what we have:

Our distributed application and database setup is now deployed by Docker containers.

Connecting to the Services and Load Balancing

At this point, the following ports are published (based on the -p flag on each “docker service create” command) on all Docker nodes in the cluster, whether or not the node is currently running the task for the service:

  • etcd - 2380, 2379, 7001, 4001
  • MySQL - 3306
  • HTTP - 80

If we connect directly to the PublishedPort, with a simple loop, we can see that the MySQL service is load balanced among containers:

[docker1]$ while true; do mysql -uroot -pmypassword -h127.0.0.1 -P3306 -NBe 'select @@wsrep_node_address'; sleep 1; done
10.255.0.10
10.255.0.8
10.255.0.9
10.255.0.10
10.255.0.8
10.255.0.9
10.255.0.10
10.255.0.8
10.255.0.9
10.255.0.10
^C

At the moment, Swarm manager manages the load balancing internally and there is no way to configure the load balancing algorithm. We can then use external load balancers to route outside traffic to these Docker nodes. In case of any of the Docker nodes goes down, the service will be relocated to the other available nodes.

That’s all for now. In the next blog post, we’ll take a deeper look at Docker overlay network drivers for MySQL containers.

by Severalnines at August 29, 2016 06:51 PM

August 26, 2016

MariaDB AB

Configuring MariaDB Master and MariaDB MaxScale for Data Streaming

Massimiliano Pinto

In the previous blog, I introduced Data Streaming with MariaDB MaxScale. In this post, I will show you how to configure a MariaDB Master server and MariaDB MaxScale to stream binary log events from the Master server to MaxScale, and convert them to AVRO records.

First of all, some checks and eventually some modifications will be needed in the MariaDB 10 database.

Configuring the Master Database

MariaDB MaxScale requires that the binary log events provided to each row be modified rather than the operation on the row, whether it was inserted, updated or deleted. Additionally, in order to stream the entire content of the row being modified by a binary log event, replication on the Master database must be configured with ‘row’ format. Simply edit my.cnf, add the two required options and restart mysqld process.

[mysqld]
....
binlog_format=row
binlog_row_image=full

This way, in row-based replication, each row change event contains two images, a “before” image whose columns are matched against when searching for the row to be updated, and an “after” image containing the changes.

You can find out more about replication formats from the MariaDB Knowledge Base.

Configuring the MaxScale Server

MariaDB MaxScale can be configured to run on the same server as MariaDB 10 Master or separately on a dedicated server. In this blog we will show you configurations when MariaDB MaxScale is running on its own dedicated server.

If MariaDB MaxScale is installed on a dedicated server, it needs to register with the Master database as a Slave. To do so, the binlog server should be configured to MariaDB MaxScale in order to receive binlog events from the MariaDB 10 Master database.

AVRO Service Setup

We start by adding two new services into the configuration file. The first service (replication service) uses the binlogrouter plugin which reads the binary logs from the Master server.

The binlog router supports both old style binlog events without GTID and MariaDB 10 binlog events with GTID. However for the purpose of data streaming, MariaDB 10 GTID is required, so we need to specifically set the router_options mariadb10-compatibility to 1.

# The Replication Proxy service
[replication-service]
type=service
router=binlogrouter
router_options=server-id=4000,
               master-id=3000,
               binlogdir=/var/lib/maxscale/binlog/,
               mariadb10-compatibility=1
user=maxuser
passwd=maxpwd

The second service (avro-service) reads the binary logs as they are streamed from the Master through the binlog router and converts them into AVRO format files. Please note that the source of the binary log events is set to “replication-service” - this is what enables AVRO-service to continuously consume the binary log events received by the replication and converts them to AVRO records. The router option avrodir is the directory where the converted AVRO records are stored. The router option filestem is the prefix of the source binlog files.

# The Avro conversion service
[avro-service]
type=service
router=avrorouter
source=replication-service
router_options=avrodir=/var/lib/maxscale/avro/,
               filestem=binlog

Next, we need to setup the listener on MaxScale, so that MaxScale can be administratively configured to register with the Master server as a Slave.

# The listener for the replication-service
[replication-listener]
type=listener
service=replication-service
protocol=MySQLClient
port=4000

Now, we need to set up the listener on MaxScale where the CDC protocol clients can request AVRO change data records.

# The client listener for the avro-service
[avro-listener]
type=listener
service=avro-service
protocol=CDC
port=4001

AVRO Schema Setup

Before starting the conversion process, AVRO schema files for all the tables that need to be replicated need to be on the MaxScale server.

Every converted AVRO record corresponds to a single table record and contains the schema of the table including schema id that refers to the id of an AVRO schema file. The schema files are in JSON format per Avro specification and stored in $avrodir.  Before the conversion process starts, so that MaxScale can generate these schema file, either:

  • the binary log events needs to include CREATE TABLE and any ALTER TABLE events, OR;

  • The schema files needs to be created on MaxScale manually using the cdc_schema Go utility.

All AVRO file schemas follow the same general idea. They are in JSON and follow the following format:

{
    "Namespace": "MaxScaleChangeDataSchema.avro",
    "Type": "record",
    "Name": "ChangeRecord",
    "Fields":
    [
        {
            "Name": "name",
            "Type": "string"
        },
        {
            "Name":"address",
            "Type":"string"
        },
        {
            "Name":"age",
            "Type":"int"
        }
    ]
}

The AVRO converter uses the schema file to identify the columns, their names and what type they are. The Name field contains the name of the column and the Type contains the AVRO type. Read the AVRO specification for details on the layout of the schema files.

Starting MariaDB MaxScale

The next step is to start MariaDB MaxScale and set up the binlog server. We do that by connecting to the listener of the replication_router service and executing a Slave command to set the Master server, then start the Slave on MaxScale.

An example of CHANGE MASTER command required to configure the binlog server:

# mysql -h maxscale-host -P 4000
CHANGE MASTER TO MASTER_HOST='172.18.0.1',
       MASTER_PORT=3000,
       MASTER_LOG_FILE='binlog.000001',
       MASTER_LOG_POS=4,
       MASTER_USER='maxuser',
       MASTER_PASSWORD='maxpwd';
START SLAVE;

That command will start the replication of binary logs from the Master server at 172.18.0.1:3000. After the binary log streaming has started, the AVRO router will automatically start converting the binlogs into AVRO files. Please note that the first binlog file to be converted is the one with the highest sequence number in the $binlogdir with the $filestem prefix.

Now, let us create an example table and insert some data into it to see if our setup works.

First, create a simple test table using the following statement and populate it.

CREATE TABLE test.t1 (id INT); 
INSERT INTO test.t1 VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);

Table creation and insert data events will be converted into an AVRO file, which can be inspected by using the maxavrocheck utility program.

[maxscale@localhost]$ maxavrocheck /var/lib/maxscale/avro/test.t1.000001.avro
File sync marker: caaed7778bbe58e701eec1f96d7719a
/var/lib/maxscale/avro/test.t1.000001.avro: 1 blocks, 1 records and 12 bytes

Now we can see the generated AVRO record. In the next blog post, we will explore how these AVRO records can be requested in real time with the MariaDB MaxScale CDC API.

Related Blog Posts

Data Streaming with MariaDB MaxScale

MaxScale as a Replication Proxy, aka the Binlog Server

About the Author

Massimiliano Pinto's picture

Massimiliano is a Senior Software Solutions Engineer working mainly on MaxScale. Massimiliano has worked for almost 15 years in Web Companies playing the roles of Technical Leader and Software Engineer. Prior to joining MariaDB he worked at Banzai Group and Matrix S.p.A, big players in the Italy Web Industry. He is still a guy who likes too much the terminal window on his Mac. Apache modules and PHP extensions skills are included as well.

by Massimiliano Pinto at August 26, 2016 09:06 AM

Open Query

On Open Source and Business Choices

Open Source is a whole-of-process approach to development that can produce high-quality products better tailored to users’ real world needs.  A key reason for this is the early feedback cycle built into that complete process.

Simply publishing something under an Open Source license (while not applying Open Source development processes) does not yield the same quality and other benefits.  So, not all Open Source is the same.

Publishing source of a product “later” (for instance when the monetary benefit has diminished for the company) is meaningless.  In this scenario, there is no “Open Source benefit” to users whatsoever, it’s simply a proprietary product. There is no opportunity for the client to make custom modifications or improvements, or ask a third party to work on such matters – neither is there any third party opportunity to verify and validate either code quality or security.

Open Source is not a marketing gimmick.  Labels such as “Open Source”, or “Enterprise”, on their own, do not have any more positive outcome than a greasy hamburger labeled with “healthy”.  If a company “believes” in Open Source software, they’ll use the open source development model for their software development.

And now we see things like this: Uproar: MariaDB Corp. veers away from open source (by Simon Phipps, InfoWorld, August 2016)

So what does it mean when a company publishes some of their software under an open source license, and does some related products under a proprietary license?  To me, it’s generally a strong indication that the company either doesn’t believe in that model, or doesn’t understand it.  And we’ve seen it before.

It also reminds me of an interaction I had many years ago.  A Marketing VP asked me “How can we leverage our [Open Source] community?”  I answered the only possible way: “One does not ‘leverage’ the community, that’s not how it works.”  Of course that wasn’t the answer the VP wanted to hear, but that doesn’t make it less true.  They saw the community as an asset to use, rather than work with.  People don’t like getting used, and in the Open Source space that’s even more true.

Companies that have turned their back on their earlier Open Source work and who have devised some other model to (arguably) make more money, have all discovered that this fundamentally changes their market.  They’ll lose some of their users, customers and supporters, and gain some new different clients.  It’s a different market.  Whether and how that pans out in terms of commercial success is never certain.  Given that we know that the Open Source development process yields benefits in terms of quality and features users want, we can say that non-OSS products lack (some of) those benefits, so to put it bluntly, it’ll be a different product of possibly less quality and the feature set is likely to differ as well.

Naturally we cannot ascertain code quality directly as we can’t review closed code directly, bug systems of proprietary software tends to be closed, changelogs are condensed for marketing purposes, but as far back as a decade and a half there have been independent studies that worked out “lines of code per software flaw” and it came out significantly in favour of Open Source software, having proportionally much fewer bugs.  Bugs also tend to get fixed quicker in Open Source software.  None of this is new(s). see for instance Open-source vs. proprietary software bugs: Which get squashed fastest? (CNET, 2007)

For complete products (libraries are a slightly different beast) with a relatively large market scope, source code being available does not in any way diminish a company’s ability to make money.  Having the core developers, tech writers and support people gives them a significant edge in the open market, and that’s a business asset you can leverage.  You do that by focusing on those aspects in your communications – that’s basic marketing, you draw attention to the positive aspects that make your company/product stand out from the rest.  Clearly, this objective cannot not achieved by force, as you don’t make a (potential) client like or trust you by denying them choice or transparency.

There is one other known option aside from not believing or not understanding, and that’s fear. But fear is an awkward business driver, it makes for very bad decisions.

MariaDB Corp in part uses the Open Source development model, in part they’re an Open Source publisher (in-house work that’s only made available at a later stage in the development process), and now some proprietary product has been added to the mix (actually new versions of an existing product).  Looking at this I am rather unclear about what they believe in.  Of course companies can make business choices as they see fit – but they never operate in a vacuum.  In the end it doesn’t matter much what I believe personally, the market will do what it will – historically, it responds in the various ways as described above.  We’ll see how it pans out.

Open Query does not recommend (or re-sell at all) proprietary tools, as it just doesn’t make sense for us or our clients.  We often do bugfixes and improvements which we contribute upstream – for proprietary tools we can’t do that and thus it becomes a hindrance for us and our clients.  On the specific practical level, we’ve actually never used MaxScale (the product that MariaDB Corp will now sell under different conditions for future versions), and this stems from our experience with its effective predecessor MySQL Proxy.  Having a complex set of scripted logic in a proxy slows down applications and introduces a rather large extra (single) point of potential failure in to infrastructure.   So, while Simon refers to MaxScale as an essential tool for scale-able environments, we know from experience that there are other ways of achieving that desired objective, and without the downsides.

Rather than promoting a single tool for many wildly different jobs, we utilise a few different tools depending on the needs of particular client infrastructure.  We still have a couple of (now legacy) MySQL-MMM deployments, but also quite a few Galera clusters, and other setups as suit our clients’ needs.  Key is to not only make the infrastructure convenient to use for applications, but also to not introduce any more single points of failure.  We build resilience into the client’s server infrastructure, without adding significant overhead in either performance or maintenance requirements.

We believe that that’s what clients want, and since potential clients come to us asking exactly for that (and note our approach with relief) we think that we’re doing the right thing by our clients.  We’ve used this approach for over 9 years, and we’ll just keep on doing that – our basic approach doesn’t change even when our tools do.  If you’d like to talk with us about helping you with your infra, using our approach and way of working, contact us today!

by Arjen Lentz at August 26, 2016 04:24 AM

August 25, 2016

MariaDB Foundation

MariaDB 10.0.27 now available

The MariaDB project is pleased to announce the immediate availability of MariaDB 10.0.27. See the release notes and changelog for details on this release. Download MariaDB 10.0.27 Release Notes Changelog What is MariaDB 10.0? MariaDB APT and YUM Repository Configuration Generator Thanks, and enjoy MariaDB!

The post MariaDB 10.0.27 now available appeared first on MariaDB.org.

by Daniel Bartholomew at August 25, 2016 04:39 PM

Peter Zaitsev

Percona Live Europe featured talk with Krzysztof Książek — MySQL Load Balancers – MaxScale, ProxySQL, HAProxy, MySQL Router & nginx

Percona Live Europe featured talkWelcome to the first Percona Live Europe featured talk with Percona Live Europe 2016: Amsterdam speakers! In this series of blogs, we’ll highlight some of the speakers that will be at this year’s conference. We’ll also discuss the technologies and outlooks of the speakers themselves. Make sure to read to the end to get a special Percona Live Europe registration bonus!

In this Percona Live Europe featured talk, we’ll meet Krzysztof Książek, Senior Support Engineer at Severalnines AB. His talk will be on MySQL Load Balancers – MaxScale, ProxySQL, HAProxy, MySQL Router & nginx: a close up look. Load balancing MySQL connections and queries using HAProxy has been popular in the past years. However, the recent arrival of MaxScale, MySQL Router, ProxySQL and now also Nginx as a reverse proxy have changed the game. Which use cases are best for which solution, and how well do they integrate into your environment?

I had a chance to speak with Krzysztof and learn a bit more about these questions:

Percona: Give me a brief history of yourself: how you got into database development, where you work, what you love about it?

Krzysztof: I was working as a system administrator in a hosting company in Poland. They had a need for a dedicated MySQL DBA. So I volunteered for the job. Later, I decided it was time to move on and joined Laine Campbell’s PalominoDB. I had a great time there, working with large MySQL deployments. At the beginning of 2015, I joined Severalnines as Senior Support Engineer. It was a no-brainer for me as I was always interested in building and managing scalable clusters based on MySQL — this is exactly what Severalnines helps its customers with.

Percona: Your talk is called “MySQL Load Balancers: MaxScale, ProxySQL, HAProxy, MySQL Router & nginx – a close up look.” Why are more load balancing solutions becoming available? What problems does load balancing solve for database environments?

Krzysztof:Load balancers are a must in highly scalable environments that are usually distributed across multiple servers or data centers. Large MySQL setups can quickly become very complex — many clusters, each containing numerous nodes and using different and interconnected technologies: MySQL replication, Galera Cluster. Load balancers not only help to maintain availability of the database tier by routing traffic to available nodes, but they also hide the complexity of the database tier from the application.

Percona: You call out three general groups of load balancers: application connectors, TCP reverse proxies, and SQL-aware load balancers. What workloads do these three groups generally address best?

Krzysztof: I wouldn’t say “workloads” — I’d say more like “use cases.” Each of those groups will handle all types of workloads but they do it differently. TCP reverse proxies like HAProxy or nginx will just route packets: fast and robust. They won’t understand the state of MySQL backends, though. For that you need to use external scripts like Percona’s clustercheck or Severalnines’ clustercheck-iptables.

On the other hand, should you want to build your application to be more database-aware, you can use mysqlnd and manage complex HA topologies from your application. Finally, SQL-aware load balancers like ProxySQL or MaxScale can be used to move complexity away from the application and, for example, perform read-write split in the proxy layer. They detect the MySQL state and can make necessary changes in routing — such as moving writes to a newly promoted master. They can also empower the DBA by allowing him to (for example) rewrite queries as they pass the proxy.

Percona: Where do you see load balancing technologies heading in order to deal with some of the database trends that keep you awake at night?

Krzysztof: Personally, I love to see the “empowerment” of DBA’s. For example, ProxySQL not only routes packets and helps to maintain high availability (although this is still the main role of a proxy), it is also a flexible tool that can help a DBA tackle many day-to-day problems. An offending query? You can cache it in the proxy or you can rewrite it on the fly. Do you need to test your system before an upgrade, using real-world queries? You can configure ProxySQL to mirror the production traffic on a test system. You can use it to build a sharded environment. These things, in the past, typically weren’t possible for a DBA to do — the application had to be modified and new code had to be deployed. Activities like those take time, time that is very precious when the ops staff is dealing with databases on fire from a high load. Now I can do all that just through reconfiguring a proxy. Isn’t it great?

Percona: What are looking forward to the most at Percona Live Europe this year?

Krzysztof: The Percona Live Europe agenda looks great and, as always, it’s a hard choice to decide which talks to attend. I’d love to learn more about the upcoming MySQL 8.0: there are quite a few talks covering both performance improvements and different features of 8.0. There’s also a new Galera version in the works with great features like non-blocking DDL’s, so it would be great to see what’s happening there. We’re also excited to run the “Become a MySQL DBA” tutorial again (our blog series on the same topic has been very popular).

Additionally, I’ve been working within the MySQL community for a while and I have many friends who, unfortunately, I don’t see very often. Percona Live Europe is an event that brings us together and where we can catch up. I’m definitely looking forward to this.

You can read more about Krzysztof thoughts on load balancers at Severalnines blog.

Want to find out more about Krzysztof, load balancers and Severalnines? Register for Percona Live Europe 2016, and come see his talk MySQL Load Balancers – MaxScale, ProxySQL, HAProxy, MySQL Router & nginx: a close up look.

Use the code FeaturedTalk and receive €25 off the current registration price!

Percona Live Europe 2016: Amsterdam is the premier event for the diverse and active open source database community. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, and other open source databases. Percona live tackles subjects such as analytics, architecture and design, security, operations, scalability and performance. It also provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience. All of these people help you learn how to tackle your open source database challenges in a whole new way.

This conference has something for everyone!

Percona Live Europe 2016: Amsterdam is October 3-5 at the Mövenpick Hotel Amsterdam City Centre.

by Dave Avery at August 25, 2016 04:36 PM

PostgreSQL Day at Percona Live Amsterdam 2016

PostgreSQL Day

PostgreSQL DayIntroducing PostgreSQL Day at Percona Live Europe, Amsterdam 2016.

As modern open source database deployments change, often including more than just a single open source database, Percona Live has also changed. We changed our model from being a purely MySQL-focused conference (with variants) to include a significant amount of MongoDB content. We’ve also expanded our overview of the open source database landscape and included introductory talks on many other technologies. These included practices we commonly see used in the world, and new up and coming solutions we think show promise.

In getting Percona Live Europe 2016 ready, something unexpected happened: we noticed the PostgreSQL community come together and submit many interesting talks about this great open source database technology. This effort on their part pushed to go further than we initially planned this year, and we’ve put together a full day of PostgreSQL talks. At Percona Live Europe this year, we will be running our first ever PostgreSQL Day on October 4th!

Some folks have been questioning this decision: do we really need so much PostgreSQL content? Isn’t there some tension between the MySQL and PostgreSQL communities? (Here is a link to a very recent example.)  

While it might be true (and I think it is) that some contention exists between these groups, I don’t think isolation and indifference are the answers to improving cooperation. They certainly aren’t the best plan for the open source database community at large, because there is too much we can learn from each other — especially when it comes to promoting open source databases as a real alternative to commercial ones.

Percona Live Europe featured talkEvery open source community has its own set of “zealots” (or maybe just “strict adherents”). But our dedication to one particular technology shouldn’t blind us to the value of others. The MySQL and PostgreSQL communities have both successfully obtained support through substantial large scale deployments. There are more and more engineers joining those communities, looking to find better solutions for the problems they face and learn from others’ technologies.  

Through the years I have held very productive discussions with people like Josh Berkus, Bruce Momjian, Oleg Bartunov,  Ilya Kosmodemiansky and Robert Treat (to name just a few) about how things are done in MySQL versus PostgreSQL — and what could be done better in both.

At PGDay this year, I was glad to see Alexey Kopytov speaking about what MySQL does better; it got some very constructive conversations going. I was also pleased that my keynote on Migration to the Open Source Databases at the same conference was well attended and also sparked some good conversations.

I want this trend to continue to grow. This is why I think running a PostgreSQL Day as part of Percona Live Europe, Amsterdam is an excellent development. It provides an outstanding opportunity for people interested in PostgreSQL to further their knowledge through exposure to  MySQL, MongoDB and other open source technologies. This holds true for folks attending the conference mainly as MySQL and MongoDB users: they get exposed to the state of PostgreSQL in 2016.

Even more, I hope that this new track will spark productive conversations in the hallways, at lunches and other events between the speakers themselves. It’s really the best way to see what we can learn from each other. In the end, it benefits all technologies.

I believe the whole conference is worth attending, but for people who only wish to attend our new  PostgreSQL Day on October 4th, you can register for a single day conference pass using the PostgreSQLRocks discount code (€200, plus VAT).  

I’m looking forward to meeting and speaking with members of the PostgreSQL community at Percona Live!

by Peter Zaitsev at August 25, 2016 01:21 PM

Jean-Jerome Schmidt

Planets9s - Join us next Tuesday for part 1 of our MySQL Query Tuning Trilogy

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

Join us next Tuesday for part 1 of our MySQL Query Tuning Trilogy

Remember to join us next Tuesday, August 30th for the first part of our upcoming webinar trilogy on MySQL Query Tuning. In this first webinar we will discuss building, collecting, analysing, tuning and testing processes as well as the main tools involved, tcpdump and pt-query-digest. If you haven’t done so yet, sign up below to join us and get your questions answered around MySQL query tuning.

Register for the webinar

Deploying ClusterControl and MySQL-based systems on AWS using Ansible

We recently made a number of enhancements to the ClusterControl Ansible Role, so it now also supports automatic deployment of MySQL-based systems (MySQL Replication, Galera Cluster, NDB Cluster). The updated role uses the awesome ClusterControl RPC interface to automate deployments. It is available at Ansible Galaxy and Github.

Read the blog

Become a MongoDB DBA: backing up your data

In this blog post we describe what tools are available for making backups in MongoDB and what strategies to use. In previous posts of our MongoDB DBA series, we have covered Deployment, Configuration and Monitoring. The next step now is ensuring your data gets backed up safely. Find out how in this latest installment of our Become a MongoDB DBA blog series.

Read the blog

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

by Severalnines at August 25, 2016 11:13 AM

MariaDB AB

Data Streaming with MariaDB MaxScale

Massimiliano Pinto

Data Streaming with MariaDB MaxScale

While traditional analytics databases exists, Apache Hadoop is becoming the de facto data storage for big data. It’ an open-source software framework for distributed storage and distributed processing of very large data sets. There is a need for the ability to transfer data from MariaDB/MySQL operational data store into Hadoop. While tools such as Apache sqoop exist to export data out of MariaDB/MySQL into Hadoop - its performance is not suitable for streaming or real-time data transfer as it operates as a batch application.

To address this need, the MariaDB MaxScale team has designed a modular solution with MariaDB MaxScale to stream binlog events coming from the Master database to the data lake via messaging systems such as Kafka’s distributed broker. The binlog events for inserts, updates and deletes are converted in AVRO or JSON format before it’s forwarded to the data lake. Kafka is used as a data ingestion pipeline for distributed data process environment. MariaDB MaxScale will be the Kafka producer, whereas big data platforms such as Hadoop, Cassandra, Spark or any other analytic database will be the consumer application consuming the data through the Kafka broker.

MariaDB Maxscale Plugins for Data Streaming

The current MariaDB MaxScale binlog router provides change data capture and flow from the MariaDB Master database towards the MariaDB Slave database, while caching binlog events on the MaxScale server itself. By extending this approach, two new plugin are introduced in MariaDB MaxScale:

  • Avro Router: To convert the change data events from binlog events to AVRO and JSON events.
  • Change Data Protocol Plugin: To publish AVRO or JSON change data events to registered clients via CDC Client API.

 

 

The avrorouter is a new MaxScale component has been added in order to convert MySQL binary events into AVRO records: it’s basically a MariaDB 10.0, 10.1 compatible binary log to AVRO file converter. It consumes binary logs from a local directory and transforms them into a set of AVRO files. These files can then be queried by clients for various purposes.

This router is intended to be used in tandem with the Binlog Server. The Binlog Server can connect to a master server and request binlog records. These records can then be consumed by the “avrorouter” directly from the binlog cache of the Binlog Server. This allows MariaDB MaxScale to automatically transform binlog events on the master to local Avro format files.

The converted AVRO files can be requested any time with the new CDC protocol plugin. This protocol should be used to communicate with the avrorouter. The clients can request either AVRO or JSON format data streams from a database table.

AVRO

AVRO is a binary Object Container File that consists of a file header and one or more file data blocks. The header contains the JSON version of the schema.

Note: Each AVRO file contains data related to only ONE table.

AVRO relies on schemas. When AVRO data is read, the schema is used. When writing, it is always present.  AVRO schemas are defined with JSON. In the context of MariaDB MaxScale Binlog-AVRO conversion, each AVRO file contains data related to one table. For each Master database table, there is a corresponding AVRO schema file on MariaDB MaxScale.  There is a utility provided for cdc-schema to generate AVRO schema from the MariaDB database tables to AVRO schema in MariaDB MaxScale.

 

Next up, we’ll have upcoming blogs on how to use MariaDB MaxScale for data streaming, including:

  • MariaDB MaxScale 2.0 Configuring MariaDB Master and MariaDB MaxScale for Data Streaming Service
  • How to Stream Change Data through MariaDB MaxScale using CDC API
  • Real-time Data Streaming to Kafka with MaxScale CDC

 

Documentation Links

MariaDB MaxScale 2.0:

Change Data Capture (CDC) Protocol

Avro Router

Avro Route Tutorial

AVRO:

Apache AVRO

About the Author

Massimiliano Pinto's picture

Massimiliano is a Senior Software Solutions Engineer working mainly on MaxScale. Massimiliano has worked for almost 15 years in Web Companies playing the roles of Technical Leader and Software Engineer. Prior to joining MariaDB he worked at Banzai Group and Matrix S.p.A, big players in the Italy Web Industry. He is still a guy who likes too much the terminal window on his Mac. Apache modules and PHP extensions skills are included as well.

by Massimiliano Pinto at August 25, 2016 08:31 AM

August 24, 2016

Oli Sennhauser

Beware of large MySQL max_sort_length parameter

Today we had a very interesting phenomena at a customer. He complained that MySQL always get some errors of the following type:

[ERROR] mysqld: Sort aborted: Error writing file '/tmp/MYGbBrpA' (Errcode: 28 - No space left on device)

After a first investigation we found that df -h /tmp shows from time to time a full disk but we could not see any file with ls -la /tmp/MY*.

After some more investigation we found even the query from the Slow Query Log which was producing the same problem. It looked similar to this query:

SELECT * FROM test ORDER BY field5, field4, field3, field2, field1;

Now we were capable to simulate the problem at will with the following table:

CREATE TABLE `test` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `data` varchar(64) DEFAULT NULL,
  `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `field1` varchar(16) DEFAULT NULL,
  `field2` varchar(16) DEFAULT NULL,
  `field3` varchar(255) DEFAULT NULL,
  `field4` varchar(255) DEFAULT NULL,
  `field5` varchar(32) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=8912746 DEFAULT CHARSET=utf8
;

An we have seen the query in SHOW PROCESSLIST:

| Query   |   26 | Creating sort index | select * from test order by field5, field4, field3, field2, field1 |

But we were still not capable to see who or better how the hell mysqld is filling our disk!

I remembered further that I have seen some strange settings in the my.cnf before when we did the review of the database configuration. But I ignored them somehow.

[mysqld]
max_sort_length  = 8M
sort_buffer_size = 20M

Now I remembered again these settings. We changed max_sort_length back to default 1k and suddenly our space problems disappeared!

We played a bit around with different values of max_sort_length and got the following execution times for our query:

max_sort_lengthexecution time [s]comment
64 8.8 s
128 8.2 s
256 9.3 s
512 11.8 s
1k 14.9 s
2k 20.0 s
8k129.0 s
8M 75.0 sdisk full (50 G)

Conclusion

We set the values of max_sort_length back to the defaults. Our problems disappeared and we got working and much faster SELECT queries.

Do not needlessly change default values of MySQL without proving the impact. It can become worse than before!!!

The default value of max_sort_length is a good compromise between performance and an appropriate sort length.

Addendum

What I really did not like on this solution was, that I did not understand the way the problem occurred. So I did some more investigation in this. We were discussing forth and back if this could be because of XFS, because of sparse files or some kind of memory mapped files (see also man mmap).

At the end I had the idea to look at the lsof command during my running query:

mysql> SELECT * FROM test ORDER BY field5, field4, field3, field2, field1;
ERROR 3 (HY000): Error writing file '/tmp/MYBuWcXP' (Errcode: 28 - No space left on device)

shell> lsof -p 14733

COMMAND   PID  USER   FD   TYPE             DEVICE   SIZE/OFF     NODE NAME
mysqld  14733 mysql   32u   REG               8,18 9705619456 30147474 /tmp/MYck8vf4 (deleted)
mysqld  14733 mysql   49u   REG               8,18  749797376 30147596 /tmp/MYBuWcXP (deleted)

So it looks like that there were some deleted files which were growing!

Further information from the IRC channel led me to the libc temporary files (see also man 3 tmpfile).

And some hints from MadMerlin|work pointed me to:

shell> ls /proc//fd

Where you can also see those temporary files.

Thanks to MadMerlin|work for the hints!

Taxonomy upgrade extras: 

by Shinguz at August 24, 2016 09:40 PM

Federico Razzoli

Thoughts on MaxScale new license

MaxScale has been open source until now, just like all MariaDB projects. But the 2.0 version is released under a new license called BSL, which basically makes the covered work non-free until the Change Date (in this case 2019-01-01), when the license will be converted to GPL.

Looks like open source friendly, after all. The license will be GPL, just be patient. And the code is available. Right?

No. Cmpletely wrong. For plenty of reasons.

Some reasons

It is a lock-in. No matter how many times Monty repeats that there is no lock-in, we have a brain. If you don’t allow anyone to fix bugs except for yourself, it is a lock-in. If you force your users to buy your support, they won’t buy your competitors support.

MariaDB business moves to a non-free product. Yes, 1.4 is free an this won’t change. And yes, when 3.0 will be out, 2.0 will be free. But why should they maintain a free version, if money comes from non-free versions? Monty says that open source religion doesn’t put bread on the table. I suppose that maintaining free branches also doesn’t put bread on the table.

I wasn’t able to find any official EOL date for any MaxScale version – if there is one, please comment below.

MariaDB moves innovation to the non-free world. New features are non-free. When they will be old, they will be free. Monty also stated that this is the correct way to make money for a lot of projects. And he seems to advice this model to start-ups that use his venture capital, OpenOcean. Suddenly, BSL seems to be the only way for projects to survive. Is he protecting others projects interests, or using them for his own marketing?

MariaDB accused Oracle several times. When Oracle implemented a couple features and only distributed them in a non-GPL edition (threadpool, PAM authentication), MariaDB told that they had the same features as open source. Which was great. Except that… now MySQL Router is open source, MaxScale 2.0 is not. Now Monty has several justifications for this. But I fail to understand why open core is evil and BSL is good.

I mentioned Monty too many times. Is this an attack against Monty? Definitely not, but all articles I could find express Monty’s opinion, not MariaDB Corporation or anyone else’s opinion. I cannot answer the silence.

What is the MariaDB Foundation?

MariaDB Corporation has the legal right to make MaxScale non-free. They own it. They sometimes call it MariaDB MaxScale. They can: they also own MariaDB trademark.

So, what’s the role of MariaDB Foundation?

They claim they safeguard MariaDB. They don’t mention the ecosystem, the community, or other tools. They don’t mention, of course, MaxScale. Which is quite strange: they claimed that their model is Apache Foundation, which supports an entire ecosystem in many ways, and owns the trademarks.

Also, the board of directors has 6 members. 3 are from MariaDB Foundation. In this situation, they cannot have an independent opinion on MariaDB Corporation actions.

A curious aspect is that they declare they follow Ubuntu Code of Conduct. Please read its last paragraph and drawn your own conclusions.

My position on MariaDB and MaxScale

I am still grateful to MariaDB Corporation for creating and maintaining MariaDB (and to some of their engineers for creating MySQL).

From a technical point of view, they have many interesting features that are not in MySQL. Some of them come from the community, for example the CONNECT engine and their implementation of encryption. And the reason is that MariaDB is very open to the community.

Which brings us to a less technical point of view: MariaDB openness. Their JIRA account allows us to see the bugs (including their current status…). You can also see who is working on what, when next versions will be released, and what they will have. The team is active on the mailing lists and IRC. The documentation is a wiki and the license is free.

I have been a MariaDB supporter for years. I wrote Mastering MariaDB and I am one of their Community Ambassadors chosen by Colin Charles (who recently left MariaDB). Will my position about MariaDB project change? I don’t know, it’s too early to answer. For sure, I won’t deny that its openness is amazing and should be a model for everyone. (And I hope this won’t change)

And my position about MaxScale has changed? Of course it did. I wouldn’t use it for personal projects. Of course I could provide support but, given the license change, it seems to me unlikely. There are free alternatives: ProxySQL, MySQL Router, HAProxy. PoxySQL is by far the most interesting, if you ask me.

My position has changed forever? The answer depends on another question: will MariaDB fix its big mistake? I have no logic reasons to be optimistic, but I still hope it will. In the past they have apparently been open to criticism. After a complain in this blog, they made MaxScale binaries freely available, and I wrote a thank you post. What I couldn’t know is that they were preparing to close MaxScale next versions.

 


by Federico at August 24, 2016 03:48 PM

Peter Zaitsev

How to stop offending queries with ProxySQL

halt_man

ProxySQLThis blog discusses how to find and address badly written queries using ProxySQL.

All of us are very good in writing good queries. We know this to always be true! 😉

But sometimes a bad query escapes our control and hits our database. There is the new guy, the probie, who just joined the company and is writing all his code using SELECT * instead of WHERE. We’ve told him “STOP” millions of times, but he refuses to listen. Or a there is a new code injection, and it will take developers some time to fix and isolate the part of the code that is sending killing queries to our database.

The above are true stories; things that happen every day in at least few environments.

Isolating the bad query isn’t the main problem: that is something that we can do very fast. The issue is identifying the code that is generating the query, and disabling that code without killing the whole application.

That part can take days.

ProxySQL allows us to act fast and stop any offending query in seconds. I will show you how.

Let us say our offending query does this:

SELECT * from history;

Where history is a table of two Tb partitioned by year in our DWH.

That query will definitely create some issue on the database. It’s easy to identify this query as badly designed.

Unfortunately, it was inserted in the ETL process that uses a multi-thread approach and auto-recovery. Now when you kill it, the process restarts it. After, it takes developers some time to stop that code. In the meantime, your reporting system serving your company in real-time is so slooow (or down).

With ProxySQL, you can stop that query in one second:

INSERT INTO mysql_query_rules (rule_id, active, match_pattern, error_msg, apply) VALUES (89,1,'^SELECT \* from history$','Query not allowed',1);
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;

Done, your database never receives that query again! Now the application gets a message saying that the query is not allowed.

And look, it’s possible to do things even better:

INSERT INTO mysql_query_rules (rule_id, active, match_digest, flagOUT, apply) VALUES (89,1,'^SELECT \* FROM history', 100, 0);
INSERT INTO mysql_query_rules (rule_id, active, flagIN, match_digest, destination_hostgroup, apply) VALUES (1001,1, 100, 'WHERE', 502, 1);
INSERT INTO mysql_query_rules (rule_id, active, flagIN, error_msg, apply) VALUES (1002,1, 100, 'Query not allowed', 1);
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;

In this case, ProxySQL checks for any query having SELECT * FROM history. If the query has a WHERE clause, then it redirects it to the server for execution. If the query does not have a WHERE it stops the query and sends an error message to the application.

Conclusion

This is a very basic example of offending query. But I think it makes clear how ProxySQL helps any DBA in stopping them quickly in the case of an emergency.
This gives the DBAs and the developers time to coordinate a better plan of action to permanently fix the issue.

References

https://github.com/sysown/proxysql
http://www.proxysql.com/2015/09/proxysql-tutorial-setup-in-mysql.html
https://github.com/sysown/proxysql/blob/v1.2.2/doc/configuration_howto.md
https://github.com/sysown/proxysql/blob/v1.2.2/INSTALL.md

by Marco Tusa at August 24, 2016 12:46 AM

August 23, 2016

Peter Zaitsev

Percona Server 5.7.14-7 is now available

percona server 5.6.30-76.3

percona server 5.7.14-7Percona announces the GA release of Percona Server 5.7.14-7 on August 23, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

Based on MySQL 5.7.14, including all the bug fixes in it, Percona Server 5.7.14-7 is the current GA release in the Percona Server 5.7 series. Percona’s provides completely open-source and free software. Find release details in the 5.7.14-7 milestone at Launchpad.

New Features:
Bugs Fixed:
  • Fixed potential cardinality 0 issue for TokuDB tables if ANALYZE TABLE finds only deleted rows and no actual logical rows before it times out. Bug fixed #1607300 (#1006, #732).
  • TokuDB database.table.index names longer than 256 characters could cause a server crash if background analyze table status was checked while running. Bug fixed #1005.
  • PAM Authentication Plugin would abort authentication while checking UNIX user group membership if there were more than a thousand members. Bug fixed #1608902.
  • If DROP DATABASE would fail to delete some of the tables in the database, the partially-executed command is logged in the binlog as DROP TABLE t1, t2, ... for the tables for which drop succeeded. A slave might fail to replicate such DROP TABLE statement if there exist foreign key relationships to any of the dropped tables and the slave has a different schema from the master. Fix by checking, on the master, whether any of the database to be dropped tables participate in a Foreign Key relationship, and fail the DROP DATABASE statement immediately. Bug fixed #1525407 (upstream #79610).
  • PAM Authentication Plugin didn’t support spaces in the UNIX user group names. Bug fixed #1544443.
  • Due to security reasons ld_preload libraries can now only be loaded from the system directories (/usr/lib64, /usr/lib) and the MySQL installation base directory.
  • In the client library, any EINTR received during network I/O was not handled correctly. Bug fixed #1591202 (upstream #82019).
  • SHOW GLOBAL STATUS was locking more than the upstream implementation which made it less suitable to be called with high frequency. Bug fixed #1592290.
  • The included .gitignore in the percona-server source distribution had a line *.spec, which means someone trying to check in a copy of the percona-server source would be missing the spec file required to build the RPMs. Bug fixed #1600051.
  • Audit Log Plugin did not transcode queries. Bug fixed #1602986.
  • If the changed page bitmap redo log tracking thread stops due to any reason, then shutdown will wait for a long time for the log tracker thread to quit, which it never does. Bug fixed #1606821.
  • Changed page tracking was initialized too late by InnoDB. Bug fixed #1612574.
  • Fixed stack buffer overflow if --ssl-cipher had more than 4000 characters. Bug fixed #1596845 (upstream #82026).
  • Audit Log Plugin events did not report the default database. Bug fixed #1435099.
  • Canceling the TokuDB Background ANALYZE TABLE job twice or while it was in the queue could lead to server assertion. Bug fixed #1004.
  • Fixed various spelling errors in comments and function names. Bug fixed #728 (Otto Kekäläinen).
  • Implemented set of fixes to make PerconaFT build and run on the AArch64 (64-bit ARMv8) architecture. Bug fixed #726 (Alexey Kopytov).
Other bugs fixed:

#1542874 (upstream #80296), #1610242, #1604462 (upstream #82283), #1604774 (upstream #82307), #1606782, #1607359, #1607606, #1607607, #1607671, #1609422, #1610858, #1612551, #1613663, #1613986, #1455430, #1455432, #1581195, #998, #1003, and #730.

The release notes for Percona Server 5.7.14-7 are available in the online documentation. Please report any bugs on the launchpad bug tracker .

by Hrvoje Matijakovic at August 23, 2016 05:57 PM

Jean-Jerome Schmidt

Register for Part 1 of our MySQL Query Tuning Trilogy

Remember to join us Tuesday, August 30th for the first part of our upcoming webinar trilogy on MySQL Query Tuning. This first of three in-depth webinar sessions led by Krzysztof Książek, Senior Support Engineer at Severalnines, covers MySQL query tuning process and tools.

When done right, Tuning MySQL queries and indexes can increase the performance of your application and decrease response times. We will be covering this complex topic over the course of three webinars of 60 minutes each, so feel free to also register for parts 2 & 3 here.

In this first part of the trilogy we will discuss building, collecting, analysing, tuning and testing processes as well as the main tools involved, tcpdump and pt-query-digest. Register below to join us and get your questions answered around MySQL query tuning.

Date & Registration

Part 1: Query tuning process and tools

Tuesday, August 30th

Register

Feel free to also register for Parts 2 & 3.

Agenda

  • MySQL Query Tuning Trilogy: Process and tools
  • Query tuning process
    • Build
    • Collect
    • Analyse
    • Tune
    • Test
  • Tools
    • tcpdump
    • pt-query-digest

Speaker

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience in managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard. He’s the main author of the Severalnines blog and webinar series: Become a MySQL DBA.

We look forward to “seeing” you there!

by Severalnines at August 23, 2016 02:10 PM

August 22, 2016

Peter Zaitsev

Percona Server 5.6.32-78.0 is now available

percona server 5.6.30-76.3

percona server 5.6.32-78.0Percona announces the release of Percona Server 5.6.32-78.0 on August 22nd, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

Based on MySQL 5.6.32, including all the bug fixes in it, Percona Server 5.6.32-78.0 is the current GA release in the Percona Server 5.6 series. Percona Server is open-source and free – this is the latest release of our enhanced, drop-in replacement for MySQL. Complete details of this release are available in the 5.6.32-78.0 milestone on Launchpad.

New Features:
Bugs Fixed:
  • Fixed potential cardinality 0 issue for TokuDB tables if ANALYZE TABLE finds only deleted rows and no actual logical rows before it times out. Bug fixed #1607300 (#1006, #732).
  • TokuDB database.table.index names longer than 256 characters could cause server crash if background analyze table status was checked while running. Bug fixed #1005.
  • PAM Authentication Plugin would abort authentication while checking UNIX user group membership if there were more than a thousand members. Bug fixed #1608902.
  • If DROP DATABASE would fail to delete some of the tables in the database, the partially-executed command is logged in the binlog as DROP TABLE t1, t2, ... for the tables for which drop succeeded. A slave might fail to replicate such DROP TABLE statement if there exist foreign key relationships to any of the dropped tables and the slave has a different schema from master. Fix by checking, on the master, whether any of the database to be dropped tables participate in a Foreign Key relationship, and fail the DROP DATABASE statement immediately. Bug fixed #1525407 (upstream #79610).
  • PAM Authentication Plugin didn’t support spaces in the UNIX user group names. Bug fixed #1544443.
  • Due to security reasons ld_preload libraries can now only be loaded from the system directories (/usr/lib64, /usr/lib) and the MySQL installation base directory.
  • Percona Server 5.6 could not be built with the -DMYSQL_MAINTAINER_MODE=ON option. Bug fixed #1590454.
  • In the client library, any EINTR received during network I/O was not handled correctly. Bug fixed #1591202 (upstream #82019).
  • The included .gitignore in the percona-server source distribution had a line *.spec, which means someone trying to check in a copy of the percona-server source would be missing the spec file required to build the RPMs. Bug fixed #1600051.
  • Audit Log Plugin did not transcode queries. Bug fixed #1602986.
  • LeakSanitizer-enabled build failed to bootstrap server for MTR. Bug fixed #1603978 (upstream #81674).
  • Fixed MYSQL_SERVER_PUBLIC_KEY connection option memory leak. Bug fixed #1604419.
  • The fix for bug #1341067 added a call to free some of the heap memory allocated by OpenSSL. This is not safe for repeated calls if OpenSSL is linked twice through different libraries and each is trying to free the same. Bug fixed #1604676.
  • If the changed page bitmap redo log tracking thread stops due to any reason, then shutdown will wait for a long time for the log tracker thread to quit, which it never does. Bug fixed #1606821.
  • Audit Log Plugin events did not report the default database. Bug fixed #1435099.
  • Canceling the TokuDB Background ANALYZE TABLE job twice or while it was in the queue could lead to server assertion. Bug fixed #1004.
  • Fixed various spelling errors in comments and function names. Bug fixed #728 (Otto Kekäläinen).
  • Implemented set of fixes to make PerconaFT build and run on the AArch64 (64-bit ARMv8) architecture. Bug fixed #726 (Alexey Kopytov).

Other bugs fixed: #1603073, #1604323, #1604364, #1604462, #1604774, #1606782, #1607224, #1607359, #1607606, #1607607, #1607671, #1608385, #1608437, #1608845, #1609422, #1610858, #1612084, #1612551, #1455430, #1455432, #1610242, #998, #1003, #729, and #730.

Release notes for Percona Server 5.6.32-78.0 are available in the online documentation. Please report any bugs on the launchpad bug tracker.

by Hrvoje Matijakovic at August 22, 2016 10:44 PM

MariaDB AB

Magicbricks Migrates to MariaDB to Support its High Volume Traffic

Guest

The following is a guest blog post from Subodh Kumar, head of technology at Magicbricks, India's largest online property portal.

To support our growing online traffic, Magicbricks migrated from a proprietary database to MariaDB (version 10.1.x).

With this migration, we’ve re-factored our application architecture to separate read and write database calls. This has allowed us to load balance our heavy read calls across multiple instances of Slaves without any worries of lag during data syncs.

Using MariaDB, we are now able to serve approximately 7 million page views (from our web and mobile sites) and approximately 6 million API calls per day. MariaDB has not only helped us support this high volume of traffic but has also smoothened our database related operations. We were easily able to setup a multi-master, near real-time replication. Not to mention, this is with no additional license requirements which was a primary consideration with proprietary database servers that we had previously deployed.

This deployment has made Magicbricks scale its applications with any number of database instances as desired.

The average load factor with the previous proprietary database was around 15 to 20, which has now been tremendously reduced to approximately three after the MariaDB deployment.

by Guest at August 22, 2016 10:40 PM

Peter Zaitsev

Query rewrite plugin: scalability fix in MySQL 5.7.14

query rewrite plugin

In this post, we’ll look at a scalability fix for issues the query rewrite plugin had on performance.

Several months ago, Vadim blogged about the impact of a query rewrite plugin on performance. We decided to re-evaluate the latest release of 5.7(5.7.14), which includes fixes for this issue.

I reran tests for MySQL 5.7.13 and 5.7.14 using the same setup and the same test: sysbench OLTP_RO without and with the query rewrite plugin enabled.

query rewrite plugin

MySQL 5.7.14 performs much better, with almost no overhead. Let’s check PMP for these runs:

MySQL 5.7.13

206 __lll_lock_wait(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),plugin_unlock_list,mysql_audit_release,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
    152 __lll_lock_wait(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),plugin_foreach_with_mask,mysql_audit_acquire_plugins,mysql_audit_notify,invoke_pre_parse_rewrite_plugins,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     97 __lll_lock_wait(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),plugin_lock,acquire_plugins,plugin_foreach_with_mask,mysql_audit_acquire_plugins,mysql_audit_notify,invoke_pre_parse_rewrite_plugins,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     34 __io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_handler,fil_aio_wait,io_handler_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     18 send(libpthread.so.0),vio_write,net_write_packet,net_flush,net_send_eof,THD::send_statement_status,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      9 recv(libpthread.so.0),vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      8 poll(libc.so.6),vio_io_wait,vio_socket_io_wait,vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)

MySQL 5.7.14

309 send(libpthread.so.0),vio_write,net_write_packet,net_flush,net_send_eof,THD::send_statement_status,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     43 recv(libpthread.so.0),vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     34 __io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_handler,fil_aio_wait,io_handler_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     15 poll(libc.so.6),vio_io_wait,vio_socket_io_wait,vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      7 send(libpthread.so.0),vio_write,net_write_packet,net_flush,net_send_ok,Protocol_classic::send_ok,THD::send_statement_status,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      7 pthread_cond_wait,os_event::wait_low,srv_worker_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      7 pthread_cond_wait,os_event::wait_low,buf_flush_page_cleaner_worker,start_thread(libpthread.so.0),clone(libc.so.6)

No sign of extra locks with the plugin API in PMP for MySQL 5.7.14. Good job!

Also, note that the fix for this issue should help to improve performance for any other audit based API plugins.

by Alexey Stroganov at August 22, 2016 09:46 PM

Jean-Jerome Schmidt

Infrastructure Automation - Deploying ClusterControl and MySQL-based systems on AWS using Ansible

We recently made a number of enhancements to the ClusterControl Ansible Role, so it now also supports automatic deployment of MySQL-based systems (MySQL Replication, Galera Cluster, NDB Cluster). The updated role uses the awesome ClusterControl RPC interface to automate deployments. It is available at Ansible Galaxy and Github.

TLDR; It is now possible to define your database clusters directly in a playbook (see below example), and let Ansible and ClusterControl automate the entire deployment:

cc_cluster:
  - deployment: true
    operation: "create"
    cluster_type: "galera"
    mysql_hostnames:
      - "192.168.1.101"
      - "192.168.1.102"
      - "192.168.1.103"
    mysql_password: "MyPassword2016"
    mysql_port: 3306
    mysql_version: "5.6"
    ssh_keyfile: "/root/.ssh/id_rsa"
    ssh_user: "root"
    vendor: "percona"

What’s New?

The major improvement is that you can now automatically deploy a new database setup while deploying ClusterControl. Or you can also register already deployed databases.

Define your database cluster in the playbook, within the “cc_cluster” item, and ClusterControl will perform the deployment. We also introduced a bunch of new variables to simplify the initial setup of ClusterControl for default admin credentials, ClusterControl license and LDAP settings.

Along with these new improvements, we can leverage the Ansible built-in cloud module to automate the rest of the infrastructure that our database rely on - instance provisioning, resource allocation, network configuration and storage options. In simple words, write your infrastructure as code and let Ansible work together with ClusterControl to build the entire stack.

We also included example playbooks in the repository for reference. Check them out at our Github repository page.

Example Deployment on Amazon EC2

In this example we are going to deploy two clusters on Amazon EC2 using our new role:

  • 1 node for ClusterControl
  • 3 nodes Galera Cluster (Percona XtraDB Cluster 5.6)
  • 4 nodes MySQL Replication (Percona Server 5.7)

The following diagram illustrates our setup with Ansible:

First, let’s decide what our infrastructure in AWS will look like:

  • Region: us-west-1
  • Availability Zone: us-west-1a
  • AMI ID: ami-d1315fb1
    • AMI name: RHEL 7.2 HVM
    • SSH user: ec2-user
  • Instance size: t2.medium
  • Keypair: mykeypair
  • VPC subnet: subnet-9ecc2dfb
  • Security group: default

Preparing our Ansible Master

We are using Ubuntu 14.04 as the Ansible master host in a local data-center to deploy our cluster on AWS EC2.

  1. If you are already have Ansible installed, you may skip this step:

    $ apt-get install ansible python-setuptools
  2. Install boto (required by ec2 python script):

    $ pip install boto
  3. Download and configure ec2.py and ec2.ini under /etc/ansible. Ensure the Python script is executable:

    $ cd /etc/ansible
    $ wget https://raw.githubusercontent.com/ansible/ansible/devel/contrib/inventory/ec2.py
    $ wget https://raw.githubusercontent.com/ansible/ansible/devel/contrib/inventory/ec2.ini
    $ chmod 755 /etc/ansible/ec2.py
  4. Set up Secret and Access key environment variables. Get the AWS Secret and Access Key from Amazon EC2 Identity and Access Management (IAM) and configure them as per below:

    $ export AWS_ACCESS_KEY_ID='YOUR_AWS_API_KEY'
    $ export AWS_SECRET_ACCESS_KEY='YOUR_AWS_API_SECRET_KEY'
  5. Configure environment variables for ec2.py and ec2.ini.

    export ANSIBLE_HOSTS=/etc/ansible/ec2.py
    export EC2_INI_PATH=/etc/ansible/ec2.ini
  6. Configure AWS keypair. Ensure the keypair exists on the node. For example, if the keypair is located under /root/mykeypair.pem, use the following command to add it to the SSH agent:

    $ ssh-agent bash
    $ ssh-add /root/mykeypair.pem
  7. Verify if Ansible can see our cloud. If there are running EC2 instances, you should get a list of them in JSON by running this command:

    $ /etc/ansible/ec2.py --list

**Note that you can put steps 4 and 5 inside .bashrc or .bash_profile file to load the environment variables automatically.

Define our infrastructure inside the Ansible playbook

We are going to create 2 playbooks. The first one is the definition of our EC2 instances in AWS (ec2-instances.yml) and the second one to deploy ClusterControl and the database clusters (deploy-everything.yml).

Here is an example content of ec2-instances.yml:

- name: Create instances
  hosts: localhost
  gather_facts: False

  tasks:
    - name: Provision ClusterControl node
      ec2:
        count: 1
        region: us-east-1
        zone: us-east-1a
        key_name: mykeypair
        group: default
        instance_type: t2.medium
        image: ami-3f03c55c
        wait: yes
        wait_timeout: 500
        volumes:
          - device_name: /dev/sda1
            device_type: standard
            volume_size: 20
            delete_on_termination: true
        monitoring: no
        vpc_subnet_id: subnet-9ecc2dfb
        assign_public_ip: yes
        instance_tags:
          Name: clustercontrol
          set: ansible
          group: clustercontrol

    - name: Provision Galera nodes
      ec2:
        count: 3
        region: us-east-1
        zone: us-east-1a
        key_name: mykeypair
        group: default
        instance_type: t2.medium
        image: ami-3f03c55c
        wait: yes
        wait_timeout: 500
        volumes:
          - device_name: /dev/sdf
            device_type: standard
            volume_size: 20
            delete_on_termination: true
        monitoring: no
        vpc_subnet_id: subnet-9ecc2dfb
        assign_public_ip: yes
        instance_tags:
          Name: galeracluster
          set: ansible
          group: galeracluster

    - name: Provision MySQL Replication nodes
      ec2:
        count: 4
        region: us-east-1
        zone: us-east-1a
        key_name: mykeypair
        group: default
        instance_type: t2.medium
        image: ami-3f03c55c
        wait: yes
        wait_timeout: 500
        volumes:
          - device_name: /dev/sdf
            device_type: standard
            volume_size: 20
            delete_on_termination: true
        monitoring: no
        vpc_subnet_id: subnet-9ecc2dfb
        assign_public_ip: yes
        instance_tags:
          Name: replication
          set: ansible
          group: replication

There are three types of instances with different instance_tags (Name, set and group) in the playbook. The “group” tag distinguishes our host group accordingly so it can be called in our deployment playbook as part of Ansible host inventory. The "set" tag marks the instances were created by Ansible. Since we are provisioning everything from a local data-center, we set assign_public_ip to “yes” so the instances are reachable inside a VPC under “subnet-9ecc2dfb”.

Next, we create the deployment playbook as per below (deploy-everything.yml):

- name: Configure ClusterControl instance.
  hosts: tag_group_clustercontrol
  become: true
  user: ec2-user
  gather_facts: true

  roles:
    - { role: severalnines.clustercontrol, tags: controller }

  vars:
    cc_admin:
      - email: "admin@email.com"
        password: "test123"

- name: Configure Galera Cluster and Replication instances.
  hosts: 
    - tag_group_galeracluster
    - tag_group_replication
  user: ec2-user
  become: true
  gather_facts: true

  roles:
    - { role: severalnines.clustercontrol, tags: dbnodes }

  vars:
    clustercontrol_ip_address: "{{ hostvars[groups['tag_group_clustercontrol'][0]]['ec2_ip_address'] }}"

- name: Create the database clusters.
  hosts: tag_group_clustercontrol
  become: true
  user: ec2-user

  roles:
    - { role: severalnines.clustercontrol, tags: deploy-database }

  vars:
    cc_cluster:
      - deployment: true
        operation: "create"
        cluster_type: "galera"
        mysql_cnf_template: "my.cnf.galera"
        mysql_datadir: "/var/lib/mysql"
        mysql_hostnames:
          - "{{ hostvars[groups['tag_group_galeracluster'][0]]['ec2_ip_address'] }}"
          - "{{ hostvars[groups['tag_group_galeracluster'][1]]['ec2_ip_address'] }}"
          - "{{ hostvars[groups['tag_group_galeracluster'][2]]['ec2_ip_address'] }}"
        mysql_password: "password"
        mysql_port: 3306
        mysql_version: "5.6"
        ssh_keyfile: "/root/.ssh/id_rsa"
        ssh_user: "root"
        sudo_password: ""
        vendor: "percona"
      - deployment: true
        operation: "create"
        cluster_type: "replication"
        mysql_cnf_template: "my.cnf.repl57"
        mysql_datadir: "/var/lib/mysql"
        mysql_hostnames:
          - "{{ hostvars[groups['tag_group_replication'][0]]['ec2_ip_address'] }}"
          - "{{ hostvars[groups['tag_group_replication'][1]]['ec2_ip_address'] }}"
          - "{{ hostvars[groups['tag_group_replication'][2]]['ec2_ip_address'] }}"
          - "{{ hostvars[groups['tag_group_replication'][3]]['ec2_ip_address'] }}"
        mysql_password: "password"
        mysql_port: 3306
        mysql_version: "5.7"
        ssh_keyfile: "/root/.ssh/id_rsa"
        ssh_user: "root"
        sudo_password: ""
        vendor: "percona"

The ansible user is “ec2-user” for RHEL 7.2 image. The playbook shows the deployment flow as:

  1. Install and configure ClusterControl (tags: controller)
  2. Setup passwordless SSH from ClusterControl node to all database nodes (tags: dbnodes). In this section, we have to define clustercontrol_ip_address so we know which ClusterControl node is used to manage our nodes.
  3. Perform the database deployment. The database cluster item definition will be passed to the ClusterControl RPC interface listening on the EC2 instance that has “tag_group_clustercontrol”. For MySQL replication, the first node in the mysql_hostnames list is the master.

The above are the simplest variables used to get you started. For more customization options, you can refer to the documentation page of the role under Variables section.

Fire them up

You need to have the Ansible role installed. Grab it from Ansible Galaxy or Github repository:

$ ansible-galaxy install severalnines.clustercontrol

Then, create the EC2 instances:

$ ansible-playbook -i /etc/ansible/ec2.py ec2-instances.yml

Refresh the inventory:

$ /etc/ansible/ec2.py --refresh-cache

Verify all EC2 instances are reachable before the deployment begins (you should get SUCCESS for all nodes):

$ ansible -m ping "tag_set_ansible" -u ec2-user

Install ClusterControl and deploy the database cluster:

$ ansible-playbook -i /etc/ansible/ec2.py deploy-everything.yml

Wait for a couple of minutes until the playbook completes. Then, login to ClusterControl using the default email address and password defined in the playbook and you should be inside the ClusterControl dashboard. Go to Settings -> Cluster Job, you should see the “Create Cluster” jobs scheduled and deployment is under progress.

This is our final result on ClusterControl dashboard:

The total deployment time from installing Ansible to the database deployment took about 50 minutes. This included waiting for the instances to be created and database clusters to be deployed. This is pretty good, considering we were spinning 8 nodes and deploying two database clusters from scratch. How long does it take you to deploy two clusters from scratch?

Future Plan

At the moment, the Ansible role only supports deployment of the following:

  • Create new Galera Cluster
    • Percona XtraDB Cluster (5.5/5.6)
    • MariaDB Galera Cluster (5.510.1)
    • MySQL Galera Cluster - Codership (5.5/5.6)
  • Create new MySQL Replication
    • Percona Server (5.7/5.6)
    • MariaDB Server (10.1)
    • MySQL Server - Oracle (5.7)
  • Add existing Galera Cluster
    • Percona/MariaDB/Codership (all stable version)

We’re in the process of adding support for other cluster types supported by ClusterControl.

We’d love to hear your feedback in the comments section below. Would you like to see integration with more cloud providers (Azure, Google Cloud Platform, Rackspace)? What about virtualization platforms like OpenStack, VMware, Vagrant and Docker? How about load balancers (HAProxy, MaxScale and ProxySQL)? And Galera arbitrator (garbd), asynchronous replication slaves to Galera clusters, and backup management right from the Ansible playbook? The list can be very long, so let us know what is important to you. Happy automation!

by Severalnines at August 22, 2016 02:36 PM

August 20, 2016

Peter Zaitsev

Sharding with ProxySQL

MySQL and Ceph

Recently a colleague of mine ask me to provide a simple example on how ProxySQL can perform sharding.

That request moves me to write a short tutorial in the hope it will illustrate the ProxySQL sharding functionalities and will help people out there to better understand how to use it.

ProxySQL is a very powerful platform that allow us to manipulate and manage our connections and query in a simple but effective way.

In this article I will show you how.

Before starting is better to clarify some basic concepts.

ProxySQL organize its internal set of servers in Host Groups (HG), each HG can be associate to users and to Query Rules (QR).
Each QR can be final (apply = 1) or can let ProxySQL continue to parse other QR.
A QR can be a rewrite action, or can be a simple match, it can have a specific target HG, or be generic, finally QR are defined using regex.

You can see QR as a sequence of filters and transformation that you can arrange as you like.

These simple basic rules give us enormous flexibility, and allow us to create very simple actions, like a simple query re-write or very complex chains that could see dozens of QR concatenated. Documentation can be found here

The information related to HG or QR is easily accessible using the the ProxySQL administrator interface, in the tables mysql_servers, mysql_query_rules and stats.stats_mysql_query_rules; the last one allow us to evaluate if and how the rule(s) is used.

About sharding, what ProxySQL can do for us to achieve what we need in a (relatively easy) way?
Some people/company include sharding logic in the application, and use multiple connection to reach the different targets, or have some logic to split the load across several schemas/tables. ProxySQL allow us to simplify the way connectivity and query distribution is suppose to work reading data in the query or accepting HINTS.

No matter which kind of requirements the sharding exercise can be summarize in few different categories.
• By split the data inside the same container (like having a shard by State where each State is a schema)
• By physical data location (this can have multiple mysql servers in the same room, as well as having them geographically distributed)
• Combination of the two, where I do split by state using a server dedicated and again split by schema/table by whatever (say by gender)

In the following examples I will show how to use ProxySQL to cover the three different scenario defined above and a bit more.

The example below will report text from the Admin ProxySQL interface and from MySQL console.
I will mark each one as follow:

  • Mc for MySQL console
  • Pa for ProxySQL Admin

Please note that mysql console MUST use the -c flag to pass the comments in the query. This because the default behaviour, in mysql console, is to remove the comments.

I am going to illustrate procedure that you can replicate on your laptop, and when possible I will mention a real implementation.

This because I want you to test directly the ProxySQL functionalities.

For the example describe below I have you PrxySQL v1.2.2 that is going to become the master in few days. You can download it from :

git clone https://github.com/sysown/proxysql.git
git checkout v1.2.2

Then to compile :

cd <path to proxy source code>
make
make install

If you need full instructions on how to install and configure ProxySQL than read here and here

Finally you need to have the WORLD test db loaded, world test DB can be found here

First example/exercise is :

Shard inside the same MySQL Server using 3 different schemas split by continent.

Obviously you can have any number of shards and number of relative schemas. What is relevant here is to demonstrate how traffic can be redirect to different targets (schemas) maintaining the same structure (tables). This discriminating the target on the base of some relevant information in the Data or pass by the application.

Ok let us roll the ball.

Having :

[Mc]
+---------------+-------------+
| Continent     | count(Code) |
+---------------+-------------+
| Asia          |          51 | <--
| Europe        |          46 | <--
| North America |          37 |
| Africa        |          58 | <--
| Oceania       |          28 |
| Antarctica    |           5 |
| South America |          14 |
+---------------+-------------+

For this exercise I will use 3 hosts in replica.

Summarizing I will need:
3 hosts: 192.168.1.[5-6-7]
3 schemas: Continent X + world schema
1 user : user_shardRW
3 hostgroups: 10, 20, 30 (for future use)

We will create the following Schemas Asia, Africa, Europe first.

[Mc]
Create schema [Asia|Europe|North_America|Africa];
create table Asia.City as select a.* from  world.City a join Country on a.CountryCode = Country.code where Continent='Asia' ;
create table Europe.City as select a.* from  world.City a join Country on a.CountryCode = Country.code where Continent='Europe' ;
create table Africa.City as select a.* from  world.City a join Country on a.CountryCode = Country.code where Continent='Africa' ;
create table North_America.City as select a.* from  world.City a join Country on a.CountryCode = Country.code where Continent='North America' ;
create table Asia.Country as select * from  world.Country where Continent='Asia' ;
create table Europe.Country as select * from  world.Country where Continent='Europe' ;
create table Africa.Country as select * from  world.Country  where Continent='Africa' ;
create table North_America.Country as select * from  world.Country where Continent='North America' ;

Create the user

grant all on *.* to user_shardRW@'%' identified by 'test';

Now let us start to configure the ProxySQL:

[Pa]
insert into mysql_users (username,password,active,default_hostgroup,default_schema) values ('user_shardRW','test',1,10,'test_shard1');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK;
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.5',10,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.6',20,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.7',30,3306,100);
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

With this we have defined the User, the servers and the Host groups.

Let us start to define the logic with the query rules:

[Pa]
delete from mysql_query_rules where rule_id > 30;
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,replace_pattern,apply) VALUES (31,1,'user_shardRW',"^SELECTs*(.*)s*froms*world.(S*)s(.*).*Continent='(S*)'s*(s*.*)$","SELECT 1 from 4.2 WHERE 1=1 5",1);
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;

I am now going to query the master (or a single node) but I am expecting ProxySQL to redirect the query to the right shard catching the value of the Continent.

[Mc]
 SELECT name,population from world.City  WHERE Continent='Europe' and CountryCode='ITA' order by population desc limit 1;
+------+------------+
| name | population |
+------+------------+
| Roma |    2643581 |
+------+------------+

Well you can say … “hey you are querying the schema world, of course you get back the correct data”.

But this is not what had really happened, ProxySQL did not query the schema world but the schema Europe.

Let see the details:

[Pa]
select * from stats_mysql_query_digest;
Original    :SELECT name,population from world.City  WHERE Continent='Europe' and CountryCode='ITA' order by population desc limit 1;
Transformed :SELECT name,population from Europe.City WHERE ?=? and CountryCode=? order by population desc limit ?

Let me explain what happened.

Rule 31 in ProxySQL will take all the FIELDS we will pass in the query, it will catch the CONTINENT in the where clause, it will take any condition after the WHERE and it will reorganize the query all using the RegEx.

Does this works for any table in the sharded schemas

Of course  it does.

A query like: SELECT name,population from world.Country WHERE Continent=’Asia’ ;
Will be transformed into: SELECT name,population from Asia.Country WHERE ?=?

[Mc]
+----------------------+------------+
| name                 | population |
+----------------------+------------+
| Afghanistan          |   22720000 |
| United Arab Emirates |    2441000 |
| Armenia              |    3520000 |
<snip ...>
| Vietnam              |   79832000 |
| Yemen                |   18112000 |
+----------------------+------------+

Another possible a approach to instruct ProxySQL to shard is to pass a hint inside a comment.
Let see how.
First let me disable the rule I have just insert, this is not really needed but so you can see how 🙂

[Pa]
mysql> update mysql_query_rules set active=0 where rule_id=31;
Query OK, 1 row affected (0.00 sec)
mysql> LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;
Query OK, 0 rows affected (0.00 sec)

Done.

Now what I want to have is that *ANY* query that contains comment /* continent=X */ should go to the continent X schema, same server.

To do so, I will instruct ProxySQL to replace any reference to the world schema inside the the query I am going to submit.

[Pa]
---Comment by Leo: edit number 1---
delete from mysql_query_rules where rule_id in (31,33,34,35,36);
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,replace_pattern,apply,FlagOUT,FlagIN) VALUES (31,1,'user_shardRW',"\S*\s*\/\*\s*continent=.*Asia\s*\*.*",null,0,23,0);
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,replace_pattern,apply,FlagIN,FlagOUT) VALUES (32,1,'user_shardRW','world.','Asia.',0,23,23);
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,replace_pattern,apply,FlagOUT,FlagIN) VALUES (33,1,'user_shardRW',"\S*\s*\/\*\s*continent=.*Europe\s*\*.*",null,0,25,0);
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,replace_pattern,apply,FlagIN,FlagOUT) VALUES (34,1,'user_shardRW','world.','Europe.',0,25,25);
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,replace_pattern,apply,FlagOUT,FlagIN) VALUES (35,1,'user_shardRW',"\S*\s*\/\*\s*continent=.*Africa\s*\*.*",null,0,24,0);
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,replace_pattern,apply,FlagIN,FlagOUT) VALUES (36,1,'user_shardRW','world.','Africa.',0,24,24);
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;

How this works?

I have defined mainly to concatenated rules. The first capture the incoming query that contains the desired value (like continent = Asia).
If the match is there ProxySQL will exit that action, but while doing so it will read the Apply field and if Apply is 0 it will read the FlagOUT value. At this point it will go to the first rule (in sequence) that has the value of FlagIN equal to the FlagOUT.

The second rule will get the request and will replace the value of world with the one I have define. In short it will replace whatever is in the match_pattern with the value that is in the replace_pattern.

Now what happens here is that ProxySQL implement the Re2 google library for RegEx. Re2 is very fast but has some limitations, like it does NOT support (at the time of the writing) the flag option g. In other words if I have a select with many tables and as such several “world.” Re2 will replace ONLY the first instance.

As such a query like:

Select /* continent=Europe */ * from world.Country join world.City on world.City.CountryCode=world.Country.Code where Country.code='ITA' ;

Will be transformed into :

Select /* continent=Europe */ * from Europe.Country join world.City on world.City.CountryCode=world.Country.Code where Country.code='ITA' ;

And fail.

The other day with Rene’ we were discussing how to solve this given the lack of implementation in Re2. Finally we had opted for recursive actions.

What this means?

It means that ProxySQL from v1.2.2 now has a new functionality that allow recursive calls to a Query Rule, the maximum number of iterations that ProxySQL can run, is managed by the option (global variable) mysql-query_processor_iterations. 
Mysql-query_processor_iterations define how many operation, a query process can execute as whole (from start to end).

This new implementation allow us to reference a Query Rule to itself in order to be executed multiple times.

If you go back you will noticed that QR 34 has FlagIN and FlagOUT pointing to the same value of 25 and Apply =0. This will bring ProxySQL to recursively call rule 34 until it change ALL the value of the word world.

The result is the following:

[Mc]
Select /* continent=Europe */ Code, City.Name, City.population  from world.Country join world.City on world.City.CountryCode=world.Country.Code where City.population > 10000 group by Name order by City.Population desc limit 5;
+------+---------------+------------+
| Code | Name          | population |
+------+---------------+------------+
| RUS  | Moscow        |    8389200 |
| GBR  | London        |    7285000 |
| RUS  | St Petersburg |    4694000 |
| DEU  | Berlin        |    3386667 |
| ESP  | Madrid        |    2879052 |
+------+---------------+------------+

You can see ProxySQL internal information using the following queries:

[Pa]
 select active,hits, mysql_query_rules.rule_id, match_digest, match_pattern, replace_pattern, cache_ttl, apply,flagIn,flagOUT FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id;
+--------+------+---------+---------------------+----------------------------------------+-----------------+-----------+-------+--------+---------+
| active | hits | rule_id | match_digest        | match_pattern                          | replace_pattern | cache_ttl | apply | flagIN | flagOUT |
+--------+------+---------+---------------------+----------------------------------------+-----------------+-----------+-------+--------+---------+
| 1      | 1    | 33      | NULL                | S*s*/*s*continent=.*Europes**.* | NULL            | NULL      | 0     | 0      | 25      |
| 1      | 4    | 34      | NULL                | world.                                 | Europe.         | NULL      | 0     | 25     | 25      |
| 1      | 0    | 35      | NULL                | S*s*/*s*continent=.*Africas**.* | NULL            | NULL      | 0     | 0      | 24      |
| 1      | 0    | 36      | NULL                | world.                                 | Africa.         | NULL      | 0     | 24     | 24      |
+--------+------+---------+---------------------+----------------------------------------+-----------------+-----------+-------+--------+---------+

And:

[Pa]
select * from stats_mysql_query_digest;
<snip and taking only digest_text>
Select Code, City.Name, City.population from Europe.Country join Europe.City on Europe.City.CountryCode=Europe.Country.Code where City.population > ? group by Name order by City.Population desc limit ?

As you can see ProxySQL has nicely replace the word world. in the query to Europe. and it had run Query Rule 34, 4 times (hits).

This is obviously working for Insert/Update/Delete as well.

Queries like :

insert into  /* continent=Europe */  world.City values(999999,'AAAAAAA','ITA','ROMA',0) ;

Will be transformed into:

[Pa]
select digest_text from stats_mysql_query_digest;
+-------------------------------------------+
| digest_text                               |
+-------------------------------------------+
| insert into Europe.City values(?,?,?,?,?) |
+-------------------------------------------+

And executed only on the desired schema.

Sharding by Host

Using hint

How I can shard redirecting the queries to an Host? (Instead of a schema)
This is even easier 🙂

The main point is that whatever match the rule, should go to a defined HG.
No rewrite imply which means less work.

So how this is done?
As said before I have 3 NODES 192.168.1.[5-6-7]
For this example I will use world db (no continent schema), distributed in each node, and I wil retrieve the node bind IP to be sure I am going on the right place.

What I will do is to instruct ProxySQL to send my query by using a HINT to a specific host. I choose the hint “shard_host_HG” and I am going to inject it in the query as comment.

As such the Query Rules will be:

[Pa]
delete from mysql_query_rules where rule_id in (40,41,42, 10,11,12);
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,destination_hostgroup,apply) VALUES (10,1,'user_shardRW',"/*s*shard_host_HG=.*Europes**.",10,0);
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,destination_hostgroup,apply) VALUES (11,1,'user_shardRW',"/*s*shard_host_HG=.*Asias**.",20,0);
INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,destination_hostgroup,apply) VALUES (12,1,'user_shardRW',"/*s*shard_host_HG=.*Africas**.",30,0);
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;

While the queries I am going to test are:

[Mc]
Select /* shard_host_HG=Europe */ City.Name, City.Population from world.Country join world.City on world.City.CountryCode=world.Country.Code where Country.code='ITA' limit 5; SELECT * /* shard_host_HG=Europe */ from information_schema.GLOBAL_VARIABLES where variable_name like 'bind%';
Select /* shard_host_HG=Asia */ City.Name, City.Population from world.Country join world.City on world.City.CountryCode=world.Country.Code where Country.code='IND' limit 5; SELECT * /* shard_host_HG=Asia */ from information_schema.GLOBAL_VARIABLES where variable_name like 'bind%';
Select /* shard_host_HG=Africa */ City.Name, City.Population from world.Country join world.City on world.City.CountryCode=world.Country.Code where Country.code='ETH' limit 5; SELECT * /* shard_host_HG=Africa */ from information_schema.GLOBAL_VARIABLES where variable_name like 'bind%';

Running the query for Africa, I will get:

[Mc]
Select /* shard_host_HG=Africa */ City.Name, City.Population from world.Country join world.City on world.City.CountryCode=world.Country.Code where Country.code='ETH' limit 5; SELECT * /* shard_host_HG=Africa */ from information_schema.GLOBAL_VARIABLES where variable_name like 'bind%';
+-------------+------------+
| Name        | Population |
+-------------+------------+
| Addis Abeba |    2495000 |
| Dire Dawa   |     164851 |
| Nazret      |     127842 |
| Gonder      |     112249 |
| Dese        |      97314 |
+-------------+------------+
+---------------+----------------+
| VARIABLE_NAME | VARIABLE_VALUE |
+---------------+----------------+
| BIND_ADDRESS  | 192.168.1.7    |
+---------------+----------------+

That will give me :

[Pa]
select active,hits, mysql_query_rules.rule_id, match_digest, match_pattern, replace_pattern, cache_ttl, apply,flagIn,flagOUT FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id;
+--------+------+---------+---------------------+----------------------------------------+-----------------+-----------+-------+--------+---------+
| active | hits | rule_id | match_digest        | match_pattern                          | replace_pattern | cache_ttl | apply | flagIN | flagOUT |
+--------+------+---------+---------------------+----------------------------------------+-----------------+-----------+-------+--------+---------+
| 1      | 0    | 40      | NULL                | /*s*shard_host_HG=.*Europes**.    | NULL            | NULL      | 0     | 0      | 0       |
| 1      | 0    | 41      | NULL                | /*s*shard_host_HG=.*Asias**.      | NULL            | NULL      | 0     | 0      | 0       |
| 1      | 2    | 42      | NULL                | /*s*shard_host_HG=.*Africas**.    | NULL            | NULL      | 0     | 0      | 0       | <-- Note the HITS (2 as the run queries)
+--------+------+---------+---------------------+----------------------------------------+-----------------+-----------+-------+--------+---------+

In this example we have NO replace_patter this is only a matching and redirecting Rule, where the destination HG is defined in the value of destination_hostgroup attribute while inserting. In the case for Africa is HG 30.

The server in HG 30 is:

[Pa]
select hostgroup_id,hostname,port,status from mysql_servers ;
+--------------+-------------+------+--------+
| hostgroup_id | hostname    | port | status |
+--------------+-------------+------+--------+
| 10           | 192.168.1.5 | 3306 | ONLINE |
| 20           | 192.168.1.6 | 3306 | ONLINE |
| 30           | 192.168.1.7 | 3306 | ONLINE | <---
+--------------+-------------+------+--------+

Which match perfectly with our returned value.

You can try by your own the other two continents.

Using destination_hostgroup

Another way to assign to which final host a query should go is to use the the destination_hostgroup, set the Schema_name attribute and use the use schema syntax in the query.

like:

[Pa]
INSERT INTO mysql_query_rules (active,schemaname,destination_hostgroup,apply) VALUES
(1, 'shard00', 1, 1), (1, 'shard01', 1, 1), (1, 'shard03', 1, 1),
(1, 'shard04', 2, 1), (1, 'shard06', 2, 1), (1, 'shard06', 2, 1),
(1, 'shard07', 3, 1), (1, 'shard08', 3, 1), (1, 'shard09', 3, 1);

And then in the query do something like :

use shard02; Select * from tablex;

I mention this method because is one of the most common at the moment in large companies using SHARDING.

But it is not safe, because it relays on the fact the query will be execute in the desired HG. While the risk of error is high.

Just think if a query doing join against a specified SHARD:

use shard01; Select * from tablex join shard03 on tablex.id = shard03.tabley.id;

This will probably generate an error because shard03 is probably NOT present on the host containing shar01.

As such this approach can be used ONLY when you are 100% sure about what you are doing and when you are sure NO query will have explicit schema declaration.

Shard By Host and by Schema

Finally is obviously possible to combine the two approaches sharding by host and have only a subset of schemas

To do so let us use all the 3 nodes and have the schema distribute as follow:

  • Europe on Server 192.168.1.5 -> HG 10
  • Asia on Server 192.168.1.6 -> HG 20
  • Africa on Server 192.168.1.7 -> HG 30

I have already set the query rules both using HINT so what I have to do is to use them BOTH to combine the operations:

[Mc]
Select /* shard_host_HG=Asia */ /* continent=Asia */  City.Name, City.Population from world.Country join world.City on world.City.CountryCode=world.Country.Code where Country.code='IND' limit 5; SELECT * /* shard_host_HG=Asia */ from information_schema.GLOBAL_VARIABLES where variable_name like 'bind%';
+--------------------+------------+
| Name               | Population |
+--------------------+------------+
| Mumbai (Bombay)    |   10500000 |
| Delhi              |    7206704 |
| Calcutta [Kolkata] |    4399819 |
| Chennai (Madras)   |    3841396 |
| Hyderabad          |    2964638 |
+--------------------+------------+
5 rows in set (0.00 sec)
+---------------+----------------+
| VARIABLE_NAME | VARIABLE_VALUE |
+---------------+----------------+
| BIND_ADDRESS  | 192.168.1.6    |
+---------------+----------------+
1 row in set (0.01 sec)

[Pa]
mysql> select digest_text from stats_mysql_query_digest;
+--------------------------------------------------------------------------------------------------------------------------------------------+
| digest_text                                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------------------+
| SELECT * from information_schema.GLOBAL_VARIABLES where variable_name like ?                                                               |
| Select City.Name, City.Population from Asia.Country join Asia.City on Asia.City.CountryCode=Asia.Country.Code where Country.code=? limit ? |
+--------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
mysql> select active,hits, mysql_query_rules.rule_id, match_digest, match_pattern, replace_pattern, cache_ttl, apply,flagIn,flagOUT FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id;
+--------+------+---------+---------------------+----------------------------------------+-----------------+-----------+-------+--------+---------+
| active | hits | rule_id | match_digest        | match_pattern                          | replace_pattern | cache_ttl | apply | flagIN | flagOUT |
+--------+------+---------+---------------------+----------------------------------------+-----------------+-----------+-------+--------+---------+
| 1      | 0    | 10      | NULL                | /*s*shard_host_HG=.*Europes**.    | NULL            | NULL      | 0     | 0      | NULL    |
| 1      | 2    | 11      | NULL                | /*s*shard_host_HG=.*Asias**.      | NULL            | NULL      | 0     | 0      | NULL    |
| 1      | 0    | 12      | NULL                | /*s*shard_host_HG=.*Africas**.    | NULL            | NULL      | 0     | 0      | NULL    |
| 1      | 0    | 13      | NULL                | NULL                                   | NULL            | NULL      | 0     | 0      | 0       |
| 1      | 1    | 31      | NULL                | S*s*/*s*continent=.*Asias**.*   | NULL            | NULL      | 0     | 0      | 23      |
| 1      | 4    | 32      | NULL                | world.                                 | Asia.           | NULL      | 0     | 23     | 23      |
| 1      | 0    | 33      | NULL                | S*s*/*s*continent=.*Europes**.* | NULL            | NULL      | 0     | 0      | 25      |
| 1      | 0    | 34      | NULL                | world.                                 | Europe.         | NULL      | 0     | 25     | 25      |
| 1      | 0    | 35      | NULL                | S*s*/*s*continent=.*Africas**.* | NULL            | NULL      | 0     | 0      | 24      |
| 1      | 0    | 36      | NULL                | world.                                 | Africa.         | NULL      | 0     | 24     | 24      |
+--------+------+---------+---------------------+----------------------------------------+-----------------+-----------+-------+--------+---------+

As you can see rule 11 has to HITS, which means my queries will go to the associated HG, but given Apply for rule 11 is =0, ProxySQL will first continue to process the QueryRules.

As such it will also transform the queries as for rules 31 and 32, each one having the expected number of hits (1 the first and the 4 because the loop, the second).

And this was all we had to do to perform a perfect two layer sharding in ProxySQL.

Conclusion

ProxySQL allow the user to access data distributed by shards in a very simple way.
The query rules that follow the consolidate RegEx pattern, in conjunction with the possibility to concatenate rules and the Host Group approach definition give us huge flexibility with relative simplicity.

References:

https://github.com/sysown/proxysql/tree/v1.2.2/doc
https://github.com/google/re2/wiki/Syntax
http://www.proxysql.com/2015/09/proxysql-tutorial-setup-in-mysql.html
https://github.com/sysown/proxysql/blob/v1.2.2/doc/configuration_howto.md
https://github.com/sysown/proxysql/blob/v1.2.2/INSTALL.md
https://dev.mysql.com/doc/index-other.html

Credits

It is obvious that I need to acknowledge and kudo the work Rene’ Cannao is doing to make ProxySQL a solid, fast and flexible product. I have also to mention that I was and am working with him very often, more often than he likes, asking him fix and discussing with him optimization. Requests that he try to satisfied with surprising speed and efficiency.

by Marco Tusa at August 20, 2016 02:34 PM

August 19, 2016

Peter Zaitsev

Top Most Overlooked MySQL Performance Optimizations: Q & A

Overlooked MySQL performance optimization

Overlooked MySQL Performance OptimizationsThank you for attending my 22nd July 2016 webinar titled “Top Most Overlooked MySQL Performance Optimizations“. In this blog, I will provide answers to the Q & A for that webinar.

For hardware, which disk raid level do you suggest? Is raid5 suggested performance-wise and data-integrity-wise?
RAID 5 comes with high overhead, as each write turns into a sequence of four physical I/O operations, two reads and two writes. We know that RAID 5s have some write penalty, and it could affect the performance on spindle disks. In most cases, we advise using alternative RAID levels. Use RAID 5 when disk capacity is more important than performance (e.g., archive databases that aren’t used often). Since write performance isn’t a problem in the case of SSD, but capacity is expensive, RAID 5 can help by wasting less disk space.

Regarding collecting table statistics, do you have any suggestions for analyzing large tables (over 300GB) since we had issues with MySQL detecting the wrong cardinality?
MySQL optimizer makes decisions about the execution plan (EXPLAIN), and statistics are re-estimated automatically and can be re-estimated explicitly when one calls the ANALYZE TABLE statement for the table, or the OPTIMIZE TABLE statement for InnoDB tables (which rebuilds the table and then performs an ANALYZE for the table).

When MySQL optimizer is not picking up the right index in EXPLAIN, it could be caused by outdated or wrong statistics (optimizer bugs aside). So, when you optimize the table you rebuild it so data are stored in a more compact way (assuming they changed a lot in the past) and then you re-estimate statistics based on some random sample pages checked in the table. As a result, you come up with statistics that are more correct for the data you have at the moment. This allows optimizer to get a better plan. When an explicit hint is added, you reduce possible choices for optimizer and it can use a good enough plan even with wrong statistics.

If you use versions 5.6.x and 5.7.x, and InnoDB tables, there is a way to store/fix statistics when the plans are good.  Using Persistent Optimizer Statistics prevents it from changing automatically. It’s recommended you run ANALYZE TABLE to calculate statistics (if really needed) during off peak time and make sure the table in question is not in use. Check this blogpost too.

Regarding the buffer pool, when due you think using multiple buffer pool instances make sense?
Multiple InnoDB buffer pools were introduced in MySQL 5.5, and the default value for it was 1. Now, the default value in MySQL 5.6 is 8. Enabling

innodb_buffer_pool_instances
 is useful in highly concurrent workloads as it may reduce contention of the global mutexes.
innodb_buffer_pool_instances
 helps to improve scalability in multi-core machines and having multiple buffer pools means that access to the buffer pool splits across all instances. Therefore, no single mutex controls the access pattern.

innodb_buffer_pool_instances
 only takes effect when set to 1GB (at minimum), and the total specified size for
innodb_buffer_pool
  is divided among all the buffer pool instances. Further, setting the innodb_buffer_pool_instances parameter is not a dynamic option, so it requires a server restart to take effect.

What do you mean “PK is appended to secondary index”
In InnoDB, secondary indexes are stored along with their corresponding primary key values. InnoDB uses this primary key value to search for the row in the clustered index. So, primary keys are implicitly added with secondary keys.

About Duplicate Keys, if I have a UNIQUE KEY on two columns, is it ok then to set a key for each of these columns also? Or should I only keep the unique key on the columns and get rid of regular key on each column also?
As I mentioned during the talk, for composite index the leftmost prefix is used. For example, If you have a UNIQUE INDEX on columns A,B as (A,B), then this index is not used for lookup for the query below:

SELECT * FROM test WHERE B='xxx';

For that query, you need a separate index on B column.

On myisam settings, doesn’t the MySQL and information_schema schemas require myisam? If so, are any settings more than default needing to be changed?
performance_schema uses the PERFORMANCE_SCHEMA storage engine, so only MySQL system database tables use the MyISAM engine. The MySQL system database is not used much and usually default settings for MyISAM engine are fine.

Will functions make my app slow compare than query?
I’m not sure how you’re comparing “queries” versus “stored functions.” Functions also need to transform, similar to the query execution plan. But it might be slower compare to well-coded SQL, even with the overhead of copying the resulting data set back to the client. Typically, functions have many SQL queries. The trade-off is that this does increase the load on the database server because more of the work is done on the server side and less is done on the client (application) side.

Foreign key will make my fetches slower?
MySQL enforces referential integrity (which ensures data consistency between related tables) via foreign keys for the InnoDB storage engine. There could be some overhead of this for the INSERT/UPDATE/DELETE foreign key column, which has to check if the value exists in a related column of other tables. There could be some overhead for this, but again it’s an index lookup so the cost shouldn’t be high. However, locking overhead comes into play as well. This blogpost from our CEO is informative on this topic. This especially affect writes, but I don’t think FK fetches slower for SELECT as it’s an index lookup.

Large pool size can have a negative impact to performance? About 62GB of pool size?
The InnoDB buffer pool is by far the most important option for InnoDB Performance, as it’s the main cache for data and indexes and it must be set correctly. Setting it large enough (i.e., larger than your dataset) shouldn’t cause any problems as long as you leave enough memory for OS needs and for MySQL buffers (e.g., sort buffer, join buffer, temporary tables, etc.).

62GB doesn’t necessarily mean a big InnoDB buffer pool. It depends on how much memory your MySQL server contains, and what the size of your total InnoDB dataset is. A good rule of thumb is to set the InnoDB buffer pool size as large as possible, while still leaving enough memory for MySQL buffers and for OS.

You find duplicate, redundant indexes by looking at information_schema.key_column_usage directly?
The key_column_usage view provides information about key columns constraints. It doesn’t provide information about duplicate or redundant indexes.

Can you find candidate missing indexes by looking at the slow query log?
Yes, as I mentioned you can find unused indexes by enabling log_queries_not_using_indexes. It writes to

slow_query_log
. You can also enable the user_statistics feature which adds several information_schema tables, and you can find un-used indexes with the help of user_statistics. pt-index-usage is yet another tool from Percona toolkit for this purpose. Also, check this blogpost on this topic.

How to find the unused indexes? They also have an impact on performance.
Unused indexes can be found with the help of the pt-index-usage tool from Percona toolkit as mentioned above. If you are using Percona Server, you can also use User Statistics feature. Check this blogpost from my colleague, which shows another technique to find unused indexes.

As far as I understand, MIXED will automatically use ROW for non-deterministic and STATEMENT for deterministic queries. I’ve been using it for years now without any problems. So why this recommendation of ROW?

In Mixed Mode, MySQL uses statement-based replication for most queries, switching to row-based replication only when statement-based replication would cause an inconsistency. We recommend ROW-based logging because it’s efficient and performs better as it requires less row locks. However, RBR can generate more data if a DML query affects many rows and a significant amount of data needs to be written to the binary log (and you can configure

binlog_row_image
 parameter to control the amount of logging). Also, make sure you have good network bandwidth between master/slave(s) for RBR, as it needs to send more data to slaves. Another important thing to get best of the performance with ROW-based replication is to make sure all your database tables contain a Primary Key or Unique Key (because of this bug http://bugs.mysql.com/bug.php?id=53375).

Can you give a brief overview of sharding…The pros and cons also.
With Sharding, database data is split into multiple databases with each shard storing a subset of data. Sharding is useful to scale writes if you have huge dataset and a single server can’t handle amount of writes.

Performance and throughput could be better with sharding. On the other had, it requires lots of development and administration efforts. The application needs to be aware of the shards and keep track of which data is stored in which shard. You can use MySQL Fabric framework to manage farms of MySQL Servers. Check for details in the manual.

Why not mixed replication mode instead of row-based replication ?
As I mentioned above, MIXED uses a STATEMENT-based format by default, and converts to ROW-based replication format for non-deterministic queries. But ROW-based format is recommended as there could still be cases where MySQL fails to detect non-deterministic query behavior and replicates in a STATEMENT-based format.

Can you specify a few variables which could reduce slave lag?
Because of the single-threaded nature of MySQL (until MySQL 5.6), there is always a chance that a MySQL slave can lag from the master. I would suggest considering the below parameters to avoid slave lag:

  • innodb_flush_log_at_trx_commit <> 1, Either set it t or 0 however, it could cause you 1 second of data loss in case of crash.
  • innodb_flush_method = O_DIRECT, for unix like operating system O_DIRECT is recommended to avoid double buffering. If your InnoDB data and log files are located on SAN then O_DIRECT is probably not good choice.
  • log_bin = 0, Disable binary logging (if enabled) to minimize extra Disk IO.
  • sync_binlog = 0, Disable sync_binlog.

Those above parameters would definitely help to minimize slave lag. However, along with that make sure your slave(s) hardware is as strong as the master. Make sure your read queries are fast enough. Don’t overload slave to much, and distribute read traffic evenly between slave(s). Also, you should have the same table definitions on slave(s) as the master (e.g., master server indexes must exists on slave(s) tables too). Last but not least, I wrote a blogpost on how to diagnose and cure replication lag. It might be useful for further reading.

by Muhammad Irfan at August 19, 2016 11:34 PM

MariaDB AB

Configuring LDAP Authentication and Group Mapping With MariaDB

Geoff Montee

Enterprise users who have a large number of MariaDB servers often want to centralize their MariaDB user account administration -- especially for the user accounts of the database administration team. This can simplify some database administration tasks, since users do not have to be manually created on every server. Additionally, centralizing user account administration can make the enterprise environment more secure. For example, if a particular user needs to have their access to the servers revoked, the revocation only needs to happen once in the centralized repository, and the change will be reflected on all servers. This makes it much less likely that the database administration team will forget to remove the user's access from some of the servers, which could cause security problems if the user then tried to use the account in an unauthorized manner.

We've blogged in the past that MariaDB supports this kind of centralized user account administration with the PAM authentication plugin and PAM user mapping module and also about support for group mapping in the PAM user mapping module. Many enterprise users prefer to integrate these components with LDAP, but LDAP can be quite difficult to integrate with these components. For a step by step guide on how to do this, I’ve detailed specific instructions here.

Tags: 

About the Author

Geoff Montee's picture

Geoff Montee is a Support Engineer with MariaDB. He has previous experience as a Database Administrator/Software Engineer with the U.S. Government, and as a System Administrator and Software Developer at Florida State University.

by Geoff Montee at August 19, 2016 07:51 PM

Colin Charles

Speaking in August 2016

I know this is a tad late, but there have been some changes, etc. recently, so apologies for the delay of this post. I still hope to meet many of you to chat about MySQL/Percona Server/MariaDB Server, MongoDB, open source databases, and open source in general in the remainder of August 2016.

  • LinuxCon+ContainerCon North America – August 22-24 2016 – Westin Harbour Castle, Toronto, Canada – I’ll be speaking about lessons one can learn from database failures and enjoying the spectacle that is the 25th anniversary of Linux!
  • Chicago MySQL Meetup Group – August 29 2016 – Vivid Seats, Chicago, IL – more lessons from database failures here, and I’m looking forward to meeting users, etc. in the Chicago area

While not speaking, Vadim Tkachenko and I will be present at the @scale conference. I really enjoyed my time there previously, and if you get an invite, its truly a great place to learn and network.

by Colin Charles at August 19, 2016 04:54 PM

Peter Zaitsev

Percona Server 5.5.51-38.1 is now available

percona server 5.6.30-76.3

percona server 5.5.51-38.1Percona announces the release of Percona Server 5.5.51-38.1 on August 19, 2016. Based on MySQL 5.5.51, including all the bug fixes in it, Percona Server 5.5.51-38.1 is now the current stable release in the 5.5 series.

Percona Server is open-source and free. You can find release details of the release in the 5.5.51-38.1 milestone on Launchpad. Downloads are available here and from the Percona Software Repositories.

Bugs Fixed:
  • PAM Authentication Plugin would abort authentication while checking UNIX user group membership if there were more than a thousand members. Bug fixed #1608902.
  • PAM Authentication Plugin didn’t support spaces in the UNIX user group names. Bug fixed #1544443.
  • If DROP DATABASE would fail to delete some of the tables in the database, the partially-executed command is logged in the binlog as DROP TABLE t1, t2, ... for the tables for which drop succeeded. A slave might fail to replicate such DROP TABLE statement if there exist foreign key relationships to any of the dropped tables and the slave has a different schema from the master. Fixed by checking, on the master, whether any of the database to be dropped tables participate in a Foreign Key relationship, and fail the DROP DATABASE statement immediately. Bug fixed #1525407 (upstream #79610).
  • Percona Server 5.5 could not be built with the -DMYSQL_MAINTAINER_MODE=ON option. Bug fixed #1590454.
  • In the client library, any EINTR received during network I/O was not handled correctly. Bug fixed #1591202 (upstream #82019).
  • The included .gitignore in the percona-server source distribution had a line *.spec, which means someone trying to check in a copy of the percona-server source would be missing the spec file required to build the RPM packages. Bug fixed #1600051.
  • The fix for bug #1341067 added a call to free some of the heap memory allocated by OpenSSL. This was not safe for repeated calls if OpenSSL is linked twice through different libraries and each is trying to free the same. Bug fixed #1604676.
  • If the changed page bitmap redo log tracking thread stops due to any reason, then shutdown will wait for a long time for the log tracker thread to quit, which it never does. Bug fixed #1606821.
  • Performing slow InnoDB shutdown (innodb_fast_shutdown set to 0) could result in an incomplete purge, if a separate purge thread is running (which is a default in Percona Server). Bug fixed #1609364.
  • Due to security reasons ld_preload libraries can now only be loaded from the system directories (/usr/lib64, /usr/lib) and the MySQL installation base directory.
Other bugs fixed:

#1515591 (upstream #79249), #1612551, #1609523, #756387, #1097870, #1603073, #1606478, #1606572, #1606782, #1607224, #1607359, #1607606, #1607607, #1607671, #1608385, #1608424, #1608437, #1608515, #1608845, #1609422, #1610858, #1612084, #1612118, and #1613641.

Find the release notes for Percona Server 5.5.51-38.1 in our online documentation. Report bugs on the launchpad bug tracker.

by Hrvoje Matijakovic at August 19, 2016 03:58 PM

August 18, 2016

Peter Zaitsev

ProxySQL 1.2.1 GA Release

ProxySQL 1.2.1

ProxySQL 1.2.1The GA release of ProxySQL 1.2.1 is available. You can get it from https://github.com/sysown/proxysql/releases. There are also Docker images for Release 1.2.1: https://hub.docker.com/r/percona/proxysql/.

ProxySQL is a high-performance proxy, currently for MySQL and its forks (like Percona Server and MariaDB). It acts as an intermediary for client requests seeking resources from the database. ProxySQL was created for DBAs by René Cannaò, as a means of solving complex replication topology issues.

This post is published with René’s approval. René is busy implementing more new ProxySQL features, so I decided to make this announcement!

Release highlights:
  • Support for backend SSL connections
  • Support for encrypted password  (mysql_users table now supports both plain text password and hashed password, in the same format of mysql.user.password)
  • Improved monitoring module
  • Better integration with Percona XtraDB Cluster
    • New feature: the Scheduler, that allows the extension of ProxySQL with external scripts

The last point is especially important in conjunction with our recent Percona XtraDB Cluster 5.7 RC1 release. When we ship Percona XtraDB Cluster 5.7 GA, we plan to make ProxySQL the default proxy solution choice for Percona XtraDB Cluster. ProxySQL is aware of the cluster and node status, and can direct traffic appropriately.

ProxySQL 1.2.1 comes with additional scripts to support this:

ProxySQL 1.2.1 and these scripts are compatible with existing Percona XtraDB Cluster 5.6 GA releases.

ProxySQL 1.2.1 is a solid release, currently used by many demanding high-performance production workloads – it is already battle tested! Give it a try if you are looking for a proxy solution.

ProxySQL is available under OpenSource license GPLv3, which allows you unlimited usage in production. ProxySQL has no plans to change the license!

by Vadim Tkachenko at August 18, 2016 05:55 PM

MariaDB AB

Installing MariaDB 10.1.16 on Mac OS X with Homebrew

Ben Stillman

Developing on your Mac? Get the latest stable MariaDB version set up on OS X easily with Homebrew. See this step by step guide on installing MariaDB 10.1.16. 

 

1 Install Xcode

xcode-select --install

bens-mbp:~ ben$ xcode-select --install
xcode-select: note: install requested for command line developer tools

 

2 Install Homebrew

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Bens-MacBook-Pro:~ ben$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
==> This script will install:
/usr/local/bin/brew
/usr/local/Library/...
/usr/local/share/doc/homebrew
/usr/local/share/man/man1/brew.1
/usr/local/share/zsh/site-functions/_brew
/usr/local/etc/bash_completion.d/brew

Press RETURN to continue or any other key to abort
==> /usr/bin/sudo /bin/mkdir -p /Users/ben/Library/Caches/Homebrew
Password:
==> /usr/bin/sudo /bin/chmod g+rwx /Users/ben/Library/Caches/Homebrew
==> /usr/bin/sudo /usr/sbin/chown ben /Users/ben/Library/Caches/Homebrew
==> Downloading and installing Homebrew...
remote: Counting objects: 537, done.
remote: Compressing objects: 100% (478/478), done.
remote: Total 537 (delta 31), reused 341 (delta 28), pack-reused 0
Receiving objects: 100% (537/537), 817.70 KiB | 1.25 MiB/s, done.
Resolving deltas: 100% (31/31), done.
From https://github.com/Homebrew/brew
 * [new branch]      master     -> origin/master
HEAD is now at 984ed83 doctor: print check on --debug.
==> Tapping homebrew/core
Cloning into '/usr/local/Library/Taps/homebrew/homebrew-core'...
remote: Counting objects: 3716, done.
remote: Compressing objects: 100% (3603/3603), done.
remote: Total 3716 (delta 15), reused 1863 (delta 4), pack-reused 0
Receiving objects: 100% (3716/3716), 2.88 MiB | 3.74 MiB/s, done.
Resolving deltas: 100% (15/15), done.
Checking connectivity... done.
Tapped 3594 formulae (3,743 files, 8.9M)
==> Installation successful!
==> Next steps
Run `brew help` to get started
Further documentation: https://git.io/brew-docs
==> Homebrew has enabled anonymous aggregate user behaviour analytics
Read the analytics documentation (and how to opt-out) here:
  https://git.io/brew-analytics

 

3 Check Homebrew

brew doctor

bens-mbp:~ ben$ brew doctor
Your system is ready to brew.

 

4 Update Homebrew

brew update

bens-mbp:~ ben$ brew update
Already up-to-date.

 

5 Verify MariaDB Version in Homebrew Repo

brew info mariadb

Bens-MacBook-Pro:~ ben$ brew info mariadb
mariadb: stable 10.1.16 (bottled), devel 10.2.1
Drop-in replacement for MySQL
https://mariadb.org/
Conflicts with: mariadb-connector-c, mysql, mysql-cluster, mysql-connector-c, mytop, percona-server
Not installed
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/mariadb.rb
==> Dependencies
Build: cmake ✘
Required: openssl ✘
==> Options
--universal
    Build a universal binary
--with-archive-storage-engine
    Compile with the ARCHIVE storage engine enabled
--with-bench
    Keep benchmark app when installing
--with-blackhole-storage-engine
    Compile with the BLACKHOLE storage engine enabled
--with-embedded
    Build the embedded server
--with-libedit
    Compile with editline wrapper instead of readline
--with-local-infile
    Build with local infile loading support
--with-test
    Keep test when installing
--devel
    Install development version 10.2.1
==> Caveats
A "/etc/my.cnf" from another install may interfere with a Homebrew-built
server starting up correctly.

To connect:
    mysql -uroot

To have launchd start mariadb now and restart at login:
  brew services start mariadb
Or, if you don't want/need a background service you can just run:
  mysql.server start

 

6 Install MariaDB

brew install mariadb

Bens-MacBook-Pro:~ ben$ brew install mariadb
==> Installing dependencies for mariadb: openssl
==> Installing mariadb dependency: openssl
==> Downloading https://homebrew.bintray.com/bottles/openssl-1.0.2h_1.el_capitan.bottle.tar.gz
######################################################################## 100.0%
==> Pouring openssl-1.0.2h_1.el_capitan.bottle.tar.gz
==> Caveats
A CA file has been bootstrapped using certificates from the system
keychain. To add additional certificates, place .pem files in
  /usr/local/etc/openssl/certs

and run
  /usr/local/opt/openssl/bin/c_rehash

This formula is keg-only, which means it was not symlinked into /usr/local.

Apple has deprecated use of OpenSSL in favor of its own TLS and crypto libraries

Generally there are no consequences of this for you. If you build your
own software and it requires this formula, you'll need to add to your
build variables:

    LDFLAGS:  -L/usr/local/opt/openssl/lib
    CPPFLAGS: -I/usr/local/opt/openssl/include

==> Summary
  /usr/local/Cellar/openssl/1.0.2h_1: 1,691 files, 12M
==> Installing mariadb
==> Downloading https://homebrew.bintray.com/bottles/mariadb-10.1.16.el_capitan.bottle.tar.gz
######################################################################## 100.0%
==> Pouring mariadb-10.1.16.el_capitan.bottle.tar.gz
==> /usr/local/Cellar/mariadb/10.1.16/bin/mysql_install_db --verbose --user=ben --basedir=/usr/local/Cellar/mariadb/10.1.16 --datadir=/usr/local/var/mysql --tmpdir=/tmp
==> Caveats
A "/etc/my.cnf" from another install may interfere with a Homebrew-built
server starting up correctly.

To connect:
    mysql -uroot

To have launchd start mariadb now and restart at login:
  brew services start mariadb
Or, if you don't want/need a background service you can just run:
  mysql.server start
==> Summary
  /usr/local/Cellar/mariadb/10.1.16: 573 files, 137.1M

 

7 Run the Database Installer

mysql_install_db

Bens-MacBook-Pro:10.1.16 ben$ mysql_install_db
Installing MariaDB/MySQL system tables in '/usr/local/var/mysql' ...
2016-08-16 19:15:02 140735320776704 [Note] /usr/local/Cellar/mariadb/10.1.16/bin/mysqld (mysqld 10.1.16-MariaDB) starting as process 83824 ...
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: The InnoDB memory heap is disabled
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: Memory barrier is not used
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: Compressed tables use zlib 1.2.5
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: Using SSE crc32 instructions
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: Initializing buffer pool, size = 128.0M
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: Completed initialization of buffer pool
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: Highest supported file format is Barracuda.
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: 128 rollback segment(s) are active.
2016-08-16 19:15:02 140735320776704 [Note] InnoDB: Waiting for purge to start
2016-08-16 19:15:02 140735320776704 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.30-76.3 started; log sequence number 1616819
2016-08-16 19:15:02 123145313034240 [Note] InnoDB: Dumping buffer pool(s) not yet started
OK
Filling help tables...
2016-08-16 19:15:04 140735320776704 [Note] /usr/local/Cellar/mariadb/10.1.16/bin/mysqld (mysqld 10.1.16-MariaDB) starting as process 83828 ...
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: The InnoDB memory heap is disabled
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: Memory barrier is not used
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: Compressed tables use zlib 1.2.5
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: Using SSE crc32 instructions
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: Initializing buffer pool, size = 128.0M
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: Completed initialization of buffer pool
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: Highest supported file format is Barracuda.
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: 128 rollback segment(s) are active.
2016-08-16 19:15:04 140735320776704 [Note] InnoDB: Waiting for purge to start
2016-08-16 19:15:04 140735320776704 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.30-76.3 started; log sequence number 1616829
2016-08-16 19:15:04 123145313034240 [Note] InnoDB: Dumping buffer pool(s) not yet started
OK
Creating OpenGIS required SP-s...
2016-08-16 19:15:07 140735320776704 [Note] /usr/local/Cellar/mariadb/10.1.16/bin/mysqld (mysqld 10.1.16-MariaDB) starting as process 83833 ...
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: The InnoDB memory heap is disabled
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: Memory barrier is not used
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: Compressed tables use zlib 1.2.5
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: Using SSE crc32 instructions
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: Initializing buffer pool, size = 128.0M
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: Completed initialization of buffer pool
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: Highest supported file format is Barracuda.
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: 128 rollback segment(s) are active.
2016-08-16 19:15:07 140735320776704 [Note] InnoDB: Waiting for purge to start
2016-08-16 19:15:07 140735320776704 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.30-76.3 started; log sequence number 1616839
2016-08-16 19:15:07 123145313034240 [Note] InnoDB: Dumping buffer pool(s) not yet started
OK

To start mysqld at boot time you have to copy
support-files/mysql.server to the right place for your system

PLEASE REMEMBER TO SET A PASSWORD FOR THE MariaDB root USER !
To do so, start the server, then issue the following commands:

'/usr/local/Cellar/mariadb/10.1.16/bin/mysqladmin' -u root password 'new-password'
'/usr/local/Cellar/mariadb/10.1.16/bin/mysqladmin' -u root -h Bens-MacBook-Pro.local password 'new-password'

Alternatively you can run:
'/usr/local/Cellar/mariadb/10.1.16/bin/mysql_secure_installation'

which will also give you the option of removing the test
databases and anonymous user created by default.  This is
strongly recommended for production servers.

See the MariaDB Knowledgebase at http://mariadb.com/kb or the
MySQL manual for more instructions.

You can start the MariaDB daemon with:
cd '/usr/local/Cellar/mariadb/10.1.16' ; /usr/local/Cellar/mariadb/10.1.16/bin/mysqld_safe --datadir='/usr/local/var/mysql'

You can test the MariaDB daemon with mysql-test-run.pl
cd '/usr/local/Cellar/mariadb/10.1.16/mysql-test' ; perl mysql-test-run.pl

Please report any problems at http://mariadb.org/jira

The latest information about MariaDB is available at http://mariadb.org/.
You can find additional information about the MySQL part at:
http://dev.mysql.com
Support MariaDB development by buying support/new features from MariaDB
Corporation Ab. You can contact us about this at sales@mariadb.com.
Alternatively consider joining our community based development effort:
http://mariadb.com/kb/en/contributing-to-the-mariadb-project/

 

8 Start MariaDB

mysql.server start

Bens-MacBook-Pro:10.1.16 ben$ mysql.server start
Starting MySQL
. SUCCESS!

 

9 Secure the Installation

mysql_secure_installation

Bens-MacBook-Pro:10.1.16 ben$ mysql_secure_installation

NOTE: RUNNING ALL PARTS OF THIS SCRIPT IS RECOMMENDED FOR ALL MariaDB
      SERVERS IN PRODUCTION USE!  PLEASE READ EACH STEP CAREFULLY!

In order to log into MariaDB to secure it, we'll need the current
password for the root user.  If you've just installed MariaDB, and
you haven't set the root password yet, the password will be blank,
so you should just press enter here.

Enter current password for root (enter for none):
OK, successfully used password, moving on...

Setting the root password ensures that nobody can log into the MariaDB
root user without the proper authorisation.

Set root password? [Y/n]
New password:
Re-enter new password:
Password updated successfully!
Reloading privilege tables..
 ... Success!


By default, a MariaDB installation has an anonymous user, allowing anyone
to log into MariaDB without having to have a user account created for
them.  This is intended only for testing, and to make the installation
go a bit smoother.  You should remove them before moving into a
production environment.

Remove anonymous users? [Y/n]
 ... Success!

Normally, root should only be allowed to connect from 'localhost'.  This
ensures that someone cannot guess at the root password from the network.

Disallow root login remotely? [Y/n]
 ... Success!

By default, MariaDB comes with a database named 'test' that anyone can
access.  This is also intended only for testing, and should be removed
before moving into a production environment.

Remove test database and access to it? [Y/n]
 - Dropping test database...
 ... Success!
 - Removing privileges on test database...
 ... Success!

Reloading the privilege tables will ensure that all changes made so far
will take effect immediately.

Reload privilege tables now? [Y/n]
 ... Success!

Cleaning up...

All done!  If you've completed all of the above steps, your MariaDB
installation should now be secure.

Thanks for using MariaDB!

 

10 Connect to MariaDB

mysql -u root -p

Bens-MacBook-Pro:10.1.16 ben$ mysql -u root -p
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 11
Server version: 10.1.16-MariaDB Homebrew

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

About the Author

Ben Stillman's picture

Ben Stillman is a Principal Consultant working with MariaDB and MySQL.

by Ben Stillman at August 18, 2016 03:27 PM

Jean-Jerome Schmidt

Planets9s - Deploying HAProxy & MaxScale with ClusterControl, Percona Live & more

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

ClusterControl Tips & Tricks: transparent database failover for your applications

If you’re a MySQL user, you will know that to achieve high availability, deploying a cluster is not enough. Nodes may (and will most probably) go down, and your system has to be able to adapt accordingly. In this blog post, we share some tips on how you can achieve high availability using ClusterControl for deploying HAProxy or MaxScale.

Read the blog

Our talks & tutorials at Percona Live Amsterdam

We’ll have a whole team at this year’s Percona Live conference in Amsterdam, and would like to encourage you to join us there if you can. We have two full day tutorials on how to become a MySQL or MongoDB DBA; talks about load balancers such HAProxy, MaxScale, ProxySQL, nginx & more; how to upgrade to MySQL 5.7; and how to automate, monitor and manage your MongoDB servers.

Find out more

Become a ClusterControl DBA: adding existing databases and clusters

Following the recent introduction of our new ‘onboarding wizard’ for ClusterControl, we’ve updated this blog in the ‘Become a ClusterControl DBA’ series to reflect the enhanced way by which you can now add existing infrastructure components to ClusterControl for MySQL Galera, MySQL master-slave replication, PostgreSQL replication set and MongoDB replication set and manage them.

Read the blog

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

by Severalnines at August 18, 2016 10:55 AM

August 17, 2016

Peter Zaitsev

TokuDB/PerconaFT fragmented data file performance improvements

fragmented data file performance

In this blog post, we’ll discuss how we’ve improved TokuDB and PerconaFT fragmented data file performance.

Through our internal benchmarking and some user reports, we have found that with long term heavy write use TokuDB/PerconaFT performance can degrade significantly on large data files. Using smaller node sizes makes the problem worse (which is one of our performance tuning recommendations when you have faster storage). The problem manifests as low CPU utilization, a drop in overall TPS and high client response times during prolonged checkpointing.

This post explains a little about how PerconaFT structures dictionary files and where the current implementation breaks down. Hopefully, it explains the nature of the issue, and how our solution helps addresses it. It also provides some contrived benchmarks that prove the solution.

PerconaFT map file disk format

NOTE. This post uses the terms index, data file, and dictionary are somewhat interchangeable. We will use the PerconaFT term “dictionary” to refer specifically to a PerconaFT key/value data file.

PerconaFT stores every dictionary in its own data file on disk. TokuDB stores each index in a PerconaFT dictionary, plus one additional dictionary per table for some metadata. For example, if you have one TokuDB table with two secondary indices, you would have four data files or dictionaries: one small metadata dictionary for the table, one dictionary for the primary key/index, and one for each secondary index.

fragmented data file performance

Each dictionary file has three major parts:

  • Two headers (yes, two) made up of various bits of metadata, file versions, a checkpoint logical sequence number (CLSN), the offset of this headers block translation table, etc…
  • Two (yes, two, one per header) block translation tables (BTT) that maps block numbers (BNs) to the physical offsets and sizes of the data blocks within the file.
  • Data blocks and holes (unused space). Unlike InnoDB, PerconaFT data blocks (nodes) are variable sizes and can be any size from a minimum of a few bytes for an empty internal node all the way up to the block size defined when the tree created (4MB by default if we don’t use compression) and anywhere in between, depending on the amount of data within that node.

Each dictionary file contains two versions of the header stored on disk, and only one is valid at any given point in time. Since we fix the size of the header structure, we always know their locations. The first at offset zero, the other is immediately after the first. The header that is currently valid is the header with the later/larger CLSN.

We write the header and the BTT to disk during a checkpoint or when a dictionary is closed (the only time we do so). The header overwrites the older header (the one with the older CLSN) on disk. From that moment onward, the disk space used by the previous version of the dictionary (the whole thing, not just the header) that is not also used by the latest version, is considered immediately free.

There is much more magic to how the PerconaFT does checkpoint and consistency, but that is really out of the scope of this post. Maybe a later post that addresses the sharp checkpoint of the PerconaFT can dive into this.

The block allocator

The block allocator is the algorithm and container that manages the list of known used blocks and unused holes within an open dictionary file. When a node gets written, it is the responsibility of the block allocator to find a suitable location in the file for the nodes data. It is always placed into a new block, never overwrites an existing block (except for reclaimed block space from blocks that are removed or moved and recorded during the last checkpoint). Conversely, when a node gets destroyed it is the responsibility of the block allocator to release that used space and create a hole out of the old block. That hole also must be merged with any other holes that are adjacent to it to have a record of just one large hole rather than a series of consecutive smaller holes.

Fragmentation and large files

The current implementation of the PerconaFT block allocator maintains a simple array of used blocks in memory for each open dictionary. The used blocks are ordered ascending by their offset in the file. The holes between the blocks are calculated by knowing the offset and size of the two bounding blocks. For example, one can calculate the hole offset and size between two adjacent blocks as: b[n].offset + b[n].size and b[n+1].offset – (b[n].offset + b[n].size), respectively.

fragmented data file performance

To find a suitable hole to place node data, the current block allocator starts at the first block in the array. It iterates through the blocks looking for a hole between blocks that is large enough to hold the nodes data. Once we find a hole, we cut the space needed for the node out of the hole and the remainder is left as a hole for another block to possibly use later.

Note. Forcing alignment to 512 offsets for direct I/O has overhead, regardless if direct I/O is used or not.

This linear search severely degrades the PerconaFT performance for very large and fragmented dictionary files. We have some solid evidence from the field that this does occur. We can see it via various profiling tools as a lot of time spent within block_allocator_strategy::first_fit. It is also quite easy to create a case by using very small node (block) sizes and small fanouts (forces the existence of more nodes, and thus more small holes). This fragmentation can and does cause all sorts of side effects as the search operation locks the entire structure within memory. It blocks nodes from translating their node/block IDs into file locations.

Let’s fix it…

In this block storage paradigm, fragmentation is inevitable. We can try to dance around and propose different ways to prevent fragmentation (at the expense of higher CPU costs, online/offline operations, etc…). Or, we can look at the way the block allocator works and try to make it more efficient. Attacking the latter of the two options is a better strategy (not to say we aren’t still actively looking into the former).

Tree-based “Max Hole Size” (MHS) lookup

The linear search block allocator has no idea where bigger and smaller holes might be located within the set (a core limitation). It must use brute force to find a hole big enough for the data it needs to store. To address this, we implemented a new in-memory, tree-based algorithm (red-black tree). This replaces the current in-memory linear array and integrates the hole size search into the tree structure itself.

In this new block allocator implementation, we store the set of known in-use blocks within the node structure of a binary tree instead of a linear array. We order the  tree by the file offset of the blocks. We then added a little extra data to each node of this new tree structure. This data tells us the maximum hole we can expect to find in each child subtree. So now when searching for a hole, we can quickly drill down the tree to find an available hole of the correct size without needing to perform a fully linear scan. The trade off is that merging holes together and updating the parental max hole sizes is slightly more intricate and time-consuming than in a linear structure. The huge improvement in search efficiency makes this extra overhead pure noise.

fragmented data file performance

We can see in this overly simplified diagram, we have five blocks:

  • offset 0 : 1 byte
  • offset 3 : 2 bytes
  • offset 6 : 3 bytes
  • offset 10 : 5 bytes
  • offset 20 : 8 bytes

We can calculate four holes in between those blocks:

  • offset 1 : 2 bytes
  • offset 5 : 1 byte
  • offset 9 : 1 byte
  • offset 15 : 5 bytes

We see that the search for a 4-byte hole traverses down the right side of the tree. It discovers a hole at offset 15. This hole is a big enough for our 4 bytes. It does this without needing to visit the nodes at offsets 0 and 3. For you algorithmic folks out there, we have gone from an O(n) to O(log n) search. This is tremendously more efficient when we get into severe fragmentation states. A side effect is that we tend to allocate blocks from holes closer to the needed size rather than from the first one big enough to fit. The small hole fragmentation issue may actually increase over time, but that has yet to be seen in our testing.

Benchmarks

As our CTO Vadim Tkachenko asserts, there are “Lies, Damned Lies and Benchmarks.” We’re going to show a simple test case where we thought, “What is the worst possible scenario that I can come up with in a small-ish benchmark to show the differences?”. So, rather than try and convince you using some pseudo-real-world benchmark that uses sleight of hand, I’m telling you up front that this example is slightly absurd, but pushes the issue to the foreground.

That scenario is actually pretty simple. We shape the tree to have as many nodes as possible, and intentionally use settings that reduce concurrency. We will use a standard sysbench OLTP test, and run it for about three hours after the prepare stage has completed:

  • Hardware:
    • Intel i7, 4 core hyperthread (8 virtual cores) @ 2.8 GHz
    • 16 GB of memory
    • Samsung 850 Pro SSD
  • Sysbench OLTP:
    • 1 table of 160M rows or about 30GB of primary key data and 4GB secondary key data
    • 24 threads
    • We started each test server instance with no data. Then we ran the sysbench prepare, then the sysbench run with no shutdown in between the prepare and run.
    • prepare command : /data/percona/sysbench/sysbench/sysbench –test=/data/percona/sysbench/sysbench/tests/db/parallel_prepare.lua –mysql-table-engine=tokudb –oltp-tables-count=1 –oltp-table-size=160000000 –mysql-socket=$(PWD)/var/mysql.sock –mysql-user=root –num_threads=1 run
    • run command : /data/percona/sysbench/sysbench/sysbench –test=/data/percona/sysbench/sysbench/tests/db/oltp.lua –mysql-table-engine=tokudb –oltp-tables-count=1 –oltp-table-size=160000000 –rand-init=on –rand-type=uniform –num_threads=24 –report-interval=30 –max-requests=0 –max-time=10800 –percentile=99 –mysql-socket=$(PWD)/var/mysql.sock –mysql-user=root run
  • mysqld/TokuDB configuration
    • innodb_buffer_pool_size=5242880
    • tokudb_directio=on
    • tokudb_empty_scan=disabled
    • tokudb_commit_sync=off
    • tokudb_cache_size=8G
    • tokudb_checkpointing_period=300
    • tokudb_checkpoint_pool_threads=1
    • tokudb_enable_partial_eviction=off
    • tokudb_fsync_log_period=1000
    • tokudb_fanout=8
    • tokudb_block_size=8K
    • tokudb_read_block_size=1K
    • tokudb_row_format=tokudb_uncompressed
    • tokudb_cleaner_period=1
    • tokudb_cleaner_iterations=10000

fragmented data file performance

fragmented data file performance

fragmented data file performance

So as you can see: amazing results, right? Sustained throughput, immensely better response time and better utilization of available CPU resources. Of course, this is all fake with a tree shape that no sane user would implement. It illustrates what happens when the linear list contains small holes: exactly what we set out to fix!

Closing

Look for this improvement to appear in Percona Server 5.6.32-78.0 and 5.7.14-7. It’s a good one for you if you have huge TokuDB data files with lots and lots of nodes.

Credits!

Throughout this post, I referred to “we” numerous times. That “we” encompasses a great many people that have looked into this in the past and implemented the current solution. Some are current and former Percona and Tokutek employees that you may already know by name. Some are newer at Percona. I got to take their work and research, incorporate it into the current codebase, test and benchmark it, and report it here for all to see. Many thanks go out to Jun Yuan, Leif Walsh, John Esmet, Rich Prohaska, Bradley Kuszmaul, Alexey Stroganov, Laurynas Biveinis, Vlad Lesin, Christian Rober and others for all of the effort in diagnosing this issue, inventing a solution, and testing and reviewing this change to the PerconaFT library.

by George O. Lorch III at August 17, 2016 05:05 PM

Jean-Jerome Schmidt

ClusterControl Tips & Tricks - Transparent Database Failover for your Applications

ClusterControl is a great tool to deploy and manage databases clusters - if you are into MySQL, you can easily deploy clusters based on both traditional MySQL master-slave replication, Galera Cluster or MySQL NDB Cluster. To achieve high availability, deploying a cluster is not enough though. Nodes may (and will most probably) go down, and your system has to be able to adapt to those changes.

This adaptation can happen at different levels. You can implement some kind of logic within the application - it would check the state of cluster nodes and direct traffic to the ones which are reachable at the given moment. You can also build a proxy layer which will implement high availability in your system. In this blog post, we’d like to share some tips on how you can achieve that using ClusterControl.

Deploying HAProxy using the ClusterControl

HAProxy is the standard - one of the most popular proxies used in connection with MySQL (but not only, of course). ClusterControl supports deployment and monitoring of HAProxy nodes. It also helps to implement high availability of the proxy itself using keepalived.

Deployment is pretty simple - you need to pick or fill in the IP address of a host where HAProxy will be installed, pick port, load balancing policy, decide if ClusterControl should use existing repository or the most recent source code to deploy HAProxy. You can also pick which backend nodes you’d like to have included in the proxy configuration, and whether they should be active or backup.

By default, the HAProxy instance deployed by ClusterControl won’t work correctly with a master-slave replication setup - it’s designed to implement round-robin type of load-balancing (e.g., for Galera Cluster where all nodes are writeable). There’s a way to go around this issue, though - in the following repository you can find a check script which is intended to work with MySQL Replication. You will need to replace the check deployed by ClusterControl with this particular file.

Keepalived is used to add high availability to the proxy layer. When you have at least two HAProxy nodes in your system, you can install Keepalived from the ClusterControl UI.

You’ll have to pick two HAProxy nodes and they will be configured as an active - standby pair. A Virtual IP would be assigned to the active server and, should it fail, it will be reassigned to the standby proxy. This way you can just connect to the VIP and all your queries will be routed to the currently active and working HAProxy node.

You can find more details in how the internals are configured by reading through our HAProxy tutorial.

Deploying MaxScale using ClusterControl

While HAProxy is a rock-solid proxy and very popular choice, it lacks database awareness, e.g., read-write split. The only way to do it in HAProxy is to create two backends and listen on two ports - one for reads and one for writes. This is, usually, fine but it requires you to implement changes in your application - the application has to understand what is a read and what is a write, and then direct those queries to the correct port. It’d be much easier to just connect to a single port and let the proxy decide what to do next - this is something HAProxy cannot do as what it does is just routing packets - no packet inspection is done and, especially, it has no understanding of the MySQL protocol.

MaxScale solves this problem - it talks MySQL protocol and it can (among other things) perform a read-write split. Installation of MaxScale from ClusterControl is simple - you want to go to Manage -> Load Balancer section and fill the “Install MaxScale” tab with the required data.

In short, we need to pick where MaxScale will be installed, what admin user and password it should have, which user it should use to connect to the database. Next, we can pick number of threads MaxScale should use, ports and which nodes should be added to the loadbalancer.

By default MaxScale is configured with two ways of accessing the database. You can use Round Robin listener on port 4006 - it will split connections between the available nodes in a round-robin fashion. If you want to use MaxScale’s ability to perform a read/write split, you need to connect to port 4008. Once connected,MaxScale will begin to parse your MySQL traffic and route it according to what queries you execute. In short, SELECT queries will be routed to slaves (or, in case of Galera Cluster, all nodes except of one picked as a master), remaining traffic will hit the master. Explicitly opened transactions will also open on the master only. In MySQL replication, the master is self-explanatory - it’s a master node which will be used by MaxScale. In Galera Cluster things are slightly different as it’s a multi-master environment. What MaxScale does is to check wsrep_local_index value on all Galera nodes - the one with the lowest index will be treated as a master. In case master goes down, another lowest-valued node is picked.

MaxScale, as every proxy, can become a single point of failure and it has to be made redundant to achieve high availability. There are a couple of methods to do that. One of them is to collocate MaxScale on the web nodes. The idea here is that, most of the time, the MaxScale process will work just fine and the reason for its unavailability is that the whole node went down. In such case, if MaxScale is collocated with the web node, not much harm has been done because that particular web node will not be available either.

Another method, not supported directly from the ClusterControl UI, is to use Keepalived in a similar way like we did in the case of HAProxy.

You can do most of the hard work from the UI - you need to deploy MaxScale on two hosts. Then, you want to deploy HAProxy on the same two nodes - so that ClusterControl would allow you to install Keepalived. That’s what you need to do next - install Keepalived, making sure the IP addresses of HAProxy you picked are the same as the IPs of your MaxScale nodes. Once Keepalived has been set up by ClusterControl, you can just remove the HAProxy nodes.

Next step requires CLI access - you need to log into both MaxScale (and Keepalived) nodes and edit the Keepalived configuration. On Centos it is located in /etc/keepalived/keepalived.conf. What you need to do is, basically, pass it through sed ‘s/haproxy/maxscale/g’ - replace all mentions of haproxy with maxscale and then apply changes through restart of Keepalived on both nodes. Last step would be to edit MaxScale’s configuration and make sure it listens on either Virtual IP or on all interfaces - restart MaxScale to apply your changes and you are all set: your MaxScale health will be monitored by Keepalived and, in case of failure of the active node, Virtual IP will be moved to the standby node.

Given how MaxScale decides which one of the nodes in a Galera Cluster is a “master”, as long as you do not use any stickiness (by default, the master node is a node with the lowest wsrep_local_index which means that old master will become a master again, if it ever comes back online - you can change that so MaxScale would stick to the new master), all MaxScale nodes will see the cluster in the same way. You just have to make sure all nodes of the cluster have been added to MaxScale.

by Severalnines at August 17, 2016 03:50 PM

Peter Zaitsev

How Apache Spark makes your slow MySQL queries 10x faster (or more)

slow MySQL queries

slow MySQL queriesIn this blog post, we’ll discuss how to improve the performance of slow MySQL queries using Apache Spark.

Introduction

In my previous blog post, I wrote about using Apache Spark with MySQL for data analysis and showed how to transform and analyze a large volume of data (text files) with Apache Spark. Vadim also performed a benchmark comparing performance of MySQL and Spark with Parquet columnar format (using Air traffic performance data). That works great, but what if we don’t want to move our data from MySQL to another storage (i.e., columnar format), and instead want to use “ad hock” queries on top of an existing MySQL server? Apache Spark can help here as well.

TL;DR version:

Using Apache Spark on top of the existing MySQL server(s) (without the need to export or even stream data to Spark or Hadoop), we can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.

The idea is simple: Spark can read MySQL data via JDBC and can also execute SQL queries, so we can connect it directly to MySQL and run the queries. Why is this faster? For long running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes. In my examples below, MySQL queries are executed inside Spark and run 5-10 times faster (on top of the same MySQL data).

In addition, Spark can add “cluster” level parallelism. In the case of MySQL replication or Percona XtraDB Cluster, Spark can split the query into a set of smaller queries (in the case of a partitioned table it will run one query per each partition for example) and run those in parallel across multiple slave servers of multiple Percona XtraDB Cluster nodes. Finally, it will use map/reduce the type of processing to aggregate the results.

I’ve used the same “Airlines On-Time Performance” database as in previous posts. Vadim created some scripts to download data and upload it to MySQL. You can find the scripts here: https://github.com/Percona-Lab/ontime-airline-performance. I’ve also used Apache Spark 2.0, which was released July 26, 2016.

Apache Spark Setup

Starting Apache Spark in standalone mode is easy. To recap:

  1. Download the Apache Spark 2.0 and place it somewhere.
  2. Start master
  3. Start slave (worker) and attach it to the master
  4. Start the app (in this case spark-shell or spark-sql)

Example:

root@thor:~/spark# ./sbin/start-master.sh
less ../logs/spark-root-org.apache.spark.deploy.master.Master-1-thor.out
15/08/25 11:21:21 INFO Master: Starting Spark master at spark://thor:7077
15/08/25 11:21:21 INFO Utils: Successfully started service 'MasterUI' on port 8080.
15/08/25 11:21:21 INFO MasterWebUI: Started MasterWebUI at http://10.60.23.188:8080
root@thor:~/spark# ./sbin/start-slave.sh spark://thor:7077

To connect to Spark we can use spark-shell (Scala), pyspark (Python) or spark-sql. Since spark-sql is similar to MySQL cli, using it would be the easiest option (even “show tables” works). I also wanted to work with Scala in interactive mode so I’ve used spark-shell as well. In all the examples I’m using the same SQL query in MySQL and Spark, so working with Spark is not that different.

To work with MySQL server in Spark we need Connector/J for MySQL. Download the package and copy the mysql-connector-java-5.1.39-bin.jar to the spark directory, then add the class path to the conf/spark-defaults.conf:

spark.driver.extraClassPath = /usr/local/spark/mysql-connector-java-5.1.39-bin.jar
spark.executor.extraClassPath = /usr/local/spark/mysql-connector-java-5.1.39-bin.jar

Running MySQL queries via Apache Spark

For this test I was using one physical server with 12 CPU cores (older Intel(R) Xeon(R) CPU L5639 @ 2.13GHz) and 48G of RAM, SSD disks. I’ve installed MySQL and started spark master and spark slave on the same box.

Now we are ready to run MySQL queries inside Spark. First, start the shell (from the Spark directory, /usr/local/spark in my case):

$ ./bin/spark-shell --driver-memory 4G --master spark://server1:7077

Then we will need to connect to MySQL from spark and register the temporary view:

val jdbcDF = spark.read.format("jdbc").options(
  Map("url" ->  "jdbc:mysql://localhost:3306/ontime?user=root&password=",
  "dbtable" -> "ontime.ontime_part",
  "fetchSize" -> "10000",
  "partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28"
  )).load()
jdbcDF.createOrReplaceTempView("ontime")

So we have created a “datasource” for Spark (or in other words, a “link” from Spark to MySQL). The Spark table name is “ontime” (linked to MySQL ontime.ontime_part table) and we can run SQL queries in Spark, which in turn parse it and translate it in MySQL queries.

partitionColumn” is very important here. It tells Spark to run multiple queries in parallel, one query per each partition.

Now we can run the query:

val sqlDF = sql("select min(year), max(year) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and (origin = 'RDU' or dest = 'RDU') GROUP by carrier HAVING cnt > 100000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT  10")
sqlDF.show()

MySQL Query Example

Let’s go back to MySQL for a second and look at the query example. I’ve chosen the following query (from my older blog post):

select min(year), max(year) as max_year, Carrier, count(*) as cnt,
sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed,
round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate
FROM ontime
WHERE
DayOfWeek not in (6,7)
and OriginState not in ('AK', 'HI', 'PR', 'VI')
and DestState not in ('AK', 'HI', 'PR', 'VI')
GROUP by carrier HAVING cnt > 100000 and max_year > '1990'
ORDER by rate DESC, cnt desc
LIMIT  10

The query will find the total number of delayed flights per each airline. In addition, the query will calculate the smart “ontime” rating, taking into consideration the number of flights (we do not want to compare smaller air carriers with the large ones, and we want to exclude the older airlines who are not in business anymore).

The main reason I’ve chosen this query is that it is hard to optimize it in MySQL. All conditions in the “where” clause will only filter out ~70% of rows. I’ve done a basic calculation:

mysql> select count(*) FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI');
+-----------+
| count(*)  |
+-----------+
| 108776741 |
+-----------+
mysql> select count(*) FROM ontime;
+-----------+
| count(*)  |
+-----------+
| 152657276 |
+-----------+
mysql> select round((108776741/152657276)*100, 2);
+-------------------------------------+
| round((108776741/152657276)*100, 2) |
+-------------------------------------+
|                               71.26 |
+-------------------------------------+

Table structure:

CREATE TABLE `ontime_part` (
  `YearD` int(11) NOT NULL,
  `Quarter` tinyint(4) DEFAULT NULL,
  `MonthD` tinyint(4) DEFAULT NULL,
  `DayofMonth` tinyint(4) DEFAULT NULL,
  `DayOfWeek` tinyint(4) DEFAULT NULL,
  `FlightDate` date DEFAULT NULL,
  `UniqueCarrier` char(7) DEFAULT NULL,
  `AirlineID` int(11) DEFAULT NULL,
  `Carrier` char(2) DEFAULT NULL,
  `TailNum` varchar(50) DEFAULT NULL,
...
  `id` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`,`YearD`),
  KEY `covered` (`DayOfWeek`,`OriginState`,`DestState`,`Carrier`,`YearD`,`ArrDelayMinutes`)
) ENGINE=InnoDB AUTO_INCREMENT=162668935 DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (YearD)
(PARTITION p1987 VALUES LESS THAN (1988) ENGINE = InnoDB,
 PARTITION p1988 VALUES LESS THAN (1989) ENGINE = InnoDB,
 PARTITION p1989 VALUES LESS THAN (1990) ENGINE = InnoDB,
 PARTITION p1990 VALUES LESS THAN (1991) ENGINE = InnoDB,
 PARTITION p1991 VALUES LESS THAN (1992) ENGINE = InnoDB,
 PARTITION p1992 VALUES LESS THAN (1993) ENGINE = InnoDB,
 PARTITION p1993 VALUES LESS THAN (1994) ENGINE = InnoDB,
 PARTITION p1994 VALUES LESS THAN (1995) ENGINE = InnoDB,
 PARTITION p1995 VALUES LESS THAN (1996) ENGINE = InnoDB,
 PARTITION p1996 VALUES LESS THAN (1997) ENGINE = InnoDB,
 PARTITION p1997 VALUES LESS THAN (1998) ENGINE = InnoDB,
 PARTITION p1998 VALUES LESS THAN (1999) ENGINE = InnoDB,
 PARTITION p1999 VALUES LESS THAN (2000) ENGINE = InnoDB,
 PARTITION p2000 VALUES LESS THAN (2001) ENGINE = InnoDB,
 PARTITION p2001 VALUES LESS THAN (2002) ENGINE = InnoDB,
 PARTITION p2002 VALUES LESS THAN (2003) ENGINE = InnoDB,
 PARTITION p2003 VALUES LESS THAN (2004) ENGINE = InnoDB,
 PARTITION p2004 VALUES LESS THAN (2005) ENGINE = InnoDB,
 PARTITION p2005 VALUES LESS THAN (2006) ENGINE = InnoDB,
 PARTITION p2006 VALUES LESS THAN (2007) ENGINE = InnoDB,
 PARTITION p2007 VALUES LESS THAN (2008) ENGINE = InnoDB,
 PARTITION p2008 VALUES LESS THAN (2009) ENGINE = InnoDB,
 PARTITION p2009 VALUES LESS THAN (2010) ENGINE = InnoDB,
 PARTITION p2010 VALUES LESS THAN (2011) ENGINE = InnoDB,
 PARTITION p2011 VALUES LESS THAN (2012) ENGINE = InnoDB,
 PARTITION p2012 VALUES LESS THAN (2013) ENGINE = InnoDB,
 PARTITION p2013 VALUES LESS THAN (2014) ENGINE = InnoDB,
 PARTITION p2014 VALUES LESS THAN (2015) ENGINE = InnoDB,
 PARTITION p2015 VALUES LESS THAN (2016) ENGINE = InnoDB,
 PARTITION p_new VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

Even with a “covered” index, MySQL will have to scan ~70M-100M of rows and create a temporary table:

mysql>  explain select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT  10G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ontime_part
         type: range
possible_keys: covered
          key: covered
      key_len: 2
          ref: NULL
         rows: 70483364
        Extra: Using where; Using index; Using temporary; Using filesort
1 row in set (0.00 sec)

What is the query response time in MySQL:

mysql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT  10;
+------------+----------+---------+----------+-----------------+------+
| min(yearD) | max_year | Carrier | cnt      | flights_delayed | rate |
+------------+----------+---------+----------+-----------------+------+
|       2003 |     2013 | EV      |  2962008 |          464264 | 0.16 |
|       2003 |     2013 | B6      |  1237400 |          187863 | 0.15 |
|       2006 |     2011 | XE      |  1615266 |          230977 | 0.14 |
|       2003 |     2005 | DH      |   501056 |           69833 | 0.14 |
|       2001 |     2013 | MQ      |  4518106 |          605698 | 0.13 |
|       2003 |     2013 | FL      |  1692887 |          212069 | 0.13 |
|       2004 |     2010 | OH      |  1307404 |          175258 | 0.13 |
|       2006 |     2013 | YV      |  1121025 |          143597 | 0.13 |
|       2003 |     2006 | RU      |  1007248 |          126733 | 0.13 |
|       1988 |     2013 | UA      | 10717383 |         1327196 | 0.12 |
+------------+----------+---------+----------+-----------------+------+
10 rows in set (19 min 16.58 sec)

19 minutes is definitely not great.

SQL in Spark

Now we want to run the same query inside Spark and let Spark read data from MySQL. We will create a “datasource” and execute the query:

scala> val jdbcDF = spark.read.format("jdbc").options(
     |   Map("url" ->  "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql",
     |   "dbtable" -> "ontime.ontime_sm",
     |   "fetchSize" -> "10000",
     |   "partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2015", "numPartitions" -> "48"
     |   )).load()
16/08/02 23:24:12 WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 27; Input number of partitions: 48; Lower bound: 1988; Upper bound: 2015.
dbcDF: org.apache.spark.sql.DataFrame = [id: int, YearD: date ... 19 more fields]
scala> jdbcDF.createOrReplaceTempView("ontime")
scala> val sqlDF = sql("select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT  10")
sqlDF: org.apache.spark.sql.DataFrame = [min(yearD): date, max_year: date ... 4 more fields]
scala> sqlDF.show()
+----------+--------+-------+--------+---------------+----+
|min(yearD)|max_year|Carrier|     cnt|flights_delayed|rate|
+----------+--------+-------+--------+---------------+----+
|      2003|    2013|     EV| 2962008|         464264|0.16|
|      2003|    2013|     B6| 1237400|         187863|0.15|
|      2006|    2011|     XE| 1615266|         230977|0.14|
|      2003|    2005|     DH|  501056|          69833|0.14|
|      2001|    2013|     MQ| 4518106|         605698|0.13|
|      2003|    2013|     FL| 1692887|         212069|0.13|
|      2004|    2010|     OH| 1307404|         175258|0.13|
|      2006|    2013|     YV| 1121025|         143597|0.13|
|      2003|    2006|     RU| 1007248|         126733|0.13|
|      1988|    2013|     UA|10717383|        1327196|0.12|
+----------+--------+-------+--------+---------------+----+

spark-shell does not show the query time. This can be retrieved from Web UI or from spark-sql. I’ve re-run the same query in spark-sql:

./bin/spark-sql --driver-memory 4G  --master spark://thor:7077
spark-sql> CREATE TEMPORARY VIEW ontime
         > USING org.apache.spark.sql.jdbc
         > OPTIONS (
         >      url  "jdbc:mysql://localhost:3306/ontime?user=root&password=",
         >      dbtable "ontime.ontime_part",
         >      fetchSize "1000",
         >      partitionColumn "yearD", lowerBound "1988", upperBound "2014", numPartitions "48"
         > );
16/08/04 01:44:27 WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 26; Input number of partitions: 48; Lower bound: 1988; Upper bound: 2014.
Time taken: 3.864 seconds
spark-sql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT  10;
16/08/04 01:45:13 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
2003    2013    EV      2962008 464264  0.16
2003    2013    B6      1237400 187863  0.15
2006    2011    XE      1615266 230977  0.14
2003    2005    DH      501056  69833   0.14
2001    2013    MQ      4518106 605698  0.13
2003    2013    FL      1692887 212069  0.13
2004    2010    OH      1307404 175258  0.13
2006    2013    YV      1121025 143597  0.13
2003    2006    RU      1007248 126733  0.13
1988    2013    UA      10717383        1327196 0.12
Time taken: 139.628 seconds, Fetched 10 row(s)

So the response time of the same query is almost 10x faster (on the same server, just one box). But now how was this query translated to MySQL queries, and why it is so much faster? Here is what is happening inside MySQL:

Inside MySQL

Spark:

scala> sqlDF.show()
[Stage 4:>                                                        (0 + 26) / 26]

MySQL:

mysql> select id, info from information_schema.processlist where info is not NULL and info not like '%information_schema%';
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id    | info                                                                                                                                                                                                                                                    |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10948 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2001 AND yearD < 2002) |
| 10965 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2007 AND yearD < 2008) |
| 10966 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1991 AND yearD < 1992) |
| 10967 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1994 AND yearD < 1995) |
| 10968 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1998 AND yearD < 1999) |
| 10969 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2010 AND yearD < 2011) |
| 10970 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2002 AND yearD < 2003) |
| 10971 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2006 AND yearD < 2007) |
| 10972 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1990 AND yearD < 1991) |
| 10953 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2009 AND yearD < 2010) |
| 10947 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1993 AND yearD < 1994) |
| 10956 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD < 1989 or yearD is null)  |
| 10951 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2005 AND yearD < 2006) |
| 10954 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1996 AND yearD < 1997) |
| 10955 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2008 AND yearD < 2009) |
| 10961 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1999 AND yearD < 2000) |
| 10962 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2011 AND yearD < 2012) |
| 10963 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2003 AND yearD < 2004) |
| 10964 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1995 AND yearD < 1996) |
| 10957 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2004 AND yearD < 2005) |
| 10949 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1989 AND yearD < 1990) |
| 10950 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1997 AND yearD < 1998) |
| 10952 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2013)                  |
| 10958 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1992 AND yearD < 1993) |
| 10960 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2000 AND yearD < 2001) |
| 10959 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2012 AND yearD < 2013) |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
26 rows in set (0.00 sec)

Spark is running 26 queries in parallel, which is great. As the table is partitioned it only uses one partition per query, but scans the whole partition:

mysql> explain partitions SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2001 AND yearD < 2002)G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ontime_part
   partitions: p2001
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5814106
        Extra: Using where
1 row in set (0.00 sec)

In this case, as the box has 12 CPU cores / 24 threads, it efficently executes 26 queries in parallel and the partitioned table helps to avoid contention issues (I wish MySQL could scan partitions in parallel, but it can’t at the time of writing).

Another interesting thing is that Spark can “push down” some of the conditions to MySQL, but only those inside the “where” clause. All group by/order by/aggregations are done inside Spark. It  needs to retrieve data from MySQL to satisfy those conditions and will not push down group by/order by/etc to MySQL.

That also means that queries without “where” conditions (for example “select count(*) as cnt, carrier from ontime group by carrier order by cnt desc limit 10”) will have to retrieve all data from MySQL and load it to Spark (as opposed to MySQL will do all group by inside). Running it in Spark might be slower or faster (depending on the amount of data and use of indexes) but it also requires more resources and potentially more memory dedicated for Spark. The above query is translated to 26 queries, each does a “select carrier from ontime_part where (yearD >= N AND yearD < N)”

Pushing down the whole query into MySQL 

If we want to avoid sending all data from MySQL to Spark we have the option of creating a temporary table on top of a query (similar to MySQL’s create temporary table as select …). In Scala:

val tableQuery =
 "(select yeard, count(*) from ontime group by yeard) tmp"
 val jdbcDFtmp = spark.read.format("jdbc").options(
   Map("url" ->  "jdbc:mysql://localhost:3306/ontime?user=root&password=",
   "dbtable" -> tableQuery,
   "fetchSize" -> "10000"
   )).load()
jdbcDFtmp.createOrReplaceTempView("ontime_tmp")

In Spark SQL:

CREATE TEMPORARY VIEW ontime_tmp
USING org.apache.spark.sql.jdbc
OPTIONS (
     url  "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql",
     dbtable "(select yeard, count(*) from ontime_part group by yeard) tmp",
     fetchSize "1000"
);
select * from ontime_tmp;

Please note:

  1. We do not want to use “partitionColumn” here, otherwise we will see 26 queries like this in MySQL: “SELECT yeard, count(*) FROM (select yeard, count(*) from ontime_part group by yeard) tmp where (yearD >= N AND yearD < N)” (obviously not optimal)
  2. This is not a good use of Spark, more like a “hack.” The only good reason to do it is to be able to have the result of the query as a source of an additional query.
Query cache in Spark

Another option is to cache the result of the query (or even the whole table) and then use .filter in Scala for faster processing. This requires sufficient memory dedicated for Spark. The good news is we can add additional nodes to Spark and get more memory for Spark cluster.

Spark SQL example:

CREATE TEMPORARY VIEW ontime_latest
USING org.apache.spark.sql.jdbc
OPTIONS (
     url  "jdbc:mysql://localhost:3306/ontime?user=root&password=",
     dbtable "ontime.ontime_part partition (p2013, p2014)",
     fetchSize "1000",
     partitionColumn "yearD", lowerBound "1988", upperBound "2014", numPartitions "26"
);
cache table ontime_latest;
spark-sql> cache table ontime_latest;
Time taken: 465.076 seconds
spark-sql> select count(*) from ontime_latest;
5349447
Time taken: 0.526 seconds, Fetched 1 row(s)
spark-sql> select count(*), dayofweek from ontime_latest group by dayofweek;
790896  1
634664  6
795540  3
794667  5
808243  4
743282  7
782155  2
Time taken: 0.541 seconds, Fetched 7 row(s)
spark-sql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_latest WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and (origin='RDU' or dest = 'RDU') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT  10;
2013    2013    MQ      9339    1734    0.19
2013    2013    B6      3302    516     0.16
2013    2013    EV      9225    1331    0.14
2013    2013    UA      1317    177     0.13
2013    2013    AA      5354    620     0.12
2013    2013    9E      5520    593     0.11
2013    2013    WN      10968   1130    0.1
2013    2013    US      5722    549     0.1
2013    2013    DL      6313    478     0.08
2013    2013    FL      2433    205     0.08
Time taken: 2.036 seconds, Fetched 10 row(s)

Here we cache partitions p2013 and p2014 in Spark. This retrieves the data from MySQL and loads it in Spark. After that all queries run on the cached data and will be much faster.

With Scala we can cache the result of a query and then use filters to only get the information we need:

val sqlDF = sql("SELECT flightdate, origin, dest, depdelayminutes, arrdelayminutes, carrier, TailNum, Cancelled, Diverted, Distance from ontime")
sqlDF.cache().show()
scala> sqlDF.filter("flightdate='1988-01-01'").count()
res5: Long = 862

Using Spark with Percona XtraDB Cluster

As Spark can be used in a cluster mode and scale with more and more nodes, reading data from a single MySQL is a bottleneck. We can use MySQL replication slave servers or Percona XtraDB Cluster (PXC) nodes as a Spark datasource. To test it out, I’ve provisioned Percona XtraDB Cluster with three nodes on AWS (I’ve used m4.2xlarge Ubuntu instances) and also started Apache Spark on each node:

  1. Node1 (pxc1): Percona Server + Spark Master + Spark worker node + Spark SQL running
  2. Node2 (pxc2): Percona Server + Spark worker node
  3. Node3 (pxc3): Percona Server + Spark worker node

All the Spark worker nodes use the memory configuration option:

cat conf/spark-env.sh
export SPARK_WORKER_MEMORY=24g

Then I can start spark-sql (also need to have connector/J JAR file copied to all nodes):

$ ./bin/spark-sql --driver-memory 4G --master spark://pxc1:7077

When creating a table, I still use localhost to connect to MySQL (url “jdbc:mysql://localhost:3306/ontime?user=root&password=xxx”). As Spark worker nodes are running on the same instance as Percona Cluster nodes, it will use the local connection. Then running a Spark SQL will evenly distribute all 26 MySQL queries among the three MySQL nodes.

Alternatively we can run Spark cluster on a separate host and connect it to the HA Proxy, which in turn will load balance selects across multiple Percona XtraDB Cluster nodes.

Query Performance Benchmark

Finally, here is the query response time test on the three AWS Percona XtraDB Cluster nodes:

Query 1:

select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement
No covered index (partitioned) 19 min 16.58 sec 192.17 sec 6.02
Covered index (partitioned) 2 min 10.81 sec 48.38 sec 2.7

 

Query 2: 

select dayofweek, count(*) from ontime_part group by dayofweek;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement
No covered index (partitoned) 19 min 15.21 sec 195.058 sec 5.92
Covered index (partitioned) 1 min 10.38 sec 27.323 sec 2.58

 

Now, this looks really good, but it can be better. With three nodes @ m4.2xlarge we will have 8*3 = 24 cores total (although they are shared between Spark and MySQL). We can expect 10x improvement, especially without a covered index.

However, on m4.2xlarge the amount of RAM did not allow me to run MySQL out of memory, so all reads were from EBS non-provisioned IOPS, which only gave me ~120MB/sec. I’ve redone the test on a set of three dedicated servers:

  • 28 cores E5-2683 v3 @ 2.00GHz
  • 240GB of RAM
  • Samsung 850 PRO

The test was running completely off RAM:

Query 1 (from the above)

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement
No covered index (partitoned) 3 min 13.94 sec 14.255 sec 13.61
Covered index (partitioned) 2 min 2.11 sec 9.035 sec 13.52

 

Query 2: 

select dayofweek, count(*) from ontime_part group by dayofweek;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement
No covered index (partitoned)  2 min 0.36 sec 7.055 sec 17.06
Covered index (partitioned) 1 min 6.85 sec 4.514 sec 14.81

 

With this amount of cores and running out of RAM we actually do not have enough concurrency as the table only have 26 partitions. I’ve tried the unpartitioned table with ID primary key and use 128 partitions.

Note about partitioning

I’ve used partitioned table (partition by year) in my tests to help reduce MySQL level contention. At the same time the “partitionColumn” option in Spark does not require that MySQL table is partitioned. For example, if a table has a primary key, we can use this CREATE VIEW in Spark :

CREATE OR REPLACE TEMPORARY VIEW ontime
USING org.apache.spark.sql.jdbc
OPTIONS (
  url  "jdbc:mysql://127.0.0.1:3306/ontime?user=root&password=",
  dbtable "ontime.ontime",
  fetchSize "1000",
  partitionColumn "id", lowerBound "1", upperBound "162668934", numPartitions "128"
);

Assuming we have enough MySQL servers (i.e., nodes or slaves), we can increase the number of partitions and that can improve the parallelism (as opposed to only 26 partitions when running one partition by year). Actually, the above test gives us even better response time: 6.44 seconds for query 1.

Where Spark doesn’t work well

For faster queries (those that use indexes or can efficiently use an index) it does not make sense to use Spark. Retrieving data from MySQL and loading it into Spark is not free. This overhead can be significant for faster queries. For example, a query like this 

select count(*) from ontime_part where YearD = 2013 and DayOfWeek = 7 and OriginState = 'NC' and DestState = 'NC';
 will only scan 1300 rows and will return instant (0.00 seconds reported by MySQL).

An even better example is this: 

select max(id) from ontime_part
. In MySQL, the query will use the index and all calculations will be done inside MySQL. Spark, on the other hand, will have to retrieve all IDs (select id from ontime_part) from MySQL and calculate maximum. That took 24.267 seconds.

Conclusion

Using Apache Spark as an additional engine level on top of MySQL can help to speed up the slow reporting queries and add much-needed scalability for the long running select queries. In addition, Spark can help with query caching for frequent queries.

PS: Visual explain plan with Spark

Spark Web GUI provides lots of ways of monitoring Spark jobs. For example, it shows the “job” progress:

spark_jobs

And SQL visual explain details:

slow MySQL queries

by Alexander Rubin at August 17, 2016 03:26 PM

MariaDB Foundation

MariaDB Galera Cluster 5.5.51 and Connector/J 1.5.1 now available

The MariaDB project is pleased to announce the immediate availability of MariaDB Galera Cluster 5.5.51 Stable (GA), and MariaDB Connector/J 1.5.1 Release Candidate (RC). See the release notes and changelogs for details on these releases. Download MariaDB Galera Cluster 5.5.51 Release Notes Changelog What is MariaDB Galera Cluster? MariaDB APT and YUM Repository Configuration Generator […]

The post MariaDB Galera Cluster 5.5.51 and Connector/J 1.5.1 now available appeared first on MariaDB.org.

by Daniel Bartholomew at August 17, 2016 01:48 PM

Colin Charles

What’s next

I received an overwhelming number of comments when I said I was leaving MariaDB Corporation. Thank you – it is really nice to be appreciated.

I haven’t left the MySQL ecosystem. In fact, I’ve joined Percona as their Chief Evangelist in the CTO Office, and I’m going to focus on the MySQL/Percona Server/MariaDB Server ecosystem, while also looking at MongoDB and other solutions that are good for Percona customers. Thanks again for the overwhelming response on the various social media channels, and via emails, calls, etc.

Here’s to a great time at Percona to focus on open source databases and solutions around them!

My first blog post on the Percona blog – I’m Colin Charles, and I’m here to evangelize open source databases!, the press release.

by Colin Charles at August 17, 2016 03:47 AM

August 16, 2016

Monty Says

Applying the Business Source Licensing (BSL)

I believe that Open Source is one of the best ways to develop software. However, as I have written in blogs before, the Open Source model presents challenges to creating a software company that has the needed resources to continually invest in product development and innovation.

One reason for this is a lack of understanding of the costs associated with developing and extending software. As one example of what I regard to be unrealistic user expectations, here is a statement from a large software company when I asked them to support MariaDB development with financial support:

As you may remember, we’re a fairly traditional and conservative company. A donation from us would require feature work in exchange for the donation. Unfortunately, I cannot think of a feature that I would want developed that we would be willing to pay for this year.”

This thinking is flawed on many fronts -- a new feature can take more than a year to develop! It also shows that the company saw that features create value they would invest in, but was not willing to pay for features that had already been developed and was not prepared to invest into keeping alive a product they depend upon. They also don't trust the development team with the ability to independently define new features that would bring value. Without that investment, a technology company cannot invest in ongoing research and development, thereby dooming its survival.

To be able to compete with closed source technology companies who have massive profit margins, one needs income.

Dual licensing on Free Software, as we applied it at MySQL, works only for a limited subset of products (something I have called ‘infrastructure software’) that customers need to combine with their own closed source software and distribute to their customers. Most software products are not like that. This is why David Axmark and I created the Business Source license (BSL), a license designed to harmonize producing Open Source software and running a successful software company.

The intent of BSL is to increase the overall freedom and innovation in the software industry, for customers, developers, user and vendors. Finally, I hope that BSL will pave the way for a new business model that sustains software development without relying primarily on support.

For those who are interested in the background, Linus Nyman, a doctoral student from Hanken School of Economics in Finland), and I worked together on an academic article on the BSL.

Today, MariaDB Corporation is excited to introduce the beta release of MariaDB MaxScale 2.0, our database proxy, which is released under BSL. I am very happy to see MariaDB MaxScale being released under BSL, rather than under an Open Core or Closed Source license.  Developing software under BSL will provide more resources to enhance it for future releases, in similar ways as Dual Licensing did for MySQL. MariaDB Corporation will over time create more BSL products. Even with new products coming under BSL, MariaDB Server will continue to be licensed under GPL in perpetuity. Keep in mind that because MariaDB Server extends earlier MySQL GPL code it is forever legally bound by the original GPL license of MySQL.

In addition to putting MaxScale under BSL, we have also created a framework to make it easy for anyone else to license their software under BSL.

Here follows the copyright notice used in the MaxScale 2.0 source code:

/*
* Copyright (c) 2016 MariaDB Corporation Ab
*
* Use of this software is governed by the Business Source License
* included in the LICENSE.TXT file and at www.mariadb.com/bsl.
*
* Change Date: 2019-01-01
*
* On the date above, in accordance with the Business Source
* License, use of this software will be governed by version 2
* or later of the General Public License.
*/

Two out of three top characteristics of the BSL are already shown here: The Change Date and the Change License. Starting on 1 January 2019 (the Change Date), MaxScale 2.0 is governed by GPLv2 or later (the Change License).

The centrepiece of the LICENSE.TXT file itself is this text:

Use Limitation: Usage of the software is free when your application uses the Software with a total of less than three database server instances for production purposes.

This third top characteristic is in effect until the Change Date.

What this means is that the software can be distributed, used, modified, etc., for free, within the use limitation. Beyond it, a commercial relationship is required – which, in the case of MaxScale 2.0, is a MariaDB Enterprise Subscription, which permits the use of MaxScale with three or more database servers.

You can find the full license text for MaxScale at mariadb.com/bsl and a general BSL FAQ at mariadb.com/bsl-faq-adopting. Feel free to copy or refer to them for your own BSL software!

The key characteristics of BSL are as follows:
  • The source code of BSL software is available in full from day one.
  • Users of BSL software can modify, distribute and compile the source.
  • Code contributions are encouraged and accepted through the "new BSD" license.
  • The BSL is purposefully designed to avoid vendor lock-in. With vendor lock in, I here mean that users of BSL software are not depending on one single vendor for support, fixing bugs or enhancing the BSL product.
  • The Change Date and Change License provide a time-delayed safety net for users, should the vendor stop developing the software.
  • Testing BSL software is always free of cost.
  • Production use of the software is free of cost within the use limitation.
  • Adoption of BSL software is encouraged with use limitations that provide ample freedom.
  • Monetisation of BSL software is driven by incremental sales in cases where the use limitation applies.
Whether BSL will be widely adopted remains to be seen. It’s certainly my desire that this new business model will inspire companies who develop Closed Source software or Open Core software to switch to BSL, which will ultimately result in more Open Source software in the community. With BSL, companies can realize a similar amount of revenue for the company, as they could with closed source or open core, while the free of cost usage in core production scenarios establishes a much larger user base to drive testing, innovation and adoption.


by Michael "Monty" Widenius (noreply@blogger.com) at August 16, 2016 06:56 PM

Peter Zaitsev

Webinar Thursday 8/18: Preventing and Resolving MySQL Downtime

MySQL Downtime

MySQL DowntimeJoin Percona’s Jervin Real for a webinar on Thursday August 18, 2016 at 10 am PDT (UTC-7) on Preventing and Resolving MySQL Downtime.

Preventing MySQL downtime and emergencies is difficult. Often complex combinations of several things going wrong cause these emergencies. Without knowledge of the causes of emergencies, preventative proactive measures often fail to prevent further problems — no matter how sincere. This talk discusses some of the ways to prevent real production system emergencies, and suggests specific actions for:

  • Application stack configuration
  • MySQL server configuration
  • Operating system configuration
  • Troublesome server features
  • Special features of Percona Server
  • MySQL health checks
  • Percona Toolkit

Register for the webinar here.

Register

 

MySQL DowntimeJervin Real, Technical Services Manager
As Technical Services Manager, Jervin partners with Percona’s customers on building reliable and highly performant MySQL infrastructures, while also doing other fun stuff like watching cat videos on the internet. Jervin joined Percona in Apr 2010. Starting as a PHP programmer, Jervin quickly got involved with the LAMP stack. He has worked on several high-traffic sites and a number of specialized web applications: i.e., mobile content distribution. Before joining Percona, Jervin also worked with several hosting companies, providing care for customer hosted services and data on both Linux and Windows.

by Dave Avery at August 16, 2016 05:14 PM

Percona Toolkit 2.2.19 is now available

percona toolkit 2.2.19

percona toolkit 2.2.19Percona is pleased to announce the availability of Percona Toolkit 2.2.19.  Released August 16, 2016. Percona Toolkit is a collection of advanced command-line tools that perform a variety of MySQL server and system tasks that DBAs find too difficult or complex for to perform manually. Percona Toolkit, like all Percona software, is free and open source.

This release is the current GA (Generally Available) stable release in the 2.2 series. Downloads are available here and from the Percona Software Repositories.

New Features:
  • 1221372: pt-online-schema-change now aborts with an error if the server is a slave, because this can break data consistency in case of row-based replication. If you are sure that the slave will not use row-based replication, you can disable this check using the --force-slave-run option.
  • 1485195: pt-table-checksum now forces replica table character set to UTF-8.
  • 1517155: Introduced --create-table-engine option to pt-heartbeat, which sets a storage engine for the heartbeat table different from the database default engine.
  • 1595678 and 1595912: Introduced --slave-user and --slave-password options to pt-online-schema-change, pt-table-sync, and pt-table-checksum.
  • 1610385: pt-online-schema-change now re-checks the list of slaves in the DSN table. This enables changing the contents of the table while the tool is running.
Bugs Fixed:
  • 1581752: Fixed pt-query-digest date and time parsing from MySQL 5.7 slow query log.
  • 1592166: Fixed memory leak when pt-kill kills a query.
  • 1592608: Fixed overflow of CONCAT_WS when pt-table-checksum or pt-table-sync checksums large BLOB, TEXT, or BINARY columns.
  • 1593265: Fixed pt-archiver deleting rows that were not archived.
  • 1610386: Fixed pt-slave-restart handling of GTID ranges where the left-side integer is larger than 9.
  • 1610387: Removed extra word ‘default’ from the --verbose help for pt-slave-restart.
  • 1610388: Fixed pt-table-sync not quoting enum values properly. They are now recognized as CHAR fields.

Find release details in the release notes and the 2.2.19 milestone at Launchpad. Report bugs on the Percona Toolkit launchpad bug tracker

by Hrvoje Matijakovic at August 16, 2016 03:05 PM

I’m Colin Charles, and I’m here to evangelize open source databases!

Colin Charles

Colin CharlesLet me introduce myself, I’m Colin Charles.

Percona turns ten years old this year. To me, there is no better time to join the company as the Chief Evangelist in the CTO office.

I’ve been in the MySQL world a tad longer than Percona has, and have had the pleasure of working on MySQL at MySQL AB and Sun Microsystems. Most recently I was one of the founding team members for MariaDB Server in 2009. I watched that grow into the MariaDB Corporation (after the merger with SkySQL) and the MariaDB Foundation.

For me, it’s about the right server for the right job. Today they all support a myriad of different features and different storage engines. Each server has its own community that supports and discusses their pros and cons. This is now true for both the MySQL and MongoDB ecosystems.

I’ve always had a lot of respect for the work Percona does — pragmatic engineering, deeply technical consulting (and blog posts) and amazing conferences. A big deal for me, and a big reason why I’m now here, is that Percona truly believes in the spirit of open source software development. Their obvious support of the open source community is a great pull factor for users as well.

I just spent time on the Percona Live Europe conference committee. (I’ve been involved in MySQL-related conferences since 2006, and was even Program Chair for a couple of years). There, I got to see how the conference is evolving beyond just stock MySQL to also include MongoDB and other open source databases

Recently I visited a customer who was not just interested in using a database, but also in offering a database-as-a-service to their internal customers. I discussed OpenStack with them, and knowing that Percona, the company I now represent, can support the architecture and deployment too? That’s kind of priceless.

We’re all crazy about databases and their position in the overall IT structure. They provide us with cool apps, internet functionality, and all sorts of short cuts to our daily lives. Percona’s role in providing solutions that address the issues that infrastructure faces is what really excites me about my new journey here.

by Colin Charles at August 16, 2016 02:02 PM

Jean-Jerome Schmidt

Our Talks & Tutorials at Percona Live Amsterdam

This year’s Percona Live Conference in Amsterdam is drawing closer and we’re looking forward to presenting our talks and tutorials there; and to the opportunity to discuss these topics directly with the wider community of MySQL and MongoDB users present.

We’re happy to have a couple of tutorials and some talks that have been selected for the conference, and thought we’d share some of their details here already.

Do check them out on the conference website and if you haven’t done so yet, sign up to join everyone in October for this European Percona Live Conference.

Severalnines Tutorials

Become a MySQL DBA - Full day tutorial

Tutor: Krzysztof Książek

Whether you’re looking for monitoring and trending advice for your MySQL installation or how to diagnose issues with your MySQL setup; do backup and database upgrades; or tips and tricks for some of the common maintenance operations related to MySQL … This full-day tutorial will provide you with the knowledge and tools needed to become a MySQL DBA.

Become a MongoDB DBA - Full day tutorial in cooperation with Percona

Tutor: Art van Scheppingen

This hands-on tutorial is intended to help you navigate your way through the steps that lead to becoming a MongoDB DBA. We are going to talk about the most important aspects of managing MongoDB infrastructure and we will be sharing best practices and tips on how to perform the most common activities. The topics covered will include how different MongoDB is to MySQL, monitoring and trending for your MongoDB installation, how to diagnose issues with your MongoDB setup, backups and more …

Severalnines Talks

Migrating To MySQL 5.7 - The Live Database Upgrade Guide

Speaker: Krzysztof Książek

There are a few things you need to keep in mind when planning a MySQL upgrade, such as important changes between versions 5.6 and 5.7 as well as detailed testing that needs to precede any such upgrade process for instance. In this session we’ll look at how to best research, prepare and perform such tests before the time comes to finally start the upgrade process.

How to automate, monitor and manage your MongoDB servers

Speaker: Art van Scheppingen

In this session we will go beyond the MongoDB deployment phase and show you how you can automate tasks, monitor a cluster and manage MongoDB. Art has presented this talk at Percona Live Santa Clara and also at the recent Percona Live Community Open House MongoDB in New York. It is based upon the experience he gained writing the ‘Become a MongoDB DBA’ blog series.

MySQL Load Balancers - MaxScale, ProxySQL, HAProxy, MySQL Router & nginx - a close up look

Speaker: Krzysztof Książek

This session aims to give a solid grounding in load balancer technologies for MySQL and MariaDB. We will review the wide variety of open-source options available: from application connectors (php-mysqlnd, jdbc), TCP reverse proxies (HAproxy, Keepalived, Nginx) and SQL-aware load balancers (MaxScale, ProxySQL, MySQL Router), and look at what considerations you should make when assessing their suitability for your environment.

We’ll communicate further on our participation at the conference over the coming weeks and we already looking forward to seeing you all there. If you haven’t registered for the conference, you can follow this link to do so.

We look forward to meeting you in Amsterdam!

by Severalnines at August 16, 2016 01:41 PM

August 15, 2016

MariaDB AB

Invitation to Join MariaDB MaxScale 2.0 Beta!

Dipti Joshi

Today, MariaDB MaxScale, the database proxy for MariaDB, is reaching its next major milestone – the availability of MariaDB MaxScale 2.0 Beta software release. Beta is an important time in our release and we encourage you to download this release today!

MariaDB MaxScale 2.0 introduces several new capabilities:

Data Streaming

  • Built upon MaxScale’s Binlog Server functionality, you can now stream transactional data in real time from MariaDB to other big data stores like Hadoop or a data warehouse through messaging systems, like Kafka, for real-time analytics and machine learning applications. The package includes sample client applications for a Kafka producer and a standalone Python application to receive streaming data from MaxScale.

Better Security

  • Transport layer security with end-to-end SSL through MaxScale
  • MaxAdmin security improvements to enable configurable prevention of remote access
  • Connection rate limitation to protect against DDoS attacks

High Availability

  • Minimize downtime with read mode for MariaDB/MySQL master-slave clusters

The release notes for MariaDB MaxScale 2.0 can be found here and the list of bugs fixed can be found in the release notes. Binaries for MaxScale 2.0 Beta are available for download here.

MaxScale documentation can be found in our Knowledge Base.

In case you want to build the binaries yourself, the source can be found on GitHub, tagged with maxscale-2.0.0.

Tags: 

About the Author

Dipti Joshi's picture

Dipti Joshi is Senior Product Manager for MariaDB MaxScale and MariaDB ColumnStore

by Dipti Joshi at August 15, 2016 08:22 PM

Introducing MaxScale 2.0 Beta Release

Michael Widenius

I am happy to see MariaDB Corporation announcing a major new version of MariaDB MaxScale.  MaxScale 2.0 Beta is now available here.

MariaDB MaxScale is a database proxy, but it can do much more than what you normally associate with a proxy. MaxScale is a multi-threaded and event-driven engine, where the main functionality is provided by plugins that are loaded at runtime. The functionality is not restricted only to routing, like sending write statements to the Master and share read statements between the Slaves, but MaxScale also provides advanced filtering, enhanced security and authentication.

With MaxScale plugins you can handle the scalability and availability of your database cluster, and also secure it and manage the maintenance downtime. Thanks to the plugin architecture of MaxScale it’s easy to extend it with custom plugin to allow it to handle new tasks. We are also working on a soon-to-be-released MaxScale development package to make it even easier to extend MaxScale.

Some of the major new features in MaxScale 2.0 are:

Data Streaming:

  • Change Data Capture (CDC) allows you to replicate binlog events from MariaDB to Kafka in real time. This allows you to now leverage real-time data for machine learning or data analytics.
  • A beta release supports binlog-to-Avro conversion and distribution modules. These modules allow MaxScale to connect to a MariaDB 10.0 Master server and convert the binary log events to Avro format change records.

High Availability:

  • The readwritesplit routing module supports a high availability read mode where read queries are allowed, even if the Master server goes down.
  • The MariaDB monitor module, mysqlmon, supports stale states for both the Master and Slave servers. Even if the Master goes down, a Slave server will retain its Slave state and continue to be used as long as it is running.

Security:

  • End-to-end SSL secures data in motion.
  • MaxAdmin can be configured to only accept local authorization based on the UNIX identity. This is a good security enhancement when you do not want admins accessing the system remotely.
  • A new connection rate limitation reduces the impact of attacks before they are officially identified. You can specify the maximum number of connections for a service that when reached, further connection attempts will be rejected with an error.

We have also changed the query classifier component that MaxScale uses when deciding what to do with a particular query. It used to be based upon the MariaDB embedded library, but is now based upon sqlite3.

One of the things that I am most excited about for MariaDB MaxScale 2.0 is that it is the first MariaDB product to adopt the Business Source License (BSL). The excitement comes from the fact that we are adopting the BSL instead of resorting to use a more closed license like Open Core. BSL, while not being an OSI certified Open Source license, supports the core freedoms of Open Source software – including making all code (not just some) openly available from day one. Anyone can modify, extend and compile the code, test the software and most will even be able to use the software for free. For MaxScale 2.0, one can use it for free if one uses it with less than three database servers. MariaDB Corporation will also give free licenses to active contributors.

We at MariaDB Corporation think that this new license provides a great way to harmonize ongoing development innovation, community contribution and sustainable engineering.

If you’d like to learn more about the BSL, please visit my blog and our MariaDB BSL FAQ.

Looking forward to getting your thoughts!

by Michael Widenius at August 15, 2016 08:11 PM

Jean-Jerome Schmidt

Become a ClusterControl DBA: Adding Existing Databases and clusters (updated)

In our previous blog post we covered the deployment of four types of clustering/replication: MySQL Galera, MySQL master-slave replication, PostgreSQL replication set and MongoDB replication set. This should enable you to create new clusters with great ease, but what if you already have 20 replication setups deployed and wish to manage them with ClusterControl?

This blog post will cover adding existing infrastructure components for these four types of clustering/replication to ClusterControl and how to have ClusterControl manage them.

Adding an existing Galera cluster to ClusterControl

Adding an existing Galera cluster to ClusterControl requires: mysql user with the proper grants and a ssh user that is able to login (without password) from the ClusterControl node to your existing databases and clusters.

Install ClusterControl on a separate VM. Once it is up, open the two-step dialogue for adding an existing cluster. All you have to do is to add one of the Galera nodes and ClusterControl will figure out the rest:

After this behind the scenes, ClusterControl will connect to this host and detect all the necessary details for the full cluster and register the cluster in the overview.

Adding an existing MySQL master-slave to ClusterControl

Adding of an existing MySQL master-slave topology requires a bit more work than adding a Galera cluster. As ClusterControl is able to extract the necessary information for Galera, in the case of master-slave, you need to specify every host within the replication setup.

After this, ClusterControl will connect to every host, see if they are part of the same topology and register them as part of one cluster (or server group) in the GUI.

Adding an MySQL NDB Cluster to ClusterControl

Adding an existing MySQL NDB Cluster takes four steps in total: defining SSH, management nodes, data nodes and finally the MySQL nodes.

After this ClusterControl will connect to the management, data and MySQL nodes and see if they are indeed part of the cluster. Then it will register the cluster, start monitoring and managing it.

Adding an existing PostgreSQL replication set to ClusterControl

Similar to adding the MySQL master-slave above, the PostgreSQL replication set also requires to fill in all hosts within the same replication set.

After this, ClusterControl will connect to every host, see if they are part of the same topology and register them as part of the same group.

Adding an existing MongoDB replica set to ClusterControl

Adding an existing MongoDB replica set is just as easy as Galera: just one of the hosts in the replica set needs to be specified with its credentials and ClusterControl will automatically discover the other nodes in the replica set.

Adding an existing MongoDB sharded cluster set to ClusterControl

Adding an existing MongoDB sharded cluster is almost as easy as a MongoDB replica set: all shard routers in the cluster need to be specified with its credentials, and ClusterControl will automatically discover all shards and replica sets in the cluster.

Expanding your existing infrastructure

After adding the existing databases and clusters, they now have become manageable via ClusterControl and thus we can scale out our clusters.

For MySQL, MongoDB and PostgreSQL replication sets, this can easily be achieved via the same way we showed in our previous blogpost: simply add a node and ClusterControl will take care of the rest.

For Galera, there is a bit more choice. The most obvious choice is to add a (Galera) node to the cluster by simply choosing “add node” in the cluster list or cluster overview. Expanding your Galera cluster this way should happen with increments of two to ensure your cluster always can have majority during a split brain situation.

Alternatively you could add a replication slave and thus create asynchronous slave in your synchronous cluster that looks like this:

Adding a slave node blindly under one of the Galera nodes can be dangerous since if this node goes down, the slave won’t receive updates anymore from its master. We blogged about paradigm earlier and you can read how to solve this in this blog post:

http://severalnines.com/blog/deploy-asynchronous-replication-slave-mariadb-galera-cluster-gtid-clustercontrol

Final thoughts

We showed you how easy it is to add existing databases and clusters to ClusterControl, you can literally add clusters within minutes. So nothing should hold you back from using ClusterControl to manage your existing infrastructure. If you have a large infrastructure, the addition of ClusterControl will give you more overview and save time in troubleshooting and maintaining your clusters.

Now the challenge is how to leverage ClusterControl to keep track of key performance indicators, show the global health of your clusters and proactively alert you in time when something is predicted to happen. And that’s the subject we'll cover next time.

Read also in the same series: Become a ClusterControl DBA - Deploying your Databases and Clusters

by Severalnines at August 15, 2016 06:36 PM

Peter Zaitsev

Percona Live Europe 2016 Schedule Now Live

Percona Live Europe featured talk

Percona Live Europe 2016 ScheduleThis post reveals the full Percona Live Europe 2016 schedule for Amsterdam this October 3-5.

The official The Percona Live Open Source Database Conference Europe 2016 schedule is now live, and you can find it here.

The schedule demonstrates that this conference has something for everyone! Whether your interest is in MySQL, MongoDB or other open source databases, there are talks that will interest you.

The Percona Live Open Source Database Conference is the premier event for the diverse and active open source database community, as well as businesses that develop and use open source database software. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, and other open source databases. Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience – all to help you learn how to tackle your open source database challenges in a whole new way.

Some of the talks for each area are:

MySQL

MongoDB

Open Source Databases

Check out the full schedule now!

Advanced Tickets

Purchase your passes now and get the advanced tickets discount. The earlier you buy, the better the value. You can register for Percona Live Europe here.

Sponsor Percona Live

Sponsor the Percona Live Open Source Database Performance Conference Europe 2016. Sponsorship gets you bigger visibility at the most important open source database conference in Europe. Benefits to sponsorship include:

  • Worldwide Audience: Made up of DBAs, developers, CTOs, CEOs, technology evangelists, entrepreneurs, and technology vendors.
  • Perfect Location: In Amsterdam City Centre, walking distance from Amsterdam Central Station.
  • Perfect Event: The showcase event for the rich and diverse MySQL, MongoDB and open source database markets in Europe.

Click here to sponsor now.

by Kortney Runyan at August 15, 2016 05:01 PM

Colin Charles

Changing of the guard

I posted a message to the internal mailing lists at MariaDB Corporation. I have departed (I resigned) the company, but definitely not the community. Thank you all for the privilege of serving the large MariaDB Server community of users, all 12 million+ of you. See you on the mailing lists, IRC, and the developer meetings.

The Japanese have a saying, “leave when the cherry blossoms are full”.

I’ve been one of the earliest employees of this post-merge company, and was on the founding team of the MariaDB Server having been around since 2009. I didn’t make the first company meeting in Mallorca (August 2009) due to the chickenpox, but I’ve been to every one since.

We made the first stable MariaDB Server 5.1 release in February 2010. Our first Linux distribution release was in openSUSE. Our then tagline: MariaDB: Community Developed. Feature Enhanced. Backward Compatible.

In 2013, we had to make a decision: merge with our sister company SkySQL or take on investment of equal value to compete; majority of us chose to work with our family.

Our big deal was releasing MariaDB Server 5.5 – Wikipedia migrated, Google wanted in, and Red Hat pushed us into the enterprise space.

Besides managing distributions and other community related activities (and in the pre-SkySQL days Rasmus and I did everything from marketing to NRE contract management, down to even doing press releases – you wear many hats when you’re in a startup of less than 20 people), in this time, I’ve written over 220 blog posts, spoken at over 130 events (an average of 18 per year), and given generally over 250 talks, tutorials and keynotes. I’ve had numerous face-to-face meetings with customers, figuring out what NRE they may need and providing them solutions. I’ve done numerous internal presentations, audience varying from the professional services & support teams, as well as the management team. I’ve even technically reviewed many books, including one of the best introductions by our colleague, Learning MySQL & MariaDB.

Its been a good run. Seven years. Uncountable amount of flights. Too many weekends away working for the cause. A whole bunch of great meetings with many of you. Seen the company go from bootstrap, merger, Series A, and Series B.

It’s been a true privilege to work with many of you. I have the utmost respect for Team MariaDB (and of course my SkySQL brethren!). I’m going to miss many of you. The good thing is that MariaDB Server is an open source project, and I’m not going to leave the project or #maria. I in fact hope to continue speaking and working on MariaDB Server.

I hope to remain connected to many of you.

Thank you for this great privilege.

Kind Regards,
Colin Charles

by Colin Charles at August 15, 2016 03:29 PM

August 12, 2016

Peter Zaitsev

Tuning Linux for MongoDB

tuning Linux for MongoDB

tuning Linux for MongoDBIn this post, we’ll discuss tuning Linux for MongoDB deployments.

By far the most common operating system you’ll see MongoDB running on is Linux 2.6 and 3.x. Linux flavors such as CentOS and Debian do a fantastic job of being a stable, general-purpose operating system. Linux runs software on hardware ranging from tiny computers like the Raspberry Pi up to massive data center servers. To make this flexibility work, however, Linux defaults to some “lowest common denominator” tunings so that the OS will boot on anything.

Working with databases, we often focus on the queries, patterns and tunings that happen inside the database process itself. This means we sometimes forget that the operating system below it is the life-support of database, the air that it breathes so-to-speak. Of course, a highly-scalable database such as MongoDB runs fine on these general-purpose defaults without complaints, but the efficiency can be equivalent to running in regular shoes instead of sleek runners. At small scale, you might not notice the lost efficiency, but at large scale (especially when data exceeds RAM) improved tunings equate to fewer servers and less operational costs. For all use cases and scale, good OS tunings also provide some improvement in response times and removes extra “what if…?” questions when troubleshooting.

Overall, memory, network and disk are the system resources important to MongoDB. This article covers how to optimize each of these areas. Of course, while we have successfully deployed these tunings to many live systems, it’s always best to test before applying changes to your servers.

If you plan on applying these changes, I suggest performing them with one full reboot of the host. Some of these changes don’t require a reboot, but test that they get re-applied if you reboot in the future. MongoDB’s clustered nature should make this relatively painless, plus it might be a good time to do that dreaded “yum upgrade” / “aptitude upgrade“, too.

Linux Ulimit

To prevent a single user from impacting the entire system, Linux has a facility to implement some system resource constraints on processes, file handles and other system resources on a per-user-basis. For medium-high-usage MongoDB deployments, the default limits are almost always too low. Considering MongoDB generally uses dedicated hardware, it makes sense to allow the Linux user running MongoDB (e.g., “mongod”) to use a majority of the available resources.

Now you might be thinking: “Why not disable the limit (or set it to unlimited)?” This is a common recommendation for database servers. I think you should avoid this for two reasons:

  • If you hit a problem, a lack of a limit on system resources can allow a relatively smaller problem to spiral out of control, often bringing down other services (such as SSH) crucial to solving the original problem.
  • All systems DO have an upper-limit, and understanding those limitations instead of masking them is an important exercise.

In most cases, a limit of 64,000 “max user processes” and 64,000 “open files” (both have defaults of 1024) will suffice. To be more exact you need to do some math on the number of applications/clients, the maximum size of their connection pools and some case-by-case tuning for the number of inter-node connections between replica set members and sharding processes. (We might address this in a future blog post.)

You can deploy these limits by adding a file in “/etc/security/limits.d” (or appending to “/etc/security/limits.conf” if there is no “limits.d”). Below is an example file for the Linux user “mongod”, raising open-file and max-user-process limits to 64,000:

mongod       soft        nproc        64000
mongod       hard        nproc        64000
mongod       soft        nofile       64000
mongod       hard        nofile       64000

Note: this change only applies to new shells, meaning you must restart “mongod” or “mongos” to apply this change!

Virtual Memory
Dirty Ratio

The “dirty_ratio” is the percentage of total system memory that can hold dirty pages. The default on most Linux hosts is between 20-30%. When you exceed the limit the dirty pages are committed to disk, creating a small pause. To avoid this hard pause there is a second ratio: “dirty_background_ratio” (default 10-15%) which tells the kernel to start flushing dirty pages to disk in the background without any pause.

20-30% is a good general default for “dirty_ratio”, but on large-memory database servers this can be a lot of memory! For example, on a 128GB-memory host this can allow up to 38.4GB of dirty pages. The background ratio won’t kick in until 12.8GB! We recommend that you lower this setting and monitor the impact to query performance and disk IO. The goal is reducing memory usage without impacting query performance negatively. Reducing caches sizes also guarantees data gets written to disk in smaller batches more frequently, which increases disk throughput (than huge bulk writes less often).

A recommended setting for dirty ratios on large-memory (64GB+ perhaps) database servers is: “vm.dirty_ratio = 15″ and vm.dirty_background_ratio = 5″, or possibly less. (Red Hat recommends lower ratios of 10 and 3 for high-performance/large-memory servers.)

You can set this by adding the following lines to /etc/sysctl.conf”:

vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

To check these current running values:

$ sysctl -a | egrep "vm.dirty.*_ratio"
vm.dirty_background_ratio = 5
vm.dirty_ratio = 15

Swappiness

“Swappiness” is a Linux kernel setting that influences the behavior of the Virtual Memory manager when it needs to allocate a swap, ranging from 0-100. A setting of 0 tells the kernel to swap only to avoid out-of-memory problems. A setting of 100 tells it to swap aggressively to disk. The Linux default is usually 60, which is not ideal for database usage.

It is common to see a setting of 0″ (or sometimes “10”) on database servers, telling the kernel to prefer to swap to memory for better response times. However, Ovais Tariq details a known bug (or feature) when using a setting of 0 in this blog post: https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/.

Due to this bug, we recommended using a setting of 1″ (or “10” if you  prefer some disk swapping) by adding the following to your /etc/sysctl.conf”:

vm.swappiness = 1

To check the current swappiness:

$ sysctl vm.swappiness
vm.swappiness = 1

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply a dirty_ratio or swappiness change!

Transparent HugePages

*Does not apply to Debian/Ubuntu or CentOS/RedHat 5 and lower*

Transparent HugePages is an optimization introduced in CentOS/RedHat 6.0, with the goal of reducing overhead on systems with large amounts of memory. However, due to the way MongoDB uses memory, this feature actually does more harm than good as memory access are rarely contiguous.

Disabled THP entirely by adding the following flag below to your Linux kernel boot options:

transparent_hugepage=never

Usually this requires changes to the GRUB boot-loader config in the directory /boot/grub” or /etc/grub.d” on newer systems. Red Hat covers this in more detail in this article (same method on CentOS): https://access.redhat.com/solutions/46111.

Note: We recommended rebooting the system to clear out any previous huge pages and validate that the setting will persist on reboot.

NUMA (Non-Uniform Memory Access) Architecture

Non-Uniform Memory Access is a recent memory architecture that takes into account the locality of caches and CPUs for lower latency. Unfortunately, MongoDB is not “NUMA-aware” and leaving NUMA setup in the default behavior can cause severe memory in-balance.

There are two ways to disable NUMA: one is via an on/off switch in the system BIOS config, the 2nd is using the numactl” command to set NUMA-interleaved-mode (similar effect to disabling NUMA) when starting MongoDB. Both methods achieve the same result. I lean towards using the numactl” command due to future-proofing yourself for the mostly inevitable addition of NUMA awareness. On CentOS 7+ you may need to install the numactl” yum/rpm package.

To make mongod start using interleaved-mode, add numactl –interleave=all” before your regular mongod” command:

$ numactl --interleave=all mongod <options here>

To check mongod’s NUMA setting:

$ sudo numastat -p $(pidof mongod)
Per-node process memory usage (in MBs) for PID 7516 (mongod)
                           Node 0           Total
                  --------------- ---------------
Huge                         0.00            0.00
Heap                        28.53           28.53
Stack                        0.20            0.20
Private                      7.55            7.55
----------------  --------------- ---------------
Total                       36.29           36.29

If you see only 1 x NUMA-node column (“Node0”) NUMA is disabled. If you see more than 1 x NUMA-node, make sure the metric numbers (Heap”, etc.) are balanced between nodes. Otherwise, NUMA is NOT in “interleave” mode.

Note: some MongoDB packages already ship logic to disable NUMA in the init/startup script. Check for this using “grep” first. Your hardware or BIOS manual should cover disabling NUMA via the system BIOS.

Block Device IO Scheduler and Read-Ahead

For tuning flexibility, we recommended that MongoDB data sits on its own disk volume, preferably with its own dedicated disks/RAID array. While it may complicate backups, for the best performance you can also dedicate a separate volume for the MongoDB journal to separate it’s disk activity noise from the main data set. The journal does not yet have it’s own config/command-line setting, so you’ll need to mount a volume to the journal” directory inside the dbPath. For example, /var/lib/mongo/journal” would be the journal mount-path if the dbPath was set to /var/lib/mongo”.

Aside from good hardware, the block device MongoDB stores its data on can benefit from 2 x major adjustments:

IO Scheduler

The IO scheduler is an algorithm the kernel will use to commit reads and writes to disk. By default most Linux installs use the CFQ (Completely-Fair Queue) scheduler. This is designed to work well for many general use cases, but with little latency guarantees. Two other popular schedulers are deadline” and noop”. Deadline excels at latency-sensitive use cases (like databases) and noop is closer to no scheduling at all.

We generally suggest using the deadline” IO scheduler for cases where you have real, non-virtualised disks under MongoDB. (For example, a “bare metal” server.) In some cases I’ve seen noop” perform better with certain hardware RAID controllers, however. The difference between deadline” and cfq” can be massive for disk-bound deployments.

If you are running MongoDB inside a VM (which has it’s own IO scheduler beneath it) it is best to use noop” and let the virtualization layer take care of the IO scheduling itself.

Read-Ahead

Read-ahead is a per-block device performance tuning in Linux that causes data ahead of a requested block on disk to be read and then cached into the filesystem cache. Read-ahead assumes that there is a sequential read pattern and something will benefit from those extra blocks being cached. MongoDB tends to have very random disk patterns and often does not benefit from the default read-ahead setting, wasting memory that could be used for more hot data. Most Linux systems have a default setting of 128KB/256 sectors (128KB = 256 x 512-byte sectors). This means if MongoDB fetches a 64kb document from disk, 128kb of filesystem cache is used and maybe the extra 64kb is never accessed later, wasting memory.

For this setting, we suggest a starting-point of 32 sectors (=16KB) for most MongoDB workloads. From there you can test increasing/reducing this setting and then monitor a combination of query performance, cached memory usage and disk read activity to find a better balance. You should aim to use as little cached memory as possible without dropping the query performance or causing significant disk activity.

Both the IO scheduler and read-ahead can be changed by adding a file to the udev configuration at /etc/udev/rules.d”. In this example I am assuming the block device serving mongo data is named /dev/sda” and I am setting “deadline” as the IO scheduler and 16kb/32-sectors as read-ahead:

# set deadline scheduler and 16kb read-ahead for /dev/sda
ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"

To check the IO scheduler was applied ([square-brackets] = enabled scheduler):

$ cat /sys/block/sda/queue/scheduler
noop [deadline] cfq

To check the current read-ahead setting:

$ sudo blockdev --getra /dev/sda
32

Note: this change should be applied and tested with a full system reboot!

Filesystem and Options

It is recommended that MongoDB uses only the ext4 or XFS filesystems for on-disk database data. ext3 should be avoided due to its poor pre-allocation performance. If you’re using WiredTiger (MongoDB 3.0+) as a storage engine, it is strongly advised that you ONLY use XFS due to serious stability issues on ext4.

Each time you read a file, the filesystems perform an access-time metadata update by default. However, MongoDB (and most applications) does not use this access-time information. This means you can disable access-time updates on MongoDB’s data volume. A small amount of disk IO activity that the access-time updates cause stops in this case.

You can disable access-time updates by adding the flag noatime” to the filesystem options field in the file /etc/fstab” (4th field) for the disk serving MongoDB data:

/dev/mapper/data-mongodb /var/lib/mongo        ext4        defaults,noatime    0 0

Use “grepto verify the volume is currently mounted:

$ grep "/var/lib/mongo" /proc/mounts
/dev/mapper/data-mongodb /var/lib/mongo ext4 rw,seclabel,noatime,data=ordered 0 0

Note: to apply a filesystem-options change, you must remount (umount + mount) the volume again after stopping MongoDB, or reboot.

Network Stack

Several defaults of the Linux kernel network tunings are either not optimal for MongoDB, limit a typical host with 1000mbps network interfaces (or better) or cause unpredictable behavior with routers and load balancers. We suggest some increases to the relatively low throughput settings (net.core.somaxconn and net.ipv4.tcp_max_syn_backlog) and a decrease in keepalive settings, seen below.

Make these changes permanent by adding the following to /etc/sysctl.conf” (or a new file /etc/sysctl.d/mongodb-sysctl.conf – if /etc/sysctl.d exists):

net.core.somaxconn = 4096
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_max_syn_backlog = 4096

To check the current values of any of these settings:

$ sysctl net.core.somaxconn
net.core.somaxconn = 4096

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply this change!

NTP Daemon

All of these deeper tunings make it easy to forget about something as simple as your clock source. As MongoDB is a cluster, it relies on a consistent time across nodes. Thus the NTP Daemon should run permanently on all MongoDB hosts, mongos and arbiters included. Be sure to check the time syncing won’t fight with any guest-based virtualization tools like “VMWare tools” and “VirtualBox Guest Additions”.

This is installed on RedHat/CentOS with:

$ sudo yum install ntp

And on Debian/Ubuntu:

$ sudo apt-get install ntp

Note: Start and enable the NTP Daemon (for starting on reboots) after installation. The commands to do this vary by OS and OS version, so please consult your documentation.

Security-Enhanced Linux (SELinux)

Security-Enhanced Linux is a kernel-level security access control module that has an unfortunate tendency to be disabled or set to warn-only on Linux deployments. As SELinux is a strict access control system, sometimes it can cause unexpected errors (permission denied, etc.) with applications that were not configured properly for SELinux. Often people disable SELinux to resolve the issue and forget about it entirely. While implementing SELinux is not an end-all solution, it massively reduces the local attack surface of the server. We recommend deploying MongoDB with SELinus Enforcing” mode on.

The modes of SELinux are:

  1. Enforcing – Block and log policy violations.
  2. Permissive – Log policy violations only.
  3. Disabled – Completely disabled.

As database servers are usually dedicated to one purpose, such as running MongoDB, the work of setting up SELinux is a lot simpler than a multi-use server with many processes and users (such as an application/web server, etc.). The OS access pattern of a database server should be extremely predictable. Introducing Enforcing” mode at the very beginning of your testing/installation instead of after-the-fact avoids a lot of gotchas with SELinux. Logging for SELinux is directed to /var/log/audit/audit.log” and the configuration is at /etc/selinux”.

Luckily, Percona Server for MongoDB RPM packages (CentOS/RedHat) are SELinux “Enforcing” mode compatible as they install/enable an SELinux policy at RPM install time! Debian/Ubuntu SELinux support is still in planning.

Here you can see the SELinux policy shipped in the Percona Server for MongoDB version 3.2 server package:

$ rpm -ql Percona-Server-MongoDB-32-server | grep selinux
/etc/selinux/targeted/modules/active/modules/mongod.pp

To change the SELinux mode to Enforcing”:

$ sudo setenforce Enforcing

To check the running SELinux mode:

$ sudo getenforce
Enforcing

Linux Kernel and Glibc Version

The version of the Linux kernel and Glibc itself may be more important than you think. Some community benchmarks show a significant improvement on OLTP throughput benchmarks with the recent Linux 3.x kernels versus the 2.6 still widely deployed. To avoid serious bugs, MongoDB should at minimum use Linux 2.6.36 and Glibc 2.13 or newer.

I hope to create a follow-up post on the specific differences seen under MongoDB workloads on Linux 3.2+ versus 2.6. Until then, I recommend you test the difference using your own workloads and any results/feedback are appreciated.

What’s Next?

What’s the next thing to tune? At this point, tuning becomes case-by-case and open-ended. I appreciate any comments on use-case/tunings pairings that worked for you. Also, look out for follow-ups to this article for a few tunings I excluded due to lack of testing.

Not knowing the next step might mean you’re done tuning, or that you need more visibility into your stack to find the next bottleneck. Good monitoring and data visibility are invaluable for this type of investigation. Look out for future posts regarding monitoring your MongoDB (or MySQL) deployment and consider using Percona Monitoring and Management as an all-in-one monitoring solution. You could also try using Percona-Lab/prometheus_mongodb_exporterprometheus/node_exporter and Percona-Lab/grafana_mongodb_dashboards for monitoring MongoDB/Linux with Prometheus and Grafana.

The road to an efficient database stack requires patience, analysis and iteration. Tomorrow a new hardware architecture or change in kernel behavior could come, be the first to spot the next bottleneck! Happy hunting.

by Tim Vaillancourt at August 12, 2016 07:36 PM

August 11, 2016

Peter Zaitsev

Percona XtraDB Cluster 5.7.12 RC1 is now available

Percona XtraDB Cluster Reference Architecture

Percona XtraDB Cluster 5.6Percona announces the first release candidate (RC1) in the Percona XtraDB Cluster 5.7 series on August 9, 2016. Binaries are available from the downloads area or our software repositories.

Percona XtraDB Cluster 5.7.12-5rc1-26.16 is based on the following:

This release includes all changes from upstream releases and the following:

New Features

  • PXC Strict Mode: Use the pxc_strict_mode variable in the configuration file or the –pxc-strict-mode option during mysqld startup.
  • Galera instruments exposed in Performance Schema: This includes mutexes, condition variables, file instances, and threads.

Bug Fixes

  • Fixed error messages.
  • Fixed the failure of SST via mysqldump with gtid_mode=ON.
  • Added check for TOI that ensures node readiness to process DDL+DML before starting the execution.
  • Removed protection against repeated calls of wsrep->pause() on the same node to allow parallel RSU operation.
  • Changed wsrep_row_upd_check_foreign_constraints to ensure that fk-reference-tableis open before marking it open.
  • Fixed error when running SHOW STATUS during group state update.
  • Corrected the return code of sst_flush_tables() function to return a non-negative error code and thus pass assertion.
  • Fixed memory leak and stale pointer due to stats not freeing when toggling the wsrep_providervariable.
  • Fixed failure of ROLLBACK to register wsrep_handler
  • Fixed failure of symmetric encryption during SST.

Other Changes

  • Added support for sending the keyring when performing encrypted SST.
  • Changed the code of THD_PROC_INFO to reflect what the thread is currently doing.
  • Using XtraBackup as the SST method now requires Percona XtraBackup 2.4.4 or later.
  • Improved rollback process to ensure that when a transaction is rolled back, any statements open by the transaction are also rolled back.
  • Removed the sst_special_dirs variable.
  • Disabled switching of slave_preserve_commit_order to ON when running PXC in cluster mode, as it conflicts with existing multi-master commit ordering resolution algorithm in Galera.
  • Based the default my.cnf on Percona Server 5.7 configuration with Galera/wsrep settings from PXC 5.6.
  • Other low-level fixes and improvements for better stability.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

by Alexey Zhebel at August 11, 2016 11:01 PM

Shlomi Noach

MySQL vs. PostgreSQL, gh-ost perspective

Last week we released gh-ost, GitHub's online schema migration tool for MySQL. As with other open source releases in the MySQL ecosystem, this release was echoed by several "Why not PostgreSQL?" comments. Having been active in open source since many years now, I'm familiar with these responses, and I find this is a good time to share my thoughts. Why? XKCD knows the answer:

XKCD: Duty Calls

I picked one post I wish to address (latest commit: 3dfbd2cd3f5468f035ec86442d2c670a510118d8). The author invested some time writing it. It nicely summarizes claims I've heard over the years, as well as some prejudice. Through responding to this post I will be generalizing thoughts and impressions to address the common reactions. Dear @brandur, let's grab a beer some day; I fundamentally disagree with your post and with its claims.

EDIT: linked post has been updated following this writing; I'd like to thank the author for his consideration. Also see his followup post. The version I've responded to in this post is this commit.

This is not an anti-PostgreSQL post

Disclosure: I appreciate PostgreSQL. I always wanted to be a proficient PostgreSQL user/DBA. I think this project is one of the finest examples of quality open source. It sets some high standards for open source in general, and for RDBMS in particular. I am not emotionally attached to MySQL to the extent that I would hate everything that is not called "MySQL". I never understood this approach. I am not interested in religious wars. I'm an engineer and this post follows engineering guidelines.

Background

gh-ost delivers powerful online schema migrations to MySQL, differentiating itself from existing tools by being triggerless, auditable, controllable, testable, imposing low workload on the migrated master. It addresses the same problem addressed by existing tools as of 2009.

Feature X

The most basic premise of this post is: MySQL does not have feature X, PostgreSQL does, therefore PostgreSQL. 

We'll discuss the truth of the above shortly, but let's first discuss the essence of this guideline.

It should be generally agreed that a statement of the form "A doesn't have feature X therefore B" is incomplete. We understand complex systems have varying feature sets.

MySQL has some features PostgreSQL doesn't. Take, as example, the feature R: MySQL has got it since ages ago, and yet PostgreSQL is slow to adapt it, and relied on 3rd party solutions for many years. MySQL's implementation of R is far more elaborate than PostgreSQL's.

But if we follow the rule suggested above, we must now migrate from PostgreSQL to MySQL, because PostgreSQL does not have feature R (or one of its variants). Infinite loop!

In practice, we evaluate the pros and cons, the features the products A and B have or do not have. Which feature is more important to us? X or R? Is one of them fundamentally required for our operation? Can we work around it if we don't get it directly from the product? That, and experimentation, is the way an engineer should approach a choice of technology.

In the world of RDBMS we are interested, among others and in no particular order, in write latency and throughput, read scale out, durability, loss of data in the event of failure, failure promotion schemes, DR, consistency, SQL features, etc. by this list alone it is impossible to claim "PostgreSQL is better than MySQL"  nor "MySQL is better than PostgreSQL".

The particular claim and advice

The author suggests we should be using PostgreSQL because it inherently solves the problem for which we embarked on developing gh-ost. That is, that PostgreSQL supports true online schema changes. That statement is misleading and I resent the way that statement is delivered.

The post does not mention that PostgreSQL supports online schema changes for a limited set of operations. I went ahead to double check with the PostgreSQL documentation. I love the documentation! It is detailed and honest. I'm thoroughly satisfied that PostgreSQL only supports a limited set of online operations. I go against the fact the post does not mention that, and leads us to "understand" PostgreSQL has a magic wand.

Online table operations are supported in PostgreSQL for adding/removing indexes, for removing columns, and for adding columns under certain conditions. As an example, adding a nullable column is an online operation, whereas adding a column with default value is a locking operation.

A very big part of our schema migration including adding/removing indexes and adding columns. Many of these operations fall under the lockless, online operations PostgreSQL would allow. However a large part of our migrations also consists of adding columns with default values, changing data types (e.g. From INT to BIGINT), changing character characteristics (e.g. length of text column), space reclamation, and others. These changes are blocking in PostgreSQL.

The premise of the post now turns to: it's a pity you invested time and money in developing a tool that solves 100% of your problems when you could have switched to PostgreSQL which would solve 40% of your problems!

If I were to insist my fellow engineers at GitHub migrate to PostgreSQL in order to solve the migration problem, and then, once this technical transition is complete let them know 60% of the migrations are not at all addressed and that we are still stuck with the very same problem we started with, I would not be a very popular engineer.

Moreover

"the same advancements never happened in MySQL" is a false statement.

As mentioned in the gh-ost announcement, MySQL as of 5.6 does indeed support online, non blocking alter table. In fact, it supports many more variants of online alter table than PostgreSQL does (however, noticeable difference is that PostgreSQL makes those changes transactional whereas MySQL does not).

Also as mentioned, one of the shortcomings of the MySQL implementation is that it is aggressive, and may cause a high impact on the running master. In my understanding the PostgreSQL implementation is no different. There's nothing to cause throttling, to play nice with the running workload. Yes, in PostgreSQL you can Ctrl-C your ALTER, but who wants to Ctrl-C a 10 hour operation?

gh-ost addresses that as well. Its throttling gives super powers over the migration process, kicking in throttling based on master load, replication lag, various human controlled criteria, effectively making it very lightweight on the master.

Misdirection?

"there's a level of seemingly willful misdirection here that I just can't wrap my head around"

XKCD to the rescue again:

XKCD: Internet Argument

I dare say this is not the kind of thing a person would say in person, and the accusation is rather severe. It is also ironic. Dear author, consider:

  • PostgreSQL does not really solve 100% of the problem gh-ost does, and yet you claim we'd be better off with PostgreSQL.
  • MySQL does indeed provide more variants of online alter table than PostgreSQL does, and yet you claim it has no online alter capabilities.
  • I might claim there's a seemingly willful misdirection in your post. I might claim nowhere in your write up do you mention the deficiencies in PostgreSQL.

Instead, I'd rather like to think that you, and others, are misinformed, basing your opinion on rumors and prejudice instead of facts.

I also observe that people all around the world like to willfully differentiate themselves from others. Even in tech. this is the topic for another post, but consider explaining to a complete outsider, say your doctor, why people who work in tech, are engineers, work with data, work with databases, work with relational databases, work with open source relational databases, people who have so much shared experience, still insist on "us and them", and seek to see the negative in the other. Sheesh.

Paraphrasing a memorable sarcastic quote from the movie Erin Brockovich: the fact so many of the largest tech companies today choose to use MySQL as their backend database does not mean it's crap.

No. We really think MySQL does a good job. It is not perfect. We work around some of its limitations.

Claims

The claim "you'd be better off with PostgreSQL" (not a quote from aforementioned post) cannot be made without understanding the specific workload of a company/product. It would be presumptuous of me to approach a PostgreSQL based company and say "oh hey why use PostgreSQL? You'd be better off with MySQL!"

It makes perfect sense to say "PostgreSQL handles problem X better than MySQL" or even "if your one most important requirement is X, you should switch to PostgreSQL". Otherwise claiming one database is wholly better than the other cannot be taken seriously.

Deficiencies? Any project of scale has deficiencies. It is granted. We observe and measure, and take features and deficiencies into calculation, and that makes for good engineering.

  • If you're using PostgreSQL and it works well for you, you're doing the right thing.
  • It you're using MySQL and it works well for you, you're doing the right thing.
  • If you found that PostgreSQL works better for you where MySQL does not, and you decided to switch over, you're doing the right thing.
  • If you found that MySQL works better for you where PostgreSQL wasn't, and you decided to switch over, you're doing the right thing.
  • If you found that PostgreSQL works better for you where MySQL wasn't, but decided to stick with MySQL because migrating would be too costly, you're doing the right thing.
  • If you found that MySQL works better for you where PostgreSQL wasn't, but decided to stick with PostgreSQL because migrating would be too costly, you're doing the right thing.
  • If you pick one over the other because of licensing constraints, you're doing the right thing.
  • If you choose to switch over because of rumors, prejudice, FUD, politics, religion, you're doing it wrong.

Final personal note, on pride

"Yesterday, GitHub broadcasted an indomitable sense of self-satisfaction far and wide..."

Oh hey, XKCD again. But I would like to ask an honest question: if some pg-gh-ost were to be released, a tool that would solve 100% of your PostgreSQL migrations requirements, doing it better than PostgreSQL does, covering all cases, throttling as your daily sqoop imports kick in, as your rush hour traffic kicks in, giving you far and wide greater control over the migration process, what would you do?

Would you write an offensive post filled with accusations, ranting about the deficiencies of PostgreSQL and how people even consider using such a database that needs a third party tool to do a better job at migrations? Would you tweet something like "Or... Use MySQL!"

Or would you embrace a project that enriches the PostgreSQL ecosystem, makes it even a greater database to work with, understanding PostgreSQL is not yet perfect and that more work need to be done?

I take pride in my craft and love making an impact; if we ever do meet for beer I'm happy to share more  thoughts.

s/gh-ost/anything/g

Peace on earth

by shlomi at August 11, 2016 08:34 PM

Peter Zaitsev

Percona Memory Engine for MongoDB

Memory Engine vs WiredTiger Insert

InMemory engineThis post discusses Percona Server for MongoDB’s new in-memory storage engine, Percona Memory Engine for MongoDB.

Percona Server for MongoDB introduced the Memory Engine starting with the 3.2.8-2.0 version. To use it, run Percona Server for MongoDB with the --storageEngine=inMemory option.

In-memory is a special configuration of WiredTiger that doesn’t store user data on disk. With this engine, data fully resides in the virtual memory of the system (and might get lost on server shutdown).

Despite the fact that the engine is purely in-memory, it writes a small amount of diagnostic data and statistics to disk. The latter can be controlled with the --inMemoryStatisticsLogDelaySecs option. The --dbpath option controls where to store the files. Generally, in-memory cannot run on the database directory previously used by any other engine (including WiredTiger).

The engine uses the desired amount of memory when configured with the --inMemorySizeGB option. This option takes fractional numbers to allow precise memory size specification. When you reach the specified memory limit, aWT_CACHE_FULL error is returned for all kinds of operations that cause user data size to grow. These include inserting new documents, creating indexes, updating documents by adding or extending fields, running aggregation workflow and others. However, you can still perform read queries on a full engine.

Since Percona Memory Engine executes fewer operations and makes no disk I/O system calls, it performs better compared to conventional durable storage engines, including WiredTiger’s standard disk-based configuration.

Performance

The following graphs show Percona Memory Engine versus WiredTiger performance. Both engines use the default configuration with 140GB cache size specified. The hardware is 56-core Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz with 256GB of RAM and RAID1 2xHDD. Test data set is about cache size and fully fits in memory.

Memory Engine vs WiredTiger Insert

Memory Engine vs WiredTiger OLTP

You can clearly see that Percona Memory Engine has better throughput and less jitter on all kinds of workloads. Checkpointing, however, can cause jitters in WiredTiger, and are absent in Percona Memory Engine as there’s no need to periodically sync in-memory data structures with their on-disk representations.

However, the performance of Percona Memory Engine drops when it’s about to become full (currently, when it’s 99% full). We’ve marked this issue as fixed (https://jira.mongodb.org/browse/SERVER-24580) but it still crops up in extreme cases.

Percona Memory Engine might use up to 1.5 times more memory above the set configuration when it’s close to full. WiredTiger almost never exceeds the specified cache memory limit. This might change in future versions. But current users should avoid possible swapping or OOM-killing of the server with Percona Memory Engine if (mis)configured to use all or close to all of available system RAM.

You can download the latest version of Percona Server for MongoDB, which includes the new Percona Memory Engine feature, here.

by Denis Protyvenskyi at August 11, 2016 05:19 PM

Percona Server for MongoDB 3.2.8-2.0 is now available

Percona_ServerfMDBLogoVert

Percona Server for MongoDBPercona announces the release of Percona Server for MongoDB 3.2.8-2.0 on August 11, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

Percona Server for MongoDB 3.2.8-2.0 is an enhanced, open-source, fully compatible, highly scalable, zero-maintenance downtime database supporting the MongoDB v3.2 protocol and drivers. It extends MongoDB with MongoRocks, Percona Memory Engine, and PerconaFT storage engine, as well as enterprise-grade features like external authentication and audit logging at no extra cost. Percona Server for MongoDB requires no changes to MongoDB applications or code.

Note:

We deprecated the PerconaFT storage engine. It will not be available in future releases.


This release is based on MongoDB 3.2.8, and includes the following additional changes:

  • Introducing the new Percona Memory Engine, which is based on a special configuration of WiredTiger that stores data in memory instead of the disk.
  • --auditDestination can now be set to file, syslog, or console.
  • --auditFormat can now be set to JSON or BSON.

    Note

    For more information, see Audit Logging.

  • The MongoRocks engine now supports LZ4 compression. This is an upstream feature of MongoRocks contributed by Percona.To enable it, use the --rocksdbCompression option when running PSMDB with the MongoRocks storage engine. For example:
    ./mongod --dbpath=./data --storageEngine=rocksdb --rocksdbCompression=lz4

    For a high-compression variant of LZ4:
    ./mongod --dbpath=./data --storageEngine=rocksdb --rocksdbCompression=lz4hc

    Note

    If you want to configure this permanently, set the following parameters in the /etc/mongod.conf file:

    storage:
      engine: rocksdb
      rocksdb:
        compression: lz4

The release notes are available in the official documentation.

 

by Alexey Zhebel at August 11, 2016 03:48 PM

Introducing Percona Memory Engine for MongoDB

Percona Memory Engine for MongoDB

Percona Memory Engine for MongoDBI’m pleased to announce the latest Percona Server for MongoDB feature: Percona Memory Engine for MongoDB.

Everybody understands that memory is much faster than disk – even the fastest solid state storage can’t compete with it. As such the choice for the most demanding workloads, where performance and predictable latency are paramount, is in-memory computing.

MongoDB is no exception. MongoDB can benefit from a storage engine option that stores data in memory. In fact, MongoDB introduced it in the 3.2 release with their In-Memory Storage Engine. Unfortunately, their engine is only available in their closed source MongoDB Enterprise Edition. Users of their open source MongoDB Community Edition were out of luck. Until now.

At Percona we strive to provide the best open source MongoDB variant software with Percona Server for MongoDB. To meet this goal, we spent the last few months working on an open source implementation of an in-memory storage engine: introducing Percona Memory Engine for MongoDB!

Percona Memory Engine for MongoDB provides the same performance gains as the current implementation of MongoDB’s in-memory engine. Both are based on WiredTiger, but optimize it for cases where data fits in memory and does not need to be persistent.

To make migrating from MongoDB Enterprise Edition to Percona Server for MongoDB as simple as possible, we made our command line and configuration options as compatible as possible with the MongoDB In-Memory Storage Engine.

Look for more blog posts showing the performance advantages of Percona Memory Engine for MongoDB compared to conventional disk-based engines, as well as some use cases and best practices for using Percona Memory Engine in your MongoDB deployments. Below is a quick list of advantages that in-memory processing provides:

  • Reduced costs. Storing data in memory means you do not have to have additional costs for high-performance storage, which provides a great advantage for cloud systems (where high-performance storage comes at a premium).
  • Very high performance reads. In-memory processing provides highly predictable latency as all reads come from memory instead of being pulled from a disk.
  • Very high performance writes. In-memory processing removes the need for persisted data on disk, which very useful for cases where data durability is not critical.

From a developer standpoint, Percona Memory Engine addresses several practical use cases:

  • Application cache. Replace services such as memcached and custom application-level data structures with the full power of MongoDB features.
  • Sophisticated data manipulation. Augment performance for data manipulation operations such as aggregation and map reduction.
  • Session management. Decrease application response times by keeping active user sessions in memory.
  • Transient Runtime State. Store application stateful runtime data that doesn’t require on-disk storage.
  • Real-time Analytics. Use in-memory computing in situations where response time is more critical than persistence.
  • Multi-tier object sharing. Facilitate data sharing in multi-tier/multi-language applications.
  • Application Testing. Reduce turnaround time for automated application tests.

I’m including a simple benchmark result for very intensive write workloads that compares Percona Memory Engine and WiredTiger. As you can see, you can get dramatically better performance with Percona Memory Engine!

Download Percona Memory Engine for MongoDB here.

Percona Memory Engine for MongoDB

by Peter Zaitsev at August 11, 2016 03:13 PM

Jean-Jerome Schmidt

Planets9s - Become a MySQL DBA, Polyglot Persistence Meetups, MySQL Query Tuning and more

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

Become a MySQL DBA at Percona Live this October

Whether you’re looking for monitoring and trending advice for your MySQL installation or how to diagnose issues with your MySQL setup; do backup and database upgrades; or tips and tricks for some of the common maintenance operations related to MySQL … This full-day tutorial, which we already held at Percona Live in Santa Clara last April, is now also on the tutorials schedule at Percona Live in Amsterdam this October. Do sign up for the conference, join us for this tutorial and ‘Become a MySQL DBA’!

Find out more

Sign up for our webinar on MySQL Query Tuning - Process & Tools

Join us for the first part of our upcoming webinar trilogy on MySQL Query Tuning led by Krzysztof Książek, Senior Support Engineer at Severalnines. This session focuses on the query tuning process and related tools. Building, collecting, analysing, tuning and testing will be discussed in detail as well as the main tools involved, tcpdump and pt-query-digest.

Sign up for the webinar

Join us at the Spotify & Booking.com HQs for this month’s Polyglot Persistence Meetups

For this month’s Polyglot Persistence Meetups, we’re lucky enough to be hosted by Spotify in Stockholm on August 23rd and by Booking.com in Amsterdam on August 31st. There’ll be interesting talks to listen to and discuss … as well as drinks and nibbles of course for some good chats. There are still seats available, so do feel free to sign up and join us there.

Sign up for the meetups

Become a MongoDB DBA: Backing up your data

Backups in MongoDB aren’t that different from MySQL backups.You have to start a copy process, ship the files to a safe place and ensure the backup is consistent. The consistency is obviously the biggest concern, as MongoDB doesn’t feature a transaction mode that allows you to create a consistent snapshot. Obviously there are other ways to ensure we make a consistent backup. In this blog post, we describe what tools are available for making backups in MongoDB and what strategies to use

Read the blog

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

by Severalnines at August 11, 2016 11:51 AM

Peter Zaitsev

Small innodb_page_size as a performance boost for SSD

performance boost for SSD

In this blog post, we’ll discuss how a small innodb_page_size can create a performance boost for SSD.

In my previous post Testing Samsung storage in tpcc-mysql benchmark of Percona Server I compared different Samsung devices. Most solid state drives (SSDs) use 4KiB as an internal page size, and the InnoDB default page size is 16KiB. I wondered how using a different innodb_page_size might affect the overall performance.

Fortunately, MySQL 5.7 comes with the option innodb_page_size, so you can set different InnoDB page sizes than the standard 16KiB. This option is still quite inconvenient to use, however. You can’t change innodb_page_size for the existing database. Instead, you need to create a brand new database with a different innodb_page_size and reload whole data set. This is a serious showstopper for production adoption. Specifying innodb_page_size for individual tables or indexes would be a welcome addition, and you could change it with a simple ALTER TABLE foo page_size=4k.

Anyway, this doesn’t stop us from using innodb_page_size=4k in the testing environment. Let’s see how it affects the results using the same conditions described in my previous post.

performance boost for SSD

Again we see that the PM1725 outperforms the SM863 when we have a limited memory, and the result is almost equal when we have plenty of memory.

But what about innodb_page_size 4k vs 16k.?

Here is a direct comparison chart:

performance boost for SSD

Tabular results (in NOTPM, more is better):

Buffer Pool, GiB pm1725_16k pm1725_4k sam850_16k sam850_4k sam863_16k sam863_4k pm1725 4k/16k
5 42427.57 73287.07 1931.54 2682.29 14709.69 48841.04 1.73
15 78991.67 134466.86 2750.85 6587.72 31655.18 93880.36 1.70
25 108077.56 173988.05 5156.72 10817.23 56777.82 133215.30 1.61
35 122582.17 195116.80 8986.15 11922.59 93828.48 164281.55 1.59
45 127828.82 209513.65 12136.51 20316.91 123979.99 192215.27 1.64
55 130724.59 216793.99 19547.81 24476.74 127971.30 212647.97 1.66
65 131901.38 224729.32 27653.94 23989.01 131020.07 220569.86 1.70
75 133184.70 229089.61 38210.94 23457.18 131410.40 223103.07 1.72
85 133058.50 227588.18 39669.90 24400.27 131657.16 227295.54 1.71
95 133553.49 226241.41 39519.18 24327.22 132882.29 223963.99 1.69
105 134021.26 224831.81 39631.03 24273.07 132126.29 222796.25 1.68
115 134037.09 225632.80 39469.34 24073.36 132683.55 221446.90 1.68

 

It’s interesting to see that 4k pages help to improve the performance up to 70%, but only for the PM1725 and SM863. For the low-end Samsung 850 Pro, using a 4k innodb_page_size actually makes things worse when using a high amount of memory.

I think a 70% performance gain is too significant to ignore, even if manipulating innodb_page_size requires extra work. I think it is worthwhile to evaluate if using different innodb_page_size settings help a fast SSD under your workload.

And hopefully MySQL 8.0 makes it easier to use different page sizes!

by Vadim Tkachenko at August 11, 2016 12:13 AM

August 10, 2016

Jean-Jerome Schmidt

Join us in Stockholm & Amsterdam for this month’s Polyglot Persistence meetups

We’re continuing our Polyglot Persistence meetups series this month with our first date in Stockholm and our second date in Amsterdam!

Do join us if you’re near or in either city at the end of August to discuss all things open source database such as MySQL, MongoDB, PostgreSQL and related technologies.

We’re lucky enough to be hosted by Spotify in Stockholm and by Booking.com in Amsterdam. There’ll be interesting talks to listen to and discuss … as well as drinks and nibbles of course for some good chats.

Here are the dates and sign-up links for both meetups; we do hope to see you there!

Stockholm - 23rd of August

 

Amsterdam - 31st of August

If you have any suggestions for (future) talks and locations, please do contact us via one of the meetup pages.

See you there!

About the Polyglot Persistence Meetups

This meetup series is for all those database administrators, system admins, devops professionals, BI professionals etc who deal with multiple types of databases, and in particular MySQL, MariaDB, MongoDB and PostgreSQL. The purpose is to provide talks and discussion forums to share best practices and ideas on how best automate and manage mixed open source database environments and extract meaning from the data in the most comprehensive way.

by Severalnines at August 10, 2016 08:33 PM

MariaDB Foundation

MariaDB 5.5.51 and updated connectors now available

The MariaDB project is pleased to announce the immediate availability of MariaDB 5.5.51, MariaDB Connector/J 1.5.0 RC, and MariaDB Connector/C 2.3.1. See the release notes and changelogs for details on these releases. Download MariaDB 5.5.51 Release Notes Changelog What is MariaDB 5.5? MariaDB APT and YUM Repository Configuration Generator Download MariaDB Connector/J 1.5.0 RC Release […]

The post MariaDB 5.5.51 and updated connectors now available appeared first on MariaDB.org.

by Daniel Bartholomew at August 10, 2016 05:52 PM

August 09, 2016

Peter Zaitsev

tpcc-mysql benchmark tool: less random with multi-schema support

slow MySQL queries

tpcc-mysql benchmark toolIn this blog post, I’ll discuss changes I’ve made to the

tpcc-mysql
 benchmark tool. These changes make it less random and support multi-schema.

This post might only be interesting to performance researchers. The

tpcc-mysql
 benchmark to is what I use to test different hardware (as an example, see my previous post: https://www.percona.com/blog/2016/07/26/testing-samsung-storage-in-tpcc-mysql-benchmark-percona-server/).

The first change is support for multiple schemas, rather than just one schema. Supporting only one schema creates too much internal locking in MySQL on the same rows or the same index. Locking is fine if we want to compare different MySQL server versions. But it limits comparing different hardware or Linux kernels. In this case, we want to push MySQL as much as possible to load the underlying components. One solution is to partition several tables, But since MySQL still does not support Foreign Keys over partitioning tables, we would need to remove Foreign Key as well. A better solution is using multiple schemas (which is sort of like artificial partitioning). I’ve implemented this updated in the latest code of

tpcc-mysql
https://github.com/Percona-Lab/tpcc-mysql.

The second change I proposed is replacing fully random text fields with generated text, something similar to what is used in the TPC-H benchmark. The problem with fully random strings is that they take a majority of the space in

tpcc-mysql
 schemas, but they are aren’t at all compressible. This makes it is hard to use
tpcc-mysql
 to compare compression methods in InnoDB (as well as different compression algorithms). This implementation is available in a different branch for now: https://github.com/Percona-Lab/tpcc-mysql/tree/less_random.

If you are using

tpcc-mysql
, please test these changes.

by Vadim Tkachenko at August 09, 2016 10:34 PM

Webinar Thursday 8/11 at 10 am: InnoDB Troubleshooting

InnoDB TroubleshootingJoin Sveta Smirnova Thursday, August 11 at 10 am PDT (UTC-7) for a webinar on InnoDB Troubleshooting.

InnoDB is one of the most popular database engines. This general-purpose storage engine is widely used, has been MySQL’s default engine since version 5.6, and holds MySQL system tables since 5.7. It is hard to find a MySQL installation that doesn’t have at least one InnoDB table.

InnoDB is not a simple engine. It has its own locks, transactions, log files, monitoring, options and more. It is also under active development. Some of the latest features introduced in 5.6 are read-only transactions and multiple buffer pools (which now can persist on the disk between restarts). In 5.7, InnoDB added spatial indexes and general tablespaces (which can be created to hold table data per user choice). InnoDB development continues forward today.

Its features provide a great deal of power for users, but at the same time make troubleshooting a complex task.

This webinar will try to make InnoDB troubleshooting easier. You will learn specific tools in InnoDB, how and when to use them, how to get useful information from numerous InnoDB metrics and how to decode the engine status.

Register for this webinar here.

InnoDB TroubleshootingSveta Smirnova, Principal Technical Services Engineer

Sveta joined Percona in 2015. Her main professional interests are problem-solving, working with tricky issues, bugs, finding patterns that can solve typical issues quicker and teaching others how to deal with MySQL issues, bugs and gotchas effectively. Before joining Percona Sveta worked as Support Engineer in MySQL Bugs Analysis Support Group in MySQL AB-Sun-Oracle. She is the author of book “MySQL Troubleshooting” and JSON UDF functions for MySQL.

by Dave Avery at August 09, 2016 08:48 PM

August 08, 2016

Jean-Jerome Schmidt

Become a MySQL DBA at Percona Live Amsterdam this October

There’s a new tradition that’s establishing itself in the MySQL community, and that’s the Percona Live Conference in Europe, which has been taking place for the past few years in addition to the long-standing MySQL User Conference in Santa Clara.

This year the conference takes place in Amsterdam once again following last year’s success and we’re looking forward to joining everyone there in October.

To kick things off, we were delighted to receive confirmation that our full day tutorial, Become a MySQL DBA, has been selected for the Tutorials Schedule of the conference.

We’ve conducted this tutorial in April this year at the Percona Live Santa Clara conference and it is based upon the experience we had while writing our ‘Become a MySQL DBA’ blog series, which is at its 20th installment and is a step-by-step “cookbook” on how to best administer MySQL.’

The tutorial address the following topics and questions:

  • Monitoring and trending for your MySQL installation
    • What’s the most important to look after?
    • What tools to use?
    • How to ensure you are proactive in monitoring health of your MySQL?
  • How to diagnose issues with your MySQL setup?
    • Slow queries
    • Performance problems - what to look for?
    • Error logs
    • Hardware and OS issues
  • Backups
    • Binary and logical backup
    • What tools to use?
  • Most common maintenance operations
    • Schema changes
    • Batch operations
    • Replication topology changes
  • Database upgrades
    • How to prepare for an upgrade?
    • Performing minor and major version upgrades

We will provide a setup using virtuals where you can freely test upon.

We’ll communicate again on our participation at the conference over the coming weeks and we already looking forward to seeing you all there. If you haven’t registered for the conference, you can follow this link to do so.

by Severalnines at August 08, 2016 05:59 PM

Peter Zaitsev

Docker Images for MySQL Group Replication 5.7.14

MySQL Group Replication

MySQL Group ReplicationIn this post, I will point you to Docker images for MySQL Group Replication testing.

There is a new release of MySQL Group Replication plugin for MySQL 5.7.14. It’s a “beta” plugin and it is probably the last (or at lease one of the final pre-release packages) before Group Replication goes GA (during Oracle OpenWorld 2016, in our best guess).

Since it is close to GA, it would be great to get a better understanding of this new technology. Unfortunately, MySQL Group Replication installation process isn’t very user-friendly.

Or, to put it another way, totally un-user-friendly! It consists of a mere “50 easy steps” – by which I think they mean “easy” to mess up.

Matt Lord, in his post http://mysqlhighavailability.com/mysql-group-replication-a-quick-start-guide/, acknowledges: “getting a working MySQL service consisting of 3 Group Replication members is not an easy “point and click” or automated single command style operation.”

I’m not providing a review of MySQL Group Replication 5.7.14 yet – I need to play around with it a lot more. To make this process easier for myself, and hopefully more helpful to you, I’ve prepared Docker images for the testing of MySQL Group Replication.

Docker Images

To start the first node, run:

docker run -d --net=cluster1 --name=node1  perconalab/mysql-group-replication --group_replication_bootstrap_group=ON

To join all following nodes:

docker run -d --net=cluster1 --name=node2  perconalab/mysql-group-replication --group_replication_group_seeds='node1:6606'

Of course, you need to have Docker Network running:

docker network create cluster1

I hope this will make the testing process easier!

by Vadim Tkachenko at August 08, 2016 04:06 PM

August 04, 2016

Peter Zaitsev

Percona XtraDB Cluster on Ceph

Ceph

CephThis post discusses how XtraDB Cluster and Ceph are a good match, and how their combination allows for faster SST and a smaller disk footprint.

My last post was an introduction to Red Hat’s Ceph. As interesting and useful as it was, it wasn’t a practical example. Like most of the readers, I learn about and see the possibilities of technologies by burning my fingers on them. This post dives into a real and novel Ceph use case: handling of the Percona XtraDB Cluster SST operation using Ceph snapshots.

If you are familiar with Percona XtraDB Cluster, you know that a full state snapshot transfer (SST) is required to provision a new cluster node. Similarly, SST can also be triggered when a cluster node happens to have a corrupted dataset. Those SST operations consist essentially of a full copy of the dataset sent over the network. The most common SST methods are Xtrabackup and rsync. Both of these methods imply a significant impact and load on the donor while the SST operation is in progress.

For example, the whole dataset will need to be read from the storage and sent over the network, an operation that requires a lot of IO operations and CPU time. Furthermore, with the rsync SST method, the donor is under a read lock for the whole duration of the SST. Consequently, it can take no write operations. Such constraints on SST operations are often the main motivations beyond the reluctance of using Percona XtraDB cluster with large datasets.

So, what could we do to speed up SST? In this post, I will describe a method of performing SST operations when the data is not local to the nodes. You could easily modify the solution I am proposing for any non-local data source technology that supports snapshots/clones, and has an accessible management API. Off the top of my head (other than Ceph) I see AWS EBS and many SAN-based storage solutions as good fits.

The challenges of clone-based SST

If we could use snapshots and clones, what would be the logical steps for an SST? Let’s have a look at the following list:

  1. New node starts (joiner) and unmounts its current MySQL datadir
  2. The joiner and asks for an SST
  3. The donor creates a consistent snapshot of its MySQL datadir with the Galera position
  4. The donor sends to the joiner the name of the snapshot to use
  5. The joiner creates a clone of the snapshot name provided by the donor
  6. The joiner mounts the snapshot clone as the MySQL datadir and adjusts ownership
  7. The joiner initializes MySQL on the mounted clone

As we can see, all these steps are fairly simple, but hide some challenges for an SST method base on cloning. The first challenge is the need to mount the snapshot clone. Mounting a block device requires root privileges – and SST scripts normally run under the MySQL user. The second challenge I encountered wasn’t expected. MySQL opens the datadir and some files in it before the SST happens. Consequently, those files are then kept opened in the underlying mount point, a situation that is far from ideal. Fortunately, there are solutions to both of these challenges as we will see below.

SST script

So, let’s start with the SST script. The script is available in my Github at:

https://github.com/y-trudeau/ceph-related-tools/raw/master/wsrep-sst/wsrep_sst_ceph

You should install the script in the /usr/bin directory, along with the other user scripts. Once installed, I recommend:

chown root.root /usr/bin/wsrep_sst_ceph
chmod 755 /usr/bin/wsrep_sst_ceph

The script has a few parameters that can be defined in the [sst] section of the my.cnf file.

cephlocalpool
The Ceph pool where this node should create the clone. It can be a different pool from the one of the original dataset. For example, it could have a replication factor of 1 (no replication) for a read scaling node. The default value is: mysqlpool
cephmountpoint
What mount point to use. It defaults to the MySQL datadir as provided to the SST script.
cephmountoptions
The options used to mount the filesystem. The default value is: rw,noatime
cephkeyring
The Ceph keyring file to authenticate against the Ceph cluster with cephx. The user under which MySQL is running must be able to read the file. The default value is: /etc/ceph/ceph.client.admin.keyring
cephcleanup
Whether or not the script should cleanup the snapshots and clones that are no longer is used. Enable = 1, Disable = 0. The default value is: 0
Root privileges

In order to allow the SST script to perform privileged operations, I added an extra SST role: “mount”. The SST script on the joiner will call itself back with sudo and will pass “mount” for the role parameter. To allow the elevation of privileges, the follow line must be added to the /etc/sudoers file:

mysql ALL=NOPASSWD: /usr/bin/wsrep_sst_ceph

Files opened by MySQL before the SST

Upon startup, MySQL opens files at two places in the code before the SST completes. The first one is in the function

mysqld_main
 , which sets the current working directory to the datadir (an empty directory at that point).  After the SST, a block device is mounted on the datadir. The issue is that MySQL tries to find the files in the empty mount point directory. I wrote a simple patch, presented below, and issued a pull request:

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index 90760ba..bd9fa38 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -5362,6 +5362,13 @@ a file name for --log-bin-index option", opt_binlog_index_name);
       }
     }
   }
+
+  /*
+   * Forcing a new setwd in case the SST mounted the datadir
+   */
+  if (my_setwd(mysql_real_data_home,MYF(MY_WME)) && !opt_help)
+    unireg_abort(1);        /* purecov: inspected */
+
   if (opt_bin_log)
   {
     /*

With this patch, I added a new

my_setwd
 call right after the SST completed. The Percona engineering team approved the patch, and it should be added to the upcoming release of Percona XtraDB Cluster.

The Galera library is the other source of opened files before the SST. Here, the fix is just in the configuration. You must define the

base_dir
 Galera provider option outside of the datadir. For example, if you use /var/lib/mysql as datadir and cephmountpoint, then you should use:

wsrep_provider_options="base_dir=/var/lib/galera"

Of course, if you have other provider options, don’t forget to add them there.

Walkthrough

So, what are the steps required to use Ceph with Percona XtraDB Cluster? (I assume that you have a working Ceph cluster.)

1. Join the Ceph cluster

The first thing you need is a working Ceph cluster with the needed CephX credentials. While the setup of a Ceph cluster is beyond the scope of this post, we will address it in a subsequent post. For now, we’ll focus on the client side.

You need to install the Ceph client packages on each node. On my test servers using Ubuntu 14.04, I did:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
sudo apt-add-repository 'deb http://download.ceph.com/debian-infernalis/ trusty main'
apt-get update
apt-get install ceph

These commands also installed all the dependencies. Next, I copied the Ceph cluster configuration file /etc/ceph/ceph.conf:

[global]
fsid = 87671417-61e4-442b-8511-12659278700f
mon_initial_members = odroid1, odroid2
mon_host = 10.2.2.100, 10.2.2.20, 10.2.2.21
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_journal = /var/lib/ceph/osd/journal
osd_journal_size = 128
osd_pool_default_size = 2

and the authentication file /etc/ceph/ceph.client.admin.keyring from another node. I made sure these files were readable by all. You can define more refined privileges for a production system with CephX, the security layer of Ceph.

Once everything is in place, you can test if it is working with this command:

root@PXC3:~# ceph -s
    cluster 87671417-61e4-442b-8511-12659278700f
     health HEALTH_OK
     monmap e2: 3 mons at {odroid1=10.2.2.20:6789/0,odroid2=10.2.2.21:6789/0,serveur-famille=10.2.2.100:6789/0}
            election epoch 474, quorum 0,1,2 odroid1,odroid2,serveur-famille
     mdsmap e204: 1/1/1 up {0=odroid3=up:active}
     osdmap e995: 4 osds: 4 up, 4 in
      pgmap v275501: 1352 pgs, 5 pools, 321 GB data, 165 kobjects
            643 GB used, 6318 GB / 7334 GB avail
                1352 active+clean
  client io 16491 B/s rd, 2425 B/s wr, 1 op/s

Which gives the current state of the Ceph cluster.

2. Create the Ceph pool

Before we can use Ceph, we need to create a first RBD image, put a filesystem on it and mount it for MySQL on the bootstrap node. We need at least one Ceph pool since the RBD images are stored in a Ceph pool.  We create a Ceph pool with the command:

ceph osd pool create mysqlpool 512 512 replicated

Here, we have defined the pool mysqlpool with 512 placement groups. On a larger Ceph cluster, you might need to use more placement groups (again, a topic beyond the scope of this post). The pool we just created is replicated. Each object in the pool will have two copies as defined by the osd_pool_default_size parameter in the ceph.conf file. If needed, you can modify the size of a pool and its replication factor at any moment after the pool is created.

3. Create the first RBD image

Now that we have a pool, we can create a first RBD image:

root@PXC1:~# rbd -p mysqlpool create PXC --size 10240 --image-format 2

and “map” the RBD image to a host block device:

root@PXC1:~# rbd -p mysqlpool map PXC
/dev/rbd1

The commands return the local RBD block device that corresponds to the RBD image. The other steps are not specific to RBD images, we need to create a filesystem and prepare the mount points.

The rest of the steps are not specific to RBD images. We need to create a filesystem and prepare the mount points:

mkfs.xfs /dev/rbd1
mount /dev/rbd1 /var/lib/mysql -o rw,noatime,nouuid
chown mysql.mysql /var/lib/mysql
mysql_install_db --datadir=/var/lib/mysql --user=mysql
mkdir /var/lib/galera
chown mysql.mysql /var/lib/galera

You need to mount the RBD device and run the

mysql_install_db
 tool only on the bootstrap node. You need to create the directories /var/lib/mysql and /var/lib/galera on the other nodes and adjust the permissions similarly.

4. Modify the my.cnf files

You will need to set or adjust the specific

wsrep_sst_ceph
 settings in the my.cnf file of all the servers. Here are the relevant lines from the my.cnf file of one of my cluster node:

[mysqld]
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_provider_options="base_dir=/var/lib/galera"
wsrep_cluster_address=gcomm://10.0.5.120,10.0.5.47,10.0.5.48
wsrep_node_address=10.0.5.48
wsrep_sst_method=ceph
wsrep_cluster_name=ceph_cluster
[sst]
cephlocalpool=mysqlpool
cephmountoptions=rw,noatime,nodiratime,nouuid
cephkeyring=/etc/ceph/ceph.client.admin.keyring
cephcleanup=1

At this point, we can bootstrap the cluster on the node where we mounted the initial RBD image:

/etc/init.d/mysql bootstrap-pxc

5. Start the other XtraDB Cluster nodes

The first node does not perform an SST, so nothing exciting so far. With the patched version of MySQL (the above patch), starting MySQL on a second node triggers a Ceph SST operation. In my test environment, the SST take about five seconds to complete on low-powered VMs. Interestingly, the duration is not directly related to the dataset size. Because of this, a much larger dataset, on a quiet database, should take about the exact same time. A very busy database may need more time, since an SST requires a “flush tables with read lock” at some point.

So, after their respective Ceph SST, the other two nodes have:

root@PXC2:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC2:~# rbd showmapped
id pool      image           snap device
1  mysqlpool PXC2-1463776424 -    /dev/rbd1
root@PXC3:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC3:~# rbd showmapped
id pool      image           snap device
1  mysqlpool PXC3-1464118729 -    /dev/rbd1

The original RBD image now has two snapshots that are mapped to the clones mounted by other two nodes:

root@PXC3:~# rbd -p mysqlpool ls
PXC
PXC2-1463776424
PXC3-1464118729
root@PXC3:~# rbd -p mysqlpool info PXC2-1463776424
rbd image 'PXC2-1463776424':
        size 10240 MB in 2560 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.108b4246146651
        format: 2
        features: layering
        flags:
        parent: mysqlpool/PXC@1463776423
        overlap: 10240 MB

Discussion

Apart from allowing faster SST, what other benefits do we get from using Ceph with Percona XtraDB Cluster?

The first benefit is the inherent data duplication over the network removes the need for local data replication. Thus, instead of using raid-10 or raid-5 with an array of disks, we could use a simple raid-0 stripe set if the data is already replicated to more than one server.

The second benefit is a bit less obvious: you don’t need as much storage. Why? A Ceph clone only stores the delta from its original snapshot. So, for large, read intensive datasets, the disk space savings can be very significant. Of course, over time, the clone will drift away from its parent snapshot and will use more and more space. When we determine that a Ceph clone uses too much disk space, we can simply refresh the clone by restarting MySQL and forcing a full SST. The SST script will automatically drop the old clone and snapshot when the cephcleanup option is set, and it will create a new fresh clone. You can easily evaluate how much space is consumed by the clone using the following commands:

root@PXC2:~# rbd -p mysqlpool du PXC2-1463776424
warning: fast-diff map is not enabled for PXC2-1463776424. operation may be slow.
NAME            PROVISIONED USED
PXC2-1463776424      10240M 164M

Also, nothing prevents you using a different configuration of Ceph pools in the same XtraDB cluster. Therefore a Ceph clone can use a different pool than its parent snapshot. That’s the whole purpose of the cephlocalpool parameter. Strictly speaking, you only need one node to use a replicated pool, as the other nodes could run on clones that are stored data in a non-replicated pool (saving a lot of storage space). Furthermore, we can define the OSD affinity of the non-replicated pool in a way that it stores data on the host where it is used, reducing the cross node network latency.

Using Ceph for XtraDB Cluster SST operation demonstrates one of the array of possibilities offered to MySQL by Ceph. We continue to work with the Red Hat team and Red Hat Ceph Storage architects to find new and useful ways of addressing database issues in the Ceph environment. There are many more posts to come, so stay tuned!

DISCLAIMER: The
wsrep_sst_ceph
 script isn’t officially supported by Percona.

by Yves Trudeau at August 04, 2016 10:31 PM

Jean-Jerome Schmidt

Planets9s - MySQL Query Tuning webinar, #ClusterControl CrowdChat & Database BackUps

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

Sign up for our webinar on MySQL Query Tuning - Process & Tools

Join us for the first part of our upcoming webinar trilogy on MySQL Query Tuning led by Krzysztof Książek, Senior Support Engineer at Severalnines. This session focuses on the query tuning process and related tools. Building, collecting, analysing, tuning and testing will be discussed in detail as well as the main tools involved, tcpdump and pt-query-digest.

Sign up for the webinar

Check out our new #ClusterControl CrowdChat

This week we launched a new CrowdChat to discuss all things #ClusterControl, which is hosted by our team of subject matter experts. CrowdChat is a community platform that works across Facebook, Twitter, and LinkedIn to allow you to discuss a topic using a specific #hashtag. This crowdchat focuses on the hashtag #ClusterControl. So if you’re a DBA, architect, CTO, or a database novice, register to join and become part of the conversation!

Check out the #ClusterControl CrowdChat

ClusterControl Tips & Tricks: Customising Your Database BackUps

ClusterControl provides centralized backup management and it supports the standard mysqldump and Percona Xtrabackup backup methods. We believe the chosen command line arguments for the respective methods are optimal for most database workloads, and comply with the MySQL backup best practices. We are influenced by all the feedback we have received over the years, when working with DBAs and sysadmins. However, you might still want to customize your backup. This blog post shows you how to do this.

Read the blog

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

by Severalnines at August 04, 2016 07:20 AM

August 03, 2016

Peter Zaitsev

Testing Docker multi-host network performance

MySQL Group Replication

Docker multi-host networkIn this post, I’ll review Docker multi-host network performance.

In a past post, I tested Docker network. The MySQL Server team provided their own results, which are in line with my observations.

For this set of tests, I wanted to focus more on Docker networking using multiple hosts. Mostly because when we set up a high availability (HA) environment (using Percona XtraDB Cluster, for example) the expectation is that instances are running on different hosts.

Another reason for this test is that Docker recently announced the 1.12 release, which supports Swarm Mode. Swarm Mode is quite interesting by itself — with this release, Docker targets going deeper on Orchestration deployments in order to compete with Kubernetes and Apache Mesos. I would say Swarm Mode is still rough around the edges (expected for a first release), but I am sure Docker will polish this feature in the next few releases.

Swarm Mode also expects that you run services on different physical hosts, and services are communicated over Docker network. I wanted to see how much of a performance hit we get when we run over Docker network on multiple hosts.

Network performance is especially important for clustering setups like Percona XtraDB Cluster and  MySQL Group Replication (which just put out another Lab release).

For my setup, I used two physical servers connected over a 10GB network. Both servers use 56 cores total of Intel CPUs.

Sysbench setup: data fits into memory, and I will only use primary key lookups. Testing over the network gives the worst case scenario for network round trips, but it also gives a good visibility on performance impacts.

The following are options for Docker network:

  • No Docker containers (marked as “direct” in the following results)
  • Docker container uses “host” network (marked as “host”)
  • Docker container uses “bridge” network, where service port exposed via port forwarding (marked as “bridge”)
  • Docker container uses “overlay” network, both client and server are started in containers connected via overlay network (marked as “overlay” in the results). For “overlay” network it is possible to use third-party plugins, with different implementation of the network, the most known are:

For multi-host networking setup, only “overlay” (and plugins implementations) are feasible. I used “direct”, “host” and “bridge” only for the reference and as a comparison to measure the overhead of overlay implementations.

The results I observed are:

Client Server Throughput, tps Ratio to “direct-direct”
Direct Direct 282780 1.0
Direct Host 280622 0.99
Direct Bridge 250104 0.88
Bridge Bridge 235052 0.83
overlay overlay 120503 0.43
Calico overlay Calico overlay 246202 0.87
Weave overlay Weave overlay 11554 0.044

 

Observations
  • “Bridge” network added overhead, about 12%, which is in line with my previous benchmark. I wonder, however, if this is Docker overhead or just the Linux implementation of bridge networks. Docker should be using the setup that I described in Running Percona XtraDB Cluster nodes with Linux Network namespaces on the same host, and I suspect that the Linux network namespaces and bridges add overhead. I need to do more testing to verify.
  • Native “Overlay” Docker network struggled from performance problems. I observed issues with ksoftirq using 100% of one CPU core, and I see similar reports. It seems that network interruptions in Docker “overlay” are not distributed properly across multiple CPUs. This is not the case with the “direct” and “bridge” configuration. I believe this is a problem with the Docker “overlay” network (hopefully, it will eventually be fixed).
  • Weave network showed absolutely terrible results. I see a lot of CPU allocated to “weave” containers, so I think there are serious scalability issues in their implementation.
  • Calico plugin showed the best result for multi-host containers, even better than “bridge-bridge” network setup
Conclusion

If you need to use Docker “overlay” network — which is a requirement if you are looking to deploy a multi-host environment or use Docker Swarm mode — I recommend you consider using the Calico network plugin for Docker. Native Docker “overlay” network can be used for prototype or quick testing cases, but at this moment it shows performance problems on high-end hardware.

 

by Vadim Tkachenko at August 03, 2016 07:26 PM

Oli Sennhauser

FromDual Performance Monitor for MySQL and MariaDB 0.10.6 has been released

FromDual has the pleasure to announce the release of the new version 0.10.6 of its popular Database Performance Monitor for MySQL, MariaDB, Galera Cluster and Percona Server fpmmm.

You can download fpmmm from here.

In the inconceivable case that you find a bug in fpmmm please report it to our Bug-tracker.

Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

This release contains various bug fixes and improvements. The previous release had some major bugs so we recommend to upgrade...

Changes in fpmmm v0.10.6

fpmmm agent

  • Do not connect to server bug fixed.
  • Special case when lock file was removed when it was read is fixed.
  • Added ORDER BY to all GROUP BY to be compliant for the future.
  • Zabbix 3.0 templates added.
  • MaaS: Function curl_file_create implemented for php < 5.5
  • MaaS: Debug message fixed.
  • Maas: Curl upload fixed.
  • MaaS: InnoDB: Deadlock and Foreign Key errors are only escaped with xxx when used in MaaS. Otherwise they are sent normally. Foreign Key errors with MaaS is now also escaped with xxx.

Process module

  • Wrong substitution in process vm calculation fixed.

Galera module

  • Template: Galera items changed from normal to delta.

InnoDB module

  • Template: Fixed InnoDB template to work with Zabbix v3.0.
  • Template: InnoDB locking graph improved.

For subscriptions of commercial use of fpmmm please get in contact with us.

by Shinguz at August 03, 2016 05:40 PM

MariaDB AB

MariaDB in Tokyo

Colin Charles

In July, I visited Tokyo, which is a hotbed for technology (especially around databases), to attend the yearly event db tech showcase 2016, held at the Akihabara UDX. My talk focused on Lessons from Database Failures, my third time speaking on the topic and it was well received by the audience.

Towards the end of my session, there was some discussion about what I didn't get to talk about - MariaDB ColumnStore, which many InfiniDB users expressed interest in. I was able to talk about it briefly in the hallway and encouraged people to attend the Tokyo MariaDB User Group Meetup.

A few days later, we had the Tokyo MariaDB User Group Meetup, organised by Spiral Arms and hosted at All About Corp.

My talk at the Meetup focused on MariaDB 10.1: What's New and What's Coming in 10.2 and I was happy to see that many of the attendees were already using MariaDB Server 10.0 and 10.1. Encryption was a key topic for many who attended my session. There was also a separate talk on groonga/mroonga, where we realised that the versions shipping inside of MariaDB Server are quite dated. There really should be a goal to ensure that developers can commit directly (but first, also ensure that mroonga builds on all platforms that we support, e.g. OpenBSD and FreeBSD). Of course, it was also wonderful to hear about the SPIDER roadmap and what's planned, considering there are more users of this engine these days.

The last talk was focused on MariaDB ColumnStore, which Kentoku (SPIDER’s developer) gave in Japanese. This presentation had lots of Q&A. I think the most important clarification from the Q&A is that the ColumnStore source is available on github. There were also questions around why some code is stored on github.com/mariadb-corporation, while other code on github.com/mariadb. The clearest explanation is on ownership -- so, we did talk about the trademark document. After that it was a free-for-all for sushi, beer and snacks so people could cheers, and continue talking.

All in, both talks in Tokyo were successful, and it’s abundantly clear that there are lots of people there who are very into databases.

by Colin Charles at August 03, 2016 05:09 PM

Jean-Jerome Schmidt

Webinar: MySQL Query Tuning Trilogy (Part 1) - Process & Tools

Join us for the first part of our upcoming webinar trilogy on MySQL Query Tuning. Over the course of three in-depth webinar sessions led by Krzysztof Książek, Senior Support Engineer at Severalnines, we’ll cover SQL tuning, indexing, the MySQL optimizer and how to leverage EXPLAIN to gain insight into execution plans.

Tuning MySQL queries and indexes can significantly increase the performance of your application as well as decrease response times, when done right. This is why we’re covering this complex topic over the course of three webinars of 60 minutes each.

This first part of the trilogy focuses on the query tuning process and related tools. Building, collecting, analysing, tuning and testing will be discussed in detail as well as the main tools involved, tcpdump and pt-query-digest.

Date & Registration

Part 1: Query tuning process and tools

Tuesday, August 30th

Register

Feel free to also register for Parts 2 & 3.

Agenda

  • MySQL Query Tuning Trilogy: Process and tools
  • Query tuning process
    • Build
    • Collect
    • Analyse
    • Tune
    • Test
  • Tools
    • tcpdump
    • pt-query-digest

Speaker

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience in managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard. He’s the main author of the Severalnines blog and webinar series: Become a MySQL DBA.

We look forward to “seeing” you there!

by Severalnines at August 03, 2016 01:22 PM

Oli Sennhauser

MySQL Environment MyEnv 1.3.1 has been released

FromDual has the pleasure to announce the release of the new version 1.3.1 of its popular MySQL, Galera Cluster, MariaDB and Percona Server multi-instance environment MyEnv.

The new MyEnv can be downloaded here.

In the inconceivable case that you find a bug in the MyEnv please report it to our bug tracker.

Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

Upgrade from 1.1.x or higher to 1.3.1

# cd ${HOME}/product
# tar xf /download/myenv-1.3.1.tar.gz
# rm -f myenv
# ln -s myenv-1.3.1 myenv

If you are using plug-ins for showMyEnvStatus create all the links in the new directory structure:

cd ${HOME}/product/myenv
ln -s ../../utl/oem_agent.php plg/showMyEnvStatus/

Changes in MyEnv 1.3.1

MyEnv

  • Bash function bootstrap added.
  • Galera options --bootstrap --new-cluster and start method bootstrap was implemented. Typo fixed.
  • New 5.7 variables added and 5.6 variables to avoid nasty warnings in the error log added to the my.cnf template. Further new file system structure was prepared.
  • MySQL 5.7 variables for error log behaviour added.
  • Comment for log_bin added to my.cnf template.
  • ulimit problem fixed rudely in MyEnv init script.
  • wsrep_provider for CentOS added in my.cnf template.
  • Cgroup template improved.
  • Cgroup how-to improved and configuration example added.

MyEnv Installer

  • default as instance name set to blacklist.
  • Typo fixed in help of installMyEnv.

MyEnv Utilities

  • Test table prepared for explicit_defaults_for_timestamp configuration.
  • insert_test.sh now has optional parameters for user, host etc.

For subscriptions of commercial use of MyEnv please get in contact with us.

by Shinguz at August 03, 2016 06:27 AM

August 02, 2016

Peter Zaitsev

Take Percona’s One-Click Database Security Downtime Poll

database security downtime poll

database security downtime poll.Take Percona’s database security downtime poll.

As Peter Zaitsev mentioned recently in his blog post on database support, the data breach costs can hit both your business reputation and your bottom line. Costs vary depending on the company size and market, but recent studies estimate direct costs ranging in average from $1.6M to 7.01M. Everyone agrees leaving rising security risks and costs unchecked is a recipe for disaster.

Reducing security-based outages doesn’t have a simple answer, but can be a combination of internal and external monitoring, support contracts, enhanced security systems, and a better understanding of security configuration settings.

Please take a few seconds and answer the following poll. It will help the community get an idea of how security breaches can impact their critical database environments.

If you’ve faced  specific issues, feel free to comment below. We’ll post a follow-up blog with the results!

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

You can see the results of our last blog poll on high availability here.

by Dave Avery at August 02, 2016 10:12 PM

High Availability Poll Results

high availability poll

high availability pollThis blog reports the results of Percona’s high availability poll.

High availability (HA) is always a hot topic. The reality is that if your data is not available, your customers cannot do business with you. In fact, estimates show the average cost of downtime is about $5K per minute. With an average outage taking 40 minutes to correct, you could be looking at a potential cost of $200K if your MySQL instance goes down. Whether your database is on premise, or in public or private clouds, it is critical that your database deployment does not have a potentially devastating single point of failure.

The results from Percona’s high availability poll responses are in:

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

With over 700 unique participants and 844 different selections, MySQL replication was the clear frontrunner when it comes to high availability solutions.

Percona has HA solutions available, come find out more at our website.

If you’re using other solutions or have specific issues, feel free to comment below.

Check out the latest Percona one-click poll on database security here.

by Dave Avery at August 02, 2016 10:11 PM

August 01, 2016

Peter Zaitsev

Introduction into storage engine troubleshooting: Q & A

storage engine troubleshooting

 storage engine troubleshootingIn this blog, I will provide answers to the Q & A for the “Introduction into storage engine troubleshooting” webinar.

First, I want to thank everybody for attending the July 14 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: At which isolation level do 

pt-online-schema-change
 and 
pt-archive
  copy data from a table?

A: Both tools do not change the server’s default transaction isolation level. Use either

REPEATABLE READ
 or set it in my
.cnf
.

Q: Can I create an index to optimize a query which has group by A and order by B, both from different tables and A column is from the first table in the two table join?

A: Do you mean a query like

SELECT ... FROM a, b GROUP BY a.A ORDER BY b.B
 ? Yes, this is possible:

mysql> explain select A, B, count(*) from a join b on(a.A=b.id) WHERE b.B < 4 GROUP BY a.A, b.B ORDER BY b.B ASC;
+----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref       | rows | Extra                                                     |
+----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+
|  1 | SIMPLE      | b     | range | PRIMARY,B     | B    | 5       | NULL      |   15 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | a     | ref   | A             | A    | 5       | test.b.id |    1 | Using index                                               |
+----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+
2 rows in set (0.00 sec)

Q: Where can I find recommendations on what kind of engine to use for different application types or use cases?

A: Storage engines are always being actively developed, therefore I suggest that you don’t search for generic recommendations. These can be outdated just a few weeks after they are written. Study engines instead. For example, just a few years ago MyISAM was the only engine (among those officially supported) that could work with FULLTEXT indexes and SPATIAL columns. Now InnoDB supports both: FULLTEXT indexes since version 5.6 and GIS features in 5.7. Today I can recommend InnoDB as a general-purpose engine for all installations, and TokuDB for write-heavy workloads when you cannot use high-speed disks.

Alternative storage engines can help to realize specific business needs. For example, CONNECT brings data to your server from many sources, SphinxSE talks to the Sphinx daemon, etc.

Other alternative storage engines increase the speed of certain workloads. Memory, for example, can be a good fit for temporary tables.

Q: Can you please explain how we find the full text of the query when we query the view ‘statements_with_full_table_Scans’?

A: Do you mean view in sys schema? Sys schema views take information from

summary_*
 and digests it in Performance Schema, therefore it does not contain full queries (only digests). Full text of the query can be found in the
events_statements_*
  tables in the Performance Schema. Note that even the 
events_statements_history_long
  table can be rewritten very quickly, and you may want to save data from it periodically.

Q: Hi is TokuDB for the new document protocol?

A: As Alex Rubin showed in his detailed blog post, the new document protocol just converts NoSQL queries into SQL, and is thus not limited to any storage engine. To use documents and collections, a storage engine must support generated columns (which TokuDB currently does not). So support of X Protocol for TokuDB is limited to relational tables access.

Q: Please comment on “read committed” versus “repeatable read.”
Q: Repeatable read holds the cursor on the result set for the client versus read committed where the cursor is updated after a transaction.

A:

READ COMMITTED
 and
REPEATABLE READ
 are transaction isolation levels, whose details are explained here.
I would not correlate locks set on table rows in different transaction isolation modes with the result set. A transaction with isolation level
REPEATABLE READ
  instead creates a snapshot of rows that are accessed by the transaction. Let’s consider a table:

mysql> create table ti(id int not null primary key, f1 int) engine=innodb;
Query OK, 0 rows affected (0.56 sec)
mysql> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9);
Query OK, 9 rows affected (0.03 sec)
Records: 9  Duplicates: 0  Warnings: 0

Then start the transaction and select a few rows from this table:

mysql1> begin;
Query OK, 0 rows affected (0.00 sec)
mysql1> select * from ti where id < 5;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
+----+------+
4 rows in set (0.04 sec)

Now let’s update another set of rows in another transaction:

mysql2> update ti set f1 = id*2 where id > 5;
Query OK, 4 rows affected (0.06 sec)
Rows matched: 4  Changed: 4  Warnings: 0
mysql2> select * from ti;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
|  6 |   12 |
|  7 |   14 |
|  8 |   16 |
|  9 |   18 |
+----+------+
9 rows in set (0.00 sec)

You see that the first four rows – which we accessed in the first transaction – were not modified, and last four were modified. If InnoDB only saved the cursor (as someone answered above) we would expect to see the same result if we ran 

SELECT * ...
  query in our old transaction, but it actually shows whole table content before modification:

mysql1> select * from ti;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
|  6 |    6 |
|  7 |    7 |
|  8 |    8 |
|  9 |    9 |
+----+------+
9 rows in set (0.00 sec)

So “snapshot”  is a better word than “cursor” for the result set. In the case of

READ COMMITTED
, the first transaction would see modified rows:

mysql1> drop table ti;
Query OK, 0 rows affected (0.11 sec)
mysql1> create table ti(id int not null primary key, f1 int) engine=innodb;
Query OK, 0 rows affected (0.38 sec)
mysql1> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9);
Query OK, 9 rows affected (0.04 sec)
Records: 9  Duplicates: 0  Warnings: 0
mysql1> set transaction isolation level read committed;
Query OK, 0 rows affected (0.00 sec)
mysql1> begin;
Query OK, 0 rows affected (0.00 sec)
mysql1> select * from ti where id < 5;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
+----+------+
4 rows in set (0.00 sec)

Let’s update all rows in the table this time:

mysql2> update ti set f1 = id*2;
Query OK, 9 rows affected (0.04 sec)
Rows matched: 9  Changed: 9  Warnings: 0

Now the first transaction sees both the modified rows with id >= 5 (not in the initial result set), but also the modified rows with id < 5 (which existed in the initial result set):

mysql1> select * from ti;
+----+------+
| id | f1   |
+----+------+
|  1 |    2 |
|  2 |    4 |
|  3 |    6 |
|  4 |    8 |
|  5 |   10 |
|  6 |   12 |
|  7 |   14 |
|  8 |   16 |
|  9 |   18 |
+----+------+
9 rows in set (0.00 sec)

by Sveta Smirnova at August 01, 2016 10:43 PM

Shlomi Noach

Introducing gh-ost: triggerless online schema migrations

I'm thoroughly happy to introduce gh-ost: triggerless, controllable, auditable, testable, trusted online schema change tool released today by GitHub.

gh-ost now powers our production schema migrations. We hit some serious limitations using pt-online-schema-change on our large volume, high traffic tables, to the effect of driving our database to a near grinding halt or even to the extent of causing outages. With gh-ost, we are now able to migrate our busiest tables at any time, peak hours and heavy workloads included, without causing impact to our service.

gh-ost supports testing in production. It goes a long way to build trust, both in integrity and in control. Are your databases just too busy and you cannot run existing online-schema-change tools? Have you suffered outages due to migrations? Are you tired of babysitting migrations that run up to 3:00am? Tired of being the only one tailing logs? Please, take a look at gh-ost. I believe it changes online migration paradigm.

For a more thorough overview, please read the announcement on the GitHub Engineering Blog, and proceed to the documentation.

gh-ost is open sourced under the MIT license.

by shlomi at August 01, 2016 05:19 PM

Jean-Jerome Schmidt

ClusterControl Tips & Tricks: Customizing your Database Backups

ClusterControl provides centralized backup management and it supports the standard mysqldump and Percona Xtrabackup backup methods. We believe the chosen command line arguments for the respective methods are optimal for most database workloads, and comply to the MySQL backup best practices. We are influenced by all the feedback we have received over the years, when working with DBAs and sysadmins. However, you might still want to customize your backup. In this post, we will show you how to do this.

By default, ClusterControl will append the following MySQL backup-related lines inside the my.cnf if they aren’t already in there:

[mysqldump]
max_allowed_packet = 512M
# default_character_set = utf8
user=backupuser
password=[random password]

[xtrabackup]
user=backupuser
password=[random password]

The above might not be enough in some circumstances. Let’s see how we customize this for our environment, using the respective backup method.

mysqldump

By default, ClusterControl creates 4 mysqldump files with the following suffix:

  • _triggerseventroutines - Triggers, event and routines
  • _data - Database data
  • _schema - Database schema
  • _mysql - MySQL system database

Let’s say we have 5 databases and we chose to backup all of them. Here is what ClusterControl will execute when performing the backup using mysqldump method (commands are trimmed with backslash for readability):

  • Triggers, event and routines dump file:

    $ /usr/bin/mysqldump \
    --defaults-file=/etc/mysql/my.cnf \
    --flush-privileges \
    --hex-blob \
    --opt \
    --no-create-info \
    --no-data \
    --set-gtid-purged=OFF \
    --triggers \
    -R --events \
    --single-transaction \
    --skip-comments \
    --skip-lock-tables  \
    --skip-add-locks \
    --databases accounting mydb s9s_cmon sakila sbtest
  • Data dump file:

    $ /usr/bin/mysqldump \
    --defaults-file=/etc/my.cnf \
    --flush-privileges \
    --hex-blob \
    --opt \
    --no-create-info \
    --master-data=2 \
    --set-gtid-purged=OFF \
    --skip-triggers \
    --single-transaction \
    --skip-comments \
    --skip-lock-tables \
    --skip-add-locks \
    --databases accounting mydb s9s_cmon sakila sbtest

    *master-data=2 is included only if the MySQL node is generating binary log.

  • Schema dump file:

    $ /usr/bin/mysqldump \
    --defaults-file=/etc/my.cnf \
    --flush-privileges \
    --hex-blob \
    --opt  \
    --no-data \
    --set-gtid-purged=OFF  \
    --add-drop-table \
    --skip-triggers \
    --single-transaction \
    --skip-comments  \
    --skip-lock-tables \
    --databases accounting mydb s9s_cmon sakila sbtest
  • Mysql database dump file:

    $ /usr/bin/mysqldump \
    --defaults-file=/etc/mysql/my.cnf \
    --flush-privileges \
    --hex-blob \
    --opt \
    --set-gtid-purged=OFF \
    --single-transaction \
    --skip-comments \
    --skip-lock-tables \
    --skip-add-locks \
    --databases mysql

From the above command lines, we can see that on each mysqldump command, ClusterControl includes the MySQL configuration file into its --defaults-file argument. By having this, the mysqldump process is able to read the content of the mysqldump directive. By default ClusterControl configures the backup user credentials together with max_allowed_packet, similar to the following:

[mysqldump]
max_allowed_packet = 512M
# default_character_set = utf8
user=backupuser
password=[random password]

The advantage of this is that we can include some extra options for mysqldump. Unfortunately, the --defaults-file argument can only be specified as the foremost argument. Pay attention that the latter command line arguments take precedence on what have been configured inside my.cnf under [mysqldump] directive based on the order they appear. For example, if we add skip-comments=0 inside my.cnf, while at the end of the mysqldump command, there is a --skip-comments (or --skip-comments=1), the former will be ignored and the latter will be used.

Nevertheless, we can still use it as part of our backup customization by using other mysqldump backup options. For example, we can exclude tables that we don’t want to backup by using ignore-table parameter (with “database.table” formatting). Add the following lines into the MySQL configuration file:

[mysqldump]
max_allowed_packet = 512M
# default_character_set = utf8
user=backupuser
password=[random password]
ignore-table=sbtest.sbtest9
ignore-table=sbtest.sbtest10
ignore-table=sbtest.sbtest1

Once configured, we can just trigger a new mysqldump job from ClusterControl and we will have those tables skipped by mysqldump. No MySQL restart is required.

Percona Xtrabackup

ClusterControl executes the Xtrabackup depending on the options you chose. Consider the following:

Based on the above options, the complete Xtrabackup command would be:

$ ulimit -n 256000 && LC_ALL=C /usr/bin/innobackupex --defaults-file=/etc/mysql/my.cnf  --galera-info --parallel 1 --stream=xbstream --no-timestamp .

The first command “ulimit  -n 256000” is to ensure that Percona Xtrabackup has sufficient privileges to access a huge number of file descriptors (in case the databases contain many tables). Take note of the --defaults-file=/etc/mysql/my.cnf, which is similar to mysqldump, where innobackupex reads the content of MySQL configuration on the following directives and variables:

[mysqld]
datadir=[physical path to MySQL data directory]
tmpdir=[path to temporary directory]

[xtrabackup]
user=backupuser
password=[random password]

If you would like to customize the backup options for Percona Xtrabackup, you can add them directly under [xtrabackup] directive. For example, let’s say we want Xtrabackup to print the binary log position when the backup is taken, we can add something like this:

[xtrabackup]
user=backupuser
password=[random password]
slave-info=1

Triggering the xtrabackup job will then include a file called xtrabackup_slave_info file. No MySQL restart is required.

We hope this helps you better manage your MySQL backups!

by Severalnines at August 01, 2016 11:25 AM

July 30, 2016

Valeriy Kravchuk

Fun with Bugs #44 - Community Bugs Fixed in MySQL 5.7.14

MySQL 5.7.14 was officially released yesterday. So, it's time to check what bugs reported by MySQL Community in public were fixed in this release. Some of these bugs are presented below.

As usual, let me start with InnoDB. The following bugs were fixed there:
  • Bug #80296 - "FTS query exceeds result cache limit". It was reported (for 5.6, but I do not see new 5.6 release notes yet) by Monty Solomon and verified by Umesh.
  • Bug #80304 - "generated columns don't work with foreign key actions". It was reported by Guilhem Bichot based on test case by Peter Gulutzan presented here.As most community bug reports during last 2-3 years, it was verified by Umesh.
  • Bug #80298 - "Full-Text queries with additional secondary index gives NULL or Zero rows", was reported by Ray Lambe and verified by Umesh.
  • Bug #76728 - "reduce lock_sys->mutex contention for transaction that only contains SELECT". This old bug report by Zhai Weixiang (who had provided a patch) was verified by Sinisa Milivojevic.
  • Bug #80083 - "Setting innodb_monitor_enable to ALL does not enable all monitors". It was reported by Davi Arnaut and verified by Miguel Solorzano.
  • Bug #79772 - "Foreign key not allowed when a virtual index exists". It was reported and verified by Jesper wisborg Krogh from Oracle.
There are many more bugs fixed in InnoDB, but all of them were reported in internal Oracle's bugs database by Oracle employees. I do not like this trend.

Now, let's check replication bugs that were fixed:
  • Bug #79324 - "Slave is ~10x slower to execute set of statements compared to master RBR", was reported by Serge Grachov and verified by Umesh.
  • Bug #62008 - "read-only option does not allow inserts/updates on temporary tables". This bug was reported long time ago by Ivan Stanojevic and verified by me when I worked in Oracle.It's really good to see it fixed now!
Some bugs were fixed in Performance_Schema (who could imagine it has bugs...), but they were either reported internally or remain private, like Bug #81464. Just take into account that SELECT from some P_S tables could crash server before 5.7.14, based on release notes...

This time I see several build-related bugs fixed, like these:
  • Bug #81274 - "Add support for Solaris Studio 12.5 aka 5.14". It was reported (and probably fixed) by Tor Didriksen.
  • Bug #81593 - "adapt to gcc 5.3 on solaris". It was also reported and fixed by Tor Didriksen. personally I am happy to see that Oracle still cares about Solaris and related software. Historical sentiments...
  • Bug #80996 - "correct make_pair for c++11 (contribution)". This fix was contributed by Daniel Black
  • Bug #80371 - "MySQL fails to build with new default mode in GCC6". It was reported by Terje Røsten.
The last but not the least, I also have to mention this bug in audit (and, thus, query rewrite) plugins, Bug #81298 - "query rewrite plugin suffers scalability issues". It was reported by Vadim Tkachenko and verified by Sinisa Milivojevic. This is a great improvement.

To summarize, I see reasons to upgrade for those who rely a lot on FTS indexes in InnoDB, replication, audit plugins and Performance_schema. I had not even tried to build 5.7.14 from source yet, so I do not have any personal experience to share.

by Valeriy Kravchuk (noreply@blogger.com) at July 30, 2016 04:58 PM

July 29, 2016

Peter Zaitsev

MariaDB 10.2 CHECK and DEFAULT clauses

MariaDB 10.2 CHECK and DEFAULT

MariaDB 10.2 CHECK and DEFAULTIn this blog post, we’ll look at the MariaDB 10.2 CHECK and DEFAULT clauses.

MariaDB 10.2 includes some long-awaited features. In this blog, we are going to discuss the improvements to some table definitions: the DEFAULT clause and the CHECK constraints. These clauses describe columns default values and rules for data validation.

Note that MariaDB 10.2 is still in alpha stage. This article describes the current state of these features, which could change before MariaDB 10.2 becomes GA.

The DEFAULT clause

The DEFAULT clause has always been supported in MariaDB/MySQL, but traditionally it only accepted literal values (like “hello world” or “2”). MariaDB 10.2 removes this limitation, so DEFAULT can now accept most SQL expressions. For example:

  • fiscal_year SMALLINT DEFAULT (YEAR(NOW()))
  • valid_until DATE DEFAULT (NOW() + INTERVAL 1 YEAR)
  • owner VARCHAR(100) DEFAULT (USER())

Additionally, MariaDB 10.2 allows you to set a DEFAULT value for the TEXT and BLOB columns. This was not possible in previous versions. While this might look like a small detail, it can be hard to add a column to an existing table that is used by production applications, if it cannot have a default value.

The DEFAULT clause has some very reasonable limitations. For example, it cannot contain a subquery or a stored function. An apparently strange limitation is that we can mention another column in DEFAULT only if it comes first in the CREATE TABLE command.

Note that DEFAULT can make use of non-deterministic functions even if the binary log uses the STATEMENT format. In this case, default non-deterministic values will be logged in the ROW format.

CHECK constraints

CHECK constraints are SQL expressions that are checked when a row is inserted or updated. If this expression result is false (0, empty string, empty date) or NULL, the statement will fail with an error. The error message states which CHECK failed in a way that is quite easy to parse:

ERROR 4022 (23000): CONSTRAINT `consistent_dates` failed for `test`.`author`

Some example of CHECK constraints:

  • CONSTRAINT non_empty_name CHECK (CHAR_LENGTH(name) > 0)
  • CONSTRAINT consistent_dates CHECK (birth_date IS NULL OR death_date IS NULL OR birth_date < death_date)
  • CONSTRAINT past_date CHECK (birth_date < NOW())

A possible trick is checking that a column is different from its default value. This forces users to assign values explicitly.

CHECK constraints cannot be added or altered. It is only possible to drop them. This is an important limitation for production servers.

Another limitation is that CHECK metadata are not accessible via the INFORMATION_SCHEMA. The only way to find out if a table has CHECK clauses is parsing the output of SHOW CREATE TABLE.

The exact behavior of CHECK constraints in a replication environment depends on the master binary log format. If it is STATEMENT, the slaves will apply CHECK constraints to events received from the master. If it is ROW, only the master will need to apply constraints, because failed statements will not be replicated.

Thus, in all cases, we recommend having identical constraints on master and slaves, and only using deterministic constraints.

Performance

While I didn’t run a professional benchmark, I can say that both DEFAULT and CHECK clauses don’t have a noticeable impact on a simple test where we insert one million rows (on my local machine).

However, these clauses evaluate an SQL expression each time a row is inserted or updated. The overhead is at least equal to the SQL expression performance. If high-performing writes are important, you will probably not want to use complex data validation.

To check how fast an expression is, we can use the BENCHMARK() function:

MariaDB [(none)]> SELECT BENCHMARK(10000000, (555 / 100 * 20));
+---------------------------------------+
| BENCHMARK(10000000, (555 / 100 * 20)) |
+---------------------------------------+
| 0                                     |
+---------------------------------------+
1 row in set (1.36 sec)
MariaDB [(none)]> SELECT BENCHMARK(100000000, MD5('hello world'));
+------------------------------------------+
| BENCHMARK(100000000, MD5('hello world')) |
+------------------------------------------+
| 0                                        |
+------------------------------------------+
1 row in set (14.84 sec)

In this example, we executed the specified expressions ten million times.

BENCHMARK()
 always returns 0, but what we want to check is the execution time. We can see for example that evaluating MD5(‘hello world’) takes less than 0.000002 seconds. In some cases, we may want to retry the same expressions with different parameters (longer strings, higher numbers, etc.) to check if the execution times varies.

Unfortunately, we don’t have a status variable which tells us how many times MariaDB evaluated CHECK clauses. If our workload performs many writes, that variable could help us to find out if CHECK constraints are slowing down inserts. Maybe the MariaDB team can take this as a suggestion for the future.

by Federico Razzoli at July 29, 2016 07:35 PM

July 28, 2016

Peter Zaitsev

Percona Monitoring and Management 1.0.2 Beta

Percona Monitoring and Management 1.0.2 Beta

Percona Monitoring and Management 1.0.2 BetaPercona announces the release of Percona Monitoring and Management 1.0.2 Beta on 28 July 2016.

Like prior versions, PMM is distributed through Docker Hub and is free to download. Full instructions for download and installation of the server and client are available in the documentation.

Notable changes to the tool include:

  • Upgraded to Grafana 3.1.0.
  • Upgraded to Prometheus 1.0.1.
  • Set default metrics retention to 30 days.
  • Eliminated port 9001. Now the container uses only one configurable port, 80 by default.
  • Eliminated the need to specify ADDRESS variable when creating Docker container.
  • Completely re-wrote pmm-admin with more functions.
  • Added ability to stop all services using the new pmm-admin.
  • Added support to name instances using the new pmm-admin.
  • Query Analytics Application updates:
    • Redesigned queries profile table
    • Redesigned metrics table
    • Redesigned instance settings page
    • Added sparkline charts
    • Added ability to show more than ten queries
  • Various updates for MongoDB dashboards.

The full release notes are available in the documentation. The documentation also includes details on installation and architecture.

A demonstration of the tool has been set up at pmmdemo.percona.com.

We have also implemented forums for the discussion of PMM.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Some screen shots of the updates:

Note the new sparkline that shows the current load in context (so you know if the number is higher/normal/lower than normal), and the option to “Load next 10 queries” at the bottom of the listing.

Sparklines in QAN

Our admin tool was completely re-written with new functions:

pmm-admin-help
pmm-admin –help output

 

pmm-admin list command output
pmm-admin list command output

 

pmm-admin check-network output, which provides information on the status of the client’s network connection to the server.

by Bob Davis at July 28, 2016 07:39 PM

Upcoming Webinar August 2 10:00 am PDT: MySQL and Ceph

MySQL and Ceph

MySQL and CephJoin Brent Compton, Kyle Bader and Yves Trudeau on August 2, 2016 at 10 am PDT (UTC-7) for a MySQL and Ceph webinar.

Many operators select OpenStack as their control plane of choice for providing both internal and external IT services. The OpenStack user survey repeatedly shows Ceph as the dominant backend for providing persistent storage volumes through OpenStack Cinder. When building applications and repatriating old workloads, developers are discovering the need to provide OpenStack infrastructure database services. Given MySQL’s ubiquity, and it’s reliance on persistent storage, it is of utmost importance to understand how to achieve the performance demanded by today’s applications. Databases like MySQL can be incredibly IO intensive, and Ceph offers a great opportunity to go beyond the limitations presented by a single scale-up system. Since Ceph provides a mutable object store with atomic operations, could MySQL store InnoDB pages directly in Ceph?

This talk reviews the general architecture of Ceph, and then discusses benchmark results from small to mid-size Ceph clusters. These benchmarks lead to the development of prescriptive guidance around tuning Ceph storage nodes (OSDs), the impact the amount of physical memory, and the presence of SSDs, high-speed networks or RAID controllers.

Speakers:
MySQL and Ceph
Brent Compton
Director Storage Solution Architectures, Red Hat
Brent Compton is Director Storage Solution Architectures at Red Hat. He leads the team responsible for building Ceph and Gluster storage reference architectures with Red Hat Storage partners. Before Red Hat, Brent was responsible for emerging non-volatile memory software technologies at Fusion-io. Previous enterprise software leadership roles include VP Product Management at Micromuse (now IBM Tivoli Netcool) and Product Marketing Director within HP’s OpenView software division. Brent also served as Director Middleware Development Platforms at the LDS Church and as CIO at Joint Commission International. Brent has a tight-knit family, and can be found on skis or a mountain bike whenever possible.
MySQL and Ceph
Kyle Bader
Sr Solution Architect, Red Hat
Kyle Bader, a Red Hat senior architect, provides expertise in the design and operation of petabyte-scale storage systems using Ceph. He joined Red Hat as part of the 2014 Inktank acquisition. As a senior systems engineer at DreamHost, he helped implement, operate, and design Ceph and OpenStack-based systems for DreamCompute and DreamObjects cloud products.
MySQL and Ceph
Yves Trudeau
Principal Architect
Yves is a Principal Consultant at Percona, specializing in MySQL High-Availability and scaling solutions. Before joining Percona in 2009, he worked as a senior consultant for MySQL AB and Sun Microsystems, assisting customers across North America with NDB Cluster and Heartbeat/DRBD technologies. Yves holds a Ph.D. in Experimental Physics from Université de Sherbrooke. He lives in Québec, Canada with his wife and three daughters.

by Dave Avery at July 28, 2016 06:27 PM

Jean-Jerome Schmidt

Planets9s - Sign up for our webinar trilogy on MySQL Query Tuning

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

Sign up for our webinar trilogy on MySQL Query Tuning

This is a new webinar trilogy on MySQL Query Tuning, which follows the popular webinar on MySQL database performance tuning. In this trilogy, we will look at query tuning process and tools to help with that. We’ll cover topics such as SQL tuning, indexing, the optimizer and how to leverage EXPLAIN to gain insight into execution plans. This is a proper deep-dive into optimising MySQL queries, which we’re covering in three parts.

Sign up for the webinars

ClusterControl Developer Studio: MongoDB Replication Lag Advisor

This blog post explains, step by step, how we implemented our MongoDB replication lag advisor in our Developer Studio. We have included this advisor in ClusterControl 1.3.2, and enabled it by default on any MongoDB cluster or replica set. ClusterControl Developer Studio allows you to write your own scripts, advisors and alerts. With just a few lines of code, you can already automate your clusters. Happy clustering!

Read the blog

MySQL on Docker: Single Host Networking for MySQL Containers

Having covered the basics of running MySQL in a container and how to build a custom MySQL image in our previous MySQL on Docker posts, we are now going to cover the basics of how Docker handles single-host networking and how MySQL containers can leverage that. We’d love to hear your feedback, so feel free to comment on our blogs as well.

Read the blog

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

by Severalnines at July 28, 2016 07:24 AM

July 27, 2016

Peter Zaitsev

Percona Live Europe Amsterdam 2016 Tutorial Schedule is Up!

Percona Live Europe 2016 Schedule

Percona Live Europe Amsterdam 2016 Tutorial ScheduleThis blog post lists the Percona Live Europe Amsterdam 2016 tutorial schedule.

We are excited to announce that the tutorial schedule for the Percona Live Europe Amsterdam Open Source Database Conference 2016 is up!

The Percona Live Europe Amsterdam Open Source Database Conference is the premier event for the diverse and active open source community, as well as businesses that develop and use open source software. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, and other open source databases. Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs.

Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live Europe provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience – all to help you learn how to tackle your open source database challenges in a whole new way. These tutorials are a must for any data performance professional!

The Percona Live Europe Open Source Database Conference is October 3-5 at the Mövenpick Hotel Amsterdam City Centre.

Click through to the tutorial link right now, look them over, and pick which sessions you want to attend. Discounted passes available below!

Tutorial List:
Early Bird Discounts

Just a reminder to everyone out there: our Early Bird discount rate for the Percona Live Europe Amsterdam Open Source Database Conference is only available ‘til August 8, 2016, 11:30 pm PST! This rate gets you all the excellent and amazing opportunities that Percona Live offers, at a very reasonable price!

Sponsor Percona Live

Become a conference sponsor! We have sponsorship opportunities available for this annual MySQL, MongoDB and open source database event. Sponsors become a part of a dynamic and growing ecosystem and interact with hundreds of DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solutions vendors, and entrepreneurs who attend the event.

by Kortney Runyan at July 27, 2016 10:09 PM

Monitoring MongoDB with Nagios

Monitoring MongoDB with Nagios

Monitoring MongoDB with NagiosIn this blog, we’ll discuss monitoring MongoDB with Nagios.

There is a significant amount of talk around graphing MongoDB metrics using things like Prometheus, Data Dog, New Relic, and Ops Manager from MongoDB Inc. However, I haven’t noticed a lot of talk around “What MongoDB alerts should I be setting up?”

While building out Percona’s remote DBA service for MongoDB, I looked at Prometheus’s AlertManager. After reviewing it, I’m not sure it’s quite ready to be used exclusively. We needed to decide quickly if there are better Nagios checks on the market, or did I need to write my own?

In the end, we settled on a hybrid approach. There are some good frameworks, but we need to create or tweak some of the things needed for an “SEV 1-” or “SEV 2-” type issue (which are most important to me). One of the most common problems for operations, Ops, DevOps, DBA teams and most engineering is alert spam. As such I wanted to be very careful to only alert on the things pointing to immediate dangers or current outages. As a result, we have now added

pmp-check-mongo.py
 to the GitHub for Percona Monitoring Plugins. Since we use Grafana and Prometheus for metrics and graphing, there are no accompanying Catci information templates. In the future, we’ll need to decide how this will change PMP overtime. In the meantime, we wanted to make the tool available now and worry about some of the issues later on.

As part of this push, I want to give you some real world examples of how you might use this tool. There are many options available to you, and Nagios is still a bit green in regards to making those options as user-friendly as our tools are.

Usage: pmp-check-mongo.py [options]
Options:
  -h, --help                         show this help message and exit
  -H HOST, --host=HOST               The hostname you want to connect to
  -P PORT, --port=PORT               The port mongodb is running on
  -u USER, --user=USER               The username you want to login as
  -p PASSWD, --password=PASSWD       The password you want to use for that user
  -W WARNING, --warning=WARNING      The warning threshold you want to set
  -C CRITICAL, --critical=CRITICAL   The critical threshold you want to set
  -A ACTION, --action=ACTION         The action you want to take. Valid choices are
                                     (check_connections, check_election, check_lock_pct,
                                     check_repl_lag, check_flushing, check_total_indexes,
                                     check_balance, check_queues, check_cannary_test,
                                     check_have_primary, check_oplog, check_index_ratio,
                                     check_connect) Default: check_connect
  -s SSL, --ssl=SSL                  Connect using SSL
  -r REPLICASET, --replicaset=REPLICASET    Connect to replicaset
  -c COLLECTION, --collection=COLLECTION    Specify the collection in check_cannary_test
  -d DATABASE, --database=DATABASE          Specify the database in check_cannary_test
  -q QUERY, --query=QUERY                   Specify the query, only used in check_cannary_test
  --statusfile=STATUS_FILENAME      File to current store state data in for delta checks
  --backup-statusfile=STATUS_FILENAME_BACKUP    File to previous store state data in for delta checks
  --max-stale=MAX_STALE             Age of status file to make new checks (seconds)

There seems to be a huge amount going on here, but let’s break it down into a few categories:

  • Connection options
  • Actions
  • Action options
  • Status options

Hopefully, this takes some of the scariness out of the script above.

Connection options
  • Host / Port Number
    • Pretty simple, this is just the host you want to connect to and what TCP port it is listening on.
  • Username and Password
    • Like with Host/Port, this is some of your normal and typical Mongo connection field options. If you do not set both the username and password, the system will assume auth was disabled.
  • SSL
    • This is mostly around the old SSL support in Mongo clients (which was a boolean). This tool needs updating to support the more modern SSL connection options. Use this as a “deprecated” feature that might not work on newer versions.
  • ReplicaSet
    • Very particular option that is only used for a few checks and verifies that the connection uses a replicaset connection. Using this option lets the tool automatically find a primary node for you, and is helpful to some checks specifically around replication and high availability (HA):
      • check_election
      • check_repl_lag
      • check_cannary_test
      • chech_have_primary
      • check_oplog
Actions and what they mean
  • check_connections
    • This parameter refers to memory usage, but beyond that you need to know if your typical connections suddenly double. This indicates something unexpected happened in the application or database and caused everything to reconnect. It often takes up to 10 minutes for those old connections to go away.
  • check_election
    • This uses the status file options we will cover in a minute, but it checks to see if the primary from the last check differs from the current found primary. If so, it alerts. This check should only have a threshold of one before it alarms (as an alert means an HA event occurred).
  • check_lock_pct
    • MMAP only, this engine has a write lock on the whole collection/database depending on the version. This is a crucial metric to determine if MMAP writes are blocking reads, meaning you need to scale the DB layer in some way.
  • check_repl_lag
    • Checks the replication stream to understand how lagged a given node is the primary. To accomplish this, it uses a fake record in the test DB to cause a write. Without this, a read-only system would look lagged artificially as no new oplog entries get created.
  • check_flushing
    • A common issue with MongoDB is very long flush times, causing a system halt. This is a caused by your disk subsystem not keeping up, and then the DB having to wait on flushing to make sure writes get correctly journaled.
  • check_total_indexes
    • The more indexes you have, the more the planner has to work to determine which index is a good fit. This increases the risk that the recovery of a failure will take a long time. This is due to the way a restore builds indexes and how MongoDB can only make one index at a time.
  • check_balance
    • While MongoDB should keep things in balance across a cluster, many things can happen: jumbo chunks, a disabled balancer being, constantly attempting to move the same chunk but failing, and even adding/removing sharding. This alert is for these cases, as an imbalance means some records might get served faster than others. It is purely based on the chunk count that the MongoDB balancer is also based on, which is not necessarily the same as disk usage.
  • check_queues
    • No matter what engine you have selected, a backlog of sustained reads or writes indicates your DB layer is unable to keep up with demand. It is important in these cases to send an alert if the rate is maintained. You might notice this is also in our Prometheus exporter for graphics as both trending and alerting are necessary to watch in a MongoDB system.
  • check_cannary_test
    • This is a typical query for the database and then used to set critical/warning levels based on the latency of the returned query. While not as accurate as full synthetic transactions, queries through the application are good to measure response time expectations and SLAs.
  • check_have_primary
    • If we had an HA event but failed to get back up quickly, it’s important to know if a new primary is causing writes to error on the system. This check simply determines if the replica set has a primary, which means it can handle reads and writes.
  • check_oplog
    • This check is all about how much oplog history you have. This is much like measuring how much history you have in MySQL blogs. The reason this is important is when recovering from a backup and performing a point in time recovery, you can use the current oplog if the oldest timestamp in the oplog is newer than the backup timestamp. As a result, this is normal three times the backup interval you use to guarantee that you have plenty of time to find the newest recovery and then do the recovery.
  • check_index_ratio
    • This is an older metric that modern MongoDB versions will not find useful, but in the past, it was a good way to understand the percentage of queries not handled by an index.
  • check_connect
    • A very basic check to ensure it can connect (and optionally login) to MongoDB and verify the server is working.
Status File options

These options rarely need to be changed but are present in case you want to store the status on an SHM mount point to avoid actual disk writes.

  • statusfile
    • This is where a copy of the current rs.status, serverStatus and other command data is stored
  • backup-statusfile
    • Like status_file, but status_file is moved here when a new check is done. These two objects can then be compared to find the delta between two checkpoints.
  • max-stale
    • This is the amount of age for which an old file is still valid. Deltas older then this aren’t allowed and exist to protect the system from will assumption when a statusfile is hours or days old.

If you have any questions on how to use these parameters, feel free to let us know. In the code, there is also a defaults dictionary for most of these options so that in many cases setting warning and critical level are not needed.

by David Murphy at July 27, 2016 02:08 PM

Jean-Jerome Schmidt

New Webinar Trilogy: The MySQL Query Tuning Deep-Dive

Following our popular webinar on MySQL database performance tuning, we’re excited to introduce a new webinar trilogy dedicated to MySQL query tuning.

This is an in-depth look into the ins and outs of optimising MySQL queries conducted by Krzysztof Książek, Senior Support Engineer at Severalnines.

When done right, tuning MySQL queries and indexes can significantly increase the performance of your application as well as decrease response times. This is why we’ll be covering this complex topic over the course of three webinars of 60 minutes each.

Dates

Part 1: Query tuning process and tools

Tuesday, August 30th
Register

Part 2: Indexing and EXPLAIN - deep dive

Tuesday, September 27th
Register

Part 3: Working with the optimizer and SQL tuning

Tuesday, October 25th
Register

Agenda

Part 1: Query tuning process and tools

  • Query tuning process
    • Build
    • Collect
    • Analyze
    • Tune
    • Test
  • Tools
    • tcpdump
    • pt-query-digest

Part 2: Indexing and EXPLAIN - deep dive

  • How B-Tree indexes are built?
  • Indexes - MyISAM vs. InnoDB
  • Different index types
    • B-Tree
    • Fulltext
    • Hash
  • Indexing gotchas
  • EXPLAIN walkthrough - query execution plan

Part 3: Working with optimizer and SQL tuning

  • Optimizer
    • How execution plans are calculated
    • InnoDB statistics
  • Hinting the optimizer
    • Index hints
    • JOIN order modifications
    • Tweakable optimizations
  • Optimizing SQL

Speaker

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience in managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard. He’s the main author of the Severalnines blog and webinar series: Become a MySQL DBA.

by Severalnines at July 27, 2016 01:12 PM

July 26, 2016

Peter Zaitsev

Testing Samsung storage in tpcc-mysql benchmark of Percona Server

tpcc-mysql benchmark

This blog post will detail the results of Samsung storage in

tpcc-mysql
 benchmark using Percona Server.

I had an opportunity to test different Samsung storage devices under tpcc-mysql benchmark powered by Percona Server 5.7. You can find a summary with details here https://github.com/Percona-Lab-results/201607-tpcc-samsung-storage/blob/master/summary-tpcc-samsung.md

I have in my possession:

  • Samsung 850 Pro, 2TB: This is a SATA device and is positioned as consumer-oriented, something that you would use in a high-end user desktop. As of this post, I estimate the price of this device as around $430/TB.
  • Samsung SM8631.92TB: this device is also a SATA, and positioned for a server usage. The current price is about $600/TB. 
  • Samsung PM1725, 800GB: This is an NVMe device, in a 2.5″ form factor, but it requires a connection to a PCIe slot, which I had to allocate in my server. The device is high-end, oriented for server-side and demanding workloads. The current price is about $1300/TB.

I am going to use 1000 warehouses in the 

tpcc-mysql
 benchmarks, which corresponds roughly to a data size of 100GB.

This benchmark varies the

innodb_buffer_pool_size
 from 5GB to 115GB. With 5GB buffer pool size only a very small portion of data fits into memory, which results in intensive foreground IO reads and intensive background IO writes. With 115GB almost all data fits into memory, which results in very small (or almost zero) IO reads and moderate background IO writes.

All buffer pool sizes in the middle of the interval correspond to resulting IO reads and writes. For example, we can see the read to write ratio on the chart below (received for the PM1725 device) with different buffer pool sizes:

tpcc-mysql benchmarks

We can see that for the 5GB buffer pool size we have 56000 read IOPs operations and 32000 write IOPs. For 115GB, the reads are minimal at about 300 IOPS and the background writes are at the 20000 IOPs level. Reads gradually decline with the increasing buffer pool size.

The charts are generated with the Percona Monitoring and Management tools.

Results

Let’s review the results. The first chart shows measurements taken every one second, allowing us to see the trends and stalls.

tpcc-mysql benchmarks

If we take averages, the results are:

tpcc-mysql benchmarks

In table form (the results are in new order transactions per minute (NOTPM)):

bp, GB pm1725 sam850 sam863 pm1725 / sam863 pm1725 / sam850
5 42427.57 1931.54 14709.69 2.88 21.97
15 78991.67 2750.85 31655.18 2.50 28.72
25 108077.56 5156.72 56777.82 1.90 20.96
35 122582.17 8986.15 93828.48 1.31 13.64
45 127828.82 12136.51 123979.99 1.03 10.53
55 130724.59 19547.81 127971.30 1.02 6.69
65 131901.38 27653.94 131020.07 1.01 4.77
75 133184.70 38210.94 131410.40 1.01 3.49
85 133058.50 39669.90 131657.16 1.01 3.35
95 133553.49 39519.18 132882.29 1.01 3.38
105 134021.26 39631.03 132126.29 1.01 3.38
115 134037.09 39469.34 132683.55 1.01 3.40

Conclusion

The Samsung 850 obviously can’t keep with the more advanced SM863 and PM1725. The PM1725 shows a greater benefit with smaller buffer pool sizes. In cases using large amounts of memory, there is practically no difference with SM863. The reason is that with big buffer pool sizes, MySQL does not push IO subsystem much to use all the PM1725 performance.

For the reference, my.cnf file is

[mysqld]
datadir=/var/lib/mysql
socket=/tmp/mysql.sock
ssl=0
symbolic-links=0
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
# general
thread_cache_size=2000
table_open_cache = 200000
table_open_cache_instances=64
back_log=1500
query_cache_type=0
max_connections=4000
# files
innodb_file_per_table
innodb_log_file_size=15G
innodb_log_files_in_group=2
innodb_open_files=4000
innodb_io_capacity=10000
loose-innodb_io_capacity_max=12000
innodb_lru_scan_depth=1024
innodb_page_cleaners=32
# buffers
innodb_buffer_pool_size= 200G
innodb_buffer_pool_instances=8
innodb_log_buffer_size=64M
# tune
innodb_doublewrite= 1
innodb_support_xa=0
innodb_thread_concurrency=0
innodb_flush_log_at_trx_commit= 1
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_max_dirty_pages_pct=90
join_buffer_size=32K
sort_buffer_size=32K
innodb_use_native_aio=0
innodb_stats_persistent = 1
# perf special
innodb_adaptive_flushing = 1
innodb_flush_neighbors = 0
innodb_read_io_threads = 16
innodb_write_io_threads = 8
innodb_purge_threads=4
innodb_adaptive_hash_index=0
innodb_change_buffering=none
loose-innodb-log_checksum-algorithm=crc32
loose-innodb-checksum-algorithm=strict_crc32
loose-innodb_sched_priority_cleaner=39
loose-metadata_locks_hash_instances=256

by Vadim Tkachenko at July 26, 2016 06:00 PM

July 25, 2016

Peter Zaitsev

Percona XtraBackup 2.4.4 is now available

Percona XtraBackup 2.4.4

Percona XtraBackup 2.4.4Percona announces the GA release of Percona XtraBackup 2.4.4 on July 25th, 2016. You can download it from our download site and from apt and yum repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, Percona XtraBackup drives down backup costs while providing unique features for MySQL backups.

New Features:

  • Percona XtraBackup has been rebased on MySQL 5.7.13.

Bugs Fixed:

  • Percona XtraBackup reported the difference in the actual size of the system tablespace and the size which was stored in the tablespace header. This check is now skipped for tablespaces with autoextend support. Bug fixed #1550322.
  • Because Percona Server 5.5 and MySQL 5.6 store the LSN offset for large log files at different places inside the redo log header, Percona XtraBackup was trying to guess which offset is better to use by trying to read from each one and compare the log block numbers and assert lsn_chosen == 1 when both LSNs looked correct, but they were different. Fixed by improving the server detection. Bug fixed #1568009.
  • Percona XtraBackup didn’t correctly detect when tables were both compressed and encrypted. Bug fixed #1582130.
  • Percona XtraBackup would crash if the keyring file was empty. Bug fixed #1590351.
  • Backup couldn’t be prepared when the size in cache didn’t match the physical size. Bug fixed #1604299.
  • Free Software Foundation address in copyright notices was outdated. Bug fixed #1222777.
  • Backup process would fail if the datadir specified on the command-line was not the same as one that is reported by the server. Percona XtraBackup now allows the datadir from my.cnf override the one from SHOW VARIABLES. xtrabackup prints a warning that they don’t match, but continues. Bug fixed #1526467.
  • With upstream change of maximum page size from 16K to 64K, the size of incremental buffer became 1G. Which increased the requirement to 1G of RAM in order to prepare the backup. While in fact there is no need to allocate such a large buffer for smaller pages. Bug fixed #1582456.
  • Backup process would fail on MariaDB Galera cluster operating in GTID mode if binary logs were in non-standard directory. Bug fixed #1517629.

Other bugs fixed: #1583717, #1583954, and #1599397.

Release notes with all the bugfixes for Percona XtraBackup 2.4.4 are available in our online documentation. Please report any bugs to the launchpad bug tracker.

by Hrvoje Matijakovic at July 25, 2016 06:05 PM

MongoDB Consistent Backups

MongoDB consistent backups

In this post, I’m going to discuss MongoDB consistent backups, and how to achieve them.

You might have read before that MongoDB backup is not consistent. But what if I told you there is a tool that could make them consistent. What if this tool also would make it cluster-wide consistent, automatically compress the backup, become the first step toward continually incremental recording, notify your monitoring system and upload the backup to cloud storage for you?

It’s all TRUE!

Recently Percona-Labs created a new repository aimed at exactly these issues. We hope it will eventually grow into something that becomes part of the officially supported tools (like Percona Toolkit and  Percona’s Xtrabackup utility). Before we get into how it works, let’s talk about why we need it and its key highlights. Then (for all the engineering types reading this) we can discuss what is does and why.

Why do we need a consistent backup tool?

The first thing to note is you absolutely can’t have a consistent backup on a working system unless your node is in a replicaset. (You could even have a single node replicaset for this to be accurate.) Why? Consistency requires an operations log to say what changes occurred from the first point in the backup to the last point. This lets us ensure we are consistent to the end timestamp of the backup. We are unable to verify consistency when the MongoDB backup started without the ability to take a “snapshot” of data and then save the data while other changes occur. MongoDB does not have ACID-like isolation in this way. However, it can be consistent to the backup endpoint by applying any deltas at the end of the backup restore process.

You might say, “but mongodump already provides

--oplog
 for this feature.” You are right: it does, and it works great if you only have a single replicaset to backup. When we bring sharding into the mix, however, things get vastly more complicated. It ignores that flag and hits your primaries:

Screen Shot 2016-07-11 at 12.42.20 PM

In the diagram above you can see the backup and oplog recording for the first shard ended long before the second shard. As such, the consistency point needed is nowhere close to being covered by the red line. Even if all your shards are the same size, there would be some level of variance due to network, disk, CPU and memory speeds. The new tool helps you here by keeping track of the dumps, but also by having a thread recording the oplog for all shards until the last shard finishes. This ensures that all shards can be synced to the point in time where the last shard finished. At that moment in time, we have a consistent backup across all the shards. As you can see below, the oplog finished watching both shards after the last shard finish. On recovery, they remain in sync.

Screen Shot 2016-07-11 at 12.50.59 PM

You might ask, “well what about the meta-data stored in the config servers.” This is a great quest, as the behavior differs in our tool depending on if you’re using MongoDB 3.2’s new Config Servers as a replica set feature, or a legacy config server approach.

In the legacy mode, we 

fsyncAndLock
 the config servers just long enough to record a server config data dump. Then we stop the oplog tailer threads for all the shards. After that, and after the oplog tailers finish, we unlock the config server. This ensures we remove the race conditions that could occur if it took longer than expected to close an oplog cursor. However, if we run in 3.2 mode, the config servers act just like another shard. They get dumped at the same time, and the oplog just gets tailed until we complete the data shard dumps. The newest features available to MongoDB Community, MongoDB Enterprise, and Percona Server for MongoDB 3.2 make the process much simpler.

Key Takeaways from new tool

  1. Not yet an official Percona tool, but being used already by people as it’s just a wrapper to run multiple mongo dumps for you.
  2. If you execute the make setup, it outputs a single binary file that needs only python2.7 installed on your database system, even though under the hood it’s running many python modules in a virtualenv
  3. Dumps all shard in parallel and keeps tailing the oplog until all dumps are complete
  4. Handled backing up metadata for old and new config server topologies
  5. Can currently upload to S3, but more cloud storage is coming
  6. Backups compressed by default
  7. Uses the cluster_name,  time, and shard_name to make backup paths look like  /cluster1/<timestamp>/shard1.tgz, helping you keep things organized and letting you remove old backups by timestamp and cluster name.

Desired Roadmap

  • Mature into an officially support Percona product like  Xtrabackup
  • Fully Opensource and welcoming community improvements
  • Extending uploading to  CloudFiles by Rackspace, Azure ZRS, Google Cloud Storage and more
  • Complementary documentation on restores but can just natively use mongorestore tool also
  • Modular backup methods to extend to mongodump, LVM snapshots, ISCSI, EBS snapshots, MongoDB commands and more
  • Encryption before saving to disk
  • Partial backups and restores limit to specific databases and collections
  • Offline backup querying

Please be sure to check out the GitHub @mongodb_consistent_backup and log any issues or features requests.

Feel free to reach out to me on Twitter @dbmurphy_data or @percona with any questions or suggestions as well.

by David Murphy at July 25, 2016 05:35 PM

Jean-Jerome Schmidt

MySQL on Docker: Single Host Networking for MySQL Containers

Networking is critical in MySQL, it is a fundamental resource to manage access to the server from client applications and other replication peers. The behaviour of a containerized MySQL service is determined by how the MySQL image is spawned with “docker run” command. With Docker single-host networking, a MySQL container can be run in an isolated environment (only reachable by containers in the same network), or an open environment (where the MySQL service is totally exposed to the outside world) or the instance simply runs with no network at all.

In the previous two blog posts, we covered the basics of running MySQL in a container and how to build a custom MySQL image. In today’s post, we are going to cover the basics of how Docker handles single-host networking and how MySQL containers can leverage that.

3 Types of Networks

By default, Docker creates 3 networks on the machine host upon installation:

$ docker network ls
NETWORK ID          NAME                DRIVER
1a54de857c50        host                host
1421a175401a        bridge              bridge
62bf0f8a1267        none                null

Each network driver has its own characteristic, explained in the next sections.

Host Network

The host network adds a container on the machine host’s network stack. You may imagine containers running in this network are connecting to the same network interface as the machine host. It has the following characteristics:

  • Container’s network interfaces will be identical with the machine host.
  • Only one host network per machine host. You can’t create more.
  • You have to explicitly specify “--net=host” in the “docker run” command line to assign a container to this network.
  • Container linking, “--link mysql-container:mysql” is not supported.
  • Port mapping, “-p 3307:3306” is not supported.

Let’s create a container on the host network with “--net=host”:

$ docker run \
--name=mysql-host \
--net=host \
-e MYSQL_ROOT_PASSWORD=mypassword \
-v /storage/mysql-host/datadir:/var/lib/mysql \
-d mysql

When we look into the container’s network interface, the network configuration inside the container is identical to the machine host:

[machine-host]$ docker exec -it mysql-host /bin/bash
[container-host]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:fa:f6:30 brd ff:ff:ff:ff:ff:ff
    inet 192.168.55.166/24 brd 192.168.55.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fefa:f630/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:93:50:ee:c8 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:93ff:fe50:eec8/64 scope link

In this setup, the container does not need any forwarding rules in iptables since it’s already attached to the same network as the host. Hence, port mapping using option “-p” is not supported and Docker will not manage the firewall rules of containers that run in this type of network.

If you look at the listening ports on the host machine, port 3306 is listening as it should:

[machine-host]$ netstat -tulpn | grep 3306
tcp6       0      0 :::3306                 :::*                    LISTEN      25336/mysqld

Having a MySQL container running on the Docker host network is similar to having a standard MySQL server installed on the host machine. This is only helpful if you want to dedicate the host machine as a MySQL server, however managed by Docker instead.

Now, our container architecture can be illustrated like this:

Containers created on host network are reachable by containers created inside the default docker0 and user-defined bridge.

Bridge network

Bridging allows multiple networks to communicate independently while keep separated on the same physical host. You may imagine this is similar to another internal network inside the host machine. Only containers in the same network can reach each other including the host machine. If the host machine can reach the outside world, so can the containers.

There are two types of bridge networks:

  1. Default bridge (docker0)
  2. User-defined bridge

Default bridge (docker0)

The default bridge network, docker0 will be automatically created by Docker upon installation. You can verify this by using the “ifconfig” or “ip a” command. The default IP range is 172.17.0.1/16 and you can change this inside /etc/default/docker (Debian) or /etc/sysconfig/docker (RedHat). Refer to Docker documentation if you would like to change this.

Let’s jump into an example. Basically, if you don’t explicitly specify “--net” parameter in the “docker run” command, Docker will create the container under the default docker0 network:

$ docker run \
--name=mysql-bridge \
-p 3307:3306 \
-e MYSQL_ROOT_PASSWORD=mypassword \
-v /storage/mysql-bridge/datadir:/var/lib/mysql \
-d mysql

And when we look at the container’s network interface, Docker creates one network interface, eth0 (excluding localhost):

[machine-host]$ docker exec -it mysql-container-bridge /bin/bash
[container-host]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:2/64 scope link
       valid_lft forever preferred_lft forever

By default, Docker utilises iptables to manage packet forwarding to the bridge network. Each outgoing connection will appear to originate from one of the host machines’s own IP addresses. The following is the machine’s NAT chains after the above container was started:

[machine-host]$ iptables -L -n -t nat
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  172.17.0.0/16        0.0.0.0/0
MASQUERADE  tcp  --  172.17.0.2           172.17.0.2           tcp dpt:3306

Chain DOCKER (2 references)
target     prot opt source               destination
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:3307 to:172.17.0.2:3306

The above rules allows port 3307 to be exposed on the machine host based on the port mapping option “-p 3307:3306” in the “docker run” command line. If we look at the netstat output on the host, we can see MySQL is listening on port 3307, owned by docker-proxy process:

[machine-host]$ netstat -tulpn | grep 3307
tcp6       0      0 :::3307                 :::*                    LISTEN      4150/docker-proxy

At this point, our container setup can be illustrated below:

The default bridge network supports the use of port mapping and container linking to allow communication between containers in the docker0 network. If you would like to link another container, you can use the “--link” option in the “docker run” command line. Docker documentation provides extensive details on how the container linking works by exposing environment variables and auto-configured host mapping through /etc/hosts file.

User-defined bridge

Docker allows us to create custom bridge network, a.k.a user-defined bridge network (you can also create user-defined overlay network, but we are going to cover that in the next blog post). It behaves exactly like the docker0 network, where each container in the network can immediately communicate with other containers in the network. Though, the network itself isolates the containers from external networks.

The big advantage of having this network is that all containers have the ability to resolve the container’s name. Consider the following network:

[machine-host]$ docker network create mysql-network

Then, create 5 mysql containers under the user-defined network:

[machine-host]$ for i in {1..5}; do docker run --name=mysql$i --net=mysql-network -e MYSQL_ROOT_PASSWORD=mypassword -d mysql; done

Now, login into one of the containers (mysql3):

[machine-host]$ docker exec -it mysql3 /bin/bash

We can then ping all containers in the network without ever linking them:

[mysql3-container]$ for i in {1..5}; do ping -c 1 mysql$i ; done
PING mysql1 (172.18.0.2): 56 data bytes
64 bytes from 172.18.0.2: icmp_seq=0 ttl=64 time=0.151 ms
--- mysql1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.151/0.151/0.151/0.000 ms
PING mysql2 (172.18.0.3): 56 data bytes
64 bytes from 172.18.0.3: icmp_seq=0 ttl=64 time=0.138 ms
--- mysql2 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.138/0.138/0.138/0.000 ms
PING mysql3 (172.18.0.4): 56 data bytes
64 bytes from 172.18.0.4: icmp_seq=0 ttl=64 time=0.087 ms
--- mysql3 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.087/0.087/0.087/0.000 ms
PING mysql4 (172.18.0.5): 56 data bytes
64 bytes from 172.18.0.5: icmp_seq=0 ttl=64 time=0.353 ms
--- mysql4 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.353/0.353/0.353/0.000 ms
PING mysql5 (172.18.0.6): 56 data bytes
64 bytes from 172.18.0.6: icmp_seq=0 ttl=64 time=0.135 ms
--- mysql5 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.135/0.135/0.135/0.000 ms

If we look into the resolver setting, we can see Docker configures an embedded DNS server:

[mysql3-container]$ cat /etc/resolv.conf
search localdomain
nameserver 127.0.0.11
options ndots:0

The embedded DNS server maintains the mapping between the container name and its IP address, on the network the container is connected to, as in this case it is mysql-network. This feature facilitates node discovery in the network and is extremely useful in building a cluster of MySQL containers using MySQL clustering technology like MySQL replication, Galera Cluster or MySQL Cluster.

At this point, our container setup can be illustrated as the following:

Default vs User-defined Bridge

The following table simplifies the major differences between these two networks:

Area Default bridge (docker0) User-defined bridge
Network deployment Docker creates upon installation Created by user
Container deployment Default to this network Explicitly specify “--net=[network-name]” in the “docker run” command
Container linking Allows you to link multiple containers together and send connection information from one to another by using “--link [container-name]:[service-name]”. When containers are linked, information about a source container can be sent to a recipient container. Not supported
Port mapping Supported e.g, by using “-p 3307:3306” Supported e.g, by using “-p 3307:3306”
Name resolver Not supported (unless you link them) All containers in this network are able to resolve each other’s container name to IP address. Version <1.10 use /etc/hosts, >=1.10 use embedded DNS server.
Packet forwarding Yes, via iptables Yes, via iptables
Example usage for MySQL MySQL standalone MySQL replication, Galera Cluster, MySQL Cluster (involving more than one MySQL container setup)

No network

We can also create a container without any network attached to it by specifying “--net=none” in the “docker run” command. The container is only accessible through interactive shell. No additional network interface will be configured on the node.

Consider the following:

[machine-host]$ docker run --name=mysql0 --net=none -e MYSQL_ROOT_PASSWORD=mypassword -d mysql

By looking at the container’s network interface, only localhost interface is available:

[machine-host]$ docker exec -it mysql0 /bin/bash
[mysql0-container]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

Container in network none indicates it can’t join any network. Nevertheless, the MySQL container is still running and you can access it directly from the shell using mysql client command line through localhost or socket:

[mysql0-container]$ mysql -uroot -pmypassword -h127.0.0.1 -P3306
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6
Server version: 5.7.13 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Example use cases to run MySQL container in this network are MySQL backup verification by testing the restoration process, preparing the backup created using, e.g., Percona Xtrabackup or testing queries on different version of MySQL servers.

At this point, our containers setup can be illustrated as the following:

This concludes today’s blog. In the next blog post, we are going to look into multiple host networking (using overlay networks) together with Docker Swarm, an orchestration tool to manage containers on multiple machine hosts.

by Severalnines at July 25, 2016 11:25 AM

July 21, 2016

Jean-Jerome Schmidt

Planets9s - Watch our webinar replays for the MySQL, MongoDB and PostgreSQL DBA

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

Watch our webinar replays for the MySQL, MongoDB and PostgreSQL DBA

Whether you’re interested in open source datastores such as MySQL, MariaDB, Percona, MongoDB or MySQL; load balancers such as HAProxy, MaxScale or ProxySQL; whether you’re in DB Ops or DevOps; looking to automate and manage your databases… Chances are that we have a relevant webinar replay for you. And we have just introduced a new search feature for our webinar replays, which makes it easier and quicker to find the webinar replay you’re looking for.

Search for a webinar replay

Severalnines boosts US health care provider’s IT operations

This week we were delighted to announce that US health care provider Accountable Health Inc. uses our flagship product ClusterControl to outcompete its larger rivals. To quote Greg Sarrica, Director of IT development at AHI: “Using ClusterControl was an absolute no-brainer for me. AHI looked for an alternative to Oracle and IBM, which could match our demands and with our budget. We wanted to give our clients frictionless access to their healthcare information without portals crashing and potentially losing their personal data. Now we have a solution that allows us to be agile when competing in the fast-moving US healthcare market.”

Read the press release

ClusterControl Tips & Tricks: Best practices for database backups

Backups - one of the most important things to take care of while managing databases. It is said there are two types of people - those who backup their data and those who will backup their data. In this new blog post in the Tips & Tricks series, we discuss good practices around backups and show you how you can build a reliable backup system using ClusterControl.

Read the blog

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

by Severalnines at July 21, 2016 01:33 PM