Open QueryReport from Barcamp Johor Bahru (30.8.2010, 04:27 UTC)

This weekend, I decided to attend BarcampJB pretty last minute. Lucky for me, barcamps are made for chaotics like me, so it was no problem at all. I found some friends that live here in Kuala Lumpur who I drove down to JB with (JB is around a 5 hour drive from KL, we did it in 3.5 :) ).

The camp was very interesting. Because JB is on the border with Singapore, there’s a good crossover between Malaysian and Singaporean techies.

I decided to go all out and give three talks on Saturday: First up was the MMM talk I’ve given at a few conferences before. All went well, and later on in the day some people approached me for more in-depth questions. It still seems that people have this idea in their head that they somehow need MySQL Cluster when there is more then one machine involved. When I explain them that that is very rarely the case and they can achieve what they want with MMM as well, they are often happy to hear that.

My next talk was more of a personal development one. People keep asking me here where I am from. When I explain to them that I’ve been location independent for the last 3 years, they are usually very eager to find out how I pull that off. I decided to summarise my experiences and put them in a talk. This talk was very well attended and I loved giving it. Most of the attendants were young techies, they are usually in a perfect position to do something very similar to what I’m doing.

The last talk was a lightning talk on Zabbix, the Open Source monitoring system we use at Open Query. Quick, and dirty, but effective.

Other interesting talks I attended were on breeze, an online banking application made for Standard Chartered bank that looks very slick and usable (If anyone from my bank is reading this: get with the program and fix our banking application to enter the 21st century please ;) ).

Conary and Foresight Linux were interesting as well. Conary (the package management system in Foresight Linux) is not quite mature yet, but definitely a very interesting technology. I was interested to hear about it and hope to see it become more mainstream in the future.

Daniel Cerventus gave a good lightning talk on what not to do as a startup. The main message was to just do it, and not wait for grant money or VC’s. Some solid tips as well, one of them being to run your potential name through Namechk, a handy potential username checker for many services.

There was obviously also a lot of networking and we went for a foot massage at the end of the day. Funny fact: I was the only one to stay awake through the massage (Even though I am narcoleptic), while two of my  friends (who I won’t name here ;) ) snored all the way through it :)

All in all another succesful tech event in Malaysia. Definitely one of the many reasons I love living here!

Link
Henrik IngoThe rise of Maker Culture and the leadership role of O'Reilly (27.8.2010, 12:33 UTC)

Inspired by my previous blog about the Lastwear open clothing company, I decided it is again time to take the pulse on how open source is spreading outside of the software world. The last time I did this was in 2008, where I concluded:

read more

Link
Open QueryOpen Query on Twitter/Identi.ca (27.8.2010, 03:47 UTC)

Open Query now has its own @openquery account on Twitter and Identi.ca so you can conveniently follow us there for announcements and tips – and also ask us questions! All OQ engineers can post/reply. The OQ site front page also tracks this feed.

Previously I was posting from my personal @arjenlentz account with #openquery hashtag, but that’s obviously less practical.

Link
Peter ZaitsevInnoDB memory allocation, ulimit, and OpenSUSE (23.8.2010, 20:55 UTC)

I recently encountered an interesting case. A customer reported that mysqld crashed on start on OpenSUSE 11.2 kernel 2.6.31.12-0.2-desktop x86_64   with 96 GB RAM when the innodb_buffer_pool_size was set to anything more than 62 GB. I decided to try it with 76 GB. The error message was an assert due to a failed malloc() in ut_malloc_low() in ut/ut0mem.c inside InnoDB source code. InnoDB wraps the majority of its memory allocations in ut_malloc_low(), so to get an idea of the pattern of requested allocations I added a debugging fprintf() to tell me how much was being allocated and whether it was successful.

I discovered something interesting. I expected the allocation to fail on the 76 GB of the buffer pool, due to some weird memory mapping issue and a continuous block of 76 GB not being available. However, that is not what happened. 76 GB buffer was allocated successfully. What was failing is the allocation of 3.37GB after that. What in the world could InnoDB need that was 3.37 GB? There was nothing in the settings that asked for anything close to 3 GB explicitly.

Source code is the ultimate documentation, and I took advantage of that. My good friend GDB guided me to buf_pool_init() in buf/buf0buf.c. There I found the following:

buf_pool->frame_mem = os_mem_alloc_large(
UNIV_PAGE_SIZE * (n_frames + 1),
TRUE, FALSE);

That was the buffer pool itself, the 76 GB of it. And now the buffer pool’s friend:

buf_pool->blocks = ut_malloc(sizeof(buf_block_t) * max_size);

3.6 GB of it!

From the comments in the code (InnoDB code actually has very good comments), max_size is the maximum number of buffer pool pages (16K each), n_frames which is the same thing unless AWE is used, but it was not used, so I did not worry about it.

What shall we call that friend? It is used for storing some meta information about buffer pool pages. The most natural name I could come up with from reading the source code is the blocks array.

Thus we can see that we are allocating another chunk that is in proportion to the setting of innodb_buffer_pool_size for the blocks array. The exact proportions will probably vary from version to version, but roughly about 1 G for every 25 G of the buffer pool. This can become significant in the proper innodb_buffer_pool_size estimations when the system has a lot of RAM and you want to have the largest possible innodb_buffer_pool_size. Do not forget to give the blocks array some room!

While this was an interesting investigation, it nevertheless did not explain why there was not enough room for a 76 GB buffer pool. Even with the extra 3.37 GB allocation, there was still some free memory. Or was there? Maybe some hidden monster was eating it up? I quickly wrote this hack to prove or disprove the monster’s presence.

I verified that I could allocate and initialize two chunks of 40 GB from two separate processes, but not 80 GB from one. In fact, 80GB allocation failed right in malloc(), did not even get to initialization. I tested it with allocating 70 GB concurrently in each process so as to overrun physical memory + swap. Both allocations were successful, one initialized successfully, the other was killed by the OOM kill during initialization.

This smelled like a low ulimit, and sure enough it was. ulimit -m ulimited; ulimit -v unlimited did the magic, and mysqld successfully started with an 80 GB buffer pool. Apparenly OpenSUSE defaults are set in proportion to physical memory to keep the memory-hungry applications from taking the system down. On this particular system (96 GB physical memory, 2 GB swap it decided to set the virtual memory ulimit (-v) to 77.27 GB, and the physical memory (-m) to 80.40 GB).


Entry posted by Sasha Pachev | 7 comments

Add to: delicious | digg |

Truncated by Planet PHP, read more at the original (another 924 bytes)

Link
Peter ZaitsevHigh availability for MySQL on Amazon EC2 – Part 4 – The instance restart script (19.8.2010, 12:54 UTC)

This post is the fourth of a series that started here.

From the previous of this series, we now have resources configured but instead of starting MySQL, Pacemaker invokes a script to start (or restart) the EC2 instance running MySQL. This blog post describes the instance restart script. Remember, I am more a DBA than a script writer so it might not be written in the most optimal way.

First, let's recap what's the script has to perform (the full script is given below).

  1. Kill the MySQL EC2 instance if running
  2. Make sure the MySQL EC2 instance is stopped
  3. Prepare the user-data script for the new MySQL EC2 instance
  4. Launch the new MySQL instance
  5. Make sure it is running
  6. Reconfigure local heartbeat
  7. Broadcast the new MySQL instance IP to the application servers

Kill the MySQL EC2 instance

In order to kill the existing MySQL EC2 instance, we first have to identify it. This is done by:

CODE:
  1. OLD_INSTANCE_ID=`ec2-describe-instances -K $PK -C $CERT | /usr/local/bin/filtre_instances.pl | grep $AMI_HA_MYSQL | egrep "running|pending" | tail -n 1 | cut -d'|' -f3`

by filtering on the AMI type of the instance. Since an instance can be listed at the "stopped" state, it is mandatory to filter for states "running" or "pending". Then the instance is terminated with:

CODE:
  1. ec2-terminate-instances -K $PK -C $CERT $OLD_INSTANCE_ID> /dev/null

Make sure the MySQL EC2 instance is stopped

Terminating an EC2 instance is not instantaneous, we can confirm an instance is really stopped by monitoring its status and wait until it is actually "terminated". The code below is how the script performs this task.

CODE:
  1. #wait until the old instance is terminated  it takes a few seconds to stop
  2. done="false"
  3. while [ $done == "false" ]
  4. do
  5.     status=`ec2-describe-instances -K $PK -C $CERT $OLD_INSTANCE_ID | /usr/local/bin/filtre_instances.pl |  grep -c terminated`
  6.       if [ "$status" -eq "1"

Truncated by Planet PHP, read more at the original (another 45376 bytes)

Link
Peter ZaitsevPercona talks at OpenSQL Camp this weekend (19.8.2010, 10:18 UTC)

Four Perconians (perconites?) will be at OpenSQL Camp in Sankt Augustin, Germany this weekend presenting talks on:

  • Recovery of Lost or Corrupted InnoDB Tables
  • Keep your MySQL backend online no matter what
  • XtraDB -- InnoDB on steroids
  • Xtrabackup for MySQL

If you would like to stop by and say hello, we are Aleksandr, Istvan, Morgan and Aurimas (pictures here).

If you can make the (approximate) location, but not the date, we also have training in Frankfurt in three weeks time.


Entry posted by Morgan Tocker | One comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Link
Peter ZaitsevAnnouncing Training for Operations Teams (18.8.2010, 20:45 UTC)

We're opening up registration for our new training courses today.  In short: we are moving from two days to a new four-day format.  The new additions are created by:

Our developer course has also undergone revision, and we now have more query tuning examples, and a new instrumentation chapter.

What is operations training?

Many companies split their technical staff between development, and operations.  The operations team is responsible for tasks such as capacity planning, backup/disaster recovery, and carrying a pager. They are the heroes that fight fires.

Our operations day of training is delivered as a hands on class.  Attendees will be divided into teams, and given a series of challenges to complete on EC2 machines.  As part of the development of this course we wrote a sample LAMP application, embedded with minor flaws which students will need to fix while they try and keep the application up.

Where can you attend?

We're starting off by branding the operations day as a 'BETA'.  You can attend for only $100 at San Francisco Thursday, 30 Sep 2010, New York Thursday, 14 Oct 2010 or Vancouver Thursday, 21 Oct 2010. There is also a discount of attending all four days for $1,450 if you book before 30 August.

After our initial BETA, the courses will be available in more USA and international locations. A partial list is already available on the training section of our website. We will confirm more cities in the coming weeks.


Entry posted by Morgan Tocker | 3 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Link
Shlomi Noachmylvmbackup HOWTO: minimal privileges & filesystem copy (17.8.2010, 17:42 UTC)

This HOWTO discusses two (unrelated) issues with mylvmbackup:

  • The minimal privileges required to take MySQL backups with mylvmbackup.
  • Making (non compressed) file system copy of one’s data files.

Minimal privileges

Some just give mylvmbackup the root account, which is far too permissive. We now consider what the minimal requirements of mylvmbackup are.

The queries mylvmbackup issues are:

  • FLUSH TABLES
  • FLUSH TABLES WITH READ LOCK
  • SHOW MASTER STATUS
  • SHOW SLAVE STATUS
  • UNLOCK TABLES

Both SHOW MASTER STATUS & SHOW SLAVE STATUS require either the SUPER or REPLICATION CLIENT privilege. Since SUPER is more powerful, we choose REPLICATION CLIENT.

The FLUSH TABLES * and UNLOCK TABLES require the RELOAD privilege.

However, we are not done yet. mylvmbackup connects to the mysql database, which means we must also have some privilege there, too. We choose the SELECT privilege.

Finally, here are the commands to create a mylvmbackup user with minimal privileges:

CREATE USER 'mylvmbackup'@'localhost' IDENTIFIED BY '12345';
GRANT RELOAD, REPLICATION CLIENT ON *.* TO 'mylvmbackup'@'localhost';
GRANT SELECT ON mysql.* TO 'mylvmbackup'@'localhost';

In the mylvmbackup.conf file, the correlating rows are:

[mysql]
user=mylvmbackup
password=12345
host=localhost

Filesystem copy

By default, mylvmbackup creates a .tar.gz compressed backup file of your data. This is good if the reason you’re running mylvmbackup is to, well, make a backup. However, as with all backups, one may be making the backup so as to create a replication server. But in this case you don’t really want compressed data: you want the data extracted on the replication server, just as it is on the original host.

mylvmbackup supports backing up the files using rsync.

To copy MySQL data to a remote host, configure the following in the mylvmbackup.conf file:

[fs]
backupdir=shlomi@backuphost:/data/backup/mysql
[misc]
backuptype=rsync

You may be prompted to enter password, unless you have the user’s public key stored on the remote host.

Normally, rsync is considered as remote-sync, but it also works on local file systems. If you have a remote directory mounted on your file system (e.g. with nfs), you can use the fact that rsync works just as well with local file systems:

[fs]
backupdir=/mnt/backup/mysql
[misc]
backuptype=rsync

Voila! Your backup is complete.

Link
Peter ZaitsevTesting MySQL column stores (16.8.2010, 16:17 UTC)

Recently I had the opportunity to do some testing on a large data set against two MySQL column-store storage engines. I'd like to note that this effort was sponsored by Infobright, but this analysis reflects my independent testing from an objective viewpoint.

I performed two different types of testing. The first focused on core functionality and compatibility of ICE (Infobright Community Edition) compared with MyISAM on a small data set. The second part of my testing compared the performance and accuracy of ICE with InfiniDB Community Edition on a 950GB data set.

The first first part of my analysis focused on testing specific MySQL functionally with Infobright's storage engine. A lot of my tests involved corner or edge cases since I have experience developing such cases for MySQL column-based storage engines in the past. I reported any bugs that I found, and contributed my test cases to ICE. In fact, some of the issues have already been addressed in the most recent software release. An example of such a problem would be "select avg(char_column) from table" where the column contains a mix of ascii data, such as names and numeric data. This is an example of implicit casting that probably wouldn't happen in a real application.

Importantly, I didn't find significant defects in Infobright that would be "show stoppers" for typical OLAP analysis queries. These tests were intended to approximate what would happen if you ported a MyISAM OLAP application to ICE. My testing suggests that with some basic testing, an application could be ported to ICE with a good chance of success. Of course, a good test environment is something that I think every operations team should insist on.

The second part of my analysis focused on testing a total of 29 queries on the large data set. I compared a number of different factors between the two databases, including:

  1. Ease of installation
  2. Loading capability and speed of loading
  3. Accuracy of results of queries over the large data set
  4. Speed of results of queries over the large data set
  5. Basic security aspects

The report that I produced may be found here.

It should be noted that the second set of tests included 29 different queries, some of which were provided by Infobright and others which I contributed. InfiniDB does not support as many data types and aggregate functions as ICE, and therefore it could not run some of the queries. ICE supports almost all of the MySQL aggregation functions. Notably, GROUP_CONCAT is not supported, which is something I hope they rectify in a future version.

In addition, I was not able to get accurate results for all of the queries on InfiniDB. In particular the query "select count(*) from carsales.sales_fact" when run on InfiniDB returned a number that was higher than it should have been, and several GROUP BY queries returned unexpected results as well. I did not change any "out of the box" settings for Infobright. Even after I modified the configuration settings on a 16GB box, one query did not have enough memory to complete on InfiniDB.

Overall, I would say that ICE is more "ready for prime time" than InfiniDB given the inconsistencies that I encountered. I will try to reproduce the problems on InfiniDB into easily reproducible test cases which I may contribute to them, but this is difficult given the size of the data set involved. ICE was able to execute queries quickly, and with accurate results.


Entry posted by Justin Swanhart | 13 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Link
Shlomi NoachMMM for MySQL single reader role (12.8.2010, 12:12 UTC)

The standard documentation and tutorials on MMM for MySQL, for master-master replication setup, suggest one Virtual IP for the writer role, and two Virtual IPs for the reader role. It can be desired to only have a single virtual IP for the reader role, as explained below.

The two IPs for the reader role

A simplified excerpt from the mmm_common.conf sample configuration file, as can be found on the project’s site and which is most quoted:

...
<host db1>
  ip                      192.168.0.11
  mode                    master
  peer                    db2
</host>

<host db2>
  ip                      192.168.0.12
  mode                    master
  peer                    db1
</host>
...
<role writer>
  hosts                   db1, db2
  ips                     192.168.0.100
  mode                    exclusive
</role>

<role reader>
  hosts                   db1, db2
  ips                     192.168.0.101, 192.168.0.102
  mode                    balanced
</role>

In the above setup db1 & db2 participate in master-master active-passive replication. Whenever you need to write something, you use 192.18.0.100, which is the virtual IP for the writer role. Whenever you need to read something, you use either 192.168.0.101 or 192.168.0.102, which are the virtual IPs of the two machines, this time in read role. Logic says one wishes to distribute reads between the two machines.

One IP for reader role

I have a few cases where the above setup is not satisfactory: there is a requirement to know the IP of the passive (read-only) master. Reason? There are queries which we only want to execute on the slave (reporting, long analysis), and only execute on the active master when this isn’t possible. Sometimes we might even prefer waiting for a slave to come back up rather than execute a query on the master.

This may involve an application level solution, or a connection-pool level solution (“get me a slave’s connection, or, if that’s not possible, get me the master’s”).

Anyway, neither 192.168.0.101 nor 192.168.0.102 relate to a particular machine’s role status. That is, the fact that one of the machines is in writer mode or not does not affect these virtual IPs.

The solution is a minor change to the configuration file. Real minor:

<role reader>
  hosts                   db1, db2
  ips                     192.168.0.101
  mode                    balanced
</role>

In this new setup the two nodes compete for a single reader role virtual IP. There is no 192.168.0.102 anymore. Although it does not reflect from the configuration file, it turns out MMM acts in a smart way; the way you would expect it to run.

There is nothing to suggest in the above that the IPs 192.168.0.100 & 192.168.0.101 will be distributed between the two machines. But you would like them to. And MMM does that. It makes sure that, if possible, one of the machines (say db1) gets the writer role, hence 192.168.0.100, and the other (db2) the reader role, hence 192.168.0.101.

Moreover, it prefers that situation over a current known situation: say db1 went down. The writer role moves to db2. When db1 is up again, MMM acts smartly: it does not give it back the writer role (since moving the active master around is costly, after all), but does give it the reader role, along with the 192.168.2.101 IP. So it takes care not to leave a server without a role, while preferring to move the writer role as little as possible.

Link
LinksRSS 0.92   RDF 1.
Atom Feed