Henrik IngoWhat you can do to help get rid of open core (26.7.2010, 12:44 UTC)

Much has been said about open core, but with the OSI coming out squarely against it on the one hand, and Rackspace and NASA creating the OpenStack.org project as a "true open source replacement" for Eucalyptus on the other hand, it seems open core is now much less attractive than it was only a week ago. It seems everyone has now learned what open core is and agrees that it is not open source, nor is it good for open source. (And by "everyone" I mean everyone that really are open source advocates, naturally those who directly or indirectly are trying to profit from open core will continue to promote the model for a long time to come.)

The final question that remains to be answered is, if I know about open core and don't like it, what can I do to help prevent its spreading and rather promote the adoption of true open source?

read more

Link
Peter ZaitsevCaching could be the last thing you want to do (24.7.2010, 18:39 UTC)

I recently had a run-in with very popular PHP ecommerce package which makes me want to voice a recurring mistake I see in how many web applications are architected.

What is that mistake?

The ecommerce package I was working with depended on caching.  Out of the box it couldn’t serve 10 pages/second unless I enabled some features which were designed to be “optional” (but clearly they weren’t).

I think with great tools like memcached it is easy to get carried away and use it as the mallet for every performance problem, but in many cases it should not be your first choice.  Here is why:

  • Caching might not work for all visitors - You look at a page, it loads fast.  But is this the same for every user?  Caching can sometimes be an optimization that makes the average user have a faster experience, but in reality you should be caring more that all users get a good experience (Peter explains why here, talking about six sigma).  In practice it can often be the same user that has all the cache misses, which can make this problem even worse.
  • Caching can reduce visibility – You look at the performance profile of what takes the most time for a page to load and start trying to apply optimization.  The problem is that the profile you are looking at may skew what you should really be optimizing.  The real need (thinking six sigma again) is to know what the miss path costs, but it is somewhat hidden.
  • Cache management is really hard – have you planned for cache stampeding, or many cache items being invalidated at the same time?

What alternative approach should be taken?

Caching should be seen more as a burden that many applications just can’t live without.  You don’t want that burden until you have exhausted all other easily reachable optimizations.

What other optimizations are possible?

Before implementing caching, here is a non-exhaustive checklist to run through:

  • Do you understand every execution plan of every query? If you don’t, set long_query_time=0 and use mk-query-digest to capture queries.  Run them through MySQL’s EXPLAIN command.
  • Do your queries SELECT *, only to use subset of columns?  Or do you extract many rows, only to use a subset? If so, you are extracting too much data, and (potentially) limiting further optimizations like covering indexes.
  • Do you have information about how many queries were required to generate each page? Or more specifically do you know that each one of those queries is required, and that none of those queries could potentially be eliminated or merged?

I believe this post can be summed up as “Optimization rarely decreases complexity. Avoid adding complexity by only optimizing what is necessary to meet your goals.”  – a quote from Justin’s slides on instrumentation-for-php.  In terms of future-proofing design, many applications are better off keeping it simple and (at least initially) refusing the temptation to try and solve some problems “like the big guys do”.


Entry posted by Morgan Tocker | 11 comments

Add to: delicious | digg |

Truncated by Planet PHP, read more at the original (another 773 bytes)

Link
Shlomi NoachSQL trick: overcoming GROUP_CONCAT limitation in special cases (21.7.2010, 13:14 UTC)

In Verifying GROUP_CONCAT limit without using variables, I have presented a test to verify if group_concat_max_len is sufficient for known limitations. I will follow the path where I assume I cannot control group_concat_max_len, not even in session scope, and show an SQL solution, dirty as it is, to overcome the GROUP_CONCAT limitation, under certain conditions.

Sheeri rightfully asks why I wouldn’t just set group_concat_max_len in session scope. The particular case I have is that I’m providing a VIEW definition. I’d like users to “install” that view, i.e. to CREATE it on their database. The VIEW does some logic, and uses GROUP_CONCAT to implement that logic.

Now, I have no control on the DBA or developer who created the view. The creation of the view has nothing to do with the group_concat_max_len setting on her database instance.

An example

OK, apologies aside. Using the sakila database, I execute:

mysql> SELECT GROUP_CONCAT(last_name) FROM actor \G
*************************** 1. row ***************************
GROUP_CONCAT(last_name): AKROYD,AKROYD,AKROYD,ALLEN,ALLEN,ALLEN,ASTAIRE,BACALL,BAILEY,BAILEY,BALE,BALL,BARRYMORE,BASINGER,BENING,BENING,BERGEN,BERGMAN,BERRY,BERRY,BERRY,BIRCH,BLOOM,BOLGER,BOLGER,BRIDGES,BRODY,BRODY,BULLOCK,CAGE,CAGE,CARREY,CHAPLIN,CHASE,CHASE,CLOSE,COSTNER,CRAWFORD,CRAWFORD,CRONYN,CRONYN,CROWE,CRUISE,CRUZ,DAMON,DAVIS,DAVIS,DAVIS,DAY-LEWIS,DEAN,DEAN,DEE,DEE,DEGENERES,DEGENERES,DEGENERES,DENCH,DENCH,DEPP,DEPP,DERN,DREYFUSS,DUKAKIS,DUKAKIS,DUNST,FAWCETT,FAWCETT,GABLE,GARLAND,GARLAND,GARLAND,GIBSON,GOLDBERG,GOODING,GOODING,GRANT,GUINESS,GUINESS,GUINESS,HACKMAN,HACKMAN,HARRIS,HARRIS,HARRIS,HAWKE,HESTON,HOFFMAN,HOFFMAN,HOFFMAN,HOPE,HOPKINS,HOPKINS,HOPKINS,HOPPER,HOPPER,HUDSON,HUNT,HURT,JACKMAN,JACKMAN,JOHANSSON,JOHANSSON,JOHANSSON,JOLIE,JOVOVICH,KEITEL,KEITEL,KEITEL,KILMER,KILMER,KILMER,KILMER,KILMER,LEIGH,LOLLOBRIGIDA,MALDEN,MANSFIELD,MARX,MCCONAUGHEY,MCCONAUGHEY,MCDORMAND,MCKELLEN,MCKELLEN,MCQUEEN,MCQUEEN,MIRANDA,MONROE,MONROE,MOSTEL,MOSTEL,NEESON,NEESON,NICHOLSON,NOLTE,NOLTE,NOLTE,NOLTE,OLIVIER,OLIVIER,PALTROW,PALTROW,P
1 row in set, 1 warning (0.00 sec)

mysql> SHOW WARNINGS;
+---------+------+--------------------------------------+
| Level   | Code | Message                              |
+---------+------+--------------------------------------+
| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |
+---------+------+--------------------------------------+
1 row in set (0.00 sec)

So, my GROUP_CONCAT has been truncated. How much did I lose?

mysql> SELECT SUM(LENGTH(last_name) + 1) - 1 FROM actor;
+--------------------------------+
| SUM(LENGTH(last_name) + 1) - 1 |
+--------------------------------+
|                           1445 |
+--------------------------------+

(In the above query I counted the separating commas; they are part of the GROUP_CONCAT limit).

The special case at hand

The proposed SQL trick assumes the following:

  • The length of the GROUP_CONCAT result is known to be under a certain value.
  • A GROUP_CONCAT of any set of n rows is known to be shorter than (or equal to) 1024 characters.

In our above example, I happen to know that the length of the GROUP_CONCAT result is below 2048. I also happen to know that any 100 rows will yield in a GROUP_CONCAT length of less than 1024.

How can I know this? Well, the length of my VARCHAR, or the fact I’m handling INT values can give me upper bounds on total lengths.

Steps towards the solution

Returning to our example, my intention becomes clearer: I want to work it out in two phases (later on I’ll show how this can be done in more phases). Any of the following is good:

mysql> SELECT GROUP_CONCAT(last_name) FROM actor WHERE actor_id BETWEEN 1 and 100 \G
*************************** 1. row ***************************
GROUP_CONCAT(last_name): GUINESS,WAHLBERG,CHASE,DAVIS,LOLLOBRIGIDA,NICHOLSON,MOSTEL,JOHANSSON,SWANK,GABLE,CAGE,BERRY,WOOD,BERGEN,OLIVIER,COSTNER,VOIGHT,TORN,FAWCETT,

Truncated by Planet PHP, read more at the original (another 7055 bytes)

Link
Peter ZaitsevEstimating Replication Capacity (21.7.2010, 02:51 UTC)

It is easy for MySQL replication to become bottleneck when Master server is not seriously loaded and the more cores and hard drives the get the larger the difference becomes, as long as replication
remains single thread process. At the same time it is a lot easier to optimize your system when your replication runs normally - if you need to add/remove indexes and do other schema changes you probably would be looking at some methods involving replication if you can't take your system down. So here comes the catch in many systems - we find system is in need for optimization when replication can't catch up but yet optimization process we're going to use relays on replication being functional and being able to catch up quickly.

So the question becomes how can we estimate replication capacity, so we can deal with replication load before slave is unable to catch up.

Need to replication capacity is not only needed in case you're planning to use replication to perform system optimization, it is also needed on other cases. For example in sharded environment you may need to schedule downtime or set object read only to move it to another shard. It is much nicer if it can be planned in advance rather than done on emergency basics when slave(s) are unable to catch up and application is suffering because of stale data. This especially applies to Software as Service providers which often have very strict SLA agreements with their customers and which can have a lot of data per customer so move can take considerable amount of time.

So what is replication capacity I call replication capacity the ability to replicate the master load. If replication is able to replicate 3 times the write load from the master without falling behind I will call it replication capacity of 3. When used with context of applying binary logs (for example point in time recovery from backup) replication capacity of 1 will mean you can reply 1 hour worth of binary logs within 1 hour. I will call "replication load" the inverse of replication capacity - this is basically what percentage of time the replication thread was busy replicating events vs staying idle.

Note you can speak about idle replication capacity, when box does not do anything else as well as loaded replication capacity when the box serves the normal load. Both are important. You care about idle replication capacity when you have no load on the slave and need it to catch up or when restoring from backup, the loaded replication capacity matters during normal operation.

So we defined what replication capacity is. There is however no tools which can tell us straight what replication capacity is for the given system. It also tends to float depending on the load similar as loadavg metrics. Here are some of the ways to measure it:

1) Use "UserStats" functionality from Google patches, which is now available in Percona Server and MariaDB. This is the probably the easiest and most accurate approach but it
does not work in Oracle MySQL Server. set userstat_running=1 and run following query:

SQL:
  1. mysql> SELECT * FROM information_schema.user_statistics WHERE user="#mysql_system#" \G
  2. *************************** 1. row ***************************
  3. USER: #mysql_system#
  4. TOTAL_CONNECTIONS: 1
  5. CONCURRENT_CONNECTIONS:

Truncated by Planet PHP, read more at the original (another 23577 bytes)

Link
Venu AnugantiMapReduce – DBInputFormat – Serialization on readers (20.7.2010, 05:46 UTC)
Last week I was working on EC2 MySQL server where one of the slave is taking lot of time to catch-up; and only job that is running on that server is mapreduce job to access InnoDB tables for read-only meta data. And debugging it further, noticed that every access to database server is serialized with [...]
Link
Kurt von FinckRename Maria Contest Winner! (20.7.2010, 04:15 UTC)
After two months of submissions, Monty Program employee review, community voting and Monty’s final decision, we are happy to announce that the Maria storage engine will henceforth be known as … Aria! Congratulations to Chris Tooley who suggested the name. Chris said about Aria in his submission, “Maria without the ‘M’, plus aria is a pleasant musical [...]
Link
Monty Program Group BlogRename Maria Contest Winner (20.7.2010, 03:52 UTC)

After two months of submissions, Monty Program employee review, community voting and Monty’s final decision, we are happy to announce that the Maria storage engine will henceforth be known as …

Aria!

Congratulations to Chris Tooley who suggested the name. Chris said about Aria in his submission, “Maria without the ‘M’, plus aria is a pleasant musical term.” Chris is now the proud new owner of a System 76 Meerkat net-top computer. Thanks to our good friends at System76 for providing this nifty prize.

Hopefully, in time, “Aria” will also be a pleasing database engine term. And now we will not have the confusion between MariaDB and Maria.

Link
Venu AnugantiRandom Pauses In MySQL – File Handle Serialization (20.7.2010, 03:36 UTC)
Last month, I blogged about a case involving InnoDB, where all threads acting on InnoDB tables completely stuck for about few hours doing nothing; until we found a way to get around and make the threads to run and do the actual work. There are few more cases where the server can get into pause [...]
Link
Monty SaysWhat is an Open Source Company? (18.7.2010, 18:10 UTC)
One of the hot topics here at the Community Leadership Summit in Portland is "what is an open source company ?". Simon Phipps has a got a lot of good points on this in his blog about Open Source Business.

We have companies like SugarCRM and Eucalyptus marketing themselves as "open source companies", even while not all of their code is available under an open source license.

To me it's clear that just because some of your product(s) is available under an open source license, you can't claim to be an open source company, as that would make the term meaningless. Under such a definition even Microsoft would be an open source company, as some of their products are now available as open source.

SugarCRM and Eucalyptus are clearly 'open core' companies, not open source companies. While open core is somewhat better than closed source, open core products have all the same disadvantages as closed source if you depend on a single feature of the closed parts for your business. In this case:

- You can't change, modify, port or redistribute the code.
- You can't fix bugs or extend the code.
- You are locked to the platforms that the vendor provides
- You are locked to one vendor.

In other words, the product as a whole should be regarded as a closed source product.

A little background why I feel so strongly about the term "open source company".

When MySQL AB was founded, David's and my intention was to create an open source company. Our definition was back then very simple "all software we produce should be under an open source license". When we took in investors we ensured that MySQL AB would stay as an open source company by putting a clause about this in our shareholder contracts.

David and I did however make a small mistake in that the shareholder agreement only said that "MySQL software" should be kept under an open source license. This allowed the MySQL management in 2006 to release Merlin, the MySQL monitor, as a closed source product, by claiming "this was not based on the MySQL server code". So even if we, the founders, managed to keep the MySQL server free, MySQL AB was only an "open source company" until 2006.

Learning from my mistake and to ensure that Monty Program Ab would always be an open source company, Zak Greant and I created the Hacking business model. Monty Program Ab follows this model and has additionally made a public promise that everything we create and release to our users will be under an open source license.

So what would then be a good definition for calling onces company "an open source company"?

I would like to suggest the following one:

1) You have to be a company that produces software.
2) All software the company delivers to its users must be available to everyone under an open source license. This includes all server code that is required to run and use the software.

In addition it would be good if the company could publicly state that all code they produce and release in the future will be under an open source license, but personally I would not require the company's to have to do this as some companies would have a hard time to do this.

At least here at the Leadership summit, the above definition seems to be acceptable to those that I have talked to. Please comment what you think about this!
Link
Peter ZaitsevSSD: Free space and write performance (18.7.2010, 03:46 UTC)

( cross posting from SSD Performance Blog )
In previous post On Benchmarks on SSD, commenter touched another interesting point. Available free space affects write performance on SSD card significantly. The reason is still garbage collector, which operates more efficiently the more free space you have. Again, to read mode on garbage collector and write problem you can check Write amplification wiki page.

To see how performance drops with decreasing free space, let's run sysbench fileio random write benchmark with different file sizes.

For test I took FusionIO 320 GB SLC PCIe DUO™ ioDrive card, with software stripping between two cards, and there if graph how throughput depends on available free space ( the bigger file - the less free space)

The system specification and used scripts you can see on Benchmark Wiki

On graph you can see two line ( yes, there are two lines, even they are almost identical).
First line is result when FusionIO is formatted to use full capacity, and second line is for case when I use additional space reservation ( 25% in this case, that is 240GB available). There is no difference in this case, however additional over-provisioning protects you from overusing space, and keeps performance on corresponding level.

It is clear the maximal throughput strongly depends on available free space.
With 100GiB utilization we have 933.60 MiB/sec,
with 150GiB (half of capacity) 613.48 MiB/sec and
with 200GiB it drops to 354.37 MiB/sec, which is 2.6x times less comparing with 100GiB.

So returning to question how to run proper benchmark, the result significantly depends what percentage of space on card is used, the results for 100GiB file on 160 GB card, will be different from the results for 100GiB file on 320 GB card.

Beside free space, the performance also depends on garbage collector algorithm by itself, and the card from different manufactures will show different results. Some new coming cards make high performance in case with high space utilization as competitive advantage, and I am going to run the same analysis on different cards.


Entry posted by Vadim | 7 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Link
LinksRSS 0.92   RDF 1.
Atom Feed