Archive for the 'Performance' Category

Are vibrations killing enterprise disk performance?

According to this paper by Julian Turner, anti vibration devices such as the AVR 1000 can yield impressive enterprise storage performance gains.

I am quoting the paper here:

” However, there was a shocking 246% performance difference for the RandomRead1m test and a 56% and 61% difference for the Random Read 2k and 8k tests respectively. The performance difference of Random Writes was similarly compelling with 52% difference for the 2k case, 88% difference for the 8k case, and 34% difference for 1m test.

Sequential IO only improved by about 10/15% though. ZDNET has a good summary here. And they have a good point: if random IO for regular disks can be improved by 50%, that is not good news for SSDs.

New Software Design Technique Allows Programs To Run Faster

Quoting the original article : “Researchers at North Carolina State University have developed a new approach to software development that will allow common computer programs to run up to 20 percent faster and possibly incorporate new security measures.”

Actually, research in efficient memory management on multiprocessor systems is not a novel idea but this approach where all memory management is devoted to a seperate thread is quite interesting.

Another piece of work that I had evaluated about 10 years ago while at Sun is the HOARD memory allocator. This is a dropin replacement for memory allocation routines (the good old malloc) for C and C++ programs. What is does is that it allows a greater level of concurrency in management of the heap. You can actually use it on UNIX/Linux with existing compiled programs by preloading the hoard shared library.

Performance Basics: Scalability

Scalability is a word you hear so much in IT departments that it is sometimes on the verge of being a “buzz word”, you know like “synergy”. The word has its roots in High Performance Computing and became widely used in enterprise environments when big distributed systems started to be used to solve complex problems or serve a great number of users (hundred of thousands or even millions in the case of big websites) .

How it started in HPC (High Performance Computing)

I studied parallel computing in the mid-90s and, at the time, I remember our teachers saying “maybe one day, everything you are learning will be used outside of the realm of High Performance Computing”. That was highly prophetic when the Internet was mainly a research tool and computers with more than one processor were laboratory machines, the very notion of having two cpus in a laptop would have been mind boggling at the time.

And the most important thing to know when it comes to parallel computing is that some problems cannot be parallelized. For instance, iterative calculations can be very tricky because iteration n+1 needs the results of iteration n, etc. On the other hand, processing of 2D images is generally easy to parallelize since you can cut the image in portions and each portion will be processed by a different CPU. I am over simplifying things but this is such an important notion that a guy named Gene Amdhal came up with the Amdahl’s law.

Let me quote Wikipedia here: “it [Amdahl’s law] is used to find the maximum expected improvement to an overall system when only part of the system is improved”. In other words, if you take a program and 25% of it cannot be parallelized then you will see that you will not be able to make it more that 4 times faster whatever the number of cpus you are throwing at it:

Ahmdahl's law

This is the exact same problem that everybody is now experiencing on their home computers equipped with several cores, some programs will just use one core and the other cores will do nothing. In some cases, it might be because the programmer is lazy or has not learned how to parallelize code, in other cases, it is simply because the problem cannot be parallelized. I always find amusing to hear or read people ranting about their favorite program not taking advantage of their shiny new 4 core machines.

Well it is because it can be very hard to parallelize some portions of code and a lot of people have spent their academic lives working on these issues. In the field of HPC, the way to measure scalability is to measure the speedup or “how much is the execution time of my program reduced with regard to the number of processors I throw at it”.

The coming of distributed systems to the Enterprise

In 1995, something called PVM (Parallel Virtual Machine) was all the rage since it allowed scientists to spread calculations over networked machines and these machines could be inexpensive workstations. Of course, Amdahl’s law still applied and it was only worth it if you could parallelize the application you were working on. Since then, other projects like MPI or OpenMP have been developped with the same goal in mind. The convergence of these research projects, although not entirely linked, and the availability of the Internet to a wide audience is quite remarkable.

The first example that comes to mind is the arrival of load balancer appliances in the very late 90s to spread web server load over several machines thus increasing the throughput. Until then, web servers often ran on a single machine on the desk of someone. But when the Internet user population numbered in hundreds of thousands instead of a few thousands this way of doing things did not cut it anymore. So programs but more often specialized appliances were invented to spread the load over more than one web server. This means that if 100 users tried to access your website simultaneously, 50 would be directed to webserver 1 and 50 to webserver 2. This is not that different  a concept from what people had been doing in High Performance Computing using PVM/MPI,etc.  And luckily for us serving static content is very easy to parallelize, there is no interdependency or single bottleneck.

The modern notion of Scalability for Enterprise applications

I will stop here the comparisons between the ultra specialized HPC world and its enterprise counterpart but I just wanted to show that these two worlds might sometimes benefit in looking over each other’s shoulders.

Nowadays, scalability can have multiple meanings but it often boils down to this: if I throw more distributed resources to a IT system, will it be able to serve more customers (throughput) in an acceptable time (latency)? Or, what does it take to increase the capacity of my system?

Scalability in an enterprise environment is indeed about how to handle the growing usage of a given IT system. Back in prehistoric ages, circa 1990, new generations of computer arrived every 18 months like clockwork, offered twice the processing speed and most program benefited from it since they were all mono-threaded. But nowadays, most IT systems are made of different components each with their own scalability issues.

Take a typical 3-tier web environment composed of these tiers:

  1. Web Servers
  2. Application servers
  3. Database servers

The scalability of the whole system depends on the scalability of each tier. In other words, if one tier is a bottleneck, increasing capacity for other systems will not increase your overall capacity. This might seem obvious but what is often not obvious is which tier is actually the bottleneck!

The good news is that this is not exactly a new problem since it pretty much falls under Amdahl’s law. So what you need to ask yourself is :

  • How much of the system (and subsystems) can be improved by throwing more ressources at it? In other words, how parallelized is it already?
  • What does it take to improve the system and its subsystems? Better code? More cpu? more IO throughput? more Memory?
  • What improvement will it yield? What will be the consequences? Will more customers be served? Will they be served faster or as fast? etc.

In the end, it is back to finding the bottleneck in the overall system and solving it which might be easy (e.g. serving static content is very parallelizable) or extremely difficult (e.g. lots of threads waiting on a single resource to be available). Note that IT system should usually be built with scalability in mind, which would avoid any detective work when the time to increase capacity has come, but alas it is not always the case.

Gnuplot : how to plot graphs from any UNIX machine

gnuplot is a plotting tool that I discovered in the 90s while being a student. The binary itself is around 1.2 MB only! People often forget about it and would rather use a spreadsheet program. Granted, a spreadsheet will probably give you prettier graphs but gnuplot is very handy to graph in an automated manner, with a very small footprint.

Let’s say you have a file containing measures like this (first column is the measure point, the second column contains the measured values):

bash-3.2$ more toto
1 4
2 6
3 8
4 7
5 12
6 5
7 9
8 3

Then to draw it, just fire gnuplot and type:

gnuplot> plot ‘./toto’ using 1:2 with lines;

And bang, you get this:

gnuplot simple example

1:2 means columns 1 and 2 and “with lines” means that you will use a line to join the points (there are plenty f options such as boxes, vectors, etc.).

You can also improve things a little by creating a command file containing this for instance:

set xlabel "The title of the X Axis"
set ylabel "The title of the y Axis"
set xrange [1:8]
set yrange [0:14]
plot './toto' using 1:2 with lines 4;

And execute:

gnuplot < commandfile

You will get this:

gnuplot better exampleThese are just tiny examples but by using command files, you can automate the generation of graphs very easily.

gnuplot has tons of options and is capable of much much more, checkout the official page!

For these examples, I have used the version of gnuplot provided my macports.

dstat for Linux: an alternative to sar/vmstat/iostst/etc.:

Written in python, dstat is a neat piece of tooling. It is a monitoring tool akin to sar, iostat, vmstat, etc. It allows you to measure a host of metrics. You can install it on any modern ubuntu box by typing “apt-get install dstat” (and I am sure it is available for any major distro).

By just typing dstat, you’ll get this (refreshed every second):

dstat1 output

There is quite some options:

Dstat options:
-c, --cpu              enable cpu stats
-C 0,3,total           include cpu0, cpu3 and total
-d, --disk             enable disk stats
-D total,hda           include hda and total
-g, --page             enable page stats
-i, --int              enable interrupt stats
-I 5,eth2              include int5 and interrupt used by eth2
-l, --load             enable load stats
-m, --mem              enable memory stats
-n, --net              enable network stats
-N eth1,total          include eth1 and total
-p, --proc             enable process stats
-s, --swap             enable swap stats
-S swap1,total         include swap1 and total
-t, --time             enable time/date output
-T, --epoch            enable time counter (seconds since epoch)
-y, --sys              enable system stats
--ipc                  enable ipc stats
--lock                 enable lock stats
--raw                  enable raw stats
--tcp                  enable tcp stats
--udp                  enable udp stats
--unix                 enable unix stats
-M stat1,stat2         enable external stats
--mods stat1,stat2
-a, --all              equals -cdngy (default)
-f, --full             expand -C, -D, -I, -N and -S discovery lists
-v, --vmstat           equals -pmgdsc -D total
--integer              show integer values
--nocolor              disable colors (implies --noupdate)
--noheaders            disable repetitive headers
--noupdate             disable intermediate updates
--output file          write CSV output to file

For example, “dstat -mp” will show memory and process related metrics with a refresh rate of one second (the delay is tweakable):

dstat example 2

Last but not least, you can export the output to CSV.

What I find especially neat is that you can combine any metrics with any other metrics (a bit more difficult to do with sar for instance).

iStat Menus for OS X

This is a cool monitoring app for Mac OS X. It looks a bit over the top in terms of exhaustivity but we all like it don’t we. It is called iStat Menus and is donationware. It sits in the menu bar and will display ‘real time’ information related to the usage of CPU, memory, disk, etc. It will also show you the temperature sensor status as well as the various fan speeds. The cpu menu looks like this (I like that you can see the top 5 cpu hogs at a glance):

iStats cpu menu

It is quite configurable:

iStats preferences

Hopefully it will not be a resource hog itself, I will give it a go.

Performance Basics: Bottlenecks

Today’s computer systems are increasingly complex, made of components made by different suppliers themselves made of components made by different suppliers, etc. Even your cheapo laptop is made that way. The same applies whether we are talking about hardware or software (big software publishers license or “borrow” code from others).

All these components, again whether hardware or software, can have an impact on performance be it perceived (“my computer feels sluggish”) or measured (time it takes to perform a certain task). Often solving performance problems starts by identifying the one component that is having the biggest negative impact on performance, the so called “bottleneck”.

To take the simple example of a single computer (enterprise systems composed of a myriad of networked computers are much more complex), there is a number of basic components whose utilization must be measured in order to identify a bottleneck:

  • CPU
  • Disk (Storage)
  • Memory
  • Network

There is a lot of software offering to measure the utilization of these components, Operating Systems usually come up with basic tools to monitor real time performance: sar/iostat/vmstat, etc. on UNIX/Linux style systems, the Activity Monitor on Mac OS, the Windows Task Manager or Resource Monitor on Windows (depending on the version).

Let’s say you are converting a video file from one format to another which is something that is becoming increasingly common. I am showing here what the iostat command measures on my Mac while iMovie is finishing the export of a video:

iostat measurement while iMovie is converting a movie

iostat measurement while iMovie is converting a movie

What is important here is the “cpu” column and the “id” subcolumn. It shows the idle time as in the time that the cpu (or rather the combination of the two cores on my particular machine) does not spend working. You can see that while iMovie is converting, it is less than 20% and when iMovie stops, it goes up to about 70%. Meanwhile, the disks (first two columns) are not being used at all. This would indicate that the CPU might be the “bottleneck”. Of course you would have to use other tools such as the activity monitor to measure the memory usage, network occupancy and double check that indeed the CPU is the limiting factor i.e. the application is “cpu bound”.

In a more complex enterprise environment, you would use more or less the same method: measure as much as you can and identify the “hot spots”.

Once they are identified, you may (or may not) have a solution. The busy component might be a  bottleneck just because it has to work hard, this is the way it is: you might be able to upgrade it (or not if you are already using the latest and greatest). It could also be that the busy component is a bottleneck because the software is unnecessarily overusing it. In that case, the software might be optimized. The recent example of the audio driver problem on the mac pros is a primary example of a bug that put an uncalled strain on the CPU. It could also be that there is no solution: take the example of transmitting messages to Mars (the planet, not the chocolate bar), you will always hit a “hard limit” imposed by physics in terms of transmission times.

In conclusion, there is no definite recipe, it is fun detective work though!

Performance Basics: Latency, Throughput and Load

I am often surprised by the fact that sometimes even seasoned IT professionals get confused by the differences between latency, throughput (aka bandwidth) and load. Simply put, here are the differences:

  • latency is the time it takes for a single operation to complete. For a client accessing a website for instance, this means “the time it takes to load one page”. In the web world, latency will be more often called “response time”.
  • throughput is the number of operations a system can deliver per unit of time. To take the website example again, this means “how many pages can be served per second”
  • the load can be a bit of a fuzzy concept. Simply put it is about the relationship between the throughput and the latency. To take the webserver example again, this means: “How many pages per second can I deliver with each page being served under xxx milliseconds?”

To take a plumbing analogy, the latency is determined by the length of a pipe and the throughput is determined by the size of the same pipe.

Depending on the need of users, throughput might be more important than latency or vice versa:

  • In the case of online games, latency is often more important since clients are only transmitting small amounts of data (the position of the player).  The faster this small amount of data gets transmitted to the server the better: this can give you a competitive advantage. Let’s not forget that latency on a global planetary scale is dictated by distances as you will be limited by the speed of light and network equipments you have to go through, each adding a bit of latency: a transatlantic transfer of data, however small will take about 100ms. Interestingly enough, traders might face issues similar to online action gamers if the latency to reach a trading server is too high: they will have a competitive disadvantage.
  • In the case of video streaming, latency is not so much of an issue (it dictates how much you have to wait before the video starts) but throughput will determine the quality of the video (forget about HD content on a small pipe)

That is just a few examples but often, a given performance goal has to be defined in terms of latency and throughput as these are two very different things…

Next on the topic of Performance, we will be talking about bottlenecks.