Suppose you have an application "mousepling" that should to do pling if you click on a button. The latency is the time between your finger clicks the mouse button and you hearing the pling.
The latency in this setup composes itself out of certain latencies, that have different causes.
The points 1-3 are interesting, but beyond the scope of this document. Nevertheless be aware that they exist, so that even if you have optimized everything else to really low values, you may not necessarily get exactly the result you calculated.
Telling the server to play something involves usually one single MCOP call. There are benchmarks which confirm that on the same host with unix domain sockets, telling the server to play something can be done about 9000 times in one second with the current implementation.
I expect that most of this is kernel overhead. Mostly switching from one application to another. Of course this value changes with the exact type of the parameters. If you transfer a whole image with one call, it will be slower than if you transfer only one long value. For the returncode the same is true. However for ordinary strings (such as the filename of the wav file to play) this shouldn't be a problem.
That means, we can approximate this time with 1/9000 sec, that is below 0.15 ms. We'll see that this is not relevant.
The server needs to do buffering, so that when other applications are running, such as your X11 server or "mousepling" application no dropouts are heard.
The way this is done under linux is that there are a number fragments of a size. The server will refill fragments, and the soundcard will play fragments.
So suppose there are three fragments. The server refills the first, the soundcard starts playing it. The server refills the second. The server refills the third. The server is done, other applications can do something now.
As the soundcard has played the first fragment, it starts playing the second and the server starts refilling the first. And so on.
The maximum latency you get with all that is (number of fragments)*(size of each fragment)/(samplingrate * (size of each sample)).
Suppose we assume 44kHz stereo, and 7 fragments a 1024 bytes (the current aRts defaults), we get 40 ms.
These values can be tuned according to your needs. However, the CPU usage increases with smaller latencies, as the sound server needs to refill the buffers more often, and in smaller parts. It is also mostly impossible to reach better values without giving the soundserver realtime priority, as otherwise you'll often get drop-outs.
However, it is realistic to do something like 3 fragments with 256 bytes each, which would make this value 4.4 ms.
With 4.4ms delay the idle CPU usage of aRts would be about 7.5%. With 40ms delay, it would be about 3% (of a PII-350, and this value may depend on your soundcard, kernel version and others).
Suppose your distance from the speakers is 2 meters. Sound travels at a speed of 330 meters per second. So we can approximate this time with 6 ms.
Streaming applications are applications that produce their sound themselves. Assume a game, which outputs a constant stream of samples, and should now be adapted to replay things via aRts. To have an example: when I press a key, the figure which I am playing jumps, and a boing sound is played.
First of all, you need to know how aRts does streaming. Its very similar to the I/O with the soundcard. The game sends some packets with samples to the sound server. Lets say three packets. As soon as the sound server is done with the first packet, it sends a confirmation back to the game that this packet is done.
The game creates another packet of sound and sends it to the server. Meanwhile the server starts consuming the second sound packet, and so on.
The latency here looks similar like in the simple case.
As above - beyond the scope of this document
Obviously, the streaming latency depends on the time it takes all packets that are used for streaming to be played once. So it is (number of packets)*(size of each packet)/(samplingrate * (size of each sample))
As you see that is the same formula as applies for the fragments. However for games, it makes no sense to do such small delays as above.
I'd say a realistic configuration for games would be 2048 bytes per packet, use 3 packets. The resulting latency would be 35ms.
This is based on the following: assume that the game renders 25 frames per second (for the display). It is probably safe to assume that you won't notice a difference of sound output of one frame. Thus 1/25 second delay for streaming is acceptable, which in turn means 40ms would be okay.
Most people will also not run their games with realtime priority, and the danger of drop-outs in the sound is not to be neglected. Streaming with 3 packets a 256 bytes is possible (I tried that) - but causes a lot of cpu usage for streaming.
You can calculate these exactly as above.
There are a lot of factors which influence cpu usage in a complex scenario, with some streaming applications and some others, some plugins on the server etc. To name a few:
If you play two streams, simultaneuosly you need to do additions. If you apply a filter, some calculations are involved. To have a simplified example, adding two streams involves maybe four CPU cycles per addition, on a 350Mhz processor, this is 44100*2*4/350000000 = 0.1% CPU usage.
aRts needs to decide which plugin when calculates what. This takes time. Take a profiler if you are interested in that. Generally what can be said is: the less realtime you do (i.e. the larger blocks can be calculated at a time) the less scheduling overhead you have. Above calculating blocks of 128 samples at a time (thus using fragment sizes of 512 bytes) the scheduling overhead is probably not worth thinking about it.
aRts uses floats internally as data format. These are easy to handle and on recent processors not slower than integer operations. However, if there are clients which play data which is not float (like a game that should do its sound output via aRts), it needs to be converted.
The same applies if you want to replay the sounds on your soundcard. The soundcard wants integers, so you need to convert.
Here are numbers for a Celeron, approx. ticks per sample, with -O2 +egcs 2.91.66 (taken by Eugene Smith <firstname.lastname@example.org>). This is of course highly processor dependant:
So that means 1% CPU usage for conversion and 5% for interpolation on this 350 MHz processor.
MCOP does, as a rule of thumb, 9000 invocations per second. Much of this is not MCOPs fault, but relates to the two kernel causes named below. However, this gives a base to do calculations what the cost of streaming is.
Each data packet transferred through streaming can be considered one MCOP invocation. Of course large packets are slower than 9000 packets/s, but its about the idea.
Suppose you use packet sizes of 1024 bytes. Thus, to transfer a stream with 44kHz stereo, you need to transfer 44100*4/1024 = 172 packets per second. Suppose you could with 100% cpu usage transfer 9000 packets, then you get (172*100)/9000 = 2% CPU usage due to streaming with 1024 byte packets.
That are approximations. However, they show, that you would be much better off (if you can afford it for the latency), to use for instance packets of 4096 bytes. We can make a compact formula here, by calculating the packet size which causes 100% cpu usage as 44100*4/9000 = 19.6 samples, and thus getting the quick formula:
streaming cpu usage in percent = 1960/(your packet size)
which gives us 0.5% CPU usage when streaming with 4096 byte packets.
(This is part of the MCOP protocol overhead).
Switching between two processes takes time. There is new memory mapping, the caches are invalid, whatever else (if there is a kernel expert reading this - let me know what exactly are the causes).
This means: it takes time. I am not sure how many context switches linux can do per second, but that number isn't infinite. Thus, of the MCOP protocol overhead I suppose quite a bit is due to context switching. In the beginning of MCOP, I did tests to use the same communication inside one process, and it was much faster (four times as fast or so).
(This is part of the MCOP protocol overhead).
Transferring data between processes is currently done via sockets. This is convenient, as the usual select() methods can be used to determine when a message has arrived. It can also be combined with other I/O sources as audio I/O, X11 server or whatever else easily.
However, those read and write calls cost certainly processor cycles. For small invocations (such as transferring one midi event) this is probably not so bad, for large invocations (such as transferring one video frame with several megabytes) this is clearly a problem.
Adding the usage of shared memory to MCOP where appropriate is probably the best solution. However it should be done transparent to the application programmer.
Take a profiler or do other tests to find out how much exactly current audio streaming is impacted by the not using sharedmem. However, its not bad, as audio streaming (replaying mp3) can be done with 6% total CPU usage for artsd and artscat (and 5% for the mp3 decoder). However, this includes all things from the necessary calculations up do the socket overhead, thus I'd say in this setup you could perhaps save 1% by using sharedmem.
These are done with the current development snapshot. I also wanted to try out the real hard cases, so this is not what everyday applications should use.
I wrote an application called streamsound which sends streaming data to aRts. Here it is running with realtime priority (without problems), and one small serverside (volume-scaling and clipping) plugin:
4974 stefan 20 0 2360 2360 1784 S 0 17.7 1.8 0:21 artsd 5016 stefan 20 0 2208 2208 1684 S 0 7.2 1.7 0:02 streamsound 5002 stefan 20 0 2208 2208 1684 S 0 6.8 1.7 0:07 streamsound 4997 stefan 20 0 2208 2208 1684 S 0 6.6 1.7 0:07 streamsound
Each of them is streaming with 3 fragments a 1024 bytes (18 ms). There are three such clients running simultaneously. I know that that does look a bit too much, but as I said: take a profiler and find out what costs time, and if you like, improve it.
However, I don't think using streaming like that is realistic or makes sense.
To take it even more to the extreme, I tried what would be the lowest latency possible. Result:
You can do streaming without interruptions with one client application, if you take 2 fragments of 128 bytes between aRts and the soundcard, and between the client application and aRts. This means that you have a total maximum latency of 128*4/44100*4 = 3 ms, where 1.5 ms is generated due to soundcard I/O and 1.5 ms is generated through communication with aRts.
Both applications need to run realtimed.
But: this costs an enormous amount of CPU. This example cost you about 45% of my P-II/350. I also starts to click if you start top, move windows on your X11 display or do disk I/O. All these are kernel issues. The problem is that scheduling two or more applications with realtime priority cost you an enormous amount of effort, too, even more if the communicate, notify eachother etc.
Finally, a more real life example. This is aRts with artsd and one artscat (one streaming client) running 16 fragments a 4096 bytes:
5548 stefan 12 0 2364 2364 1752 R 0 4.9 1.8 0:03 artsd 5554 stefan 3 0 752 752 572 R 0 0.7 0.5 0:00 top 5550 stefan 2 0 2280 2280 1696 S 0 0.5 1.7 0:00 artscat