This page was created following a discussion (started by Dan Weber) regarding the feasibility of scaling repro into the million of users range. Although the original topic of the discussion was about more than performance alone (it also touched reliability, security, ...), this page currently focuses on how to improve the performance of the stack and DUM.
The purpose of this page is to consolidate the ideas and experiences put forth in that discussion and to make them grow.
Areas of Interest
Transport Event Selection
- Move away from select() and into each platform's best IO selection API (epoll, ...). This will help a lot for TCP/TLS and will also enable the handling of more than 1024 sockets
- Leverage libevent so that the platform IO selection API (pre notification) is abstracted (should provide best API per platform). Can libevent also provide kernel-level async?
- Leverage asio to move to an asynchronous IO post-notification model
- Consider using kernel-level asynchronous IO APIs (rather than current synchronous ones) to be able to handle multiple events at the same time (asio might provide this)
Object Allocation and Destruction
- Use different memory allocator than new() (tcmalloc, ..., maybe a memory pool?)
- Object size could be reduced (SipMessage)
- Current object memory consumption is large and causes fragmentation (performance hit)
- Current implementation of object streaming is costly and could be improved (boost::spirit::karma could help and some work had been done with resip faststream)
- [Dan Weber] This may be because all of the SipMessage's components are allocated all over the heap. Using an allocator for each sip message that worked somewhat like a deque using say 512 byte chunks would allow for more contiguous memory and make it easier to encode the SipMessage to the stream. In addition with something like boost::fusion adapters with a bit vector and some set fields for things like parameters, many less allocations would occur on the heap. Consider std::string vs. resip::Data. std::string in most modern implementations is reference counted using atomic instructions. So if you initialize all the parameter values to the first std::string, they will be loosely copied until they are filled in. The first one taking up 8 bytes on the stack + 64 bytes or so on the heap, and each unfilled one will be 8 bytes on the stack until filled in.
Multicore/Multi CPU and Threading
- Current model (stack and DUM threading) doesn't leverage multicore a lot
- Adding more threads does not always improve performance (especially when inter thread comm. is done base on locking)
- Trying to put the transports into their own thread didn't seem to pay
- Investigate lock-free FIFOs between stack and DUM to see how performance improves on multicore
- Threading the transaction layer did improve the performance (Dan has an implementation of hash-based concurrency in svn)
- DUM's performance could be improved using Software Transactional Memory