Comments on Tsuna's blog: How long does it take to make a context switch?

There's links in a HN thread about threads. h...

2015-06-12T03:20:30.362-07:00

There's links in a HN thread about threads. https://news.ycombinator.com/item?id=6979150

[in the video link] Google measured

50ns or less "mode switch"
~1us minimum context switch - futex, pinned to same core, threads).
~3us without pinning

They looked at the pinned case. Oddly their conclusion was the time is mainly spent running the kernel cpu scheduler.

Then they showed how to get it down to 200ns without pinning. They want to switch between two threads in the same way your benchmark does. The running one goes to sleep; the sleeping one starts running. They added a new system call that basically just swaps the scheduling state of the two threads (instead of running the full scheduler).

(In a way it sounds similar to Binder IPC calls on Android http://kroah.com/log/blog/2014/01/15/kdbus-details/)

@phantomfive Also, you're assuming it's &#...

2014-04-13T02:35:21.231-07:00

@phantomfive Also, you're assuming it's 'another-IO-bound.' What about when you're really doing something intense in a thread. In that case, you really _do_ want either another pool of threads you talk to via a pipe, or you want some headroom on the # of threads and "experiment" with the concurrency value.

@phantomfive In fact Microsoft IOCPs (the one thin...

2014-04-13T02:32:51.625-07:00

@phantomfive In fact Microsoft IOCPs (the one thing I love about Windows, esp. before kqueue/epoll) default to the umber of CPUs, but multiplying by 1.5x if you have anything else that could block mixed in is still better than thousands of threads. See http://msdn.microsoft.com/en-us/library/windows/desktop/aa365198%28v=vs.85%29.aspx & "The best overall maximum value to pick for the concurrency value is the number of CPUs on the computer. If your transaction required a lengthy computation, a larger concurrency value will allow more threads to run. Each completion packet may take longer to finish, but more completion packets will be processed at the same time. You can experiment with the concurrency value in conjunction with profiling tools to achieve the best effect for your application."

Interestingly, Solaris and AIX also have something called IOCPs, but I'm not sure how similar they are. http://en.wikipedia.org/wiki/Input/output_completion_port

"What happens when you are waiting for many m...

2014-04-07T18:23:04.247-07:00

"What happens when you are waiting for many ms for a DB to get back to you?"
Use non-blocking i/o, it's better for so many reasons.

I liked your post. I do not agree with your final...

2014-04-07T17:53:05.196-07:00

I liked your post. I do not agree with your final assessment about using 1 software thread per hardware thread. What happens when you are waiting for many ms for a DB to get back to you? Should you just hold up all the other jobs in the queue because you refused to multithread. Threading makes a lot of sense when you take into account the distributed nature of modern architectures.

Thanks for putting this together.

2013-09-23T15:13:22.779-07:00

Thanks for putting this together.

As far as the Linux kernel's scheduling is con...

2013-07-05T10:34:06.152-07:00

As far as the Linux kernel's scheduling is concerned, there are no threads or processes. Everything is just a "task".

If you have a process with one thread, then there is one task that has that PID. If you have a process with three threads, then there are three tasks that share the same PID (but have different TID). But the scheduler doesn't care, all it sees is tasks that want to run, and its goal is to schedule them somewhere.

The only difference is that when you switch from one task to another, and both tasks share the same virtual address space, then no TLB flush occurs.

So yes if you have multiple cores (whether they are all in the same physical CPU or you have multiple CPUs), one core could be executing one thread of a process, and the next time quantum could be given to either another thread of that same process, or another process altogether.

Also possible is the event where one thread is running on a core, and the next time quantum it gets immediately is on another core. This would yield suboptimal performance, especially if both cores are not on the same physical CPU.

If you only have two threads on two different core...

2013-07-05T09:57:55.199-07:00

If you only have two threads on two different cores, then yeah, you're probably not doing context switches (but you could have caching issues)

Yes but differences of performances with and witho...

2013-07-05T06:33:45.642-07:00

Yes but differences of performances with and without cpu affinity only applies to process context switch and not for thread context switch. Do you agree?

I don't understand how a thread context switch can occur between 2 threads of different cores.

Yvan

It turns out switching between threads and process...

2013-07-04T09:02:44.622-07:00

It turns out switching between threads and processes isn't significantly different, except in some special cases.

So you can use them as a representation for either threads or processes (if that answer isn't precise enough for you, please make your own test and post the results! It will be interesting.)

Hello, your article is very interesting and helpf...

2013-07-04T08:51:35.184-07:00

Hello,
your article is very interesting and helpful, thanks!

I misunderstand something : You're speaking about context switching without specify each time "process context switch" or "thread context switch". For me there is a huge difference, threads share the same virtual space memory. Since threads belong to a same process and a process is executed in 1 processor core (Am I right till here?), can we really have a thread context switch between 2 cores? Or your graphs only represent process context switch?
I find the difference between threads context switch and process context switch a bit blurred in your explanation.
Could you explain me?

Thank you very much

Yvan Gadeau

This is very helpful. Thanks a lot

2013-05-29T23:30:15.730-07:00

This is very helpful. Thanks a lot

Thanks a lot! Your investigation helped me great!

2013-01-27T02:06:46.333-08:00

Thanks a lot! Your investigation helped me great!

Excellent work!

2012-12-13T14:25:37.832-08:00

Excellent work!

How to measure the time taken by a single context ...

2012-03-04T04:45:56.437-08:00

How to measure the time taken by a single context switch? Is it same for every context switch or it varies?

pankaj@tux: this article shows that context switch...

2011-02-14T21:57:15.403-08:00

pankaj@tux: this article shows that context switching takes a lot more time than just a simple user-kernel switch (aka mode switch). System calls only do a mode switch.

What should take more time a context switch or a u...

2011-02-14T21:28:26.475-08:00

What should take more time a context switch or a user-kernel mode switch?

@James Aguilar: I'm not sure to follow your ca...

2010-11-29T17:44:51.970-08:00

@James Aguilar: I'm not sure to follow your calculation that leads you to the conclusion that you can't spend more than 3% of your time context switching. I have a MySQL database server running on a dual E7220 = 4 actual cores, it's doing 25k context switches per second, meaning it's doing on average 25000/4=6250 switches per core per second, if each switch takes 30µs, each core is spending 187500µs = 187ms per second doing context switches, which translates in 18.75% of the CPU cycles being wasted switching. In practice though, I'm guessing that this DB doesn't get the full 30µs penalty thanks to shared working set and saved TLB flushes (most of the active threads are part of the same process, so they share the same address space), so we can see 18.75% as the theoretical upper bound of the percentage of CPU cycles wasted to context switching.

@Anonymous 5 and James Aguilar: async servers are ...

2010-11-29T17:44:45.938-08:00

@Anonymous 5 and James Aguilar: async servers are more likely to perform better than non-async servers. YMWV depending on the type of server we're talking about. At least not doing context switches saves TLB flushes. If you design and implement your multi-threaded server appropriately, you can maximize the performance by using dedicated threads per task type and CPU affinity in order to keep the high cache hit rate that boosts performance so much. Yes it's more work to implement servers this way, but it's the price to pay for the best performance. I changed a server application at StumbleUpon to be fully async non-blocking and I gained from 40% to a full order of magnitude performance boost, depending on the workload.

@Anonymous 4: I left HyperThreading on because tha...

2010-11-29T17:44:36.743-08:00

@Anonymous 4: I left HyperThreading on because that's what we use on our servers at StumbleUpon (for a variety of reasons). The only "true numbers" are the ones that actually matter for your environment. I know how HT works and the limitations it has, but this isn't relevant to this post.

@Anonymous 1: All the tests happened in an x86_64 ...

2010-11-29T17:44:31.558-08:00

@Anonymous 1: All the tests happened in an x86_64 environment. I will update the post to reflect this.

@Adrian Cockcroft: I didn't use lmbench becaus...

2010-11-29T17:44:23.407-08:00

@Adrian Cockcroft: I didn't use lmbench because I was wondering how to write such a benchmark in the first place, and you always learn more by building things yourself than by using something already existing. My little benchmarks aren't supposed to be alternatives to lmbench, I simply provided the source code so others could unambiguously see how I did the benchmarks and reproduce them.

Your conclusion about writing servers in async sty...

2010-11-28T11:03:15.112-08:00

Your conclusion about writing servers in async style versus sync style is not justified by the data. You said yourself that the problem with a context switch is that it trashes your cache. What do you think happens when you submit one async operation and the thread switches to working on another one? Because all the data for the second operation is not in the cache, it has to be faulted in. The effect is the same as the effect of switching to another thread. If data for the second op does not have to be faulted in (i.e. it is already in the cache), then the same would hold for thread-switching and the switch would be inexpensive.

I can buy that you would not want to have many more CPU-bound threads than hardware threads, but even that conclusion is dubious. If the cost of a context switch is 30 microseconds and the Linux kernel has 1000 time-slices per second, then the most you could be spending on context switches is 30 ms per second, or about three percent of the computer's power. Maybe through other mechanisms the situation is actually worse than this, but without experimentation it will not be easy to be sure.

An async server still has context switches, and I ...

2010-11-27T19:30:29.456-08:00

An async server still has context switches, and I mean from a conceptual angle, not a processor angle. A select server still has to change which data structures it is working on. During those events you are switching the context and L2 will need to be switched out.

And don't forget the other cost, which would be code cleanliness and complexity.

Leaving on HyperThreading does not give a good ind...

2010-11-27T18:27:22.280-08:00

Leaving on HyperThreading does not give a good indication. HyperThreading does not have a true doubling of hardware threads. There is a probability of contention of the same rare resources. This test should be re-run with HyperThreading disabled to get a look at true numbers. HyperThreading works best when your issue is total throughput, where lazy workloads can better make use of idle areas of the CPU. In a contentious environment with latency sensitivity, HT works against predictability and overall performance when measured as lower latency.