tag:blogger.com,1999:blog-82607392788742944862024-03-05T17:54:00.914-08:00Tsuna's blogIn code we trust.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.comBlogger53125tag:blogger.com,1999:blog-8260739278874294486.post-45819966877909868552023-03-30T13:33:00.003-07:002023-03-30T13:35:10.293-07:00CVE-2022-4696 mitigation on GKEThere's a <a href="https://access.redhat.com/security/cve/cve-2022-4696" target="_blank">CVE</a> on GCP that could lead to a privilege escalation (see security bulletin <a href="https://cloud.google.com/anthos/clusters/docs/security-bulletins#gcp-2023-001-gke" target="_blank">GCP-2023-001</a>). This can be mitigated by blocking the affect syscall with a seccomp profile. Unfortunately Kubernetes doesn't make it easy to deploy a profile, you can only reference a file under the kubelet's root directory, and GKE doesn't provide an easy facility to deploy those files to all the nodes in the cluster. So here's a quick workaround to mitigate the CVE if you need to mitigate it quickly for specific workloads that are at risk (of course the recommended course of action is to upgrade GKE instead).<br /><pre>apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: seccomp-config
app.kubernetes.io/part-of: seccomp
name: seccomp-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: seccomp-profiles
labels:
app.kubernetes.io/name: seccomp-config
app.kubernetes.io/part-of: seccomp
data:
CVE-2022-4696.json: |
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"io_uring_enter",
"io_uring_register",
"io_uring_setup"
],
"action": "SCMP_ACT_ERRNO"
}
]
}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: seccomp-config
labels:
app: seccomp-config
app.kubernetes.io/name: seccomp-config
app.kubernetes.io/part-of: seccomp
spec:
selector:
matchLabels:
app: seccomp-config
template:
metadata:
labels:
app: seccomp-config
name: seccomp-config
app.kubernetes.io/name: seccomp-config
app.kubernetes.io/part-of: seccomp
spec:
containers:
- name: seccomp-config
image: busybox
command:
- "sh"
- "-c"
- "ls -lR /host && cp -v /config/*.json /host/ && sleep infinity"
volumeMounts:
- name: hostdir
mountPath: /host
- name: seccomp-profiles
mountPath: /config
resources:
requests:
cpu: 1m
memory: 1Mi
limits:
cpu: 25m
memory: 25Mi
livenessProbe:
exec:
command:
- "true"
periodSeconds: 600
securityContext:
privileged: true
volumes:
- name: seccomp-profiles
configMap:
defaultMode: 420
name: seccomp-profiles
- name: hostdir
hostPath:
path: /var/lib/kubelet/seccomp
type: DirectoryOrCreate
serviceAccountName: seccomp-config</pre>
<div>
This deploys the seccomp profile on all the nodes with a daemon set (alternatively consider using the <a href="https://github.com/kubernetes-sigs/security-profiles-operator" target="_blank">security profiles operator</a>). You may need to deploy this in a namespace where privileged pods are allowed. Then any pod where you want to plug this hole you need to add this to the <code>securityContext</code> of the container:
</div>
<pre> seccompProfile:
localhostProfile: CVE-2022-4696.json
type: Localhost
</pre>tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com0tag:blogger.com,1999:blog-8260739278874294486.post-47935034826365164342022-09-13T01:58:00.003-07:002022-09-13T01:59:20.510-07:00tar hanging when extracting an archive during a Docker buildConsider the following excerpt of a Dockerfile building nginx: <div><br /></div><div><div><span style="font-family: courier; font-size: x-small;">RUN \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>mkdir -p /usr/src/ngx_brotli \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& cd /usr/src/ngx_brotli \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& git init \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& git remote add origin https://github.com/google/ngx_brotli.git \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& git fetch --depth 1 origin $NGX_BROTLI_COMMIT \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& git checkout --recurse-submodules -q FETCH_HEAD \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& git submodule update --init --depth 1 \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& cd .. \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& curl -fSL https://nginx.org/download/nginx-$NGINX_VERSION.tar.gz -o nginx.tar.gz \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& curl -fSL https://nginx.org/download/nginx-$NGINX_VERSION.tar.gz.asc -o nginx.tar.gz.asc \</span></div><div><span style="font-family: courier; font-size: x-small;"> && sha512sum nginx.tar.gz nginx.tar.gz.asc \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& export GNUPGHOME="$(mktemp -d)" \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& gpg --keyserver keyserver.ubuntu.com --recv-keys 13C82A63B603576156E30A4EA0EA981B66B0D967 \</span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& gpg --batch --verify nginx.tar.gz.asc nginx.tar.gz \</span></div><div><span style="font-family: courier; font-size: x-small;"><b> && rm -rf "$GNUPGHOME" \</b></span></div><div><span style="font-family: courier; font-size: x-small;"><span style="white-space: pre;"> </span>&& tar -C /usr/src -vxzf nginx.tar.gz</span></div></div><div><br /></div><div>Looks pretty simple, right? Yet it's hanging. Halfway through the archive extraction, <span style="font-family: courier;">tar</span> goes into an endless busyloop:</div><div><br /></div><div><div><span style="font-family: courier; font-size: x-small;">root@docker-desktop:/# strace -fp 29116</span></div><div><span style="font-family: courier; font-size: x-small;">strace: Process 29116 attached</span></div><div><span style="font-family: courier; font-size: x-small;">wait4(-1, 0x7ffff15dcf24, WNOHANG, NULL) = 0</span></div><div><span style="font-family: courier; font-size: x-small;">wait4(-1, 0x7ffff15dcf24, WNOHANG, NULL) = 0</span></div><div><span style="font-family: courier; font-size: x-small;">wait4(-1, 0x7ffff15dcf24, WNOHANG, NULL) = 0</span></div><div><span style="font-family: courier; font-size: x-small;">wait4(-1, 0x7ffff15dcf24, WNOHANG, NULL) = 0</span></div><div><span style="font-family: courier; font-size: x-small;">wait4(-1, 0x7ffff15dcf24, WNOHANG, NULL) = 0</span></div><div><span style="font-family: courier; font-size: x-small;">wait4(-1, 0x7ffff15dcf24, WNOHANG, NULL) = 0</span></div><div><span style="font-family: courier; font-size: x-small;">wait4(-1, 0x7ffff15dcf24, WNOHANG, NULL) = 0</span></div></div><div><span style="font-family: courier; font-size: x-small;">[...]</span></div><div><br /></div><div>More interestingly, I couldn't reproduce this by manually running the shell commands one by one in an identical interactive container.</div><div><br /></div><div>I realized that the <span style="font-family: courier; font-size: small;">gpg --recv-keys</span> call left behind a couple process, <span style="font-family: courier; font-size: x-small;">dirmngr</span> and <span style="font-family: courier; font-size: x-small;">gpg-agent</span>, and that those appear to be the culprits (for reasons not yet clear to me).</div><div><br /></div><div>Easiest way to "fix" this was to ask them to terminate by deleting the <span style="font-family: courier; font-size: small;">$GNUPGHOME</span> directory (which <span style="font-family: courier; font-size: small;">dirmngr</span> watches with inotify).</div><div><br /></div><div>Just posting this here in case anyone else ever gets puzzled the way I did.</div>tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com0tag:blogger.com,1999:blog-8260739278874294486.post-72553914005993147082018-07-07T18:07:00.000-07:002018-07-07T18:09:45.390-07:00Creating an admin account on KubernetesI spent a bunch of time Googling how to do this so I figured it could help someone else if posted the steps to add an admin account on a <a href="https://kubernetes.io/">Kubernetes</a> cluster managed with <a href="https://github.com/kubernetes/kops">kops</a>.<br />
<br />
k8s has service accounts but that's not what you want to create an admin account — equivalent to having root privileges on the cluster. Instead you simply need to create a certificate/key pair for the user and sign it with the master's CA (certificate authority).<br />
<br />
In this example we'll create an account for user <code>foobar</code>.<br />
<ol>
<li>Create a private key:<br />
<code>openssl genrsa -out foobar.key 2048</code><br />
For extra security you can also opt for 4096 bits for the key but for some reason <code>kops</code> defaults to 2048 right now.</li>
<li>Create a CSR (Certificate Signing Request)<br />
<code>openssl req -new -key foobar.key -out foobar.csr -subj '/CN=foobar/O=system:masters'</code><br />
The CN (Common Name) contains the user name and the O (Organization Name) must be <a href="https://kubernetes.io/docs/reference/access-authn-authz/rbac/#user-facing-roles"><code>system:masters</code></a> to be a super-user.</li>
<li>Fetch the master's private key from S3, from the bucket <code>kops</code> was configured to use:<br />
<code>aws s3 sync $KOPS_STATE_STORE/$NAME/pki pki</code><br />
Here the variables <code>$KOPS_STATE_STORE</code> and <code>$NAME</code> are the ones referred to in the <a href="https://github.com/kubernetes/kops/blob/master/docs/aws.md#creating-your-first-cluster">kops documentation</a>. For example:<br />
<code>aws s3 sync s3://prefix-example-com-state-store/myfirstcluster.example.com/pki pki</code><br />
All the PKI files will be downloaded from S3 into the local <code>pki</code> directory.</li>
<li>Issue the certificate using the master's CA:
<code>openssl x509 -req -in foobar.csr -CA pki/issued/ca/*.crt -CAkey pki/private/ca/*.key -CAcreateserial -out foobar.crt -days 3650</code></li>
</ol>
At this point you could give the private key (<code>foobar.key</code>) and the certificate (<code>foobar.crt</code>) to the user, but if you want to be a bit nicer and generate a self-contained <code>kubectl</code> config for them, here's how:
<br />
<pre>kubectl --kubeconfig=kcfg config \
set-credentials $NAME --client-key=foobar.key --client-certificate=foobar.crt --embed-certs=true
kubectl --kubeconfig=kcfg config \
set-cluster $NAME --embed-certs=true --server=https://api.k8s.example.com --certificate-authority pki/issued/ca/*.crt
kubectl --kubeconfig=kcfg config \
set-context $NAME --cluster=$NAME --user=$NAME
kubectl --kubeconfig=kcfg config \
use-context $NAME
</pre>
You can then hand over the <code>kcfg</code> file to the user and they could use it directly as their <code>~/.kube/config</code> if they don't already have one.<br />
<br />
Don't forget to <code>rm -rf pki</code> to delete the files you downloaded from S3.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com0tag:blogger.com,1999:blog-8260739278874294486.post-19662544458318976712018-01-09T23:39:00.001-08:002022-07-23T00:23:27.296-07:00Why I left Arista Networks<i>edit: I rejoined Arista Networks in early 2020 ;)</i><br />
5 years ago, I wrote a blog post on <a href="http://blog.tsunanet.net/2013/03/why-i-joined-arista-networks.html">why I joined Arista Networks</a> back in 2012. As I am now suddenly and unexpectedly leaving the company, I figured I'd write a bit of a retrospective and perhaps bring some closure to this otherwise fairly quiet blog. I know that the original blog post has been used by candidates considering to join Arista, and even though I didn't write it with this in mind originally, I wanted to give a bit of an update to those considering to join the company in 2018 and beyond.<br />
<h2>
Why I left Arista</h2>
<div>
I was very happy and thriving at Arista and wasn't looking for a change. But I guess change was looking for me and somehow managed to convince me to join a new startup as co-founder. I won't say much more on that topic for now but it's one of those opportunities that was too big to pass up on. It's not in the networking industry, so not competitive with Arista.</div>
<div>
<br /></div>
<div>
I really struggled with this change, it took some massive amount of questioning things to accept the idea to leave such a great company, with a great team, working on great projects, to throw myself into the unknown and push myself <i>way</i> outside of my comfort zone. But I felt like I had to try, I had to seize this opportunity.</div>
<h2>
Arista in 2018</h2>
<div>
Everything that I wrote in my <a href="http://blog.tsunanet.net/2013/03/why-i-joined-arista-networks.html">original blog post</a> is still true as far as I'm concerned. The big difference is that in the meantime Arista has established itself as one of the truly remarkable success stories in recent Silicon Valley history.</div>
<div>
<br /></div>
<div>
Now “Arista Networks” may not be a household name like Google or Facebook, but make no mistake, Arista's success in the networking industry is on the same track as Google's success in search or Facebook's success in social media.</div>
<div>
<br /></div>
<div>
Many others have tried (or are trying) to claim a piece of the networking cake dominated by Cisco, and I really cannot think of any other company succeeding in any meaningful way in that space. If anything, previously established players have all but disappeared (e.g. Force10, Brocade) or become largely irrelevant (e.g. Extreme). As the two remaining industry giants, Cisco and Juniper, are tumbling, steadily losing market share and focus, the brightest rising star in the datacenter networking industry has been Arista. And yet Arista still only commands a low double digit market share, so there is a lot of room to grow further while also strategically expanding the TAM (Total Addressable Market).</div>
<div>
<br /></div>
<div>
There are a number of tailwinds benefiting the company:</div>
<div>
<ul>
<li><b>Competitors still can't get their act together</b> and continue to overpromise and underdeliver. Quality issues continue to plague them. Arista manages its roadmap carefully and will not hesitate to say "no" to a customer if they cannot commit to what the customer is asking for, rather than promise something that they know cannot be delivered on time or at all. Quality remains paramount and the team is constantly trying to improve automated testing processes to ensure that every new release that comes out is better than the previous one and that no regression sneaks back into the code. This includes things like automatically running tests based on what code changed by leveraging code-coverage information gleaned during earlier test runs, automatically triaging and root-causing unexplained test failures, and more. There is a strong emphasis on building/improving tools and creating a development environment where everyone can be productive [1].</li>
<li><b>The routing industry is collapsing in the datacenter networking industry.</b> This trend started a couple years ago and should by now be clear to anybody in the industry. The gap between a "switch" and a "router" has been shrinking steadily to the point that we now commonly see datacenter switches play the role of edge peering boxes, backbone routers, cross-datacenter interconnects, etc. This is hurting Juniper particularly bad, because this space was their bread and butter. But with the wrong hardware and the wrong software, they cannot compete with the density and cost per port of commodity hardware. The only lead they kept, and mostly the only differences that remain between switches and routers, are in specialized routing software. And since Arista is a software company, not a hardware company, the team has been hard at work to implement routing features and scale the routing code way beyond what has ever been done on datacenter networking platforms. This is probably one of the biggest boost to Arista's TAM and much work remains to be done in that space to close that gap fully. It's very exciting.</li>
<li><b>Arista has been leading innovation in the networking industry.</b> Whenever a new chip comes out, Arista is often the first to make it bridge a packet, sometimes before the chip vendor has done it themselves. On many occasions, Arista has managed to push the hardware at a scale that exceeds the data sheet of the underlying hardware. This is only made possible by Arista's edge on the software front. Furthermore, Arista has influenced chip design with the silicon vendors they partner with to further widen the gap between the cost/performance of commodity hardware and vendor-proprietary ASICs like those designed at great cost by Cisco and Juniper. Arista has been leading industry standards like 25/50G and more recently 200/400G, with the new <a href="http://osfpmsa.org/">OSFP initiative</a>. Arista was the first to take to market new technologies like VXLAN, internet-scale routing in a sub-$20k 1RU top of rack switch, streaming telemetry and network programmability, etc.</li>
<li><b>Arista's execution has been flawless.</b> The company faced some pretty serious challenges, including a set of massive lawsuits from the 800 pound gorilla with a virtually unlimited legal budget that would stop at nothing to slow them down or tarnish their image. Despite all this, the company kept its head down and its focus, fought fearlessly for what was right, and managed to deliver 14 consecutive "beat and raise" quarters that turned it into a Wall Street darling. This is really a function of the amazing exec team that has been at the helm of the company.</li>
<li><b>Arista is in the segment of the networking industry that is growing the fastest.</b> There are a lot of products and areas in the overall networking industry but datacenter networking is the one growing the fastest, because everything is going to the cloud, and the cloud runs on this stuff. Arista has managed to remain laser focused on this specific segment of the industry, slowly expanding into connected areas where opportunities existed to go after some low hanging fruits (e.g. tap aggregation, routing, and more). Arista is present at a large scale in virtually all the major cloud environments out there. Again, the name might not quite have the mindshare of a Google or a Facebook, but these days it's virtually impossible to use the Internet without going through Arista devices.</li>
</ul>
<div>
And while the headcount has more than quintupled since I joined, the company has managed to remain surprisingly apolitical and bullshit-free. There have been growing pains, for sure, and it's not like everything is perfect and just happy rainbow unicorns either, but the company culture is essentially unchanged, and that's what actually matters.</div>
</div>
<div>
<br /></div>
<div>
So it was really, really, <i>really</i> freaking hard to say goodbye. I've been lucky to be very happy everywhere I worked in my career, but to this point Arista has been by far the best company I've worked at.</div>
<div>
<br /></div>
<div>
So... As Douglas Adams would say: So long, and thanks for all the fish.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
[1] A footnote worthwhile adding regarding the emphasis on tooling. Ken Duda, one of the co-founders, is very involved in developer tools. After becoming a <a href="https://golang.org/">Go</a> fanboy he spent months working on a new way to put together development workspaces using Docker containers. There are several people working with him on this new tool now and it has become the de-facto standard way of managing Arista's massive workspaces, which comprise millions of lines of code and often need to pull in tens of gigabytes of stuff. This has saved everybody a lot of time and helped support / enable changes to the CI (Continuous Integration) workflow.</div>
<div>
<br /></div>
<div>
<span style="font-size: x-small;">Additional disclaimer for this post: the views expressed in this blog are my own, and Arista didn't review/approve/endorse anything I wrote here.</span></div>
tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com2tag:blogger.com,1999:blog-8260739278874294486.post-87504289952057842992017-05-01T23:46:00.000-07:002018-01-08T23:49:41.705-08:00Getting cash without selling stocksI haven't posted anything here in a while, just been busy with life and hating Blogger's interface (and being too lazy to move to something else). But I wanted to share some of what I've learned recently on the ways one can get liquidity, because I've run into too many people who told me "damn I wish I'd known this earlier!".<br />
<br />
<u>Disclaimer:</u> This post, or anything else on this blog, is not financial / legal / investing / tax advice, just some pointers for you to research. Each person's situation is different and what works well for one person may not be applicable for another.<br />
<br />
<h3>
So you IPO'ed or got an exit?</h3>
<div>
Good for you, your shares are now convertible to real $$. The generally accepted strategy is to sell your stock and buy a low-cost diversified portfolio to avoid the risk of keeping all your eggs in the same basket. You can either do it yourself by buying Vanguard funds, or use a service like <a href="http://wlth.fr/13PtxO2">Wealthfront</a> (disclaimer: referral link) to <a href="https://blog.wealthfront.com/introducing-selling-plan/">do it for you</a>.<br />
<br />
Now, many people also want to use this liquidity to buy a home. This is especially true in the Bay Area where the real estate market <a href="https://twitter.com/louisgray/status/859198709925462017">is</a> <a href="http://www.sfgate.com/realestate/article/S-F-real-estate-over-bidding-Outer-Sunset-25th-Av-11102327.php">crazy</a>, with the <a href="https://www.zillow.com/san-francisco-ca/home-values/">median home price in SF being around $1.2m</a> as of today, and your typical jumbo loan requiring a 20-30% downpayment, i.e. $240k to $360k of cash on hand. I personally never had anything even close to this much cash, so I never thought buying a home was an option for me, even though I could technically afford the monthly mortgage payments (see this great <a href="https://www.nytimes.com/interactive/2014/upshot/buy-rent-calculator.html">rent-vs-buy calculator</a> to run the maths for you, you might be surprised).<br />
<br />
I've seen a lot of people sell off a big chunk of their shares, or even sometimes all of it, to buy a home or even just make a downpayment. They were then hit with a huge tax bill and, sometimes, the regret of having cashed out too soon and not having captured some of the upside of their stock.<br />
<br />
There is a lot of research that shows that IPOs typically underperform the market in the short term (1-2 years), and that investors buying at post-IPO <a href="http://www.kellogg.northwestern.edu/researchcomputing/workshops/papers/ritter_jf1991.pdf">prices</a> <a href="http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=1015&context=econ_wp">typically</a> <a href="http://www.cbsnews.com/news/why-ipos-underperform/">underperform</a> the market in the long term (3-6 years) as well. Wealthfront has a nice <a href="https://blog.wealthfront.com/strategies-for-selling-stock-post-ipo-wide/">blog post comparing the different selling strategies</a> across a few different scenarios.<br />
<br />
But if your strategy of choice is to diversify over the course of the next 3-5 years, as opposed to cashing out as quickly as possible, then that makes it much harder to get access to the cash to buy a home, unless you're willing to take a big tax hit.<br />
<br />
<h3>
Borrowing cash against assets</h3>
</div>
<div>
Enter the wonderfully dangerous world of lines of credits you can get against your (now-liquid) assets. I didn't even know this was a thing until a couple months ago, but there is a plethora of financial products to get liquidity by borrowing against your stocks. SBLOC (Securities-Backed Lines of Credit), PAL / LAL (Pledged / Liquidity Asset Line), pledged-asset mortgage, etc. And margin loans. They all come with slightly different trade-offs but the basic idea is essentially the same: it's a bit like taking an HELOC (Home Equity Line of Credit) against your assets. If don't know what that means, don't worry, keep reading.</div>
<div>
<br /></div>
<div>
I'm going to focus on margin loans because that's what I've researched the most, the easiest to access and most flexible, and the best deal I've found in my case, with <a href="https://www.interactivebrokers.com/en/index.php?f=1595">Interactive Brokers (IB) offering interest rates currently around 2%</a> (indexed on <a href="https://ycharts.com/indicators/effective_federal_funds_rate">overnight Fed rate</a>).</div>
<div>
<br /></div>
<div>
Your brokerage account typically starts as a cash account – i.e. you put cash in (or you get cash by selling shares) and you can use cash to buy stocks. You can upgrade your account to a margin account in order to be able to increase your buying power, so that your broker will lend you money and use the shares you buy as a collateral. But that's not what we're interested here, we already have shares and we want to get cash.<br />
<br /></div>
<h3>
Margin loan 101</h3>
<div>
I found it rather hard in the beginning to grok how this worked, so after being confused for a couple weeks I spent a bunch of time reading select chapters of a couple books that prepare students taking the “<a href="https://en.wikipedia.org/wiki/Series_7_exam">Series 7 Examination</a>” to certify stockbrokers, and the explanations there were much clearer than anything else I could find online. It’s all very simple at the end and makes a lot of sense. As I mentioned earlier, this works mostly like a HELOC but with investment leverage.<br />
<br />
Let’s take a concrete example. You open a margin account and transfer in $2000 worth of XYZ stock. Your account now looks like this:<br />
<br />
<table>
<tbody>
<tr><td>Market Value (MV)</td><td colspan="2">= $2000</td><td></td></tr>
<tr><td>Debit (DB)</td><td>= $0</td><td>(you haven’t borrowed anything yet)</td><td></td></tr>
<tr><td>Equity (EQ)</td><td>= $2000</td><td>(you own all the stock you put in)</td><td></td></tr>
</tbody></table>
<br />
There are two margin requirements: the “initial” margin requirement, required to open new positions (e.g. buy stock), and the “maintenance” margin requirement, needed to keep your account in good standing. With IB the initial margin requirement (IM) is 50% and maintenance margin (MM) is 25% <small>(for accounts funded with long positions that meet certain conditions of liquidity, which most mid/large-cap stocks do)</small>.<br />
<br />
The difference between your equity and your initial margin requirement is the Special Memorandum Account (SMA), it’s like a credit line you can use.<br />
SMA = EQ - IM = $2000 - $1000 = $1000.<br />
<small>(Detail: SMA is actually a high watermark, so it can end up being greater than EQ - IM if your stocks go up and then down).</small><br />
<br />
This $1000 you could withdraw in cash (a bit like taking a HELOC against the part of your house that you own) or you could invest it with leverage (maybe 2x, 3x leverage, sometimes more).<br />
<br />
So let’s say you decide to withdraw the entire amount in cash (again, like taking an HELOC). You now have:<br />
<table>
<tbody>
<tr><td>MV</td><td colspan="2">= 2000</td></tr>
<tr><td>DB</td><td>= 1000</td><td>(you owe the broker $1000)</td></tr>
<tr><td>EQ</td><td>= 1000</td><td>(you now only really own half of the stock, since you borrowed against the other half)</td></tr>
<tr><td>SMA</td><td>= 0</td><td>(you depleted your credit line)</td></tr>
<tr><td>MM</td><td>= 500</td><td>(25% of MV: how much equity you need to be in good standing)</td></tr>
</tbody></table>
<br />
Now your equity is $1000, which is greater than your maintenance margin of $500, so you’re good. Let’s see what happens if XYZ starts to tank. For example let’s say it drops 25%.<br />
<br />
<table>
<tbody>
<tr><td>MV</td><td>= 1500</td><td>(lost 25% of value)</td></tr>
<tr><td>DB</td><td>= 1000</td><td>(the amount you owe to the broker obviously didn’t change)</td></tr>
<tr><td>EQ</td><td>= 500</td><td>(difference between MV and DB)</td></tr>
<tr><td>SMA</td><td>= 0</td><td>(still no credit left)</td></tr>
<tr><td>MM</td><td>= 375</td><td>(25% of MV)</td></tr>
</tbody></table>
<br />
In this case the account is still in good standing because you still have $500 of equity in the account and the maintenance margin is $375.<br />
<br />
Now if the stock dips further, let’s say your account value drops to $1350, we have:<br />
<br />
<table>
<tbody>
<tr><td>MV</td><td>= 1350</td></tr>
<tr><td>DB</td><td>= 1000</td></tr>
<tr><td>EQ</td><td>= 350</td></tr>
<tr><td>MM</td><td>= 337.5</td></tr>
</tbody></table>
<br />
Now you’re running close to the wire but you’re still good as EQ >= MM. But if the account value was to drop a bit further, to $1332, you’d be in the red and get a margin call:<br />
<br />
<table>
<tbody>
<tr><td>MV</td><td>= 1332</td></tr>
<tr><td>DB</td><td>= 1000</td></tr>
<tr><td>EQ</td><td>= 332</td></tr>
<tr><td>MM</td><td>= 333</td></tr>
</tbody></table>
<br />
Now EQ < MM, your equity is short of $1 to meet the maintenance margin. The broker will liquidate your XYZ shares until EQ == MM again (and perhaps even a bit more to give you a bit of a cushion).<br />
<br />
Bottom line: if you withdraw your entire SMA and don’t open any positions, you can only absorb a 33% drop in market value before you get a margin call for maintenance margin violation. Obviously if you don’t use the entire SMA, you then have more breathing room.<br />
<br />
Obviously this whole thing is super safe for the broker, if they start to liquidate you automatically and aggressively when you go in margin violation (like IB would do), there is almost no way they can’t recover the money they loaned out to you, unless something absolutely dramatic happens such as your position becoming illiquid and them becoming stuck while trying to liquidate you (which is why they have requirements such as minimum daily trading volume, minimum market cap, minimum share price, which, if not met, result in increased margin requirements – IPO shares are also typically subject to 100% margin requirement, so you typically have wait if you're just about to IPO, but it's not clear to me how long exactly – might be able to get some liquidity before the hold up period expire?).<br />
<br />
You have to run the numbers, based on the assets you have, how much would they need to tank given the amount you borrow, before you get a margin call. Based on that and your assessment of the likelihood that such a scenario would unfold, you can gauge what amount of risk you're taking, what's a reasonable balance to maintain vs not.<br />
<br />
<h3>
Negotiating margin terms</h3>
<div>
I very recently figured out that while <a href="https://www.interactivebrokers.com/en/index.php?f=1340">Interactive Brokers seems the only one</a> with such <a href="https://www.interactivebrokers.com/en/index.php?f=1595">low interest rates</a> (around 2% when everybody else charges 5-8%), with the exception perhaps of <a href="https://blog.wealthfront.com/introducing-portfolio-line-of-credit/">Wealthfront's Portfolio Line of Credit</a> clocking in at around 3-5%, you can actually negotiate the published rates. I've read various stories online of people getting good deals with the broker of their choice, and usually the negotiation involves transferring your assets to IB and coming back to your broker saying "this is what I get with IB, but if you are willing to earn my business back, we can talk".</div>
<div>
<br /></div>
<div>
I did this with E*TRADE recently, they not only matched but beat slightly IB's rate, and made it a flat rate (as opposed to IB's rate being a blended rate, which would only beat my negotiated rate for really large balances) along with a cash incentive and a couple months of free trading (I'm not an active trader anyways but I thought I'd just mention it here). Morgan Stanley was also willing to give me a similar deal. I'm not a big fan of E*TRADE (to say the least) but there is some value in keeping things together with my company stock plan, and I also appreciate their efforts to win me back.<br />
<br /></div>
<h3>
Buying a home</h3>
</div>
<div>
So once you have access to liquidity via the margin loan, the cool thing is that you don't pay anything until you start withdrawing money from the account. And then you'll be paying interest monthly on whatever balance you have (beware that the rate is often based on daily Fed / LIBOR rate, so keep an eye on how that changes over time). Actually, you don't even have to pay interest, it'll just get debited from your balance — not that I would recommend this, but let's just say the terms of this type of loan are incredibly flexible.</div>
<div>
<br /></div>
<div>
You can then either do a traditional mortgage, where the downpayment comes in part or in full from the margin loan – generally speaking lenders don't want the downpayment to be borrowed money, but since the margin loan is secured by your assets, that's often fine by them (I've had only one lender, SoFi, ironically, turn me down due to this, other banks where fine with it), or if you have enough assets (more than 2x the value of the property) borrow the entire amount in cash, make a cash offer (unfortunately a common occurrence in the Bay Area), and then get a mortgage within 90 days of closing the deal. This is called <a href="https://www.quickenloans.com/blog/delayed-financing-uncommon-refinance-option-cash-buyers">delayed financing</a>, and it works exactly like a mortgage, except it kicks in after you closed on the property with cash. This way you pay yourself back 70-80% of the amount, enjoy mortgage interest deduction (while it lasts) and the security of having a fixed rate locked for 30 years.</div>
<div>
<br /></div>
<div>
I know at least two people that are also considering using this trick to do expensive home remodels, where it's not clear just how expensive exactly the work will be, and having the flexibility of getting access to large amounts of cash fast, without selling stocks / incurring taxable events at inconvenient times, is a great plus.</div>
<div>
<br /></div>
<div>
This whole contraption allows you to decouple your spending from the sale of your assets. Or you may decide to pay the loan back in other ways than by selling assets (e.g. monthly payments using your regular income), thereby preserving your portfolio and saving a lot in taxes. Basically a bit like having your cake and eating it too.</div>
tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com1tag:blogger.com,1999:blog-8260739278874294486.post-45576473640220327202013-03-08T13:26:00.000-08:002018-01-09T23:40:22.404-08:00Why I joined Arista NetworksOver the past few months, many people have asked me why I jumped from the "web world" to the "network industry" to work at <a href="http://www.arista.com/">Arista Networks</a>. I asked myself this question more than once, and it was a bit of a leap of faith, but here's why I did it, and why I'm happy I did it.<br />
<h2>
Choosing a company to work for</h2>
<div>
There is a negative unemployment rate in Silicon Valley provided you know how to type on a keyboard. It's ridiculous, but all the tech companies are <a href="http://www.youtube.com/watch?v=I6IQ_FOCE6I">hiring like there's no tomorrow</a>. So needless to say, when the time came to make a move, I had too many options available to me. It's not easy to decide where you'll want to spend the next X years of your life.</div>
<div>
<br /></div>
<div>
My #1 requirement for my next job was to work with great people. This was ranking above salary, likelihood of company success, and possibly even location (although I really wanted to try to stay in SF). I wanted to feel like I felt when I was at Google, when I could look around me, and assume all these engineers I didn't know were smarter than me, because most of them were. I could have returned to Google too, but I was in for something new.</div>
<div>
<br /></div>
<div>
I quickly wound up with 3 really good offers. One from <a href="https://www.cloudflare.com/">CloudFlare</a>, who's coming to kick the butt of the big CDNs, one from <a href="https://twitter.com/">Twitter</a>, which you know already, and one from this datacenter networking company called <a href="http://www.arista.com/">Arista</a>. The first two were to work on interesting, large-scale distributed systems. But the last one was different.</div>
<h2>
Why did I interview with Arista?</h2>
<div>
So why did I decide to interview with Arista in the first place? In November 2010, I was shopping for datacenter networking gear to rebuild an entire network from scratch. I heard about Arista and quickly realized that their switches and software architecture was exactly what I'd been looking for the previous year already (since I left Google basically). We ended up buying Arista and I was a happy customer for about 2 years, until I joined them.</div>
<div>
<br /></div>
<div>
I don't like to interact with most vendors. Most of them want to take you out to lunch or ball games or invite you at useless events to brainwash you with sales pitches. But my relationship with Arista was good, the people we were interacting with on the sales and SE side were absolutely stellar. In April 2011, they invited me to an event they regularly hold, a "Customer Exchange", at their HQ. I wasn't convinced this would make a good use of my time, but I decided to give it a shot, and RSVPed yes.<br />
<br /></div>
<div>
I remember coming home that evening of April, and telling my wife "wow, if I was looking for a job, I'd definitely consider Arista". The event was entirely bullshit-free, and I got to meet the <a href="http://www.arista.com/en/company/management-team">exec team</a>, who literally blew me away. If you know me, you know I'm not impressed easily, but that day I was really unsettled by what I'd seen. I didn't want to change jobs then, so I tried to get over it.</div>
<div>
<br /></div>
<div>
Over the following year, I went to their 2 subsequent customer exchanges, and each time I came back with that same feeling of "darn, these guys are awesome". I mean, I knew the product already, I knew why it was good, as well as what were its problems, limitations, areas for improvement, etc, because I used it daily. I knew the roadmap, so it was clear to me where the company was headed (I unfortunately couldn't say so for Twitter). Everybody – mark my word – <i>everybody</i> I had met so far at Arista, with no exception, was stellar: support (TAC), sales, a handful of engineers, and all their execs and virtually all VPs, marketing, bizdev, etc.</div>
<div>
<br /></div>
<div>
So I decided to give it a shot and interview with them, and see where that would take me.</div>
<h2>
What's the deal with Arista's people?</h2>
<div>
Arista isn't your typical Silicon Valley company. First of all, it doesn't have any outside investors. The company was entirely funded by its founders, something quite unusual around the Valley, doubly so for a company that sells hardware. By the way, Arista isn't a hardware company. There are 3 times more software engineers than hardware engineers. Sure we do some really cool stuff on the hardware side, and our hardware engineers are really pushing the envelope, allowing us to build switches that run faster and in a smaller footprint than competitors that use the same chips. But most of the efforts and investments, and ultimately what really makes the difference, are in the software.</div>
<div>
<br /></div>
<div>
Let's take a look at the three founders, maybe you'll start to get a sense of why I speak so highly of Arista's people.</div>
<div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Andreas_bechtolsheim.jpg/160px-Andreas_bechtolsheim.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Andreas_bechtolsheim.jpg/160px-Andreas_bechtolsheim.jpg" /></a></div>
<a href="http://en.wikipedia.org/wiki/Andy_Bechtolsheim">Andy Bechtolsheim</a>, co-founder of Sun Microsystems, is one of the legends of Silicon Valley. He's one of the brains who put together hardware design, except he seems to do so one or two years ahead of everybody else. I always loved his talks at the Arista Customer Exchange as they gave me a glimpse of how technology was going to evolve over the next few years, a glimpse into the future. Generally he was right, although some of this predictions took more time than anticipated to materialize.<br />
Andy is truly passionate about that stuff, and he seems to have a special interest for optical technologies (e.g. 100Gbps transceivers and such). He's putting the German touch to our hardware engineering: efficiency. :)</div>
<div>
<br />
Then there is <a href="http://en.wikipedia.org/wiki/David_Cheriton">David Cheriton</a>, professor at Stanford, who isn't on his first stint with Andy. The two had founded Granite Systems in '95, which got acquired in just about a year by Cisco, for over $200M. This apparently made David a bit of a celebrity at Stanford, and in '98 two students called Larry & Sergey sought his advice to start their company, a search engine for the web. David invited them over to talk about their project, and also invited Andy. They liked the idea so much that they each gave them a $100k check to start Google. This 2x$100k investment alone yielded a 10000x return, so now you know why Arista didn't need to raise any money :)<br />
David is passionate about software engineering & distributed systems, and it should be no surprise that virtually all of Arista's software is built upon a framework that came out of David's work.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.arista.com/assets/images/mgmt-photos/kenneth_duda.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://www.arista.com/assets/images/mgmt-photos/kenneth_duda.jpg" height="200" width="142" /></a></div>
Last but not least, Ken Duda, who isn't new to the Arista gang either, as he was the first employee at Granite in '95. Ken is one of the most brillant software engineers I've ever met. Other common points he shares with Andy and David: super low key, very pragmatic, visionary, incredibly intelligent, truly passionate about what he's doing. So passionate in fact that when Arista was hosting a 24h-long hackathon (<a href="https://twitter.com/tsunanet/status/307263997831426048">Hack-a-Switch</a>), he was eager to stay with us all night long to hack on some code (to be fair I think he slept about 2 hours on a beanbag). I will always remember this <a href="https://twitter.com/tsunanet/status/307480335338332161">WTF moment we had around 5am</a> with some JavaScript idiosyncrasy for the web interface we were building, that was epic (when you're tired...).<br />
Not only Ken is one of those extraordinary software engineers, but also he's one of the best leaders I've met, and I'm glad he's our CTO as he's pushing things in the right direction.<br />
<br />
Of course, it's not all about those three guys. What's even more amazing about Arista, is that our <a href="http://www.arista.com/en/company/management-team">VPs of engineering</a> are like that too. The "management layer" is fairly thin, with only a handful of VPs in engineering and handful of managers who got promoted based on meritocracy, and that "management layer", if I dare to call it this way, is one the most technically competent and apt to drive a tech company that I've ever seen.<br />
<br />
I would also like to point out that our CEO is a woman, which is also unusual, unfortunately, for a tech company. It's a coincidence that today is International Women's Day, but let me just say that there is a reason why <a href="http://en.wikipedia.org/wiki/Jayshree_Ullal">Jayshree Ullal</a> frequently ranks high in lists such as "Top X most influential executives", "Top X most powerful people in technology", etc. Like everybody else at Arista, she has a very deep understanding of the industry, our technology, what we're building, <i>how</i> we're building it, and where we should be going next.<br />
<br />
Heck, even our VP of <i>marketing</i>, <a href="https://twitter.com/dgourlay">Doug Gourlay</a>, could be VP of engineering or CTO at other tech companies. I remember the first time I saw him at the first Arista Customer Exchange, I couldn't help but think "here comes the marketing guy". But his talk not only made a lot of sense, he was also explaining why our approach to configuring networks today sucks and how it could be done better, and he was spot on. For a moment I just thought he was really good at talking about something he didn't genuinely understand, a common trait of alluring VPs of marketing, but as he kept talking and correctly answering questions, no matter how technical, it was obvious that he knew exactly what he was talking about. Mind=blown.<br />
<h2>
Company culture</h2>
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://twitter.com/tsunanet/status/307264719373344768" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img alt="Hack-a-switch" border="0" height="200" src="https://pbs.twimg.com/media/BEOfomLCIAAPNH3.jpg" title="" width="164" /></a></div>
<div>
So we have a bunch of tech leaders, some of the sharpest minds in this industry, who are all passionate, low-key, and want to build the best datacenter networking gear out there. This has a profound impact on company culture, and Doug made something click in my mind not so long ago: company culture <i>is</i> a lasting competitive advantage. Company culture is what enables you to hire, design, build, drive, and ship a product in one way vs another. It's incredibly important.</div>
<div>
<br /></div>
<div>
Arista's culture is open, "do the right thing", "if you see something wrong/broken, fix it because you can", a lot like Google. No office drama – yes, Silicon Valley startups tend to have a fair bit of office drama. Ken is particularly sensitive to all the bullshit things you typically see in management, ridiculous processes (e.g. Cisco's infamous "manage out the bottom 10% performers in your organization"), red tape, and other stupid, unproductive things. Therefore this simply doesn't exist at Arista.</div>
<div>
<br /></div>
<div>
One of the striking peculiarities of the engineering culture at Arista that I haven't seen anywhere else (not saying that it doesn't exist anywhere else, just that I personally never came across this), is that teams aren't very well defined groups. Teams form and dissolve as projects come and go. People try to gravitate around the projects they're interested in, and those who end up working together on a particular project make up the de facto team of that project, for the duration of that project. Then they move along and go do something else with other people. It's incredibly flexible.<br />
<br />
So all in all, I'm very happy I joined Arista, although I'm sure it would have been a lot of fun too with my friends over at Twitter or CloudFlare. There are a lot of very exciting things happening right now, and a lot of cool challenges to be tackled ahead of us.<br />
<br />
Jan 2018 update: <a href="http://blog.tsunanet.net/2018/01/why-i-left-arista-networks.html">I just left Arista</a>.<br />
<br />
<small>Additional disclaimer for this post: the views expressed in this blog are my own, and Arista didn't review/approve/endorse anything I wrote here.</small></div>
tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com20tag:blogger.com,1999:blog-8260739278874294486.post-79253206917454057432013-02-06T15:36:00.004-08:002013-02-06T15:36:44.525-08:00Google uses captcha to improve StreetView image recognitionI just stumbled on one of these for the first time:
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5FzbU9cpkBEfvh4Lh-hbJ0f0Mt5K68sF837JYBsUtcsOYeqyXjhQMqLnKjabDwIOAiFymRSGg1oiBEWd4TeVaLSrDZyp-kpOgnaWfBUDfN2hMFRjgw-5Qj2qvF4_32mW-EKk8lQlnaxA/s1600/streetview-captcha-1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="171" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5FzbU9cpkBEfvh4Lh-hbJ0f0Mt5K68sF837JYBsUtcsOYeqyXjhQMqLnKjabDwIOAiFymRSGg1oiBEWd4TeVaLSrDZyp-kpOgnaWfBUDfN2hMFRjgw-5Qj2qvF4_32mW-EKk8lQlnaxA/s400/streetview-captcha-1.png" width="316" /></a></div>
Here's another one:
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimGBDvBfB4UCLtikMweuknn8JQqPvuOFNOFzsPYIZCYOIJ0OiNTIHmmePdpXPltRRVG0raN6AwehQzdza34MDMsVrJPh6N8Aj5MJu0OKLsA_C7GoA0HpfeipUhP1nJulkk-ivIsAgXK2M/s1600/streetview-captcha-2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="172" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimGBDvBfB4UCLtikMweuknn8JQqPvuOFNOFzsPYIZCYOIJ0OiNTIHmmePdpXPltRRVG0raN6AwehQzdza34MDMsVrJPh6N8Aj5MJu0OKLsA_C7GoA0HpfeipUhP1nJulkk-ivIsAgXK2M/s400/streetview-captcha-2.png" width="316" /></a></div>
These were on some Blogger blogs. Looks like Google is using captchas to help improve StreetView's address extraction quality.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com0tag:blogger.com,1999:blog-8260739278874294486.post-33092562862406481802013-01-27T02:35:00.000-08:002018-07-07T18:12:57.924-07:00Using debootstrap with grsecIf you attempt to use <code>debootstrap</code> with grsec (more specifically with a kernel compiled with <code>CONFIG_GRKERNSEC_CHROOT_MOUNT=y</code>), you may see it bail out because of this error:<br />
<pre>W: Failure trying to run: chroot <i>path/to/root</i> mount -t proc proc /proc</pre>
One way to work around this is to bind-mount procfs into the new chroot. Just apply the following patch before runnning <code>debootstrap</code>:<br />
<pre>--- /usr/share/debootstrap/functions.orig 2013-01-27 02:05:55.000000000 -0800
+++ /usr/share/debootstrap/functions 2013-01-27 02:06:39.000000000 -0800
@@ -975,12 +975,12 @@
umount_on_exit /proc/bus/usb
umount_on_exit /proc
umount "$TARGET/proc" 2>/dev/null || true
<b>- in_target mount -t proc proc /proc
+ sudo mount -o bind /proc "$TARGET/proc"
</b> if [ -d "$TARGET/sys" ] && \
grep -q '[[:space:]]sysfs' /proc/filesystems 2>/dev/null; then
umount_on_exit /sys
umount "$TARGET/sys" 2>/dev/null || true
<b>- in_target mount -t sysfs sysfs /sys
+ sudo mount -o bind /sys "$TARGET/sys"
</b> fi
on_exit clear_mtab
;;</pre>
As a side note, a minbase chroot of Precise (12.04 LTS) takes only 142MB of disk space.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com1tag:blogger.com,1999:blog-8260739278874294486.post-1845041945562155702012-11-09T10:20:00.001-08:002012-11-09T10:23:59.924-08:00Sudden large increases in MySQL slave lag caused by clock driftJust in case this ever helps anyone else, I had a machine where slave lag (as reported by <code>Seconds_Behind_Master</code> in <code>SHOW SLAVE STATUS</code>) would sometimes suddenly jump to 7 hours and then come back, and jump again, and come back.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5gb1Q4I57IMm1iQ7YaKc-QwEfl_h8YmY5s_04JZpNLRyIcHd2sLA5RJnpZJYFjo-Ch0h7ocst4NzI3E57lg_3ePgEKKEWRMmoo1OHWRF2EHjSy7GfhtqwT5UpHWgfsL8dFNVUnF5IINQ/s1600/Seconds_Behind_Master_sudden_jump.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5gb1Q4I57IMm1iQ7YaKc-QwEfl_h8YmY5s_04JZpNLRyIcHd2sLA5RJnpZJYFjo-Ch0h7ocst4NzI3E57lg_3ePgEKKEWRMmoo1OHWRF2EHjSy7GfhtqwT5UpHWgfsL8dFNVUnF5IINQ/s800/Seconds_Behind_Master_sudden_jump.png" width="800" /></a></div>
<br />
Turns out, the machine's clock was off by 7 hours and no one had noticed! After fixing NTP synchronization, the issue remained, I suspect that MySQL keeps a base timestamp in memory that was still off by 7 hours.<br />
<br />
The fix was to <code>STOP SLAVE; START SLAVE;</code>tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com3tag:blogger.com,1999:blog-8260739278874294486.post-19023686587085451272012-10-18T09:41:00.003-07:002012-10-18T09:46:42.623-07:00Python's screwed up exception hierarchyDoing this in Python is bad bad bad:<br />
<pre>try:
# some code
except Exception, e: # Bad
log.error("Uncaught exception!", e)
</pre>
Yet you need to do something like that, typically in the event loop of an application server, or when one library is calling into another library and needs to make sure that no exception escapes from the call, or that all exceptions are re-packaged in another type of exception.<br />
<br />
The reason the above is bad is that Python badly screwed up their standard exception hierarchy.
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEmvdwG8-0Dq_82cYQyGshzTv0dnWFDHoGcGjc28QdiYE_2lrS_jmFmFrPuOfRYuM4NVqYqwuPyBxCgFZLCqmVQIMdd2K3YiQyZZ0M6hZ96eDxYtKL1rUdKTzQnIFh6JYW4VpZsDYkvSs/s1600/snake-eating-itself-fail.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="214" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEmvdwG8-0Dq_82cYQyGshzTv0dnWFDHoGcGjc28QdiYE_2lrS_jmFmFrPuOfRYuM4NVqYqwuPyBxCgFZLCqmVQIMdd2K3YiQyZZ0M6hZ96eDxYtKL1rUdKTzQnIFh6JYW4VpZsDYkvSs/s320/snake-eating-itself-fail.jpg" width="320" /></a>
<pre> __builtin__.object
BaseException
Exception
StandardError
ArithmeticError
<b>AssertionError</b>
AttributeError
BufferError
EOFError
EnvironmentError
<b>ImportError</b>
LookupError
MemoryError
<b>NameError</b>
<b>UnboundLocalError</b>
ReferenceError
RuntimeError
NotImplementedError
<b>SyntaxError</b>
<b>IndentationError</b>
<b>TabError</b>
<b>SystemError</b>
TypeError
ValueError
</pre>
Meaning, if you try to catch all <code>Exception</code>s, you're also hiding real problems like syntax errors (!!), typoed imports, etc. But then what are you gonna do? Even if you wrote something silly such as:<br />
<pre>try:
# some code
except (ArithmeticError, ..., ValueError), e:
log.error("Uncaught exception!", e)
</pre>
You still wouldn't catch the many cases where people define new types of exceptions that inherit directly from <code>Exception</code>. So it looks like your only option is to catch <code>Exception</code> and then filter out things you really don't want to catch, e.g.:
<br />
<pre>try:
# some code
except Exception, e:
if isinstance(e, (AssertionError, ImportError, NameError, SyntaxError, SystemError)):
raise
log.error("Uncaught exception!", e)
</pre>
But then nobody does this. And pylint still complains.<br />
<br />
Unfortunately it looks like Python 3.0 didn't fix the problem :( – they only moved <code>SystemExit</code>, <code>KeyboardInterrupt</code>, and <code>GeneratorExit</code> to be subclasses of <code>BaseException</code> but that's all.<br />
<br />
They should have introduced another separate level of hierarchy for those errors that you generally don't want to catch because they are programming errors or internal errors (i.e. bugs) in the underlying Python runtime.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com1tag:blogger.com,1999:blog-8260739278874294486.post-77860063462880191742012-10-06T01:43:00.000-07:002014-05-11T20:32:48.851-07:00Perforce killed my productivity. Again.I've used Perforce for 2 years at Google. Google got a lot of things right, but Perforce has always been a pain in the ass to deal with, despite the huge amount of tooling Google built on top. I miss a lot of things from my days at Google, but Perforce is definitely not on the list. Isn't it ironic that for a company that builds large distributed systems on commodity machines, their P4 server had to be by far the beefiest, most expensive server? Oh and guess what ended up happening to P4 at Google?<br />
<br />
Anyways, after a 3 year break during which I happily forgot my struggle with Perforce, I am now back to using it. Sigh. Now what's 'funny' is that Arista has the same problem as Google: they locked themselves in through tools. When you have a large code base of tools built on top of an SCM, it's really, <i>really</i> hard to migrate to something else.<br />
<br />
Arista, like Google, literally has tens of thousands of lines of code of tools built around Perforce. It's kind of ironic that Perforce, the company, doesn't appear to have done anything actively evil to lock the customers in. The customers got locked in by themselves. Also note that in both of these instances the companies started quite a few years ago, back when Git didn't exist, or barely existed in Arista's case, so Perforce was a reasonable choice at the time (provided you had the $$$, that is) given that the only other options then were quite brain damaging.<br />
<br />
Now I could go on and repeat all the things that have been written many times all over the web about why Perforce sucks. Yes it's slow, yes you can't work offline, yes you can't do anything that doesn't make it wanna talk to the server, yes it makes all your freaking files read-only and it forces you to tell the server that you're going to edit a file, etc.<br />
<br />
But Perforce has its own advantages too. It has quasi-decent branching / merging capabilities (merging is often more painful than with Git IMO). It gives you a flexible way to compose your working copy, what's in it, where it comes from. It's more forgiving for organizations that like to dump a lot of random crap in their SCM. This seems fairly common, people just find it convenient to commit binaries and such. It is convenient indeed if you lack better tools, but that doesn't mean it's right.<br />
<a href="http://cdn.memegenerator.net/instances/400x400/27903989.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img alt="Used to be a productive software engineer, took a P4 arrow in the knee" border="0" src="http://cdn.memegenerator.net/instances/400x400/27903989.jpg" height="320" title="" width="320" /></a><br />
So what's my grip with Perforce? It totally ruins my workflow. This makes my life as a software engineer utterly miserable. I always work on multiple things at the same time. Most of the time they're related. I may be working on a big change, and I want to break it down in many multiple small incremental steps. And I often like to revisit these steps. Or I just wanna go back and forth between a few somewhat related things as I work on an idea and sort of wander into connected ideas. And I want to get my code reviewed. Before it gets upstream.<br />
<br />
This means that I use <span style="font-family: Courier New, Courier, monospace;">git rebase</span> very, <i>very</i> extensively. And <span style="font-family: Courier New, Courier, monospace;">git stash</span>. I find that this the hardest thing to explain to people who don't know Git. But once it clicks in your mind, and you understand how powerful <span style="font-family: Courier New, Courier, monospace;">git rebase</span> is, you realize it's the best Swiss army knife to manipulate your changes and their history. When it comes to writing code, it's literally my best friend after <span style="font-family: Courier New, Courier, monospace;">vim</span>.<br />
<br />
Git, as a tool to manipulate changes made to files, is several orders of magnitude better and more convenient. It's so simple to select what goes into what commit, undo, redo, squash, split, swap, drop, amend changes. I always feel like I can manipulate my code and commits effortlessly, that it's malleable, flexible. I'm removing some lint around some code I'm refactoring? No problem, <span style="font-family: Courier New, Courier, monospace;">git commit -p</span> to select hunk-by-hunk what goes into the refactoring commit and what goes into the "small clean up" commit. Perforce on the other hand doesn't offer anything but "mark this file for add/edit/delete" and "put these files in a change" and "commit the change". This isn't the 1990s anymore, but it sure feels like it.<br />
<br />
With Perforce you have to serialize your workflow, you have to accept to commit things that will require subsequent "fix previous commit" commits, and thus you tend to commit fewer bigger changes because breaking up a change in smaller chunks is a pain in the ass. And when you realize you got it wrong, you can't go back, you just have to fix it up with another change. And your project history is all fugly. I've used the <span style="font-family: Courier New, Courier, monospace;">patch</span> command more over the past 2 months than in the previous 3 years combined. I'm back to the stone age.<br />
<br />
Oh and you can't switch back and forth between branches. At all. Like, you just can't. Period. This means you have to maintain multiple workspaces and try to parallelize your work across them. I already have 8 workspaces across 2 servers at Arista, each of which contains mostly-the-same copy of several GB of code. The overhead to go back and forth between them is significant, so I end up switching a lot less than when I just do <span style="font-family: Courier New, Courier, monospace;">git checkout <i>somebranch</i></span>. And of course creating a new branch/workspace is extremely time consuming, as in we're talking minutes, so you really don't wanna do it unless you know you're going to amortize the cost over the next several days.<br />
<br />
I think the fact that P4 coerces you into a workflow that sucks shows in Perforce's marketing material and product strategy too. Now they're rolling out this Git integration, dubbed Perforce Git Fusion, that essentially makes the P4 server speak Git so that you can work with Git but still use P4 on the server. They sell it as "improving the Git experience". That must be the best joke of the year. But I think the reality is that engineers don't want to deal with the bullshit way of doing things Perforce imposes, and they want to work with Git. Anyways this integration sounds great, I would love to use it to stop the pain, only you have to be on a recent enough version of Perforce to be able to use it, and if you're not you "just" need to pay an arm and a fucking leg to upgrade.<br />
<br />
My lame workaround: overlay a Git repo on top of my P4 workspace, <span style="font-family: Courier New, Courier, monospace;">p4 edit</span> the files I want to work on, maintain the changes in Git until I'm ready to push them upstream. Still a royal PITA, but at least I can manipulate the files in my workspace.<br />
<br />
And then, of course, there is the problem that I'm impatient. I can't stand waiting more than 500ms at a prompt. It's quite rare to be able to <span style="font-family: Courier New, Courier, monospace;">p4 edit</span> a file in less than a second or two. At 1:30am on Saturday, after a dozen <span style="font-family: Courier New, Courier, monospace;">p4 edit</span>s in a row, I was able to get the latency down to 300-500ms (yes it really took a dozen edits/reverts in a row to reliably get lower latency). It often takes several <i>minutes</i> to trace the history of a file or a branch, or to blame a file ... when that's useful at all with Perforce.<br />
<br />
We're in 2012, soon 2013, running on 32 core 128GB RAM machines hooked to 10G/40G networks with an RTT of less than 60µs. Why would I ever need to wait more than a handful of milliseconds for any of these mundane things to happen?<br />
<br />
So, you know what Perforce, (╯°□°)╯︵ ┻━┻<br />
<br />
Edit: despite the fact that Arista uses Perforce, which is a bummer, I love that place, love the people I work with and what we're building. So you should join!tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com13tag:blogger.com,1999:blog-8260739278874294486.post-5059214607221823622012-04-14T20:35:00.000-07:002012-10-06T01:46:35.628-07:00How Apache Hadoop is molesting IOException all dayToday I'd like to rant on one thing that's been bugging me for last couple years with Apache Hadoop (and all its derived projects). It's a big issue that concerns us all. We have to admit it, each time we write code for the Apache Hadoop stack, we feel bad about it, but we try hard to ignore what's happening right before our eyes. I'm talking, of course, about the constant abuse and molestation of <code>IOException</code>.
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-cXqrHIpi8_NdHp-94xd47T5EcyM-8ONpG6RYbiJ9tPwiYComg7df7be5XQ18r7jB_KxTWbWJiFu3aIOhwnFItXJ7ZK9an3R-YCIAEBbudAL-13aVt1Jfsn7NTwCjYr7mCOpKnM2xbdM/s1600/hadoop-catch-ioexception.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="330" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-cXqrHIpi8_NdHp-94xd47T5EcyM-8ONpG6RYbiJ9tPwiYComg7df7be5XQ18r7jB_KxTWbWJiFu3aIOhwnFItXJ7ZK9an3R-YCIAEBbudAL-13aVt1Jfsn7NTwCjYr7mCOpKnM2xbdM/s400/hadoop-catch-ioexception.jpg" width="400" /></a></div>
I'm not even going to debate how checked exceptions are like communism (good idea in theory, totally fails in practice). Even if people don't get that, I wish they at least stopped the madness with this poor little <code>IOException</code>.
<br />
Let's review again what <a href="http://docs.oracle.com/javase/6/docs/api/java/io/IOException.html"><code>IOException</code></a> is for:
<br />
<blockquote>
"<i>Signals that an I/O exception of some sort has occurred. This class is the general class of exceptions produced by failed or interrupted I/O operations.</i>"</blockquote>
In Hadoop everything is an <code>IOException</code>. <em>Everything</em>. Some assertion fails, <code>IOException</code>. A number exceeds the maximum allowed by the config, <code>IOException</code>. Some protocol versions don't match, <code>IOException</code>. Hadoop needs to fart, <code>IOException</code>.
<br />
How are you supposed to handle these exceptions? Everything is declared as <code>throws IOException</code> and everything is catching, wrapping, re-throwing, logging, eating, and ignoring <code>IOException</code>s. Impossible. No matter what goes wrong, you're left clueless. And it's not like there is a nice exception hierarchy to help you handle them. No, virtually everything is just a bare <code>IOException</code>.
<br />
Because of this, it's not uncommon to see code that inspects the message of the exception (a bare <code>String</code>) to try to figure out what's wrong and what to do with it. A friend of mine was recently explaining to me how Apache Kafka was "stringly typed" (a new cutting-edge paradigm whereby you show the middle finger to the type system and stuff everything in <code>String</code>s). Well Hadoop has invented better than checked exceptions, they have <em>stringed exceptions</em>. Unfortunately, half of the time you can't even leverage this awesome new idiom because the message of the exception itself is useless. For example when a MapReduce chokes on a corrupted file, it will just throw an <code>IOException</code> without telling you the path of the problematic file. This way it's more fun, once you nail it down (with a binary search of course), you feel like you accomplished something. Or you'll get messages like "<code>IOException: Split metadata size exceeded 10000000.</code>". Figuring out what was the actual value is left as an exercise to the reader.
<br />
So, seriously Apache folks...
<br />
<div style="border: 2px solid; font-size: xx-large; text-align: center; width: 100%;">
Stop Abusing <code>IOException</code>!</div>
Leave this poor little <code>IOException</code> alone!
<br />
Hadoop (0.20.2) currently has a whopping 1300+ lines of code creating bare <code>IOException</code>s. HBase (0.92.1) has over 400. Apache committers should consider <em>every single one</em> of these lines as a code smell that needs to be fixed, that's <em>begging</em> to be fixed. Please introduce a new base exception type, and create a sound exception hierarchy.
<br />
<b>Updates</b>:
<br />
<ul>
<li>Apr 15: There is now an <a href="https://issues.apache.org/jira/browse/HBASE-5796">issue for HBase to fix their abuse of <code>IOException</code> (HBASE-5796)</a>.</li>
<li>Will update if someone from Hadoop/HDFS/MapReduce files a similar issue on their side.</li>
</ul>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com0tag:blogger.com,1999:blog-8260739278874294486.post-41779055967078471702012-02-06T00:16:00.000-08:002012-02-06T09:52:41.759-08:00Devirtualizing method calls in JavaIf you've read code I wrote, chances are you've seen I'm a strong adept of <a href="http://www.parashift.com/c++-faq-lite/const-correctness.html">const correctness</a> (<a href="http://en.wikipedia.org/wiki/Const-correctness">WP</a>). Naturally, when I started writing Java code (to my despair), I became equally adept of "final correctness". This is mostly because the keywords <code>const</code> (C/C++) and <code>final</code> (Java/Scala) are truly here to help the compiler help you. Many things aren't supposed to change. References in a given scope are often not made point to another object, various methods aren't supposed to be overridden, most classes aren't designed to be subclassed, etc. In C/C++ <code>const</code> also helps avoid doing unintentional pointer arithmetic. So when something isn't supposed to happen, if you state it explicitly, you allow the compiler to catch and report any violation of this otherwise implicit assumption.
<p>
The other aspect of const correctness is that you also help the compiler itself. Often the extra bit of information enables it to produce more efficient code. In Java especially, <a href="http://java.sun.com/docs/books/jls/third_edition/html/memory.html#66562"><code>final</code> plays an important role in thread safety</a>, and when used on <code>String</code>s as well as built-in types. Here's an example of the latter:
<pre>
1 final class concat {
2 public static void main(final String[] _) {
3 String a = "a";
4 String b = "b";
5 System.out.println(a + b);
6 final String X = "X";
7 final String Y = "Y";
8 System.out.println(X + Y);
9 }
10 }
</pre>
Which gets compiled to:
<pre>
public static void main(java.lang.String[]);
Code:
0: ldc #2; //String a
2: astore_1
3: ldc #3; //String b
5: astore_2
6: getstatic #4; //Field java/lang/System.out:Ljava/io/PrintStream;
9: new #5; //class java/lang/StringBuilder
12: dup
13: invokespecial #6; //Method java/lang/StringBuilder."<init>":()V
16: aload_1
17: invokevirtual #7; //Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
20: aload_2
21: invokevirtual #7; //Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
24: invokevirtual #8; //Method java/lang/StringBuilder.toString:()Ljava/lang/String;
27: invokevirtual #9; //Method java/io/PrintStream.println:(Ljava/lang/String;)V
30: getstatic #4; //Field java/lang/System.out:Ljava/io/PrintStream;
33: ldc #10; //String XY
35: invokevirtual #9; //Method java/io/PrintStream.println:(Ljava/lang/String;)V
38: return
}
</pre>
In the original code, lines 3-4-5 are identical to lines 6-7-8 modulo the presence of two <code>final</code> keywords. Yet, lines 3-4-5 get compiled to 14 byte code instructions (lines 0 through 27), whereas 6-7-8 turn into only 3 (lines 30 through 35). I find it kind of amazing that the compiler doesn't even bother optimizing such a simple piece of code, even when used with the <code>-O</code> flag which, most people say, is almost a no-op as of Java 1.3 – at least I checked in OpenJDK6, and it's truly a no-op there, the flag is only accepted for backwards compatibility. OpenJDK6 has a <code>-XO</code> flag instead, but the Sun Java install that comes on Mac OS X doesn't recognize it...
<p>
There was another thing that I thought was a side effect of <code>final</code>. I thought any method marked <code>final</code>, or any method in a class marked <code>final</code> would allow the <em>compiler</em> to devirtualize method calls. Well, it turns out that I was wrong. Not only it doesn't do this, but also the JVM considers this compile-time optimization downright illegal! Only the JIT compiler is allowed to do it.
<p>
All method calls in Java are compiled to an <a href="http://java.sun.com/docs/books/jvms/second_edition/html/Instructions2.doc6.html#invokevirtual"><code>invokevirtual</code></a> byte code instruction, except:
<ul>
<li>Constructors and private method use <code>invokespecial</code>.</li>
<li>Static methods use <code>invokestatic</code>.</li>
<li>Virtual method calls on objects with a static type that is an interface use <code>invokeinterface</code>.</li>
</ul>
The last one is weird, one might wonder why special-case virtual method calls when the static type is an interface. The reason essentially boils down to the fact that if the static type is not an interface, then we know at compile-time what entry in the vtable to use for that method, and all we have to do at runtime is essentially to read that entry from the vtable. If the static type is an interface, the compiler doesn't even know which entry in the vtable will be used, as this will depend at what point in the class hierarchy the interface will be used.
<p>
Anyway, I always imagined that having a <code>final</code> method meant that the compiler would compile all calls to it using <code>invokespecial</code> instead of <code>invokevirtual</code>, to "devirtualize" the method calls since it already knows for sure at compile-time where to transfer execution. Doing this at compile time seems like a trivial optimization, while leaving this up to the JIT is far more complex. But no, the compiler doesn't do this. It's not even legal to do it!
<pre>
interface iface {
int foo();
}
class base implements iface {
public int foo() {
return (int) System.nanoTime();
}
}
final class sealed extends base { // Implies that foo is final
}
final class sealedfinal extends base {
public final int foo() { // Redefine it to be sure / help the compiler.
return super.foo();
}
}
public final class devirt {
public static void main(String[] a) {
int n = 0;
final iface i = new base();
n ^= i.foo(); // invokeinterface
final base b = new base();
n ^= b.foo(); // invokevirtual
final sealed s = new sealed();
n ^= s.foo(); // invokevirtual
final sealedfinal s = new sealedfinal();
n ^= s.foo(); // invokevirtual
}
}
</pre>
A simple <a href="http://code.google.com/p/caliper/">Caliper</a> benchmark also shows that in practice all 4 calls above have exactly the same performance characteristic (see <a href="https://gist.github.com/1750505">full microbenchmark</a>). This seems to indicate that the JIT compiler is able to devirtualize the method calls in all these cases.
<p>
To try to manually devirtualize one of the last two calls, I applied a binary patch (courtesy of <code>xxd</code>) on the <code>.class</code> generated by <code>javac</code>. After doing this, <code>javap</code> correctly shows an <code>invokespecial</code> instruction. To my dismay the JVM then rejects the byte code: <code>Exception in thread "main" java.lang.VerifyError: (class: devirt, method: timeInvokeFinalFinal signature: (I)I) Illegal use of nonvirtual function call</code>
<p>
I find the wording of the JLS slightly ambiguous as to whether or not this is truly illegal, but in any case the Sun JVM rejects it, so it can't be used anyway.
<p>
The moral of the story is that <code>javac</code> is really only translating Java code into pre-parsed Java code. Nothing interesting happens at all in the "compiler", which should really be called the pre-parser. They don't even bother doing any kind of trivial optimization. <em>Everything</em> is left up to the JIT compiler. Also Java byte code is bloated, but then it's normal, it's Java :)tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com12tag:blogger.com,1999:blog-8260739278874294486.post-50352099890209396772011-10-08T12:16:00.000-07:002011-10-08T13:44:16.048-07:00Hardware Growler for Mac OS X LionJust in case this could be of any use to someone else, I compiled Growl 1.2.2 for Lion with the fix for <a href="http://code.google.com/p/growl/issues/detail?id=223">HardwareGrowler crash on Lion</a> that happens when disconnecting from a wireless network or waking up the Mac.
You can <a href="http://tsunanet.net/~tsuna/Growl-1.2.2-Lion-x86_64.dmg">download it here</a>. The binary should work on Snow Leopard too. It's only compiled for x86_64 CPUs.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com12tag:blogger.com,1999:blog-8260739278874294486.post-74601602327984152102011-09-13T11:12:00.000-07:002012-04-26T23:41:22.125-07:00ext4 2x faster than XFS?For a lot of people, the conventional wisdom is that XFS outperforms ext4. I'm not sure whether this is just because XFS used to be a lot faster than ext2 or ext3 or what. I don't have anything against XFS, and actually I would like to see it outperform ext4, unfortunately my benchmarks show otherwise. I'm wondering whether I'm doing something wrong.
<p/>
In the benchmark below, the same machine and same HDDs were tested with 2 different RAID controllers. In most tests, ext4 has better results than XFS. In some tests, the difference is as much as 2x. Here are the details of the config:
<ul>
<li>CPU: 2 x <a href="http://ark.intel.com/products/47927">Intel L5630</a> (Westmere microarchitecture, so 2x4x2 = 16 hardware threads and lots of caches)</li>
<li>RAM: 2 x 6 x 8GB = 96GB DDR3 ECC+Reg Dual-Rank DIMMs</li>
<li>Disks: 12 x <a href="http://www.wdc.com/en/products/products.aspx?id=30">Western Digital (WD) RE4</a> (model: WD2003FYYS – 2TB SATA 7200rpm)</li>
<li>RAID controllers: <a href="http://www.adaptec.com/en-us/products/controllers/hardware/sas/performance/sas-51645/">Adaptec 51645</a> and <a href="http://www.lsi.com/products/storagecomponents/Pages/MegaRAIDSAS9280-16i4e.aspx">LSI MegaRaid 9280-16i4e</a></li>
</ul>
Both RAID controllers are equipped with 512MB of RAM and are in their respective default factory config, except that WriteBack mode was enabled on the LSI because it's disabled by default (!). One other notable difference between the default configurations is that the Adaptec uses a strip size of 256k whereas the LSI uses 64k – this was left unchanged. Both arrays were created as RAID10 (6 pairs of 2 disks, so no spares). One controller was tested at a time, in the same machine and with the same disks. The OS (Linux 2.6.32) was on a separate RAID1 of 2 drives. The IO scheduler in use was "deadline". <a href="http://sysbench.sourceforge.net/">SysBench</a> was using <code>O_DIRECT</code> on 64 files, for a total of 100GB of data.
<p/>
Some observations:
<ul>
<li>Formatting XFS with the <a href="/2011/08/mkfsxfs-raid10-optimal-performance.html">optimal values for <code>sunit</code> and <code>swidth</code></a> doesn't lead to much better performance. The gain is about 2%, except for sequential writes where it actually makes things <em>worse</em>. Yes, there was no partition table, the whole array was formatted directly as one single big filesystem.</li>
<li>Creating more allocation groups in XFS than physical threads doesn't lead to better performance.</li>
<li>XFS has much better random write throughput at low concurrency levels, but quickly degrades to the same performance level as ext4 with more than 8 threads.</li>
<li>ext4 has consistently better random read/write throughput and latency, even at high concurrency levels.</li>
<li>Similarly, for random reads ext4 also has much better throughput and latency.</li>
<li>By default XFS creates too few allocation groups, which artificially limits its performance at high concurrency levels. It's important to create as many AGs as hardware threads. ext4, on the other hand, doesn't really need any tuning as it performs well out of the box.</li>
</ul>
<p/>
See the <a href="http://tsunanet.net/~tsuna/benchmarks/ext4-xfs-raid10/sysbench.html">benchmark results</a> in full screen or look at the <a href="http://tsunanet.net/~tsuna/benchmarks/ext4-xfs-raid10/">raw outputs</a> of SysBench.
<iframe src="http://tsunanet.net/~tsuna/benchmarks/ext4-xfs-raid10/sysbench.html" width="940" height="1000"><a href="http://tsunanet.net/~tsuna/benchmarks/ext4-xfs-raid10/sysbench.html">See the benchmark results</a></iframe>tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com14tag:blogger.com,1999:blog-8260739278874294486.post-62393513719231788622011-08-27T20:13:00.003-07:002011-08-27T22:07:39.878-07:00Hitachi 7K3000 vs WD RE4 vs Seagate Constellation ESThese days, the <a href="http://www.hitachigst.com/internal-drives/desktop/deskstar/deskstar-7k3000">Hitachi 7K3000</a> seems like the best bang for your bucks. You can get 2TB disks for around US$100. The 7K3000 isn't an "enterprise disk", so many people wouldn't buy it for their servers.
<br />It's not clear what disks sold with the Enterprise™©® label really do to justify the big price difference. Often it seems like the hardware is exactly the same, but the firmware behaves differently, notably to report errors faster. In desktop environments, you want the disk to try hard to read bad sectors, but in RAID arrays it's better to give up quickly and let the RAID controller know, otherwise the disks might timeout from the controller's point of view, and the whole disk might be incorrectly considered dead and trigger a spurious rebuild.
<br />So I recently benchmarked the Hitachi 7K3000 against two other "enterprise" disks, the Western Digital RE4 and the Seagate Constellation ES.
<br /><h3>The line up</h3><ul><li><a href="http://www.hitachigst.com/internal-drives/desktop/deskstar/deskstar-7k3000">Hitachi 7K3000</a> model: HDS723020BLA642 – the baseline</li><li><a href="http://www.wdc.com/en/products/products.aspx?id=30">Western Digital (WD) RE4</a> model: WD2003FYYS</li><li><a href="http://www.seagate.com/www/en-us/products/enterprise-hard-drives/constellation-es/constellation-es-1/">Seagate Constellation ES</a> model: ST2000NM0011</li></ul>All disks are 3.5" 2TB SATA 7200rpm with 64MB of cache, all but the WD are 6Gb/s SATA. The WD is 3Gb/s – not that this really matters, as I have yet to see a spinning disk of this grade exceed 2Gb/s.
<br />Both enterprise disks cost about $190, so about 90% more (almost double the price) than the Hitachi. Are they worth the extra money?
<br /><h3>The test</h3>I ended up using <a href="http://sysbench.sourceforge.net/">SysBench</a> to compare the drives. I had all 3 drives connected to the motherboard of the same machine, a dual <a href="http://ark.intel.com/products/47927">L5630</a> with 96GB of RAM, running Linux 2.6.32. Drives and OS were using their default config, except the "deadline" IO scheduler was in effect (whereas vanilla Linux uses CFQ by default since 2.6.18). SysBench used <code>O_DIRECT</code> for all its accesses. Each disk was formatted with ext4 – no partition table, the whole disk was used directly. Default formatting and mount options were used. SysBench was told to use 64 files, for a total of 100GB of data. Every single test was repeated 4 times and then averages were plotted. Running all the tests takes over 20h.
<br />SysBench produces some kind of a free-form output which isn't very easy to use. So I wrote a Python script to parse the results and a bit of JavaScript to visualize them. The code is available on GitHub: <a href="https://github.com/tsuna/sysbench-tools">tsuna/sysbench-tools</a>.
<br /><h3>Results</h3>A picture is worth a thousand words, so <a href="http://tsunanet.net/~tsuna/benchmarks/7K3000-RE4-ConstellationES/sysbench.html">take a look at the graphs</a>. Overall the WD RE4 is a clear winner for me, as it outperforms its 2 buddies on all tests involving random accesses. The Seagate doesn't seem worth the money. Although it's the best at sequential reads, the Hitachi is pretty much on par with it while being almost twice cheaper.
<br />So I'll buy the Hitachi 7K3000 for everything, and pay the extra premium for the WD RE4 for MySQL servers, because MySQL isn't a cheap bastard and needs every drop of performance it can get out of the IO subsystem. No, I don't want to buy ridiculously expensive and power-hungry 15k RPM SAS drives, thank you.
<br />The raw outputs of SysBench are available here: <a href="http://tsunanet.net/~tsuna/benchmarks/7K3000-RE4-ConstellationES">http://tsunanet.net/~tsuna/benchmarks/7K3000-RE4-ConstellationES</a>tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com5tag:blogger.com,1999:blog-8260739278874294486.post-50531266720419615632011-08-19T15:20:00.006-07:002011-08-23T10:48:56.961-07:00Formatting XFS for optimal performance on RAID10XFS has terribly bad performance out of the box, especially on large RAID arrays. Unlike ext4, the filesystem needs to be formatted with the right parameters to perform well. If you don't get the parameters right, you need to reformat the filesystem as they can't be changed later.
<br />
<br />The 3 main parameters are:<ul><li><code>agcount</code>: Number of allocation groups</li><li><code>sunit</code>: Stripe size (as configured on your RAID controller)</li><li><code>swidth</code>: Stripe width (number of data disks, excluding parity / spare disks)</li></ul>Let's take an example: you have 12 disks configured in a <a href="http://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_1_.2B_0">RAID 10</a> (so 6 pairs of disks in RAID 1, and RAID 0 across the 6 pairs). Let's assume the RAID controller was instructed to use a stripe size of 256k. Then we have:
<br /><ul><li><code>sunit</code> = 256k / <i>512</i> = 512</i>, because <code>sunit</code> is in multiple of <i>512</i> byte sectors<li><code>swidth</code> = <i>6</i> * 512 = 3072, because in a RAID10 with 12 disks we have <i>6</i> data disks excluding parity disks (and no hot spares in this case)</li></ul>Now XFS internally split the filesystem into "allocation groups" (AG). Essentially an AG is like a filesystem on its own. XFS splits the filesystem into multiple AGs in order to help increase parallelism, because each AG has its own set of locks. My rule of thumb is to create as many AGs as you have hardware threads. So if you have a dual-CPU configuration, with 4 cores with HyperThreading, then you have 2 x 4 x 2 = 16 hardware threads, so you should create 16 AGs.<pre>$ sudo mkfs.xfs -f -d sunit=512,swidth=$((512*6)),agcount=16 /dev/sdb<br/>Warning: AG size is a multiple of stripe width. This can cause performance<br/>problems by aligning all AGs on the same disk. To avoid this, run mkfs with<br/>an AG size that is one stripe unit smaller, for example 182845376.<br/>meta-data=/dev/sdb isize=256 agcount=16, agsize=182845440 blks<br/> = sectsz=512 attr=2<br/>data = bsize=4096 blocks=2925527040, imaxpct=5<br/> = <b>sunit=64 swidth=384 blks</b><br/>naming =version 2 bsize=4096 ascii-ci=0<br/>log =internal log bsize=4096 blocks=521728, version=2<br/> = sectsz=512 sunit=64 blks, lazy-count=1<br/>realtime =none extsz=4096 blocks=0, rtextents=0</pre>Now from the output above, we can see 2 problems:<ol><li>There's this warning message we better pay attention to.</li><li>The values of <code>sunit</code> and <code>swidth</code> printed don't correspond to what we asked for.</li></ol>The reason the values printed don't match what we wanted is because they're in multiples of "block size". We can see that <code>bsize=4096</code>, so sure enough the numbers match up: 4096 x 64 = 512 x 512 = our stripe size of 256k.
<br />
<br />Now let's look at this warning message. It suggests us to use <code>agsize=182845376</code> instead of <code>agsize=182845440</code>. When we specified the number of AGs we wanted, XFS automatically figured the size of each AG, but then it's complaining that this size is suboptimal. Yay. Now <code>agsize</code> is specified in blocks (so multiples of 4096), but the command line tool expects the value in bytes. At this point you're probably thinking like me: "you must be kidding me, right? Some options are in bytes, some in sectors, some in blocks?!" Yes.
<br />
<br />So to make it all work:<pre>$ sudo mkfs.xfs -f -d sunit=512,swidth=$((512*6)),agsize=$((182845376*4096)) /dev/sdb<br/>meta-data=/dev/sdb isize=256 agcount=16, agsize=182845376 blks<br/> = sectsz=512 attr=2<br/>data = bsize=4096 blocks=2925526016, imaxpct=5<br/> = sunit=64 swidth=384 blks<br/>naming =version 2 bsize=4096 ascii-ci=0<br/>log =internal log bsize=4096 blocks=521728, version=2<br/> = sectsz=512 sunit=64 blks, lazy-count=1<br/>realtime =none extsz=4096 blocks=0, rtextents=0</pre>It's critical that you get this right before you start using the filesystem. There's no way to change them later. You might be tempted to try using <code>mount -o remount,sunit=X,swidth=Y</code>, and the command will succeed but do nothing. The only XFS parameter you can change at runtime is <code>nobarrier</code> (see the <a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=fs/xfs/linux-2.6/xfs_super.c;h=18a4b8e11df2d4241bcfafd59297c30e961241ad;hb=HEAD#l1241">source code of XFS's remount support in the Linux kernel</a>), which you should use if you have a battery-backup unit (BBU) on your RAID card, although the performance boost seems pretty small on DB-type workloads, even with 512MB of RAM on the controller.
<br />
<br />Next post: how much of a performance difference is there when you give XFS the right <code>sunit</code>/<code>swidth</code> parameters, and does this allow XFS to beat ext4's performance.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com8tag:blogger.com,1999:blog-8260739278874294486.post-53980845755148862682011-08-15T16:36:00.005-07:002011-08-19T00:42:40.172-07:00e1000e scales a lot better than bnx2At StumbleUpon we've had a never ending string of problems with Broadcom's cards that use the <a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/net/bnx2.h;h=a4d83409f20555eb60c73e9d10ada9edd2a777b3">bnx2</a> driver. The machine cannot handle more than 100kpps (packets/s), the driver has bugs that will lock up the NIC until it gets reset manually when you use <a href="http://en.wikipedia.org/wiki/Jumbo_frame">jumbo frames</a> and/or <a href="http://en.wikipedia.org/wiki/TCP_segmentation_offloading">TSO</a> (TCP Segmentation Offloading).
<br />
<br />So we switched everything to Intel NICs. Not only they don't have these nasty bugs, but also they scale better. They can do up to 170kpps each way before they start discarding packets. Graphs courtesy of <a href="http://opentsdb.net/">OpenTSDB</a>: <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiV_tmwmbHee0JOfElImRvnBAmhe3Nrb-vCj-HVqCvyG2jicOIG1b7jNCTr4kiSpRhCdooeljosByTXGR0uZiTr5w8s_7yeHwHwUkGcDjaqsBbxejhOOJIUxUq-GkNYo4g-_fiyFZEqiDQ/s1600/e1000e_packets-and-drops.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiV_tmwmbHee0JOfElImRvnBAmhe3Nrb-vCj-HVqCvyG2jicOIG1b7jNCTr4kiSpRhCdooeljosByTXGR0uZiTr5w8s_7yeHwHwUkGcDjaqsBbxejhOOJIUxUq-GkNYo4g-_fiyFZEqiDQ/s1600/e1000e_packets-and-drops.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5641233666736931426" /></a><div style="text-align: center;">Packets/s vs. packets dropped/s</div><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEZLM5gKchdymgGUe750gfbMcJelJIGaCG4U3jmWz9X-GERodv5kFTSavSaLJBuuPqfwVs0Qr0nBGhWRLYU0Y_GU3BaqtL3JpPhnJ8XzyGhR5lPOvIPwwBip8gt8tVFtvRltj8Z7TPFUo/s1600/e1000e_packets-and-interrupts.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEZLM5gKchdymgGUe750gfbMcJelJIGaCG4U3jmWz9X-GERodv5kFTSavSaLJBuuPqfwVs0Qr0nBGhWRLYU0Y_GU3BaqtL3JpPhnJ8XzyGhR5lPOvIPwwBip8gt8tVFtvRltj8Z7TPFUo/s1600/e1000e_packets-and-interrupts.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5641233669909504674" /></a><div style="text-align: center;">Packets/s vs. interrupts/s</div>
<br />
<br />We can also see how the NIC is doing interrupt coalescing at high packet rates. Yay.
<br /><small>Kernel tested: 2.6.32-31-server x86_64 from Lucid, running on 2 L5630 with 48GB of RAM.</small>tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com0tag:blogger.com,1999:blog-8260739278874294486.post-80199987690160281042011-07-28T00:22:00.002-07:002011-07-28T00:25:02.366-07:00VM warning: GC locker is held; pre-dump GC was skippedIf you ever run into this message while using the Sun JVM / OpenJDK:<pre>Java HotSpot(TM) 64-Bit Server VM warning: GC locker is held; pre-dump GC was skipped</pre>then I wouldn't worry too much about it as it seems like it's <a href="http://www.google.com/codesearch#62XBuw3RHgs/src/share/vm/gc_implementation/shared/vmGCOperations.cpp&q=%22GC%20locker%20is%20held%22&type=cs&l=117">printed</a> when running a <code>jmap -histo:live</code> while the GC is already running or holding a certain lock in the jVM.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com1tag:blogger.com,1999:blog-8260739278874294486.post-90926185417089736752011-06-03T12:52:00.003-07:002012-10-06T01:47:38.181-07:00Clarifications on Linux's NUMA statsAfter reading the excellent post on <a href="http://jcole.us/blog/archives/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/">The MySQL “swap insanity” problem and the effects of the NUMA architecture</a>, I remembered about the existence of <a href="http://www.kernel.org/doc/Documentation/numastat.txt"><code>/sys/devices/system/node/node*/numastat</code></a> and decided to add these numbers to a collector for <a href="http://opentsdb.net/">OpenTSDB</a>. But whenever I add a collector that reads metrics from <code>/proc</code> or <code>/sys</code>, I always need to go read the Linux kernel's source code, because most metrics tend to be misleading and under-documented (when they're documented at all).<br />
<br />
In this case, if you RTFM, you'll get this:<br />
<pre>Numa policy hit/miss statistics
/sys/devices/system/node/node*/numastat
All units are pages. Hugepages have separate counters.
numa_hit A process wanted to allocate memory from this node, and succeeded.
numa_miss A process wanted to allocate memory from another node, but ended up with memory from this node.
numa_foreign A process wanted to allocate on this node, but ended up with memory from another one.
local_node A process ran on this node and got memory from it.
other_node A process ran on this node and got memory from another node.
interleave_hit Interleaving wanted to allocate from this node and succeeded.</pre>
I was very confused about the last one, about the exact difference between the second and the third one, and about the difference between the first 3 metrics and the next 2.<br />
<br />
After RTFSC, the relevant part of the code appeared to be in <code>mm/vmstat.c</code>:<br />
<pre>void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags)
{
if (z->zone_pgdat == preferred_zone->zone_pgdat) {
__inc_zone_state(z, NUMA_HIT);
} else {
__inc_zone_state(z, NUMA_MISS);
__inc_zone_state(preferred_zone, NUMA_FOREIGN);
}
if (z->node == ((flags & __GFP_OTHER_NODE) ?
preferred_zone->node : numa_node_id()))
__inc_zone_state(z, NUMA_LOCAL);
else
__inc_zone_state(z, NUMA_OTHER);
}</pre>
<br />
So here's what it all really means:<br />
<ul>
<li><code>numa_hit</code>: Number of pages allocated from the node the process wanted.</li>
<li><code>numa_miss</code>: Number of pages allocated from this node, but the process preferred another node.</li>
<li><code>numa_foreign</code>: Number of pages allocated another node, but the process preferred this node.</li>
<li><code>local_node</code>: Number of pages allocated from this node while the process was running locally.</li>
<li><code>other_node</code>: Number of pages allocated from this node while the process was running remotely (on another node).</li>
<li><code>interleave_hit</code>: Number of pages allocated successfully with the interleave strategy.</li>
</ul>
<br />
I was originally confused about <code>numa_foreign</code> but this metric can actually be useful to see what happens when a node runs out of free pages. If a process attempts to get a page from its local node, but this node is out of free pages, then the <code>numa_miss</code> of that node will be incremented (indicating that the node is out of memory) and another node will accomodate the process's request. But in order to know which nodes are "lending memory" to the out-of-memory node, you need to look at <code>numa_foreign</code>. Having a high value for <code>numa_foreign</code> for a particular node indicates that this node's memory is under-utilized so the node is frequently accommodating memory allocation requests that failed on other nodes.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com0tag:blogger.com,1999:blog-8260739278874294486.post-20363408550001662232011-05-07T22:12:00.004-07:002011-05-20T15:01:50.316-07:00JVM u24 segfault in clearerr on JauntyAt StumbleUpon we've been tracking down a weird problem with one of our application servers written in Java. We run Sun's <code>jdk1.6.0_24</code> on Ubuntu Jaunty (9.04 – yes, these servers are old and due for an upgrade) and this application seems to do something that causes the JVM to segfault:<pre>[6972247.491417] hbase_regionser[32760]: segfault at 8 ip 00007f26cabd608b sp 00007fffb0798270 error 4 in libc-2.9.so[7f26cab66000+168000]<br />[6972799.682147] hbase_regionser[30904]: segfault at 8 ip 00007f8878fb608b sp 00007fff09b69900 error 4 in libc-2.9.so[7f8878f46000+168000]</pre>What's odd is that the problem always happens on different hosts, and almost always around 6:30 - 6:40 am. Go figure.<br /><h3>Understanding segfault messages from the Linux kernel</h3>Let's try to make sense of the messages shown above, logged by the Linux kernel. Back in Linux v2.6.28, it was logged by <a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/x86/mm/fault.c;hb=v2.6.28#l772"><code> do_page_fault</code></a>, but since then this big function has been refactored into multiple smaller functions, so look for <code>show_signal_msg</code> now.<pre> 791 printk(<br /> 792 "%s%s[%d]: segfault at %lx ip %p sp %p error %lx",<br /> 793 task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,<br /> 794 tsk->comm, task_pid_nr(tsk), address,<br /> 795 (void *) regs->ip, (void *) regs->sp, error_code);<br /> 796 print_vma_addr(" in ", regs->ip);</pre>From the above, we see that <code>segfault at 8</code> means that the code attempted to access the address "8", which is what caused the segfault (because there is no page ever mapped at address 0 <small>(normally)</small>). <code>ip</code> stands for instruction pointer, so the code that triggered the segfault was mapped at the address 0x00007f8878fb608b. <code>sp</code> is stack pointer and isn't very relevant here. <code>error 4</code> means that this was a read access (4 = <code>PF_USER</code>, which used to be a <code>#define</code> but is now part of <code>enum x86_pf_error_code</code>). The rest of the message tells us that the address of the instruction pointer falls inside the memory region mapped for the code of the libc, and it tells us in square brackets that the libc is mapped at the base address 0x7f8878f46000 and that there's 168000 bytes of code mapped. So that means that we were at 0x00007f8878fb608b - 0x7f8878f46000 = 0x7008b into the libc when the segfault occurred.<br /><h3>So where did the segfault occur exactly?</h3>Since now we know what offset into the libc we were while the segfault happened, we can fire <code>gdb</code> and see what's up with that code:<pre>$ gdb -q /lib/libc.so.6<br />(no debugging symbols found)<br />(gdb) x/i 0x7008b<br />0x7008b <clearerr+27>: cmp %r8,0x8(%r10)</pre>Interesting... So the JVM is segfaulting in <a href="http://pubs.opengroup.org/onlinepubs/007908799/xsh/clearerr.html"><code>clearerr</code></a>. We're 27 bytes into this function when the segfault happens. Let's see what the function does up to here:<pre>(gdb) disas clearerr<br />Dump of assembler code for function clearerr:<br />0x0000000000070070 <clearerr+0>: push %rbx<br />0x0000000000070071 <clearerr+1>: mov (%rdi),%eax<br />0x0000000000070073 <clearerr+3>: mov %rdi,%rbx<br />0x0000000000070076 <clearerr+6>: test %ax,%ax<br />0x0000000000070079 <clearerr+9>: js 0x700c7 <clearerr+87><br />0x000000000007007b <clearerr+11>: mov 0x88(%rdi),%r10<br />0x0000000000070082 <clearerr+18>: mov %fs:0x10,%r8<br />0x000000000007008b <clearerr+27>: cmp %r8,0x8(%r10)<br />0x000000000007008f <clearerr+31>: je 0x700c0 <clearerr+80><br />0x0000000000070091 <clearerr+33>: xor %edx,%edx<br />0x0000000000070093 <clearerr+35>: mov $0x1,%esi<br />0x0000000000070098 <clearerr+40>: mov %edx,%eax<br />0x000000000007009a <clearerr+42>: cmpl $0x0,0x300fa7(%rip) # 0x371048<br />0x00000000000700a1 <clearerr+49>: je 0x700ac <clearerr+60><br />0x00000000000700a3 <clearerr+51>: lock cmpxchg %esi,(%r10)<br />[...]</pre>Reminder: the prototype of the function is <code>void clearerr(FILE *stream);</code> so there's one pointer argument and no return value. The code above starts by saving <code>rbx</code> (because it's the callee's responsibility to save this register), then dereferences the first (and only) argument (passed in <code>rdi</code>) and saves the dereferenced address in <code>eax</code>. Then it copies the pointer passed in argument in <code>rbx</code>. It then tests whether low 16 bits in <code>eax</code> are negative and jumps over some code if it is, because they contain the <code>_flags</code> field of the <code>FILE*</code> passed in argument. At this point it helps to know what a <code>FILE</code> looks like. This structure is opaque so it depends on the libc implementation. In this case, it's the <a href="http://sourceware.org/git/?p=glibc.git;hb=glibc-2.9;a=blob;f=libio/libio.h#l271">glibc's</a>:<pre> 271 struct _IO_FILE {<br /> 272 int _flags; /* High-order word is _IO_MAGIC; rest is flags. */<br />[...]<br /> 310 _IO_lock_t *_lock;<br /> 311 #ifdef _IO_USE_OLD_IO_FILE<br /> 312 };</pre>Then it's looking 0x88 = 136 bytes into the <code>FILE*</code> passed in argument and storing this in <code>r10</code>. If you look at the definition of <code>FILE*</code> and add up the offsets, 136 bytes into the <code>FILE*</code> you'll find the <code>_IO_lock_t *_lock;</code> member of the struct, the mutex that protects this <code>FILE*</code>. Then we're loading address 0x10 from the FS segment in <code>r8</code>. On Linux x86_64, the <a href="http://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture#Segment_Registers">F segment</a> is used for thread-local data. In this case it's loading a pointer to a structure that corresponds to the local thread. Finally, we're comparing <code>r8</code> to the value 8 bytes into the value pointed to by <code>r10</code>, and kaboom, we get a segfault. This suggest that <code>r10</code> is a <code>NULL</code> pointer, meaning that the <code>_lock</code> of the <code>FILE*</code> given in argument is <code>NULL</code>. Now that's weird. I'm not sure how this happened. So the assembly code above is essentially doing:<pre>void clearerr(FILE *stream) {<br /> if (stream->_flags & 0xFFFF >= 0) {<br /> struct pthread* self = /* mov %fs:0x10,%r8 -- (can't express this in C, but you can use <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/arch_prctl.2.html">arch_prctl</a>) */;<br /> struct lock_t lock = *stream->_lock;<br /> if (lock.owner != self) { // We segfault here, when doing lock->owner<br /> mutex_lock(lock.lock);<br /> lock.owner = self;<br /> }<br /> // ...<br /> }<br /> // ...<br />}</pre>What's odd is that the return value of the JVM is 143 (128+<code>SIGTERM</code>) and not 139 (=128+<code>SIGSEGV</code>). Maybe it's because the JVM is always catching and handling <code>SIGSEGV</code> (they do this to allow the JIT to optimize away some <code>NULL</code>-pointer checks and translate them into <code>NullPointerExceptions</code>, among other things). But even then, normally the JVM will write a file where it complains about the segfault, asks you to file a bug, and dumps all the registers and whatnot... We should see that file somewhere. Yet it's nowhere to be found in the JVM's current working directory or anywhere else I looked.<br /><br />So this segfault remains a mystery so far. Next step: run the application server with <code>ulimit -c unlimited</code> and analyze a core dump.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com3tag:blogger.com,1999:blog-8260739278874294486.post-66341789670730251562011-03-14T20:49:00.012-07:002011-03-14T22:23:03.900-07:00The "Out of socket memory" errorI recently did some work on some of our frontend machines (on which we run <a href="http://www.varnish-cache.org/">Varnish</a>) at StumbleUpon and decided to track down some of the errors the Linux kernel was regularly throwing in <code>kern.log</code> such as:<pre>Feb 25 08:23:42 foo kernel: [3077014.450011] Out of socket memory</pre>Before we get started, let me tell you that <strong>you should NOT listen to any blog or forum post without doing your homework</strong>, especially when the post recommends that you tune up virtually every TCP related knob in the kernel. These people don't know what they're doing and most probably don't understand much to TCP/IP. Most importantly, their voodoo won't help you fix your problem and might actually make it worse.<br /><br /><h1>Dive in the Linux kernel</h1><br />In order to best understand what's going on, the best thing is to go read the code of the kernel. Unfortunately, the kernel's error messages or counters are often imprecise, confusing, or even misleading. But they're important. And reading the kernel's code isn't nearly as hard as what people say.<br /><br /><h3 id="out_of_socket_memory_error">The "Out of socket memory" error</h3><br />The only match for "Out of socket memory" in the kernel's code (as of v2.6.38) is in <a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=net/ipv4/tcp_timer.c;hb=v2.6.38-rc8#l82"><code>net/ipv4/tcp_timer.c</code></a>:<pre> 66 static int tcp_out_of_resources(struct sock *sk, int do_reset)<br /> 67 {<br /> 68 struct tcp_sock *tp = tcp_sk(sk);<br /> 69 int shift = 0;<br /> 70 <br /> 71 /* If peer does not open window for long time, or did not transmit<br /> 72 * anything for long time, penalize it. */<br /> 73 if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)<br /> 74 shift++;<br /> 75 <br /> 76 /* If some dubious ICMP arrived, penalize even more. */<br /> 77 if (sk->sk_err_soft)<br /> 78 shift++;<br /> 79 <br /> 80 if (tcp_too_many_orphans(sk, shift)) {<br /> 81 if (net_ratelimit())<br /> 82 printk(KERN_INFO "Out of socket memory\n");<br /></pre>So the question is: when does <code>tcp_too_many_orphans</code> return true? Let's take a look in <a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=include/net/tcp.h;hb=v2.6.38-rc8#l268"><code>include/net/tcp.h</code></a>:<pre> 268 static inline bool tcp_too_many_orphans(struct sock *sk, int shift)<br /> 269 {<br /> 270 struct percpu_counter *ocp = sk->sk_prot->orphan_count;<br /> 271 int orphans = percpu_counter_read_positive(ocp);<br /> 272 <br /> 273 if (orphans << shift > sysctl_tcp_max_orphans) {<br /> 274 orphans = percpu_counter_sum_positive(ocp);<br /> 275 if (orphans << shift > sysctl_tcp_max_orphans)<br /> 276 return true;<br /> 277 }<br /> 278 <br /> 279 if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&<br /> 280 atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])<br /> 281 return true;<br /> 282 return false;<br /> 283 }<br /></pre>So <strong>two conditions that can trigger this "Out of socket memory" error</strong>:<ol><li>There are "too many" orphan sockets (most common).</li><li>The socket already has the minimum amount of memory and we can't give it more because TCP is already using more than its limit.</li></ol>In order to remedy to your problem, you need to figure out which case you fall into. The vast majority of the people (especially those dealing with frontend servers like Varnish) fall into case 1.<br /><br /><h3>Are you running out of TCP memory?</h3><br />Ruling out case 2 is easy. All you need is to see how much memory your kernel is configured to give to TCP vs how much is actually being used. If you're close to the limit (uncommon), then you're in case 2. Otherwise (most common) you're in case 1. The kernel keeps track of the memory allocated to TCP in multiple of pages, not in bytes. This is a first bit of confusion that a lot of people run into because some settings are in bytes and other are in pages (and most of the time 1 page = 4096 bytes).<br /><br />Rule out case 2: find how much memory the kernel is willing to give to TCP:<pre>$ cat /proc/sys/net/ipv4/<a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/ip-sysctl.txt;hb=v2.6.38-rc8#l298">tcp_mem</a><br />3093984 4125312 6187968</pre>The values are in number of pages. They get <a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=net/ipv4/tcp.c;hb=v2.6.38-rc8#l3279">automatically sized at boot time</a> (values above are for a machine with 32GB of RAM). They mean:<ol><li>When TCP uses less than 3093984 pages (11.8GB), the kernel will consider it below the "low threshold" and won't bother TCP about its memory consumption.</li><li>When TCP uses more than 4125312 pages (15.7GB), enter the "memory pressure" mode.</li><li>The maximum number of pages the kernel is willing to give to TCP is 6187968 (23.6GB). When we go above this, we'll start seeing the "Out of socket memory" error and Bad Things will happen.</li></ol>Now let's find how much of that memory TCP actually uses.<pre>$ cat /proc/net/sockstat<br />sockets: used 14565<br />TCP: inuse 35938 orphan 21564 tw 70529 alloc 35942 mem 1894<br />UDP: inuse 11 mem 3<br />UDPLITE: inuse 0<br />RAW: inuse 0<br />FRAG: inuse 0 memory 0</pre>The last value on the second line (<code>mem 1894</code>) is the number of pages allocated to TCP. In this case we can see that 1894 is <i>way</i> below 6187968, so there's no way we can possibly be running out of TCP memory. So in this case, the "Out of socket memory" error was caused by the number of orphan sockets.<br /><br /><h3>Do you have "too many" orphan sockets?</h3><br />First of all: what's an orphan socket? It's simply a socket that isn't associated to a file descriptor. For instance, after you <code>close()</code> a socket, you no longer hold a file descriptor to reference it, but it still exists because the kernel has to keep it around for a bit more until TCP is done with it. Because orphan sockets aren't very useful to applications (since applications can't interact with them), the kernel is trying to limit the amount of memory consumed by orphans, and it does so by limiting the number of orphans that stick around. If you're running a frontend web server (or an HTTP load balancer), then you'll most likely have a sizeable number of orphans, and that's perfectly normal.<br /><br />In order to find the limit on the number of orphan sockets, simply do:<pre>$ cat /proc/sys/net/ipv4/<a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/ip-sysctl.txt;hb=v2.6.38-rc8#l271">tcp_max_orphans</a><br />65536</pre>Here we see the default value, which is 64k. In order to find the number of orphan sockets in the system, look again in <code>sockstat</code>:<pre>$ cat /proc/net/sockstat<br />sockets: used 14565<br />TCP: inuse 35938 orphan 21564 tw 70529 alloc 35942 mem 1894<br />[...]</pre>So in this case we have 21564 orphans. That doesn't seem very close to 65536... Yet, if you look once more at <a href="#out_of_socket_memory_error">the code above</a> that prints the warning, you'll see that there is this <code>shift</code> variable that has a value between 0 and 2, and that the check is testing <code>if (orphans << shift > sysctl_tcp_max_orphans)</code>. What this means is that in certain cases, the kernel decides to penalize some sockets more, and it does so by multiplying the number of orphans by 2x or 4x to artificially increase the "score" of the "bad socket" to penalize. The problem is that due to the way this is implemented, you can see a worrisome "Out of socket memory" error when in fact you're still 4x below the limit and you just had a couple "bad sockets" (which happens frequently when you have an Internet facing service). So unfortunately that means that you need to tune up the maximum number of orphan sockets even if you're 2x or 4x away from the threshold. What value is reasonable for you depends on your situation at hand. Observe how the count of orphans in <code>/proc/net/sockstat</code> is changing when your server is at peak traffic, multiply that value by 4, round it up a bit to have a nice value, and set it. You can set it by doing a <code>echo</code> of the new value in <code>/proc/sys/net/ipv4/tcp_max_orphans</code>, and don't forget to update the value of <code>net.ipv4.tcp_max_orphans</code> in <code>/etc/sysctl.conf</code> so that your change persists across reboots.<br /><br />That's all you need to get rid of these "Out of socket memory" errors, most of which are "false alarms" due to the <code>shift</code> variable of the implementation.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com10tag:blogger.com,1999:blog-8260739278874294486.post-90063766208248680812010-12-10T18:30:00.005-08:002012-10-06T01:46:57.325-07:00Java IO: slowest readLine everI have a fairly simple problem: I want to count the number of lines in a file, then seek back to after the first line, and then read the file line by line. Easy heh? Not in Java. Enter the utterly retarded world of the JDK.<br />
<br />
So if you're n00b, you'll start with a <a href="http://download.oracle.com/javase/6/docs/api/java/io/FileInputStream.html"><code>FileInputStream</code></a>, but quickly you'll realize that seeking around with it isn't really possible... Indeed, the only way to go back to a previous position in the file is to call <a href="http://download.oracle.com/javase/6/docs/api/java/io/InputStream.html#reset()"><code>reset()</code></a>, which will take you back to the previous location you marked with <a href="http://download.oracle.com/javase/6/docs/api/java/io/InputStream.html#mark(int)"><code>mark(int)</code></a>. The argument to <code>mark</code> is "the maximum limit of bytes that can be read before the mark position becomes invalid". OK WTF.<br />
<br />
If you dig around some more, you'll see that you should really be using a <a href="http://download.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html"><code>RandomAccessFile</code></a> – so much for good OO design. The other seemingly cool thing about <code>RandomAccessFile</code> is that it's got a <a href="http://download.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html#readLine()"><code>readLine()</code></a> method. Unfortunately, this method was implemented by a 1st year CS student who probably dropped out before understanding the basics of systems programming.<br />
<br />
Believe it or not, but <code>readLine()</code> reads the file one byte at a time. It does one system call to <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/read.2.html"><code>read</code></a> per byte. As such, it's 2 orders of magnitude slower than it could be... In fact, you can't really implement a readline function that's much slower than that. facepalm.<br />
<br />
PS: This is with Sun's JRE version 1.6.0_22-b04. JDK7/OpenJDK has the same <a href="http://www.google.com/codesearch/p?hl=en#UkL11lIAx-s/src/share/classes/java/io/RandomAccessFile.java&q=readLine&d=2&l=882">implementation</a>. Apache's Harmony implementation is the same, so Android has the same retarded <a href="http://www.google.com/codesearch/p?hl=en#cZwlSNS7aEw/libcore/luni/src/main/java/java/io/RandomAccessFile.java&q=readLine&l=563">implementation</a>.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com11tag:blogger.com,1999:blog-8260739278874294486.post-55123190808594698432010-12-09T19:52:00.004-08:002010-12-09T20:01:16.140-08:00OpenTSDB at Strata'11I will be <a href="http://strataconf.com/strata2011/public/schedule/detail/16996">speaking</a> about <a href="http://opentsdb.net">OpenTSDB</a> at the <a href="http://strataconf.com/strata2011">Strata conference</a>, Wednesday, February 02, 2011, in Santa Clara, CA. You can sign up with this promo code and get a 25% discount: str11fsd.<br />Strata is a new conference about large scale systems put together by O'Reilly.<br /><a href="http://strataconf.com/strata2011"><img src="http://assets.en.oreilly.com/1/event/55/strata2011_spkr_210x60.jpg" width="210" height="60" border="0" alt="Strata 2011" title="Strata 2011"/></a>tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com0tag:blogger.com,1999:blog-8260739278874294486.post-60226491269777345322010-11-14T20:53:00.021-08:002022-07-29T09:03:04.075-07:00How long does it take to make a context switch?That's a interesting question I'm willing to spend some of my time on. Someone at StumbleUpon emitted the hypothesis that with all the improvements in the <a href="http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)">Nehalem architecture</a> (marketed as Intel i7), context switching would be much faster. How would you devise a test to empirically find an answer to this question? How expensive are context switches anyway? (tl;dr answer: <strong>very expensive</strong>)<br />
<h2>
The lineup</h2>
<i>April 21, 2011 update: I added an "extreme" Nehalem and a low-voltage Westmere.</i><br />
<i>April 1, 2013 update: Added an Intel Sandy Bridge E5-2620.</i><br />
I've put 4 different generations of CPUs to test:<br />
<ul>
<li>A dual <a href="http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Woodcrest.22_.2865_nm.29">Intel 5150</a> (Woodcrest, based on the <a href="http://en.wikipedia.org/wiki/Core_(microarchitecture)">old "Core" architecture</a>, 2.67GHz). The 5150 is a dual-core, and so in total the machine has 4 cores available. Kernel: 2.6.28-19-server x86_64.</li>
<li>A dual <a href="http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Harpertown.22_.2845_nm.29">Intel E5440</a> (Harpertown, based on the <a href="http://en.wikipedia.org/wiki/Penryn_(microarchitecture)#Penryn">Penrynn architecture</a>, 2.83GHz). The E5440 is a quad-core so the machine has a total of 8 cores. Kernel: 2.6.24-26-server x86_64.</li>
<li>A dual <a href="http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Gainestown.22_.2845_nm.29">Intel E5520</a> (Gainestown, based on the <a href="http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)">Nehalem architecture</a>, aka i7, 2.27GHz). The E5520 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Kernel: 2.6.28-18-generic x86_64.</li>
<li>A dual <a href="http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Gainestown.22_.2845_nm.29">Intel X5550</a> (Gainestown, based on the <a href="http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)">Nehalem architecture</a>, aka i7, 2.67GHz). The X5550 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Note: the X5550 is in the "server" product line. This CPU is 3x more expensive than the previous one. Kernel: 2.6.28-15-server x86_64.</li>
<li>A dual <a href="http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Gulftown.22_.2832_nm.29">Intel L5630</a> (Gulftown, based on the <a href="http://en.wikipedia.org/wiki/Westmere_(microarchitecture)">Westmere architecture</a>, aka i7, 2.13GHz). The L5630 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Note: the L5630 is a "low-voltage" CPU. At equal price, this CPU is in theory 16% less powerful than a non-low-voltage CPU. Kernel: 2.6.32-29-server x86_64.</li>
<li>A dual <a href="http://ark.intel.com/products/64594">Intel E5-2620</a> (Sandy Bridge-EP, based on the <a href="http://en.wikipedia.org/wiki/Sandy_Bridge">Sandy Bridge architecture</a>, aka E5, 2Ghz). The E5-2620 is a hexa-core, has HyperThreading, so the machine has a total of 12 cores, or 24 "hardware threads". Kernel: 3.4.24 x86_64.</li>
</ul>
As far as I can say, all CPUs are set to a constant clock rate (no Turbo Boost or anything fancy). All the Linux kernels are those built and distributed by Ubuntu.<br />
<h2>
First idea: with syscalls (fail)</h2>
My first idea was to make a cheap system call many times in a row, time how long it took, and compute the average time spent per syscall. The cheapest system call on Linux these days seems to be <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/gettid.2.html"><code>gettid</code></a>. Turns out, this was a naive approach since system calls don't actually cause a full context switch anymore nowadays, the kernel can get away with a "mode switch" (go from user mode to kernel mode, then back to user mode). That's why when I ran my first test program, <code>vmstat</code> wouldn't show a noticeable increase in number of context switches. But this test is interesting too, although it's not what I wanted originally.<br />
<br />
Source code: <a href="https://github.com/tsuna/contextswitch/blob/master/timesyscall.c">timesyscall.c</a> Results:<br />
<ul>
<li>Intel 5150: 105ns/syscall</li>
<li>Intel E5440: 87ns/syscall</li>
<li>Intel E5520: 58ns/syscall</li>
<li>Intel X5550: 52ns/syscall</li>
<li>Intel L5630: 58ns/syscall</li>
<li>Intel E5-2620: 67ns/syscall</li>
</ul>
Now that's nice, more expensive CPUs perform noticeably better (note however the slight increase in cost on Sandy Bridge). But that's not really what we wanted to know. So to test the cost of a context switch, we need to force the kernel to de-schedule the current process and schedule another one instead. And to benchmark the CPU, we need to get the kernel to do nothing but this in a tight loop. How would you do this?<br />
<h2>Second idea: with <code>futex</code></h2>
The way I did it was to abuse <a href="http://en.wikipedia.org/wiki/Futex"><code>futex</code></a> (<a href="http://www.kernel.org/doc/man-pages/online/pages/man2/futex.2.html">RTFM</a>). <code>futex</code> is the low level Linux-specific primitive used by most threading libraries to implement blocking operations such as waiting on contended mutexes, semaphores that run out of permits, condition variables, etc. If you would like to know more, go read <a href="http://people.redhat.com/drepper/futex.pdf">Futexes Are Tricky</a> by Ulrich Drepper. Anyways, with a futex, it's easy to suspend and resume processes. What my test does is that it forks off a child process, and the parent and the child take turn waiting on the futex. When the parent waits, the child wakes it up and goes on to wait on the futex, until the parent wakes it and goes on to wait again. Some kind of a ping-pong "I wake you up, you wake me up...".<br />
<br />
Source code: <a href="https://github.com/tsuna/contextswitch/blob/master/timectxsw.c">timectxsw.c</a> Results:<br />
<ul>
<li>Intel 5150: ~4300ns/context switch</li>
<li>Intel E5440: ~3600ns/context switch</li>
<li>Intel E5520: ~4500ns/context switch</li>
<li>Intel X5550: ~3000ns/context switch</li>
<li>Intel L5630: ~3000ns/context switch</li>
<li>Intel E5-2620: ~3000ns/context switch</li>
</ul>
Note: those results include the overhead of the <code>futex</code> system calls.<br />
<br />
Now you must take those results with a grain of salt. The micro-benchmark does <em>nothing</em> but context switching. In practice context switching is expensive because it screws up the CPU caches (L1, L2, L3 if you have one, and the <a href="http://en.wikipedia.org/wiki/Translation_lookaside_buffer">TLB</a> – don't forget the TLB!).<br />
<h2>
CPU affinity</h2>
Things are harder to predict in an SMP environment, because the performance can vary wildly depending on whether a task is migrated from one core to another (especially if the migration is across physical CPUs). I ran the benchmarks again but this time I pinned the processes/threads on a single core (or "hardware thread"). The performance speedup is dramatic.<br />
<br />
Source code: <a href="https://github.com/tsuna/contextswitch/blob/master/cpubench.sh">cpubench.sh</a> Results:<br />
<ul>
<li>Intel 5150: ~1900ns/process context switch, ~1700ns/thread context switch</li>
<li>Intel E5440: ~1300ns/process context switch, ~1100ns/thread context switch</li>
<li>Intel E5520: ~1400ns/process context switch, ~1300ns/thread context switch</li>
<li>Intel X5550: ~1300ns/process context switch, ~1100ns/thread context switch</li>
<li>Intel L5630: ~1600ns/process context switch, ~1400ns/thread context switch</li>
<li>Intel E5-2620: ~1600ns/process context switch, ~1300ns/thread context siwtch</li>
</ul>
Performance boost: 5150: 66%, E5440: 65-70%, E5520: 50-54%, X5550: 55%, L5630: 45%, E5-2620: 45%.<br />
<br />
The performance gap between thread switches and process switches seems to increase with newer CPU generations (5150: 7-8%, E5440: 5-15%, E5520: 11-20%, X5550: 15%, L5630: 13%, E5-2620: 19%). Overall the penalty of switching from one task to another remains very high. Bear in mind that those artificial tests do absolutely zero computation, so they probably have 100% cache hit in L1d and L1i. In the real world, switching between two tasks (threads or processes) typically incurs significantly higher penalties due to cache pollution. But we'll get back to this later.<br />
<h2>
Threads vs. processes</h2>
After producing the numbers above, I quickly criticized Java applications, because it's fairly common to create shitloads of threads in Java, and the cost of context switching becomes high in such applications. Someone retorted that, yes, Java uses lots of threads but threads have become significantly faster and cheaper with the <a href="http://en.wikipedia.org/wiki/Native_POSIX_Thread_Library">NPTL</a> in Linux 2.6. They said that normally there's no need to do a TLB flush when switching between two threads of the same process. That's true, you can go check the source code of the Linux kernel (<a href="http://livegrep.com/search/linux?q=void%5C+switch_mm+file%3Ax86%2Finclude%2Fasm%2Fmmu_context.h"><code>switch_mm</code> in <code>mmu_context.h</code></a>):<br />
<pre>static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
unsigned cpu = smp_processor_id();
if (likely(prev != next)) {
<i>[...]</i>
load_cr3(next->pgd);
} else {
<i>[don't typically reload cr3]</i>
}
}</pre>
In this code, the kernel expects to be switching between tasks that have different memory structures, in which cases it updates <a href="http://en.wikipedia.org/wiki/Control_register#CR3">CR3</a>, the register that holds a pointer to the <a href="http://en.wikipedia.org/wiki/Page_table">page table</a>. Writing to CR3 automatically causes a TLB flush on x86.<br />
<br />
In practice though, with the default kernel scheduler and a busy server-type workload, it's fairly infrequent to go through the code path that skips the call to <code>load_cr3</code>. Plus, different threads tend to have different working sets, so even if you skip this step, you still end up polluting the L1/L2/L3/TLB caches. I re-ran the benchmark above with 2 threads instead of 2 processes (source: <a href="https://github.com/tsuna/contextswitch/blob/master/timetctxsw.c">timetctxsw.c</a>) but the results aren't significantly different (this varies a lot depending on scheduling and luck, but on average on many runs it's typically only 100ns faster to switch between threads if you don't set a custom CPU affinity).<br />
<h2>
Indirect costs in context switches: cache pollution</h2>
The results above are in line with a paper published a bunch of guys from University of Rochester: <a href="http://www.cs.rochester.edu/u/cli/research/switch.pdf">Quantifying The Cost of Context Switch</a>. On an unspecified Intel Xeon (the paper was written in 2007, so the CPU was probably not <i>too</i> old), they end up with an average time of 3800ns. They use another method I thought of, which involves writing / reading 1 byte to / from a pipe to block / unblock a couple of processes. I thought that (ab)using futex would be better since futex is essentially exposing some scheduling interface to userland.<br />
<br />
The paper goes on to explain the indirect costs involved in context switching, which are due to cache interference. Beyond a certain working set size (about half the size of the L2 cache in their benchmarks), the cost of context switching increases dramatically (by 2 orders of magnitude).<br />
<br />
I think this is a more realistic expectation. Not sharing data between threads leads to optimal performance, but it also means that every thread has its own working set and that when a thread is migrated from one core to another (or worse, across physical CPUs), the cache pollution is going to be costly. Unfortunately, when an application has many more active threads than hardware threads, this is happening all the time. That's why not creating more active threads than there are hardware threads available is so important, because in this case it's easier for the Linux scheduler to keep re-scheduling the same threads on the core they last used ("weak affinity").<br />
<br />
Having said that, these days, our CPUs have much larger caches, and can even have an L3 cache.<br />
<ul>
<li>5150: L1i & L1d = 32K each, L2 = 4M</li>
<li>E5440: L1i & L1d = 32K each, L2 = 6M</li>
<li>E5520: L1i & L1d = 32K each, L2 = 256K/core, L3 = 8M (same for the X5550)</li>
<li>L5630: L1i & L1d = 32K each, L2 = 256K/core, L3 = 12M</li>
<li>E5-2620: L1i & L1d = 64K each, L2 = 256K/core, L3 = 15M</li>
</ul>
Note that in the case of the E5520/X5550/L5630 (the ones marketed as "i7") as well as the Sandy Bridge E5-2520, the L2 cache is tiny but there's one L2 cache per core (with HT enabled, this gives us 128K per hardware thread). The L3 cache is shared for all cores that are on each physical CPU.<br />
<br />
Having more cores is great, but it also increases the chance that your task be rescheduled onto a different core. The cores have to "migrate" cache lines around, which is expensive. I recommend reading <a href="http://www.akkadia.org/drepper/cpumemory.pdf">What Every Programmer Should Know About Main Memory</a> by Ulrich Drepper (yes, him again!) to understand more about how this works and the performance penalties involved.<br />
<br />
So how does the cost of context switching increase with the size of the working set? This time we'll use another micro-benchmark, <a href="https://github.com/tsuna/contextswitch/blob/master/timectxswws.c">timectxswws.c</a> that takes in argument the number of pages to use as a working set. This benchmark is exactly the same as the one used earlier to test the cost of context switching between two processes except that now each process does a <code>memset</code> on the working set, which is <em>shared</em> across both processes. Before starting, the benchmark times how long it takes to write over all the pages in the working set size requested. This time is then discounted from the total time taken by the test. This attempts to estimate the <em>overhead</em> of overwriting pages across context switches.<br />
<br />
Here are the results for the 5150:<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPNe4cUbrimuXM_XBKm6P9IClF-73wsNGy8kY3I_sAHGcpN9JwOacgsmVwti55mIeCE9O_Q4k4VU1cGYO_y2N3tdWz3QsgNC8Nl7dyMC05XwBTbCLm3vOYHDG76pcb2uNw5rGa1EZ0eaU/s1600/5150.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5539700659150745810" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPNe4cUbrimuXM_XBKm6P9IClF-73wsNGy8kY3I_sAHGcpN9JwOacgsmVwti55mIeCE9O_Q4k4VU1cGYO_y2N3tdWz3QsgNC8Nl7dyMC05XwBTbCLm3vOYHDG76pcb2uNw5rGa1EZ0eaU/s1600/5150.png" style="cursor: hand; cursor: pointer; display: block; margin: 0px auto 10px; text-align: center;" /></a>As we can see, the time needed to write a 4K page more than doubles once our working set is bigger than what we can fit in the L1d (32K). The time per context switch keeps going up and up as the working set size increases, but beyond a certain point the benchmark becomes dominated by memory accesses and is no longer actually testing the overhead of a context switch, it's simply testing the performance of the memory subsystem.<br />
<br />
Same test, but this time with CPU affinity (both processes pinned on the same core):<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixpTl0DOEXwF1iZkFW-TF-pjBFgGhZSltZptT4weR8x4gmxw8WTFibtO5aWCnGDhdXTQMalImXCl-Y2GgdK57J1b-JvTcPeA5edHM1KJ5eRbVQXaDLMhVrjKmoRqIlmySQALr2lIF7wck/s1600/5150-affinity.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5539700665101992450" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixpTl0DOEXwF1iZkFW-TF-pjBFgGhZSltZptT4weR8x4gmxw8WTFibtO5aWCnGDhdXTQMalImXCl-Y2GgdK57J1b-JvTcPeA5edHM1KJ5eRbVQXaDLMhVrjKmoRqIlmySQALr2lIF7wck/s1600/5150-affinity.png" style="cursor: hand; cursor: pointer; display: block; margin: 0px auto 10px; text-align: center;" /></a>Oh wow, watch this! It's an <i>order of magnitude</i> faster when pinning both processes on the same core! Because the working set is shared, the working set fits entirely in the 4M L2 cache and cache lines simply need to be transfered from L2 to L1d, instead of being transfered from core to core (potentially across 2 physical CPUs, which is far more expensive than within the same CPU).<br />
<br />
Now the results for the i7 processor:<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhVUyOR3UsIG5y48sVNDwJZenppzuEleaSp4ZBkcntTwMcfHuxuXL7FSprWRQtGH_WbNgg6b_R8hVwYg6q3BPMvKpLEG1ZAJwh2bgsE7P4GTX6EwHQesRQEhakg-_fBpYoGBsL8uUwBTU/s1600/E5520.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5539685230539457890" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhVUyOR3UsIG5y48sVNDwJZenppzuEleaSp4ZBkcntTwMcfHuxuXL7FSprWRQtGH_WbNgg6b_R8hVwYg6q3BPMvKpLEG1ZAJwh2bgsE7P4GTX6EwHQesRQEhakg-_fBpYoGBsL8uUwBTU/s1600/E5520.png" style="cursor: hand; cursor: pointer; display: block; margin: 0px auto 10px; text-align: center;" /></a>Note that this time I covered larger working set sizes, hence the log scale on the X axis.<br />
<br />
So yes, context switching on i7 is faster, but only for so long. Real applications (especially Java applications) tend to have large working sets so typically pay the highest price when undergoing a context switch. Other observations about the Nehalem architecture used in the i7:<br />
<ul>
<li>Going from L1 to L2 is almost unnoticeable. It takes about 130ns to write a page with a working set that fits in L1d (32K) and only 180ns when it fits in L2 (256K). In this respect, the L2 on Nehalem is more of a "L1.5", since its latency is simply not comparable to that of the L2 of previous CPU generations.</li>
<li> As soon as the working set increases beyond 1024K, the time needed to write a page jumps to 750ns. My theory here is that 1024K = 256 pages = half of the TLB of the core, which is shared by the two HyperThreads. Because now both HyperThreads are fighting for TLB entries, the CPU core is constantly doing page table lookups.</li>
</ul>
Speaking of TLB, the Nehalem has an interesting architecture. Each core has a 64 entry "L1d TLB" (there's no "L1i TLB") and a unified 512 entry "L2TLB". Both are dynamically allocated between both HyperThreads.<br />
<h2>
Virtualization</h2>
I was wondering how much overhead there is when using virtualization. I repeated the benchmarks for the dual E5440, once in a normal Linux install, once while running the same install inside VMware ESX Server. The result is that, on average, it's 2.5x to 3x more expensive to do a context switch when using virtualization. My <i>guess</i> is that this is due to the fact that the guest OS can't update the page table itself, so when it attempts to change it, the hypervisor intervenes, which causes an extra 2 context switches (one to get inside the hypervisor, one to get out, back to the guest OS).<br />
<br />
This probably explains why Intel added the EPT (<a href="http://en.wikipedia.org/wiki/Extended_Page_Table">Extended Page Table</a>) on the Nehalem, since it enables the guest OS to modify its own page table without help of the hypervisor, and the CPU is able to do the end-to-end memory address translation on its own, entirely in hardware (virtual address to "guest-physical" address to physical address).<br />
<h2>
Parting words</h2>
Context switching is expensive. My rule of thumb is that it'll cost you about 30µs of CPU overhead. This seems to be a good worst-case approximation. Applications that create too many threads that are constantly fighting for CPU time (such as Apache's HTTPd or many Java applications) can waste considerable amounts of CPU cycles just to switch back and forth between different threads. I think the sweet spot for optimal CPU use is to have the same number of worker threads as there are hardware threads, and write code in an asynchronous / non-blocking fashion. Asynchronous code tends to be CPU bound, because anything that would block is simply deferred to later, until the blocking operation completes. This means that threads in asynchronous / non-blocking applications are much more likely to use their full time quantum before the kernel scheduler preempts them. And if there's the same number of runnable threads as there are hardware threads, the kernel is very likely to reschedule threads on the same core, which <em>significantly</em> helps performance.<br />
<br />
Another hidden cost that severely impacts server-type workloads is that after being switched out, even if your process becomes runnable, it'll have to wait in the kernel's run queue until a CPU core is available for it. Linux kernels are often compiled with <code>HZ=100</code>, which entails that processes are given time slices of 10ms. If your thread has been switched out but becomes runnable almost immediately, and there are 2 other threads before it in the run queue waiting for CPU time, your thread may have to wait up to 20ms in the worst scenario to get CPU time. So depending on the average length of the run queue (which is reflected in load average), and how long your threads typically run before getting switched out again, this can considerably impact performance.<br />
<br />
It is illusory to imagine that NPTL or the Nehalem architecture made context switching cheaper in real-world server-type workloads. Default Linux kernels don't do a good job at keeping CPU affinity, even on idle machines. You must explore alternative schedulers or use <a href="http://linux.die.net/man/1/taskset"><code>taskset</code></a> or <a href="http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html"><code>cpuset</code></a> to control affinity yourself. If you're running multiple different CPU-intensive applications on the same server, manually partitioning cores across applications can help you achieve very significant performance gains.tsunahttp://www.blogger.com/profile/06114951663056205324noreply@blogger.com30