+ into a global register variable if possible; if we don't have a
+ register then use gcc's __thread extension to create a thread-local
+ variable.
+
+ Even on x86 where registers are scarce, it is worthwhile using a
+ register variable here: I measured about a 2-5% slowdown with the
+ __thread version.