Getting some rest - random thoughts

William ML Leslie

An Expert At Napping

Today, I'm learning to get rest the hard way. No, not representational state transfer, and not reStructuredText - actually stopping and recovering from the last year. From looking for work while planning a wedding, to working for eight full-time hours and commuting for six, to getting run-down and never really recovering, this has been one crazy year. I've had no desire at all to put down my keyboard and stop thinking about code, but I feel that I no longer have any choice. I am so very tired.

My plan is to cut the time I spend coding on interesting problems into nice discreet blocks, to plan ahead the things I will work on in that time, and to take generous rest breaks. I'll spend some time writing about the things I've been playing with, even incomplete thoughts.

Entangled

I've moved my research project into its own garden, where I can tweak the IR and compiler to fit the needs of the analysis and optimiser. At the moment it can compile several versions of python bytecode into two different SSA forms - the first with implicit stack and locals, and the second with the more familliar arguments. It also has a web-based visualisation that can animate the SSA graph as it is constructed. I have a little work to do on the visualisation before I can get back to the next graph transformation - I want to display and highlight the use-dep relationships by drawing nice paths in SVG. Then I get to convert the explicit SSA form into one with explicit special-method invocation.

Starting over with my own compiler has made me realise some of the design decisions I wish had been made in pypy. Entangled has static descriptions of the languages it operates on, much like Chez Scheme has. This gives a great place to add extra information to particular operations, such as their stack effect, whether they never return, and how to decode their operand. It also means we can be certain of the language that reaches each pass - we know there are no operations we don't expect.

Another concept that I wish we had in pypy and in firm was the fact that there is no global state in the compiler. Individual passes can maintain global state, for example if we did call-family lowering in this compiler, that pass would mantain a concordance between functions and their 'call family', which is their set of invoked signatures as well as other functions that may be aliased at the same call site. However, once that pass was done, any information not present in the graph would be thrown away.

I've also discovered how to get 80 percent of the benefit with only 20 percent of the algebra: we can track, with static analysis, which objects we know have no other alias present on the stack. Or more weakly, which objects have no alias that has been written through. The truly interesting bit of this compiler will be how we determine this quality and how we use it.

Bytecode VMs

I love to play with bytecode VMs. Of particular interest to me are VMs for bootstrapping. These involve writing minimal assembler or C, and this providing a runtime on which you can write code to get things done. This is something of a mashup of two things: the bootstrapping king is FORTH, which can be defined quite easily with a little assembler code. On the other hand, a truly nice bytecode VM is implemented by femtolisp which contains entirely printable opcodes.

My bytecode VMs have no practical use - they are just fun to implement. First I like to define the types and operations I would like to have, then I define the representation of values and what state needs to be tracked, and then write a function to perform each operation. I take care that each operation is defined in terms of primitives that translate easily into assembly.

I've recently designed two of these languages - a fancy one with datatypes and fancy matching and all the stuff I'd want in a bytecode language - and a very simple one, with just array-lists, bytestrings, fixed-size integers, closures, and a handfull of constants (booleans and a none value). This last one has the handy benefit that integer literals appear almost as their hexadecimal representation in source. I've written a handful of functions in it, such as equals? and foldr, and can confirm it's not the sort of language you would want to use on a regular basis, but it is quite usable.

The most dramatic thing I have taken away from the latter implementation: make all your functions unary, and have them take a list. If instead you have your CALL_* operations take an extra nargs parameter, that requires a lot more work in the code for the operation, and then you still need to do extra work to actually get the arguments desired to the top of the stack. Another alternative is just giving the function the current stack to work with, but that truly complicates return, as well as one other thing I ran into.

That is, the second big takeaway - you really want a locals vector. Before I started creating one in my functions as a general practice, I found myself needing to set the stack up a certain way so that I could tail-call into different conditions. Most of the time, that was not a constant-time process. So make it easy - I tend to push a vector with default values for locals at function entry.

The third takeaway is that if people are going to write bytecode - that is, not compilers or assemblers - don't use a relative offset or even absolute position in your jump opcode: instead, have it take a command string as argument. This made function construction fairly readable, as they start with allocating an empty vector, then appending strings for each basic block, appending the rest of the function's 'environment', and then combining the environment with the functions start block.

I made a couple of other concessions for readability. Whitespace is a no-op, that is, space can be used to logically group different operations. Comments can be embedded, as the ; operation skips comands until the next newline. String literals can be embedded too, with the " operation performing a similar function.

Here's a code example - foldr:

;; ZL creates a list of length zero.  This will be our environment.
ZL
;; Here, the backtick ` aborts with the message on the top of the
;; stack, and the letter 'a' at the end appends this block to the
;; list.
"\"number is not iterable\"`"a
;; take ([) item 9 (X9) from the environment (e) and branch (q)
;; to it.  Presumably I wanted to keep the list functions near
;; each other.
"eX9[q"a
"eXb[q"a
"\"function is not iterable\"`"a
"\"none is not iterable\"`"a
"\"true is not iterable\"`"a
"\"false is not iterable\"`"a
"\"<unknown> is not iterable\"`"a
;; bX1 = first argument, ZmZ[ is the index, ZmX1[ is the
;; accumulated result
"eXA Z bX1[l ZmZ[ =p +q"a
"Zm bZ[ ZL bX1[ ZmZ[ [a ZmX1[a . X1] Zm ZmZ[ ] eX9[q"a
"ZmZ[z"a
;; strings
"eXE Z bX1[_ ZmZ[ =p +q"a
"Zm bZ[ ZL bX1[ ZmZ[ wa ZmX1[a . X1] Zm ZmZ[ ] eXD[q"a
"ZmZ[z"a

;; start block
;; push local vector [index, result], branch on type of arg1
"ZL Za bX2[a eX1mhq"
\  ;; Construct function

It's clearly not pretty or easy to use - but there is no undefined behaviour or wierd linking behaviour, and you could see how - stranded on a desert island with nothing but an m68k and a way to write the operations - you could bootstrap your way to a real language that you could work in. You could easily write a compiler to generate code for it, too.

Instruction Set Architectures

I'm a real chip geek, and it's an interesting time to be one: at the high end, there's real interest around RISC-V. On the lower end, devices with ARM processors in them are everywhere. You can even find some MIPS based routers if you are prepared to do some ground work. I got into software right after 2006 during the rise of amd64 in the consumer space. The thing to know about that time is that just like in the early 90s, you had to mortage your house to get something interesting that still had power to do what you want to do. The only real players at the time were the Sparc family and the Itanium II, and they were both priced well beyond anything I could afford.

Anyway, we now have the ability to design, test, and run different levels of emulation for any kind of hardware we want to build right from our desks - so it's well worth playing with. You can build stack architectures like the RTX2010 or Burroughs BLS 6000, or you can build wide vector architectures like AMD's GCN which powered the Radeon 5xxx series of graphics cards.

Me - I'm interested in three things: flexible security, wide execution, and great performance tools available to sophisticated compilers and safe runtimes. Oh, and the less patent-hobbled the better. I'm still bitter at Intel for pricing the Itanium II beyond developers who wanted to develop for it, and for killing the i960 in the consumer market.

It's great fun to play with the concepts behind ISA design. For example: have you ever wondered why the different operations on a CPU take the time that they do? Multiplication typically takes three times as long as addition, which might sound puzzling until you find out that they are probably using a Dadda multiplier which is truly a beautiful trick. Why, on a modern x86-64 CPU, does integer division take 29 mostly-pipelined cycles, but floating division take 9 unpipelined cycles? Because there is only one hardware low-latency division unit, and it's in the floating-point path. Fun exercise: figure out how they implemented the different division units.

For myself, writing a compiler that is good at figuring out how to remove false data dependencies, it's fun to look at wide architectures and figure out how to make them work with real-world programming languages. These are languages which are semantically about graph walking and control flow, and require all sorts of dynamic allocation patterns - but if we can figure out how to execute that sort of code on vector architectures, we have a great path to real performance improvements.

As a compiler architect, there are a few things I think are missing from the GPU that would make it a great tool for parallelising typical inner loops. These include a way to represent an append to the result list within each loop that may be conditional, ways to easily handle function pointers, partitioning the different streams by a value and handling those values one at a time.

I can't say yet that I've figured it out - until I can see the sort of changes I can make to the way we lay out objects in the compiler - but I'm having a lot of fun exploring the possibilities.

A good place to start playing actually would be the old Moxie Logic blogs about GCC support. I have often wondered if the GCC compiler could be augmented to graph the cost/benefit of different numbers of registers and seeing how they get used.