Topaz Spec

Topaz is a new VM heavily based on Sean Barrett's SLUX. The VM has the following rough goals:

It should be simple to use. Priority of ease-of-use is programmer, library author, interpreter author, compiler author, VM spec author, but obviously if it's a pain in the ass to use in any of these areas, people won't use it.
It should be designed so that multiple IF languages can compile to it, by which we mean 'Inform' and 'TADS'.
It should be useful for non-IF applications also
It should support multiple I/O systems
It should be cool and not annoying and stuff
It should be discussed extensively on #vm-design

Random bits:

There is garbage collection, although with no specified algorithm or guarantees
It's stack-based, I think. The Parrot project is going with registers but nobody else seems to. I'm not sure there's a win doing it, and it seems more complicated to generate the code. The Parrot discussion is here, and is pretty interesting.

Data types:

string (immutable; stored in some compressed format in the data file; passed by reference; can be created dynamically)
int (32 bits, signed)
list (passed by reference; can be created/grow/shrink dynamically)
hash (passed by reference; can be created dynamically)
object (I dunno. like tads/inform objects, I guess)
property (property : object :: key : hash)
function (or a method or whatever. passed by reference)
nil (I guess we need this)
true (I don't know if we need this -- language implementation of booleans should probably be done with ints, yes?)

Stuff in the file:

The file layout is a header with stuff (slux has a checksum, which seems like a potentially cool idea), then a list of special objects, then a list of property id/property names, then all the object definitions, then all the string definitions, then all the list/hash definitions, then all the function definitions.

The list of special objects is a set of (id number, special-type string) pairs. The id number is that of some object used elsewhere in the game (it can be defined in the object definition section but that definition will be ignored), and the special type is 'system' or 'glk' or 'regex' or something. This is the type not the name of the object, which can be anything in the compiler (and just gets reduced to an id number anyway). An object mentioned in the special objects list is considered to be defined, and any method calls on the object are handled by the interpreter internally rather than actually looking up the method definition and evaluating that. For example, if object 32 is of type 'system', and you call 32.'exit', it might exit the program or something. It won't actually attempt to look up the exit property and evaluate it on the object. It should probably be an error to have multiple objects defined as being the same type.

The property name/property id thing is a mapping of property names to id numbers. The bytecode generally only uses id numbers to reference properties, and if you use a string it needs to be converted to a thing of type property (which is a property id number) first. The use of this table is 1) to handle merging different bytecode files and 2) people can use dynamic property lookup (ie, strings) and this table provides for the conversion. Also 3) for debugging.

Objects are of course a list of object id number, size of definition, object flags, superclass list, then a list of property id/value pairs. Values are strings/ints/lists/hashes/objects/property -- these are all of course references to the actual thing, not the things themselves (except ints).

The string definitions are just, I dunno, a bunch of strings. They should be compressed somehow, maybe a la the z-machine or maybe a la how slux does it or maybe something fancier. Some provision should be made for internationalization/unicode.

List/hash definitions. I figure lists and hashes can go in the same space, and hashes'll just be lists that definitely have even size. The only downside is now you have to store a type bit before each thing in the list/hash section to say whether it's a list or a hash, right?

Function definitions. This is, I guess, just a big block of bytecode that gets indexed into. Ok, no, wait, functions should probably have lists of their bytecode size, and how much stack space they'll take up, and how many arguments they take, and so on. This could all get handled with initial opcodes in the function but I'm not sure there's a win doing it that way.

And that's it for that.

The I/O system. You can change the I/O system at will in this VM. There is a method on the system object (which means "the object of type 'system'") which gets you a new I/O object of the requested type (or fails, and presumably returns nil or something). The 'bootstrap' I/O system is guaranteed to exist and be requestable (and in fact is requested at startup). This supports, I dunno, a single stream a la stdout and a single line-input-only stream a la stdin. It's possible that the io system object should always be referenced as system.io like how java does it. The alternative is, what? The programmer will have to call request_io("bootstrap"), -- ok, I can already see this is going to be lamer. So system.request_io("glk") will either set the I/O system to glk or do nothing, and return 1/0 accordingly, and you can access the I/O system object with system.io.

I think the only other thing is opcodes. Presumably it works like this:
.. no, ok, I'm not going to write out all the opcodes. There should be all the obvious ones, from the z-spec/slux spec/tads spec (ha ha). We can eliminate any of the ones dealing with I/O, and some things that are not called particularly often should be methods on special objects, not opcodes in themselves. So the questions are, basically,

operator overloading? TADS has this with, eg, the add opcode, that does list append if you give it a list and addition if you give it ints. This is messy and a pain to implement. On the other hand, if you don't do this, you can't support language operator overloading unless you make all/many of your add calls generate opcodes a la "push type; jump unless type == int; add; jump unless type == list; listappend;" etc. (You don't have to do this if you can work out the type of the variable at compile-time but generally you can't do this except for local variables). SLUX, again, has something cool here -- non-object types have 'pseudoclass' objects associated with them, and if you use an opcode on a value which is the wrong type, it sends the opcode as a message to the value's type's pseudoclass (presumably with the type as an argument).
what should be an opcode and not a special object method? I'm thinking basic functions for the different types (key lookup on hash, property evaluate or superclass on object) should be opcodes and not special object method. But what about slightly less common but still fairly common things like strlen? Or substr? (This is all based on the assumption that special object methods are noticeably more expensive than opcodes, which may or may not be the case. They are definitely more expensive in terms of number of instructions, since you need to push the arguments and so on.)
What optimizations do we put in? I figure we've got about a hundred opcodes, maybe less. It's probably not much good to try and store the opcode in less than a byte, so therefore we have like 150 extra opcodes. So, what do we do with them. One thing we can do is add optimizations for common operations. For instance, to call a method you probably do
- push property
- push object
- call
This is, I dunno, 7 bytes. Now, most of the time, you will know what the property id is at compile time. So we could save a byte on almost every property reference (of which there will be lots) by adding a second form of call, so you can do
- push object
- call 32 (or whatever the property is)
or even a third form that allows "call 100 32" since you often even know the object. So we can potentially save over 25% of the bytes on a very common operation just by adding another opcode. And there are lots of cases like this, where you could add opcodes to call things on the self object, or to push the first/second/third method argument onto the stack, etc. And I think this is how most languages do it. TADS certainly does (for instance, it's got all the following jumps: jump, jump if true, jump if false, jump if equal, jump if not equal, jump if greater-than, jump if greater-than-or-equal, jump if less-than, jump if less-than-or-equal). This is fine, I guess, but it seems like the pain for interpreter authors is really more than it's worth, and leads to lots of code duplication in interpreters which leads to errors etc.
So an alternate theory is we could reserve some opcodes for macros. A macro opcode would mean "execute the N bytes of code at this address then keep going". This has a number of obvious disadvantages, including "what do jumps from within the macro do" and "would this hurt JIT?" and "this wouldn't actually work to compress 'push obj; push prop; call' for arbitrary obj and prop". But it does seem kinda cool, and would, I think, let compilers do fancier optimization on larger blocks of code and would be good if there's some bytecode the compiler commonly outputs that there is no preset opcode for.

I think this is the whole spec. Ok, except for restore/restart/undo. The way those work (plus start, when you first load the game) is, when one of the four is executed, the call stack is cleared, every object in the game is reset to the appropriate state, and then system.main is called, with one of 'start', 'restore', etc being passed as an argument to it. Uh, and this suggests that you can in fact have user-defined methods on special objects.