Topaz is a new VM heavily based on Sean Barrett's
SLUX.
The VM has the following rough goals:
- It should be simple to use. Priority of ease-of-use is
programmer, library author, interpreter author, compiler author, VM
spec author, but obviously if it's a pain in the ass to use in any of
these areas, people won't use it.
- It should be designed so that multiple IF languages can compile
to it, by which we mean 'Inform' and 'TADS'.
- It should be useful for non-IF applications also
- It should support multiple I/O systems
- It should be cool and not annoying and stuff
- It should be discussed extensively on #vm-design
Random bits:
- There is garbage collection, although with no specified algorithm
or guarantees
- It's stack-based, I think. The Parrot project is going with
registers but nobody else seems to. I'm not sure there's a win
doing it, and it seems more complicated to generate the code. The
Parrot discussion is here,
and is pretty interesting.
Data types:
- string (immutable; stored in some compressed format in the data
file; passed by reference; can be created dynamically)
- int (32 bits, signed)
- list (passed by reference; can be created/grow/shrink dynamically)
- hash (passed by reference; can be created dynamically)
- object (I dunno. like tads/inform objects, I guess)
- property (property : object :: key : hash)
- function (or a method or whatever. passed by reference)
- nil (I guess we need this)
- true (I don't know if we need this -- language implementation of
booleans should probably be done with ints, yes?)
Stuff in the file:
The file layout is a header with stuff (slux has a checksum, which
seems like a potentially cool idea), then a list of special objects,
then a list of property id/property names, then all the object
definitions, then all the string definitions, then all the list/hash
definitions, then all the function definitions.
The list of special objects is a set of (id number, special-type
string) pairs. The id number is that of some object used elsewhere in
the game (it can be defined in the object definition section but that
definition will be ignored), and the special type is 'system' or 'glk'
or 'regex' or something. This is the type not the name of the
object, which can be anything in the compiler (and just gets reduced
to an id number anyway). An object mentioned in the special objects
list is considered to be defined, and any method calls on the object
are handled by the interpreter internally rather than actually looking
up the method definition and evaluating that. For example, if object
32 is of type 'system', and you call 32.'exit', it might exit the
program or something. It won't actually attempt to look up the exit
property and evaluate it on the object. It should probably be an error
to have multiple objects defined as being the same type.
The property name/property id thing is a mapping of property names to
id numbers. The bytecode generally only uses id numbers to reference
properties, and if you use a string it needs to be converted to a
thing of type property (which is a property id number) first. The use
of this table is 1) to handle merging different bytecode files and 2)
people can use dynamic property lookup (ie, strings) and this table
provides for the conversion. Also 3) for debugging.
Objects are of course a list of object id number, size of definition,
object flags, superclass list, then a list of property id/value pairs.
Values are strings/ints/lists/hashes/objects/property -- these are all
of course references to the actual thing, not the things themselves
(except ints).
The string definitions are just, I dunno, a bunch of strings. They
should be compressed somehow, maybe a la the z-machine or maybe a la
how slux does it or maybe something fancier. Some provision should be
made for internationalization/unicode.
List/hash definitions. I figure lists and hashes can go in the same
space, and hashes'll just be lists that definitely have even size. The
only downside is now you have to store a type bit before each thing in
the list/hash section to say whether it's a list or a hash, right?
Function definitions. This is, I guess, just a big block of bytecode
that gets indexed into. Ok, no, wait, functions should probably have
lists of their bytecode size, and how much stack space they'll take
up, and how many arguments they take, and so on. This could all get
handled with initial opcodes in the function but I'm not sure there's
a win doing it that way.
And that's it for that.
The I/O system. You can change the I/O system at will in this
VM. There is a method on the system object (which means "the object of
type 'system'") which gets you a new I/O object of the requested type
(or fails, and presumably returns nil or something). The 'bootstrap' I/O
system is guaranteed to exist and be requestable (and in fact is
requested at startup). This supports, I dunno, a single stream a la
stdout and a single line-input-only stream a la stdin. It's possible
that the io system object should always be referenced as system.io
like how java does it. The alternative is, what? The programmer will
have to call request_io("bootstrap"), -- ok, I can already see this is
going to be lamer. So system.request_io("glk") will either set the I/O
system to glk or do nothing, and return 1/0 accordingly, and you can
access the I/O system object with system.io.
I think the only other thing is opcodes. Presumably it works like
this:
.. no, ok, I'm not going to write out all the opcodes. There should be
all the obvious ones, from the z-spec/slux spec/tads spec (ha ha). We
can eliminate any of the ones dealing with I/O, and some things that
are not called particularly often should be methods on special
objects, not opcodes in themselves. So the questions are, basically,
- operator overloading? TADS has this with, eg, the add opcode,
that does list append if you give it a list and addition if you
give it ints. This is messy and a pain to implement. On the other
hand, if you don't do this, you can't support language operator
overloading unless you make all/many of your add calls generate
opcodes a la "push type; jump unless type == int; add; jump
unless type == list; listappend;" etc. (You don't have to do this
if you can work out the type of the variable at compile-time but
generally you can't do this except for local variables). SLUX,
again, has something cool here -- non-object types have
'pseudoclass' objects associated with them, and if you use an
opcode on a value which is the wrong type, it sends the opcode as
a message to the value's type's pseudoclass (presumably with the type as
an argument).
- what should be an opcode and not a special object method? I'm thinking
basic functions for the different types (key lookup on hash,
property evaluate or superclass on object) should be opcodes and
not special object method. But what about slightly less common but
still fairly common things like strlen? Or substr? (This is all
based on the assumption that special object methods are
noticeably more expensive than opcodes, which may or may not be
the case. They are definitely more expensive in terms of number
of instructions, since you need to push the arguments and so on.)
- What optimizations do we put in? I figure we've got about a
hundred opcodes, maybe less. It's probably not much good to try
and store the opcode in less than a byte, so therefore we have
like 150 extra opcodes. So, what do we do with them. One thing we
can do is add optimizations for common operations. For instance,
to call a method you probably do
- push property
- push object
- call
This is, I dunno, 7 bytes. Now, most of the time, you will know
what the property id is at compile time. So we could save a byte
on almost every property reference (of which there will be lots)
by adding a second form of call, so you can do
- push object
- call 32 (or whatever the property is)
or even a third form that allows "call 100 32" since you often
even know the object. So we can potentially save over 25% of the
bytes on a very common operation just by adding another
opcode. And there are lots of cases like this, where you could
add opcodes to call things on the self object, or to push the
first/second/third method argument onto the stack, etc. And I
think this is how most languages do it. TADS certainly does (for
instance, it's got all the following jumps: jump, jump if true,
jump if false, jump if equal, jump if not equal, jump if
greater-than, jump if greater-than-or-equal, jump if less-than,
jump if less-than-or-equal). This is fine, I guess, but it seems
like the pain for interpreter authors is really more than it's
worth, and leads to lots of code duplication in interpreters
which leads to errors etc.
So an alternate theory is we could reserve some opcodes for macros. A
macro opcode would mean "execute the N bytes of code at this address
then keep going". This has a number of obvious disadvantages,
including "what do jumps from within the macro do" and "would this
hurt JIT?" and "this wouldn't actually work to compress 'push obj;
push prop; call' for arbitrary obj and prop". But it does seem kinda
cool, and would, I think, let compilers do fancier optimization on
larger blocks of code and would be good if there's some bytecode the
compiler commonly outputs that there is no preset opcode for.
I think this is the whole spec. Ok, except for
restore/restart/undo. The way those work (plus start, when you first
load the game) is, when one of the four is executed, the call stack is
cleared, every object in the game is reset to the appropriate state,
and then system.main is called, with one of 'start', 'restore', etc
being passed as an argument to it. Uh, and this suggests that you can
in fact have user-defined methods on special objects.