Igor,
those of us who design our own hardware have options that are not available when using conventional processors. In the case of object tables, we can use virtually addressed object caches (invented in the Mushroom project - http://www.wolczko.com/mushroom/index.html) to eliminate most of the cost.
In a conventional processor, think about what happens when we execute an instruction like
load R3, R7, R1
where R1 has the number of the instance variable we want to read (multiplied by the word size, depending on the processor), R7 is the oop for the object and R3 will store the value of the instance variable. The first step is that R7 and R1 are added and the result is the virtual address of the instance variable. Then the top (20 or so) bits will be searched in the TLB (translation look-aside buffer) of the MMU (memory management unit) and, if found there, they will be replaced with the associated bits, forming the physical address of the instance variable. The last step is that the top bits of the physical address (28 bits in the case of a cache with lines of 16 bytes) are used to find the right line in the data cache and the bottom bits will select the bytes from that line to be loaded into R3.
Of course, sometimes the "page" isn't in the TLB or the data cache doesn't have the needed line, but let's not worry about that for now.
Imagine that we redesign our processor so that same instruction will work like this: we concatenate R7 and R1 into a 64 bit virtual instance variable address and use the top 60 bits to find the right line in the data cache, and the bottom 4 bits to select the bytes from that line to be loaded into R3. We have saved one addition and one MMU lookup at the cost of a larger tag for the cache. An additional cost is that two objects can't share the same cache line like they can in the conventional processor, but that doesn't hurt much.
When we can't find the cache line we need, we have to bring in data from the main memory. That can be done by adding R7 and R1, masking the bottom 4 bits, doing the MMU lookup and fetching the 16 bytes from the result. This will be compatible with the direct pointer Squeak. But we could instead use R7 as an index into an object table, fetch the base address, add R1 to that, mask the bottom 4 bits, do a MMU lookup (or not - the object table itself could double as a virtual memory system) and fetch the 16 bytes into the new cache line. Since cache misses are rare, the extra memory access here does not impact performance very much.
Note that virtual caches are considered a bad thing in the C world because of aliasing problems: two virtual addresses might map to the same physical address and then you could have two copies of the same data in the cache and no way to keep them consistent. With object addressing, this is much easier to avoid.
-- Jecel