Hi Torge,
If you're right with your assumption the difference should be measurable from within the VM with different images. I don't know if the devices we're talking about have something equivalent to the RDTSC instruction on i386 processors but if there's a _really_ cheap way of measuring sub-microsecond units (RDTSC, for example, measures clock cycles) then it might be really worthwhile to attribute a VM and see if you can find out a difference (e.g., time spent in critical areas such as "full" method lookup as a percentage of overall time spent).
The problem I have with your measurements is that (I think) they are not really giving you any "good enough" evidence to make a case here. Even if it is true that faster images show smaller lookup lengths the differences could still be attributed to many other factors - lots of things have changed and chasing VM inefficiencies is typically very hard if you don't have any "hard numbers" to go along with.
Personally, my feeling is "half and half" here. Yes, there could be a problem with the mcache size as well as the speed of the full method lookup. But then, it's _really_ hard to tell without hard numbers.
Cheers, - Andreas
-----Original Message----- From: squeak-dev-admin@lists.squeakfoundation.org [mailto:squeak-dev-admin@lists.squeakfoundation.org] On Behalf Of Torge.Husfeldt@gmx.de Sent: Sunday, December 29, 2002 6:38 PM To: squeak-dev@lists.squeakfoundation.org Cc: squeak-dev@lists.squeakfoundation.org Subject: Re: Pocket PC Performance
Hi All, Can someone who encounters the performance problems mentioned in this thread please try out the following code snippets and report on the outcome?!
First try in a workspace: | lookupLengths | lookupLengths _ SortedCollection new. Behavior allSubInstancesDo:[:class | | md | md _ class methodDict. lookupLengths addAll:(md keys asSortedCollection collect:[:sel | (((md scanFor: sel) - sel identityHash) \ md basicSize) -> (class -> sel)]) ]. lookupLengths asBag sortedCounts inspect. lookupLenghts last:100 inspect.
This will give you two inspectors. The first will show the sorted counts of a bag which entries should be interpreted the following: #occurences -> #lookupLength -> sampleClass -> sampleSelector Please report on the differences between a slow image and an acceptable image (preferrably on the same system) The second will give you the details of the 100 Methods with the highest lookupLenghts. Please look swiftly over this list if you can detect any Morphic specific selectors with long lookup lenghts.
The second thing i want you to try is to grow all your MethodDictionaries that have exessive lookupLenghts. The following code snippet will do this for you.
| lookupLengths | Behavior allSubInstancesDo:[:class | | md | md _ class methodDict. md isEmpty ifFalse:[ lookupLengths _ SortedCollection new. lookupLengths addAll:(md keys asSortedCollection collect:[:sel | (((md scanFor: sel) - sel identityHash) \ md basicSize) -> (class -> sel)]). (lookupLengths last key > 9) ifTrue:[md grow]]]
Please report if your image "feels" any swifter after this operation. Note#0: Be sure not to have any PackagePaneBrowser (aka 5-pane browser) open when you do your tests because these beasts will stop all morphic updating (and maybe event dispatch) for up to one second every second on a slow machine. This is due to a design bug which can very easily be avoided using a changeset i once posted to the list but don't have the patience to dig up right now. Note#1: These operations might take a _very_ long time (especially on a slow system) so be pationent) (on my 1700+ it was in the second range but since you're especially encountering problems on slow systems you will probably do the tests there, too -- so don't say i didn't warn you ). Note#2: LookupLenghts stand for the amount of probes the vm has to do in _a single method dictionary_ to find a method corresponding to a selector. This is just a minimum measure because it doesn't count the number of probes spent while following the superclass chain. These numbers are typically small for almost empty method dictionaries but may become huge when all superclasses have long probe chains and the selector is only implemented in ProtoObject Note#3: It is nowhere near guranteed that this will change anything because lookup lenghts aren't _supposed_ to make a difference. It is widely believed that the vm's lookup cache mechanism should deal with the performance hit that would result from long probe chains. I have, whatsoever, two strong hints that lookupLenghts _might_ be part of the problem you encounter. These are: Hint#1: The problem has arisen rather gradually and noone has yet been able to find any particular change that made the difference hint#2: The Lookup cache (as i understand it) seems to be rather small for as big a system as morphic (i only saw space for 512 entries last time i looked) and gets flushed on several accasions (such as gc's).
Looking forward to your feedback, Torge
"Andreas Raab" andreas.raab@gmx.de is claimed by the authorities to have written:
Personally, my feeling is "half and half" here. Yes, there could be a problem with the mcache size as well as the speed of the full method lookup. But then, it's _really_ hard to tell without hard numbers.
The only hard numbers we currently have in this area relate to the mcache changes proposed by scott wossname a while ago. We did a series of tests on just about every platform available (including my Acorn, with the same cpu etc as the pocketpc stuff) and found no measurable improvements. Of course, this only demonstrates that we found no improvements with that particular change and those particular tests, but it's at least circumstantial evidence.
tim
This is a revision. I've broken the change set into two parts. (another part follows). This part only deals with int procedures that return nothing, thus a fix to declare them as void. Avoids the routine returns nothing, int expected warning messages.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
This change set provides a 18 - 28% improvement in the GC logic using an test image that has 1.1 million objects in it on the PowerPC. I won't comment on if this makes any difference on intel, however someone might try it and see what happens.
In early December I took a look at the garbage collector and realized for the mark logic, at least on the PowerPC, we were moving three key variables in and out of memory multiple times per object, versus keeping everything in a register or two. Also we managed to make some procedure calls for each object we looked at, which is time consuming.
In discussing a possible fix with Tim, we realized that in some cases because of inlining, why some global variables actually become local variables because different SLANG methods are folded together.
This change set add logic to review variable usage and if possible fold variables that only appear in one inlined method into a declaration within the one method and dropped them from the globals. Right now I only deal with interp.c, and I've not really studied say variables across N methods where N is > 2 but say < 6 to understand what could be done with a bit more tweaking.
Interesting enough these variables are: field, parentField, child in the Mark Logic cntx, val in the interpret routine.
A bit of slang changing was also done to:
1) Provide an array headerTypeBytes, a lookup array for the header byte length calculation versus doing a couple of IF statements. 2) A few restructuring hints to ensure procedure calls actually become inlined, where in the past they violated an inlining rule. 3) Some really large routines in the compacting logic were broken into multiple routines, because the logic is really in multiple steps and this helps the compiler (ah stupid compilers) deal with register allocation better. Also it helps one think about why we have multiple steps.
I think a bit more work could be done in this area, but I'll look for feedback first.
Also I've moved the logic from voidIsNotInt that dealt with ensuring bogus unused variables are deleted into this change set.
Now for the Globals as a structure logic, I've attached an addendum change set (GCMakeItFasterGStruct-JMM.6.cs.gz) to provide the following feature: Declare foo as a pointer to the foo structure as a local, this makes for better powerpc code which improves overall performance by a few percentage points. You also need this changeset because of some structure changes if you've filed in the globals as a structure changes sets, and want the faster GC speed.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
OOps forgot:
I also change the logic that actually moves the bytes in the compactor a bit to decrement the target from the starting location versus what was there before. Although GCC made sense of the loop and built some nice assembler, CodeWarrior for Mac Classic extruded some ugly assembler. With the change both compilers produce optimal (almost) code for moving chunks of bytes.
The Smalltalk for the GC benchmark is this ugly chunk of code.
| counter target keep | counter _ 0. Smalltalk garbageCollect. Smalltalk garbageCollect. Smalltalk garbageCollect. Smalltalk allObjectsDo: [:e | counter _ counter + 1]. target _ 1000000 - counter. keep _ OrderedCollection new: 60000. target//10 timesRepeat: [| element | element _ OrderedCollection new: 10. 10 timesRepeat: [element add: Object new]. keep add: element]. Smalltalk garbageCollect. Smalltalk garbageCollect. Smalltalk garbageCollect. counter _ 0. Smalltalk allObjectsDo: [:e | counter _ counter + 1]. Transcript show: counter;cr. 10 timesRepeat: [Transcript show: (Time millisecondsToRun: [ Smalltalk garbageCollect]);cr].
PS optimal assembler would consider cache line logic. But that's a job for a custom memory move routine. memcpy does not work well here because it has too many checks and usually we move things 4/8/12 bytes or so, so the overhead of calling memcpy is too great.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
Lets see today I built upon the squeak port found in FreeBSD 4.6 and found one issue (Now I could have used the official distribution but I was curious about what was in the FreeBSD port collection)
In my VM changes I added some accessors for variables, but if you just file in the fasterGC change set you don't have an accessor for savedWindowSize, thus it gets localized. Thus you must add an accessor for this instance variable to prevent it from being localized. The unix source base cheerfully refers to it directly versus using an accessor.
However I did get to run a GC benchmark on my i386 box.
a 5156 non-gnuified VM gives raw numbers of 1157-1574 on 1156640 objects a 5156 non-gnuified VM with fasterCG changeset gives 1039-1436 on 1156206 objects
Ah, that's about a 10% improvement. Versus the 30% I get on the mac, but I suspect that is because providing more working registers versus memory can be done on the mac, but not on the register poor intel machine, however things *are* a bit better...
PS mmmm now it seems the MakeFile that in the port has cflags as -O -pipe -fomit-frame-pointer -DLSB_FIRST=1 -funroll-loops
how very odd, should that not be -O2 ? Mmm well some testing says it doesn't make much of a difference... I'll leave that issue up to the folks that fiddle with the intel compiler options.
Oh and the FreeBSD port doesn't gnuify either, perhaps someone might chase why? Or prod someone to update that unix distribution? mm implies Roland Jesse jesse@cs.uni-magdeburg.de built the port
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
On Mon, Dec 30, 2002 at 12:27:19AM -0800, John M McIntosh wrote:
This change set provides a 18 - 28% improvement in the GC logic using an test image that has 1.1 million objects in it on the PowerPC.
Here's a macroBenchmark (on a G3-400 Powerbook):
normal VM: #(23176 165464 71470 26862 0 23717 14090) with GC changes: #(22306 154462 67454 24451 0 21615 14134) app. %faster %3.9 %7.1 %5.9 %9.8 %9.7 0%
Benchmark #2 triggers 10 full collections, it shows nicely how much faster the GC got:
Benchmark #2: 165464ms --------------------- uptime 165.5s full 10 totalling 5,920ms (4.0% uptime), avg 592.0ms incr 15932 totalling 33,728ms (20.0% uptime), avg 2.0ms tenures 1,550 (avg 10 GCs/tenure)
Benchmark #2: 154462ms --------------------- uptime 154.5s full 10 totalling 4,781ms (3.0% uptime), avg 478.0ms incr 16036 totalling 25,407ms (16.0% uptime), avg 2.0ms tenures 1,554 (avg 10 GCs/tenure)
Because the average ms calculation is rounded, it's not apparent that the incremental GC actually runs 25% faster unless you do the math.
On Tuesday, December 31, 2002, at 01:40 AM, Marcus Denker wrote:
On Mon, Dec 30, 2002 at 12:27:19AM -0800, John M McIntosh wrote:
This change set provides a 18 - 28% improvement in the GC logic using an test image that has 1.1 million objects in it on the PowerPC.
Here's a macroBenchmark (on a G3-400 Powerbook):
normal VM: #(23176 165464 71470 26862 0 23717 14090) with GC changes: #(22306 154462 67454 24451 0 21615 14134) app. %faster %3.9 %7.1 %5.9 %9.8 %9.7 0%
Benchmark #2 triggers 10 full collections, it shows nicely how much faster the GC got:
Benchmark #2: 165464ms
uptime 165.5s full 10 totalling 5,920ms (4.0% uptime), avg 592.0ms incr 15932 totalling 33,728ms (20.0% uptime), avg 2.0ms tenures 1,550 (avg 10 GCs/tenure)
Benchmark #2: 154462ms
uptime 154.5s full 10 totalling 4,781ms (3.0% uptime), avg 478.0ms incr 16036 totalling 25,407ms (16.0% uptime), avg 2.0ms tenures 1,554 (avg 10 GCs/tenure)
-- Marcus Denker marcus@ira.uka.de -- Squeak! http://squeak.de
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
Someone has to say it - HAPPY NEW YEAR TO YOU ALL ! HAPPY SQUEAKING! HIP ! HIP ! HOORAY !
Edmund
HAPPY NEW YEAR, EVREYONE.
Cheers,
PhiHo.
----- Original Message ----- From: "John M McIntosh" johnmci@smalltalkconsulting.com To: squeak-dev@lists.squeakfoundation.org Sent: Monday, December 30, 2002 3:27 AM Subject: [ENH][VM][FIX] faster garbage collection, and removal of unused vars
This change set provides a 18 - 28% improvement in the GC logic using an test image that has 1.1 million objects in it on the PowerPC. I won't comment on if this makes any difference on intel, however someone might try it and see what happens.
In early December I took a look at the garbage collector and realized for the mark logic, at least on the PowerPC, we were moving three key variables in and out of memory multiple times per object, versus keeping everything in a register or two. Also we managed to make some procedure calls for each object we looked at, which is time consuming.
In discussing a possible fix with Tim, we realized that in some cases because of inlining, why some global variables actually become local variables because different SLANG methods are folded together.
This change set add logic to review variable usage and if possible fold variables that only appear in one inlined method into a declaration within the one method and dropped them from the globals. Right now I only deal with interp.c, and I've not really studied say variables across N methods where N is > 2 but say < 6 to understand what could be done with a bit more tweaking.
Interesting enough these variables are: field, parentField, child in the Mark Logic cntx, val in the interpret routine.
A bit of slang changing was also done to:
- Provide an array headerTypeBytes, a lookup array for the header byte
length calculation versus doing a couple of IF statements. 2) A few restructuring hints to ensure procedure calls actually become inlined, where in the past they violated an inlining rule. 3) Some really large routines in the compacting logic were broken into multiple routines, because the logic is really in multiple steps and this helps the compiler (ah stupid compilers) deal with register allocation better. Also it helps one think about why we have multiple steps.
I think a bit more work could be done in this area, but I'll look for feedback first.
Also I've moved the logic from voidIsNotInt that dealt with ensuring bogus unused variables are deleted into this change set.
Now for the Globals as a structure logic, I've attached an addendum change set (GCMakeItFasterGStruct-JMM.6.cs.gz) to provide the following feature: Declare foo as a pointer to the foo structure as a local, this makes for better powerpc code which improves overall performance by a few percentage points. You also need this changeset because of some structure changes if you've filed in the globals as a structure changes sets, and want the faster GC speed.
--
=== John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
A very happy 2003 to you and all, PhiHo!
Let's Squeak through another one!
Gary
----- Original Message ----- From: PhiHo Hoang To: squeak-dev@lists.squeakfoundation.org Sent: Wednesday, January 01, 2003 12:13 AM Subject: It's now 2003 (was Re: [ENH][VM][FIX] faster garbage collection, and removal of unused vars)
HAPPY NEW YEAR, EVREYONE.
Cheers,
PhiHo.
----- Original Message ----- From: "John M McIntosh" johnmci@smalltalkconsulting.com To: squeak-dev@lists.squeakfoundation.org Sent: Monday, December 30, 2002 3:27 AM Subject: [ENH][VM][FIX] faster garbage collection, and removal of unused vars
This change set provides a 18 - 28% improvement in the GC logic using an test image that has 1.1 million objects in it on the PowerPC. I won't comment on if this makes any difference on intel, however someone might try it and see what happens.
In early December I took a look at the garbage collector and realized for the mark logic, at least on the PowerPC, we were moving three key variables in and out of memory multiple times per object, versus keeping everything in a register or two. Also we managed to make some procedure calls for each object we looked at, which is time consuming.
In discussing a possible fix with Tim, we realized that in some cases because of inlining, why some global variables actually become local variables because different SLANG methods are folded together.
This change set add logic to review variable usage and if possible fold variables that only appear in one inlined method into a declaration within the one method and dropped them from the globals. Right now I only deal with interp.c, and I've not really studied say variables across N methods where N is > 2 but say < 6 to understand what could be done with a bit more tweaking.
Interesting enough these variables are: field, parentField, child in the Mark Logic cntx, val in the interpret routine.
A bit of slang changing was also done to:
- Provide an array headerTypeBytes, a lookup array for the header byte
length calculation versus doing a couple of IF statements. 2) A few restructuring hints to ensure procedure calls actually become inlined, where in the past they violated an inlining rule. 3) Some really large routines in the compacting logic were broken into multiple routines, because the logic is really in multiple steps and this helps the compiler (ah stupid compilers) deal with register allocation better. Also it helps one think about why we have multiple steps.
I think a bit more work could be done in this area, but I'll look for feedback first.
Also I've moved the logic from voidIsNotInt that dealt with ensuring bogus unused variables are deleted into this change set.
Now for the Globals as a structure logic, I've attached an addendum change set (GCMakeItFasterGStruct-JMM.6.cs.gz) to provide the following feature: Declare foo as a pointer to the foo structure as a local, this makes for better powerpc code which improves overall performance by a few percentage points. You also need this changeset because of some structure changes if you've filed in the globals as a structure changes sets, and want the faster GC speed.
--
=== John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
squeak-dev@lists.squeakfoundation.org