BFS and CFS and Cogs, Oh My

List overview All Threads
Download

newer

older

CompiledMethod basicAt: problem

Building an Interpreter VM...

Casey Ransberger

19 Apr 2013 19 Apr '13

9:38 a.m.

I had a brief chat with Con Kolivas, who did BFS (which implements kernel stuff that will make Cog happier under Linux on machines with sub-supercomputing quantities of CPUs) tonight.

It sounds like there are actually two reasons it hasn't made it into the mainline kernel:

a) he doesn't have time to support it, and b) the other kernel folks don't want it.

Oh well. Since right now I'm focused on Raspbian, I sent a message explaining what it was, why I want it, etc on their web board. If I do get it in, support would have to fall to me. Yikes, right? ;)

Basically it would be a fork, done with the original author's blessing. This would also make it easy to change the (rather unfortunate) f-ing name of the thing.

For the curious, at the end of this message I'm linking to a bit about why he wrote it. It's worth mentioning that it's rather in the spirit of simple-is-better, which I think fits with the Squeak ethos a little. Not to mention that it's aimed at the desktop/personal computing experience, rather than CFS's focus on big iron and tons of CPUs, so it's really appropriate to the Raspberry Pi, as well as potentially the XO.

The only real advantage to forking the work is to bring support for it into the Squeak community. I doubt I'll have any more luck lobbying the kernel crowd to get it in as a secondary process scheduler than Con did, but maybe, just maybe, the Raspbian folks will see some value in making the Squeak VM go faster / possibly use less power (given that they're putting Scratch on the desktop by default) and be open to taking the patch. Especially if I can come with benchmarks that make it desirable across the board on the device (it sounds like it can probably bring the rain on that count if I'm reading this stuff right.)

Whether or not I get warm feelings from the Raspbian folks, I'm going to look at "how can I make this patch as painless as possible to apply to as many distros as possible?"

Of course if I actually do this, I can probably kiss my social life goodbye, so I'm not without reservations. Not to mention that the number of people who'll care is probably smaller than one might hope.

I wondered what folks on vm-dev might think. If this is another one of my crazy projects that no one actually wants, please let me know posthaste :D

P.S.

If you're worried that I might not be qualified to maintain a kernel patch, I think this should help to assuage your fears:

http://media.tumblr.com/tumblr_me0qrvGwDP1r3z80e.jpg

And here's Con's rationale for implementing it. Warning: you might not want to pull this up while your parents are watching unless you've had a good long talk with them about what words are appropriate to use in various social contexts. Raising parents can be very challenging! And putting soap in their mouths only seems to upset them...

http://ck.kolivas.org/patches/bfs/bfs-faq.txt

-- Casey Ransberger

Attachments:

attachment.html (text/html — 3.6 KB)

Show replies by date

Alex Bradbury

19 Apr 19 Apr

3:16 p.m.

On 19 April 2013 08:38, Casey Ransberger casey.obrien.r@gmail.com wrote:

...

I had a brief chat with Con Kolivas, who did BFS (which implements kernel stuff that will make Cog happier under Linux on machines with sub-supercomputing quantities of CPUs) tonight.

It sounds like there are actually two reasons it hasn't made it into the mainline kernel:

a) he doesn't have time to support it, and b) the other kernel folks don't want it.

Oh well. Since right now I'm focused on Raspbian, I sent a message explaining what it was, why I want it, etc on their web board. If I do get it in, support would have to fall to me. Yikes, right? ;)

Yes, for political reasons it seems unlikely anything like BFS would get in to the upstream kernel. If someone can do work to actually show noticeable performance gains then that would make us (the Raspberry Pi Foundation) interested in exploring further. Real workloads that perform much better with an alternative scheduler would be much more interesting than microbenchmarks. Of course the next step after that wouldn't be dumping the upstream scheduler and switching to BFS, but it would certainly justify taking a closer look.

I'm not entirely sure why you want to fork BFS - as far as I can see Con Kolivas is keeping the BFS and his larger -ck patchset up to date with upstream releases.

In conclusion (from a Raspberry Pi perspective): please do play with BFS on the pi, do something useful with it (if it solves the recently discussed issues with heartbeat+cogvm then swell), then let's think about where to go from there.

Regards,

Alex

Casey Ransberger

5:02 p.m.

Well hi, Alex:) Nice to meet you!

I agree wrt to real workloads, microbenches mean very little. Apache might be a good candidate for benching. That's off the top of my head, see below about how I'd love to hear what you'd like instrumented.

My main problem with BFS is that the name -- under the rather restrictive conditions I had to agree to in order to join the Pi Foundation's message boards -- cannot even be discussed. That and it might upset some parents, not to mention possibly confusing some kids.

Let's not doubt that kids will get their hands in the kernel. That would be a poor assumption, if any of them are a bit like we were.

The other reason for suggesting a fork is that the original author has stated that he hasn't got intentions around supporting the work on a broad scale, and I think something which could end up empowering Pi users ought to have someone backing it up. I'm considering being that person. This is the part where my own self interest says "shut up and go home" and I fail almost entirely to listen.

Here's what Con and I talked about: I would fork, change the name, track his work, and then contribute back anything of value. Downstream forkiness, basically. I shield him from support randomization, and that makes this thing supportable.

What I'm getting at: it's a matter of branding. Sorry, I don't use the marketing-department hat often, at least not in public, but here we are!

Also: it's worth noting that I'm a bit of a culture-jammer. Changing the name might have some very funny positive effects. At Apple, when no one wanted to hear about Smalltalk anymore, some very clever people "invented" Squeak. Which of course they'd already invented as Smalltalk, etcetera. But the radar hadn't learned about Squeak yet, and so the balloon sailed away underneath it one more time. Or that's the version of the story that I heard?

I'll take your advice and start experimenting (was going to anyway.) The output of my experimentation, assuming I don't get run over by a bus in the meantime, will be some macro benches. Then we can start talking turkey, no? ;) Anyway I'd like to come with facts and numbers to a discussion like that, rather than conjecture.

Thanks for your thoughtful reply!

I'd like to ask a favour: can you name off some not-Squeak applications (maybe Python based stuff or something?) that you'd like to see some numbers around between the two schedulers? Feel free to reply to me directly, as I imagine this veers away from the focus of vm-dev a bit.

Casey

On Fri, Apr 19, 2013 at 6:16 AM, Alex Bradbury asb@asbradbury.org wrote:

...

On 19 April 2013 08:38, Casey Ransberger casey.obrien.r@gmail.com wrote:

...
I had a brief chat with Con Kolivas, who did BFS (which implements

kernel stuff that will make Cog happier under Linux on machines with sub-supercomputing quantities of CPUs) tonight.

...
It sounds like there are actually two reasons it hasn't made it into the

mainline kernel:

...
a) he doesn't have time to support it, and b) the other kernel folks don't want it.

Oh well. Since right now I'm focused on Raspbian, I sent a message

explaining what it was, why I want it, etc on their web board. If I do get it in, support would have to fall to me. Yikes, right? ;)

Yes, for political reasons it seems unlikely anything like BFS would get in to the upstream kernel. If someone can do work to actually show noticeable performance gains then that would make us (the Raspberry Pi Foundation) interested in exploring further. Real workloads that perform much better with an alternative scheduler would be much more interesting than microbenchmarks. Of course the next step after that wouldn't be dumping the upstream scheduler and switching to BFS, but it would certainly justify taking a closer look.

I'm not entirely sure why you want to fork BFS - as far as I can see Con Kolivas is keeping the BFS and his larger -ck patchset up to date with upstream releases.

In conclusion (from a Raspberry Pi perspective): please do play with BFS on the pi, do something useful with it (if it solves the recently discussed issues with heartbeat+cogvm then swell), then let's think about where to go from there.

Regards,

Alex

Eliot Miranda

6:23 p.m.

Hi Alex,

On Fri, Apr 19, 2013 at 6:16 AM, Alex Bradbury asb@asbradbury.org wrote:

...

On 19 April 2013 08:38, Casey Ransberger casey.obrien.r@gmail.com wrote:

...
I had a brief chat with Con Kolivas, who did BFS (which implements

kernel stuff that will make Cog happier under Linux on machines with sub-supercomputing quantities of CPUs) tonight.

...
It sounds like there are actually two reasons it hasn't made it into the

mainline kernel:

...
a) he doesn't have time to support it, and b) the other kernel folks don't want it.

Oh well. Since right now I'm focused on Raspbian, I sent a message

explaining what it was, why I want it, etc on their web board. If I do get it in, support would have to fall to me. Yikes, right? ;)

Yes, for political reasons it seems unlikely anything like BFS would get in to the upstream kernel. If someone can do work to actually show noticeable performance gains then that would make us (the Raspberry Pi Foundation) interested in exploring further. Real workloads that perform much better with an alternative scheduler would be much more interesting than microbenchmarks.

This isn't about workload or performance. It is about basic functionality. The CFS scheduler does not support multiple thread priorities for user processes (actually, for the non-real-time scheduling policy, and the real-time scheduling policy is available only to superuser processes). AFAIA it is the only main-stream pthreads scheduler that doesn't. AFAIA BFS (what a name?!) does support multiple thread priorities for user processes.

Within the Squeak Cog VM (and in a number of other VMs, SMalltalk and Java VMs amongst them) there's a heartbeat which is used to cause the VM to periodically break out of normal processing and poll for events. A heartbeat is both much more efficient, and more regular than e.g. decrementing a counter as part of normal processing (e.g. frame build on entering non-leaf methods). Ideally the heartbeat is implemented as a thread spinning, blocking in e.g. nanosleep and then forcing the breakout before entering nanosleep again. But this requires that the heartbeat thread runs at a higher priority than the main VM thread(s). On linux under the CFS this isn't possible. The fallback is to use an interval timer (setitimer with ITIMER_REAL) and a signal handler (for SIGALRM). This is a poor substitute: - system calls are interrupted, which can play havoc with external code - when debugging the heartbeat signal must be disabled because otherwise one is constantly stepping into the signal handler - certain linux kernels have bugs with signal delivery and threads which can cause the loss of a thread's context, ending up with two threads having the same context, hence the setitimer approach works only with a strictly single-threaded VM (this is a bug I found and worked around late last year in Red Hat Enterprise Linux WS release 4 (Nahant Update 4) vintage kernels, which alas I have customers using)

Either of these solutions would seem straight-forward from the outside: - make SCHED_RR and/or SCHED_FIFO for user processes. - implement multiple priorities for SCHED_OTHER Expecting to be able to install a VM as a setuid program is not realistic.

I think you'll find that this kind of architectural issue is present in a number of multi-media applications, not just dynamic language virtual machines. The restriction to a single thread priority is, frankly, pathetic. If you see Rasbian and Pi as a platform for multi-media apps then I would urge you to bring any influence you have to bear on getting the linux kernel community to provide multiple thread priorities. The lack thereof is a significant limitation.

best regards, Eliot Miranda

Of course the next step after that

...

wouldn't be dumping the upstream scheduler and switching to BFS, but it would certainly justify taking a closer look.

I'm not entirely sure why you want to fork BFS - as far as I can see Con Kolivas is keeping the BFS and his larger -ck patchset up to date with upstream releases.

In conclusion (from a Raspberry Pi perspective): please do play with BFS on the pi, do something useful with it (if it solves the recently discussed issues with heartbeat+cogvm then swell), then let's think about where to go from there.

Regards,

Alex

-- best, Eliot

Steve Rees

21 Apr 21 Apr

2:10 p.m.

Hi Eliot,

On 19/04/2013 17:23, Eliot Miranda wrote:

...

Hi Alex,

On Fri, Apr 19, 2013 at 6:16 AM, Alex Bradbury <asb@asbradbury.org mailto:asb@asbradbury.org> wrote:

On 19 April 2013 08:38, Casey Ransberger <casey.obrien.r@gmail.com
<mailto:casey.obrien.r@gmail.com>> wrote:
>
> I had a brief chat with Con Kolivas, who did BFS (which
implements kernel stuff that will make Cog happier under Linux on
machines with sub-supercomputing quantities of CPUs) tonight.
>
> It sounds like there are actually two reasons it hasn't made it
into the mainline kernel:
>
> a) he doesn't have time to support it, and
> b) the other kernel folks don't want it.
>
> Oh well. Since right now I'm focused on Raspbian, I sent a
message explaining what it was, why I want it, etc on their web
board. If I do get it in, support would have to fall to me. Yikes,
right? ;)

Yes, for political reasons it seems unlikely anything like BFS would
get in to the upstream kernel. If someone can do work to actually show
noticeable performance gains then that would make us (the Raspberry Pi
Foundation) interested in exploring further. Real workloads that
perform much better with an alternative scheduler would be much more
interesting than microbenchmarks.

This isn't entirely true. Out-of-the-box unprivileged processes can't change the scheduling policy, but in kernels after 2.6.12 it is possible to configure your system to allow this without resorting to setuid root.

Quoting from the man page for sched_setscheduler - http://linux.die.net/man/2/sched_setscheduler - (the privilege restrictions are the same as for pthread_attr_setschedpolicy), "If an unprivileged process has a nonzero RLIMIT_RTPRIO soft limit, then it can change its scheduling policy and priority, subject to the restriction that the priority cannot be set to a value higher than the maximum of its current priority and its RLIMIT_RTPRIO soft limit."

Using the pam_limits.so module, one can set the RLIMIT_RTPRIO soft limit higher than zero, which then allows the use of the SCHED_FIFO and SCHED_RR policies with priorities up to the soft limit.

One way to achieve this is to add the following lines to the file /etc/security/limits.conf.

* hard rtprio 1 * soft rtprio 1

or you can add a squeakvm.conf file to /etc/security/limits.d with those same lines, eg.

# /etc/security/limits.d/squeakvm.conf * hard rtprio 1 * soft rtprio 1

This grants this capability to unprivileged users, but you will need to logout and login again for it to take effect, as pam limits are applied at user login.

The only problem with this approach is that there's a possibility it might conflict with other global settings for the rtprio. Another alternative is to grant the privilege to a group (eg. squeakvm) and then add users to that group to allow the ability to change the SCHED_FIFO or SCHED_RR policies and to change the priorities of threads:

# /etc/security/limits.d/squeakvm.conf @squeakvm hard rtprio 1 @squeakvm soft rtprio 1

This will grant the ability only to users in the squeakvm group. The 1 in the examples above is the maximum priority. Higher levels could be used, but a level of 1 is necessary to trigger the capability.

Of course the group needs to exist for this to take effect.

sudo groupadd squeakvm

There's a handy test program on the pthread_setschedparam man page - http://linux.die.net/man/3/pthread_setschedparam - that can be used for experimentation. I've attached the source. I tried this out on an up-to-date Ubuntu Server 12.04 LTS VM running on a MacbookPro under VMWare Fusion. YMMV.

pthreads_sched_test is a bit of a verbose name, so I named the test program "schedtest" when I compiled it. Here are the results of my tests.

First, compile the program

gcc pthreads_sched_test.c -o schedtest -lpthread

The first set of tests were performed without making any changes to the PAM limits.

Running schedtest without arguments gives the following

./schedtest

Scheduler settings of main thread policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr' policy=SCHED_OTHER, priority=0 inheritsched is INHERIT

Scheduler attributes of new thread policy=SCHED_OTHER, priority=0

Trying to change the policy and priority of the new thread the program creates gives the following

./schedtest -ar1 -i e

Scheduler settings of main thread policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr' policy=SCHED_RR, priority=1 inheritsched is EXPLICIT

pthread_create: Operation not permitted

Trying to change the priority of the main thread gives

./schedtest -mr1

pthread_setschedparam: Operation not permitted

As Eliot described, the default configuration prevents unprivileged user processes from changing the priority or scheduling policy.

After adding the /etc/security/limits.d/squeakvm.conf file describe above, adding my user to the squeakvm group and logging out and back in again, the tests are somewhat more successful. Note that these are the only additional privileges given to the squeakvm group.

schedtest -ar1 -i e

Scheduler settings of main thread policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr' policy=SCHED_RR, priority=1 inheritsched is EXPLICIT

Scheduler attributes of new thread policy=SCHED_RR, priority=1

schedtest -mr1 -ao0 -i e

Scheduler settings of main thread policy=SCHED_RR, priority=1

Scheduler settings in 'attr' policy=SCHED_OTHER, priority=0 inheritsched is EXPLICIT

Scheduler attributes of new thread policy=SCHED_OTHER, priority=0

Does this give sufficient flexibility without having to patch the kernel's scheduler (whatever its name)?

Cheers, Steve

...

AFAIA it is the only main-stream pthreads scheduler that doesn't. AFAIA BFS (what a name?!) does support multiple thread priorities for user processes.

Within the Squeak Cog VM (and in a number of other VMs, SMalltalk and Java VMs amongst them) there's a heartbeat which is used to cause the VM to periodically break out of normal processing and poll for events. A heartbeat is both much more efficient, and more regular than e.g. decrementing a counter as part of normal processing (e.g. frame build on entering non-leaf methods). Ideally the heartbeat is implemented as a thread spinning, blocking in e.g. nanosleep and then forcing the breakout before entering nanosleep again. But this requires that the heartbeat thread runs at a higher priority than the main VM thread(s). On linux under the CFS this isn't possible. The fallback is to use an interval timer (setitimer with ITIMER_REAL) and a signal handler (for SIGALRM). This is a poor substitute:

system calls are interrupted, which can play havoc with external code

when debugging the heartbeat signal must be disabled because

otherwise one is constantly stepping into the signal handler

certain linux kernels have bugs with signal delivery and threads

which can cause the loss of a thread's context, ending up with two threads having the same context, hence the setitimer approach works only with a strictly single-threaded VM (this is a bug I found and worked around late last year in Red Hat Enterprise Linux WS release 4 (Nahant Update 4) vintage kernels, which alas I have customers using)

Either of these solutions would seem straight-forward from the outside:

make SCHED_RR and/or SCHED_FIFO for user processes.

implement multiple priorities for SCHED_OTHER

Expecting to be able to install a VM as a setuid program is not realistic.

I think you'll find that this kind of architectural issue is present in a number of multi-media applications, not just dynamic language virtual machines. The restriction to a single thread priority is, frankly, pathetic. If you see Rasbian and Pi as a platform for multi-media apps then I would urge you to bring any influence you have to bear on getting the linux kernel community to provide multiple thread priorities. The lack thereof is a significant limitation. best regards, Eliot Miranda
Of course the next step after that
wouldn't be dumping the upstream scheduler and switching to BFS, but
it would certainly justify taking a closer look.

I'm not entirely sure why you want to fork BFS - as far as I can see
Con Kolivas is keeping the BFS and his larger -ck patchset up to date
with upstream releases.

In conclusion (from a Raspberry Pi perspective): please do play with
BFS on the pi, do something useful with it (if it solves the recently
discussed issues with heartbeat+cogvm then swell), then let's think
about where to go from there.

Regards,

Alex
-- best, Eliot

-- You can follow me on twitter at http://twitter.com/smalltalkhacker

Eliot Miranda

22 Apr 22 Apr

8:02 p.m.

Thanks, Steve, that's great news! I'll try and look at this really soon.

On Sun, Apr 21, 2013 at 5:10 AM, Steve Rees < squeak-vm-dev@vimes.worldonline.co.uk> wrote:

...

Hi Eliot,

On 19/04/2013 17:23, Eliot Miranda wrote:

Hi Alex,

On Fri, Apr 19, 2013 at 6:16 AM, Alex Bradbury asb@asbradbury.org wrote:

...
On 19 April 2013 08:38, Casey Ransberger casey.obrien.r@gmail.com wrote:

...
I had a brief chat with Con Kolivas, who did BFS (which implements

kernel stuff that will make Cog happier under Linux on machines with sub-supercomputing quantities of CPUs) tonight.

...
It sounds like there are actually two reasons it hasn't made it into

the mainline kernel:

...
a) he doesn't have time to support it, and b) the other kernel folks don't want it.

Oh well. Since right now I'm focused on Raspbian, I sent a message

explaining what it was, why I want it, etc on their web board. If I do get it in, support would have to fall to me. Yikes, right? ;)

Yes, for political reasons it seems unlikely anything like BFS would get in to the upstream kernel. If someone can do work to actually show noticeable performance gains then that would make us (the Raspberry Pi Foundation) interested in exploring further. Real workloads that perform much better with an alternative scheduler would be much more interesting than microbenchmarks.

This isn't about workload or performance. It is about basic functionality. The CFS scheduler does not support multiple thread priorities for user processes (actually, for the non-real-time scheduling policy, and the real-time scheduling policy is available only to superuser processes).

This isn't entirely true. Out-of-the-box unprivileged processes can't change the scheduling policy, but in kernels after 2.6.12 it is possible to configure your system to allow this without resorting to setuid root.

Quoting from the man page for sched_setscheduler - http://linux.die.net/man/2/sched_setscheduler - (the privilege restrictions are the same as for pthread_attr_setschedpolicy), "If an unprivileged process has a nonzero RLIMIT_RTPRIO soft limit, then it can change its scheduling policy and priority, subject to the restriction that the priority cannot be set to a value higher than the maximum of its current priority and its RLIMIT_RTPRIO soft limit."

Using the pam_limits.so module, one can set the RLIMIT_RTPRIO soft limit higher than zero, which then allows the use of the SCHED_FIFO and SCHED_RR policies with priorities up to the soft limit.

One way to achieve this is to add the following lines to the file /etc/security/limits.conf.

hard rtprio 1

soft rtprio 1

or you can add a squeakvm.conf file to /etc/security/limits.d with those same lines, eg.

# /etc/security/limits.d/squeakvm.conf

hard rtprio 1

soft rtprio 1

This grants this capability to unprivileged users, but you will need to logout and login again for it to take effect, as pam limits are applied at user login.

The only problem with this approach is that there's a possibility it might conflict with other global settings for the rtprio. Another alternative is to grant the privilege to a group (eg. squeakvm) and then add users to that group to allow the ability to change the SCHED_FIFO or SCHED_RR policies and to change the priorities of threads:

# /etc/security/limits.d/squeakvm.conf @squeakvm hard rtprio 1 @squeakvm soft rtprio 1

This will grant the ability only to users in the squeakvm group. The 1 in the examples above is the maximum priority. Higher levels could be used, but a level of 1 is necessary to trigger the capability.

Of course the group needs to exist for this to take effect.

sudo groupadd squeakvm

There's a handy test program on the pthread_setschedparam man page - http://linux.die.net/man/3/pthread_setschedparam - that can be used for experimentation. I've attached the source. I tried this out on an up-to-date Ubuntu Server 12.04 LTS VM running on a MacbookPro under VMWare Fusion. YMMV.

pthreads_sched_test is a bit of a verbose name, so I named the test program "schedtest" when I compiled it. Here are the results of my tests.

First, compile the program

gcc pthreads_sched_test.c -o schedtest -lpthread

The first set of tests were performed without making any changes to the PAM limits.

Running schedtest without arguments gives the following

./schedtest

Scheduler settings of main thread policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr' policy=SCHED_OTHER, priority=0 inheritsched is INHERIT

Scheduler attributes of new thread policy=SCHED_OTHER, priority=0

Trying to change the policy and priority of the new thread the program creates gives the following

./schedtest -ar1 -i e

Scheduler settings of main thread policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr' policy=SCHED_RR, priority=1 inheritsched is EXPLICIT

pthread_create: Operation not permitted

Trying to change the priority of the main thread gives

./schedtest -mr1

pthread_setschedparam: Operation not permitted

As Eliot described, the default configuration prevents unprivileged user processes from changing the priority or scheduling policy.

After adding the /etc/security/limits.d/squeakvm.conf file describe above, adding my user to the squeakvm group and logging out and back in again, the tests are somewhat more successful. Note that these are the only additional privileges given to the squeakvm group.

schedtest -ar1 -i e

Scheduler settings of main thread policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr' policy=SCHED_RR, priority=1 inheritsched is EXPLICIT

Scheduler attributes of new thread policy=SCHED_RR, priority=1

schedtest -mr1 -ao0 -i e

Scheduler settings of main thread policy=SCHED_RR, priority=1

Scheduler settings in 'attr' policy=SCHED_OTHER, priority=0 inheritsched is EXPLICIT

Scheduler attributes of new thread policy=SCHED_OTHER, priority=0

Does this give sufficient flexibility without having to patch the kernel's scheduler (whatever its name)?

Cheers, Steve

AFAIA it is the only main-stream pthreads scheduler that doesn't. AFAIA BFS (what a name?!) does support multiple thread priorities for user processes.

Within the Squeak Cog VM (and in a number of other VMs, SMalltalk and Java VMs amongst them) there's a heartbeat which is used to cause the VM to periodically break out of normal processing and poll for events. A heartbeat is both much more efficient, and more regular than e.g. decrementing a counter as part of normal processing (e.g. frame build on entering non-leaf methods). Ideally the heartbeat is implemented as a thread spinning, blocking in e.g. nanosleep and then forcing the breakout before entering nanosleep again. But this requires that the heartbeat thread runs at a higher priority than the main VM thread(s). On linux under the CFS this isn't possible. The fallback is to use an interval timer (setitimer with ITIMER_REAL) and a signal handler (for SIGALRM). This is a poor substitute:

system calls are interrupted, which can play havoc with external code

when debugging the heartbeat signal must be disabled because otherwise

one is constantly stepping into the signal handler

certain linux kernels have bugs with signal delivery and threads which

can cause the loss of a thread's context, ending up with two threads having the same context, hence the setitimer approach works only with a strictly single-threaded VM (this is a bug I found and worked around late last year in Red Hat Enterprise Linux WS release 4 (Nahant Update 4) vintage kernels, which alas I have customers using)

Either of these solutions would seem straight-forward from the outside:

make SCHED_RR and/or SCHED_FIFO for user processes.

implement multiple priorities for SCHED_OTHER

Expecting to be able to install a VM as a setuid program is not realistic.

I think you'll find that this kind of architectural issue is present in a number of multi-media applications, not just dynamic language virtual machines. The restriction to a single thread priority is, frankly, pathetic. If you see Rasbian and Pi as a platform for multi-media apps then I would urge you to bring any influence you have to bear on getting the linux kernel community to provide multiple thread priorities. The lack thereof is a significant limitation.

best regards, Eliot Miranda

Of course the next step after that

...
wouldn't be dumping the upstream scheduler and switching to BFS, but it would certainly justify taking a closer look.

I'm not entirely sure why you want to fork BFS - as far as I can see Con Kolivas is keeping the BFS and his larger -ck patchset up to date with upstream releases.

In conclusion (from a Raspberry Pi perspective): please do play with BFS on the pi, do something useful with it (if it solves the recently discussed issues with heartbeat+cogvm then swell), then let's think about where to go from there.

Regards,

Alex

-- best, Eliot

-- You can follow me on twitter at http://twitter.com/smalltalkhacker

-- best, Eliot

Eliot Miranda

25 May 25 May

1:13 a.m.

Steve,

thank you _so much_! This works like a charm. At least on redhat I didn't have to add a group and thins work fine with the first of the /etc/security/limits.d/squeakvm.conf approaches.

On Sun, Apr 21, 2013 at 5:10 AM, Steve Rees < squeak-vm-dev@vimes.worldonline.co.uk> wrote:

...

Hi Eliot,

On 19/04/2013 17:23, Eliot Miranda wrote:

Hi Alex,

On Fri, Apr 19, 2013 at 6:16 AM, Alex Bradbury asb@asbradbury.org wrote:

...
On 19 April 2013 08:38, Casey Ransberger casey.obrien.r@gmail.com wrote:

...
I had a brief chat with Con Kolivas, who did BFS (which implements

kernel stuff that will make Cog happier under Linux on machines with sub-supercomputing quantities of CPUs) tonight.

...
It sounds like there are actually two reasons it hasn't made it into

the mainline kernel:

...
a) he doesn't have time to support it, and b) the other kernel folks don't want it.

Oh well. Since right now I'm focused on Raspbian, I sent a message

explaining what it was, why I want it, etc on their web board. If I do get it in, support would have to fall to me. Yikes, right? ;)

Yes, for political reasons it seems unlikely anything like BFS would get in to the upstream kernel. If someone can do work to actually show noticeable performance gains then that would make us (the Raspberry Pi Foundation) interested in exploring further. Real workloads that perform much better with an alternative scheduler would be much more interesting than microbenchmarks.

This isn't about workload or performance. It is about basic functionality. The CFS scheduler does not support multiple thread priorities for user processes (actually, for the non-real-time scheduling policy, and the real-time scheduling policy is available only to superuser processes).

This isn't entirely true. Out-of-the-box unprivileged processes can't change the scheduling policy, but in kernels after 2.6.12 it is possible to configure your system to allow this without resorting to setuid root.

Quoting from the man page for sched_setscheduler - http://linux.die.net/man/2/sched_setscheduler - (the privilege restrictions are the same as for pthread_attr_setschedpolicy), "If an unprivileged process has a nonzero RLIMIT_RTPRIO soft limit, then it can change its scheduling policy and priority, subject to the restriction that the priority cannot be set to a value higher than the maximum of its current priority and its RLIMIT_RTPRIO soft limit."

Using the pam_limits.so module, one can set the RLIMIT_RTPRIO soft limit higher than zero, which then allows the use of the SCHED_FIFO and SCHED_RR policies with priorities up to the soft limit.

One way to achieve this is to add the following lines to the file /etc/security/limits.conf.

hard rtprio 1

soft rtprio 1

or you can add a squeakvm.conf file to /etc/security/limits.d with those same lines, eg.

# /etc/security/limits.d/squeakvm.conf

hard rtprio 1

soft rtprio 1

This grants this capability to unprivileged users, but you will need to logout and login again for it to take effect, as pam limits are applied at user login.

The only problem with this approach is that there's a possibility it might conflict with other global settings for the rtprio. Another alternative is to grant the privilege to a group (eg. squeakvm) and then add users to that group to allow the ability to change the SCHED_FIFO or SCHED_RR policies and to change the priorities of threads:

# /etc/security/limits.d/squeakvm.conf @squeakvm hard rtprio 1 @squeakvm soft rtprio 1

This will grant the ability only to users in the squeakvm group. The 1 in the examples above is the maximum priority. Higher levels could be used, but a level of 1 is necessary to trigger the capability.

Of course the group needs to exist for this to take effect.

sudo groupadd squeakvm

There's a handy test program on the pthread_setschedparam man page - http://linux.die.net/man/3/pthread_setschedparam - that can be used for experimentation. I've attached the source. I tried this out on an up-to-date Ubuntu Server 12.04 LTS VM running on a MacbookPro under VMWare Fusion. YMMV.

pthreads_sched_test is a bit of a verbose name, so I named the test program "schedtest" when I compiled it. Here are the results of my tests.

First, compile the program

gcc pthreads_sched_test.c -o schedtest -lpthread

The first set of tests were performed without making any changes to the PAM limits.

Running schedtest without arguments gives the following

./schedtest

Scheduler settings of main thread policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr' policy=SCHED_OTHER, priority=0 inheritsched is INHERIT

Scheduler attributes of new thread policy=SCHED_OTHER, priority=0

Trying to change the policy and priority of the new thread the program creates gives the following

./schedtest -ar1 -i e

Scheduler settings of main thread policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr' policy=SCHED_RR, priority=1 inheritsched is EXPLICIT

pthread_create: Operation not permitted

Trying to change the priority of the main thread gives

./schedtest -mr1

pthread_setschedparam: Operation not permitted

As Eliot described, the default configuration prevents unprivileged user processes from changing the priority or scheduling policy.

After adding the /etc/security/limits.d/squeakvm.conf file describe above, adding my user to the squeakvm group and logging out and back in again, the tests are somewhat more successful. Note that these are the only additional privileges given to the squeakvm group.

schedtest -ar1 -i e

Scheduler settings of main thread policy=SCHED_OTHER, priority=0

Scheduler settings in 'attr' policy=SCHED_RR, priority=1 inheritsched is EXPLICIT

Scheduler attributes of new thread policy=SCHED_RR, priority=1

schedtest -mr1 -ao0 -i e

Scheduler settings of main thread policy=SCHED_RR, priority=1

Scheduler settings in 'attr' policy=SCHED_OTHER, priority=0 inheritsched is EXPLICIT

Scheduler attributes of new thread policy=SCHED_OTHER, priority=0

Does this give sufficient flexibility without having to patch the kernel's scheduler (whatever its name)?

Cheers, Steve

AFAIA it is the only main-stream pthreads scheduler that doesn't. AFAIA BFS (what a name?!) does support multiple thread priorities for user processes.

Within the Squeak Cog VM (and in a number of other VMs, SMalltalk and Java VMs amongst them) there's a heartbeat which is used to cause the VM to periodically break out of normal processing and poll for events. A heartbeat is both much more efficient, and more regular than e.g. decrementing a counter as part of normal processing (e.g. frame build on entering non-leaf methods). Ideally the heartbeat is implemented as a thread spinning, blocking in e.g. nanosleep and then forcing the breakout before entering nanosleep again. But this requires that the heartbeat thread runs at a higher priority than the main VM thread(s). On linux under the CFS this isn't possible. The fallback is to use an interval timer (setitimer with ITIMER_REAL) and a signal handler (for SIGALRM). This is a poor substitute:

system calls are interrupted, which can play havoc with external code

when debugging the heartbeat signal must be disabled because otherwise

one is constantly stepping into the signal handler

certain linux kernels have bugs with signal delivery and threads which

can cause the loss of a thread's context, ending up with two threads having the same context, hence the setitimer approach works only with a strictly single-threaded VM (this is a bug I found and worked around late last year in Red Hat Enterprise Linux WS release 4 (Nahant Update 4) vintage kernels, which alas I have customers using)

Either of these solutions would seem straight-forward from the outside:

make SCHED_RR and/or SCHED_FIFO for user processes.

implement multiple priorities for SCHED_OTHER

Expecting to be able to install a VM as a setuid program is not realistic.

I think you'll find that this kind of architectural issue is present in a number of multi-media applications, not just dynamic language virtual machines. The restriction to a single thread priority is, frankly, pathetic. If you see Rasbian and Pi as a platform for multi-media apps then I would urge you to bring any influence you have to bear on getting the linux kernel community to provide multiple thread priorities. The lack thereof is a significant limitation.

best regards, Eliot Miranda

Of course the next step after that

...
wouldn't be dumping the upstream scheduler and switching to BFS, but it would certainly justify taking a closer look.

I'm not entirely sure why you want to fork BFS - as far as I can see Con Kolivas is keeping the BFS and his larger -ck patchset up to date with upstream releases.

In conclusion (from a Raspberry Pi perspective): please do play with BFS on the pi, do something useful with it (if it solves the recently discussed issues with heartbeat+cogvm then swell), then let's think about where to go from there.

Regards,

Alex

-- best, Eliot

-- You can follow me on twitter at http://twitter.com/smalltalkhacker

-- best, Eliot

Steve Rees

5:12 a.m.

Hi Elliott, Happy to help. Glad to hear it works on Redhat too. I only had access to Ubuntu at the time I tried it. The only reason I suggested the group-based version was to avoid any potential conflicts with other config on the machine. As you say, the global version works too, it just applies to all users, which may or may not be what one wants.

Cheers, Steve

On 25/05/2013 00:13, Eliot Miranda wrote:

...

Steve,

Â Â thank you _so much_! Â This works like a charm. Â At least on redhat I didn't have to add a group and thins work fine with the first of theÂ /etc/security/limits.d/squeakvm.confapproaches.

On Sun, Apr 21, 2013 at 5:10 AM, Steve Rees <squeak-vm-dev@vimes.worldonline.co.uk mailto:squeak-vm-dev@vimes.worldonline.co.uk> wrote:

Â
Hi Eliot,

On 19/04/2013 17:23, Eliot Miranda wrote:

...

Hi Alex,

On Fri, Apr 19, 2013 at 6:16 AM, Alex Bradbury
<asb@asbradbury.org <mailto:asb@asbradbury.org>> wrote:


    On 19 April 2013 08:38, Casey Ransberger
    <casey.obrien.r@gmail.com <mailto:casey.obrien.r@gmail.com>>
    wrote:
    >
    > I had a brief chat with Con Kolivas, who did BFS (which
    implements kernel stuff that will make Cog happier under
    Linux on machines with sub-supercomputing quantities of CPUs)
    tonight.
    >
    > It sounds like there are actually two reasons it hasn't
    made it into the mainline kernel:
    >
    > a) he doesn't have time to support it, and
    > b) the other kernel folks don't want it.
    >
    > Oh well. Since right now I'm focused on Raspbian, I sent a
    message explaining what it was, why I want it, etc on their
    web board. If I do get it in, support would have to fall to
    me. Yikes, right? ;)

    Yes, for political reasons it seems unlikely anything like
    BFS would
    get in to the upstream kernel. If someone can do work to
    actually show
    noticeable performance gains then that would make us (the
    Raspberry Pi
    Foundation) interested in exploring further. Real workloads that
    perform much better with an alternative scheduler would be
    much more
    interesting than microbenchmarks. 


This isn't about workload or performance.Â  It is about basic
functionality.Â  The CFS scheduler does not support multiple
thread priorities for user processes (actually, for the
non-real-time scheduling policy, and the real-time scheduling
policy is available only to superuser processes).Â

This isn't entirely true. Out-of-the-box unprivileged processes
can't change the scheduling policy, but in kernels after 2.6.12 it
is possible to configure your system to allow this without
resorting to setuid root.

Quoting from the man page for sched_setscheduler -
http://linux.die.net/man/2/sched_setscheduler - (the privilege
restrictions are the same as for pthread_attr_setschedpolicy), "If
an unprivileged process has a nonzero RLIMIT_RTPRIO soft limit,
then it can change its scheduling policy and priority, subject to
the restriction that the priority cannot be set to a value higher
than the maximum of its current priority and its RLIMIT_RTPRIO
soft limit."

Using the pam_limits.so module, one can set the RLIMIT_RTPRIO soft
limit higher than zero, which then allows the use of the
SCHED_FIFO and SCHED_RR policies with priorities up to the soft limit.

One way to achieve this is to add the following lines to the file
/etc/security/limits.conf.

    *Â Â Â  hardÂ Â Â  rtprioÂ Â Â  1
    *Â Â Â  softÂ Â Â  rtprioÂ Â Â  1

or you can add a squeakvm.conf file to /etc/security/limits.d with
those same lines, eg.

    # /etc/security/limits.d/squeakvm.conf
    *Â Â Â  hardÂ Â Â  rtprioÂ Â Â  1
    *Â Â Â  softÂ Â Â  rtprioÂ Â Â  1

This grants this capability to unprivileged users, but you will
need to logout and login again for it to take effect, as pam
limits are applied at user login.

The only problem with this approach is that there's a possibility
it might conflict with other global settings for the rtprio.
Another alternative is to grant the privilege to a group (eg.
squeakvm) and then add users to that group to allow the ability to
change the SCHED_FIFO or SCHED_RR policies and to change the
priorities of threads:

    # /etc/security/limits.d/squeakvm.conf
    @squeakvmÂ Â Â  hardÂ Â Â  rtprioÂ Â Â  1
    @squeakvmÂ Â Â  softÂ Â Â  rtprioÂ Â Â  1

This will grant the ability only to users in the squeakvm group.
The 1 in the examples above is the maximum priority. Higher levels
could be used, but a level of 1 is necessary to trigger the
capability.

Of course the group needs to exist for this to take effect.

    sudo groupadd squeakvm

There's a handy test program on the pthread_setschedparam man page
- http://linux.die.net/man/3/pthread_setschedparam - that can be
used for experimentation. I've attached the source. I tried this
out on an up-to-date Ubuntu Server 12.04 LTS VM running on a
MacbookPro under VMWare Fusion. YMMV.

pthreads_sched_test is a bit of a verbose name, so I named the
test program "schedtest" when I compiled it. Here are the results
of my tests.

First, compile the program

    gcc pthreads_sched_test.c -o schedtest -lpthread

The first set of tests were performed without making any changes
to the PAM limits.

Running schedtest without arguments gives the following

    ./schedtest

        Scheduler settings of main thread
        Â Â Â  policy=SCHED_OTHER, priority=0

        Scheduler settings in 'attr'
        Â Â Â  policy=SCHED_OTHER, priority=0
        Â Â Â  inheritsched is INHERIT

        Scheduler attributes of new thread
        Â Â Â  policy=SCHED_OTHER, priority=0

Trying to change the policy and priority of the new thread the
program creates gives the following

    ./schedtest -ar1 -i e

        Scheduler settings of main thread
        Â Â Â  policy=SCHED_OTHER, priority=0

        Scheduler settings in 'attr'
        Â Â Â  policy=SCHED_RR, priority=1
        Â Â Â  inheritsched is EXPLICIT

        pthread_create: Operation not permitted

Trying to change the priority of the main thread gives

    ./schedtest -mr1

        pthread_setschedparam: Operation not permitted

As Eliot described, the default configuration prevents
unprivileged user processes from changing the priority or
scheduling policy.

After adding the /etc/security/limits.d/squeakvm.conf file
describe above, adding my user to the squeakvm group and logging
out and back in again, the tests are somewhat more successful.
Note that these are the only additional privileges given to the
squeakvm group.

    schedtest -ar1 -i e

        Scheduler settings of main thread
        Â Â Â  policy=SCHED_OTHER, priority=0

        Scheduler settings in 'attr'
        Â Â Â  policy=SCHED_RR, priority=1
        Â Â Â  inheritsched is EXPLICIT

        Scheduler attributes of new thread
        Â Â Â  policy=SCHED_RR, priority=1


    schedtest -mr1 -ao0 -i e

        Scheduler settings of main thread
        Â Â Â  policy=SCHED_RR, priority=1

        Scheduler settings in 'attr'
        Â Â Â  policy=SCHED_OTHER, priority=0
        Â Â Â  inheritsched is EXPLICIT

        Scheduler attributes of new thread
        Â Â Â  policy=SCHED_OTHER, priority=0

Does this give sufficient flexibility without having to patch the
kernel's scheduler (whatever its name)?

Cheers,
Steve

...

AFAIA it is the only main-stream pthreads scheduler that
doesn't.Â  AFAIA BFS (what a name?!) does support multiple thread
priorities for user processes.

Within the Squeak Cog VM (and in a number of other VMs, SMalltalk
and Java VMs amongst them) there's a heartbeat which is used to
cause the VM to periodically break out of normal processing and
poll for events.Â  A heartbeat is both much more efficient, and
more regular than e.g. decrementing a counter as part of normal
processing (e.g. frame build on entering non-leaf methods).Â 
Ideally the heartbeat is implemented as a thread spinning,
blocking in e.g. nanosleep and then forcing the breakout before
entering nanosleep again.Â  But this requires that the heartbeat
thread runs at a higher priority than the main VM thread(s).Â  On
linux under the CFS this isn't possible.Â  The fallback is to use
an interval timer (setitimer with ITIMER_REAL) and a signal
handler (for SIGALRM).Â This is a poor substitute:
- system calls are interrupted, which can play havoc with
external code
- when debugging the heartbeat signal must be disabled because
otherwise one is constantly stepping into the signal handler
- certain linux kernels have bugs with signal delivery and
threads which can cause the loss of a thread's context, ending up
with two threads having the same context, hence the setitimer
approach works only with a strictly single-threaded VM (this is a
bug I found and worked around late last year in Red Hat
Enterprise Linux WS release 4 (Nahant Update 4) vintage kernels,
which alas I have customers using)

Either of these solutions would seem straight-forward from the
outside:
- make SCHED_RR and/or SCHED_FIFO for user processes.
- implement multiple priorities for SCHED_OTHER
Expecting to be able to install a VM as a setuid program is not
realistic.

I think you'll find that this kind of architectural issue is
present in a number of multi-media applications, not just dynamic
language virtual machines.Â  The restriction to a single thread
priority is, frankly, pathetic.Â  If you see Rasbian and Pi as a
platform for multi-media apps then I would urge you to bring any
influence you have to bear on getting theÂ  linux kernel
community to provide multiple thread priorities.Â  The lack
thereof is a significant limitation.
Â
best regards,
Eliot Miranda

    Of course the next step after that
    wouldn't be dumping the upstream scheduler and switching to
    BFS, but
    it would certainly justify taking a closer look.

    I'm not entirely sure why you want to fork BFS - as far as I
    can see
    Con Kolivas is keeping the BFS and his larger -ck patchset up
    to date
    with upstream releases.

    In conclusion (from a Raspberry Pi perspective): please do
    play with
    BFS on the pi, do something useful with it (if it solves the
    recently
    discussed issues with heartbeat+cogvm then swell), then let's
    think
    about where to go from there.

    Regards,

    Alex




-- 
best,
Eliot

-- 
You can follow me on twitter athttp://twitter.com/smalltalkhacker

-- best, Eliot

-- You can follow me on twitter at http://twitter.com/smalltalkhacker

4011

Age (days ago)

4047

Last active (days ago)

vm-dev@lists.squeakfoundation.org

7 comments

4 participants

tags (0)

participants (4)

Alex Bradbury
Casey Ransberger
Eliot Miranda
Steve Rees