Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutant causes MRI to get stuck #265

Closed
plexus opened this issue Oct 21, 2014 · 57 comments
Closed

Mutant causes MRI to get stuck #265

plexus opened this issue Oct 21, 2014 · 57 comments
Labels

Comments

@plexus
Copy link
Contributor

plexus commented Oct 21, 2014

Seeing this consistently on current Yaks.

Steps to reproduce

git clone https://github.com/plexus/yaks.git
cd yaks
git checkout 6febe63
rake mutant

Ruby version

ruby 2.1.3p242 (2014-09-19 revision 47630) [x86_64-linux]

Gemfile.lock: https://gist.github.com/ae2fab52a0999a3eeea7

It runs almost to the end, but then gets stuck at

Mutant configuration:
Matcher:            #<Mutant::Matcher::Config match_expressions=[<Mutant::Expression: Yaks*>] subject_ignores=[] subject_selects=[]>
Integration:        rspec
Expect Coverage:    100.00%
Processes:          4
Includes:           ["lib"]
Requires:           ["yaks"]
Available Subjects: 143
Subjects:           143
Mutations:          5362
Kills:              5351
Alive:              10
Runtime:            454.85s
Killtime:           398.27s
Overhead:           14.21%
Coverage:           99.81%
Expected:           100.00%
Active subjects:    1
Yaks::Mapper::Association#initialize:/home/arne/github/yaks/yaks/lib/yaks/mapper/association.rb:9 mutations: 87
- rspec:Yaks::Mapper::Association
- rspec:Yaks::Mapper::Association#add_to_resource
- rspec:Yaks::Mapper::Association#render_as_link?
- rspec:Yaks::Mapper::Association#map_rel
- rspec:Yaks::Mapper::Association#resolve_association_mapper
......................................................................................
(86/87) 100% - killtime: 6.46s runtime: 16.54s overhead: 10.08s
@mbj
Copy link
Owner

mbj commented Oct 21, 2014

@plexus Very good reproduction. I'll see what I can do later this day after finishing commercial activities.

@mbj mbj added the bug label Oct 23, 2014
@mbj
Copy link
Owner

mbj commented Oct 23, 2014

@plexus I got it reproduced and fixed in my actors branch. Need to clean up lots of ugly code I wrote there (and maybe use an external actor implementation, not sure yet). Just a heads up from me.

@plexus
Copy link
Contributor Author

plexus commented Oct 23, 2014

is that branch on GH? not seeing it after a fetch...

@mbj
Copy link
Owner

mbj commented Oct 23, 2014

I can push it. But do not comment it etc its WIP (I do not push code that does not pass all specs yet. Only when I have collabs I do so).

@mbj
Copy link
Owner

mbj commented Oct 23, 2014

is that branch on GH? not seeing it after a fetch...

My main reason is not to trigger reviews by others when I know there are still issues. This is experimental and not done with tests up-front ;)

@mbj
Copy link
Owner

mbj commented Oct 23, 2014

@plexus Here ya go. Do not look into the code, better focus on the semantics in your case. #266

@plexus
Copy link
Contributor Author

plexus commented Oct 23, 2014

If you've seen a bit of my code you know I'm not the most critical when it comes to these things :) anyway hope it works, need to up the coverage of Yaks again!

@mbj
Copy link
Owner

mbj commented Oct 23, 2014

@plexus you can always use an older version of mutant where parallel nesting worked through an older parallel dependency.

@mbj
Copy link
Owner

mbj commented Oct 23, 2014

@plexus parallel is a nice piece of software. But I do not like its internals too much to dive in and fix my issues. My solution will have other nice (mutant specific) benefits, see the PR description.

@plexus
Copy link
Contributor Author

plexus commented Oct 23, 2014

ok, will try older versions

@mbj
Copy link
Owner

mbj commented Oct 23, 2014

@plexus you could also exclude the problematic subject. --ignore-subject.

@mbj
Copy link
Owner

mbj commented Oct 24, 2014

@plexus A recent commit on master removes the nested use of parallel that could potentially fix your issue. Parallel is not thread save and fails in weird ways when it gets used nested.

@mbj
Copy link
Owner

mbj commented Oct 24, 2014

@plexus I'd love you confirm the problem is solved to release 0.6.4 with the fix.

@plexus
Copy link
Contributor Author

plexus commented Oct 27, 2014

Alas... now it runs up to

Mutant configuration:
Matcher:            #<Mutant::Matcher::Config match_expressions=[<Mutant::Expression: Yaks*>] subject_ignores=[] subject_selects=[]>
Integration:        rspec
Expect Coverage:    100.00%
Jobs:               4
Includes:           ["lib"]
Requires:           ["yaks"]
Available Subjects: 150
Subjects:           150
Mutations:          5594
Kills:              4941
Alive:              652
Runtime:            1247.67s
Killtime:           555.33s
Overhead:           124.67%
Coverage:           88.34%
Expected:           100.00%
Active subjects:    1
Yaks::Format::Hal#serialize_resource:/home/arne/github/yaks/yaks/lib/yaks/format/hal.rb:30 mutations: 127
- rspec:Yaks::Format::Hal
..............................................................................................................................
(126/127) 100% - killtime: 7.27s runtime: 30.99s overhead: 23.71s

then hangs, when I Ctrl-C I get

^CParallel execution interrupted, exiting ...
rake aborted!
Interrupt: 
/home/arne/github/yaks/Rakefile:10:in `block (2 levels) in delegate_task'
/home/arne/github/yaks/Rakefile:9:in `block in delegate_task'
Tasks: TOP => mutant
rake aborted!
Interrupt: Interrupt
(See full trace by running task with --trace)
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:398:in `block in trap_interrupt'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:345:in `call'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:345:in `join'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:345:in `block in wait_for_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:343:in `map'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:343:in `wait_for_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:140:in `block in in_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:386:in `kill_on_ctrl_c'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:140:in `in_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:231:in `work_in_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:189:in `map'
/home/arne/github/mutant/lib/mutant/runner.rb:49:in `run'
/home/arne/github/mutant/lib/mutant/runner.rb:25:in `initialize'
/home/arne/.gem/ruby/2.1.3/gems/adamantium-0.2.0/lib/adamantium/class_methods.rb:17:in `new'
/home/arne/.gem/ruby/2.1.3/gems/adamantium-0.2.0/lib/adamantium/class_methods.rb:17:in `new'
/home/arne/.gem/ruby/2.1.3/gems/procto-0.0.2/lib/procto.rb:62:in `block in initialize'
/home/arne/github/mutant/lib/mutant/env.rb:40:in `run'
/home/arne/.gem/ruby/2.1.3/gems/procto-0.0.2/lib/procto.rb:62:in `public_send'
/home/arne/.gem/ruby/2.1.3/gems/procto-0.0.2/lib/procto.rb:62:in `block in initialize'
/home/arne/github/mutant/lib/mutant/cli.rb:25:in `run'
/home/arne/github/yaks/rakelib/shared.rake:12:in `block in mutant_task'
Tasks: TOP => mutant
(See full trace by running task with --trace)

@mbj
Copy link
Owner

mbj commented Oct 27, 2014

@plexus I count this as another improvement over the initial behavior.

@plexus
Copy link
Contributor Author

plexus commented Oct 27, 2014

Running the single subject on master reproduces the behavior, on actors that one now finishes fine. When I add a timeout it also finishes fine on master. So that's good enough for me! Just have to see if it makes it through a full run but seems it will be fine.

@mbj
Copy link
Owner

mbj commented Oct 27, 2014

Mhh.

Master and Actors should be semantically equivalent in terms of termination behavior. When adding a timeout fixes master, but actors was not even affected there are some effects I oversaw.

BTW: Do you notice a small (around 15%) speedup when using recent master, with hand rolling isolation rather than using parallel I saw a reduction of auom coverage time from 34s to 25s.

When your full run passes I'm gonna release 0.6.4 from master anyway. As its a improvement over 0.6.3.

@mbj mbj closed this as completed in fc8f924 Oct 27, 2014
@plexus
Copy link
Contributor Author

plexus commented Oct 27, 2014

Now it's a different subject that gets stuck, timeout isn't helping.

On master it just gets stuck like before

Mutant configuration:
Matcher:            #<Mutant::Matcher::Config match_expressions=[<Mutant::Expression: Yaks::Format::Hal#serialize_embedded>] subject_ignores=[] subject_selects=[]>
Integration:        rspec
Expect Coverage:    100.00%
Jobs:               4
Includes:           ["lib"]
Requires:           ["yaks"]
Available Subjects: 1
Subjects:           1
Mutations:          64
Kills:              62
Alive:              0
Runtime:            4.34s
Killtime:           6.85s
Overhead:           -36.75%
Coverage:           100.00%
Expected:           100.00%
Active subjects:    1
Yaks::Format::Hal#serialize_embedded:/home/arne/github/yaks/yaks/lib/yaks/format/hal.rb:81 mutations: 64
- rspec:Yaks::Format::Hal
..............................................................
(62/64) 100% - killtime: 6.85s runtime: 14.79s overhead: 7.94s
^CParallel execution interrupted, exiting ...
rake aborted!
Interrupt: Interrupt
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:398:in `block in trap_interrupt'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:345:in `call'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:345:in `join'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:345:in `block in wait_for_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:343:in `map'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:343:in `wait_for_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:140:in `block in in_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:386:in `kill_on_ctrl_c'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:140:in `in_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:231:in `work_in_threads'
/home/arne/.gem/ruby/2.1.3/gems/parallel-1.3.3/lib/parallel.rb:189:in `map'
/home/arne/github/mutant/lib/mutant/runner.rb:49:in `run'
/home/arne/github/mutant/lib/mutant/runner.rb:25:in `initialize'
/home/arne/.gem/ruby/2.1.3/gems/adamantium-0.2.0/lib/adamantium/class_methods.rb:17:in `new'
/home/arne/.gem/ruby/2.1.3/gems/adamantium-0.2.0/lib/adamantium/class_methods.rb:17:in `new'
/home/arne/.gem/ruby/2.1.3/gems/procto-0.0.2/lib/procto.rb:62:in `block in initialize'
/home/arne/github/mutant/lib/mutant/env.rb:40:in `run'
/home/arne/.gem/ruby/2.1.3/gems/procto-0.0.2/lib/procto.rb:62:in `public_send'
/home/arne/.gem/ruby/2.1.3/gems/procto-0.0.2/lib/procto.rb:62:in `block in initialize'
/home/arne/github/mutant/lib/mutant/cli.rb:25:in `run'
/home/arne/github/yaks/rakelib/shared.rake:12:in `block in mutant_task'
Tasks: TOP => mutant
(See full trace by running task with --trace)

On actors it also gets stuck but the Runtime keeps going up

Mutant configuration:
Matcher:            #<Mutant::Matcher::Config match_expressions=[<Mutant::Expression: Yaks::Format::Hal#serialize_embedded>] subject_ignores=[] subject_selects=[]>
Integration:        rspec
Expect Coverage:    100.00%
Jobs:               4
Includes:           ["lib"]
Requires:           ["yaks"]
Available Subjects: 1
Subjects:           1
Mutations:          64
Kills:              63
Alive:              0
Runtime:            13.34s
Killtime:           6.34s
Overhead:           110.27%
Coverage:           100.00%
Expected:           100.00%
Active subjects:    1
Yaks::Format::Hal#serialize_embedded:/home/arne/github/yaks/yaks/lib/yaks/format/hal.rb:81 mutations: 64
- rspec:Yaks::Format::Hal
...............................................................
(63/64) 100% - killtime: 6.34s runtime: 13.75s overhead: 7.41s
rake aborted!
Interrupt: 
/home/arne/github/mutant/lib/mutant/reporter/cli.rb:89:in `write'
/home/arne/github/mutant/lib/mutant/reporter/cli.rb:47:in `progress'
/home/arne/github/mutant/lib/mutant/runner.rb:39:in `block in initialize'
/home/arne/github/mutant/lib/mutant/runner.rb:36:in `loop'
/home/arne/github/mutant/lib/mutant/runner.rb:36:in `initialize'
/home/arne/.gem/ruby/2.1.3/gems/adamantium-0.2.0/lib/adamantium/class_methods.rb:17:in `new'
/home/arne/.gem/ruby/2.1.3/gems/adamantium-0.2.0/lib/adamantium/class_methods.rb:17:in `new'
/home/arne/.gem/ruby/2.1.3/gems/procto-0.0.2/lib/procto.rb:62:in `block in initialize'
/home/arne/github/mutant/lib/mutant/env.rb:40:in `run'
/home/arne/.gem/ruby/2.1.3/gems/procto-0.0.2/lib/procto.rb:62:in `public_send'
/home/arne/.gem/ruby/2.1.3/gems/procto-0.0.2/lib/procto.rb:62:in `block in initialize'
/home/arne/github/mutant/lib/mutant/cli.rb:25:in `run'
/home/arne/github/yaks/rakelib/shared.rake:12:in `block in mutant_task'
Tasks: TOP => mutant
(See full trace by running task with --trace)

@mbj mbj reopened this Oct 27, 2014
@mbj
Copy link
Owner

mbj commented Oct 27, 2014

@plexus Actors decouples killing from reporting. For that reason the report timer still triggers nicely in such cases. We'll need worker reporting to I plan for actors where we know which worker is doing what right now.

What is your timeout set to? Rubies timers are not very deterministic, I suggest to wait 2-10x longer to be sure the timeout has a chance to get a rid of the mutation.

@plexus
Copy link
Contributor Author

plexus commented Oct 27, 2014

I first had it to 0.1 seconds, but then later set it to 1 sec. With 0.01 or smaller I got segfaults :)

@mbj
Copy link
Owner

mbj commented Oct 27, 2014

hehe. I'll look into it. Can we setup a branch in yaks to use mutants master to reproduce this problem?

@plexus
Copy link
Contributor Author

plexus commented Oct 27, 2014

yeah sure, I can do that

@mbj
Copy link
Owner

mbj commented Oct 27, 2014

@plexus I released 0.6.4 with the first set of fixes. Even with your ongoing problem I'm confident its an improvement over 0.6.3.

@mbj
Copy link
Owner

mbj commented Oct 28, 2014

@plexus Okay. So I did a very deep dive. Short story: Can you try 1.9.3 I think a specific MRI bug is that causes this clusterfuck is not present here.

Long story: We know that mutant exposes bugs in MRI, bugs that manifest in segfaults. This behavior was visible from 2.0.0 upwards. Some of the bugs had been easily reducible. And you reported it upstream and the problems got fixed. But there is a class of segfaults that where not easily to reduce. Mutant, for this reason, counts crashing killforks as "killed" and moves on. But there is a class of MRI bugs being exposed where MRI segfaults and than locks itself up in a futex while handling the segfault. This only happens when a timer thread is active. It causes a killfork to never terminate. Mutant infinitely waits for this killfork to terminate resulting in the hang you see.

I strac-ed / gdb-ed such a hanging killfork and will try to reproduce it outside the mutant / rspec / yaks environment. This will take some time.

As a faster countermeasure we probably also need to timeout killforks from mutants parent side. Something I never wanted to build for various reasons:

  • Its simply more complex ;)
  • Ruby and timers is ugly (see the segfault story above)
  • I always wanted people to be able to paste the mutanted code and run the specs and see the missing failure. In case mutant fails by itself we loose this property that made debugging very easy. Through people will probably see something "interesting" when pasting in that mutated code ;)
  • To avoid a false sense of coverage kills caused by mutant-side-timeouts must be explicitly reported. This will result in additional state. "killed", "alive", "killed-by-mutant-timeout-because-mri-is-likely-buggy".

I probably need to pile up much motivation to dig into that messy ecosystem to produce such a workaround / fix.

@mbj
Copy link
Owner

mbj commented Oct 28, 2014

@plexus

I first had it to 0.1 seconds, but then later set it to 1 sec. With 0.01 or smaller I got segfaults :)

Does this mean you reproduced the segfaults outside mutant already? Can you create me a branch from that state so I can begin reducing it?

@mbj
Copy link
Owner

mbj commented Oct 29, 2014

@plexus I reduced it to this short one, crashes most of recent MRI with a segfault.

I could not reduce the segfault + stuck in futex yet. But I think getting this one reported / fixed is a good start. BTW 2.1.1 (at least my build) seems not to be affected. Everything else >= 2.0.0 is. //cc @whitequark I promised many of these weeks ago. This is the first time I had the energy to reduce them from ~20kloc. Hope I get the others reduced also.

class Foo
  include Enumerable

  def initialize(items)
    @items = items
  end

  def each(&block)
    @items.__send__(:each, &block)
  rescue Exception
  end

  def more
    to_a # or any other method from Enumerable.
    [self]
  end
end

# Intentional infinite recursion to trigger the bug
# This is not code I'd write. But its code ruby should not segfault on.
def call(resource)
  resource.more.each(&method(__method__))
end

# Removing the thread results in *correct* behavior (stack level to deep error).
Thread.new do
  call(Foo.new([]))
end.join

@mbj
Copy link
Owner

mbj commented Oct 29, 2014

@plexus Will you report this one upstream? You had good luck the last time we reduced one from mutant ;)

@plexus
Copy link
Contributor Author

plexus commented Oct 29, 2014

How about setting smaller timeouts? On latest trunk it doesn't crash with 0.1 for me but it does with 0.001 for example.

@mbj
Copy link
Owner

mbj commented Oct 29, 2014

@plexus My reduction does not involve timeouts. I think it might be the kernel version that produces other scheduling patterns. What kernel you are on? I can downgrade to see of the race (I expect its one) does not occur here. In this case I think my reproduction will help upstream to fix early.

@mbj
Copy link
Owner

mbj commented Oct 29, 2014

How about setting smaller timeouts? On latest trunk it doesn't crash with 0.1 for me but it does with 0.001 for example.

I think I'll just write a fuzzer interleaving threads with timeouts in random ways to get a better picture.

@mbj
Copy link
Owner

mbj commented Oct 29, 2014

This will be fun with unparser assistance ;)

@plexus
Copy link
Contributor Author

plexus commented Oct 29, 2014

kernel: Linux lutze 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

@whitequark
Copy link

@mbj Oops, sorry, misread that as addressed to me, somehow.

@mbj
Copy link
Owner

mbj commented Oct 29, 2014

@plexus Just had luck reproducing it on an ubuntu VM:

class Foo
  include Enumerable

  def each(&block)
    [].__send__(:each, &block)
  rescue Exception
  end

  def more
    to_a # any method from enumerable
    [self]
  end
end

def call(resource)
  resource.more.each(&method(__method__))
end

Thread.new do
  Kernel.catch(Exception.new) do
    call(Foo.new)
  end
end.join

Can you confirm it also crashes on your machine? I think it would be reportable to upstream than.

@mbj
Copy link
Owner

mbj commented Oct 29, 2014

Using the CrowdCI I got confirmations about segfaults / hangs, but not on all systems: https://gist.github.com/mbj/31163a8e712573877268

@mbj
Copy link
Owner

mbj commented Oct 30, 2014

@plexus So some people can reproduce the segfault. IMO its reportable to upstream with this findings. What do you think?

@plexus
Copy link
Contributor Author

plexus commented Oct 30, 2014

Yup, can confirm that it crashes on 2.1.4 and trunk. I'll report it.

@mbj
Copy link
Owner

mbj commented Oct 30, 2014

@plexus Nice. I'm working in mutant side timeouts of killforks within the actors branch. In the hope this timeouts do NOT trigger the bug we are trying to workaround :P

@plexus
Copy link
Contributor Author

plexus commented Oct 30, 2014

Ruby lang issue regarding the segfault: https://bugs.ruby-lang.org/issues/10460

@mbj
Copy link
Owner

mbj commented Oct 30, 2014

@plexus Thx.

@mbj
Copy link
Owner

mbj commented Nov 3, 2014

I added a brief summary about this to mutants README, as it will probably take long to resolve this bug.

@mbj mbj changed the title Mutant doesn't terminate Mutant causes MRI to get stuck Nov 4, 2014
@mbj
Copy link
Owner

mbj commented Nov 13, 2014

I just tried with freshly released MRI 2.1.5 and it still segfaults.

@mbj
Copy link
Owner

mbj commented Dec 22, 2014

I think upstream issue https://bugs.ruby-lang.org/issues/10626 is related to this one. I saw the same symptoms but could never isolate them in a stable way.

@mbj
Copy link
Owner

mbj commented Jan 31, 2015

@plexus This problem was probably solved with: ruby/ruby@8fe95fe can you confirm?

@plexus
Copy link
Contributor Author

plexus commented Feb 9, 2015

Yup works fine now, great news!

@mbj
Copy link
Owner

mbj commented Feb 9, 2015

@plexus Yeah. I hope MRI people cut a release with that one soon. So I can continue mutant development on a more deterministic environment. I'll close this issue now.

@mbj
Copy link
Owner

mbj commented Nov 23, 2020

This issue is cited a lot, hence years after it was closed let me point interested people to configurable coverage criteria mutant added in the latest release:

https://github.com/mbj/mutant/blob/master/docs/configuration.md#coverage_criteria

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants