Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eflomal crashes during filtering #63

Open
yvesscherrer opened this issue May 31, 2023 · 1 comment
Open

eflomal crashes during filtering #63

yvesscherrer opened this issue May 31, 2023 · 1 comment

Comments

@yvesscherrer
Copy link
Member

Alignment model creation works fine, but during filtering Eflomal crashes with the following error message:

INFO:opusfilter.opusfilter:Running step 5: filter
20343327it [10:23, 32615.14it/s]
INFO:eflomal:Prepared 20343327 sentences for alignment
INFO:eflomal:Reading lexical priors...
INFO:eflomal:1618911 (of 2174631) pairs of lexical priors used
Traceback (most recent call last):
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/bin/opusfilter", line 31, in <module>
    of.execute_steps(overwrite=args.overwrite, last=args.last)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 224, in execute_steps
    self._run_step(step, num + 1, overwrite)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 289, in _run_step
    self.step_functions[step['type']](parameters, overwrite=overwrite)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 96, in wrapper
    return self.parallelize(*args, **kwargs)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 141, in parallelize
    self.func(obj, parameters, overwrite)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 380, in filter_data
    for idx, pair in enumerate(pairs):
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/word_alignment.py", line 170, in _filtergen
    self.aligner.align(
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/__init__.py", line 72, in align
    align(srcf.name, trgf.name,
  File "python/eflomal/eflomal.pyx", line 161, in eflomal.cython.align
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/bin/eflomal', '-m', '3', '-s', '/tmp/tmpawsij1rg', '-t', '/tmp/tmpphsceo43', '-n', '3', '-N', '0.2', '-1', '2', '-q', '-2', '1', '-3', '2', '-F', '/tmp/tmpyamo5usj', '-R', '/tmp/tmps4d0ndvi', '-p', '/tmp/tmp18jxqkax']' died with <Signals.SIGKILL: 9>.

The Eflomal unittest (test_eflomal.py) runs fine:

/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/bin/eflomal -m 3 -s /tmp/tmpst1zbe0v -t /tmp/tmps4j5_0m8 -n 3 -N 0.2 -1 721 -2 721 -3 2887 -f /tmp/tmpf50or8p5 -r /tmp/tmp98dw6njz
Read texts (3 sentences): 0.000 s
Vocabulary sizes are 9 (source), 9 (target)
Created alignment structures: 0.000 s
Created alignment structures: 0.000 s
Randomized alignment: 0.002 s
Aligning with model 1 (721 iterations)
Randomized alignment: 0.000 s
Aligning with model 1 (721 iterations)
Done: 0.002 s
Aligning with model 2 (721 iterations)
Done: 0.002 s
Aligning with model 2 (721 iterations)
Done: 0.001 s
Aligning with model 3 (2887 iterations)
Done: 0.001 s
Aligning with model 3 (2887 iterations)
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmpf50or8p5 for 3 sentencess
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmp98dw6njz for 3 sentencess
./mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/bin/eflomal -m 3 -s /tmp/tmpfpe3h_i5 -t /tmp/tmpqggbus3t -n 3 -N 0.2 -1 721 -2 721 -3 2887 -f /tmp/tmp4y0_3tw1 -r /tmp/tmpk5nynnwy -p /tmp/tmp4yygknic
Read texts (3 sentences): 0.000 s
Vocabulary sizes are 9 (source), 9 (target)
Created alignment structures: 0.000 s
Created alignment structures: 0.000 s
Randomized alignment: 0.001 s
Aligning with model 1 (721 iterations)
Randomized alignment: 0.001 s
Aligning with model 1 (721 iterations)
Done: 0.001 s
Aligning with model 2 (721 iterations)
Done: 0.002 s
Aligning with model 2 (721 iterations)
Done: 0.002 s
Aligning with model 3 (2887 iterations)
Done: 0.001 s
Aligning with model 3 (2887 iterations)
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmpk5nynnwy for 3 sentencess
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmp4y0_3tw1 for 3 sentencess
./mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/bin/eflomal -m 3 -s /tmp/tmpdd0kzzqb -t /tmp/tmpex4wlj51 -n 3 -N 0.2 -1 721 -2 721 -3 2887 -f /tmp/tmpjxe0px3n -r /tmp/tmpu0jpju0y
Read texts (3 sentences): 0.000 s
Vocabulary sizes are 9 (source), 9 (target)
Created alignment structures: 0.000 s
Created alignment structures: 0.000 s
Randomized alignment: 0.002 s
Aligning with model 1 (721 iterations)
Randomized alignment: 0.002 s
Aligning with model 1 (721 iterations)
Done: 0.002 s
Aligning with model 2 (721 iterations)
Done: 0.003 s
Aligning with model 2 (721 iterations)
Done: 0.002 s
Aligning with model 3 (2887 iterations)
Done: 0.001 s
Aligning with model 3 (2887 iterations)
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmpu0jpju0y for 3 sentencess
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmpjxe0px3n for 3 sentencess
.
----------------------------------------------------------------------
Ran 3 tests in 0.182s

OK

The OpusFilter unit test also seems to run fine:

.........
----------------------------------------------------------------------
Ran 9 tests in 0.911s

OK
@svirpioj
Copy link
Member

It seems most probable that the process was killed due to exceeding memory limits. Eflomal is using a considerable amount of memory for large inputs, apparently growing linearly with the corpus size. For a corpus of 20 million sentence pairs, it used 10 gigabytes of memory.

Possible solutions:

  • Split the files to smaller subsets before filtering
  • If you use multiple filters, set WordAlignFilter as the last one (less data remaining)

The score step and filter with filterfalse=True automatically do chunking, but the normal filter does not. Maybe there should be an option for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants