-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Important fix (untokenized data written to .bin files) #2
Comments
when I run the code , it always print "Tried to find tokenized story file 9bfbb6ede20df9611c2a8b42980629658dc5ec23.story in both directories cnn_stories_tokenized and dm_stories_tokenized.“ Couldn't find it. |
@yangze01 That error message means that it's trying to find a You probably had some error during tokenization that resulted in an incomplete set of tokenized files in The latest commit now has more informative checks and error messages. |
@abisee Thx. I downloads the orinal cnn/dailymail stories, and it works. |
This is a notification that the code to obtain the CNN / Daily Mail dataset unfortunately had a bug which caused the untokenized data to be written to the
.bin
files (not the tokenized data, as intended). The fix has been committed here.If you've already created your
.bin
andvocab
files, I advise you to recreate them. To do this:cnn-dailymail
repofinished_files
directory (but keep thecnn_stories_tokenized
anddm_stories_tokenized
directories)tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
andtokenize_stories(dm_stories_dir, dm_tokenized_stories_dir)
(lines 178 and 179) ofmake_datafiles.py
. This is because you don't need to retokenize the data.make_datafiles.py
. This will create the new.bin
andvocab
files.If you've already begun training with the Tensorflow code, I advise you to restart training with the new datafiles. Switching the
vocab
and.bin
files mid-training will not work.Apologies for the inconvenience.
Tagging people to whom this may be relevant: @prokopevaleksey @tianjianjiang @StevenLOL @MrGLaDOS @hate5six @liuchen11 @bugtig @ayushoriginal @BenJamesbabala @BinbinBian @caomw @halolimat @ml-lab @ParseThis @qiang2100 @scylla @tonydeep @yiqingyang2012 @YuxuanHuang @Rahul-Iisc @pj-parag
The text was updated successfully, but these errors were encountered: