Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are these characters in the bin file? #20

Open
JunjieCheng opened this issue Feb 1, 2018 · 1 comment
Open

What are these characters in the bin file? #20

JunjieCheng opened this issue Feb 1, 2018 · 1 comment

Comments

@JunjieCheng
Copy link

JunjieCheng commented Feb 1, 2018

I opened the file by 'rb', and the file contains many unconverted characters

with open('/users/cheng/NLP/Data/finished_files/chunked/test_000.bin', 'rb') as file:
    for line in file:
        print(line)
b'R\x1e\x00\x00\x00\x00\x00\x00\n'
b'\xcf<\n'
b'\xf0\x02\n'
b'\x08abstract\x12\xe3\x02\n'
b'\xe0\x02\n'
b"\xdd\x02<s> marseille prosecutor says `` so far no videos were used in the crash investigation '' despite media reports . </s> <s> journalists at bild and paris match are `` very confident '' the video clip is real , an editor says . </s> <s> andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says . </s>\n"
b'\xd99\n'
b'\x07article\x12\xcd9\n'
b'\xca9\n'

Then I tried to process them by myself. Split the article and abstract and write them to separate file, but here is an error after processing most files:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

How can I get a clean article and abstract from these files?

@JafferWilson
Copy link

@JunjieCheng it is the binary code that is acceptable by the tensorflow for testing. it is like a pre-process data for testing. The code is accepting the binary data, which fast in reading by system. If you wish not to convert to binary then you can change the code as per your needs as it is openly available. Please do not ask what to change as this is what you have to make and if you have any issue, ask here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants