Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

got error when processing training data #24

Open
sindax123 opened this issue Dec 25, 2022 · 5 comments
Open

got error when processing training data #24

sindax123 opened this issue Dec 25, 2022 · 5 comments

Comments

@sindax123
Copy link

Hi Dear developer,
I got an error when procesing training data with TR0 data provided by MXfold2

$ python process_data_newdataset.py TR0
Traceback (most recent call last):
File "process_data_newdataset.py", line 69, in
pair_dict_all_list = [[int(item_tmp)-1,int(t2[1].split('\n')[index_tmp])-1] for index_tmp,item_tmp in enumerate(t1[1].split('\n')) if int(t2[1].split('\n')[index_tmp]) != 0]
File "process_data_newdataset.py", line 69, in
pair_dict_all_list = [[int(item_tmp)-1,int(t2[1].split('\n')[index_tmp])-1] for index_tmp,item_tmp in enumerate(t1[1].split('\n')) if int(t2[1].split('\n')[index_tmp]) != 0]
ValueError: invalid literal for int() with base 10: 'X'

Having no idea of what the data exactly look like , I feel confused with this problem. Could you please tell me how to fix it ? Thank you!

@sindax123
Copy link
Author

when i tried to print t0,t1,t2 in the code some of the files are successfully processed while others turned out t0 t1 t2 respectively are (0, 'OS')
(0, '\x00\x05\x16\x07\x00\x02\x00\x00Mac')
(0, 'X')

@sperfu
Copy link
Contributor

sperfu commented Dec 26, 2022

Hi there,

Since we used this script to process different formats of training data. So we may altered some of the scripts in process_data_newdataset.py during processing. So one solution way is to find out what is the data composed of by using pickle(python package) to load those files and check the exact details in those file. I hope that will work.

Thanks

@sindax123
Copy link
Author

Thank you for your reply!I checked the component of the data and found some of the data invalid.It ouputs "OS" instead of rna sequence,accounting for at least a half of the dataset.I wonder if such situation is normal or there is something wrong with my dataset. If there is something wrong with my dataset, where else can i get those data?

@sperfu
Copy link
Contributor

sperfu commented Dec 27, 2022

I wonder if there is some format issue related to the system(like "OS""Mac" etc.), it seems you used MacOS to deal with those files. We process those file using Linux(Ubuntu). You may pay attention to that.
Secondly, if that doesn't solve your problem. You may resort to MXfold2 paper. They also provide those datasets.

@sindax123
Copy link
Author

Thank you for your reply! I think I have figured out what the problem is by double checking the data! In the TR0 folder I downloaded each piece of rna sequence contains two document named“._bpRNA_XXXXX”and“bpRNA_XXXX” respectively.I suppose it would be fixed by adding a selective condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants