Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with Converting Spatio-Temporal Dataset for Consumption #19

Open
trecius opened this issue Apr 26, 2016 · 3 comments
Open

Help with Converting Spatio-Temporal Dataset for Consumption #19

trecius opened this issue Apr 26, 2016 · 3 comments

Comments

@trecius
Copy link

trecius commented Apr 26, 2016

Hello,

I have a spatio-temporal dataset that I have compiled. It's in a TSV format, and I'd like your RNNSharp to consume the input for classification as well as recognition. My features are continuous values in the range [0, 1]. My TSV file looks like the following:

ID1 0.923 0.223 0.573 0.235 0.111
ID1 0.920 0.228 0.353 0.213 0.098
ID1 0.901 0.677 0.235 0.551 0.121
...
ID1 0.853 0.383 0.301 0.618 0.132

ID1 0.918 0.733 0.622 0.222 0.238
ID1 0.985 0.682 0.793 0.221 0.465
...
ID1 0.953 0.788 0.912 0.228 0.539

ID2 0.918 0.733 0.622 0.222 0.238
ID2 0.985 0.682 0.793 0.221 0.465
...
ID2 0.953 0.788 0.912 0.228 0.539

Each line in my TSV is a snapshot at a specific moment in time. When all snapshot are combined, it describes the spatio-temporal entity. These entities are separated by an EMPTY LINE. Therefore, the first instance ID1 is all the lines until you reach the empty line. The second instance of ID1 is the next set of contiguous lines and so on. Note, the first TSV value is just a class label and is not a feature. Also, I have 6 class labels for this spatio-temporal dataset.

1.) First, how can I transform my data into an "embedded feature" that is in the correct model format? I assume this is the Txt2Vec?

2.) Additionally, I will have to create a corpus. Will the following work for the corpus?

ID1 ClassLabel1
ID2 ClassLabel2
ID3 ClassLabel3
ID4 ClassLabel4
ID5 ClassLabel5
ID6 ClassLabel6

3.) Additional steps or a walkthrough would be greatly appreciated. I hope this information helps all others who are trying to consume RNNSharp. When I finish, I hope to compile a walkthrough for others, so they can easily consume this great technology.

Thank you.

@zhongkaifu
Copy link
Owner

For each time frame (one line in your training corpus), if it only contains 5 features, you could build embedding model likes. That means each time frame has its unique id.
ID1 0.923 0.223 0.573 0.235 0.111
ID2 0.920 0.228 0.353 0.213 0.098
ID3 0.901 0.677 0.235 0.551 0.121
...
ID2 0.920 0.228 0.353 0.213 0.098

I just updated RNNSharp to support embedding model in raw text format, so you could use above format for training directly. Please replace WORDEMBEDDING_FILENAME with WORDEMBEDDING_RAW_FILENAME in configuration file.

For #2, yes. It looks good. For example, it may looks like
ID1 Wave
ID2 Label2
ID2 Wave
...
IDn LabelX

For each time frame, it has a corresponding label as result.

@trecius
Copy link
Author

trecius commented Apr 27, 2016

Hello:

I'm getting closer. I've since extracted all my time frames that I want to train the dataset into a single file: rawModel.txt. It has the format:

\t\t\t\t\t
\t\t\t\t\t
...
\t\t\t\t\t

I've also created a train.txt file, and it is in the format:

\t
\t
\t
...
\t

Finally, I've also create a template.txt file. It looks like this:

U01:%x[0,0]
U02:%x[0,1]
U03:%x[0,2]
U04:%x[0,3]
U05:%x[0,4]
U06:%x[-1,0]
U07:%x[-1,1]
U08:%x[-1,2]
U09:%x[-1,3]
U10:%x[-1,4]
U11:%x[1,0]
U12:%x[1,1]
U13:%x[1,2]
U14:%x[1,3]
U15:%x[1,4]

I've modified the BAT file to use the new files, but it's not working the way I had planned.

1.) How does RNNSharp (RNNSharpConsole) know when one spatio-temporal entity has completed and a new one begins? I'm more talking about the edge cases. I've tried to split up them using a blank line, but an exception is thrown, stating the lengths are not the same.

@zhongkaifu
Copy link
Owner

Since you are going to use continuous values as features, the template.txt should only keep one line: U01:%x[0,0]. All of other lines are used for discrete features only.

In training corpus, RNNSharp uses a blank line to split two entities, but embedding model (rawModel.txt in your example) needn't to use blank lines, since embedding model is just a key-value pair, RNNSharp access embedding model by keyword, and get dense features from embedding model for encoding or decoding.

RNNSharp already supports embedding model in raw text format, you could sync the latest code from depot and use it. In your case, the configuration file looks like:

#The file name for template feature set
TFEATURE_FILENAME: tfeature
#The context range for template feature set. In below, the context is current token, next token and next after next token
TFEATURE_CONTEXT: 0

WORDEMBEDDING_RAW_FILENAME: rawModel.txt
#The context range for word embedding.
WORDEMBEDDING_CONTEXT: -1, 0, 1
#The column index applied word embedding feature
WORDEMBEDDING_COLUMN: 0

I hope these information can help you. For exception you mentioned, could you please show more detailed information about it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants