Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculate epoch #1140

Closed
mackmake opened this issue Jan 30, 2024 · 4 comments
Closed

calculate epoch #1140

mackmake opened this issue Jan 30, 2024 · 4 comments

Comments

@mackmake
Copy link

hi
I want to know how to find out the number of iterations needed for seeing all of my data for one epoch exactly.
does the preprocess/train code calculate and log it somewhere or i should calculate it by myself?
the realted part in my stdout file looks like below if it helps:

 > dataset split:
    train:
     document indices in [0, 76507324) total of 76507324 documents
    validation:
     document indices in [76507324, 78875972) total of 2368648 documents
    test:
     document indices in [78875972, 78954927) total of 78955 documents
 > loading doc-idx mapping from path/to/mydata/_text_document_train_indexmap_24000000ns_2048sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /path/to/mydata/_text_document_train_indexmap_24000000ns_2048sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /path/to/mydata/_text_document_train_indexmap_24000000ns_2048sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.013 seconds
    total number of samples: 31267487
    total number of epochs: 2

if possible, explain the formula for calculation of it please.
thanks

@StellaAthena
Copy link
Member

StellaAthena commented Jan 30, 2024

# tokens / [global batch size * sequence length] is the number of steps to do a single epoch.

@mackmake
Copy link
Author

thanks for your quick response
but how can i find the # tokens?
i have a large dataset and i think counting tokens in it takes a lot of time
does the preprocess code count it or gives an approximation of it?

@StellaAthena
Copy link
Member

StellaAthena commented Jan 31, 2024

It looks like it's 64,035,813,376 (samples*sequence_length = 31267487*2048=64035813376)

@mackmake
Copy link
Author

mackmake commented Feb 3, 2024

oh that's right
thanks very much

@mackmake mackmake closed this as completed Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants