BERT experiments

FIXED something original version is this: https://github.com/google-research/bert

First, I needed to apply them for Japanese sentences, so prepared a data called JAS. And then, just replaced run_classifier.py and add some lines:

class JasProcessor(DataProcessor):
  """Processor for the CoLA data set (GLUE version)."""

  def read_tsv(self, path):
    df = pd.read_csv(path, sep="\t")
    return [(str(text), str(label)) for text,label in zip(df['text'], df['label'])]

  
  def get_train_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self.read_tsv(os.path.join(data_dir, "train.tsv")), "train")

  def get_dev_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self.read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

  def get_test_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
      self.read_tsv(os.path.join(data_dir, "test.tsv")), "test")

  def get_labels(self):
    """See base class."""
    return ["0", "1", "2", "3", "4", "5"]

  def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      guid = "%s-%s" % (set_type, i)
      text_a = tokenization.convert_to_unicode(line[0])
      label = tokenization.convert_to_unicode(line[1])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples

and then, ran it. The result:

eval_accuracy = 0.363
eval_loss = 1.5512451
global_step = 937
loss = 1.5512451

it's nice because its data size isn't so large. The baseline model JAS_old/run.py is lower (acc: 0.333).

vocab.txt for sentencepiece

I fixed tokenization.py for sentencepiece vocaburaly. https://github.com/sugiyamath/bert/blob/master/tokenization.py

fixed lines: 159, 211-219

it disabled "text normalization" and "charcter based tokenization for chinese characters", because wanna increase vocabulary for Japanese language.

And then, created pre_example.sh . https://github.com/sugiyamath/bert/blob/master/pre_example.sh

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
JAS		JAS
JAS_old		JAS_old
data_scripts		data_scripts
jamodel		jamodel
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
create_pretraining_data.py		create_pretraining_data.py
download_glue_data.py		download_glue_data.py
eval_results.txt		eval_results.txt
extract_features.py		extract_features.py
fv_example.sh		fv_example.sh
modeling.py		modeling.py
modeling_test.py		modeling_test.py
multilingual.md		multilingual.md
optimization.py		optimization.py
optimization_test.py		optimization_test.py
pre_example.sh		pre_example.sh
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_pretraining.py		run_pretraining.py
run_squad.py		run_squad.py
sample_text.txt		sample_text.txt
test.py		test.py
test.txt		test.txt
test_mymodel.py		test_mymodel.py
tokenization.py		tokenization.py
tokenization_test.py		tokenization_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT experiments

vocab.txt for sentencepiece

About

Releases

Packages

Contributors 13

Languages

License

BingoRakuten/bert_ja

Folders and files

Latest commit

History

Repository files navigation

BERT experiments

vocab.txt for sentencepiece

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 13

Languages

Packages