Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Clojure] Clojure BERT QA example #14691

Merged
merged 6 commits into from
Apr 14, 2019

Conversation

gigasquid
Copy link
Member

@gigasquid gigasquid commented Apr 13, 2019

Description

Thanks to @lanking520 and the JVM team - we were able to convert the Java BERT QA example to the Clojure package 💯

This makes a slight change to it by having an external edn file to store the sample question and answers along with the ground truths to be able to process multiple examples and make it easier to edit and add more.

Example output:

===============================
      Question Answer Data
{:input-answer
 "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm.",
 :input-question
 "By what main attribute are computational problems classified utilizing computational complexity theory?",
 :ground-truth-answers
 ["Computational complexity theory"
  "Computational  complexity theory"
  "complexity theory"]}

  Predicted Answer:  [inherent difficulty]
===============================
===============================
      Question Answer Data
{:input-answer
 "Steam engines are external combustion engines, where the working fluid is separate from the combustion products. Non-combustion heat sources such as solar power, nuclear power or geothermal energy may be used. The ideal thermodynamic cycle used to analyze this process is called the Rankine cycle. In the cycle, water is heated and transforms into steam within a boiler operating at a high pressure. When expanded through pistons or turbines, mechanical work is done. The reduced-pressure steam is then condensed and pumped back into the boiler.",
 :input-question
 "Along with geothermal and nuclear, what is a notable non-combustion heat source?",
 :ground-truth-answers
 ["solar"
  "solar power"
  "solar power, nuclear power or geothermal energysolar"]}

  Predicted Answer:  [solar power]
===============================
===============================
      Question Answer Data
{:input-answer
 "In the 1960s, a series of discoveries, the most important of which was seafloor spreading, showed that the Earth's lithosphere, which includes the crust and rigid uppermost portion of the upper mantle, is separated into a number of tectonic plates that move across the plastically deforming, solid, upper mantle, which is called the asthenosphere. There is an intimate coupling between the movement of the plates on the surface and the convection of the mantle: oceanic plate motions and mantle convection currents always move in the same direction, because the oceanic lithosphere is the rigid upper thermal boundary layer of the convecting mantle. This coupling between rigid plates moving on the surface of the Earth and the convecting mantle is called plate tectonics.",
 :input-question
 "What was the most important discovery that led to the understanding that Earth's lithosphere is separated into tectonic plates?",
 :ground-truth-answers ["seafloor spreading"]}

  Predicted Answer:  [seafloor spreading]
===============================
===============================
      Question Answer Data
{:input-answer
 "Susan had a cat named Sammy when she lived in the green house.",
 :input-question "What was Susan's cat named?",
 :ground-truth-answers ["Sammy" "sammy"]}

  Predicted Answer:  [sammy]
===============================
===============================
      Question Answer Data
{:input-answer
 "Rich Hickey is the creator of the Clojure language. Before Clojure, he developed dotLisp, a similar project based on the .NET platform, and three earlier attempts to provide interoperability between Lisp and Java: a Java foreign language interface for Common Lisp, A Foreign Object Interface for Lisp, and a Lisp-friendly interface to Java Servlets.",
 :input-question "Who created Clojure?",
 :ground-truth-answers ["rich" "hickey"]}

  Predicted Answer:  [rich hickey]
===============================

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

New example code for BERT QA based off the Java example along with a new integration test

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@gigasquid gigasquid added Clojure pr-work-in-progress PR is still work in progress labels Apr 13, 2019
rename core to infer
add integration test
@gigasquid gigasquid removed the pr-work-in-progress PR is still work in progress label Apr 13, 2019
@gigasquid gigasquid changed the title [Clojure] [WIP] Clojure BERT QA example [Clojure] Clojure BERT QA example Apr 13, 2019
By default, this model are using `bert_12_768_12` model with extra layers for QA jobs.

After that, to be able to use it in Java, we need to export the dictionary from the script to parse the text
to actual indexes. Please add the following lines after [this line](https://github.com/dmlc/gluon-nlp/blob/master/scripts/bert/staticbert/static_finetune_squad.py#L262).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add an option to this file itself (i.e. create a PR) to export the vocabulary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. @lanking520 did the original exploration and documentation on this. What do you think?


For this tutorial, you can get the model and vocabulary by running following bash file. This script will use `wget` to download these artifacts from AWS S3.

From the `scala-package/examples/scripts/infer/bert/` folder run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be within the clojure bert-qa example folder?

:token2idx (get vocab "token_to_idx")}))

(defn tokens->idxs [token2idx tokens]
(mapv #(get token2idx % (get token2idx "[UNK]")) tokens))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let [unk-idx (get token2idx "[UNK]")] ...)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better


(defn get-vocab []
(let [vocab (json/parse-stream (clojure.java.io/reader "model/vocab.json"))]
{:idx2token (get vocab "idx_to_token")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

{:idx->token ... 
 :token->idx ...}

?

(break-out-punctuation s target-char)
[s]))

(defn tokenizer [s]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenize?

Copy link
Contributor

@kedarbellare kedarbellare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! this is great!! only minor comments/suggestions

@gigasquid
Copy link
Member Author

Thanks so much for the feedback @kedarbellare - I'll work on implementing it 😸

(map #(string/replace % "<punc>" str-match))))

(defn break-out-punctuations [s]
(if-let [target-char (first (re-seq #"[.,?!]" s))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do saw some tokens like ... in your example, maybe get it covered as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was changing the data examples around and I don't have one with that in there now, but I'll keep my eye out for them in the future.

@gigasquid gigasquid merged commit c2ba51b into apache:master Apr 14, 2019
@gigasquid gigasquid deleted the clojure-bert-qa-example branch April 14, 2019 19:37
kedarbellare pushed a commit to kedarbellare/incubator-mxnet that referenced this pull request Apr 20, 2019
* Initial working example for bert qa

* add RAT
rename core to infer
add integration test

* add rat for project.clj

* Couldn’t resist adding a qa about Clojure

* rat for readme

* feedback from @kedarbellare
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Initial working example for bert qa

* add RAT
rename core to infer
add integration test

* add rat for project.clj

* Couldn’t resist adding a qa about Clojure

* rat for readme

* feedback from @kedarbellare
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants