Validate TriviaQA #456

StellaAthena · 2023-05-01T14:19:46Z

No description provided.

seopbo · 2023-05-04T12:59:06Z

lm-evaluation-harness/lm_eval/tasks/triviaqa.py

Lines 76 to 84 in 3c210d4

    
           def construct_requests(self, doc, ctx): 
        
               ret = [] 
        
               for alias in self._remove_prefixes(doc["answer"]["aliases"]): 
        
                   _, is_prediction = rf.loglikelihood(ctx, " " + alias) 
        
                   ret.append(is_prediction) 
        
               return ret 
        
           def process_results(self, doc, results): 
        
               return {"acc": float(any(results))}

I think above snippets are supposed to be changed to below. to @StellaAthena

    def construct_requests(self, doc, ctx):
        ret = []
        for alias in self._remove_prefixes(doc["answer"]["aliases"]):
            is_prediction, _ = rf.loglikelihood(ctx, " " + alias)
            ret.append(is_prediction)
        return ret


    def process_results(self, doc, results):
        pred = self._remove_prefixes(doc["answer"]["aliases"])[np.argmax(results)]
        gold = doc["answer"]["value"]
        return {"acc": float(pred == gold)}

StellaAthena · 2023-05-06T03:55:33Z

@seopbo Great work! Can you write up a bit about how you came to this conclusion, what the paper says, etc? Right now validating your work requires largely redoing it, so it would be good to have the relevant info collected in one place to make verification easier.

seopbo · 2023-05-06T04:34:31Z

@seopbo Great work! Can you write up a bit about how you came to this conclusion, what the paper says, etc? Right now validating your work requires largely redoing it, so it would be good to have the relevant info collected in one place to make verification easier.

I think that triviaqa task is also regarded as multiple choice tasks in lm-evaluation-harness. In previous codes, ret consists of list of floats which are calculated from each choices. results arg is also same as return of construct_requests function. In my case, because of {"acc": float(any(results))}, return of construct_requests always 1.

p.s.

I implemented codes for calculating loglikelihood like textsynth by using text-generation-inference (https://github.com/huggingface/text-generation-inference)
Below results (my 13.6b bilingual model (korean dataset: my dataset, english dataset: pile, code dataset: github)

0 shots: 51.02
1 shots: 55.41
5: shots: 58
64: shots: 59.52

to: @StellaAthena

seopbo · 2023-05-06T04:44:40Z

In llama paper, triviaqa implementation is different to previous code and my code. I think that author of previous code implements triviaqa task asmultiple-choices tasks style. Implementation of triviaqa really generates answer in llama paper. to: @StellaAthena

seopbo · 2023-05-06T04:52:08Z

If we want to implement triviaqa task as style of llama paper, https://github.com/EleutherAI/lm-evaluation-harness/blob/polyglot/lm_eval/tasks/korquad.py code will be good reference. I think that we use greedy_until func instead of loglikelihood and check if generated_text is included in doc["answer"]["aliases"]. to: @StellaAthena

    def construct_requests(self, doc, ctx):
        """Uses RequestFactory to construct Requests and returns an iterable of
        Requests which will be sent to the LM.
        :param doc:
                The document as returned from training_docs, validation_docs, or test_docs.
        :param ctx: str
                The context string, generated by fewshot_context. This includes the natural
                language description, as well as the few shot examples, and the question
                part of the document for `doc`.
        """
        continuation = rf.greedy_until(ctx, ["\n", ".", ","])
        return continuation

    def process_results(self, doc, results):
        continuation = results[0].strip().lower().translate(str.maketrans('', '', string.punctuation))
        list_of_candidates = [alias.lower().translate(str.maketrans('', '', string.punctuation)) for alias in self._remove_prefixes(doc["answer"]["aliases"])]
        return {"em": float(continuation in list_of_candidates)}
    
    def aggregation(self):
        return {
            "em": mean,
        }

    def higher_is_better(self):
        return {"em": True}

StellaAthena · 2023-05-21T04:19:15Z

@seopbo Apologies for my delayed response, but if you open a PR correcting the implementation I will merge it.

Thank you!

seopbo · 2023-05-23T12:42:27Z

@seopbo Apologies for my delayed response, but if you open a PR correcting the implementation I will merge it.

Thank you!

okay. I pr this soon. @StellaAthena

StellaAthena added help wanted Contributors and extra help welcome. good first issue Good for newcomers validation For validation of task implementations. labels May 1, 2023

StellaAthena changed the title ~~TriviaQA~~ Validate TriviaQA May 6, 2023

StellaAthena removed the help wanted Contributors and extra help welcome. label May 6, 2023

seopbo mentioned this issue May 26, 2023

Fix triviaqa task #525

Merged

haileyschoelkopf closed this as completed in #525 Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate TriviaQA #456

Validate TriviaQA #456

StellaAthena commented May 1, 2023

seopbo commented May 4, 2023

StellaAthena commented May 6, 2023

seopbo commented May 6, 2023 •

edited

Loading

seopbo commented May 6, 2023 •

edited

Loading

seopbo commented May 6, 2023 •

edited

Loading

StellaAthena commented May 21, 2023

seopbo commented May 23, 2023

Validate TriviaQA #456

Validate TriviaQA #456

Comments

StellaAthena commented May 1, 2023

seopbo commented May 4, 2023

StellaAthena commented May 6, 2023

seopbo commented May 6, 2023 • edited Loading

seopbo commented May 6, 2023 • edited Loading

seopbo commented May 6, 2023 • edited Loading

StellaAthena commented May 21, 2023

seopbo commented May 23, 2023

seopbo commented May 6, 2023 •

edited

Loading

seopbo commented May 6, 2023 •

edited

Loading

seopbo commented May 6, 2023 •

edited

Loading