unable to handle new lines with proposed fix #235

progressEdd · 2023-10-11T04:39:59Z

When a string with multiple line breaks is passed into a chain, kor is unable to parse the raw text into the schema.
Solution: adding delimiters such as ``` between the text improves extraction. You can see it with the example at the end

pip show kor
Name: kor
Version: 1.0.0

for example

test_lots_of_new_lines = """ long extra text a the start of the document. value that will get extracted: name: John Smith





                                            probably will be missed main street: 123 main st




"""

will return

{'data': {'customer_info': [{'MainStreet': '',
    'MainZipCode': '',
    'MainCity': '',
    'MainState': 'John',
    'CustomerFirstName': 'Smith',
    'CustomerLastName': '',
    'CustomerEmail': '',
    'CustomerPhone': '',
    'State': ''}]},
 'raw': 'MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State\n"123 main st"||||John|Smith||||\n',
 'errors': [],
 'validated_data': {}}

only the name is returned to the dictionary, but it will show up in the raw

for further debugging here's the output of chain.prompt.format_prompt(text=test_lots_of_new_lines).to_string(), to work with github formatting, I removed the codeblock from the promptt

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

TypeScript

customer_info: Array<{ // Customer information from a narrative.
 MainStreet: string // street address
 MainZipCode: string // zip code, usually 5 digits
 MainCity: string // US city of the business or customer
 MainState: string // US state of the business or customer
 CustomerFirstName: string // first name of the customer
 CustomerLastName: string // last name of the customer
 CustomerEmail: string // email address of the customer
 CustomerPhone: string // business phone of the customer
 State: string // service state
}>



Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: MainStreet
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
"Contact Address 1 & Contact Address 2 
 123 Industrial Rd, 5678 E Pine St"||||||||

Input: MainZipCode
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
|"Contact Zip 
 12345"|||||||

Input: MainCity
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
||"Contact City 
 Reno"||||||

Input: MainState
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
|||"Contact State 
 California"|||||

Input: CustomerFirstName
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
||||"Contact First 
 John"||||

Input: CustomerLastName
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
|||||"Contact Last 
 Smith"|||

Input: CustomerEmail
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
||||||"Contact Email 
 [email protected]"||

Input: CustomerPhone
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
|||||||"Contact Business Phone 
 123-123-1234"|

Input: State
Output: MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State
||||||||"Service State 
 Florida"

Input:  long extra text a the start of the document. value that will get extracted: name: John Smith





                                            probably will be missed main steet: 123 main st





Output:

When I add delimiters, this is what I am able to get mainstreet
chain.run(text="```"+test_lots_of_new_lines+"```")

{'data': {'customer_info': [{'MainStreet': '123 main st',
    'MainZipCode': '',
    'MainCity': '',
    'MainState': '',
    'CustomerFirstName': 'John',
    'CustomerLastName': 'Smith',
    'CustomerEmail': '',
    'CustomerPhone': '',
    'State': ''}]},
 'raw': 'MainStreet|MainZipCode|MainCity|MainState|CustomerFirstName|CustomerLastName|CustomerEmail|CustomerPhone|State\r\n"123 main st"||||John|Smith|||\r\n',
 'errors': [],
 'validated_data': {}}

The text was updated successfully, but these errors were encountered:

eyurtsev · 2023-10-12T15:05:41Z

Hi @progressEdd, thanks for the issue report. create_extraction_chain allows you to pass in an input formatter.

Here's the api reference: https://eyurtsev.github.io/kor/generated/kor.html#kor.create_extraction_chain

You can do something like this:

# For CSV encoding
chain = create_extraction_chain(llm, node, encoder_or_encoder_class="csv", input_formatter="triple_quotes")

# For JSON encoding
chain = create_extraction_chain(llm, node, encoder_or_encoder_class="json",
                                input_formatter="triple_quotes")

or pass in a callable to apply whatever formatting you want.

Another trick is to collapse a lot of contiguous whitespace -- it'll improve the results and reduce the token count.

progressEdd · 2023-10-12T15:20:49Z

gotcha I must have missed it, I was looking at the example code pages, and didn't look too in depth into the module page.

I'll have to look into input_formatter because that's cleaner than the string concatenation. Does CSV encoder offer advantages (in token usage or llm performance) over passing the data as a plain text and trimming white space?

Another trick is to collapse a lot of contiguous whitespace -- it'll improve the results and reduce the token count.
is there any functions that can do that? I was using this function I adapted

def count_metrics(text):
    ic(len(text))
    # Remove excessive newlines and consecutive whitespaces
    text = re.sub(r"\n{2,}", "\n", text)
    text = re.sub(r"\s{2,}", " ", text)

    # Calculate the number of words, characters, and tokens
    word_count = len(text.split())
    char_count = len(text)
    token_count = len(text.split())  # Assuming each word is a token

    ic(word_count,char_count,token_count)
    return text

I got it from
https://github.com/ajitdash/pview/blob/05a27d739d0dbae44e212975f400fc4a58b99937/Token%20Optimization%20-Tool/toklimit4f.py#L46

which was referenced by
https://techcommunity.microsoft.com/t5/healthcare-and-life-sciences/unlocking-the-power-of-tokens-optimizing-token-usage-in-gpt-for/ba-p/3826665

eyurtsev · 2023-10-12T18:50:49Z

Does CSV encoder offer advantages (in token usage or llm performance) over passing the data as a plain text and trimming white space?

The encoders are used to encode the desired output from the model rather than the input (e.g., plain text).

It tells the LLM how to structure its output so that the output can be parsed into structured representation.

JSON is more flexible and supports more complex structured data representations. It also uses more tokens.
CSV is less flexible, but potentially more reliable. It uses less tokens.

To get a sense of what's going, you follow the tutorial and try out both encoders and print out the prompt into the LLM:
https://eyurtsev.github.io/kor/objects.html#what-s-the-actual-prompt

progressEdd · 2023-10-12T20:11:28Z

gotcha that makes sense, I thought the csv encoder had the capability of truncating/processing input text. I'll rely on the the code I shared earlier

progressEdd closed this as completed Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to handle new lines with proposed fix #235

unable to handle new lines with proposed fix #235

progressEdd commented Oct 11, 2023

eyurtsev commented Oct 12, 2023

progressEdd commented Oct 12, 2023 •

edited

Loading

eyurtsev commented Oct 12, 2023

progressEdd commented Oct 12, 2023

unable to handle new lines with proposed fix #235

unable to handle new lines with proposed fix #235

Comments

progressEdd commented Oct 11, 2023

eyurtsev commented Oct 12, 2023

progressEdd commented Oct 12, 2023 • edited Loading

eyurtsev commented Oct 12, 2023

progressEdd commented Oct 12, 2023

progressEdd commented Oct 12, 2023 •

edited

Loading