Code capability enhancement & Bot crash fix #272

Ninot1Quyi · 2024-11-02T13:34:01Z

Last Modified Time: November 10, 2024, 5:53 PM

Latest changes are as follows:

Improvement Effects
- Model: GPT-4o
- Initial Command: !goal("Your goal is: use only "!newAction" instructions and rely only on code execution to obtain a diamond pickaxe. You must complete this task step by step and by yourself. And can't use another "!command". You should promptly check to see what you have.")
- Effect: After testing, under the condition of relying solely on generated code, the bot can run stably for at least 30 minutes without crashing (I manually ended the process at 30 minutes), during which it executed over 130 validated code snippets.
- Remaining Issues:
  1. If illegal commands are executed, such as "attacking a non-existent entity," the server may kick the bot out.
  2. A very small number of tasks may lead to no execution result being obtained, causing code crashes. It is suspected that there may be unconsidered exceptional situations when receiving task results.
- WARNING: If you use the command above or set a goal that requires a long time to work, please pay attention to the execution status and token consumption, as the LLM may continuously generate code in certain situations. For example, when "an iron pickaxe is available and diamonds need to be mined," it might stand still using its code abilities to search for nearby diamond locations. Since diamonds are rare, it may fail to find them continuously, repeatedly improving the code and getting stuck, leading to substantial token consumption.Please test with caution, it cost me $60 to test with gpt-4o for 60min. But gpt-4o-mini is much cheaper and can be used to test this command
Added Features:
2.1 During code generation, the top select_num relevant skillsDocs related to !newAction("task") will be selected and sent to the LLM in the prompt to help it focus better on the task. Currently, select_num is set to 5.
2.2 Before running the code, use ESLint to perform syntax and exception checks on the generated code to detect issues in advance, check for undefined functions, and add exceptions to messages.
2.3 During code execution, detailed error information will be included in messages.
Added Files:
3.1 file path: ./bots/codeCheckTemplate.js
A template used for performing checks before code execution. ESLint cannot be used for detection in the sandbox.

3.2 file path: ./eslint.config.js
Manages the ESLint rules for code syntax and exception detection.
Modified Code Content:

4.1 package.json

- Added: ESLint dependency.

4.2 settings.js

- Set: code_timeout_mins=3, ensuring timely code execution updates and preventing long blocks.

4.3 coder.js

- Added: checkCode function to pre-check for syntax and exceptions. First, it checks whether the functions used in the code exist. If they don't, it writes the illegal functions to the message, then proceeds with syntax and exception checks.

- Modified: Modified the return value of stageCode function from return { main: mainFn }; to return { func: { main: mainFn }, src_check_copy: src_check_copy }; to ensure pre-execution exception detection.

4.4 action_manager.js

- Enhanced: catch (err) error detection to include detailed exception content and related code Docs in messages, improving the LLM's ability to fix code.

4.5 index.js

- Modified: docHelper and getSkillDocs return values to return the docArray of functions from the skill library for subsequent word embedding vector calculations.

4.6 prompter.js

- Added: this.skill_docs_embeddings = {}; to store the docArray word embedding vectors.

- Added: Parallel initialization of this.skill_docs_embeddings in initExamples.

- Added: getRelevantSkillDocs function to obtain select_num relevant doc texts based on input messages and select_num. If select_num >= 0, it is meaningful; otherwise, return all content sorted by relevance.

Note: This modification ensures code quality by making minimal changes only where necessary, while also clearing test outputs and comments. If further modifications are needed, please feel free to let me know.

…e-exception-fixes' into Tasks-more-relevant-docs-and-code-exception-fixes # Conflicts: # src/agent/coder.js # src/agent/prompter.js

Ninot1Quyi · 2024-11-02T20:04:49Z

Resolve merge conflicts with the latest code

New additions

Added the codeChackTemplate.js file under the bots directory for static syntax and exception detection.
Modified the return value of stageCode and the part of generateCodeLoop that runs the code to resolve merge conflict issues.
Added the ESLint configuration file eslint.config.js in the project root directory to manage code syntax and exception detection rules.

JurassikLizard · 2024-11-02T22:13:01Z

Can you try re-running this with a stupider model (not state-of-the-art lol). I'm curious to see if they benefit too, or just advanced ones.

Ninot1Quyi · 2024-11-03T13:41:19Z

Comparison Experiment on Low-Performance Models

1. Objective

The objective is set using the following command:
!goal("Your goal is: use only "!newAction" instructions and rely only on code execution to obtain a diamondpickaxe. You must complete this task step by step and by yourself. And can't use another "!command". You should promptly check to see what you have")

2. Model Selection

First, I tested the lowest-performance model, gpt-3.5-turbo, but it could not limit itself to using only !newAction and was unable to complete the task. Subsequently, I tested gpt-4o-mini.
All subsequent tests were conducted using gpt-4o-mini.

3. Experimental Process

Created a world and made two copies to ensure the environment was the same.
Used both modified and unmodified code to enter the same position, input the goal command, and let the bot execute the task.

4. Experimental Results

4.1 Original

Total run time: 16 minutes 41 seconds.

0 min: Start
3 min: First crash.
4 min: Second crash.
13 min: Acquired wooden pickaxe. [The bot was continually collecting wood and only completed the wooden pickaxe after multiple reminders to check existing items.]
16 min 24 s: Acquired a stone pickaxe.

4.2 Modified

I didn’t give any reminders to the bot while it was running.
Total run time: 16 minutes 22 seconds.

0 min: Start
4 min: Acquired a wooden pickaxe.
5 min 12 s: First crash.
8 min: Acquired a stone pickaxe.
11 min: Acquired iron ingots.
15 min: The content was obtained.
16 min 22s: Second crash.

4.3 Complete Comparison Video

Total duration: 16 minutes 41 seconds.
Watch the full comparison video here

…e-exception-fixes' into Tasks-more-relevant-docs-and-code-exception-fixes # Conflicts: # src/agent/coder.js

Ninot1Quyi · 2024-11-04T17:11:09Z

Resolved merge conflict with Action Manager

…e-exception-fixes' into Tasks-more-relevant-docs-and-code-exception-fixes # Conflicts: # src/agent/prompter.js

Ninot1Quyi · 2024-11-08T16:11:59Z

There is a part that needs improvement

Ninot1Quyi · 2024-11-08T17:33:21Z

Improve the relevance of docs to !newAction("task")Fix Qwen api concurrency limit issue

MaxRobinsonTheGreat · 2024-11-16T20:23:31Z

Not sure about this. I like a few things, but not others.

Code linting looks very useful, I would add that by itself. Though codeCheckTemplate.js looks like it overwrites the recent changes to template.js

I don't like selecting only the most relevant skill docs. Wouldn't this strictly reduce performance? Yes it saves on context space, but means the LLM has no knowledge of most of the available functions. I am also skeptical that comparing the latest message to skill doc would reliably select the most relevant docs. Why not do this for commands too? Additionally, the logic for selecting relevant docs should be in its own separate file, like how we do for Examples.

I see your comparison make it looks like it has about the same performance. So what is the benefit?

Ninot1Quyi · 2024-11-18T03:50:26Z

Explanation and Feedback

1. Purpose of `codeCheckTemplate.js`

codeCheckTemplate.js haven‘t overwrites the recent changes to template.js.The codeCheckTemplate.js is designed solely for static syntax and error checking. I haven’t found a way to perform syntax checks within the sandbox environment. The final code still uses the modified template.js to run in the sandbox.

2. On Relevant Skill Document Selection

I thought we agree that relevance-based ranking can improve the agent’s performance. The function getRelevantSkillsDocs I designed accepts either no parameters or the parameter '-1', which will return all documents sorted by relevance. I think this approach is at least effective.

In the current implementation, I compare the task extracted from !newAction("task") with the annotations of skill functions (["function name" + "function description"]) to calculate similarity. Regarding the strategy of selecting relevant functions based only on the latest task and determining how many relevant functions to include, I think we should conduct detailed comparative experiments to explore this and determine the best performance.

3. Proposed Experiment Design

I plan to design a comparative experiment to evaluate the agent's performance in acquiring a diamond_pickaxe task using only code generation capabilities.

Initial Experiment Design:

Model: gpt-4o-mini
Task: !goal("Generate instructions using only !newAction to acquire a diamond_pickaxe")
Methodology: Use different codebases to initiate the same task in three scenarios (forest, plains, village) at the same starting location. Each agent operates in an isolated game environment without interference from others.
Comparison Codes:
- Base-line: Uses all messages and the default annotation order of skills as defined in the documentation.
- Code1: Uses all documents, sorted by relevance to the latest task.
- Code2: Uses 5% of the documents, sorted by relevance to the latest task.
- Code3: Uses 10% of the documents, sorted by relevance to the latest task.
- Code4: Uses 25% of the documents, sorted by relevance to the latest task.
- Code5: Uses 50% of the documents, sorted by relevance to the latest task.
- Code6: Uses 75% of the documents, sorted by relevance to the latest task.
- Code7: Uses 90% of the documents, sorted by relevance to the latest task.
Results: Average the time taken across three scenarios to acquire key items for the diamond_pickaxe task. Plot time taken and crash frequency, with a maximum runtime of 30 minutes.

Note

I have an exam on the 24th, so I won’t have time to work on this until next week. I will start the experiment at some point next week. If anyone is interested in testing some of these setups, feel free to follow the design above and post your results here. Suggestions for improving this plan are also welcome.
If you conduct the tests, please copy and upload the save files for the three scenarios. I am also considering the idea of directly sending the historical information and all documents to the LLM and letting it return documents related to the current task. This might lead to unexpected results, but I will first carry out the experiment designed above.

MaxRobinsonTheGreat · 2024-11-23T04:36:02Z

Hi @Ninot1Quyi, I've reconsidered and I now agree with you! This will be a very valuable contribution, let's move forward with it. I realized a main benefit is that it would allow the list of skills to grow almost indefinitely. So we should probably do something similar with command docs too, but let's just do skills for now.

You don't need to test so much, and we can expect reduced performance as the cost of fixed context space usage. So long as it is easy to turn off.

A few requests:

move all skill docs selection functionality from prompter.js and move to a new file like skill_library.js or something like that
add a param to settings.js to control the number of skill docs to retrieve. If -1, just give the full unsorted skill docs as is currently done.
why have a separate file codeCheckTemplate.js instead of just using template.js? Preferably we only use template.js
code_chack_template is misspelled

Take your time, no rush whatsoever. Good luck with your exams!

Ninot1Quyi · 2024-11-27T06:25:31Z

@MaxRobinsonTheGreat
Hi Max,
I completed my exam last Sunday. Sorry for the delay in replying to your message. Over the next month, I’ll have three more exams, so progress on the revisions might be a bit slow.

I’m glad to hear that you appreciate my work, and I’ll make gradual changes to the code based on your suggestions. I need to find a better way to verify the generated code. Currently, I haven’t figured out how to enable ESLint to check the code in a sandbox environment, which is why I’m using codeCheckTemplate.js to import packages and test them in a real environment. This issue has been bothering me for a while, and I’ll do my best to resolve it since I also dislike the presence of codeCheckTemplate.js.

Thank you again for recognizing my efforts. Wishing you a happy life!

Ninot1Quyi added 4 commits November 1, 2024 01:08

Sort docs by relevance to !newAction("task")

90df61d

Add select_num exception range judgment

f264b23

Add select_num exception range judgment

17fa2b6

Code capability enhancement & bot crash fix

ecaf5e8

Ninot1Quyi closed this Nov 2, 2024

Ninot1Quyi force-pushed the Tasks-more-relevant-docs-and-code-exception-fixes branch from ecaf5e8 to 02232e2 Compare November 2, 2024 18:03

Ninot1Quyi added 2 commits November 3, 2024 02:24

Merge remote-tracking branch 'origin/Tasks-more-relevant-docs-and-cod…

80d0c25

…e-exception-fixes' into Tasks-more-relevant-docs-and-code-exception-fixes # Conflicts: # src/agent/coder.js # src/agent/prompter.js

Merger conflict resolution

5e84d69

Ninot1Quyi reopened this Nov 2, 2024

Change note to English

e1dfad9

Ninot1Quyi closed this Nov 4, 2024

Ninot1Quyi force-pushed the Tasks-more-relevant-docs-and-code-exception-fixes branch from e1dfad9 to 0a21561 Compare November 4, 2024 15:05

Ninot1Quyi added 2 commits November 4, 2024 23:20

Merge remote-tracking branch 'origin/Tasks-more-relevant-docs-and-cod…

615af11

…e-exception-fixes' into Tasks-more-relevant-docs-and-code-exception-fixes # Conflicts: # src/agent/coder.js

Resolving merge conflicts with Task Manager

82b37e0

Ninot1Quyi reopened this Nov 4, 2024

Fix spelling mistakes

f6e309a

Ninot1Quyi closed this Nov 8, 2024

Ninot1Quyi force-pushed the Tasks-more-relevant-docs-and-code-exception-fixes branch from f6e309a to a6edd8f Compare November 8, 2024 10:20

Ninot1Quyi added 2 commits November 8, 2024 18:33

Merge remote-tracking branch 'origin/Tasks-more-relevant-docs-and-cod…

043fc78

…e-exception-fixes' into Tasks-more-relevant-docs-and-code-exception-fixes # Conflicts: # src/agent/prompter.js

Resolving conflicts created by adding new annotations

e15c516

Ninot1Quyi reopened this Nov 8, 2024

Ninot1Quyi added 2 commits November 9, 2024 01:29

Improve the relevance of docs to !newAction("task")

c8302c2

Fix Qwen api concurrency limit issue

a368451

code_timeout_mins is set to 3

2322f78

Ninot1Quyi added 3 commits November 10, 2024 17:50

set default profiles to andy.json

dd176af

Default settings except code_timeout_mins

69c0bd1

Default settings except code_timeout_mins

cba7f7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code capability enhancement & Bot crash fix #272

Code capability enhancement & Bot crash fix #272

Ninot1Quyi commented Nov 2, 2024 •

edited

Loading

Ninot1Quyi commented Nov 2, 2024

JurassikLizard commented Nov 2, 2024

Ninot1Quyi commented Nov 3, 2024 •

edited

Loading

Ninot1Quyi commented Nov 4, 2024 •

edited

Loading

Ninot1Quyi commented Nov 8, 2024

Ninot1Quyi commented Nov 8, 2024

MaxRobinsonTheGreat commented Nov 16, 2024

Ninot1Quyi commented Nov 18, 2024 •

edited

Loading

MaxRobinsonTheGreat commented Nov 23, 2024

Ninot1Quyi commented Nov 27, 2024

Code capability enhancement & Bot crash fix #272

Are you sure you want to change the base?

Code capability enhancement & Bot crash fix #272

Conversation

Ninot1Quyi commented Nov 2, 2024 • edited Loading

Last Modified Time: November 10, 2024, 5:53 PM

Latest changes are as follows:

Ninot1Quyi commented Nov 2, 2024

Resolve merge conflicts with the latest code

New additions

JurassikLizard commented Nov 2, 2024

Ninot1Quyi commented Nov 3, 2024 • edited Loading

Comparison Experiment on Low-Performance Models

1. Objective

2. Model Selection

3. Experimental Process

4. Experimental Results

4.1 Original

4.2 Modified

4.3 Complete Comparison Video

Ninot1Quyi commented Nov 4, 2024 • edited Loading

Ninot1Quyi commented Nov 8, 2024

Ninot1Quyi commented Nov 8, 2024

MaxRobinsonTheGreat commented Nov 16, 2024

Ninot1Quyi commented Nov 18, 2024 • edited Loading

Explanation and Feedback

1. Purpose of codeCheckTemplate.js

2. On Relevant Skill Document Selection

3. Proposed Experiment Design

Note

MaxRobinsonTheGreat commented Nov 23, 2024

Ninot1Quyi commented Nov 27, 2024

Ninot1Quyi commented Nov 2, 2024 •

edited

Loading

Ninot1Quyi commented Nov 3, 2024 •

edited

Loading

Ninot1Quyi commented Nov 4, 2024 •

edited

Loading

Ninot1Quyi commented Nov 18, 2024 •

edited

Loading

1. Purpose of `codeCheckTemplate.js`