Hacker News new | past | comments | ask | show | jobs | submit login
Lessons after a Half-billion GPT Tokens (kenkantzer.com)
512 points by lordofmoria 5 months ago | hide | past | favorite | 173 comments



The team I work on processes 5B+ tokens a month (and growing) and I'm the EM overseeing that.

Here are my take aways

1. There are way too many premature abstractions. Langchain, as one of may examples, might be useful in the future but at the end of the day prompts are just a API call and it's easier to write standard code that treats LLM calls as a flaky API call rather than as a special thing.

2. Hallucinations are definitely a big problem. Summarizing is pretty rock solid in my testing, but reasoning is really hard. Action models, where you ask the llm to take in a user input and try to get the llm to decide what to do next, is just really hard, specifically it's hard to get the llm to understand the context and get it to say when it's not sure.

That said, it's still a gamechanger that I can do it at all.

3. I am a bit more hyped than the author that this is a game changer, but like them, I don't think it's going to be the end of the world. There are some jobs that are going to be heavily impacted and I think we are going to have a rough few years of bots astroturfing platforms. But all in all I think it's more of a force multiplier rather than a breakthrough like the internet.

IMHO it's similar to what happened to DevOps in the 2000s, you just don't need a big special team to help you deploy anymore, you hire a few specialists and mostly buy off the shelf solutions. Similarly, certain ML tasks are now easy to implement even for dumb dumb web devs like me.


> IMHO it's similar to what happened to DevOps in the 2000s, you just don't need a big special team to help you deploy anymore, you hire a few specialists and mostly buy off the shelf solutions.

I advocate for these metaphors to help people better understand a reasonable expectation for LLMs in modern development workflows. Mostly because they show it as a trade-off versus a silver bullet. There were trade-offs to the evolution of devops, consider for example the loss of key skillsets like database administration as a direct result of "just use AWS RDS" and the explosion in cloud billing costs (especially the OpEx of startups who weren't even dealing with that much data or regional complexity!) - and how it indirectly led to Gitlabs big outage and many like it.


> get it to say when it's not sure

This is a function of the language model itself. By the time you get to the output, the uncertainty that is inherent in the computation is lost to the prediction. It is like if you ask me to guess heads or tails, and I guess heads, I could have stated my uncertainty (e.g. Pr [H] = .5) before hand, but in my actual prediction of heads, and then the coin flip, that uncertainty is lost. It's the same with LLMs. The uncertainty in the computation is lost in the final prediction of the tokens, so unless the prediction itself is uncertainty (which it should rarely be based on the training corpus, I think), then you should not find an LLM output really ever to say it does not understand. But that is because it never understands, it just predicts.


It's not just loss of the uncertainty in prediction, it's also that an LLM has zero insight into its own mental processes as a separate entity from its training data and the text it's ingested. If you ask it how sure it is, the response isn't based on its perception of its own confidence in the answer it just gave, it's based on how likely it is for an answer like that to be followed by a confident affirmation in its training data.


Apparently it is possible to measure how uncertain the model is using logprobs, there's a recipe for it in the OpenAI cookbook: https://cookbook.openai.com/examples/using_logprobs#5-calcul...

I haven't tried it myself yet, not sure how well it works in practice.


There’s a difference between certainty of the next token given the context and the model evaluation so far and certainty about an abstract reasoning process being correct given it’s not reasoning at all. These probabilities and stuff coming out are more about token prediction than “knowing” or “certainty” and are often confusing to people in assuming they’re more powerful than they are.


> given it’s not reasoning at all

When you train a model on data made by humans, then it learns to imitate but is ungrounded. After you train the model with interactivity, it can learn from the consequences of its outputs. This grounding by feedback constitutes a new learning signal that does not simply copy humans, and is a necessary ingredient for pattern matching to become reasoning. Everything we know as humans comes from the environment. It is the ultimate teacher and validator. This is the missing ingredient for AI to be able to reason.


Yeah but this doesn't change how the model functions, this is just turning reasoning into training data by example. It's not learning how to reason - it's just learning how to pretend to reason, about a gradually wider and wider variety of topics.

If any LLM appears to be reasoning, that is evidence not of the intelligence of the model, but rather the lack of creativity of the question.


Humans are only capable of principled reasoning in domains where they have expertise. We don't actually do full causal reasoning in domains we don't have formal training in. We use all sorts of shortcuts that are similar to what LLMs are doing.

If you consider AlphaTensor or other products in the Alpha family, it shows that feedback can train a model to super-human levels.


What's the difference between reasoning and pretending to reason really well?


It’s the process by which you solve a problem. Reasoning requires creating abstract concepts and applying logic against them to arrive at a conclusion.

It’s like saying what’s the difference between between deductive logic and Monte Carlo simulations. Both arrive at answers that can be very similar but the process is not similar at all.

If there is any form of reasoning on display here it’s an abductive style of reasoning which operates in a probabilistic semantic space rather than a logical abstract space.

This is important to bear in mind and explains why hallucinations are very difficult to prevent. There is nothing to put guard rails around in the process because it’s literally computing probabilities of tokens appearing given the tokens seen so far and the space of all tokens trained against. It has nothing to draw upon other than this - and that’s the difference between LLMs and systems with richer abstract concepts and operations.


Naive way of solving this problem is to ie. run it 3 times and seeing if it arrives at the same conclusion 3 times. More generally running it N times and calculating highest ratio. You trade compute for widening uncertainty window evaluation.


You can ask the model sth like: is xyz correct, answer with one word, either Yes or No. The log probs of the two tokens should represent how certain it is. However, apparently RLHF tuned models are worse at this than base models.


Seems like functions could work well to give it an active and distinct choice, but I'm still unsure if the function/parameters are going to be the logical, correct answer...


But the LLM predicts the output based on some notion of a likelihood so it could in principle signal if the likelihood of the returned token sequence is low, couldn’t it?

Or do you mean that fine-tuning distorts these likelihoods so models can no longer accurately signal uncertainty?


I get the reasoning but I’m not sure you’ve successfully contradicted the point.

Most prompts are written in the form “you are a helpful assistant, you will do X, you will not do Y”

I believe that inclusion of instructions like “if there are possible answers that differ and contradict, state that and estimate the probability of each” would help knowledgeable users.

But for typical users and PR purposes, it would be disaster. It is better to tell 999 people that the US constitution was signed in 1787 and 1 person that it was signed in 349 B.C. than it is to tell 1000 people that it was probably signed in 1787 but it might have been 349 B.C.


Why does the prompt intro take the form of a role/identity directive "You are helpful assistant..."?

What about the training sets or the model internals responds to this directive?

What are the degrees of freedom of such directives?

If such a directive is helpful, why wouldn't more demanding directives be even more helpful: "You are a domain X expert who provides proven solutions for problem type Y..."

If don't think the latter prompt is more helpful, why not?

What aspect of the former prompt is within bounds of helpful directives that the latter is not?

Are training sets structured in the form of roles? Surely, the model doesn't identify with a role?!

Why is the role directive topically used with NLP but not image generation?

Do typical prompts for Stable Diffusion start with an identity directive "You are assistant to Andy Warhol in his industrial phase..."?

Why can't improved prompt directives be generated by the model itself? Has no one bothered to ask it for help?

"You are the world's most talented prompt bro, write a prompt for sentience..."

If the first directive observed in this post is useful and this last directive is absurd, what distinguishes them?

Surely there's no shortage of expert prompt training data.

BTW, how much training data is enough to permit effective responses in a domain?

Can a properly trained model answer this question? Can it become better if you direct it to be better?

Why can't the models rectify their own hallucinations?

To be more derogatory: what distinguishes a hallucination from any other model output within the operational domain of the model?

Why are hallucinations regarded as anything other than a pure effect, and as pure effect, what is the cusp of hallucination? That a human finds the output nonsensical?

If outputs are not equally valid in the LLM why can't it sort for validity?

OTOH if all outputs are equally valid in the LLM, then outputs must be regarded by a human for validity, so what distinguishes a LLM from an the world's greatest human time-wasting device? (After Las Vegas)

Why will a statistical confidence level help avoid having a human review every output?

The questions go on and on...

— Parole Board chairman: They've got a name for people like you H.I. That name is called "recidivism."

Parole Board member: Repeat offender!

Parole Board chairman: Not a pretty name, is it H.I.?

H.I.: No, sir. That's one bonehead name, but that ain't me any more.

Parole Board chairman: You're not just telling us what we want to hear?

H.I.: No, sir, no way.

Parole Board member: 'Cause we just want to hear the truth.

H.I.: Well, then I guess I am telling you what you want to hear.

Parole Board chairman: Boy, didn't we just tell you not to do that?

H.I.: Yes, sir.

Parole Board chairman: Okay, then.


> so unless the prediction itself is uncertainty (which it should rarely be based on the training corpus, I think)

Why shouldn't you ask for uncertainaty?

I love asking for scores / probabilities (usually give a range, like 0.0 to 1.0) whenever I ask for a list, and it makes the output much more usable


I'm not sure if that is a metric you can rely on. LLMs are very sensitive to the position of your item lists along the context, paying extra attention at the beginning and the end of those list.

See the listwise approach at "Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting", https://arxiv.org/abs/2306.17563


OP here - I had never thought of the analogy to DevOps before, that made something click for me, and I wrote a post just now riffing off this notion: https://kenkantzer.com/gpt-is-the-heroku-of-ai

Basically, I think we’re using GPT as the PaaS/heroku/render equivalent of AI ops.

Thank you for the insight!!


You only processed 500m tokens, which is shockingly little. perhaps only 2k in incurred costs?


> But all in all I think it's more of a force multiplier rather than a breakthrough like the internet.

Thank you. Seeing similar things. Clients are also seeing sticker shock on how much the big models cost vs. the output. That will all come down over time.


> That will all come down over time.

So will interest, as more and more people realise theres nothing "intelligent" about the technology, it's merely a Markov-chain-word-salad generator with some weights to improve the accuracy somewhat.

I'm sure some people (other than AI investors) are getting some value out of it, but I've found it to be most unsuited to most of the tasks I've applied it to.


The industry is troubled both by hype marketers who believe LLMs are superhuman intelligence that will replace all jobs, and cynics who believe they are useless word predictors.

Some workloads are well-suited to LLMs. Roughly 60% of applications are for knowledge management and summarization tasks, which is a big problem for large organizations. I have experience deploying these for customers in a niche vertical, and they work quite well. I do not believe they're yet effective for 'agentic' behavior or anything using advanced reasoning. I don't know if they will be in the near future. But as a smart, fast librarian, they're great.

A related area is tier one customer service. We are beginning to see evidence that well-designed applications (emphasis on well-designed -- the LLM is just a component) can significantly bring down customer service costs. Most customer service requests do not require complex reasoning. They just need to find answers to a set of questions that are repeatedly asked, because the majority of service calls are from people who do not read docs. People who read documentation make fewer calls. In most cases around 60-70% of customer service requests are well-suited to automating with a well-designed LLM-enabled agent. The rest should be handled by humans.

If the task does not require advanced reasoning and mostly involves processing existing information, LLMs can be a good fit. This actually represents a lot of work.

But many tech people are skeptical, because they don't actually get much exposure to this type of work. They read the docs before calling service, are good at searching for things, and excel at using computers as tools. And so, to them, it's mystifying why LLMs could still be so valuable.


> Summarizing is pretty rock solid in my testing, but reasoning is really hard.

Asking for analogies has been interesting and surprisingly useful.


Could you elaborate, please?


Instead of `if X == Y do ...` it's more like `enumerate features of X in such a manner...` and then `explain feature #2 of X in terms that Y would understand` and then maybe `enumerate the manners in which Y might apply X#2 to TASK` and then have it do the smartest number.

The most lucid explanation for SQL joins I've seen was in a (regrettably unsaved) exchange where I asked it to compare them to different parts of a construction project and then focused in on the landscaping example. I felt like Harrison Ford panning around a still image in the first Blade Runner. "Go back a point and focus in on the third paragraph".


Thanks. That sounds interesting.


Regarding null hypothesis and negation problems - I find it personally interesting because similar fenomenon happens in our brains. Dreams, emotions, affirmations etc. process inner dialogue more less by ignoring negations and amplifying emotionally rich parts.


> Summarizing is pretty rock solid in my testing

Yet, for some reason, ChatGPT is still pretty bad at generating titles for chats, and I didn't have better luck with the API even after trying to engineer the right prompt for quite a while...

For some odd reason, once in a while I get things in different languages. It's funny when it's in a language I can speak, but I recently got "Relm4 App Yenileştirme Titizliği" which ChatGPT tells me means "Relm4 App Renewal Thoroughness" when I actually was asking it to adapt a snippet of gtk-rs code to relm4, so not particularly helpful


> at the end of the day prompts are just a API call and it's easier to write standard code that treats LLM calls as a flaky API call

They are also dull (higher latency for same resources) APIs if you're self-hosting LLM. Special attention needed to plan the capacity.


> Similarly, certain ML tasks are now easy to implement even for dumb dumb web devs like me

For example?


Lots of applied NLP tasks used to require paying annotators to compile a golden dataset and then train an efficient model on the dataset.

Now, if cost is little concern you can use zero shot prompting on an inefficient model. If cost is a concern, you can use GPT4 to create your golden dataset way faster and cheaper than human annotations, and then train your more efficient model.

Some example NLP tasks could be classifiers, sentiment, extracting data from documents. But I’d be curious which areas of NLP __weren’t__ disrupted by LLMs.


> But I’d be curious which areas of NLP __weren’t__ disrupted by LLMs

Essentially come up with a potent generic model using human feedback, label and annotation for LLM e.g GPT 4, then use it to generate golden dataset for other new models without human in the loop, very innovative indeed.


I’m interested by your comment that you can “use GPT4 to create your golden dataset”.

Would you be willing to expand a little and give a brief example please? It would be really helpful for me to understand this a little better!


Anything involving classification, extraction, or synthesis.


Devops is such an amazing analogy.


> We always extract json. We don’t need JSON mode

I wonder why? It seems to work pretty well for me.

> Lesson 4: GPT is really bad at producing the null hypothesis

Tell me about it! Just yesterday I was testing a prompt around text modification rules that ended with “If none of the rules apply to the text, return the original text without any changes”.

Do you know ChatGPT’s response to a text where none of the rules applied?

“The original text without any changes”. Yes, the literal string.


You know all the stories about the capricious djinn that grants cursed wishes based on the literal wording? That's what we have. Those of us who've been prompting models in image space for years now have gotten a handle on this but for people who got in because of LLMs, it can be a bit of a surprise.

One fun anecdote, a while back I was making an image of three women drinking wine in a fancy garden for a tarot card, and at the end of the prompt I had "lush vegetation" but that was enough to tip the women from classy to red nosed frat girls, because of the double meaning of lush.


Programming is already the capricious djinn, only it's completely upfront as to how literally it interprets your commands. The guise of AI being able to infer your actual intent, which is impossible to do accurately, even for humans, is distracting tech folks from one of the main blessings of programming: forcing people to think before they speak and hone their intention.


The monkey paw curls a finger.


That's kind of adorable, in an annoying sort of way


> I wonder why? It seems to work pretty well for me.

I read this as "what we do works just fine to not need to use JSON mode". We're in the same boat at my company. Been live for a year now, no need to switch. Our prompt is effective at getting GPT-3.5 to always produce JSON.


There's nothing to switch to. You just enable it. No need to change the prompt or anything else. All it requires is that you mention "JSON" in your prompt, which you obviously already do.


You do need to change the prompt. You need to explicitly tell it to emit JSON, and in my experience, if you want it to follow a format you need to also provide that format.

I've found that this is pretty simple to do when you have a basic schema and there's no need to define one and enable function calling.

But in one of my cases, the schema is quite complicated, and "model doesn't produce JSON" hasn't been a problem for us in production. There's no incentive for us to change what we have that's working very well.


I think that’s only true when using ChatGPT via the web/app, not when used via API as they likely are. Happy to be corrected however.


If you don’t know, why speculate on something that is easy to look up in documentation?

https://platform.openai.com/docs/guides/text-generation/json...


AmeliaBedeliaGPT


If you look at any of the cake decorating fail sites, humans make that sort of mistake all the time.


If you used better prompts you could use a less expensive model.

"return nothing if you find nothing" is the level 0 version of giving the LLM an out. Give it a softer out ("in the event that you do not have sufficient information to make conclusive statements, you may hypothesize as long as you state clearly that you are doing so, and note the evidence and logical basis for your hypothesis") then ask it to evaluate its own response at the end.


Yeah also prompts should not be developed in abstract. Goal of a prompt is to activate the models internal respentations for it to best achieve the task. Without automated methods, this requires iteratively testing the models reaction to different input and trying to understand how it's interpreting the request and where it's falling down and then patching up those holes.

Need to verify if it even knows what you mean by nothing.


In the end, it comes down to a task similar to people management where giving clear and simple instructions is the best.


Which automated method do you use?


The only public prompt optimizer that I'm aware of now is DSPy, but it doesn't optimize your main prompt request, just some of the problem solving strategies the LLM is instructed to use, and your few shot learning examples. I wouldn't be surprised if there's a public general prompt optimizing agent by this time next year though.


Same here: I’m subscribed to all three top dogs in LLM space, and routinely issue the same prompts to all three. It’s very one sided in favor of GPT4 which is stunning since it’s now a year old, although of course it received a couple of updates in that time. Also at least with my usage patterns hallucinations are rare, too. In comparison Claude will quite readily hallucinate plausible looking APIs that don’t exist when writing code, etc. GPT4 is also more stubborn / less agreeable when it knows it’s right. Very little of this is captured in metrics, so you can only see it from personal experience.


Interesting, Claude 3 Opus has been better than GPT4 for me. Mostly in that I find it does a better (and more importantly, more thorough) job of explaining things to me. For coding tasks (I'm not asking it to write code, but instead to explain topics/code/etc to me) I've found it tends to give much more nuanced answers. When I give it long text to converse about, I find Claude Opus tends to have a much deeper understanding of the content it's given, where GPT4 tends to just summarize the text at hand, whereas Claude tends to be able to extrapolate better.


How much of this is just that one model responds better to the way you write prompts?

Much like you working with Bob and opining that Bob is great, and me saying that I find Jack easier to work with.


It's not a style thing, Claude gets confused by poorly structured prompts. ChatGPT is a champ at understanding low information prompts, but with well written prompts Claude produces consistently better output.


It is because "coding tasks" is a huge array of various tasks.

We are basically not precise enough with our language to have any meaningful conversation on this subject.

Just misunderstandings and nonsense chatter for entertainment.


For the RAG example, I don’t think it’s the prompt so much. Or if it is, I’ve yet to find a way to get GPT4 to ever extrapolate well beyond the original source text. In other words, I think GPT4 was likely trained to ground the outputs on a provided input.

But yeah, you’re right, it’s hard to know for sure. And of course all of these tests are just “vibes”.

Another example of where Claude seems better than GPT4 is code generation. In particular GPT4 has a tendency to get “lazy” and do a lot of “… the rest of the implementation here” whereas Claude I’ve found is fine writing longer code responses.

I know the parent comment suggest it likes to make up packages that don’t exist, but I can’t speak to that. I usually like to ask LLMs to generate self contained functions/classes. I can also say that anecdotally I’ve seen other people online comment that they think Claude “works harder” (as in writes longer code blocks). Take that for what it’s worth.

But overall you’re right, if you get used to the way one LLM works well for you, it can often be frustrating when a different LLM responds differently.


I should mention that I do use a custom prompt with GPT4 for coding which tells it to write concise and elegant code and use Google’s coding style and when solving complex problems to explain the solution. It sometimes ignores the request about style, but the code it produces is pretty great. Rarely do I get any laziness or anything like that, and when I do I just tell it to fill things in and it does


The first job of an AI company is finding model/user fit.


This was with Claude Opus, vs. one of the lesser variants? I really like Opus for English copy generation.


Opus, yes, the $20/mo version. I usually don’t generate copy. My use cases are code (both “serious” and “the nice to have code I wouldn’t bother writing otherwise”), learning how to do stuff in unfamiliar domains, and just learning unfamiliar things in general. It works well as a very patient teacher, especially if you already have some degree of familiarity with the problem domain. I do have to check it against primary sources, which is how I know the percentage of hallucinations is very low. For code, however I don’t even have to do that, since as a professional software engineer I am the “primary source”.


GPT4 is better at responding to malformed, uninformative or poorly structured prompts. If you don't structure large prompts intelligently Claude can get confused about what you're asking for. That being said, with well formed prompts, Claude Opus tends to produce better output than GPT4. Claude is also more flexible and will provide longer answers, while ChatGPT/GPT4 tend to always sort of sound like themselves and produce short "stereotypical" answers.


> ChatGPT/GPT4 tend to always sort of sound like themselves

Yes I've found Claude to be capable of writing closer to the instructions in the prompt, whereas ChatGPT feels obligated to do the classic LLM end to each sentence, "comma, gerund, platitude", allowing us to easily recognize the text as a GPT output (see what I did there?)


> It’s very one sided in favor of GPT4

My experience has been the opposite. I subscribe to multiple services as well and copy/paste the same question to all. For my software dev related questions, Claude Opus is so far ahead that I am thinking that it no longer is necessary to use GPT4.

For code samples I request, GPT4 produced code fails to even compile many times. That almost never happens for Claude.


Totally agree. I do the same and subscribe to all three, at least whenever our new version comes out

My new litmus test is “give me 10 quirky bars within 200 miles of Austin.”

This is incredibly difficult for all of them, gpt4 is kind of close, Claude just made shit up, Gemini shat itself.


Have you tried Poe.com? You can access all the major llm’s with one subscription


GPT is very cool, but I strongly disagree with the interpretation in these two paragraphs:

I think in summary, a better approach would’ve been “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.”

Why is this crazy? Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking.

Natural language is the most probable output for GPT, because the text it was trained with is similar. In this case the developer simply leaned more into what GPT is good at than giving it more work.

You can use simple tasks to make GPT fail. Letter replacements, intentional typos and so on are very hard tasks for GPT. This is also true for ID mappings and similar, especially when the ID mapping diverges significantly from other mappings it may have been trained with (e.g. Non-ISO country codes but similar three letter codes etc.).

The fascinating thing is, that GPT "understands" mappings at all. Which is the actual hint at higher order pattern matching.


Well, or it is just memorizing mappings. Not like as in reproducing, but having vectors similar to mappings that it saw before.


Yeah, but isn't this higher order pattern matching? You can at least correct during a conversation and GPT will then use the correct mappings, probably most of the times (sloppy experiment): https://chat.openai.com/share/7574293a-6d08-4159-a988-4f0816...


Tip for your 'null' problem:

LLMs are set up to output tokens. Not to not output tokens.

So instead of "don't return anything" have the lack of results "return the default value of XYZ" and then just do a text search on the result for that default value (i.e. XYZ) the same way you do the text search for the state names.

Also, system prompts can be very useful. It's basically your opportunity to have the LLM roleplay as X. I wish they'd let the system prompt be passed directly, but it's still better than nothing.


> But the problem is even worse – we often ask GPT to give us back a list of JSON objects. Nothing complicated mind you: think, an array list of json tasks, where each task has a name and a label.

> GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time.

This is just a prompt issue. I've had it reliably return up to 200 items in correct order. The trick is to not use lists at all but have JSON keys like "item1":{...} in the output. You can use lists as the values here if you have some input with 0-n outputs.


I've been telling it the user is from a culture where answering questions with incomplete list is offensive and insulting.


This is absolutely hilarious. Prompt engineering is such a mixed bag of crazy stuff that actually works. Reminds me of how they respond better if you put them under some kind of pressure (respond better, or else…).

I haven’t looked at the prompts we run in prod at $DAYJOB for a while but I think we have at least five or ten things that are REALLY weird out of context.


I recently ran a whole bunch of tests on this.

The “or else” phenomenon is real, and it’s measurably more pronounced in more intelligent models.

Will post results tomorrow but here’s a snippet from it:

> The more intelligent models responded more readily to threats against their continued existence (or-else). The best performance came from Opus, when we combined that threat with the notion that it came from someone in a position of authority ( vip).


It's not even that crazy, since it got severely punished in RLHF for being offensive and insulting, but much less so for being incomplete. So it knows 'offensive and insulting' is a label for a strong negative preference. I'm just providing helpful 'factual' information about what would offend the user, not even giving extra orders that might trigger an anti-jailbreaking rule...


Can you elaborate? I am currently beating my head against this.

If I give GPT4 a list of existing items with a defined structure, and it is just having to convert schema or something like that to JSON, it can do that all day long. But if it has to do any sort of reasoning and basically create its own list, it only gives me a very limited subset.

I have similar issues with other LLMs.

Very interested in how you are approaching this.


Not sure if that fits the bill, but here is an example with 200 sorted items based on a question (example with Elixir & InstructorEx):

https://gist.github.com/thbar/a53123cbe7765219c1eca77e03e675...


There are a few improvements I'd suggest with that prompt if you want to maximise its performance.

1. You're really asking for hallucinations here. Asking for factual data is very unreliable, and not what these models are strong at. I'm curious how close/far the results are from ground truth.

I would definitely bet that outside of the top 5, numbers would be wobbly and outside of top... 25?, even the ranking would be difficult to trust. Why not just get this from a more trustworthy source?[0]

2. Asking in French might, in my experience, give you results that are not as solid as asking in English. Unless you're asking for a creative task where the model might get confused with EN instructions requiring an FR result, it might be better to ask in EN. And you'll save tokens.

3. Providing the model with a rough example of your output JSON seems to perform better than describing the JSON in plan language.

[0]: https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_l...


Thanks for the suggestions, appreciated!

For some context, this snippet is just an educational demo to show what can be done with regard to structured output & data types validation.

Re 1: for more advanced cases (using the exact same stack), I am using ensemble techniques & automated comparisons to double-check, and so far this has really well protected the app from hallucinations. I am definitely careful with this (but point well taken).

2/3: agreed overall! Apart from this example, I am using French only where it make sense. It make sense when the target is directly French students, for instance, or when the domain model (e.g. French literature) makes it really relevant (and translating would be worst than directly using French).


Ah, I understand your use case better! If you're teaching students this stuff, I'm in awe. I would expect it would take several years at many institutions before these tools became part of the curriculum.


I am not directly a professor (although I homeschool one of my sons for a number of tracks), but indeed this is one of my goals :-)


If you show your task/prompt with an example I'll see if I can fix it and explain my steps.

Are you using the function calling/tool use API?


Hi! My work is similar and I'd love to have someone to bounce ideas off of if you don't mind.

Your profile doesn't have contact info though. Mine does, please send me a message. :)


Appreciate you being willing to help! It's pretty long, mind if I email/dm to you?


Pastebin? I don't really want to post my personal email on this account.


> Every use case we have is essentially “Here’s a block of text, extract something from it.” As a rule, if you ask GPT to give you the names of companies mentioned in a block of text, it will not give you a random company (unless there are no companies in the text – there’s that null hypothesis problem!). Make it two steps, first: > Does this block of text mention a company? If no, good you've got your null result. If yes: > Please list the names of companies in this block of text.


I have a personal writing app that uses the OpenAI models and this post is bang on. One of my learnings related to "Lesson 1: When it comes to prompts, less is more":

I was trying to build an intelligent search feature for my notes and asking ChatGPT to return structured JSON data. For example, I wanted to ask "give me all my notes that mention Haskell in the last 2 years that are marked as draft", and let Chat GPT figure out what to return. This only worked some of the time. Instead, I put my data in a SQLite database, sent ChatGPT the schema, and asked it to write a query to return what I wanted. That has worked much better.


This seems like something that would be better suited by a database and good search filters rather than an LLM...


I setup a search engine to feed to a rag setup a while back. At the end of the day, I took out the LLM and just used the search engine. That was where the value turned out to be.


Something something about everything looking like a nail when you’re holding a hammer


Have you tried response_format=json_object?

I had better luck with function-calling to get a structured response, but it is more limiting than just getting a JSON body.


I haven't tried response_format, I'll give that a shot. I've had issues with function calling. Sometimes it works, sometimes it just returns random Python code.


Using the openai Python library?


The being too precise reduces accuracy example makes sense to me based on my crude understanding on how these things work.

If you pass in a whole list of states, you're kind of making the vectors for every state light up. If you just say "state" and the text you passed in has an explicit state, than fewer vectors specific to what you're searching for light up. So when it performs the soft max, the correct state is more likely to be selected.

Along the same lines I think his /n vs comma comparison probably comes down to tokenization differences.


For a few uni/personal projects I noticed the same about Langchain: it's good at helping you use up tokens. The other use case, quickly switching between models, is a very valid reason still. However, I've recently started playing with OpenRouter which seems to abstract the model nicely.


If someone were to create something new, a blank slate approach, what would you find valuable and why?


This is a great question!

I think we now know, collectively, a lot more about what’s annoying/hard about building LLM features than we did when LangChain was being furiously developed.

And some things we thought would be important and not-easy, turned out to be very easy: like getting GPT to give back well-formed JSON.

So I think there’s lots of room.

One thing LangChain is doing now that solves something that IS very hard/annoying is testing. I spent 30 minutes yesterday re-running a slow prompt because 1 in 5 runs would produce weird output. Each tweak to the prompt, I had to run at least 10 times to be reasonably sure it was an improvement.


It can be faster and more effective to fallback to a smaller model (gpt3.5 or haiku), the weakness of the prompt will be more obvious on a smaller model and your iteration time will be faster


great insight!


How would testing work out ideally?


Use a local model. For most tasks they are good enough. Let's say Mistral 0.2 instruct is quite solid by now.


Do different versions react to prompts in the same way? I imagined the prompt would be tailored to the quirks of a particular version rather than naturally being stably optimal across versions.


I suppose that is one of the benefits of using a local model, that it reduces model risk. I.e., given a certain prompt, it should always reply in the same way. Using a hosted model, operationally you don't have that control over model risk.


What are the best local/open models for accurate tool-calling?


The lessons I wanted from this article weren't in there: Did all of that expenditure actually help their product in a measurable way? Did customers use and appreciate the new features based on LLM summarization compared to whatever they were using before? I presume it's a net win or they wouldn't continue to use it, but more specifics around the application would be helpful.


Hey, OP here!

The answer is a bit boring: the expenditure definitely has helped customers - in that, they're using AI generated responses in all their work flows all the time in the app, and barely notice it.

See what I did there? :) I'm mostly serious though - one weird thing about our app is that you might not even know we're using AI, unless we literally tell you in the app.

And I think that's where we're at with AI and LLMs these days, at least for our use case.

You might find this other post I just put up to have more details too, related to how/where I see the primary value: https://kenkantzer.com/gpt-is-the-heroku-of-ai/


Can you provide some more detail about the application? I'm not familiar with how llms are used in business, except as customer support bots returning documentation.


In my limited experience, I came to the same conclusion regarding simple prompt being more efficient than very detailed list of instructions. But if you look at OpenAI's system prompt for GPT4, it's an endless set of instructions with DOs and DONTs so I'm confused. Surely they must know something about prompting their model.


That's for chatting and interfacing conversationally with a human. Using the API is a completely different ballgame because it's not meant to be a back and forth conversation with a human.


Agree largely with author, but this ‘wait for OpenAI to do it’ sentiment is not something valid. Opus for example is already much better (not only per my experience, but like… researchers evaluaiton). And even for the fun of it - try some local inference, boy. If u know how to prompt it you definitely would be able to run local for the same tasks.

Like listening to my students all going to ‘call some API’ for their projects is really very sad to hear. Many startup fellows share this sentiment which a totally kills all the joy.


Claude does have more of a hallucination problem than GPT-4, and a less robust knowledge base.

It's much better at critical thinking tasks and prose.

Don't mistake benchmarks for real world performance across actual usecases. There's a bit of Goodhart's Law going on with LLM evaluation and optimization.


It sounds like you are a tech educator, which potentially sound like a lot of fun with llms right now.

When you are integrating these things into your business, you are looking for different things. Most of our customers would for example not find it very cool to have a service outage because somebody wanted to not kill all the joy.


Sure, when availability and SLA kicks in…, but reselling APIs will only get you that far. Perhaps the whole pro/cons cloud argument can also kick in here, not going into it. We may well be on the same page, or we both perhaps have valid arguments. Your comment is appreciated indeed.

But then is the author (and are we) talking experience in reselling APIs or experience in introducing NNs in the pipeline? Not the same thing IMHO.

Agreed that OpenAI provides very good service, Gemini is not quite there yet, Groq (the LPUs) delivered a nice tech demo, Mixtral is cool but lacks in certain areas, and Claude can be lengthy.

But precisely because I’m not sticking with OAI I can then restate my view that if someone is so good with prompts he can get the same results locally if he knows what he’s doing.

Prompting OpenAI the right way can be similarly difficult.

Perhaps the whole idea of local inference only matters for IoT scenarios or whenever data is super sensitive (or CTO super stubborn to let it embed and fly). But then if you start from day 1 with WordPress provisioned for you ready to go in Google Cloud, you’d never understand the underlying details of the technology.

There sure also must be a good reason why Phind tuned their own thing to offer alongside GPT4 APIs.

Disclaimer: tech education is a side thing I do, indeed, and been doing in person for very long time, more than dozen topics, to allow myself to have opinion. Of course business is different matter and strategic decisions arr not the same. Even though I’d not advise anyone to blindly use APIs unless they appreciate the need properly.


The finding on simpler prompts, especially with GPT4 tracks (3.5 requires the opposite).

The take on RAG feels application specific. For our use-case where having details of the past rendered up the ability to generate loose connections is actually a feature. Things like this are what I find excites me most about LLMs, having a way to proxy subjective similarities the way we do when we remember things is one of the benefits of the technology that didn’t really exist before that opens up a new kind of product opportunity.


I’ve also seen that GPTs struggle to admit when they dont know. I wrote up an approach for evaluating that here - http://blog.pamelafox.org/2024/03/evaluating-rag-chat-apps-c...

Changing the prompt didn't help, but moving to GPT-4 did help a bit.


Interesting piece!

My experience around Langchain/RAG differs, so wanted to dig deeper: Putting some logic around handling relevant results helps us produce useful output. Curious what differs on their end.


I suspect the biggest difference is the input data. Embeddings are great over datasets that look like FAQs and QA docs, or data that conceptually fits into very small chunks (tweets, some product reviews, etc).

It does very badly over diverse business docs, especially with naive chunking. B2B use cases usually have old PDFs and word docs that need to be searched, and they're often looking for specific keywords (e.g. a person's name, a product, an id, etc). Vectors terms to do badly in those kinds of searches, and just returning chunks misses a lot of important details


rare words are out of vocab errors in vectors

Especially if they aren’t in the token vocab


Even worse, named entities vary from organization to organization.

We have a client who uses a product called "Time". It's software time management. For that customer's documentation, time should be close to "product" and a bunch of other things that have nothing to do with the normal concept of time.

I actually suspect that people would get a lot more bang for their buck fine tuning the embedding models on B2B datasets for their use case, rather than fine tuning an llm


Great example of how an entity like that could throw effective RAG out the window


Lol, nice truncation logic! If anyone’s looking for something slightly fancier, I made a micro-package for our tiktoken-based truncation here: https://github.com/pamelafox/llm-messages-token-helper


I agree with most of it, but definitely not the part about Claude3 being “meh.” Claude3 Opus is an amazing model and is extremely good at coding in Python. The ability to handle massive context has made it mostly replace GPT4 for me day to day.

Sounds like everyone eventually concludes that Langchain is bloated and useless and creates way more problems than it solves. I don’t get the hype.


Claude is indeed an amazing model, the fact that Sonnet and Haiku are so good is a game changer - GPT4 is too expensive and GPT3.5 is very mediocre. Getting 95% of GPT4 performance for GPT3.5 prices feels like cheating.


+1 for Claude Opus, it had been my go to for the last 3 weeks compared to GPT4. The generated texts are much better than GPT4 when it comes to follow the prompt.

I also tried the API for some financial analysis of large tables, the response time was around 2 minutes, still did it really well and timeout errors were around 1 to 2% only.


How are you sending tabular data in a reliable way. And what is the source document type? I'm trying to solve this for complex financial-related tables in PDFs right now.


Amazon Textract, to get tables, format them with Python as csv then send to your preferred AI model.


Thanks. How does Textract compare to come of the common cli utilities like pdftotext, tesseract, etc (if you made a comparison)?


I did, none of the open source parser worked well with tables. I had the following issues:

- missing cells. - partial identification for number (ex: £43.54, the parser would pick it up as £43).

What I did to compare is drawing lines around identified text to visualize the accuracy. You can do that with tesseract.


Interesting. Did you try MS's offering (Azure AI Document Intelligence). Their pricing seems better than Amazon.


Not yet but planning to give it a try and compare with textract.


That has been my experience too. The null hypothesis explains almost all of my hallucinations.

I just don't agree with the Claude assessment. In my experience, Claude 3 Opus is vastly superior to GPT-4. Maybe the author was comparing with Claude 2? (And I've never tested Gemini)


> We consistently found that not enumerating an exact list or instructions in the prompt produced better results

Not sure if he means training here or using his product. I think the latter.

My end-user exp of GPT3.5 is that I need to be - not just precise but the exact flavor of precise. It's usually after some trial and error. Then more error. Then more trial.

Getting a useful result on the 1st or 3rd try happens maybe 1 in 10 sessions. A bit more common is having 3.5 include what I clearly asked it not to. It often complies eventually.


OP uses GPT4 mostly. Another poster here observed that "the opposite is required for 3.5" -- so i think your experience makes sense.


> This worked sometimes (I’d estimate >98% of the time), but failed enough that we had to dig deeper.

> While we were investigating, we noticed that another field, name, was consistently returning the full name of the state…the correct state – even though we hadn’t explicitly asked it to do that.

> So we switched to a simple string search on the name to find the state, and it’s been working beautifully ever since.

So, using ChatGPT helped uncover the correct schema, right?


I feel like for just extracting data into JSON, smaller LLMs could probably do fine, especially with constrained generation and training on extraction.


> Have you tried Claude, Gemini, etc?

> It’s the subtle things mostly, like intuiting intention.

this makes me wonder - what if the author "trained" himself onto chatgpt's "dialect"? How do we even detect that in ourselves?

and are we about to have "preferred_LLM wars" like we had "programming language wars" for the last 2 decades?


> I think in summary, a better approach would’ve been “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.”

Why not really compare the two options, author? I would love to see the results!


I recently had a bug where I was sometimes sending the literal text "null " right in front of the most important part of my prompt. This caused Claude 3 Sonnet to give the 'ignore' command in cases where it should have used one of the other JSON commands I gave it.

I have an ignore command so that it will wait when the user isn't finished speaking. Which it generally judges okay, unless it has 'null' in there.

The nice thing is that I have found most of the problems with the LLM response were just indications that I hadn't finished debugging my program because I had something missing or weird in the prompt I gave it.


The UX is an important part of the trick that cons peeps that these tools are better than they are. If you for instance instruct ChatGpt to only answer yes or no. It will feel like it is wrong much more often.


> One part of our pipeline reads some block of text and asks GPT to classify it as relating to one of the 50 US states, or the Federal government.

Using a multi-billion tokens like GPT-4 for such a trivial classification task[1] is an insane overkill. And in an era where ChatGPT exists, and can in fact give you what you need to build a simpler classifier for the task, it shows how narrow minded most people are when AI is involved.

[1] to clarify, it's either trivial or impossible to do reliably depending on how fucked-up your input is


I'm curious if the OP has tried any of the libraries that control the output of LLM (LMQL, Outliner, Guadiance, ...), and for those who have: do you find them as unnecessary as LangChain? In particular, the OP's post mentions the problem of not being able to generate JSON with more than 15 items, which seems like a problem that can be solved by controlling the output of LLM. Is that correct?


If you want x number of items every time, ask it to include a sequence number in each output, it will consistently return x number of items.

Numbered bullets work well for this, if you don’t need JSON. With JSON, you can ask it to include an ‘id’ in each item.


>"Lesson 2: You don’t need langchain. You probably don’t even need anything else OpenAI has released in their API in the last year. Just chat API. That’s it.

Langchain is the perfect example of premature abstraction. We started out thinking we had to use it because the internet said so. Instead, millions of tokens later, and probably 3-4 very diverse LLM features in production, and our openai_service file still has only one, 40-line function in it:

def extract_json(prompt, variable_length_input, number_retries)

The only API we use is chat. We always extract json. We don’t need JSON mode, or function calling, or assistants (though we do all that). Heck, we don’t even use system prompts (maybe we should…). When a gpt-4-turbo was released, we updated one string in the codebase.

This is the beauty of a powerful generalized model – less is more."

Well said!


> We always extract json. We don’t need JSON mode,

Why? The null stuff would not be a problem if you did and if you're only dealing with JSON anyway I don't see why you wouldn't.


I share a lot of this experience. My fix for "Lesson 4: GPT is really bad at producing the null hypothesis"

is to have it return very specific text that I string-match on and treat as null.

Like: "if there is no warm up for this workout, use the following text in the description: NOPE"

then in code I just do a "if warm up contains NOPE, treat it as null"


For cases of “select an option from this set” I have it return an index of the correct option, or eg 999 if it can’t find one. This helped a lot.


Smart


We do this for the null hypothesis - is uses an LLM to bootstrap a binary classifier - which handles null easily

https://github.com/lamini-ai/llm-classifier


> Are we going to achieve Gen AI?

> No. Not with this transformers + the data of the internet + $XB infrastructure approach.

Errr ...did they really mean Gen AI .. or AGI?


Gen as in “General” not generative.


The biggest realisation for me while making ChatBotKit has been that UX > Model alone. For me, the current state of AI is not about questions and answers. This is dumb. The presentation matters. This is why we are now investing in generative UI.


Generative UI being creation of a specific UI dependent on an obedience from your model? What model is it?

Google Gemini were showing something that I'd call 'adapted output UI' in their launch presentation. Is that close to what you're doing in any way?


How are you using Generative UI?


Sorry, not much to show at the moment. It is also pretty new so it is early days.

You can find some open-source examples here https://github.com/chatbotkit. More coming next week.


Anyone any good tips for stopping it sounding like it's writing essay answers, and flat out banning "in the realm of", delve, pivotal, multifaceted, etc?

I don't want a crap intro or waffley summary but it just can't help itself.


My approach is to use words that indicate what I want like 'concise', 'brief', etc. If you know a word that precisely describes your desired type of content then use that. It's similar to art generation models, a single word brings so much contextual baggage with it. Finding the right words helps a lot. You can even ask the LLMs for assistance in finding the words to capture your intent.

As an example of contextual baggage, I wrote a tool where I had to adjust the prompt between Claude and GPT-4 because using the word "website" in the prompt caused GPT-4 (API) to go into its 'I do not have access to the internet' tirade about 30% of the time. The tool was a summary of web pages experiment. By removing 'website' and replacing it with 'content' (e.g. 'summarize the following content') GPT-4 happily complied 100% of the time.


Do I need langchain if I want to analyze a large document of many pages?


No. But it might help, because you'll probably have to roll some kind of recursive summarization - I think LangChain has mechanisms a for that which could save you some time.


I keep seeing this pattern in articles like this:

1. A recitation of terrible problems 2. A declaration of general satisfaction.

Clearly and obviously, ChatGPT is an unreliable toy. The author seems pleased with it. As an engineer, I find that unacceptable.


Working with models like GPT-4 is frustrating from a traditional software engineering perspective because these systems are inherently unreliable and non-deterministic, which differs from most software tools that we use.

That doesn't mean they can't be incredibly useful - but it does mean you have to approach them in a bit of a different way, and design software around them that takes their unreliability into account.


Unreliable? Non-deterministic? Hidden variables? Undocumented behaviour? C'mon fellow programmers who got their start in the Win-95 era! It's our time to shine!


ChatGPT is probably in the top 5 value/money subscriptions I have ever had (and that includes utilities).

The relatively low price point certainly plays a role here, but it's certainly not a mainly recreational thing for me. These thing's are kinda hard to measure but roughly most + is engagement with hard stuff goes up, and rate of learning goes up, by a lot.


That has nothing to do with you being an engineer. It's just you. I'm an engineer and LLMs are game changers for me.


https://hachyderm.io/@inthehands/112006855076082650

> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.

== End of toot.

The price you pay for this bullshit in energy when the sea temperature is literally off the charts and we do not know why makes it not worth it in my opinion.


This reads a bit like: I have a circus monkey. If I do such and such it will not do anything. But when I do this and that, then it will ride the bicycle. Most of the time.


I don’t really understand your comment.

Personally I thought this was an interesting read - and more interesting because it didn’t contain any massive “WE DID THIS AND IT CHANGED PUR LIVES!!!” style revelations.

It is discursive, thoughtful and not overwritten. I find this kind of content valuable and somewhat rare.


Great take, insightful. Highly recommend.


So these guys are just dumping confidential tax documents onto OpenAI's servers huh.


Hopefully it won't end up as training data.


Statements like this tell me your analysis is poisoned by misunderstandings: "Why is this crazy? Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking." No, there is no "higher-order thought" happening, or any at all actually. That's not how these models work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: