Hacker News new | past | comments | ask | show | jobs | submit | hubraumhugo's comments login

Congrats on the launch! A quick search in the YC startup directory brought up 5-10 companies doing pretty much the same thing:

- https://www.ycombinator.com/companies/tableflow

- https://www.ycombinator.com/companies/reducto

- https://www.ycombinator.com/companies/mindee

- https://www.ycombinator.com/companies/omniai

- https://www.ycombinator.com/companies/trellis

At the same time, accurate document extraction is becoming a commodity with powerful VLMs. Are you planning to focus on a specific industry, or how do you plan to differentiate?


Yes there is definitely a boom in document related startups. We see our niche as focusing on non technical users. We have focused on making it easy to build schemas, an audit and review experience, and integrating into downstream applications.

Hey we're on that list! Congrats on the launch Max & team!

I could definitely point to minor differences between all the platforms, but you're right that everyone is tackling the same unstructured data problem.

In general, I think it will be a couple years before anyone really butts heads in the market. The problem space is just that big. I'm constantly blown away by how big the document problem at these mid sized businesses. And most of these companies don't have any engineers on staff. So no attempt has ever been made to fix it.


"accurate document extraction is becoming a commodity with powerful VLMs"

Agree.

The capability is fairly trivial for orgs with decent technical talent. The tech / processes all look similar:

User uploads file --> Azure prebuilt-layout returns .MD --> prompt + .MD + schema set to LLM --> JSON returned. Do whatever you want with it.


Totally agree that this is becoming the standard "reference architecture" for this kind of pipeline. The only thing that complicates this a lot today is complex inputs. For simple 1-2 page PDFs what you describes works quite well out of the box but for 100+ page doc it starts to fall over in ways I described in another comment.

Are really large inputs solved at midship? If so, I'd consider that a differentiator (at least today). The demo's limited to 15pgs, and I don't see any marketing around long-context or complex inputs on the site.

I suspect this problem gets solved in the next iteration or two of commodity models. In the meantime, being smart about how the context gets divvied works ok.

I do like the UI you appear to have for citing information. Drawing the polygons around the data, and then where they appear in the PDF. Nice.


Why all those steps? Why not just file + prompt to JSON directly?

Having the text (for now) is still pretty important for quality output. The vision models are quite good, but not a replacement for a quality OCR step. A combination of Text + Vision is compelling too.

Execution is everything. Not to drop a link in someone else’s HN launch but I’m building https://therapy-forms.com and these guys are way ahead of me on UI, polish, and probably overall quality. I do think there’s plenty of slightly different niches here, but even if there were not, execution is everything. Heck it’s likely I’ll wind up as a midship customer, my spare time to fiddle with OCR models is desperately limited and all I want to do is sell to clinics.

Just a heads up, but I tried to signup but the button doesn't seem to work.

See what I mean about execution?

Do you know if there any good (pref C++) libraries for extracting data tables from PDFs?

TableFlow co-founder here - I don't want to distract from the Midship launch (congrats!) but did want to add my 2 cents.

We see a ton of industries/use-cases still bogged down by manual workflows that start with data extraction. These are often large companies throwing many people at the issue ($$). The vast majority of these companies lack technical teams required to leverage VLMs directly (or at least the desire to manage their own software). There’s a ton of room for tailored solutions here, and I don't think it's a winner-take-all space.


+1 to what mitch said. We believe there is a large market for non-technical users who can now automate extraction tasks but do not know how to interact with apis. Midship is another option for them that requires 0 programming!

I'm curious - what does the AI coding setup of the HN community look like, and how has your experience been so far?

I want to get some broader feedback before completely switching my workflow to Aide or Cursor.


I tried Cursor and found it annoying. I don’t really like talking to AI in IDE chat windows. For whatever reason, I really prefer a web browser. I also didn’t like the overall experience.

I’m still using Copilot in VS Code every day. I recently switched from OpenAI to Claude for the browser-based chat stuff and I really like it. The UI for coding assistance in Claude is excellent. Very well thought out.

Claude also has a nice feature called Projects where you can upload a bunch of stuff to build context which is great - so for instance if you are doing an API integration you can dump all the API docs into the project and then every chat you have has that context available.

As with all the AI tools you have to be quite careful. I do find that errors slip into my code more easily when I am not writing it all myself. Reading (or worse, skimming) source code is just different than writing it. However, between type safety and unit testing, I find I get rid of the bugs pretty quickly and overall my productivity is multiples of what it was before.


This is me also, I don't like the UX/DX of Cursor and such just yet.

I can't tell if it is a UX thing or if it also doesn't suit my mental model.

I religiously use Copilot, and then paste stuff into Claude or ChatGPT (both pro) when needed.


I am on day 8 of Cursor's 14-day trial. If things continue to go well, I will be switching from Webstorm to Cursor for my Typescript projects.

The AI integrations are a huge productivity boost. There is a substantial difference in the quality of the AI suggestions between using Claude on the side, and having Claude be deeply integrated in the codebase.

I think I accepted about 60-70% of the suggestions Cursor provided.

Some highlights of Cursor:

- Wrote about 80% of a Vite plugin for consolidating articles in my blog (built on remix.run)

- Wrote a Github Action for automated deployments. Using Cursor to write automation scripts is a tangible productivity boost.

- Made meaningful alterations to a libpg_query fork that allowed it to be cross-compiled to iOS. I have very little experience with C compilation, it would have taken me a substantially long time to figure this out.

There are some downsides to using Cursor though:

- Cursor can get too eager with its suggestions, and I'm not seeing any easy way to temporarily or conditionally turn them off. This was especially bad when I was writing blog posts.

- Cursor does really well with Bash and Typescript, but does not work very well with Kotlin or Swift.

- This is a personal thing, but I'm still not used to some of the shortcuts that Cursor uses (Cursor is built on top of VSCode).


I would not be able to leave a Jetbrains product for Kotlin, or XCode for Swift

Overall it's so unfortunate that Jetbrains doesn't have a Cursor-level AI plugin* because Jetbrains IDEs by themselves are so much more powerful than base level VS Code it actually erases some small portion of the gains from AI...

(* people will link many Jetbrains AI plugins, but none are polished enough)


I probably would switch to Cursor for Swift projects too if it weren't for the fact that I will still need Xcode to compile the app.

I also agree with the non-AI parts of JetBrains stuff being much better than the non-AI parts of Cursor. Jetbrain's refactoring tools is still very unmatched.

That said, I think the AI part is compelling enough to warrant the switch. There are code rewrite tasks that JetBrains would struggle with, that LLMs can do fairly easily.


JetBrains is very interesting, what are the best performing extensions out there for it?

I do wonder what api level access do we get over there as well. For sidecar to run, we need LSP + a web/panel for the ux part (deeper editor layer like undo and redo stack access will also be cool but not totally necessary)


You can get both by using Aider (yes, confusingly similar name). https://aider.chat

It does the multi-file editing with asking to add files etc, but as a CLI/local web app tool.


Its great that cursor is working for you. I do think LLMs in general are far far better on Typescript and Python compared to other languages (reflects from the training data)

What features of cursor were the most compelling to you? I know their autocomplete experience is elite but wondering if there are other features which you use often!


Their autocomplete experience is decent, but I've gotten the most value out of Cursor's "chat + codebase context" (no idea what it's called). The feature where you feed it the entire codebase as part of the context, and let Cursor suggest changes to any parts of the codebase.

ohh inserting.. I tried it on couple of big repos and it was a bit of a miss to me. How large are the codebases on which you work? I want to get a sense check on where the behavior detoriates with embedding + gpt3.5 based reranker search (not sure if they are doing more now!)

Largest repo I used with Cursor was about 600,000 lines long

that's a good metric to aim for... creating a full local index for 600k lines is pretty expensive but there are a bunch of huristics which can take us pretty far

- looking at git commits - making use of recently accesses files - keyword search

If I set these constraints and allow for maybe around 2 LLM round trips we can get pretty far in terms of performance.


It really depends on what you're doing. AI is great for generating a ton of text at once but only a small subset of programming tasks clearly benefit from this.

Outside of this it's an autocomplete that's generally 2/3rds incorrect. If you keep coding as you normally do and accept correct solutions as they appear you'll see a few percentage productivity increase.

For highly regular patterns you'll see a drastically improved productivity increase. Sadly this is also a small subset of the programming space.

One exception might be translating user stories into unit tests, but I'm waiting for positive feedback to declare this.


I'm using Copilot in VScode every day, it works fine, but I mostly use it as glorified one-line autocomplete. I almost never accept multi-line suggestions, don't even look at them.

I tried to use AI deeper, like using aider, but so far I just don't like it. I'm very sensitive to the tiny details of code and AI almost never got it right. I guess actually the main reason that I don't like AI is that I love to write code, simple as that. I don't want to automate that part of my work. I'm fine with trivial autocompletes, but I'm not fine with releasing control over the entire code.

What I would love is to automate interaction with other humans. I don't want to talk to colleagues, boss or other people. I want AI to do so and present me some short extracts.


I can give my broader feedback: - Codegen tools today are still not great: The lack of context and not using LSP really burns down the quality of the generated code. - Autocomplete is great Autocomplete is pretty nice, IMHO it helps finish your thoughts and code faster, its like intellisense but better.

If you are working on a greenfield project, AI codegen really shines today and there are many tools in the market for that.

With Aide, we wanted it to work for engineers who spend >= 6 months on the same project and there are deep dependencies between classes/files and the project overall.

For quick answers, I have a renewed habit of going to o1-preview or sonnet3.5 and then fact checking that with google (not been to stack overflow in a long while now)

Do give AI coding a chance, I think you will be excited to say the least for the coming future and develop habits on how to best use the tool.


> Codegen tools today are still not great: The lack of context and not using LSP really burns down the quality of the generated code

Have you tried Aider?

They've done some discovery on this subject, and it's currently using tree-sitter.


Yup, I have.

We also use tree-sitter for the smartness of understanding symbols https://github.com/codestoryai/sidecar/blob/ba20fb3596c71186... and also the editor for talking to the Language Server.

What we found was that its not just about having access to these tools but to smartly perform the `go-to-definition` `go-to-reference` etc to grab the right context as and when required.

Every LLM call in between slows down the response time so there are a fair bit of heuristics which we use today to sidestep that process.


GitHub Copilot in either VS Code or JetBrains IDEs. Having more or less the same experience across multiple tools is lovely and meets me where I am, instead of making me get a new tool.

The chat is okay, the autocomplete is also really pleasant for snippets and anything boilerplate heavy. The context awareness also helps. No advanced features like creating entirely new structures of files, though.

Of course, I’ll probably explore additional tools in the future, but for now LLMs are useful in my coding and also sometimes help me figure out what I should Google, because nowadays seemingly accurate search terms return trash.


yeah I am also getting the sense that people want tooling which meets them in their preferred environment.

Do you use any of the AI features which go for editing multiple files or doing a lot more in the same instruction?


Cursor works amazing day to day. Copilot is not even comparable there. I like but rarely use aider and plandex. I'd use them more if the interface didn't take me completely away from the ide. Currently they're closer to "work on this while I'm taking a break".

Have you tried latest Copilot with workspace where you can use Claude and add files to the context?

I was deep into AI coding experiments since last December before all the VS Code Extensions and IDEs came out.

I wrote a few scripts to get to a semi-automated workflow where I have control over the source code context and code editing portion, because I believe I can do a better than AI in those areas.

Eventually I built my own desktop app which is 16x Prompt: https://prompt.16x.engineer/


I'm still copy/pasting between VS.Code and ChatGPT. I just don't want to invest/commit yet because this workflow is good enough for me. It lets me chat design, architecture, UX, product in the same context as the code which I find helpful.

Pros

- Only one subscription needed

- Very simple

- Highly flexible/adaptive to what part of workflow I'm in

Cons

- More legwork

- Copy/pasting sometimes results in errors due to incomplete context


I've been building and using these tools for well more than a year now, so here's my journey on building and using them (ORDER BY DESC datetime).

(1) My view now (Nov 2024) is that code building is very conversational and iterative. You need to be able to tweak aspects of generated code by talking to the LLM. For example: "Can you use a params object instead of individual parameters in addToCart?". You also need the ability to sync generated code into your project, run it, and pipe any errors back into the model for refinement. So basically, a very incremental approach to writing it.

For this I made a Chrome plugin, which allowed ChatGPT and Claude to edit source code (using Chrome's File System APIs). You can see a video here: https://www.youtube.com/watch?v=HHzqlI6LLp8

The code is here, but its WIP and for very early users; so please don't give negative reviews yet: https://github.com/codespin-ai/codespin-chrome-extension

(2) Earlier this year, I thought I should build a VS Code plugin. It actually works quite well, allows you to edit code without leaving VSCode. It does stuff like adding dependencies, model selection, prompt histories, sharing git diffs etc. Towards the end, I was convinced that edits need to be conversations, and hence I don't use it as much these days.

Link: https://github.com/codespin-ai/codespin-vscode-extension

(3) Prior to that (2023), I built this same thing in CLI. The idea was that you'd include prompt files in your project, and say something like `my-magical-tool gen prompt.md`. Code would be mostly written as markdown prompt files, and almost never edited directly. In the end, I felt that some form of IDE integration is required - which led to the VSCode extension above.

Link: https://github.com/codespin-ai/codespin

All of these tools were primarily built with AI. So these are not hypotheticals. In addition, I've built half a dozen projects with it; some of it code running in production and hobby stuff like webjsx.org.

Basically, my takeaway is this: code editing is conversational. You need to design a project to be AI-friendly, which means smaller, modular code which can be easily understood by LLMs. Also, my way of using AI is not auto-complete based; I prefer generating from higher level inputs spanning multiple files.


thats a great way to build a tool which solves your need.

In Aide as well, we realised that the major missing loop was the self-correction one, it needs to iteratively expand and do more

Our proactive agent is our first stab at that, and we also realised that the flow from chat -> edit needs to be very free form and the edits are a bit more high level.

I do think you will find value in Aide, do let me know if you got a chance to try it out


>I do think you will find value in Aide, do let me know if you got a chance to try it out

Is there a relationship between Aide and Aider or it's just a name resemblance?


just a name resemblance, no relationship otherwise

> I do think you will find value in Aide, do let me know if you got a chance to try it out

Absolutely, will do it over the weekend. Best of luck with the launch.


But why use the web interface instead of Copilot, Cursor, Zed, Cline, Aider?

There are some advantages.

1) Cost. More people have ChatGPT/Claude than CoPilot. And it's cheaper to load large contexts into ChatGPT than into the API. For example, o1-preview is $15/million tokens. And it's a fixed $20/m that someone can use for everything else as well.

Of course, there are times when I just use the VS Code plugin via API as well.

2) I want to stay in VS Code. So that excludes some of the options you mentioned

3) I don't find tiled VSCode + ChatGPT much of a hindrance.

4) Things might have improved a bit, but previously the Web-based chat interface was more refined/mature than the integrated interface.


I never got the appeal of having the AI directly in your editor, I've tried Copilot and whatever JetBrains are calling their assistant and I found it mostly just got in the way. So for me it's no AI in editor and ChatGPT in a browser for when I do need some help.

Besides Claude.vim for "AI pair programming"? :) (tbh it works well only for small things)

I'm using Codeium and it's pretty decent at picking up the right context automatically, usually it autocompletes within ~100kLoC project quite flawlessly. (So far I haven't been using the chat much, just autocomplete.)


any reason you don't use the chat often, or maybe it's not your usecase?

I'm not the parent poster, but in my case I very rarely use it because it's not in the Neovim UI; it opens in a browser.

I've also had some issues where it doesn't seem to work reliably, but that could be related to my setup.


yeah I am learning that on neovim you can own a buffer region and instead use that for ai back and forth.. it's a very interesting space

what does the AI coding setup of the HN community look like

GitHub copilot and copilot chat in Jetbrains IDE. Cut and paste to Claude for anything else.


Neovim + CopilotChat + CopilotLSP + Copilot subscription. You can dump your context, autocomplete, chat and select Claude or O1. Best deal so far.

cursor works well - uses RAG on your code to give context, can directly reference latest docs of whatever you're using

not perfect but good to incrementally build things/find bugs


Using cursor and it’s been great !

Founders care about development experience a lot and it shows.

Yet to try others, but already satisfied so not required.


Vscode + cline + openrouter using claude sonnet 3.5 20241022 model it's unreal the shit it can do

VS Code plugins. Codeium at home. GitHub Copilot at work. Botb are good. Probably equivalent.

Codeium recently pushed annoying update that limits your ctrl-I prompt to one line and is lost if you lose focus eg to check another file. There is a GH issue for that.


I tried GH copilot again recently with Claude. It was complete shit. Dog slow and gave incomplete responses. Back to aider.

what was so bad about it? genuinely curious cause they did make quite a bit of noise about the integration.

It kept truncating files only about 600 lines long. It also seems to rewrite the entire file each time instead of just sending diffs like aider making it super slow.

Interestingly, I had this problem with Claude (their web chat) and not Copilot. However, there were times where it was unresponsive.

oh, I see your point now. Its weird that they are not doing the search and replace style editing. Altho now that OpenAI also has Predicted Output, I think this will improve and it won't make mistakes while rewriting longer files.

The 600 line limit might be due to the output token limit on the LLM (not sure what they are using for the code rewriting)


Yeah I guess it's a response limit. It makes it a deal breaker though.

It's not nearly as helpful as Claude.ai - it seems to only want to do the minimum required. On top of that it will quite regularly ignore what you've asked, give you back the exact code you gave it, or even generate syntactically invalid code.

It's amazing how much difference the prompt must make because using it is like going back to gpt3.5 yet it's the same model.


I've seen quite a few YC startups working on AI-powered RPA, and now it looks like a foundational model player is directly competing in their space. It will be interesting to see whether Anthropic will double down on this or leave it to third-party developers to build commercial applications around it.


We're one of those players (https://github.com/Skyvern-AI/skyvern) and we're definitely watching the space with a lot of excitement

We thought it was inevitable that OpenAI / Anthropic would veer into this space and start to become competitive with us. We actually expected OpenAI to do it first!

What this confirms is that there is significant interest in computer / browser automation, and the problem is still unsolved. We will see whether the automation itself is an application later problem (our approach) or whether the model needs to be intertwined with the application (Anthropic's approach here)


Has anyone seen AI agents working in production at scale? It doesn't matter if you're using Swarm, langchain, or any other orchestration framework if the underlying issue is that AI agents too slow, too expensive, and too unreliable. I wrote about AI agent hype vs. reality[0] a while ago, and I don't think it has changed yet.

[0] https://www.kadoa.com/blog/ai-agents-hype-vs-reality


Yes we use agents in a human support agent facing application that has many sub agents used to summarize and analyze a lot of different models, prior support cases, knowledge base information, third party data sets, etc, to form an expert in a specific customer and their unique situation in detected potential fraud and other cases. The goal of the expert is to reduce the cognitive load of our support agent in analyzing some often complex situation with lots of information more rapidly and reliably. Because there is no right answer and the goal is error reduction not elimination it’s not necessary to have determinism, just do better than a human at understanding a lot of divergent information rapidly and answering various queries. Cost isn’t an issue because the decisions are high value. Speed isn’t an issue because the alternative is a human attempting to make sense of an enormous amount of information in many systems. It has dramatically improved our precision and recall over pure humans.


Isn’t the best customer service:

    Cost to Solve < Remaining LTV * Profit Margin
In other words, do the details matter? If the customer leaves because you don’t take a fraudulent $10 return, but he’s worth $1,000 in the long term, that’s dumb.

You might think that such a user doesn’t exist. Then you’d be getting the details wrong again! Example: Should ISPs disconnect users for piracy? Should Apple close your iCloud sub for pirating Apple TV? Should Amazon lose accounts for rejecting returns? Etc etc.

A business that makes CS more details oriented is 200% the wrong solution.


The fraud we deal with is a lot more than $10.


Do you find that the entities committing fraud are using generative AI tools to facilitate the crimes?


They use every tool you can imagine. Most are not imaginative but many are profoundly smart. They could do anything they set their minds to and for some reason this is what they do.


The problem with agents is divergence. Very quickly, an ensemble of agents will start doing their own things and it’s impossible to get something that consistently gets to your desired state.

There are a whole class of problems that do not require low-latency. But not having consistency makes them pretty useless.

Frameworks don’t solve that. You’ll probably need some sort of ground-truth injection at every sub-agent level. Ie: you just need data.

Totally agree with you. Unreliability is the thing that needs solving first.


> The problem with agents is divergence. Very quickly, an ensemble of agents will start doing their own things and it’s impossible to get something that consistently gets to your desired state.

Sounds like management to me.


Sticky goals and reevaluation of tasks is one way to keep the end result on track.

How does gpt o1 solve this?


I use my own agent all day, every day. Here is one example: https://x.com/xundecidability/status/1835085853506650269

I've been using the general agent to build specialised sub-agents. Here's an example search agent beating perplexity: https://x.com/xundecidability/status/1835059091506450493


Do you have any code to share?

I'm failing to see the point in the example, unless the agents can do things on multiple threads. For example let's say we have Boss Agent.

I can ask Boss agent to organize a trip for five people to the Netherlands.

Boss agent can ask some basic questions, about where my Friends are traveling from, and what our budget is .

Then travel agent can go and look up how we each can get there, hotel agent can search for hotel prices, weather agent can make sure it's nice out, sightseeing agent can suggest things for us to do. And I guess correspondence agent can send out emails to my actual friends.

If this is multi-threaded, you could get a ton of work done much faster. But if it's all running on a single thread anyway, then couldn't boss agent just switch functionality after completing each job ?


That particular task didn't need parallel agents or any of the advanced features.

The prompt was: <prompt> Research claude pricing with caching and then review a conversation history to calculate the cost. First, search online for pricing for anthropic api with and without caching enabled for all of the models: claude-3-haiku, claude-3-opus and claude-3.5-sonnet (sonnet 3.5). Create a json file with ALL the pricing data.

from the llm history db, fetch the response.response_json.usage for each result under conversation_id=01j7jzcbxzrspg7qz9h8xbq1ww llm_db=$(llm logs path) schema=$(sqlite3 $llm_db '.schema') example usage: { "input_tokens": 1086, "output_tokens": 1154, "cache_creation_input_tokens": 2364, "cache_read_input_tokens": 0 }

Calculate the actual costs of each prompt by using the usage object for each response based the actual token usage cached or not. Also calculate/simulate what it would have cost if the tokens where not cached. Create interactive graphs of different kinds to show the real cost of conversation, the cache usage, and a comparison to what it would have costed without caching.

Write to intermediary files along the way.

Ask me if anything is unclear. </prompt>

I just gave it your task and I'll share the results tomorrow (I'm off to bed).


True. In the classic form of automation, reasoning is externalized into rules. In the case of AI agents, reasoning is internalized within a language model. This is a fundamental difference. The problem is that language models are not designed to reason. They are designed to predict the next most likely word. They mimic human skills but possess no general intelligence. They are not ready to function without a human in the loop. So, what are the implications of this new form of automation that AI agents represent? https://www.lycee.ai/blog/ai-agents-automation-eng


I want hear more about this. I'm playing with langroid, crew.ai, and dspy and they all layer so many abstractions on top of a shifting LLM landscape. I can't believe anyone is really using them in the way their readme goals profess.


Not you in particular, but I hear this common refrain that the "LLM landscape is shifting", but what exactly is shifting? Yes new models are constantly announced, but at the end of the day, interacting with the LLMs involves making calls to an API, and the OpenAI API (and perhaps Anthropic's variant) has become fairly established, and this API will obviously not change significantly any time soon.

Given that there is (a fairly standard) API to interact with LLMs, the next question is, what abstractions and primitives help easily build applications on top of these, while giving enough flexibility for complex use cases.

The features in Langroid have evolved in response to the requirements of various use-cases that arose while building applications for clients, or companies that have requested them.


Sonnet 3.5 and other large context models made context management approaches irrelevant and will continue to do so.

o1 (and likely sonnet 3.5) made chain of through and other complex prompt engineering irrelevant.

Realtime API (and others that will soon follow) will made the best VTT > LLM > TTV irrelevant.

VLMs will likely make LLMs irrelevant. Who knows what Google has planned for Gemini 2.

The point is building these complex agents has been proven a waste of time over and over again until, at least until we see a plateau in models. It's much easier to swap in a single API call and modify one or two prompts than to rework a convoluted agentic approach. Especially when it's very clear that the same prompts can't be reused reliably between different models.


I encourage you to run evals on result quality for real b2b tasks before making these claims. Almost all of your post is measurably wrong in ways that cause customers to churn an AI product same-day.


I appreciate your comment.

I suppose my comment is reserved more for the documentation than the actual models in the wild?

I do worry that LLM service providers won't do any better than rest API providers in versioning their backend. Even if we specify the model in the call to the API, it feels like it will silently be upgraded behind the scenes. There are so many parameters that could be adjusted to "improve" the experience for users even if the weights don't change.

I prefer to use open weight models when possible. But so many agentic frameworks, like this one (to be fair, I would not expect OpenAI to offer a framework that work local first), treat the local LLM experience as second class, at best.


Years ago we complained about the speed with which new JavaScript frameworks were popping into existence. Today it goes one order of magnitude faster, and the quality of the outputs can only be suffering. Yes there's code but so and so, interfaces and APIs change dramatically, and the documentation is a few versions behind. Who has time to compare simply cannot do it in depth, and ideas get also dropped on the way. I don't want to call it a mess because it's too negative, to have many ideas is great but I feel we're still in the brainstorming phase.


> The underlying issue is that AI agents too slow,

Inference speed is being rapidly optimized, especially for edge devices.

> too expensive,

The half-life of OpenAI's API pricing is a couple of months. While the bleeding edge model is always costly, the cost of API's are becoming rapidly available to the public.

> and too unreliable

Out of the 3 points raised, this is probably the most up in the air. Personally I chalk this up to sideeffects of OpenAI's rapid growth over the last few years. I think this gets solved, especially once price and latency have been figured out.

IMO, the biggest unknown here isn't a technical one, but rather a business one- I don't think it's certain that products built on multi-agent architectures will be addressing a need for end users. Most of the talk I see in this space are by people excited by building with LLM's, not by people who are asking to pay for these products.


Frankly, what you are describing is a money-printing machine. You should expect anyone who has figured out such a thing to keep it as a trade secret, until the FOSS community figures out and publishes something comparable.

I don’t think the tech is ready yet for other reasons, but absence of anyone publishing is not good evidence against.


Agents can work in production, but usually only when they are closer to "workflows" that are very targeted to a specific use case.


If the solution the agents create is immediately useful, then waiting a few minutes or longer for the answer is fine.


Yes I built a lot of stuff (at batch, not to respond to user queries). Mostly large scale code generation and testing tasks.


We've been working on AI-automated web scraping at Kadoa[0] and our early experiments were similar to the those in the article. We started when only the expensive and slow GPT-3 was available, which pushed us to develop a cost-effective solution at scale.

Here is what we ended up with:

- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.

- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.

- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.

We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.

Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us

[0] https://kadoa.com


this! I've been following Kadoa since its very first days. Great team.


As an engineer who became a founder, I cannot recommend the book enough. Whether you're improving your landing page, writing your pitch deck, reaching out to customers, or developing your company and product strategy, communicating effectively in writing is a very crucial skill.

Invest in improving your writing skills. It will pay dividends in every aspect of your business.


The post is not about the book. In fact, as he commented here, he had never heard of the book.


Curious to hear some real-world use cases for fine-tuning and what your setup looks like in terms of:

- frameworks

- hardware

- foundation models (OS vs 3rd party)

- evaluation & training data

Did the performance gains justify the investment for fine-tuning?


I think vision models have a lot bigger ROI from fine tuning than language models. That being said, I do consider fine-tunes to be helpful in improving smaller models as long as the domain is limited in scope. In other words, fine tuning allows you to get similar performance in smaller models, but improved performance in larger models seems pretty elusive at the moment, albeit somewhat possible with very creatively engineered training datasets.

An example of this is DeepSeek-Coder, which can essentially be considered a fine-tune of a fine-tune of a Mixtral model. It performs very similarly to Claude 3.5 Sonnet, which is pretty damn impressive, but it does it at less than 1/10th the cost.

What I don't understand though is why anyone would even remotely consider fine tuning a GPT-4o model that they will never fully own, when they could spend the same resources on fine tuning a Llama3.1 model that they will own. And even if you absolutely don't care about ownership (???), why not do a fine tune of an Anthropic model which is already significantly better than GPT-4o. At this point, with the laggard performance of OpenAI and their shameless attempts at regulatory hostility to competitors, I can't imagine ever giving them any of my money, let alone owning my derivative work.


"An example of this is DeepSeek-Coder, which can essentially be considered a fine-tune of a fine-tune of a Mixtral model"

I've not heard that anywhere else. My impression from https://huggingface.co/deepseek-ai/deepseek-coder-33b-base was that DeepSeek-Coder was trained from scratch.


The current deepseek-coder version (v2) is actually a fine tune off of the deepseek v2 model, and was not trained from scratch.

I’m now getting conflicting information about the origin of the deepseek MOE framework, so I may be wrong about it starting with a Mixtral model.


Very cool to see more YC hard tech startups emerging [0]. These are the kind of moonshot projects I love to see getting funded (instead of just SaaS and AI).

[0] https://www.ycombinator.com/companies/industry/hard-tech


You mention validation and schema guarantees as key features for high accuracy. Are you using an LLM-as-a-judge combined with traditional checks for this?


Yes, we combine LLMs as a judge with traditional checks like reverse search in original data sources, defining your own post-processing logic, and simple classifier for confidence score.


Fascinating how traditionally very complex and hard ML problems are slowly becomming commodities with AI:

- transcription

- machine translation

- OCR

- image recognition


Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: