At the same time, accurate document extraction is becoming a commodity with powerful VLMs. Are you planning to focus on a specific industry, or how do you plan to differentiate?
Yes there is definitely a boom in document related startups. We see our niche as focusing on non technical users. We have focused on making it easy to build schemas, an audit and review experience, and integrating into downstream applications.
Hey we're on that list! Congrats on the launch Max & team!
I could definitely point to minor differences between all the platforms, but you're right that everyone is tackling the same unstructured data problem.
In general, I think it will be a couple years before anyone really butts heads in the market. The problem space is just that big. I'm constantly blown away by how big the document problem at these mid sized businesses. And most of these companies don't have any engineers on staff. So no attempt has ever been made to fix it.
Totally agree that this is becoming the standard "reference architecture" for this kind of pipeline. The only thing that complicates this a lot today is complex inputs. For simple 1-2 page PDFs what you describes works quite well out of the box but for 100+ page doc it starts to fall over in ways I described in another comment.
Are really large inputs solved at midship? If so, I'd consider that a differentiator (at least today). The demo's limited to 15pgs, and I don't see any marketing around long-context or complex inputs on the site.
I suspect this problem gets solved in the next iteration or two of commodity models. In the meantime, being smart about how the context gets divvied works ok.
I do like the UI you appear to have for citing information. Drawing the polygons around the data, and then where they appear in the PDF. Nice.
Having the text (for now) is still pretty important for quality output. The vision models are quite good, but not a replacement for a quality OCR step. A combination of Text + Vision is compelling too.
Execution is everything. Not to drop a link in someone else’s HN launch but I’m building https://therapy-forms.com and these guys are way ahead of me on UI, polish, and probably overall quality. I do think there’s plenty of slightly different niches here, but even if there were not, execution is everything. Heck it’s likely I’ll wind up as a midship customer, my spare time to fiddle with OCR models is desperately limited and all I want to do is sell to clinics.
TableFlow co-founder here - I don't want to distract from the Midship launch (congrats!) but did want to add my 2 cents.
We see a ton of industries/use-cases still bogged down by manual workflows that start with data extraction. These are often large companies throwing many people at the issue ($$). The vast majority of these companies lack technical teams required to leverage VLMs directly (or at least the desire to manage their own software). There’s a ton of room for tailored solutions here, and I don't think it's a winner-take-all space.
+1 to what mitch said. We believe there is a large market for non-technical users who can now automate extraction tasks but do not know how to interact with apis. Midship is another option for them that requires 0 programming!
I tried Cursor and found it annoying. I don’t really like talking to AI in IDE chat windows. For whatever reason, I really prefer a web browser. I also didn’t like the overall experience.
I’m still using Copilot in VS Code every day. I recently switched from OpenAI to Claude for the browser-based chat stuff and I really like it. The UI for coding assistance in Claude is excellent. Very well thought out.
Claude also has a nice feature called Projects where you can upload a bunch of stuff to build context which is great - so for instance if you are doing an API integration you can dump all the API docs into the project and then every chat you have has that context available.
As with all the AI tools you have to be quite careful. I do find that errors slip into my code more easily when I am not writing it all myself. Reading (or worse, skimming) source code is just different than writing it. However, between type safety and unit testing, I find I get rid of the bugs pretty quickly and overall my productivity is multiples of what it was before.
I am on day 8 of Cursor's 14-day trial. If things continue to go well, I will be switching from Webstorm to Cursor for my Typescript projects.
The AI integrations are a huge productivity boost. There is a substantial difference in the quality of the AI suggestions between using Claude on the side, and having Claude be deeply integrated in the codebase.
I think I accepted about 60-70% of the suggestions Cursor provided.
Some highlights of Cursor:
- Wrote about 80% of a Vite plugin for consolidating articles in my blog (built on remix.run)
- Wrote a Github Action for automated deployments. Using Cursor to write automation scripts is a tangible productivity boost.
- Made meaningful alterations to a libpg_query fork that allowed it to be cross-compiled to iOS. I have very little experience with C compilation, it would have taken me a substantially long time to figure this out.
There are some downsides to using Cursor though:
- Cursor can get too eager with its suggestions, and I'm not seeing any easy way to temporarily or conditionally turn them off. This was especially bad when I was writing blog posts.
- Cursor does really well with Bash and Typescript, but does not work very well with Kotlin or Swift.
- This is a personal thing, but I'm still not used to some of the shortcuts that Cursor uses (Cursor is built on top of VSCode).
I would not be able to leave a Jetbrains product for Kotlin, or XCode for Swift
Overall it's so unfortunate that Jetbrains doesn't have a Cursor-level AI plugin* because Jetbrains IDEs by themselves are so much more powerful than base level VS Code it actually erases some small portion of the gains from AI...
(* people will link many Jetbrains AI plugins, but none are polished enough)
I probably would switch to Cursor for Swift projects too if it weren't for the fact that I will still need Xcode to compile the app.
I also agree with the non-AI parts of JetBrains stuff being much better than the non-AI parts of Cursor. Jetbrain's refactoring tools is still very unmatched.
That said, I think the AI part is compelling enough to warrant the switch. There are code rewrite tasks that JetBrains would struggle with, that LLMs can do fairly easily.
JetBrains is very interesting, what are the best performing extensions out there for it?
I do wonder what api level access do we get over there as well. For sidecar to run, we need LSP + a web/panel for the ux part (deeper editor layer like undo and redo stack access will also be cool but not totally necessary)
Its great that cursor is working for you. I do think LLMs in general are far far better on Typescript and Python compared to other languages (reflects from the training data)
What features of cursor were the most compelling to you? I know their autocomplete experience is elite but wondering if there are other features which you use often!
Their autocomplete experience is decent, but I've gotten the most value out of Cursor's "chat + codebase context" (no idea what it's called). The feature where you feed it the entire codebase as part of the context, and let Cursor suggest changes to any parts of the codebase.
ohh inserting.. I tried it on couple of big repos and it was a bit of a miss to me. How large are the codebases on which you work? I want to get a sense check on where the behavior detoriates with embedding + gpt3.5 based reranker search (not sure if they are doing more now!)
that's a good metric to aim for... creating a full local index for 600k lines is pretty expensive but there are a bunch of huristics which can take us pretty far
- looking at git commits
- making use of recently accesses files
- keyword search
If I set these constraints and allow for maybe around 2 LLM round trips we can get pretty far in terms of performance.
It really depends on what you're doing. AI is great for generating a ton of text at once but only a small subset of programming tasks clearly benefit from this.
Outside of this it's an autocomplete that's generally 2/3rds incorrect. If you keep coding as you normally do and accept correct solutions as they appear you'll see a few percentage productivity increase.
For highly regular patterns you'll see a drastically improved productivity increase. Sadly this is also a small subset of the programming space.
One exception might be translating user stories into unit tests, but I'm waiting for positive feedback to declare this.
I'm using Copilot in VScode every day, it works fine, but I mostly use it as glorified one-line autocomplete. I almost never accept multi-line suggestions, don't even look at them.
I tried to use AI deeper, like using aider, but so far I just don't like it. I'm very sensitive to the tiny details of code and AI almost never got it right. I guess actually the main reason that I don't like AI is that I love to write code, simple as that. I don't want to automate that part of my work. I'm fine with trivial autocompletes, but I'm not fine with releasing control over the entire code.
What I would love is to automate interaction with other humans. I don't want to talk to colleagues, boss or other people. I want AI to do so and present me some short extracts.
I can give my broader feedback:
- Codegen tools today are still not great:
The lack of context and not using LSP really burns down the quality of the generated code.
- Autocomplete is great
Autocomplete is pretty nice, IMHO it helps finish your thoughts and code faster, its like intellisense but better.
If you are working on a greenfield project, AI codegen really shines today and there are many tools in the market for that.
With Aide, we wanted it to work for engineers who spend >= 6 months on the same project and there are deep dependencies between classes/files and the project overall.
For quick answers, I have a renewed habit of going to o1-preview or sonnet3.5 and then fact checking that with google (not been to stack overflow in a long while now)
Do give AI coding a chance, I think you will be excited to say the least for the coming future and develop habits on how to best use the tool.
What we found was that its not just about having access to these tools but to smartly perform the `go-to-definition` `go-to-reference` etc to grab the right context as and when required.
Every LLM call in between slows down the response time so there are a fair bit of heuristics which we use today to sidestep that process.
GitHub Copilot in either VS Code or JetBrains IDEs. Having more or less the same experience across multiple tools is lovely and meets me where I am, instead of making me get a new tool.
The chat is okay, the autocomplete is also really pleasant for snippets and anything boilerplate heavy. The context awareness also helps. No advanced features like creating entirely new structures of files, though.
Of course, I’ll probably explore additional tools in the future, but for now LLMs are useful in my coding and also sometimes help me figure out what I should Google, because nowadays seemingly accurate search terms return trash.
Cursor works amazing day to day. Copilot is not even comparable there. I like but rarely use aider and plandex. I'd use them more if the interface didn't take me completely away from the ide. Currently they're closer to "work on this while I'm taking a break".
I was deep into AI coding experiments since last December before all the VS Code Extensions and IDEs came out.
I wrote a few scripts to get to a semi-automated workflow where I have control over the source code context and code editing portion, because I believe I can do a better than AI in those areas.
I'm still copy/pasting between VS.Code and ChatGPT. I just don't want to invest/commit yet because this workflow is good enough for me. It lets me chat design, architecture, UX, product in the same context as the code which I find helpful.
Pros
- Only one subscription needed
- Very simple
- Highly flexible/adaptive to what part of workflow I'm in
Cons
- More legwork
- Copy/pasting sometimes results in errors due to incomplete context
I've been building and using these tools for well more than a year now, so here's my journey on building and using them (ORDER BY DESC datetime).
(1) My view now (Nov 2024) is that code building is very conversational and iterative. You need to be able to tweak aspects of generated code by talking to the LLM. For example: "Can you use a params object instead of individual parameters in addToCart?". You also need the ability to sync generated code into your project, run it, and pipe any errors back into the model for refinement. So basically, a very incremental approach to writing it.
For this I made a Chrome plugin, which allowed ChatGPT and Claude to edit source code (using Chrome's File System APIs). You can see a video here: https://www.youtube.com/watch?v=HHzqlI6LLp8
(2) Earlier this year, I thought I should build a VS Code plugin. It actually works quite well, allows you to edit code without leaving VSCode. It does stuff like adding dependencies, model selection, prompt histories, sharing git diffs etc. Towards the end, I was convinced that edits need to be conversations, and hence I don't use it as much these days.
(3) Prior to that (2023), I built this same thing in CLI. The idea was that you'd include prompt files in your project, and say something like `my-magical-tool gen prompt.md`. Code would be mostly written as markdown prompt files, and almost never edited directly. In the end, I felt that some form of IDE integration is required - which led to the VSCode extension above.
All of these tools were primarily built with AI. So these are not hypotheticals. In addition, I've built half a dozen projects with it; some of it code running in production and hobby stuff like webjsx.org.
Basically, my takeaway is this: code editing is conversational. You need to design a project to be AI-friendly, which means smaller, modular code which can be easily understood by LLMs. Also, my way of using AI is not auto-complete based; I prefer generating from higher level inputs spanning multiple files.
thats a great way to build a tool which solves your need.
In Aide as well, we realised that the major missing loop was the self-correction one, it needs to iteratively expand and do more
Our proactive agent is our first stab at that, and we also realised that the flow from chat -> edit needs to be very free form and the edits are a bit more high level.
I do think you will find value in Aide, do let me know if you got a chance to try it out
1) Cost. More people have ChatGPT/Claude than CoPilot. And it's cheaper to load large contexts into ChatGPT than into the API. For example, o1-preview is $15/million tokens. And it's a fixed $20/m that someone can use for everything else as well.
Of course, there are times when I just use the VS Code plugin via API as well.
2) I want to stay in VS Code. So that excludes some of the options you mentioned
3) I don't find tiled VSCode + ChatGPT much of a hindrance.
4) Things might have improved a bit, but previously the Web-based chat interface was more refined/mature than the integrated interface.
I never got the appeal of having the AI directly in your editor, I've tried Copilot and whatever JetBrains are calling their assistant and I found it mostly just got in the way. So for me it's no AI in editor and ChatGPT in a browser for when I do need some help.
Besides Claude.vim for "AI pair programming"? :)
(tbh it works well only for small things)
I'm using Codeium and it's pretty decent at picking up the right context automatically, usually it autocompletes within ~100kLoC project quite flawlessly. (So far I haven't been using the chat much, just autocomplete.)
VS Code plugins. Codeium at home. GitHub Copilot at work. Botb are good. Probably equivalent.
Codeium recently pushed annoying update that limits your ctrl-I prompt to one line and is lost if you lose focus eg to check another file. There is a GH issue for that.
It kept truncating files only about 600 lines long. It also seems to rewrite the entire file each time instead of just sending diffs like aider making it super slow.
oh, I see your point now. Its weird that they are not doing the search and replace style editing.
Altho now that OpenAI also has Predicted Output, I think this will improve and it won't make mistakes while rewriting longer files.
The 600 line limit might be due to the output token limit on the LLM (not sure what they are using for the code rewriting)
It's not nearly as helpful as Claude.ai - it seems to only want to do the minimum required. On top of that it will quite regularly ignore what you've asked, give you back the exact code you gave it, or even generate syntactically invalid code.
It's amazing how much difference the prompt must make because using it is like going back to gpt3.5 yet it's the same model.
I've seen quite a few YC startups working on AI-powered RPA, and now it looks like a foundational model player is directly competing in their space. It will be interesting to see whether Anthropic will double down on this or leave it to third-party developers to build commercial applications around it.
We thought it was inevitable that OpenAI / Anthropic would veer into this space and start to become competitive with us. We actually expected OpenAI to do it first!
What this confirms is that there is significant interest in computer / browser automation, and the problem is still unsolved. We will see whether the automation itself is an application later problem (our approach) or whether the model needs to be intertwined with the application (Anthropic's approach here)
Has anyone seen AI agents working in production at scale? It doesn't matter if you're using Swarm, langchain, or any other orchestration framework if the underlying issue is that AI agents too slow, too expensive, and too unreliable. I wrote about AI agent hype vs. reality[0] a while ago, and I don't think it has changed yet.
Yes we use agents in a human support agent facing application that has many sub agents used to summarize and analyze a lot of different models, prior support cases, knowledge base information, third party data sets, etc, to form an expert in a specific customer and their unique situation in detected potential fraud and other cases. The goal of the expert is to reduce the cognitive load of our support agent in analyzing some often complex situation with lots of information more rapidly and reliably. Because there is no right answer and the goal is error reduction not elimination it’s not necessary to have determinism, just do better than a human at understanding a lot of divergent information rapidly and answering various queries. Cost isn’t an issue because the decisions are high value. Speed isn’t an issue because the alternative is a human attempting to make sense of an enormous amount of information in many systems. It has dramatically improved our precision and recall over pure humans.
In other words, do the details matter? If the customer leaves because you don’t take a fraudulent $10 return, but he’s worth $1,000 in the long term, that’s dumb.
You might think that such a user doesn’t exist. Then you’d be getting the details wrong again! Example: Should ISPs disconnect users for piracy? Should Apple close your iCloud sub for pirating Apple TV? Should Amazon lose accounts for rejecting returns? Etc etc.
A business that makes CS more details oriented is 200% the wrong solution.
They use every tool you can imagine. Most are not imaginative but many are profoundly smart. They could do anything they set their minds to and for some reason this is what they do.
The problem with agents is divergence. Very quickly, an ensemble of agents will start doing their own things and it’s impossible to get something that consistently gets to your desired state.
There are a whole class of problems that do not require low-latency. But not having consistency makes them pretty useless.
Frameworks don’t solve that. You’ll probably need some sort of ground-truth injection at every sub-agent level. Ie: you just need data.
Totally agree with you. Unreliability is the thing that needs solving first.
> The problem with agents is divergence. Very quickly, an ensemble of agents will start doing their own things and it’s impossible to get something that consistently gets to your desired state.
I'm failing to see the point in the example, unless the agents can do things on multiple threads. For example let's say we have Boss Agent.
I can ask Boss agent to organize a trip for five people to the Netherlands.
Boss agent can ask some basic questions, about where my Friends are traveling from, and what our budget is .
Then travel agent can go and look up how we each can get there, hotel agent can search for hotel prices, weather agent can make sure it's nice out, sightseeing agent can suggest things for us to do. And I guess correspondence agent can send out emails to my actual friends.
If this is multi-threaded, you could get a ton of work done much faster. But if it's all running on a single thread anyway, then couldn't boss agent just switch functionality after completing each job ?
That particular task didn't need parallel agents or any of the advanced features.
The prompt was:
<prompt>
Research claude pricing with caching and then review a conversation history to calculate the cost.
First, search online for pricing for anthropic api with and without caching enabled for all of the models: claude-3-haiku, claude-3-opus and claude-3.5-sonnet (sonnet 3.5).
Create a json file with ALL the pricing data.
from the llm history db, fetch the response.response_json.usage for each result under conversation_id=01j7jzcbxzrspg7qz9h8xbq1ww
llm_db=$(llm logs path)
schema=$(sqlite3 $llm_db '.schema')
example usage: {
"input_tokens": 1086,
"output_tokens": 1154,
"cache_creation_input_tokens": 2364,
"cache_read_input_tokens": 0
}
Calculate the actual costs of each prompt by using the usage object for each response based the actual token usage cached or not.
Also calculate/simulate what it would have cost if the tokens where not cached.
Create interactive graphs of different kinds to show the real cost of conversation, the cache usage, and a comparison to what it would have costed without caching.
Write to intermediary files along the way.
Ask me if anything is unclear.
</prompt>
I just gave it your task and I'll share the results tomorrow (I'm off to bed).
True.
In the classic form of automation, reasoning is externalized into rules. In the case of AI agents, reasoning is internalized within a language model. This is a fundamental difference. The problem is that language models are not designed to reason. They are designed to predict the next most likely word. They mimic human skills but possess no general intelligence. They are not ready to function without a human in the loop. So, what are the implications of this new form of automation that AI agents represent?
https://www.lycee.ai/blog/ai-agents-automation-eng
I want hear more about this. I'm playing with langroid, crew.ai, and dspy and they all layer so many abstractions on top of a shifting LLM landscape. I can't believe anyone is really using them in the way their readme goals profess.
Not you in particular, but I hear this common refrain that the "LLM landscape is shifting", but what exactly is shifting? Yes new models are constantly announced, but at the end of the day, interacting with the LLMs involves making calls to an API, and the OpenAI API (and perhaps Anthropic's variant) has become fairly established, and this API will obviously not change significantly any time soon.
Given that there is (a fairly standard) API to interact with LLMs, the next question is, what abstractions and primitives help easily build applications on top of these, while giving enough flexibility for complex use cases.
The features in Langroid have evolved in response to the requirements of various use-cases that arose while building applications for clients, or companies that have requested them.
Sonnet 3.5 and other large context models made context management approaches irrelevant and will continue to do so.
o1 (and likely sonnet 3.5) made chain of through and other complex prompt engineering irrelevant.
Realtime API (and others that will soon follow) will made the best VTT > LLM > TTV irrelevant.
VLMs will likely make LLMs irrelevant. Who knows what Google has planned for Gemini 2.
The point is building these complex agents has been proven a waste of time over and over again until, at least until we see a plateau in models. It's much easier to swap in a single API call and modify one or two prompts than to rework a convoluted agentic approach. Especially when it's very clear that the same prompts can't be reused reliably between different models.
I encourage you to run evals on result quality for real b2b tasks before making these claims. Almost all of your post is measurably wrong in ways that cause customers to churn an AI product same-day.
I suppose my comment is reserved more for the documentation than the actual models in the wild?
I do worry that LLM service providers won't do any better than rest API providers in versioning their backend. Even if we specify the model in the call to the API, it feels like it will silently be upgraded behind the scenes. There are so many parameters that could be adjusted to "improve" the experience for users even if the weights don't change.
I prefer to use open weight models when possible. But so many agentic frameworks, like this one (to be fair, I would not expect OpenAI to offer a framework that work local first), treat the local LLM experience as second class, at best.
Years ago we complained about the speed with which new JavaScript frameworks were popping into existence. Today it goes one order of magnitude faster, and the quality of the outputs can only be suffering. Yes there's code but so and so, interfaces and APIs change dramatically, and the documentation is a few versions behind. Who has time to compare simply cannot do it in depth, and ideas get also dropped on the way. I don't want to call it a mess because it's too negative, to have many ideas is great but I feel we're still in the brainstorming phase.
> The underlying issue is that AI agents too slow,
Inference speed is being rapidly optimized, especially for edge devices.
> too expensive,
The half-life of OpenAI's API pricing is a couple of months. While the bleeding edge model is always costly, the cost of API's are becoming rapidly available to the public.
> and too unreliable
Out of the 3 points raised, this is probably the most up in the air. Personally I chalk this up to sideeffects of OpenAI's rapid growth over the last few years. I think this gets solved, especially once price and latency have been figured out.
IMO, the biggest unknown here isn't a technical one, but rather a business one- I don't think it's certain that products built on multi-agent architectures will be addressing a need for end users. Most of the talk I see in this space are by people excited by building with LLM's, not by people who are asking to pay for these products.
Frankly, what you are describing is a money-printing machine. You should expect anyone who has figured out such a thing to keep it as a trade secret, until the FOSS community figures out and publishes something comparable.
I don’t think the tech is ready yet for other reasons, but absence of anyone publishing is not good evidence against.
We've been working on AI-automated web scraping at Kadoa[0] and our early experiments were similar to the those in the article. We started when only the expensive and slow GPT-3 was available, which pushed us to develop a cost-effective solution at scale.
Here is what we ended up with:
- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.
- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us
As an engineer who became a founder, I cannot recommend the book enough. Whether you're improving your landing page, writing your pitch deck, reaching out to customers, or developing your company and product strategy, communicating effectively in writing is a very crucial skill.
Invest in improving your writing skills. It will pay dividends in every aspect of your business.
I think vision models have a lot bigger ROI from fine tuning than language models. That being said, I do consider fine-tunes to be helpful in improving smaller models as long as the domain is limited in scope. In other words, fine tuning allows you to get similar performance in smaller models, but improved performance in larger models seems pretty elusive at the moment, albeit somewhat possible with very creatively engineered training datasets.
An example of this is DeepSeek-Coder, which can essentially be considered a fine-tune of a fine-tune of a Mixtral model. It performs very similarly to Claude 3.5 Sonnet, which is pretty damn impressive, but it does it at less than 1/10th the cost.
What I don't understand though is why anyone would even remotely consider fine tuning a GPT-4o model that they will never fully own, when they could spend the same resources on fine tuning a Llama3.1 model that they will own. And even if you absolutely don't care about ownership (???), why not do a fine tune of an Anthropic model which is already significantly better than GPT-4o. At this point, with the laggard performance of OpenAI and their shameless attempts at regulatory hostility to competitors, I can't imagine ever giving them any of my money, let alone owning my derivative work.
Very cool to see more YC hard tech startups emerging [0]. These are the kind of moonshot projects I love to see getting funded (instead of just SaaS and AI).
You mention validation and schema guarantees as key features for high accuracy. Are you using an LLM-as-a-judge combined with traditional checks for this?
Yes, we combine LLMs as a judge with traditional checks like reverse search in original data sources, defining your own post-processing logic, and simple classifier for confidence score.
- https://www.ycombinator.com/companies/tableflow
- https://www.ycombinator.com/companies/reducto
- https://www.ycombinator.com/companies/mindee
- https://www.ycombinator.com/companies/omniai
- https://www.ycombinator.com/companies/trellis
At the same time, accurate document extraction is becoming a commodity with powerful VLMs. Are you planning to focus on a specific industry, or how do you plan to differentiate?
reply