DocuChat is a full-stack web application developed using C#, .NET 8.0, and Blazor. The goal is to provide a cost-effective chat interface for users to interact with documents. It will prioritize data accuracy, minimize errors, and aim to deliver a high-quality user experience.
- Parses SEC documents and extracts relevant data. (build in progress)
- Use LLM and RAG to develop a chatbot capable of interpreting and responding to queries based on information extracted from SEC documents. (coming soon)
-
Ensure you have .NET 8 SDK installed on your machine. You can download it from the official .NET website.
-
Clone the repository which will open the solution:
git clone https://github.com/your-repo/DocumentAPI.git
-
Open the user secret file and paste this in the file. I am using Mac, the user secret file in Rider IDE is by right-click the project -> tools -> .NET user secrets.
{"SecUserAgent": "Personal-Project/1.0 (+{{your email}@gmail.com)"}
-
Run the project.
curl --location 'https://localhost:5084/api/sec/sec-parser' \ --header 'accept: */*' \ --header 'Content-Type: application/json' \ --header 'X-CALLING-APP: CompanyA' \ --data '{ "secDocumentUrls": [ "https://www.sec.gov/Archives/edgar/data/320193/000032019319000119/a10-k20199282019.htm", "https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm", "https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm", "https://www.sec.gov/Archives/edgar/data/320193/000032019321000105/aapl-20210925.htm", "https://www.sec.gov/Archives/edgar/data/320193/000032019320000096/aapl-20200926.htm", "https://www.sec.gov/Archives/edgar/data/789019/000156459020034944/msft-10k_20200630.htm", "https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm" ], "secDocumentTypeEnum": 1 }'
curl --location 'https://localhost:5084/api/sec/batch-get-sec-urls?formType=1&startDate=2019-04-30&endDate=2024-04-30' \ --header 'accept: */*' }'
- Data
- Requirement
- Start from 10K and 10Q forms.
- Extract the specific sections from the forms.
- Do a load testing, say 200 documents.
- Algorithms
- Levenshtein Distance for Measuring Text Similarity
- Libraries
- HtmlAgilityPack library to parse the HTML documents
- Carter library for routing and handling requests, so I don't need to write my own filters from scratch and can have more time to focus on the business.
- Polly library that provides resilience strategies in fluent-to-express policies such as Retry, WaitAndRetry, and CircuitBreaker, etc.
- ...